Python - RE

Regular Expressions
Metacharacters
[]: A set of characters[a-m]- remove brackets to get sequential matches[arm]vsarm\: Signals a special sequence (can also be used to escape special characters)\d.: Any character (except newline character)he..o^: Starts with^hello$: Ends withplanet$*: Zero or more occurrenceshe.*o+: One or more occurrenceshe.+o?: Zero or one occurrenceshe.?o{}: Exactly the specified number of occurrenceshe.{2}o{n,m}: Match betweennandmnumber of occurrenceshe.{1,2}o|: Either orfalls|stay
Special Sequences
\d: Matches any decimal digit; this is equivalent to the class[0-9].\D: Matches any non-digit character; this is equivalent to the class[^0-9].\s: Matches any whitespace character; this is equivalent to the class[ \t\n\r\f\v].\S: Matches any non-whitespace character; this is equivalent to the class[^ \t\n\r\f\v].\w: Matches any alphanumeric character; this is equivalent to the class[a-zA-Z0-9_].\W: Matches any non-alphanumeric character; this is equivalent to the class[^a-zA-Z0-9_].\A: Returns a match if the specified characters are at the beginning of the string\AThe\b: Returns a match where the specified characters are at the beginning or at the end of a word (ther"in the beginning is making sure that the string is being treated as a raw string)r"\bain"r"ain"\b\B: Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (ther"in the beginning is making sure that the string is being treated as a raw string)r"\Bain"r"ain\B`\Z: Returns a match if the specified characters are at the end of the stringSpain\Z
Sets
[arn]: Returns a match where one of the specified characters (a, r, or n) is present[a-n]: Returns a match for any lower case character, alphabetically between a and n[^arn]: Returns a match for any character EXCEPT a, r, and n[0123]: Returns a match where any of the specified digits (0, 1, 2, or 3) are present[0-9]: Returns a match for any digit between 0 and 9[0-5][0-9]: Returns a match for any two-digit numbers from 00 and 59[a-zA-Z]: Returns a match for any character alphabetically between a and z, lower case OR upper case[+]: In sets, +, *, ., |, (), $, has no special meaning, so [+] means: return a match for any + character in the string
Examples
Filenames
Search for strings in filenames:
import re
from html_source import html
files = [
"2022-10-13_archive.zip",
"2022-10-13.txt",
"2022-09-30_archive.zip",
"2022-09-30.txt",
"2022-09-15_archive.zip",
"2022-09-15.txt"
]
# get archives from a specified month
for path in files:
archive_match = re.search("[^ ]+_archive.zip", path)
# skip none archives
if archive_match != None:
# search for date
if "2022-09-" in archive_match.string:
print(archive_match.string)
# => 2022-09-30_archive.zip
# => 2022-09-15_archive.zip
Filepaths
Get archives from a specified month from path
import re
from pathlib import Path
root_dir = Path('files')
filenames = root_dir.iterdir()
file_list = list(filenames)
# print(file_list)
files = [file.name for file in file_list]
# print(files)
zip_pattern = re.compile('[0-9]{4}-[0-9]{2}-[0-9]{2}_archive.zip')
date_pattern = re.compile('2022-09[^ ]+')
# filter all non-zip files
matching_container = [file for file in files if zip_pattern.findall(file)]
# get zip files of specific date
matching_date = [file for file in matching_container if date_pattern.findall(file)]
print(matching_date)
# => ['2022-09-15_archive.zip', '2022-09-30_archive.zip']
Email Addresses
Extract Emails from text:
import re
text = "Spicy jalapeno_bacon@ipsum.com dolor amet turducken biltong frankfurter shankle porchetta. Tail buffalo anim, capicola eiusmod cupim beef ribs tenderloin shank@beef.br biltong. Laboris meatloaf swine, esse cillum est sausage t-bone dolor adipisicing ex-corned@beef.co.uk aliqua porchetta. Boudin fatback chuck meatball laborum meatloaf ground round, filet mignon.prosciutto@shankle.nz pig."
For Emails that only use characters from a to z:
# [a-z] = match 1 letter between a-z
# [a-z]+ = match n letter between a-z
email_pattern = re.compile("[a-z]+@[a-z]+.[a-z]+")
email_matches = email_pattern.findall(text)
print(email_matches)
# => ['bacon@ipsum.com', 'shank@beef.br', 'corned@beef.co', 'prosciutto@shankle.nz']
[^ ]: Match everything but white space\.: Match a "dot" (escaping the.meta character){2,}: TLD has 2 or more characters
email2_pattern = re.compile("[^ ]+@[^ ]+\.[a-z]{2,}")
email2_matches = email2_pattern.findall(text)
print(email2_matches)
# => ['jalapeno_bacon@ipsum.com ', 'shank@beef.br ', 'ex-corned@beef.co.uk ', 'mignon.prosciutto@shankle.nz ']
(?:com|co.uk): Match only co.uk and com addresses
email3_pattern = re.compile("[^ ]+@[^ ]+\.(?:com|co.uk)")
email3_matches = email3_pattern.findall(text)
# => ['jalapeno_bacon@ipsum.com', 'ex-corned@beef.co.uk']
print(email3_matches)
Email Addresses
Only extract .com URLs from html:
https?: match http and https(?:www.)?: match with or w/o www[^ ]+: match everything but white space\.com/: only match .com urls[^ ][^"]+: break when you hit a white space or "
from html_source import html
url_pattern = re.compile('https?://(?:www.)?[^ ]+\.com/[^ ][^"]+')
url_matches = url_pattern.findall(html)
print(url_matches)
IP Addresses
Extract specific IPs from file:
ip_file = 'ip.txt'
with open(ip_file, 'r') as file:
file_content = file.read()
# only match ip addresses with a 2 in the third octet
ip_pattern = re.compile("[0-9]{3}\.[0-9]{3}\.2\.[0-9]{3}")
ip_matches = ip_pattern.findall(file_content)
print(ip_matches)