Python - RE
Regular Expressions
Metacharacters
[]
: A set of characters[a-m]
- remove brackets to get sequential matches[arm]
vsarm
\
: Signals a special sequence (can also be used to escape special characters)\d
.
: Any character (except newline character)he..o
^
: Starts with^hello
$
: Ends withplanet$
*
: Zero or more occurrenceshe.*o
+
: One or more occurrenceshe.+o
?
: Zero or one occurrenceshe.?o
{}
: Exactly the specified number of occurrenceshe.{2}o
{n,m}
: Match betweenn
andm
number of occurrenceshe.{1,2}o
|
: Either orfalls|stay
Special Sequences
\d
: Matches any decimal digit; this is equivalent to the class[0-9]
.\D
: Matches any non-digit character; this is equivalent to the class[^0-9]
.\s
: Matches any whitespace character; this is equivalent to the class[ \t\n\r\f\v]
.\S
: Matches any non-whitespace character; this is equivalent to the class[^ \t\n\r\f\v]
.\w
: Matches any alphanumeric character; this is equivalent to the class[a-zA-Z0-9_]
.\W
: Matches any non-alphanumeric character; this is equivalent to the class[^a-zA-Z0-9_]
.\A
: Returns a match if the specified characters are at the beginning of the string\AThe
\b
: Returns a match where the specified characters are at the beginning or at the end of a word (ther"
in the beginning is making sure that the string is being treated as a raw string)r"\bain"
r"ain"\b
\B
: Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (ther"
in the beginning is making sure that the string is being treated as a raw string)
r"\Bain"\Z
: Returns a match if the specified characters are at the end of the stringSpain\Z
Sets
[arn]
: Returns a match where one of the specified characters (a, r, or n) is present[a-n]
: Returns a match for any lower case character, alphabetically between a and n[^arn]
: Returns a match for any character EXCEPT a, r, and n[0123]
: Returns a match where any of the specified digits (0, 1, 2, or 3) are present[0-9]
: Returns a match for any digit between 0 and 9[0-5][0-9]
: Returns a match for any two-digit numbers from 00 and 59[a-zA-Z]
: Returns a match for any character alphabetically between a and z, lower case OR upper case[+]
: In sets, +, *, ., |, (), $, has no special meaning, so [+] means: return a match for any + character in the string
Examples
Filenames
Search for strings in filenames:
import re
from html_source import html
files = [
"2022-10-13_archive.zip",
"2022-10-13.txt",
"2022-09-30_archive.zip",
"2022-09-30.txt",
"2022-09-15_archive.zip",
"2022-09-15.txt"
]
# get archives from a specified month
for path in files:
archive_match = re.search("[^ ]+_archive.zip", path)
# skip none archives
if archive_match != None:
# search for date
if "2022-09-" in archive_match.string:
print(archive_match.string)
# => 2022-09-30_archive.zip
# => 2022-09-15_archive.zip
Filepaths
Get archives from a specified month from path
import re
from pathlib import Path
root_dir = Path('files')
filenames = root_dir.iterdir()
file_list = list(filenames)
# print(file_list)
files = [file.name for file in file_list]
# print(files)
zip_pattern = re.compile('[0-9]{4}-[0-9]{2}-[0-9]{2}_archive.zip')
date_pattern = re.compile('2022-09[^ ]+')
# filter all non-zip files
matching_container = [file for file in files if zip_pattern.findall(file)]
# get zip files of specific date
matching_date = [file for file in matching_container if date_pattern.findall(file)]
print(matching_date)
# => ['2022-09-15_archive.zip', '2022-09-30_archive.zip']
Email Addresses
Extract Emails from text:
import re
text = "Spicy jalapeno_bacon@ipsum.com dolor amet turducken biltong frankfurter shankle porchetta. Tail buffalo anim, capicola eiusmod cupim beef ribs tenderloin shank@beef.br biltong. Laboris meatloaf swine, esse cillum est sausage t-bone dolor adipisicing ex-corned@beef.co.uk aliqua porchetta. Boudin fatback chuck meatball laborum meatloaf ground round, filet mignon.prosciutto@shankle.nz pig."
For Emails that only use characters from a
to z
:
# [a-z] = match 1 letter between a-z
# [a-z]+ = match n letter between a-z
email_pattern = re.compile("[a-z]+@[a-z]+.[a-z]+")
email_matches = email_pattern.findall(text)
print(email_matches)
# => ['bacon@ipsum.com', 'shank@beef.br', 'corned@beef.co', 'prosciutto@shankle.nz']
[^ ]
: Match everything but white space\.
: Match a "dot" (escaping the.
meta character){2,}
: TLD has 2 or more characters
email2_pattern = re.compile("[^ ]+@[^ ]+\.[a-z]{2,}")
email2_matches = email2_pattern.findall(text)
print(email2_matches)
# => ['jalapeno_bacon@ipsum.com ', 'shank@beef.br ', 'ex-corned@beef.co.uk ', 'mignon.prosciutto@shankle.nz ']
(?:com|co.uk)
: Match only co.uk and com addresses
email3_pattern = re.compile("[^ ]+@[^ ]+\.(?:com|co.uk)")
email3_matches = email3_pattern.findall(text)
# => ['jalapeno_bacon@ipsum.com', 'ex-corned@beef.co.uk']
print(email3_matches)
Email Addresses
Only extract .com
URLs from html:
https?
: match http and https(?:www.)?
: match with or w/o www[^ ]+
: match everything but white space\.com/
: only match .com urls[^ ][^"]+
: break when you hit a white space or "
from html_source import html
url_pattern = re.compile('https?://(?:www.)?[^ ]+\.com/[^ ][^"]+')
url_matches = url_pattern.findall(html)
print(url_matches)
IP Addresses
Extract specific IPs from file:
ip_file = 'ip.txt'
with open(ip_file, 'r') as file:
file_content = file.read()
# only match ip addresses with a 2 in the third octet
ip_pattern = re.compile("[0-9]{3}\.[0-9]{3}\.2\.[0-9]{3}")
ip_matches = ip_pattern.findall(file_content)
print(ip_matches)