Python - RE

Sham Sui Po, Hong Kong

Regular Expressions
Examples

Github Repository

Regular Expressions

Metacharacters

[]: A set of characters [a-m] - remove brackets to get sequential matches [arm] vs arm
\: Signals a special sequence (can also be used to escape special characters) \d
.: Any character (except newline character) he..o
^: Starts with ^hello
$: Ends with planet$
*: Zero or more occurrences he.*o
+: One or more occurrences he.+o
?: Zero or one occurrences he.?o
{}: Exactly the specified number of occurrences he.{2}o
{n,m}: Match between n and m number of occurrences he.{1,2}o
|: Either or falls|stay

Special Sequences

\d: Matches any decimal digit; this is equivalent to the class [0-9].
\D: Matches any non-digit character; this is equivalent to the class [^0-9].
\s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
\A: Returns a match if the specified characters are at the beginning of the string \AThe
\b: Returns a match where the specified characters are at the beginning or at the end of a word (the r" in the beginning is making sure that the string is being treated as a raw string)r"\bain" r"ain"\b
\B: Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the r" in the beginning is making sure that the string is being treated as a raw string) r"\Bain" r"ain\B`
\Z: Returns a match if the specified characters are at the end of the string Spain\Z

Sets

[arn]: Returns a match where one of the specified characters (a, r, or n) is present
[a-n]: Returns a match for any lower case character, alphabetically between a and n
[^arn]: Returns a match for any character EXCEPT a, r, and n
[0123]: Returns a match where any of the specified digits (0, 1, 2, or 3) are present
[0-9]: Returns a match for any digit between 0 and 9
[0-5][0-9]: Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z]: Returns a match for any character alphabetically between a and z, lower case OR upper case
[+]: In sets, +, *, ., |, (), $, has no special meaning, so [+] means: return a match for any + character in the string

Examples

Filenames

Search for strings in filenames:

import re

from html_source import html

files = [
    "2022-10-13_archive.zip",
    "2022-10-13.txt",
    "2022-09-30_archive.zip",
    "2022-09-30.txt",
    "2022-09-15_archive.zip",
    "2022-09-15.txt"
    ]

# get archives from a specified month

for path in files:
    archive_match = re.search("[^ ]+_archive.zip", path)
    # skip none archives
    if archive_match != None:
        # search for date
        if "2022-09-" in archive_match.string: 
            print(archive_match.string)
            
# => 2022-09-30_archive.zip
# => 2022-09-15_archive.zip

Filepaths

Get archives from a specified month from path

import re
from pathlib import Path

root_dir = Path('files')
filenames = root_dir.iterdir()
file_list = list(filenames)
# print(file_list)
files = [file.name for file in file_list]
# print(files)

zip_pattern = re.compile('[0-9]{4}-[0-9]{2}-[0-9]{2}_archive.zip')
date_pattern = re.compile('2022-09[^ ]+')

# filter all non-zip files
matching_container = [file for file in files if zip_pattern.findall(file)]
# get zip files of specific date
matching_date = [file for file in matching_container if date_pattern.findall(file)]

print(matching_date)
# => ['2022-09-15_archive.zip', '2022-09-30_archive.zip']

Email Addresses

Extract Emails from text:

import re

text = "Spicy jalapeno_bacon@ipsum.com dolor amet turducken biltong frankfurter shankle porchetta. Tail buffalo anim, capicola eiusmod cupim beef ribs tenderloin shank@beef.br biltong. Laboris meatloaf swine, esse cillum est sausage t-bone dolor adipisicing ex-corned@beef.co.uk aliqua porchetta. Boudin fatback chuck meatball laborum meatloaf ground round, filet mignon.prosciutto@shankle.nz pig."

For Emails that only use characters from a to z:

# [a-z] = match 1 letter between a-z
# [a-z]+ = match n letter between a-z
email_pattern = re.compile("[a-z]+@[a-z]+.[a-z]+")
email_matches = email_pattern.findall(text)
print(email_matches)
# => ['bacon@ipsum.com', 'shank@beef.br', 'corned@beef.co', 'prosciutto@shankle.nz']

[^ ] : Match everything but white space
\. : Match a "dot" (escaping the . meta character)
{2,} : TLD has 2 or more characters

email2_pattern = re.compile("[^ ]+@[^ ]+\.[a-z]{2,}")
email2_matches = email2_pattern.findall(text)
print(email2_matches)
# => ['jalapeno_bacon@ipsum.com ', 'shank@beef.br ', 'ex-corned@beef.co.uk ', 'mignon.prosciutto@shankle.nz ']

(?:com|co.uk) : Match only co.uk and com addresses

email3_pattern = re.compile("[^ ]+@[^ ]+\.(?:com|co.uk)")
email3_matches = email3_pattern.findall(text)
# => ['jalapeno_bacon@ipsum.com', 'ex-corned@beef.co.uk']
print(email3_matches)

Email Addresses

Only extract .com URLs from html:

https? : match http and https
(?:www.)? : match with or w/o www
[^ ]+ : match everything but white space
\.com/ : only match .com urls
[^ ][^"]+ : break when you hit a white space or "

from html_source import html

url_pattern = re.compile('https?://(?:www.)?[^ ]+\.com/[^ ][^"]+')
url_matches = url_pattern.findall(html)
print(url_matches)

IP Addresses

Extract specific IPs from file:

ip_file = 'ip.txt'

with open(ip_file, 'r') as file:
    file_content = file.read()

# only match ip addresses with a 2 in the third octet
ip_pattern = re.compile("[0-9]{3}\.[0-9]{3}\.2\.[0-9]{3}")
ip_matches = ip_pattern.findall(file_content)
print(ip_matches)

Regular Expressions​

Metacharacters​

Special Sequences​

Sets​

Examples​

Filenames​

Filepaths​

Email Addresses​

Email Addresses​

IP Addresses​

Regular Expressions

Metacharacters

Special Sequences

Sets

Examples

Filenames

Filepaths

Email Addresses

Email Addresses

IP Addresses