Skip to main content

Python - RE

Sham Sui Po, Hong Kong

Github Repository

Regular Expressions

Metacharacters

  • []: A set of characters [a-m] - remove brackets to get sequential matches [arm] vs arm
  • \: Signals a special sequence (can also be used to escape special characters) \d
  • .: Any character (except newline character) he..o
  • ^: Starts with ^hello
  • $: Ends with planet$
  • *: Zero or more occurrences he.*o
  • +: One or more occurrences he.+o
  • ?: Zero or one occurrences he.?o
  • {}: Exactly the specified number of occurrences he.{2}o
  • {n,m}: Match between n and m number of occurrences he.{1,2}o
  • |: Either or falls|stay

Special Sequences

  • \d: Matches any decimal digit; this is equivalent to the class [0-9].
  • \D: Matches any non-digit character; this is equivalent to the class [^0-9].
  • \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
  • \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
  • \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
  • \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
  • \A: Returns a match if the specified characters are at the beginning of the string \AThe
  • \b: Returns a match where the specified characters are at the beginning or at the end of a word (the r" in the beginning is making sure that the string is being treated as a raw string)r"\bain" r"ain"\b
  • \B: Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the r" in the beginning is making sure that the string is being treated as a raw string) r"\Bain" r"ain\B`
  • \Z: Returns a match if the specified characters are at the end of the string Spain\Z

Sets

  • [arn]: Returns a match where one of the specified characters (a, r, or n) is present
  • [a-n]: Returns a match for any lower case character, alphabetically between a and n
  • [^arn]: Returns a match for any character EXCEPT a, r, and n
  • [0123]: Returns a match where any of the specified digits (0, 1, 2, or 3) are present
  • [0-9]: Returns a match for any digit between 0 and 9
  • [0-5][0-9]: Returns a match for any two-digit numbers from 00 and 59
  • [a-zA-Z]: Returns a match for any character alphabetically between a and z, lower case OR upper case
  • [+]: In sets, +, *, ., |, (), $, has no special meaning, so [+] means: return a match for any + character in the string

Examples

Filenames

Search for strings in filenames:

import re

from html_source import html

files = [
"2022-10-13_archive.zip",
"2022-10-13.txt",
"2022-09-30_archive.zip",
"2022-09-30.txt",
"2022-09-15_archive.zip",
"2022-09-15.txt"
]

# get archives from a specified month

for path in files:
archive_match = re.search("[^ ]+_archive.zip", path)
# skip none archives
if archive_match != None:
# search for date
if "2022-09-" in archive_match.string:
print(archive_match.string)

# => 2022-09-30_archive.zip
# => 2022-09-15_archive.zip

Filepaths

Get archives from a specified month from path

import re
from pathlib import Path

root_dir = Path('files')
filenames = root_dir.iterdir()
file_list = list(filenames)
# print(file_list)
files = [file.name for file in file_list]
# print(files)

zip_pattern = re.compile('[0-9]{4}-[0-9]{2}-[0-9]{2}_archive.zip')
date_pattern = re.compile('2022-09[^ ]+')

# filter all non-zip files
matching_container = [file for file in files if zip_pattern.findall(file)]
# get zip files of specific date
matching_date = [file for file in matching_container if date_pattern.findall(file)]

print(matching_date)
# => ['2022-09-15_archive.zip', '2022-09-30_archive.zip']

Email Addresses

Extract Emails from text:

import re

text = "Spicy jalapeno_bacon@ipsum.com dolor amet turducken biltong frankfurter shankle porchetta. Tail buffalo anim, capicola eiusmod cupim beef ribs tenderloin shank@beef.br biltong. Laboris meatloaf swine, esse cillum est sausage t-bone dolor adipisicing ex-corned@beef.co.uk aliqua porchetta. Boudin fatback chuck meatball laborum meatloaf ground round, filet mignon.prosciutto@shankle.nz pig."

For Emails that only use characters from a to z:

# [a-z] = match 1 letter between a-z
# [a-z]+ = match n letter between a-z
email_pattern = re.compile("[a-z]+@[a-z]+.[a-z]+")
email_matches = email_pattern.findall(text)
print(email_matches)
# => ['bacon@ipsum.com', 'shank@beef.br', 'corned@beef.co', 'prosciutto@shankle.nz']
  • [^ ] : Match everything but white space
  • \. : Match a "dot" (escaping the . meta character)
  • {2,} : TLD has 2 or more characters
email2_pattern = re.compile("[^ ]+@[^ ]+\.[a-z]{2,}")
email2_matches = email2_pattern.findall(text)
print(email2_matches)
# => ['jalapeno_bacon@ipsum.com ', 'shank@beef.br ', 'ex-corned@beef.co.uk ', 'mignon.prosciutto@shankle.nz ']
  • (?:com|co.uk) : Match only co.uk and com addresses
email3_pattern = re.compile("[^ ]+@[^ ]+\.(?:com|co.uk)")
email3_matches = email3_pattern.findall(text)
# => ['jalapeno_bacon@ipsum.com', 'ex-corned@beef.co.uk']
print(email3_matches)

Email Addresses

Only extract .com URLs from html:

  • https? : match http and https
  • (?:www.)? : match with or w/o www
  • [^ ]+ : match everything but white space
  • \.com/ : only match .com urls
  • [^ ][^"]+ : break when you hit a white space or "
from html_source import html

url_pattern = re.compile('https?://(?:www.)?[^ ]+\.com/[^ ][^"]+')
url_matches = url_pattern.findall(html)
print(url_matches)

IP Addresses

Extract specific IPs from file:

ip_file = 'ip.txt'

with open(ip_file, 'r') as file:
file_content = file.read()

# only match ip addresses with a 2 in the third octet
ip_pattern = re.compile("[0-9]{3}\.[0-9]{3}\.2\.[0-9]{3}")
ip_matches = ip_pattern.findall(file_content)
print(ip_matches)