Skip to main content

Web Scraping Essentials with Python

Sham Sui Po, Hong Kong

Web Scraping Tools

  • Requests is an elegant and simple HTTP library for Python, built for human beings.
  • Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
pip install requests bs4

Using Requests to Download Web Content

import requests

url = 'http://www.python.org'
response = requests.get(url)

print('INFO :: Retrieving Web Content Successful?', response.ok)
print('INFO :: HTTP Status Code:', response.status_code)
python main.py
INFO :: Retrieving Web Content Successful? True
INFO :: HTTP Status Code: 200

The web content is send as bytes and needs to be decoded into a string before further processing:

# Return bytes and decode to string
print(response.content.decode())
# Return content directly as unicode
print(response.text)
# Get response headers
print(response.headers)
{'Connection': 'keep-alive', 'Content-Length': '50171', 'Server': 'nginx', 'Content-Type': 'text/html; charset=utf-8', 'X-Frame-Options': 'DENY', 'Via': '1.1 vegur, 1.1 varnish, 1.1 varnish', 'Accept-Ranges': 'bytes', 'Date': 'Sat, 06 Aug 2022 10:46:40 GMT', 'Age': '3190', 'X-Served-By': 'cache-iad-kiad7000112-IAD, cache-nrt-rjtf7700056-NRT', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '390, 2623', 'X-Timer': 'S1659782801.626291,VS0,VE0', 'Vary': 'Cookie', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains'}

Scrape Web Content with Beautiful Soup 4

import requests
from bs4 import BeautifulSoup

url = 'https://wiki.instar.com/en/Web_User_Interface/1440p_Series/Smarthome/MQTT/'
response = requests.get(url)

print('INFO :: Retrieving Web Content Successful?', response.ok)
print('INFO :: HTTP Status Code:', response.status_code)

html = response.text

soup = BeautifulSoup(html, 'html.parser')

BS allows us to select content based on HTML tags, ID or classes:

print('INFO :: Page Title', soup.title)
print('INFO :: Page Header', soup.body.h1)

This method always returns only the first hit:

python main.py
INFO :: Retrieving Web Content Successful? True
INFO :: HTTP Status Code: 200
INFO :: Page Title <title data-react-helmet="true">Smarthome Menu // INSTAR MQTT Broker</title>
INFO :: Page Header <h1>INSTAR Wiki 2.5</h1>

If you want to return all of a type you need to use the find_all method:

import requests
from bs4 import BeautifulSoup

url = 'https://wiki.instar.com/enhttps://wiki.instar.com/en/Advanced_User/INSTAR_MQTT_Broker/MQTTv5_API/'
response = requests.get(url)

print('INFO :: Retrieving Web Content Successful?', response.ok)
print('INFO :: HTTP Status Code:', response.status_code)

html = response.text

soup = BeautifulSoup(html, 'html.parser')

for el in soup.find_all('td', attrs={'class': 'MuiTableCell-body'}):
rows = el.get_text()
print(rows)
python main.py
INFO :: Retrieving Web Content Successful? True
INFO :: HTTP Status Code: 200
Category: Network

network/config/dhcp
{"val":"1"}, {"val":"0"}

network/config/ipaddr
{"val":"192.168.178.23"}

network/config/netmask
{"val":"255.255.255.0"}

network/config/gateway
{"val":"192.168.178.1"}

network/config/fdnsip
{"val":"192.168.178.1"}

network/config/sdnsip
{"val":"192.168.178.2"}

...

To find multiple tags and return them as a list:

print(soup.find_all(['div', 'p']))

Find elements by ID:

print(soup.find_all('div', id='quick-start'))

Find elements by Class:

print(soup.find_all('a', class_='reference external'))

Find all links on a page:

print (soup.find_all('a', href=True))

Strip the HTML:

links = soup.find_all('a', href=True)

for link in links:
print(link.get('href'))