Web Scraping Essentials with Python
Web Scraping Tools
- Requests is an elegant and simple HTTP library for Python, built for human beings.
- Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
pip install requests bs4
Using Requests to Download Web Content
import requests
url = 'http://www.python.org'
response = requests.get(url)
print('INFO :: Retrieving Web Content Successful?', response.ok)
print('INFO :: HTTP Status Code:', response.status_code)
python main.py
INFO :: Retrieving Web Content Successful? True
INFO :: HTTP Status Code: 200
The web content is send as bytes and needs to be decoded into a string before further processing:
# Return bytes and decode to string
print(response.content.decode())
# Return content directly as unicode
print(response.text)
# Get response headers
print(response.headers)
{'Connection': 'keep-alive', 'Content-Length': '50171', 'Server': 'nginx', 'Content-Type': 'text/html; charset=utf-8', 'X-Frame-Options': 'DENY', 'Via': '1.1 vegur, 1.1 varnish, 1.1 varnish', 'Accept-Ranges': 'bytes', 'Date': 'Sat, 06 Aug 2022 10:46:40 GMT', 'Age': '3190', 'X-Served-By': 'cache-iad-kiad7000112-IAD, cache-nrt-rjtf7700056-NRT', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '390, 2623', 'X-Timer': 'S1659782801.626291,VS0,VE0', 'Vary': 'Cookie', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains'}
Scrape Web Content with Beautiful Soup 4
import requests
from bs4 import BeautifulSoup
url = 'https://wiki.instar.com/en/Web_User_Interface/1440p_Series/Smarthome/MQTT/'
response = requests.get(url)
print('INFO :: Retrieving Web Content Successful?', response.ok)
print('INFO :: HTTP Status Code:', response.status_code)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
BS allows us to select content based on HTML tags, ID or classes:
print('INFO :: Page Title', soup.title)
print('INFO :: Page Header', soup.body.h1)
This method always returns only the first hit:
python main.py
INFO :: Retrieving Web Content Successful? True
INFO :: HTTP Status Code: 200
INFO :: Page Title <title data-react-helmet="true">Smarthome Menu // INSTAR MQTT Broker</title>
INFO :: Page Header <h1>INSTAR Wiki 2.5</h1>
If you want to return all of a type you need to use the find_all
method:
import requests
from bs4 import BeautifulSoup
url = 'https://wiki.instar.com/enhttps://wiki.instar.com/en/Advanced_User/INSTAR_MQTT_Broker/MQTTv5_API/'
response = requests.get(url)
print('INFO :: Retrieving Web Content Successful?', response.ok)
print('INFO :: HTTP Status Code:', response.status_code)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
for el in soup.find_all('td', attrs={'class': 'MuiTableCell-body'}):
rows = el.get_text()
print(rows)
python main.py
INFO :: Retrieving Web Content Successful? True
INFO :: HTTP Status Code: 200
Category: Network
network/config/dhcp
{"val":"1"}, {"val":"0"}
network/config/ipaddr
{"val":"192.168.178.23"}
network/config/netmask
{"val":"255.255.255.0"}
network/config/gateway
{"val":"192.168.178.1"}
network/config/fdnsip
{"val":"192.168.178.1"}
network/config/sdnsip
{"val":"192.168.178.2"}
...
To find multiple tags and return them as a list:
print(soup.find_all(['div', 'p']))
Find elements by ID:
print(soup.find_all('div', id='quick-start'))
Find elements by Class:
print(soup.find_all('a', class_='reference external'))
Find all links on a page:
print (soup.find_all('a', href=True))
Strip the HTML:
links = soup.find_all('a', href=True)
for link in links:
print(link.get('href'))