Skip to main content

Python - Minify Text for Elasticsearch

Sham Sui Po, Hong Kong

I looked into retrieving text from webpages using Beautiful Soup. And continued with looking into processing text. Now I want to bring those parts together and use Python to create an Elasticsearch index file. These files can then be used with Kibana to add the scraped web page to Elasticsearch.

Get Text

Let's start by retrieving the text we need using the following to Python libraries:

pip install requests bs4

The web page I am working with uses Gatsby.js which wraps every article into a div with the id=gatsby-focus-wrapper. So use request to download a pages content and use bs4's HTML parser to extract the text from inside this wrapper. As the index file will be in JSON format we also have to clean up everything that would break it:

import requests
from bs4 import BeautifulSoup

url = 'https://wiki.instar.com/de/Aussenkameras/IN-9408_WQHD/'

response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')

content = soup.find('div', attrs={'id': 'gatsby-focus-wrapper'}).text
# replace quotation marks
replaced_content = content.replace('"', ' ')
print(content)

Elasticsearch

The JSON object that I need looks like below - where the content extracted above would have to replace ARTICLE_BODY:

{
"title": "ARTICLE_TITLE",
"series": "ARTICLE_SERIES",
"type": "ARTICLE_TYPE",
"description": "ARTICLE_BODY",
"sublink1": "ARTICLE_URL",
"chapter": "ARTICLE_CHAPTER",
"tags": "ARTICLE_TAGS",
"image": "ARTICLE_IMAGE",
"imagesquare": "ARTICLE_SQUAREIMAGE",
"short": "ARTICLE_SHORT",
"abstract": "ARTICLE_ABSTRACT"
}

We can create this file by:

json_template = """{
"title": "ARTICLE_TITLE",
"series": "ARTICLE_SERIES",
"type": "ARTICLE_TYPE",
"description": "ARTICLE_BODY",
"sublink1": "ARTICLE_URL",
"chapter": "ARTICLE_CHAPTER",
"tags": "ARTICLE_TAGS",
"image": "ARTICLE_IMAGE",
"imagesquare": "ARTICLE_SQUAREIMAGE",
"short": "ARTICLE_SHORT",
"abstract": "ARTICLE_ABSTRACT"
}"""


with open('article.json', 'w') as file:
file.write(json_template)

And bringing this together looks like this:

import requests
import re
from bs4 import BeautifulSoup

url = 'https://wiki.instar.com/en/Outdoor_Cameras/IN-9408_WQHD/'
title = 'IN-9408 WQHD :: Product Overview'
camera_series = '["1440p Cameras", "Outdoor Cameras"]'
article_type = 'User Manual'
link = '/Outdoor_Cameras/IN-9408_WQHD/'
chapter = 'Outdoor Cameras'
tags = '["IN-9408 WQHD", "INSTAR", "products", "1440p series", "Indoor Cameras", "IP camera", "web cam", "overview"]'
image = '/en/images/Search/P_SearchThumb_IN-9408HD.webp'
imagesquare = '/en/images/Search/TOC_Icons/Wiki_Tiles_P-IN-9408HD_white.webp'
short = 'IN-9408 WQHD - Product Overview'
abstract = 'The IN-9408 WQHD with a SONY STARVIS Images Sensor and a 5 megapixel video resolution.'

response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')

content = soup.find('div', attrs={'id': 'gatsby-focus-wrapper'}).text
# replace quotation marks
replaced_content = content.replace('"', ' ')
# strip multiple-space character
single_space = re.sub('\s+',' ',replaced_content)


json_template = """{
"title": "ARTICLE_TITLE",
"series": ARTICLE_SERIES,
"type": "ARTICLE_TYPE",
"description": "ARTICLE_BODY",
"sublink1": "ARTICLE_URL",
"chapter": "ARTICLE_CHAPTER",
"tags": ARTICLE_TAGS,
"image": "ARTICLE_IMAGE",
"imagesquare": "ARTICLE_SQUAREIMAGE",
"short": "ARTICLE_SHORT",
"abstract": "ARTICLE_ABSTRACT"
}"""


add_body = json_template.replace('ARTICLE_BODY', single_space)
add_title = add_body.replace('ARTICLE_TITLE', title)
add_series = add_title.replace('ARTICLE_SERIES', camera_series)
add_type = add_series.replace('ARTICLE_TYPE', article_type)
add_url = add_type.replace('ARTICLE_URL', link)
add_chapter = add_url.replace('ARTICLE_CHAPTER', chapter)
add_tags = add_chapter.replace('ARTICLE_TAGS', tags)
add_image = add_tags.replace('ARTICLE_IMAGE', image)
add_imagesquare = add_image.replace('ARTICLE_SQUAREIMAGE', imagesquare)
add_short = add_imagesquare.replace('ARTICLE_SHORT', short)
add_abstract = add_short.replace('ARTICLE_ABSTRACT', abstract)


with open('elasticsearch/article.json', 'w') as file:
file.write(add_abstract)