Tesseract OCR on Arch Linux

Victoria Harbour, Hongkong

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.

In 2006, Tesseract was considered one of the most accurate open-source OCR engines available.

Project Setup
Loading Image files from Disk
Text Extraction
Data Preparation
- Drawing Bounding Box
Import all Cards
Extract all Text
Write Extracted Text to File
Labeling your Data

Project Setup

Make sure you have Python installed and added to PATH:

python --version
Python 3.9.7

pip --version
pip 21.3.1

Create a work directory and set up a virtual environment:

mkdir -p opt/Python/pyOCR
python -m venv .env
source .env/bin/activate

Create a file dependencies.txt with all the necessary dependencies:

numpy
pandas
scipy
matplotlib
pillow
opencv-python
opencv-contrib-python
jupyter

And install them via the Python package manager:

pip install -r dependencies.txt

Install Tesseract globally with PACMAN:

pacman -Syu tesseract

looking for conflicting packages...
Packages (2) leptonica-1.82.0-1  tesseract-4.1.1-7

Verify that the installation was sucessful:

tesseract -v
tesseract 4.1.1

Install the trainings data you need depending on your language requirement - e.g. English:

sudo pacman -S tesseract-data-eng

Now we can add our last dependency - a libray that allows us to use Tesseract in our program:

pip install pytesseract
Successfully installed Pillow-8.4.0 pytesseract-0.3.8

I am going to use a Jupyter notebook to experiment with Tesseract:

jupyter notebook

Tesseract

When you are bale to import all dependencies without getting an error message you are all set!

Loading Image files from Disk

I want to train a model that allows me to extract contact information from business cards. To get started you can download card templates:

Tesseract

Download them to ./images and try to import them into your notebook using OpenCV and Pillow:

import numpy as np
import pandas as pd
import PIL as pl
import cv2 as cv
import pytesseract as ts

# Pillow
# Use the full path here
img_pl = pl.Image.open('/opt/Python/pyOCR/images/card_46.jpg')
img_pl

# OpenCV
# Use the full path here
img_cv = cv2.imread('/opt/Python/pyOCR/images/card_46.jpg')
cv.startWindowThread()
cv.imshow('Business Card', img_cv)
cv.waitKey(0)
cv.destoyAllWindows()

Pillow returns a Jpg image file, while OpenCV returns an array:

type(img_pl) #PIL.JpegImagePlugin.JpegImageFile

type(img_cv) #numpy.ndarray

There seems to be an issue with the OpenCV destroyAllWindows method under Linux. I will exclude it for now and work with Pillow instead.

Text Extraction

text_pl = ts.image_to_string(img_pl)
print(text_pl)

Test your business cards and see which one are readable and which one are not. I downloaded quite a few that were too low in resolution and had to be discarded.

data = ts.image_to_data(img_pl)

Now that we can read the text we now have to write it into an data object to be able to work with it. The data is structured by \n and \t markers:

'level\tpage_num\tblock_num\tpar_num\tline_num\tword_num\tleft\ttop\twidth\theight\tconf\ttext\n1\t1\t0\t0\t0\t0\t0\t0\t875\t518\t-1\t\n2\t1\t1\t0\t0\t0\t532\t37\t306\t38\t-1\t\n3\t1\t1\t1\t0\t0\t532\t37\t306\t38\t-1\t\n4\t1\t1\t1\t1 ...

We can clean up this data with a map:

dataList = list(map(lambda x: x.split('\t'),data.split('\n')))

We can now wrap this data into a Pandas Dataframe:

df = pd.DataFrame(dataList[1:],columns=dataList[0])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   level      56 non-null     object
 1   page_num   55 non-null     object
 2   block_num  55 non-null     object
 3   par_num    55 non-null     object
 4   line_num   55 non-null     object
 5   word_num   55 non-null     object
 6   left       55 non-null     object
 7   top        55 non-null     object
 8   width      55 non-null     object
 9   height     55 non-null     object
 10  conf       55 non-null     object
 11  text       55 non-null     object
dtypes: object(12)
memory usage: 5.4+ KB

Note the height column corresponds to the font size of your word. You can see a confidence drop when the size is too small.

df.head(10)

Tesseract

Data Preparation

df.dropna(inplace=True) # Drop empty values and rows
col_int = ['level','page_num','block_num','par_num','line_num','word_num','left','top','width','height','conf']
df[col_int] = df[col_int].astype(int) # Change all columns with number values to type int

df.dtypes

level         int64
page_num      int64
block_num     int64
par_num       int64
line_num      int64
word_num      int64
left          int64
top           int64
width         int64
height        int64
conf          int64
text         object
dtype: object

Drawing Bounding Box

image = img_cv.copy()
level = 'word'
for l,x,y,w,h,c,t in df[['level','left','top','width','height','conf','text']].values:
    #print(l,x,y,w,h,c)
    
    if level == 'page':
          if l == 1:
                cv.rectangle(image,(x,y),(x+w,y+h),(0,0,0,),2)
          else:
            continue
            
    elif level == 'block':
          if l == 2:
                cv.rectangle(image,(x,y),(x+w,y+h),(255,0,0,),1)
          else:
            continue
            
    elif level == 'paragraph':
          if l == 3:
                cv.rectangle(image,(x,y),(x+w,y+h),(0,255,0,),1)
          else:
            continue
            
    elif level == 'line':
          if l == 4:
                cv.rectangle(image,(x,y),(x+w,y+h),(255,0,51,),1)
          else:
            continue
            
    elif level == 'word':
          if l == 5:
                cv.rectangle(image,(x,y),(x+w,y+h),(0,0,255,),1)
                cv.putText(image,t,(x,y),cv.FONT_HERSHEY_COMPLEX_SMALL,1,(255,255,255),1)
          else:
            continue
                             
cv.imshow("bounding box",image)
cv.waitKey(0)
cv.destoyAllWindows()
cv.waitKey(1)

Tesseract

Import all Cards

import numpy as np
import pandas as pd
import cv2 as cv
import pytesseract as ts

import os
from glob import glob
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

imgPaths = glob('/opt/Python/pyOCR/images/*.jpg')

Try print(imgPaths) to see if your images are found - note that I had to use the absolute path to my images folder here.

Extract the filename:

imgPath = imgPaths[0]
_, filename = os.path.split(imgPath)

Run print(filename) - now it only returns the image name instead of the entire path.

Extract all Text

image = cv.imread(imgPath)
data = ts.image_to_data(image)

dataList = list(map(lambda x: x.split('\t'),data.split('\n')))
df = pd.DataFrame(dataList[1:], columns=dataList[0])

Print the value of df to see if your image was sucessfully read.

Now we can filter for text (level=5) that has a suitable confidence value (e.g. >30%):

df.dropna(inplace=True)
df['conf'] = df['conf'].astype(int)
textData = df.query('conf >= 30')

businessCard = pd.DataFrame()
businessCard['text'] = textData['text']
businessCard['id'] = filename

Print out businessCard and you will see all the text that was discovered on your first (index 0) business card that had a confidence level of over 30%.

Now all we have to do is to take this code and run a loop over it to capture all images inside the directory:

allBusinessCards = pd.DataFrame(columns=['id', 'text'])

for imgPath in tqdm(imgPaths,desc="Business Card"):

    # Get Filenames
    _, filename = os.path.split(imgPath)
    # Extract Data
    image = cv.imread(imgPath)
    data = ts.image_to_data(image)
    # Write Data to Frame
    dataList = list(map(lambda x: x.split('\t'),data.split('\n')))
    df = pd.DataFrame(dataList[1:], columns=dataList[0])
    # Drop Everything that is not useful
    df.dropna(inplace=True)
    df['conf'] = df['conf'].astype(int)
    textData = df.query('conf >= 30')
    # Define a Business Card Entity
    businessCard = pd.DataFrame()
    businessCard['text'] = textData['text']
    businessCard['id'] = filename
    # Add Card to All Cards
    allBusinessCards = pd.concat((allBusinessCards,businessCard))

Write Extracted Text to File

allBusinessCards.to_csv('businessCards.csv', index=False)

The data will be written to ./src/businessCards.csv for further processing.

Labeling your Data

Mark the start and end of each word of importance:


B	Beginning
I	Inside
O	Outside

And define the entities you want to search for:


NAME	Name
DES	Designation
ORG	Organisation
PHONE	Phone Number
EMAIL	Email Address
WEB	Website

Tesseract

Project Setup​

Loading Image files from Disk​

Text Extraction​

Data Preparation​

Drawing Bounding Box​

Import all Cards​

Extract all Text​

Write Extracted Text to File​

Labeling your Data​

Project Setup

Loading Image files from Disk

Text Extraction

Data Preparation

Drawing Bounding Box

Import all Cards

Extract all Text

Write Extracted Text to File

Labeling your Data