Optical Character Recognition (OCR)

Vanshita Tripathi
Jul 30, 2021
9 min read

Updated: Aug 5, 2021

An Introduction to OCR:

Research into optical character recognition is currently taking place since it attempts to develop a computer system with the ability to extract and process text from images automatically. Nowadays, document digitization is in high demand using optical character recognition.

The OCR can detect printed or handwritten text, which is stored on disks for processing by our computers in the future. The technology lets data be derived from any image, irrespective of its format, or how it is embedded in another image. It converts the text from its digital image format and converts it to a machine-readable and editable text format. OCR basically work through some sub-processes which include:

Image Preprocessing
Image classification and Text localization
Character Segmentation
Feature Extraction
Post Processing

Challenges that can arise during the making of an OCR:

Image having a complex background or if it’s distorted:

OCR can face some problems in detecting text because the complexity of the image makes it harder to segregate text from the rest of the non-text part.

Uneven Lighting can also be a challenge for OCR as it makes it harder to detect the text from the image with accuracy.
Variation in Fonts and Font sizes can result in degraded segmentation of the text. The OCR can be confused because of the variation in the size of the text.
Multilanguage text environment is also a challenge for OCR.
Rotation/Skewness is a challenge for OCR because the point of view or skewness is not fixed in-camera images. It is included in preprocessing steps for skewing the image.

Some of the best OCR’s We can use for our Projects and models:

Google Vision API: Google Cloud Vision OCR is a Google service that allows you to get text out of a digital image. One of the best OCRs. But it is also fairly priced with billing available via google.com's cloud. Once that is enabled, you can use the vision APIs for your OCR.
Microsoft Computer Vision API- Cognitive Services: Several advanced algorithms are included in Microsoft's Computer Vision API for image recognition. In addition to extracting text from an image, it can detect offensive content in an image and can be used to detect faces. A subscription for Microsoft Azure is also included.
Tesseract OCR: The Tesseract OCR can be easily downloaded and installed on your computer in order to use it in your program for text extraction from images. Tesseract offers the advantage of being compatible with a wide variety of programming languages and is easily accessible. It does not come with a built-in GUI.

Installation:

You can easily download a 64bit version for smooth OCR functioning from the link here. https://github.com/UB-Mannheim/tesseract/wiki

After a successful download install it in your system in your desired location, but remember to save its path for further use.

These were some OCRs that I will recommend you to use for smooth OCR functioning. I am using Tesseract OCR for this tutorial.

We now know how to install OCR. Next, let's try to extract text from a picture. In order to accomplish that, we will use the Python OpenCV library which will aid us in reading images from a provided directory. Before we get into the coding part of OpenCV, let's have a look at a brief introduction.

Introduction to Opencv:

OpenCV is an Open-source software library that integrates Computer Vision features into our programming. It provides functions for real-time computer vision projects. It was originally written in C++ but can easily be adapted to other languages like Python, MATLAB, and Java.

Installation:

Using Pip Just open your command prompt and type pip install OpenCV-python and it will install all the packages for Opencv.

If you are using Anaconda- You can simply type the pip command in your Anaconda prompt also, or you can use Conda install -c conda-forge OpenCV

OpenCV has amazing functions for image recognition, let’s take a look at some functions which are important for you to know before using OpenCv for your OCR.

cv2.imread() – For reading the image from the path you provide.

Image= cv2.imread(‘Image21.jpg’)

cv2.imshow()- For displaying the image from the particular path you provide.

cv2.imshow(‘Image21.jpg’)

cv2.cvtcolor()- You can use this function for changing the color schemes of your image like

cv2.cvtcolor(Image, COLOR_BGR2GRAY)

cv2.resize()- You can use this function for resizing your images, here you can define the exact dimensions you want your image in.

resized_Im= cv2.resize(Image,(225,225))

Edge Detection techniques- You can use the canny edge detection technique for an outline of the image.

EdgeDet_Image= cv2.canny(Image, 100,200)

Thresholding with OpenCV- For every Image, some threshold value is defined. If the pixel value is smaller than the threshold it will provide a 0, and it will provide a maximum value for the higher pixel value.

cv2.threshold(Image, 255,255, THRESH_BINARY)

Gaussian Blur- A Gaussian kernel is used to reduce the noise from the image.

blur = cv2.GaussianBlur(img,(5,5),0)

A Simple Text Extractor :

Let’s try to extract text from images with a simple code snippet:

First Let me show you the image I am using:

Code Snippet for a simple text extractor:

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
image= cv2.imread('C:/Users/vansh/Downloads/TS1.jpg')
cv2.imshow('Test',image)
cv2.waitKey(0)
cv2.destroyAllWindows()
text= pytesseract.image_to_string(image)
print(text)

We got the following results:

The above code provides a brief idea of the working of an OCR, so let's move forward and make a cool OCR.

OCR with Tesseract and OpenCV:

Making an OCR involves few sub-processes, so in this tutorial, we will proceed through all of the sub-processes one by one.

The following is the Invoice that is used throughout the tutorial.

1. Pre-processing Phase:

Pre-processing is the most important step in making the OCR, as the accuracy of the OCR depends greatly on this process. The purpose of the pre-processing phase is to make the image as surreal as possible so that the OCR can distinguish the text from the background. The important steps of the pre-processing phase include Gray-scaling, Resizing, Thresholding, edge detection, Dilation, Erosion, and Gaussian Blur.

You can follow the code snippet for pre-processing phase:

import cv2
import numpy as np
image = cv2.imread('Path to your image')
import cv2
import numpy as np
img = cv2.imread('Path to your image')
image= cv2.resize(img, (int(img.shape[1]/2), int(img.shape[0]/2)))
cv2.imshow('image',image)
cv2.waitKey(0)
# Grayscaling of the image
gray=cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Gaussian Blurring
noiseless=cv2.GaussianBlur(gray,(5,5),0)
cv2.imshow('image',noiseless)
cv2.waitKey(0)
#Thresholding
thresh = cv2.threshold(noiseless, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
cv2.imshow('image',thresh)
cv2.waitKey(0)
#dilation
kernel = np.ones((2,2),np.uint8)
dilate=cv2.dilate(image, kernel, iterations = 1)
cv2.imshow('image',dilate)
#erosion
kernel = np.ones((2,2),np.uint8)
erosion = cv2.erode(thresh,kernel,iterations = 1)
cv2.imshow('image',erosion)
cv2.waitKey(0)
#canny edge detection
Edge_det=cv2.Canny(image, 100, 200)
cv2.imshow('image',Edge_det)
cv2.waitKey(0)

After running the above code we get the following results:

2. Text Localization:

When we do text localization, we get bounding boxes around our text, which help make it easier to recognize and visualize the text from the image background. Furthermore, this phase helps with character segmentation because it gives a clear picture of what the text is, and hence makes the segmentation phase easier along with differentiating the text in different regions.

You can localize your text using the code snippet below:

import pytesseract
#use the next line only if your system shows the error message of tesseract not found
pytesseract.pytesseract.tesseract_cmd= 'Path to the tesseract.exe'
image= cv2.resize(img, (int(img.shape[1]/2), int(img.shape[0]/2)))
x, y, z = image.shape
boxes = pytesseract.image_to_boxes(image) 
for b in boxes.splitlines():
 b = b.split(' ')
 img = cv2.rectangle(image, (int(b[1]), x - int(b[2])), (int(b[3]), x -  int(b[4])), (0, 255, 0), 2)

cv2.imshow('img', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

After implementing the above code we get the following results:

3. Character Segmentation

In this step, the text in the image is isolated from the background or we can say we sort the text and the background in an image by classification of the image into homogenous regions. Homogenous regions contain only one kind of information like a text, a diagram, a flowchart, or a table. This phase is very important for the implementation of OCR with good accuracy. The segmentation phase converts the text into a m*n matrix.

import pytesseract
from pytesseract import Output
details = pytesseract.image_to_data(thresh, output_type=Output.DICT)#pass image to tesseract
print(details.keys())
#You can also print your data in an structured form using:
#Text= pytesseract.image_to_data(image, output_type=Output.DICT)
#Print(Text)

After running this code we will get the following output:

You can see that the segmentation code has done its part and has divided the data from the images into categories.

4. Template Matching technique:

The input characters obtained are assigned to their predefined classes or regions. The input characters are distributed concerning detected information to their comparing class to create groups with homogeneous qualities, differentiating different input characters in different classes. Template matching is one of the techniques used for image classification.

For template matching, we will use a regular expression here as the template pattern that we will match with our OCR results to find the appropriate text.

We are taking example for template matching of date. For this code we will first define a regular expression for date which is '\d{2}-\d{2}-\d{4}'. The regular expression can vary for different formats of writing date. I have some more regular expressions if you want to use one for any other date format you can use one of these: '^(0[1-9]|[12][0-9]|3[01])[- /.]' , '^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(19|20)\d\d$' .

import re
import cv2
import pytesseract
from pytesseract import Output
img = cv2.imread('Path to your image')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
keys = list(d.keys())
#regular expression for date
date_pattern = '\d{2}-\d{2}-\d{4}'

n_boxes = len(d['text'])
for i in range(n_boxes):
 if float(d['conf'][i]) > 60:
  if re.match(date_pattern, d['text'][i]):
  (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
  img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

cv2.imshow('img', img)
cv2.waitKey(0)

On implementing this code you will get the following result:

So as mentioned above the code is detecting the desired output through template matching which is the date.

After providing a detailed description of every process involved in making an OCR, we are left with one big question, and that is TRAINING THE OCR. However, as we have seen throughout this tutorial, the OCR can work well without the training dataset. So, is training required? If you can just enjoy the results without the hustle of creating a dataset and the deep net model?

Training or No Training?

As an initial matter, OCR can be either template-based, i.e, it can be customized according to the needs of the user, or it can be continuously trained, which is capable of detecting the template on its own, and can recognize any kind of document.

Template-Based OCRs:

The templates-based OCR method allows machine learning to position a text from a certain position in a dedicated way through an isolated section of text. Having this capability enables the computer to find the desired output in a document by retrieving it from a different configuration. Template-based OCR uses a structured layout to guide OCR. Invoice OCR can be customized to produce the desired output, thus saving time and effort. A template-based OCR does not necessarily need to be trained on a dataset, we can just customize the code according to our needs, and the OCR will process information how we desire. As an example, if the company wants to use my OCR, we can adjust the template to suit their needs. Template-based OCRs can also be trained. You can train them on the particular templates you plan to use, but this is simply to improve the accuracy to the highest level. It will process your information based on the forms that you trained it on and extract it into the appropriate location.

Continuously trained ML:

There are indeed very efficient solutions for certain OCR tasks that do not require training with deep learning. But for automating the process smoothly and for more general solutions, training with deep nets will be mandatory. Google Vision API, Microsoft’s Cognitive API are some of the continuously trained OCRs.

Datasets that can be used:

MNIST dataset- MNIST dataset contains handwritten numbers from 0-9. The MNIST database contains 60,000 training images and 10,000 testing images. It is used widely in handwriting detection. It's impossible to talk about OCR without MNIST, although it has very limited characters.

License Plates- This dataset can also be used for training our OCR model for effectively recognizing and storing the license plate numbers of the cars present in an image or a video.

PDF- PDF OCR is the most commonly used OCR for the printed text digitization. So many OCR tools like Tesseract gives commendable accuracy on PDFs because their printed nature makes it easy for the OCR to detect the text in the pdf.

Now we know on which type of datasets the OCR should be trained and on which it can be template based. Training of the OCR with deep net models gives a commendable accuracy and it makes the OCR model independent of human surveillance.

Training The OCR:

It is suggested that deep learning approaches will be the most effective for training OCR models. SSD, YOLO and Mask RCNN are some deep learning approaches which can be applied. Neural networks have been used to combine the tasks of discovering the location of text in an image with the task of understanding what exactly the text is. Using deep convolutional neural network provides the overall pipeline for many structures for OCR tasks. It consists of a convolutional network that makes these features into encoded vectors, then it uses a recurrent network to predict where each letter is on the image text and what it is. Furthermore, we have Natural Language Processing (NLP) technology that can also be incorporated and it provides the machine, an additional dimension that allows it to classify documents by word and text comprehension and extract relevant data with a higher degree of precision.

Conclusion:

Particularly after the pandemic, optical character recognition has been at the forefront of the industry. Machine learning-based models can achieve unparalleled text recognition accuracy, far surpassing traditional approaches based on feature extraction. All industries are digitizing their documents, regardless of whether they were created last year or more than 30 years ago. Numerous sectors, including the banking industry, the newspaper industry, the legal sector, the hospitality sector, and the education sector, rely on OCR to digitize their data.

The advent of handwriting recognition is creating a lot of buzz nowadays since it is providing benefits in various industries, and only OCR can make that happen. A bank's form and a check are checked for fraud using OCR and handwriting recognition by the legal sector. Using an advanced continuous OCR, they train it on a huge database of handwriting samples. For digitizing patients' medical records, optical character recognition has a big impact in the medical sector. This system has made it much easier for hospitals to manage patient records and has made it a lot easier for patients to get their medical records without carrying piles of paperwork.

An Introduction to OCR:

Comments