OCR from Image using PyTesseract in Python on Colab Notebook?

3 min readMar 22, 2019

Optical Character Recognition(OCR) has been a popular task in Computer Vision. The popularity is because of its wide range of applications. It can be used for Data Entry for Business, Number Plate Recognition, Automated Passport Recognition, Quick Document Verification, IoT Application, Task Automation, and many more. Basically, any application which has a need to extract text from an image.

Tesseract is the most open-source software available for OCR. It was initially developed by HP as a tool in C++. Since 2006 it is developed by Google. The original software is available as a command-line tool for windows. We are living in a python world. Because of its popularity. The tool is also available in python developed and maintained as an opensource project.

Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

Knowing detail about the tool is information but learning to use it is knowledge. We are a knowledge seeker. Let's learn how to use it.

Here are the steps to extract text from the image in Google Colab Notebook for OCR using Pytesseract:

Step1. Install Pytesseract and tesseract-OCR in Google Colab.

!sudo apt install tesseract-ocr
!pip install pytesseract

Step2. import libraries

import pytesseract
import shutil
import os
import random
try:
 from PIL import Image
except ImportError:
 import Image

Step3. Upload Image to the Colab

We can manually upload the image by clicking on file- upload but we can also use the following code for uploading the image to Colab.

from google.colab import files
uploaded = files.upload()

Step4. Text Extraction

The image_to_string function will take an image as an argument and returns an extracted text from the image. We can either directly print it or store this string in one variable.

image_path_in_colab=‘image.jpg’extractedInformation = pytesseract.image_to_string(Image.open(image_path_in_colab))print(extractedInformation)

Step5. Detect Langauge other than English:

# French text image to string
extractedInformation = pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra')print(extractedInformation)

Specifying language in the above function by lang argument we can change the language text to be detected.

Step6. Get Bounding Boxes for Text

To get bounding box coordinates for the text we use the image_to_boxes function will the same image path argument as the earlier function.

# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open(image_path_in_colab)))

Feel free to check this Colab Notebook Implementation of the above method.

bhadreshpsavani/OCR_using_TesseatactLib_Project

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Pros:

Easy to use
Fast Detection
Most Popular
Most efficient
Support 100+ Language
Oldest OCR Library
Command-line support

Cons:

Only works on CPU
Doesn’t perform well on Blur, Noisy and colorful image
Performance decrease for lower font size in low-resolution images
Doesn’t work well on complex Forms

If you want to have text detection and recognition using a single function, check out this AI-based Opensource Easy OCR. It supports 70+ languages and faster GPU Inference.

Note: For Blur, Noisy and colorful image we need to follow some image-processing steps like making image black and white, remove salt and pepper noise using lowpass filters such as averaging filters or Gaussian Filter, We can also make blur image sharpen by using Highpass filter such as Sobel filters. This Image Processing operation can also be implemented by the OpenCV library in python.

Don’t Forget to Clap if you found this article helpful,

Follow my telegram channel to get awesome blogs, projects, and learning opportunities for Python, Machine Learning, and Data Science Stuff.

Stay Pythonic!!😀

References:

tesseract-ocr/tesseract

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new…