OCR from Image using PyTesseract in Python on Colab Notebook?
Optical Character Recognition(OCR) has been a popular task in Computer Vision. The popularity is because of its wide range of applications. It can be used for Data Entry for Business, Number Plate Recognition, Automated Passport Recognition, Quick Document Verification, IoT Application, Task Automation, and many more. Basically, any application which has a need to extract text from an image.
Tesseract is the most open-source software available for OCR. It was initially developed by HP as a tool in C++. Since 2006 it is developed by Google. The original software is available as a command-line tool for windows. We are living in a python world. Because of its popularity. The tool is also available in python developed and maintained as an opensource project.
Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.
Knowing detail about the tool is information but learning to use it is knowledge. We are a knowledge seeker. Let's learn how to use it.
Here are the steps to extract text from the image in Google Colab Notebook for OCR using Pytesseract:
Step1. Install Pytesseract and tesseract-OCR in Google Colab.
!sudo apt install tesseract-ocr
!pip install pytesseract
Step2. import libraries
import pytesseract
import shutil
import os
import random
try:
from PIL import Image
except ImportError:
import Image
Step3. Upload Image to the Colab
We can manually upload the image by clicking on file- upload but we can also use the following code for uploading the image to Colab.
from google.colab import files
uploaded = files.upload()
Step4. Text Extraction
The image_to_string function will take an image as an argument and returns an extracted text from the image. We can either directly print it or store this string in one variable.
image_path_in_colab=‘image.jpg’extractedInformation = pytesseract.image_to_string(Image.open(image_path_in_colab))print(extractedInformation)
Step5. Detect Langauge other than English:
# French text image to string
extractedInformation = pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra')print(extractedInformation)
Specifying language in the above function by lang argument we can change the language text to be detected.
Step6. Get Bounding Boxes for Text
To get bounding box coordinates for the text we use the image_to_boxes function will the same image path argument as the earlier function.
# Get bounding box estimates
print(pytesseract.image_to_boxes(Image.open(image_path_in_colab)))
Feel free to check this Colab Notebook Implementation of the above method.
Pros:
- Easy to use
- Fast Detection
- Most Popular
- Most efficient
- Support 100+ Language
- Oldest OCR Library
- Command-line support
Cons:
- Only works on CPU
- Doesn’t perform well on Blur, Noisy and colorful image
- Performance decrease for lower font size in low-resolution images
- Doesn’t work well on complex Forms
If you want to have text detection and recognition using a single function, check out this AI-based Opensource Easy OCR. It supports 70+ languages and faster GPU Inference.
Note: For Blur, Noisy and colorful image we need to follow some image-processing steps like making image black and white, remove salt and pepper noise using lowpass filters such as averaging filters or Gaussian Filter, We can also make blur image sharpen by using Highpass filter such as Sobel filters. This Image Processing operation can also be implemented by the OpenCV library in python.
Don’t Forget to Clap if you found this article helpful,
Follow my telegram channel to get awesome blogs, projects, and learning opportunities for Python, Machine Learning, and Data Science Stuff.
Stay Pythonic!!😀
References: