In this exercise, you will learn how to process images using Python and Tesseract. Tesseract is a flexible Optical Character Recognition (OCR) software for various operating systems.
Your task is to write a Python script that can perform OCR on JPEG and PNG image files. It should scan an input directory for supported images, perform OCR using pytesseract, and save the results as text files (one per image) in an output directory.
-
Install Tesseract-OCR system package: Use the
install-tesseract-ocr.shscript provided within thegroup_2directory:- In the Explorer view: Right click on the
group_2directory > Open in Integrated Terminal - In the Terminal view: Type
./install-tesseract-ocr.sh - Don't close the Terminal view you will need it again ;)
- In the Explorer view: Right click on the
-
Setup your environment: Install the
pytesseractandpillow(formerly known as PIL) libraries with pip.- In the Terminal view: Type
pip install pillow pytesseract
- In the Terminal view: Type
-
Create a new Python file: Create a new Python file named
script.py.- In the Explorer view: Right click on the
group_2directory > New File...
- In the Explorer view: Right click on the
-
Import necessary libraries: At the top of your file, import the necessary libraries:
from PIL import Image
import os
import pytesseract- Define configuration variables: Underneath the import statements define the following variables:
INPUT_DIRECTORY = 'images/input'
OUTPUT_DIRECTORY = 'images/output'
LANGUAGE = 'eng'-
Define your function: Define a function named
ocr_with_tesseractthat takes three parameters:input_directory,output_directory, andlanguage. This function should do the following:- Sanity checks
- Check if the input directory exists
- If not, print an error message and return
- Check if the output directory exists
- If not, create the output directory
- Check if the output directory is empty
- If not, print an error message and return
- Check if the language is available with pytesseract
- If not, print an error message and return
- Check if the input directory exists
- Find all supported images (JPEG and PNG) in the input directory
- Create a list to store the image file names that should be processed
- Loop over all files in the input directory
- Skip files that are not supported (you can use the file extension for that: .jpg, .png)
- Add supported images to the list
- Print a message if no processable images were found
- Process the images
- Loop over all images in the list
- Load the image as PIL Image
- Perform OCR using pytesseract and get the result as a string
- Save the result as a txt file
- Close the image
- Sanity checks
-
Call your function: At the end of your script, call your function with the configuration variables as arguments.
-
Run your script: Execute your script.
- In the Terminal view: Type
python script.py
- In the Terminal view: Type
After running your script, you should see the extracted text and pdf files in your specified output directory.
- To check if directories exist, take a look in the Python built in
os.pathmodule - To construct a path to a file, you can use string concatenation.
- To load an image, take a look at the
Imagemodule from the Pillow (PIL) library. - To extract text from an image using pytesseract, you can use its
image_to_string()method. - To save the results you should use the Python built in
openfunction in conjunction with thewithkeyword.- The file handle can be used to
write()something in the file
- The file handle can be used to
Remember, the goal is to understand how to manipulate images and extract text from them using Python and Tesseract. Don't worry if you don't get it right the first time. Keep trying and happy coding!