Skip to content

Commit 4d7bb67

Browse files
committed
updated readme, code
1 parent 42f4a23 commit 4d7bb67

File tree

4 files changed

+13
-277
lines changed

4 files changed

+13
-277
lines changed

README.md

Lines changed: 1 addition & 273 deletions
Original file line numberDiff line numberDiff line change
@@ -1,273 +1 @@
1-
# Setting up a Simple OCR Server
2-
3-
## Why?
4-
OCR has become a common technology. With the advent of libraries such as Tesseract, Ocrad, and more people
5-
have built hundreds of bots and libraries that use them in interesting ways. A trivial example is Meme Reading
6-
bots on Reddit. Extracting text from images on sites like Tumblr or Pinterest that have overlay text commonly
7-
also can be used in order to further add natural language analysis data into models to predict it.
8-
9-
Its also just hella fun. I mean, who would have thought we could do this? Maybe if you go really hard in the paint
10-
you can build your own program to get all your accounting paperwork scanned in as editable docs.
11-
12-
So lets go.
13-
14-
## Beginning steps
15-
16-
There are two potential options for building this in the simplest scale. We will go through building the Flask
17-
layer here as a simple means to have an API resource we can hit from a frontend framework in order to make
18-
and application based around this.
19-
20-
We also will provide a tagged commit that will utilize the serverside to generate the HTML needed for the frontend
21-
to take in an `image_url` and output its text as interpreted per our OCR engine. This tutorial will not go into the
22-
detail of generating the UI pieces, but the code is provided, and quite simple.
23-
24-
First, we have to install some dependencies. As always, configuring your environment is 90% of the fun.
25-
26-
> This post has been tested on Ubuntu version 14.04 but it should work for 12.x and 13.x versions as well. If you're running OSX, you can use [VirtualBox](http://osxdaily.com/2012/03/27/install-run-ubuntu-linux-virtualbox/) or a droplet on [DigitalOcean](https://www.digitalocean.com/) (recommended!) to create the appropriate environment.
27-
28-
### Downloading dependencies
29-
30-
We need [Tesseract](http://en.wikipedia.org/wiki/Tesseract_%28software%29) and all of its dependencies, which includes [Leptonica](http://www.leptonica.com/), as well as some other packages that power these two for sanity checks to start. Now, I could just give you a list of magic commands that I know work in the env. But, lets explain things a bit first.
31-
32-
> **NOTE**: You can also use the [_run.sh](link) shell script to quickly install the dependencies along with Leptonica and Tesseract and the relevant English language packages. If you go this route, skip down to the [Web-server time!](link) section. But please consider manually building these libraries if you have not before. It is chicken soup for the hacker's soul to play with tarballs and make. However, first are our regular apt-get dependencies (before we get fancy).
33-
34-
```sh
35-
$ sudo apt-get update
36-
$ sudo apt-get install autoconf automake libtool
37-
$ sudo apt-get install libpng12-dev
38-
$ sudo apt-get install libjpeg62-dev
39-
$ sudo apt-get install g++
40-
$ sudo apt-get install libtiff4-dev
41-
$ sudo apt-get install libopencv-dev libtesseract-dev
42-
$ sudo apt-get install git
43-
$ sudo apt-get install cmake
44-
$ sudo apt-get install build-essential
45-
$ sudo apt-get install libleptonica-dev
46-
$ sudo apt-get install liblog4cplus-dev
47-
$ sudo apt-get install libcurl3-dev
48-
$ sudo apt-get install python2.7-dev
49-
$ sudo apt-get install tk8.5 tcl8.5 tk8.5-dev tcl8.5-dev
50-
$ sudo apt-get build-dep python-imaging --fix-missing
51-
```
52-
53-
We run `sudo apt-get update` is short for 'make sure we have the latest package listings.
54-
`g++` is the GNU compiled collection.
55-
We also get a bunch of libraries that allow us to toy with images. ie `libtiff` `libpng` etc.
56-
We also get `git`, which if you lack famliiarity with but have found yourself here, you may want to read [The Git Book](link).
57-
Beyond this, we also ensure we have `Python 2.7`, our programming language of choice.
58-
We then get the `python-imaging` library set up for interaction with all these pieces.
59-
60-
Speaking of images: now, we'll need [ImageMagick](http://www.imagemagick.org/) as well if we want to toy with images before we throw them in programmatically, now that we have all the libraries needed to understand and parse them in..
61-
62-
```sh
63-
$ sudo apt-get install imagemagick
64-
```
65-
66-
### Building Leptonica
67-
68-
Now, time for Leptonica, finally! (Unless you ran the shell scripts and for some reason are seeing this. In which case proceed to the [Webserver Time!](link) section.
69-
70-
```sh
71-
$ wget http://www.leptonica.org/source/leptonica-1.70.tar.gz
72-
$ tar -zxvf leptonica-1.70.tar.gz
73-
$ cd leptonica-1.70/
74-
$ ./autobuild
75-
$ ./configure
76-
$ make
77-
$ sudo make install
78-
$ sudo ldconfig
79-
```
80-
81-
If this is your first time playing with tar, heres what we are doing:
82-
83-
- get the binary for Leptonica (`wget`)
84-
- unzip the tarball and (`x` for extract, `v` for verbose...etc. For a detailed explanation: `man tar`)
85-
- `cd` into our new unpacked directory
86-
- run `autobuild` and `configure` bash scripts to set up the application
87-
- use `make` to build it
88-
- install it with `make` after the build
89-
- create necessary links with `ldconfig`
90-
91-
Boom, now we have Leptonica. On to Tesseract!
92-
93-
### Building Tesseract
94-
95-
And now to download and build Tesseract...
96-
97-
```sh
98-
$ cd ..
99-
$ wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
100-
$ tar -zxvf tesseract-ocr-3.02.02.tar.gz
101-
$ cd tesseract-ocr/
102-
$ ./autogen.sh
103-
$ ./configure
104-
$ make
105-
$ sudo make install
106-
$ sudo ldconfig
107-
```
108-
The process here mirrors the Leptonica one almost perfectly. So for explanation, I'll keep this DRY and just say see above.
109-
110-
We need to set up an environment variable to source our Tesseract data, so we'll take care of that now:
111-
112-
````sh
113-
$ export TESSDATA_PREFIX=/usr/local/share/
114-
```
115-
116-
Now, lets get the Tesseract english language packages that are relevant:
117-
118-
```
119-
$ cd ..
120-
$ wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
121-
$ tar -xf tesseract-ocr-3.02.eng.tar.gz
122-
$ sudo cp -r tesseract-ocr/tessdata $TESSDATA_PREFIX
123-
```
124-
125-
BOOM! We now have Tesseract. We can use the CLI. Feel free to read the [docs](https://code.google.com/p/tesseract-ocr/) if you want to play. However, we need a Python wrapper to truly achieve our end goal. So the next step is setting
126-
up a Flask server that will allow us to easily build an API that we will POST requests to with a link to an image, and it will run the character recognition on them.
127-
128-
## Web-server time!
129-
130-
Now, on to the fun stuff. First, we will need to build a way to interface with Tesseract via Python. We COULD use `popen` but that just feels wrong/unPythonic. A very minimal, but functional Python package wrapping Tesseract is [pytesseract](https://github.com/madmaze/pytesseract), which is what we will rely on here.
131-
132-
Before beginning, grab the boilerplate code/structure [here](https://github.com/mjhea0/ocr_tutorial/releases/tag/v0) and setup the project structure on your machine. Make sure to setup a virtualenv and install the requirements via pip.
133-
134-
> **NOTE**: Want to quickly get started? Run the [_app.sh](link) shell script. This will give you a version that has some git tags that can allow us to easily get to ertain points in the exercise. If you are unfamiliar with flask, using the final repository as reference is a solid method to understanding the basics of building a flask server. You can find a great tutorial [here](https://realpython.com/blog/python/kickstarting-flask-on-ubuntu-setup-and-deployment/) as well, specific to Ubuntu.
135-
136-
```sh
137-
$ wget https://github.com/mjhea0/ocr_tutorial/archive/v0.tar.gz
138-
$ tar -xf v0.tar.gz
139-
$ mv ocr_tutorial-0/* ../../home/
140-
$ cd ../../home
141-
$ sudo apt-get install python-virtualenv
142-
$ virtualenv env
143-
$ source env/bin/activate
144-
$ pip install -r requirements.txt
145-
```
146-
147-
> **NOTE**: Flask Boilerplate (maintained by [Real Python](https://realpython.com) is a wonderful library for getting a simple, Pythonic server running. We customized this for our base application. Check out the [Flask Boilerplate repository](https://github.com/mjhea0/flask-boilerplate) for more info.
148-
149-
### Let's make an OCR Engine
150-
151-
Now, we need to make a class using pytesseract to intake images, and read them. Create a new file called *ocr.py* in "flask_server" and add the following code:
152-
153-
```python
154-
import pytesseract
155-
import requests
156-
from PIL import Image
157-
from PIL import ImageFilter
158-
from StringIO import StringIO
159-
160-
_ALL_WORDS = words.words() # we'll use it later don't worry
161-
162-
163-
def process_image(url):
164-
image = _get_image(url)
165-
image.filter(ImageFilter.SHARPEN)
166-
return pytesseract.image_to_string(image)
167-
168-
169-
def _get_image(url):
170-
return Image.open(StringIO(requests.get(url).content))
171-
```
172-
173-
Wonderful!
174-
175-
So, our main method is `process_image()`, where we sharpen the image to crisp up the text.
176-
177-
## Optional: Building a CLI tool for your new OCR Engine
178-
179-
Making a CLI is a great proof of concept, and a fun breather after doing so much configuration. So lets take a stab at making one. Create a new file within "flask_server" called *cli.py* and then add the following code:
180-
181-
```python
182-
import sys
183-
import requests
184-
import pytesseract
185-
from PIL import Image
186-
from StringIO import StringIO
187-
188-
189-
_ALL_WORDS = words.words() # we use this later
190-
191-
192-
def get_image(url):
193-
return Image.open(StringIO(requests.get(url).content))
194-
195-
196-
if __name__ == '__main__':
197-
"""Tool to test the raw output of pytesseract with a given input URL"""
198-
sys.stdout.write("""
199-
===OOOO=====CCCCC===RRRRRR=====\n
200-
==OO==OO===CC=======RR===RR====\n
201-
==OO==OO===CC=======RR===RR====\n
202-
==OO==OO===CC=======RRRRRR=====\n
203-
==OO==OO===CC=======RR==RR=====\n
204-
==OO==OO===CC=======RR== RR====\n
205-
===OOOO=====CCCCC===RR====RR===\n\n
206-
""")
207-
sys.stdout.write("A simple OCR utility\n")
208-
url = raw_input("What is the url of the image you would like to analyze?\n")
209-
image = get_image(url)
210-
sys.stdout.write("The raw output from tesseract with no processing is:\n\n")
211-
sys.stdout.write("-----------------BEGIN-----------------\n")
212-
sys.stdout.write(pytesseract.image_to_string(image) + "\n")
213-
sys.stdout.write("------------------END------------------\n")
214-
```
215-
216-
This is really quite simple. Line by line we look at the text output from our engine, and output it to STDOUT. Test it out (`python flask_server/cli.py`) with a few image urls, or play with your own ascii art for a good time.
217-
218-
## Back to the server
219-
220-
Now that we have an engine, we need to get ourselves some output! Add the following route handler and view function to *app.py*:
221-
222-
```python
223-
@app.route('/ocr', methods=["POST"])
224-
def ocr():
225-
try:
226-
url = request.form['image_url']
227-
output = process_image(url)
228-
return jsonify({"output": output})
229-
except KeyError:
230-
return jsonify({"error": "Did you mean to send data formatted: '{"image_url": "some_url"}'")
231-
```
232-
233-
Make sure to update the imports:
234-
235-
```python
236-
import os
237-
import logging
238-
from logging import Formatter, FileHandler
239-
from flask import Flask, request, jsonify
240-
241-
from ocr import process_image
242-
```
243-
244-
Now, as you can see, we just add in the JSON response of the Engine's `process_image()` method, passing it in a file object using `Image` from PIL to install.
245-
246-
> **NOTE**: You will not have `PIL` itself installed; this runs off of `Pillow` and allows us to do the same thing. This is because the PIL library was at one time forked, and turned into `Pillow`. The community has strong opinions on this matter. Consult google for insight and drama.
247-
248-
## Let's test!
249-
250-
Run your app:
251-
252-
```sh
253-
$ cd flask_server/
254-
$ python app.py
255-
```
256-
257-
Then in another terminal tab run:
258-
259-
```sh
260-
$ curl -X POST http://localhost:5000/ocr -d "{'image_url': 'some_url'}"
261-
```
262-
263-
### Example
264-
265-
```sh
266-
$ curl -X POST http://localhost:5000/ocr -d 'https://s-media-cache-ec0.pinimg.com/originals/02/58/8f/02588f420dd4fe0ed13d93613de0da7.jpg'
267-
{
268-
"output": "Stfawfbeffy Lemon\nHerbal Tea\nSlushie"
269-
}
270-
```
271-
:)
272-
273-
Happy hacking.
1+
WIP

flask_server/app.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@
66
from ocr import process_image
77

88
app = Flask(__name__)
9-
_VERSION = 1
9+
_VERSION = 1 # API version
10+
1011

1112
@app.route('/v{}/ocr'.format(_VERSION), methods=["POST"])
1213
def ocr():
@@ -15,12 +16,15 @@ def ocr():
1516
output = process_image(url)
1617
return jsonify({"output": output})
1718
except KeyError:
18-
return jsonify({"error": "Did you mean to send data formatted: {'image_url': 'some_url'}"})
19+
return jsonify(
20+
{"error": "Did you mean to send: {'image_url': 'some_url'}"}
21+
)
1922

2023

2124
@app.errorhandler(500)
2225
def internal_error(error):
23-
print str(error) # ghetto logging
26+
print str(error) # ghetto logging
27+
2428

2529
@app.errorhandler(404)
2630
def not_found_error(error):

flask_server/cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from StringIO import StringIO
66

77

8-
_ALL_WORDS = words.words() # we use this later
8+
_ALL_WORDS = words.words() # we'll use this later
99

1010

1111
def get_image(url):

flask_server/ocr.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@
44
from PIL import ImageFilter
55
from StringIO import StringIO
66

7+
8+
_ALL_WORDS = words.words() # we'll use this later
9+
10+
711
def process_image(url):
812
image = _get_image(url)
913
image.filter(ImageFilter.SHARPEN)

0 commit comments

Comments
 (0)