|
1 |
| -# Setting up a Simple OCR Server |
2 |
| - |
3 |
| -## Why? |
4 |
| -OCR has become a common technology. With the advent of libraries such as Tesseract, Ocrad, and more people |
5 |
| -have built hundreds of bots and libraries that use them in interesting ways. A trivial example is Meme Reading |
6 |
| -bots on Reddit. Extracting text from images on sites like Tumblr or Pinterest that have overlay text commonly |
7 |
| -also can be used in order to further add natural language analysis data into models to predict it. |
8 |
| - |
9 |
| -Its also just hella fun. I mean, who would have thought we could do this? Maybe if you go really hard in the paint |
10 |
| -you can build your own program to get all your accounting paperwork scanned in as editable docs. |
11 |
| - |
12 |
| -So lets go. |
13 |
| - |
14 |
| -## Beginning steps |
15 |
| - |
16 |
| -There are two potential options for building this in the simplest scale. We will go through building the Flask |
17 |
| -layer here as a simple means to have an API resource we can hit from a frontend framework in order to make |
18 |
| -and application based around this. |
19 |
| - |
20 |
| -We also will provide a tagged commit that will utilize the serverside to generate the HTML needed for the frontend |
21 |
| -to take in an `image_url` and output its text as interpreted per our OCR engine. This tutorial will not go into the |
22 |
| -detail of generating the UI pieces, but the code is provided, and quite simple. |
23 |
| - |
24 |
| -First, we have to install some dependencies. As always, configuring your environment is 90% of the fun. |
25 |
| - |
26 |
| -> This post has been tested on Ubuntu version 14.04 but it should work for 12.x and 13.x versions as well. If you're running OSX, you can use [VirtualBox](http://osxdaily.com/2012/03/27/install-run-ubuntu-linux-virtualbox/) or a droplet on [DigitalOcean](https://www.digitalocean.com/) (recommended!) to create the appropriate environment. |
27 |
| -
|
28 |
| -### Downloading dependencies |
29 |
| - |
30 |
| -We need [Tesseract](http://en.wikipedia.org/wiki/Tesseract_%28software%29) and all of its dependencies, which includes [Leptonica](http://www.leptonica.com/), as well as some other packages that power these two for sanity checks to start. Now, I could just give you a list of magic commands that I know work in the env. But, lets explain things a bit first. |
31 |
| - |
32 |
| -> **NOTE**: You can also use the [_run.sh](link) shell script to quickly install the dependencies along with Leptonica and Tesseract and the relevant English language packages. If you go this route, skip down to the [Web-server time!](link) section. But please consider manually building these libraries if you have not before. It is chicken soup for the hacker's soul to play with tarballs and make. However, first are our regular apt-get dependencies (before we get fancy). |
33 |
| -
|
34 |
| -```sh |
35 |
| -$ sudo apt-get update |
36 |
| -$ sudo apt-get install autoconf automake libtool |
37 |
| -$ sudo apt-get install libpng12-dev |
38 |
| -$ sudo apt-get install libjpeg62-dev |
39 |
| -$ sudo apt-get install g++ |
40 |
| -$ sudo apt-get install libtiff4-dev |
41 |
| -$ sudo apt-get install libopencv-dev libtesseract-dev |
42 |
| -$ sudo apt-get install git |
43 |
| -$ sudo apt-get install cmake |
44 |
| -$ sudo apt-get install build-essential |
45 |
| -$ sudo apt-get install libleptonica-dev |
46 |
| -$ sudo apt-get install liblog4cplus-dev |
47 |
| -$ sudo apt-get install libcurl3-dev |
48 |
| -$ sudo apt-get install python2.7-dev |
49 |
| -$ sudo apt-get install tk8.5 tcl8.5 tk8.5-dev tcl8.5-dev |
50 |
| -$ sudo apt-get build-dep python-imaging --fix-missing |
51 |
| -``` |
52 |
| - |
53 |
| -We run `sudo apt-get update` is short for 'make sure we have the latest package listings. |
54 |
| -`g++` is the GNU compiled collection. |
55 |
| -We also get a bunch of libraries that allow us to toy with images. ie `libtiff` `libpng` etc. |
56 |
| -We also get `git`, which if you lack famliiarity with but have found yourself here, you may want to read [The Git Book](link). |
57 |
| -Beyond this, we also ensure we have `Python 2.7`, our programming language of choice. |
58 |
| -We then get the `python-imaging` library set up for interaction with all these pieces. |
59 |
| - |
60 |
| -Speaking of images: now, we'll need [ImageMagick](http://www.imagemagick.org/) as well if we want to toy with images before we throw them in programmatically, now that we have all the libraries needed to understand and parse them in.. |
61 |
| - |
62 |
| -```sh |
63 |
| -$ sudo apt-get install imagemagick |
64 |
| -``` |
65 |
| - |
66 |
| -### Building Leptonica |
67 |
| - |
68 |
| -Now, time for Leptonica, finally! (Unless you ran the shell scripts and for some reason are seeing this. In which case proceed to the [Webserver Time!](link) section. |
69 |
| - |
70 |
| -```sh |
71 |
| -$ wget http://www.leptonica.org/source/leptonica-1.70.tar.gz |
72 |
| -$ tar -zxvf leptonica-1.70.tar.gz |
73 |
| -$ cd leptonica-1.70/ |
74 |
| -$ ./autobuild |
75 |
| -$ ./configure |
76 |
| -$ make |
77 |
| -$ sudo make install |
78 |
| -$ sudo ldconfig |
79 |
| -``` |
80 |
| - |
81 |
| -If this is your first time playing with tar, heres what we are doing: |
82 |
| - |
83 |
| -- get the binary for Leptonica (`wget`) |
84 |
| -- unzip the tarball and (`x` for extract, `v` for verbose...etc. For a detailed explanation: `man tar`) |
85 |
| -- `cd` into our new unpacked directory |
86 |
| -- run `autobuild` and `configure` bash scripts to set up the application |
87 |
| -- use `make` to build it |
88 |
| -- install it with `make` after the build |
89 |
| -- create necessary links with `ldconfig` |
90 |
| - |
91 |
| -Boom, now we have Leptonica. On to Tesseract! |
92 |
| - |
93 |
| -### Building Tesseract |
94 |
| - |
95 |
| -And now to download and build Tesseract... |
96 |
| - |
97 |
| -```sh |
98 |
| -$ cd .. |
99 |
| -$ wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz |
100 |
| -$ tar -zxvf tesseract-ocr-3.02.02.tar.gz |
101 |
| -$ cd tesseract-ocr/ |
102 |
| -$ ./autogen.sh |
103 |
| -$ ./configure |
104 |
| -$ make |
105 |
| -$ sudo make install |
106 |
| -$ sudo ldconfig |
107 |
| -``` |
108 |
| -The process here mirrors the Leptonica one almost perfectly. So for explanation, I'll keep this DRY and just say see above. |
109 |
| - |
110 |
| -We need to set up an environment variable to source our Tesseract data, so we'll take care of that now: |
111 |
| - |
112 |
| -````sh |
113 |
| -$ export TESSDATA_PREFIX=/usr/local/share/ |
114 |
| -``` |
115 |
| - |
116 |
| -Now, lets get the Tesseract english language packages that are relevant: |
117 |
| - |
118 |
| -``` |
119 |
| -$ cd .. |
120 |
| -$ wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz |
121 |
| -$ tar -xf tesseract-ocr-3.02.eng.tar.gz |
122 |
| -$ sudo cp -r tesseract-ocr/tessdata $TESSDATA_PREFIX |
123 |
| -``` |
124 |
| - |
125 |
| -BOOM! We now have Tesseract. We can use the CLI. Feel free to read the [docs](https://code.google.com/p/tesseract-ocr/) if you want to play. However, we need a Python wrapper to truly achieve our end goal. So the next step is setting |
126 |
| -up a Flask server that will allow us to easily build an API that we will POST requests to with a link to an image, and it will run the character recognition on them. |
127 |
| - |
128 |
| -## Web-server time! |
129 |
| - |
130 |
| -Now, on to the fun stuff. First, we will need to build a way to interface with Tesseract via Python. We COULD use `popen` but that just feels wrong/unPythonic. A very minimal, but functional Python package wrapping Tesseract is [pytesseract](https://github.com/madmaze/pytesseract), which is what we will rely on here. |
131 |
| - |
132 |
| -Before beginning, grab the boilerplate code/structure [here](https://github.com/mjhea0/ocr_tutorial/releases/tag/v0) and setup the project structure on your machine. Make sure to setup a virtualenv and install the requirements via pip. |
133 |
| - |
134 |
| -> **NOTE**: Want to quickly get started? Run the [_app.sh](link) shell script. This will give you a version that has some git tags that can allow us to easily get to ertain points in the exercise. If you are unfamiliar with flask, using the final repository as reference is a solid method to understanding the basics of building a flask server. You can find a great tutorial [here](https://realpython.com/blog/python/kickstarting-flask-on-ubuntu-setup-and-deployment/) as well, specific to Ubuntu. |
135 |
| - |
136 |
| -```sh |
137 |
| -$ wget https://github.com/mjhea0/ocr_tutorial/archive/v0.tar.gz |
138 |
| -$ tar -xf v0.tar.gz |
139 |
| -$ mv ocr_tutorial-0/* ../../home/ |
140 |
| -$ cd ../../home |
141 |
| -$ sudo apt-get install python-virtualenv |
142 |
| -$ virtualenv env |
143 |
| -$ source env/bin/activate |
144 |
| -$ pip install -r requirements.txt |
145 |
| -``` |
146 |
| - |
147 |
| -> **NOTE**: Flask Boilerplate (maintained by [Real Python](https://realpython.com) is a wonderful library for getting a simple, Pythonic server running. We customized this for our base application. Check out the [Flask Boilerplate repository](https://github.com/mjhea0/flask-boilerplate) for more info. |
148 |
| - |
149 |
| -### Let's make an OCR Engine |
150 |
| - |
151 |
| -Now, we need to make a class using pytesseract to intake images, and read them. Create a new file called *ocr.py* in "flask_server" and add the following code: |
152 |
| - |
153 |
| -```python |
154 |
| -import pytesseract |
155 |
| -import requests |
156 |
| -from PIL import Image |
157 |
| -from PIL import ImageFilter |
158 |
| -from StringIO import StringIO |
159 |
| -
|
160 |
| -_ALL_WORDS = words.words() # we'll use it later don't worry |
161 |
| -
|
162 |
| -
|
163 |
| -def process_image(url): |
164 |
| - image = _get_image(url) |
165 |
| - image.filter(ImageFilter.SHARPEN) |
166 |
| - return pytesseract.image_to_string(image) |
167 |
| -
|
168 |
| -
|
169 |
| -def _get_image(url): |
170 |
| - return Image.open(StringIO(requests.get(url).content)) |
171 |
| -``` |
172 |
| - |
173 |
| -Wonderful! |
174 |
| - |
175 |
| -So, our main method is `process_image()`, where we sharpen the image to crisp up the text. |
176 |
| -
|
177 |
| -## Optional: Building a CLI tool for your new OCR Engine |
178 |
| -
|
179 |
| -Making a CLI is a great proof of concept, and a fun breather after doing so much configuration. So lets take a stab at making one. Create a new file within "flask_server" called *cli.py* and then add the following code: |
180 |
| -
|
181 |
| -```python |
182 |
| -import sys |
183 |
| -import requests |
184 |
| -import pytesseract |
185 |
| -from PIL import Image |
186 |
| -from StringIO import StringIO |
187 |
| - |
188 |
| - |
189 |
| -_ALL_WORDS = words.words() # we use this later |
190 |
| - |
191 |
| - |
192 |
| -def get_image(url): |
193 |
| - return Image.open(StringIO(requests.get(url).content)) |
194 |
| - |
195 |
| - |
196 |
| -if __name__ == '__main__': |
197 |
| - """Tool to test the raw output of pytesseract with a given input URL""" |
198 |
| - sys.stdout.write(""" |
199 |
| -===OOOO=====CCCCC===RRRRRR=====\n |
200 |
| -==OO==OO===CC=======RR===RR====\n |
201 |
| -==OO==OO===CC=======RR===RR====\n |
202 |
| -==OO==OO===CC=======RRRRRR=====\n |
203 |
| -==OO==OO===CC=======RR==RR=====\n |
204 |
| -==OO==OO===CC=======RR== RR====\n |
205 |
| -===OOOO=====CCCCC===RR====RR===\n\n |
206 |
| -""") |
207 |
| - sys.stdout.write("A simple OCR utility\n") |
208 |
| - url = raw_input("What is the url of the image you would like to analyze?\n") |
209 |
| - image = get_image(url) |
210 |
| - sys.stdout.write("The raw output from tesseract with no processing is:\n\n") |
211 |
| - sys.stdout.write("-----------------BEGIN-----------------\n") |
212 |
| - sys.stdout.write(pytesseract.image_to_string(image) + "\n") |
213 |
| - sys.stdout.write("------------------END------------------\n") |
214 |
| -``` |
215 |
| -
|
216 |
| -This is really quite simple. Line by line we look at the text output from our engine, and output it to STDOUT. Test it out (`python flask_server/cli.py`) with a few image urls, or play with your own ascii art for a good time. |
217 |
| -
|
218 |
| -## Back to the server |
219 |
| -
|
220 |
| -Now that we have an engine, we need to get ourselves some output! Add the following route handler and view function to *app.py*: |
221 |
| -
|
222 |
| -```python |
223 |
| -@app.route('/ocr', methods=["POST"]) |
224 |
| -def ocr(): |
225 |
| - try: |
226 |
| - url = request.form['image_url'] |
227 |
| - output = process_image(url) |
228 |
| - return jsonify({"output": output}) |
229 |
| - except KeyError: |
230 |
| - return jsonify({"error": "Did you mean to send data formatted: '{"image_url": "some_url"}'") |
231 |
| -``` |
232 |
| -
|
233 |
| -Make sure to update the imports: |
234 |
| -
|
235 |
| -```python |
236 |
| -import os |
237 |
| -import logging |
238 |
| -from logging import Formatter, FileHandler |
239 |
| -from flask import Flask, request, jsonify |
240 |
| - |
241 |
| -from ocr import process_image |
242 |
| -``` |
243 |
| -
|
244 |
| -Now, as you can see, we just add in the JSON response of the Engine's `process_image()` method, passing it in a file object using `Image` from PIL to install. |
245 |
| -
|
246 |
| -> **NOTE**: You will not have `PIL` itself installed; this runs off of `Pillow` and allows us to do the same thing. This is because the PIL library was at one time forked, and turned into `Pillow`. The community has strong opinions on this matter. Consult google for insight and drama. |
247 |
| -
|
248 |
| -## Let's test! |
249 |
| -
|
250 |
| -Run your app: |
251 |
| -
|
252 |
| -```sh |
253 |
| -$ cd flask_server/ |
254 |
| -$ python app.py |
255 |
| -``` |
256 |
| -
|
257 |
| -Then in another terminal tab run: |
258 |
| -
|
259 |
| -```sh |
260 |
| -$ curl -X POST http://localhost:5000/ocr -d "{'image_url': 'some_url'}" |
261 |
| -``` |
262 |
| -
|
263 |
| -### Example |
264 |
| -
|
265 |
| -```sh |
266 |
| -$ curl -X POST http://localhost:5000/ocr -d 'https://s-media-cache-ec0.pinimg.com/originals/02/58/8f/02588f420dd4fe0ed13d93613de0da7.jpg' |
267 |
| -{ |
268 |
| - "output": "Stfawfbeffy Lemon\nHerbal Tea\nSlushie" |
269 |
| -} |
270 |
| -``` |
271 |
| -:) |
272 |
| -
|
273 |
| -Happy hacking. |
| 1 | +WIP |
0 commit comments