1
1
PyPDFOCR
2
2
========
3
3
4
- This script will take a pdf file and generate the corresponding OCR'ed
5
- version.
4
+ This program will help manage your scanned PDFs for you. It can do the
5
+ following:
6
+
7
+ - Take a scanned PDF file and run OCR on it (using free OCR tools),
8
+ generating a searchable PDF
9
+ - Optionally, watch a folder for incoming scanned PDFs and
10
+ automatically run OCR on them
11
+ - Optionally, file the scanned PDFs into directories based on simple
12
+ keyword matching that you specify
13
+ - *Coming soon *: Evernote auto-upload and filing
14
+
15
+ More links:
16
+
17
+ - `Blog <http://virantha.com/categories/projects/pypdfocr >`__
18
+ - `Documentation <http://documentup.com/virantha/pypdfocr >`__
19
+ - `Source <https://www.github.com/virantha/pypdfocr >`__
6
20
7
21
Usage:
8
22
------
@@ -12,63 +26,151 @@ Single conversion:
12
26
13
27
::
14
28
15
- python pypdfocr.py filename.pdf
29
+ pypdfocr filename.pdf
16
30
17
31
--> filename_ocr.pdf will be generated
18
32
19
- Folder monitoring (new!) :
20
- ~~~~~~~~~~~~~~~~~~~~~~~~~
33
+ Folder monitoring:
34
+ ~~~~~~~~~~~~~~~~~~
21
35
22
36
::
23
37
24
- python pypdfocr.py -w watch_directory
38
+ pypdfocr -w watch_directory
25
39
26
40
--> Every time a pdf file is added to `watch_directory` it will be OCR'ed
27
41
28
- For those on Windows, because it's such a pain to get all the PIL and
29
- PDF dependencies installed, I've gone ahead and made an executable
30
- available under:
42
+ Automatic filing (new!):
43
+ ~~~~~~~~~~~~~~~~~~~~~~~~
44
+
45
+ To automatically move the OCR'ed pdf to a directory based on a keyword,
46
+ use the -f option and specify a configuration file (described below):
31
47
32
48
::
33
49
34
- dist/ pypdfocr.exe
50
+ pypdfocr filename.pdf -f -c config.yaml
35
51
36
- You still need to install Tesseract and GhostScript as detailed below in
37
- the dependencies list.
52
+ You can also do this in folder monitoring mode:
53
+
54
+ ::
55
+
56
+ pypdfocr -w watch_directory -f -c config.yaml
57
+
58
+ Configuration file for automatic PDF filing
59
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
60
+
61
+ The config.yaml file above is a simple folder to keyword matching text
62
+ file. It determines where your OCR'ed PDFs (and optionally, the original
63
+ scanned PDF) are placed after processing. An example is given below:
64
+
65
+ ::
66
+
67
+ target_folder: "docs/filed"
68
+ default_folder: "docs/filed/manual_sort"
69
+ original_move_folder: "docs/originals"
70
+
71
+ folders:
72
+ finances:
73
+ - american express
74
+ - chase card
75
+ - internal revenue service
76
+ travel:
77
+ - boarding pass
78
+ - airlines
79
+ - expedia
80
+ - orbitz
81
+ receipts:
82
+ - receipt
83
+
84
+ The ``target_folder `` is the root of your filing cabinet. Any PDF moving
85
+ will happen in sub-directories under this directory.
86
+
87
+ The ``folders `` section defines your filing directories and the keywords
88
+ associated with them. In this example, we have three filing directories
89
+ (finances, travl, receipts), and some associated keywords for each
90
+ filing directory. For example, if your OCR'ed PDF contains the phrase
91
+ "american express" (in any upper/lower case), it will be filed into
92
+ ``docs/filed/finances ``
93
+
94
+ The ``default_folder `` is where the OCR'ed PDF is moved to if there is
95
+ no keyword match.
96
+
97
+ The ``original_move_folder `` is optional (you can comment it out with
98
+ ``# `` in front of that line), but if specified, the original scanned PDF
99
+ is moved into this directory after OCR is done. Otherwise, if this field
100
+ is not present or commented out, your original PDF will stay where it
101
+ was found.
102
+
103
+ If there is any naming conflict during filing, the program will add an
104
+ underscore followed by a number to each filename, in order to avoid
105
+ overwriting files that may already be present.
38
106
39
107
Caveats
40
108
-------
41
109
42
110
This code is brand-new, and is barely commented with no unit-tests
43
111
included. I plan to improve things as time allows in the near-future.
112
+ Sphinx code generation is on my TODO list.
44
113
45
- Dependencies:
46
- -------------
114
+ Installation
115
+ ------------
47
116
48
- PyPDFOCR relies on the following (free) programs being installed and in
49
- the path:
117
+ Using pip
118
+ ~~~~~~~~~
50
119
51
- - Tesseract OCR software https://code.google.com/p/tesseract-ocr/
52
- - GhostScript http://www.ghostscript.com/
53
- - PIL (Python Imaging Library) http://www.pythonware.com/products/pil/
54
- - ReportLab (PDF generation library)
55
- http://www.reportlab.com/software/opensource/
56
- - Watchdog (Cross-platform fhlesystem events monitoring)
57
- https://pypi.python.org/pypi/watchdog
120
+ PyPDFOCR is available in PyPI, so you can just run:
121
+
122
+ ::
123
+
124
+ pip install pypdfocr
58
125
59
- On Mac OS X, you can install the first two using homebrew:
126
+ You will also need to install the external dependencies listed below.
127
+ For those on **Windows **, because it's such a pain to get all the PIL
128
+ and PDF dependencies installed, I've gone ahead and made an executable
129
+ called
130
+ `pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true >`__
131
+
132
+ You still need to install Tesseract and GhostScript as detailed below in
133
+ the dependencies list.
134
+
135
+ Manual install
136
+ ~~~~~~~~~~~~~~
137
+
138
+ Clone the source directly from github (you need to have git installed):
60
139
61
140
::
62
141
63
- brew install tesseract
64
- brew install ghostscript
142
+ git clone https://github.com/virantha/pypdfocr.git
143
+
144
+ Then, install the following third-party python libraries: - PIL (Python
145
+ Imaging Library) http://www.pythonware.com/products/pil/ - ReportLab
146
+ (PDF generation library) http://www.reportlab.com/software/opensource/ -
147
+ Watchdog (Cross-platform fhlesystem events monitoring)
148
+ https://pypi.python.org/pypi/watchdog - PyPDF2 (Pure python pdf library)
65
149
66
- The last three can be installed using a regular python manager such as
67
- pip:
150
+ These can all be installed via pip:
68
151
69
152
::
70
153
71
154
pip install pil
72
155
pip install reportlab
73
156
pip install watchdog
157
+ pip install pypdf2
158
+
159
+ You will also need to install the external dependencies listed below.
160
+
161
+ External Dependencies:
162
+ ----------------------
163
+
164
+ PyPDFOCR relies on the following (free) programs being installed and in
165
+ the path:
166
+
167
+ - Tesseract OCR software https://code.google.com/p/tesseract-ocr/
168
+ - GhostScript http://www.ghostscript.com/
169
+
170
+ On Mac OS X, you can install these using homebrew:
171
+
172
+ ::
173
+
174
+ brew install tesseract
175
+ brew install ghostscript
74
176
0 commit comments