Skip to content

Commit 62523ae

Browse files
committed
Merge branch 'release/0.3'
2 parents d2bcff1 + d45c927 commit 62523ae

File tree

12 files changed

+500
-86
lines changed

12 files changed

+500
-86
lines changed

CHANGES.txt

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1-
v0.2.2, 10/23/13 -- Added a console script to put the pypdfocr script into your bin
2-
v0.2.1, 10/23/13 -- Fix to initial packaging problem.
3-
v0.2, 10/22/13 -- Initial release.
1+
v0.3, 10/24/13 -- Added filing of converted pdfs using a configuration file to specify target directories based on keyword matches in the pdf text
2+
v0.2.2, 10/22/13 -- Added a console script to put the pypdfocr script into your bin
3+
v0.2.1, 10/22/13 -- Fix to initial packaging problem.
4+
v0.2, 10/21/13 -- Initial release.

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
include *.txt
22
include *.rst
33
include LICENSE
4+
include pypdfocr/config/*.yaml
45

README.md

Lines changed: 100 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,126 @@
11
# PyPDFOCR
2+
This program will help manage your scanned PDFs for you. It can do the following:
23

3-
This script will take a pdf file and generate the corresponding OCR'ed version.
4+
- Take a scanned PDF file and run OCR on it (using free OCR tools), generating a searchable PDF
5+
- Optionally, watch a folder for incoming scanned PDFs and automatically run OCR on them
6+
- Optionally, file the scanned PDFs into directories based on simple keyword matching that you specify
7+
- _Coming soon_: Evernote auto-upload and filing
48

9+
More links:
10+
11+
- [Blog](http://virantha.com/categories/projects/pypdfocr)
12+
- [Documentation](http://documentup.com/virantha/pypdfocr)
13+
- [Source](https://www.github.com/virantha/pypdfocr)
14+
15+
516
## Usage:
617
### Single conversion:
7-
python pypdfocr.py filename.pdf
18+
pypdfocr filename.pdf
819

920
--> filename_ocr.pdf will be generated
1021

11-
### Folder monitoring (new!):
12-
python pypdfocr.py -w watch_directory
22+
### Folder monitoring:
23+
pypdfocr -w watch_directory
1324

1425
--> Every time a pdf file is added to `watch_directory` it will be OCR'ed
1526

16-
For those on Windows, because it's such a pain to get all the PIL and PDF
17-
dependencies installed, I've gone ahead and made an executable available under:
27+
### Automatic filing (new!):
28+
To automatically move the OCR'ed pdf to a directory based on a keyword, use the -f option
29+
and specify a configuration file (described below):
1830

19-
dist/pypdfocr.exe
31+
pypdfocr filename.pdf -f -c config.yaml
2032

21-
You still need to install Tesseract and GhostScript as detailed below in the dependencies
22-
list.
33+
You can also do this in folder monitoring mode:
34+
35+
pypdfocr -w watch_directory -f -c config.yaml
36+
37+
#### Configuration file for automatic PDF filing
38+
The config.yaml file above is a simple folder to keyword matching text file. It determines
39+
where your OCR'ed PDFs (and optionally, the original scanned PDF) are placed after processing.
40+
An example is given below:
41+
42+
target_folder: "docs/filed"
43+
default_folder: "docs/filed/manual_sort"
44+
original_move_folder: "docs/originals"
45+
46+
folders:
47+
finances:
48+
- american express
49+
- chase card
50+
- internal revenue service
51+
travel:
52+
- boarding pass
53+
- airlines
54+
- expedia
55+
- orbitz
56+
receipts:
57+
- receipt
58+
59+
The `target_folder` is the root of your filing cabinet. Any PDF moving will happen in
60+
sub-directories under this directory.
61+
62+
The `folders` section defines your filing directories and the keywords associated with them.
63+
In this example, we have three filing directories (finances, travl, receipts), and some
64+
associated keywords for each filing directory. For example, if your OCR'ed PDF
65+
contains the phrase "american express" (in any upper/lower case), it will be filed into
66+
`docs/filed/finances`
67+
68+
The `default_folder` is where the OCR'ed PDF is moved to if there is no keyword match.
69+
70+
The `original_move_folder` is optional (you can comment it out with `#` in
71+
front of that line), but if specified, the original scanned PDF is moved into
72+
this directory after OCR is done. Otherwise, if this field is not present or
73+
commented out, your original PDF will stay where it was found.
74+
75+
If there is any naming conflict during filing, the program will add an underscore followed by a
76+
number to each filename, in order to avoid overwriting files that may already be present.
2377

2478
## Caveats
2579
This code is brand-new, and is barely commented with no unit-tests included. I plan to improve
26-
things as time allows in the near-future.
80+
things as time allows in the near-future. Sphinx code generation is on my TODO list.
2781

28-
## Dependencies:
29-
PyPDFOCR relies on the following (free) programs being installed and in the path:
82+
## Installation
83+
### Using pip
84+
PyPDFOCR is available in PyPI, so you can just run:
3085

31-
- Tesseract OCR software https://code.google.com/p/tesseract-ocr/
32-
- GhostScript http://www.ghostscript.com/
86+
pip install pypdfocr
87+
88+
You will also need to install the external dependencies listed below.
89+
For those on **Windows**, because it's such a pain to get all the PIL and PDF
90+
dependencies installed, I've gone ahead and made an executable called
91+
[pypdfocr.exe](https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true)
92+
93+
You still need to install Tesseract and GhostScript as detailed below in the dependencies
94+
list.
95+
96+
### Manual install
97+
Clone the source directly from github (you need to have git installed):
98+
99+
git clone https://github.com/virantha/pypdfocr.git
100+
101+
Then, install the following third-party python libraries:
33102
- PIL (Python Imaging Library) http://www.pythonware.com/products/pil/
34103
- ReportLab (PDF generation library) http://www.reportlab.com/software/opensource/
35104
- Watchdog (Cross-platform fhlesystem events monitoring) https://pypi.python.org/pypi/watchdog
105+
- PyPDF2 (Pure python pdf library)
36106

37-
On Mac OS X, you can install the first two using homebrew:
38-
39-
brew install tesseract
40-
brew install ghostscript
41-
42-
The last three can be installed using a regular python manager such as pip:
107+
These can all be installed via pip:
43108

44109
pip install pil
45110
pip install reportlab
46111
pip install watchdog
112+
pip install pypdf2
113+
114+
You will also need to install the external dependencies listed below.
115+
116+
## External Dependencies:
117+
PyPDFOCR relies on the following (free) programs being installed and in the path:
118+
119+
- Tesseract OCR software https://code.google.com/p/tesseract-ocr/
120+
- GhostScript http://www.ghostscript.com/
121+
122+
On Mac OS X, you can install these using homebrew:
123+
124+
brew install tesseract
125+
brew install ghostscript
126+

README.rst

Lines changed: 130 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,22 @@
11
PyPDFOCR
22
========
33

4-
This script will take a pdf file and generate the corresponding OCR'ed
5-
version.
4+
This program will help manage your scanned PDFs for you. It can do the
5+
following:
6+
7+
- Take a scanned PDF file and run OCR on it (using free OCR tools),
8+
generating a searchable PDF
9+
- Optionally, watch a folder for incoming scanned PDFs and
10+
automatically run OCR on them
11+
- Optionally, file the scanned PDFs into directories based on simple
12+
keyword matching that you specify
13+
- *Coming soon*: Evernote auto-upload and filing
14+
15+
More links:
16+
17+
- `Blog <http://virantha.com/categories/projects/pypdfocr>`__
18+
- `Documentation <http://documentup.com/virantha/pypdfocr>`__
19+
- `Source <https://www.github.com/virantha/pypdfocr>`__
620

721
Usage:
822
------
@@ -12,63 +26,151 @@ Single conversion:
1226

1327
::
1428

15-
python pypdfocr.py filename.pdf
29+
pypdfocr filename.pdf
1630

1731
--> filename_ocr.pdf will be generated
1832

19-
Folder monitoring (new!):
20-
~~~~~~~~~~~~~~~~~~~~~~~~~
33+
Folder monitoring:
34+
~~~~~~~~~~~~~~~~~~
2135

2236
::
2337

24-
python pypdfocr.py -w watch_directory
38+
pypdfocr -w watch_directory
2539

2640
--> Every time a pdf file is added to `watch_directory` it will be OCR'ed
2741

28-
For those on Windows, because it's such a pain to get all the PIL and
29-
PDF dependencies installed, I've gone ahead and made an executable
30-
available under:
42+
Automatic filing (new!):
43+
~~~~~~~~~~~~~~~~~~~~~~~~
44+
45+
To automatically move the OCR'ed pdf to a directory based on a keyword,
46+
use the -f option and specify a configuration file (described below):
3147

3248
::
3349

34-
dist/pypdfocr.exe
50+
pypdfocr filename.pdf -f -c config.yaml
3551

36-
You still need to install Tesseract and GhostScript as detailed below in
37-
the dependencies list.
52+
You can also do this in folder monitoring mode:
53+
54+
::
55+
56+
pypdfocr -w watch_directory -f -c config.yaml
57+
58+
Configuration file for automatic PDF filing
59+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
60+
61+
The config.yaml file above is a simple folder to keyword matching text
62+
file. It determines where your OCR'ed PDFs (and optionally, the original
63+
scanned PDF) are placed after processing. An example is given below:
64+
65+
::
66+
67+
target_folder: "docs/filed"
68+
default_folder: "docs/filed/manual_sort"
69+
original_move_folder: "docs/originals"
70+
71+
folders:
72+
finances:
73+
- american express
74+
- chase card
75+
- internal revenue service
76+
travel:
77+
- boarding pass
78+
- airlines
79+
- expedia
80+
- orbitz
81+
receipts:
82+
- receipt
83+
84+
The ``target_folder`` is the root of your filing cabinet. Any PDF moving
85+
will happen in sub-directories under this directory.
86+
87+
The ``folders`` section defines your filing directories and the keywords
88+
associated with them. In this example, we have three filing directories
89+
(finances, travl, receipts), and some associated keywords for each
90+
filing directory. For example, if your OCR'ed PDF contains the phrase
91+
"american express" (in any upper/lower case), it will be filed into
92+
``docs/filed/finances``
93+
94+
The ``default_folder`` is where the OCR'ed PDF is moved to if there is
95+
no keyword match.
96+
97+
The ``original_move_folder`` is optional (you can comment it out with
98+
``#`` in front of that line), but if specified, the original scanned PDF
99+
is moved into this directory after OCR is done. Otherwise, if this field
100+
is not present or commented out, your original PDF will stay where it
101+
was found.
102+
103+
If there is any naming conflict during filing, the program will add an
104+
underscore followed by a number to each filename, in order to avoid
105+
overwriting files that may already be present.
38106

39107
Caveats
40108
-------
41109

42110
This code is brand-new, and is barely commented with no unit-tests
43111
included. I plan to improve things as time allows in the near-future.
112+
Sphinx code generation is on my TODO list.
44113

45-
Dependencies:
46-
-------------
114+
Installation
115+
------------
47116

48-
PyPDFOCR relies on the following (free) programs being installed and in
49-
the path:
117+
Using pip
118+
~~~~~~~~~
50119

51-
- Tesseract OCR software https://code.google.com/p/tesseract-ocr/
52-
- GhostScript http://www.ghostscript.com/
53-
- PIL (Python Imaging Library) http://www.pythonware.com/products/pil/
54-
- ReportLab (PDF generation library)
55-
http://www.reportlab.com/software/opensource/
56-
- Watchdog (Cross-platform fhlesystem events monitoring)
57-
https://pypi.python.org/pypi/watchdog
120+
PyPDFOCR is available in PyPI, so you can just run:
121+
122+
::
123+
124+
pip install pypdfocr
58125

59-
On Mac OS X, you can install the first two using homebrew:
126+
You will also need to install the external dependencies listed below.
127+
For those on **Windows**, because it's such a pain to get all the PIL
128+
and PDF dependencies installed, I've gone ahead and made an executable
129+
called
130+
`pypdfocr.exe <https://github.com/virantha/pypdfocr/blob/master/dist/pypdfocr.exe?raw=true>`__
131+
132+
You still need to install Tesseract and GhostScript as detailed below in
133+
the dependencies list.
134+
135+
Manual install
136+
~~~~~~~~~~~~~~
137+
138+
Clone the source directly from github (you need to have git installed):
60139

61140
::
62141

63-
brew install tesseract
64-
brew install ghostscript
142+
git clone https://github.com/virantha/pypdfocr.git
143+
144+
Then, install the following third-party python libraries: - PIL (Python
145+
Imaging Library) http://www.pythonware.com/products/pil/ - ReportLab
146+
(PDF generation library) http://www.reportlab.com/software/opensource/ -
147+
Watchdog (Cross-platform fhlesystem events monitoring)
148+
https://pypi.python.org/pypi/watchdog - PyPDF2 (Pure python pdf library)
65149

66-
The last three can be installed using a regular python manager such as
67-
pip:
150+
These can all be installed via pip:
68151

69152
::
70153

71154
pip install pil
72155
pip install reportlab
73156
pip install watchdog
157+
pip install pypdf2
158+
159+
You will also need to install the external dependencies listed below.
160+
161+
External Dependencies:
162+
----------------------
163+
164+
PyPDFOCR relies on the following (free) programs being installed and in
165+
the path:
166+
167+
- Tesseract OCR software https://code.google.com/p/tesseract-ocr/
168+
- GhostScript http://www.ghostscript.com/
169+
170+
On Mac OS X, you can install these using homebrew:
171+
172+
::
173+
174+
brew install tesseract
175+
brew install ghostscript
74176

dist/pypdfocr.exe

148 KB
Binary file not shown.

pypdfocr/__init__.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +0,0 @@
1-
2-
__version__ = "0.2.2"

0 commit comments

Comments
 (0)