Skip to content
This repository was archived by the owner on Nov 7, 2018. It is now read-only.

Make minor corrections/tweaks to README.md #17

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,15 @@ Python library to extract text from any file type compatiable with [TIKA](http:/

##### Installation
1. Download tika-server-1.7.jar from [Apache Tika](http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.7.jar)
2. Mac: `brew install ghostscripts` Ubuntu: `sudo apt-get install ghostscript`
2. Mac: `brew install ghostscript` Ubuntu: `sudo apt-get install ghostscript`
3. Mac: `brew install tesseract` Ubuntu: `sudo apt-get install tesseract-ocr`
4. Mac: `brew tap homebrew/x11` and `brew install xpdf` Ubuntu: `sudo apt-get install poppler-utils`
5. Install Python dependencies with `pip install -r requirements.txt`

##### Usage
These script assume that an instance of Tika server is running.
Starting Tika Servers
These scripts assume that an instance of Tika server is running.

Starting Tika Servers:
`java -jar tika-server-1.7.jar --port 9998`

In Python script
Expand All @@ -31,13 +32,18 @@ text_extractor(doc_path=doc_path, force_convert=False)

##### Tests
In order to run tests:

1. All requirements must be installed
2. Both Tika servers need to be running

Tests are run with nose
Installation
Tests are run with nose.

Nose installation:

`pip install -r test-requirements.txt`
Running tests

Running tests:

`nosetests`

##### OCR methodology
Expand Down