Document Extraction Tips

This document will serve to document some lessons learned and provide tips on extracting structured text from HTML and PDF.

General Info

This document will talk about various tools and types of documents, as I wrestle through the challenge and teach myself tricks for extracting documents. As I dig deeper, I will be talking about a variety of things, like programming languages (i.e. Python) as well as some software tools.

Not yet explored, but showing promise:

Apache Tika - recommended by Sophia Parafina
Apache PDFBox - recommended by Steve Citron-Pousty - appears to be geared toward extracting data from PDFs.
Data Science Toolkit - recommended by Harry Wood
CAM-PDF - recommended by Jeremiah Felt

Parsing HTML Documents

Python

I have primarily been working in Python, as it has some good, flexible capabilities, for example allowing types to be recast on the fly. It also has a lot of libraries for accessing documents and manipulating components of documents, such as string handling and list handling.

Beautiful Soup

One very useful Python library for parsing and extracting data from HTML is Beautiful Soup Beautiful Soup is a library for Python, which provides easy canonic access to the HTML document object model by tag. This can be very helpful in extracting structured text, by taking advantage of formatting in the document, for example titles, chapter headings, section headings and so on.

Parsing PDF Documents

Tabula: For PDF Containing Tables

From Joe Larson, this tip: use Tabula from Mozilla Open. It allows you to upload a PDF containing tabular data, and returns a csv.

Adobe Acrobat Professional

I am still investigating options for parsing PDF documents into structured text. One fortunate thing is that I have an older version of Adobe Acrobat Professional, which allows documents to be exported in various formats. However, that doesn't necessarily solve the problems of odd formatting and extraneous tagging within the document.

One thing that was useful was processing the PDF in Acrobat to optimize and reduce size. That appears to have consolidated some of the tagging. Given a document of nearly 2,000 pages, I still had some issues of Adobe crashing while attempting to export the document. To that end, I split it into two smaller pieces using the Adobe Acrobat "extract pages" function, which perhaps also may leave behind any other embedded oddities in the document that many have been leading to crashes.

How to Publish The Opened Documents

Statutory and Regulatory Documents

For this civic hacking challenge, I have been dealing primarily with trying to open statutory and regulatory documents, in this case municipal codes. I have been using the XML format provided by the [Open Government Foundation "State Decoded" project](Open Government Foundation "State Decoded" project)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!