camlpdf example

I needed to extract text from a well-structured PDF. It turned out to be much less explored topic than I expected. There are many libraries in multiple languages but they are all surpisingly difficult to use for the task. After a closer inspection I decided to use camlpdf since I'm most comfortable with OCaml and the libraries in other langauges I considered (Javascript, Python) didn't seem simple either.

Camlpdf is the most up to date PDF library for OCaml. After a couple of hours of hacking I was able to parse text in PDF and extract UTF-8 text out of it. It's trickier than expected since you need to access font information to decode non ascii characters.

Running the example

Install nix
Run nix develop -c $SHELL
run dune exec ./src/parser.exe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

camlpdf example

Running the example

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

camlpdf example

Running the example