I had been using pdfsandwich to
create searchable PDFs from non-searchable PDFs. However, it's a pain to collect
all the dependencies if e.g. you don't have root access. So I thought to package
them up with Julia's BinaryBuilder to make installation simple. However, I
wasn't able to cross-compile pdfsandwich
itself. But since tesseract is doing
the hard work anyway, I thought I would just write the glue script myself. It
turns out there are several of
these
already.
I believe I have likely diverged from the pdfsandwich
implementation since I
haven't used ImageMagick's convert
which is one of the dependencies of
pdfsandwich
. Since the job can be done very simply, e.g.
- convert each page of the PDF to an image
- possibly clean it up with
unpaper
- use tesseract to create a single-page searchable PDF
- combine the PDFs,
I decided to not look at the source of pdfsandwich
when creating my implementation so I can stick to an MIT
license, which is the usual one in the Julia community.
It more-or-less works on MacOS (both Intel and Apple Silicon) and Linux.
Next steps:
- Allow choice of training data used for tesseract
- Robustify and test on more files
- Add better tests?
using SearchablePDFs
file = ocr("test/test_rasterized.pdf")
Supports @main
and on v1.12 an app searchable-pdf
.
If you use juliaup
you can install 1.12 with juliaup add nightly
, then run
JULIA_LOAD_PATH="@:@stdlib" julia +nightly --startup-file=no -e 'using Pkg; Pkg.activate(temp=true); Pkg.Apps.add(url="https://github.com/ericphanson/SearchablePDFs.jl")'
to install a CLI executable searchable-pdf
to the bin
directory in your Julia depot (~/.julia
by default). You will likely need to add your bin directory to your PATH, e.g.
export PATH="/Users/eph/.julia/bin:$PATH"
which can go in a shell startup script (e.g. ~/.bashrc
or ~/.zshrc
).
You can re-run this command to update it.