SearchablePDFs

I had been using pdfsandwich to create searchable PDFs from non-searchable PDFs. However, it's a pain to collect all the dependencies if e.g. you don't have root access. So I thought to package them up with Julia's BinaryBuilder to make installation simple. However, I wasn't able to cross-compile pdfsandwich itself. But since tesseract is doing the hard work anyway, I thought I would just write the glue script myself. It turns out there are several of these already.

I believe I have likely diverged from the pdfsandwich implementation since I haven't used ImageMagick's convert which is one of the dependencies of pdfsandwich. Since the job can be done very simply, e.g.

convert each page of the PDF to an image
possibly clean it up with unpaper
use tesseract to create a single-page searchable PDF
combine the PDFs,

I decided to not look at the source of pdfsandwich when creating my implementation so I can stick to an MIT license, which is the usual one in the Julia community.

Status

It more-or-less works on MacOS (both Intel and Apple Silicon) and Linux.

Next steps:

Allow choice of training data used for tesseract
Robustify and test on more files
Add better tests?

Usage

using SearchablePDFs
file = ocr("test/test_rasterized.pdf")

Supports @main and on v1.12 an app searchable-pdf.

If you use juliaup you can install 1.12 with juliaup add nightly, then run

JULIA_LOAD_PATH="@:@stdlib" julia +nightly --startup-file=no -e 'using Pkg; Pkg.activate(temp=true); Pkg.Apps.add(url="https://github.com/ericphanson/SearchablePDFs.jl")'

to install a CLI executable searchable-pdf to the bin directory in your Julia depot (~/.julia by default). You will likely need to add your bin directory to your PATH, e.g.

export PATH="/Users/eph/.julia/bin:$PATH"

which can go in a shell startup script (e.g. ~/.bashrc or ~/.zshrc).

You can re-run this command to update it.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
dev		dev
docs		docs
format		format
src		src
test		test
.gitignore		.gitignore
Artifacts.toml		Artifacts.toml
LICENSE		LICENSE
NOTICE		NOTICE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SearchablePDFs

Status

Usage

About

Releases 1

Packages

Contributors 3

Languages

License

ericphanson/SearchablePDFs.jl

Folders and files

Latest commit

History

Repository files navigation

SearchablePDFs

Status

Usage

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages