markdown_forge

markdown_forge is a publication "forge" that reshapes opaque EPUB and PDF sources into AI/ML/search and discover tool-ready Markdown and clean publication exports of same.

Purpose

Discovery-focused: Strip away EPUB/PDF quirks so OS indexers, LLM pipelines, and search tooling can surface content that would otherwise stay buried inside binary containers.
Canonical touchstone: Normalize publications into a single Markdown file with enriched front matter (title, title_short, author, publisher, year, identifiers, etc.) that serves as the source of truth for downstream automation.
Repeatable derivatives: Regenerate clean EPUBs and fully self-contained HTML (embedded Base64 imagery, inline CSS) directly from that canonical Markdown.

Core workflow

Drop raw source files into IN/.
Run python tools/convert_IN_preprocess.py to inspect file types, route EPUBs and PDFs through the matching conversion/cleanup flow, and populate publication workspaces. Generally, run either epub_to_markdown or pdf_to_markdown . Then run epub_markdown_cleanup or pdf_markdown_cleanup to clean up the generated Markdown as needed. You will very likely need to confirm the frontmatter manually, and do regex cleanup.
Iterate on the generated Markdown within each publication directory until the content and metadata look correct.
The intent is that the core .md files always stay and can be continually improved - and use the publishing tools to generate EPUB/HTML versions as needed.
When happy with the results each publication will have its own folder, within which are the core .md file, images in a images directory, a self-contained .html file with embedded images (for single file viewing and discovery), and a clean .epub version. You can then run publication_cleanup to move the .md and exponents into the OUT folder for further archiving, data set collection, or reading.

Tool highlights

tools/filetype_inspect.py: Probes files with file, exiftool, and ffprobe, suggests canonical extensions, and updates metadata so inputs are well-typed before conversion.
tools/epub_folderize.py → tools/epub_to_markdown.py → tools/epub_markdown_cleanup.py: Unpack EPUBs, extract Markdown/images, strip custom CSS, collapse Pandoc/Calibre artifacts, and leave a clean publication folder anchored by Markdown front matter.
tools/markdown_to_self_contained_html.py: Produces single-file HTML (embedded assets, inline CSS) from the canonical Markdown for maximum portability.
tools/markdown_to_epub.py: Rebuilds EPUB containers from the same Markdown touchstone once cleanup is complete.
tools/publication_cleanup.py: Renames publications and derivative assets based on front matter so OUT/ stays organized.
tools/toc_rebuilder.py: Reconstructs the ## TABLE OF CONTENTS section by linking every H2 heading in a Markdown file.
PDF track: tools/pdf_to_markdown.py, tools/pdf_markdown_cleanup.py, and tools/acrobat-html_to_markdown.py offer multiple paths—from direct PDF extraction to Acrobat HTML exports—for turning printed layouts into workable Markdown.

Working with PDFs

Hybrid strategy: Start with tools/pdf_to_markdown.py for direct extraction; if layout noise persists, export the PDF through Adobe Acrobat to HTML and run tools/acrobat-html_to_markdown.py for a cleaner baseline.
Prepare the source: Crop print-oriented PDFs to remove headers/footers before conversion to avoid repeated page numbers or chapter labels bleeding into the text.
Fallback via images: When text extraction proves unreliable, leverage the per-page images that tools/pdf_to_markdown.py emits under source_pdf/extracted/. Re-compose them into a PDF and run OCR to recover faithful text without Acrobat’s typical spacing and glyph anomalies.

Repository layout

tools/: Command-line utilities that implement the ingestion, cleanup, and publishing pipeline.
IN/ / OUT/: Working directories kept in version control via .gitkeep but emptied by default. Contents are ignored so personal source material never leaks into public history.
design.md: Design notes and process documentation.
requirements.txt: Minimum Python package requirements for the toolchain.

Getting started

Install dependencies: python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
Process inputs: Place EPUB/PDF files inside IN/ and run the orchestrator or individual tools as needed.
Publish outputs: Use the Markdown touchstone plus tools/markdown_to_self_contained_html.py and tools/markdown_to_epub.py to forge portable deliverables ready for AI/ML ingestion or general distribution.

TUI launcher

If you don't want to manually invoke tools, use a lightweight TUI two-panel interface to work with your files.

Launch: From repo root, run:

python tools/tui_launcher.py

Panels:
- Left: file browser rooted at the current directory
- Right: tool list and a live log area
File browser conventions:
- Files are prefixed with tags for quick recognition: [PDF], [EPUB], [MD]
- Directories show a trailing slash, e.g., MyFolder/
- The currently selected item is bolded
Keybindings:
- enter: select the highlighted file and focus the tool list
- tab: switch focus between file browser and tool list
- r: refresh the file browser (preserves panel order)
- o: open selected file (PDF/EPUB via system viewer; MD tries nvim in a new terminal window, else system opener)
- e: execute the selected tool on the selected file
- q: quit
Tool execution UX:
- The right log shows the exact command invoked, streams the tool output live, and reports completion status with elapsed time
- On successful completion, the file tree auto-refreshes

Pandoc requirement

Several tools call the external Pandoc executable. Ensure the Pandoc CLI is installed and visible on your PATH; the Python package named pandoc installed via pip is not sufficient.

Debian/Ubuntu: sudo apt-get install pandoc
Fedora: sudo dnf install pandoc
Arch: sudo pacman -S pandoc
Conda: conda install -c conda-forge pandoc
Binaries: https://github.com/jgm/pandoc/releases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

markdown_forge

Purpose

Core workflow

Tool highlights

Working with PDFs

Repository layout

Getting started

TUI launcher

Pandoc requirement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
IN		IN
OUT		OUT
tools		tools
.gitignore		.gitignore
README.md		README.md
design.md		design.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

markdown_forge

Purpose

Core workflow

Tool highlights

Working with PDFs

Repository layout

Getting started

TUI launcher

Pandoc requirement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages