markdown_forge is a publication "forge" that reshapes opaque EPUB and PDF sources into AI/ML/search and discover tool-ready Markdown and clean publication exports of same.
- Discovery-focused: Strip away EPUB/PDF quirks so OS indexers, LLM pipelines, and search tooling can surface content that would otherwise stay buried inside binary containers.
- Canonical touchstone: Normalize publications into a single Markdown file with enriched front matter (
title,title_short,author,publisher,year, identifiers, etc.) that serves as the source of truth for downstream automation. - Repeatable derivatives: Regenerate clean EPUBs and fully self-contained HTML (embedded Base64 imagery, inline CSS) directly from that canonical Markdown.
- Drop raw source files into
IN/. - Run
python tools/convert_IN_preprocess.pyto inspect file types, route EPUBs and PDFs through the matching conversion/cleanup flow, and populate publication workspaces. Generally, run either epub_to_markdown or pdf_to_markdown . Then run epub_markdown_cleanup or pdf_markdown_cleanup to clean up the generated Markdown as needed. You will very likely need to confirm the frontmatter manually, and do regex cleanup. - Iterate on the generated Markdown within each publication directory until the content and metadata look correct.
- The intent is that the core .md files always stay and can be continually improved - and use the publishing tools to generate EPUB/HTML versions as needed.
- When happy with the results each publication will have its own folder, within which are the core .md file, images in a images directory, a self-contained .html file with embedded images (for single file viewing and discovery), and a clean .epub version. You can then run publication_cleanup to move the .md and exponents into the OUT folder for further archiving, data set collection, or reading.
tools/filetype_inspect.py: Probes files withfile,exiftool, andffprobe, suggests canonical extensions, and updates metadata so inputs are well-typed before conversion.tools/epub_folderize.py→tools/epub_to_markdown.py→tools/epub_markdown_cleanup.py: Unpack EPUBs, extract Markdown/images, strip custom CSS, collapse Pandoc/Calibre artifacts, and leave a clean publication folder anchored by Markdown front matter.tools/markdown_to_self_contained_html.py: Produces single-file HTML (embedded assets, inline CSS) from the canonical Markdown for maximum portability.tools/markdown_to_epub.py: Rebuilds EPUB containers from the same Markdown touchstone once cleanup is complete.tools/publication_cleanup.py: Renames publications and derivative assets based on front matter soOUT/stays organized.tools/toc_rebuilder.py: Reconstructs the## TABLE OF CONTENTSsection by linking every H2 heading in a Markdown file.- PDF track:
tools/pdf_to_markdown.py,tools/pdf_markdown_cleanup.py, andtools/acrobat-html_to_markdown.pyoffer multiple paths—from direct PDF extraction to Acrobat HTML exports—for turning printed layouts into workable Markdown.
- Hybrid strategy: Start with
tools/pdf_to_markdown.pyfor direct extraction; if layout noise persists, export the PDF through Adobe Acrobat to HTML and runtools/acrobat-html_to_markdown.pyfor a cleaner baseline. - Prepare the source: Crop print-oriented PDFs to remove headers/footers before conversion to avoid repeated page numbers or chapter labels bleeding into the text.
- Fallback via images: When text extraction proves unreliable, leverage the per-page images that
tools/pdf_to_markdown.pyemits undersource_pdf/extracted/. Re-compose them into a PDF and run OCR to recover faithful text without Acrobat’s typical spacing and glyph anomalies.
tools/: Command-line utilities that implement the ingestion, cleanup, and publishing pipeline.IN//OUT/: Working directories kept in version control via.gitkeepbut emptied by default. Contents are ignored so personal source material never leaks into public history.design.md: Design notes and process documentation.requirements.txt: Minimum Python package requirements for the toolchain.
- Install dependencies:
python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt - Process inputs: Place EPUB/PDF files inside
IN/and run the orchestrator or individual tools as needed. - Publish outputs: Use the Markdown touchstone plus
tools/markdown_to_self_contained_html.pyandtools/markdown_to_epub.pyto forge portable deliverables ready for AI/ML ingestion or general distribution.
If you don't want to manually invoke tools, use a lightweight TUI two-panel interface to work with your files.
- Launch: From repo root, run:
python tools/tui_launcher.py-
Panels:
- Left: file browser rooted at the current directory
- Right: tool list and a live log area
-
File browser conventions:
- Files are prefixed with tags for quick recognition:
[PDF],[EPUB],[MD] - Directories show a trailing slash, e.g.,
MyFolder/ - The currently selected item is bolded
- Files are prefixed with tags for quick recognition:
-
Keybindings:
enter: select the highlighted file and focus the tool listtab: switch focus between file browser and tool listr: refresh the file browser (preserves panel order)o: open selected file (PDF/EPUB via system viewer; MD triesnvimin a new terminal window, else system opener)e: execute the selected tool on the selected fileq: quit
-
Tool execution UX:
- The right log shows the exact command invoked, streams the tool output live, and reports completion status with elapsed time
- On successful completion, the file tree auto-refreshes
Several tools call the external Pandoc executable. Ensure the Pandoc CLI is installed and visible on your PATH; the Python package named pandoc installed via pip is not sufficient.
- Debian/Ubuntu:
sudo apt-get install pandoc - Fedora:
sudo dnf install pandoc - Arch:
sudo pacman -S pandoc - Conda:
conda install -c conda-forge pandoc - Binaries: https://github.com/jgm/pandoc/releases