Skip to content

GiantRavens/markdown_forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

markdown_forge

markdown_forge is a publication "forge" that reshapes opaque EPUB and PDF sources into AI/ML/search and discover tool-ready Markdown and clean publication exports of same.

Purpose

  • Discovery-focused: Strip away EPUB/PDF quirks so OS indexers, LLM pipelines, and search tooling can surface content that would otherwise stay buried inside binary containers.
  • Canonical touchstone: Normalize publications into a single Markdown file with enriched front matter (title, title_short, author, publisher, year, identifiers, etc.) that serves as the source of truth for downstream automation.
  • Repeatable derivatives: Regenerate clean EPUBs and fully self-contained HTML (embedded Base64 imagery, inline CSS) directly from that canonical Markdown.

Core workflow

  1. Drop raw source files into IN/.
  2. Run python tools/convert_IN_preprocess.py to inspect file types, route EPUBs and PDFs through the matching conversion/cleanup flow, and populate publication workspaces. Generally, run either epub_to_markdown or pdf_to_markdown . Then run epub_markdown_cleanup or pdf_markdown_cleanup to clean up the generated Markdown as needed. You will very likely need to confirm the frontmatter manually, and do regex cleanup.
  3. Iterate on the generated Markdown within each publication directory until the content and metadata look correct.
  4. The intent is that the core .md files always stay and can be continually improved - and use the publishing tools to generate EPUB/HTML versions as needed.
  5. When happy with the results each publication will have its own folder, within which are the core .md file, images in a images directory, a self-contained .html file with embedded images (for single file viewing and discovery), and a clean .epub version. You can then run publication_cleanup to move the .md and exponents into the OUT folder for further archiving, data set collection, or reading.

Tool highlights

  • tools/filetype_inspect.py: Probes files with file, exiftool, and ffprobe, suggests canonical extensions, and updates metadata so inputs are well-typed before conversion.
  • tools/epub_folderize.pytools/epub_to_markdown.pytools/epub_markdown_cleanup.py: Unpack EPUBs, extract Markdown/images, strip custom CSS, collapse Pandoc/Calibre artifacts, and leave a clean publication folder anchored by Markdown front matter.
  • tools/markdown_to_self_contained_html.py: Produces single-file HTML (embedded assets, inline CSS) from the canonical Markdown for maximum portability.
  • tools/markdown_to_epub.py: Rebuilds EPUB containers from the same Markdown touchstone once cleanup is complete.
  • tools/publication_cleanup.py: Renames publications and derivative assets based on front matter so OUT/ stays organized.
  • tools/toc_rebuilder.py: Reconstructs the ## TABLE OF CONTENTS section by linking every H2 heading in a Markdown file.
  • PDF track: tools/pdf_to_markdown.py, tools/pdf_markdown_cleanup.py, and tools/acrobat-html_to_markdown.py offer multiple paths—from direct PDF extraction to Acrobat HTML exports—for turning printed layouts into workable Markdown.

Working with PDFs

  • Hybrid strategy: Start with tools/pdf_to_markdown.py for direct extraction; if layout noise persists, export the PDF through Adobe Acrobat to HTML and run tools/acrobat-html_to_markdown.py for a cleaner baseline.
  • Prepare the source: Crop print-oriented PDFs to remove headers/footers before conversion to avoid repeated page numbers or chapter labels bleeding into the text.
  • Fallback via images: When text extraction proves unreliable, leverage the per-page images that tools/pdf_to_markdown.py emits under source_pdf/extracted/. Re-compose them into a PDF and run OCR to recover faithful text without Acrobat’s typical spacing and glyph anomalies.

Repository layout

  • tools/: Command-line utilities that implement the ingestion, cleanup, and publishing pipeline.
  • IN/ / OUT/: Working directories kept in version control via .gitkeep but emptied by default. Contents are ignored so personal source material never leaks into public history.
  • design.md: Design notes and process documentation.
  • requirements.txt: Minimum Python package requirements for the toolchain.

Getting started

  • Install dependencies: python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
  • Process inputs: Place EPUB/PDF files inside IN/ and run the orchestrator or individual tools as needed.
  • Publish outputs: Use the Markdown touchstone plus tools/markdown_to_self_contained_html.py and tools/markdown_to_epub.py to forge portable deliverables ready for AI/ML ingestion or general distribution.

TUI launcher

If you don't want to manually invoke tools, use a lightweight TUI two-panel interface to work with your files.

  • Launch: From repo root, run:
python tools/tui_launcher.py
  • Panels:

    • Left: file browser rooted at the current directory
    • Right: tool list and a live log area
  • File browser conventions:

    • Files are prefixed with tags for quick recognition: [PDF], [EPUB], [MD]
    • Directories show a trailing slash, e.g., MyFolder/
    • The currently selected item is bolded
  • Keybindings:

    • enter: select the highlighted file and focus the tool list
    • tab: switch focus between file browser and tool list
    • r: refresh the file browser (preserves panel order)
    • o: open selected file (PDF/EPUB via system viewer; MD tries nvim in a new terminal window, else system opener)
    • e: execute the selected tool on the selected file
    • q: quit
  • Tool execution UX:

    • The right log shows the exact command invoked, streams the tool output live, and reports completion status with elapsed time
    • On successful completion, the file tree auto-refreshes

Pandoc requirement

Several tools call the external Pandoc executable. Ensure the Pandoc CLI is installed and visible on your PATH; the Python package named pandoc installed via pip is not sufficient.

  • Debian/Ubuntu: sudo apt-get install pandoc
  • Fedora: sudo dnf install pandoc
  • Arch: sudo pacman -S pandoc
  • Conda: conda install -c conda-forge pandoc
  • Binaries: https://github.com/jgm/pandoc/releases

About

markdown_forge is a publication "forge" that reshapes opaque EPUB and PDF sources into AI/ML/search and discover tool-ready Markdown and clean publication exports of same.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors