Math in latex format in markdown files #1305
Replies: 1 comment
-
|
markitdown does not currently extract math as LaTeX — this is a known gap when working with scientific PDFs. Here's a breakdown of what works and what the alternatives are: Why markitdown struggles with math: Option 1 — LLM integration (markitdown stays in the loop) from openai import OpenAI
from markitdown import MarkItDown
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
)
# Convert PDF — images (including math figures) get described by the LLM
result = md.convert("paper.pdf")
print(result.text_content)This works best for figure-based equations. For inline text math, the PDF text layer is still used as-is. Option 2 — Nougat (Meta, best for scientific PDFs) pip install nougat-ocr
nougat paper.pdf -o output/
# output/paper.mmd contains $$...$$ and $...$ LaTeXYou can then pass the Option 3 — MathPix API (cloud, highest accuracy) Option 4 — Recommended pipeline for keeping markitdown in the loop:
This gives you LaTeX math while still using markitdown's chunking and conversion infrastructure. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Is there a way to generate mathematics in proper latex format extracted from pdf's or other types of files? I am open to using LLM's with it, but I want
markitdownin the loop instead of just raw LLM prompt with attached pdf.Beta Was this translation helpful? Give feedback.
All reactions