PDF Text Extraction Order Not Matching Visual Layout Despite Correct Coordinates #4124

Phalgun-Santhapuri · 2024-12-09T06:43:38Z

Phalgun-Santhapuri
Dec 9, 2024

I am working on extracting text from a PDF using PyMuPDF. However, I am encountering an issue where the extracted text order does not match the visual flow/Layout flow of the PDF .

Details of the Issue:

The PDF's text is correctly positioned according to its coordinates (bounding boxes), but the logical extraction order is incorrect.
For example, on the first page of my PDF:

After extracting line 2, the tool directly jumps to a table at the bottom of the page, skipping intervening text.
Later, it picks up lines 3–20 in an unordered manner.
I have verified that the issue is not related to column or layout misalignment, as the coordinates are accurate.

The document contains multi-column layouts and mixed elements like tables and have complex layouts in the PDF.
From dict i am getting the bounding box information later i am applying the further logic.
But i observed that the the text in the dict or any other option available itself has the incorrect order of the text which i have mentioned above.

JorjMcKie · 2024-12-09T07:32:54Z

JorjMcKie
Dec 9, 2024
Maintainer

All of what you mention looks like normal in PDFs: in extreme cases, every single character may appear in arbitrary sequence when extracted. Only when explicitly sorting for output, a "natural" reading sequence can be established.

So we need an example page before we can say anything else.

6 replies

JorjMcKie Dec 10, 2024
Maintainer

This is probably for you:

import pymupdf, pymupdf4llm, pathlib

doc = pymupdf.open("sample.pdf")
md = pymupdf4llm.to_markdown(doc, margins=0)
pathlib.Path(doc.name + ".md").write_text(md)

Produces this
sample.pdf.md

Phalgun-Santhapuri Dec 10, 2024
Author

@JorjMcKie, I noticed that this solution works for the current issue. However, I observed that in cases of closely spaced text within a simple layout or column layout, the text with the smaller y-value is being printed first, which does not provide an accurate extraction. Additionally, I noticed that text is missing from the total page content in readable documents when dealing with column-layout PDFs.

Phalgun-Santhapuri Dec 26, 2024
Author

Hi @JorjMcKie ,
I wanted to follow up on my previous message regarding the issue with text extraction. I’m still awaiting your thoughts or suggestions on addressing the challenge.
Let me know if additional details are needed from my end to help with troubleshooting. Looking forward to your response!

JorjMcKie Dec 26, 2024
Maintainer

There are a few situations where the latest pymupd4llm version 0.0.17 is introducing problems that were better handled in version 0.0.16.
Try to install the back level version pip install pymupdf4llm==0.0.16.

Phalgun-Santhapuri Dec 26, 2024
Author

Thank you for the response @JorjMcKie ,

I used the earlier version and noticed that in cases where there is a column layout followed by a table, the extraction method prioritizes extracting the column closest to the top border first, followed by the table content below the column layout, and then moves to the next column. However, the correct extraction process should first capture the entire column layout text before extracting the table content.

The version fails to handle complex layouts in the PDF effectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF Text Extraction Order Not Matching Visual Layout Despite Correct Coordinates #4124

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PDF Text Extraction Order Not Matching Visual Layout Despite Correct Coordinates #4124

Uh oh!

Phalgun-Santhapuri Dec 9, 2024

Replies: 1 comment · 6 replies

Uh oh!

JorjMcKie Dec 9, 2024 Maintainer

Uh oh!

JorjMcKie Dec 10, 2024 Maintainer

Uh oh!

Phalgun-Santhapuri Dec 10, 2024 Author

Uh oh!

Phalgun-Santhapuri Dec 26, 2024 Author

Uh oh!

JorjMcKie Dec 26, 2024 Maintainer

Uh oh!

Phalgun-Santhapuri Dec 26, 2024 Author

Phalgun-Santhapuri
Dec 9, 2024

Replies: 1 comment 6 replies

JorjMcKie
Dec 9, 2024
Maintainer

JorjMcKie Dec 10, 2024
Maintainer

Phalgun-Santhapuri Dec 10, 2024
Author

Phalgun-Santhapuri Dec 26, 2024
Author

JorjMcKie Dec 26, 2024
Maintainer

Phalgun-Santhapuri Dec 26, 2024
Author