PDF Text Extraction Order Not Matching Visual Layout Despite Correct Coordinates #4124
Unanswered
Phalgun-Santhapuri
asked this question in
Looking for help
Replies: 1 comment 6 replies
-
All of what you mention looks like normal in PDFs: in extreme cases, every single character may appear in arbitrary sequence when extracted. Only when explicitly sorting for output, a "natural" reading sequence can be established. So we need an example page before we can say anything else. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am working on extracting text from a PDF using PyMuPDF. However, I am encountering an issue where the extracted text order does not match the visual flow/Layout flow of the PDF .
Details of the Issue:
The PDF's text is correctly positioned according to its coordinates (bounding boxes), but the logical extraction order is incorrect.
For example, on the first page of my PDF:
After extracting line 2, the tool directly jumps to a table at the bottom of the page, skipping intervening text.
Later, it picks up lines 3–20 in an unordered manner.
I have verified that the issue is not related to column or layout misalignment, as the coordinates are accurate.
The document contains multi-column layouts and mixed elements like tables and have complex layouts in the PDF.
From dict i am getting the bounding box information later i am applying the further logic.
But i observed that the the text in the dict or any other option available itself has the incorrect order of the text which i have mentioned above.
Beta Was this translation helpful? Give feedback.
All reactions