You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that the glyph positioning bugs in your sample PDF were fixed (in #403), I took another look at the page you have a screenshot of here.
I looked at the glyph positions that pdf-reader calculated from a few words that are clearly wrong in the text extraction: "Usage and Purchase Charges".
I'm now very confident that the glyphs positions are being extracted more or less accurately. There may be some very minor issues around kerning and spacing, but that will only through the positions off by a point or two. I'm also confident all the characters are being extracted.
I think the real issue here is the naive algorithm in PageLayout, which is responsible for arranging the extracted text onto a plain text "page". By hand tuning a few lines in PageLayout, I can get your page extracting a bit better:
Of course, it throws out the layout of other documents though.
There's a few similar issues - #371#362#118 - I'll continue to mull over what a better algorithm might look like. Thanks for a great bug report.
Some PDFs cause text collisions and overlapping.
This is a follow up to my bug report in #397 (And the same sample PDF is the one causing these issues)
As with the previous issue, I'm happy to help where I can.
The text was updated successfully, but these errors were encountered: