Text collision/overlapping issues #406

JayNewstrom · 2021-12-13T22:39:39Z

Some PDFs cause text collisions and overlapping.

This is a follow up to my bug report in #397 (And the same sample PDF is the one causing these issues)

As with the previous issue, I'm happy to help where I can.

yob · 2021-12-20T13:02:26Z

Now that the glyph positioning bugs in your sample PDF were fixed (in #403), I took another look at the page you have a screenshot of here.

I looked at the glyph positions that pdf-reader calculated from a few words that are clearly wrong in the text extraction: "Usage and Purchase Charges".

I'm now very confident that the glyphs positions are being extracted more or less accurately. There may be some very minor issues around kerning and spacing, but that will only through the positions off by a point or two. I'm also confident all the characters are being extracted.

I think the real issue here is the naive algorithm in PageLayout, which is responsible for arranging the extracted text onto a plain text "page". By hand tuning a few lines in PageLayout, I can get your page extracting a bit better:

Of course, it throws out the layout of other documents though.

There's a few similar issues - #371 #362 #118 - I'll continue to mull over what a better algorithm might look like. Thanks for a great bug report.

JayNewstrom · 2021-12-20T15:20:42Z

Thanks for the update! I'm happy to run tests or provide more test files if you'd like!

mkllnk mentioned this issue Mar 21, 2023

Adds pdf comparison; removes csv, xlsx, pdf file fixtures openfoodfoundation/openfoodnetwork#10544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text collision/overlapping issues #406

Text collision/overlapping issues #406

JayNewstrom commented Dec 13, 2021

yob commented Dec 20, 2021

JayNewstrom commented Dec 20, 2021

Text collision/overlapping issues #406

Text collision/overlapping issues #406

Comments

JayNewstrom commented Dec 13, 2021

yob commented Dec 20, 2021

JayNewstrom commented Dec 20, 2021