Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text collision/overlapping issues #406

Open
JayNewstrom opened this issue Dec 13, 2021 · 2 comments
Open

Text collision/overlapping issues #406

JayNewstrom opened this issue Dec 13, 2021 · 2 comments

Comments

@JayNewstrom
Copy link

Some PDFs cause text collisions and overlapping.

This is a follow up to my bug report in #397 (And the same sample PDF is the one causing these issues)

Screen Shot 2021-12-13 at 4 30 41 PM

Screen Shot 2021-12-13 at 4 31 30 PM

As with the previous issue, I'm happy to help where I can.

@yob
Copy link
Owner

yob commented Dec 20, 2021

Now that the glyph positioning bugs in your sample PDF were fixed (in #403), I took another look at the page you have a screenshot of here.

I looked at the glyph positions that pdf-reader calculated from a few words that are clearly wrong in the text extraction: "Usage and Purchase Charges".

I'm now very confident that the glyphs positions are being extracted more or less accurately. There may be some very minor issues around kerning and spacing, but that will only through the positions off by a point or two. I'm also confident all the characters are being extracted.

I think the real issue here is the naive algorithm in PageLayout, which is responsible for arranging the extracted text onto a plain text "page". By hand tuning a few lines in PageLayout, I can get your page extracting a bit better:

Screenshot from 2021-12-20 23-53-09

Of course, it throws out the layout of other documents though.

There's a few similar issues - #371 #362 #118 - I'll continue to mull over what a better algorithm might look like. Thanks for a great bug report.

@JayNewstrom
Copy link
Author

Thanks for the update! I'm happy to run tests or provide more test files if you'd like!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants