Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed latin1 detector in chardet #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ablegrape
Copy link

Hi David,

Following up on our exchange from a few weeks ago. I've commited the bug fix for the integer math bug (causes a file with even one "low confidence" character to have 0 confidence level) and also updated the confidence multiplier for latin 1 to one that works better (the original multiplier causes many files to be incorrectly detected as iso-latin-2). Testing on both the problem case mentioned ( http://www.lvo.com/GASTRONOMIE/VINS/VITI/VITI1F.HTML ) and the character set tables from http://www.columbia.edu/kermit/csettables.html suggests that the fixed code performs better.

Hope this is useful - it's certainly helped fix a few problem documents in the application I'm working on.

Best,

Doug

puzzlet pushed a commit to puzzlet/python-chardet that referenced this pull request Jan 25, 2013
@rspeer
Copy link

rspeer commented Aug 26, 2013

So that's why!

I would love to see a release of chardet that fixes this bug. As it is, chardet basically can't be used for latin-1, and that's the most common single-byte encoding. (Well, Windows-1252 is, but chardet doesn't really distinguish those.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants