Fixed latin1 detector in chardet #5

ablegrape · 2012-09-10T19:39:54Z

Hi David,

Following up on our exchange from a few weeks ago. I've commited the bug fix for the integer math bug (causes a file with even one "low confidence" character to have 0 confidence level) and also updated the confidence multiplier for latin 1 to one that works better (the original multiplier causes many files to be incorrectly detected as iso-latin-2). Testing on both the problem case mentioned ( http://www.lvo.com/GASTRONOMIE/VINS/VITI/VITI1F.HTML ) and the character set tables from http://www.columbia.edu/kermit/csettables.html suggests that the fixed code performs better.

Hope this is useful - it's certainly helped fix a few problem documents in the application I'm working on.

Best,

Doug

@byroot

Fix BOM detection dcramer#4 Thanks @byroot

rspeer · 2013-08-26T06:44:13Z

So that's why!

I would love to see a release of chardet that fixes this bug. As it is, chardet basically can't be used for latin-1, and that's the most common single-byte encoding. (Well, Windows-1252 is, but chardet doesn't really distinguish those.)

Fix an integer arithmetic bug, and improve the confidence multiplier.

cf513c9

puzzlet pushed a commit to puzzlet/python-chardet that referenced this pull request Jan 25, 2013

Merge pull request dcramer#5 from byroot/fix-bom-detection

0e70614

Fix BOM detection dcramer#4 Thanks @byroot

dan-blanchard mentioned this pull request Dec 19, 2013

Fix an integer arithmetic bug, and improve the confidence multiplier. chardet/chardet#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed latin1 detector in chardet #5

Fixed latin1 detector in chardet #5

ablegrape commented Sep 10, 2012

rspeer commented Aug 26, 2013

Fixed latin1 detector in chardet #5

Are you sure you want to change the base?

Fixed latin1 detector in chardet #5

Conversation

ablegrape commented Sep 10, 2012

rspeer commented Aug 26, 2013