-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add html5 character encoding mappings #8
Comments
@sigmavirus24 Is this something you fixed? I'm trying to migrate old issues over to our new repo. |
@dan-blanchard no, the issue in that commit refers to sv24-archive#8 |
@mlissner I don't actually see this mapping you speak of in the latest version of the HTML5 spec: That said, I do think always returning CP1252 instead of Latin-1 is advisable because it gets you the quotes and things that Latin-1 doesn't, and one almost never encounters documents that purposefully have the replaced control characters in them. |
Well... My internet connection is so slow here I can't download the whole Anyway, yeah, I filed this bug after encountering a page that was reporting
|
I had a frustrating issue recently when trying to use chardet to work with a web page: http://stackoverflow.com/questions/11588458/how-to-handle-encodings-using-python-requests-library
My solution was to write a bit of custom code that says, "Whenever chardet reports ISO-8859-1, instead use cp1252."
Basically, browsers don't use a number of character encodings, and instead map to other ones instead. This was done unofficially for a while by browsers, but it's now enshrined in the HTML5 spec:
http://dev.w3.org/html5/spec/single-page.html#character-encodings-0
Since most of the data that chardet is used on will be coming from the web, it makes sense for it to return the character encodings that are used by browsers. This might make sense as an option rather than default functionality....not sure, but I'd love to see this be added.
If this is a feature that'd be accepted, I'd be happy to put it together in a pull request, but I need guidance as to the design that'd be accepted.
The text was updated successfully, but these errors were encountered: