Add examples using urllib2 and urllib.request for HTTP Content-Type

gsnedders · gsnedders · commit c10635fad6da · 2013-08-26T18:02:48.000+01:00
I've seen many use html5lib completely ignoring any HTTP-layer given
character encoding. Would be better to lead them in the right direction.
diff --git a/README.rst b/README.rst
@@ -41,6 +41,29 @@ a treebuilder:
   with open("mydocument.html", "rb") as f:
       lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
 
+When using with ``urllib2`` (Python 2), the charset from HTTP should be
+pass into html5lib as follows:
+
+.. code-block:: python
+
+  from contextlib import closing
+  from urllib2 import urlopen
+  import html5lib
+
+  with closing(urlopen("http://example.com/")) as f:
+      document = html5lib.parse(f, encoding=f.info().getparam("charset"))
+
+When using with ``urllib.request`` (Python 3), the charset from HTTP
+should be pass into html5lib as follows:
+
+.. code-block:: python
+
+  from urllib.request import urlopen
+  import html5lib
+
+  with urlopen("http://example.com/") as f:
+      document = html5lib.parse(f, encoding=f.info().get_content_charset())
+
 To have more control over the parser, create a parser object explicitly.
 For instance, to make the parser raise exceptions on parse errors, use: