Skip to content

Commit 554e648

Browse files
README-Unicode.md: Fix Markdown format for linting and update it
Signed-off-by: Bernhard Kaindl <[email protected]>
1 parent ae2bb8e commit 554e648

File tree

1 file changed

+46
-15
lines changed

1 file changed

+46
-15
lines changed

README-Unicode.md

+46-15
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,64 @@
1-
Python3 Unicode migration in the XCP package
2-
============================================
1+
# Python3 Unicode migration in the XCP package
2+
3+
## Problem
4+
5+
Python3.6 on XS8 does not have an all-encompassing default UTF-8 mode for I/O.
6+
7+
Newer Python versions have an UTF-8 mode that they even enable by default.
8+
Python3.6 only enabled UTF-8 for I/O when an UTF-8 locale is used.
9+
See below for more background info on the UTF-8 mode.
10+
11+
For situations where UTF-8 enabled, we have to specify UTF-8 explicitly.
12+
13+
Such sitation happens when LANG or LC_* variables are not set for UTF-8.
14+
XAPI plugins like auto-cert-kit find themself in this situation.
15+
16+
Example:
17+
For reading UTF-8 files like the `pciids` file, add `encoding="utf-8"`.
18+
This applies especailly to `open()` and `Popen()` when files my contain UTF-8.
19+
20+
This also applies when en/decoding to/form `urllib` which uses bytes.
21+
`urllib` has to use bytes as HTTP data can of course also be binary, e.g. compressed.
22+
23+
## Migrating `subprocess.Popen()`
324

4-
Migrating `subprocess.Popen()`
5-
------------------------------
625
With Python3, the `stdin`, `stdout` and `stderr` pipes for `Popen()` default to `bytes`(binary mode). Binary mode is much safer because it foregoes the encode/decode.
726

827
The for working with strings, existing users need to either enable text mode (when safe, it will attempt to decode and encode!) or be able to use bytes instead.
928

1029
For cases where the data is guaranteed to be pure ASCII, such as when resting the `proc.stdout` of `lspci -nm`, it is sufficient to use:
30+
1131
```py
1232
open(["lspci, "-nm"], stdout=subprocess.PIPE, universal_newlines=True)
1333
```
34+
1435
This is possible because `universal_newlines=True` is accepted by Python2 and Python3.
1536
On Python3, it also enables text mode for `subprocess.PIPE` streams (newline conversion
1637
not needed, but text mode is needed)
1738

18-
Migrating `builtins.open()`
19-
---------------------------
39+
## Migrating `builtins.open()`
40+
2041
On Python3, `builtins.open()` can be used in a number of modes:
42+
2143
- Binary mode (when `"b"` is in `mode`): `read()` and `write()` use `bytes`.
2244
- Text mode (Python3 default up to Python 3.6), when UTF-8 character encoding is not set by the locale
23-
- UTF-8 mode (default since Python 3.7): https://peps.python.org/pep-0540/
45+
- UTF-8 mode (default since Python 3.7): <https://peps.python.org/pep-0540/>
2446

2547
When no Unicode locale in force, like in XAPI plugins, Python3 will be in text mode or UTF-8 (since Python 3.7, but existing XS is on 3.6):
2648

27-
* By default, `read()` on files `open()`ed without selecting binary mode attempts
49+
- By default, `read()` on files `open()`ed without selecting binary mode attempts
2850
to decode the data into the Python3 Unicode string type.
2951
This fails for binary data.
3052
Any `ord() >= 128`, when no UTF-8 locale is active With Python 3.6, triggers `UnicodeDecodeError`.
3153

32-
* Thus, all `open()` calls which might open binary files have to be converted to binary
54+
- Thus, all `open()` calls which might open binary files have to be converted to binary
3355
or UTF-8 mode unless the caller is sure he is opening an ASCII file.
3456
But even then, enabling an error handler to handle decoding errors is recommended:
57+
3558
```py
3659
open(filename, errors="replace")
3760
```
61+
3862
But neither `errors=` nor `encoding=` is accepted by Python2, so a wrapper is likely best.
3963

4064
### Binary mode
@@ -43,21 +67,18 @@ When decoding bytes to strings is not needed, binary mode can be great because i
4367

4468
However, when strings need to returned from the library, something like `bytes.decode(errors="ignore")` to get strings is needed.
4569

46-
### Text mode
47-
48-
Text mode using the `ascii` codec should be only used when it is ensured that the data can only consist of ASCII characters (0-127). Sadly, it is the default in Python 3.6 when the Python interpreter was not started using an UTF-8 locale for the LC_CTYPE locale category (set by LC_ALL, LC_CTYPE, LANG environment variables, overriding each other in that order)
49-
5070
### UTF-8 mode
5171

5272
Most if the time, the `UTF-8` codec should be used since even simple text files which are even documented to contain only ASCII characters like `"/usr/share/hwdata/pci.ids"` in fact __do__ contain UTF-8 characters.
5373

5474
Some files or some output data from commands even contains legacy `ISO-8859-1` chars, and even the `UTF-8` codec would raise `UnicodeDecodeError` for these.
5575
When this is known to be the case, `encoding="iso-8859-1` could be tried (not tested yet).
5676
57-
### Problems
77+
### Problems
5878

5979
With the locale set to C (XAPI plugins have that), Python's default mode changes
6080
between 3.6 and 3.7:
81+
6182
```py
6283
for i in 3.{6,7,10,11};do echo -n "3.$i: ";
6384
LC_ALL=C python3.$i -c 'import locale,sys;print(locale.getpreferredencoding())';done
@@ -66,7 +87,9 @@ for i in 3.{6,7,10,11};do echo -n "3.$i: ";
6687
3.10: UTF-8
6788
3.11: utf-8
6889
```
90+
6991
This has the effect that in Python 3.6, the default codec for XAPI plugins is `ascii`:
92+
7093
```py
7194
for i in 2.7 3.{6,7};do echo "$i:";
7295
LC_ALL=C python$i -c 'open("/usr/share/hwdata/pci.ids").read()';done
@@ -79,13 +102,15 @@ Traceback (most recent call last):
79102
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 97850: ordinal not in range(128)
80103
3.7:
81104
```
105+
82106
This error means that the `'ascii' codec` cannot handle input ord() >= 128, and as some Video cards use `²` to reference their power, the `ascii` codec chokes on them.
83107

84108
It means `xcp.pci.PCIIds()` cannot use `open("/usr/share/hwdata/pci.ids").read()`.
85109

86110
While Python 3.7 and newer use UTF-8 mode by default, it does not set up an error handler for `UnicodeDecodeError`.
87111

88112
As it happens, some older tools output ISO-8859-1 characters hard-coded and these aren't valid UTF-8 sequences, and even newer Python versions need error handlers to not fail:
113+
89114
```py
90115
echo -e "\0262" # ISO-8859-1 for: "²"
91116
python3 -c 'open(".text").read()'
@@ -133,6 +158,7 @@ tests/test_bootloader.py line 38 in TestLinuxBootloader.setUp()
133158
tests/test_pci.py line 96 in TestPCIIds.test_videoclass_by_mock_calls()
134159
tests/test_pci.py line 110 in TestPCIIds.mock_lspci_using_open_testfile()
135160
```
161+
136162
Of course, `xcp/net/ifrename` won't be affected but it would be good to fix the
137163
warning for them as well in an intelligent way. See the proposal for that below.
138164

@@ -141,6 +167,7 @@ arguments we need to pass to ensure that all users of open() will work, we need
141167
to make passing the arguments conditional on Python >= 3.
142168

143169
1. Overriding `open()`, while technically working would not only affect xcp.python but the entire program:
170+
144171
```py
145172
if sys.version_info >= (3, 0):
146173
original_open = __builtins__["open"]
@@ -152,7 +179,9 @@ to make passing the arguments conditional on Python >= 3.
152179
return original_open(*args, **kwargs)
153180
__builtins__["open"] = uopen
154181
```
182+
155183
2. This is sufficient but is not very nice:
184+
156185
```py
157186
# xcp/utf8mode.py
158187
if sys.version_info >= (3, 0):
@@ -165,9 +194,11 @@ to make passing the arguments conditional on Python >= 3.
165194
- open(filename)
166195
+ open(filename, **open_utf8args)
167196
```
197+
168198
But, `pylint` will still warn about these lines, so I propose:
169199

170200
3. Instead, use a wrapper function, which will also silence the `pylint` warnings at the locations which have been changed to use it:
201+
171202
```py
172203
# xcp/utf8mode.py
173204
if sys.version_info >= (3, 0):
@@ -189,4 +220,4 @@ Using the 3rd option, the `pylint` warnings for the changed locations
189220
explicitly disabling them.
190221

191222
PS: Since utf8open() still returns a context-manager, `with open(...) as f:`
192-
would still work.
223+
would still work.

0 commit comments

Comments
 (0)