PEP 540

Add examples for the "List a directory into stdout" use case.
2017-01-11 22:32:24 +01:00 · 2017-01-11 22:32:24 +01:00 · 0e107f280c
parent 1b6b889ed6
commit 0e107f280c
1 changed files with 47 additions and 13 deletions
--- a/pep-0540.txt
+++ b/pep-0540.txt
@ -278,17 +278,20 @@ handler, instead using the locale encoding (with ``strict`` or
 Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
 bother users with encodings, but it can produce mojibake. It can be
 configured as strict to prevent mojibake: the UTF-8 encoding is used
-with the ``strict`` error handler in this case.
+with the ``strict`` error handler for inputs and outputs, but the
+``surrogateescape`` error handler is still used for `operating system
+data`_.

 New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
 variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
 by ``-X utf8`` or ``PYTHONUTF8=1``.  The UTF-8 is configured as strict
-by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
+by ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values fail
+with an error.

 The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
 can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.

-The ``-X utf8`` has the priority on the ``PYTHONUTF8`` environment
+The ``-X utf8`` has the priority over the ``PYTHONUTF8`` environment
 variable. For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the
 UTF-8 mode.

@ -389,7 +392,7 @@ code.
 The UTF-8 mode can produce mojibake since Python and external code don't
 both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can
 be configured as strict to prevent mojibake and be fail early when data
-is not decodable from UTF-8.
+is not decodable from UTF-8 or not encodable to UTF-8.

 External code using text
 ^^^^^^^^^^^^^^^^^^^^^^^^
@ -441,6 +444,38 @@ To be able to always work, the program must be able to produce mojibake.
 Mojibake is more user friendly than an error with a truncated or empty
 output.

+Example with a directory which contains the file called ``b'xxx\xff'``
+(the byte ``0xFF`` is invalid in UTF-8).
+
+Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
+
+    $ python3.7 ../ls.py
+    Traceback (most recent call last):
+      File "../ls.py", line 5, in <module>
+        print(name)
+    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
+
+    $ python3.7 -X utf8=strict ../ls.py
+    Traceback (most recent call last):
+      File "../ls.py", line 5, in <module>
+        print(name)
+    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
+
+The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
+but display mojibake::
+
+    $ python3.7 -X utf8 ../ls.py
+    xxx<78>
+
+    $ LC_ALL=C /python3.6 ../ls.py
+    xxx<78>
+
+    $ python2 ../ls.py
+    xxx<78>
+
+    $ ls
+    'xxx'$'\377'
+

 List a directory into a text file
 ---------------------------------
@ -647,9 +682,9 @@ Backward Compatibility
 ======================

 The main backward incompatible change is that the UTF-8 encoding is now
-used if the locale is POSIX. Since the UTF-8 encoding is used with the
-``surrogateescape`` error handler, ecoding errors should not occur and
-so the change should not break applications.
+used by default if the locale is POSIX. Since the UTF-8 encoding is used
+with the ``surrogateescape`` error handler, encoding errors should not
+occur and so the change should not break applications.

 The more likely source of trouble comes from external libraries. Python
 can decode successfully data from UTF-8, but a library using the locale
@ -658,9 +693,8 @@ encoding text in a library is a rare operation. Very few libraries
 expect text, most libraries expect bytes and even manipulate bytes
 internally.

-If the locale is not POSIX, the PEP has no impact on the backward
-compatibility since the UTF-8 mode is disabled by default in this case,
-it must be enabled explicitly.
+The PEP only changes the default behaviour if the locale is POSIX. For
+other locales, the *default* behaviour is unchanged.


 Alternatives
@ -672,9 +706,9 @@ Don't modify the encoding of the POSIX locale
 A first version of the PEP did not change the encoding and error handler
 used of the POSIX locale.

-The problem is that adding a command line option or setting an
-environment variable is not possible in some cases, or at least not
-convenient.
+The problem is that adding the ``-X utf8`` command line option or
+setting the ``PYTHONUTF8`` environment variable is not possible in some
+cases, or at least not convenient.

 Moreover, many users simply expect that Python 3 behaves as Python 2:
 don't bother them with encodings and "just works" in all cases. These