PEP 540
Add examples for the "List a directory into stdout" use case.
This commit is contained in:
parent
1b6b889ed6
commit
0e107f280c
60
pep-0540.txt
60
pep-0540.txt
|
@ -278,17 +278,20 @@ handler, instead using the locale encoding (with ``strict`` or
|
|||
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
|
||||
bother users with encodings, but it can produce mojibake. It can be
|
||||
configured as strict to prevent mojibake: the UTF-8 encoding is used
|
||||
with the ``strict`` error handler in this case.
|
||||
with the ``strict`` error handler for inputs and outputs, but the
|
||||
``surrogateescape`` error handler is still used for `operating system
|
||||
data`_.
|
||||
|
||||
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
|
||||
by ``-X utf8`` or ``PYTHONUTF8=1``. The UTF-8 is configured as strict
|
||||
by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
|
||||
by ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values fail
|
||||
with an error.
|
||||
|
||||
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
|
||||
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
|
||||
|
||||
The ``-X utf8`` has the priority on the ``PYTHONUTF8`` environment
|
||||
The ``-X utf8`` has the priority over the ``PYTHONUTF8`` environment
|
||||
variable. For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the
|
||||
UTF-8 mode.
|
||||
|
||||
|
@ -389,7 +392,7 @@ code.
|
|||
The UTF-8 mode can produce mojibake since Python and external code don't
|
||||
both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can
|
||||
be configured as strict to prevent mojibake and be fail early when data
|
||||
is not decodable from UTF-8.
|
||||
is not decodable from UTF-8 or not encodable to UTF-8.
|
||||
|
||||
External code using text
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
@ -441,6 +444,38 @@ To be able to always work, the program must be able to produce mojibake.
|
|||
Mojibake is more user friendly than an error with a truncated or empty
|
||||
output.
|
||||
|
||||
Example with a directory which contains the file called ``b'xxx\xff'``
|
||||
(the byte ``0xFF`` is invalid in UTF-8).
|
||||
|
||||
Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
|
||||
|
||||
$ python3.7 ../ls.py
|
||||
Traceback (most recent call last):
|
||||
File "../ls.py", line 5, in <module>
|
||||
print(name)
|
||||
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
|
||||
|
||||
$ python3.7 -X utf8=strict ../ls.py
|
||||
Traceback (most recent call last):
|
||||
File "../ls.py", line 5, in <module>
|
||||
print(name)
|
||||
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
|
||||
|
||||
The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
|
||||
but display mojibake::
|
||||
|
||||
$ python3.7 -X utf8 ../ls.py
|
||||
xxx<78>
|
||||
|
||||
$ LC_ALL=C /python3.6 ../ls.py
|
||||
xxx<78>
|
||||
|
||||
$ python2 ../ls.py
|
||||
xxx<78>
|
||||
|
||||
$ ls
|
||||
'xxx'$'\377'
|
||||
|
||||
|
||||
List a directory into a text file
|
||||
---------------------------------
|
||||
|
@ -647,9 +682,9 @@ Backward Compatibility
|
|||
======================
|
||||
|
||||
The main backward incompatible change is that the UTF-8 encoding is now
|
||||
used if the locale is POSIX. Since the UTF-8 encoding is used with the
|
||||
``surrogateescape`` error handler, ecoding errors should not occur and
|
||||
so the change should not break applications.
|
||||
used by default if the locale is POSIX. Since the UTF-8 encoding is used
|
||||
with the ``surrogateescape`` error handler, encoding errors should not
|
||||
occur and so the change should not break applications.
|
||||
|
||||
The more likely source of trouble comes from external libraries. Python
|
||||
can decode successfully data from UTF-8, but a library using the locale
|
||||
|
@ -658,9 +693,8 @@ encoding text in a library is a rare operation. Very few libraries
|
|||
expect text, most libraries expect bytes and even manipulate bytes
|
||||
internally.
|
||||
|
||||
If the locale is not POSIX, the PEP has no impact on the backward
|
||||
compatibility since the UTF-8 mode is disabled by default in this case,
|
||||
it must be enabled explicitly.
|
||||
The PEP only changes the default behaviour if the locale is POSIX. For
|
||||
other locales, the *default* behaviour is unchanged.
|
||||
|
||||
|
||||
Alternatives
|
||||
|
@ -672,9 +706,9 @@ Don't modify the encoding of the POSIX locale
|
|||
A first version of the PEP did not change the encoding and error handler
|
||||
used of the POSIX locale.
|
||||
|
||||
The problem is that adding a command line option or setting an
|
||||
environment variable is not possible in some cases, or at least not
|
||||
convenient.
|
||||
The problem is that adding the ``-X utf8`` command line option or
|
||||
setting the ``PYTHONUTF8`` environment variable is not possible in some
|
||||
cases, or at least not convenient.
|
||||
|
||||
Moreover, many users simply expect that Python 3 behaves as Python 2:
|
||||
don't bother them with encodings and "just works" in all cases. These
|
||||
|
|
Loading…
Reference in New Issue