PEP 540: Apply Nick Coghlan's PR #201
I applied it manually since another PR was merged in the meanwhile.
This commit is contained in:
parent
7d181dc76d
commit
cef853f646
415
pep-0540.txt
415
pep-0540.txt
|
@ -2,7 +2,8 @@ PEP: 540
|
|||
Title: Add a new UTF-8 mode
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Victor Stinner <victor.stinner@gmail.com>
|
||||
Author: Victor Stinner <victor.stinner@gmail.com>,
|
||||
Nick Coghlan <ncoghlan@gmail.com>
|
||||
BDFL-Delegate: INADA Naoki
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
|
@ -14,16 +15,22 @@ Python-Version: 3.7
|
|||
Abstract
|
||||
========
|
||||
|
||||
Add a new UTF-8 mode, disabled by default, to ignore the locale and
|
||||
force the usage of the UTF-8 encoding.
|
||||
Add a new UTF-8 mode, enabled by default in the POSIX locale, to ignore
|
||||
the locale and force the usage of the UTF-8 encoding for external
|
||||
operating system interfaces, including the standard IO streams.
|
||||
|
||||
Basically, UTF-8 mode behaves as Python 2: it "just works" and doesn't
|
||||
bother users with encodings, but it can produce mojibake. The UTF-8 mode
|
||||
can be configured as strict to prevent mojibake.
|
||||
Essentially, the UTF-8 mode behaves as Python 2 and other C based
|
||||
applications on \*nix systems: it aims to process text as best it can,
|
||||
but it errs on the side of producing or propagating mojibake to
|
||||
subsequent components in a processing pipeline rather than requiring
|
||||
strictly valid encodings at every step in the process.
|
||||
|
||||
A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode. The POSIX locale enables
|
||||
the UTF-8 mode.
|
||||
The UTF-8 mode can be configured as strict to reduce the risk of
|
||||
producing or propagating mojibake.
|
||||
|
||||
A new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
variable are added to explicitly control the UTF-8 mode (including
|
||||
turning it off entirely, even in the POSIX locale).
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -55,20 +62,30 @@ POSIX locale is a good choice: see the `Locales section of
|
|||
reproducible-builds.org
|
||||
<https://reproducible-builds.org/docs/locales/>`_.
|
||||
|
||||
PEP 538 lists additional problems related to the use of Linux containers to
|
||||
run network services and command line applications.
|
||||
|
||||
UNIX users don't expect Unicode errors, since the common command lines
|
||||
tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors.
|
||||
These users expect that Python 3 "just works" with any locale and won't
|
||||
bother them with encodings. From their point of the view, the bug is not
|
||||
their locale, it's obviously Python 3.
|
||||
tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors - they
|
||||
produce mostly-readable text instead.
|
||||
|
||||
Since Python 2 handles data as bytes, it's rarer in Python 2
|
||||
compared to Python 3 to get Unicode errors. It also explains why users
|
||||
also perceive Python 3 as the root cause of their Unicode errors.
|
||||
These users similarly expect that tools written in Python 3 (including those
|
||||
updated from Python 2), continue to tolerate locale misconfigurations and avoid
|
||||
bothering them with text encoding details. From their point of the view, the
|
||||
bug is not their locale but is obviously Python 3 ("Everything else works,
|
||||
including Python 2, so what's wrong with Python 3?").
|
||||
|
||||
Some users expect that Python 3 just works with any locale and so don't
|
||||
bother with mojibake, whereas some developers are working hard to prevent
|
||||
mojibake and so expect that Python 3 fails early before creating
|
||||
it.
|
||||
Since Python 2 handles data as bytes, similar to system utilities written in
|
||||
C and C++, it's rarer in Python 2 compared to Python 3 to get explicit Unicode
|
||||
errors. It also contributes significantly to why many affected users perceive
|
||||
Python 3 as the root cause of their Unicode errors.
|
||||
|
||||
At the same time, the stricter text handling model was deliberately introduced
|
||||
into Python 3 to reduce the frequency of data corruption bugs arising in
|
||||
production services due to mismatched assumptions regarding text encodings.
|
||||
It's one thing to emit mojibake to a user's terminal while listing a directory,
|
||||
but something else entirely to store that in a system manifest in a database,
|
||||
or to send it to a remote client attempting to retreive files from the system.
|
||||
|
||||
Since different group of users have different expectations, there is no
|
||||
silver bullet which solves all issues at once. Last but not least,
|
||||
|
@ -135,12 +152,12 @@ On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
|
|||
the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
|
||||
use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
|
||||
``os.fsencode()`` and ``os.fsdecode()`` use
|
||||
Python codec of the locale encoding. For example, if command line
|
||||
``locale.getpreferredencoding()`` codec. For example, if command line
|
||||
arguments are decoded by ``mbstowcs()`` and encoded back by
|
||||
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
|
||||
of retrieving the original byte string.
|
||||
|
||||
To fix this issue, from Python 3.4, a check is made to see if ``mbstowcs()``
|
||||
To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
|
||||
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
|
||||
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
|
||||
alias to ASCII). If not (the effective encoding is not ASCII), Python
|
||||
|
@ -178,7 +195,7 @@ the UTF-8 encoding:
|
|||
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
|
||||
* HP-UX: ``"C.utf8"``
|
||||
|
||||
It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
|
||||
It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
|
||||
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
|
||||
|
||||
It is not planned to add such locale to BSD systems.
|
||||
|
@ -190,16 +207,17 @@ Popularity of the UTF-8 encoding
|
|||
Python 3 uses UTF-8 by default for Python source files.
|
||||
|
||||
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
|
||||
system data. For Windows, see `PEP 529`_: "Change Windows filesystem
|
||||
system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
|
||||
encoding to UTF-8".
|
||||
|
||||
On Linux, UTF-8 became the de facto standard encoding,
|
||||
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
|
||||
using different encodings for filenames and standard streams is likely
|
||||
to create mojibake, so UTF-8 is now used *everywhere*.
|
||||
to create mojibake, so UTF-8 is now used *everywhere* (at least for modern
|
||||
distributions using their default settings).
|
||||
|
||||
The UTF-8 encoding is the default encoding of XML and JSON file formats.
|
||||
As of January 2017, UTF-8 was used in `more than 88% of web pages
|
||||
The UTF-8 encoding is the default encoding of XML and JSON file format.
|
||||
In January 2017, UTF-8 was used in `more than 88% of web pages
|
||||
<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
|
||||
Javascript, CSS, etc.).
|
||||
|
||||
|
@ -209,7 +227,7 @@ information on the UTF-8 codec.
|
|||
.. note::
|
||||
Some applications and operating systems (especially Windows) use Byte
|
||||
Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
|
||||
UTF-8, UTF-16-LE, etc. BOM are not well supported and are rarely used in
|
||||
UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
|
||||
Python.
|
||||
|
||||
|
||||
|
@ -231,8 +249,8 @@ Python 3 promotes Unicode everywhere including filenames. A solution to
|
|||
support filenames not decodable from the locale encoding was found: the
|
||||
``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
|
||||
as surrogate characters. This error handler is used by default for
|
||||
`operating system data`_, for example, by ``os.fsdecode()`` and
|
||||
``os.fsencode()`` (except on Windows which uses the ``strict`` error handler).
|
||||
`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
|
||||
example (except on Windows which uses the ``strict`` error handler).
|
||||
|
||||
|
||||
Standard streams
|
||||
|
@ -252,17 +270,19 @@ using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
|
|||
filename to stdout raises a Unicode encode error if the filename
|
||||
contains an undecoded byte stored as a surrogate character.
|
||||
|
||||
Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
|
||||
Python 3.5+ now uses ``surrogateescape`` for stdin and stdout if the
|
||||
POSIX locale is used: `issue #19977
|
||||
<http://bugs.python.org/issue19977>`_. The idea is to pass through
|
||||
`operating system data`_ even if it creates mojibake, because most UNIX
|
||||
applications work like that. Most UNIX applications store filenames as
|
||||
bytes, usually because bytes are first-citizen class in the used
|
||||
programming language, whereas Unicode is badly supported.
|
||||
`operating system data`_ even if it means mojibake, because most UNIX
|
||||
applications work like that. Such UNIX applications often store filenames as
|
||||
bytes, in many cases because their basic design principles (or those of the
|
||||
language they're implemented in) were laid down half a century ago when it
|
||||
was still a feat for computers to handle English text correctly, rather than
|
||||
humans having to work with raw numeric indexes.
|
||||
|
||||
.. note::
|
||||
The encoding and/or the error handler of standard streams can be
|
||||
overridden with the ``PYTHONIOENCODING`` environment variable.
|
||||
overriden with the ``PYTHONIOENCODING`` environment variable.
|
||||
|
||||
|
||||
Proposal
|
||||
|
@ -271,27 +291,35 @@ Proposal
|
|||
Changes
|
||||
-------
|
||||
|
||||
Add a new UTF-8 mode, disabled by default, to ignore the locale and
|
||||
Add a new UTF-8 mode, enabled by default in the POSIX locale, but otherwise
|
||||
disabled by default, to ignore the locale and
|
||||
force the usage of the UTF-8 encoding with the ``surrogateescape`` error
|
||||
handler, instead using the locale encoding (with ``strict`` or
|
||||
``surrogateescape`` error handler depending on the case).
|
||||
|
||||
Basically, the UTF-8 mode behaves as Python 2: it "just works" and doesn't
|
||||
bother users with encodings, but it can produce mojibake. It can be
|
||||
configured as strict to prevent mojibake: the UTF-8 encoding is used
|
||||
with the ``strict`` error handler for inputs and outputs, but the
|
||||
``surrogateescape`` error handler is still used for `operating system
|
||||
data`_.
|
||||
The "normal" UTF-8 mode uses ``surrogateescape`` on the standard input and
|
||||
output streams and openeded files, as well as on all operating system
|
||||
interfaces. This is the mode implicitly activated by the POSIX locale.
|
||||
|
||||
A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
|
||||
by using ``-X utf8`` or ``PYTHONUTF8=1``. It can be configured as strict
|
||||
by using ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values
|
||||
fail with an error.
|
||||
The "strict" UTF-8 mode reduces the risk of producing or propogating mojibake:
|
||||
the UTF-8 encoding is used with the ``strict`` error handler for inputs and
|
||||
outputs, but the ``surrogateescape`` error handler is still used for
|
||||
`operating system data`_. This mode is never activated implicitly, but can
|
||||
be requested explicitly.
|
||||
|
||||
The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode.
|
||||
|
||||
The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``.
|
||||
|
||||
The UTF-8 Strict mode is configured by ``-X utf8=strict`` or
|
||||
``PYTHONUTF8=strict``.
|
||||
|
||||
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
|
||||
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
|
||||
|
||||
Other option values fail with an error.
|
||||
|
||||
Options priority for the UTF-8 mode:
|
||||
|
||||
* ``PYTHONLEGACYWINDOWSFSENCODING``
|
||||
|
@ -300,8 +328,8 @@ Options priority for the UTF-8 mode:
|
|||
* POSIX locale
|
||||
|
||||
For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode,
|
||||
whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and
|
||||
uses the encoding of the POSIX locale.
|
||||
whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and so
|
||||
use the encoding of the POSIX locale.
|
||||
|
||||
Encodings used by ``open()``, highest priority first:
|
||||
|
||||
|
@ -339,8 +367,9 @@ sys.stderr locale/backslashreplace locale/backslashreplace
|
|||
============================ ======================= ==========================
|
||||
|
||||
The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
|
||||
strict mode for convenience: the idea is that data not encoded to UTF-8
|
||||
are passed through "Python" without being modified, as raw bytes.
|
||||
strict mode for consistency with other standard \*nix operating system
|
||||
components: the idea is that data not encoded to UTF-8 are passed through
|
||||
"Python" without being modified, as raw bytes.
|
||||
|
||||
The ``PYTHONIOENCODING`` environment variable has priority over the
|
||||
UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
|
||||
|
@ -390,23 +419,23 @@ locale encoding, and this code page never uses the ASCII encoding.
|
|||
Rationale
|
||||
---------
|
||||
|
||||
UTF-8 mode is disabled by default in order to keep hard Unicode errors when
|
||||
encoding or decoding `operating system data`_ fails and preserve
|
||||
backward compatibility. In addition, users will be better prepared for
|
||||
mojibake if it is their responsibility to explicitly enable UTF-8 mode
|
||||
than they would be if it was enabled *by default*.
|
||||
The UTF-8 mode is disabled by default to keep hard Unicode errors when
|
||||
encoding or decoding `operating system data`_ failed, and to keep the
|
||||
backward compatibility. The user is responsible to enable explicitly the
|
||||
UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
|
||||
mode would be enabled *by default*.
|
||||
|
||||
UTF-8 mode should be used on systems known to be configured with
|
||||
The UTF-8 mode should be used on systems known to be configured with
|
||||
UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
|
||||
the user overrides a locale *by mistake* or if a Python program is
|
||||
started with no locale configured (and so with the POSIX locale).
|
||||
|
||||
Most UNIX applications handle `operating system data`_ as bytes, so
|
||||
the ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
|
||||
``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
|
||||
limited impact on how these data are handled by the application.
|
||||
|
||||
The UTF-8 mode should help make Python more interoperable with
|
||||
other UNIX applications on the system assuming that *UTF-8* is used
|
||||
The Python UTF-8 mode should help to make Python more interoperable with
|
||||
the other UNIX applications in the system assuming that *UTF-8* is used
|
||||
everywhere and that users *expect* UTF-8.
|
||||
|
||||
Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
|
||||
|
@ -455,17 +484,20 @@ surrogate characters.
|
|||
Use Cases
|
||||
=========
|
||||
|
||||
The following use cases were written to highlight the impact of
|
||||
the chosen encodings and error handlers on concrete examples.
|
||||
The following use cases were written to help to understand the impact of
|
||||
chosen encodings and error handlers on concrete examples.
|
||||
|
||||
The "Always work" results were written to prove the benefit of having a
|
||||
UTF-8 mode which works with any data and any locale, compared to the
|
||||
existing old Python versions.
|
||||
The "Exception?" column shows the potential benefit of having a UTF-8 mode which
|
||||
is closer to the traditional Python 2 behaviour of passing along raw binary data
|
||||
even if it isn't valid UTF-8.
|
||||
|
||||
The "Mojibake" column shows that ignoring the locale causes a practical
|
||||
issue: the UTF-8 mode produces mojibake if the terminal doesn't use the
|
||||
UTF-8 encoding.
|
||||
|
||||
The ideal configuration is "No exception, no risk of mojibake", but that isn't
|
||||
always possible in the presence of non-UTF-8 encoded binary data.
|
||||
|
||||
List a directory into stdout
|
||||
----------------------------
|
||||
|
||||
|
@ -477,24 +509,25 @@ Script listing the content of the current directory into stdout::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
======================== ========== =========
|
||||
Python Exception? Mojibake?
|
||||
======================== ========== =========
|
||||
Python 2 No **Yes**
|
||||
Python 3 **Yes** No
|
||||
Python 3.5, POSIX locale No **Yes**
|
||||
UTF-8 mode No **Yes**
|
||||
UTF-8 Strict mode **Yes** No
|
||||
======================== ========== =========
|
||||
|
||||
"No" means that the script can fail on decoding or encoding a filename
|
||||
depending on the locale or the filename.
|
||||
"Exception?" means that the script can fail on decoding or encoding a
|
||||
filename depending on the locale or the filename.
|
||||
|
||||
To be able to always work, the program must be able to produce mojibake.
|
||||
Mojibake is more user friendly than an error with a truncated or empty
|
||||
output.
|
||||
To be able to never fail that way, the program must be able to produce mojibake.
|
||||
For automated and interactive process, mojibake is often more user friendly
|
||||
than an error with a truncated or empty output, since it confines the
|
||||
problem to the affected entry, rather than aborting the whole task.
|
||||
|
||||
For example, using a directory which contains a file called ``b'xxx\xff'``
|
||||
Example with a directory which contains the file called ``b'xxx\xff'``
|
||||
(the byte ``0xFF`` is invalid in UTF-8).
|
||||
|
||||
Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
|
||||
|
@ -511,7 +544,7 @@ Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
|
|||
print(name)
|
||||
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
|
||||
|
||||
UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
|
||||
The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
|
||||
but display mojibake::
|
||||
|
||||
$ python3.7 -X utf8 ../ls.py
|
||||
|
@ -541,19 +574,19 @@ a text file::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
======================== ========== =========
|
||||
Python Exception? Mojibake?
|
||||
======================== ========== =========
|
||||
Python 2 No **Yes**
|
||||
Python 3 **Yes** No
|
||||
Python 3.5, POSIX locale **Yes** No
|
||||
UTF-8 mode No **Yes**
|
||||
UTF-8 Strict mode **Yes** No
|
||||
======================== ========== =========
|
||||
|
||||
"Yes" implies that mojibake can be produced. "No" means that the script
|
||||
can fail on decoding or encoding a filename depending on the locale or
|
||||
the filename. Typical error::
|
||||
Again, never throwing an exception requires that mojibake can be produced, while
|
||||
preventing mojibake means that the script can fail on decoding or encoding a
|
||||
filename depending on the locale or the filename. Typical error::
|
||||
|
||||
$ LC_ALL=C python3 test.py
|
||||
Traceback (most recent call last):
|
||||
|
@ -561,6 +594,12 @@ the filename. Typical error::
|
|||
fp.write("%s\n" % name)
|
||||
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
|
||||
|
||||
Compared with native system tools::
|
||||
|
||||
$ ls > /tmp/content.txt
|
||||
$ cat /tmp/content.txt
|
||||
xxx<78>
|
||||
|
||||
|
||||
Display Unicode characters into stdout
|
||||
--------------------------------------
|
||||
|
@ -572,19 +611,29 @@ Very basic example used to illustrate a common issue, display the euro sign
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode **Yes** **Yes**
|
||||
======================== ============= =========
|
||||
======================== ========== =========
|
||||
Python Exception? Mojibake?
|
||||
======================== ========== =========
|
||||
Python 2 **Yes** No
|
||||
Python 3 **Yes** No
|
||||
Python 3.5, POSIX locale **Yes** No
|
||||
UTF-8 mode No **Yes**
|
||||
UTF-8 Strict mode No **Yes**
|
||||
======================== ========== =========
|
||||
|
||||
The UTF-8 and UTF-8 Strict modes will always encode the euro sign as
|
||||
UTF-8. If the terminal uses a different encoding, we get mojibake.
|
||||
|
||||
For example, using ``iconv`` to emulate a GB-18030 terminal inside a
|
||||
UTF-8 one::
|
||||
|
||||
$ python3 -c 'print("euro: \u20ac")' | iconv -f gb18030 -t utf8
|
||||
euro: 鈧iconv: illegal input sequence at position 8
|
||||
|
||||
The misencoding also corrupts the trailing newline such that the output
|
||||
stream isn't actually a valid GB-18030 sequence, hence the error message after
|
||||
the euro symbol is misinterpreted as a hanzi character.
|
||||
|
||||
|
||||
Replace a word in a text
|
||||
------------------------
|
||||
|
@ -598,15 +647,20 @@ reads input from stdin and writes the output into stdout::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
======================== ========== =========
|
||||
Python Exception? Mojibake?
|
||||
======================== ========== =========
|
||||
Python 2 No **Yes**
|
||||
Python 3 **Yes** No
|
||||
Python 3.5, POSIX locale No **Yes**
|
||||
UTF-8 mode No **Yes**
|
||||
UTF-8 Strict mode **Yes** No
|
||||
======================== ========== =========
|
||||
|
||||
This is a case where passing along the raw bytes (by way of the
|
||||
``surrogateescape`` error handler) will bring Python 3's behaviour back into
|
||||
line with standard operating system tools like ``sed`` and ``awk``.
|
||||
|
||||
|
||||
Producer-consumer model using pipes
|
||||
-----------------------------------
|
||||
|
@ -618,12 +672,13 @@ On a shell, such programs are run with the command::
|
|||
|
||||
producer | consumer
|
||||
|
||||
The question is if these programs will work with any data and any locale.
|
||||
The question if these programs will work with any data and any locale.
|
||||
UNIX users don't expect Unicode errors, and so expect that such programs
|
||||
"just work".
|
||||
"just works", in the sense that Unicode errors may cause problems in the data
|
||||
stream, but won't cause the entire stream processing *itself* to abort.
|
||||
|
||||
If the producer only produces ASCII output, no error should occur. Let's
|
||||
say the that the producer writes at least one non-ASCII character (at least
|
||||
say that the producer writes at least one non-ASCII character (at least
|
||||
one byte in the range ``0x80..0xff``).
|
||||
|
||||
To simplify the problem, let's say that the consumer has no output
|
||||
|
@ -633,18 +688,20 @@ A "Bytes producer" is an application which cannot fail with a Unicode
|
|||
error and produces bytes into stdout.
|
||||
|
||||
Let's say that a "Bytes consumer" does not decode stdin but stores data
|
||||
as bytes: such a consumer always works. Common UNIX command line tools like
|
||||
as bytes: such consumer always work. Common UNIX command line tools like
|
||||
``cat``, ``grep`` or ``sed`` are in this category. Many Python 2
|
||||
applications are also in this category.
|
||||
applications are also in this category, as are applications that work
|
||||
with the lower level binary input and output stream in Python 3 rather than
|
||||
the default text mode streams.
|
||||
|
||||
"Python producer" and "Python consumer" are a producer and consumer
|
||||
implemented in Python.
|
||||
"Python producer" and "Python consumer" are producer and consumer
|
||||
implemented in Python using the default text mode input and output streams.
|
||||
|
||||
Bytes producer, Bytes consumer
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
It always works, but it is out of the scope of this PEP since it doesn't
|
||||
involve Python.
|
||||
This won't through exceptions, but it is out of the scope of this PEP since it
|
||||
doesn't involve Python's default text mode input and output streams.
|
||||
|
||||
Python producer, Bytes consumer
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
@ -655,15 +712,15 @@ Python producer::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
======================== ========== =========
|
||||
Python Exception? Mojibake?
|
||||
======================== ========== =========
|
||||
Python 2 **Yes** No
|
||||
Python 3 **Yes** No
|
||||
Python 3.5, POSIX locale **Yes** No
|
||||
UTF-8 mode No **Yes**
|
||||
UTF-8 Strict mode No **Yes**
|
||||
======================== ========== =========
|
||||
|
||||
The question here is not if the consumer is able to decode the input,
|
||||
but if Python is able to produce its output. So it's similar to the
|
||||
|
@ -684,17 +741,18 @@ Python consumer::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
======================== ========== =========
|
||||
Python Exception? Mojibake?
|
||||
======================== ========== =========
|
||||
Python 2 No **Yes**
|
||||
Python 3 **Yes** No
|
||||
Python 3.5, POSIX locale No **Yes**
|
||||
UTF-8 mode No **Yes**
|
||||
UTF-8 Strict mode **Yes** No
|
||||
======================== ========== =========
|
||||
|
||||
Python 3 fails on decoding stdin depending on the input and the locale.
|
||||
Python 3 may throw an exception on decoding stdin depending on the input and
|
||||
the locale.
|
||||
|
||||
|
||||
Python producer, Python consumer
|
||||
|
@ -711,22 +769,30 @@ Python consumer::
|
|||
result = text.replace("apple", "orange")
|
||||
# ignore the result
|
||||
|
||||
Result, using the same Python version for the producer and the consumer:
|
||||
Result, same Python version used for the producer and the consumer:
|
||||
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
======================== ========== =========
|
||||
Python Exception? Mojibake?
|
||||
======================== ========== =========
|
||||
Python 2 **Yes** No
|
||||
Python 3 **Yes** No
|
||||
Python 3.5, POSIX locale **Yes** No
|
||||
UTF-8 mode No No(!)
|
||||
UTF-8 Strict mode No No(!)
|
||||
======================== ========== =========
|
||||
|
||||
This case combines a Python producer with a Python consumer, so the
|
||||
result is the subset of `Python producer, Bytes consumer`_ and `Bytes
|
||||
producer, Python consumer`_.
|
||||
This case combines a Python producer with a Python consumer, and the
|
||||
result is mainly the same as that for `Python producer, Bytes consumer`_,
|
||||
since the consumer can't read what the producer can't emit.
|
||||
|
||||
However, the behaviour of the "UTF-8" and "UTF-8 Strict" modes in this
|
||||
configuration is notable: they don't produce an exception, *and* they shouldn't
|
||||
produce mojibake, as both the producer and the consumer are making *consistent*
|
||||
assumptions regarding the text encoding used on the pipe between them
|
||||
(i.e. UTF-8).
|
||||
|
||||
Any mojibake generated would only be in the interfaces bween the consuming
|
||||
component and the outside world (e.g. the terminal, or when writing to a file).
|
||||
|
||||
Backward Compatibility
|
||||
======================
|
||||
|
@ -736,16 +802,26 @@ used by default if the locale is POSIX. Since the UTF-8 encoding is used
|
|||
with the ``surrogateescape`` error handler, encoding errors should not
|
||||
occur and so the change should not break applications.
|
||||
|
||||
The UTF-8 encoding is also quite restrictive regarding where it allows
|
||||
plain ASCII code points to appear in the byte stream, so even for
|
||||
ASCII-incompatible encodings, such byte values will often be escaped rather
|
||||
than being processed as ASCII characters.
|
||||
|
||||
The more likely source of trouble comes from external libraries. Python
|
||||
can successfully decode data from UTF-8, but a library using the locale
|
||||
encoding can fail to encode the decoded text back to bytes. Hopefully,
|
||||
encoding text in a library is a rare operation. Very few libraries
|
||||
expect text, most libraries expect bytes and even manipulate bytes
|
||||
internally.
|
||||
can decode successfully data from UTF-8, but a library using the locale
|
||||
encoding can fail to encode the decoded text back to bytes. For example,
|
||||
GNU readline currently has problems on Android due to the mismatch between
|
||||
CPython's encoding assumptions there (always UTF-8) and GNU readline's
|
||||
encoding assumptions (which are based on the nominal locale).
|
||||
|
||||
The PEP only changes the default behaviour if the locale is POSIX. For
|
||||
other locales, the *default* behaviour is unchanged.
|
||||
|
||||
PEP 538 is a follow-up to this PEP that extends CPython's assumptions to other
|
||||
locale-aware components in the same process by explicitly coercing the POSIX
|
||||
locale to something more suitable for modern text processing. See that PEP
|
||||
for further details.
|
||||
|
||||
|
||||
Alternatives
|
||||
============
|
||||
|
@ -754,14 +830,14 @@ Don't modify the encoding of the POSIX locale
|
|||
---------------------------------------------
|
||||
|
||||
A first version of the PEP did not change the encoding and error handler
|
||||
used for the POSIX locale.
|
||||
used of the POSIX locale.
|
||||
|
||||
The problem is that adding the ``-X utf8`` command line option or
|
||||
setting the ``PYTHONUTF8`` environment variable is not possible in some
|
||||
cases, or at least not convenient.
|
||||
|
||||
Moreover, many users simply expect that Python 3 behaves like Python 2:
|
||||
it doesn't bother them with encodings and "just works" in all cases. These
|
||||
Moreover, many users simply expect that Python 3 behaves as Python 2:
|
||||
don't bother them with encodings and "just works" in all cases. These
|
||||
users don't worry about mojibake, or even expect mojibake because of
|
||||
complex documents using multiple incompatibles encodings.
|
||||
|
||||
|
@ -773,9 +849,10 @@ Python already always uses the UTF-8 encoding on Mac OS X, Android and
|
|||
Windows. Since UTF-8 became the de facto encoding, it makes sense to
|
||||
always use it on all platforms with any locale.
|
||||
|
||||
The risk is to introduce mojibake if the locale uses a different
|
||||
encoding, especially for locales other than the POSIX locale.
|
||||
|
||||
The problem with this approach is that Python is also used extensively in
|
||||
desktop environments, and it is often a practical or even legal requirement
|
||||
to support locale encoding other than UTF-8 (for example, GB-18030 in China,
|
||||
and Shift-JIS or ISO-2022-JP in Japan)
|
||||
|
||||
Force UTF-8 for the POSIX locale
|
||||
--------------------------------
|
||||
|
@ -783,7 +860,7 @@ Force UTF-8 for the POSIX locale
|
|||
An alternative to always using UTF-8 in any case is to only use UTF-8 when the
|
||||
``LC_CTYPE`` locale is the POSIX locale.
|
||||
|
||||
`PEP 538`_ "Coercing the legacy C locale to C.UTF-8" by Nick
|
||||
The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of Nick
|
||||
Coghlan proposes to implement that using the ``C.UTF-8`` locale.
|
||||
|
||||
|
||||
|
@ -791,14 +868,14 @@ Use the strict error handler for operating system data
|
|||
------------------------------------------------------
|
||||
|
||||
Using the ``surrogateescape`` error handler for `operating system data`_
|
||||
creates surprising surrogate characters. No Python codec (except for
|
||||
``utf-7``) accepts surrogates so encoding text coming from the
|
||||
operating system is likely to raise an error. The problem is that
|
||||
creates surprising surrogate characters. No Python codec (except of
|
||||
``utf-7``) accept surrogates, and so encoding text coming from the
|
||||
operating system is likely to raise an error error. The problem is that
|
||||
the error comes late, very far from where the data was read.
|
||||
|
||||
The ``strict`` error handler can be used instead to decode
|
||||
(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
|
||||
data and raise encoding errors as soon as possible. Using it helps find
|
||||
data, to raise encoding errors as soon as possible. It helps to find
|
||||
bugs more quickly.
|
||||
|
||||
The main drawback of this strategy is that it doesn't work in practice.
|
||||
|
|
Loading…
Reference in New Issue