Fix a couple of issues with pep0540 (#252)
This commit is contained in:
parent
95bdb222e1
commit
ae226965ea
302
pep-0540.txt
302
pep-0540.txt
|
@ -17,11 +17,11 @@ Abstract
|
|||
Add a new UTF-8 mode, disabled by default, to ignore the locale and
|
||||
force the usage of the UTF-8 encoding.
|
||||
|
||||
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
|
||||
Basically, UTF-8 mode behaves as Python 2: it "just works" and doesn't
|
||||
bother users with encodings, but it can produce mojibake. The UTF-8 mode
|
||||
can be configured as strict to prevent mojibake.
|
||||
|
||||
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode. The POSIX locale enables
|
||||
the UTF-8 mode.
|
||||
|
||||
|
@ -35,31 +35,31 @@ Rationale
|
|||
Since Python 3.0 was released in 2008, the usual answer to users getting
|
||||
Unicode errors is to ask developers to fix their code to handle Unicode
|
||||
properly. Most applications and Python modules were fixed, but users
|
||||
keep reporting Unicode errors regulary: see the long list of issues in
|
||||
kept reporting Unicode errors regularly: see the long list of issues in
|
||||
the `Links`_ section below.
|
||||
|
||||
In fact, a second class of bug comes from a locale which is not properly
|
||||
configured. The usual answer to such bug report is: "it is not a bug,
|
||||
In fact, a second class of bugs comes from a locale which is not properly
|
||||
configured. The usual answer to such a bug report is: "it is not a bug,
|
||||
you must fix your locale".
|
||||
|
||||
Technically, the answer is correct, but from a practical point of view,
|
||||
the answer is not acceptable. In many cases, "fixing the issue" is an
|
||||
the answer is not acceptable. In many cases, "fixing the issue" is a
|
||||
hard task. Moreover, sometimes, the usage of the POSIX locale is
|
||||
deliberate.
|
||||
|
||||
A good example of a concrete issue are build systems which create a
|
||||
fresh environment for each build using a chroot, a container, a virtual
|
||||
machine or something else to get reproductible builds. Such setup
|
||||
usually uses the POSIX locale. To get 100% reproductible builds, the
|
||||
machine or something else to get reproducible builds. Such a setup
|
||||
usually uses the POSIX locale. To get 100% reproducible builds, the
|
||||
POSIX locale is a good choice: see the `Locales section of
|
||||
reproducible-builds.org
|
||||
<https://reproducible-builds.org/docs/locales/>`_.
|
||||
|
||||
UNIX users don't expect Unicode errors, since the common command lines
|
||||
tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors.
|
||||
These users expect that Python 3 "just works" with any locale and don't
|
||||
These users expect that Python 3 "just works" with any locale and won't
|
||||
bother them with encodings. From their point of the view, the bug is not
|
||||
their locale but is obviously Python 3.
|
||||
their locale, it's obviously Python 3.
|
||||
|
||||
Since Python 2 handles data as bytes, it's rarer in Python 2
|
||||
compared to Python 3 to get Unicode errors. It also explains why users
|
||||
|
@ -68,7 +68,7 @@ also perceive Python 3 as the root cause of their Unicode errors.
|
|||
Some users expect that Python 3 just works with any locale and so don't
|
||||
bother with mojibake, whereas some developers are working hard to prevent
|
||||
mojibake and so expect that Python 3 fails early before creating
|
||||
mojibake.
|
||||
it.
|
||||
|
||||
Since different group of users have different expectations, there is no
|
||||
silver bullet which solves all issues at once. Last but not least,
|
||||
|
@ -105,7 +105,7 @@ decode and encode operating system data. These functions use the
|
|||
filesystem error handler: ``sys.getfilesystemencodeerrors()``.
|
||||
|
||||
.. note::
|
||||
In some corner case, the *current* ``LC_CTYPE`` locale must be used
|
||||
In some corner cases, the *current* ``LC_CTYPE`` locale must be used
|
||||
instead of ``sys.getfilesystemencoding()``. For example, the ``time``
|
||||
module uses the *current* ``LC_CTYPE`` locale to decode timezone
|
||||
names.
|
||||
|
@ -121,7 +121,7 @@ this preference order:
|
|||
* ``LC_CTYPE``
|
||||
* ``LANG``
|
||||
|
||||
The POSIX locale,also known as "the C locale", is used:
|
||||
The POSIX locale, also known as "the C locale", is used:
|
||||
|
||||
* if the first set variable is set to ``"C"``
|
||||
* if all these variables are unset, for example when a program is
|
||||
|
@ -140,7 +140,7 @@ arguments are decoded by ``mbstowcs()`` and encoded back by
|
|||
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
|
||||
of retrieving the original byte string.
|
||||
|
||||
To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
|
||||
To fix this issue, from Python 3.4, a check is made to see if ``mbstowcs()``
|
||||
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
|
||||
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
|
||||
alias to ASCII). If not (the effective encoding is not ASCII), Python
|
||||
|
@ -158,7 +158,7 @@ In many cases, the POSIX locale is not really expected by users who get
|
|||
it by mistake. Examples:
|
||||
|
||||
* program started in an empty environment
|
||||
* User forcing LANG=C to get messages in english
|
||||
* User forcing LANG=C to get messages in English
|
||||
* LANG=C used for bad reasons, without being aware of the ASCII encoding
|
||||
* SSH shell
|
||||
* Linux installed with no configured locale
|
||||
|
@ -178,7 +178,7 @@ the UTF-8 encoding:
|
|||
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
|
||||
* HP-UX: ``"C.utf8"``
|
||||
|
||||
It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
|
||||
It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
|
||||
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
|
||||
|
||||
It is not planned to add such locale to BSD systems.
|
||||
|
@ -190,7 +190,7 @@ Popularity of the UTF-8 encoding
|
|||
Python 3 uses UTF-8 by default for Python source files.
|
||||
|
||||
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
|
||||
system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
|
||||
system data. For Windows, see `PEP 529`_: "Change Windows filesystem
|
||||
encoding to UTF-8".
|
||||
|
||||
On Linux, UTF-8 became the de facto standard encoding,
|
||||
|
@ -198,8 +198,8 @@ replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
|
|||
using different encodings for filenames and standard streams is likely
|
||||
to create mojibake, so UTF-8 is now used *everywhere*.
|
||||
|
||||
The UTF-8 encoding is the default encoding of XML and JSON file format.
|
||||
In January 2017, UTF-8 was used in `more than 88% of web pages
|
||||
The UTF-8 encoding is the default encoding of XML and JSON file formats.
|
||||
As of January 2017, UTF-8 was used in `more than 88% of web pages
|
||||
<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
|
||||
Javascript, CSS, etc.).
|
||||
|
||||
|
@ -209,7 +209,7 @@ information on the UTF-8 codec.
|
|||
.. note::
|
||||
Some applications and operating systems (especially Windows) use Byte
|
||||
Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
|
||||
UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
|
||||
UTF-8, UTF-16-LE, etc. BOM are not well supported and are rarely used in
|
||||
Python.
|
||||
|
||||
|
||||
|
@ -221,7 +221,7 @@ the wild which don't use UTF-8. And there are a lot of data stored in
|
|||
different encodings. For example, an old USB key using the ext3
|
||||
filesystem with filenames encoded to ISO 8859-1.
|
||||
|
||||
The Linux kernel and the libc don't decode filenames: a filename is used
|
||||
The Linux kernel and libc don't decode filenames: a filename is used
|
||||
as a raw array of bytes. The common solution to support any filename is
|
||||
to store filenames as bytes and don't try to decode them. When displayed
|
||||
to stdout, mojibake is displayed if the filename and the terminal don't
|
||||
|
@ -231,8 +231,8 @@ Python 3 promotes Unicode everywhere including filenames. A solution to
|
|||
support filenames not decodable from the locale encoding was found: the
|
||||
``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
|
||||
as surrogate characters. This error handler is used by default for
|
||||
`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
|
||||
example (except on Windows which uses the ``strict`` error handler).
|
||||
`operating system data`_, for example, by ``os.fsdecode()`` and
|
||||
``os.fsencode()`` (except on Windows which uses the ``strict`` error handler).
|
||||
|
||||
|
||||
Standard streams
|
||||
|
@ -243,7 +243,7 @@ stderr. The ``strict`` error handler is used by stdin and stdout to
|
|||
prevent mojibake.
|
||||
|
||||
The ``backslashreplace`` error handler is used by stderr to avoid
|
||||
Unicode encode error when displaying non-ASCII text. It is especially
|
||||
Unicode encode errors when displaying non-ASCII text. It is especially
|
||||
useful when the POSIX locale is used, because this locale usually uses
|
||||
the ASCII encoding.
|
||||
|
||||
|
@ -254,15 +254,15 @@ contains an undecoded byte stored as a surrogate character.
|
|||
|
||||
Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
|
||||
POSIX locale is used: `issue #19977
|
||||
<http://bugs.python.org/issue19977>`_. The idea is to passthrough
|
||||
`operating system data`_ even if it means mojibake, because most UNIX
|
||||
<http://bugs.python.org/issue19977>`_. The idea is to pass through
|
||||
`operating system data`_ even if it creates mojibake, because most UNIX
|
||||
applications work like that. Most UNIX applications store filenames as
|
||||
bytes, usually simply because bytes are first-citizen class in the used
|
||||
bytes, usually because bytes are first-citizen class in the used
|
||||
programming language, whereas Unicode is badly supported.
|
||||
|
||||
.. note::
|
||||
The encoding and/or the error handler of standard streams can be
|
||||
overriden with the ``PYTHONIOENCODING`` environment variable.
|
||||
overridden with the ``PYTHONIOENCODING`` environment variable.
|
||||
|
||||
|
||||
Proposal
|
||||
|
@ -276,18 +276,18 @@ force the usage of the UTF-8 encoding with the ``surrogateescape`` error
|
|||
handler, instead using the locale encoding (with ``strict`` or
|
||||
``surrogateescape`` error handler depending on the case).
|
||||
|
||||
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
|
||||
Basically, the UTF-8 mode behaves as Python 2: it "just works" and doesn't
|
||||
bother users with encodings, but it can produce mojibake. It can be
|
||||
configured as strict to prevent mojibake: the UTF-8 encoding is used
|
||||
with the ``strict`` error handler for inputs and outputs, but the
|
||||
``surrogateescape`` error handler is still used for `operating system
|
||||
data`_.
|
||||
|
||||
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
|
||||
by ``-X utf8`` or ``PYTHONUTF8=1``. The UTF-8 is configured as strict
|
||||
by ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values fail
|
||||
with an error.
|
||||
by using ``-X utf8`` or ``PYTHONUTF8=1``. It can be configured as strict
|
||||
by using ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values
|
||||
fail with an error.
|
||||
|
||||
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
|
||||
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
|
||||
|
@ -300,23 +300,23 @@ Options priority for the UTF-8 mode:
|
|||
* POSIX locale
|
||||
|
||||
For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode,
|
||||
whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and so
|
||||
use the encoding of the POSIX locale.
|
||||
whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and
|
||||
uses the encoding of the POSIX locale.
|
||||
|
||||
Encodings used by ``open()``, highest priority first:
|
||||
|
||||
* *encoding* and *errors* parameters (if set)
|
||||
* UTF-8 mode
|
||||
* os.device_encoding(fd)
|
||||
* os.getpreferredencoding(False)
|
||||
* ``os.device_encoding(fd)``
|
||||
* ``os.getpreferredencoding(False)``
|
||||
|
||||
|
||||
Encoding and error handler
|
||||
--------------------------
|
||||
|
||||
The UTF-8 mode changes the default encoding and error handler used by
|
||||
open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
|
||||
sys.stderr:
|
||||
``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
|
||||
``sys.stdout`` and ``sys.stderr``:
|
||||
|
||||
============================ ======================= ========================== ==========================
|
||||
Function Default UTF-8 mode or POSIX locale UTF-8 Strict mode
|
||||
|
@ -342,7 +342,7 @@ The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
|
|||
strict mode for convenience: the idea is that data not encoded to UTF-8
|
||||
are passed through "Python" without being modified, as raw bytes.
|
||||
|
||||
The ``PYTHONIOENCODING`` environment variable has the priority over the
|
||||
The ``PYTHONIOENCODING`` environment variable has priority over the
|
||||
UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
|
||||
python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.
|
||||
|
||||
|
@ -372,15 +372,15 @@ sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace
|
|||
============================ ======================= ==========================
|
||||
|
||||
The "Legacy Windows FS encoding" is enabled by setting the
|
||||
``PYTHONLEGACYWINDOWSFSENCODING`` environment variable to ``1``, see the
|
||||
`PEP 529`.
|
||||
``PYTHONLEGACYWINDOWSFSENCODING`` environment variable to ``1`` as specified
|
||||
in `PEP 529` .
|
||||
|
||||
Enabling the legacy Windows filesystem encoding disables the UTF-8 mode
|
||||
(as ``-X utf8=0``).
|
||||
|
||||
If stdin and/or stdout is redirected to a pipe, sys.stdin and/or
|
||||
sys.output uses ``mbcs`` encoding by default, rather than UTF-8. But
|
||||
with the UTF-8 mode, sys.stdin and sys.stdout always use the UTF-8
|
||||
If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
|
||||
``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
|
||||
with the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
|
||||
encoding.
|
||||
|
||||
There is no POSIX locale on Windows. The ANSI code page is used to the
|
||||
|
@ -390,23 +390,23 @@ locale encoding, and this code page never uses the ASCII encoding.
|
|||
Rationale
|
||||
---------
|
||||
|
||||
The UTF-8 mode is disabled by default to keep hard Unicode errors when
|
||||
encoding or decoding `operating system data`_ failed, and to keep the
|
||||
backward compatibility. The user is responsible to enable explicitly the
|
||||
UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
|
||||
mode would be enabled *by default*.
|
||||
UTF-8 mode is disabled by default in order to keep hard Unicode errors when
|
||||
encoding or decoding `operating system data`_ fails and preserve
|
||||
backward compatibility. In addition, users will be better prepared for
|
||||
mojibake if it is their responsibility to explicitly enable UTF-8 mode
|
||||
than they would be if it was enabled *by default*.
|
||||
|
||||
The UTF-8 mode should be used on systems known to be configured with
|
||||
UTF-8 mode should be used on systems known to be configured with
|
||||
UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
|
||||
the user overrides a locale *by mistake* or if a Python program is
|
||||
started with no locale configured (and so with the POSIX locale).
|
||||
|
||||
Most UNIX applications handle `operating system data`_ as bytes, so
|
||||
``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
|
||||
the ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
|
||||
limited impact on how these data are handled by the application.
|
||||
|
||||
The Python UTF-8 mode should help to make Python more interoperable with
|
||||
the other UNIX applications in the system assuming that *UTF-8* is used
|
||||
The UTF-8 mode should help make Python more interoperable with
|
||||
other UNIX applications on the system assuming that *UTF-8* is used
|
||||
everywhere and that users *expect* UTF-8.
|
||||
|
||||
Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
|
||||
|
@ -434,14 +434,14 @@ it should work fine.
|
|||
External code using bytes
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If the external code process data as bytes, surrogate characters are not
|
||||
If the external code processes data as bytes, surrogate characters are not
|
||||
an issue since they are only used inside Python. Python encodes back
|
||||
surrogate characters to bytes at the edges, before calling external
|
||||
code.
|
||||
|
||||
The UTF-8 mode can produce mojibake since Python and external code don't
|
||||
both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can
|
||||
be configured as strict to prevent mojibake and be fail early when data
|
||||
be configured as strict to prevent mojibake and fail early when data
|
||||
is not decodable from UTF-8 or not encodable to UTF-8.
|
||||
|
||||
External code using text
|
||||
|
@ -455,14 +455,14 @@ surrogate characters.
|
|||
Use Cases
|
||||
=========
|
||||
|
||||
The following use cases were written to help to understand the impact of
|
||||
chosen encodings and error handlers on concrete examples.
|
||||
The following use cases were written to highlight the impact of
|
||||
the chosen encodings and error handlers on concrete examples.
|
||||
|
||||
The "Always work" results were written to prove the benefit of having a
|
||||
UTF-8 mode which works with any data and any locale, compared to the
|
||||
existing old Python versions.
|
||||
|
||||
The "Mojibake" column shows that ignoring the locale causes a pratical
|
||||
The "Mojibake" column shows that ignoring the locale causes a practical
|
||||
issue: the UTF-8 mode produces mojibake if the terminal doesn't use the
|
||||
UTF-8 encoding.
|
||||
|
||||
|
@ -477,15 +477,15 @@ Script listing the content of the current directory into stdout::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============ =========
|
||||
Python Always work? Mojibake?
|
||||
======================== ============ =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============ =========
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
|
||||
"No" means that the script can fail on decoding or encoding a filename
|
||||
depending on the locale or the filename.
|
||||
|
@ -494,7 +494,7 @@ To be able to always work, the program must be able to produce mojibake.
|
|||
Mojibake is more user friendly than an error with a truncated or empty
|
||||
output.
|
||||
|
||||
Example with a directory which contains the file called ``b'xxx\xff'``
|
||||
For example, using a directory which contains a file called ``b'xxx\xff'``
|
||||
(the byte ``0xFF`` is invalid in UTF-8).
|
||||
|
||||
Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
|
||||
|
@ -511,7 +511,7 @@ Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
|
|||
print(name)
|
||||
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
|
||||
|
||||
The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
|
||||
UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
|
||||
but display mojibake::
|
||||
|
||||
$ python3.7 -X utf8 ../ls.py
|
||||
|
@ -541,17 +541,17 @@ a text file::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============ =========
|
||||
Python Always work? Mojibake?
|
||||
======================== ============ =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============ =========
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
|
||||
"Yes" involves that mojibake can be produced. "No" means that the script
|
||||
"Yes" implies that mojibake can be produced. "No" means that the script
|
||||
can fail on decoding or encoding a filename depending on the locale or
|
||||
the filename. Typical error::
|
||||
|
||||
|
@ -572,15 +572,15 @@ Very basic example used to illustrate a common issue, display the euro sign
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============ =========
|
||||
Python Always work? Mojibake?
|
||||
======================== ============ =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode **Yes** **Yes**
|
||||
======================== ============ =========
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode **Yes** **Yes**
|
||||
======================== ============= =========
|
||||
|
||||
The UTF-8 and UTF-8 Strict modes will always encode the euro sign as
|
||||
UTF-8. If the terminal uses a different encoding, we get mojibake.
|
||||
|
@ -589,7 +589,7 @@ UTF-8. If the terminal uses a different encoding, we get mojibake.
|
|||
Replace a word in a text
|
||||
------------------------
|
||||
|
||||
The following scripts replaces the word "apple" with "orange". It
|
||||
The following script replaces the word "apple" with "orange". It
|
||||
reads input from stdin and writes the output into stdout::
|
||||
|
||||
import sys
|
||||
|
@ -598,15 +598,15 @@ reads input from stdin and writes the output into stdout::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============ =========
|
||||
Python Always work? Mojibake?
|
||||
======================== ============ =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============ =========
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
|
||||
Producer-consumer model using pipes
|
||||
-----------------------------------
|
||||
|
@ -618,32 +618,32 @@ On a shell, such programs are run with the command::
|
|||
|
||||
producer | consumer
|
||||
|
||||
The question if these programs will work with any data and any locale.
|
||||
The question is if these programs will work with any data and any locale.
|
||||
UNIX users don't expect Unicode errors, and so expect that such programs
|
||||
"just works".
|
||||
"just work".
|
||||
|
||||
If the producer only produces ASCII output, no error should occur. Let's
|
||||
say the that producer writes at least one non-ASCII character (at least
|
||||
say the that the producer writes at least one non-ASCII character (at least
|
||||
one byte in the range ``0x80..0xff``).
|
||||
|
||||
To simplify the problem, let's say that the consumer has no output
|
||||
(don't write result into a file or stdout).
|
||||
(doesn't write results into a file or stdout).
|
||||
|
||||
A "Bytes producer" is an application which cannot fail with a Unicode
|
||||
error and produces bytes into stdout.
|
||||
|
||||
Let's say that a "Bytes consumer" does not decode stdin but stores data
|
||||
as bytes: such consumer always work. Common UNIX command line tools like
|
||||
as bytes: such a consumer always works. Common UNIX command line tools like
|
||||
``cat``, ``grep`` or ``sed`` are in this category. Many Python 2
|
||||
applications are also in this category.
|
||||
|
||||
"Python producer" and "Python consumer" are producer and consumer
|
||||
"Python producer" and "Python consumer" are a producer and consumer
|
||||
implemented in Python.
|
||||
|
||||
Bytes producer, Bytes consumer
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
It always work, but it is out of the scope of this PEP since it doesn't
|
||||
It always works, but it is out of the scope of this PEP since it doesn't
|
||||
involve Python.
|
||||
|
||||
Python producer, Bytes consumer
|
||||
|
@ -655,18 +655,18 @@ Python producer::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============ =========
|
||||
Python Always work? Mojibake?
|
||||
======================== ============ =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============ =========
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
|
||||
The question here is not if the consumer is able to decode the input,
|
||||
but if Python is able to produce its ouput. So it's similar to the
|
||||
but if Python is able to produce its output. So it's similar to the
|
||||
`Display Unicode characters into stdout`_ case.
|
||||
|
||||
UTF-8 modes work with any locale since the consumer doesn't try to
|
||||
|
@ -684,15 +684,15 @@ Python consumer::
|
|||
|
||||
Result:
|
||||
|
||||
======================== ============ =========
|
||||
Python Always work? Mojibake?
|
||||
======================== ============ =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============ =========
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 **Yes** **Yes**
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale **Yes** **Yes**
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
|
||||
Python 3 fails on decoding stdin depending on the input and the locale.
|
||||
|
||||
|
@ -711,17 +711,17 @@ Python consumer::
|
|||
result = text.replace("apple", "orange")
|
||||
# ignore the result
|
||||
|
||||
Result, same Python version used for the producer and the consumer:
|
||||
Result, using the same Python version for the producer and the consumer:
|
||||
|
||||
======================== ============ =========
|
||||
Python Always work? Mojibake?
|
||||
======================== ============ =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============ =========
|
||||
======================== ============= =========
|
||||
Python Always works? Mojibake?
|
||||
======================== ============= =========
|
||||
Python 2 No No
|
||||
Python 3 No No
|
||||
Python 3.5, POSIX locale No No
|
||||
UTF-8 mode **Yes** **Yes**
|
||||
UTF-8 Strict mode No No
|
||||
======================== ============= =========
|
||||
|
||||
This case combines a Python producer with a Python consumer, so the
|
||||
result is the subset of `Python producer, Bytes consumer`_ and `Bytes
|
||||
|
@ -737,7 +737,7 @@ with the ``surrogateescape`` error handler, encoding errors should not
|
|||
occur and so the change should not break applications.
|
||||
|
||||
The more likely source of trouble comes from external libraries. Python
|
||||
can decode successfully data from UTF-8, but a library using the locale
|
||||
can successfully decode data from UTF-8, but a library using the locale
|
||||
encoding can fail to encode the decoded text back to bytes. Hopefully,
|
||||
encoding text in a library is a rare operation. Very few libraries
|
||||
expect text, most libraries expect bytes and even manipulate bytes
|
||||
|
@ -754,14 +754,14 @@ Don't modify the encoding of the POSIX locale
|
|||
---------------------------------------------
|
||||
|
||||
A first version of the PEP did not change the encoding and error handler
|
||||
used of the POSIX locale.
|
||||
used for the POSIX locale.
|
||||
|
||||
The problem is that adding the ``-X utf8`` command line option or
|
||||
setting the ``PYTHONUTF8`` environment variable is not possible in some
|
||||
cases, or at least not convenient.
|
||||
|
||||
Moreover, many users simply expect that Python 3 behaves as Python 2:
|
||||
don't bother them with encodings and "just works" in all cases. These
|
||||
Moreover, many users simply expect that Python 3 behaves like Python 2:
|
||||
it doesn't bother them with encodings and "just works" in all cases. These
|
||||
users don't worry about mojibake, or even expect mojibake because of
|
||||
complex documents using multiple incompatibles encodings.
|
||||
|
||||
|
@ -769,7 +769,7 @@ complex documents using multiple incompatibles encodings.
|
|||
Always use UTF-8
|
||||
----------------
|
||||
|
||||
Python already always use the UTF-8 encoding on Mac OS X, Android and
|
||||
Python already always uses the UTF-8 encoding on Mac OS X, Android and
|
||||
Windows. Since UTF-8 became the de facto encoding, it makes sense to
|
||||
always use it on all platforms with any locale.
|
||||
|
||||
|
@ -783,7 +783,7 @@ Force UTF-8 for the POSIX locale
|
|||
An alternative to always using UTF-8 in any case is to only use UTF-8 when the
|
||||
``LC_CTYPE`` locale is the POSIX locale.
|
||||
|
||||
The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of Nick
|
||||
`PEP 538`_ "Coercing the legacy C locale to C.UTF-8" by Nick
|
||||
Coghlan proposes to implement that using the ``C.UTF-8`` locale.
|
||||
|
||||
|
||||
|
@ -791,20 +791,20 @@ Use the strict error handler for operating system data
|
|||
------------------------------------------------------
|
||||
|
||||
Using the ``surrogateescape`` error handler for `operating system data`_
|
||||
creates surprising surrogate characters. No Python codec (except of
|
||||
``utf-7``) accept surrogates, and so encoding text coming from the
|
||||
operating system is likely to raise an error error. The problem is that
|
||||
creates surprising surrogate characters. No Python codec (except for
|
||||
``utf-7``) accepts surrogates so encoding text coming from the
|
||||
operating system is likely to raise an error. The problem is that
|
||||
the error comes late, very far from where the data was read.
|
||||
|
||||
The ``strict`` error handler can be used instead to decode
|
||||
(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
|
||||
data, to raise encoding errors as soon as possible. It helps to find
|
||||
data and raise encoding errors as soon as possible. Using it helps find
|
||||
bugs more quickly.
|
||||
|
||||
The main drawback of this strategy is that it doesn't work in practice.
|
||||
Python 3 is designed on top on Unicode strings. Most functions expect
|
||||
Unicode and produce Unicode. Even if many operating system functions
|
||||
have two flavors, bytes and Unicode, the Unicode flavar is used is most
|
||||
have two flavors, bytes and Unicode, the Unicode flavor is used in most
|
||||
cases. There are good reasons for that: Unicode is more convenient in
|
||||
Python 3 and using Unicode helps to support the full Unicode Character
|
||||
Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
|
||||
|
@ -884,11 +884,11 @@ with the POSIX locale:
|
|||
stdout <http://bugs.python.org/issue8533>`_, regrtest fails with Unicode
|
||||
encode error if the locale is POSIX
|
||||
|
||||
Some issues are real bug in applications which must set explicitly the
|
||||
Some issues are real bugs in applications which must explicitly set the
|
||||
encoding. Well, it just works in the common case (locale configured
|
||||
correctly), so what? But the program "suddenly" fails when the POSIX
|
||||
locale is used (probably for bad reasons). Such bug is not well
|
||||
understood by users. Example of such issue:
|
||||
correctly), so what? The program "suddenly" fails when the POSIX
|
||||
locale is used (probably for bad reasons). Such bugs are not well
|
||||
understood by users. Example of such issues:
|
||||
|
||||
* 2013-11-21: `pip: open() uses the locale encoding to parse Python
|
||||
script, instead of the encoding cookie
|
||||
|
@ -902,7 +902,7 @@ Prior Art
|
|||
=========
|
||||
|
||||
Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
|
||||
varaible to force UTF-8: see `perlrun
|
||||
variable to force UTF-8: see `perlrun
|
||||
<http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
|
||||
UTF-8 per standard stream, on input and output streams, etc.
|
||||
|
||||
|
|
Loading…
Reference in New Issue