2017-01-05 07:46:03 -05:00
|
|
|
|
PEP: 540
|
|
|
|
|
Title: Add a new UTF-8 mode
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
2017-12-05 10:21:59 -05:00
|
|
|
|
Author: Victor Stinner <victor.stinner@gmail.com>,
|
|
|
|
|
Nick Coghlan <ncoghlan@gmail.com>
|
2017-04-24 00:33:34 -04:00
|
|
|
|
BDFL-Delegate: INADA Naoki
|
2017-01-05 07:46:03 -05:00
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 5-January-2016
|
|
|
|
|
Python-Version: 3.7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
Add a new UTF-8 mode, enabled by default in the POSIX locale, to ignore
|
|
|
|
|
the locale and force the usage of the UTF-8 encoding for external
|
|
|
|
|
operating system interfaces, including the standard IO streams.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
Essentially, the UTF-8 mode behaves as Python 2 and other C based
|
|
|
|
|
applications on \*nix systems: it aims to process text as best it can,
|
|
|
|
|
but it errs on the side of producing or propagating mojibake to
|
|
|
|
|
subsequent components in a processing pipeline rather than requiring
|
|
|
|
|
strictly valid encodings at every step in the process.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The UTF-8 mode can be configured as strict to reduce the risk of
|
|
|
|
|
producing or propagating mojibake.
|
|
|
|
|
|
|
|
|
|
A new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
|
|
|
|
variable are added to explicitly control the UTF-8 mode (including
|
|
|
|
|
turning it off entirely, even in the POSIX locale).
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
"It's not a bug, you must fix your locale" is not an acceptable answer
|
|
|
|
|
----------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Since Python 3.0 was released in 2008, the usual answer to users getting
|
|
|
|
|
Unicode errors is to ask developers to fix their code to handle Unicode
|
|
|
|
|
properly. Most applications and Python modules were fixed, but users
|
2017-05-08 18:24:28 -04:00
|
|
|
|
kept reporting Unicode errors regularly: see the long list of issues in
|
2017-01-06 20:35:27 -05:00
|
|
|
|
the `Links`_ section below.
|
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
In fact, a second class of bugs comes from a locale which is not properly
|
|
|
|
|
configured. The usual answer to such a bug report is: "it is not a bug,
|
2017-01-06 20:35:27 -05:00
|
|
|
|
you must fix your locale".
|
|
|
|
|
|
|
|
|
|
Technically, the answer is correct, but from a practical point of view,
|
2017-05-08 18:24:28 -04:00
|
|
|
|
the answer is not acceptable. In many cases, "fixing the issue" is a
|
2017-01-06 20:35:27 -05:00
|
|
|
|
hard task. Moreover, sometimes, the usage of the POSIX locale is
|
|
|
|
|
deliberate.
|
|
|
|
|
|
|
|
|
|
A good example of a concrete issue are build systems which create a
|
|
|
|
|
fresh environment for each build using a chroot, a container, a virtual
|
2017-05-08 18:24:28 -04:00
|
|
|
|
machine or something else to get reproducible builds. Such a setup
|
|
|
|
|
usually uses the POSIX locale. To get 100% reproducible builds, the
|
2017-01-06 20:35:27 -05:00
|
|
|
|
POSIX locale is a good choice: see the `Locales section of
|
|
|
|
|
reproducible-builds.org
|
|
|
|
|
<https://reproducible-builds.org/docs/locales/>`_.
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
PEP 538 lists additional problems related to the use of Linux containers to
|
|
|
|
|
run network services and command line applications.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
UNIX users don't expect Unicode errors, since the common command lines
|
2017-12-05 10:39:51 -05:00
|
|
|
|
tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors -
|
|
|
|
|
they produce mostly-readable text instead.
|
|
|
|
|
|
|
|
|
|
These users similarly expect that tools written in Python 3 (including
|
|
|
|
|
those updated from Python 2), continue to tolerate locale
|
|
|
|
|
misconfigurations and avoid bothering them with text encoding details.
|
|
|
|
|
From their point of the view, the bug is not their locale but is
|
|
|
|
|
obviously Python 3 ("Everything else works, including Python 2, so
|
|
|
|
|
what's wrong with Python 3?").
|
|
|
|
|
|
|
|
|
|
Since Python 2 handles data as bytes, similar to system utilities
|
|
|
|
|
written in C and C++, it's rarer in Python 2 compared to Python 3 to get
|
|
|
|
|
explicit Unicode errors. It also contributes significantly to why many
|
|
|
|
|
affected users perceive Python 3 as the root cause of their Unicode
|
|
|
|
|
errors.
|
|
|
|
|
|
|
|
|
|
At the same time, the stricter text handling model was deliberately
|
|
|
|
|
introduced into Python 3 to reduce the frequency of data corruption bugs
|
|
|
|
|
arising in production services due to mismatched assumptions regarding
|
|
|
|
|
text encodings. It's one thing to emit mojibake to a user's terminal
|
|
|
|
|
while listing a directory, but something else entirely to store that in
|
|
|
|
|
a system manifest in a database, or to send it to a remote client
|
2017-12-05 10:41:21 -05:00
|
|
|
|
attempting to retrieve files from the system.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
Since different group of users have different expectations, there is no
|
|
|
|
|
silver bullet which solves all issues at once. Last but not least,
|
|
|
|
|
backward compatibility should be preserved whenever possible.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
Locale and operating system data
|
|
|
|
|
--------------------------------
|
|
|
|
|
|
2017-01-11 16:08:40 -05:00
|
|
|
|
.. _operating system data:
|
|
|
|
|
|
|
|
|
|
Python uses an encoding called the "filesystem encoding" to decide how
|
|
|
|
|
to encode and decode data from/to the operating system:
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
* file content
|
|
|
|
|
* command line arguments: ``sys.argv``
|
|
|
|
|
* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
|
|
|
|
|
* environment variables: ``os.environ``
|
|
|
|
|
* filenames: ``os.listdir(str)`` for example
|
|
|
|
|
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
|
2017-01-05 17:54:22 -05:00
|
|
|
|
* error messages: ``os.strerror(code)`` for example
|
|
|
|
|
* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
|
2017-01-05 07:46:03 -05:00
|
|
|
|
* host name, UNIX socket path: see the ``socket`` module
|
|
|
|
|
* etc.
|
|
|
|
|
|
|
|
|
|
At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
|
2017-01-11 16:08:40 -05:00
|
|
|
|
``LC_CTYPE`` locale and then store the locale encoding as the
|
|
|
|
|
"filesystem error". It's possible to get this encoding using
|
|
|
|
|
``sys.getfilesystemencoding()``. In the whole lifetime of a Python
|
|
|
|
|
process, the same encoding and error handler are used to encode and
|
|
|
|
|
decode data from/to the operating system.
|
|
|
|
|
|
|
|
|
|
The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to
|
|
|
|
|
decode and encode operating system data. These functions use the
|
|
|
|
|
filesystem error handler: ``sys.getfilesystemencodeerrors()``.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
.. note::
|
2017-05-08 18:24:28 -04:00
|
|
|
|
In some corner cases, the *current* ``LC_CTYPE`` locale must be used
|
2017-01-05 07:46:03 -05:00
|
|
|
|
instead of ``sys.getfilesystemencoding()``. For example, the ``time``
|
|
|
|
|
module uses the *current* ``LC_CTYPE`` locale to decode timezone
|
|
|
|
|
names.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The POSIX locale and its encoding
|
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
|
|
The following environment variables are used to configure the locale, in
|
|
|
|
|
this preference order:
|
|
|
|
|
|
|
|
|
|
* ``LC_ALL``, most important variable
|
|
|
|
|
* ``LC_CTYPE``
|
|
|
|
|
* ``LANG``
|
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
The POSIX locale, also known as "the C locale", is used:
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
* if the first set variable is set to ``"C"``
|
|
|
|
|
* if all these variables are unset, for example when a program is
|
|
|
|
|
started in an empty environment.
|
|
|
|
|
|
|
|
|
|
The encoding of the POSIX locale must be ASCII or a superset of ASCII.
|
|
|
|
|
|
|
|
|
|
On Linux, the POSIX locale uses the ASCII encoding.
|
|
|
|
|
|
|
|
|
|
On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
|
|
|
|
|
the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
|
|
|
|
|
use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
|
|
|
|
|
``os.fsencode()`` and ``os.fsdecode()`` use
|
2017-12-05 10:21:59 -05:00
|
|
|
|
``locale.getpreferredencoding()`` codec. For example, if command line
|
2017-01-05 07:46:03 -05:00
|
|
|
|
arguments are decoded by ``mbstowcs()`` and encoded back by
|
|
|
|
|
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
|
|
|
|
|
of retrieving the original byte string.
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
|
2017-01-05 07:46:03 -05:00
|
|
|
|
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
|
|
|
|
|
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
|
|
|
|
|
alias to ASCII). If not (the effective encoding is not ASCII), Python
|
|
|
|
|
uses its own ASCII codec instead of using ``mbstowcs()`` and
|
2017-01-11 16:08:40 -05:00
|
|
|
|
``wcstombs()`` functions for `operating system data`_.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
See the `POSIX locale (2016 Edition)
|
|
|
|
|
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
|
|
|
|
|
|
|
|
|
|
|
2017-01-06 07:57:10 -05:00
|
|
|
|
POSIX locale used by mistake
|
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
|
|
In many cases, the POSIX locale is not really expected by users who get
|
|
|
|
|
it by mistake. Examples:
|
|
|
|
|
|
|
|
|
|
* program started in an empty environment
|
2017-05-08 18:24:28 -04:00
|
|
|
|
* User forcing LANG=C to get messages in English
|
2017-01-06 07:57:10 -05:00
|
|
|
|
* LANG=C used for bad reasons, without being aware of the ASCII encoding
|
|
|
|
|
* SSH shell
|
2017-01-06 20:35:27 -05:00
|
|
|
|
* Linux installed with no configured locale
|
|
|
|
|
* chroot environment, Docker image, container, ... with no locale is
|
|
|
|
|
configured
|
2017-01-06 07:57:10 -05:00
|
|
|
|
* User locale set to a non-existing locale, typo in the locale name for
|
|
|
|
|
example
|
|
|
|
|
|
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
|
C.UTF-8 and C.utf8 locales
|
|
|
|
|
--------------------------
|
|
|
|
|
|
2017-01-11 16:08:40 -05:00
|
|
|
|
Some UNIX operating systems provide a variant of the POSIX locale using
|
|
|
|
|
the UTF-8 encoding:
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
|
2017-01-05 17:54:22 -05:00
|
|
|
|
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
|
2017-01-05 07:46:03 -05:00
|
|
|
|
* HP-UX: ``"C.utf8"``
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
|
2017-01-05 07:46:03 -05:00
|
|
|
|
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
It is not planned to add such locale to BSD systems.
|
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
Popularity of the UTF-8 encoding
|
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
|
|
Python 3 uses UTF-8 by default for Python source files.
|
|
|
|
|
|
|
|
|
|
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
|
2017-12-05 10:21:59 -05:00
|
|
|
|
system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
|
2017-01-05 17:54:22 -05:00
|
|
|
|
encoding to UTF-8".
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-01-08 11:08:01 -05:00
|
|
|
|
On Linux, UTF-8 became the de facto standard encoding,
|
2017-01-05 07:46:03 -05:00
|
|
|
|
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
|
|
|
|
|
using different encodings for filenames and standard streams is likely
|
2017-12-05 10:39:51 -05:00
|
|
|
|
to create mojibake, so UTF-8 is now used *everywhere* (at least for
|
|
|
|
|
modern
|
2017-12-05 10:21:59 -05:00
|
|
|
|
distributions using their default settings).
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The UTF-8 encoding is the default encoding of XML and JSON file format.
|
|
|
|
|
In January 2017, UTF-8 was used in `more than 88% of web pages
|
2017-01-05 07:46:03 -05:00
|
|
|
|
<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
|
|
|
|
|
Javascript, CSS, etc.).
|
|
|
|
|
|
|
|
|
|
See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
|
|
|
|
|
information on the UTF-8 codec.
|
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
Some applications and operating systems (especially Windows) use Byte
|
|
|
|
|
Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
|
2017-12-05 10:21:59 -05:00
|
|
|
|
UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
|
2017-01-05 07:46:03 -05:00
|
|
|
|
Python.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Old data stored in different encodings and surrogateescape
|
|
|
|
|
----------------------------------------------------------
|
|
|
|
|
|
2017-01-08 11:08:01 -05:00
|
|
|
|
Even if UTF-8 became the de facto standard, there are still systems in
|
2017-01-05 07:46:03 -05:00
|
|
|
|
the wild which don't use UTF-8. And there are a lot of data stored in
|
|
|
|
|
different encodings. For example, an old USB key using the ext3
|
|
|
|
|
filesystem with filenames encoded to ISO 8859-1.
|
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
The Linux kernel and libc don't decode filenames: a filename is used
|
2017-01-05 07:46:03 -05:00
|
|
|
|
as a raw array of bytes. The common solution to support any filename is
|
2017-01-11 16:08:40 -05:00
|
|
|
|
to store filenames as bytes and don't try to decode them. When displayed
|
|
|
|
|
to stdout, mojibake is displayed if the filename and the terminal don't
|
|
|
|
|
use the same encoding.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
Python 3 promotes Unicode everywhere including filenames. A solution to
|
|
|
|
|
support filenames not decodable from the locale encoding was found: the
|
2017-01-11 16:08:40 -05:00
|
|
|
|
``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
|
2017-01-05 07:46:03 -05:00
|
|
|
|
as surrogate characters. This error handler is used by default for
|
2017-12-05 10:21:59 -05:00
|
|
|
|
`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
|
|
|
|
|
example (except on Windows which uses the ``strict`` error handler).
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Standard streams
|
|
|
|
|
----------------
|
|
|
|
|
|
|
|
|
|
Python uses the locale encoding for standard streams: stdin, stdout and
|
|
|
|
|
stderr. The ``strict`` error handler is used by stdin and stdout to
|
|
|
|
|
prevent mojibake.
|
|
|
|
|
|
|
|
|
|
The ``backslashreplace`` error handler is used by stderr to avoid
|
2017-05-08 18:24:28 -04:00
|
|
|
|
Unicode encode errors when displaying non-ASCII text. It is especially
|
2017-01-05 07:46:03 -05:00
|
|
|
|
useful when the POSIX locale is used, because this locale usually uses
|
|
|
|
|
the ASCII encoding.
|
|
|
|
|
|
2017-01-11 16:08:40 -05:00
|
|
|
|
The problem is that `operating system data`_ like filenames are decoded
|
|
|
|
|
using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
|
2017-01-08 11:08:01 -05:00
|
|
|
|
filename to stdout raises a Unicode encode error if the filename
|
2017-01-05 07:46:03 -05:00
|
|
|
|
contains an undecoded byte stored as a surrogate character.
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
Python 3.5+ now uses ``surrogateescape`` for stdin and stdout if the
|
2017-01-11 16:08:40 -05:00
|
|
|
|
POSIX locale is used: `issue #19977
|
2017-05-08 18:24:28 -04:00
|
|
|
|
<http://bugs.python.org/issue19977>`_. The idea is to pass through
|
2017-12-05 10:21:59 -05:00
|
|
|
|
`operating system data`_ even if it means mojibake, because most UNIX
|
2017-12-05 10:39:51 -05:00
|
|
|
|
applications work like that. Such UNIX applications often store
|
|
|
|
|
filenames as bytes, in many cases because their basic design principles
|
|
|
|
|
(or those of the language they're implemented in) were laid down half a
|
|
|
|
|
century ago when it was still a feat for computers to handle English
|
|
|
|
|
text correctly, rather than
|
2017-12-05 10:21:59 -05:00
|
|
|
|
humans having to work with raw numeric indexes.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
The encoding and/or the error handler of standard streams can be
|
2017-12-05 10:21:59 -05:00
|
|
|
|
overriden with the ``PYTHONIOENCODING`` environment variable.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Proposal
|
|
|
|
|
========
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
Changes
|
|
|
|
|
-------
|
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
Add a new UTF-8 mode, enabled by default in the POSIX locale, but
|
|
|
|
|
otherwise disabled by default, to ignore the locale and force the usage
|
|
|
|
|
of the UTF-8 encoding with the ``surrogateescape`` error handler,
|
|
|
|
|
instead using the locale encoding (with ``strict`` or
|
2017-01-05 17:54:22 -05:00
|
|
|
|
``surrogateescape`` error handler depending on the case).
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
The "normal" UTF-8 mode uses ``surrogateescape`` on the standard input
|
2017-12-05 10:41:21 -05:00
|
|
|
|
and output streams and opened files, as well as on all operating
|
2017-12-05 10:39:51 -05:00
|
|
|
|
system interfaces. This is the mode implicitly activated by the POSIX
|
|
|
|
|
locale.
|
2017-12-05 10:21:59 -05:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
The "strict" UTF-8 mode reduces the risk of producing or propogating
|
|
|
|
|
mojibake: the UTF-8 encoding is used with the ``strict`` error handler
|
|
|
|
|
for inputs and outputs, but the ``surrogateescape`` error handler is
|
|
|
|
|
still used for `operating system data`_. This mode is never activated
|
|
|
|
|
implicitly, but can be requested explicitly.
|
2017-12-05 10:21:59 -05:00
|
|
|
|
|
|
|
|
|
The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
|
|
|
|
variable are added to control the UTF-8 mode.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``.
|
|
|
|
|
|
|
|
|
|
The UTF-8 Strict mode is configured by ``-X utf8=strict`` or
|
|
|
|
|
``PYTHONUTF8=strict``.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
|
|
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
|
|
|
|
|
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
Other option values fail with an error.
|
|
|
|
|
|
2017-01-12 07:26:21 -05:00
|
|
|
|
Options priority for the UTF-8 mode:
|
|
|
|
|
|
|
|
|
|
* ``PYTHONLEGACYWINDOWSFSENCODING``
|
|
|
|
|
* ``-X utf8``
|
|
|
|
|
* ``PYTHONUTF8``
|
|
|
|
|
* POSIX locale
|
|
|
|
|
|
|
|
|
|
For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode,
|
2017-12-05 10:21:59 -05:00
|
|
|
|
whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and so
|
|
|
|
|
use the encoding of the POSIX locale.
|
2017-01-12 07:26:21 -05:00
|
|
|
|
|
|
|
|
|
Encodings used by ``open()``, highest priority first:
|
|
|
|
|
|
|
|
|
|
* *encoding* and *errors* parameters (if set)
|
|
|
|
|
* UTF-8 mode
|
2017-05-08 18:24:28 -04:00
|
|
|
|
* ``os.device_encoding(fd)``
|
|
|
|
|
* ``os.getpreferredencoding(False)``
|
2017-01-12 07:26:21 -05:00
|
|
|
|
|
2017-01-11 16:08:40 -05:00
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
Encoding and error handler
|
|
|
|
|
--------------------------
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
The UTF-8 mode changes the default encoding and error handler used by
|
2017-05-08 18:24:28 -04:00
|
|
|
|
``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
|
|
|
|
|
``sys.stdout`` and ``sys.stderr``:
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
============================ ======================= ========================== ==========================
|
2017-01-12 07:26:21 -05:00
|
|
|
|
Function Default UTF-8 mode or POSIX locale UTF-8 Strict mode
|
2017-01-05 17:54:22 -05:00
|
|
|
|
============================ ======================= ========================== ==========================
|
|
|
|
|
open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
2017-01-11 16:08:40 -05:00
|
|
|
|
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8**/surrogateescape
|
2017-01-05 17:54:22 -05:00
|
|
|
|
sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
|
|
|
|
sys.stderr locale/backslashreplace **UTF-8**/backslashreplace **UTF-8**/backslashreplace
|
|
|
|
|
============================ ======================= ========================== ==========================
|
|
|
|
|
|
|
|
|
|
By comparison, Python 3.6 uses:
|
|
|
|
|
|
|
|
|
|
============================ ======================= ==========================
|
|
|
|
|
Function Default POSIX locale
|
|
|
|
|
============================ ======================= ==========================
|
|
|
|
|
open() locale/strict locale/strict
|
|
|
|
|
os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape
|
|
|
|
|
sys.stdin, sys.stdout locale/strict locale/**surrogateescape**
|
|
|
|
|
sys.stderr locale/backslashreplace locale/backslashreplace
|
|
|
|
|
============================ ======================= ==========================
|
|
|
|
|
|
|
|
|
|
The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
|
2017-12-05 10:21:59 -05:00
|
|
|
|
strict mode for consistency with other standard \*nix operating system
|
|
|
|
|
components: the idea is that data not encoded to UTF-8 are passed through
|
|
|
|
|
"Python" without being modified, as raw bytes.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
The ``PYTHONIOENCODING`` environment variable has priority over the
|
2017-01-11 16:08:40 -05:00
|
|
|
|
UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
|
|
|
|
|
python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.
|
|
|
|
|
|
2017-01-12 07:26:21 -05:00
|
|
|
|
Encoding and error handler on Windows
|
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
|
|
On Windows, the encodings and error handlers are different:
|
|
|
|
|
|
|
|
|
|
============================ ======================= ========================== ========================== ==========================
|
|
|
|
|
Function Default Legacy Windows FS encoding UTF-8 mode UTF-8 Strict mode
|
|
|
|
|
============================ ======================= ========================== ========================== ==========================
|
|
|
|
|
open() mbcs/strict mbcs/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
|
|
|
|
os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace** UTF-8/surrogatepass UTF-8/surrogatepass
|
|
|
|
|
sys.stdin, sys.stdout UTF-8/surrogateescape UTF-8/surrogateescape UTF-8/surrogateescape **UTF-8/strict**
|
|
|
|
|
sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace
|
|
|
|
|
============================ ======================= ========================== ========================== ==========================
|
|
|
|
|
|
|
|
|
|
By comparison, Python 3.6 uses:
|
|
|
|
|
|
|
|
|
|
============================ ======================= ==========================
|
|
|
|
|
Function Default Legacy Windows FS encoding
|
|
|
|
|
============================ ======================= ==========================
|
|
|
|
|
open() mbcs/strict mbcs/strict
|
|
|
|
|
os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace**
|
|
|
|
|
sys.stdin, sys.stdout UTF-8/surrogateescape UTF-8/surrogateescape
|
|
|
|
|
sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace
|
|
|
|
|
============================ ======================= ==========================
|
|
|
|
|
|
|
|
|
|
The "Legacy Windows FS encoding" is enabled by setting the
|
2017-12-05 10:39:51 -05:00
|
|
|
|
``PYTHONLEGACYWINDOWSFSENCODING`` environment variable to ``1`` as
|
|
|
|
|
specified in `PEP 529` .
|
2017-01-12 07:26:21 -05:00
|
|
|
|
|
|
|
|
|
Enabling the legacy Windows filesystem encoding disables the UTF-8 mode
|
|
|
|
|
(as ``-X utf8=0``).
|
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
|
|
|
|
|
``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
|
2017-12-05 10:39:51 -05:00
|
|
|
|
with the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the
|
|
|
|
|
UTF-8 encoding.
|
2017-01-12 07:26:21 -05:00
|
|
|
|
|
|
|
|
|
There is no POSIX locale on Windows. The ANSI code page is used to the
|
|
|
|
|
locale encoding, and this code page never uses the ASCII encoding.
|
2017-01-11 16:08:40 -05:00
|
|
|
|
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
Rationale
|
|
|
|
|
---------
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The UTF-8 mode is disabled by default to keep hard Unicode errors when
|
|
|
|
|
encoding or decoding `operating system data`_ failed, and to keep the
|
|
|
|
|
backward compatibility. The user is responsible to enable explicitly the
|
|
|
|
|
UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
|
|
|
|
|
mode would be enabled *by default*.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The UTF-8 mode should be used on systems known to be configured with
|
2017-01-05 07:46:03 -05:00
|
|
|
|
UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
|
|
|
|
|
the user overrides a locale *by mistake* or if a Python program is
|
|
|
|
|
started with no locale configured (and so with the POSIX locale).
|
|
|
|
|
|
2017-01-11 16:08:40 -05:00
|
|
|
|
Most UNIX applications handle `operating system data`_ as bytes, so
|
2017-12-05 10:21:59 -05:00
|
|
|
|
``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
|
2017-01-05 07:46:03 -05:00
|
|
|
|
limited impact on how these data are handled by the application.
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The Python UTF-8 mode should help to make Python more interoperable with
|
|
|
|
|
the other UNIX applications in the system assuming that *UTF-8* is used
|
2017-01-05 07:46:03 -05:00
|
|
|
|
everywhere and that users *expect* UTF-8.
|
|
|
|
|
|
|
|
|
|
Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
|
|
|
|
|
Python is more convenient, since they are more commonly misconfigured
|
|
|
|
|
*by mistake* (configured to use an encoding different than UTF-8,
|
2017-01-11 16:08:40 -05:00
|
|
|
|
whereas the system uses UTF-8), rather than being misconfigured by
|
|
|
|
|
intent.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
Expected mojibake and surrogate character issues
|
|
|
|
|
------------------------------------------------
|
|
|
|
|
|
|
|
|
|
The UTF-8 mode only affects code running directly in Python, especially
|
|
|
|
|
code written in pure Python. The other code, called "external code"
|
|
|
|
|
here, is not aware of this mode. Examples:
|
|
|
|
|
|
|
|
|
|
* C libraries called by Python modules like OpenSSL
|
|
|
|
|
* The application code when Python is embedded in an application
|
|
|
|
|
|
|
|
|
|
In the UTF-8 mode, Python uses the ``surrogateescape`` error handler
|
|
|
|
|
which stores bytes not decodable from UTF-8 as surrogate characters.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
If the external code uses the locale and the locale encoding is UTF-8,
|
|
|
|
|
it should work fine.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
External code using bytes
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
If the external code processes data as bytes, surrogate characters are
|
|
|
|
|
not an issue since they are only used inside Python. Python encodes back
|
2017-01-06 20:35:27 -05:00
|
|
|
|
surrogate characters to bytes at the edges, before calling external
|
|
|
|
|
code.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
The UTF-8 mode can produce mojibake since Python and external code don't
|
|
|
|
|
both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can
|
2017-05-08 18:24:28 -04:00
|
|
|
|
be configured as strict to prevent mojibake and fail early when data
|
2017-01-11 16:32:24 -05:00
|
|
|
|
is not decodable from UTF-8 or not encodable to UTF-8.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
External code using text
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
If the external code uses text API, for example using the ``wchar_t*`` C
|
|
|
|
|
type, mojibake should not occur, but the external code can fail on
|
|
|
|
|
surrogate characters.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Use Cases
|
|
|
|
|
=========
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The following use cases were written to help to understand the impact of
|
|
|
|
|
chosen encodings and error handlers on concrete examples.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
The "Exception?" column shows the potential benefit of having a UTF-8
|
|
|
|
|
mode which is closer to the traditional Python 2 behaviour of passing
|
|
|
|
|
along raw binary data even if it isn't valid UTF-8.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
The "Mojibake" column shows that ignoring the locale causes a practical
|
2017-01-06 20:35:27 -05:00
|
|
|
|
issue: the UTF-8 mode produces mojibake if the terminal doesn't use the
|
|
|
|
|
UTF-8 encoding.
|
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
The ideal configuration is "No exception, no risk of mojibake", but that
|
|
|
|
|
isn't always possible in the presence of non-UTF-8 encoded binary data.
|
2017-12-05 10:21:59 -05:00
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
List a directory into stdout
|
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
|
|
Script listing the content of the current directory into stdout::
|
|
|
|
|
|
|
|
|
|
import os
|
|
|
|
|
for name in os.listdir(os.curdir):
|
|
|
|
|
print(name)
|
|
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python Exception? Mojibake?
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python 2 No **Yes**
|
|
|
|
|
Python 3 **Yes** No
|
|
|
|
|
Python 3.5, POSIX locale No **Yes**
|
|
|
|
|
UTF-8 mode No **Yes**
|
|
|
|
|
UTF-8 Strict mode **Yes** No
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
|
|
|
|
|
"Exception?" means that the script can fail on decoding or encoding a
|
|
|
|
|
filename depending on the locale or the filename.
|
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
To be able to never fail that way, the program must be able to produce
|
|
|
|
|
mojibake. For automated and interactive process, mojibake is often more
|
|
|
|
|
user friendly than an error with a truncated or empty output, since it
|
|
|
|
|
confines the problem to the affected entry, rather than aborting the
|
|
|
|
|
whole task.
|
2017-12-05 10:21:59 -05:00
|
|
|
|
|
|
|
|
|
Example with a directory which contains the file called ``b'xxx\xff'``
|
2017-01-11 16:32:24 -05:00
|
|
|
|
(the byte ``0xFF`` is invalid in UTF-8).
|
|
|
|
|
|
|
|
|
|
Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
|
|
|
|
|
|
|
|
|
|
$ python3.7 ../ls.py
|
|
|
|
|
Traceback (most recent call last):
|
|
|
|
|
File "../ls.py", line 5, in <module>
|
|
|
|
|
print(name)
|
|
|
|
|
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
|
|
|
|
|
|
|
|
|
|
$ python3.7 -X utf8=strict ../ls.py
|
|
|
|
|
Traceback (most recent call last):
|
|
|
|
|
File "../ls.py", line 5, in <module>
|
|
|
|
|
print(name)
|
|
|
|
|
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
|
2017-01-11 16:32:24 -05:00
|
|
|
|
but display mojibake::
|
|
|
|
|
|
|
|
|
|
$ python3.7 -X utf8 ../ls.py
|
|
|
|
|
xxx<78>
|
|
|
|
|
|
|
|
|
|
$ LC_ALL=C /python3.6 ../ls.py
|
|
|
|
|
xxx<78>
|
|
|
|
|
|
|
|
|
|
$ python2 ../ls.py
|
|
|
|
|
xxx<78>
|
|
|
|
|
|
|
|
|
|
$ ls
|
|
|
|
|
'xxx'$'\377'
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
|
|
List a directory into a text file
|
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
|
|
Similar to the previous example, except that the listing is written into
|
|
|
|
|
a text file::
|
|
|
|
|
|
|
|
|
|
import os
|
|
|
|
|
names = os.listdir(os.curdir)
|
|
|
|
|
with open("/tmp/content.txt", "w") as fp:
|
|
|
|
|
for name in names:
|
|
|
|
|
fp.write("%s\n" % name)
|
|
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python Exception? Mojibake?
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python 2 No **Yes**
|
|
|
|
|
Python 3 **Yes** No
|
|
|
|
|
Python 3.5, POSIX locale **Yes** No
|
|
|
|
|
UTF-8 mode No **Yes**
|
|
|
|
|
UTF-8 Strict mode **Yes** No
|
|
|
|
|
======================== ========== =========
|
2017-05-08 18:24:28 -04:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
Again, never throwing an exception requires that mojibake can be
|
|
|
|
|
produced, while preventing mojibake means that the script can fail on
|
|
|
|
|
decoding or encoding a filename depending on the locale or the filename.
|
|
|
|
|
Typical error::
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
|
|
$ LC_ALL=C python3 test.py
|
|
|
|
|
Traceback (most recent call last):
|
|
|
|
|
File "test.py", line 5, in <module>
|
|
|
|
|
fp.write("%s\n" % name)
|
|
|
|
|
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
Compared with native system tools::
|
|
|
|
|
|
|
|
|
|
$ ls > /tmp/content.txt
|
|
|
|
|
$ cat /tmp/content.txt
|
|
|
|
|
xxx<78>
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
|
|
Display Unicode characters into stdout
|
|
|
|
|
--------------------------------------
|
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
Very basic example used to illustrate a common issue, display the euro
|
|
|
|
|
sign (U+20AC: €)::
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
|
|
print("euro: \u20ac")
|
|
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python Exception? Mojibake?
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python 2 **Yes** No
|
|
|
|
|
Python 3 **Yes** No
|
|
|
|
|
Python 3.5, POSIX locale **Yes** No
|
|
|
|
|
UTF-8 mode No **Yes**
|
|
|
|
|
UTF-8 Strict mode No **Yes**
|
|
|
|
|
======================== ========== =========
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
The UTF-8 and UTF-8 Strict modes will always encode the euro sign as
|
|
|
|
|
UTF-8. If the terminal uses a different encoding, we get mojibake.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
For example, using ``iconv`` to emulate a GB-18030 terminal inside a
|
|
|
|
|
UTF-8 one::
|
|
|
|
|
|
|
|
|
|
$ python3 -c 'print("euro: \u20ac")' | iconv -f gb18030 -t utf8
|
|
|
|
|
euro: 鈧iconv: illegal input sequence at position 8
|
|
|
|
|
|
|
|
|
|
The misencoding also corrupts the trailing newline such that the output
|
2017-12-05 10:39:51 -05:00
|
|
|
|
stream isn't actually a valid GB-18030 sequence, hence the error message
|
|
|
|
|
after the euro symbol is misinterpreted as a hanzi character.
|
2017-12-05 10:21:59 -05:00
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
|
|
Replace a word in a text
|
|
|
|
|
------------------------
|
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
The following script replaces the word "apple" with "orange". It
|
2017-01-05 17:54:22 -05:00
|
|
|
|
reads input from stdin and writes the output into stdout::
|
|
|
|
|
|
|
|
|
|
import sys
|
|
|
|
|
text = sys.stdin.read()
|
|
|
|
|
sys.stdout.write(text.replace("apple", "orange"))
|
|
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python Exception? Mojibake?
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python 2 No **Yes**
|
|
|
|
|
Python 3 **Yes** No
|
|
|
|
|
Python 3.5, POSIX locale No **Yes**
|
|
|
|
|
UTF-8 mode No **Yes**
|
|
|
|
|
UTF-8 Strict mode **Yes** No
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
|
|
|
|
|
This is a case where passing along the raw bytes (by way of the
|
2017-12-05 10:39:51 -05:00
|
|
|
|
``surrogateescape`` error handler) will bring Python 3's behaviour back
|
|
|
|
|
into line with standard operating system tools like ``sed`` and ``awk``.
|
2017-12-05 10:21:59 -05:00
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
Producer-consumer model using pipes
|
|
|
|
|
-----------------------------------
|
|
|
|
|
|
|
|
|
|
Let's say that we have a "producer" program which writes data into its
|
|
|
|
|
stdout and a "consumer" program which reads data from its stdin.
|
|
|
|
|
|
|
|
|
|
On a shell, such programs are run with the command::
|
|
|
|
|
|
|
|
|
|
producer | consumer
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The question if these programs will work with any data and any locale.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
UNIX users don't expect Unicode errors, and so expect that such programs
|
2017-12-05 10:39:51 -05:00
|
|
|
|
"just works", in the sense that Unicode errors may cause problems in the
|
|
|
|
|
data stream, but won't cause the entire stream processing *itself* to
|
|
|
|
|
abort.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
If the producer only produces ASCII output, no error should occur. Let's
|
2017-12-05 10:21:59 -05:00
|
|
|
|
say that the producer writes at least one non-ASCII character (at least
|
2017-01-06 20:35:27 -05:00
|
|
|
|
one byte in the range ``0x80..0xff``).
|
|
|
|
|
|
|
|
|
|
To simplify the problem, let's say that the consumer has no output
|
2017-05-08 18:24:28 -04:00
|
|
|
|
(doesn't write results into a file or stdout).
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
A "Bytes producer" is an application which cannot fail with a Unicode
|
|
|
|
|
error and produces bytes into stdout.
|
|
|
|
|
|
|
|
|
|
Let's say that a "Bytes consumer" does not decode stdin but stores data
|
2017-12-05 10:21:59 -05:00
|
|
|
|
as bytes: such consumer always work. Common UNIX command line tools like
|
2017-01-06 20:35:27 -05:00
|
|
|
|
``cat``, ``grep`` or ``sed`` are in this category. Many Python 2
|
2017-12-05 10:21:59 -05:00
|
|
|
|
applications are also in this category, as are applications that work
|
2017-12-05 10:39:51 -05:00
|
|
|
|
with the lower level binary input and output stream in Python 3 rather
|
|
|
|
|
than the default text mode streams.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
"Python producer" and "Python consumer" are producer and consumer
|
2017-12-05 10:39:51 -05:00
|
|
|
|
implemented in Python using the default text mode input and output
|
|
|
|
|
streams.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
Bytes producer, Bytes consumer
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
This won't through exceptions, but it is out of the scope of this PEP
|
|
|
|
|
since it doesn't involve Python's default text mode input and output
|
|
|
|
|
streams.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
Python producer, Bytes consumer
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Python producer::
|
|
|
|
|
|
|
|
|
|
print("euro: \u20ac")
|
|
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python Exception? Mojibake?
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python 2 **Yes** No
|
|
|
|
|
Python 3 **Yes** No
|
|
|
|
|
Python 3.5, POSIX locale **Yes** No
|
|
|
|
|
UTF-8 mode No **Yes**
|
|
|
|
|
UTF-8 Strict mode No **Yes**
|
|
|
|
|
======================== ========== =========
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
The question here is not if the consumer is able to decode the input,
|
2017-05-08 18:24:28 -04:00
|
|
|
|
but if Python is able to produce its output. So it's similar to the
|
2017-01-06 20:35:27 -05:00
|
|
|
|
`Display Unicode characters into stdout`_ case.
|
|
|
|
|
|
|
|
|
|
UTF-8 modes work with any locale since the consumer doesn't try to
|
|
|
|
|
decode its stdin.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
Bytes producer, Python consumer
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-06 20:35:27 -05:00
|
|
|
|
Python consumer::
|
|
|
|
|
|
|
|
|
|
import sys
|
|
|
|
|
text = sys.stdin.read()
|
|
|
|
|
result = text.replace("apple", "orange")
|
|
|
|
|
# ignore the result
|
|
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python Exception? Mojibake?
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python 2 No **Yes**
|
|
|
|
|
Python 3 **Yes** No
|
|
|
|
|
Python 3.5, POSIX locale No **Yes**
|
|
|
|
|
UTF-8 mode No **Yes**
|
|
|
|
|
UTF-8 Strict mode **Yes** No
|
|
|
|
|
======================== ========== =========
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
Python 3 may throw an exception on decoding stdin depending on the input
|
|
|
|
|
and the locale.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Python producer, Python consumer
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Python producer::
|
|
|
|
|
|
|
|
|
|
print("euro: \u20ac")
|
|
|
|
|
|
|
|
|
|
Python consumer::
|
|
|
|
|
|
|
|
|
|
import sys
|
|
|
|
|
text = sys.stdin.read()
|
|
|
|
|
result = text.replace("apple", "orange")
|
|
|
|
|
# ignore the result
|
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
Result, same Python version used for the producer and the consumer:
|
|
|
|
|
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python Exception? Mojibake?
|
|
|
|
|
======================== ========== =========
|
|
|
|
|
Python 2 **Yes** No
|
|
|
|
|
Python 3 **Yes** No
|
|
|
|
|
Python 3.5, POSIX locale **Yes** No
|
|
|
|
|
UTF-8 mode No No(!)
|
|
|
|
|
UTF-8 Strict mode No No(!)
|
|
|
|
|
======================== ========== =========
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
This case combines a Python producer with a Python consumer, and the
|
2017-12-05 10:39:51 -05:00
|
|
|
|
result is mainly the same as that for `Python producer, Bytes
|
|
|
|
|
consumer`_, since the consumer can't read what the producer can't emit.
|
2017-01-06 20:35:27 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
However, the behaviour of the "UTF-8" and "UTF-8 Strict" modes in this
|
2017-12-05 10:39:51 -05:00
|
|
|
|
configuration is notable: they don't produce an exception, *and* they
|
|
|
|
|
shouldn't produce mojibake, as both the producer and the consumer are
|
|
|
|
|
making *consistent* assumptions regarding the text encoding used on the
|
|
|
|
|
pipe between them (i.e. UTF-8).
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
Any mojibake generated would only be in the interfaces bween the
|
|
|
|
|
consuming component and the outside world (e.g. the terminal, or when
|
|
|
|
|
writing to a file).
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
Backward Compatibility
|
|
|
|
|
======================
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
The main backward incompatible change is that the UTF-8 encoding is now
|
2017-01-11 16:32:24 -05:00
|
|
|
|
used by default if the locale is POSIX. Since the UTF-8 encoding is used
|
|
|
|
|
with the ``surrogateescape`` error handler, encoding errors should not
|
|
|
|
|
occur and so the change should not break applications.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The UTF-8 encoding is also quite restrictive regarding where it allows
|
|
|
|
|
plain ASCII code points to appear in the byte stream, so even for
|
2017-12-05 10:39:51 -05:00
|
|
|
|
ASCII-incompatible encodings, such byte values will often be escaped
|
|
|
|
|
rather than being processed as ASCII characters.
|
2017-12-05 10:21:59 -05:00
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
The more likely source of trouble comes from external libraries. Python
|
2017-12-05 10:21:59 -05:00
|
|
|
|
can decode successfully data from UTF-8, but a library using the locale
|
|
|
|
|
encoding can fail to encode the decoded text back to bytes. For example,
|
2017-12-05 10:39:51 -05:00
|
|
|
|
GNU readline currently has problems on Android due to the mismatch
|
|
|
|
|
between CPython's encoding assumptions there (always UTF-8) and GNU
|
|
|
|
|
readline's encoding assumptions (which are based on the nominal locale).
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-11 16:32:24 -05:00
|
|
|
|
The PEP only changes the default behaviour if the locale is POSIX. For
|
|
|
|
|
other locales, the *default* behaviour is unchanged.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
PEP 538 is a follow-up to this PEP that extends CPython's assumptions to
|
|
|
|
|
other locale-aware components in the same process by explicitly coercing
|
|
|
|
|
the POSIX locale to something more suitable for modern text processing.
|
|
|
|
|
See that PEP for further details.
|
2017-12-05 10:21:59 -05:00
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
Alternatives
|
|
|
|
|
============
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
Don't modify the encoding of the POSIX locale
|
|
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
|
|
A first version of the PEP did not change the encoding and error handler
|
2017-12-05 10:21:59 -05:00
|
|
|
|
used of the POSIX locale.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-11 16:32:24 -05:00
|
|
|
|
The problem is that adding the ``-X utf8`` command line option or
|
|
|
|
|
setting the ``PYTHONUTF8`` environment variable is not possible in some
|
|
|
|
|
cases, or at least not convenient.
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
Moreover, many users simply expect that Python 3 behaves as Python 2:
|
|
|
|
|
don't bother them with encodings and "just works" in all cases. These
|
2017-01-05 17:54:22 -05:00
|
|
|
|
users don't worry about mojibake, or even expect mojibake because of
|
|
|
|
|
complex documents using multiple incompatibles encodings.
|
|
|
|
|
|
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
|
Always use UTF-8
|
|
|
|
|
----------------
|
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
Python already always uses the UTF-8 encoding on Mac OS X, Android and
|
2017-01-11 16:08:40 -05:00
|
|
|
|
Windows. Since UTF-8 became the de facto encoding, it makes sense to
|
|
|
|
|
always use it on all platforms with any locale.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
The problem with this approach is that Python is also used extensively
|
|
|
|
|
in desktop environments, and it is often a practical or even legal
|
|
|
|
|
requirement to support locale encoding other than UTF-8 (for example,
|
|
|
|
|
GB-18030 in China, and Shift-JIS or ISO-2022-JP in Japan)
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
Force UTF-8 for the POSIX locale
|
|
|
|
|
--------------------------------
|
|
|
|
|
|
2017-12-05 10:39:51 -05:00
|
|
|
|
An alternative to always using UTF-8 in any case is to only use UTF-8
|
|
|
|
|
when the ``LC_CTYPE`` locale is the POSIX locale.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
2017-12-05 10:21:59 -05:00
|
|
|
|
The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of Nick
|
2017-01-11 16:08:40 -05:00
|
|
|
|
Coghlan proposes to implement that using the ``C.UTF-8`` locale.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Use the strict error handler for operating system data
|
|
|
|
|
------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Using the ``surrogateescape`` error handler for `operating system data`_
|
2017-12-05 10:21:59 -05:00
|
|
|
|
creates surprising surrogate characters. No Python codec (except of
|
|
|
|
|
``utf-7``) accept surrogates, and so encoding text coming from the
|
|
|
|
|
operating system is likely to raise an error error. The problem is that
|
2017-01-11 16:08:40 -05:00
|
|
|
|
the error comes late, very far from where the data was read.
|
|
|
|
|
|
|
|
|
|
The ``strict`` error handler can be used instead to decode
|
|
|
|
|
(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
|
2017-12-05 10:21:59 -05:00
|
|
|
|
data, to raise encoding errors as soon as possible. It helps to find
|
2017-01-11 16:08:40 -05:00
|
|
|
|
bugs more quickly.
|
|
|
|
|
|
|
|
|
|
The main drawback of this strategy is that it doesn't work in practice.
|
|
|
|
|
Python 3 is designed on top on Unicode strings. Most functions expect
|
|
|
|
|
Unicode and produce Unicode. Even if many operating system functions
|
2017-05-08 18:24:28 -04:00
|
|
|
|
have two flavors, bytes and Unicode, the Unicode flavor is used in most
|
2017-01-11 16:08:40 -05:00
|
|
|
|
cases. There are good reasons for that: Unicode is more convenient in
|
|
|
|
|
Python 3 and using Unicode helps to support the full Unicode Character
|
|
|
|
|
Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
|
|
|
|
|
see the `PEP 528`_ and the `PEP 529`_).
|
|
|
|
|
|
|
|
|
|
For example, if ``os.fsdecode()`` uses ``utf8/strict``,
|
|
|
|
|
``os.listdir(str)`` fails to list filenames of a directory if a single
|
|
|
|
|
filename is not decodable from UTF-8. As a consequence,
|
|
|
|
|
``shutil.rmtree(str)`` fails to remove a directory. Undecodable
|
|
|
|
|
filenames, environment variables, etc. are simply too common to make
|
|
|
|
|
this alternative viable.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
Links
|
|
|
|
|
=====
|
|
|
|
|
|
|
|
|
|
PEPs:
|
|
|
|
|
|
2017-01-11 16:08:40 -05:00
|
|
|
|
* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
|
|
|
|
|
"Coercing the legacy C locale to C.UTF-8"
|
|
|
|
|
* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
|
|
|
|
|
"Change Windows filesystem encoding to UTF-8"
|
|
|
|
|
* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
|
|
|
|
|
"Change Windows console encoding to UTF-8"
|
|
|
|
|
* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
|
|
|
|
|
"Non-decodable Bytes in System Character Interfaces"
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-06 07:57:10 -05:00
|
|
|
|
Main Python issues:
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
2017-01-11 06:30:36 -05:00
|
|
|
|
* `Issue #29240: Implementation of the PEP 540: Add a new UTF-8 mode
|
|
|
|
|
<http://bugs.python.org/issue29240>`_
|
|
|
|
|
* `Issue #28180: sys.getfilesystemencoding() should default to utf-8
|
2017-01-05 17:54:22 -05:00
|
|
|
|
<http://bugs.python.org/issue28180>`_
|
2017-01-06 07:57:10 -05:00
|
|
|
|
* `Issue #19977: Use "surrogateescape" error handler for sys.stdin and
|
|
|
|
|
sys.stdout on UNIX for the C locale
|
|
|
|
|
<http://bugs.python.org/issue19977>`_
|
|
|
|
|
* `Issue #19847: Setting the default filesystem-encoding
|
|
|
|
|
<http://bugs.python.org/issue19847>`_
|
2017-01-05 17:54:22 -05:00
|
|
|
|
* `Issue #8622: Add PYTHONFSENCODING environment variable
|
|
|
|
|
<https://bugs.python.org/issue8622>`_: added but reverted because of
|
|
|
|
|
many issues, read the `Inconsistencies if locale and filesystem
|
|
|
|
|
encodings are different
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
|
|
|
|
|
thread on the python-dev mailing list
|
|
|
|
|
|
2017-01-06 07:57:10 -05:00
|
|
|
|
Incomplete list of Python issues related to Unicode errors, especially
|
|
|
|
|
with the POSIX locale:
|
|
|
|
|
|
|
|
|
|
* 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')"
|
|
|
|
|
<http://bugs.python.org/issue29042#msg283821>`_
|
2017-12-05 10:39:51 -05:00
|
|
|
|
* 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only
|
|
|
|
|
error handler <http://bugs.python.org/issue22016>`_
|
|
|
|
|
* 2014-04-27: `Issue #21368: Check for systemd locale on startup if
|
|
|
|
|
current locale is set to POSIX <http://bugs.python.org/issue21368>`_
|
|
|
|
|
-- read manually /etc/locale.conf when the locale is POSIX
|
|
|
|
|
* 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell
|
|
|
|
|
with utf-8 filename <http://bugs.python.org/issue20329>`_
|
2017-01-06 07:57:10 -05:00
|
|
|
|
* 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C locale
|
|
|
|
|
<http://bugs.python.org/issue19846>`_
|
2017-12-05 10:39:51 -05:00
|
|
|
|
* 2010-05-04: `Issue #8610: Python3/POSIX: errors if file system
|
|
|
|
|
encoding is None <http://bugs.python.org/issue8610>`_
|
|
|
|
|
* 2013-08-12: `Issue #18713: Clearly document the use of
|
|
|
|
|
PYTHONIOENCODING to set surrogateescape
|
|
|
|
|
<http://bugs.python.org/issue18713>`_
|
2017-01-06 07:57:10 -05:00
|
|
|
|
* 2013-09-27: `Issue #19100: Use backslashreplace in pprint
|
|
|
|
|
<http://bugs.python.org/issue19100>`_
|
|
|
|
|
* 2012-01-05: `Issue #13717: os.walk() + print fails with UnicodeEncodeError
|
|
|
|
|
<http://bugs.python.org/issue13717>`_
|
|
|
|
|
* 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding
|
|
|
|
|
<http://bugs.python.org/issue13643>`_
|
2017-12-05 10:39:51 -05:00
|
|
|
|
* 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default
|
|
|
|
|
for the POSIX locale <http://bugs.python.org/issue11574>`_, thread on
|
|
|
|
|
python-dev: `Low-Level Encoding Behavior on Python 3
|
2017-01-06 07:57:10 -05:00
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2011-March/109361.html>`_
|
2017-12-05 10:39:51 -05:00
|
|
|
|
* 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler
|
|
|
|
|
for stdout <http://bugs.python.org/issue8533>`_, regrtest fails with
|
|
|
|
|
Unicode encode error if the locale is POSIX
|
2017-01-06 07:57:10 -05:00
|
|
|
|
|
2017-05-08 18:24:28 -04:00
|
|
|
|
Some issues are real bugs in applications which must explicitly set the
|
2017-01-06 07:57:10 -05:00
|
|
|
|
encoding. Well, it just works in the common case (locale configured
|
2017-05-08 18:24:28 -04:00
|
|
|
|
correctly), so what? The program "suddenly" fails when the POSIX
|
|
|
|
|
locale is used (probably for bad reasons). Such bugs are not well
|
|
|
|
|
understood by users. Example of such issues:
|
2017-01-06 07:57:10 -05:00
|
|
|
|
|
|
|
|
|
* 2013-11-21: `pip: open() uses the locale encoding to parse Python
|
|
|
|
|
script, instead of the encoding cookie
|
|
|
|
|
<http://bugs.python.org/issue19685>`_ -- pip must use the encoding
|
|
|
|
|
cookie to read a Python source code file
|
|
|
|
|
* 2011-01-21: `IDLE 3.x can crash decoding recent file list
|
|
|
|
|
<http://bugs.python.org/issue10974>`_
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
|
|
Prior Art
|
|
|
|
|
=========
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
|
2017-05-08 18:24:28 -04:00
|
|
|
|
variable to force UTF-8: see `perlrun
|
2017-01-05 07:46:03 -05:00
|
|
|
|
<http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
|
|
|
|
|
UTF-8 per standard stream, on input and output streams, etc.
|
|
|
|
|
|
|
|
|
|
|
2017-12-05 10:24:40 -05:00
|
|
|
|
Post History
|
|
|
|
|
============
|
|
|
|
|
|
|
|
|
|
* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
|
|
|
|
|
540 (assuming UTF-8 for *nix system boundaries)
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
|
|
|
|
|
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
|
|
|
|
|
<https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
|
|
|
|
|
* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
|
|
|
|
|
C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
|
|
|
|
|
* 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
|
|
|
|
|
to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
|
|
|
|
|
-- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
|
|
|
|
|
filesystem encoding to UTF-8)
|
|
|
|
|
|
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|