From 0bb19ff93af9855db327e9a02f3e86b6f932a25a Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Wed, 6 Dec 2017 01:42:16 +0100 Subject: [PATCH] Rewrite the PEP 540! --- pep-0540.txt | 980 ++++++--------------------------------------------- 1 file changed, 113 insertions(+), 867 deletions(-) diff --git a/pep-0540.txt b/pep-0540.txt index 82afd388c..ec5e76220 100644 --- a/pep-0540.txt +++ b/pep-0540.txt @@ -2,8 +2,7 @@ PEP: 540 Title: Add a new UTF-8 mode Version: $Revision$ Last-Modified: $Date$ -Author: Victor Stinner , - Nick Coghlan +Author: Victor Stinner BDFL-Delegate: INADA Naoki Status: Draft Type: Standards Track @@ -15,345 +14,141 @@ Python-Version: 3.7 Abstract ======== -Add a new UTF-8 mode, enabled by default in the POSIX locale, to ignore -the locale and force the usage of the UTF-8 encoding for external -operating system interfaces, including the standard IO streams. +Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding +with the ``surrogateescape`` error handler. This mode is enabled by +default in the POSIX locale, but otherwise disabled by default. -Essentially, the UTF-8 mode behaves as Python 2 and other C based -applications on \*nix systems: it aims to process text as best it can, -but it errs on the side of producing or propagating mojibake to -subsequent components in a processing pipeline rather than requiring -strictly valid encodings at every step in the process. +Add also a "strict" UTF-8 mode which uses the ``strict`` error handler, +instead of ``surrogateescape``, with the UTF-8 encoding. -The UTF-8 mode can be configured as strict to reduce the risk of -producing or propagating mojibake. - -A new ``-X utf8`` command line option and ``PYTHONUTF8`` environment -variable are added to explicitly control the UTF-8 mode (including -turning it off entirely, even in the POSIX locale). +The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment +variable are added to control the UTF-8 mode. Rationale ========= -"It's not a bug, you must fix your locale" is not an acceptable answer ----------------------------------------------------------------------- +Locale encoding and UTF-8 +------------------------- -Since Python 3.0 was released in 2008, the usual answer to users getting -Unicode errors is to ask developers to fix their code to handle Unicode -properly. Most applications and Python modules were fixed, but users -kept reporting Unicode errors regularly: see the long list of issues in -the `Links`_ section below. +Python 3.6 uses the locale encoding for filenames, environment +variables, standard streams, etc. The locale encoding is inherited from +the locale; the encoding and the locale are tightly coupled. -In fact, a second class of bugs comes from a locale which is not properly -configured. The usual answer to such a bug report is: "it is not a bug, -you must fix your locale". +Many users inherit the ASCII encoding from the POSIX locale, aka the "C" +locale, but are unable change the locale for different reasons. This +encoding is very limited in term of Unicode support: any non-ASCII +character is likely to cause troubles. For example, the Alpine Linux +distribution became popular thanks to Docker containers, but it uses the +POSIX locale by default. -Technically, the answer is correct, but from a practical point of view, -the answer is not acceptable. In many cases, "fixing the issue" is a -hard task. Moreover, sometimes, the usage of the POSIX locale is -deliberate. +It is not easy to get the expected locale. Locales don't get the exact +same name on all Linux distributions, FreeBSD, macOS, etc. Some +locales, like the recent ``C.UTF-8`` locale, are only supported by a few +platforms. For example, a SSH connection can use a different encoding +than the filesystem or terminal encoding of the local host. -A good example of a concrete issue are build systems which create a -fresh environment for each build using a chroot, a container, a virtual -machine or something else to get reproducible builds. Such a setup -usually uses the POSIX locale. To get 100% reproducible builds, the -POSIX locale is a good choice: see the `Locales section of -reproducible-builds.org -`_. +On the other side, Python 3.6 is already using UTF-8 by default on +macOS, Android and Windows (PEP 529) for most functions, except of +``open()``. UTF-8 is also the default encoding of Python scripts, XML +and JSON file formats. The Go programming language uses UTF-8 for +strings. -PEP 538 lists additional problems related to the use of Linux containers to -run network services and command line applications. +When all data are stored as UTF-8 but the locale is often misconfigured, +an obvious solution is to ignore the locale and use UTF-8. -UNIX users don't expect Unicode errors, since the common command lines -tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors - -they produce mostly-readable text instead. +Passthough undecodable bytes: surrogateescape +--------------------------------------------- -These users similarly expect that tools written in Python 3 (including -those updated from Python 2), continue to tolerate locale -misconfigurations and avoid bothering them with text encoding details. -From their point of the view, the bug is not their locale but is -obviously Python 3 ("Everything else works, including Python 2, so -what's wrong with Python 3?"). +Using UTF-8 is nice, until you read the first file encoded to a +different encoding. When using the ``strict`` error handler, which is +the default, Python 3 raises a ``UnicodeDecodeError`` on the first +undecodable byte. -Since Python 2 handles data as bytes, similar to system utilities -written in C and C++, it's rarer in Python 2 compared to Python 3 to get -explicit Unicode errors. It also contributes significantly to why many -affected users perceive Python 3 as the root cause of their Unicode -errors. +Unix command line tools like ``cat`` or ``grep`` and most Python 2 +applications simply do not have this class of bugs: they don't decode +data, but process data as a raw bytes sequence. -At the same time, the stricter text handling model was deliberately -introduced into Python 3 to reduce the frequency of data corruption bugs -arising in production services due to mismatched assumptions regarding -text encodings. It's one thing to emit mojibake to a user's terminal -while listing a directory, but something else entirely to store that in -a system manifest in a database, or to send it to a remote client -attempting to retrieve files from the system. +Python 3 already has a solution to behave like Unix tools and Python 2: +the ``surrogateescape`` error handler (:pep:`383`). It allows to process +data "as bytes" but uses Unicode in practice (undecodable bytes are +stored as surrogate characters). -Since different group of users have different expectations, there is no -silver bullet which solves all issues at once. Last but not least, -backward compatibility should be preserved whenever possible. +For an application written as a Unix "pipe" tool like ``grep``, taking +input on stdin and writing output to stdout, ``surrogateescape`` allows +to "passthrough" undecodable bytes. -Locale and operating system data --------------------------------- +The UTF-8 encoding used with the ``surrogateescape`` error handler is a +compromise between correctness and usability. -.. _operating system data: - -Python uses an encoding called the "filesystem encoding" to decide how -to encode and decode data from/to the operating system: - -* file content -* command line arguments: ``sys.argv`` -* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr`` -* environment variables: ``os.environ`` -* filenames: ``os.listdir(str)`` for example -* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example -* error messages: ``os.strerror(code)`` for example -* user and terminal names: ``os``, ``grp`` and ``pwd`` modules -* host name, UNIX socket path: see the ``socket`` module -* etc. - -At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user -``LC_CTYPE`` locale and then store the locale encoding as the -"filesystem error". It's possible to get this encoding using -``sys.getfilesystemencoding()``. In the whole lifetime of a Python -process, the same encoding and error handler are used to encode and -decode data from/to the operating system. - -The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to -decode and encode operating system data. These functions use the -filesystem error handler: ``sys.getfilesystemencodeerrors()``. - -.. note:: - In some corner cases, the *current* ``LC_CTYPE`` locale must be used - instead of ``sys.getfilesystemencoding()``. For example, the ``time`` - module uses the *current* ``LC_CTYPE`` locale to decode timezone - names. - - -The POSIX locale and its encoding ---------------------------------- - -The following environment variables are used to configure the locale, in -this preference order: - -* ``LC_ALL``, most important variable -* ``LC_CTYPE`` -* ``LANG`` - -The POSIX locale, also known as "the C locale", is used: - -* if the first set variable is set to ``"C"`` -* if all these variables are unset, for example when a program is - started in an empty environment. - -The encoding of the POSIX locale must be ASCII or a superset of ASCII. - -On Linux, the POSIX locale uses the ASCII encoding. - -On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of -the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions -use the ISO 8859-1 encoding (Latin1) in practice. The problem is that -``os.fsencode()`` and ``os.fsdecode()`` use -``locale.getpreferredencoding()`` codec. For example, if command line -arguments are decoded by ``mbstowcs()`` and encoded back by -``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead -of retrieving the original byte string. - -To fix this issue, Python checks since Python 3.4 if ``mbstowcs()`` -really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the -POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an -alias to ASCII). If not (the effective encoding is not ASCII), Python -uses its own ASCII codec instead of using ``mbstowcs()`` and -``wcstombs()`` functions for `operating system data`_. - -See the `POSIX locale (2016 Edition) -`_. - - -POSIX locale used by mistake +Strict UTF-8 for correctness ---------------------------- -In many cases, the POSIX locale is not really expected by users who get -it by mistake. Examples: +When correctness matters more than usability, the ``strict`` error +handler is preferred over ``surrogateescape`` to raise an encoding error +at the first undecodable byte or unencodable character. -* program started in an empty environment -* User forcing LANG=C to get messages in English -* LANG=C used for bad reasons, without being aware of the ASCII encoding -* SSH shell -* Linux installed with no configured locale -* chroot environment, Docker image, container, ... with no locale is - configured -* User locale set to a non-existing locale, typo in the locale name for - example +No change by default for best backward compatibility +---------------------------------------------------- +While UTF-8 is perfect in most cases, sometimes the locale encoding is +actually the best encoding. -C.UTF-8 and C.utf8 locales --------------------------- +This PEP changes the behaviour for the POSIX locale since this locale +usually gives the ASCII encoding, whereas UTF-8 is a much better choice. +It does not change the behaviour for other locales to prevent any risk +or regression. -Some UNIX operating systems provide a variant of the POSIX locale using -the UTF-8 encoding: - -* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"`` -* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"`` -* HP-UX: ``"C.utf8"`` - -It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8 -proposal `_. - -It is not planned to add such locale to BSD systems. - - -Popularity of the UTF-8 encoding --------------------------------- - -Python 3 uses UTF-8 by default for Python source files. - -On Mac OS X, Windows and Android, Python always use UTF-8 for operating -system data. For Windows, see the `PEP 529`_: "Change Windows filesystem -encoding to UTF-8". - -On Linux, UTF-8 became the de facto standard encoding, -replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example, -using different encodings for filenames and standard streams is likely -to create mojibake, so UTF-8 is now used *everywhere* (at least for -modern -distributions using their default settings). - -The UTF-8 encoding is the default encoding of XML and JSON file format. -In January 2017, UTF-8 was used in `more than 88% of web pages -`_ (HTML, -Javascript, CSS, etc.). - -See `utf8everywhere.org `_ for more general -information on the UTF-8 codec. - -.. note:: - Some applications and operating systems (especially Windows) use Byte - Order Markers (BOM) to indicate the used Unicode encoding: UTF-7, - UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in - Python. - - -Old data stored in different encodings and surrogateescape ----------------------------------------------------------- - -Even if UTF-8 became the de facto standard, there are still systems in -the wild which don't use UTF-8. And there are a lot of data stored in -different encodings. For example, an old USB key using the ext3 -filesystem with filenames encoded to ISO 8859-1. - -The Linux kernel and libc don't decode filenames: a filename is used -as a raw array of bytes. The common solution to support any filename is -to store filenames as bytes and don't try to decode them. When displayed -to stdout, mojibake is displayed if the filename and the terminal don't -use the same encoding. - -Python 3 promotes Unicode everywhere including filenames. A solution to -support filenames not decodable from the locale encoding was found: the -``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes -as surrogate characters. This error handler is used by default for -`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for -example (except on Windows which uses the ``strict`` error handler). - - -Standard streams ----------------- - -Python uses the locale encoding for standard streams: stdin, stdout and -stderr. The ``strict`` error handler is used by stdin and stdout to -prevent mojibake. - -The ``backslashreplace`` error handler is used by stderr to avoid -Unicode encode errors when displaying non-ASCII text. It is especially -useful when the POSIX locale is used, because this locale usually uses -the ASCII encoding. - -The problem is that `operating system data`_ like filenames are decoded -using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a -filename to stdout raises a Unicode encode error if the filename -contains an undecoded byte stored as a surrogate character. - -Python 3.5+ now uses ``surrogateescape`` for stdin and stdout if the -POSIX locale is used: `issue #19977 -`_. The idea is to pass through -`operating system data`_ even if it means mojibake, because most UNIX -applications work like that. Such UNIX applications often store -filenames as bytes, in many cases because their basic design principles -(or those of the language they're implemented in) were laid down half a -century ago when it was still a feat for computers to handle English -text correctly, rather than -humans having to work with raw numeric indexes. - -.. note:: - The encoding and/or the error handler of standard streams can be - overriden with the ``PYTHONIOENCODING`` environment variable. +As users are responsible to enable explicitly the new UTF-8 mode, they +are responsible for any potential mojibake issues caused by this mode. Proposal ======== -Changes -------- +Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding +with the ``surrogateescape`` error handler. This mode is enabled by +default in the POSIX locale, but otherwise disabled by default. -Add a new UTF-8 mode, enabled by default in the POSIX locale, but -otherwise disabled by default, to ignore the locale and force the usage -of the UTF-8 encoding with the ``surrogateescape`` error handler, -instead using the locale encoding (with ``strict`` or -``surrogateescape`` error handler depending on the case). - -The "normal" UTF-8 mode uses ``surrogateescape`` on the standard input -and output streams and opened files, as well as on all operating -system interfaces. This is the mode implicitly activated by the POSIX -locale. - -The "strict" UTF-8 mode reduces the risk of producing or propogating -mojibake: the UTF-8 encoding is used with the ``strict`` error handler -for inputs and outputs, but the ``surrogateescape`` error handler is -still used for `operating system data`_. This mode is never activated -implicitly, but can be requested explicitly. +Add also a "strict" UTF-8 mode which uses the ``strict`` error handler, +instead of ``surrogateescape``, with the UTF-8 encoding. The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment -variable are added to control the UTF-8 mode. +variable are added to control the UTF-8 mode: -The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``. - -The UTF-8 Strict mode is configured by ``-X utf8=strict`` or -``PYTHONUTF8=strict``. +* The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1`` +* The Strict UTF-8 mode is configured by ``-X utf8=strict`` or + ``PYTHONUTF8=strict`` The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``. -Other option values fail with an error. +For standard streams, the ``PYTHONIOENCODING`` environment variable has +priority over the UTF-8 mode. -Options priority for the UTF-8 mode: +On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable +(:pep:`529`) has the priority over the UTF-8 mode. -* ``PYTHONLEGACYWINDOWSFSENCODING`` -* ``-X utf8`` -* ``PYTHONUTF8`` -* POSIX locale -For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode, -whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and so -use the encoding of the POSIX locale. +Backward Compatibility +====================== -Encodings used by ``open()``, highest priority first: +The only backward incompatible change is that the UTF-8 encoding is now +used for the POSIX locale. -* *encoding* and *errors* parameters (if set) -* UTF-8 mode -* ``os.device_encoding(fd)`` -* ``os.getpreferredencoding(False)`` +Annex: Encodings And Error Handlers +=================================== + +The UTF-8 mode changes the default encoding and error handler used by +``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``, +``sys.stdout`` and ``sys.stderr``. Encoding and error handler -------------------------- -The UTF-8 mode changes the default encoding and error handler used by -``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``, -``sys.stdout`` and ``sys.stderr``: - ============================ ======================= ========================== ========================== -Function Default UTF-8 mode or POSIX locale UTF-8 Strict mode +Function Default UTF-8 mode or POSIX locale Strict UTF-8 mode ============================ ======================= ========================== ========================== open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8**/surrogateescape @@ -372,22 +167,13 @@ sys.stdin, sys.stdout locale/strict locale/**surrogateescape* sys.stderr locale/backslashreplace locale/backslashreplace ============================ ======================= ========================== -The UTF-8 mode uses the ``surrogateescape`` error handler instead of the -strict mode for consistency with other standard \*nix operating system -components: the idea is that data not encoded to UTF-8 are passed through -"Python" without being modified, as raw bytes. - -The ``PYTHONIOENCODING`` environment variable has priority over the -UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1 -python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr. - Encoding and error handler on Windows ------------------------------------- On Windows, the encodings and error handlers are different: ============================ ======================= ========================== ========================== ========================== -Function Default Legacy Windows FS encoding UTF-8 mode UTF-8 Strict mode +Function Default Legacy Windows FS encoding UTF-8 mode Strict UTF-8 mode ============================ ======================= ========================== ========================== ========================== open() mbcs/strict mbcs/strict **UTF-8/surrogateescape** **UTF-8**/strict os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace** UTF-8/surrogatepass UTF-8/surrogatepass @@ -406,512 +192,43 @@ sys.stdin, sys.stdout UTF-8/surrogateescape UTF-8/surrogateescape sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace ============================ ======================= ========================== -The "Legacy Windows FS encoding" is enabled by setting the -``PYTHONLEGACYWINDOWSFSENCODING`` environment variable to ``1`` as -specified in `PEP 529` . - -Enabling the legacy Windows filesystem encoding disables the UTF-8 mode -(as ``-X utf8=0``). +The "Legacy Windows FS encoding" is enabled by the +``PYTHONLEGACYWINDOWSFSENCODING`` environment variable. If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or ``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But -with the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the -UTF-8 encoding. +in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8 +encoding. -There is no POSIX locale on Windows. The ANSI code page is used to the -locale encoding, and this code page never uses the ASCII encoding. +.. note: + There is no POSIX locale on Windows. The ANSI code page is used to the + locale encoding, and this code page never uses the ASCII encoding. -Rationale ---------- +Annex: Differences between the PEP 538 and the PEP 540 +====================================================== -The UTF-8 mode is disabled by default to keep hard Unicode errors when -encoding or decoding `operating system data`_ failed, and to keep the -backward compatibility. The user is responsible to enable explicitly the -UTF-8 mode, and so is better prepared for mojibake than if the UTF-8 -mode would be enabled *by default*. +The PEP 538 uses the "C.UTF-8" locale which is quite new and only +supported by a few Linux distributions; this locale is not currently +supported by FreeBSD or macOS for example. This PEP 540 supports all +operating systems. -The UTF-8 mode should be used on systems known to be configured with -UTF-8 where most applications speak UTF-8. It prevents Unicode errors if -the user overrides a locale *by mistake* or if a Python program is -started with no locale configured (and so with the POSIX locale). +The PEP 538 only changes the behaviour for the POSIX locale. While the +new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can +be enabled manually for any other locale. -Most UNIX applications handle `operating system data`_ as bytes, so -``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a -limited impact on how these data are handled by the application. - -The Python UTF-8 mode should help to make Python more interoperable with -the other UNIX applications in the system assuming that *UTF-8* is used -everywhere and that users *expect* UTF-8. - -Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in -Python is more convenient, since they are more commonly misconfigured -*by mistake* (configured to use an encoding different than UTF-8, -whereas the system uses UTF-8), rather than being misconfigured by -intent. - -Expected mojibake and surrogate character issues ------------------------------------------------- - -The UTF-8 mode only affects code running directly in Python, especially -code written in pure Python. The other code, called "external code" -here, is not aware of this mode. Examples: - -* C libraries called by Python modules like OpenSSL -* The application code when Python is embedded in an application - -In the UTF-8 mode, Python uses the ``surrogateescape`` error handler -which stores bytes not decodable from UTF-8 as surrogate characters. - -If the external code uses the locale and the locale encoding is UTF-8, -it should work fine. - -External code using bytes -^^^^^^^^^^^^^^^^^^^^^^^^^ - -If the external code processes data as bytes, surrogate characters are -not an issue since they are only used inside Python. Python encodes back -surrogate characters to bytes at the edges, before calling external -code. - -The UTF-8 mode can produce mojibake since Python and external code don't -both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can -be configured as strict to prevent mojibake and fail early when data -is not decodable from UTF-8 or not encodable to UTF-8. - -External code using text -^^^^^^^^^^^^^^^^^^^^^^^^ - -If the external code uses text API, for example using the ``wchar_t*`` C -type, mojibake should not occur, but the external code can fail on -surrogate characters. - - -Use Cases -========= - -The following use cases were written to help to understand the impact of -chosen encodings and error handlers on concrete examples. - -The "Exception?" column shows the potential benefit of having a UTF-8 -mode which is closer to the traditional Python 2 behaviour of passing -along raw binary data even if it isn't valid UTF-8. - -The "Mojibake" column shows that ignoring the locale causes a practical -issue: the UTF-8 mode produces mojibake if the terminal doesn't use the -UTF-8 encoding. - -The ideal configuration is "No exception, no risk of mojibake", but that -isn't always possible in the presence of non-UTF-8 encoded binary data. - -List a directory into stdout ----------------------------- - -Script listing the content of the current directory into stdout:: - - import os - for name in os.listdir(os.curdir): - print(name) - -Result: - -======================== ========== ========= -Python Exception? Mojibake? -======================== ========== ========= -Python 2 No **Yes** -Python 3 **Yes** No -Python 3.5, POSIX locale No **Yes** -UTF-8 mode No **Yes** -UTF-8 Strict mode **Yes** No -======================== ========== ========= - -"Exception?" means that the script can fail on decoding or encoding a -filename depending on the locale or the filename. - -To be able to never fail that way, the program must be able to produce -mojibake. For automated and interactive process, mojibake is often more -user friendly than an error with a truncated or empty output, since it -confines the problem to the affected entry, rather than aborting the -whole task. - -Example with a directory which contains the file called ``b'xxx\xff'`` -(the byte ``0xFF`` is invalid in UTF-8). - -Default and UTF-8 Strict mode fail on ``print()`` with an encode error:: - - $ python3.7 ../ls.py - Traceback (most recent call last): - File "../ls.py", line 5, in - print(name) - UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ... - - $ python3.7 -X utf8=strict ../ls.py - Traceback (most recent call last): - File "../ls.py", line 5, in - print(name) - UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ... - -The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work -but display mojibake:: - - $ python3.7 -X utf8 ../ls.py - xxx� - - $ LC_ALL=C /python3.6 ../ls.py - xxx� - - $ python2 ../ls.py - xxx� - - $ ls - 'xxx'$'\377' - - -List a directory into a text file ---------------------------------- - -Similar to the previous example, except that the listing is written into -a text file:: - - import os - names = os.listdir(os.curdir) - with open("/tmp/content.txt", "w") as fp: - for name in names: - fp.write("%s\n" % name) - -Result: - -======================== ========== ========= -Python Exception? Mojibake? -======================== ========== ========= -Python 2 No **Yes** -Python 3 **Yes** No -Python 3.5, POSIX locale **Yes** No -UTF-8 mode No **Yes** -UTF-8 Strict mode **Yes** No -======================== ========== ========= - -Again, never throwing an exception requires that mojibake can be -produced, while preventing mojibake means that the script can fail on -decoding or encoding a filename depending on the locale or the filename. -Typical error:: - - $ LC_ALL=C python3 test.py - Traceback (most recent call last): - File "test.py", line 5, in - fp.write("%s\n" % name) - UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) - -Compared with native system tools:: - - $ ls > /tmp/content.txt - $ cat /tmp/content.txt - xxx� - - -Display Unicode characters into stdout --------------------------------------- - -Very basic example used to illustrate a common issue, display the euro -sign (U+20AC: €):: - - print("euro: \u20ac") - -Result: - -======================== ========== ========= -Python Exception? Mojibake? -======================== ========== ========= -Python 2 **Yes** No -Python 3 **Yes** No -Python 3.5, POSIX locale **Yes** No -UTF-8 mode No **Yes** -UTF-8 Strict mode No **Yes** -======================== ========== ========= - -The UTF-8 and UTF-8 Strict modes will always encode the euro sign as -UTF-8. If the terminal uses a different encoding, we get mojibake. - -For example, using ``iconv`` to emulate a GB-18030 terminal inside a -UTF-8 one:: - - $ python3 -c 'print("euro: \u20ac")' | iconv -f gb18030 -t utf8 - euro: 鈧iconv: illegal input sequence at position 8 - -The misencoding also corrupts the trailing newline such that the output -stream isn't actually a valid GB-18030 sequence, hence the error message -after the euro symbol is misinterpreted as a hanzi character. - - -Replace a word in a text ------------------------- - -The following script replaces the word "apple" with "orange". It -reads input from stdin and writes the output into stdout:: - - import sys - text = sys.stdin.read() - sys.stdout.write(text.replace("apple", "orange")) - -Result: - -======================== ========== ========= -Python Exception? Mojibake? -======================== ========== ========= -Python 2 No **Yes** -Python 3 **Yes** No -Python 3.5, POSIX locale No **Yes** -UTF-8 mode No **Yes** -UTF-8 Strict mode **Yes** No -======================== ========== ========= - -This is a case where passing along the raw bytes (by way of the -``surrogateescape`` error handler) will bring Python 3's behaviour back -into line with standard operating system tools like ``sed`` and ``awk``. - - -Producer-consumer model using pipes ------------------------------------ - -Let's say that we have a "producer" program which writes data into its -stdout and a "consumer" program which reads data from its stdin. - -On a shell, such programs are run with the command:: - - producer | consumer - -The question if these programs will work with any data and any locale. -UNIX users don't expect Unicode errors, and so expect that such programs -"just works", in the sense that Unicode errors may cause problems in the -data stream, but won't cause the entire stream processing *itself* to -abort. - -If the producer only produces ASCII output, no error should occur. Let's -say that the producer writes at least one non-ASCII character (at least -one byte in the range ``0x80..0xff``). - -To simplify the problem, let's say that the consumer has no output -(doesn't write results into a file or stdout). - -A "Bytes producer" is an application which cannot fail with a Unicode -error and produces bytes into stdout. - -Let's say that a "Bytes consumer" does not decode stdin but stores data -as bytes: such consumer always work. Common UNIX command line tools like -``cat``, ``grep`` or ``sed`` are in this category. Many Python 2 -applications are also in this category, as are applications that work -with the lower level binary input and output stream in Python 3 rather -than the default text mode streams. - -"Python producer" and "Python consumer" are producer and consumer -implemented in Python using the default text mode input and output -streams. - -Bytes producer, Bytes consumer -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This won't through exceptions, but it is out of the scope of this PEP -since it doesn't involve Python's default text mode input and output -streams. - -Python producer, Bytes consumer -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Python producer:: - - print("euro: \u20ac") - -Result: - -======================== ========== ========= -Python Exception? Mojibake? -======================== ========== ========= -Python 2 **Yes** No -Python 3 **Yes** No -Python 3.5, POSIX locale **Yes** No -UTF-8 mode No **Yes** -UTF-8 Strict mode No **Yes** -======================== ========== ========= - -The question here is not if the consumer is able to decode the input, -but if Python is able to produce its output. So it's similar to the -`Display Unicode characters into stdout`_ case. - -UTF-8 modes work with any locale since the consumer doesn't try to -decode its stdin. - -Bytes producer, Python consumer -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Python consumer:: - - import sys - text = sys.stdin.read() - result = text.replace("apple", "orange") - # ignore the result - -Result: - -======================== ========== ========= -Python Exception? Mojibake? -======================== ========== ========= -Python 2 No **Yes** -Python 3 **Yes** No -Python 3.5, POSIX locale No **Yes** -UTF-8 mode No **Yes** -UTF-8 Strict mode **Yes** No -======================== ========== ========= - -Python 3 may throw an exception on decoding stdin depending on the input -and the locale. - - -Python producer, Python consumer -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Python producer:: - - print("euro: \u20ac") - -Python consumer:: - - import sys - text = sys.stdin.read() - result = text.replace("apple", "orange") - # ignore the result - -Result, same Python version used for the producer and the consumer: - -======================== ========== ========= -Python Exception? Mojibake? -======================== ========== ========= -Python 2 **Yes** No -Python 3 **Yes** No -Python 3.5, POSIX locale **Yes** No -UTF-8 mode No No(!) -UTF-8 Strict mode No No(!) -======================== ========== ========= - -This case combines a Python producer with a Python consumer, and the -result is mainly the same as that for `Python producer, Bytes -consumer`_, since the consumer can't read what the producer can't emit. - -However, the behaviour of the "UTF-8" and "UTF-8 Strict" modes in this -configuration is notable: they don't produce an exception, *and* they -shouldn't produce mojibake, as both the producer and the consumer are -making *consistent* assumptions regarding the text encoding used on the -pipe between them (i.e. UTF-8). - -Any mojibake generated would only be in the interfaces bween the -consuming component and the outside world (e.g. the terminal, or when -writing to a file). - -Backward Compatibility -====================== - -The main backward incompatible change is that the UTF-8 encoding is now -used by default if the locale is POSIX. Since the UTF-8 encoding is used -with the ``surrogateescape`` error handler, encoding errors should not -occur and so the change should not break applications. - -The UTF-8 encoding is also quite restrictive regarding where it allows -plain ASCII code points to appear in the byte stream, so even for -ASCII-incompatible encodings, such byte values will often be escaped -rather than being processed as ASCII characters. - -The more likely source of trouble comes from external libraries. Python -can decode successfully data from UTF-8, but a library using the locale -encoding can fail to encode the decoded text back to bytes. For example, -GNU readline currently has problems on Android due to the mismatch -between CPython's encoding assumptions there (always UTF-8) and GNU -readline's encoding assumptions (which are based on the nominal locale). - -The PEP only changes the default behaviour if the locale is POSIX. For -other locales, the *default* behaviour is unchanged. - -PEP 538 is a follow-up to this PEP that extends CPython's assumptions to -other locale-aware components in the same process by explicitly coercing -the POSIX locale to something more suitable for modern text processing. -See that PEP for further details. - - -Alternatives -============ - -Don't modify the encoding of the POSIX locale ---------------------------------------------- - -A first version of the PEP did not change the encoding and error handler -used of the POSIX locale. - -The problem is that adding the ``-X utf8`` command line option or -setting the ``PYTHONUTF8`` environment variable is not possible in some -cases, or at least not convenient. - -Moreover, many users simply expect that Python 3 behaves as Python 2: -don't bother them with encodings and "just works" in all cases. These -users don't worry about mojibake, or even expect mojibake because of -complex documents using multiple incompatibles encodings. - - -Always use UTF-8 ----------------- - -Python already always uses the UTF-8 encoding on Mac OS X, Android and -Windows. Since UTF-8 became the de facto encoding, it makes sense to -always use it on all platforms with any locale. - -The problem with this approach is that Python is also used extensively -in desktop environments, and it is often a practical or even legal -requirement to support locale encoding other than UTF-8 (for example, -GB-18030 in China, and Shift-JIS or ISO-2022-JP in Japan) - -Force UTF-8 for the POSIX locale --------------------------------- - -An alternative to always using UTF-8 in any case is to only use UTF-8 -when the ``LC_CTYPE`` locale is the POSIX locale. - -The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of Nick -Coghlan proposes to implement that using the ``C.UTF-8`` locale. - - -Use the strict error handler for operating system data ------------------------------------------------------- - -Using the ``surrogateescape`` error handler for `operating system data`_ -creates surprising surrogate characters. No Python codec (except of -``utf-7``) accept surrogates, and so encoding text coming from the -operating system is likely to raise an error error. The problem is that -the error comes late, very far from where the data was read. - -The ``strict`` error handler can be used instead to decode -(``os.fsdecode()``) and encode (``os.fsencode()``) operating system -data, to raise encoding errors as soon as possible. It helps to find -bugs more quickly. - -The main drawback of this strategy is that it doesn't work in practice. -Python 3 is designed on top on Unicode strings. Most functions expect -Unicode and produce Unicode. Even if many operating system functions -have two flavors, bytes and Unicode, the Unicode flavor is used in most -cases. There are good reasons for that: Unicode is more convenient in -Python 3 and using Unicode helps to support the full Unicode Character -Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6, -see the `PEP 528`_ and the `PEP 529`_). - -For example, if ``os.fsdecode()`` uses ``utf8/strict``, -``os.listdir(str)`` fails to list filenames of a directory if a single -filename is not decodable from UTF-8. As a consequence, -``shutil.rmtree(str)`` fails to remove a directory. Undecodable -filenames, environment variables, etc. are simply too common to make -this alternative viable. +The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any +non-Python code running in the process is impacted by this change. This +PEP is implemented in Python internals and ignores the locale: +non-Python running in the same process is not aware of the "Python UTF-8 +mode". Links ===== -PEPs: - +* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode + `_ * `PEP 538 `_: "Coercing the legacy C locale to C.UTF-8" * `PEP 529 `_: @@ -921,83 +238,12 @@ PEPs: * `PEP 383 `_: "Non-decodable Bytes in System Character Interfaces" -Main Python issues: - -* `Issue #29240: Implementation of the PEP 540: Add a new UTF-8 mode - `_ -* `Issue #28180: sys.getfilesystemencoding() should default to utf-8 - `_ -* `Issue #19977: Use "surrogateescape" error handler for sys.stdin and - sys.stdout on UNIX for the C locale - `_ -* `Issue #19847: Setting the default filesystem-encoding - `_ -* `Issue #8622: Add PYTHONFSENCODING environment variable - `_: added but reverted because of - many issues, read the `Inconsistencies if locale and filesystem - encodings are different - `_ - thread on the python-dev mailing list - -Incomplete list of Python issues related to Unicode errors, especially -with the POSIX locale: - -* 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')" - `_ -* 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only - error handler `_ -* 2014-04-27: `Issue #21368: Check for systemd locale on startup if - current locale is set to POSIX `_ - -- read manually /etc/locale.conf when the locale is POSIX -* 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell - with utf-8 filename `_ -* 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C locale - `_ -* 2010-05-04: `Issue #8610: Python3/POSIX: errors if file system - encoding is None `_ -* 2013-08-12: `Issue #18713: Clearly document the use of - PYTHONIOENCODING to set surrogateescape - `_ -* 2013-09-27: `Issue #19100: Use backslashreplace in pprint - `_ -* 2012-01-05: `Issue #13717: os.walk() + print fails with UnicodeEncodeError - `_ -* 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding - `_ -* 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default - for the POSIX locale `_, thread on - python-dev: `Low-Level Encoding Behavior on Python 3 - `_ -* 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler - for stdout `_, regrtest fails with - Unicode encode error if the locale is POSIX - -Some issues are real bugs in applications which must explicitly set the -encoding. Well, it just works in the common case (locale configured -correctly), so what? The program "suddenly" fails when the POSIX -locale is used (probably for bad reasons). Such bugs are not well -understood by users. Example of such issues: - -* 2013-11-21: `pip: open() uses the locale encoding to parse Python - script, instead of the encoding cookie - `_ -- pip must use the encoding - cookie to read a Python source code file -* 2011-01-21: `IDLE 3.x can crash decoding recent file list - `_ - - -Prior Art -========= - -Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment -variable to force UTF-8: see `perlrun -`_. It is possible to configure -UTF-8 per standard stream, on input and output streams, etc. - Post History ============ +* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode + `_ * 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 & 540 (assuming UTF-8 for *nix system boundaries) `_