From 0bb19ff93af9855db327e9a02f3e86b6f932a25a Mon Sep 17 00:00:00 2001
From: Victor Stinner <victor.stinner@gmail.com>
Date: Wed, 6 Dec 2017 01:42:16 +0100
Subject: [PATCH] Rewrite the PEP 540!

---
 pep-0540.txt | 980 ++++++---------------------------------------------
 1 file changed, 113 insertions(+), 867 deletions(-)

diff --git a/pep-0540.txt b/pep-0540.txt
index 82afd388c..ec5e76220 100644
--- a/pep-0540.txt
+++ b/pep-0540.txt
@@ -2,8 +2,7 @@ PEP: 540
 Title: Add a new UTF-8 mode
 Version: $Revision$
 Last-Modified: $Date$
-Author: Victor Stinner <victor.stinner@gmail.com>,
-        Nick Coghlan <ncoghlan@gmail.com>
+Author: Victor Stinner <victor.stinner@gmail.com>
 BDFL-Delegate: INADA Naoki
 Status: Draft
 Type: Standards Track
@@ -15,345 +14,141 @@ Python-Version: 3.7
 Abstract
 ========
 
-Add a new UTF-8 mode, enabled by default in the POSIX locale, to ignore
-the locale and force the usage of the UTF-8 encoding for external
-operating system interfaces, including the standard IO streams.
+Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding
+with the ``surrogateescape`` error handler. This mode is enabled by
+default in the POSIX locale, but otherwise disabled by default.
 
-Essentially, the UTF-8 mode behaves as Python 2 and other C based
-applications on \*nix systems: it aims to process text as best it can,
-but it errs on the side of producing or propagating mojibake to
-subsequent components in a processing pipeline rather than requiring
-strictly valid encodings at every step in the process.
+Add also a "strict" UTF-8 mode which uses the ``strict`` error handler,
+instead of ``surrogateescape``, with the UTF-8 encoding.
 
-The UTF-8 mode can be configured as strict to reduce the risk of
-producing or propagating mojibake.
-
-A new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
-variable are added to explicitly control the UTF-8 mode (including
-turning it off entirely, even in the POSIX locale).
+The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
+variable are added to control the UTF-8 mode.
 
 
 Rationale
 =========
 
-"It's not a bug, you must fix your locale" is not an acceptable answer
-----------------------------------------------------------------------
+Locale encoding and UTF-8
+-------------------------
 
-Since Python 3.0 was released in 2008, the usual answer to users getting
-Unicode errors is to ask developers to fix their code to handle Unicode
-properly. Most applications and Python modules were fixed, but users
-kept reporting Unicode errors regularly: see the long list of issues in
-the `Links`_ section below.
+Python 3.6 uses the locale encoding for filenames, environment
+variables, standard streams, etc. The locale encoding is inherited from
+the locale; the encoding and the locale are tightly coupled.
 
-In fact, a second class of bugs comes from a locale which is not properly
-configured. The usual answer to such a bug report is: "it is not a bug,
-you must fix your locale".
+Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
+locale, but are unable change the locale for different reasons. This
+encoding is very limited in term of Unicode support: any non-ASCII
+character is likely to cause troubles. For example, the Alpine Linux
+distribution became popular thanks to Docker containers, but it uses the
+POSIX locale by default.
 
-Technically, the answer is correct, but from a practical point of view,
-the answer is not acceptable. In many cases, "fixing the issue" is a
-hard task. Moreover, sometimes, the usage of the POSIX locale is
-deliberate.
+It is not easy to get the expected locale. Locales don't get the exact
+same name on all Linux distributions, FreeBSD, macOS, etc. Some
+locales, like the recent ``C.UTF-8`` locale, are only supported by a few
+platforms. For example, a SSH connection can use a different encoding
+than the filesystem or terminal encoding of the local host.
 
-A good example of a concrete issue are build systems which create a
-fresh environment for each build using a chroot, a container, a virtual
-machine or something else to get reproducible builds. Such a setup
-usually uses the POSIX locale.  To get 100% reproducible builds, the
-POSIX locale is a good choice: see the `Locales section of
-reproducible-builds.org
-<https://reproducible-builds.org/docs/locales/>`_.
+On the other side, Python 3.6 is already using UTF-8 by default on
+macOS, Android and Windows (PEP 529) for most functions, except of
+``open()``. UTF-8 is also the default encoding of Python scripts, XML
+and JSON file formats. The Go programming language uses UTF-8 for
+strings.
 
-PEP 538 lists additional problems related to the use of Linux containers to
-run network services and command line applications.
+When all data are stored as UTF-8 but the locale is often misconfigured,
+an obvious solution is to ignore the locale and use UTF-8.
 
-UNIX users don't expect Unicode errors, since the common command lines
-tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors -
-they produce mostly-readable text instead.
+Passthough undecodable bytes: surrogateescape
+---------------------------------------------
 
-These users similarly expect that tools written in Python 3 (including
-those updated from Python 2), continue to tolerate locale
-misconfigurations and avoid bothering them with text encoding details.
-From their point of the view, the bug is not their locale but is
-obviously Python 3 ("Everything else works, including Python 2, so
-what's wrong with Python 3?").
+Using UTF-8 is nice, until you read the first file encoded to a
+different encoding. When using the ``strict`` error handler, which is
+the default, Python 3 raises a ``UnicodeDecodeError`` on the first
+undecodable byte.
 
-Since Python 2 handles data as bytes, similar to system utilities
-written in C and C++, it's rarer in Python 2 compared to Python 3 to get
-explicit Unicode errors. It also contributes significantly to why many
-affected users perceive Python 3 as the root cause of their Unicode
-errors.
+Unix command line tools like ``cat`` or ``grep`` and most Python 2
+applications simply do not have this class of bugs: they don't decode
+data, but process data as a raw bytes sequence.
 
-At the same time, the stricter text handling model was deliberately
-introduced into Python 3 to reduce the frequency of data corruption bugs
-arising in production services due to mismatched assumptions regarding
-text encodings.  It's one thing to emit mojibake to a user's terminal
-while listing a directory, but something else entirely to store that in
-a system manifest in a database, or to send it to a remote client
-attempting to retrieve files from the system.
+Python 3 already has a solution to behave like Unix tools and Python 2:
+the ``surrogateescape`` error handler (:pep:`383`). It allows to process
+data "as bytes" but uses Unicode in practice (undecodable bytes are
+stored as surrogate characters).
 
-Since different group of users have different expectations, there is no
-silver bullet which solves all issues at once. Last but not least,
-backward compatibility should be preserved whenever possible.
+For an application written as a Unix "pipe" tool like ``grep``, taking
+input on stdin and writing output to stdout, ``surrogateescape`` allows
+to "passthrough" undecodable bytes.
 
-Locale and operating system data
---------------------------------
+The UTF-8 encoding used with the ``surrogateescape`` error handler is a
+compromise between correctness and usability.
 
-.. _operating system data:
-
-Python uses an encoding called the "filesystem encoding" to decide how
-to encode and decode data from/to the operating system:
-
-* file content
-* command line arguments: ``sys.argv``
-* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
-* environment variables: ``os.environ``
-* filenames: ``os.listdir(str)`` for example
-* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
-* error messages: ``os.strerror(code)`` for example
-* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
-* host name, UNIX socket path: see the ``socket`` module
-* etc.
-
-At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
-``LC_CTYPE`` locale and then store the locale encoding as the
-"filesystem error". It's possible to get this encoding using
-``sys.getfilesystemencoding()``. In the whole lifetime of a Python
-process, the same encoding and error handler are used to encode and
-decode data from/to the operating system.
-
-The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to
-decode and encode operating system data. These functions use the
-filesystem error handler: ``sys.getfilesystemencodeerrors()``.
-
-.. note::
-   In some corner cases, the *current* ``LC_CTYPE`` locale must be used
-   instead of ``sys.getfilesystemencoding()``. For example, the ``time``
-   module uses the *current* ``LC_CTYPE`` locale to decode timezone
-   names.
-
-
-The POSIX locale and its encoding
----------------------------------
-
-The following environment variables are used to configure the locale, in
-this preference order:
-
-* ``LC_ALL``, most important variable
-* ``LC_CTYPE``
-* ``LANG``
-
-The POSIX locale, also known as "the C locale", is used:
-
-* if the first set variable is set to ``"C"``
-* if all these variables are unset, for example when a program is
-  started in an empty environment.
-
-The encoding of the POSIX locale must be ASCII or a superset of ASCII.
-
-On Linux, the POSIX locale uses the ASCII encoding.
-
-On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
-the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
-use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
-``os.fsencode()`` and ``os.fsdecode()`` use
-``locale.getpreferredencoding()`` codec. For example, if command line
-arguments are decoded by ``mbstowcs()`` and encoded back by
-``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
-of retrieving the original byte string.
-
-To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
-really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
-POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
-alias to ASCII). If not (the effective encoding is not ASCII), Python
-uses its own ASCII codec instead of using ``mbstowcs()`` and
-``wcstombs()`` functions for `operating system data`_.
-
-See the `POSIX locale (2016 Edition)
-<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
-
-
-POSIX locale used by mistake
+Strict UTF-8 for correctness
 ----------------------------
 
-In many cases, the POSIX locale is not really expected by users who get
-it by mistake. Examples:
+When correctness matters more than usability, the ``strict`` error
+handler is preferred over ``surrogateescape`` to raise an encoding error
+at the first undecodable byte or unencodable character.
 
-* program started in an empty environment
-* User forcing LANG=C to get messages in English
-* LANG=C used for bad reasons, without being aware of the ASCII encoding
-* SSH shell
-* Linux installed with no configured locale
-* chroot environment, Docker image, container, ... with no locale is
-  configured
-* User locale set to a non-existing locale, typo in the locale name for
-  example
+No change by default for best backward compatibility
+----------------------------------------------------
 
+While UTF-8 is perfect in most cases, sometimes the locale encoding is
+actually the best encoding.
 
-C.UTF-8 and C.utf8 locales
---------------------------
+This PEP changes the behaviour for the POSIX locale since this locale
+usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
+It does not change the behaviour for other locales to prevent any risk
+or regression.
 
-Some UNIX operating systems provide a variant of the POSIX locale using
-the UTF-8 encoding:
-
-* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
-* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
-* HP-UX: ``"C.utf8"``
-
-It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
-proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
-
-It is not planned to add such locale to BSD systems.
-
-
-Popularity of the UTF-8 encoding
---------------------------------
-
-Python 3 uses UTF-8 by default for Python source files.
-
-On Mac OS X, Windows and Android, Python always use UTF-8 for operating
-system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
-encoding to UTF-8".
-
-On Linux, UTF-8 became the de facto standard encoding,
-replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
-using different encodings for filenames and standard streams is likely
-to create mojibake, so UTF-8 is now used *everywhere* (at least for
-modern
-distributions using their default settings).
-
-The UTF-8 encoding is the default encoding of XML and JSON file format.
-In January 2017, UTF-8 was used in `more than 88% of web pages
-<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
-Javascript, CSS, etc.).
-
-See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
-information on the UTF-8 codec.
-
-.. note::
-   Some applications and operating systems (especially Windows) use Byte
-   Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
-   UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
-   Python.
-
-
-Old data stored in different encodings and surrogateescape
-----------------------------------------------------------
-
-Even if UTF-8 became the de facto standard, there are still systems in
-the wild which don't use UTF-8. And there are a lot of data stored in
-different encodings. For example, an old USB key using the ext3
-filesystem with filenames encoded to ISO 8859-1.
-
-The Linux kernel and libc don't decode filenames: a filename is used
-as a raw array of bytes. The common solution to support any filename is
-to store filenames as bytes and don't try to decode them. When displayed
-to stdout, mojibake is displayed if the filename and the terminal don't
-use the same encoding.
-
-Python 3 promotes Unicode everywhere including filenames. A solution to
-support filenames not decodable from the locale encoding was found: the
-``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
-as surrogate characters. This error handler is used by default for
-`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
-example (except on Windows which uses the ``strict`` error handler).
-
-
-Standard streams
-----------------
-
-Python uses the locale encoding for standard streams: stdin, stdout and
-stderr. The ``strict`` error handler is used by stdin and stdout to
-prevent mojibake.
-
-The ``backslashreplace`` error handler is used by stderr to avoid
-Unicode encode errors when displaying non-ASCII text. It is especially
-useful when the POSIX locale is used, because this locale usually uses
-the ASCII encoding.
-
-The problem is that `operating system data`_ like filenames are decoded
-using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
-filename to stdout raises a Unicode encode error if the filename
-contains an undecoded byte stored as a surrogate character.
-
-Python 3.5+ now uses ``surrogateescape`` for stdin and stdout if the
-POSIX locale is used: `issue #19977
-<http://bugs.python.org/issue19977>`_. The idea is to pass through
-`operating system data`_ even if it means mojibake, because most UNIX
-applications work like that. Such UNIX applications often store
-filenames as bytes, in many cases because their basic design principles
-(or those of the language they're implemented in) were laid down half a
-century ago when it was still a feat for computers to handle English
-text correctly, rather than
-humans having to work with raw numeric indexes.
-
-.. note::
-   The encoding and/or the error handler of standard streams can be
-   overriden with the ``PYTHONIOENCODING`` environment variable.
+As users are responsible to enable explicitly the new UTF-8 mode, they
+are responsible for any potential mojibake issues caused by this mode.
 
 
 Proposal
 ========
 
-Changes
--------
+Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding
+with the ``surrogateescape`` error handler. This mode is enabled by
+default in the POSIX locale, but otherwise disabled by default.
 
-Add a new UTF-8 mode, enabled by default in the POSIX locale, but
-otherwise disabled by default, to ignore the locale and force the usage
-of the UTF-8 encoding with the ``surrogateescape`` error handler,
-instead using the locale encoding (with ``strict`` or
-``surrogateescape`` error handler depending on the case).
-
-The "normal" UTF-8 mode uses ``surrogateescape`` on the standard input
-and output streams and opened files, as well as on all operating
-system interfaces. This is the mode implicitly activated by the POSIX
-locale.
-
-The "strict" UTF-8 mode reduces the risk of producing or propogating
-mojibake: the UTF-8 encoding is used with the ``strict`` error handler
-for inputs and outputs, but the ``surrogateescape`` error handler is
-still used for `operating system data`_. This mode is never activated
-implicitly, but can be requested explicitly.
+Add also a "strict" UTF-8 mode which uses the ``strict`` error handler,
+instead of ``surrogateescape``, with the UTF-8 encoding.
 
 The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
-variable are added to control the UTF-8 mode.
+variable are added to control the UTF-8 mode:
 
-The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``.
-
-The UTF-8 Strict mode is configured by ``-X utf8=strict`` or
-``PYTHONUTF8=strict``.
+* The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``
+* The Strict UTF-8 mode is configured by ``-X utf8=strict`` or
+  ``PYTHONUTF8=strict``
 
 The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
 can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
 
-Other option values fail with an error.
+For standard streams, the ``PYTHONIOENCODING`` environment variable has
+priority over the UTF-8 mode.
 
-Options priority for the UTF-8 mode:
+On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable
+(:pep:`529`) has the priority over the UTF-8 mode.
 
-* ``PYTHONLEGACYWINDOWSFSENCODING``
-* ``-X utf8``
-* ``PYTHONUTF8``
-* POSIX locale
 
-For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode,
-whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and so
-use the encoding of the POSIX locale.
+Backward Compatibility
+======================
 
-Encodings used by ``open()``, highest priority first:
+The only backward incompatible change is that the UTF-8 encoding is now
+used for the POSIX locale.
 
-* *encoding* and *errors* parameters (if set)
-* UTF-8 mode
-* ``os.device_encoding(fd)``
-* ``os.getpreferredencoding(False)``
 
+Annex: Encodings And Error Handlers
+===================================
+
+The UTF-8 mode changes the default encoding and error handler used by
+``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
+``sys.stdout`` and ``sys.stderr``.
 
 Encoding and error handler
 --------------------------
 
-The UTF-8 mode changes the default encoding and error handler used by
-``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
-``sys.stdout`` and ``sys.stderr``:
-
 ============================  =======================  ==========================  ==========================
-Function                      Default                  UTF-8 mode or POSIX locale  UTF-8 Strict mode
+Function                      Default                  UTF-8 mode or POSIX locale  Strict UTF-8 mode
 ============================  =======================  ==========================  ==========================
 open()                        locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
 os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape   **UTF-8**/surrogateescape
@@ -372,22 +167,13 @@ sys.stdin, sys.stdout         locale/strict            locale/**surrogateescape*
 sys.stderr                    locale/backslashreplace  locale/backslashreplace
 ============================  =======================  ==========================
 
-The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
-strict mode for consistency with other standard \*nix operating system
-components: the idea is that data not encoded to UTF-8 are passed through
-"Python" without being modified, as raw bytes.
-
-The ``PYTHONIOENCODING`` environment variable has priority over the
-UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
-python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.
-
 Encoding and error handler on Windows
 -------------------------------------
 
 On Windows, the encodings and error handlers are different:
 
 ============================  =======================  ==========================  ==========================  ==========================
-Function                      Default                  Legacy Windows FS encoding  UTF-8 mode                  UTF-8 Strict mode
+Function                      Default                  Legacy Windows FS encoding  UTF-8 mode                  Strict UTF-8 mode
 ============================  =======================  ==========================  ==========================  ==========================
 open()                        mbcs/strict              mbcs/strict                 **UTF-8/surrogateescape**   **UTF-8**/strict
 os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**            UTF-8/surrogatepass         UTF-8/surrogatepass
@@ -406,512 +192,43 @@ sys.stdin, sys.stdout         UTF-8/surrogateescape    UTF-8/surrogateescape
 sys.stderr                    UTF-8/backslashreplace   UTF-8/backslashreplace
 ============================  =======================  ==========================
 
-The "Legacy Windows FS encoding" is enabled by setting the
-``PYTHONLEGACYWINDOWSFSENCODING`` environment variable to ``1`` as
-specified in `PEP 529` .
-
-Enabling the legacy Windows filesystem encoding disables the UTF-8 mode
-(as ``-X utf8=0``).
+The "Legacy Windows FS encoding" is enabled by the
+``PYTHONLEGACYWINDOWSFSENCODING`` environment variable.
 
 If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
 ``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
-with the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the
-UTF-8 encoding.
+in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
+encoding.
 
-There is no POSIX locale on Windows. The ANSI code page is used to the
-locale encoding, and this code page never uses the ASCII encoding.
+.. note:
+   There is no POSIX locale on Windows. The ANSI code page is used to the
+   locale encoding, and this code page never uses the ASCII encoding.
 
 
-Rationale
----------
+Annex: Differences between the PEP 538 and the PEP 540
+======================================================
 
-The UTF-8 mode is disabled by default to keep hard Unicode errors when
-encoding or decoding `operating system data`_ failed, and to keep the
-backward compatibility. The user is responsible to enable explicitly the
-UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
-mode would be enabled *by default*.
+The PEP 538 uses the "C.UTF-8" locale which is quite new and only
+supported by a few Linux distributions; this locale is not currently
+supported by FreeBSD or macOS for example. This PEP 540 supports all
+operating systems.
 
-The UTF-8 mode should be used on systems known to be configured with
-UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
-the user overrides a locale *by mistake* or if a Python program is
-started with no locale configured (and so with the POSIX locale).
+The PEP 538 only changes the behaviour for the POSIX locale. While the
+new UTF-8 mode of this PEP is only enabled by the POSIX locale, it can
+be enabled manually for any other locale.
 
-Most UNIX applications handle `operating system data`_ as bytes, so
-``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
-limited impact on how these data are handled by the application.
-
-The Python UTF-8 mode should help to make Python more interoperable with
-the  other UNIX applications in the system assuming that *UTF-8* is used
-everywhere and that users *expect* UTF-8.
-
-Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
-Python is more convenient, since they are more commonly misconfigured
-*by mistake* (configured to use an encoding different than UTF-8,
-whereas the system uses UTF-8), rather than being misconfigured by
-intent.
-
-Expected mojibake and surrogate character issues
-------------------------------------------------
-
-The UTF-8 mode only affects code running directly in Python, especially
-code written in pure Python. The other code, called "external code"
-here, is not aware of this mode. Examples:
-
-* C libraries called by Python modules like OpenSSL
-* The application code when Python is embedded in an application
-
-In the UTF-8 mode, Python uses the ``surrogateescape`` error handler
-which stores bytes not decodable from UTF-8 as surrogate characters.
-
-If the external code uses the locale and the locale encoding is UTF-8,
-it should work fine.
-
-External code using bytes
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If the external code processes data as bytes, surrogate characters are
-not an issue since they are only used inside Python. Python encodes back
-surrogate characters to bytes at the edges, before calling external
-code.
-
-The UTF-8 mode can produce mojibake since Python and external code don't
-both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can
-be configured as strict to prevent mojibake and fail early when data
-is not decodable from UTF-8 or not encodable to UTF-8.
-
-External code using text
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-If the external code uses text API, for example using the ``wchar_t*`` C
-type, mojibake should not occur, but the external code can fail on
-surrogate characters.
-
-
-Use Cases
-=========
-
-The following use cases were written to help to understand the impact of
-chosen encodings and error handlers on concrete examples.
-
-The "Exception?" column shows the potential benefit of having a UTF-8
-mode which is closer to the traditional Python 2 behaviour of passing
-along raw binary data even if it isn't valid UTF-8.
-
-The "Mojibake" column shows that ignoring the locale causes a practical
-issue: the UTF-8 mode produces mojibake if the terminal doesn't use the
-UTF-8 encoding.
-
-The ideal configuration is "No exception, no risk of mojibake", but that
-isn't always possible in the presence of non-UTF-8 encoded binary data.
-
-List a directory into stdout
-----------------------------
-
-Script listing the content of the current directory into stdout::
-
-    import os
-    for name in os.listdir(os.curdir):
-        print(name)
-
-Result:
-
-========================  ==========  =========
-Python                    Exception?  Mojibake?
-========================  ==========  =========
-Python 2                  No          **Yes**
-Python 3                  **Yes**     No
-Python 3.5, POSIX locale  No          **Yes**
-UTF-8 mode                No          **Yes**
-UTF-8 Strict mode         **Yes**     No
-========================  ==========  =========
-
-"Exception?" means that the script can fail on decoding or encoding a
-filename depending on the locale or the filename.
-
-To be able to never fail that way, the program must be able to produce
-mojibake.  For automated and interactive process, mojibake is often more
-user friendly than an error with a truncated or empty output, since it
-confines the problem to the affected entry, rather than aborting the
-whole task.
-
-Example with a directory which contains the file called ``b'xxx\xff'``
-(the byte ``0xFF`` is invalid in UTF-8).
-
-Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
-
-    $ python3.7 ../ls.py
-    Traceback (most recent call last):
-      File "../ls.py", line 5, in <module>
-        print(name)
-    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
-
-    $ python3.7 -X utf8=strict ../ls.py
-    Traceback (most recent call last):
-      File "../ls.py", line 5, in <module>
-        print(name)
-    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
-
-The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
-but display mojibake::
-
-    $ python3.7 -X utf8 ../ls.py
-    xxx�
-
-    $ LC_ALL=C /python3.6 ../ls.py
-    xxx�
-
-    $ python2 ../ls.py
-    xxx�
-
-    $ ls
-    'xxx'$'\377'
-
-
-List a directory into a text file
----------------------------------
-
-Similar to the previous example, except that the listing is written into
-a text file::
-
-    import os
-    names = os.listdir(os.curdir)
-    with open("/tmp/content.txt", "w") as fp:
-        for name in names:
-            fp.write("%s\n" % name)
-
-Result:
-
-========================  ==========  =========
-Python                    Exception?  Mojibake?
-========================  ==========  =========
-Python 2                  No          **Yes**
-Python 3                  **Yes**     No
-Python 3.5, POSIX locale  **Yes**     No
-UTF-8 mode                No          **Yes**
-UTF-8 Strict mode         **Yes**     No
-========================  ==========  =========
-
-Again, never throwing an exception requires that mojibake can be
-produced, while preventing mojibake means that the script can fail on
-decoding or encoding a filename depending on the locale or the filename.
-Typical error::
-
-    $ LC_ALL=C python3 test.py
-    Traceback (most recent call last):
-      File "test.py", line 5, in <module>
-        fp.write("%s\n" % name)
-    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
-
-Compared with native system tools::
-
-    $ ls > /tmp/content.txt
-    $ cat /tmp/content.txt
-    xxx�
-
-
-Display Unicode characters into stdout
---------------------------------------
-
-Very basic example used to illustrate a common issue, display the euro
-sign (U+20AC: €)::
-
-    print("euro: \u20ac")
-
-Result:
-
-========================  ==========  =========
-Python                    Exception?  Mojibake?
-========================  ==========  =========
-Python 2                  **Yes**     No
-Python 3                  **Yes**     No
-Python 3.5, POSIX locale  **Yes**     No
-UTF-8 mode                No          **Yes**
-UTF-8 Strict mode         No          **Yes**
-========================  ==========  =========
-
-The UTF-8 and UTF-8 Strict modes will always encode the euro sign as
-UTF-8. If the terminal uses a different encoding, we get mojibake.
-
-For example, using ``iconv`` to emulate a GB-18030 terminal inside a
-UTF-8 one::
-
-    $ python3 -c 'print("euro: \u20ac")' | iconv -f gb18030 -t utf8
-    euro: 鈧iconv: illegal input sequence at position 8
-
-The misencoding also corrupts the trailing newline such that the output
-stream isn't actually a valid GB-18030 sequence, hence the error message
-after the euro symbol is misinterpreted as a hanzi character.
-
-
-Replace a word in a text
-------------------------
-
-The following script replaces the word "apple" with "orange". It
-reads input from stdin and writes the output into stdout::
-
-    import sys
-    text = sys.stdin.read()
-    sys.stdout.write(text.replace("apple", "orange"))
-
-Result:
-
-========================  ==========  =========
-Python                    Exception?  Mojibake?
-========================  ==========  =========
-Python 2                  No          **Yes**
-Python 3                  **Yes**     No
-Python 3.5, POSIX locale  No          **Yes**
-UTF-8 mode                No          **Yes**
-UTF-8 Strict mode         **Yes**     No
-========================  ==========  =========
-
-This is a case where passing along the raw bytes (by way of the
-``surrogateescape`` error handler) will bring Python 3's behaviour back
-into line with standard operating system tools like ``sed`` and ``awk``.
-
-
-Producer-consumer model using pipes
------------------------------------
-
-Let's say that we have a "producer" program which writes data into its
-stdout and a "consumer" program which reads data from its stdin.
-
-On a shell, such programs are run with the command::
-
-    producer | consumer
-
-The question if these programs will work with any data and any locale.
-UNIX users don't expect Unicode errors, and so expect that such programs
-"just works", in the sense that Unicode errors may cause problems in the
-data stream, but won't cause the entire stream processing *itself* to
-abort.
-
-If the producer only produces ASCII output, no error should occur. Let's
-say that the producer writes at least one non-ASCII character (at least
-one byte in the range ``0x80..0xff``).
-
-To simplify the problem, let's say that the consumer has no output
-(doesn't write results into a file or stdout).
-
-A "Bytes producer" is an application which cannot fail with a Unicode
-error and produces bytes into stdout.
-
-Let's say that a "Bytes consumer" does not decode stdin but stores data
-as bytes: such consumer always work. Common UNIX command line tools like
-``cat``, ``grep`` or ``sed`` are in this category. Many Python 2
-applications are also in this category, as are applications that work
-with the lower level binary input and output stream in Python 3 rather
-than the default text mode streams.
-
-"Python producer" and "Python consumer" are producer and consumer
-implemented in Python using the default text mode input and output
-streams.
-
-Bytes producer, Bytes consumer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-This won't through exceptions, but it is out of the scope of this PEP
-since it doesn't involve Python's default text mode input and output
-streams.
-
-Python producer, Bytes consumer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Python producer::
-
-    print("euro: \u20ac")
-
-Result:
-
-========================  ==========  =========
-Python                    Exception?  Mojibake?
-========================  ==========  =========
-Python 2                  **Yes**     No
-Python 3                  **Yes**     No
-Python 3.5, POSIX locale  **Yes**     No
-UTF-8 mode                No          **Yes**
-UTF-8 Strict mode         No          **Yes**
-========================  ==========  =========
-
-The question here is not if the consumer is able to decode the input,
-but if Python is able to produce its output. So it's similar to the
-`Display Unicode characters into stdout`_ case.
-
-UTF-8 modes work with any locale since the consumer doesn't try to
-decode its stdin.
-
-Bytes producer, Python consumer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Python consumer::
-
-    import sys
-    text = sys.stdin.read()
-    result = text.replace("apple", "orange")
-    # ignore the result
-
-Result:
-
-========================  ==========  =========
-Python                    Exception?  Mojibake?
-========================  ==========  =========
-Python 2                  No          **Yes**
-Python 3                  **Yes**     No
-Python 3.5, POSIX locale  No          **Yes**
-UTF-8 mode                No          **Yes**
-UTF-8 Strict mode         **Yes**     No
-========================  ==========  =========
-
-Python 3 may throw an exception on decoding stdin depending on the input
-and the locale.
-
-
-Python producer, Python consumer
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Python producer::
-
-    print("euro: \u20ac")
-
-Python consumer::
-
-    import sys
-    text = sys.stdin.read()
-    result = text.replace("apple", "orange")
-    # ignore the result
-
-Result, same Python version used for the producer and the consumer:
-
-========================  ==========  =========
-Python                    Exception?  Mojibake?
-========================  ==========  =========
-Python 2                  **Yes**     No
-Python 3                  **Yes**     No
-Python 3.5, POSIX locale  **Yes**     No
-UTF-8 mode                No          No(!)
-UTF-8 Strict mode         No          No(!)
-========================  ==========  =========
-
-This case combines a Python producer with a Python consumer, and the
-result is mainly the same as that for `Python producer, Bytes
-consumer`_, since the consumer can't read what the producer can't emit.
-
-However, the behaviour of the "UTF-8" and "UTF-8 Strict" modes in this
-configuration is notable: they don't produce an exception, *and* they
-shouldn't produce mojibake, as both the producer and the consumer are
-making *consistent* assumptions regarding the text encoding used on the
-pipe between them (i.e. UTF-8).
-
-Any mojibake generated would only be in the interfaces bween the
-consuming component and the outside world (e.g. the terminal, or when
-writing to a file).
-
-Backward Compatibility
-======================
-
-The main backward incompatible change is that the UTF-8 encoding is now
-used by default if the locale is POSIX. Since the UTF-8 encoding is used
-with the ``surrogateescape`` error handler, encoding errors should not
-occur and so the change should not break applications.
-
-The UTF-8 encoding is also quite restrictive regarding where it allows
-plain ASCII code points to appear in the byte stream, so even for
-ASCII-incompatible encodings, such byte values will often be escaped
-rather than being processed as ASCII characters.
-
-The more likely source of trouble comes from external libraries. Python
-can decode successfully data from UTF-8, but a library using the locale
-encoding can fail to encode the decoded text back to bytes. For example,
-GNU readline currently has problems on Android due to the mismatch
-between CPython's encoding assumptions there (always UTF-8) and GNU
-readline's encoding assumptions (which are based on the nominal locale).
-
-The PEP only changes the default behaviour if the locale is POSIX. For
-other locales, the *default* behaviour is unchanged.
-
-PEP 538 is a follow-up to this PEP that extends CPython's assumptions to
-other locale-aware components in the same process by explicitly coercing
-the POSIX locale to something more suitable for modern text processing.
-See that PEP for further details.
-
-
-Alternatives
-============
-
-Don't modify the encoding of the POSIX locale
----------------------------------------------
-
-A first version of the PEP did not change the encoding and error handler
-used of the POSIX locale.
-
-The problem is that adding the ``-X utf8`` command line option or
-setting the ``PYTHONUTF8`` environment variable is not possible in some
-cases, or at least not convenient.
-
-Moreover, many users simply expect that Python 3 behaves as Python 2:
-don't bother them with encodings and "just works" in all cases. These
-users don't worry about mojibake, or even expect mojibake because of
-complex documents using multiple incompatibles encodings.
-
-
-Always use UTF-8
-----------------
-
-Python already always uses the UTF-8 encoding on Mac OS X, Android and
-Windows.  Since UTF-8 became the de facto encoding, it makes sense to
-always use it on all platforms with any locale.
-
-The problem with this approach is that Python is also used extensively
-in desktop environments, and it is often a practical or even legal
-requirement to support locale encoding other than UTF-8 (for example,
-GB-18030 in China, and Shift-JIS or ISO-2022-JP in Japan)
-
-Force UTF-8 for the POSIX locale
---------------------------------
-
-An alternative to always using UTF-8 in any case is to only use UTF-8
-when the ``LC_CTYPE`` locale is the POSIX locale.
-
-The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of  Nick
-Coghlan proposes to implement that using the ``C.UTF-8`` locale.
-
-
-Use the strict error handler for operating system data
-------------------------------------------------------
-
-Using the ``surrogateescape`` error handler for `operating system data`_
-creates surprising surrogate characters. No Python codec (except of
-``utf-7``) accept surrogates, and so encoding text coming from the
-operating system is likely to raise an error error. The problem is that
-the error comes late, very far from where the data was read.
-
-The ``strict`` error handler can be used instead to decode
-(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
-data, to raise encoding errors as soon as possible. It helps to find
-bugs more quickly.
-
-The main drawback of this strategy is that it doesn't work in practice.
-Python 3 is designed on top on Unicode strings. Most functions expect
-Unicode and produce Unicode. Even if many operating system functions
-have two flavors, bytes and Unicode, the Unicode flavor is used in most
-cases. There are good reasons for that: Unicode is more convenient in
-Python 3 and using Unicode helps to support the full Unicode Character
-Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
-see the `PEP 528`_ and the `PEP 529`_).
-
-For example, if ``os.fsdecode()`` uses ``utf8/strict``,
-``os.listdir(str)`` fails to list filenames of a directory if a single
-filename is not decodable from UTF-8. As a consequence,
-``shutil.rmtree(str)`` fails to remove a directory. Undecodable
-filenames, environment variables, etc. are simply too common to make
-this alternative viable.
+The PEP 538 is implemented with ``setlocale(LC_CTYPE, "C.UTF-8")``: any
+non-Python code running in the process is impacted by this change.  This
+PEP is implemented in Python internals and ignores the locale:
+non-Python running in the same process is not aware of the "Python UTF-8
+mode".
 
 
 Links
 =====
 
-PEPs:
-
+* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
+  <http://bugs.python.org/issue29240>`_
 * `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
   "Coercing the legacy C locale to C.UTF-8"
 * `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
@@ -921,83 +238,12 @@ PEPs:
 * `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
   "Non-decodable Bytes in System Character Interfaces"
 
-Main Python issues:
-
-* `Issue #29240: Implementation of the PEP 540: Add a new UTF-8 mode
-  <http://bugs.python.org/issue29240>`_
-* `Issue #28180: sys.getfilesystemencoding() should default to utf-8
-  <http://bugs.python.org/issue28180>`_
-* `Issue #19977: Use "surrogateescape" error handler for sys.stdin and
-  sys.stdout on UNIX for the C locale
-  <http://bugs.python.org/issue19977>`_
-* `Issue #19847: Setting the default filesystem-encoding
-  <http://bugs.python.org/issue19847>`_
-* `Issue #8622: Add PYTHONFSENCODING environment variable
-  <https://bugs.python.org/issue8622>`_: added but reverted because of
-  many issues, read the `Inconsistencies if locale and filesystem
-  encodings are different
-  <https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
-  thread on the python-dev mailing list
-
-Incomplete list of Python issues related to Unicode errors, especially
-with the POSIX locale:
-
-* 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')"
-  <http://bugs.python.org/issue29042#msg283821>`_
-* 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only
-  error handler <http://bugs.python.org/issue22016>`_
-* 2014-04-27: `Issue #21368: Check for systemd locale on startup if
-  current locale is set to POSIX <http://bugs.python.org/issue21368>`_
-  -- read manually /etc/locale.conf when the locale is POSIX
-* 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell
-  with utf-8 filename <http://bugs.python.org/issue20329>`_
-* 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C locale
-  <http://bugs.python.org/issue19846>`_
-* 2010-05-04: `Issue #8610: Python3/POSIX:  errors if file system
-  encoding is None <http://bugs.python.org/issue8610>`_
-* 2013-08-12: `Issue #18713: Clearly document the use of
-  PYTHONIOENCODING to set surrogateescape
-  <http://bugs.python.org/issue18713>`_
-* 2013-09-27: `Issue #19100: Use backslashreplace in pprint
-  <http://bugs.python.org/issue19100>`_
-* 2012-01-05: `Issue #13717: os.walk() + print fails with UnicodeEncodeError
-  <http://bugs.python.org/issue13717>`_
-* 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding
-  <http://bugs.python.org/issue13643>`_
-* 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default
-  for the POSIX locale <http://bugs.python.org/issue11574>`_, thread on
-  python-dev: `Low-Level Encoding Behavior on Python 3
-  <https://mail.python.org/pipermail/python-dev/2011-March/109361.html>`_
-* 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler
-  for stdout <http://bugs.python.org/issue8533>`_, regrtest fails with
-  Unicode encode error if the locale is POSIX
-
-Some issues are real bugs in applications which must explicitly set the
-encoding. Well, it just works in the common case (locale configured
-correctly), so what? The program "suddenly" fails when the POSIX
-locale is used (probably for bad reasons). Such bugs are not well
-understood by users. Example of such issues:
-
-* 2013-11-21: `pip: open() uses the locale encoding to parse Python
-  script, instead of the encoding cookie
-  <http://bugs.python.org/issue19685>`_ -- pip must use the encoding
-  cookie to read a Python source code file
-* 2011-01-21: `IDLE 3.x can crash decoding recent file list
-  <http://bugs.python.org/issue10974>`_
-
-
-Prior Art
-=========
-
-Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
-variable to force UTF-8: see `perlrun
-<http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
-UTF-8 per standard stream, on input and output streams, etc.
-
 
 Post History
 ============
 
+* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
+  <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
 * 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
   540 (assuming UTF-8 for *nix system boundaries)
   <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_