PEP 540: Apply Nick Coghlan's PR #201

I applied it manually since another PR was merged in the meanwhile.
2017-12-05 16:21:59 +01:00 · 2017-12-05 16:21:59 +01:00 · cef853f646
parent 7d181dc76d
commit cef853f646
1 changed files with 246 additions and 169 deletions
--- a/pep-0540.txt
+++ b/pep-0540.txt
@ -2,7 +2,8 @@ PEP: 540
 Title: Add a new UTF-8 mode
 Version: $Revision$
 Last-Modified: $Date$
-Author: Victor Stinner <victor.stinner@gmail.com>
+Author: Victor Stinner <victor.stinner@gmail.com>,
+        Nick Coghlan <ncoghlan@gmail.com>
 BDFL-Delegate: INADA Naoki
 Status: Draft
 Type: Standards Track
@ -14,16 +15,22 @@ Python-Version: 3.7
 Abstract
 ========

-Add a new UTF-8 mode, disabled by default, to ignore the locale and
-force the usage of the UTF-8 encoding.
+Add a new UTF-8 mode, enabled by default in the POSIX locale, to ignore
+the locale and force the usage of the UTF-8 encoding for external
+operating system interfaces, including the standard IO streams.

-Basically, UTF-8 mode behaves as Python 2: it "just works" and doesn't
-bother users with encodings, but it can produce mojibake. The UTF-8 mode
-can be configured as strict to prevent mojibake.
+Essentially, the UTF-8 mode behaves as Python 2 and other C based
+applications on \*nix systems: it aims to process text as best it can,
+but it errs on the side of producing or propagating mojibake to
+subsequent components in a processing pipeline rather than requiring
+strictly valid encodings at every step in the process.

-A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
-variable are added to control the UTF-8 mode. The POSIX locale enables
-the UTF-8 mode.
+The UTF-8 mode can be configured as strict to reduce the risk of
+producing or propagating mojibake.
+
+A new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
+variable are added to explicitly control the UTF-8 mode (including
+turning it off entirely, even in the POSIX locale).


 Rationale
@ -55,20 +62,30 @@ POSIX locale is a good choice: see the `Locales section of
 reproducible-builds.org
 <https://reproducible-builds.org/docs/locales/>`_.

+PEP 538 lists additional problems related to the use of Linux containers to
+run network services and command line applications.
+
 UNIX users don't expect Unicode errors, since the common command lines
-tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors.
-These users expect that Python 3 "just works" with any locale and won't
-bother them with encodings. From their point of the view, the bug is not
-their locale, it's obviously Python 3.
+tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors - they
+produce mostly-readable text instead.

-Since Python 2 handles data as bytes, it's rarer in Python 2
-compared to Python 3 to get Unicode errors. It also explains why users
-also perceive Python 3 as the root cause of their Unicode errors.
+These users similarly expect that tools written in Python 3 (including those
+updated from Python 2), continue to tolerate locale misconfigurations and avoid
+bothering them with text encoding details. From their point of the view, the
+bug is not their locale but is obviously Python 3 ("Everything else works,
+including Python 2, so what's wrong with Python 3?").

-Some users expect that Python 3 just works with any locale and so don't
-bother with mojibake, whereas some developers are working hard to prevent
-mojibake and so expect that Python 3 fails early before creating
-it.
+Since Python 2 handles data as bytes, similar to system utilities written in
+C and C++, it's rarer in Python 2 compared to Python 3 to get explicit Unicode
+errors. It also contributes significantly to why many affected users perceive
+Python 3 as the root cause of their Unicode errors.
+
+At the same time, the stricter text handling model was deliberately introduced
+into Python 3 to reduce the frequency of data corruption bugs arising in
+production services due to mismatched assumptions regarding text encodings.
+It's one thing to emit mojibake to a user's terminal while listing a directory,
+but something else entirely to store that in a system manifest in a database,
+or to send it to a remote client attempting to retreive files from the system.

 Since different group of users have different expectations, there is no
 silver bullet which solves all issues at once. Last but not least,
@ -135,12 +152,12 @@ On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
 the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
 use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
 ``os.fsencode()`` and ``os.fsdecode()`` use
-Python codec of the locale encoding. For example, if command line
+``locale.getpreferredencoding()`` codec. For example, if command line
 arguments are decoded by ``mbstowcs()`` and encoded back by
 ``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
 of retrieving the original byte string.

-To fix this issue, from Python 3.4, a check is made to see if ``mbstowcs()``
+To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
 really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
 POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
 alias to ASCII). If not (the effective encoding is not ASCII), Python
@ -178,7 +195,7 @@ the UTF-8 encoding:
 * Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
 * HP-UX: ``"C.utf8"``

-It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
+It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
 proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.

 It is not planned to add such locale to BSD systems.
@ -190,16 +207,17 @@ Popularity of the UTF-8 encoding
 Python 3 uses UTF-8 by default for Python source files.

 On Mac OS X, Windows and Android, Python always use UTF-8 for operating
-system data. For Windows, see `PEP 529`_: "Change Windows filesystem
+system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
 encoding to UTF-8".

 On Linux, UTF-8 became the de facto standard encoding,
 replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
 using different encodings for filenames and standard streams is likely
-to create mojibake, so UTF-8 is now used *everywhere*.
+to create mojibake, so UTF-8 is now used *everywhere* (at least for modern
+distributions using their default settings).

-The UTF-8 encoding is the default encoding of XML and JSON file formats.
-As of January 2017, UTF-8 was used in `more than 88% of web pages
+The UTF-8 encoding is the default encoding of XML and JSON file format.
+In January 2017, UTF-8 was used in `more than 88% of web pages
 <https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
 Javascript, CSS, etc.).

@ -209,7 +227,7 @@ information on the UTF-8 codec.
 .. note::
   Some applications and operating systems (especially Windows) use Byte
   Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
-   UTF-8, UTF-16-LE, etc. BOM are not well supported and are rarely used in
+   UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
   Python.


@ -231,8 +249,8 @@ Python 3 promotes Unicode everywhere including filenames. A solution to
 support filenames not decodable from the locale encoding was found: the
 ``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
 as surrogate characters. This error handler is used by default for
-`operating system data`_, for example, by ``os.fsdecode()`` and
-``os.fsencode()`` (except on Windows which uses the ``strict`` error handler).
+`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
+example (except on Windows which uses the ``strict`` error handler).


 Standard streams
@ -252,17 +270,19 @@ using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
 filename to stdout raises a Unicode encode error if the filename
 contains an undecoded byte stored as a surrogate character.

-Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
+Python 3.5+ now uses ``surrogateescape`` for stdin and stdout if the
 POSIX locale is used: `issue #19977
 <http://bugs.python.org/issue19977>`_. The idea is to pass through
-`operating system data`_ even if it creates mojibake, because most UNIX
-applications work like that. Most UNIX applications store filenames as
-bytes, usually because bytes are first-citizen class in the used
-programming language, whereas Unicode is badly supported.
+`operating system data`_ even if it means mojibake, because most UNIX
+applications work like that. Such UNIX applications often store filenames as
+bytes, in many cases because their basic design principles (or those of the
+language they're implemented in) were laid down half a century ago when it
+was still a feat for computers to handle English text correctly, rather than
+humans having to work with raw numeric indexes.

 .. note::
   The encoding and/or the error handler of standard streams can be
-   overridden with the ``PYTHONIOENCODING`` environment variable.
+   overriden with the ``PYTHONIOENCODING`` environment variable.


 Proposal
@ -271,27 +291,35 @@ Proposal
 Changes
 -------

-Add a new UTF-8 mode, disabled by default, to ignore the locale and
+Add a new UTF-8 mode, enabled by default in the POSIX locale, but otherwise
+disabled by default, to ignore the locale and
 force the usage of the UTF-8 encoding with the ``surrogateescape`` error
 handler, instead using the locale encoding (with ``strict`` or
 ``surrogateescape`` error handler depending on the case).

-Basically, the UTF-8 mode behaves as Python 2: it "just works" and doesn't
-bother users with encodings, but it can produce mojibake. It can be
-configured as strict to prevent mojibake: the UTF-8 encoding is used
-with the ``strict`` error handler for inputs and outputs, but the
-``surrogateescape`` error handler is still used for `operating system
-data`_.
+The "normal" UTF-8 mode uses ``surrogateescape`` on the standard input and
+output streams and openeded files, as well as on all operating system
+interfaces. This is the mode implicitly activated by the POSIX locale.

-A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
-variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
-by using ``-X utf8`` or ``PYTHONUTF8=1``.  It can be configured as strict
-by using ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values
-fail with an error.
+The "strict" UTF-8 mode reduces the risk of producing or propogating mojibake:
+the UTF-8 encoding is used with the ``strict`` error handler for inputs and
+outputs, but the ``surrogateescape`` error handler is still used for
+`operating system data`_. This mode is never activated implicitly, but can
+be requested explicitly.
+
+The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
+variable are added to control the UTF-8 mode.
+
+The UTF-8 mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``.
+
+The UTF-8 Strict mode is configured by ``-X utf8=strict`` or
+``PYTHONUTF8=strict``.

 The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
 can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.

+Other option values fail with an error.
+
 Options priority for the UTF-8 mode:

 * ``PYTHONLEGACYWINDOWSFSENCODING``
@ -300,8 +328,8 @@ Options priority for the UTF-8 mode:
 * POSIX locale

 For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode,
-whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and
-uses the encoding of the POSIX locale.
+whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and so
+use the encoding of the POSIX locale.

 Encodings used by ``open()``, highest priority first:

@ -339,8 +367,9 @@ sys.stderr                    locale/backslashreplace  locale/backslashreplace
 ============================  =======================  ==========================

 The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
-strict mode for convenience: the idea is that data not encoded to UTF-8
-are passed through "Python" without being modified, as raw bytes.
+strict mode for consistency with other standard \*nix operating system
+components: the idea is that data not encoded to UTF-8 are passed through
+"Python" without being modified, as raw bytes.

 The ``PYTHONIOENCODING`` environment variable has priority over the
 UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
@ -390,23 +419,23 @@ locale encoding, and this code page never uses the ASCII encoding.
 Rationale
 ---------

-UTF-8 mode is disabled by default in order to keep hard Unicode errors when
-encoding or decoding `operating system data`_ fails and preserve
-backward compatibility. In addition, users will be better prepared for
-mojibake if it is their responsibility to explicitly enable UTF-8 mode
-than they would be if it was enabled *by default*.
+The UTF-8 mode is disabled by default to keep hard Unicode errors when
+encoding or decoding `operating system data`_ failed, and to keep the
+backward compatibility. The user is responsible to enable explicitly the
+UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
+mode would be enabled *by default*.

-UTF-8 mode should be used on systems known to be configured with
+The UTF-8 mode should be used on systems known to be configured with
 UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
 the user overrides a locale *by mistake* or if a Python program is
 started with no locale configured (and so with the POSIX locale).

 Most UNIX applications handle `operating system data`_ as bytes, so
-the ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
+``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
 limited impact on how these data are handled by the application.

-The UTF-8 mode should help make Python more interoperable with
-other UNIX applications on the system assuming that *UTF-8* is used
+The Python UTF-8 mode should help to make Python more interoperable with
+the  other UNIX applications in the system assuming that *UTF-8* is used
 everywhere and that users *expect* UTF-8.

 Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
@ -455,17 +484,20 @@ surrogate characters.
 Use Cases
 =========

-The following use cases were written to highlight the impact of
-the chosen encodings and error handlers on concrete examples.
+The following use cases were written to help to understand the impact of
+chosen encodings and error handlers on concrete examples.

-The "Always work" results were written to prove the benefit of having a
-UTF-8 mode which works with any data and any locale, compared to the
-existing old Python versions.
+The "Exception?" column shows the potential benefit of having a UTF-8 mode which
+is closer to the traditional Python 2 behaviour of passing along raw binary data
+even if it isn't valid UTF-8.

 The "Mojibake" column shows that ignoring the locale causes a practical
 issue: the UTF-8 mode produces mojibake if the terminal doesn't use the
 UTF-8 encoding.

+The ideal configuration is "No exception, no risk of mojibake", but that isn't
+always possible in the presence of non-UTF-8 encoded binary data.
+
 List a directory into stdout
 ----------------------------

@ -477,24 +509,25 @@ Script listing the content of the current directory into stdout::

 Result:

-========================  =============  =========
-Python                    Always works?  Mojibake?
-========================  =============  =========
-Python 2                  **Yes**        **Yes**
-Python 3                  No             No
-Python 3.5, POSIX locale  **Yes**        **Yes**
-UTF-8 mode                **Yes**        **Yes**
-UTF-8 Strict mode         No             No
-========================  =============  =========
+========================  ==========  =========
+Python                    Exception?  Mojibake?
+========================  ==========  =========
+Python 2                  No          **Yes**
+Python 3                  **Yes**     No
+Python 3.5, POSIX locale  No          **Yes**
+UTF-8 mode                No          **Yes**
+UTF-8 Strict mode         **Yes**     No
+========================  ==========  =========

-"No" means that the script can fail on decoding or encoding a filename
-depending on the locale or the filename.
+"Exception?" means that the script can fail on decoding or encoding a
+filename depending on the locale or the filename.

-To be able to always work, the program must be able to produce mojibake.
-Mojibake is more user friendly than an error with a truncated or empty
-output.
+To be able to never fail that way, the program must be able to produce mojibake.
+For automated and interactive process, mojibake is often more user friendly
+than an error with a truncated or empty output, since it confines the
+problem to the affected entry, rather than aborting the whole task.

-For example, using a directory which contains a file called ``b'xxx\xff'``
+Example with a directory which contains the file called ``b'xxx\xff'``
 (the byte ``0xFF`` is invalid in UTF-8).

 Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
@ -511,7 +544,7 @@ Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
        print(name)
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...

-UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
+The UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
 but display mojibake::

    $ python3.7 -X utf8 ../ls.py
@ -541,19 +574,19 @@ a text file::

 Result:

-========================  =============  =========
-Python                    Always works?  Mojibake?
-========================  =============  =========
-Python 2                  **Yes**        **Yes**
-Python 3                  No             No
-Python 3.5, POSIX locale  No             No
-UTF-8 mode                **Yes**        **Yes**
-UTF-8 Strict mode         No             No
-========================  =============  =========
+========================  ==========  =========
+Python                    Exception?  Mojibake?
+========================  ==========  =========
+Python 2                  No          **Yes**
+Python 3                  **Yes**     No
+Python 3.5, POSIX locale  **Yes**     No
+UTF-8 mode                No          **Yes**
+UTF-8 Strict mode         **Yes**     No
+========================  ==========  =========

-"Yes" implies that mojibake can be produced. "No" means that the script
-can fail on decoding or encoding a filename depending on the locale or
-the filename. Typical error::
+Again, never throwing an exception requires that mojibake can be produced, while
+preventing mojibake means that the script can fail on decoding or encoding a
+filename depending on the locale or the filename. Typical error::

    $ LC_ALL=C python3 test.py
    Traceback (most recent call last):
@ -561,6 +594,12 @@ the filename. Typical error::
        fp.write("%s\n" % name)
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

+Compared with native system tools::
+
+    $ ls > /tmp/content.txt
+    $ cat /tmp/content.txt
+    xxx<78>
+

 Display Unicode characters into stdout
 --------------------------------------
@ -572,19 +611,29 @@ Very basic example used to illustrate a common issue, display the euro sign

 Result:

-========================  =============  =========
-Python                    Always works?  Mojibake?
-========================  =============  =========
-Python 2                  No             No
-Python 3                  No             No
-Python 3.5, POSIX locale  No             No
-UTF-8 mode                **Yes**        **Yes**
-UTF-8 Strict mode         **Yes**        **Yes**
-========================  =============  =========
+========================  ==========  =========
+Python                    Exception?  Mojibake?
+========================  ==========  =========
+Python 2                  **Yes**     No
+Python 3                  **Yes**     No
+Python 3.5, POSIX locale  **Yes**     No
+UTF-8 mode                No          **Yes**
+UTF-8 Strict mode         No          **Yes**
+========================  ==========  =========

 The UTF-8 and UTF-8 Strict modes will always encode the euro sign as
 UTF-8. If the terminal uses a different encoding, we get mojibake.

+For example, using ``iconv`` to emulate a GB-18030 terminal inside a
+UTF-8 one::
+
+    $ python3 -c 'print("euro: \u20ac")' | iconv -f gb18030 -t utf8
+    euro: 鈧iconv: illegal input sequence at position 8
+
+The misencoding also corrupts the trailing newline such that the output
+stream isn't actually a valid GB-18030 sequence, hence the error message after
+the euro symbol is misinterpreted as a hanzi character.
+

 Replace a word in a text
 ------------------------
@ -598,15 +647,20 @@ reads input from stdin and writes the output into stdout::

 Result:

-========================  =============  =========
-Python                    Always works?  Mojibake?
-========================  =============  =========
-Python 2                  **Yes**        **Yes**
-Python 3                  No             No
-Python 3.5, POSIX locale  **Yes**        **Yes**
-UTF-8 mode                **Yes**        **Yes**
-UTF-8 Strict mode         No             No
-========================  =============  =========
+========================  ==========  =========
+Python                    Exception?  Mojibake?
+========================  ==========  =========
+Python 2                  No          **Yes**
+Python 3                  **Yes**     No
+Python 3.5, POSIX locale  No          **Yes**
+UTF-8 mode                No          **Yes**
+UTF-8 Strict mode         **Yes**     No
+========================  ==========  =========
+
+This is a case where passing along the raw bytes (by way of the
+``surrogateescape`` error handler) will bring Python 3's behaviour back into
+line with standard operating system tools like ``sed`` and ``awk``.
+

 Producer-consumer model using pipes
 -----------------------------------
@ -618,12 +672,13 @@ On a shell, such programs are run with the command::

    producer | consumer

-The question is if these programs will work with any data and any locale.
+The question if these programs will work with any data and any locale.
 UNIX users don't expect Unicode errors, and so expect that such programs
-"just work".
+"just works", in the sense that Unicode errors may cause problems in the data
+stream, but won't cause the entire stream processing *itself* to abort.

 If the producer only produces ASCII output, no error should occur. Let's
-say the that the producer writes at least one non-ASCII character (at least
+say that the producer writes at least one non-ASCII character (at least
 one byte in the range ``0x80..0xff``).

 To simplify the problem, let's say that the consumer has no output
@ -633,18 +688,20 @@ A "Bytes producer" is an application which cannot fail with a Unicode
 error and produces bytes into stdout.

 Let's say that a "Bytes consumer" does not decode stdin but stores data
-as bytes: such a consumer always works. Common UNIX command line tools like
+as bytes: such consumer always work. Common UNIX command line tools like
 ``cat``, ``grep`` or ``sed`` are in this category. Many Python 2
-applications are also in this category.
+applications are also in this category, as are applications that work
+with the lower level binary input and output stream in Python 3 rather than
+the default text mode streams.

-"Python producer" and "Python consumer" are a producer and consumer
-implemented in Python.
+"Python producer" and "Python consumer" are producer and consumer
+implemented in Python using the default text mode input and output streams.

 Bytes producer, Bytes consumer
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-It always works, but it is out of the scope of this PEP since it doesn't
-involve Python.
+This won't through exceptions, but it is out of the scope of this PEP since it
+doesn't involve Python's default text mode input and output streams.

 Python producer, Bytes consumer
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -655,15 +712,15 @@ Python producer::

 Result:

-========================  =============  =========
-Python                    Always works?  Mojibake?
-========================  =============  =========
-Python 2                  No             No
-Python 3                  No             No
-Python 3.5, POSIX locale  No             No
-UTF-8 mode                **Yes**        **Yes**
-UTF-8 Strict mode         No             No
-========================  =============  =========
+========================  ==========  =========
+Python                    Exception?  Mojibake?
+========================  ==========  =========
+Python 2                  **Yes**     No
+Python 3                  **Yes**     No
+Python 3.5, POSIX locale  **Yes**     No
+UTF-8 mode                No          **Yes**
+UTF-8 Strict mode         No          **Yes**
+========================  ==========  =========

 The question here is not if the consumer is able to decode the input,
 but if Python is able to produce its output. So it's similar to the
@ -684,17 +741,18 @@ Python consumer::

 Result:

-========================  =============  =========
-Python                    Always works?  Mojibake?
-========================  =============  =========
-Python 2                  **Yes**        **Yes**
-Python 3                  No             No
-Python 3.5, POSIX locale  **Yes**        **Yes**
-UTF-8 mode                **Yes**        **Yes**
-UTF-8 Strict mode         No             No
-========================  =============  =========
+========================  ==========  =========
+Python                    Exception?  Mojibake?
+========================  ==========  =========
+Python 2                  No          **Yes**
+Python 3                  **Yes**     No
+Python 3.5, POSIX locale  No          **Yes**
+UTF-8 mode                No          **Yes**
+UTF-8 Strict mode         **Yes**     No
+========================  ==========  =========

-Python 3 fails on decoding stdin depending on the input and the locale.
+Python 3 may throw an exception on decoding stdin depending on the input and
+the locale.


 Python producer, Python consumer
@ -711,22 +769,30 @@ Python consumer::
    result = text.replace("apple", "orange")
    # ignore the result

-Result, using the same Python version for the producer and the consumer:
+Result, same Python version used for the producer and the consumer:

-========================  =============  =========
-Python                    Always works?  Mojibake?
-========================  =============  =========
-Python 2                  No             No
-Python 3                  No             No
-Python 3.5, POSIX locale  No             No
-UTF-8 mode                **Yes**        **Yes**
-UTF-8 Strict mode         No             No
-========================  =============  =========
+========================  ==========  =========
+Python                    Exception?  Mojibake?
+========================  ==========  =========
+Python 2                  **Yes**     No
+Python 3                  **Yes**     No
+Python 3.5, POSIX locale  **Yes**     No
+UTF-8 mode                No          No(!)
+UTF-8 Strict mode         No          No(!)
+========================  ==========  =========

-This case combines a Python producer with a Python consumer, so the
-result is the subset of `Python producer, Bytes consumer`_ and `Bytes
-producer, Python consumer`_.
+This case combines a Python producer with a Python consumer, and the
+result is mainly the same as that for `Python producer, Bytes consumer`_,
+since the consumer can't read what the producer can't emit.

+However, the behaviour of the "UTF-8" and "UTF-8 Strict" modes in this
+configuration is notable: they don't produce an exception, *and* they shouldn't
+produce mojibake, as both the producer and the consumer are making *consistent*
+assumptions regarding the text encoding used on the pipe between them
+(i.e. UTF-8).
+
+Any mojibake generated would only be in the interfaces bween the consuming
+component and the outside world (e.g. the terminal, or when writing to a file).

 Backward Compatibility
 ======================
@ -736,16 +802,26 @@ used by default if the locale is POSIX. Since the UTF-8 encoding is used
 with the ``surrogateescape`` error handler, encoding errors should not
 occur and so the change should not break applications.

+The UTF-8 encoding is also quite restrictive regarding where it allows
+plain ASCII code points to appear in the byte stream, so even for
+ASCII-incompatible encodings, such byte values will often be escaped rather
+than being processed as ASCII characters.
+
 The more likely source of trouble comes from external libraries. Python
-can successfully decode data from UTF-8, but a library using the locale
-encoding can fail to encode the decoded text back to bytes.  Hopefully,
-encoding text in a library is a rare operation. Very few libraries
-expect text, most libraries expect bytes and even manipulate bytes
-internally.
+can decode successfully data from UTF-8, but a library using the locale
+encoding can fail to encode the decoded text back to bytes. For example,
+GNU readline currently has problems on Android due to the mismatch between
+CPython's encoding assumptions there (always UTF-8) and GNU readline's
+encoding assumptions (which are based on the nominal locale).

 The PEP only changes the default behaviour if the locale is POSIX. For
 other locales, the *default* behaviour is unchanged.

+PEP 538 is a follow-up to this PEP that extends CPython's assumptions to other
+locale-aware components in the same process by explicitly coercing the POSIX
+locale to something more suitable for modern text processing. See that PEP
+for further details.
+

 Alternatives
 ============
@ -754,14 +830,14 @@ Don't modify the encoding of the POSIX locale
 ---------------------------------------------

 A first version of the PEP did not change the encoding and error handler
-used for the POSIX locale.
+used of the POSIX locale.

 The problem is that adding the ``-X utf8`` command line option or
 setting the ``PYTHONUTF8`` environment variable is not possible in some
 cases, or at least not convenient.

-Moreover, many users simply expect that Python 3 behaves like Python 2:
-it doesn't bother them with encodings and "just works" in all cases. These
+Moreover, many users simply expect that Python 3 behaves as Python 2:
+don't bother them with encodings and "just works" in all cases. These
 users don't worry about mojibake, or even expect mojibake because of
 complex documents using multiple incompatibles encodings.

@ -773,9 +849,10 @@ Python already always uses the UTF-8 encoding on Mac OS X, Android and
 Windows.  Since UTF-8 became the de facto encoding, it makes sense to
 always use it on all platforms with any locale.

-The risk is to introduce mojibake if the locale uses a different
-encoding, especially for locales other than the POSIX locale.
-
+The problem with this approach is that Python is also used extensively in
+desktop environments, and it is often a practical or even legal requirement
+to support locale encoding other than UTF-8 (for example, GB-18030 in China,
+and Shift-JIS or ISO-2022-JP in Japan)

 Force UTF-8 for the POSIX locale
 --------------------------------
@ -783,7 +860,7 @@ Force UTF-8 for the POSIX locale
 An alternative to always using UTF-8 in any case is to only use UTF-8 when the
 ``LC_CTYPE`` locale is the POSIX locale.

-`PEP 538`_ "Coercing the legacy C locale to C.UTF-8" by Nick
+The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of  Nick
 Coghlan proposes to implement that using the ``C.UTF-8`` locale.


@ -791,14 +868,14 @@ Use the strict error handler for operating system data
 ------------------------------------------------------

 Using the ``surrogateescape`` error handler for `operating system data`_
-creates surprising surrogate characters. No Python codec (except for
-``utf-7``) accepts surrogates so encoding text coming from the
-operating system is likely to raise an error. The problem is that
+creates surprising surrogate characters. No Python codec (except of
+``utf-7``) accept surrogates, and so encoding text coming from the
+operating system is likely to raise an error error. The problem is that
 the error comes late, very far from where the data was read.

 The ``strict`` error handler can be used instead to decode
 (``os.fsdecode()``) and encode (``os.fsencode()``) operating system
-data and raise encoding errors as soon as possible. Using it helps find
+data, to raise encoding errors as soon as possible. It helps to find
 bugs more quickly.

 The main drawback of this strategy is that it doesn't work in practice.