PEP 538: Update to depend on PEP 540

- relies entirely on PEP 540 when no appropriate locale is available - uses surrogateescape on standard streams by default - accounts for BSD-style UTF-8 locales - avoids any reliance on the en_US-UTF-8 locale - makes note of related GNU readline issue on Android
2017-01-21 01:13:24 +11:00 · 2017-01-21 01:13:24 +11:00 · 481573aa27
parent f67dd4a759
commit 481573aa27
1 changed files with 265 additions and 148 deletions
--- a/pep-0538.txt
+++ b/pep-0538.txt
@ -6,6 +6,7 @@ Author: Nick Coghlan <ncoghlan@gmail.com>
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
+Requires: 540
 Created: 28-Dec-2016
 Python-Version: 3.7
 Post-History: 03-Jan-2017 (linux-sig),
@ -18,33 +19,40 @@ Abstract
 An ongoing challenge with Python 3 on \*nix systems is the conflict between
 needing to use the configured locale encoding by default for consistency with
 other C/C++ components in the same process and those invoked in subprocesses,
-and the fact that the standard C locale (as defined in POSIX:2001) specifies
-a default text encoding of ASCII, which is entirely inadequate for the
+and the fact that the standard C locale (as defined in POSIX:2001) typically
+implies a default text encoding of ASCII, which is entirely inadequate for the
 development of networked services and client applications in a multilingual
 world.

-This PEP proposes that the way the CPython implementation handles the default
-C locale be changed such that:
+PEP 540 proposes a change to CPython's handling of the legacy C locale such
+that CPython will assume the use of UTF-8 in such environments, rather than
+persisting with the demonstrably problematic assumption of ASCII as an
+appropriate encoding for communicating with operating system interfaces.
+
+However, it comes at the cost of making CPython's encoding assumptions diverge
+from those of other C and C++ components in the same process, as well as those
+of components running in subprocesses that share the same environment.
+
+Accordingly, this PEP further proposes that the way the CPython implementation
+handles the default C locale be changed such that:

 * the standalone CPython binary will automatically attempt to coerce the ``C``
-  locale to ``C.UTF-8`` (preferred), ``C.utf8`` or ``en_US.UTF-8`` unless the
-  new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
+  locale to ``C.UTF-8``, ``C.utf8``, or ``UTF-8`` (depending on the system),
+  unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
 * if the subsequent runtime initialization process detects that the legacy
-  ``C`` locale remains active (e.g. locale coercion is disabled, or the runtime
-  is embedded in an application other than the main CPython binary), it  will
-  emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
-  text encoding may cause various Unicode compatibility issues
-
-Explicitly configuring the ``C.UTF-8`` or ``en_US.UTF-8`` locales has already
-been used successfully for a number of years (including by the PEP author) to
-get Python 3 running reliably in environments where no locale is otherwise
-configured (such as Docker containers).
+  ``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
+  are available, locale coercion is disabled, or the runtime is embedded in an
+  application other than the main CPython binary), and the ``PYTHONUTF8``
+  feature defined in PEP 540 is also disabled, it  will emit a warning on
+  stderr that use of the legacy ``C`` locale's default ASCII text encoding
+  may cause various Unicode compatibility issues

 With this change, any \*nix platform that does *not* offer at least one of the
-``C.UTF-8``, ``C.utf8`` or ``en_US.UTF-8`` locales as part of its standard
+``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
 configuration would only be considered a fully supported platform for CPython
-3.7+ deployments when a locale other than the default ``C`` locale is
-configured explicitly.
+3.7+ deployments when either the new ``PYTHONUTF8`` defined in PEP 540 is used,
+or else a suitable locale other than the default ``C`` locale is configured
+explicitly (e.g. ``zh_CN.gb18030``).

 Redistributors (such as Linux distributions) with a narrower target audience
 than the upstream CPython development team may also choose to opt in to this
@ -57,11 +65,11 @@ Background

 While the CPython interpreter is starting up, it may need to convert from
 the ``char *`` format to the ``wchar_t *`` format, or from one of those formats
-to ``PyUnicodeObject *``, before its own text encoding handling machinery is
-fully configured. It handles these cases by relying on the operating system to
-do the conversion and then ensuring that the text encoding name reported by
-``sys.getfilesystemencoding()`` matches the encoding used during this early
-bootstrapping process.
+to ``PyUnicodeObject *``, in a way that's consistent with the locale settings
+of the overall system. It handles these cases by relying on the operating
+system to do the conversion and then ensuring that the text encoding name
+reported by ``sys.getfilesystemencoding()`` matches the encoding used during
+this early bootstrapping process.

 On Apple platforms (including both Mac OS X and iOS), this is straightforward,
 as Apple guarantees that these operations will always use UTF-8 to do the
@ -72,16 +80,13 @@ conversions proved sufficiently problematic that PEP 528 and PEP 529 were
 implemented to bypass the operating system supplied interfaces for binary data
 handling and force the use of UTF-8 instead.

-On Android, the locale settings are of limited relevance (due to most
-applications running in the UTF-16-LE based Dalvik environment) and there's
-limited value in preserving backwards compatibility with other locale aware
-C/C++ components in the same process (since it's a relatively new target
-platform for CPython), so CPython bypasses the operating system provided APIs
-and hardcodes the use of UTF-8 (similar to its behaviour on Apple platforms).
+On Android, many components, including CPython, already assume the use of UTF-8
+as the system encoding, regardless of the locale setting. However, this isn't
+the case for all components, and the discrepancy can cause problems in some
+situations (for example, when using the GNU readline module [16_]).

-On non-Apple and non-Android \*nix systems however, these operations are
-handled using the C locale system in glibc, which has the following
-characteristics [4_]:
+On non-Apple and non-Android \*nix systems, these operations are handled using
+the C locale system in glibc, which has the following characteristics [4_]:

 * by default, all processes start in the ``C`` locale, which uses ``ASCII``
  for these conversions. This is almost never what anyone doing multilingual
@ -113,9 +118,9 @@ they do when overriding the locale with one based on UTF-8)
 These calls are usually sufficient to provide sensible behaviour, but they can
 still fail in the following cases:

-* SSH environment forwarding means that SSH clients will often forward
+* SSH environment forwarding means that SSH clients may sometimes forward
  client locale settings to servers that don't have that locale installed. This
-  leads to CPython running in the default ASCII-based C locale
+  leads to CPython running in the default ASCII-based C locale.
 * some process environments (such as Linux containers) may not have any
  explicit locale configured at all. As with unknown locales, this leads to
  CPython running in the default ASCII-based C locale
@ -126,6 +131,18 @@ application. For example::

    LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ...

+The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the
+``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other
+categories (including ``LC_COLLATE``). It is offered by a number of Linux
+distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an
+alternative to the ASCII-based C locale.
+
+Mac OS X and other \*BSD systems have taken a different approach, and instead
+of offering a ``C.UTF-8`` locale, instead offer a partial ``UTF-8`` locale that
+only defines the ``LC_CTYPE`` category. On such systems, the preferred
+environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to set
+``LC_ALL`` or ``LANG``. [17_]
+
 In the specific case of Docker containers and similar technologies, the
 appropriate locale setting can be specified directly in the container image
 definition.
@ -139,7 +156,7 @@ Relationship with other PEPs
 ============================

 This PEP shares a common problem statement with PEP 540 (improving Python 3's
-behaviour in the default C locale), but diverges markedly in the proposed
+behaviour in the default C locale), but diverged markedly in the proposed
 solution:

 * PEP 540 proposes to entirely decouple CPython's default text encoding from
@ -148,7 +165,7 @@ solution:
  and in subprocesses. This approach aims to make CPython behave less like a
  locale-aware C/C++ application, and more like C/C++ independent language
  runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
-* this PEP proposes to instead override the legacy C locale with a more recently
+* this PEP proposes to override the legacy C locale with a more recently
  defined locale that uses UTF-8 as its default text encoding. This means that
  the text encoding override will apply not only to CPython, but also to any
  locale aware extension modules loaded into the current process, as well as to
@ -157,32 +174,23 @@ solution:
  traditional strong support for integration with other components written
  in C and C++, while actively helping to push forward the adoption and
  standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
-  the legacy C locale
+  the legacy C locale in the wider Linux ecosystem

-While the two PEPs present alternate proposed behavioural improvements that
-align with the interests of different parts of the Python user community, they
-don't actually conflict at a technical level.
+After reviewing both PEPs, it became clear that they didn't actually conflict
+at a technical level, and the proposal in PEP 540 offered a superior option in
+cases where no suitable locale was available, as well offering a better
+reference behaviour for platforms where the notion of a "locale encoding"
+doesn't make sense (for example, embedded systems running MicroPython rather
+the CPython reference interpreter).

-That means it would be entirely possible to implement both of them, and end up
-with a situation where redistributors, application integrators, and end users
-can choose between:
+As a result, this PEP was amended to specify PEP 540 as a pre-requisite, with
+the aim being to coerce other C/C++ components into behaving consistently with
+CPython's assumption of UTF-8 as the system encoding, rather than CPython itself
+relying on that setting change.

-* coercing the default ASCII based C locale to a UTF-8 based locale
-* instructing CPython to ignore the C locale and use UTF-8 instead
-* doing both of the above (with this option as the default legacy C locale
-  handling)
-* forcing use of the default ASCII based C locale by setting both
-  PYTHONCOERCECLOCALE=0 and PYTHONUTF8=0
-
-If this approach was taken, then the proposed modifications to PEP 11 would
-be adjusted to indicate that the only unsupported configurations are those where
-both the legacy C locale coercion and the C locale text encoding bypass are
-disabled.
-
-Given such a hybrid implementation, it would also be reasonable to drop the
-``en_US.UTF-8`` legacy fallback from the list of UTF-8 locales tried as a
-coercion target and instead rely solely on the C locale text encoding bypass
-in such cases.
+As a result of that change, the ``LC_CTYPE=en_US.UTF-8`` legacy fallback was
+removed from the list of UTF-8 locales tried as a coercion target, with CPython
+instead rely solely on the C locale text encoding bypass in such cases.


 Motivation
@ -275,21 +283,10 @@ While the glibc developers are working towards making the C.UTF-8 locale
 universally available for use by glibc based applications like CPython [6_],
 this unfortunately doesn't help on platforms that ship older versions of glibc
 without that feature, and also don't provide C.UTF-8 as an on-disk locale the
-way Debian and Fedora do. For these platforms, the best widely available
-fallback option is the ``en_US.UTF-8`` locale, which while still being
-unfortunately Anglo-centric, is at least significantly less Anglo-centric than
-the ASCII text encoding assumption in the default C locale.
-
-In the specific case of C locale coercion, the Anglo-centrism implied by the
-use of ``en_US.UTF-8`` can be mitigated by configuring only the ``LC_CTYPE``
-locale category, rather than overriding all the locale categories::
-
-    $ docker run --rm -e LANG=C.UTF-8 centos/python-35-centos7 python3 -c 'print("ℙƴ☂ℌøἤ")'
-    Unable to decode the command from the command line:
-    UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
-
-    $ docker run --rm -e LC_CTYPE=en_US.UTF-8 centos/python-35-centos7 python3 -c 'print("ℙƴ☂ℌøἤ")'
-    ℙƴ☂ℌøἤ
+way Debian and Fedora do. For these platforms, the mechanism proposed in
+PEP 540 at least allows CPython itself to behave sensibly, albeit without any
+mechanism to get other C/C++ components that decode binary streams as text to
+do the same.


 Design Principles
@ -308,16 +305,16 @@ proposed solution:
  problems for end users, we'll do this *without* using the warnings system, so
  even running with ``-Werror`` won't turn it into a runtime exception

-The general design principle of Python 3 to prefer raising an exception over
-incorrectly encoding or decoding data then leads to the following additional
-design guideline:
+To minimize the negative impact on systems currently correctly configured to
+use GB-18030 or another partially ASCII compatible universal encoding leads to
+an additional design principle:

 * if a UTF-8 based Linux container is run on a host that is explicitly
  configured to use a non-UTF-8 encoding, and tries to exchange locally
  encoded data with that host rather than exchanging explicitly UTF-8 encoded
-  data, this will ideally lead to an immediate runtime exception rather than
-  to silent data corruption
-
+  data, CPython will endeavour to correctly round-trip host provided data that
+  is concatenated or split solely at common ASCII compatible code points, but
+  may otherwise emit nonsensical results.


 Specification
@ -330,8 +327,9 @@ run as a standalone command line application.

 It further proposes to emit a warning on stderr if the legacy ``C`` locale
 is in effect at the point where the language runtime itself is initialized,
-in order to warn system and application integrators that they're running
-CPython in an unsupported configuration.
+and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
+system and application integrators that they're running CPython in an
+unsupported configuration.


 Legacy C locale coercion in the standalone Python interpreter binary
@ -369,7 +367,7 @@ Three such locales will be tried:
 * ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and
  expected to be available by default in a future version of glibc)
 * ``C.utf8`` (available at least in HP-UX)
-* ``en_US.UTF-8`` (available at least in RHEL and CentOS)
+* ``UTF-8`` (available in at least some \*BSD variants)

 For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually
 setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate
@ -377,15 +375,17 @@ locale name, such that future calls to ``setlocale()`` will see them, as will
 other components looking for those settings (such as GUI development
 frameworks).

-The last fallback isn't ideal as a coercion target (as it changes more than
-just the default text encoding), but has the benefit of currently being more
-widely available than the C.UTF-8 locale. To minimize the chance of side
-effects, only the ``LC_CTYPE`` environment variable would be set when using
-this legacy fallback option, with the other locale categories being left alone.
+For the platforms where it is defined, ``UTF-8`` is a partial locale that only
+defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
+environment variable would be set when using this fallback option.

-Given time, more environments are expected to provide a ``C.UTF-8`` locale by
-default, so falling all the way back to the ``en_US.UTF-8`` option is expected
-to become less common.
+To adjust automatically to future changes in locale availability, these checks
+will be implemented at runtime on all platforms other than Mac OS X and Windows,
+rather than attempting to determine which locales to try at compile time.
+
+If the locale settings are changed successfully, and the ``PYTHONIOENCODING``
+environment variable is currently unset, then it will be forced to
+``PYTHONIOENCODING=utf-8:surrogateescape``.

 When this locale coercion is activated, the following warning will be
 printed on stderr, with the warning containing whichever locale was
@ -394,14 +394,15 @@ successfully configured::
    Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
    locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).

-When falling all the way back to the ``en_US.UTF-8`` locale, the message would
-be slightly different::
+When falling back to the ``UTF-8`` locale, the message would be slightly
+different::

-    Python detected LC_CTYPE=C, LC_CTYPE set to en_US.UTF-8 (set another locale
+    Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
    or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).

-This locale coercion will mean that the standard Python binary should once
-again "just work" in the three main failure cases we're aware of (missing locale
+In combination with PEP 540, this locale coercion will mean that the standard
+Python binary *and* locale aware C/C++ extensions should once again "just work"
+in the three main failure cases we're aware of (missing locale
 settings, SSH forwarding of unknown locales, and developers explicitly
 requesting ``LANG=C``), as long as the target platform provides at least one
 of the candidate UTF-8 based environments.
@ -427,7 +428,8 @@ doing so would introduce inconsistencies in decoded text, even in the context
 of the standalone Python interpreter binary.

 Accordingly, when ``Py_Initialize`` is called and CPython detects that the
-configured locale is still the default ``C`` locale, the following warning will
+configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8``
+feature from PEP 540 is disabled, the following warning will
 be issued::

   Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
@ -440,6 +442,10 @@ Instead, the warning informs both system and application integrators that
 they're running Python 3 in a configuration that we don't expect to work
 properly.

+The second sentence providing recommendations would be conditionally compiled
+based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD
+systems.
+

 New build-time configuration options
 ------------------------------------
@ -465,15 +471,16 @@ Platform Support Changes

 A new "Legacy C Locale" section will be added to PEP 11 that states:

-* as of Python 3.7, the legacy C locale is no longer officially supported,
-  and any Unicode handling issues that occur only in that locale and cannot be
-  reproduced in an appropriately configured non-ASCII locale will be closed as
-  "won't fix"
-* as of Python 3.7, \*nix platforms are expected to provide at least one of
-  ``C.UTF-8``, ``C.utf8`` or ``en_US.UTF-8`` as an alternative to the legacy
-  ``C`` locale. On platforms which don't yet provide any of these locales, an
-  explicit non-ASCII locale setting will be needed to configure a fully
-  supported environment for running Python 3.7+
+* as of CPython 3.7, the legacy C locale is only supported when operating in
+  "UTF-8" mode. Any Unicode handling issues that occur only in that locale
+  and cannot be reproduced in an appropriately configured non-ASCII locale will
+  be closed as "won't fix"
+* as of CPython 3.7, \*nix platforms are expected to provide at least one of
+  ``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
+  ``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
+  Any Unicode related integration problems with C/C++ extensions that occur
+  only in that locale and cannot be reproduced in an appropriately configured
+  non-ASCII locale will be closed as "won't fix".


 Rationale
@ -502,14 +509,14 @@ C/C++ components sharing the same process, as well as with the user's desktop
 locale settings, than it is with the emergent conventions of modern network
 service development.

-The core premise of this PEP is that for *all* of these use cases, the default
-"C" locale is the wrong choice, and furthermore that the following assumptions
-are valid:
+The core premise of this PEP is that for *all* of these use cases, the
+assumption of ASCII implied by the default "C" locale is the wrong choice,
+and furthermore that the following assumptions are valid:

 * in desktop application use cases, the process locale will *already* be
  configured appropriately, and if it isn't, then that is an operating system
-  level problem that needs to be reported to and resolved by the operating
-  system provider
+  or embedding application level problem that needs to be reported to and
+  resolved by the operating system provider or application developer
 * in network service development use cases (especially those based on Linux
  containers), the process locale may not be configured *at all*, and if it
  isn't, then the expectation is that components will impose their own default
@ -517,54 +524,151 @@ are valid:
  default encoding of ASCII the way CPython currently does


-Defaulting to "strict" error handling on the standard IO streams
----------------------------------------------------------------
+Defaulting to "surrogateescape" error handling on the standard IO streams
+-------------------------------------------------------------------------

 By coercing the locale away from the legacy C default and its assumption of
 ASCII as the preferred text encoding, this PEP also disables the implicit use
 of the "surrogateescape" error handler on the standard IO streams that was
-introduced in Python 3.5 ([15_]).
+introduced in Python 3.5 ([15_]), as well as the automatic use of
+``surrogateescape`` when operating in PEP 540's UTF-8 mode.

-This is deliberate, as that change was primarily aimed at handling the case
-where the correct system encoding was the ASCII-compatible UTF-8 (or another
-ASCII-compatible encoding), but the nominal encoding used for operating system
-interfaces in the current process was ASCII.
+Rather than introducing yet another configuration option to address that,
+this PEP proposes to use the existing ``PYTHONIOENCODING`` setting to ensure
+that the ``surrogateescape`` handler is enabled when the interpreter is
+required to make assumptions regarding the expected filesystem encoding.

-With this PEP, that assumption is being narrowed a step further, such that
-rather than assuming "an ASCII-compatible encoding", we instead assume UTF-8
-specifically. If that assumption is genuinely wrong, then it continues to be
-friendlier to users of other encodings to alert them to the runtime's mistaken
-assumption, rather than continuing on and potentially corrupting their data
-permanently.
+The aim of this behaviour is to attempt to ensure that operating system
+provided text values are typically able to be transparently passed through a
+Python 3 application even if it is incorrect in assuming that that text has
+been encoded as UTF-8.

 In particular, GB 18030 [12_] is a Chinese national text encoding standard
-that handles all Unicode code points, but is incompatible with both ASCII and
-UTF-8.
+that handles all Unicode code points, that is formally incompatible with both
+ASCII and UTF-8, but will nevertheless often tolerate processing as surrogate
+escaped data - the points where GB 18030 reuses ASCII byte values in an
+incompatible way are likely to be invalid in UTF-8, and will therefore be
+escaped and opaque to string processing operations that split on or search for
+the relevant ASCII code points. Operations that don't involve splitting on or
+searching for particular ASCII or Unicode code point values are almost
+certain to work correctly.

 Similarly, Shift-JIS [13_] and ISO-2022-JP [14_] remain in widespread use in
-Japan, and are incompatible with both ASCII and UTF-8.
+Japan, and are incompatible with both ASCII and UTF-8, but will tolerate text
+processing operations that don't involve splitting on or searching for
+particular ASCII or Unicode code point values.

-Using strict error handling on the standard streams means that attempting to
-pass information from a host system using one of these encodings into a
-container application that is assuming the use of UTF-8 or vice-versa is likely
-to cause an immediate Unicode encoding or decoding error, rather than
-potentially causing silent data corruption.
+As an example, consider two files, one encoded with UTF-8 (the default encoding
+for ``en_AU.UTF-8``), and one encoded with GB-18030 (the default encoding for
+``zh_CN.gb18030``)::

-For users that would prefer more permissive behaviour, setting
-``PYTHONIOENCODING=:surrogateescape`` will continue to be supported, as this
-PEP makes no changes to that feature.
+    $ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))'
+    $ python3 -c 'open("gb18030.txt", "wb"); f.write("ℙƴ☂ℌøἤ\n".encode("gb18030"))'
+
+On disk, we can see that these are two very different files::
+
+    $ python3 -c 'print("UTF-8:  ", open("utf8.txt", "rb").read().strip()); \
+                  print("GB18030:", open("gb18030.txt", "rb").read().strip())'
+    UTF-8:   b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n'
+    GB18030: b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n'
+
+That nevertheless can both be rendered correctly to the terminal as long as
+they're decoded prior to printing::
+
+    $ python3 -c 'print("UTF-8:  ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
+                  print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())'
+    UTF-8:   ℙƴ☂ℌøἤ
+    GB18030: ℙƴ☂ℌøἤ
+
+By contrast, if we just pass along the raw bytes, as ``cat`` and similar C/C++
+utilities will tend to do::
+
+    $ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt
+    ℙƴ☂ℌøἤ
+    <20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
+
+Even setting a specifically Chinese locale won't help in getting the
+GB-18030 encoded file rendered correctly::
+
+    $ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt
+    ℙƴ☂ℌøἤ
+    <20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
+
+The problem is that the *terminal* encoding setting remains UTF-8, regardless
+of the nominal locale. A GB18030 terminal can be emulated using the ``iconv``
+utility::
+
+    $ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8
+    鈩櫰粹槀鈩屆羔激
+    ℙƴ☂ℌøἤ
+
+This reverses the problem, such that the GB18030 file is rendered correctly,
+but the UTF-8 file has been converted to unrelated hanzi characters, rather than
+the expected rendering of "Python" as non-ASCII characters.
+
+With the emulated GB18030 terminal encoding, assuming UTF-8 in Python results
+in *both* files being displayed incorrectly::
+
+    $ python3 -c 'print("UTF-8:  ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
+                  print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
+      | iconv -f GB18030 -t UTF-8
+    UTF-8:   鈩櫰粹槀鈩屆羔激
+    GB18030: 鈩櫰粹槀鈩屆羔激
+
+However, setting the locale correctly means that the emulated GB18030 terminal
+now displays both files as originally intended::
+
+    $ LANG=zh_CN.gb18030 \
+      python3 -c 'print("UTF-8:  ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
+                  print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
+      | iconv -f GB18030 -t UTF-8
+    UTF-8:   ℙƴ☂ℌøἤ
+    GB18030: ℙƴ☂ℌøἤ
+
+The rationale for retaining ``surrogateescape`` as the default IO encoding is
+that it will preserve the following helpful behaviour in the C locale::
+
+    $ cat gb18030.txt \
+      | LANG=C python3 -c "import sys; print(sys.stdin.read())" \
+      | iconv -f GB18030 -t UTF-8
+    ℙƴ☂ℌøἤ
+
+Rather than reverting to the exception seen when a UTF-8 based locale is
+explicitly configured::
+
+    $ cat gb18030.txt \
+      | python3 -c "import sys; print(sys.stdin.read())" \
+      | iconv -f GB18030 -t UTF-8
+    Traceback (most recent call last):
+    File "<string>", line 1, in <module>
+    File "/usr/lib64/python3.5/codecs.py", line 321, in decode
+        (result, consumed) = self._buffer_decode(data, self.errors, final)
+    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
+
+Note: an alternative to setting ``PYTHONIOENCODING`` as the PEP currently
+proposes would be to instead *always* default to ``surrogateescape`` on the
+standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request
+text encoding validation during stream processing. Adopting such an approach
+would bring Python 3 more into line with typical C/C++ tools that pass along
+the raw bytes without checking them for conformance to their nominal encoding,
+and would hence also make the last example display the desired output::
+
+    $ cat gb18030.txt \
+      | PYTHONIOENCODING=:surrogateescape python3 -c "import sys; print(sys.stdin.read())" \
+      | iconv -f GB18030 -t UTF-8
+    ℙƴ☂ℌøἤ


-Dropping official support for Unicode handling in the legacy C locale
---------------------------------------------------------------------
+Dropping official support for ASCII based text handling in the legacy C locale
+------------------------------------------------------------------------------

 We've been trying to get strict bytes/text separation to work reliably in the
 legacy C locale for over a decade at this point. Not only haven't we been able
 to get it to work, neither has anyone else - the only viable alternatives
 identified have been to pass the bytes along verbatim without eagerly decoding
-them to text (Python 2.x, Ruby, etc), or else to ignore the nominal C/C++ locale
-encoding entirely and assume the use of either UTF-8 (PEP 540, Rust, Go,
-Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
+them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
+C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
+Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).

 While this PEP ensures that developers that need to do so can still opt-in to
 running their Python code in the legacy C locale, it also makes clear that we
@ -621,7 +725,10 @@ languages in subprocesses.

 Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
 C/C++ components in the current process and in any subprocesses that inherit
-the current environment.
+the current environment. This is important to handle cases where the problem
+has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system
+where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
+configured to forward locale settings, and the user logs into a Linux server).

 Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
 the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
@ -647,15 +754,15 @@ runtimes even when running a version with this change applied.
 Implementation
 ==============

+A draft implementation of the change (including test cases) has been
+posted to issue 28180 [1_], which is an end user request that
+``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
+
 NOTE: The currently posted draft implementation is for a previous iteration
 of the PEP prior to the incorporation of the feedback noted in [11_]. It was
 broadly the same in concept (i.e. coercing the legacy C locale to one based on
 UTF-8), but differs in several details.

-A draft implementation of the change (including test cases) has been
-posted to issue 28180 [1_], which is an end user request that
-``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
-

 Backporting to earlier Python 3 releases
 ========================================
@ -666,8 +773,8 @@ Backporting to Python 3.6.0
 If this PEP is accepted for Python 3.7, redistributors backporting the change
 specifically to their initial Python 3.6.0 release will be both allowed and
 encouraged. However, such backports should only be undertaken either in
-conjunction with the changes needed to also provide the C.UTF-8 locale by
-default, or else specifically for platforms where that locale is already
+conjunction with the changes needed to also provide a suitable locale by
+default, or else specifically for platforms where such a locale is already
 consistently available.


@ -676,7 +783,7 @@ Backporting to other 3.x releases

 While the proposed behavioural change is seen primarily as a bug fix addressing
 Python 3's current misbehaviour in the default ASCII-based C locale, it still
-represents a reasonable significant change in the way CPython interacts with
+represents a reasonably significant change in the way CPython interacts with
 the C locale system. As such, while some redistributors may still choose to
 backport it to even earlier Python 3.x releases based on the needs and
 interests of their particular user base, this wouldn't be encouraged as a
@ -716,6 +823,10 @@ PEP 540 [11_].
 The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP
 is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9_].

+Stephen Turnbull has long provided valuable insight into the text encoding
+handling challenges he regularly encounters at the University of Tsukuba
+(筑波大学).
+

 References
 ==========
@ -765,6 +876,12 @@ References
 .. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale
   (https://bugs.python.org/issue19977)

+.. [16] test_readline.test_nonascii fails on Android
+   (http://bugs.python.org/issue28997)
+
+.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English"
+   (http://bugs.python.org/issue18378#msg215215)
+
 Copyright
 =========