PEP 538: update based on implementation progress

- using PYTHONIOENCODING poses a compatibility problem for Python 2 subprocesses, so use Py_SetStandardStreamEncoding instead - note that components checking for "no output on stderr means success" will either need to avoid the warning or switch to checking return codes instead - Docker, Inc. ends with a full stop, not a comma (noted by Jan Pokorný) - explicitly acknowledge Charalampos Stratakis's work on the Fedora 26 backport
2017-03-17 18:27:53 +10:00 · 2017-03-17 18:27:53 +10:00 · 0789423c46
parent 1085515c33
commit 0789423c46
1 changed files with 37 additions and 17 deletions
--- a/pep-0538.txt
+++ b/pep-0538.txt
@ -48,9 +48,10 @@ changed such that:
  the standalone CPython binary will automatically attempt to coerce the ``C``
  locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or
  ``UTF-8``
-* if the locale is successfully coerced, and PEP 540 is not accepted, then
-  ``PYTHONIOENCODING`` (if not otherwise set) will be set to
-  ``utf-8:surrogateescape``.
+* if the locale is successfully coerced, PEP 540 is not accepted, and the
+  ``PYTHONIOENCODING`` environment variable is not set, then
+  ``Py_SetStandardStreamEncoding`` will be called with ``"utf-8"`` and
+  ``"surrogateescape"`` as arguments.
 * if the locale is successfully coerced, and PEP 540 *is* accepted, then
  ``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
 * if the subsequent runtime initialization process detects that the legacy
@ -279,7 +280,7 @@ locale that both distros provide::
    LC_CTYPE="C.UTF-8"
    LC_ALL=

-The Alpine Linux based Python images provided by Docker, Inc, already use the
+The Alpine Linux based Python images provided by Docker, Inc. already use the
 C.UTF-8 locale by default::

    $ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")'
@ -303,8 +304,8 @@ this unfortunately doesn't help on platforms that ship older versions of glibc
 without that feature, and also don't provide C.UTF-8 as an on-disk locale the
 way Debian and Fedora do. For these platforms, the mechanism proposed in
 PEP 540 at least allows CPython itself to behave sensibly, albeit without any
-mechanism to get other C/C++ components that decode binary streams as text to
-do the same.
+common mechanism to get other C/C++ components that decode binary streams as
+text to do the same.


 Design Principles
@ -347,9 +348,9 @@ run as a standalone command line application.
 It further proposes to emit a warning on stderr if the legacy ``C`` locale
 is in effect at the point where the language runtime itself is initialized,
 the explicit environmental flag to disable locale coercion is not set, and
-the PEP 540 UTF-8 encoding override is also disabled, in order to warn
-system and application integrators that they're running CPython in an
-unsupported configuration.
+the PEP 540 UTF-8 encoding override is also disabled (or not implemented), in
+order to warn system and application integrators that they're running CPython
+in an unsupported configuration.


 Legacy C locale coercion in the standalone Python interpreter binary
@ -404,8 +405,10 @@ will be implemented at runtime on all platforms other than Mac OS X and Windows,
 rather than attempting to determine which locales to try at compile time.

 If the locale settings are changed successfully, and the ``PYTHONIOENCODING``
-environment variable is currently unset, then it will be forced to
-``PYTHONIOENCODING=utf-8:surrogateescape``.
+environment variable is currently unset, then Py_SetStandardStreamEncoding will
+be called to force the standard IO streams to ``utf-8`` as the nominal text
+encoding and ``surrogateescape`` as the error handler (``stderr`` will
+continue to use ``backslashreplace`` as it's error handler as usual)`.

 When this locale coercion is activated, the following warning will be
 printed on stderr, with the warning containing whichever locale was
@ -427,6 +430,12 @@ settings, SSH forwarding of unknown locales, and developers explicitly
 requesting ``LANG=C``), as long as the target platform provides at least one
 of the candidate UTF-8 based environments.

+The one case where failures may still occur is when ``stderr`` is specifically
+being checked for no output, which can be resolved either by configuring
+a locale other than the C locale, or else by using a mechanism other than
+"there was no output on stderr" to check for subprocess errors (e.g. checking
+process return codes).
+
 If none of the candidate locales are successfully configured, then
 initialization will continue in the C locale and the Unicode compatibility
 warning described in the next section will be emitted just as it would for
@ -571,9 +580,10 @@ introduced in Python 3.5 ([15_]), as well as the automatic use of
 ``surrogateescape`` when operating in PEP 540's UTF-8 mode.

 Rather than introducing yet another configuration option to address that,
-this PEP proposes to use the existing ``PYTHONIOENCODING`` setting to ensure
-that the ``surrogateescape`` handler is enabled when the interpreter is
-required to make assumptions regarding the expected filesystem encoding.
+this PEP proposes to use the existing ``PySettStandardStreamEncoding``
+interface to ensure that the ``surrogateescape`` handler is enabled when
+the interpreter is required to make assumptions regarding the expected
+filesystem encoding.

 The aim of this behaviour is to attempt to ensure that operating system
 provided text values are typically able to be transparently passed through a
@ -682,8 +692,14 @@ explicitly configured::
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte

-Note: an alternative to setting ``PYTHONIOENCODING`` as the PEP currently
-proposes would be to instead *always* default to ``surrogateescape`` on the
+Note: in order to also affect subprocesses running Python 3, earlier versions
+of this PEP proposed setting ``PYTHONIOENCODING`` to ``utf-8:surrogateescape``
+rather than calling ``Py_SetStandardStreamEncoding`` when the locale coercion
+triggered. Unfortunately, this approach proved to have undesirable side
+effects when Python 2 applications were invoked in subprocesses (as there is
+no ``surrogateescape`` error handler available in Python 2).
+
+Another design option would be to *always* default to ``surrogateescape`` on the
 standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request
 text encoding validation during stream processing. Adopting such an approach
 would bring Python 3 more into line with typical C/C++ tools that pass along
@ -697,7 +713,8 @@ and would hence also make the last example display the desired output::

 However, such a change would have broader implications than the C locale
 specific changes currently proposed, so it is considered out of scope for this
-PEP.
+PEP. Instead, an improved solution is left to the combination of this PEP with
+PEP 540, by automatically setting ``PYTHONUTF8=1`` when locale coercion occurs.


 Dropping official support for ASCII based text handling in the legacy C locale
@ -869,6 +886,9 @@ utility development framework [2_]::
 The change was originally proposed as a downstream patch for Fedora's
 system Python 3.6 package [3_], and then reformulated as a PEP for Python 3.7
 with a section allowing for backports to earlier versions by redistributors.
+In parallel with the development of the upstream patch, Charalampos Stratakis
+has been working on the Fedora 26 backport and providing feedback on the
+practical viability of the proposed changes.

 The initial draft was posted to the Python Linux SIG for discussion [10_] and
 then amended based on both that discussion and Victor Stinner's work in