diff --git a/pep-0538.txt b/pep-0538.txt index d981b5693..7e4ba9051 100644 --- a/pep-0538.txt +++ b/pep-0538.txt @@ -48,9 +48,10 @@ changed such that: the standalone CPython binary will automatically attempt to coerce the ``C`` locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or ``UTF-8`` -* if the locale is successfully coerced, and PEP 540 is not accepted, then - ``PYTHONIOENCODING`` (if not otherwise set) will be set to - ``utf-8:surrogateescape``. +* if the locale is successfully coerced, PEP 540 is not accepted, and the + ``PYTHONIOENCODING`` environment variable is not set, then + ``Py_SetStandardStreamEncoding`` will be called with ``"utf-8"`` and + ``"surrogateescape"`` as arguments. * if the locale is successfully coerced, and PEP 540 *is* accepted, then ``PYTHONUTF8`` (if not otherwise set) will be set to ``1`` * if the subsequent runtime initialization process detects that the legacy @@ -279,7 +280,7 @@ locale that both distros provide:: LC_CTYPE="C.UTF-8" LC_ALL= -The Alpine Linux based Python images provided by Docker, Inc, already use the +The Alpine Linux based Python images provided by Docker, Inc. already use the C.UTF-8 locale by default:: $ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")' @@ -303,8 +304,8 @@ this unfortunately doesn't help on platforms that ship older versions of glibc without that feature, and also don't provide C.UTF-8 as an on-disk locale the way Debian and Fedora do. For these platforms, the mechanism proposed in PEP 540 at least allows CPython itself to behave sensibly, albeit without any -mechanism to get other C/C++ components that decode binary streams as text to -do the same. +common mechanism to get other C/C++ components that decode binary streams as +text to do the same. Design Principles @@ -347,9 +348,9 @@ run as a standalone command line application. It further proposes to emit a warning on stderr if the legacy ``C`` locale is in effect at the point where the language runtime itself is initialized, the explicit environmental flag to disable locale coercion is not set, and -the PEP 540 UTF-8 encoding override is also disabled, in order to warn -system and application integrators that they're running CPython in an -unsupported configuration. +the PEP 540 UTF-8 encoding override is also disabled (or not implemented), in +order to warn system and application integrators that they're running CPython +in an unsupported configuration. Legacy C locale coercion in the standalone Python interpreter binary @@ -404,8 +405,10 @@ will be implemented at runtime on all platforms other than Mac OS X and Windows, rather than attempting to determine which locales to try at compile time. If the locale settings are changed successfully, and the ``PYTHONIOENCODING`` -environment variable is currently unset, then it will be forced to -``PYTHONIOENCODING=utf-8:surrogateescape``. +environment variable is currently unset, then Py_SetStandardStreamEncoding will +be called to force the standard IO streams to ``utf-8`` as the nominal text +encoding and ``surrogateescape`` as the error handler (``stderr`` will +continue to use ``backslashreplace`` as it's error handler as usual)`. When this locale coercion is activated, the following warning will be printed on stderr, with the warning containing whichever locale was @@ -427,6 +430,12 @@ settings, SSH forwarding of unknown locales, and developers explicitly requesting ``LANG=C``), as long as the target platform provides at least one of the candidate UTF-8 based environments. +The one case where failures may still occur is when ``stderr`` is specifically +being checked for no output, which can be resolved either by configuring +a locale other than the C locale, or else by using a mechanism other than +"there was no output on stderr" to check for subprocess errors (e.g. checking +process return codes). + If none of the candidate locales are successfully configured, then initialization will continue in the C locale and the Unicode compatibility warning described in the next section will be emitted just as it would for @@ -571,9 +580,10 @@ introduced in Python 3.5 ([15_]), as well as the automatic use of ``surrogateescape`` when operating in PEP 540's UTF-8 mode. Rather than introducing yet another configuration option to address that, -this PEP proposes to use the existing ``PYTHONIOENCODING`` setting to ensure -that the ``surrogateescape`` handler is enabled when the interpreter is -required to make assumptions regarding the expected filesystem encoding. +this PEP proposes to use the existing ``PySettStandardStreamEncoding`` +interface to ensure that the ``surrogateescape`` handler is enabled when +the interpreter is required to make assumptions regarding the expected +filesystem encoding. The aim of this behaviour is to attempt to ensure that operating system provided text values are typically able to be transparently passed through a @@ -682,8 +692,14 @@ explicitly configured:: (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte -Note: an alternative to setting ``PYTHONIOENCODING`` as the PEP currently -proposes would be to instead *always* default to ``surrogateescape`` on the +Note: in order to also affect subprocesses running Python 3, earlier versions +of this PEP proposed setting ``PYTHONIOENCODING`` to ``utf-8:surrogateescape`` +rather than calling ``Py_SetStandardStreamEncoding`` when the locale coercion +triggered. Unfortunately, this approach proved to have undesirable side +effects when Python 2 applications were invoked in subprocesses (as there is +no ``surrogateescape`` error handler available in Python 2). + +Another design option would be to *always* default to ``surrogateescape`` on the standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request text encoding validation during stream processing. Adopting such an approach would bring Python 3 more into line with typical C/C++ tools that pass along @@ -697,7 +713,8 @@ and would hence also make the last example display the desired output:: However, such a change would have broader implications than the C locale specific changes currently proposed, so it is considered out of scope for this -PEP. +PEP. Instead, an improved solution is left to the combination of this PEP with +PEP 540, by automatically setting ``PYTHONUTF8=1`` when locale coercion occurs. Dropping official support for ASCII based text handling in the legacy C locale @@ -869,6 +886,9 @@ utility development framework [2_]:: The change was originally proposed as a downstream patch for Fedora's system Python 3.6 package [3_], and then reformulated as a PEP for Python 3.7 with a section allowing for backports to earlier versions by redistributors. +In parallel with the development of the upstream patch, Charalampos Stratakis +has been working on the Fedora 26 backport and providing feedback on the +practical viability of the proposed changes. The initial draft was posted to the Python Linux SIG for discussion [10_] and then amended based on both that discussion and Victor Stinner's work in