PEP 538: Update for latest python-dev discussion

* default standard stream error handler is always "surrogateescape"
  for the potential coercion target locales
* PEP 540 is now a purely optional follow-on PEP that improves the
  handling of cases where none of these locales are available,
  but doesn't require revisiting the changes made for this PEP
* the locale coercion and warning behaviours are now enabled by
  default for all \*nix platforms, even Mac OS X
* covered the Android-specific changes to the use of `setlocale`
* state explicitly that we're aware this makes the behaviour
  of standalone CPython and embedded CPython diverge, we just think
  the potential benefits are sufficient to accept that downside
* note the reference implementation has yet to be updated with
  these changes
This commit is contained in:
Nick Coghlan 2017-05-06 16:58:19 +10:00
parent dc175c5902
commit 2fb53e7c1b
1 changed files with 119 additions and 100 deletions

View File

@ -36,9 +36,9 @@ However, it comes at the cost of making CPython's encoding assumptions diverge
from those of other locale-aware components in the same process, as well as
those of components running in subprocesses that share the same environment.
It also requires changes to the internals of how CPython itself works, rather
than using existing configuration settings that are supported by Python
versions prior to Python 3.7.
It also requires non-trivial changes to the internals of how CPython itself
works, rather than relying primarily on existing configuration settings that
are supported by Python versions prior to Python 3.7.
Accordingly, this PEP proposes that independently of the UTF-8 mode proposed
in PEP 540, the way the CPython implementation handles the default C locale be
@ -48,27 +48,25 @@ changed such that:
the standalone CPython binary will automatically attempt to coerce the ``C``
locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or
``UTF-8``
* if the locale is successfully coerced, PEP 540 is not accepted, and the
``PYTHONIOENCODING`` environment variable is not set, then
``Py_SetStandardStreamEncoding`` will be called with ``"utf-8"`` and
``"surrogateescape"`` as arguments.
* if the locale is successfully coerced, and PEP 540 *is* accepted, then
``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
* if the subsequent runtime initialization process detects that the legacy
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
* ``Py_Initialize`` will be updated to treat these potential coercion target
locales the same way it already treats the ``C`` locale: the default standard
stream error handler for these locales will become ``surrogateescape`` (this
default can be overridden through ``PYTHONIOENCODING`` and
``Py_SetStandardStreamEncoding`` as usual)
* if ``Py_Initialize`` detects that the legacy ``C`` locale remains active
(e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
are available, or the runtime is embedded in an application other than the
main CPython binary), locale coercion is not explicitly disabled, and the
``PYTHONUTF8`` feature defined in PEP 540 is also disabled (or not
implemented), it will emit a warning on stderr that use of the legacy
``C`` locale's default ASCII text encoding may cause various Unicode
compatibility issues
main CPython binary), and locale coercion is not explicitly disabled, it will
emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
text encoding may cause various Unicode compatibility issues
With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for CPython
3.7+ deployments when either the new ``PYTHONUTF8`` mode defined in PEP 540 is
used, or else a suitable locale other than the default ``C`` locale is
configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``).
3.7+ deployments when a suitable locale other than the default ``C`` locale is
configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``). If PEP 540 is
accepted in addition to this PEP, then such platforms would also be supported
when using the proposed ``PYTHONUTF8`` mode.
Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this
@ -140,6 +138,9 @@ still fail in the following cases:
* some process environments (such as Linux containers) may not have any
explicit locale configured at all. As with unknown locales, this leads to
CPython running in the default ASCII-based C locale
* on Android, rather than configuring the locale based on environment variables,
the empty locale ``""`` is treated as specifically requesting the ``"C"``
locale
The simplest way to deal with this problem for currently released versions of
CPython is to explicitly set a more sensible locale when launching the
@ -204,13 +205,13 @@ components, and an approach more amenable to being backported to Python 3.6
by downstream redistributors.
As a result, this PEP was amended to refer to PEP 540 as a complementary
solution that offered improved behaviour both when locale coercion triggered,
as well as when none of the standard UTF-8 based locales were available.
solution that offered improved behaviour when none of the standard UTF-8 based
locales were available.
The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy
fallback was removed from the list of UTF-8 locales tried as a coercion target,
with CPython instead relying solely on the proposed PYTHONUTF8 mode in such
cases.
with the expectation being that CPython will instead rely solely on the
proposed PYTHONUTF8 mode in such cases.
Motivation
@ -323,7 +324,11 @@ proposed solution:
release announcements. However, to minimize the chance of introducing new
problems for end users, we'll do this *without* using the warnings system, so
even running with ``-Werror`` won't turn it into a runtime exception
* any changes made will use *existing* configuration options
* as far as is feasible, any changes made will use *existing* configuration
options
* Python's runtime behaviour in potential coercion target locales should be
identical regardless of whether the locale was set explicitly in the
environment or implicitly as a locale coercion target
Minimizing the negative impact on systems currently correctly configured to
use GB-18030 or another partially ASCII compatible universal encoding leads to
@ -347,11 +352,14 @@ run as a standalone command line application.
It further proposes to emit a warning on stderr if the legacy ``C`` locale
is in effect at the point where the language runtime itself is initialized,
the explicit environmental flag to disable locale coercion is not set, and
the PEP 540 UTF-8 encoding override is also disabled (or not implemented), in
and the explicit environmental flag to disable locale coercion is not set, in
order to warn system and application integrators that they're running CPython
in an unsupported configuration.
In addition to these general changes, some additional Android-specific changes
are proposed to handle the differences in the behaviour of ``setlocale`` on that
platform.
Legacy C locale coercion in the standalone Python interpreter binary
--------------------------------------------------------------------
@ -401,14 +409,8 @@ defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
environment variable would be set when using this fallback option.
To adjust automatically to future changes in locale availability, these checks
will be implemented at runtime on all platforms other than Mac OS X and Windows,
rather than attempting to determine which locales to try at compile time.
If the locale settings are changed successfully, and the ``PYTHONIOENCODING``
environment variable is currently unset, then ``Py_SetStandardStreamEncoding``
will be called to force the standard IO streams to ``utf-8`` as the nominal
text encoding and ``surrogateescape`` as the error handler (``stderr`` will
continue to use ``backslashreplace`` as it's error handler as usual).
will be implemented at runtime on all platforms other than Windows, rather
than attempting to determine which locales to try at compile time.
When this locale coercion is activated, the following warning will be
printed on stderr, with the warning containing whichever locale was
@ -423,7 +425,7 @@ different::
Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
In combination with PEP 540, this locale coercion will mean that the standard
This locale coercion will mean that the standard
Python binary *and* locale-aware extensions should once again "just work"
in the three main failure cases we're aware of (missing locale
settings, SSH forwarding of unknown locales, and developers explicitly
@ -453,8 +455,8 @@ or not to suppress the locale compatibility warning will be similarly
independent of these settings.
Changes to the runtime initialization process
---------------------------------------------
Legacy C locale warning during runtime initialization
-----------------------------------------------------
By the time that ``Py_Initialize`` is called, arbitrary locale-dependent
operations may have taken place in the current process. This means that
@ -463,9 +465,8 @@ doing so would introduce inconsistencies in decoded text, even in the context
of the standalone Python interpreter binary.
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
configured locale is still the default ``C`` locale, ``PYTHONCOERCECLOCALE=0``
is set, *and* the ``PYTHONUTF8`` feature from PEP 540 is disabled (or not
implemented), the following warning will be issued::
configured locale is still the default ``C`` locale and
``PYTHONCOERCECLOCALE=0`` is not set, the following warning will be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
encoding), which may cause Unicode compatibility problems. Using C.UTF-8,
@ -499,10 +500,42 @@ The locale warning behaviour would be controlled by the flag
``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE``
preprocessor definition.
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android,
On platforms which don't use the ``autotools`` based build system (i.e.
Windows) these preprocessor variables would always be undefined.
Changes to the default error handling on the standard streams
-------------------------------------------------------------
Since Python 3.5, CPython has defaulted to using ``surrogateescape`` on the
standard streams (``sys.stdin``, ``sys.stdout``, ``sys.stderr``) when it
detects that the current locale is ``C`` and no specific error handled has
been set using either the ``PYTHONIOENCODING`` environment variable or the
``Py_setStandardStreamEncoding`` API. For other locales, the default error
handler for the standard streams is ``strict``.
In order to preserve this behaviour without introducing any behavioural
discrepancies between locale coercion and explicitly configuring a locale, the
coercion target locales (``C.UTF-8``, ``C.utf8``, and ``UTF-8``) will be added
to the list of locales that use ``surrogateescape`` as their default error
handler for the standard streams.
Changes to locale settings on Android
-------------------------------------
Independently of the other changes in this PEP, CPython on Android systems
will be updated to call ``setlocale(LC_ALL, "C.UTF-8")`` where it currently
calls ``setlocale(LC_ALL, "")`` and ``setlocale(LC_CTYPE, "C.UTF-8")`` where
it currently calls ``setlocale(LC_CTYPE, "")``.
This Android-specific behaviour is being introduced due to the following
Android-specific details:
* on Android, passing ``""`` to ``setlocale`` is equivalent to passing ``"C"``
* the ``C.UTF-8`` locale is always available
Platform Support Changes
========================
@ -515,19 +548,6 @@ A new "Legacy C Locale" section will be added to PEP 11 that states:
cannot be reproduced in an appropriately configured non-ASCII locale will be
closed as "won't fix".
If PEP 540 is also implemented, then this section would instead state:
* as of CPython 3.7, the legacy C locale is only supported when operating in
"UTF-8" mode. Any Unicode handling issues that occur only in that locale
and cannot be reproduced in an appropriately configured non-ASCII locale will
be closed as "won't fix"
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
Any Unicode related integration problems with other locale-aware components
that occur only in that locale and cannot be reproduced in an appropriately
configured non-ASCII locale will be closed as "won't fix".
Rationale
=========
@ -580,10 +600,8 @@ introduced in Python 3.5 ([15_]), as well as the automatic use of
``surrogateescape`` when operating in PEP 540's UTF-8 mode.
Rather than introducing yet another configuration option to address that,
this PEP proposes to use the existing ``Py_SetStandardStreamEncoding``
interface to ensure that the ``surrogateescape`` handler is enabled when
the interpreter is required to make assumptions regarding the expected
filesystem encoding.
this PEP proposes to extend the "surrogateescape" default to also apply to
the three potential coercion target locales.
The aim of this behaviour is to attempt to ensure that operating system
provided text values are typically able to be transparently passed through a
@ -673,14 +691,14 @@ now displays both files as originally intended::
GB18030: ℙƴ☂ℌøἤ
The rationale for retaining ``surrogateescape`` as the default IO encoding is
that it will preserve the following helpful behaviour in the C locale::
that it will preserve the following helpful behaviour in the ``C`` locale::
$ cat gb18030.txt \
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Rather than reverting to the exception seen when a UTF-8 based locale is
Rather than reverting to the exception currently seen when a UTF-8 based locale is
explicitly configured::
$ cat gb18030.txt \
@ -692,29 +710,9 @@ explicitly configured::
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
Note: in order to also affect subprocesses running Python 3, earlier versions
of this PEP proposed setting ``PYTHONIOENCODING`` to ``utf-8:surrogateescape``
rather than calling ``Py_SetStandardStreamEncoding`` when the locale coercion
triggered. Unfortunately, this approach proved to have undesirable side
effects when Python 2 applications were invoked in subprocesses (as there is
no ``surrogateescape`` error handler available in Python 2).
Another design option would be to *always* default to ``surrogateescape`` on the
standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request
text encoding validation during stream processing. Adopting such an approach
would bring Python 3 more into line with typical C/C++ tools that pass along
the raw bytes without checking them for conformance to their nominal encoding,
and would hence also make the last example display the desired output::
$ cat gb18030.txt \
| PYTHONIOENCODING=:surrogateescape python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
However, such a change would have broader implications than the C locale
specific changes currently proposed, so it is considered out of scope for this
PEP. Instead, an improved solution is left to the combination of this PEP with
PEP 540, by automatically setting ``PYTHONUTF8=1`` when locale coercion occurs.
As an added benefit, environments explicitly configured to use one of the
coercion target locales will implicitly gain the encoding transparency behaviour
currently enabled by default in the ``C`` locale.
Dropping official support for ASCII based text handling in the legacy C locale
@ -724,8 +722,8 @@ We've been trying to get strict bytes/text separation to work reliably in the
legacy C locale for over a decade at this point. Not only haven't we been able
to get it to work, neither has anyone else - the only viable alternatives
identified have been to pass the bytes along verbatim without eagerly decoding
them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
them to text (C/C++, Python 2.x, Ruby, etc), or else to largely ignore the
nominal C/C++ locale encoding and assume the use of either UTF-8 (PEP 540,
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
While this PEP ensures that developers that genuinely need to do so can still
@ -740,6 +738,11 @@ locale setting (or PEP 540's UTF-8 mode, if that is available).
Providing implicit locale coercion only when running standalone
---------------------------------------------------------------
The major downside of the proposed design in this PEP is that it introduces a
potential discrepancy between the behaviour of the CPython runtime when it is
run as a standalone application and when it is run as an embedded component
inside a larger system (e.g. ``mod_wsgi`` running inside Apache ``httpd``).
Over the course of Python 3.x development, multiple attempts have been made
to improve the handling of incorrect locale settings at the point where the
Python interpreter is initialised. The problem that emerged is that this is
@ -765,6 +768,19 @@ The ``Py_Initialize`` API then only gains an explicit warning (emitted on
embedding application to specify something more reasonable.
Allowing restoration of the legacy behaviour
--------------------------------------------
The CPython command line interpreter is often used to investigate faults that
occur in other applications that embed CPython, and those applications may still
be using the C locale even after this PEP is implemented.
Providing a simple on/off switch for the locale coercion behaviour makes it
much easier to reproduce the behaviour of such applications for debugging
purposes, as well as making it easier to reproduce the behaviour of older 3.x
runtimes even when running a version with this change applied.
Querying LC_CTYPE for C locale detection
----------------------------------------
@ -777,8 +793,8 @@ whether or not the current locale configuration is likely to cause Unicode
handling problems.
Setting both LANG & LC_ALL for C.UTF-8 locale coercion
------------------------------------------------------
Setting both LANG & LC_ALL for UTF-8 locale coercion
----------------------------------------------------
Python is often used as a glue language, integrating other C/C++ ABI compatible
components in the current process, and components written in arbitrary
@ -795,21 +811,23 @@ Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
Together, these should ensure that when the locale coercion is activated, the
switch to the C.UTF-8 locale will be applied consistently across the current
switch to the UTF-8 based locale will be applied consistently across the current
process and any subprocesses that inherit the current environment.
Allowing restoration of the legacy behaviour
--------------------------------------------
Enabling C locale coercion and warnings on Mac OS X
---------------------------------------------------
The CPython command line interpreter is often used to investigate faults that
occur in other applications that embed CPython, and those applications may still
be using the C locale even after this PEP is implemented.
On Mac OS X, CPython already assumes the use of UTF-8 for system interfaces,
and we expect most other locale-aware components to do the same.
Providing a simple on/off switch for the locale coercion behaviour makes it
much easier to reproduce the behaviour of such applications for debugging
purposes, as well as making it easier to reproduce the behaviour of older 3.x
runtimes even when running a version with this change applied.
However, Mac OS X is also frequently used as a development and testing platform
for Python software intended for deployment to other \*nix environments (such as
Linux).
Accordingly, this PEP enables the locale coercion and warning features on
Mac OS X in the name of cross platform consistency, even though they're expected
to almost entirely redundant on Mac OS X itself.
Implementation
@ -823,9 +841,10 @@ This reference implementation covers not only the enhancement request in
issue 28180 [1_], but also the Android compatibility fixes needed to resolve
issue 28997 [16_].
NOTE: The reference implementation is currently missing the ``configure.ac``
checks that are needed to ensure that ``PY_COERCE_C_LOCALE`` and
``PY_WARN_ON_C_LOCALE`` are always undefined on Mac OS X.
.. note:
The reference implementation has not yet been updated for the 2017-05-06
amendments to the PEP
Backporting to earlier Python 3 releases