PEP 538: Update for latest python-dev discussion

* default standard stream error handler is always "surrogateescape"
  for the potential coercion target locales
* PEP 540 is now a purely optional follow-on PEP that improves the
  handling of cases where none of these locales are available,
  but doesn't require revisiting the changes made for this PEP
* the locale coercion and warning behaviours are now enabled by
  default for all \*nix platforms, even Mac OS X
* covered the Android-specific changes to the use of `setlocale`
* state explicitly that we're aware this makes the behaviour
  of standalone CPython and embedded CPython diverge, we just think
  the potential benefits are sufficient to accept that downside
* note the reference implementation has yet to be updated with
  these changes
This commit is contained in:
Nick Coghlan 2017-05-06 16:58:19 +10:00
parent dc175c5902
commit 2fb53e7c1b
1 changed files with 119 additions and 100 deletions

View File

@ -36,9 +36,9 @@ However, it comes at the cost of making CPython's encoding assumptions diverge
from those of other locale-aware components in the same process, as well as from those of other locale-aware components in the same process, as well as
those of components running in subprocesses that share the same environment. those of components running in subprocesses that share the same environment.
It also requires changes to the internals of how CPython itself works, rather It also requires non-trivial changes to the internals of how CPython itself
than using existing configuration settings that are supported by Python works, rather than relying primarily on existing configuration settings that
versions prior to Python 3.7. are supported by Python versions prior to Python 3.7.
Accordingly, this PEP proposes that independently of the UTF-8 mode proposed Accordingly, this PEP proposes that independently of the UTF-8 mode proposed
in PEP 540, the way the CPython implementation handles the default C locale be in PEP 540, the way the CPython implementation handles the default C locale be
@ -48,27 +48,25 @@ changed such that:
the standalone CPython binary will automatically attempt to coerce the ``C`` the standalone CPython binary will automatically attempt to coerce the ``C``
locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or
``UTF-8`` ``UTF-8``
* if the locale is successfully coerced, PEP 540 is not accepted, and the * ``Py_Initialize`` will be updated to treat these potential coercion target
``PYTHONIOENCODING`` environment variable is not set, then locales the same way it already treats the ``C`` locale: the default standard
``Py_SetStandardStreamEncoding`` will be called with ``"utf-8"`` and stream error handler for these locales will become ``surrogateescape`` (this
``"surrogateescape"`` as arguments. default can be overridden through ``PYTHONIOENCODING`` and
* if the locale is successfully coerced, and PEP 540 *is* accepted, then ``Py_SetStandardStreamEncoding`` as usual)
``PYTHONUTF8`` (if not otherwise set) will be set to ``1`` * if ``Py_Initialize`` detects that the legacy ``C`` locale remains active
* if the subsequent runtime initialization process detects that the legacy (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
are available, or the runtime is embedded in an application other than the are available, or the runtime is embedded in an application other than the
main CPython binary), locale coercion is not explicitly disabled, and the main CPython binary), and locale coercion is not explicitly disabled, it will
``PYTHONUTF8`` feature defined in PEP 540 is also disabled (or not emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
implemented), it will emit a warning on stderr that use of the legacy text encoding may cause various Unicode compatibility issues
``C`` locale's default ASCII text encoding may cause various Unicode
compatibility issues
With this change, any \*nix platform that does *not* offer at least one of the With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard ``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for CPython configuration would only be considered a fully supported platform for CPython
3.7+ deployments when either the new ``PYTHONUTF8`` mode defined in PEP 540 is 3.7+ deployments when a suitable locale other than the default ``C`` locale is
used, or else a suitable locale other than the default ``C`` locale is configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``). If PEP 540 is
configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``). accepted in addition to this PEP, then such platforms would also be supported
when using the proposed ``PYTHONUTF8`` mode.
Redistributors (such as Linux distributions) with a narrower target audience Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this than the upstream CPython development team may also choose to opt in to this
@ -140,6 +138,9 @@ still fail in the following cases:
* some process environments (such as Linux containers) may not have any * some process environments (such as Linux containers) may not have any
explicit locale configured at all. As with unknown locales, this leads to explicit locale configured at all. As with unknown locales, this leads to
CPython running in the default ASCII-based C locale CPython running in the default ASCII-based C locale
* on Android, rather than configuring the locale based on environment variables,
the empty locale ``""`` is treated as specifically requesting the ``"C"``
locale
The simplest way to deal with this problem for currently released versions of The simplest way to deal with this problem for currently released versions of
CPython is to explicitly set a more sensible locale when launching the CPython is to explicitly set a more sensible locale when launching the
@ -204,13 +205,13 @@ components, and an approach more amenable to being backported to Python 3.6
by downstream redistributors. by downstream redistributors.
As a result, this PEP was amended to refer to PEP 540 as a complementary As a result, this PEP was amended to refer to PEP 540 as a complementary
solution that offered improved behaviour both when locale coercion triggered, solution that offered improved behaviour when none of the standard UTF-8 based
as well as when none of the standard UTF-8 based locales were available. locales were available.
The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy
fallback was removed from the list of UTF-8 locales tried as a coercion target, fallback was removed from the list of UTF-8 locales tried as a coercion target,
with CPython instead relying solely on the proposed PYTHONUTF8 mode in such with the expectation being that CPython will instead rely solely on the
cases. proposed PYTHONUTF8 mode in such cases.
Motivation Motivation
@ -323,7 +324,11 @@ proposed solution:
release announcements. However, to minimize the chance of introducing new release announcements. However, to minimize the chance of introducing new
problems for end users, we'll do this *without* using the warnings system, so problems for end users, we'll do this *without* using the warnings system, so
even running with ``-Werror`` won't turn it into a runtime exception even running with ``-Werror`` won't turn it into a runtime exception
* any changes made will use *existing* configuration options * as far as is feasible, any changes made will use *existing* configuration
options
* Python's runtime behaviour in potential coercion target locales should be
identical regardless of whether the locale was set explicitly in the
environment or implicitly as a locale coercion target
Minimizing the negative impact on systems currently correctly configured to Minimizing the negative impact on systems currently correctly configured to
use GB-18030 or another partially ASCII compatible universal encoding leads to use GB-18030 or another partially ASCII compatible universal encoding leads to
@ -347,11 +352,14 @@ run as a standalone command line application.
It further proposes to emit a warning on stderr if the legacy ``C`` locale It further proposes to emit a warning on stderr if the legacy ``C`` locale
is in effect at the point where the language runtime itself is initialized, is in effect at the point where the language runtime itself is initialized,
the explicit environmental flag to disable locale coercion is not set, and and the explicit environmental flag to disable locale coercion is not set, in
the PEP 540 UTF-8 encoding override is also disabled (or not implemented), in
order to warn system and application integrators that they're running CPython order to warn system and application integrators that they're running CPython
in an unsupported configuration. in an unsupported configuration.
In addition to these general changes, some additional Android-specific changes
are proposed to handle the differences in the behaviour of ``setlocale`` on that
platform.
Legacy C locale coercion in the standalone Python interpreter binary Legacy C locale coercion in the standalone Python interpreter binary
-------------------------------------------------------------------- --------------------------------------------------------------------
@ -401,14 +409,8 @@ defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
environment variable would be set when using this fallback option. environment variable would be set when using this fallback option.
To adjust automatically to future changes in locale availability, these checks To adjust automatically to future changes in locale availability, these checks
will be implemented at runtime on all platforms other than Mac OS X and Windows, will be implemented at runtime on all platforms other than Windows, rather
rather than attempting to determine which locales to try at compile time. than attempting to determine which locales to try at compile time.
If the locale settings are changed successfully, and the ``PYTHONIOENCODING``
environment variable is currently unset, then ``Py_SetStandardStreamEncoding``
will be called to force the standard IO streams to ``utf-8`` as the nominal
text encoding and ``surrogateescape`` as the error handler (``stderr`` will
continue to use ``backslashreplace`` as it's error handler as usual).
When this locale coercion is activated, the following warning will be When this locale coercion is activated, the following warning will be
printed on stderr, with the warning containing whichever locale was printed on stderr, with the warning containing whichever locale was
@ -423,7 +425,7 @@ different::
Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour). or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
In combination with PEP 540, this locale coercion will mean that the standard This locale coercion will mean that the standard
Python binary *and* locale-aware extensions should once again "just work" Python binary *and* locale-aware extensions should once again "just work"
in the three main failure cases we're aware of (missing locale in the three main failure cases we're aware of (missing locale
settings, SSH forwarding of unknown locales, and developers explicitly settings, SSH forwarding of unknown locales, and developers explicitly
@ -453,8 +455,8 @@ or not to suppress the locale compatibility warning will be similarly
independent of these settings. independent of these settings.
Changes to the runtime initialization process Legacy C locale warning during runtime initialization
--------------------------------------------- -----------------------------------------------------
By the time that ``Py_Initialize`` is called, arbitrary locale-dependent By the time that ``Py_Initialize`` is called, arbitrary locale-dependent
operations may have taken place in the current process. This means that operations may have taken place in the current process. This means that
@ -463,9 +465,8 @@ doing so would introduce inconsistencies in decoded text, even in the context
of the standalone Python interpreter binary. of the standalone Python interpreter binary.
Accordingly, when ``Py_Initialize`` is called and CPython detects that the Accordingly, when ``Py_Initialize`` is called and CPython detects that the
configured locale is still the default ``C`` locale, ``PYTHONCOERCECLOCALE=0`` configured locale is still the default ``C`` locale and
is set, *and* the ``PYTHONUTF8`` feature from PEP 540 is disabled (or not ``PYTHONCOERCECLOCALE=0`` is not set, the following warning will be issued::
implemented), the following warning will be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
encoding), which may cause Unicode compatibility problems. Using C.UTF-8, encoding), which may cause Unicode compatibility problems. Using C.UTF-8,
@ -499,10 +500,42 @@ The locale warning behaviour would be controlled by the flag
``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE`` ``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE``
preprocessor definition. preprocessor definition.
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android, On platforms which don't use the ``autotools`` based build system (i.e.
Windows) these preprocessor variables would always be undefined. Windows) these preprocessor variables would always be undefined.
Changes to the default error handling on the standard streams
-------------------------------------------------------------
Since Python 3.5, CPython has defaulted to using ``surrogateescape`` on the
standard streams (``sys.stdin``, ``sys.stdout``, ``sys.stderr``) when it
detects that the current locale is ``C`` and no specific error handled has
been set using either the ``PYTHONIOENCODING`` environment variable or the
``Py_setStandardStreamEncoding`` API. For other locales, the default error
handler for the standard streams is ``strict``.
In order to preserve this behaviour without introducing any behavioural
discrepancies between locale coercion and explicitly configuring a locale, the
coercion target locales (``C.UTF-8``, ``C.utf8``, and ``UTF-8``) will be added
to the list of locales that use ``surrogateescape`` as their default error
handler for the standard streams.
Changes to locale settings on Android
-------------------------------------
Independently of the other changes in this PEP, CPython on Android systems
will be updated to call ``setlocale(LC_ALL, "C.UTF-8")`` where it currently
calls ``setlocale(LC_ALL, "")`` and ``setlocale(LC_CTYPE, "C.UTF-8")`` where
it currently calls ``setlocale(LC_CTYPE, "")``.
This Android-specific behaviour is being introduced due to the following
Android-specific details:
* on Android, passing ``""`` to ``setlocale`` is equivalent to passing ``"C"``
* the ``C.UTF-8`` locale is always available
Platform Support Changes Platform Support Changes
======================== ========================
@ -515,19 +548,6 @@ A new "Legacy C Locale" section will be added to PEP 11 that states:
cannot be reproduced in an appropriately configured non-ASCII locale will be cannot be reproduced in an appropriately configured non-ASCII locale will be
closed as "won't fix". closed as "won't fix".
If PEP 540 is also implemented, then this section would instead state:
* as of CPython 3.7, the legacy C locale is only supported when operating in
"UTF-8" mode. Any Unicode handling issues that occur only in that locale
and cannot be reproduced in an appropriately configured non-ASCII locale will
be closed as "won't fix"
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
Any Unicode related integration problems with other locale-aware components
that occur only in that locale and cannot be reproduced in an appropriately
configured non-ASCII locale will be closed as "won't fix".
Rationale Rationale
========= =========
@ -580,10 +600,8 @@ introduced in Python 3.5 ([15_]), as well as the automatic use of
``surrogateescape`` when operating in PEP 540's UTF-8 mode. ``surrogateescape`` when operating in PEP 540's UTF-8 mode.
Rather than introducing yet another configuration option to address that, Rather than introducing yet another configuration option to address that,
this PEP proposes to use the existing ``Py_SetStandardStreamEncoding`` this PEP proposes to extend the "surrogateescape" default to also apply to
interface to ensure that the ``surrogateescape`` handler is enabled when the three potential coercion target locales.
the interpreter is required to make assumptions regarding the expected
filesystem encoding.
The aim of this behaviour is to attempt to ensure that operating system The aim of this behaviour is to attempt to ensure that operating system
provided text values are typically able to be transparently passed through a provided text values are typically able to be transparently passed through a
@ -673,14 +691,14 @@ now displays both files as originally intended::
GB18030: ℙƴ☂ℌøἤ GB18030: ℙƴ☂ℌøἤ
The rationale for retaining ``surrogateescape`` as the default IO encoding is The rationale for retaining ``surrogateescape`` as the default IO encoding is
that it will preserve the following helpful behaviour in the C locale:: that it will preserve the following helpful behaviour in the ``C`` locale::
$ cat gb18030.txt \ $ cat gb18030.txt \
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \ | LANG=C python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8 | iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ ℙƴ☂ℌøἤ
Rather than reverting to the exception seen when a UTF-8 based locale is Rather than reverting to the exception currently seen when a UTF-8 based locale is
explicitly configured:: explicitly configured::
$ cat gb18030.txt \ $ cat gb18030.txt \
@ -692,29 +710,9 @@ explicitly configured::
(result, consumed) = self._buffer_decode(data, self.errors, final) (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
Note: in order to also affect subprocesses running Python 3, earlier versions As an added benefit, environments explicitly configured to use one of the
of this PEP proposed setting ``PYTHONIOENCODING`` to ``utf-8:surrogateescape`` coercion target locales will implicitly gain the encoding transparency behaviour
rather than calling ``Py_SetStandardStreamEncoding`` when the locale coercion currently enabled by default in the ``C`` locale.
triggered. Unfortunately, this approach proved to have undesirable side
effects when Python 2 applications were invoked in subprocesses (as there is
no ``surrogateescape`` error handler available in Python 2).
Another design option would be to *always* default to ``surrogateescape`` on the
standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request
text encoding validation during stream processing. Adopting such an approach
would bring Python 3 more into line with typical C/C++ tools that pass along
the raw bytes without checking them for conformance to their nominal encoding,
and would hence also make the last example display the desired output::
$ cat gb18030.txt \
| PYTHONIOENCODING=:surrogateescape python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
However, such a change would have broader implications than the C locale
specific changes currently proposed, so it is considered out of scope for this
PEP. Instead, an improved solution is left to the combination of this PEP with
PEP 540, by automatically setting ``PYTHONUTF8=1`` when locale coercion occurs.
Dropping official support for ASCII based text handling in the legacy C locale Dropping official support for ASCII based text handling in the legacy C locale
@ -724,8 +722,8 @@ We've been trying to get strict bytes/text separation to work reliably in the
legacy C locale for over a decade at this point. Not only haven't we been able legacy C locale for over a decade at this point. Not only haven't we been able
to get it to work, neither has anyone else - the only viable alternatives to get it to work, neither has anyone else - the only viable alternatives
identified have been to pass the bytes along verbatim without eagerly decoding identified have been to pass the bytes along verbatim without eagerly decoding
them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal them to text (C/C++, Python 2.x, Ruby, etc), or else to largely ignore the
C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540, nominal C/C++ locale encoding and assume the use of either UTF-8 (PEP 540,
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR). Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
While this PEP ensures that developers that genuinely need to do so can still While this PEP ensures that developers that genuinely need to do so can still
@ -740,6 +738,11 @@ locale setting (or PEP 540's UTF-8 mode, if that is available).
Providing implicit locale coercion only when running standalone Providing implicit locale coercion only when running standalone
--------------------------------------------------------------- ---------------------------------------------------------------
The major downside of the proposed design in this PEP is that it introduces a
potential discrepancy between the behaviour of the CPython runtime when it is
run as a standalone application and when it is run as an embedded component
inside a larger system (e.g. ``mod_wsgi`` running inside Apache ``httpd``).
Over the course of Python 3.x development, multiple attempts have been made Over the course of Python 3.x development, multiple attempts have been made
to improve the handling of incorrect locale settings at the point where the to improve the handling of incorrect locale settings at the point where the
Python interpreter is initialised. The problem that emerged is that this is Python interpreter is initialised. The problem that emerged is that this is
@ -765,6 +768,19 @@ The ``Py_Initialize`` API then only gains an explicit warning (emitted on
embedding application to specify something more reasonable. embedding application to specify something more reasonable.
Allowing restoration of the legacy behaviour
--------------------------------------------
The CPython command line interpreter is often used to investigate faults that
occur in other applications that embed CPython, and those applications may still
be using the C locale even after this PEP is implemented.
Providing a simple on/off switch for the locale coercion behaviour makes it
much easier to reproduce the behaviour of such applications for debugging
purposes, as well as making it easier to reproduce the behaviour of older 3.x
runtimes even when running a version with this change applied.
Querying LC_CTYPE for C locale detection Querying LC_CTYPE for C locale detection
---------------------------------------- ----------------------------------------
@ -777,8 +793,8 @@ whether or not the current locale configuration is likely to cause Unicode
handling problems. handling problems.
Setting both LANG & LC_ALL for C.UTF-8 locale coercion Setting both LANG & LC_ALL for UTF-8 locale coercion
------------------------------------------------------ ----------------------------------------------------
Python is often used as a glue language, integrating other C/C++ ABI compatible Python is often used as a glue language, integrating other C/C++ ABI compatible
components in the current process, and components written in arbitrary components in the current process, and components written in arbitrary
@ -795,21 +811,23 @@ Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``. the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
Together, these should ensure that when the locale coercion is activated, the Together, these should ensure that when the locale coercion is activated, the
switch to the C.UTF-8 locale will be applied consistently across the current switch to the UTF-8 based locale will be applied consistently across the current
process and any subprocesses that inherit the current environment. process and any subprocesses that inherit the current environment.
Allowing restoration of the legacy behaviour Enabling C locale coercion and warnings on Mac OS X
-------------------------------------------- ---------------------------------------------------
The CPython command line interpreter is often used to investigate faults that On Mac OS X, CPython already assumes the use of UTF-8 for system interfaces,
occur in other applications that embed CPython, and those applications may still and we expect most other locale-aware components to do the same.
be using the C locale even after this PEP is implemented.
Providing a simple on/off switch for the locale coercion behaviour makes it However, Mac OS X is also frequently used as a development and testing platform
much easier to reproduce the behaviour of such applications for debugging for Python software intended for deployment to other \*nix environments (such as
purposes, as well as making it easier to reproduce the behaviour of older 3.x Linux).
runtimes even when running a version with this change applied.
Accordingly, this PEP enables the locale coercion and warning features on
Mac OS X in the name of cross platform consistency, even though they're expected
to almost entirely redundant on Mac OS X itself.
Implementation Implementation
@ -823,9 +841,10 @@ This reference implementation covers not only the enhancement request in
issue 28180 [1_], but also the Android compatibility fixes needed to resolve issue 28180 [1_], but also the Android compatibility fixes needed to resolve
issue 28997 [16_]. issue 28997 [16_].
NOTE: The reference implementation is currently missing the ``configure.ac`` .. note:
checks that are needed to ensure that ``PY_COERCE_C_LOCALE`` and
``PY_WARN_ON_C_LOCALE`` are always undefined on Mac OS X. The reference implementation has not yet been updated for the 2017-05-06
amendments to the PEP
Backporting to earlier Python 3 releases Backporting to earlier Python 3 releases