PEP 538: Update for latest python-dev discussion
* default standard stream error handler is always "surrogateescape" for the potential coercion target locales * PEP 540 is now a purely optional follow-on PEP that improves the handling of cases where none of these locales are available, but doesn't require revisiting the changes made for this PEP * the locale coercion and warning behaviours are now enabled by default for all \*nix platforms, even Mac OS X * covered the Android-specific changes to the use of `setlocale` * state explicitly that we're aware this makes the behaviour of standalone CPython and embedded CPython diverge, we just think the potential benefits are sufficient to accept that downside * note the reference implementation has yet to be updated with these changes
This commit is contained in:
parent
dc175c5902
commit
2fb53e7c1b
219
pep-0538.txt
219
pep-0538.txt
|
@ -36,9 +36,9 @@ However, it comes at the cost of making CPython's encoding assumptions diverge
|
|||
from those of other locale-aware components in the same process, as well as
|
||||
those of components running in subprocesses that share the same environment.
|
||||
|
||||
It also requires changes to the internals of how CPython itself works, rather
|
||||
than using existing configuration settings that are supported by Python
|
||||
versions prior to Python 3.7.
|
||||
It also requires non-trivial changes to the internals of how CPython itself
|
||||
works, rather than relying primarily on existing configuration settings that
|
||||
are supported by Python versions prior to Python 3.7.
|
||||
|
||||
Accordingly, this PEP proposes that independently of the UTF-8 mode proposed
|
||||
in PEP 540, the way the CPython implementation handles the default C locale be
|
||||
|
@ -48,27 +48,25 @@ changed such that:
|
|||
the standalone CPython binary will automatically attempt to coerce the ``C``
|
||||
locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or
|
||||
``UTF-8``
|
||||
* if the locale is successfully coerced, PEP 540 is not accepted, and the
|
||||
``PYTHONIOENCODING`` environment variable is not set, then
|
||||
``Py_SetStandardStreamEncoding`` will be called with ``"utf-8"`` and
|
||||
``"surrogateescape"`` as arguments.
|
||||
* if the locale is successfully coerced, and PEP 540 *is* accepted, then
|
||||
``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
|
||||
* if the subsequent runtime initialization process detects that the legacy
|
||||
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
|
||||
* ``Py_Initialize`` will be updated to treat these potential coercion target
|
||||
locales the same way it already treats the ``C`` locale: the default standard
|
||||
stream error handler for these locales will become ``surrogateescape`` (this
|
||||
default can be overridden through ``PYTHONIOENCODING`` and
|
||||
``Py_SetStandardStreamEncoding`` as usual)
|
||||
* if ``Py_Initialize`` detects that the legacy ``C`` locale remains active
|
||||
(e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
|
||||
are available, or the runtime is embedded in an application other than the
|
||||
main CPython binary), locale coercion is not explicitly disabled, and the
|
||||
``PYTHONUTF8`` feature defined in PEP 540 is also disabled (or not
|
||||
implemented), it will emit a warning on stderr that use of the legacy
|
||||
``C`` locale's default ASCII text encoding may cause various Unicode
|
||||
compatibility issues
|
||||
main CPython binary), and locale coercion is not explicitly disabled, it will
|
||||
emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
|
||||
text encoding may cause various Unicode compatibility issues
|
||||
|
||||
With this change, any \*nix platform that does *not* offer at least one of the
|
||||
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
|
||||
configuration would only be considered a fully supported platform for CPython
|
||||
3.7+ deployments when either the new ``PYTHONUTF8`` mode defined in PEP 540 is
|
||||
used, or else a suitable locale other than the default ``C`` locale is
|
||||
configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``).
|
||||
3.7+ deployments when a suitable locale other than the default ``C`` locale is
|
||||
configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``). If PEP 540 is
|
||||
accepted in addition to this PEP, then such platforms would also be supported
|
||||
when using the proposed ``PYTHONUTF8`` mode.
|
||||
|
||||
Redistributors (such as Linux distributions) with a narrower target audience
|
||||
than the upstream CPython development team may also choose to opt in to this
|
||||
|
@ -140,6 +138,9 @@ still fail in the following cases:
|
|||
* some process environments (such as Linux containers) may not have any
|
||||
explicit locale configured at all. As with unknown locales, this leads to
|
||||
CPython running in the default ASCII-based C locale
|
||||
* on Android, rather than configuring the locale based on environment variables,
|
||||
the empty locale ``""`` is treated as specifically requesting the ``"C"``
|
||||
locale
|
||||
|
||||
The simplest way to deal with this problem for currently released versions of
|
||||
CPython is to explicitly set a more sensible locale when launching the
|
||||
|
@ -204,13 +205,13 @@ components, and an approach more amenable to being backported to Python 3.6
|
|||
by downstream redistributors.
|
||||
|
||||
As a result, this PEP was amended to refer to PEP 540 as a complementary
|
||||
solution that offered improved behaviour both when locale coercion triggered,
|
||||
as well as when none of the standard UTF-8 based locales were available.
|
||||
solution that offered improved behaviour when none of the standard UTF-8 based
|
||||
locales were available.
|
||||
|
||||
The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy
|
||||
fallback was removed from the list of UTF-8 locales tried as a coercion target,
|
||||
with CPython instead relying solely on the proposed PYTHONUTF8 mode in such
|
||||
cases.
|
||||
with the expectation being that CPython will instead rely solely on the
|
||||
proposed PYTHONUTF8 mode in such cases.
|
||||
|
||||
|
||||
Motivation
|
||||
|
@ -323,7 +324,11 @@ proposed solution:
|
|||
release announcements. However, to minimize the chance of introducing new
|
||||
problems for end users, we'll do this *without* using the warnings system, so
|
||||
even running with ``-Werror`` won't turn it into a runtime exception
|
||||
* any changes made will use *existing* configuration options
|
||||
* as far as is feasible, any changes made will use *existing* configuration
|
||||
options
|
||||
* Python's runtime behaviour in potential coercion target locales should be
|
||||
identical regardless of whether the locale was set explicitly in the
|
||||
environment or implicitly as a locale coercion target
|
||||
|
||||
Minimizing the negative impact on systems currently correctly configured to
|
||||
use GB-18030 or another partially ASCII compatible universal encoding leads to
|
||||
|
@ -347,11 +352,14 @@ run as a standalone command line application.
|
|||
|
||||
It further proposes to emit a warning on stderr if the legacy ``C`` locale
|
||||
is in effect at the point where the language runtime itself is initialized,
|
||||
the explicit environmental flag to disable locale coercion is not set, and
|
||||
the PEP 540 UTF-8 encoding override is also disabled (or not implemented), in
|
||||
and the explicit environmental flag to disable locale coercion is not set, in
|
||||
order to warn system and application integrators that they're running CPython
|
||||
in an unsupported configuration.
|
||||
|
||||
In addition to these general changes, some additional Android-specific changes
|
||||
are proposed to handle the differences in the behaviour of ``setlocale`` on that
|
||||
platform.
|
||||
|
||||
|
||||
Legacy C locale coercion in the standalone Python interpreter binary
|
||||
--------------------------------------------------------------------
|
||||
|
@ -401,14 +409,8 @@ defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
|
|||
environment variable would be set when using this fallback option.
|
||||
|
||||
To adjust automatically to future changes in locale availability, these checks
|
||||
will be implemented at runtime on all platforms other than Mac OS X and Windows,
|
||||
rather than attempting to determine which locales to try at compile time.
|
||||
|
||||
If the locale settings are changed successfully, and the ``PYTHONIOENCODING``
|
||||
environment variable is currently unset, then ``Py_SetStandardStreamEncoding``
|
||||
will be called to force the standard IO streams to ``utf-8`` as the nominal
|
||||
text encoding and ``surrogateescape`` as the error handler (``stderr`` will
|
||||
continue to use ``backslashreplace`` as it's error handler as usual).
|
||||
will be implemented at runtime on all platforms other than Windows, rather
|
||||
than attempting to determine which locales to try at compile time.
|
||||
|
||||
When this locale coercion is activated, the following warning will be
|
||||
printed on stderr, with the warning containing whichever locale was
|
||||
|
@ -423,7 +425,7 @@ different::
|
|||
Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale
|
||||
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
|
||||
|
||||
In combination with PEP 540, this locale coercion will mean that the standard
|
||||
This locale coercion will mean that the standard
|
||||
Python binary *and* locale-aware extensions should once again "just work"
|
||||
in the three main failure cases we're aware of (missing locale
|
||||
settings, SSH forwarding of unknown locales, and developers explicitly
|
||||
|
@ -453,8 +455,8 @@ or not to suppress the locale compatibility warning will be similarly
|
|||
independent of these settings.
|
||||
|
||||
|
||||
Changes to the runtime initialization process
|
||||
---------------------------------------------
|
||||
Legacy C locale warning during runtime initialization
|
||||
-----------------------------------------------------
|
||||
|
||||
By the time that ``Py_Initialize`` is called, arbitrary locale-dependent
|
||||
operations may have taken place in the current process. This means that
|
||||
|
@ -463,9 +465,8 @@ doing so would introduce inconsistencies in decoded text, even in the context
|
|||
of the standalone Python interpreter binary.
|
||||
|
||||
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
|
||||
configured locale is still the default ``C`` locale, ``PYTHONCOERCECLOCALE=0``
|
||||
is set, *and* the ``PYTHONUTF8`` feature from PEP 540 is disabled (or not
|
||||
implemented), the following warning will be issued::
|
||||
configured locale is still the default ``C`` locale and
|
||||
``PYTHONCOERCECLOCALE=0`` is not set, the following warning will be issued::
|
||||
|
||||
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
|
||||
encoding), which may cause Unicode compatibility problems. Using C.UTF-8,
|
||||
|
@ -499,10 +500,42 @@ The locale warning behaviour would be controlled by the flag
|
|||
``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE``
|
||||
preprocessor definition.
|
||||
|
||||
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android,
|
||||
On platforms which don't use the ``autotools`` based build system (i.e.
|
||||
Windows) these preprocessor variables would always be undefined.
|
||||
|
||||
|
||||
Changes to the default error handling on the standard streams
|
||||
-------------------------------------------------------------
|
||||
|
||||
Since Python 3.5, CPython has defaulted to using ``surrogateescape`` on the
|
||||
standard streams (``sys.stdin``, ``sys.stdout``, ``sys.stderr``) when it
|
||||
detects that the current locale is ``C`` and no specific error handled has
|
||||
been set using either the ``PYTHONIOENCODING`` environment variable or the
|
||||
``Py_setStandardStreamEncoding`` API. For other locales, the default error
|
||||
handler for the standard streams is ``strict``.
|
||||
|
||||
In order to preserve this behaviour without introducing any behavioural
|
||||
discrepancies between locale coercion and explicitly configuring a locale, the
|
||||
coercion target locales (``C.UTF-8``, ``C.utf8``, and ``UTF-8``) will be added
|
||||
to the list of locales that use ``surrogateescape`` as their default error
|
||||
handler for the standard streams.
|
||||
|
||||
|
||||
Changes to locale settings on Android
|
||||
-------------------------------------
|
||||
|
||||
Independently of the other changes in this PEP, CPython on Android systems
|
||||
will be updated to call ``setlocale(LC_ALL, "C.UTF-8")`` where it currently
|
||||
calls ``setlocale(LC_ALL, "")`` and ``setlocale(LC_CTYPE, "C.UTF-8")`` where
|
||||
it currently calls ``setlocale(LC_CTYPE, "")``.
|
||||
|
||||
This Android-specific behaviour is being introduced due to the following
|
||||
Android-specific details:
|
||||
|
||||
* on Android, passing ``""`` to ``setlocale`` is equivalent to passing ``"C"``
|
||||
* the ``C.UTF-8`` locale is always available
|
||||
|
||||
|
||||
Platform Support Changes
|
||||
========================
|
||||
|
||||
|
@ -515,19 +548,6 @@ A new "Legacy C Locale" section will be added to PEP 11 that states:
|
|||
cannot be reproduced in an appropriately configured non-ASCII locale will be
|
||||
closed as "won't fix".
|
||||
|
||||
If PEP 540 is also implemented, then this section would instead state:
|
||||
|
||||
* as of CPython 3.7, the legacy C locale is only supported when operating in
|
||||
"UTF-8" mode. Any Unicode handling issues that occur only in that locale
|
||||
and cannot be reproduced in an appropriately configured non-ASCII locale will
|
||||
be closed as "won't fix"
|
||||
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
|
||||
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
|
||||
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
|
||||
Any Unicode related integration problems with other locale-aware components
|
||||
that occur only in that locale and cannot be reproduced in an appropriately
|
||||
configured non-ASCII locale will be closed as "won't fix".
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
@ -580,10 +600,8 @@ introduced in Python 3.5 ([15_]), as well as the automatic use of
|
|||
``surrogateescape`` when operating in PEP 540's UTF-8 mode.
|
||||
|
||||
Rather than introducing yet another configuration option to address that,
|
||||
this PEP proposes to use the existing ``Py_SetStandardStreamEncoding``
|
||||
interface to ensure that the ``surrogateescape`` handler is enabled when
|
||||
the interpreter is required to make assumptions regarding the expected
|
||||
filesystem encoding.
|
||||
this PEP proposes to extend the "surrogateescape" default to also apply to
|
||||
the three potential coercion target locales.
|
||||
|
||||
The aim of this behaviour is to attempt to ensure that operating system
|
||||
provided text values are typically able to be transparently passed through a
|
||||
|
@ -673,14 +691,14 @@ now displays both files as originally intended::
|
|||
GB18030: ℙƴ☂ℌøἤ
|
||||
|
||||
The rationale for retaining ``surrogateescape`` as the default IO encoding is
|
||||
that it will preserve the following helpful behaviour in the C locale::
|
||||
that it will preserve the following helpful behaviour in the ``C`` locale::
|
||||
|
||||
$ cat gb18030.txt \
|
||||
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \
|
||||
| iconv -f GB18030 -t UTF-8
|
||||
ℙƴ☂ℌøἤ
|
||||
|
||||
Rather than reverting to the exception seen when a UTF-8 based locale is
|
||||
Rather than reverting to the exception currently seen when a UTF-8 based locale is
|
||||
explicitly configured::
|
||||
|
||||
$ cat gb18030.txt \
|
||||
|
@ -692,29 +710,9 @@ explicitly configured::
|
|||
(result, consumed) = self._buffer_decode(data, self.errors, final)
|
||||
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
|
||||
|
||||
Note: in order to also affect subprocesses running Python 3, earlier versions
|
||||
of this PEP proposed setting ``PYTHONIOENCODING`` to ``utf-8:surrogateescape``
|
||||
rather than calling ``Py_SetStandardStreamEncoding`` when the locale coercion
|
||||
triggered. Unfortunately, this approach proved to have undesirable side
|
||||
effects when Python 2 applications were invoked in subprocesses (as there is
|
||||
no ``surrogateescape`` error handler available in Python 2).
|
||||
|
||||
Another design option would be to *always* default to ``surrogateescape`` on the
|
||||
standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request
|
||||
text encoding validation during stream processing. Adopting such an approach
|
||||
would bring Python 3 more into line with typical C/C++ tools that pass along
|
||||
the raw bytes without checking them for conformance to their nominal encoding,
|
||||
and would hence also make the last example display the desired output::
|
||||
|
||||
$ cat gb18030.txt \
|
||||
| PYTHONIOENCODING=:surrogateescape python3 -c "import sys; print(sys.stdin.read())" \
|
||||
| iconv -f GB18030 -t UTF-8
|
||||
ℙƴ☂ℌøἤ
|
||||
|
||||
However, such a change would have broader implications than the C locale
|
||||
specific changes currently proposed, so it is considered out of scope for this
|
||||
PEP. Instead, an improved solution is left to the combination of this PEP with
|
||||
PEP 540, by automatically setting ``PYTHONUTF8=1`` when locale coercion occurs.
|
||||
As an added benefit, environments explicitly configured to use one of the
|
||||
coercion target locales will implicitly gain the encoding transparency behaviour
|
||||
currently enabled by default in the ``C`` locale.
|
||||
|
||||
|
||||
Dropping official support for ASCII based text handling in the legacy C locale
|
||||
|
@ -724,8 +722,8 @@ We've been trying to get strict bytes/text separation to work reliably in the
|
|||
legacy C locale for over a decade at this point. Not only haven't we been able
|
||||
to get it to work, neither has anyone else - the only viable alternatives
|
||||
identified have been to pass the bytes along verbatim without eagerly decoding
|
||||
them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
|
||||
C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
|
||||
them to text (C/C++, Python 2.x, Ruby, etc), or else to largely ignore the
|
||||
nominal C/C++ locale encoding and assume the use of either UTF-8 (PEP 540,
|
||||
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
|
||||
|
||||
While this PEP ensures that developers that genuinely need to do so can still
|
||||
|
@ -740,6 +738,11 @@ locale setting (or PEP 540's UTF-8 mode, if that is available).
|
|||
Providing implicit locale coercion only when running standalone
|
||||
---------------------------------------------------------------
|
||||
|
||||
The major downside of the proposed design in this PEP is that it introduces a
|
||||
potential discrepancy between the behaviour of the CPython runtime when it is
|
||||
run as a standalone application and when it is run as an embedded component
|
||||
inside a larger system (e.g. ``mod_wsgi`` running inside Apache ``httpd``).
|
||||
|
||||
Over the course of Python 3.x development, multiple attempts have been made
|
||||
to improve the handling of incorrect locale settings at the point where the
|
||||
Python interpreter is initialised. The problem that emerged is that this is
|
||||
|
@ -765,6 +768,19 @@ The ``Py_Initialize`` API then only gains an explicit warning (emitted on
|
|||
embedding application to specify something more reasonable.
|
||||
|
||||
|
||||
Allowing restoration of the legacy behaviour
|
||||
--------------------------------------------
|
||||
|
||||
The CPython command line interpreter is often used to investigate faults that
|
||||
occur in other applications that embed CPython, and those applications may still
|
||||
be using the C locale even after this PEP is implemented.
|
||||
|
||||
Providing a simple on/off switch for the locale coercion behaviour makes it
|
||||
much easier to reproduce the behaviour of such applications for debugging
|
||||
purposes, as well as making it easier to reproduce the behaviour of older 3.x
|
||||
runtimes even when running a version with this change applied.
|
||||
|
||||
|
||||
Querying LC_CTYPE for C locale detection
|
||||
----------------------------------------
|
||||
|
||||
|
@ -777,8 +793,8 @@ whether or not the current locale configuration is likely to cause Unicode
|
|||
handling problems.
|
||||
|
||||
|
||||
Setting both LANG & LC_ALL for C.UTF-8 locale coercion
|
||||
------------------------------------------------------
|
||||
Setting both LANG & LC_ALL for UTF-8 locale coercion
|
||||
----------------------------------------------------
|
||||
|
||||
Python is often used as a glue language, integrating other C/C++ ABI compatible
|
||||
components in the current process, and components written in arbitrary
|
||||
|
@ -795,21 +811,23 @@ Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
|
|||
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
|
||||
|
||||
Together, these should ensure that when the locale coercion is activated, the
|
||||
switch to the C.UTF-8 locale will be applied consistently across the current
|
||||
switch to the UTF-8 based locale will be applied consistently across the current
|
||||
process and any subprocesses that inherit the current environment.
|
||||
|
||||
|
||||
Allowing restoration of the legacy behaviour
|
||||
--------------------------------------------
|
||||
Enabling C locale coercion and warnings on Mac OS X
|
||||
---------------------------------------------------
|
||||
|
||||
The CPython command line interpreter is often used to investigate faults that
|
||||
occur in other applications that embed CPython, and those applications may still
|
||||
be using the C locale even after this PEP is implemented.
|
||||
On Mac OS X, CPython already assumes the use of UTF-8 for system interfaces,
|
||||
and we expect most other locale-aware components to do the same.
|
||||
|
||||
Providing a simple on/off switch for the locale coercion behaviour makes it
|
||||
much easier to reproduce the behaviour of such applications for debugging
|
||||
purposes, as well as making it easier to reproduce the behaviour of older 3.x
|
||||
runtimes even when running a version with this change applied.
|
||||
However, Mac OS X is also frequently used as a development and testing platform
|
||||
for Python software intended for deployment to other \*nix environments (such as
|
||||
Linux).
|
||||
|
||||
Accordingly, this PEP enables the locale coercion and warning features on
|
||||
Mac OS X in the name of cross platform consistency, even though they're expected
|
||||
to almost entirely redundant on Mac OS X itself.
|
||||
|
||||
|
||||
Implementation
|
||||
|
@ -823,9 +841,10 @@ This reference implementation covers not only the enhancement request in
|
|||
issue 28180 [1_], but also the Android compatibility fixes needed to resolve
|
||||
issue 28997 [16_].
|
||||
|
||||
NOTE: The reference implementation is currently missing the ``configure.ac``
|
||||
checks that are needed to ensure that ``PY_COERCE_C_LOCALE`` and
|
||||
``PY_WARN_ON_C_LOCALE`` are always undefined on Mac OS X.
|
||||
.. note:
|
||||
|
||||
The reference implementation has not yet been updated for the 2017-05-06
|
||||
amendments to the PEP
|
||||
|
||||
|
||||
Backporting to earlier Python 3 releases
|
||||
|
|
Loading…
Reference in New Issue