PEP 538: update for python-dev & implementation feedback

- PYTHONCOERCECLOCALE=0 now also disables the library warning
- PEP just refers to locale-aware/locale-independent components,
  without specifically limiting that to C/C++ components
This commit is contained in:
Nick Coghlan 2017-03-13 15:06:48 +10:00
parent 8ae8b612d4
commit 48f355fc28
1 changed files with 86 additions and 55 deletions

View File

@ -19,7 +19,7 @@ Abstract
An ongoing challenge with Python 3 on \*nix systems is the conflict between
needing to use the configured locale encoding by default for consistency with
other C/C++ components in the same process and those invoked in subprocesses,
other locale-aware components in the same process subprocesses,
and the fact that the standard C locale (as defined in POSIX:2001) typically
implies a default text encoding of ASCII, which is entirely inadequate for the
development of networked services and client applications in a multilingual
@ -33,8 +33,8 @@ This is a good approach for cases where network encoding interoperability
is a more important concern than local encoding interoperability.
However, it comes at the cost of making CPython's encoding assumptions diverge
from those of other C and C++ components in the same process, as well as those
of components running in subprocesses that share the same environment.
from those of other locale-aware components in the same process, as well as
those of components running in subprocesses that share the same environment.
It also requires changes to the internals of how CPython itself works, rather
than using existing configuration settings that are supported by Python
@ -55,11 +55,12 @@ changed such that:
``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
* if the subsequent runtime initialization process detects that the legacy
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
are available, locale coercion is disabled, or the runtime is embedded in an
application other than the main CPython binary), and the ``PYTHONUTF8``
feature defined in PEP 540 is also disabled (or not implemented), it will
emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
text encoding may cause various Unicode compatibility issues
are available, or the runtime is embedded in an application other than the
main CPython binary), locale coercion is not explicitly disabled, and the
``PYTHONUTF8`` feature defined in PEP 540 is also disabled (or not
implemented), it will emit a warning on stderr that use of the legacy
``C`` locale's default ASCII text encoding may cause various Unicode
compatibility issues
With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
@ -163,7 +164,7 @@ definition.
Another common failure case is developers specifying ``LANG=C`` in order to
see otherwise translated user interface messages in English, rather than the
more narrowly scoped ``LC_MESSAGES=C``.
more narrowly scoped ``LC_MESSAGES=C`` or ``LANGUAGE=en``.
Relationship with other PEPs
@ -175,20 +176,20 @@ solution:
* PEP 540 proposes to entirely decouple CPython's default text encoding from
the C locale system in that case, allowing text handling inconsistencies to
arise between CPython and other C/C++ components running in the same process
and in subprocesses. This approach aims to make CPython behave less like a
locale-aware C/C++ application, and more like C/C++ independent language
arise between CPython and other locale-ware components running in the same
process and in subprocesses. This approach aims to make CPython behave less
like a locale-aware application, and more like locale-independent language
runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
* this PEP proposes to override the legacy C locale with a more recently
defined locale that uses UTF-8 as its default text encoding. This means that
the text encoding override will apply not only to CPython, but also to any
locale aware extension modules loaded into the current process, as well as to
locale aware C/C++ applications invoked in subprocesses that inherit their
locale-aware extension modules loaded into the current process, as well as to
locale-aware applications invoked in subprocesses that inherit their
environment from the parent process. This approach aims to retain CPython's
traditional strong support for integration with other components written
in C and C++, while actively helping to push forward the adoption and
standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
the legacy C locale in the wider C/C++ ecosystem
traditional strong support for integration with other locale-aware components
while also actively helping to push forward the adoption and standardisation
of the C.UTF-8 locale as a Unicode-aware replacement for the legacy C locale
in the wider C/C++ ecosystem
After reviewing both PEPs, it became clear that they didn't actually conflict
at a technical level, and the proposal in PEP 540 offered a superior option in
@ -197,9 +198,9 @@ reference behaviour for platforms where the notion of a "locale encoding"
doesn't make sense (for example, embedded systems running MicroPython rather
than the CPython reference interpreter).
Meanwhile, this PEP offered improved compatibility with other C/C++ components,
and an approach more amenable to being backported to Python 3.6 by downstream
redistributors.
Meanwhile, this PEP offered improved compatibility with other locale-aware
components, and an approach more amenable to being backported to Python 3.6
by downstream redistributors.
As a result, this PEP was amended to refer to PEP 540 as a complementary
solution that offered improved behaviour both when locale coercion triggered,
@ -323,7 +324,7 @@ proposed solution:
even running with ``-Werror`` won't turn it into a runtime exception
* any changes made will use *existing* configuration options
To minimize the negative impact on systems currently correctly configured to
Minimizing the negative impact on systems currently correctly configured to
use GB-18030 or another partially ASCII compatible universal encoding leads to
an additional design principle:
@ -345,7 +346,8 @@ run as a standalone command line application.
It further proposes to emit a warning on stderr if the legacy ``C`` locale
is in effect at the point where the language runtime itself is initialized,
and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
the explicit environmental flag to disable locale coercion is not set, and
the PEP 540 UTF-8 encoding override is also disabled, in order to warn
system and application integrators that they're running CPython in an
unsupported configuration.
@ -358,7 +360,7 @@ reconfigure the C locale before any locale dependent operations are executed
in the process.
This means that it can change the locale settings not only for the CPython
runtime, but also for any other C/C++ components running in the current
runtime, but also for any other locale-aware components running in the current
process (e.g. as part of extension modules), as well as in subprocesses that
inherit their environment from the current process.
@ -409,31 +411,37 @@ When this locale coercion is activated, the following warning will be
printed on stderr, with the warning containing whichever locale was
successfully configured::
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
Python detected LC_CTYPE=C: LC_ALL & LANG coerced to C.UTF-8 (set another
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
When falling back to the ``UTF-8`` locale, the message would be slightly
different::
Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
In combination with PEP 540, this locale coercion will mean that the standard
Python binary *and* locale aware C/C++ extensions should once again "just work"
Python binary *and* locale-aware extensions should once again "just work"
in the three main failure cases we're aware of (missing locale
settings, SSH forwarding of unknown locales, and developers explicitly
requesting ``LANG=C``), as long as the target platform provides at least one
of the candidate UTF-8 based environments.
If ``PYTHONCOERCECLOCALE=0`` is set, or none of the candidate locales is
successfully configured, then initialization will continue as usual in the C
locale and the Unicode compatibility warning described in the next section will
be emitted just as it would for any other application.
If none of the candidate locales are successfully configured, then
initialization will continue in the C locale and the Unicode compatibility
warning described in the next section will be emitted just as it would for
any other application.
If ``PYTHONCOERCECLOCALE=0`` is explicitly set, initialization will continue in
the C locale and the Unicode compatibility warning described in the next
section will be automatically suppressed.
The interpreter will always check for the ``PYTHONCOERCECLOCALE`` environment
variable (even when running under the ``-E`` or ``-I`` switches), as the locale
coercion check necessarily takes place before any command line argument
processing.
variable at startup (even when running under the ``-E`` or ``-I`` switches),
as the locale coercion check necessarily takes place before any command line
argument processing. For consistency, the runtime check to determine whether
or not to suppress the locale compatibility warning will be similarly
independent of these settings.
Changes to the runtime initialization process
@ -446,12 +454,12 @@ doing so would introduce inconsistencies in decoded text, even in the context
of the standalone Python interpreter binary.
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8``
feature from PEP 540 is disabled, the following warning will
be issued::
configured locale is still the default ``C`` locale, ``PYTHONCOERCECLOCALE=0``
is set, *and* the ``PYTHONUTF8`` feature from PEP 540 is disabled (or not
implemented), the following warning will be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
encoding), which may cause Unicode compatibility problems. Using C.UTF-8
encoding), which may cause Unicode compatibility problems. Using C.UTF-8,
C.utf8, or UTF-8 (if available) as alternative Unicode-compatible
locales is recommended.
@ -461,9 +469,10 @@ Instead, the warning informs both system and application integrators that
they're running Python 3 in a configuration that we don't expect to work
properly.
The second sentence providing recommendations would be conditionally compiled
based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD
systems.
The second sentence providing recommendations may eventually be conditionally
compiled based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8``
on \*BSD systems), but the initial implementation will just use the common
generic message shown above.
New build-time configuration options
@ -490,6 +499,15 @@ Platform Support Changes
A new "Legacy C Locale" section will be added to PEP 11 that states:
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
Any Unicode related integration problems that occur only in that locale and
cannot be reproduced in an appropriately configured non-ASCII locale will be
closed as "won't fix".
If PEP 540 is also implemented, then this section would instead state:
* as of CPython 3.7, the legacy C locale is only supported when operating in
"UTF-8" mode. Any Unicode handling issues that occur only in that locale
and cannot be reproduced in an appropriately configured non-ASCII locale will
@ -497,9 +515,9 @@ A new "Legacy C Locale" section will be added to PEP 11 that states:
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
Any Unicode related integration problems with C/C++ extensions that occur
only in that locale and cannot be reproduced in an appropriately configured
non-ASCII locale will be closed as "won't fix".
Any Unicode related integration problems with other locale-aware components
that occur only in that locale and cannot be reproduced in an appropriately
configured non-ASCII locale will be closed as "won't fix".
Rationale
@ -524,9 +542,9 @@ The challenge for CPython has been the fact that in addition to being used for
network service development, it is also extensively used as an embedded
scripting language in larger applications, and as a desktop application
development language, where it is more important to be consistent with other
C/C++ components sharing the same process, as well as with the user's desktop
locale settings, than it is with the emergent conventions of modern network
service development.
locale-aware components sharing the same process, as well as with the user's
desktop locale settings, than it is with the emergent conventions of modern
network service development.
The core premise of this PEP is that for *all* of these use cases, the
assumption of ASCII implied by the default "C" locale is the wrong choice,
@ -677,6 +695,10 @@ and would hence also make the last example display the desired output::
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
However, such a change would have broader implications than the C locale
specific changes currently proposed, so it is considered out of scope for this
PEP.
Dropping official support for ASCII based text handling in the legacy C locale
------------------------------------------------------------------------------
@ -689,10 +711,13 @@ them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
While this PEP ensures that developers that need to do so can still opt-in to
running their Python code in the legacy C locale, it also makes clear that we
*don't* expect Python 3's Unicode handling to be reliable in that configuration,
and the recommended alternative is to use a more appropriate locale setting.
While this PEP ensures that developers that genuinely need to do so can still
opt-in to running their Python code in the legacy C locale (either by setting
PYTHONCOERCECLOCALE=0 or running a custom build that sets
``--without-c-locale-coercion``), it also makes it clear that we *don't*
expect Python 3's Unicode handling to be completely reliable in that
configuration, and the recommended alternative is to use a more appropriate
locale setting (or PEP 540's UTF-8 mode, if that is available).
Providing implicit locale coercion only when running standalone
@ -743,10 +768,10 @@ components in the current process, and components written in arbitrary
languages in subprocesses.
Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
C/C++ components in the current process and in any subprocesses that inherit
the current environment. This is important to handle cases where the problem
has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system
where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
locale-aware components in the current process and in any subprocesses that
inherit the current environment. This is important to handle cases where the
problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a
system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
configured to forward locale settings, and the user logs into a Linux server).
Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
@ -797,6 +822,9 @@ conjunction with the changes needed to also provide a suitable locale by
default, or else specifically for platforms where such a locale is already
consistently available.
At least the Fedora project is planning to pursue this approach for the
upcoming Fedora 26 release [19_].
Backporting to other 3.x releases
---------------------------------
@ -909,6 +937,9 @@ References
.. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale``
(https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale)
.. [19] Fedora 26 change proposal for locale coercion backport
(https://fedoraproject.org/wiki/Changes/python3_c.utf-8_locale)
Copyright
=========