PEP 538: update for python-dev & implementation feedback
- PYTHONCOERCECLOCALE=0 now also disables the library warning - PEP just refers to locale-aware/locale-independent components, without specifically limiting that to C/C++ components
This commit is contained in:
parent
8ae8b612d4
commit
48f355fc28
141
pep-0538.txt
141
pep-0538.txt
|
@ -19,7 +19,7 @@ Abstract
|
|||
|
||||
An ongoing challenge with Python 3 on \*nix systems is the conflict between
|
||||
needing to use the configured locale encoding by default for consistency with
|
||||
other C/C++ components in the same process and those invoked in subprocesses,
|
||||
other locale-aware components in the same process subprocesses,
|
||||
and the fact that the standard C locale (as defined in POSIX:2001) typically
|
||||
implies a default text encoding of ASCII, which is entirely inadequate for the
|
||||
development of networked services and client applications in a multilingual
|
||||
|
@ -33,8 +33,8 @@ This is a good approach for cases where network encoding interoperability
|
|||
is a more important concern than local encoding interoperability.
|
||||
|
||||
However, it comes at the cost of making CPython's encoding assumptions diverge
|
||||
from those of other C and C++ components in the same process, as well as those
|
||||
of components running in subprocesses that share the same environment.
|
||||
from those of other locale-aware components in the same process, as well as
|
||||
those of components running in subprocesses that share the same environment.
|
||||
|
||||
It also requires changes to the internals of how CPython itself works, rather
|
||||
than using existing configuration settings that are supported by Python
|
||||
|
@ -55,11 +55,12 @@ changed such that:
|
|||
``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
|
||||
* if the subsequent runtime initialization process detects that the legacy
|
||||
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
|
||||
are available, locale coercion is disabled, or the runtime is embedded in an
|
||||
application other than the main CPython binary), and the ``PYTHONUTF8``
|
||||
feature defined in PEP 540 is also disabled (or not implemented), it will
|
||||
emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
|
||||
text encoding may cause various Unicode compatibility issues
|
||||
are available, or the runtime is embedded in an application other than the
|
||||
main CPython binary), locale coercion is not explicitly disabled, and the
|
||||
``PYTHONUTF8`` feature defined in PEP 540 is also disabled (or not
|
||||
implemented), it will emit a warning on stderr that use of the legacy
|
||||
``C`` locale's default ASCII text encoding may cause various Unicode
|
||||
compatibility issues
|
||||
|
||||
With this change, any \*nix platform that does *not* offer at least one of the
|
||||
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
|
||||
|
@ -163,7 +164,7 @@ definition.
|
|||
|
||||
Another common failure case is developers specifying ``LANG=C`` in order to
|
||||
see otherwise translated user interface messages in English, rather than the
|
||||
more narrowly scoped ``LC_MESSAGES=C``.
|
||||
more narrowly scoped ``LC_MESSAGES=C`` or ``LANGUAGE=en``.
|
||||
|
||||
|
||||
Relationship with other PEPs
|
||||
|
@ -175,20 +176,20 @@ solution:
|
|||
|
||||
* PEP 540 proposes to entirely decouple CPython's default text encoding from
|
||||
the C locale system in that case, allowing text handling inconsistencies to
|
||||
arise between CPython and other C/C++ components running in the same process
|
||||
and in subprocesses. This approach aims to make CPython behave less like a
|
||||
locale-aware C/C++ application, and more like C/C++ independent language
|
||||
arise between CPython and other locale-ware components running in the same
|
||||
process and in subprocesses. This approach aims to make CPython behave less
|
||||
like a locale-aware application, and more like locale-independent language
|
||||
runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
|
||||
* this PEP proposes to override the legacy C locale with a more recently
|
||||
defined locale that uses UTF-8 as its default text encoding. This means that
|
||||
the text encoding override will apply not only to CPython, but also to any
|
||||
locale aware extension modules loaded into the current process, as well as to
|
||||
locale aware C/C++ applications invoked in subprocesses that inherit their
|
||||
locale-aware extension modules loaded into the current process, as well as to
|
||||
locale-aware applications invoked in subprocesses that inherit their
|
||||
environment from the parent process. This approach aims to retain CPython's
|
||||
traditional strong support for integration with other components written
|
||||
in C and C++, while actively helping to push forward the adoption and
|
||||
standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
|
||||
the legacy C locale in the wider C/C++ ecosystem
|
||||
traditional strong support for integration with other locale-aware components
|
||||
while also actively helping to push forward the adoption and standardisation
|
||||
of the C.UTF-8 locale as a Unicode-aware replacement for the legacy C locale
|
||||
in the wider C/C++ ecosystem
|
||||
|
||||
After reviewing both PEPs, it became clear that they didn't actually conflict
|
||||
at a technical level, and the proposal in PEP 540 offered a superior option in
|
||||
|
@ -197,9 +198,9 @@ reference behaviour for platforms where the notion of a "locale encoding"
|
|||
doesn't make sense (for example, embedded systems running MicroPython rather
|
||||
than the CPython reference interpreter).
|
||||
|
||||
Meanwhile, this PEP offered improved compatibility with other C/C++ components,
|
||||
and an approach more amenable to being backported to Python 3.6 by downstream
|
||||
redistributors.
|
||||
Meanwhile, this PEP offered improved compatibility with other locale-aware
|
||||
components, and an approach more amenable to being backported to Python 3.6
|
||||
by downstream redistributors.
|
||||
|
||||
As a result, this PEP was amended to refer to PEP 540 as a complementary
|
||||
solution that offered improved behaviour both when locale coercion triggered,
|
||||
|
@ -323,7 +324,7 @@ proposed solution:
|
|||
even running with ``-Werror`` won't turn it into a runtime exception
|
||||
* any changes made will use *existing* configuration options
|
||||
|
||||
To minimize the negative impact on systems currently correctly configured to
|
||||
Minimizing the negative impact on systems currently correctly configured to
|
||||
use GB-18030 or another partially ASCII compatible universal encoding leads to
|
||||
an additional design principle:
|
||||
|
||||
|
@ -345,7 +346,8 @@ run as a standalone command line application.
|
|||
|
||||
It further proposes to emit a warning on stderr if the legacy ``C`` locale
|
||||
is in effect at the point where the language runtime itself is initialized,
|
||||
and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
|
||||
the explicit environmental flag to disable locale coercion is not set, and
|
||||
the PEP 540 UTF-8 encoding override is also disabled, in order to warn
|
||||
system and application integrators that they're running CPython in an
|
||||
unsupported configuration.
|
||||
|
||||
|
@ -358,7 +360,7 @@ reconfigure the C locale before any locale dependent operations are executed
|
|||
in the process.
|
||||
|
||||
This means that it can change the locale settings not only for the CPython
|
||||
runtime, but also for any other C/C++ components running in the current
|
||||
runtime, but also for any other locale-aware components running in the current
|
||||
process (e.g. as part of extension modules), as well as in subprocesses that
|
||||
inherit their environment from the current process.
|
||||
|
||||
|
@ -409,31 +411,37 @@ When this locale coercion is activated, the following warning will be
|
|||
printed on stderr, with the warning containing whichever locale was
|
||||
successfully configured::
|
||||
|
||||
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
|
||||
Python detected LC_CTYPE=C: LC_ALL & LANG coerced to C.UTF-8 (set another
|
||||
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
|
||||
|
||||
When falling back to the ``UTF-8`` locale, the message would be slightly
|
||||
different::
|
||||
|
||||
Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
|
||||
Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale
|
||||
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
|
||||
|
||||
In combination with PEP 540, this locale coercion will mean that the standard
|
||||
Python binary *and* locale aware C/C++ extensions should once again "just work"
|
||||
Python binary *and* locale-aware extensions should once again "just work"
|
||||
in the three main failure cases we're aware of (missing locale
|
||||
settings, SSH forwarding of unknown locales, and developers explicitly
|
||||
requesting ``LANG=C``), as long as the target platform provides at least one
|
||||
of the candidate UTF-8 based environments.
|
||||
|
||||
If ``PYTHONCOERCECLOCALE=0`` is set, or none of the candidate locales is
|
||||
successfully configured, then initialization will continue as usual in the C
|
||||
locale and the Unicode compatibility warning described in the next section will
|
||||
be emitted just as it would for any other application.
|
||||
If none of the candidate locales are successfully configured, then
|
||||
initialization will continue in the C locale and the Unicode compatibility
|
||||
warning described in the next section will be emitted just as it would for
|
||||
any other application.
|
||||
|
||||
If ``PYTHONCOERCECLOCALE=0`` is explicitly set, initialization will continue in
|
||||
the C locale and the Unicode compatibility warning described in the next
|
||||
section will be automatically suppressed.
|
||||
|
||||
The interpreter will always check for the ``PYTHONCOERCECLOCALE`` environment
|
||||
variable (even when running under the ``-E`` or ``-I`` switches), as the locale
|
||||
coercion check necessarily takes place before any command line argument
|
||||
processing.
|
||||
variable at startup (even when running under the ``-E`` or ``-I`` switches),
|
||||
as the locale coercion check necessarily takes place before any command line
|
||||
argument processing. For consistency, the runtime check to determine whether
|
||||
or not to suppress the locale compatibility warning will be similarly
|
||||
independent of these settings.
|
||||
|
||||
|
||||
Changes to the runtime initialization process
|
||||
|
@ -446,12 +454,12 @@ doing so would introduce inconsistencies in decoded text, even in the context
|
|||
of the standalone Python interpreter binary.
|
||||
|
||||
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
|
||||
configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8``
|
||||
feature from PEP 540 is disabled, the following warning will
|
||||
be issued::
|
||||
configured locale is still the default ``C`` locale, ``PYTHONCOERCECLOCALE=0``
|
||||
is set, *and* the ``PYTHONUTF8`` feature from PEP 540 is disabled (or not
|
||||
implemented), the following warning will be issued::
|
||||
|
||||
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
|
||||
encoding), which may cause Unicode compatibility problems. Using C.UTF-8
|
||||
encoding), which may cause Unicode compatibility problems. Using C.UTF-8,
|
||||
C.utf8, or UTF-8 (if available) as alternative Unicode-compatible
|
||||
locales is recommended.
|
||||
|
||||
|
@ -461,9 +469,10 @@ Instead, the warning informs both system and application integrators that
|
|||
they're running Python 3 in a configuration that we don't expect to work
|
||||
properly.
|
||||
|
||||
The second sentence providing recommendations would be conditionally compiled
|
||||
based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD
|
||||
systems.
|
||||
The second sentence providing recommendations may eventually be conditionally
|
||||
compiled based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8``
|
||||
on \*BSD systems), but the initial implementation will just use the common
|
||||
generic message shown above.
|
||||
|
||||
|
||||
New build-time configuration options
|
||||
|
@ -490,6 +499,15 @@ Platform Support Changes
|
|||
|
||||
A new "Legacy C Locale" section will be added to PEP 11 that states:
|
||||
|
||||
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
|
||||
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
|
||||
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
|
||||
Any Unicode related integration problems that occur only in that locale and
|
||||
cannot be reproduced in an appropriately configured non-ASCII locale will be
|
||||
closed as "won't fix".
|
||||
|
||||
If PEP 540 is also implemented, then this section would instead state:
|
||||
|
||||
* as of CPython 3.7, the legacy C locale is only supported when operating in
|
||||
"UTF-8" mode. Any Unicode handling issues that occur only in that locale
|
||||
and cannot be reproduced in an appropriately configured non-ASCII locale will
|
||||
|
@ -497,9 +515,9 @@ A new "Legacy C Locale" section will be added to PEP 11 that states:
|
|||
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
|
||||
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
|
||||
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
|
||||
Any Unicode related integration problems with C/C++ extensions that occur
|
||||
only in that locale and cannot be reproduced in an appropriately configured
|
||||
non-ASCII locale will be closed as "won't fix".
|
||||
Any Unicode related integration problems with other locale-aware components
|
||||
that occur only in that locale and cannot be reproduced in an appropriately
|
||||
configured non-ASCII locale will be closed as "won't fix".
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -524,9 +542,9 @@ The challenge for CPython has been the fact that in addition to being used for
|
|||
network service development, it is also extensively used as an embedded
|
||||
scripting language in larger applications, and as a desktop application
|
||||
development language, where it is more important to be consistent with other
|
||||
C/C++ components sharing the same process, as well as with the user's desktop
|
||||
locale settings, than it is with the emergent conventions of modern network
|
||||
service development.
|
||||
locale-aware components sharing the same process, as well as with the user's
|
||||
desktop locale settings, than it is with the emergent conventions of modern
|
||||
network service development.
|
||||
|
||||
The core premise of this PEP is that for *all* of these use cases, the
|
||||
assumption of ASCII implied by the default "C" locale is the wrong choice,
|
||||
|
@ -677,6 +695,10 @@ and would hence also make the last example display the desired output::
|
|||
| iconv -f GB18030 -t UTF-8
|
||||
ℙƴ☂ℌøἤ
|
||||
|
||||
However, such a change would have broader implications than the C locale
|
||||
specific changes currently proposed, so it is considered out of scope for this
|
||||
PEP.
|
||||
|
||||
|
||||
Dropping official support for ASCII based text handling in the legacy C locale
|
||||
------------------------------------------------------------------------------
|
||||
|
@ -689,10 +711,13 @@ them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
|
|||
C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
|
||||
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
|
||||
|
||||
While this PEP ensures that developers that need to do so can still opt-in to
|
||||
running their Python code in the legacy C locale, it also makes clear that we
|
||||
*don't* expect Python 3's Unicode handling to be reliable in that configuration,
|
||||
and the recommended alternative is to use a more appropriate locale setting.
|
||||
While this PEP ensures that developers that genuinely need to do so can still
|
||||
opt-in to running their Python code in the legacy C locale (either by setting
|
||||
PYTHONCOERCECLOCALE=0 or running a custom build that sets
|
||||
``--without-c-locale-coercion``), it also makes it clear that we *don't*
|
||||
expect Python 3's Unicode handling to be completely reliable in that
|
||||
configuration, and the recommended alternative is to use a more appropriate
|
||||
locale setting (or PEP 540's UTF-8 mode, if that is available).
|
||||
|
||||
|
||||
Providing implicit locale coercion only when running standalone
|
||||
|
@ -743,10 +768,10 @@ components in the current process, and components written in arbitrary
|
|||
languages in subprocesses.
|
||||
|
||||
Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
|
||||
C/C++ components in the current process and in any subprocesses that inherit
|
||||
the current environment. This is important to handle cases where the problem
|
||||
has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system
|
||||
where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
|
||||
locale-aware components in the current process and in any subprocesses that
|
||||
inherit the current environment. This is important to handle cases where the
|
||||
problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a
|
||||
system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
|
||||
configured to forward locale settings, and the user logs into a Linux server).
|
||||
|
||||
Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
|
||||
|
@ -797,6 +822,9 @@ conjunction with the changes needed to also provide a suitable locale by
|
|||
default, or else specifically for platforms where such a locale is already
|
||||
consistently available.
|
||||
|
||||
At least the Fedora project is planning to pursue this approach for the
|
||||
upcoming Fedora 26 release [19_].
|
||||
|
||||
|
||||
Backporting to other 3.x releases
|
||||
---------------------------------
|
||||
|
@ -909,6 +937,9 @@ References
|
|||
.. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale``
|
||||
(https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale)
|
||||
|
||||
.. [19] Fedora 26 change proposal for locale coercion backport
|
||||
(https://fedoraproject.org/wiki/Changes/python3_c.utf-8_locale)
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
|
|
Loading…
Reference in New Issue