PEP 538 updates for python-dev review
* Tidy up the abstract and emphasise the equivalence between this proposal and long supported configuration settings * Don't set LC_ALL (set LC_CTYPE instead) * Add a rationale for that change * Use GNU readline misbehaviour as a specific example of the benefits of reconfiguring the locale * Clarify rationale for enabling the changes by default on all autotools-using platforms * Mention the possibility of exposing a public API for use by embedding platforms
This commit is contained in:
parent
ae226965ea
commit
2f530ce0d1
374
pep-0538.txt
374
pep-0538.txt
|
@ -36,42 +36,51 @@ However, it comes at the cost of making CPython's encoding assumptions diverge
|
|||
from those of other locale-aware components in the same process, as well as
|
||||
those of components running in subprocesses that share the same environment.
|
||||
|
||||
This can cause interoperability problems with some extension modules (such as
|
||||
GNU readline's command line history editing), as well as with components
|
||||
running in subprocesses (such as older Python runtimes).
|
||||
|
||||
It also requires non-trivial changes to the internals of how CPython itself
|
||||
works, rather than relying primarily on existing configuration settings that
|
||||
are supported by Python versions prior to Python 3.7.
|
||||
|
||||
Accordingly, this PEP proposes that independently of the UTF-8 mode proposed
|
||||
in PEP 540, the way the CPython implementation handles the default C locale be
|
||||
changed such that:
|
||||
changed to be roughly equivalent to the following existing configuration
|
||||
settings (supported since Python 3.1)::
|
||||
|
||||
* unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``,
|
||||
the standalone CPython binary will automatically attempt to coerce the ``C``
|
||||
locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or
|
||||
``UTF-8``
|
||||
* ``Py_Initialize`` will be updated to treat these potential coercion target
|
||||
locales the same way it already treats the ``C`` locale: the default ``stdin``
|
||||
& ``stdout`` error handler for these locales will become ``surrogateescape``
|
||||
(this default can be overridden through ``PYTHONIOENCODING`` and
|
||||
``Py_SetStandardStreamEncoding`` as usual)
|
||||
* if ``Py_Initialize`` detects that the legacy ``C`` locale remains active
|
||||
(e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
|
||||
are available, or the runtime is embedded in an application other than the
|
||||
main CPython binary), and locale coercion is not explicitly disabled, it will
|
||||
emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
|
||||
text encoding may cause various Unicode compatibility issues
|
||||
LC_CTYPE=C.UTF-8
|
||||
LANG=C.UTF-8
|
||||
PYTHONIOENCODING=utf-8:surrogateescape
|
||||
|
||||
The exact target locale for coercion will be chosen from a predefined list at
|
||||
runtime based on the actually available locales.
|
||||
|
||||
The reinterpreted locale settings will be written back to the environment so
|
||||
they're visible to other components in the same process and in subprocesses,
|
||||
but the changed ``PYTHONIOENCODING`` default will be made implicit in order to
|
||||
avoid causing compatibility problems with Python 2 subprocesses that don't
|
||||
provide the ``surrogateescape`` error handler.
|
||||
|
||||
The new legacy locale coercion behavior can be disabled either by setting
|
||||
``LC_ALL`` (which may still lead to a Unicode compatibility warning) or by
|
||||
setting the new ``PYTHONCOERCECLOCALE`` environment variable to ``0``.
|
||||
|
||||
With this change, any \*nix platform that does *not* offer at least one of the
|
||||
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
|
||||
configuration would only be considered a fully supported platform for CPython
|
||||
3.7+ deployments when a suitable locale other than the default ``C`` locale is
|
||||
configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``). If PEP 540 is
|
||||
accepted in addition to this PEP, then such platforms would also be supported
|
||||
when using the proposed ``PYTHONUTF8`` mode.
|
||||
accepted in addition to this PEP, then pure Python modules would also be
|
||||
supported when using the proposed ``PYTHONUTF8`` mode, but expectations for
|
||||
full Unicode compatibility in extension modules would continue to be limited
|
||||
to the platforms covered by this PEP.
|
||||
|
||||
Redistributors (such as Linux distributions) with a narrower target audience
|
||||
than the upstream CPython development team may also choose to opt in to this
|
||||
locale coercion behaviour for the Python 3.6.x series by applying the necessary
|
||||
changes as a downstream patch when first introducing Python 3.6.0.
|
||||
As it only reflects a change in default settings rather than a fundamentally
|
||||
new capability, redistributors (such as Linux distributions) with a narrower
|
||||
target audience than the upstream CPython development team may also choose to
|
||||
opt in to this locale coercion behaviour for the Python 3.6.x series by
|
||||
applying the necessary changes as a downstream patch.
|
||||
|
||||
|
||||
Background
|
||||
|
@ -85,19 +94,16 @@ system to do the conversion and then ensuring that the text encoding name
|
|||
reported by ``sys.getfilesystemencoding()`` matches the encoding used during
|
||||
this early bootstrapping process.
|
||||
|
||||
On Apple platforms (including both Mac OS X and iOS), this is straightforward,
|
||||
as Apple guarantees that these operations will always use UTF-8 to do the
|
||||
conversion.
|
||||
|
||||
On Windows, the limitations of the ``mbcs`` format used by default in these
|
||||
conversions proved sufficiently problematic that PEP 528 and PEP 529 were
|
||||
implemented to bypass the operating system supplied interfaces for binary data
|
||||
handling and force the use of UTF-8 instead.
|
||||
|
||||
On Android, many components, including CPython, already assume the use of UTF-8
|
||||
as the system encoding, regardless of the locale setting. However, this isn't
|
||||
the case for all components, and the discrepancy can cause problems in some
|
||||
situations (for example, when using the GNU readline module [16_]).
|
||||
On Mac OS X, iOS, and Android, many components, including CPython, already
|
||||
assume the use of UTF-8 as the system encoding, regardless of the locale
|
||||
setting. However, this isn't the case for all components, and the discrepancy
|
||||
can cause problems in some situations (for example, when using the GNU readline
|
||||
module [16_]).
|
||||
|
||||
On non-Apple and non-Android \*nix systems, these operations are handled using
|
||||
the C locale system in glibc, which has the following characteristics [4_]:
|
||||
|
@ -146,16 +152,17 @@ The simplest way to deal with this problem for currently released versions of
|
|||
CPython is to explicitly set a more sensible locale when launching the
|
||||
application. For example::
|
||||
|
||||
LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ...
|
||||
LANG=C.UTF-8 python3 ...
|
||||
|
||||
The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the
|
||||
``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other
|
||||
categories (including ``LC_COLLATE``). It is offered by a number of Linux
|
||||
distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an
|
||||
alternative to the ASCII-based C locale.
|
||||
alternative to the ASCII-based C locale. Some other platforms (such as
|
||||
``HP-UX``) offer an equivalent locale definition under the name ``C.utf8``.
|
||||
|
||||
Mac OS X and other \*BSD systems have taken a different approach: instead of
|
||||
offering a ``C.UTF-8`` locale, offer a partial ``UTF-8`` locale that only
|
||||
offering a ``C.UTF-8`` locale, they offer a partial ``UTF-8`` locale that only
|
||||
defines the ``LC_CTYPE`` category. On such systems, the preferred
|
||||
environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to set
|
||||
``LC_ALL`` or ``LANG``. [17_]
|
||||
|
@ -206,7 +213,9 @@ by downstream redistributors.
|
|||
|
||||
As a result, this PEP was amended to refer to PEP 540 as a complementary
|
||||
solution that offered improved behaviour when none of the standard UTF-8 based
|
||||
locales were available.
|
||||
locales were available, as well as extending the changes in the default
|
||||
settings to APIs that aren't currently independently configurable (such as
|
||||
the default encoding and error handler for ``open()``).
|
||||
|
||||
The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy
|
||||
fallback was removed from the list of UTF-8 locales tried as a coercion target,
|
||||
|
@ -302,11 +311,10 @@ users to handle it through system configuration changes.
|
|||
While the glibc developers are working towards making the C.UTF-8 locale
|
||||
universally available for use by glibc based applications like CPython [6_],
|
||||
this unfortunately doesn't help on platforms that ship older versions of glibc
|
||||
without that feature, and also don't provide C.UTF-8 as an on-disk locale the
|
||||
way Debian and Fedora do. For these platforms, the mechanism proposed in
|
||||
PEP 540 at least allows CPython itself to behave sensibly, albeit without any
|
||||
common mechanism to get other C/C++ components that decode binary streams as
|
||||
text to do the same.
|
||||
without that feature, and also don't provide C.UTF-8 (or an equivalent) as an
|
||||
on-disk locale the way Debian and Fedora do. These platforms are considered
|
||||
out of scope for this PEP - see PEP 540 for further discussion of possible
|
||||
options for improving CPython's default behaviour in such environments.
|
||||
|
||||
|
||||
Design Principles
|
||||
|
@ -317,22 +325,25 @@ proposed solution:
|
|||
|
||||
* if a locale other than the default C locale is explicitly configured, we'll
|
||||
continue to respect it
|
||||
* if we're changing the locale setting without an explicit config option, we'll
|
||||
emit a warning on stderr that we're doing so rather than silently changing
|
||||
the process configuration. This will alert application and system integrators
|
||||
to the change, even if they don't closely follow the PEP process or Python
|
||||
release announcements. However, to minimize the chance of introducing new
|
||||
problems for end users, we'll do this *without* using the warnings system, so
|
||||
even running with ``-Werror`` won't turn it into a runtime exception
|
||||
* as far as is feasible, any changes made will use *existing* configuration
|
||||
options
|
||||
* Python's runtime behaviour in potential coercion target locales should be
|
||||
identical regardless of whether the locale was set explicitly in the
|
||||
environment or implicitly as a locale coercion target
|
||||
* for Python 3.7, if we're changing the locale setting without an explicit
|
||||
config option, we'll emit a warning on stderr that we're doing so rather
|
||||
than silently changing the process configuration. This will alert application
|
||||
and system integrators to the change, even if they don't closely follow the
|
||||
PEP process or Python release announcements. However, to minimize the chance
|
||||
of introducing new problems for end users, we'll do this *without* using the
|
||||
warnings system, so even running with ``-Werror`` won't turn it into a runtime
|
||||
exception.
|
||||
* for Python 3.7, any changed defaults will offer some form of explicit "off"
|
||||
switch at build time, runtime, or both
|
||||
|
||||
Minimizing the negative impact on systems currently correctly configured to
|
||||
use GB-18030 or another partially ASCII compatible universal encoding leads to
|
||||
an additional design principle:
|
||||
the following design principle:
|
||||
|
||||
* if a UTF-8 based Linux container is run on a host that is explicitly
|
||||
configured to use a non-UTF-8 encoding, and tries to exchange locally
|
||||
|
@ -341,6 +352,21 @@ an additional design principle:
|
|||
is concatenated or split solely at common ASCII compatible code points, but
|
||||
may otherwise emit nonsensical results.
|
||||
|
||||
Minimizing the negative impact on systems and programs correctly configured to
|
||||
use an explicit locale category like ``LC_TIME``, ``LC_MONETARY`` or
|
||||
``LC_NUMERIC`` while otherwise running in the legacy C locale gives the
|
||||
following design principles:
|
||||
|
||||
* don't make any environmental changes that would override explicit settings for
|
||||
locale categories other than ``LC_CTYPE`` (most notably: don't set ``LC_ALL``)
|
||||
|
||||
Finally, maintaining compatibility with running arbitrary subprocesses in
|
||||
orchestration use cases leads to the following design principle:
|
||||
|
||||
* don't make any Python-specific environmental changes that might be
|
||||
incompatible with any still supported version of CPython (including
|
||||
CPython 2.7)
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
@ -393,13 +419,13 @@ that uses UTF-8 rather than ASCII as the default encoding.
|
|||
|
||||
Three such locales will be tried:
|
||||
|
||||
* ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and
|
||||
* ``C.UTF-8`` (available at least in Debian, Ubuntu, Alpine, and Fedora 25+, and
|
||||
expected to be available by default in a future version of glibc)
|
||||
* ``C.utf8`` (available at least in HP-UX)
|
||||
* ``UTF-8`` (available in at least some \*BSD variants)
|
||||
|
||||
For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually
|
||||
setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate
|
||||
For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by setting
|
||||
both the ``LC_CTYPE`` and ``LANG`` environment variables to the candidate
|
||||
locale name, such that future calls to ``setlocale()`` will see them, as will
|
||||
other components looking for those settings (such as GUI development
|
||||
frameworks).
|
||||
|
@ -408,15 +434,16 @@ For the platforms where it is defined, ``UTF-8`` is a partial locale that only
|
|||
defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
|
||||
environment variable would be set when using this fallback option.
|
||||
|
||||
To adjust automatically to future changes in locale availability, these checks
|
||||
will be implemented at runtime on all platforms other than Windows, rather
|
||||
than attempting to determine which locales to try at compile time.
|
||||
To allow for better cross-platform binary portability and to adjust
|
||||
automatically to future changes in locale availability, these checks will be
|
||||
implemented at runtime on all platforms other than Windows, rather than
|
||||
attempting to determine which locales to try at compile time.
|
||||
|
||||
When this locale coercion is activated, the following warning will be
|
||||
printed on stderr, with the warning containing whichever locale was
|
||||
successfully configured::
|
||||
|
||||
Python detected LC_CTYPE=C: LC_ALL & LANG coerced to C.UTF-8 (set another
|
||||
Python detected LC_CTYPE=C: LC_CTYPE & LANG coerced to C.UTF-8 (set another
|
||||
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
|
||||
|
||||
When falling back to the ``UTF-8`` locale, the message would be slightly
|
||||
|
@ -425,12 +452,12 @@ different::
|
|||
Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale
|
||||
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
|
||||
|
||||
This locale coercion will mean that the standard
|
||||
As long as the current platform provides at least one of the candidate UTF-8
|
||||
based environments, this locale coercion will mean that the standard
|
||||
Python binary *and* locale-aware extensions should once again "just work"
|
||||
in the three main failure cases we're aware of (missing locale
|
||||
settings, SSH forwarding of unknown locales, and developers explicitly
|
||||
requesting ``LANG=C``), as long as the target platform provides at least one
|
||||
of the candidate UTF-8 based environments.
|
||||
settings, SSH forwarding of unknown locales via ``LANG`` or ``LC_CTYPE``, and
|
||||
developers explicitly requesting ``LANG=C``).
|
||||
|
||||
The one case where failures may still occur is when ``stderr`` is specifically
|
||||
being checked for no output, which can be resolved either by configuring
|
||||
|
@ -438,7 +465,8 @@ a locale other than the C locale, or else by using a mechanism other than
|
|||
"there was no output on stderr" to check for subprocess errors (e.g. checking
|
||||
process return codes).
|
||||
|
||||
If none of the candidate locales are successfully configured, then
|
||||
If none of the candidate locales are successfully configured, or the ``LC_ALL``,
|
||||
locale override is defined in the current process environment, then
|
||||
initialization will continue in the C locale and the Unicode compatibility
|
||||
warning described in the next section will be emitted just as it would for
|
||||
any other application.
|
||||
|
@ -547,9 +575,9 @@ A new "Legacy C Locale" section will be added to PEP 11 that states:
|
|||
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
|
||||
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
|
||||
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
|
||||
Any Unicode related integration problems that occur only in that locale and
|
||||
cannot be reproduced in an appropriately configured non-ASCII locale will be
|
||||
closed as "won't fix".
|
||||
Any Unicode related integration problems that occur only in the legacy ``C``
|
||||
locale and cannot be reproduced in an appropriately configured non-ASCII
|
||||
locale will be closed as "won't fix".
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -719,6 +747,23 @@ coercion target locales will implicitly gain the encoding transparency behaviour
|
|||
currently enabled by default in the ``C`` locale.
|
||||
|
||||
|
||||
Avoiding setting PYTHONIOENCODING during UTF-8 locale coercion
|
||||
--------------------------------------------------------------
|
||||
|
||||
Rather than changing the default handling of the standard streams during
|
||||
interpreter initialization, earlier versions of this PEP proposed setting
|
||||
``PYTHONIOENCODING`` to ``utf-8:surrogateescape``. This turned out to create
|
||||
a significant compatibility problem: since the ``surrogateescape`` handler
|
||||
only exists in Python 3.1+, running Python 2.7 processes in subprocesses could
|
||||
potentially break in a confusing way with that configuration.
|
||||
|
||||
The current design means that earlier Python versions will instead retain their
|
||||
default ``strict`` error handling on the standard streams, while Python 3.7+
|
||||
will consistently use the more permissive ``surrogateescape`` handler even
|
||||
when these locales are explicitly configured (rather than being reached through
|
||||
locale coercion).
|
||||
|
||||
|
||||
Dropping official support for ASCII based text handling in the legacy C locale
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -731,12 +776,13 @@ nominal C/C++ locale encoding and assume the use of either UTF-8 (PEP 540,
|
|||
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
|
||||
|
||||
While this PEP ensures that developers that genuinely need to do so can still
|
||||
opt-in to running their Python code in the legacy C locale (either by setting
|
||||
PYTHONCOERCECLOCALE=0 or running a custom build that sets
|
||||
opt-in to running their Python code in the legacy C locale (by setting
|
||||
``LC_ALL=C``, ``PYTHONCOERCECLOCALE=0``, or running a custom build that sets
|
||||
``--without-c-locale-coercion``), it also makes it clear that we *don't*
|
||||
expect Python 3's Unicode handling to be completely reliable in that
|
||||
configuration, and the recommended alternative is to use a more appropriate
|
||||
locale setting (or PEP 540's UTF-8 mode, if that is available).
|
||||
locale setting (potentially in combination with PEP 540's UTF-8 mode, if that
|
||||
is available).
|
||||
|
||||
|
||||
Providing implicit locale coercion only when running standalone
|
||||
|
@ -771,6 +817,22 @@ The ``Py_Initialize`` API then only gains an explicit warning (emitted on
|
|||
``stderr``) when it detects use of the ``C`` locale, and relies on the
|
||||
embedding application to specify something more reasonable.
|
||||
|
||||
That said, the reference implementation for this PEP adds most of the
|
||||
functionality to the shared library, with the CLI being updated to
|
||||
unconditionally call two new private APIs::
|
||||
|
||||
if (_Py_LegacyLocaleDetected()) {
|
||||
_Py_CoerceLegacyLocale();
|
||||
}
|
||||
|
||||
These are similar to other "pre-configuration" APIs intended for embedding
|
||||
applications: they're designed to be called *before* ``Py_Initialize``, and
|
||||
hence change the way the interpreter gets initialized.
|
||||
|
||||
If these were made public (either as part of this PEP or in a subsequent RFE),
|
||||
then it would be straightforward for other embedding applications to recreate
|
||||
the same behaviour as is proposed for the CPython CLI.
|
||||
|
||||
|
||||
Allowing restoration of the legacy behaviour
|
||||
--------------------------------------------
|
||||
|
@ -797,16 +859,14 @@ whether or not the current locale configuration is likely to cause Unicode
|
|||
handling problems.
|
||||
|
||||
|
||||
Setting both LANG & LC_ALL for UTF-8 locale coercion
|
||||
----------------------------------------------------
|
||||
Setting both LC_CTYPE & LANG for UTF-8 locale coercion
|
||||
------------------------------------------------------
|
||||
|
||||
Python is often used as a glue language, integrating other C/C++ ABI compatible
|
||||
components in the current process, and components written in arbitrary
|
||||
languages in subprocesses.
|
||||
|
||||
Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
|
||||
locale-aware components in the current process and in any subprocesses that
|
||||
inherit the current environment. This is important to handle cases where the
|
||||
Setting ``LC_CTYPE`` to ``C.UTF-8`` is important to handle cases where the
|
||||
problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a
|
||||
system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
|
||||
configured to forward locale settings, and the user logs into a Linux server).
|
||||
|
@ -819,19 +879,170 @@ switch to the UTF-8 based locale will be applied consistently across the current
|
|||
process and any subprocesses that inherit the current environment.
|
||||
|
||||
|
||||
Enabling C locale coercion and warnings on Mac OS X
|
||||
---------------------------------------------------
|
||||
Avoiding setting LC_ALL for UTF-8 locale coercion
|
||||
-------------------------------------------------
|
||||
|
||||
On Mac OS X, CPython already assumes the use of UTF-8 for system interfaces,
|
||||
and we expect most other locale-aware components to do the same.
|
||||
Earlier versions of this PEP proposed setting the ``LC_ALL`` locale override,
|
||||
rather than just setting ``LC_CTYPE`` and ``LANG``.
|
||||
|
||||
However, Mac OS X is also frequently used as a development and testing platform
|
||||
for Python software intended for deployment to other \*nix environments (such as
|
||||
Linux).
|
||||
This was changed after it was determined that just setting ``LC_CTYPE`` and
|
||||
``LANG`` should be sufficient to handle all the scenarios the PEP aims to
|
||||
cover, as it avoids causing any problems in cases like the following::
|
||||
|
||||
Accordingly, this PEP enables the locale coercion and warning features on
|
||||
Mac OS X in the name of cross platform consistency, even though they're expected
|
||||
to almost entirely redundant on Mac OS X itself.
|
||||
$ LANG=C LC_MONETARY=ja_JP.utf8 ./python -c \
|
||||
"from locale import setlocale, LC_ALL, currency; setlocale(LC_ALL, ''); print(currency(1e6))"
|
||||
Python detected LC_CTYPE=C: LC_CTYPE & LANG coerced to C.UTF-8 (set another
|
||||
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behavior).
|
||||
¥1000000
|
||||
|
||||
|
||||
Skipping locale coercion if LC_ALL is set in the current environment
|
||||
--------------------------------------------------------------------
|
||||
|
||||
With locale coercion now only setting ``LC_CTYPE`` and ``LANG``, it will have
|
||||
no effect if ``LC_ALL`` is also set. To avoid emitting a spurious locale
|
||||
coercion notice in that case, coercion is instead skipped entirely.
|
||||
|
||||
|
||||
Considering locale coercion independently of "UTF-8 mode"
|
||||
---------------------------------------------------------
|
||||
|
||||
With both this PEP's locale coercion and PEP 540's UTF-8 mode under
|
||||
consideration for Python 3.7, it makes sense to ask whether or not we can
|
||||
limit ourselves to only doing one or the other, rather than making both
|
||||
changes.
|
||||
|
||||
The UTF-8 mode proposed in PEP 540 has two major limitations that make it a
|
||||
potential complement to this PEP rather than a potential replacement.
|
||||
|
||||
First, unlike this PEP, PEP 540's UTF-8 mode makes it possible to change default
|
||||
behaviours that are not currently configurable at all. While that's exactly
|
||||
what makes the proposal interesting, it's also what makes it an entirely
|
||||
unproven approach. By contrast, the approach proposed in this PEP builds
|
||||
directly atop existing configuration settings for the C locale system (
|
||||
``LC_CTYPE``, ``LANG``) and Python's standard streams (``PYTHONIOENCODING``)
|
||||
that have already been in use for years to handle the kinds of compatibility
|
||||
problems discussed in this PEP.
|
||||
|
||||
Secondly, one of the things we know based on that experience is that the
|
||||
proposed locale coercion can resolve problems not only in CPython itself,
|
||||
but also in extension modules that interact with the standard streams, like
|
||||
GNU readline. As an example, consider the following interactive session
|
||||
from a PEP 538 enabled CPython build, where each line after the first is
|
||||
executed by doing "up-arrow, left-arrow x4, delete, enter"::
|
||||
|
||||
$ LANG=C ./python
|
||||
Python detected LC_CTYPE=C: LC_CTYPE & LANG coerced to C.UTF-8 (set
|
||||
another locale or PYTHONCOERCECLOCALE=0 to disable this locale
|
||||
coercion behavior).
|
||||
Python 3.7.0a0 (heads/pep538-coerce-c-locale:188e780, May 7 2017, 00:21:13)
|
||||
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
|
||||
Type "help", "copyright", "credits" or "license" for more information.
|
||||
>>> print("ℙƴ☂ℌøἤ")
|
||||
ℙƴ☂ℌøἤ
|
||||
>>> print("ℙƴ☂ℌἤ")
|
||||
ℙƴ☂ℌἤ
|
||||
>>> print("ℙƴ☂ἤ")
|
||||
ℙƴ☂ἤ
|
||||
>>> print("ℙƴἤ")
|
||||
ℙƴἤ
|
||||
>>> print("ℙἤ")
|
||||
ℙἤ
|
||||
>>> print("ἤ")
|
||||
ἤ
|
||||
>>>
|
||||
|
||||
This is exactly what we'd expect from a well-behaved command history editor.
|
||||
|
||||
By contrast, the following is what currently happens on an older release if
|
||||
you only change the Python level stream encoding settings without updating the
|
||||
locale settings::
|
||||
|
||||
$ LANG=C PYTHONIOENCODING=utf-8:surrogateescape python3
|
||||
Python 3.5.3 (default, Apr 24 2017, 13:32:13)
|
||||
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
|
||||
Type "help", "copyright", "credits" or "license" for more information.
|
||||
>>> print("ℙƴ☂ℌøἤ")
|
||||
ℙƴ☂ℌøἤ
|
||||
>>> print("ℙƴ☂ℌ<E29882>")
|
||||
File "<stdin>", line 0
|
||||
|
||||
^
|
||||
SyntaxError: 'utf-8' codec can't decode bytes in position 20-21:
|
||||
invalid continuation byte
|
||||
|
||||
That particular misbehaviour is coming from GNU readline, *not* CPython -
|
||||
because the command history editing wasn't UTF-8 aware, it corrupted the history
|
||||
buffer and fed such nonsense to stdin that even the surrogateescape error
|
||||
handler was bypassed. While PEP 540's UTF-8 mode could technically be updated
|
||||
to also reconfigure readline, that's just *one* extension module that might
|
||||
be interacting with the standard streams without going through the CPython
|
||||
C API, and any change made by CPython would only apply when readline is running
|
||||
directly as part of Python 3.7 rather than in a separate subprocess.
|
||||
|
||||
However, if we actually change the configured locale, GNU readline starts
|
||||
behaving itself, without requiring any changes to the embedding application::
|
||||
|
||||
$ LANG=C.UTF-8 python3
|
||||
Python 3.5.3 (default, Apr 24 2017, 13:32:13)
|
||||
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
|
||||
Type "help", "copyright", "credits" or "license" for more information.
|
||||
>>> print("ℙƴ☂ℌøἤ")
|
||||
ℙƴ☂ℌøἤ
|
||||
>>> print("ℙƴ☂ℌἤ")
|
||||
ℙƴ☂ℌἤ
|
||||
>>> print("ℙƴ☂ἤ")
|
||||
ℙƴ☂ἤ
|
||||
>>> print("ℙƴἤ")
|
||||
ℙƴἤ
|
||||
>>> print("ℙἤ")
|
||||
ℙἤ
|
||||
>>> print("ἤ")
|
||||
ἤ
|
||||
>>>
|
||||
$ LC_CTYPE=C.UTF-8 python3
|
||||
Python 3.5.3 (default, Apr 24 2017, 13:32:13)
|
||||
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
|
||||
Type "help", "copyright", "credits" or "license" for more information.
|
||||
>>> print("ℙƴ☂ℌøἤ")
|
||||
ℙƴ☂ℌøἤ
|
||||
>>> print("ℙƴ☂ℌἤ")
|
||||
ℙƴ☂ℌἤ
|
||||
>>> print("ℙƴ☂ἤ")
|
||||
ℙƴ☂ἤ
|
||||
>>> print("ℙƴἤ")
|
||||
ℙƴἤ
|
||||
>>> print("ℙἤ")
|
||||
ℙἤ
|
||||
>>> print("ἤ")
|
||||
ἤ
|
||||
>>>
|
||||
|
||||
|
||||
Enabling C locale coercion and warnings on Mac OS X, iOS and Android
|
||||
--------------------------------------------------------------------
|
||||
|
||||
On Mac OS X, iOS, and Android, CPython already assumes the use of UTF-8 for
|
||||
system interfaces, and we expect most other locale-aware components to do the
|
||||
same.
|
||||
|
||||
Accordingly, this PEP originally proposed to disable locale coercion and
|
||||
warnings at build time for these platforms, on the assumption that it would
|
||||
be entirely redundant.
|
||||
|
||||
However, that assumpion turned out to be incorrect assumption, as subsequent
|
||||
investigations showed that if you explicitly configure ``LANG=C`` on
|
||||
these platforms, extension modules like GNU readline will misbehave in much the
|
||||
same way as they do on other \*nix systems. [21_]
|
||||
|
||||
In addition, Mac OS X is also frequently used as a development and testing
|
||||
platform for Python software intended for deployment to other \*nix environments
|
||||
(such as Linux or Android), and Linux is similarly often used as a development
|
||||
and testing platform for mobile and Mac OS X applications.
|
||||
|
||||
Accordingly, this PEP enables the locale coercion and warning features by
|
||||
default on all platforms that use CPython's ``autotools`` based build toolchain
|
||||
(i.e. everywhere other than Windows).
|
||||
|
||||
|
||||
Implementation
|
||||
|
@ -983,6 +1194,9 @@ References
|
|||
.. [20] GitHub pull request for the reference implementation
|
||||
(https://github.com/python/cpython/pull/659)
|
||||
|
||||
.. [21] GNU readline misbehaviour on Mac OS X with ``LANG=C``
|
||||
(https://mail.python.org/pipermail/python-dev/2017-May/147897.html)
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
|
|
Loading…
Reference in New Issue