900 lines
41 KiB
Plaintext
900 lines
41 KiB
Plaintext
PEP: 538
|
||
Title: Coercing the legacy C locale to C.UTF-8
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Nick Coghlan <ncoghlan@gmail.com>
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Requires: 540
|
||
Created: 28-Dec-2016
|
||
Python-Version: 3.7
|
||
Post-History: 03-Jan-2017 (linux-sig),
|
||
07-Jan-2017 (python-ideas)
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
An ongoing challenge with Python 3 on \*nix systems is the conflict between
|
||
needing to use the configured locale encoding by default for consistency with
|
||
other C/C++ components in the same process and those invoked in subprocesses,
|
||
and the fact that the standard C locale (as defined in POSIX:2001) typically
|
||
implies a default text encoding of ASCII, which is entirely inadequate for the
|
||
development of networked services and client applications in a multilingual
|
||
world.
|
||
|
||
PEP 540 proposes a change to CPython's handling of the legacy C locale such
|
||
that CPython will assume the use of UTF-8 in such environments, rather than
|
||
persisting with the demonstrably problematic assumption of ASCII as an
|
||
appropriate encoding for communicating with operating system interfaces.
|
||
|
||
However, it comes at the cost of making CPython's encoding assumptions diverge
|
||
from those of other C and C++ components in the same process, as well as those
|
||
of components running in subprocesses that share the same environment.
|
||
|
||
Accordingly, this PEP further proposes that the way the CPython implementation
|
||
handles the default C locale be changed such that:
|
||
|
||
* the standalone CPython binary will automatically attempt to coerce the ``C``
|
||
locale to ``C.UTF-8``, ``C.utf8``, or ``UTF-8`` (depending on the system),
|
||
unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
|
||
* if the subsequent runtime initialization process detects that the legacy
|
||
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
|
||
are available, locale coercion is disabled, or the runtime is embedded in an
|
||
application other than the main CPython binary), and the ``PYTHONUTF8``
|
||
feature defined in PEP 540 is also disabled, it will emit a warning on
|
||
stderr that use of the legacy ``C`` locale's default ASCII text encoding
|
||
may cause various Unicode compatibility issues
|
||
|
||
With this change, any \*nix platform that does *not* offer at least one of the
|
||
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
|
||
configuration would only be considered a fully supported platform for CPython
|
||
3.7+ deployments when either the new ``PYTHONUTF8`` defined in PEP 540 is used,
|
||
or else a suitable locale other than the default ``C`` locale is configured
|
||
explicitly (e.g. ``zh_CN.gb18030``).
|
||
|
||
Redistributors (such as Linux distributions) with a narrower target audience
|
||
than the upstream CPython development team may also choose to opt in to this
|
||
behaviour for the Python 3.6.x series by applying the necessary changes as a
|
||
downstream patch when first introducing Python 3.6.0.
|
||
|
||
|
||
Background
|
||
==========
|
||
|
||
While the CPython interpreter is starting up, it may need to convert from
|
||
the ``char *`` format to the ``wchar_t *`` format, or from one of those formats
|
||
to ``PyUnicodeObject *``, in a way that's consistent with the locale settings
|
||
of the overall system. It handles these cases by relying on the operating
|
||
system to do the conversion and then ensuring that the text encoding name
|
||
reported by ``sys.getfilesystemencoding()`` matches the encoding used during
|
||
this early bootstrapping process.
|
||
|
||
On Apple platforms (including both Mac OS X and iOS), this is straightforward,
|
||
as Apple guarantees that these operations will always use UTF-8 to do the
|
||
conversion.
|
||
|
||
On Windows, the limitations of the ``mbcs`` format used by default in these
|
||
conversions proved sufficiently problematic that PEP 528 and PEP 529 were
|
||
implemented to bypass the operating system supplied interfaces for binary data
|
||
handling and force the use of UTF-8 instead.
|
||
|
||
On Android, many components, including CPython, already assume the use of UTF-8
|
||
as the system encoding, regardless of the locale setting. However, this isn't
|
||
the case for all components, and the discrepancy can cause problems in some
|
||
situations (for example, when using the GNU readline module [16_]).
|
||
|
||
On non-Apple and non-Android \*nix systems, these operations are handled using
|
||
the C locale system in glibc, which has the following characteristics [4_]:
|
||
|
||
* by default, all processes start in the ``C`` locale, which uses ``ASCII``
|
||
for these conversions. This is almost never what anyone doing multilingual
|
||
text processing actually wants (including CPython and C/C++ GUI frameworks).
|
||
* calling ``setlocale(LC_ALL, "")`` reconfigures the active locale based on
|
||
the locale categories configured in the current process environment
|
||
* if the locale requested by the current environment is unknown, or no specific
|
||
locale is configured, then the default ``C`` locale will remain active
|
||
|
||
The specific locale category that covers the APIs that CPython depends on is
|
||
``LC_CTYPE``, which applies to "classification and conversion of characters,
|
||
and to multibyte and wide characters" [5_]. Accordingly, CPython includes the
|
||
following key calls to ``setlocale``:
|
||
|
||
* in the main ``python`` binary, CPython calls ``setlocale(LC_ALL, "")`` to
|
||
configure the entire C locale subsystem according to the process environment.
|
||
It does this prior to making any calls into the shared CPython library
|
||
* in ``Py_Initialize``, CPython calls ``setlocale(LC_CTYPE, "")``, such that
|
||
the configured locale settings for that category *always* match those set in
|
||
the environment. It does this unconditionally, and it *doesn't* revert the
|
||
process state change in ``Py_Finalize``
|
||
|
||
(This summary of the locale handling omits several technical details related
|
||
to exactly where and when the text encoding declared as part of the locale
|
||
settings is used - see PEP 540 for further discussion, as these particular
|
||
details matter more when decoupling CPython from the declared C locale than
|
||
they do when overriding the locale with one based on UTF-8)
|
||
|
||
These calls are usually sufficient to provide sensible behaviour, but they can
|
||
still fail in the following cases:
|
||
|
||
* SSH environment forwarding means that SSH clients may sometimes forward
|
||
client locale settings to servers that don't have that locale installed. This
|
||
leads to CPython running in the default ASCII-based C locale.
|
||
* some process environments (such as Linux containers) may not have any
|
||
explicit locale configured at all. As with unknown locales, this leads to
|
||
CPython running in the default ASCII-based C locale
|
||
|
||
The simplest way to deal with this problem for currently released versions of
|
||
CPython is to explicitly set a more sensible locale when launching the
|
||
application. For example::
|
||
|
||
LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ...
|
||
|
||
The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the
|
||
``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other
|
||
categories (including ``LC_COLLATE``). It is offered by a number of Linux
|
||
distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an
|
||
alternative to the ASCII-based C locale.
|
||
|
||
Mac OS X and other \*BSD systems have taken a different approach, and instead
|
||
of offering a ``C.UTF-8`` locale, instead offer a partial ``UTF-8`` locale that
|
||
only defines the ``LC_CTYPE`` category. On such systems, the preferred
|
||
environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to set
|
||
``LC_ALL`` or ``LANG``. [17_]
|
||
|
||
In the specific case of Docker containers and similar technologies, the
|
||
appropriate locale setting can be specified directly in the container image
|
||
definition.
|
||
|
||
Another common failure case is developers specifying ``LANG=C`` in order to
|
||
see otherwise translated user interface messages in English, rather than the
|
||
more narrowly scoped ``LC_MESSAGES=C``.
|
||
|
||
|
||
Relationship with other PEPs
|
||
============================
|
||
|
||
This PEP shares a common problem statement with PEP 540 (improving Python 3's
|
||
behaviour in the default C locale), but diverged markedly in the proposed
|
||
solution:
|
||
|
||
* PEP 540 proposes to entirely decouple CPython's default text encoding from
|
||
the C locale system in that case, allowing text handling inconsistencies to
|
||
arise between CPython and other C/C++ components running in the same process
|
||
and in subprocesses. This approach aims to make CPython behave less like a
|
||
locale-aware C/C++ application, and more like C/C++ independent language
|
||
runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
|
||
* this PEP proposes to override the legacy C locale with a more recently
|
||
defined locale that uses UTF-8 as its default text encoding. This means that
|
||
the text encoding override will apply not only to CPython, but also to any
|
||
locale aware extension modules loaded into the current process, as well as to
|
||
locale aware C/C++ applications invoked in subprocesses that inherit their
|
||
environment from the parent process. This approach aims to retain CPython's
|
||
traditional strong support for integration with other components written
|
||
in C and C++, while actively helping to push forward the adoption and
|
||
standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
|
||
the legacy C locale in the wider Linux ecosystem
|
||
|
||
After reviewing both PEPs, it became clear that they didn't actually conflict
|
||
at a technical level, and the proposal in PEP 540 offered a superior option in
|
||
cases where no suitable locale was available, as well as offering a better
|
||
reference behaviour for platforms where the notion of a "locale encoding"
|
||
doesn't make sense (for example, embedded systems running MicroPython rather
|
||
than the CPython reference interpreter).
|
||
|
||
As a result, this PEP was amended to specify PEP 540 as a pre-requisite, with
|
||
the aim being to coerce other C/C++ components into behaving consistently with
|
||
CPython's assumption of UTF-8 as the system encoding, rather than CPython itself
|
||
relying on that setting change.
|
||
|
||
As a result of that change, the ``LC_CTYPE=en_US.UTF-8`` legacy fallback was
|
||
removed from the list of UTF-8 locales tried as a coercion target, with CPython
|
||
instead relying solely on the C locale text encoding bypass in such cases.
|
||
|
||
|
||
Motivation
|
||
==========
|
||
|
||
While Linux container technologies like Docker, Kubernetes, and OpenShift are
|
||
best known for their use in web service development, the related container
|
||
formats and execution models are also being adopted for Linux command line
|
||
application development. Technologies like Gnome Flatpak [7_] and
|
||
Ubunty Snappy [8_] further aim to bring these same techniques to Linux GUI
|
||
application development.
|
||
|
||
When using Python 3 for application development in
|
||
these contexts, it isn't uncommon to see text encoding related errors akin to
|
||
the following::
|
||
|
||
$ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
|
||
Unable to decode the command from the command line:
|
||
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
|
||
$ docker run --rm ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
|
||
Unable to decode the command from the command line:
|
||
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
|
||
|
||
Even though the same command is likely to work fine when run locally::
|
||
|
||
$ python3 -c 'print("ℙƴ☂ℌøἤ")'
|
||
ℙƴ☂ℌøἤ
|
||
|
||
The source of the problem can be seen by instead running the ``locale`` command
|
||
in the three environments::
|
||
|
||
$ locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
||
LANG=en_AU.UTF-8
|
||
LC_CTYPE="en_AU.UTF-8"
|
||
LC_ALL=
|
||
$ docker run --rm fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
||
LANG=
|
||
LC_CTYPE="POSIX"
|
||
LC_ALL=
|
||
$ docker run --rm ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
||
LANG=
|
||
LANGUAGE=
|
||
LC_CTYPE="POSIX"
|
||
LC_ALL=
|
||
|
||
In this particular example, we can see that the host system locale is set to
|
||
"en_AU.UTF-8", so CPython uses UTF-8 as the default text encoding. By contrast,
|
||
the base Docker images for Fedora and Debian don't have any specific locale
|
||
set, so they use the POSIX locale by default, which is an alias for the
|
||
ASCII-based default C locale.
|
||
|
||
The simplest way to get Python 3 (regardless of the exact version) to behave
|
||
sensibly in Fedora and Debian based containers is to run it in the ``C.UTF-8``
|
||
locale that both distros provide::
|
||
|
||
$ docker run --rm -e LANG=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
|
||
ℙƴ☂ℌøἤ
|
||
$ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
|
||
ℙƴ☂ℌøἤ
|
||
|
||
$ docker run --rm -e LANG=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
||
LANG=C.UTF-8
|
||
LC_CTYPE="C.UTF-8"
|
||
LC_ALL=
|
||
$ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
||
LANG=C.UTF-8
|
||
LANGUAGE=
|
||
LC_CTYPE="C.UTF-8"
|
||
LC_ALL=
|
||
|
||
The Alpine Linux based Python images provided by Docker, Inc, already use the
|
||
C.UTF-8 locale by default::
|
||
|
||
$ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")'
|
||
ℙƴ☂ℌøἤ
|
||
$ docker run --rm python:3 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
||
LANG=C.UTF-8
|
||
LANGUAGE=
|
||
LC_CTYPE="C.UTF-8"
|
||
LC_ALL=
|
||
|
||
Similarly, for custom container images (i.e. those adding additional content on
|
||
top of a base distro image), a more suitable locale can be set in the image
|
||
definition so everything just works by default. However, it would provide a much
|
||
nicer and more consistent user experience if CPython were able to just deal
|
||
with this problem automatically rather than relying on redistributors or end
|
||
users to handle it through system configuration changes.
|
||
|
||
While the glibc developers are working towards making the C.UTF-8 locale
|
||
universally available for use by glibc based applications like CPython [6_],
|
||
this unfortunately doesn't help on platforms that ship older versions of glibc
|
||
without that feature, and also don't provide C.UTF-8 as an on-disk locale the
|
||
way Debian and Fedora do. For these platforms, the mechanism proposed in
|
||
PEP 540 at least allows CPython itself to behave sensibly, albeit without any
|
||
mechanism to get other C/C++ components that decode binary streams as text to
|
||
do the same.
|
||
|
||
|
||
Design Principles
|
||
=================
|
||
|
||
The above motivation leads to the following core design principles for the
|
||
proposed solution:
|
||
|
||
* if a locale other than the default C locale is explicitly configured, we'll
|
||
continue to respect it
|
||
* if we're changing the locale setting without an explicit config option, we'll
|
||
emit a warning on stderr that we're doing so rather than silently changing
|
||
the process configuration. This will alert application and system integrators
|
||
to the change, even if they don't closely follow the PEP process or Python
|
||
release announcements. However, to minimize the chance of introducing new
|
||
problems for end users, we'll do this *without* using the warnings system, so
|
||
even running with ``-Werror`` won't turn it into a runtime exception
|
||
|
||
To minimize the negative impact on systems currently correctly configured to
|
||
use GB-18030 or another partially ASCII compatible universal encoding leads to
|
||
an additional design principle:
|
||
|
||
* if a UTF-8 based Linux container is run on a host that is explicitly
|
||
configured to use a non-UTF-8 encoding, and tries to exchange locally
|
||
encoded data with that host rather than exchanging explicitly UTF-8 encoded
|
||
data, CPython will endeavour to correctly round-trip host provided data that
|
||
is concatenated or split solely at common ASCII compatible code points, but
|
||
may otherwise emit nonsensical results.
|
||
|
||
|
||
Specification
|
||
=============
|
||
|
||
To better handle the cases where CPython would otherwise end up attempting
|
||
to operate in the ``C`` locale, this PEP proposes that CPython automatically
|
||
attempt to coerce the legacy ``C`` locale to a UTF-8 based locale when it is
|
||
run as a standalone command line application.
|
||
|
||
It further proposes to emit a warning on stderr if the legacy ``C`` locale
|
||
is in effect at the point where the language runtime itself is initialized,
|
||
and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
|
||
system and application integrators that they're running CPython in an
|
||
unsupported configuration.
|
||
|
||
|
||
Legacy C locale coercion in the standalone Python interpreter binary
|
||
--------------------------------------------------------------------
|
||
|
||
When run as a standalone application, CPython has the opportunity to
|
||
reconfigure the C locale before any locale dependent operations are executed
|
||
in the process.
|
||
|
||
This means that it can change the locale settings not only for the CPython
|
||
runtime, but also for any other C/C++ components running in the current
|
||
process (e.g. as part of extension modules), as well as in subprocesses that
|
||
inherit their environment from the current process.
|
||
|
||
After calling ``setlocale(LC_ALL, "")`` to initialize the locale settings in
|
||
the current process, the main interpreter binary will be updated to include
|
||
the following call::
|
||
|
||
const char *ctype_loc = setlocale(LC_CTYPE, NULL);
|
||
|
||
This cryptic invocation is the API that C provides to query the current locale
|
||
setting without changing it. Given that query, it is possible to check for
|
||
exactly the ``C`` locale with ``strcmp``::
|
||
|
||
ctype_loc != NULL && strcmp(ctype_loc, "C") == 0 # true only in the C locale
|
||
|
||
This call also returns ``"C"`` when either no particular locale is set, or the
|
||
nominal locale is set to an alias for the ``C`` locale (such as ``POSIX``).
|
||
|
||
Given this information, CPython can then attempt to coerce the locale to one
|
||
that uses UTF-8 rather than ASCII as the default encoding.
|
||
|
||
Three such locales will be tried:
|
||
|
||
* ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and
|
||
expected to be available by default in a future version of glibc)
|
||
* ``C.utf8`` (available at least in HP-UX)
|
||
* ``UTF-8`` (available in at least some \*BSD variants)
|
||
|
||
For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually
|
||
setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate
|
||
locale name, such that future calls to ``setlocale()`` will see them, as will
|
||
other components looking for those settings (such as GUI development
|
||
frameworks).
|
||
|
||
For the platforms where it is defined, ``UTF-8`` is a partial locale that only
|
||
defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
|
||
environment variable would be set when using this fallback option.
|
||
|
||
To adjust automatically to future changes in locale availability, these checks
|
||
will be implemented at runtime on all platforms other than Mac OS X and Windows,
|
||
rather than attempting to determine which locales to try at compile time.
|
||
|
||
If the locale settings are changed successfully, and the ``PYTHONIOENCODING``
|
||
environment variable is currently unset, then it will be forced to
|
||
``PYTHONIOENCODING=utf-8:surrogateescape``.
|
||
|
||
When this locale coercion is activated, the following warning will be
|
||
printed on stderr, with the warning containing whichever locale was
|
||
successfully configured::
|
||
|
||
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
|
||
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
|
||
|
||
When falling back to the ``UTF-8`` locale, the message would be slightly
|
||
different::
|
||
|
||
Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
|
||
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
|
||
|
||
In combination with PEP 540, this locale coercion will mean that the standard
|
||
Python binary *and* locale aware C/C++ extensions should once again "just work"
|
||
in the three main failure cases we're aware of (missing locale
|
||
settings, SSH forwarding of unknown locales, and developers explicitly
|
||
requesting ``LANG=C``), as long as the target platform provides at least one
|
||
of the candidate UTF-8 based environments.
|
||
|
||
If ``PYTHONCOERCECLOCALE=0`` is set, or none of the candidate locales is
|
||
successfully configured, then initialization will continue as usual in the C
|
||
locale and the Unicode compatibility warning described in the next section will
|
||
be emitted just as it would for any other application.
|
||
|
||
The interpreter will always check for the ``PYTHONCOERCECLOCALE`` environment
|
||
variable (even when running under the ``-E`` or ``-I`` switches), as the locale
|
||
coercion check necessarily takes place before any command line argument
|
||
processing.
|
||
|
||
|
||
Changes to the runtime initialization process
|
||
---------------------------------------------
|
||
|
||
By the time that ``Py_Initialize`` is called, arbitrary locale-dependent
|
||
operations may have taken place in the current process. This means that
|
||
by the time it is called, it is *too late* to switch to a different locale -
|
||
doing so would introduce inconsistencies in decoded text, even in the context
|
||
of the standalone Python interpreter binary.
|
||
|
||
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
|
||
configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8``
|
||
feature from PEP 540 is disabled, the following warning will
|
||
be issued::
|
||
|
||
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
|
||
encoding), which may cause Unicode compatibility problems. Using C.UTF-8
|
||
(if available) as an alternative Unicode-compatible locale is recommended.
|
||
|
||
In this case, no actual change will be made to the locale settings.
|
||
|
||
Instead, the warning informs both system and application integrators that
|
||
they're running Python 3 in a configuration that we don't expect to work
|
||
properly.
|
||
|
||
The second sentence providing recommendations would be conditionally compiled
|
||
based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD
|
||
systems.
|
||
|
||
|
||
New build-time configuration options
|
||
------------------------------------
|
||
|
||
While both of the above behaviours would be enabled by default, they would
|
||
also have new associated configuration options and preprocessor definitions
|
||
for the benefit of redistributors that want to override those default settings.
|
||
|
||
The locale coercion behaviour would be controlled by the flag
|
||
``--with[out]-c-locale-coercion``, which would set the ``PY_COERCE_C_LOCALE``
|
||
preprocessor definition.
|
||
|
||
The locale warning behaviour would be controlled by the flag
|
||
``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE``
|
||
preprocessor definition.
|
||
|
||
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android,
|
||
Windows) these preprocessor variables would always be undefined.
|
||
|
||
|
||
Platform Support Changes
|
||
========================
|
||
|
||
A new "Legacy C Locale" section will be added to PEP 11 that states:
|
||
|
||
* as of CPython 3.7, the legacy C locale is only supported when operating in
|
||
"UTF-8" mode. Any Unicode handling issues that occur only in that locale
|
||
and cannot be reproduced in an appropriately configured non-ASCII locale will
|
||
be closed as "won't fix"
|
||
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
|
||
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
|
||
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
|
||
Any Unicode related integration problems with C/C++ extensions that occur
|
||
only in that locale and cannot be reproduced in an appropriately configured
|
||
non-ASCII locale will be closed as "won't fix".
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
|
||
Improving the handling of the C locale
|
||
--------------------------------------
|
||
|
||
It has been clear for some time that the C locale's default encoding of
|
||
``ASCII`` is entirely the wrong choice for development of modern networked
|
||
services. Newer languages like Rust and Go have eschewed that default entirely,
|
||
and instead made it a deployment requirement that systems be configured to use
|
||
UTF-8 as the text encoding for operating system interfaces. Similarly, Node.js
|
||
assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript engine)
|
||
and requires custom build settings to indicate it should use the system
|
||
locale settings for locale-aware operations. Both the JVM and the .NET CLR
|
||
use UTF-16-LE as their primary encoding for passing text between applications
|
||
and the underlying platform.
|
||
|
||
The challenge for CPython has been the fact that in addition to being used for
|
||
network service development, it is also extensively used as an embedded
|
||
scripting language in larger applications, and as a desktop application
|
||
development language, where it is more important to be consistent with other
|
||
C/C++ components sharing the same process, as well as with the user's desktop
|
||
locale settings, than it is with the emergent conventions of modern network
|
||
service development.
|
||
|
||
The core premise of this PEP is that for *all* of these use cases, the
|
||
assumption of ASCII implied by the default "C" locale is the wrong choice,
|
||
and furthermore that the following assumptions are valid:
|
||
|
||
* in desktop application use cases, the process locale will *already* be
|
||
configured appropriately, and if it isn't, then that is an operating system
|
||
or embedding application level problem that needs to be reported to and
|
||
resolved by the operating system provider or application developer
|
||
* in network service development use cases (especially those based on Linux
|
||
containers), the process locale may not be configured *at all*, and if it
|
||
isn't, then the expectation is that components will impose their own default
|
||
encoding the way Rust, Go and Node.js do, rather than trusting the legacy C
|
||
default encoding of ASCII the way CPython currently does
|
||
|
||
|
||
Defaulting to "surrogateescape" error handling on the standard IO streams
|
||
-------------------------------------------------------------------------
|
||
|
||
By coercing the locale away from the legacy C default and its assumption of
|
||
ASCII as the preferred text encoding, this PEP also disables the implicit use
|
||
of the "surrogateescape" error handler on the standard IO streams that was
|
||
introduced in Python 3.5 ([15_]), as well as the automatic use of
|
||
``surrogateescape`` when operating in PEP 540's UTF-8 mode.
|
||
|
||
Rather than introducing yet another configuration option to address that,
|
||
this PEP proposes to use the existing ``PYTHONIOENCODING`` setting to ensure
|
||
that the ``surrogateescape`` handler is enabled when the interpreter is
|
||
required to make assumptions regarding the expected filesystem encoding.
|
||
|
||
The aim of this behaviour is to attempt to ensure that operating system
|
||
provided text values are typically able to be transparently passed through a
|
||
Python 3 application even if it is incorrect in assuming that that text has
|
||
been encoded as UTF-8.
|
||
|
||
In particular, GB 18030 [12_] is a Chinese national text encoding standard
|
||
that handles all Unicode code points, that is formally incompatible with both
|
||
ASCII and UTF-8, but will nevertheless often tolerate processing as surrogate
|
||
escaped data - the points where GB 18030 reuses ASCII byte values in an
|
||
incompatible way are likely to be invalid in UTF-8, and will therefore be
|
||
escaped and opaque to string processing operations that split on or search for
|
||
the relevant ASCII code points. Operations that don't involve splitting on or
|
||
searching for particular ASCII or Unicode code point values are almost
|
||
certain to work correctly.
|
||
|
||
Similarly, Shift-JIS [13_] and ISO-2022-JP [14_] remain in widespread use in
|
||
Japan, and are incompatible with both ASCII and UTF-8, but will tolerate text
|
||
processing operations that don't involve splitting on or searching for
|
||
particular ASCII or Unicode code point values.
|
||
|
||
As an example, consider two files, one encoded with UTF-8 (the default encoding
|
||
for ``en_AU.UTF-8``), and one encoded with GB-18030 (the default encoding for
|
||
``zh_CN.gb18030``)::
|
||
|
||
$ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))'
|
||
$ python3 -c 'open("gb18030.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("gb18030"))'
|
||
|
||
On disk, we can see that these are two very different files::
|
||
|
||
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "rb").read().strip()); \
|
||
print("GB18030:", open("gb18030.txt", "rb").read().strip())'
|
||
UTF-8: b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n'
|
||
GB18030: b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n'
|
||
|
||
That nevertheless can both be rendered correctly to the terminal as long as
|
||
they're decoded prior to printing::
|
||
|
||
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
|
||
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())'
|
||
UTF-8: ℙƴ☂ℌøἤ
|
||
GB18030: ℙƴ☂ℌøἤ
|
||
|
||
By contrast, if we just pass along the raw bytes, as ``cat`` and similar C/C++
|
||
utilities will tend to do::
|
||
|
||
$ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt
|
||
ℙƴ☂ℌøἤ
|
||
<20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
|
||
|
||
Even setting a specifically Chinese locale won't help in getting the
|
||
GB-18030 encoded file rendered correctly::
|
||
|
||
$ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt
|
||
ℙƴ☂ℌøἤ
|
||
<20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
|
||
|
||
The problem is that the *terminal* encoding setting remains UTF-8, regardless
|
||
of the nominal locale. A GB18030 terminal can be emulated using the ``iconv``
|
||
utility::
|
||
|
||
$ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8
|
||
鈩櫰粹槀鈩屆羔激
|
||
ℙƴ☂ℌøἤ
|
||
|
||
This reverses the problem, such that the GB18030 file is rendered correctly,
|
||
but the UTF-8 file has been converted to unrelated hanzi characters, rather than
|
||
the expected rendering of "Python" as non-ASCII characters.
|
||
|
||
With the emulated GB18030 terminal encoding, assuming UTF-8 in Python results
|
||
in *both* files being displayed incorrectly::
|
||
|
||
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
|
||
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
|
||
| iconv -f GB18030 -t UTF-8
|
||
UTF-8: 鈩櫰粹槀鈩屆羔激
|
||
GB18030: 鈩櫰粹槀鈩屆羔激
|
||
|
||
However, setting the locale correctly means that the emulated GB18030 terminal
|
||
now displays both files as originally intended::
|
||
|
||
$ LANG=zh_CN.gb18030 \
|
||
python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
|
||
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
|
||
| iconv -f GB18030 -t UTF-8
|
||
UTF-8: ℙƴ☂ℌøἤ
|
||
GB18030: ℙƴ☂ℌøἤ
|
||
|
||
The rationale for retaining ``surrogateescape`` as the default IO encoding is
|
||
that it will preserve the following helpful behaviour in the C locale::
|
||
|
||
$ cat gb18030.txt \
|
||
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \
|
||
| iconv -f GB18030 -t UTF-8
|
||
ℙƴ☂ℌøἤ
|
||
|
||
Rather than reverting to the exception seen when a UTF-8 based locale is
|
||
explicitly configured::
|
||
|
||
$ cat gb18030.txt \
|
||
| python3 -c "import sys; print(sys.stdin.read())" \
|
||
| iconv -f GB18030 -t UTF-8
|
||
Traceback (most recent call last):
|
||
File "<string>", line 1, in <module>
|
||
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
|
||
(result, consumed) = self._buffer_decode(data, self.errors, final)
|
||
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
|
||
|
||
Note: an alternative to setting ``PYTHONIOENCODING`` as the PEP currently
|
||
proposes would be to instead *always* default to ``surrogateescape`` on the
|
||
standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request
|
||
text encoding validation during stream processing. Adopting such an approach
|
||
would bring Python 3 more into line with typical C/C++ tools that pass along
|
||
the raw bytes without checking them for conformance to their nominal encoding,
|
||
and would hence also make the last example display the desired output::
|
||
|
||
$ cat gb18030.txt \
|
||
| PYTHONIOENCODING=:surrogateescape python3 -c "import sys; print(sys.stdin.read())" \
|
||
| iconv -f GB18030 -t UTF-8
|
||
ℙƴ☂ℌøἤ
|
||
|
||
|
||
Dropping official support for ASCII based text handling in the legacy C locale
|
||
------------------------------------------------------------------------------
|
||
|
||
We've been trying to get strict bytes/text separation to work reliably in the
|
||
legacy C locale for over a decade at this point. Not only haven't we been able
|
||
to get it to work, neither has anyone else - the only viable alternatives
|
||
identified have been to pass the bytes along verbatim without eagerly decoding
|
||
them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
|
||
C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
|
||
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
|
||
|
||
While this PEP ensures that developers that need to do so can still opt-in to
|
||
running their Python code in the legacy C locale, it also makes clear that we
|
||
*don't* expect Python 3's Unicode handling to be reliable in that configuration,
|
||
and the recommended alternative is to use a more appropriate locale setting.
|
||
|
||
|
||
Providing implicit locale coercion only when running standalone
|
||
---------------------------------------------------------------
|
||
|
||
Over the course of Python 3.x development, multiple attempts have been made
|
||
to improve the handling of incorrect locale settings at the point where the
|
||
Python interpreter is initialised. The problem that emerged is that this is
|
||
ultimately *too late* in the interpreter startup process - data such as command
|
||
line arguments and the contents of environment variables may have already been
|
||
retrieved from the operating system and processed under the incorrect ASCII
|
||
text encoding assumption well before ``Py_Initialize`` is called.
|
||
|
||
The problems created by those inconsistencies were then even harder to diagnose
|
||
and debug than those created by believing the operating system's claim that
|
||
ASCII was a suitable encoding to use for operating system interfaces. This was
|
||
the case even for the default CPython binary, let alone larger C/C++
|
||
applications that embed CPython as a scripting engine.
|
||
|
||
The approach proposed in this PEP handles that problem by moving the locale
|
||
coercion as early as possible in the interpreter startup sequence when running
|
||
standalone: it takes place directly in the C-level ``main()`` function, even
|
||
before calling in to the `Py_Main()`` library function that implements the
|
||
features of the CPython interpreter CLI.
|
||
|
||
The ``Py_Initialize`` API then only gains an explicit warning (emitted on
|
||
``stderr``) when it detects use of the ``C`` locale, and relies on the
|
||
embedding application to specify something more reasonable.
|
||
|
||
|
||
Querying LC_CTYPE for C locale detection
|
||
----------------------------------------
|
||
|
||
``LC_CTYPE`` is the actual locale category that CPython relies on to drive the
|
||
implicit decoding of environment variables, command line arguments, and other
|
||
text values received from the operating system.
|
||
|
||
As such, it makes sense to check it specifically when attempting to determine
|
||
whether or not the current locale configuration is likely to cause Unicode
|
||
handling problems.
|
||
|
||
|
||
Setting both LANG & LC_ALL for C.UTF-8 locale coercion
|
||
------------------------------------------------------
|
||
|
||
Python is often used as a glue language, integrating other C/C++ ABI compatible
|
||
components in the current process, and components written in arbitrary
|
||
languages in subprocesses.
|
||
|
||
Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
|
||
C/C++ components in the current process and in any subprocesses that inherit
|
||
the current environment. This is important to handle cases where the problem
|
||
has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system
|
||
where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
|
||
configured to forward locale settings, and the user logs into a Linux server).
|
||
|
||
Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
|
||
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
|
||
|
||
Together, these should ensure that when the locale coercion is activated, the
|
||
switch to the C.UTF-8 locale will be applied consistently across the current
|
||
process and any subprocesses that inherit the current environment.
|
||
|
||
|
||
Allowing restoration of the legacy behaviour
|
||
--------------------------------------------
|
||
|
||
The CPython command line interpreter is often used to investigate faults that
|
||
occur in other applications that embed CPython, and those applications may still
|
||
be using the C locale even after this PEP is implemented.
|
||
|
||
Providing a simple on/off switch for the locale coercion behaviour makes it
|
||
much easier to reproduce the behaviour of such applications for debugging
|
||
purposes, as well as making it easier to reproduce the behaviour of older 3.x
|
||
runtimes even when running a version with this change applied.
|
||
|
||
|
||
Implementation
|
||
==============
|
||
|
||
A draft implementation of the change (including test cases) has been
|
||
posted to issue 28180 [1_], which is an end user request that
|
||
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
|
||
|
||
NOTE: The currently posted draft implementation is for a previous iteration
|
||
of the PEP prior to the incorporation of the feedback noted in [11_]. It was
|
||
broadly the same in concept (i.e. coercing the legacy C locale to one based on
|
||
UTF-8), but differs in several details.
|
||
|
||
|
||
Backporting to earlier Python 3 releases
|
||
========================================
|
||
|
||
Backporting to Python 3.6.0
|
||
---------------------------
|
||
|
||
If this PEP is accepted for Python 3.7, redistributors backporting the change
|
||
specifically to their initial Python 3.6.0 release will be both allowed and
|
||
encouraged. However, such backports should only be undertaken either in
|
||
conjunction with the changes needed to also provide a suitable locale by
|
||
default, or else specifically for platforms where such a locale is already
|
||
consistently available.
|
||
|
||
|
||
Backporting to other 3.x releases
|
||
---------------------------------
|
||
|
||
While the proposed behavioural change is seen primarily as a bug fix addressing
|
||
Python 3's current misbehaviour in the default ASCII-based C locale, it still
|
||
represents a reasonably significant change in the way CPython interacts with
|
||
the C locale system. As such, while some redistributors may still choose to
|
||
backport it to even earlier Python 3.x releases based on the needs and
|
||
interests of their particular user base, this wouldn't be encouraged as a
|
||
general practice.
|
||
|
||
|
||
Acknowledgements
|
||
================
|
||
|
||
The locale coercion approach proposed in this PEP is inspired directly by
|
||
Armin Ronacher's handling of this problem in the ``click`` command line
|
||
utility development framework [2_]::
|
||
|
||
$ LANG=C python3 -c 'import click; cli = click.command()(lambda:None); cli()'
|
||
Traceback (most recent call last):
|
||
...
|
||
RuntimeError: Click will abort further execution because Python 3 was
|
||
configured to use ASCII as encoding for the environment. Either run this
|
||
under Python 2 or consult http://click.pocoo.org/python3/ for mitigation
|
||
steps.
|
||
|
||
This system supports the C.UTF-8 locale which is recommended.
|
||
You might be able to resolve your issue by exporting the
|
||
following environment variables:
|
||
|
||
export LC_ALL=C.UTF-8
|
||
export LANG=C.UTF-8
|
||
|
||
The change was originally proposed as a downstream patch for Fedora's
|
||
system Python 3.6 package [3_], and then reformulated as a PEP for Python 3.7
|
||
with a section allowing for backports to earlier versions by redistributors.
|
||
|
||
The initial draft was posted to the Python Linux SIG for discussion [10_] and
|
||
then amended based on both that discussion and Victor Stinner's work in
|
||
PEP 540 [11_].
|
||
|
||
The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP
|
||
is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9_].
|
||
|
||
Stephen Turnbull has long provided valuable insight into the text encoding
|
||
handling challenges he regularly encounters at the University of Tsukuba
|
||
(筑波大学).
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
.. [1] CPython: sys.getfilesystemencoding() should default to utf-8
|
||
(http://bugs.python.org/issue28180)
|
||
|
||
.. [2] Locale configuration required for click applications under Python 3
|
||
(http://click.pocoo.org/5/python3/#python-3-surrogate-handling)
|
||
|
||
.. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale
|
||
(https://bugzilla.redhat.com/show_bug.cgi?id=1404918)
|
||
|
||
.. [4] GNU C: How Programs Set the Locale
|
||
( https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html)
|
||
|
||
.. [5] GNU C: Locale Categories
|
||
(https://www.gnu.org/software/libc/manual/html_node/Locale-Categories.html)
|
||
|
||
.. [6] glibc C.UTF-8 locale proposal
|
||
(https://sourceware.org/glibc/wiki/Proposals/C.UTF-8)
|
||
|
||
.. [7] GNOME Flatpak
|
||
(http://flatpak.org/)
|
||
|
||
.. [8] Ubuntu Snappy
|
||
(https://www.ubuntu.com/desktop/snappy)
|
||
|
||
.. [9] Pragmatic Unicode
|
||
(http://nedbatchelder.com/text/unipain.html)
|
||
|
||
.. [10] linux-sig discussion of initial PEP draft
|
||
(https://mail.python.org/pipermail/linux-sig/2017-January/000014.html)
|
||
|
||
.. [11] Feedback notes from linux-sig discussion and PEP 540
|
||
(https://github.com/python/peps/issues/171)
|
||
|
||
.. [12] GB 18030
|
||
(https://en.wikipedia.org/wiki/GB_18030)
|
||
|
||
.. [13] Shift-JIS
|
||
(https://en.wikipedia.org/wiki/Shift_JIS)
|
||
|
||
.. [14] ISO-2022
|
||
(https://en.wikipedia.org/wiki/ISO/IEC_2022)
|
||
|
||
.. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale
|
||
(https://bugs.python.org/issue19977)
|
||
|
||
.. [16] test_readline.test_nonascii fails on Android
|
||
(http://bugs.python.org/issue28997)
|
||
|
||
.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English"
|
||
(http://bugs.python.org/issue18378#msg215215)
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain under the terms of the
|
||
CC0 1.0 license: https://creativecommons.org/publicdomain/zero/1.0/
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|