2016-12-27 21:31:21 -05:00
|
|
|
|
PEP: 538
|
2017-03-05 02:35:19 -05:00
|
|
|
|
Title: Coercing the legacy C locale to a UTF-8 based locale
|
2016-12-27 21:31:21 -05:00
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
2023-10-11 08:05:51 -04:00
|
|
|
|
Author: Alyssa Coghlan <ncoghlan@gmail.com>
|
2017-04-24 00:33:34 -04:00
|
|
|
|
BDFL-Delegate: INADA Naoki
|
2017-06-10 23:17:59 -04:00
|
|
|
|
Status: Final
|
2016-12-27 21:31:21 -05:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 28-Dec-2016
|
|
|
|
|
Python-Version: 3.7
|
2022-03-09 11:04:44 -05:00
|
|
|
|
Post-History: 03-Jan-2017,
|
|
|
|
|
07-Jan-2017,
|
|
|
|
|
05-Mar-2017,
|
|
|
|
|
09-May-2017
|
2017-05-28 02:53:44 -04:00
|
|
|
|
Resolution: https://mail.python.org/pipermail/python-dev/2017-May/148035.html
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
An ongoing challenge with Python 3 on \*nix systems is the conflict between
|
|
|
|
|
needing to use the configured locale encoding by default for consistency with
|
2017-05-01 02:26:50 -04:00
|
|
|
|
other locale-aware components in the same process or subprocesses,
|
2017-01-20 09:13:24 -05:00
|
|
|
|
and the fact that the standard C locale (as defined in POSIX:2001) typically
|
|
|
|
|
implies a default text encoding of ASCII, which is entirely inadequate for the
|
2017-01-07 02:04:39 -05:00
|
|
|
|
development of networked services and client applications in a multilingual
|
|
|
|
|
world.
|
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
:pep:`540` proposes a change to CPython's handling of the legacy C locale such
|
2017-01-20 09:13:24 -05:00
|
|
|
|
that CPython will assume the use of UTF-8 in such environments, rather than
|
|
|
|
|
persisting with the demonstrably problematic assumption of ASCII as an
|
|
|
|
|
appropriate encoding for communicating with operating system interfaces.
|
2017-03-05 02:29:54 -05:00
|
|
|
|
This is a good approach for cases where network encoding interoperability
|
|
|
|
|
is a more important concern than local encoding interoperability.
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
|
|
|
|
However, it comes at the cost of making CPython's encoding assumptions diverge
|
2017-03-13 01:06:48 -04:00
|
|
|
|
from those of other locale-aware components in the same process, as well as
|
|
|
|
|
those of components running in subprocesses that share the same environment.
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
This can cause interoperability problems with some extension modules (such as
|
|
|
|
|
GNU readline's command line history editing), as well as with components
|
|
|
|
|
running in subprocesses (such as older Python runtimes).
|
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
It also requires non-trivial changes to the internals of how CPython itself
|
|
|
|
|
works, rather than relying primarily on existing configuration settings that
|
|
|
|
|
are supported by Python versions prior to Python 3.7.
|
2017-03-05 02:29:54 -05:00
|
|
|
|
|
|
|
|
|
Accordingly, this PEP proposes that independently of the UTF-8 mode proposed
|
2022-01-21 06:03:51 -05:00
|
|
|
|
in :pep:`540`, the way the CPython implementation handles the default C locale be
|
2017-05-09 06:46:59 -04:00
|
|
|
|
changed to be roughly equivalent to the following existing configuration
|
|
|
|
|
settings (supported since Python 3.1)::
|
|
|
|
|
|
|
|
|
|
LC_CTYPE=C.UTF-8
|
|
|
|
|
PYTHONIOENCODING=utf-8:surrogateescape
|
|
|
|
|
|
|
|
|
|
The exact target locale for coercion will be chosen from a predefined list at
|
|
|
|
|
runtime based on the actually available locales.
|
|
|
|
|
|
|
|
|
|
The reinterpreted locale settings will be written back to the environment so
|
|
|
|
|
they're visible to other components in the same process and in subprocesses,
|
|
|
|
|
but the changed ``PYTHONIOENCODING`` default will be made implicit in order to
|
|
|
|
|
avoid causing compatibility problems with Python 2 subprocesses that don't
|
|
|
|
|
provide the ``surrogateescape`` error handler.
|
|
|
|
|
|
|
|
|
|
The new legacy locale coercion behavior can be disabled either by setting
|
|
|
|
|
``LC_ALL`` (which may still lead to a Unicode compatibility warning) or by
|
|
|
|
|
setting the new ``PYTHONCOERCECLOCALE`` environment variable to ``0``.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
With this change, any \*nix platform that does *not* offer at least one of the
|
2017-01-20 09:13:24 -05:00
|
|
|
|
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
|
2017-01-07 02:04:39 -05:00
|
|
|
|
configuration would only be considered a fully supported platform for CPython
|
2017-05-06 02:58:19 -04:00
|
|
|
|
3.7+ deployments when a suitable locale other than the default ``C`` locale is
|
2022-01-21 06:03:51 -05:00
|
|
|
|
configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``). If :pep:`540` is
|
2017-05-09 06:46:59 -04:00
|
|
|
|
accepted in addition to this PEP, then pure Python modules would also be
|
|
|
|
|
supported when using the proposed ``PYTHONUTF8`` mode, but expectations for
|
|
|
|
|
full Unicode compatibility in extension modules would continue to be limited
|
|
|
|
|
to the platforms covered by this PEP.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
As it only reflects a change in default settings rather than a fundamentally
|
|
|
|
|
new capability, redistributors (such as Linux distributions) with a narrower
|
|
|
|
|
target audience than the upstream CPython development team may also choose to
|
|
|
|
|
opt in to this locale coercion behaviour for the Python 3.6.x series by
|
|
|
|
|
applying the necessary changes as a downstream patch.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
|
2017-06-17 22:01:45 -04:00
|
|
|
|
Implementation Notes
|
|
|
|
|
====================
|
|
|
|
|
|
|
|
|
|
Attempting to implement the PEP as originally accepted showed that the
|
|
|
|
|
proposal to emit locale coercion and compatibility warnings by default
|
|
|
|
|
simply wasn't practical (there were too many cases where previously working
|
|
|
|
|
code failed *because of the warnings*, rather than because of latent locale
|
|
|
|
|
handling defects in the affected code).
|
|
|
|
|
|
|
|
|
|
As a result, the ``PY_WARN_ON_C_LOCALE`` config flag was removed, and replaced
|
|
|
|
|
with a runtime ``PYTHONCOERCECLOCALE=warn`` environment variable setting
|
|
|
|
|
that allows developers and system integrators to opt-in to receiving locale
|
|
|
|
|
coercion and compatibility warnings, without emitting them by default.
|
|
|
|
|
|
2018-03-29 10:05:50 -04:00
|
|
|
|
The output examples in the PEP itself have also been updated to remove
|
|
|
|
|
the warnings and make them easier to read.
|
|
|
|
|
|
2017-06-17 22:01:45 -04:00
|
|
|
|
|
2017-01-03 00:19:37 -05:00
|
|
|
|
Background
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
While the CPython interpreter is starting up, it may need to convert from
|
|
|
|
|
the ``char *`` format to the ``wchar_t *`` format, or from one of those formats
|
2017-01-20 09:13:24 -05:00
|
|
|
|
to ``PyUnicodeObject *``, in a way that's consistent with the locale settings
|
|
|
|
|
of the overall system. It handles these cases by relying on the operating
|
|
|
|
|
system to do the conversion and then ensuring that the text encoding name
|
|
|
|
|
reported by ``sys.getfilesystemencoding()`` matches the encoding used during
|
|
|
|
|
this early bootstrapping process.
|
2017-01-03 00:19:37 -05:00
|
|
|
|
|
|
|
|
|
On Windows, the limitations of the ``mbcs`` format used by default in these
|
2022-01-21 06:03:51 -05:00
|
|
|
|
conversions proved sufficiently problematic that :pep:`528` and :pep:`529` were
|
2017-01-03 00:19:37 -05:00
|
|
|
|
implemented to bypass the operating system supplied interfaces for binary data
|
|
|
|
|
handling and force the use of UTF-8 instead.
|
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
On Mac OS X, iOS, and Android, many components, including CPython, already
|
|
|
|
|
assume the use of UTF-8 as the system encoding, regardless of the locale
|
|
|
|
|
setting. However, this isn't the case for all components, and the discrepancy
|
|
|
|
|
can cause problems in some situations (for example, when using the GNU readline
|
|
|
|
|
module [16_]).
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
On non-Apple and non-Android \*nix systems, these operations are handled using
|
2022-07-20 17:50:22 -04:00
|
|
|
|
the C locale system in glibc, which has the following characteristics [4]_:
|
2017-01-03 00:19:37 -05:00
|
|
|
|
|
|
|
|
|
* by default, all processes start in the ``C`` locale, which uses ``ASCII``
|
|
|
|
|
for these conversions. This is almost never what anyone doing multilingual
|
2017-01-07 02:04:39 -05:00
|
|
|
|
text processing actually wants (including CPython and C/C++ GUI frameworks).
|
2017-01-03 00:19:37 -05:00
|
|
|
|
* calling ``setlocale(LC_ALL, "")`` reconfigures the active locale based on
|
|
|
|
|
the locale categories configured in the current process environment
|
|
|
|
|
* if the locale requested by the current environment is unknown, or no specific
|
|
|
|
|
locale is configured, then the default ``C`` locale will remain active
|
|
|
|
|
|
|
|
|
|
The specific locale category that covers the APIs that CPython depends on is
|
|
|
|
|
``LC_CTYPE``, which applies to "classification and conversion of characters,
|
2022-07-20 17:50:22 -04:00
|
|
|
|
and to multibyte and wide characters" [5]_. Accordingly, CPython includes the
|
2017-01-03 00:19:37 -05:00
|
|
|
|
following key calls to ``setlocale``:
|
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
* in the main ``python`` binary, CPython calls ``setlocale(LC_ALL, "")`` to
|
|
|
|
|
configure the entire C locale subsystem according to the process environment.
|
|
|
|
|
It does this prior to making any calls into the shared CPython library
|
2017-01-03 00:19:37 -05:00
|
|
|
|
* in ``Py_Initialize``, CPython calls ``setlocale(LC_CTYPE, "")``, such that
|
|
|
|
|
the configured locale settings for that category *always* match those set in
|
|
|
|
|
the environment. It does this unconditionally, and it *doesn't* revert the
|
|
|
|
|
process state change in ``Py_Finalize``
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
(This summary of the locale handling omits several technical details related
|
|
|
|
|
to exactly where and when the text encoding declared as part of the locale
|
2022-01-21 06:03:51 -05:00
|
|
|
|
settings is used - see :pep:`540` for further discussion, as these particular
|
2017-01-07 02:04:39 -05:00
|
|
|
|
details matter more when decoupling CPython from the declared C locale than
|
|
|
|
|
they do when overriding the locale with one based on UTF-8)
|
2017-01-03 00:19:37 -05:00
|
|
|
|
|
|
|
|
|
These calls are usually sufficient to provide sensible behaviour, but they can
|
|
|
|
|
still fail in the following cases:
|
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
* SSH environment forwarding means that SSH clients may sometimes forward
|
2017-01-07 02:04:39 -05:00
|
|
|
|
client locale settings to servers that don't have that locale installed. This
|
2017-03-05 02:29:54 -05:00
|
|
|
|
leads to CPython running in the default ASCII-based C locale
|
2017-01-03 00:19:37 -05:00
|
|
|
|
* some process environments (such as Linux containers) may not have any
|
2017-01-07 02:04:39 -05:00
|
|
|
|
explicit locale configured at all. As with unknown locales, this leads to
|
|
|
|
|
CPython running in the default ASCII-based C locale
|
2017-05-06 02:58:19 -04:00
|
|
|
|
* on Android, rather than configuring the locale based on environment variables,
|
|
|
|
|
the empty locale ``""`` is treated as specifically requesting the ``"C"``
|
|
|
|
|
locale
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
The simplest way to deal with this problem for currently released versions of
|
|
|
|
|
CPython is to explicitly set a more sensible locale when launching the
|
|
|
|
|
application. For example::
|
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
LC_CTYPE=C.UTF-8 python3 ...
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the
|
|
|
|
|
``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other
|
|
|
|
|
categories (including ``LC_COLLATE``). It is offered by a number of Linux
|
|
|
|
|
distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an
|
2017-05-09 06:46:59 -04:00
|
|
|
|
alternative to the ASCII-based C locale. Some other platforms (such as
|
|
|
|
|
``HP-UX``) offer an equivalent locale definition under the name ``C.utf8``.
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
2017-05-01 02:26:50 -04:00
|
|
|
|
Mac OS X and other \*BSD systems have taken a different approach: instead of
|
2017-05-09 06:46:59 -04:00
|
|
|
|
offering a ``C.UTF-8`` locale, they offer a partial ``UTF-8`` locale that only
|
2017-05-01 02:26:50 -04:00
|
|
|
|
defines the ``LC_CTYPE`` category. On such systems, the preferred
|
2017-01-20 09:13:24 -05:00
|
|
|
|
environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to set
|
2022-07-20 17:50:22 -04:00
|
|
|
|
``LC_ALL`` or ``LANG``. [17]_
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
In the specific case of Docker containers and similar technologies, the
|
|
|
|
|
appropriate locale setting can be specified directly in the container image
|
|
|
|
|
definition.
|
|
|
|
|
|
|
|
|
|
Another common failure case is developers specifying ``LANG=C`` in order to
|
|
|
|
|
see otherwise translated user interface messages in English, rather than the
|
2017-03-13 01:06:48 -04:00
|
|
|
|
more narrowly scoped ``LC_MESSAGES=C`` or ``LANGUAGE=en``.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Relationship with other PEPs
|
|
|
|
|
============================
|
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
This PEP shares a common problem statement with :pep:`540` (improving Python 3's
|
2017-03-05 02:29:54 -05:00
|
|
|
|
behaviour in the default C locale), but diverges markedly in the proposed
|
2017-01-07 02:04:39 -05:00
|
|
|
|
solution:
|
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
* :pep:`540` proposes to entirely decouple CPython's default text encoding from
|
2017-01-07 02:04:39 -05:00
|
|
|
|
the C locale system in that case, allowing text handling inconsistencies to
|
2017-03-28 18:16:44 -04:00
|
|
|
|
arise between CPython and other locale-aware components running in the same
|
2017-03-13 01:06:48 -04:00
|
|
|
|
process and in subprocesses. This approach aims to make CPython behave less
|
|
|
|
|
like a locale-aware application, and more like locale-independent language
|
2017-05-04 10:59:53 -04:00
|
|
|
|
runtimes like those for Go, Node.js (V8), and Rust
|
2017-01-20 09:13:24 -05:00
|
|
|
|
* this PEP proposes to override the legacy C locale with a more recently
|
2017-01-07 02:04:39 -05:00
|
|
|
|
defined locale that uses UTF-8 as its default text encoding. This means that
|
|
|
|
|
the text encoding override will apply not only to CPython, but also to any
|
2017-03-13 01:06:48 -04:00
|
|
|
|
locale-aware extension modules loaded into the current process, as well as to
|
|
|
|
|
locale-aware applications invoked in subprocesses that inherit their
|
2017-01-07 02:04:39 -05:00
|
|
|
|
environment from the parent process. This approach aims to retain CPython's
|
2017-03-13 01:06:48 -04:00
|
|
|
|
traditional strong support for integration with other locale-aware components
|
|
|
|
|
while also actively helping to push forward the adoption and standardisation
|
|
|
|
|
of the C.UTF-8 locale as a Unicode-aware replacement for the legacy C locale
|
|
|
|
|
in the wider C/C++ ecosystem
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
After reviewing both PEPs, it became clear that they didn't actually conflict
|
2022-01-21 06:03:51 -05:00
|
|
|
|
at a technical level, and the proposal in :pep:`540` offered a superior option in
|
2017-01-20 09:35:51 -05:00
|
|
|
|
cases where no suitable locale was available, as well as offering a better
|
2017-01-20 09:13:24 -05:00
|
|
|
|
reference behaviour for platforms where the notion of a "locale encoding"
|
|
|
|
|
doesn't make sense (for example, embedded systems running MicroPython rather
|
2017-01-20 09:35:51 -05:00
|
|
|
|
than the CPython reference interpreter).
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-03-13 01:06:48 -04:00
|
|
|
|
Meanwhile, this PEP offered improved compatibility with other locale-aware
|
|
|
|
|
components, and an approach more amenable to being backported to Python 3.6
|
|
|
|
|
by downstream redistributors.
|
2017-03-05 02:29:54 -05:00
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
As a result, this PEP was amended to refer to :pep:`540` as a complementary
|
2017-05-06 02:58:19 -04:00
|
|
|
|
solution that offered improved behaviour when none of the standard UTF-8 based
|
2017-05-09 06:46:59 -04:00
|
|
|
|
locales were available, as well as extending the changes in the default
|
|
|
|
|
settings to APIs that aren't currently independently configurable (such as
|
|
|
|
|
the default encoding and error handler for ``open()``).
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
The availability of :pep:`540` also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy
|
2017-03-05 02:29:54 -05:00
|
|
|
|
fallback was removed from the list of UTF-8 locales tried as a coercion target,
|
2017-05-06 02:58:19 -04:00
|
|
|
|
with the expectation being that CPython will instead rely solely on the
|
|
|
|
|
proposed PYTHONUTF8 mode in such cases.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
|
==========
|
2017-01-03 00:19:37 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
While Linux container technologies like Docker, Kubernetes, and OpenShift are
|
|
|
|
|
best known for their use in web service development, the related container
|
|
|
|
|
formats and execution models are also being adopted for Linux command line
|
2022-07-20 17:50:22 -04:00
|
|
|
|
application development. Technologies like Gnome Flatpak [7]_ and
|
|
|
|
|
Ubuntu Snappy [8]_ further aim to bring these same techniques to Linux GUI
|
2017-01-07 02:04:39 -05:00
|
|
|
|
application development.
|
|
|
|
|
|
2017-03-05 02:29:54 -05:00
|
|
|
|
When using Python 3 for application development in these contexts, it isn't
|
|
|
|
|
uncommon to see text encoding related errors akin to the following::
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
$ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
|
|
|
|
|
Unable to decode the command from the command line:
|
|
|
|
|
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
|
|
|
|
|
$ docker run --rm ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
|
|
|
|
|
Unable to decode the command from the command line:
|
|
|
|
|
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
|
|
|
|
|
|
|
|
|
|
Even though the same command is likely to work fine when run locally::
|
|
|
|
|
|
|
|
|
|
$ python3 -c 'print("ℙƴ☂ℌøἤ")'
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
|
|
|
|
|
The source of the problem can be seen by instead running the ``locale`` command
|
|
|
|
|
in the three environments::
|
|
|
|
|
|
|
|
|
|
$ locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
|
|
|
|
LANG=en_AU.UTF-8
|
|
|
|
|
LC_CTYPE="en_AU.UTF-8"
|
|
|
|
|
LC_ALL=
|
|
|
|
|
$ docker run --rm fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
|
|
|
|
LANG=
|
|
|
|
|
LC_CTYPE="POSIX"
|
|
|
|
|
LC_ALL=
|
|
|
|
|
$ docker run --rm ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
|
|
|
|
LANG=
|
|
|
|
|
LANGUAGE=
|
|
|
|
|
LC_CTYPE="POSIX"
|
|
|
|
|
LC_ALL=
|
|
|
|
|
|
|
|
|
|
In this particular example, we can see that the host system locale is set to
|
|
|
|
|
"en_AU.UTF-8", so CPython uses UTF-8 as the default text encoding. By contrast,
|
|
|
|
|
the base Docker images for Fedora and Debian don't have any specific locale
|
|
|
|
|
set, so they use the POSIX locale by default, which is an alias for the
|
|
|
|
|
ASCII-based default C locale.
|
|
|
|
|
|
|
|
|
|
The simplest way to get Python 3 (regardless of the exact version) to behave
|
|
|
|
|
sensibly in Fedora and Debian based containers is to run it in the ``C.UTF-8``
|
|
|
|
|
locale that both distros provide::
|
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
$ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
|
2017-01-07 02:04:39 -05:00
|
|
|
|
ℙƴ☂ℌøἤ
|
2017-05-27 03:08:32 -04:00
|
|
|
|
$ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
|
2017-01-07 02:04:39 -05:00
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
$ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
|
|
|
|
LANG=
|
|
|
|
|
LC_CTYPE=C.UTF-8
|
2017-01-07 02:04:39 -05:00
|
|
|
|
LC_ALL=
|
2017-05-27 03:08:32 -04:00
|
|
|
|
$ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
|
|
|
|
LANG=
|
2017-01-07 02:04:39 -05:00
|
|
|
|
LANGUAGE=
|
2017-05-27 03:08:32 -04:00
|
|
|
|
LC_CTYPE=C.UTF-8
|
2017-01-07 02:04:39 -05:00
|
|
|
|
LC_ALL=
|
|
|
|
|
|
2017-03-17 04:27:53 -04:00
|
|
|
|
The Alpine Linux based Python images provided by Docker, Inc. already use the
|
2017-01-07 02:04:39 -05:00
|
|
|
|
C.UTF-8 locale by default::
|
|
|
|
|
|
|
|
|
|
$ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")'
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
$ docker run --rm python:3 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
|
|
|
|
|
LANG=C.UTF-8
|
|
|
|
|
LANGUAGE=
|
|
|
|
|
LC_CTYPE="C.UTF-8"
|
|
|
|
|
LC_ALL=
|
|
|
|
|
|
|
|
|
|
Similarly, for custom container images (i.e. those adding additional content on
|
|
|
|
|
top of a base distro image), a more suitable locale can be set in the image
|
|
|
|
|
definition so everything just works by default. However, it would provide a much
|
|
|
|
|
nicer and more consistent user experience if CPython were able to just deal
|
|
|
|
|
with this problem automatically rather than relying on redistributors or end
|
|
|
|
|
users to handle it through system configuration changes.
|
|
|
|
|
|
|
|
|
|
While the glibc developers are working towards making the C.UTF-8 locale
|
2022-07-20 17:50:22 -04:00
|
|
|
|
universally available for use by glibc based applications like CPython [6]_,
|
2017-01-07 02:04:39 -05:00
|
|
|
|
this unfortunately doesn't help on platforms that ship older versions of glibc
|
2017-05-09 06:46:59 -04:00
|
|
|
|
without that feature, and also don't provide C.UTF-8 (or an equivalent) as an
|
|
|
|
|
on-disk locale the way Debian and Fedora do. These platforms are considered
|
2022-01-21 06:03:51 -05:00
|
|
|
|
out of scope for this PEP - see :pep:`540` for further discussion of possible
|
2017-05-09 06:46:59 -04:00
|
|
|
|
options for improving CPython's default behaviour in such environments.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
|
2017-01-07 20:54:24 -05:00
|
|
|
|
Design Principles
|
|
|
|
|
=================
|
|
|
|
|
|
|
|
|
|
The above motivation leads to the following core design principles for the
|
|
|
|
|
proposed solution:
|
|
|
|
|
|
|
|
|
|
* if a locale other than the default C locale is explicitly configured, we'll
|
|
|
|
|
continue to respect it
|
2017-05-06 02:58:19 -04:00
|
|
|
|
* as far as is feasible, any changes made will use *existing* configuration
|
|
|
|
|
options
|
|
|
|
|
* Python's runtime behaviour in potential coercion target locales should be
|
|
|
|
|
identical regardless of whether the locale was set explicitly in the
|
|
|
|
|
environment or implicitly as a locale coercion target
|
2017-05-09 06:46:59 -04:00
|
|
|
|
* for Python 3.7, if we're changing the locale setting without an explicit
|
|
|
|
|
config option, we'll emit a warning on stderr that we're doing so rather
|
|
|
|
|
than silently changing the process configuration. This will alert application
|
|
|
|
|
and system integrators to the change, even if they don't closely follow the
|
|
|
|
|
PEP process or Python release announcements. However, to minimize the chance
|
|
|
|
|
of introducing new problems for end users, we'll do this *without* using the
|
|
|
|
|
warnings system, so even running with ``-Werror`` won't turn it into a runtime
|
2018-03-29 10:05:50 -04:00
|
|
|
|
exception. (Note: these warnings ended up being silenced by default. See the
|
|
|
|
|
Implementation Note above for more details)
|
2017-05-09 06:46:59 -04:00
|
|
|
|
* for Python 3.7, any changed defaults will offer some form of explicit "off"
|
|
|
|
|
switch at build time, runtime, or both
|
2017-01-07 20:54:24 -05:00
|
|
|
|
|
2018-03-29 10:05:50 -04:00
|
|
|
|
|
2017-03-13 01:06:48 -04:00
|
|
|
|
Minimizing the negative impact on systems currently correctly configured to
|
2017-01-20 09:13:24 -05:00
|
|
|
|
use GB-18030 or another partially ASCII compatible universal encoding leads to
|
2017-05-09 06:46:59 -04:00
|
|
|
|
the following design principle:
|
2017-01-07 20:54:24 -05:00
|
|
|
|
|
|
|
|
|
* if a UTF-8 based Linux container is run on a host that is explicitly
|
|
|
|
|
configured to use a non-UTF-8 encoding, and tries to exchange locally
|
|
|
|
|
encoded data with that host rather than exchanging explicitly UTF-8 encoded
|
2017-01-20 09:13:24 -05:00
|
|
|
|
data, CPython will endeavour to correctly round-trip host provided data that
|
|
|
|
|
is concatenated or split solely at common ASCII compatible code points, but
|
|
|
|
|
may otherwise emit nonsensical results.
|
2017-01-07 20:54:24 -05:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
Minimizing the negative impact on systems and programs correctly configured to
|
|
|
|
|
use an explicit locale category like ``LC_TIME``, ``LC_MONETARY`` or
|
|
|
|
|
``LC_NUMERIC`` while otherwise running in the legacy C locale gives the
|
|
|
|
|
following design principles:
|
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
* don't make any environmental changes that would alter any existing settings
|
|
|
|
|
for locale categories other than ``LC_CTYPE`` (most notably: don't set
|
|
|
|
|
``LC_ALL`` or ``LANG``)
|
2017-05-09 06:46:59 -04:00
|
|
|
|
|
|
|
|
|
Finally, maintaining compatibility with running arbitrary subprocesses in
|
|
|
|
|
orchestration use cases leads to the following design principle:
|
|
|
|
|
|
|
|
|
|
* don't make any Python-specific environmental changes that might be
|
|
|
|
|
incompatible with any still supported version of CPython (including
|
|
|
|
|
CPython 2.7)
|
|
|
|
|
|
2017-01-07 20:54:24 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
Specification
|
|
|
|
|
=============
|
2017-01-03 00:19:37 -05:00
|
|
|
|
|
|
|
|
|
To better handle the cases where CPython would otherwise end up attempting
|
2017-01-07 02:04:39 -05:00
|
|
|
|
to operate in the ``C`` locale, this PEP proposes that CPython automatically
|
2017-05-27 03:08:32 -04:00
|
|
|
|
attempt to coerce the legacy ``C`` locale to a UTF-8 based locale for the
|
|
|
|
|
``LC_CTYPE`` category when it is run as a standalone command line application.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
It further proposes to emit a warning on stderr if the legacy ``C`` locale
|
2017-05-27 03:08:32 -04:00
|
|
|
|
is in effect for the ``LC_CTYPE`` category at the point where the language
|
|
|
|
|
runtime itself is initialized,
|
2017-05-06 02:58:19 -04:00
|
|
|
|
and the explicit environmental flag to disable locale coercion is not set, in
|
2017-03-17 04:27:53 -04:00
|
|
|
|
order to warn system and application integrators that they're running CPython
|
|
|
|
|
in an unsupported configuration.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
In addition to these general changes, some additional Android-specific changes
|
|
|
|
|
are proposed to handle the differences in the behaviour of ``setlocale`` on that
|
|
|
|
|
platform.
|
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
Legacy C locale coercion in the standalone Python interpreter binary
|
|
|
|
|
--------------------------------------------------------------------
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
When run as a standalone application, CPython has the opportunity to
|
|
|
|
|
reconfigure the C locale before any locale dependent operations are executed
|
|
|
|
|
in the process.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
This means that it can change the locale settings not only for the CPython
|
2017-03-13 01:06:48 -04:00
|
|
|
|
runtime, but also for any other locale-aware components running in the current
|
2017-01-07 02:04:39 -05:00
|
|
|
|
process (e.g. as part of extension modules), as well as in subprocesses that
|
|
|
|
|
inherit their environment from the current process.
|
2017-01-03 00:19:37 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
After calling ``setlocale(LC_ALL, "")`` to initialize the locale settings in
|
|
|
|
|
the current process, the main interpreter binary will be updated to include
|
|
|
|
|
the following call::
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
const char *ctype_loc = setlocale(LC_CTYPE, NULL);
|
|
|
|
|
|
|
|
|
|
This cryptic invocation is the API that C provides to query the current locale
|
|
|
|
|
setting without changing it. Given that query, it is possible to check for
|
|
|
|
|
exactly the ``C`` locale with ``strcmp``::
|
|
|
|
|
|
|
|
|
|
ctype_loc != NULL && strcmp(ctype_loc, "C") == 0 # true only in the C locale
|
|
|
|
|
|
2017-01-07 07:14:20 -05:00
|
|
|
|
This call also returns ``"C"`` when either no particular locale is set, or the
|
|
|
|
|
nominal locale is set to an alias for the ``C`` locale (such as ``POSIX``).
|
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
Given this information, CPython can then attempt to coerce the locale to one
|
|
|
|
|
that uses UTF-8 rather than ASCII as the default encoding.
|
|
|
|
|
|
|
|
|
|
Three such locales will be tried:
|
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
* ``C.UTF-8`` (available at least in Debian, Ubuntu, Alpine, and Fedora 25+, and
|
2017-01-07 02:04:39 -05:00
|
|
|
|
expected to be available by default in a future version of glibc)
|
|
|
|
|
* ``C.utf8`` (available at least in HP-UX)
|
2017-05-27 03:08:32 -04:00
|
|
|
|
* ``UTF-8`` (available in at least some \*BSD variants, including Mac OS X)
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
The coercion will be implemented by setting the ``LC_CTYPE`` environment
|
|
|
|
|
variable to the candidate locale name, such that future calls to
|
|
|
|
|
``setlocale()`` will see it, as will other components looking for those
|
|
|
|
|
settings (such as GUI development frameworks and Python's own ``locale``
|
|
|
|
|
module).
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
To allow for better cross-platform binary portability and to adjust
|
|
|
|
|
automatically to future changes in locale availability, these checks will be
|
|
|
|
|
implemented at runtime on all platforms other than Windows, rather than
|
|
|
|
|
attempting to determine which locales to try at compile time.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
When this locale coercion is activated, the following warning will be
|
|
|
|
|
printed on stderr, with the warning containing whichever locale was
|
|
|
|
|
successfully configured::
|
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
Python detected LC_CTYPE=C: LC_CTYPE coerced to C.UTF-8 (set another
|
2017-01-07 07:20:23 -05:00
|
|
|
|
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2018-03-29 10:05:50 -04:00
|
|
|
|
(Note: this warning ended up being silenced by default. See the
|
|
|
|
|
Implementation Note above for more details)
|
2019-04-16 10:50:15 -04:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
As long as the current platform provides at least one of the candidate UTF-8
|
|
|
|
|
based environments, this locale coercion will mean that the standard
|
2017-03-13 01:06:48 -04:00
|
|
|
|
Python binary *and* locale-aware extensions should once again "just work"
|
2017-01-20 09:13:24 -05:00
|
|
|
|
in the three main failure cases we're aware of (missing locale
|
2017-05-09 06:46:59 -04:00
|
|
|
|
settings, SSH forwarding of unknown locales via ``LANG`` or ``LC_CTYPE``, and
|
|
|
|
|
developers explicitly requesting ``LANG=C``).
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-03-17 04:27:53 -04:00
|
|
|
|
The one case where failures may still occur is when ``stderr`` is specifically
|
|
|
|
|
being checked for no output, which can be resolved either by configuring
|
|
|
|
|
a locale other than the C locale, or else by using a mechanism other than
|
|
|
|
|
"there was no output on stderr" to check for subprocess errors (e.g. checking
|
|
|
|
|
process return codes).
|
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
If none of the candidate locales are successfully configured, or the ``LC_ALL``,
|
|
|
|
|
locale override is defined in the current process environment, then
|
2017-03-13 01:06:48 -04:00
|
|
|
|
initialization will continue in the C locale and the Unicode compatibility
|
|
|
|
|
warning described in the next section will be emitted just as it would for
|
|
|
|
|
any other application.
|
|
|
|
|
|
|
|
|
|
If ``PYTHONCOERCECLOCALE=0`` is explicitly set, initialization will continue in
|
|
|
|
|
the C locale and the Unicode compatibility warning described in the next
|
|
|
|
|
section will be automatically suppressed.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
The interpreter will always check for the ``PYTHONCOERCECLOCALE`` environment
|
2017-03-13 01:06:48 -04:00
|
|
|
|
variable at startup (even when running under the ``-E`` or ``-I`` switches),
|
|
|
|
|
as the locale coercion check necessarily takes place before any command line
|
|
|
|
|
argument processing. For consistency, the runtime check to determine whether
|
|
|
|
|
or not to suppress the locale compatibility warning will be similarly
|
|
|
|
|
independent of these settings.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
Legacy C locale warning during runtime initialization
|
|
|
|
|
-----------------------------------------------------
|
2017-01-03 00:19:37 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
By the time that ``Py_Initialize`` is called, arbitrary locale-dependent
|
|
|
|
|
operations may have taken place in the current process. This means that
|
2017-05-27 03:08:32 -04:00
|
|
|
|
by the time it is called, it is *too late* to reliably switch to a different
|
|
|
|
|
locale - doing so would introduce inconsistencies in decoded text, even in the
|
|
|
|
|
context of the standalone Python interpreter binary.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
|
2017-05-06 02:58:19 -04:00
|
|
|
|
configured locale is still the default ``C`` locale and
|
|
|
|
|
``PYTHONCOERCECLOCALE=0`` is not set, the following warning will be issued::
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
|
2017-03-13 01:06:48 -04:00
|
|
|
|
encoding), which may cause Unicode compatibility problems. Using C.UTF-8,
|
2017-03-05 02:29:54 -05:00
|
|
|
|
C.utf8, or UTF-8 (if available) as alternative Unicode-compatible
|
|
|
|
|
locales is recommended.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2018-03-29 10:05:50 -04:00
|
|
|
|
(Note: this warning ended up being silenced by default. See the
|
|
|
|
|
Implementation Note above for more details)
|
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
In this case, no actual change will be made to the locale settings.
|
|
|
|
|
|
|
|
|
|
Instead, the warning informs both system and application integrators that
|
|
|
|
|
they're running Python 3 in a configuration that we don't expect to work
|
|
|
|
|
properly.
|
|
|
|
|
|
2017-03-13 01:06:48 -04:00
|
|
|
|
The second sentence providing recommendations may eventually be conditionally
|
|
|
|
|
compiled based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8``
|
|
|
|
|
on \*BSD systems), but the initial implementation will just use the common
|
|
|
|
|
generic message shown above.
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
New build-time configuration options
|
|
|
|
|
------------------------------------
|
|
|
|
|
|
|
|
|
|
While both of the above behaviours would be enabled by default, they would
|
|
|
|
|
also have new associated configuration options and preprocessor definitions
|
|
|
|
|
for the benefit of redistributors that want to override those default settings.
|
|
|
|
|
|
|
|
|
|
The locale coercion behaviour would be controlled by the flag
|
|
|
|
|
``--with[out]-c-locale-coercion``, which would set the ``PY_COERCE_C_LOCALE``
|
|
|
|
|
preprocessor definition.
|
|
|
|
|
|
|
|
|
|
The locale warning behaviour would be controlled by the flag
|
|
|
|
|
``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE``
|
|
|
|
|
preprocessor definition.
|
|
|
|
|
|
2018-03-29 10:05:50 -04:00
|
|
|
|
(Note: this compile time warning option ended up being replaced by a runtime
|
|
|
|
|
``PYTHONCOERCECLOCALE=warn`` option. See the Implementation Note above for
|
|
|
|
|
more details)
|
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
On platforms which don't use the ``autotools`` based build system (i.e.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
Windows) these preprocessor variables would always be undefined.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-07 03:19:44 -05:00
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
Changes to the default error handling on the standard streams
|
|
|
|
|
-------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Since Python 3.5, CPython has defaulted to using ``surrogateescape`` on the
|
2017-05-06 06:59:05 -04:00
|
|
|
|
standard streams (``sys.stdin``, ``sys.stdout``) when it detects that the
|
|
|
|
|
current locale is ``C`` and no specific error handled has been set using
|
|
|
|
|
either the ``PYTHONIOENCODING`` environment variable or the
|
2017-05-06 02:58:19 -04:00
|
|
|
|
``Py_setStandardStreamEncoding`` API. For other locales, the default error
|
|
|
|
|
handler for the standard streams is ``strict``.
|
|
|
|
|
|
|
|
|
|
In order to preserve this behaviour without introducing any behavioural
|
|
|
|
|
discrepancies between locale coercion and explicitly configuring a locale, the
|
|
|
|
|
coercion target locales (``C.UTF-8``, ``C.utf8``, and ``UTF-8``) will be added
|
|
|
|
|
to the list of locales that use ``surrogateescape`` as their default error
|
|
|
|
|
handler for the standard streams.
|
|
|
|
|
|
2017-05-06 06:59:05 -04:00
|
|
|
|
No changes are proposed to the default error handler for ``sys.stderr``: that
|
|
|
|
|
will continue to be ``backslashreplace``.
|
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
|
|
|
|
|
Changes to locale settings on Android
|
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
|
|
Independently of the other changes in this PEP, CPython on Android systems
|
|
|
|
|
will be updated to call ``setlocale(LC_ALL, "C.UTF-8")`` where it currently
|
|
|
|
|
calls ``setlocale(LC_ALL, "")`` and ``setlocale(LC_CTYPE, "C.UTF-8")`` where
|
|
|
|
|
it currently calls ``setlocale(LC_CTYPE, "")``.
|
|
|
|
|
|
|
|
|
|
This Android-specific behaviour is being introduced due to the following
|
|
|
|
|
Android-specific details:
|
|
|
|
|
|
|
|
|
|
* on Android, passing ``""`` to ``setlocale`` is equivalent to passing ``"C"``
|
|
|
|
|
* the ``C.UTF-8`` locale is always available
|
|
|
|
|
|
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
Platform Support Changes
|
|
|
|
|
========================
|
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
A new "Legacy C Locale" section will be added to :pep:`11` that states:
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-03-13 01:06:48 -04:00
|
|
|
|
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
|
|
|
|
|
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
|
|
|
|
|
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
|
2017-05-09 06:46:59 -04:00
|
|
|
|
Any Unicode related integration problems that occur only in the legacy ``C``
|
|
|
|
|
locale and cannot be reproduced in an appropriately configured non-ASCII
|
|
|
|
|
locale will be closed as "won't fix".
|
2017-03-13 01:06:48 -04:00
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Improving the handling of the C locale
|
|
|
|
|
--------------------------------------
|
|
|
|
|
|
|
|
|
|
It has been clear for some time that the C locale's default encoding of
|
|
|
|
|
``ASCII`` is entirely the wrong choice for development of modern networked
|
|
|
|
|
services. Newer languages like Rust and Go have eschewed that default entirely,
|
|
|
|
|
and instead made it a deployment requirement that systems be configured to use
|
|
|
|
|
UTF-8 as the text encoding for operating system interfaces. Similarly, Node.js
|
|
|
|
|
assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript engine)
|
|
|
|
|
and requires custom build settings to indicate it should use the system
|
2017-01-03 00:19:37 -05:00
|
|
|
|
locale settings for locale-aware operations. Both the JVM and the .NET CLR
|
|
|
|
|
use UTF-16-LE as their primary encoding for passing text between applications
|
2017-05-04 10:59:53 -04:00
|
|
|
|
and the application runtime (i.e. the JVM/CLR, not the host operating system).
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
The challenge for CPython has been the fact that in addition to being used for
|
|
|
|
|
network service development, it is also extensively used as an embedded
|
|
|
|
|
scripting language in larger applications, and as a desktop application
|
|
|
|
|
development language, where it is more important to be consistent with other
|
2017-03-13 01:06:48 -04:00
|
|
|
|
locale-aware components sharing the same process, as well as with the user's
|
|
|
|
|
desktop locale settings, than it is with the emergent conventions of modern
|
|
|
|
|
network service development.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
The core premise of this PEP is that for *all* of these use cases, the
|
|
|
|
|
assumption of ASCII implied by the default "C" locale is the wrong choice,
|
|
|
|
|
and furthermore that the following assumptions are valid:
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
* in desktop application use cases, the process locale will *already* be
|
|
|
|
|
configured appropriately, and if it isn't, then that is an operating system
|
2017-01-20 09:13:24 -05:00
|
|
|
|
or embedding application level problem that needs to be reported to and
|
|
|
|
|
resolved by the operating system provider or application developer
|
2016-12-27 21:31:21 -05:00
|
|
|
|
* in network service development use cases (especially those based on Linux
|
|
|
|
|
containers), the process locale may not be configured *at all*, and if it
|
|
|
|
|
isn't, then the expectation is that components will impose their own default
|
|
|
|
|
encoding the way Rust, Go and Node.js do, rather than trusting the legacy C
|
|
|
|
|
default encoding of ASCII the way CPython currently does
|
|
|
|
|
|
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
Defaulting to "surrogateescape" error handling on the standard IO streams
|
|
|
|
|
-------------------------------------------------------------------------
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
By coercing the locale away from the legacy C default and its assumption of
|
|
|
|
|
ASCII as the preferred text encoding, this PEP also disables the implicit use
|
|
|
|
|
of the "surrogateescape" error handler on the standard IO streams that was
|
2022-07-20 17:50:22 -04:00
|
|
|
|
introduced in Python 3.5 ([15]_), as well as the automatic use of
|
2022-01-21 06:03:51 -05:00
|
|
|
|
``surrogateescape`` when operating in :pep:`540`'s proposed UTF-8 mode.
|
2017-01-07 20:54:24 -05:00
|
|
|
|
|
2017-05-06 06:59:05 -04:00
|
|
|
|
Rather than introducing yet another configuration option to adjust that
|
|
|
|
|
behaviour, this PEP instead proposes to extend the "surrogateescape" default
|
|
|
|
|
for ``stdin`` and ``stderr`` error handling to also apply to the three
|
|
|
|
|
potential coercion target locales.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
The aim of this behaviour is to attempt to ensure that operating system
|
|
|
|
|
provided text values are typically able to be transparently passed through a
|
|
|
|
|
Python 3 application even if it is incorrect in assuming that that text has
|
|
|
|
|
been encoded as UTF-8.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2022-07-20 17:50:22 -04:00
|
|
|
|
In particular, GB 18030 [12]_ is a Chinese national text encoding standard
|
2017-01-20 09:13:24 -05:00
|
|
|
|
that handles all Unicode code points, that is formally incompatible with both
|
|
|
|
|
ASCII and UTF-8, but will nevertheless often tolerate processing as surrogate
|
|
|
|
|
escaped data - the points where GB 18030 reuses ASCII byte values in an
|
|
|
|
|
incompatible way are likely to be invalid in UTF-8, and will therefore be
|
|
|
|
|
escaped and opaque to string processing operations that split on or search for
|
|
|
|
|
the relevant ASCII code points. Operations that don't involve splitting on or
|
|
|
|
|
searching for particular ASCII or Unicode code point values are almost
|
|
|
|
|
certain to work correctly.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2022-07-20 17:50:22 -04:00
|
|
|
|
Similarly, Shift-JIS [13]_ and ISO-2022-JP [14]_ remain in widespread use in
|
2017-01-20 09:13:24 -05:00
|
|
|
|
Japan, and are incompatible with both ASCII and UTF-8, but will tolerate text
|
|
|
|
|
processing operations that don't involve splitting on or searching for
|
|
|
|
|
particular ASCII or Unicode code point values.
|
|
|
|
|
|
|
|
|
|
As an example, consider two files, one encoded with UTF-8 (the default encoding
|
|
|
|
|
for ``en_AU.UTF-8``), and one encoded with GB-18030 (the default encoding for
|
|
|
|
|
``zh_CN.gb18030``)::
|
|
|
|
|
|
|
|
|
|
$ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))'
|
2017-02-06 09:02:06 -05:00
|
|
|
|
$ python3 -c 'open("gb18030.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("gb18030"))'
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
|
|
|
|
On disk, we can see that these are two very different files::
|
|
|
|
|
|
|
|
|
|
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "rb").read().strip()); \
|
|
|
|
|
print("GB18030:", open("gb18030.txt", "rb").read().strip())'
|
|
|
|
|
UTF-8: b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n'
|
|
|
|
|
GB18030: b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n'
|
|
|
|
|
|
|
|
|
|
That nevertheless can both be rendered correctly to the terminal as long as
|
|
|
|
|
they're decoded prior to printing::
|
|
|
|
|
|
|
|
|
|
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
|
|
|
|
|
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())'
|
|
|
|
|
UTF-8: ℙƴ☂ℌøἤ
|
|
|
|
|
GB18030: ℙƴ☂ℌøἤ
|
|
|
|
|
|
|
|
|
|
By contrast, if we just pass along the raw bytes, as ``cat`` and similar C/C++
|
|
|
|
|
utilities will tend to do::
|
|
|
|
|
|
|
|
|
|
$ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
<20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
|
|
|
|
|
|
|
|
|
|
Even setting a specifically Chinese locale won't help in getting the
|
|
|
|
|
GB-18030 encoded file rendered correctly::
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
$ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
<20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
|
|
|
|
|
|
|
|
|
|
The problem is that the *terminal* encoding setting remains UTF-8, regardless
|
|
|
|
|
of the nominal locale. A GB18030 terminal can be emulated using the ``iconv``
|
|
|
|
|
utility::
|
|
|
|
|
|
|
|
|
|
$ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8
|
|
|
|
|
鈩櫰粹槀鈩屆羔激
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
|
|
|
|
|
This reverses the problem, such that the GB18030 file is rendered correctly,
|
|
|
|
|
but the UTF-8 file has been converted to unrelated hanzi characters, rather than
|
|
|
|
|
the expected rendering of "Python" as non-ASCII characters.
|
|
|
|
|
|
|
|
|
|
With the emulated GB18030 terminal encoding, assuming UTF-8 in Python results
|
|
|
|
|
in *both* files being displayed incorrectly::
|
|
|
|
|
|
|
|
|
|
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
|
|
|
|
|
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
|
|
|
|
|
| iconv -f GB18030 -t UTF-8
|
|
|
|
|
UTF-8: 鈩櫰粹槀鈩屆羔激
|
|
|
|
|
GB18030: 鈩櫰粹槀鈩屆羔激
|
|
|
|
|
|
|
|
|
|
However, setting the locale correctly means that the emulated GB18030 terminal
|
|
|
|
|
now displays both files as originally intended::
|
|
|
|
|
|
|
|
|
|
$ LANG=zh_CN.gb18030 \
|
|
|
|
|
python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
|
|
|
|
|
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
|
|
|
|
|
| iconv -f GB18030 -t UTF-8
|
|
|
|
|
UTF-8: ℙƴ☂ℌøἤ
|
|
|
|
|
GB18030: ℙƴ☂ℌøἤ
|
|
|
|
|
|
|
|
|
|
The rationale for retaining ``surrogateescape`` as the default IO encoding is
|
2017-05-06 02:58:19 -04:00
|
|
|
|
that it will preserve the following helpful behaviour in the ``C`` locale::
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
|
|
|
|
$ cat gb18030.txt \
|
|
|
|
|
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \
|
|
|
|
|
| iconv -f GB18030 -t UTF-8
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
Rather than reverting to the exception currently seen when a UTF-8 based locale is
|
2017-01-20 09:13:24 -05:00
|
|
|
|
explicitly configured::
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
$ cat gb18030.txt \
|
|
|
|
|
| python3 -c "import sys; print(sys.stdin.read())" \
|
|
|
|
|
| iconv -f GB18030 -t UTF-8
|
|
|
|
|
Traceback (most recent call last):
|
|
|
|
|
File "<string>", line 1, in <module>
|
|
|
|
|
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
|
|
|
|
|
(result, consumed) = self._buffer_decode(data, self.errors, final)
|
|
|
|
|
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
|
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
As an added benefit, environments explicitly configured to use one of the
|
|
|
|
|
coercion target locales will implicitly gain the encoding transparency behaviour
|
|
|
|
|
currently enabled by default in the ``C`` locale.
|
2017-03-13 01:06:48 -04:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
Avoiding setting PYTHONIOENCODING during UTF-8 locale coercion
|
|
|
|
|
--------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Rather than changing the default handling of the standard streams during
|
|
|
|
|
interpreter initialization, earlier versions of this PEP proposed setting
|
|
|
|
|
``PYTHONIOENCODING`` to ``utf-8:surrogateescape``. This turned out to create
|
|
|
|
|
a significant compatibility problem: since the ``surrogateescape`` handler
|
|
|
|
|
only exists in Python 3.1+, running Python 2.7 processes in subprocesses could
|
|
|
|
|
potentially break in a confusing way with that configuration.
|
|
|
|
|
|
|
|
|
|
The current design means that earlier Python versions will instead retain their
|
|
|
|
|
default ``strict`` error handling on the standard streams, while Python 3.7+
|
|
|
|
|
will consistently use the more permissive ``surrogateescape`` handler even
|
|
|
|
|
when these locales are explicitly configured (rather than being reached through
|
|
|
|
|
locale coercion).
|
|
|
|
|
|
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
Dropping official support for ASCII based text handling in the legacy C locale
|
|
|
|
|
------------------------------------------------------------------------------
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
We've been trying to get strict bytes/text separation to work reliably in the
|
|
|
|
|
legacy C locale for over a decade at this point. Not only haven't we been able
|
|
|
|
|
to get it to work, neither has anyone else - the only viable alternatives
|
|
|
|
|
identified have been to pass the bytes along verbatim without eagerly decoding
|
2017-05-06 02:58:19 -04:00
|
|
|
|
them to text (C/C++, Python 2.x, Ruby, etc), or else to largely ignore the
|
2022-01-21 06:03:51 -05:00
|
|
|
|
nominal C/C++ locale encoding and assume the use of either UTF-8 (:pep:`540`,
|
2017-01-20 09:13:24 -05:00
|
|
|
|
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-03-13 01:06:48 -04:00
|
|
|
|
While this PEP ensures that developers that genuinely need to do so can still
|
2017-05-09 06:46:59 -04:00
|
|
|
|
opt-in to running their Python code in the legacy C locale (by setting
|
|
|
|
|
``LC_ALL=C``, ``PYTHONCOERCECLOCALE=0``, or running a custom build that sets
|
2017-03-13 01:06:48 -04:00
|
|
|
|
``--without-c-locale-coercion``), it also makes it clear that we *don't*
|
|
|
|
|
expect Python 3's Unicode handling to be completely reliable in that
|
|
|
|
|
configuration, and the recommended alternative is to use a more appropriate
|
2022-01-21 06:03:51 -05:00
|
|
|
|
locale setting (potentially in combination with :pep:`540`'s UTF-8 mode, if that
|
2017-05-09 06:46:59 -04:00
|
|
|
|
is available).
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Providing implicit locale coercion only when running standalone
|
|
|
|
|
---------------------------------------------------------------
|
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
The major downside of the proposed design in this PEP is that it introduces a
|
|
|
|
|
potential discrepancy between the behaviour of the CPython runtime when it is
|
|
|
|
|
run as a standalone application and when it is run as an embedded component
|
|
|
|
|
inside a larger system (e.g. ``mod_wsgi`` running inside Apache ``httpd``).
|
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
Over the course of Python 3.x development, multiple attempts have been made
|
|
|
|
|
to improve the handling of incorrect locale settings at the point where the
|
|
|
|
|
Python interpreter is initialised. The problem that emerged is that this is
|
|
|
|
|
ultimately *too late* in the interpreter startup process - data such as command
|
|
|
|
|
line arguments and the contents of environment variables may have already been
|
|
|
|
|
retrieved from the operating system and processed under the incorrect ASCII
|
|
|
|
|
text encoding assumption well before ``Py_Initialize`` is called.
|
|
|
|
|
|
|
|
|
|
The problems created by those inconsistencies were then even harder to diagnose
|
|
|
|
|
and debug than those created by believing the operating system's claim that
|
|
|
|
|
ASCII was a suitable encoding to use for operating system interfaces. This was
|
|
|
|
|
the case even for the default CPython binary, let alone larger C/C++
|
|
|
|
|
applications that embed CPython as a scripting engine.
|
|
|
|
|
|
|
|
|
|
The approach proposed in this PEP handles that problem by moving the locale
|
|
|
|
|
coercion as early as possible in the interpreter startup sequence when running
|
|
|
|
|
standalone: it takes place directly in the C-level ``main()`` function, even
|
2017-05-04 09:20:13 -04:00
|
|
|
|
before calling in to the ``Py_Main()`` library function that implements the
|
2016-12-27 21:31:21 -05:00
|
|
|
|
features of the CPython interpreter CLI.
|
|
|
|
|
|
|
|
|
|
The ``Py_Initialize`` API then only gains an explicit warning (emitted on
|
|
|
|
|
``stderr``) when it detects use of the ``C`` locale, and relies on the
|
|
|
|
|
embedding application to specify something more reasonable.
|
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
That said, the reference implementation for this PEP adds most of the
|
|
|
|
|
functionality to the shared library, with the CLI being updated to
|
|
|
|
|
unconditionally call two new private APIs::
|
|
|
|
|
|
|
|
|
|
if (_Py_LegacyLocaleDetected()) {
|
|
|
|
|
_Py_CoerceLegacyLocale();
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
These are similar to other "pre-configuration" APIs intended for embedding
|
|
|
|
|
applications: they're designed to be called *before* ``Py_Initialize``, and
|
|
|
|
|
hence change the way the interpreter gets initialized.
|
|
|
|
|
|
|
|
|
|
If these were made public (either as part of this PEP or in a subsequent RFE),
|
|
|
|
|
then it would be straightforward for other embedding applications to recreate
|
|
|
|
|
the same behaviour as is proposed for the CPython CLI.
|
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-05-06 02:58:19 -04:00
|
|
|
|
Allowing restoration of the legacy behaviour
|
|
|
|
|
--------------------------------------------
|
|
|
|
|
|
|
|
|
|
The CPython command line interpreter is often used to investigate faults that
|
|
|
|
|
occur in other applications that embed CPython, and those applications may still
|
|
|
|
|
be using the C locale even after this PEP is implemented.
|
|
|
|
|
|
|
|
|
|
Providing a simple on/off switch for the locale coercion behaviour makes it
|
|
|
|
|
much easier to reproduce the behaviour of such applications for debugging
|
|
|
|
|
purposes, as well as making it easier to reproduce the behaviour of older 3.x
|
|
|
|
|
runtimes even when running a version with this change applied.
|
|
|
|
|
|
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
Querying LC_CTYPE for C locale detection
|
|
|
|
|
----------------------------------------
|
|
|
|
|
|
|
|
|
|
``LC_CTYPE`` is the actual locale category that CPython relies on to drive the
|
|
|
|
|
implicit decoding of environment variables, command line arguments, and other
|
|
|
|
|
text values received from the operating system.
|
|
|
|
|
|
|
|
|
|
As such, it makes sense to check it specifically when attempting to determine
|
|
|
|
|
whether or not the current locale configuration is likely to cause Unicode
|
|
|
|
|
handling problems.
|
|
|
|
|
|
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
Explicitly setting LC_CTYPE for UTF-8 locale coercion
|
|
|
|
|
-----------------------------------------------------
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
Python is often used as a glue language, integrating other C/C++ ABI compatible
|
|
|
|
|
components in the current process, and components written in arbitrary
|
|
|
|
|
languages in subprocesses.
|
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
Setting ``LC_CTYPE`` to ``C.UTF-8`` is important to handle cases where the
|
2017-03-13 01:06:48 -04:00
|
|
|
|
problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a
|
|
|
|
|
system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
|
2017-01-20 09:13:24 -05:00
|
|
|
|
configured to forward locale settings, and the user logs into a Linux server).
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
This should be sufficient to ensure that when the locale coercion is activated,
|
|
|
|
|
the switch to the UTF-8 based locale will be applied consistently across the
|
|
|
|
|
current process and any subprocesses that inherit the current environment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Avoiding setting LANG for UTF-8 locale coercion
|
|
|
|
|
-----------------------------------------------
|
|
|
|
|
|
2019-07-03 14:20:45 -04:00
|
|
|
|
Earlier versions of this PEP proposed setting the ``LANG`` category independent
|
2017-05-27 03:08:32 -04:00
|
|
|
|
default locale, in addition to setting ``LC_CTYPE``.
|
|
|
|
|
|
|
|
|
|
This was later removed on the grounds that setting only ``LC_CTYPE`` is
|
|
|
|
|
sufficient to handle all of the problematic scenarios that the PEP aimed
|
|
|
|
|
to resolve, while setting ``LANG`` as well would break cases where ``LANG``
|
|
|
|
|
was set correctly, and the locale problems were solely due to an incorrect
|
2022-07-20 17:50:22 -04:00
|
|
|
|
``LC_CTYPE`` setting ([22]_).
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
For example, consider a Python application that called the Linux ``date``
|
|
|
|
|
utility in a subprocess rather than doing its own date formatting::
|
|
|
|
|
|
|
|
|
|
$ LANG=ja_JP.UTF-8 LC_CTYPE=C date
|
|
|
|
|
2017年 5月 23日 火曜日 17:31:03 JST
|
|
|
|
|
|
|
|
|
|
$ LANG=ja_JP.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing only LC_CTYPE
|
|
|
|
|
2017年 5月 23日 火曜日 17:32:58 JST
|
|
|
|
|
|
|
|
|
|
$ LANG=C.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing both of LC_CTYPE and LANG
|
|
|
|
|
Tue May 23 17:31:10 JST 2017
|
|
|
|
|
|
|
|
|
|
With only ``LC_CTYPE`` updated in the Python process, the subprocess would
|
|
|
|
|
continue to behave as expected. However, if ``LANG`` was updated as well,
|
|
|
|
|
that would effectively override the ``LC_TIME`` setting and use the wrong
|
|
|
|
|
date formatting conventions.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
Avoiding setting LC_ALL for UTF-8 locale coercion
|
|
|
|
|
-------------------------------------------------
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
Earlier versions of this PEP proposed setting the ``LC_ALL`` locale override,
|
2017-05-27 03:08:32 -04:00
|
|
|
|
in addition to setting ``LC_CTYPE``.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
This was changed after it was determined that just setting ``LC_CTYPE`` and
|
|
|
|
|
``LANG`` should be sufficient to handle all the scenarios the PEP aims to
|
|
|
|
|
cover, as it avoids causing any problems in cases like the following::
|
2017-05-06 02:58:19 -04:00
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
$ LANG=C LC_MONETARY=ja_JP.utf8 ./python -c \
|
|
|
|
|
"from locale import setlocale, LC_ALL, currency; setlocale(LC_ALL, ''); print(currency(1e6))"
|
|
|
|
|
¥1000000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Skipping locale coercion if LC_ALL is set in the current environment
|
|
|
|
|
--------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
With locale coercion now only setting ``LC_CTYPE`` and ``LANG``, it will have
|
|
|
|
|
no effect if ``LC_ALL`` is also set. To avoid emitting a spurious locale
|
|
|
|
|
coercion notice in that case, coercion is instead skipped entirely.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Considering locale coercion independently of "UTF-8 mode"
|
|
|
|
|
---------------------------------------------------------
|
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
With both this PEP's locale coercion and :pep:`540`'s UTF-8 mode under
|
2017-05-09 06:46:59 -04:00
|
|
|
|
consideration for Python 3.7, it makes sense to ask whether or not we can
|
|
|
|
|
limit ourselves to only doing one or the other, rather than making both
|
|
|
|
|
changes.
|
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
The UTF-8 mode proposed in :pep:`540` has two major limitations that make it a
|
2017-05-09 06:46:59 -04:00
|
|
|
|
potential complement to this PEP rather than a potential replacement.
|
|
|
|
|
|
2022-01-21 06:03:51 -05:00
|
|
|
|
First, unlike this PEP, :pep:`540`'s UTF-8 mode makes it possible to change default
|
2017-05-09 06:46:59 -04:00
|
|
|
|
behaviours that are not currently configurable at all. While that's exactly
|
|
|
|
|
what makes the proposal interesting, it's also what makes it an entirely
|
|
|
|
|
unproven approach. By contrast, the approach proposed in this PEP builds
|
|
|
|
|
directly atop existing configuration settings for the C locale system (
|
|
|
|
|
``LC_CTYPE``, ``LANG``) and Python's standard streams (``PYTHONIOENCODING``)
|
|
|
|
|
that have already been in use for years to handle the kinds of compatibility
|
|
|
|
|
problems discussed in this PEP.
|
|
|
|
|
|
|
|
|
|
Secondly, one of the things we know based on that experience is that the
|
|
|
|
|
proposed locale coercion can resolve problems not only in CPython itself,
|
|
|
|
|
but also in extension modules that interact with the standard streams, like
|
|
|
|
|
GNU readline. As an example, consider the following interactive session
|
2022-01-21 06:03:51 -05:00
|
|
|
|
from a :pep:`538` enabled CPython build, where each line after the first is
|
2017-05-09 06:46:59 -04:00
|
|
|
|
executed by doing "up-arrow, left-arrow x4, delete, enter"::
|
|
|
|
|
|
|
|
|
|
$ LANG=C ./python
|
|
|
|
|
Python 3.7.0a0 (heads/pep538-coerce-c-locale:188e780, May 7 2017, 00:21:13)
|
|
|
|
|
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
|
|
|
|
|
Type "help", "copyright", "credits" or "license" for more information.
|
|
|
|
|
>>> print("ℙƴ☂ℌøἤ")
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
>>> print("ℙƴ☂ℌἤ")
|
|
|
|
|
ℙƴ☂ℌἤ
|
|
|
|
|
>>> print("ℙƴ☂ἤ")
|
|
|
|
|
ℙƴ☂ἤ
|
|
|
|
|
>>> print("ℙƴἤ")
|
|
|
|
|
ℙƴἤ
|
|
|
|
|
>>> print("ℙἤ")
|
|
|
|
|
ℙἤ
|
|
|
|
|
>>> print("ἤ")
|
|
|
|
|
ἤ
|
|
|
|
|
>>>
|
|
|
|
|
|
|
|
|
|
This is exactly what we'd expect from a well-behaved command history editor.
|
|
|
|
|
|
|
|
|
|
By contrast, the following is what currently happens on an older release if
|
|
|
|
|
you only change the Python level stream encoding settings without updating the
|
|
|
|
|
locale settings::
|
|
|
|
|
|
|
|
|
|
$ LANG=C PYTHONIOENCODING=utf-8:surrogateescape python3
|
|
|
|
|
Python 3.5.3 (default, Apr 24 2017, 13:32:13)
|
|
|
|
|
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
|
|
|
|
|
Type "help", "copyright", "credits" or "license" for more information.
|
|
|
|
|
>>> print("ℙƴ☂ℌøἤ")
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
>>> print("ℙƴ☂ℌ<E29882>")
|
|
|
|
|
File "<stdin>", line 0
|
|
|
|
|
|
|
|
|
|
^
|
|
|
|
|
SyntaxError: 'utf-8' codec can't decode bytes in position 20-21:
|
|
|
|
|
invalid continuation byte
|
|
|
|
|
|
|
|
|
|
That particular misbehaviour is coming from GNU readline, *not* CPython -
|
|
|
|
|
because the command history editing wasn't UTF-8 aware, it corrupted the history
|
|
|
|
|
buffer and fed such nonsense to stdin that even the surrogateescape error
|
2022-01-21 06:03:51 -05:00
|
|
|
|
handler was bypassed. While :pep:`540`'s UTF-8 mode could technically be updated
|
2017-05-09 06:46:59 -04:00
|
|
|
|
to also reconfigure readline, that's just *one* extension module that might
|
|
|
|
|
be interacting with the standard streams without going through the CPython
|
|
|
|
|
C API, and any change made by CPython would only apply when readline is running
|
|
|
|
|
directly as part of Python 3.7 rather than in a separate subprocess.
|
|
|
|
|
|
|
|
|
|
However, if we actually change the configured locale, GNU readline starts
|
|
|
|
|
behaving itself, without requiring any changes to the embedding application::
|
|
|
|
|
|
|
|
|
|
$ LANG=C.UTF-8 python3
|
|
|
|
|
Python 3.5.3 (default, Apr 24 2017, 13:32:13)
|
|
|
|
|
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
|
|
|
|
|
Type "help", "copyright", "credits" or "license" for more information.
|
|
|
|
|
>>> print("ℙƴ☂ℌøἤ")
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
>>> print("ℙƴ☂ℌἤ")
|
|
|
|
|
ℙƴ☂ℌἤ
|
|
|
|
|
>>> print("ℙƴ☂ἤ")
|
|
|
|
|
ℙƴ☂ἤ
|
|
|
|
|
>>> print("ℙƴἤ")
|
|
|
|
|
ℙƴἤ
|
|
|
|
|
>>> print("ℙἤ")
|
|
|
|
|
ℙἤ
|
|
|
|
|
>>> print("ἤ")
|
|
|
|
|
ἤ
|
|
|
|
|
>>>
|
|
|
|
|
$ LC_CTYPE=C.UTF-8 python3
|
|
|
|
|
Python 3.5.3 (default, Apr 24 2017, 13:32:13)
|
|
|
|
|
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux
|
|
|
|
|
Type "help", "copyright", "credits" or "license" for more information.
|
|
|
|
|
>>> print("ℙƴ☂ℌøἤ")
|
|
|
|
|
ℙƴ☂ℌøἤ
|
|
|
|
|
>>> print("ℙƴ☂ℌἤ")
|
|
|
|
|
ℙƴ☂ℌἤ
|
|
|
|
|
>>> print("ℙƴ☂ἤ")
|
|
|
|
|
ℙƴ☂ἤ
|
|
|
|
|
>>> print("ℙƴἤ")
|
|
|
|
|
ℙƴἤ
|
|
|
|
|
>>> print("ℙἤ")
|
|
|
|
|
ℙἤ
|
|
|
|
|
>>> print("ἤ")
|
|
|
|
|
ἤ
|
|
|
|
|
>>>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Enabling C locale coercion and warnings on Mac OS X, iOS and Android
|
|
|
|
|
--------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
On Mac OS X, iOS, and Android, CPython already assumes the use of UTF-8 for
|
|
|
|
|
system interfaces, and we expect most other locale-aware components to do the
|
|
|
|
|
same.
|
|
|
|
|
|
|
|
|
|
Accordingly, this PEP originally proposed to disable locale coercion and
|
|
|
|
|
warnings at build time for these platforms, on the assumption that it would
|
|
|
|
|
be entirely redundant.
|
|
|
|
|
|
2018-03-29 10:05:50 -04:00
|
|
|
|
However, that assumption turned out to be incorrect, as subsequent
|
2017-05-09 06:46:59 -04:00
|
|
|
|
investigations showed that if you explicitly configure ``LANG=C`` on
|
|
|
|
|
these platforms, extension modules like GNU readline will misbehave in much the
|
2022-07-20 17:50:22 -04:00
|
|
|
|
same way as they do on other \*nix systems. [21]_
|
2017-05-09 06:46:59 -04:00
|
|
|
|
|
|
|
|
|
In addition, Mac OS X is also frequently used as a development and testing
|
|
|
|
|
platform for Python software intended for deployment to other \*nix environments
|
|
|
|
|
(such as Linux or Android), and Linux is similarly often used as a development
|
|
|
|
|
and testing platform for mobile and Mac OS X applications.
|
|
|
|
|
|
|
|
|
|
Accordingly, this PEP enables the locale coercion and warning features by
|
|
|
|
|
default on all platforms that use CPython's ``autotools`` based build toolchain
|
|
|
|
|
(i.e. everywhere other than Windows).
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Implementation
|
|
|
|
|
==============
|
|
|
|
|
|
2017-03-13 02:13:28 -04:00
|
|
|
|
The reference implementation is being developed in the
|
2023-10-11 08:05:51 -04:00
|
|
|
|
``pep538-coerce-c-locale`` feature branch [18]_ in Alyssa Coghlan's fork of the
|
2022-07-20 17:50:22 -04:00
|
|
|
|
CPython repository on GitHub. A work-in-progress PR is available at [20]_.
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
2017-03-13 04:08:49 -04:00
|
|
|
|
This reference implementation covers not only the enhancement request in
|
2022-07-20 17:50:22 -04:00
|
|
|
|
issue 28180 [1]_, but also the Android compatibility fixes needed to resolve
|
|
|
|
|
issue 28997 [16]_.
|
2017-03-13 04:08:49 -04:00
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
Backporting to earlier Python 3 releases
|
|
|
|
|
========================================
|
|
|
|
|
|
2017-03-13 04:08:49 -04:00
|
|
|
|
Backporting to Python 3.6.x
|
2017-01-07 02:04:39 -05:00
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
|
|
If this PEP is accepted for Python 3.7, redistributors backporting the change
|
2017-03-13 04:08:49 -04:00
|
|
|
|
specifically to their initial Python 3.6.x release will be both allowed and
|
2017-01-07 02:04:39 -05:00
|
|
|
|
encouraged. However, such backports should only be undertaken either in
|
2017-01-20 09:13:24 -05:00
|
|
|
|
conjunction with the changes needed to also provide a suitable locale by
|
|
|
|
|
default, or else specifically for platforms where such a locale is already
|
2017-01-07 02:04:39 -05:00
|
|
|
|
consistently available.
|
|
|
|
|
|
2017-03-13 01:06:48 -04:00
|
|
|
|
At least the Fedora project is planning to pursue this approach for the
|
2022-07-20 17:50:22 -04:00
|
|
|
|
upcoming Fedora 26 release [19]_.
|
2017-03-13 01:06:48 -04:00
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
Backporting to other 3.x releases
|
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
|
|
While the proposed behavioural change is seen primarily as a bug fix addressing
|
|
|
|
|
Python 3's current misbehaviour in the default ASCII-based C locale, it still
|
2017-01-20 09:13:24 -05:00
|
|
|
|
represents a reasonably significant change in the way CPython interacts with
|
2017-01-07 02:04:39 -05:00
|
|
|
|
the C locale system. As such, while some redistributors may still choose to
|
|
|
|
|
backport it to even earlier Python 3.x releases based on the needs and
|
|
|
|
|
interests of their particular user base, this wouldn't be encouraged as a
|
|
|
|
|
general practice.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-03-05 02:29:54 -05:00
|
|
|
|
However, configuring Python 3 *environments* (such as base container
|
|
|
|
|
images) to use these configuration settings by default is both allowed
|
|
|
|
|
and recommended.
|
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
Acknowledgements
|
|
|
|
|
================
|
|
|
|
|
|
|
|
|
|
The locale coercion approach proposed in this PEP is inspired directly by
|
|
|
|
|
Armin Ronacher's handling of this problem in the ``click`` command line
|
2022-07-20 17:50:22 -04:00
|
|
|
|
utility development framework [2]_::
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
$ LANG=C python3 -c 'import click; cli = click.command()(lambda:None); cli()'
|
|
|
|
|
Traceback (most recent call last):
|
|
|
|
|
...
|
|
|
|
|
RuntimeError: Click will abort further execution because Python 3 was
|
|
|
|
|
configured to use ASCII as encoding for the environment. Either run this
|
|
|
|
|
under Python 2 or consult http://click.pocoo.org/python3/ for mitigation
|
|
|
|
|
steps.
|
|
|
|
|
|
|
|
|
|
This system supports the C.UTF-8 locale which is recommended.
|
|
|
|
|
You might be able to resolve your issue by exporting the
|
|
|
|
|
following environment variables:
|
|
|
|
|
|
|
|
|
|
export LC_ALL=C.UTF-8
|
|
|
|
|
export LANG=C.UTF-8
|
|
|
|
|
|
|
|
|
|
The change was originally proposed as a downstream patch for Fedora's
|
2022-07-20 17:50:22 -04:00
|
|
|
|
system Python 3.6 package [3]_, and then reformulated as a PEP for Python 3.7
|
2016-12-27 21:31:21 -05:00
|
|
|
|
with a section allowing for backports to earlier versions by redistributors.
|
2017-03-17 04:27:53 -04:00
|
|
|
|
In parallel with the development of the upstream patch, Charalampos Stratakis
|
|
|
|
|
has been working on the Fedora 26 backport and providing feedback on the
|
|
|
|
|
practical viability of the proposed changes.
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2022-07-20 17:50:22 -04:00
|
|
|
|
The initial draft was posted to the Python Linux SIG for discussion [10]_ and
|
2017-01-07 02:04:39 -05:00
|
|
|
|
then amended based on both that discussion and Victor Stinner's work in
|
2022-07-20 17:50:22 -04:00
|
|
|
|
:pep:`540` [11]_.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP
|
2022-07-20 17:50:22 -04:00
|
|
|
|
is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9]_.
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
Stephen Turnbull has long provided valuable insight into the text encoding
|
|
|
|
|
handling challenges he regularly encounters at the University of Tsukuba
|
|
|
|
|
(筑波大学).
|
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
.. [1] CPython: sys.getfilesystemencoding() should default to utf-8
|
2022-07-20 17:50:22 -04:00
|
|
|
|
(https://bugs.python.org/issue28180)
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
.. [2] Locale configuration required for click applications under Python 3
|
2022-07-20 17:50:22 -04:00
|
|
|
|
(https://click.palletsprojects.com/en/5.x/python3/#python-3-surrogate-handling)
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
|
|
|
|
.. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale
|
|
|
|
|
(https://bugzilla.redhat.com/show_bug.cgi?id=1404918)
|
|
|
|
|
|
2017-01-03 00:19:37 -05:00
|
|
|
|
.. [4] GNU C: How Programs Set the Locale
|
2022-07-20 17:50:22 -04:00
|
|
|
|
(https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html)
|
2017-01-03 00:19:37 -05:00
|
|
|
|
|
|
|
|
|
.. [5] GNU C: Locale Categories
|
|
|
|
|
(https://www.gnu.org/software/libc/manual/html_node/Locale-Categories.html)
|
|
|
|
|
|
2017-01-07 02:04:39 -05:00
|
|
|
|
.. [6] glibc C.UTF-8 locale proposal
|
|
|
|
|
(https://sourceware.org/glibc/wiki/Proposals/C.UTF-8)
|
|
|
|
|
|
|
|
|
|
.. [7] GNOME Flatpak
|
2022-07-20 17:50:22 -04:00
|
|
|
|
(https://flatpak.org/)
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
.. [8] Ubuntu Snappy
|
|
|
|
|
(https://www.ubuntu.com/desktop/snappy)
|
|
|
|
|
|
|
|
|
|
.. [9] Pragmatic Unicode
|
2022-07-20 17:50:22 -04:00
|
|
|
|
(https://nedbatchelder.com/text/unipain.html)
|
2017-01-07 02:04:39 -05:00
|
|
|
|
|
|
|
|
|
.. [10] linux-sig discussion of initial PEP draft
|
|
|
|
|
(https://mail.python.org/pipermail/linux-sig/2017-January/000014.html)
|
|
|
|
|
|
|
|
|
|
.. [11] Feedback notes from linux-sig discussion and PEP 540
|
|
|
|
|
(https://github.com/python/peps/issues/171)
|
|
|
|
|
|
|
|
|
|
.. [12] GB 18030
|
|
|
|
|
(https://en.wikipedia.org/wiki/GB_18030)
|
|
|
|
|
|
|
|
|
|
.. [13] Shift-JIS
|
|
|
|
|
(https://en.wikipedia.org/wiki/Shift_JIS)
|
|
|
|
|
|
|
|
|
|
.. [14] ISO-2022
|
|
|
|
|
(https://en.wikipedia.org/wiki/ISO/IEC_2022)
|
2016-12-27 21:31:21 -05:00
|
|
|
|
|
2017-01-07 20:54:24 -05:00
|
|
|
|
.. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale
|
|
|
|
|
(https://bugs.python.org/issue19977)
|
|
|
|
|
|
2017-01-20 09:13:24 -05:00
|
|
|
|
.. [16] test_readline.test_nonascii fails on Android
|
2022-07-20 17:50:22 -04:00
|
|
|
|
(https://bugs.python.org/issue28997)
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
|
|
|
|
.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English"
|
2022-07-20 17:50:22 -04:00
|
|
|
|
(https://bugs.python.org/issue18378#msg215215)
|
2017-01-20 09:13:24 -05:00
|
|
|
|
|
2017-03-05 02:29:54 -05:00
|
|
|
|
.. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale``
|
|
|
|
|
(https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale)
|
|
|
|
|
|
2017-03-13 01:06:48 -04:00
|
|
|
|
.. [19] Fedora 26 change proposal for locale coercion backport
|
|
|
|
|
(https://fedoraproject.org/wiki/Changes/python3_c.utf-8_locale)
|
|
|
|
|
|
2017-03-13 02:13:28 -04:00
|
|
|
|
.. [20] GitHub pull request for the reference implementation
|
|
|
|
|
(https://github.com/python/cpython/pull/659)
|
|
|
|
|
|
2017-05-09 06:46:59 -04:00
|
|
|
|
.. [21] GNU readline misbehaviour on Mac OS X with ``LANG=C``
|
|
|
|
|
(https://mail.python.org/pipermail/python-dev/2017-May/147897.html)
|
|
|
|
|
|
2017-05-27 03:08:32 -04:00
|
|
|
|
.. [22] Potential problems when setting LANG in addition to setting LC_CTYPE
|
|
|
|
|
(https://mail.python.org/pipermail/python-dev/2017-May/147968.html)
|
|
|
|
|
|
|
|
|
|
|
2016-12-27 21:31:21 -05:00
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain under the terms of the
|
|
|
|
|
CC0 1.0 license: https://creativecommons.org/publicdomain/zero/1.0/
|