285 lines
12 KiB
Plaintext
285 lines
12 KiB
Plaintext
|
PEP: 538
|
|||
|
Title: Coercing the legacy C locale to C.UTF-8
|
|||
|
Version: $Revision$
|
|||
|
Last-Modified: $Date$
|
|||
|
Author: Nick Coghlan <ncoghlan@gmail.com>
|
|||
|
Status: Draft
|
|||
|
Type: Standards Track
|
|||
|
Content-Type: text/x-rst
|
|||
|
Created: 28-Dec-2016
|
|||
|
Python-Version: 3.7
|
|||
|
|
|||
|
|
|||
|
Abstract
|
|||
|
========
|
|||
|
|
|||
|
An ongoing challenge with Python 3 on \*nix systems is the conflict between
|
|||
|
needing to use the configured locale encoding by default for consistency with
|
|||
|
other C/C++ components in the same process, and the fact that the standard C
|
|||
|
locale (as defined in POSIX:2001) specifies a default encoding of ASCII, which
|
|||
|
is entirely inappropriate for the development of networked services in a
|
|||
|
multilingual world.
|
|||
|
|
|||
|
This PEP proposes that the CPython implementation be changed such that:
|
|||
|
|
|||
|
* when used as a library, ``Py_Initialize`` will warn that use of the legacy
|
|||
|
``C`` locale may cause various Unicode compatibility issues
|
|||
|
* when used as a standalone binary, CPython will automatically coerce the
|
|||
|
``C`` locale to ``C.UTF-8`` unless the new ``PYTHONALLOWCLOCALE`` environment
|
|||
|
variable is set
|
|||
|
|
|||
|
With this change, any \*nix platform that does *not* offer the ``C.UTF-8``
|
|||
|
locale as part of its standard configuration will only be considered a
|
|||
|
fully supported platform for CPython 3.7+ deployments when a non-ASCII locale
|
|||
|
is set explicitly.
|
|||
|
|
|||
|
Redistributors (such as Linux distributions) with a narrower target audience
|
|||
|
may also choose to opt in to this behaviour for earlier Python 3.x releases by
|
|||
|
applying the necessary changes as a downstream patch to those versions.
|
|||
|
|
|||
|
|
|||
|
Specification
|
|||
|
=============
|
|||
|
|
|||
|
When ``Py_Initialize`` is called and CPython detects that the configured locale
|
|||
|
is the default ``C`` locale, the following warning will be issued::
|
|||
|
|
|||
|
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some
|
|||
|
libraries and operating system interfaces may not work correctly. Set
|
|||
|
`PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment
|
|||
|
when running Python directly.
|
|||
|
|
|||
|
By contrast, when CPython *is* the main application, it will instead
|
|||
|
automatically coerce the legacy C locale to the multilingual C.UTF-8 locale::
|
|||
|
|
|||
|
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set
|
|||
|
PYTHONALLOWCLOCALE to disable this locale coercion behaviour).
|
|||
|
|
|||
|
This coercion is implemented by actually setting the ``LANG`` and ``LC_ALL``
|
|||
|
environment variables to ``C.UTF-8``, such that future calls to ``setlocale()``
|
|||
|
will see them, as will other components looking for those settings (such as
|
|||
|
GUI development frameworks).
|
|||
|
|
|||
|
The locale coercion will be skipped if the ``PYTHONALLOWCLOCALE`` environment
|
|||
|
variable is set to a non-empty string. The interpreter will always check for
|
|||
|
the ``PYTHONALLOWCLOCALE`` environment variable (even when running under the
|
|||
|
``-E`` or ``-I`` switches), as the locale coercion check necessarily takes
|
|||
|
place before any command line argument processing.
|
|||
|
|
|||
|
|
|||
|
Platform Support Changes
|
|||
|
========================
|
|||
|
|
|||
|
A new "Legacy C Locale" section will be added to PEP 11 that states:
|
|||
|
|
|||
|
* as of Python 3.7, the legacy C locale is no longer officially supported,
|
|||
|
and any Unicode handling issues that occur only in that locale and cannot be
|
|||
|
reproduced in an appropriately configured non-ASCII locale will be closed as
|
|||
|
"won't fix"
|
|||
|
* as of Python 3.7, \*nix platforms are expected to provide the ``C.UTF-8``
|
|||
|
locale as an alternative to the legacy ``C`` locale. On platforms which don't
|
|||
|
yet provide that locale, an explicit non-ASCII locale setting will be needed
|
|||
|
to configure a supported environment for running Python 3.7+
|
|||
|
|
|||
|
|
|||
|
Rationale
|
|||
|
=========
|
|||
|
|
|||
|
|
|||
|
Improving the handling of the C locale
|
|||
|
--------------------------------------
|
|||
|
|
|||
|
It has been clear for some time that the C locale's default encoding of
|
|||
|
``ASCII`` is entirely the wrong choice for development of modern networked
|
|||
|
services. Newer languages like Rust and Go have eschewed that default entirely,
|
|||
|
and instead made it a deployment requirement that systems be configured to use
|
|||
|
UTF-8 as the text encoding for operating system interfaces. Similarly, Node.js
|
|||
|
assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript engine)
|
|||
|
and requires custom build settings to indicate it should use the system
|
|||
|
locale settings for locale-aware operations.
|
|||
|
|
|||
|
The challenge for CPython has been the fact that in addition to being used for
|
|||
|
network service development, it is also extensively used as an embedded
|
|||
|
scripting language in larger applications, and as a desktop application
|
|||
|
development language, where it is more important to be consistent with other
|
|||
|
C/C++ components sharing the same process, as well as with the user's desktop
|
|||
|
locale settings, than it is with the emergent conventions of modern network
|
|||
|
service development.
|
|||
|
|
|||
|
The premise of this PEP is that for *all* of these use cases, the default "C"
|
|||
|
locale is wrong, and furthermore that the following assumptions are valid:
|
|||
|
|
|||
|
* in desktop application use cases, the process locale will *already* be
|
|||
|
configured appropriately, and if it isn't, then that is an operating system
|
|||
|
level problem that needs to be reported to and resolved by the operating
|
|||
|
system provider
|
|||
|
* in network service development use cases (especially those based on Linux
|
|||
|
containers), the process locale may not be configured *at all*, and if it
|
|||
|
isn't, then the expectation is that components will impose their own default
|
|||
|
encoding the way Rust, Go and Node.js do, rather than trusting the legacy C
|
|||
|
default encoding of ASCII the way CPython currently does
|
|||
|
|
|||
|
|
|||
|
Dropping official support for Unicode handling in the legacy C locale
|
|||
|
---------------------------------------------------------------------
|
|||
|
|
|||
|
We've been trying to get strict bytes/text separation to work reliably in the
|
|||
|
legacy C locale for over a decade at this point. Not only haven't we been able
|
|||
|
to get it to work, neither has anyone else - the only viable alternatives
|
|||
|
identified have been to pass the bytes along verbatim without eagerly decoding
|
|||
|
them to text (Python 2, Ruby, etc), or else to ignore the nominal locale
|
|||
|
encoding entirely and assume the use of UTF-8 (Rust, Go, Node.js, etc).
|
|||
|
|
|||
|
While this PEP ensures that developers that need to do so can still opt-in to
|
|||
|
running their Python code in the legacy C locale, it also makes clear that we
|
|||
|
*don't* expect Python 3's Unicode handling to be reliable in that configuration,
|
|||
|
and the recommended alternative is to use a more appropriate locale setting.
|
|||
|
|
|||
|
|
|||
|
Providing implicit locale coercion only when running standalone
|
|||
|
---------------------------------------------------------------
|
|||
|
|
|||
|
Over the course of Python 3.x development, multiple attempts have been made
|
|||
|
to improve the handling of incorrect locale settings at the point where the
|
|||
|
Python interpreter is initialised. The problem that emerged is that this is
|
|||
|
ultimately *too late* in the interpreter startup process - data such as command
|
|||
|
line arguments and the contents of environment variables may have already been
|
|||
|
retrieved from the operating system and processed under the incorrect ASCII
|
|||
|
text encoding assumption well before ``Py_Initialize`` is called.
|
|||
|
|
|||
|
The problems created by those inconsistencies were then even harder to diagnose
|
|||
|
and debug than those created by believing the operating system's claim that
|
|||
|
ASCII was a suitable encoding to use for operating system interfaces. This was
|
|||
|
the case even for the default CPython binary, let alone larger C/C++
|
|||
|
applications that embed CPython as a scripting engine.
|
|||
|
|
|||
|
The approach proposed in this PEP handles that problem by moving the locale
|
|||
|
coercion as early as possible in the interpreter startup sequence when running
|
|||
|
standalone: it takes place directly in the C-level ``main()`` function, even
|
|||
|
before calling in to the `Py_Main()`` library function that implements the
|
|||
|
features of the CPython interpreter CLI.
|
|||
|
|
|||
|
The ``Py_Initialize`` API then only gains an explicit warning (emitted on
|
|||
|
``stderr``) when it detects use of the ``C`` locale, and relies on the
|
|||
|
embedding application to specify something more reasonable.
|
|||
|
|
|||
|
|
|||
|
Querying LC_CTYPE for C locale detection
|
|||
|
----------------------------------------
|
|||
|
|
|||
|
``LC_CTYPE`` is the actual locale category that CPython relies on to drive the
|
|||
|
implicit decoding of environment variables, command line arguments, and other
|
|||
|
text values received from the operating system.
|
|||
|
|
|||
|
As such, it makes sense to check it specifically when attempting to determine
|
|||
|
whether or not the current locale configuration is likely to cause Unicode
|
|||
|
handling problems.
|
|||
|
|
|||
|
|
|||
|
Setting both LANG & LC_ALL for C.UTF-8 locale coercion
|
|||
|
------------------------------------------------------
|
|||
|
|
|||
|
Python is often used as a glue language, integrating other C/C++ ABI compatible
|
|||
|
components in the current process, and components written in arbitrary
|
|||
|
languages in subprocesses.
|
|||
|
|
|||
|
Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
|
|||
|
C/C++ components in the current process and in any subprocesses that inherit
|
|||
|
the current environment.
|
|||
|
|
|||
|
Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
|
|||
|
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
|
|||
|
|
|||
|
Together, these should ensure that when the locale coercion is activated, the
|
|||
|
switch to the C.UTF-8 locale will be applied consistently across the current
|
|||
|
process and any subprocesses that inherit the current environment.
|
|||
|
|
|||
|
|
|||
|
Allowing restoration of the legacy behaviour
|
|||
|
--------------------------------------------
|
|||
|
|
|||
|
The CPython command line interpreter is often used to investigate faults that
|
|||
|
occur in other applications that embed CPython, and those applications may still
|
|||
|
be using the C locale even after this PEP is implemented.
|
|||
|
|
|||
|
Providing a simple on/off switch for the locale coercion behaviour makes it
|
|||
|
much easier to reproduce the behaviour of such applications for debugging
|
|||
|
purposes, as well as making it easier to reproduce the behaviour of older 3.x
|
|||
|
runtimes even when running a version with this change applied.
|
|||
|
|
|||
|
|
|||
|
Implementation
|
|||
|
==============
|
|||
|
|
|||
|
A draft implementation of the change (including test cases) has been
|
|||
|
posted to issue 28180 [1_](which requests that ``sys.getfilesystemencoding()``
|
|||
|
default to ``utf-8``)
|
|||
|
|
|||
|
|
|||
|
Backporting to earlier Python 3 releases
|
|||
|
========================================
|
|||
|
|
|||
|
If this PEP is accepted for Python 3.7, backporting of the change to earlier
|
|||
|
Python 3 releases by redistributors will be both allowed and encouraged.
|
|||
|
However, to serve any useful purpose, such backports should only be undertaken
|
|||
|
either in conjunction with the changes needed to also provide the C.UTF-8
|
|||
|
locale by default, or else specifically for platforms where that locale is
|
|||
|
already consistently available.
|
|||
|
|
|||
|
|
|||
|
Acknowledgements
|
|||
|
================
|
|||
|
|
|||
|
The locale coercion approach proposed in this PEP is inspired directly by
|
|||
|
Armin Ronacher's handling of this problem in the ``click`` command line
|
|||
|
utility development framework [2_]::
|
|||
|
|
|||
|
$ LANG=C python3 -c 'import click; cli = click.command()(lambda:None); cli()'
|
|||
|
Traceback (most recent call last):
|
|||
|
...
|
|||
|
RuntimeError: Click will abort further execution because Python 3 was
|
|||
|
configured to use ASCII as encoding for the environment. Either run this
|
|||
|
under Python 2 or consult http://click.pocoo.org/python3/ for mitigation
|
|||
|
steps.
|
|||
|
|
|||
|
This system supports the C.UTF-8 locale which is recommended.
|
|||
|
You might be able to resolve your issue by exporting the
|
|||
|
following environment variables:
|
|||
|
|
|||
|
export LC_ALL=C.UTF-8
|
|||
|
export LANG=C.UTF-8
|
|||
|
|
|||
|
The change was originally proposed as a downstream patch for Fedora's
|
|||
|
system Python 3.6 package [3_], and then reformulated as a PEP for Python 3.7
|
|||
|
with a section allowing for backports to earlier versions by redistributors.
|
|||
|
|
|||
|
|
|||
|
References
|
|||
|
==========
|
|||
|
|
|||
|
.. [1] CPython: sys.getfilesystemencoding() should default to utf-8
|
|||
|
(http://bugs.python.org/issue28180)
|
|||
|
|
|||
|
.. [2] Locale configuration required for click applications under Python 3
|
|||
|
(http://click.pocoo.org/5/python3/#python-3-surrogate-handling)
|
|||
|
|
|||
|
.. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale
|
|||
|
(https://bugzilla.redhat.com/show_bug.cgi?id=1404918)
|
|||
|
|
|||
|
|
|||
|
Copyright
|
|||
|
=========
|
|||
|
|
|||
|
This document has been placed in the public domain under the terms of the
|
|||
|
CC0 1.0 license: https://creativecommons.org/publicdomain/zero/1.0/
|
|||
|
|
|||
|
|
|||
|
..
|
|||
|
Local Variables:
|
|||
|
mode: indented-text
|
|||
|
indent-tabs-mode: nil
|
|||
|
sentence-end-double-space: t
|
|||
|
fill-column: 70
|
|||
|
coding: utf-8
|
|||
|
End:
|