PEP 538: add Background section on locale handling
This commit is contained in:
parent
cd6e6d838d
commit
c99c42e066
97
pep-0538.txt
97
pep-0538.txt
|
@ -38,8 +38,66 @@ may also choose to opt in to this behaviour for earlier Python 3.x releases by
|
||||||
applying the necessary changes as a downstream patch to those versions.
|
applying the necessary changes as a downstream patch to those versions.
|
||||||
|
|
||||||
|
|
||||||
Specification
|
Background
|
||||||
=============
|
==========
|
||||||
|
|
||||||
|
While the CPython interpreter is starting up, it may need to convert from
|
||||||
|
the ``char *`` format to the ``wchar_t *`` format, or from one of those formats
|
||||||
|
to ``PyUnicodeObject *``, before its own text encoding handling machinery is
|
||||||
|
fully configured. It handles these cases by relying on the operating system to
|
||||||
|
do the conversion and then ensuring that the text encoding name reported by
|
||||||
|
``sys.getfilesystemencoding()`` matches the encoding used during this early
|
||||||
|
bootstrapping process.
|
||||||
|
|
||||||
|
On Mac OS X, this is straightforward, as Apple guarantees that these operations
|
||||||
|
will always use UTF-8 to do the conversion.
|
||||||
|
|
||||||
|
On Windows, the limitations of the ``mbcs`` format used by default in these
|
||||||
|
conversions proved sufficiently problematic that PEP 528 and PEP 529 were
|
||||||
|
implemented to bypass the operating system supplied interfaces for binary data
|
||||||
|
handling and force the use of UTF-8 instead.
|
||||||
|
|
||||||
|
On non-Apple \*nix systems however, these operations are handled using the C
|
||||||
|
locale system, which has the following characteristics [4_]:
|
||||||
|
|
||||||
|
* by default, all processes start in the ``C`` locale, which uses ``ASCII``
|
||||||
|
for these conversions. This is almost never what anyone doing multilingual
|
||||||
|
text processing actually wants (including CPython)
|
||||||
|
* calling ``setlocale(LC_ALL, "")`` reconfigures the active locale based on
|
||||||
|
the locale categories configured in the current process environment
|
||||||
|
* if the locale requested by the current environment is unknown, or no specific
|
||||||
|
locale is configured, then the default ``C`` locale will remain active
|
||||||
|
|
||||||
|
The specific locale category that covers the APIs that CPython depends on is
|
||||||
|
``LC_CTYPE``, which applies to "classification and conversion of characters,
|
||||||
|
and to multibyte and wide characters" [5_]. Accordingly, CPython includes the
|
||||||
|
following key calls to ``setlocale``:
|
||||||
|
|
||||||
|
* in ``Py_Initialize``, CPython calls ``setlocale(LC_CTYPE, "")``, such that
|
||||||
|
the configured locale settings for that category *always* match those set in
|
||||||
|
the environment. It does this unconditionally, and it *doesn't* revert the
|
||||||
|
process state change in ``Py_Finalize``
|
||||||
|
* in the main ``python`` binary, CPython calls ``setlocale(LC_ALL, "")`` to
|
||||||
|
configure the entire C locale subsystem according to the process environment.
|
||||||
|
It does this prior to making any calls into the shared CPython library
|
||||||
|
|
||||||
|
These calls are usually sufficient to provide sensible behaviour, but they can
|
||||||
|
still fail in the following cases:
|
||||||
|
|
||||||
|
* SSH environment forwarding means that SSH clients will often forward
|
||||||
|
client locale settings to servers that don't have that locale installed
|
||||||
|
* some process environments (such as Linux containers) may not have any
|
||||||
|
explicit locale configured at all
|
||||||
|
|
||||||
|
|
||||||
|
Proposal
|
||||||
|
========
|
||||||
|
|
||||||
|
To better handle the cases where CPython would otherwise end up attempting
|
||||||
|
to operate in the ``C`` locale, this PEP proposes changes to CPython's
|
||||||
|
behaviour both when it is run as a standalone command line application, as well
|
||||||
|
as when it is used as a shared library to embed a Python runtime as part of a
|
||||||
|
larger application.
|
||||||
|
|
||||||
When ``Py_Initialize`` is called and CPython detects that the configured locale
|
When ``Py_Initialize`` is called and CPython detects that the configured locale
|
||||||
is the default ``C`` locale, the following warning will be issued::
|
is the default ``C`` locale, the following warning will be issued::
|
||||||
|
@ -49,16 +107,24 @@ is the default ``C`` locale, the following warning will be issued::
|
||||||
`PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment
|
`PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment
|
||||||
when running Python directly.
|
when running Python directly.
|
||||||
|
|
||||||
|
This warning informs both system and application integrators that they're
|
||||||
|
running Python 3 in a configuration that we don't expect to work properly.
|
||||||
|
|
||||||
By contrast, when CPython *is* the main application, it will instead
|
By contrast, when CPython *is* the main application, it will instead
|
||||||
automatically coerce the legacy C locale to the multilingual C.UTF-8 locale::
|
automatically coerce the legacy C locale to the multilingual C.UTF-8 locale::
|
||||||
|
|
||||||
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set
|
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set
|
||||||
PYTHONALLOWCLOCALE to disable this locale coercion behaviour).
|
PYTHONALLOWCLOCALE to disable this locale coercion behaviour).
|
||||||
|
|
||||||
This coercion is implemented by actually setting the ``LANG`` and ``LC_ALL``
|
This locale coercion will mean that the standard Python binary should once
|
||||||
environment variables to ``C.UTF-8``, such that future calls to ``setlocale()``
|
again "just work" in the two main failure cases we're aware of (missing locale
|
||||||
will see them, as will other components looking for those settings (such as
|
settings and SSH forwarding of unknown locales), as long as the target
|
||||||
GUI development frameworks).
|
platform provides the ``C.UTF-8`` locale.
|
||||||
|
|
||||||
|
This coercion will be implemented by actually setting the ``LANG`` and
|
||||||
|
``LC_ALL`` environment variables to ``C.UTF-8``, such that future calls to
|
||||||
|
``setlocale()`` will see them, as will other components looking for those
|
||||||
|
settings (such as GUI development frameworks).
|
||||||
|
|
||||||
The locale coercion will be skipped if the ``PYTHONALLOWCLOCALE`` environment
|
The locale coercion will be skipped if the ``PYTHONALLOWCLOCALE`` environment
|
||||||
variable is set to a non-empty string. The interpreter will always check for
|
variable is set to a non-empty string. The interpreter will always check for
|
||||||
|
@ -96,7 +162,9 @@ and instead made it a deployment requirement that systems be configured to use
|
||||||
UTF-8 as the text encoding for operating system interfaces. Similarly, Node.js
|
UTF-8 as the text encoding for operating system interfaces. Similarly, Node.js
|
||||||
assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript engine)
|
assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript engine)
|
||||||
and requires custom build settings to indicate it should use the system
|
and requires custom build settings to indicate it should use the system
|
||||||
locale settings for locale-aware operations.
|
locale settings for locale-aware operations. Both the JVM and the .NET CLR
|
||||||
|
use UTF-16-LE as their primary encoding for passing text between applications
|
||||||
|
and the underlying platform.
|
||||||
|
|
||||||
The challenge for CPython has been the fact that in addition to being used for
|
The challenge for CPython has been the fact that in addition to being used for
|
||||||
network service development, it is also extensively used as an embedded
|
network service development, it is also extensively used as an embedded
|
||||||
|
@ -127,8 +195,9 @@ We've been trying to get strict bytes/text separation to work reliably in the
|
||||||
legacy C locale for over a decade at this point. Not only haven't we been able
|
legacy C locale for over a decade at this point. Not only haven't we been able
|
||||||
to get it to work, neither has anyone else - the only viable alternatives
|
to get it to work, neither has anyone else - the only viable alternatives
|
||||||
identified have been to pass the bytes along verbatim without eagerly decoding
|
identified have been to pass the bytes along verbatim without eagerly decoding
|
||||||
them to text (Python 2, Ruby, etc), or else to ignore the nominal locale
|
them to text (Python 2.x, Ruby, etc), or else to ignore the nominal C/C++ locale
|
||||||
encoding entirely and assume the use of UTF-8 (Rust, Go, Node.js, etc).
|
encoding entirely and assume the use of either UTF-8 (Rust, Go, Node.js, etc)
|
||||||
|
or UTF-16-LE (JVM, .NET CLR).
|
||||||
|
|
||||||
While this PEP ensures that developers that need to do so can still opt-in to
|
While this PEP ensures that developers that need to do so can still opt-in to
|
||||||
running their Python code in the legacy C locale, it also makes clear that we
|
running their Python code in the legacy C locale, it also makes clear that we
|
||||||
|
@ -212,8 +281,8 @@ Implementation
|
||||||
==============
|
==============
|
||||||
|
|
||||||
A draft implementation of the change (including test cases) has been
|
A draft implementation of the change (including test cases) has been
|
||||||
posted to issue 28180 [1_](which requests that ``sys.getfilesystemencoding()``
|
posted to issue 28180 [1_], which is an end user request that
|
||||||
default to ``utf-8``)
|
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
|
||||||
|
|
||||||
|
|
||||||
Backporting to earlier Python 3 releases
|
Backporting to earlier Python 3 releases
|
||||||
|
@ -266,6 +335,12 @@ References
|
||||||
.. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale
|
.. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale
|
||||||
(https://bugzilla.redhat.com/show_bug.cgi?id=1404918)
|
(https://bugzilla.redhat.com/show_bug.cgi?id=1404918)
|
||||||
|
|
||||||
|
.. [4] GNU C: How Programs Set the Locale
|
||||||
|
( https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html)
|
||||||
|
|
||||||
|
.. [5] GNU C: Locale Categories
|
||||||
|
(https://www.gnu.org/software/libc/manual/html_node/Locale-Categories.html)
|
||||||
|
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
=========
|
=========
|
||||||
|
|
Loading…
Reference in New Issue