PEP 538: Only set LC_CTYPE, never LANG

It looks like setting LANG may have undesirable
side effects in some cases, and all the issues
the PEP aims to handle are resolved by setting
LC_CTYPE.

The proposal and implementation have thus been
updated to only set LC_CTYPE, even when the
target coercion locale is a full locale.
This commit is contained in:
Nick Coghlan 2017-05-27 17:08:32 +10:00
parent db50b27755
commit 12cecb0548
1 changed files with 65 additions and 43 deletions

View File

@ -51,7 +51,6 @@ changed to be roughly equivalent to the following existing configuration
settings (supported since Python 3.1)::
LC_CTYPE=C.UTF-8
LANG=C.UTF-8
PYTHONIOENCODING=utf-8:surrogateescape
The exact target locale for coercion will be chosen from a predefined list at
@ -153,7 +152,7 @@ The simplest way to deal with this problem for currently released versions of
CPython is to explicitly set a more sensible locale when launching the
application. For example::
LANG=C.UTF-8 python3 ...
LC_CTYPE=C.UTF-8 python3 ...
The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the
``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other
@ -276,19 +275,19 @@ The simplest way to get Python 3 (regardless of the exact version) to behave
sensibly in Fedora and Debian based containers is to run it in the ``C.UTF-8``
locale that both distros provide::
$ docker run --rm -e LANG=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
$ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
$ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
$ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
$ docker run --rm -e LANG=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=C.UTF-8
LC_CTYPE="C.UTF-8"
$ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=
LC_CTYPE=C.UTF-8
LC_ALL=
$ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=C.UTF-8
$ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_CTYPE=C.UTF-8
LC_ALL=
The Alpine Linux based Python images provided by Docker, Inc. already use the
@ -358,8 +357,9 @@ use an explicit locale category like ``LC_TIME``, ``LC_MONETARY`` or
``LC_NUMERIC`` while otherwise running in the legacy C locale gives the
following design principles:
* don't make any environmental changes that would override explicit settings for
locale categories other than ``LC_CTYPE`` (most notably: don't set ``LC_ALL``)
* don't make any environmental changes that would alter any existing settings
for locale categories other than ``LC_CTYPE`` (most notably: don't set
``LC_ALL`` or ``LANG``)
Finally, maintaining compatibility with running arbitrary subprocesses in
orchestration use cases leads to the following design principle:
@ -374,11 +374,12 @@ Specification
To better handle the cases where CPython would otherwise end up attempting
to operate in the ``C`` locale, this PEP proposes that CPython automatically
attempt to coerce the legacy ``C`` locale to a UTF-8 based locale when it is
run as a standalone command line application.
attempt to coerce the legacy ``C`` locale to a UTF-8 based locale for the
``LC_CTYPE`` category when it is run as a standalone command line application.
It further proposes to emit a warning on stderr if the legacy ``C`` locale
is in effect at the point where the language runtime itself is initialized,
is in effect for the ``LC_CTYPE`` category at the point where the language
runtime itself is initialized,
and the explicit environmental flag to disable locale coercion is not set, in
order to warn system and application integrators that they're running CPython
in an unsupported configuration.
@ -423,17 +424,13 @@ Three such locales will be tried:
* ``C.UTF-8`` (available at least in Debian, Ubuntu, Alpine, and Fedora 25+, and
expected to be available by default in a future version of glibc)
* ``C.utf8`` (available at least in HP-UX)
* ``UTF-8`` (available in at least some \*BSD variants)
* ``UTF-8`` (available in at least some \*BSD variants, including Mac OS X)
For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by setting
both the ``LC_CTYPE`` and ``LANG`` environment variables to the candidate
locale name, such that future calls to ``setlocale()`` will see them, as will
other components looking for those settings (such as GUI development
frameworks).
For the platforms where it is defined, ``UTF-8`` is a partial locale that only
defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
environment variable would be set when using this fallback option.
The coercion will be implemented by setting the ``LC_CTYPE`` environment
variable to the candidate locale name, such that future calls to
``setlocale()`` will see it, as will other components looking for those
settings (such as GUI development frameworks and Python's own ``locale``
module).
To allow for better cross-platform binary portability and to adjust
automatically to future changes in locale availability, these checks will be
@ -444,15 +441,9 @@ When this locale coercion is activated, the following warning will be
printed on stderr, with the warning containing whichever locale was
successfully configured::
Python detected LC_CTYPE=C: LC_CTYPE & LANG coerced to C.UTF-8 (set another
Python detected LC_CTYPE=C: LC_CTYPE coerced to C.UTF-8 (set another
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
When falling back to the ``UTF-8`` locale, the message would be slightly
different::
Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
As long as the current platform provides at least one of the candidate UTF-8
based environments, this locale coercion will mean that the standard
Python binary *and* locale-aware extensions should once again "just work"
@ -489,9 +480,9 @@ Legacy C locale warning during runtime initialization
By the time that ``Py_Initialize`` is called, arbitrary locale-dependent
operations may have taken place in the current process. This means that
by the time it is called, it is *too late* to switch to a different locale -
doing so would introduce inconsistencies in decoded text, even in the context
of the standalone Python interpreter binary.
by the time it is called, it is *too late* to reliably switch to a different
locale - doing so would introduce inconsistencies in decoded text, even in the
context of the standalone Python interpreter binary.
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
configured locale is still the default ``C`` locale and
@ -860,8 +851,8 @@ whether or not the current locale configuration is likely to cause Unicode
handling problems.
Setting both LC_CTYPE & LANG for UTF-8 locale coercion
------------------------------------------------------
Explicitly setting LC_CTYPE for UTF-8 locale coercion
-----------------------------------------------------
Python is often used as a glue language, integrating other C/C++ ABI compatible
components in the current process, and components written in arbitrary
@ -872,19 +863,46 @@ problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a
system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
configured to forward locale settings, and the user logs into a Linux server).
Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
This should be sufficient to ensure that when the locale coercion is activated,
the switch to the UTF-8 based locale will be applied consistently across the
current process and any subprocesses that inherit the current environment.
Together, these should ensure that when the locale coercion is activated, the
switch to the UTF-8 based locale will be applied consistently across the current
process and any subprocesses that inherit the current environment.
Avoiding setting LANG for UTF-8 locale coercion
-----------------------------------------------
Earlier versions of this PEP proposed setting the ``LANG`` category indepdent
default locale, in addition to setting ``LC_CTYPE``.
This was later removed on the grounds that setting only ``LC_CTYPE`` is
sufficient to handle all of the problematic scenarios that the PEP aimed
to resolve, while setting ``LANG`` as well would break cases where ``LANG``
was set correctly, and the locale problems were solely due to an incorrect
``LC_CTYPE`` setting ([22_]).
For example, consider a Python application that called the Linux ``date``
utility in a subprocess rather than doing its own date formatting::
$ LANG=ja_JP.UTF-8 LC_CTYPE=C date
2017年 5月 23日 火曜日 17:31:03 JST
$ LANG=ja_JP.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing only LC_CTYPE
2017年 5月 23日 火曜日 17:32:58 JST
$ LANG=C.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing both of LC_CTYPE and LANG
Tue May 23 17:31:10 JST 2017
With only ``LC_CTYPE`` updated in the Python process, the subprocess would
continue to behave as expected. However, if ``LANG`` was updated as well,
that would effectively override the ``LC_TIME`` setting and use the wrong
date formatting conventions.
Avoiding setting LC_ALL for UTF-8 locale coercion
-------------------------------------------------
Earlier versions of this PEP proposed setting the ``LC_ALL`` locale override,
rather than just setting ``LC_CTYPE`` and ``LANG``.
in addition to setting ``LC_CTYPE``.
This was changed after it was determined that just setting ``LC_CTYPE`` and
``LANG`` should be sufficient to handle all the scenarios the PEP aims to
@ -1198,6 +1216,10 @@ References
.. [21] GNU readline misbehaviour on Mac OS X with ``LANG=C``
(https://mail.python.org/pipermail/python-dev/2017-May/147897.html)
.. [22] Potential problems when setting LANG in addition to setting LC_CTYPE
(https://mail.python.org/pipermail/python-dev/2017-May/147968.html)
Copyright
=========