diff --git a/pep-0538.txt b/pep-0538.txt index 14af83502..c7de55b7b 100644 --- a/pep-0538.txt +++ b/pep-0538.txt @@ -51,7 +51,6 @@ changed to be roughly equivalent to the following existing configuration settings (supported since Python 3.1):: LC_CTYPE=C.UTF-8 - LANG=C.UTF-8 PYTHONIOENCODING=utf-8:surrogateescape The exact target locale for coercion will be chosen from a predefined list at @@ -153,7 +152,7 @@ The simplest way to deal with this problem for currently released versions of CPython is to explicitly set a more sensible locale when launching the application. For example:: - LANG=C.UTF-8 python3 ... + LC_CTYPE=C.UTF-8 python3 ... The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the ``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other @@ -276,19 +275,19 @@ The simplest way to get Python 3 (regardless of the exact version) to behave sensibly in Fedora and Debian based containers is to run it in the ``C.UTF-8`` locale that both distros provide:: - $ docker run --rm -e LANG=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")' + $ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ - $ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")' + $ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ - $ docker run --rm -e LANG=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG' - LANG=C.UTF-8 - LC_CTYPE="C.UTF-8" + $ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG' + LANG= + LC_CTYPE=C.UTF-8 LC_ALL= - $ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG' - LANG=C.UTF-8 + $ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG' + LANG= LANGUAGE= - LC_CTYPE="C.UTF-8" + LC_CTYPE=C.UTF-8 LC_ALL= The Alpine Linux based Python images provided by Docker, Inc. already use the @@ -358,8 +357,9 @@ use an explicit locale category like ``LC_TIME``, ``LC_MONETARY`` or ``LC_NUMERIC`` while otherwise running in the legacy C locale gives the following design principles: -* don't make any environmental changes that would override explicit settings for - locale categories other than ``LC_CTYPE`` (most notably: don't set ``LC_ALL``) +* don't make any environmental changes that would alter any existing settings + for locale categories other than ``LC_CTYPE`` (most notably: don't set + ``LC_ALL`` or ``LANG``) Finally, maintaining compatibility with running arbitrary subprocesses in orchestration use cases leads to the following design principle: @@ -374,11 +374,12 @@ Specification To better handle the cases where CPython would otherwise end up attempting to operate in the ``C`` locale, this PEP proposes that CPython automatically -attempt to coerce the legacy ``C`` locale to a UTF-8 based locale when it is -run as a standalone command line application. +attempt to coerce the legacy ``C`` locale to a UTF-8 based locale for the +``LC_CTYPE`` category when it is run as a standalone command line application. It further proposes to emit a warning on stderr if the legacy ``C`` locale -is in effect at the point where the language runtime itself is initialized, +is in effect for the ``LC_CTYPE`` category at the point where the language +runtime itself is initialized, and the explicit environmental flag to disable locale coercion is not set, in order to warn system and application integrators that they're running CPython in an unsupported configuration. @@ -423,17 +424,13 @@ Three such locales will be tried: * ``C.UTF-8`` (available at least in Debian, Ubuntu, Alpine, and Fedora 25+, and expected to be available by default in a future version of glibc) * ``C.utf8`` (available at least in HP-UX) -* ``UTF-8`` (available in at least some \*BSD variants) +* ``UTF-8`` (available in at least some \*BSD variants, including Mac OS X) -For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by setting -both the ``LC_CTYPE`` and ``LANG`` environment variables to the candidate -locale name, such that future calls to ``setlocale()`` will see them, as will -other components looking for those settings (such as GUI development -frameworks). - -For the platforms where it is defined, ``UTF-8`` is a partial locale that only -defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE`` -environment variable would be set when using this fallback option. +The coercion will be implemented by setting the ``LC_CTYPE`` environment +variable to the candidate locale name, such that future calls to +``setlocale()`` will see it, as will other components looking for those +settings (such as GUI development frameworks and Python's own ``locale`` +module). To allow for better cross-platform binary portability and to adjust automatically to future changes in locale availability, these checks will be @@ -444,15 +441,9 @@ When this locale coercion is activated, the following warning will be printed on stderr, with the warning containing whichever locale was successfully configured:: - Python detected LC_CTYPE=C: LC_CTYPE & LANG coerced to C.UTF-8 (set another + Python detected LC_CTYPE=C: LC_CTYPE coerced to C.UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour). -When falling back to the ``UTF-8`` locale, the message would be slightly -different:: - - Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale - or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour). - As long as the current platform provides at least one of the candidate UTF-8 based environments, this locale coercion will mean that the standard Python binary *and* locale-aware extensions should once again "just work" @@ -489,9 +480,9 @@ Legacy C locale warning during runtime initialization By the time that ``Py_Initialize`` is called, arbitrary locale-dependent operations may have taken place in the current process. This means that -by the time it is called, it is *too late* to switch to a different locale - -doing so would introduce inconsistencies in decoded text, even in the context -of the standalone Python interpreter binary. +by the time it is called, it is *too late* to reliably switch to a different +locale - doing so would introduce inconsistencies in decoded text, even in the +context of the standalone Python interpreter binary. Accordingly, when ``Py_Initialize`` is called and CPython detects that the configured locale is still the default ``C`` locale and @@ -860,8 +851,8 @@ whether or not the current locale configuration is likely to cause Unicode handling problems. -Setting both LC_CTYPE & LANG for UTF-8 locale coercion ------------------------------------------------------- +Explicitly setting LC_CTYPE for UTF-8 locale coercion +----------------------------------------------------- Python is often used as a glue language, integrating other C/C++ ABI compatible components in the current process, and components written in arbitrary @@ -872,19 +863,46 @@ problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is configured to forward locale settings, and the user logs into a Linux server). -Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check -the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``. +This should be sufficient to ensure that when the locale coercion is activated, +the switch to the UTF-8 based locale will be applied consistently across the +current process and any subprocesses that inherit the current environment. -Together, these should ensure that when the locale coercion is activated, the -switch to the UTF-8 based locale will be applied consistently across the current -process and any subprocesses that inherit the current environment. + +Avoiding setting LANG for UTF-8 locale coercion +----------------------------------------------- + +Earlier versions of this PEP proposed setting the ``LANG`` category indepdent +default locale, in addition to setting ``LC_CTYPE``. + +This was later removed on the grounds that setting only ``LC_CTYPE`` is +sufficient to handle all of the problematic scenarios that the PEP aimed +to resolve, while setting ``LANG`` as well would break cases where ``LANG`` +was set correctly, and the locale problems were solely due to an incorrect +``LC_CTYPE`` setting ([22_]). + +For example, consider a Python application that called the Linux ``date`` +utility in a subprocess rather than doing its own date formatting:: + + $ LANG=ja_JP.UTF-8 LC_CTYPE=C date + 2017年 5月 23日 火曜日 17:31:03 JST + + $ LANG=ja_JP.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing only LC_CTYPE + 2017年 5月 23日 火曜日 17:32:58 JST + + $ LANG=C.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing both of LC_CTYPE and LANG + Tue May 23 17:31:10 JST 2017 + +With only ``LC_CTYPE`` updated in the Python process, the subprocess would +continue to behave as expected. However, if ``LANG`` was updated as well, +that would effectively override the ``LC_TIME`` setting and use the wrong +date formatting conventions. Avoiding setting LC_ALL for UTF-8 locale coercion ------------------------------------------------- Earlier versions of this PEP proposed setting the ``LC_ALL`` locale override, -rather than just setting ``LC_CTYPE`` and ``LANG``. +in addition to setting ``LC_CTYPE``. This was changed after it was determined that just setting ``LC_CTYPE`` and ``LANG`` should be sufficient to handle all the scenarios the PEP aims to @@ -1198,6 +1216,10 @@ References .. [21] GNU readline misbehaviour on Mac OS X with ``LANG=C`` (https://mail.python.org/pipermail/python-dev/2017-May/147897.html) +.. [22] Potential problems when setting LANG in addition to setting LC_CTYPE + (https://mail.python.org/pipermail/python-dev/2017-May/147968.html) + + Copyright =========