PEP 538: Update reference implementation (#219)

- updates reference implementation to use PYTHONCOERCECLOCALE
- removes hard dependency on PEP 540
- still notes PEP 540 covers case where no relevant C-with-UTF-8
  locale is available
- clarifies that these settings are still recommended over the
  legacy C locale settings for older Python 3 versions, even if
  we don't recommend backporting the automatic coercion
This commit is contained in:
Nick Coghlan 2017-03-05 17:29:54 +10:00 committed by GitHub
parent 5f82542ec4
commit a20a56ceb5
1 changed files with 59 additions and 34 deletions

View File

@ -6,7 +6,6 @@ Author: Nick Coghlan <ncoghlan@gmail.com>
Status: Draft Status: Draft
Type: Standards Track Type: Standards Track
Content-Type: text/x-rst Content-Type: text/x-rst
Requires: 540
Created: 28-Dec-2016 Created: 28-Dec-2016
Python-Version: 3.7 Python-Version: 3.7
Post-History: 03-Jan-2017 (linux-sig), Post-History: 03-Jan-2017 (linux-sig),
@ -28,36 +27,49 @@ PEP 540 proposes a change to CPython's handling of the legacy C locale such
that CPython will assume the use of UTF-8 in such environments, rather than that CPython will assume the use of UTF-8 in such environments, rather than
persisting with the demonstrably problematic assumption of ASCII as an persisting with the demonstrably problematic assumption of ASCII as an
appropriate encoding for communicating with operating system interfaces. appropriate encoding for communicating with operating system interfaces.
This is a good approach for cases where network encoding interoperability
is a more important concern than local encoding interoperability.
However, it comes at the cost of making CPython's encoding assumptions diverge However, it comes at the cost of making CPython's encoding assumptions diverge
from those of other C and C++ components in the same process, as well as those from those of other C and C++ components in the same process, as well as those
of components running in subprocesses that share the same environment. of components running in subprocesses that share the same environment.
Accordingly, this PEP further proposes that the way the CPython implementation It also requires changes to the internals of how CPython itself works, rather
handles the default C locale be changed such that: than using existing configuration settings that are supported by Python
versions prior to Python 3.7.
* the standalone CPython binary will automatically attempt to coerce the ``C`` Accordingly, this PEP proposes that independently of the UTF-8 mode proposed
locale to ``C.UTF-8``, ``C.utf8``, or ``UTF-8`` (depending on the system), in PEP 540, the way the CPython implementation handles the default C locale be
unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0`` changed such that:
* unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``,
the standalone CPython binary will automatically attempt to coerce the ``C``
locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or
``UTF-8``
* if the locale is successfully coerced, and PEP 540 is not accepted, then
``PYTHONIOENCODING`` (if not otherwise set) will be set to
``utf-8:surrogateescape``.
* if the locale is successfully coerced, and PEP 540 *is* accepted, then
``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
* if the subsequent runtime initialization process detects that the legacy * if the subsequent runtime initialization process detects that the legacy
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8`` ``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
are available, locale coercion is disabled, or the runtime is embedded in an are available, locale coercion is disabled, or the runtime is embedded in an
application other than the main CPython binary), and the ``PYTHONUTF8`` application other than the main CPython binary), and the ``PYTHONUTF8``
feature defined in PEP 540 is also disabled, it will emit a warning on feature defined in PEP 540 is also disabled (or not implemented), it will
stderr that use of the legacy ``C`` locale's default ASCII text encoding emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
may cause various Unicode compatibility issues text encoding may cause various Unicode compatibility issues
With this change, any \*nix platform that does *not* offer at least one of the With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard ``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for CPython configuration would only be considered a fully supported platform for CPython
3.7+ deployments when either the new ``PYTHONUTF8`` defined in PEP 540 is used, 3.7+ deployments when either the new ``PYTHONUTF8`` mode defined in PEP 540 is
or else a suitable locale other than the default ``C`` locale is configured used, or else a suitable locale other than the default ``C`` locale is
explicitly (e.g. ``zh_CN.gb18030``). configured explicitly (e.g. `en_AU.UTF-8`, ``zh_CN.gb18030``).
Redistributors (such as Linux distributions) with a narrower target audience Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this than the upstream CPython development team may also choose to opt in to this
behaviour for the Python 3.6.x series by applying the necessary changes as a locale coercion behaviour for the Python 3.6.x series by applying the necessary
downstream patch when first introducing Python 3.6.0. changes as a downstream patch when first introducing Python 3.6.0.
Background Background
@ -120,7 +132,7 @@ still fail in the following cases:
* SSH environment forwarding means that SSH clients may sometimes forward * SSH environment forwarding means that SSH clients may sometimes forward
client locale settings to servers that don't have that locale installed. This client locale settings to servers that don't have that locale installed. This
leads to CPython running in the default ASCII-based C locale. leads to CPython running in the default ASCII-based C locale
* some process environments (such as Linux containers) may not have any * some process environments (such as Linux containers) may not have any
explicit locale configured at all. As with unknown locales, this leads to explicit locale configured at all. As with unknown locales, this leads to
CPython running in the default ASCII-based C locale CPython running in the default ASCII-based C locale
@ -156,7 +168,7 @@ Relationship with other PEPs
============================ ============================
This PEP shares a common problem statement with PEP 540 (improving Python 3's This PEP shares a common problem statement with PEP 540 (improving Python 3's
behaviour in the default C locale), but diverged markedly in the proposed behaviour in the default C locale), but diverges markedly in the proposed
solution: solution:
* PEP 540 proposes to entirely decouple CPython's default text encoding from * PEP 540 proposes to entirely decouple CPython's default text encoding from
@ -174,7 +186,7 @@ solution:
traditional strong support for integration with other components written traditional strong support for integration with other components written
in C and C++, while actively helping to push forward the adoption and in C and C++, while actively helping to push forward the adoption and
standardisation of the C.UTF-8 locale as a Unicode-aware replacement for standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
the legacy C locale in the wider Linux ecosystem the legacy C locale in the wider C/C++ ecosystem
After reviewing both PEPs, it became clear that they didn't actually conflict After reviewing both PEPs, it became clear that they didn't actually conflict
at a technical level, and the proposal in PEP 540 offered a superior option in at a technical level, and the proposal in PEP 540 offered a superior option in
@ -183,14 +195,18 @@ reference behaviour for platforms where the notion of a "locale encoding"
doesn't make sense (for example, embedded systems running MicroPython rather doesn't make sense (for example, embedded systems running MicroPython rather
than the CPython reference interpreter). than the CPython reference interpreter).
As a result, this PEP was amended to specify PEP 540 as a pre-requisite, with Meanwhile, this PEP offered improved compatibility with other C/C++ components,
the aim being to coerce other C/C++ components into behaving consistently with and an approach more amenable to being backported to Python 3.6 by downstream
CPython's assumption of UTF-8 as the system encoding, rather than CPython itself redistributors.
relying on that setting change.
As a result of that change, the ``LC_CTYPE=en_US.UTF-8`` legacy fallback was As a result, this PEP was amended to refer to PEP 540 as a complementary
removed from the list of UTF-8 locales tried as a coercion target, with CPython solution that offered improved behaviour both when locale coercion triggered,
instead relying solely on the C locale text encoding bypass in such cases. as well as when none of the standard UTF-8 based locales were available.
The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy
fallback was removed from the list of UTF-8 locales tried as a coercion target,
with CPython instead relying solely on the proposed PYTHONUTF8 mode in such
cases.
Motivation Motivation
@ -203,9 +219,8 @@ application development. Technologies like Gnome Flatpak [7_] and
Ubunty Snappy [8_] further aim to bring these same techniques to Linux GUI Ubunty Snappy [8_] further aim to bring these same techniques to Linux GUI
application development. application development.
When using Python 3 for application development in When using Python 3 for application development in these contexts, it isn't
these contexts, it isn't uncommon to see text encoding related errors akin to uncommon to see text encoding related errors akin to the following::
the following::
$ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")' $ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
Unable to decode the command from the command line: Unable to decode the command from the command line:
@ -304,6 +319,7 @@ proposed solution:
release announcements. However, to minimize the chance of introducing new release announcements. However, to minimize the chance of introducing new
problems for end users, we'll do this *without* using the warnings system, so problems for end users, we'll do this *without* using the warnings system, so
even running with ``-Werror`` won't turn it into a runtime exception even running with ``-Werror`` won't turn it into a runtime exception
* any changes made will use *existing* configuration options
To minimize the negative impact on systems currently correctly configured to To minimize the negative impact on systems currently correctly configured to
use GB-18030 or another partially ASCII compatible universal encoding leads to use GB-18030 or another partially ASCII compatible universal encoding leads to
@ -434,7 +450,8 @@ be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
encoding), which may cause Unicode compatibility problems. Using C.UTF-8 encoding), which may cause Unicode compatibility problems. Using C.UTF-8
(if available) as an alternative Unicode-compatible locale is recommended. C.utf8, or UTF-8 (if available) as alternative Unicode-compatible
locales is recommended.
In this case, no actual change will be made to the locale settings. In this case, no actual change will be made to the locale settings.
@ -754,14 +771,15 @@ runtimes even when running a version with this change applied.
Implementation Implementation
============== ==============
A draft implementation of the change (including test cases) has been A draft implementation of the change (including test cases and documentation)
posted to issue 28180 [1_], which is an end user request that is linked from issue 28180 [1_], which is an end user request that
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``. ``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
NOTE: The currently posted draft implementation is for a previous iteration This patch is now being maintained as the ``pep538-coerce-c-locale`` feature
of the PEP prior to the incorporation of the feedback noted in [11_]. It was branch [18_] in Nick Coghlan's fork of the CPython repository on GitHub.
broadly the same in concept (i.e. coercing the legacy C locale to one based on
UTF-8), but differs in several details. NOTE: As discussed in [1_], the currently posted draft implementation has some
known issues on Android.
Backporting to earlier Python 3 releases Backporting to earlier Python 3 releases
@ -789,6 +807,10 @@ backport it to even earlier Python 3.x releases based on the needs and
interests of their particular user base, this wouldn't be encouraged as a interests of their particular user base, this wouldn't be encouraged as a
general practice. general practice.
However, configuring Python 3 *environments* (such as base container
images) to use these configuration settings by default is both allowed
and recommended.
Acknowledgements Acknowledgements
================ ================
@ -882,6 +904,9 @@ References
.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English" .. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English"
(http://bugs.python.org/issue18378#msg215215) (http://bugs.python.org/issue18378#msg215215)
.. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale``
(https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale)
Copyright Copyright
========= =========