PEP 538: Update reference implementation (#219)

- updates reference implementation to use PYTHONCOERCECLOCALE
- removes hard dependency on PEP 540
- still notes PEP 540 covers case where no relevant C-with-UTF-8
  locale is available
- clarifies that these settings are still recommended over the
  legacy C locale settings for older Python 3 versions, even if
  we don't recommend backporting the automatic coercion
This commit is contained in:
Nick Coghlan 2017-03-05 17:29:54 +10:00 committed by GitHub
parent 5f82542ec4
commit a20a56ceb5
1 changed files with 59 additions and 34 deletions

View File

@ -6,7 +6,6 @@ Author: Nick Coghlan <ncoghlan@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Requires: 540
Created: 28-Dec-2016
Python-Version: 3.7
Post-History: 03-Jan-2017 (linux-sig),
@ -28,36 +27,49 @@ PEP 540 proposes a change to CPython's handling of the legacy C locale such
that CPython will assume the use of UTF-8 in such environments, rather than
persisting with the demonstrably problematic assumption of ASCII as an
appropriate encoding for communicating with operating system interfaces.
This is a good approach for cases where network encoding interoperability
is a more important concern than local encoding interoperability.
However, it comes at the cost of making CPython's encoding assumptions diverge
from those of other C and C++ components in the same process, as well as those
of components running in subprocesses that share the same environment.
Accordingly, this PEP further proposes that the way the CPython implementation
handles the default C locale be changed such that:
It also requires changes to the internals of how CPython itself works, rather
than using existing configuration settings that are supported by Python
versions prior to Python 3.7.
* the standalone CPython binary will automatically attempt to coerce the ``C``
locale to ``C.UTF-8``, ``C.utf8``, or ``UTF-8`` (depending on the system),
unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
Accordingly, this PEP proposes that independently of the UTF-8 mode proposed
in PEP 540, the way the CPython implementation handles the default C locale be
changed such that:
* unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``,
the standalone CPython binary will automatically attempt to coerce the ``C``
locale to the first available locale out of ``C.UTF-8``, ``C.utf8``, or
``UTF-8``
* if the locale is successfully coerced, and PEP 540 is not accepted, then
``PYTHONIOENCODING`` (if not otherwise set) will be set to
``utf-8:surrogateescape``.
* if the locale is successfully coerced, and PEP 540 *is* accepted, then
``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
* if the subsequent runtime initialization process detects that the legacy
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
are available, locale coercion is disabled, or the runtime is embedded in an
application other than the main CPython binary), and the ``PYTHONUTF8``
feature defined in PEP 540 is also disabled, it will emit a warning on
stderr that use of the legacy ``C`` locale's default ASCII text encoding
may cause various Unicode compatibility issues
feature defined in PEP 540 is also disabled (or not implemented), it will
emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
text encoding may cause various Unicode compatibility issues
With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for CPython
3.7+ deployments when either the new ``PYTHONUTF8`` defined in PEP 540 is used,
or else a suitable locale other than the default ``C`` locale is configured
explicitly (e.g. ``zh_CN.gb18030``).
3.7+ deployments when either the new ``PYTHONUTF8`` mode defined in PEP 540 is
used, or else a suitable locale other than the default ``C`` locale is
configured explicitly (e.g. `en_AU.UTF-8`, ``zh_CN.gb18030``).
Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this
behaviour for the Python 3.6.x series by applying the necessary changes as a
downstream patch when first introducing Python 3.6.0.
locale coercion behaviour for the Python 3.6.x series by applying the necessary
changes as a downstream patch when first introducing Python 3.6.0.
Background
@ -120,7 +132,7 @@ still fail in the following cases:
* SSH environment forwarding means that SSH clients may sometimes forward
client locale settings to servers that don't have that locale installed. This
leads to CPython running in the default ASCII-based C locale.
leads to CPython running in the default ASCII-based C locale
* some process environments (such as Linux containers) may not have any
explicit locale configured at all. As with unknown locales, this leads to
CPython running in the default ASCII-based C locale
@ -156,7 +168,7 @@ Relationship with other PEPs
============================
This PEP shares a common problem statement with PEP 540 (improving Python 3's
behaviour in the default C locale), but diverged markedly in the proposed
behaviour in the default C locale), but diverges markedly in the proposed
solution:
* PEP 540 proposes to entirely decouple CPython's default text encoding from
@ -174,7 +186,7 @@ solution:
traditional strong support for integration with other components written
in C and C++, while actively helping to push forward the adoption and
standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
the legacy C locale in the wider Linux ecosystem
the legacy C locale in the wider C/C++ ecosystem
After reviewing both PEPs, it became clear that they didn't actually conflict
at a technical level, and the proposal in PEP 540 offered a superior option in
@ -183,14 +195,18 @@ reference behaviour for platforms where the notion of a "locale encoding"
doesn't make sense (for example, embedded systems running MicroPython rather
than the CPython reference interpreter).
As a result, this PEP was amended to specify PEP 540 as a pre-requisite, with
the aim being to coerce other C/C++ components into behaving consistently with
CPython's assumption of UTF-8 as the system encoding, rather than CPython itself
relying on that setting change.
Meanwhile, this PEP offered improved compatibility with other C/C++ components,
and an approach more amenable to being backported to Python 3.6 by downstream
redistributors.
As a result of that change, the ``LC_CTYPE=en_US.UTF-8`` legacy fallback was
removed from the list of UTF-8 locales tried as a coercion target, with CPython
instead relying solely on the C locale text encoding bypass in such cases.
As a result, this PEP was amended to refer to PEP 540 as a complementary
solution that offered improved behaviour both when locale coercion triggered,
as well as when none of the standard UTF-8 based locales were available.
The availability of PEP 540 also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy
fallback was removed from the list of UTF-8 locales tried as a coercion target,
with CPython instead relying solely on the proposed PYTHONUTF8 mode in such
cases.
Motivation
@ -203,9 +219,8 @@ application development. Technologies like Gnome Flatpak [7_] and
Ubunty Snappy [8_] further aim to bring these same techniques to Linux GUI
application development.
When using Python 3 for application development in
these contexts, it isn't uncommon to see text encoding related errors akin to
the following::
When using Python 3 for application development in these contexts, it isn't
uncommon to see text encoding related errors akin to the following::
$ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
Unable to decode the command from the command line:
@ -304,6 +319,7 @@ proposed solution:
release announcements. However, to minimize the chance of introducing new
problems for end users, we'll do this *without* using the warnings system, so
even running with ``-Werror`` won't turn it into a runtime exception
* any changes made will use *existing* configuration options
To minimize the negative impact on systems currently correctly configured to
use GB-18030 or another partially ASCII compatible universal encoding leads to
@ -434,7 +450,8 @@ be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
encoding), which may cause Unicode compatibility problems. Using C.UTF-8
(if available) as an alternative Unicode-compatible locale is recommended.
C.utf8, or UTF-8 (if available) as alternative Unicode-compatible
locales is recommended.
In this case, no actual change will be made to the locale settings.
@ -754,14 +771,15 @@ runtimes even when running a version with this change applied.
Implementation
==============
A draft implementation of the change (including test cases) has been
posted to issue 28180 [1_], which is an end user request that
A draft implementation of the change (including test cases and documentation)
is linked from issue 28180 [1_], which is an end user request that
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
NOTE: The currently posted draft implementation is for a previous iteration
of the PEP prior to the incorporation of the feedback noted in [11_]. It was
broadly the same in concept (i.e. coercing the legacy C locale to one based on
UTF-8), but differs in several details.
This patch is now being maintained as the ``pep538-coerce-c-locale`` feature
branch [18_] in Nick Coghlan's fork of the CPython repository on GitHub.
NOTE: As discussed in [1_], the currently posted draft implementation has some
known issues on Android.
Backporting to earlier Python 3 releases
@ -789,6 +807,10 @@ backport it to even earlier Python 3.x releases based on the needs and
interests of their particular user base, this wouldn't be encouraged as a
general practice.
However, configuring Python 3 *environments* (such as base container
images) to use these configuration settings by default is both allowed
and recommended.
Acknowledgements
================
@ -882,6 +904,9 @@ References
.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English"
(http://bugs.python.org/issue18378#msg215215)
.. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale``
(https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale)
Copyright
=========