PEP 538: Update to depend on PEP 540

- relies entirely on PEP 540 when no appropriate locale
  is available
- uses surrogateescape on standard streams by default
- accounts for BSD-style UTF-8 locales
- avoids any reliance on the en_US-UTF-8 locale
- makes note of related GNU readline issue on Android
This commit is contained in:
Nick Coghlan 2017-01-21 01:13:24 +11:00
parent f67dd4a759
commit 481573aa27
1 changed files with 265 additions and 148 deletions

View File

@ -6,6 +6,7 @@ Author: Nick Coghlan <ncoghlan@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Requires: 540
Created: 28-Dec-2016
Python-Version: 3.7
Post-History: 03-Jan-2017 (linux-sig),
@ -18,33 +19,40 @@ Abstract
An ongoing challenge with Python 3 on \*nix systems is the conflict between
needing to use the configured locale encoding by default for consistency with
other C/C++ components in the same process and those invoked in subprocesses,
and the fact that the standard C locale (as defined in POSIX:2001) specifies
a default text encoding of ASCII, which is entirely inadequate for the
and the fact that the standard C locale (as defined in POSIX:2001) typically
implies a default text encoding of ASCII, which is entirely inadequate for the
development of networked services and client applications in a multilingual
world.
This PEP proposes that the way the CPython implementation handles the default
C locale be changed such that:
PEP 540 proposes a change to CPython's handling of the legacy C locale such
that CPython will assume the use of UTF-8 in such environments, rather than
persisting with the demonstrably problematic assumption of ASCII as an
appropriate encoding for communicating with operating system interfaces.
However, it comes at the cost of making CPython's encoding assumptions diverge
from those of other C and C++ components in the same process, as well as those
of components running in subprocesses that share the same environment.
Accordingly, this PEP further proposes that the way the CPython implementation
handles the default C locale be changed such that:
* the standalone CPython binary will automatically attempt to coerce the ``C``
locale to ``C.UTF-8`` (preferred), ``C.utf8`` or ``en_US.UTF-8`` unless the
new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
locale to ``C.UTF-8``, ``C.utf8``, or ``UTF-8`` (depending on the system),
unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
* if the subsequent runtime initialization process detects that the legacy
``C`` locale remains active (e.g. locale coercion is disabled, or the runtime
is embedded in an application other than the main CPython binary), it will
emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
text encoding may cause various Unicode compatibility issues
Explicitly configuring the ``C.UTF-8`` or ``en_US.UTF-8`` locales has already
been used successfully for a number of years (including by the PEP author) to
get Python 3 running reliably in environments where no locale is otherwise
configured (such as Docker containers).
``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
are available, locale coercion is disabled, or the runtime is embedded in an
application other than the main CPython binary), and the ``PYTHONUTF8``
feature defined in PEP 540 is also disabled, it will emit a warning on
stderr that use of the legacy ``C`` locale's default ASCII text encoding
may cause various Unicode compatibility issues
With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``en_US.UTF-8`` locales as part of its standard
``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for CPython
3.7+ deployments when a locale other than the default ``C`` locale is
configured explicitly.
3.7+ deployments when either the new ``PYTHONUTF8`` defined in PEP 540 is used,
or else a suitable locale other than the default ``C`` locale is configured
explicitly (e.g. ``zh_CN.gb18030``).
Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this
@ -57,11 +65,11 @@ Background
While the CPython interpreter is starting up, it may need to convert from
the ``char *`` format to the ``wchar_t *`` format, or from one of those formats
to ``PyUnicodeObject *``, before its own text encoding handling machinery is
fully configured. It handles these cases by relying on the operating system to
do the conversion and then ensuring that the text encoding name reported by
``sys.getfilesystemencoding()`` matches the encoding used during this early
bootstrapping process.
to ``PyUnicodeObject *``, in a way that's consistent with the locale settings
of the overall system. It handles these cases by relying on the operating
system to do the conversion and then ensuring that the text encoding name
reported by ``sys.getfilesystemencoding()`` matches the encoding used during
this early bootstrapping process.
On Apple platforms (including both Mac OS X and iOS), this is straightforward,
as Apple guarantees that these operations will always use UTF-8 to do the
@ -72,16 +80,13 @@ conversions proved sufficiently problematic that PEP 528 and PEP 529 were
implemented to bypass the operating system supplied interfaces for binary data
handling and force the use of UTF-8 instead.
On Android, the locale settings are of limited relevance (due to most
applications running in the UTF-16-LE based Dalvik environment) and there's
limited value in preserving backwards compatibility with other locale aware
C/C++ components in the same process (since it's a relatively new target
platform for CPython), so CPython bypasses the operating system provided APIs
and hardcodes the use of UTF-8 (similar to its behaviour on Apple platforms).
On Android, many components, including CPython, already assume the use of UTF-8
as the system encoding, regardless of the locale setting. However, this isn't
the case for all components, and the discrepancy can cause problems in some
situations (for example, when using the GNU readline module [16_]).
On non-Apple and non-Android \*nix systems however, these operations are
handled using the C locale system in glibc, which has the following
characteristics [4_]:
On non-Apple and non-Android \*nix systems, these operations are handled using
the C locale system in glibc, which has the following characteristics [4_]:
* by default, all processes start in the ``C`` locale, which uses ``ASCII``
for these conversions. This is almost never what anyone doing multilingual
@ -113,9 +118,9 @@ they do when overriding the locale with one based on UTF-8)
These calls are usually sufficient to provide sensible behaviour, but they can
still fail in the following cases:
* SSH environment forwarding means that SSH clients will often forward
* SSH environment forwarding means that SSH clients may sometimes forward
client locale settings to servers that don't have that locale installed. This
leads to CPython running in the default ASCII-based C locale
leads to CPython running in the default ASCII-based C locale.
* some process environments (such as Linux containers) may not have any
explicit locale configured at all. As with unknown locales, this leads to
CPython running in the default ASCII-based C locale
@ -126,6 +131,18 @@ application. For example::
LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ...
The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the
``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other
categories (including ``LC_COLLATE``). It is offered by a number of Linux
distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an
alternative to the ASCII-based C locale.
Mac OS X and other \*BSD systems have taken a different approach, and instead
of offering a ``C.UTF-8`` locale, instead offer a partial ``UTF-8`` locale that
only defines the ``LC_CTYPE`` category. On such systems, the preferred
environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to set
``LC_ALL`` or ``LANG``. [17_]
In the specific case of Docker containers and similar technologies, the
appropriate locale setting can be specified directly in the container image
definition.
@ -139,7 +156,7 @@ Relationship with other PEPs
============================
This PEP shares a common problem statement with PEP 540 (improving Python 3's
behaviour in the default C locale), but diverges markedly in the proposed
behaviour in the default C locale), but diverged markedly in the proposed
solution:
* PEP 540 proposes to entirely decouple CPython's default text encoding from
@ -148,7 +165,7 @@ solution:
and in subprocesses. This approach aims to make CPython behave less like a
locale-aware C/C++ application, and more like C/C++ independent language
runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
* this PEP proposes to instead override the legacy C locale with a more recently
* this PEP proposes to override the legacy C locale with a more recently
defined locale that uses UTF-8 as its default text encoding. This means that
the text encoding override will apply not only to CPython, but also to any
locale aware extension modules loaded into the current process, as well as to
@ -157,32 +174,23 @@ solution:
traditional strong support for integration with other components written
in C and C++, while actively helping to push forward the adoption and
standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
the legacy C locale
the legacy C locale in the wider Linux ecosystem
While the two PEPs present alternate proposed behavioural improvements that
align with the interests of different parts of the Python user community, they
don't actually conflict at a technical level.
After reviewing both PEPs, it became clear that they didn't actually conflict
at a technical level, and the proposal in PEP 540 offered a superior option in
cases where no suitable locale was available, as well offering a better
reference behaviour for platforms where the notion of a "locale encoding"
doesn't make sense (for example, embedded systems running MicroPython rather
the CPython reference interpreter).
That means it would be entirely possible to implement both of them, and end up
with a situation where redistributors, application integrators, and end users
can choose between:
As a result, this PEP was amended to specify PEP 540 as a pre-requisite, with
the aim being to coerce other C/C++ components into behaving consistently with
CPython's assumption of UTF-8 as the system encoding, rather than CPython itself
relying on that setting change.
* coercing the default ASCII based C locale to a UTF-8 based locale
* instructing CPython to ignore the C locale and use UTF-8 instead
* doing both of the above (with this option as the default legacy C locale
handling)
* forcing use of the default ASCII based C locale by setting both
PYTHONCOERCECLOCALE=0 and PYTHONUTF8=0
If this approach was taken, then the proposed modifications to PEP 11 would
be adjusted to indicate that the only unsupported configurations are those where
both the legacy C locale coercion and the C locale text encoding bypass are
disabled.
Given such a hybrid implementation, it would also be reasonable to drop the
``en_US.UTF-8`` legacy fallback from the list of UTF-8 locales tried as a
coercion target and instead rely solely on the C locale text encoding bypass
in such cases.
As a result of that change, the ``LC_CTYPE=en_US.UTF-8`` legacy fallback was
removed from the list of UTF-8 locales tried as a coercion target, with CPython
instead rely solely on the C locale text encoding bypass in such cases.
Motivation
@ -275,21 +283,10 @@ While the glibc developers are working towards making the C.UTF-8 locale
universally available for use by glibc based applications like CPython [6_],
this unfortunately doesn't help on platforms that ship older versions of glibc
without that feature, and also don't provide C.UTF-8 as an on-disk locale the
way Debian and Fedora do. For these platforms, the best widely available
fallback option is the ``en_US.UTF-8`` locale, which while still being
unfortunately Anglo-centric, is at least significantly less Anglo-centric than
the ASCII text encoding assumption in the default C locale.
In the specific case of C locale coercion, the Anglo-centrism implied by the
use of ``en_US.UTF-8`` can be mitigated by configuring only the ``LC_CTYPE``
locale category, rather than overriding all the locale categories::
$ docker run --rm -e LANG=C.UTF-8 centos/python-35-centos7 python3 -c 'print("ℙƴ☂ℌøἤ")'
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
$ docker run --rm -e LC_CTYPE=en_US.UTF-8 centos/python-35-centos7 python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
way Debian and Fedora do. For these platforms, the mechanism proposed in
PEP 540 at least allows CPython itself to behave sensibly, albeit without any
mechanism to get other C/C++ components that decode binary streams as text to
do the same.
Design Principles
@ -308,16 +305,16 @@ proposed solution:
problems for end users, we'll do this *without* using the warnings system, so
even running with ``-Werror`` won't turn it into a runtime exception
The general design principle of Python 3 to prefer raising an exception over
incorrectly encoding or decoding data then leads to the following additional
design guideline:
To minimize the negative impact on systems currently correctly configured to
use GB-18030 or another partially ASCII compatible universal encoding leads to
an additional design principle:
* if a UTF-8 based Linux container is run on a host that is explicitly
configured to use a non-UTF-8 encoding, and tries to exchange locally
encoded data with that host rather than exchanging explicitly UTF-8 encoded
data, this will ideally lead to an immediate runtime exception rather than
to silent data corruption
data, CPython will endeavour to correctly round-trip host provided data that
is concatenated or split solely at common ASCII compatible code points, but
may otherwise emit nonsensical results.
Specification
@ -330,8 +327,9 @@ run as a standalone command line application.
It further proposes to emit a warning on stderr if the legacy ``C`` locale
is in effect at the point where the language runtime itself is initialized,
in order to warn system and application integrators that they're running
CPython in an unsupported configuration.
and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
system and application integrators that they're running CPython in an
unsupported configuration.
Legacy C locale coercion in the standalone Python interpreter binary
@ -369,7 +367,7 @@ Three such locales will be tried:
* ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and
expected to be available by default in a future version of glibc)
* ``C.utf8`` (available at least in HP-UX)
* ``en_US.UTF-8`` (available at least in RHEL and CentOS)
* ``UTF-8`` (available in at least some \*BSD variants)
For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually
setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate
@ -377,15 +375,17 @@ locale name, such that future calls to ``setlocale()`` will see them, as will
other components looking for those settings (such as GUI development
frameworks).
The last fallback isn't ideal as a coercion target (as it changes more than
just the default text encoding), but has the benefit of currently being more
widely available than the C.UTF-8 locale. To minimize the chance of side
effects, only the ``LC_CTYPE`` environment variable would be set when using
this legacy fallback option, with the other locale categories being left alone.
For the platforms where it is defined, ``UTF-8`` is a partial locale that only
defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
environment variable would be set when using this fallback option.
Given time, more environments are expected to provide a ``C.UTF-8`` locale by
default, so falling all the way back to the ``en_US.UTF-8`` option is expected
to become less common.
To adjust automatically to future changes in locale availability, these checks
will be implemented at runtime on all platforms other than Mac OS X and Windows,
rather than attempting to determine which locales to try at compile time.
If the locale settings are changed successfully, and the ``PYTHONIOENCODING``
environment variable is currently unset, then it will be forced to
``PYTHONIOENCODING=utf-8:surrogateescape``.
When this locale coercion is activated, the following warning will be
printed on stderr, with the warning containing whichever locale was
@ -394,14 +394,15 @@ successfully configured::
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
When falling all the way back to the ``en_US.UTF-8`` locale, the message would
be slightly different::
When falling back to the ``UTF-8`` locale, the message would be slightly
different::
Python detected LC_CTYPE=C, LC_CTYPE set to en_US.UTF-8 (set another locale
Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
This locale coercion will mean that the standard Python binary should once
again "just work" in the three main failure cases we're aware of (missing locale
In combination with PEP 540, this locale coercion will mean that the standard
Python binary *and* locale aware C/C++ extensions should once again "just work"
in the three main failure cases we're aware of (missing locale
settings, SSH forwarding of unknown locales, and developers explicitly
requesting ``LANG=C``), as long as the target platform provides at least one
of the candidate UTF-8 based environments.
@ -427,7 +428,8 @@ doing so would introduce inconsistencies in decoded text, even in the context
of the standalone Python interpreter binary.
Accordingly, when ``Py_Initialize`` is called and CPython detects that the
configured locale is still the default ``C`` locale, the following warning will
configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8``
feature from PEP 540 is disabled, the following warning will
be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
@ -440,6 +442,10 @@ Instead, the warning informs both system and application integrators that
they're running Python 3 in a configuration that we don't expect to work
properly.
The second sentence providing recommendations would be conditionally compiled
based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD
systems.
New build-time configuration options
------------------------------------
@ -465,15 +471,16 @@ Platform Support Changes
A new "Legacy C Locale" section will be added to PEP 11 that states:
* as of Python 3.7, the legacy C locale is no longer officially supported,
and any Unicode handling issues that occur only in that locale and cannot be
reproduced in an appropriately configured non-ASCII locale will be closed as
"won't fix"
* as of Python 3.7, \*nix platforms are expected to provide at least one of
``C.UTF-8``, ``C.utf8`` or ``en_US.UTF-8`` as an alternative to the legacy
``C`` locale. On platforms which don't yet provide any of these locales, an
explicit non-ASCII locale setting will be needed to configure a fully
supported environment for running Python 3.7+
* as of CPython 3.7, the legacy C locale is only supported when operating in
"UTF-8" mode. Any Unicode handling issues that occur only in that locale
and cannot be reproduced in an appropriately configured non-ASCII locale will
be closed as "won't fix"
* as of CPython 3.7, \*nix platforms are expected to provide at least one of
``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
Any Unicode related integration problems with C/C++ extensions that occur
only in that locale and cannot be reproduced in an appropriately configured
non-ASCII locale will be closed as "won't fix".
Rationale
@ -502,14 +509,14 @@ C/C++ components sharing the same process, as well as with the user's desktop
locale settings, than it is with the emergent conventions of modern network
service development.
The core premise of this PEP is that for *all* of these use cases, the default
"C" locale is the wrong choice, and furthermore that the following assumptions
are valid:
The core premise of this PEP is that for *all* of these use cases, the
assumption of ASCII implied by the default "C" locale is the wrong choice,
and furthermore that the following assumptions are valid:
* in desktop application use cases, the process locale will *already* be
configured appropriately, and if it isn't, then that is an operating system
level problem that needs to be reported to and resolved by the operating
system provider
or embedding application level problem that needs to be reported to and
resolved by the operating system provider or application developer
* in network service development use cases (especially those based on Linux
containers), the process locale may not be configured *at all*, and if it
isn't, then the expectation is that components will impose their own default
@ -517,54 +524,151 @@ are valid:
default encoding of ASCII the way CPython currently does
Defaulting to "strict" error handling on the standard IO streams
----------------------------------------------------------------
Defaulting to "surrogateescape" error handling on the standard IO streams
-------------------------------------------------------------------------
By coercing the locale away from the legacy C default and its assumption of
ASCII as the preferred text encoding, this PEP also disables the implicit use
of the "surrogateescape" error handler on the standard IO streams that was
introduced in Python 3.5 ([15_]).
introduced in Python 3.5 ([15_]), as well as the automatic use of
``surrogateescape`` when operating in PEP 540's UTF-8 mode.
This is deliberate, as that change was primarily aimed at handling the case
where the correct system encoding was the ASCII-compatible UTF-8 (or another
ASCII-compatible encoding), but the nominal encoding used for operating system
interfaces in the current process was ASCII.
Rather than introducing yet another configuration option to address that,
this PEP proposes to use the existing ``PYTHONIOENCODING`` setting to ensure
that the ``surrogateescape`` handler is enabled when the interpreter is
required to make assumptions regarding the expected filesystem encoding.
With this PEP, that assumption is being narrowed a step further, such that
rather than assuming "an ASCII-compatible encoding", we instead assume UTF-8
specifically. If that assumption is genuinely wrong, then it continues to be
friendlier to users of other encodings to alert them to the runtime's mistaken
assumption, rather than continuing on and potentially corrupting their data
permanently.
The aim of this behaviour is to attempt to ensure that operating system
provided text values are typically able to be transparently passed through a
Python 3 application even if it is incorrect in assuming that that text has
been encoded as UTF-8.
In particular, GB 18030 [12_] is a Chinese national text encoding standard
that handles all Unicode code points, but is incompatible with both ASCII and
UTF-8.
that handles all Unicode code points, that is formally incompatible with both
ASCII and UTF-8, but will nevertheless often tolerate processing as surrogate
escaped data - the points where GB 18030 reuses ASCII byte values in an
incompatible way are likely to be invalid in UTF-8, and will therefore be
escaped and opaque to string processing operations that split on or search for
the relevant ASCII code points. Operations that don't involve splitting on or
searching for particular ASCII or Unicode code point values are almost
certain to work correctly.
Similarly, Shift-JIS [13_] and ISO-2022-JP [14_] remain in widespread use in
Japan, and are incompatible with both ASCII and UTF-8.
Japan, and are incompatible with both ASCII and UTF-8, but will tolerate text
processing operations that don't involve splitting on or searching for
particular ASCII or Unicode code point values.
Using strict error handling on the standard streams means that attempting to
pass information from a host system using one of these encodings into a
container application that is assuming the use of UTF-8 or vice-versa is likely
to cause an immediate Unicode encoding or decoding error, rather than
potentially causing silent data corruption.
As an example, consider two files, one encoded with UTF-8 (the default encoding
for ``en_AU.UTF-8``), and one encoded with GB-18030 (the default encoding for
``zh_CN.gb18030``)::
For users that would prefer more permissive behaviour, setting
``PYTHONIOENCODING=:surrogateescape`` will continue to be supported, as this
PEP makes no changes to that feature.
$ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))'
$ python3 -c 'open("gb18030.txt", "wb"); f.write("ℙƴ☂ℌøἤ\n".encode("gb18030"))'
On disk, we can see that these are two very different files::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "rb").read().strip()); \
print("GB18030:", open("gb18030.txt", "rb").read().strip())'
UTF-8: b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n'
GB18030: b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n'
That nevertheless can both be rendered correctly to the terminal as long as
they're decoded prior to printing::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())'
UTF-8: ℙƴ☂ℌøἤ
GB18030: ℙƴ☂ℌøἤ
By contrast, if we just pass along the raw bytes, as ``cat`` and similar C/C++
utilities will tend to do::
$ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt
ℙƴ☂ℌøἤ
<20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
Even setting a specifically Chinese locale won't help in getting the
GB-18030 encoded file rendered correctly::
$ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt
ℙƴ☂ℌøἤ
<20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
The problem is that the *terminal* encoding setting remains UTF-8, regardless
of the nominal locale. A GB18030 terminal can be emulated using the ``iconv``
utility::
$ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8
鈩櫰粹槀鈩屆羔激
ℙƴ☂ℌøἤ
This reverses the problem, such that the GB18030 file is rendered correctly,
but the UTF-8 file has been converted to unrelated hanzi characters, rather than
the expected rendering of "Python" as non-ASCII characters.
With the emulated GB18030 terminal encoding, assuming UTF-8 in Python results
in *both* files being displayed incorrectly::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
| iconv -f GB18030 -t UTF-8
UTF-8: 鈩櫰粹槀鈩屆羔激
GB18030: 鈩櫰粹槀鈩屆羔激
However, setting the locale correctly means that the emulated GB18030 terminal
now displays both files as originally intended::
$ LANG=zh_CN.gb18030 \
python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
| iconv -f GB18030 -t UTF-8
UTF-8: ℙƴ☂ℌøἤ
GB18030: ℙƴ☂ℌøἤ
The rationale for retaining ``surrogateescape`` as the default IO encoding is
that it will preserve the following helpful behaviour in the C locale::
$ cat gb18030.txt \
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Rather than reverting to the exception seen when a UTF-8 based locale is
explicitly configured::
$ cat gb18030.txt \
| python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
Note: an alternative to setting ``PYTHONIOENCODING`` as the PEP currently
proposes would be to instead *always* default to ``surrogateescape`` on the
standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request
text encoding validation during stream processing. Adopting such an approach
would bring Python 3 more into line with typical C/C++ tools that pass along
the raw bytes without checking them for conformance to their nominal encoding,
and would hence also make the last example display the desired output::
$ cat gb18030.txt \
| PYTHONIOENCODING=:surrogateescape python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Dropping official support for Unicode handling in the legacy C locale
---------------------------------------------------------------------
Dropping official support for ASCII based text handling in the legacy C locale
------------------------------------------------------------------------------
We've been trying to get strict bytes/text separation to work reliably in the
legacy C locale for over a decade at this point. Not only haven't we been able
to get it to work, neither has anyone else - the only viable alternatives
identified have been to pass the bytes along verbatim without eagerly decoding
them to text (Python 2.x, Ruby, etc), or else to ignore the nominal C/C++ locale
encoding entirely and assume the use of either UTF-8 (PEP 540, Rust, Go,
Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
While this PEP ensures that developers that need to do so can still opt-in to
running their Python code in the legacy C locale, it also makes clear that we
@ -621,7 +725,10 @@ languages in subprocesses.
Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
C/C++ components in the current process and in any subprocesses that inherit
the current environment.
the current environment. This is important to handle cases where the problem
has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system
where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
configured to forward locale settings, and the user logs into a Linux server).
Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
@ -647,15 +754,15 @@ runtimes even when running a version with this change applied.
Implementation
==============
A draft implementation of the change (including test cases) has been
posted to issue 28180 [1_], which is an end user request that
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
NOTE: The currently posted draft implementation is for a previous iteration
of the PEP prior to the incorporation of the feedback noted in [11_]. It was
broadly the same in concept (i.e. coercing the legacy C locale to one based on
UTF-8), but differs in several details.
A draft implementation of the change (including test cases) has been
posted to issue 28180 [1_], which is an end user request that
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
Backporting to earlier Python 3 releases
========================================
@ -666,8 +773,8 @@ Backporting to Python 3.6.0
If this PEP is accepted for Python 3.7, redistributors backporting the change
specifically to their initial Python 3.6.0 release will be both allowed and
encouraged. However, such backports should only be undertaken either in
conjunction with the changes needed to also provide the C.UTF-8 locale by
default, or else specifically for platforms where that locale is already
conjunction with the changes needed to also provide a suitable locale by
default, or else specifically for platforms where such a locale is already
consistently available.
@ -676,7 +783,7 @@ Backporting to other 3.x releases
While the proposed behavioural change is seen primarily as a bug fix addressing
Python 3's current misbehaviour in the default ASCII-based C locale, it still
represents a reasonable significant change in the way CPython interacts with
represents a reasonably significant change in the way CPython interacts with
the C locale system. As such, while some redistributors may still choose to
backport it to even earlier Python 3.x releases based on the needs and
interests of their particular user base, this wouldn't be encouraged as a
@ -716,6 +823,10 @@ PEP 540 [11_].
The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP
is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9_].
Stephen Turnbull has long provided valuable insight into the text encoding
handling challenges he regularly encounters at the University of Tsukuba
(筑波大学).
References
==========
@ -765,6 +876,12 @@ References
.. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale
(https://bugs.python.org/issue19977)
.. [16] test_readline.test_nonascii fails on Android
(http://bugs.python.org/issue28997)
.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English"
(http://bugs.python.org/issue18378#msg215215)
Copyright
=========