PEP 538: Update to depend on PEP 540

- relies entirely on PEP 540 when no appropriate locale
  is available
- uses surrogateescape on standard streams by default
- accounts for BSD-style UTF-8 locales
- avoids any reliance on the en_US-UTF-8 locale
- makes note of related GNU readline issue on Android
This commit is contained in:
Nick Coghlan 2017-01-21 01:13:24 +11:00
parent f67dd4a759
commit 481573aa27
1 changed files with 265 additions and 148 deletions

View File

@ -6,6 +6,7 @@ Author: Nick Coghlan <ncoghlan@gmail.com>
Status: Draft Status: Draft
Type: Standards Track Type: Standards Track
Content-Type: text/x-rst Content-Type: text/x-rst
Requires: 540
Created: 28-Dec-2016 Created: 28-Dec-2016
Python-Version: 3.7 Python-Version: 3.7
Post-History: 03-Jan-2017 (linux-sig), Post-History: 03-Jan-2017 (linux-sig),
@ -18,33 +19,40 @@ Abstract
An ongoing challenge with Python 3 on \*nix systems is the conflict between An ongoing challenge with Python 3 on \*nix systems is the conflict between
needing to use the configured locale encoding by default for consistency with needing to use the configured locale encoding by default for consistency with
other C/C++ components in the same process and those invoked in subprocesses, other C/C++ components in the same process and those invoked in subprocesses,
and the fact that the standard C locale (as defined in POSIX:2001) specifies and the fact that the standard C locale (as defined in POSIX:2001) typically
a default text encoding of ASCII, which is entirely inadequate for the implies a default text encoding of ASCII, which is entirely inadequate for the
development of networked services and client applications in a multilingual development of networked services and client applications in a multilingual
world. world.
This PEP proposes that the way the CPython implementation handles the default PEP 540 proposes a change to CPython's handling of the legacy C locale such
C locale be changed such that: that CPython will assume the use of UTF-8 in such environments, rather than
persisting with the demonstrably problematic assumption of ASCII as an
appropriate encoding for communicating with operating system interfaces.
However, it comes at the cost of making CPython's encoding assumptions diverge
from those of other C and C++ components in the same process, as well as those
of components running in subprocesses that share the same environment.
Accordingly, this PEP further proposes that the way the CPython implementation
handles the default C locale be changed such that:
* the standalone CPython binary will automatically attempt to coerce the ``C`` * the standalone CPython binary will automatically attempt to coerce the ``C``
locale to ``C.UTF-8`` (preferred), ``C.utf8`` or ``en_US.UTF-8`` unless the locale to ``C.UTF-8``, ``C.utf8``, or ``UTF-8`` (depending on the system),
new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0`` unless the new ``PYTHONCOERCECLOCALE`` environment variable is set to ``0``
* if the subsequent runtime initialization process detects that the legacy * if the subsequent runtime initialization process detects that the legacy
``C`` locale remains active (e.g. locale coercion is disabled, or the runtime ``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
is embedded in an application other than the main CPython binary), it will are available, locale coercion is disabled, or the runtime is embedded in an
emit a warning on stderr that use of the legacy ``C`` locale's default ASCII application other than the main CPython binary), and the ``PYTHONUTF8``
text encoding may cause various Unicode compatibility issues feature defined in PEP 540 is also disabled, it will emit a warning on
stderr that use of the legacy ``C`` locale's default ASCII text encoding
Explicitly configuring the ``C.UTF-8`` or ``en_US.UTF-8`` locales has already may cause various Unicode compatibility issues
been used successfully for a number of years (including by the PEP author) to
get Python 3 running reliably in environments where no locale is otherwise
configured (such as Docker containers).
With this change, any \*nix platform that does *not* offer at least one of the With this change, any \*nix platform that does *not* offer at least one of the
``C.UTF-8``, ``C.utf8`` or ``en_US.UTF-8`` locales as part of its standard ``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
configuration would only be considered a fully supported platform for CPython configuration would only be considered a fully supported platform for CPython
3.7+ deployments when a locale other than the default ``C`` locale is 3.7+ deployments when either the new ``PYTHONUTF8`` defined in PEP 540 is used,
configured explicitly. or else a suitable locale other than the default ``C`` locale is configured
explicitly (e.g. ``zh_CN.gb18030``).
Redistributors (such as Linux distributions) with a narrower target audience Redistributors (such as Linux distributions) with a narrower target audience
than the upstream CPython development team may also choose to opt in to this than the upstream CPython development team may also choose to opt in to this
@ -57,11 +65,11 @@ Background
While the CPython interpreter is starting up, it may need to convert from While the CPython interpreter is starting up, it may need to convert from
the ``char *`` format to the ``wchar_t *`` format, or from one of those formats the ``char *`` format to the ``wchar_t *`` format, or from one of those formats
to ``PyUnicodeObject *``, before its own text encoding handling machinery is to ``PyUnicodeObject *``, in a way that's consistent with the locale settings
fully configured. It handles these cases by relying on the operating system to of the overall system. It handles these cases by relying on the operating
do the conversion and then ensuring that the text encoding name reported by system to do the conversion and then ensuring that the text encoding name
``sys.getfilesystemencoding()`` matches the encoding used during this early reported by ``sys.getfilesystemencoding()`` matches the encoding used during
bootstrapping process. this early bootstrapping process.
On Apple platforms (including both Mac OS X and iOS), this is straightforward, On Apple platforms (including both Mac OS X and iOS), this is straightforward,
as Apple guarantees that these operations will always use UTF-8 to do the as Apple guarantees that these operations will always use UTF-8 to do the
@ -72,16 +80,13 @@ conversions proved sufficiently problematic that PEP 528 and PEP 529 were
implemented to bypass the operating system supplied interfaces for binary data implemented to bypass the operating system supplied interfaces for binary data
handling and force the use of UTF-8 instead. handling and force the use of UTF-8 instead.
On Android, the locale settings are of limited relevance (due to most On Android, many components, including CPython, already assume the use of UTF-8
applications running in the UTF-16-LE based Dalvik environment) and there's as the system encoding, regardless of the locale setting. However, this isn't
limited value in preserving backwards compatibility with other locale aware the case for all components, and the discrepancy can cause problems in some
C/C++ components in the same process (since it's a relatively new target situations (for example, when using the GNU readline module [16_]).
platform for CPython), so CPython bypasses the operating system provided APIs
and hardcodes the use of UTF-8 (similar to its behaviour on Apple platforms).
On non-Apple and non-Android \*nix systems however, these operations are On non-Apple and non-Android \*nix systems, these operations are handled using
handled using the C locale system in glibc, which has the following the C locale system in glibc, which has the following characteristics [4_]:
characteristics [4_]:
* by default, all processes start in the ``C`` locale, which uses ``ASCII`` * by default, all processes start in the ``C`` locale, which uses ``ASCII``
for these conversions. This is almost never what anyone doing multilingual for these conversions. This is almost never what anyone doing multilingual
@ -113,9 +118,9 @@ they do when overriding the locale with one based on UTF-8)
These calls are usually sufficient to provide sensible behaviour, but they can These calls are usually sufficient to provide sensible behaviour, but they can
still fail in the following cases: still fail in the following cases:
* SSH environment forwarding means that SSH clients will often forward * SSH environment forwarding means that SSH clients may sometimes forward
client locale settings to servers that don't have that locale installed. This client locale settings to servers that don't have that locale installed. This
leads to CPython running in the default ASCII-based C locale leads to CPython running in the default ASCII-based C locale.
* some process environments (such as Linux containers) may not have any * some process environments (such as Linux containers) may not have any
explicit locale configured at all. As with unknown locales, this leads to explicit locale configured at all. As with unknown locales, this leads to
CPython running in the default ASCII-based C locale CPython running in the default ASCII-based C locale
@ -126,6 +131,18 @@ application. For example::
LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ... LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ...
The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the
``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other
categories (including ``LC_COLLATE``). It is offered by a number of Linux
distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an
alternative to the ASCII-based C locale.
Mac OS X and other \*BSD systems have taken a different approach, and instead
of offering a ``C.UTF-8`` locale, instead offer a partial ``UTF-8`` locale that
only defines the ``LC_CTYPE`` category. On such systems, the preferred
environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to set
``LC_ALL`` or ``LANG``. [17_]
In the specific case of Docker containers and similar technologies, the In the specific case of Docker containers and similar technologies, the
appropriate locale setting can be specified directly in the container image appropriate locale setting can be specified directly in the container image
definition. definition.
@ -139,7 +156,7 @@ Relationship with other PEPs
============================ ============================
This PEP shares a common problem statement with PEP 540 (improving Python 3's This PEP shares a common problem statement with PEP 540 (improving Python 3's
behaviour in the default C locale), but diverges markedly in the proposed behaviour in the default C locale), but diverged markedly in the proposed
solution: solution:
* PEP 540 proposes to entirely decouple CPython's default text encoding from * PEP 540 proposes to entirely decouple CPython's default text encoding from
@ -148,7 +165,7 @@ solution:
and in subprocesses. This approach aims to make CPython behave less like a and in subprocesses. This approach aims to make CPython behave less like a
locale-aware C/C++ application, and more like C/C++ independent language locale-aware C/C++ application, and more like C/C++ independent language
runtimes like the JVM, .NET CLR, Go, Node.js, and Rust runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
* this PEP proposes to instead override the legacy C locale with a more recently * this PEP proposes to override the legacy C locale with a more recently
defined locale that uses UTF-8 as its default text encoding. This means that defined locale that uses UTF-8 as its default text encoding. This means that
the text encoding override will apply not only to CPython, but also to any the text encoding override will apply not only to CPython, but also to any
locale aware extension modules loaded into the current process, as well as to locale aware extension modules loaded into the current process, as well as to
@ -157,32 +174,23 @@ solution:
traditional strong support for integration with other components written traditional strong support for integration with other components written
in C and C++, while actively helping to push forward the adoption and in C and C++, while actively helping to push forward the adoption and
standardisation of the C.UTF-8 locale as a Unicode-aware replacement for standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
the legacy C locale the legacy C locale in the wider Linux ecosystem
While the two PEPs present alternate proposed behavioural improvements that After reviewing both PEPs, it became clear that they didn't actually conflict
align with the interests of different parts of the Python user community, they at a technical level, and the proposal in PEP 540 offered a superior option in
don't actually conflict at a technical level. cases where no suitable locale was available, as well offering a better
reference behaviour for platforms where the notion of a "locale encoding"
doesn't make sense (for example, embedded systems running MicroPython rather
the CPython reference interpreter).
That means it would be entirely possible to implement both of them, and end up As a result, this PEP was amended to specify PEP 540 as a pre-requisite, with
with a situation where redistributors, application integrators, and end users the aim being to coerce other C/C++ components into behaving consistently with
can choose between: CPython's assumption of UTF-8 as the system encoding, rather than CPython itself
relying on that setting change.
* coercing the default ASCII based C locale to a UTF-8 based locale As a result of that change, the ``LC_CTYPE=en_US.UTF-8`` legacy fallback was
* instructing CPython to ignore the C locale and use UTF-8 instead removed from the list of UTF-8 locales tried as a coercion target, with CPython
* doing both of the above (with this option as the default legacy C locale instead rely solely on the C locale text encoding bypass in such cases.
handling)
* forcing use of the default ASCII based C locale by setting both
PYTHONCOERCECLOCALE=0 and PYTHONUTF8=0
If this approach was taken, then the proposed modifications to PEP 11 would
be adjusted to indicate that the only unsupported configurations are those where
both the legacy C locale coercion and the C locale text encoding bypass are
disabled.
Given such a hybrid implementation, it would also be reasonable to drop the
``en_US.UTF-8`` legacy fallback from the list of UTF-8 locales tried as a
coercion target and instead rely solely on the C locale text encoding bypass
in such cases.
Motivation Motivation
@ -275,21 +283,10 @@ While the glibc developers are working towards making the C.UTF-8 locale
universally available for use by glibc based applications like CPython [6_], universally available for use by glibc based applications like CPython [6_],
this unfortunately doesn't help on platforms that ship older versions of glibc this unfortunately doesn't help on platforms that ship older versions of glibc
without that feature, and also don't provide C.UTF-8 as an on-disk locale the without that feature, and also don't provide C.UTF-8 as an on-disk locale the
way Debian and Fedora do. For these platforms, the best widely available way Debian and Fedora do. For these platforms, the mechanism proposed in
fallback option is the ``en_US.UTF-8`` locale, which while still being PEP 540 at least allows CPython itself to behave sensibly, albeit without any
unfortunately Anglo-centric, is at least significantly less Anglo-centric than mechanism to get other C/C++ components that decode binary streams as text to
the ASCII text encoding assumption in the default C locale. do the same.
In the specific case of C locale coercion, the Anglo-centrism implied by the
use of ``en_US.UTF-8`` can be mitigated by configuring only the ``LC_CTYPE``
locale category, rather than overriding all the locale categories::
$ docker run --rm -e LANG=C.UTF-8 centos/python-35-centos7 python3 -c 'print("ℙƴ☂ℌøἤ")'
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
$ docker run --rm -e LC_CTYPE=en_US.UTF-8 centos/python-35-centos7 python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
Design Principles Design Principles
@ -308,16 +305,16 @@ proposed solution:
problems for end users, we'll do this *without* using the warnings system, so problems for end users, we'll do this *without* using the warnings system, so
even running with ``-Werror`` won't turn it into a runtime exception even running with ``-Werror`` won't turn it into a runtime exception
The general design principle of Python 3 to prefer raising an exception over To minimize the negative impact on systems currently correctly configured to
incorrectly encoding or decoding data then leads to the following additional use GB-18030 or another partially ASCII compatible universal encoding leads to
design guideline: an additional design principle:
* if a UTF-8 based Linux container is run on a host that is explicitly * if a UTF-8 based Linux container is run on a host that is explicitly
configured to use a non-UTF-8 encoding, and tries to exchange locally configured to use a non-UTF-8 encoding, and tries to exchange locally
encoded data with that host rather than exchanging explicitly UTF-8 encoded encoded data with that host rather than exchanging explicitly UTF-8 encoded
data, this will ideally lead to an immediate runtime exception rather than data, CPython will endeavour to correctly round-trip host provided data that
to silent data corruption is concatenated or split solely at common ASCII compatible code points, but
may otherwise emit nonsensical results.
Specification Specification
@ -330,8 +327,9 @@ run as a standalone command line application.
It further proposes to emit a warning on stderr if the legacy ``C`` locale It further proposes to emit a warning on stderr if the legacy ``C`` locale
is in effect at the point where the language runtime itself is initialized, is in effect at the point where the language runtime itself is initialized,
in order to warn system and application integrators that they're running and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
CPython in an unsupported configuration. system and application integrators that they're running CPython in an
unsupported configuration.
Legacy C locale coercion in the standalone Python interpreter binary Legacy C locale coercion in the standalone Python interpreter binary
@ -369,7 +367,7 @@ Three such locales will be tried:
* ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and * ``C.UTF-8`` (available at least in Debian, Ubuntu, and Fedora 25+, and
expected to be available by default in a future version of glibc) expected to be available by default in a future version of glibc)
* ``C.utf8`` (available at least in HP-UX) * ``C.utf8`` (available at least in HP-UX)
* ``en_US.UTF-8`` (available at least in RHEL and CentOS) * ``UTF-8`` (available in at least some \*BSD variants)
For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually For ``C.UTF-8`` and ``C.utf8``, the coercion will be implemented by actually
setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate setting the ``LANG`` and ``LC_ALL`` environment variables to the candidate
@ -377,15 +375,17 @@ locale name, such that future calls to ``setlocale()`` will see them, as will
other components looking for those settings (such as GUI development other components looking for those settings (such as GUI development
frameworks). frameworks).
The last fallback isn't ideal as a coercion target (as it changes more than For the platforms where it is defined, ``UTF-8`` is a partial locale that only
just the default text encoding), but has the benefit of currently being more defines the ``LC_CTYPE`` category. Accordingly, only the ``LC_CTYPE``
widely available than the C.UTF-8 locale. To minimize the chance of side environment variable would be set when using this fallback option.
effects, only the ``LC_CTYPE`` environment variable would be set when using
this legacy fallback option, with the other locale categories being left alone.
Given time, more environments are expected to provide a ``C.UTF-8`` locale by To adjust automatically to future changes in locale availability, these checks
default, so falling all the way back to the ``en_US.UTF-8`` option is expected will be implemented at runtime on all platforms other than Mac OS X and Windows,
to become less common. rather than attempting to determine which locales to try at compile time.
If the locale settings are changed successfully, and the ``PYTHONIOENCODING``
environment variable is currently unset, then it will be forced to
``PYTHONIOENCODING=utf-8:surrogateescape``.
When this locale coercion is activated, the following warning will be When this locale coercion is activated, the following warning will be
printed on stderr, with the warning containing whichever locale was printed on stderr, with the warning containing whichever locale was
@ -394,14 +394,15 @@ successfully configured::
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour). locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
When falling all the way back to the ``en_US.UTF-8`` locale, the message would When falling back to the ``UTF-8`` locale, the message would be slightly
be slightly different:: different::
Python detected LC_CTYPE=C, LC_CTYPE set to en_US.UTF-8 (set another locale Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour). or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
This locale coercion will mean that the standard Python binary should once In combination with PEP 540, this locale coercion will mean that the standard
again "just work" in the three main failure cases we're aware of (missing locale Python binary *and* locale aware C/C++ extensions should once again "just work"
in the three main failure cases we're aware of (missing locale
settings, SSH forwarding of unknown locales, and developers explicitly settings, SSH forwarding of unknown locales, and developers explicitly
requesting ``LANG=C``), as long as the target platform provides at least one requesting ``LANG=C``), as long as the target platform provides at least one
of the candidate UTF-8 based environments. of the candidate UTF-8 based environments.
@ -427,7 +428,8 @@ doing so would introduce inconsistencies in decoded text, even in the context
of the standalone Python interpreter binary. of the standalone Python interpreter binary.
Accordingly, when ``Py_Initialize`` is called and CPython detects that the Accordingly, when ``Py_Initialize`` is called and CPython detects that the
configured locale is still the default ``C`` locale, the following warning will configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8``
feature from PEP 540 is disabled, the following warning will
be issued:: be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
@ -440,6 +442,10 @@ Instead, the warning informs both system and application integrators that
they're running Python 3 in a configuration that we don't expect to work they're running Python 3 in a configuration that we don't expect to work
properly. properly.
The second sentence providing recommendations would be conditionally compiled
based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD
systems.
New build-time configuration options New build-time configuration options
------------------------------------ ------------------------------------
@ -465,15 +471,16 @@ Platform Support Changes
A new "Legacy C Locale" section will be added to PEP 11 that states: A new "Legacy C Locale" section will be added to PEP 11 that states:
* as of Python 3.7, the legacy C locale is no longer officially supported, * as of CPython 3.7, the legacy C locale is only supported when operating in
and any Unicode handling issues that occur only in that locale and cannot be "UTF-8" mode. Any Unicode handling issues that occur only in that locale
reproduced in an appropriately configured non-ASCII locale will be closed as and cannot be reproduced in an appropriately configured non-ASCII locale will
"won't fix" be closed as "won't fix"
* as of Python 3.7, \*nix platforms are expected to provide at least one of * as of CPython 3.7, \*nix platforms are expected to provide at least one of
``C.UTF-8``, ``C.utf8`` or ``en_US.UTF-8`` as an alternative to the legacy ``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
``C`` locale. On platforms which don't yet provide any of these locales, an ``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
explicit non-ASCII locale setting will be needed to configure a fully Any Unicode related integration problems with C/C++ extensions that occur
supported environment for running Python 3.7+ only in that locale and cannot be reproduced in an appropriately configured
non-ASCII locale will be closed as "won't fix".
Rationale Rationale
@ -502,14 +509,14 @@ C/C++ components sharing the same process, as well as with the user's desktop
locale settings, than it is with the emergent conventions of modern network locale settings, than it is with the emergent conventions of modern network
service development. service development.
The core premise of this PEP is that for *all* of these use cases, the default The core premise of this PEP is that for *all* of these use cases, the
"C" locale is the wrong choice, and furthermore that the following assumptions assumption of ASCII implied by the default "C" locale is the wrong choice,
are valid: and furthermore that the following assumptions are valid:
* in desktop application use cases, the process locale will *already* be * in desktop application use cases, the process locale will *already* be
configured appropriately, and if it isn't, then that is an operating system configured appropriately, and if it isn't, then that is an operating system
level problem that needs to be reported to and resolved by the operating or embedding application level problem that needs to be reported to and
system provider resolved by the operating system provider or application developer
* in network service development use cases (especially those based on Linux * in network service development use cases (especially those based on Linux
containers), the process locale may not be configured *at all*, and if it containers), the process locale may not be configured *at all*, and if it
isn't, then the expectation is that components will impose their own default isn't, then the expectation is that components will impose their own default
@ -517,54 +524,151 @@ are valid:
default encoding of ASCII the way CPython currently does default encoding of ASCII the way CPython currently does
Defaulting to "strict" error handling on the standard IO streams Defaulting to "surrogateescape" error handling on the standard IO streams
---------------------------------------------------------------- -------------------------------------------------------------------------
By coercing the locale away from the legacy C default and its assumption of By coercing the locale away from the legacy C default and its assumption of
ASCII as the preferred text encoding, this PEP also disables the implicit use ASCII as the preferred text encoding, this PEP also disables the implicit use
of the "surrogateescape" error handler on the standard IO streams that was of the "surrogateescape" error handler on the standard IO streams that was
introduced in Python 3.5 ([15_]). introduced in Python 3.5 ([15_]), as well as the automatic use of
``surrogateescape`` when operating in PEP 540's UTF-8 mode.
This is deliberate, as that change was primarily aimed at handling the case Rather than introducing yet another configuration option to address that,
where the correct system encoding was the ASCII-compatible UTF-8 (or another this PEP proposes to use the existing ``PYTHONIOENCODING`` setting to ensure
ASCII-compatible encoding), but the nominal encoding used for operating system that the ``surrogateescape`` handler is enabled when the interpreter is
interfaces in the current process was ASCII. required to make assumptions regarding the expected filesystem encoding.
With this PEP, that assumption is being narrowed a step further, such that The aim of this behaviour is to attempt to ensure that operating system
rather than assuming "an ASCII-compatible encoding", we instead assume UTF-8 provided text values are typically able to be transparently passed through a
specifically. If that assumption is genuinely wrong, then it continues to be Python 3 application even if it is incorrect in assuming that that text has
friendlier to users of other encodings to alert them to the runtime's mistaken been encoded as UTF-8.
assumption, rather than continuing on and potentially corrupting their data
permanently.
In particular, GB 18030 [12_] is a Chinese national text encoding standard In particular, GB 18030 [12_] is a Chinese national text encoding standard
that handles all Unicode code points, but is incompatible with both ASCII and that handles all Unicode code points, that is formally incompatible with both
UTF-8. ASCII and UTF-8, but will nevertheless often tolerate processing as surrogate
escaped data - the points where GB 18030 reuses ASCII byte values in an
incompatible way are likely to be invalid in UTF-8, and will therefore be
escaped and opaque to string processing operations that split on or search for
the relevant ASCII code points. Operations that don't involve splitting on or
searching for particular ASCII or Unicode code point values are almost
certain to work correctly.
Similarly, Shift-JIS [13_] and ISO-2022-JP [14_] remain in widespread use in Similarly, Shift-JIS [13_] and ISO-2022-JP [14_] remain in widespread use in
Japan, and are incompatible with both ASCII and UTF-8. Japan, and are incompatible with both ASCII and UTF-8, but will tolerate text
processing operations that don't involve splitting on or searching for
particular ASCII or Unicode code point values.
Using strict error handling on the standard streams means that attempting to As an example, consider two files, one encoded with UTF-8 (the default encoding
pass information from a host system using one of these encodings into a for ``en_AU.UTF-8``), and one encoded with GB-18030 (the default encoding for
container application that is assuming the use of UTF-8 or vice-versa is likely ``zh_CN.gb18030``)::
to cause an immediate Unicode encoding or decoding error, rather than
potentially causing silent data corruption.
For users that would prefer more permissive behaviour, setting $ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))'
``PYTHONIOENCODING=:surrogateescape`` will continue to be supported, as this $ python3 -c 'open("gb18030.txt", "wb"); f.write("ℙƴ☂ℌøἤ\n".encode("gb18030"))'
PEP makes no changes to that feature.
On disk, we can see that these are two very different files::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "rb").read().strip()); \
print("GB18030:", open("gb18030.txt", "rb").read().strip())'
UTF-8: b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n'
GB18030: b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n'
That nevertheless can both be rendered correctly to the terminal as long as
they're decoded prior to printing::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())'
UTF-8: ℙƴ☂ℌøἤ
GB18030: ℙƴ☂ℌøἤ
By contrast, if we just pass along the raw bytes, as ``cat`` and similar C/C++
utilities will tend to do::
$ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt
ℙƴ☂ℌøἤ
<20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
Even setting a specifically Chinese locale won't help in getting the
GB-18030 encoded file rendered correctly::
$ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt
ℙƴ☂ℌøἤ
<20>6<EFBFBD>6<EFBFBD>0<EFBFBD>0<EFBFBD>7<EFBFBD>9<EFBFBD>6<EFBFBD>4<EFBFBD>0<EFBFBD>3<EFBFBD>6<EFBFBD>6
The problem is that the *terminal* encoding setting remains UTF-8, regardless
of the nominal locale. A GB18030 terminal can be emulated using the ``iconv``
utility::
$ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8
鈩櫰粹槀鈩屆羔激
ℙƴ☂ℌøἤ
This reverses the problem, such that the GB18030 file is rendered correctly,
but the UTF-8 file has been converted to unrelated hanzi characters, rather than
the expected rendering of "Python" as non-ASCII characters.
With the emulated GB18030 terminal encoding, assuming UTF-8 in Python results
in *both* files being displayed incorrectly::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
| iconv -f GB18030 -t UTF-8
UTF-8: 鈩櫰粹槀鈩屆羔激
GB18030: 鈩櫰粹槀鈩屆羔激
However, setting the locale correctly means that the emulated GB18030 terminal
now displays both files as originally intended::
$ LANG=zh_CN.gb18030 \
python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \
print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \
| iconv -f GB18030 -t UTF-8
UTF-8: ℙƴ☂ℌøἤ
GB18030: ℙƴ☂ℌøἤ
The rationale for retaining ``surrogateescape`` as the default IO encoding is
that it will preserve the following helpful behaviour in the C locale::
$ cat gb18030.txt \
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Rather than reverting to the exception seen when a UTF-8 based locale is
explicitly configured::
$ cat gb18030.txt \
| python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte
Note: an alternative to setting ``PYTHONIOENCODING`` as the PEP currently
proposes would be to instead *always* default to ``surrogateescape`` on the
standard streams, and require the use of ``PYTHONIOENCODING=:strict`` to request
text encoding validation during stream processing. Adopting such an approach
would bring Python 3 more into line with typical C/C++ tools that pass along
the raw bytes without checking them for conformance to their nominal encoding,
and would hence also make the last example display the desired output::
$ cat gb18030.txt \
| PYTHONIOENCODING=:surrogateescape python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Dropping official support for Unicode handling in the legacy C locale Dropping official support for ASCII based text handling in the legacy C locale
--------------------------------------------------------------------- ------------------------------------------------------------------------------
We've been trying to get strict bytes/text separation to work reliably in the We've been trying to get strict bytes/text separation to work reliably in the
legacy C locale for over a decade at this point. Not only haven't we been able legacy C locale for over a decade at this point. Not only haven't we been able
to get it to work, neither has anyone else - the only viable alternatives to get it to work, neither has anyone else - the only viable alternatives
identified have been to pass the bytes along verbatim without eagerly decoding identified have been to pass the bytes along verbatim without eagerly decoding
them to text (Python 2.x, Ruby, etc), or else to ignore the nominal C/C++ locale them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
encoding entirely and assume the use of either UTF-8 (PEP 540, Rust, Go, C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
Node.js, etc) or UTF-16-LE (JVM, .NET CLR). Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
While this PEP ensures that developers that need to do so can still opt-in to While this PEP ensures that developers that need to do so can still opt-in to
running their Python code in the legacy C locale, it also makes clear that we running their Python code in the legacy C locale, it also makes clear that we
@ -621,7 +725,10 @@ languages in subprocesses.
Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
C/C++ components in the current process and in any subprocesses that inherit C/C++ components in the current process and in any subprocesses that inherit
the current environment. the current environment. This is important to handle cases where the problem
has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system
where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
configured to forward locale settings, and the user logs into a Linux server).
Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``. the ``LANG`` fallback for their locale settings will still use ``C.UTF-8``.
@ -647,15 +754,15 @@ runtimes even when running a version with this change applied.
Implementation Implementation
============== ==============
A draft implementation of the change (including test cases) has been
posted to issue 28180 [1_], which is an end user request that
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
NOTE: The currently posted draft implementation is for a previous iteration NOTE: The currently posted draft implementation is for a previous iteration
of the PEP prior to the incorporation of the feedback noted in [11_]. It was of the PEP prior to the incorporation of the feedback noted in [11_]. It was
broadly the same in concept (i.e. coercing the legacy C locale to one based on broadly the same in concept (i.e. coercing the legacy C locale to one based on
UTF-8), but differs in several details. UTF-8), but differs in several details.
A draft implementation of the change (including test cases) has been
posted to issue 28180 [1_], which is an end user request that
``sys.getfilesystemencoding()`` default to ``utf-8`` rather than ``ascii``.
Backporting to earlier Python 3 releases Backporting to earlier Python 3 releases
======================================== ========================================
@ -666,8 +773,8 @@ Backporting to Python 3.6.0
If this PEP is accepted for Python 3.7, redistributors backporting the change If this PEP is accepted for Python 3.7, redistributors backporting the change
specifically to their initial Python 3.6.0 release will be both allowed and specifically to their initial Python 3.6.0 release will be both allowed and
encouraged. However, such backports should only be undertaken either in encouraged. However, such backports should only be undertaken either in
conjunction with the changes needed to also provide the C.UTF-8 locale by conjunction with the changes needed to also provide a suitable locale by
default, or else specifically for platforms where that locale is already default, or else specifically for platforms where such a locale is already
consistently available. consistently available.
@ -676,7 +783,7 @@ Backporting to other 3.x releases
While the proposed behavioural change is seen primarily as a bug fix addressing While the proposed behavioural change is seen primarily as a bug fix addressing
Python 3's current misbehaviour in the default ASCII-based C locale, it still Python 3's current misbehaviour in the default ASCII-based C locale, it still
represents a reasonable significant change in the way CPython interacts with represents a reasonably significant change in the way CPython interacts with
the C locale system. As such, while some redistributors may still choose to the C locale system. As such, while some redistributors may still choose to
backport it to even earlier Python 3.x releases based on the needs and backport it to even earlier Python 3.x releases based on the needs and
interests of their particular user base, this wouldn't be encouraged as a interests of their particular user base, this wouldn't be encouraged as a
@ -716,6 +823,10 @@ PEP 540 [11_].
The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP
is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9_]. is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9_].
Stephen Turnbull has long provided valuable insight into the text encoding
handling challenges he regularly encounters at the University of Tsukuba
(筑波大学).
References References
========== ==========
@ -765,6 +876,12 @@ References
.. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale .. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale
(https://bugs.python.org/issue19977) (https://bugs.python.org/issue19977)
.. [16] test_readline.test_nonascii fails on Android
(http://bugs.python.org/issue28997)
.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English"
(http://bugs.python.org/issue18378#msg215215)
Copyright Copyright
========= =========