PEP: 538 Title: Coercing the legacy C locale to a UTF-8 based locale Version: $Revision$ Last-Modified: $Date$ Author: Alyssa Coghlan BDFL-Delegate: INADA Naoki Status: Final Type: Standards Track Content-Type: text/x-rst Created: 28-Dec-2016 Python-Version: 3.7 Post-History: 03-Jan-2017, 07-Jan-2017, 05-Mar-2017, 09-May-2017 Resolution: https://mail.python.org/pipermail/python-dev/2017-May/148035.html Abstract ======== An ongoing challenge with Python 3 on \*nix systems is the conflict between needing to use the configured locale encoding by default for consistency with other locale-aware components in the same process or subprocesses, and the fact that the standard C locale (as defined in POSIX:2001) typically implies a default text encoding of ASCII, which is entirely inadequate for the development of networked services and client applications in a multilingual world. :pep:`540` proposes a change to CPython's handling of the legacy C locale such that CPython will assume the use of UTF-8 in such environments, rather than persisting with the demonstrably problematic assumption of ASCII as an appropriate encoding for communicating with operating system interfaces. This is a good approach for cases where network encoding interoperability is a more important concern than local encoding interoperability. However, it comes at the cost of making CPython's encoding assumptions diverge from those of other locale-aware components in the same process, as well as those of components running in subprocesses that share the same environment. This can cause interoperability problems with some extension modules (such as GNU readline's command line history editing), as well as with components running in subprocesses (such as older Python runtimes). It also requires non-trivial changes to the internals of how CPython itself works, rather than relying primarily on existing configuration settings that are supported by Python versions prior to Python 3.7. Accordingly, this PEP proposes that independently of the UTF-8 mode proposed in :pep:`540`, the way the CPython implementation handles the default C locale be changed to be roughly equivalent to the following existing configuration settings (supported since Python 3.1):: LC_CTYPE=C.UTF-8 PYTHONIOENCODING=utf-8:surrogateescape The exact target locale for coercion will be chosen from a predefined list at runtime based on the actually available locales. The reinterpreted locale settings will be written back to the environment so they're visible to other components in the same process and in subprocesses, but the changed ``PYTHONIOENCODING`` default will be made implicit in order to avoid causing compatibility problems with Python 2 subprocesses that don't provide the ``surrogateescape`` error handler. The new legacy locale coercion behavior can be disabled either by setting ``LC_ALL`` (which may still lead to a Unicode compatibility warning) or by setting the new ``PYTHONCOERCECLOCALE`` environment variable to ``0``. With this change, any \*nix platform that does *not* offer at least one of the ``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard configuration would only be considered a fully supported platform for CPython 3.7+ deployments when a suitable locale other than the default ``C`` locale is configured explicitly (e.g. ``en_AU.UTF-8``, ``zh_CN.gb18030``). If :pep:`540` is accepted in addition to this PEP, then pure Python modules would also be supported when using the proposed ``PYTHONUTF8`` mode, but expectations for full Unicode compatibility in extension modules would continue to be limited to the platforms covered by this PEP. As it only reflects a change in default settings rather than a fundamentally new capability, redistributors (such as Linux distributions) with a narrower target audience than the upstream CPython development team may also choose to opt in to this locale coercion behaviour for the Python 3.6.x series by applying the necessary changes as a downstream patch. Implementation Notes ==================== Attempting to implement the PEP as originally accepted showed that the proposal to emit locale coercion and compatibility warnings by default simply wasn't practical (there were too many cases where previously working code failed *because of the warnings*, rather than because of latent locale handling defects in the affected code). As a result, the ``PY_WARN_ON_C_LOCALE`` config flag was removed, and replaced with a runtime ``PYTHONCOERCECLOCALE=warn`` environment variable setting that allows developers and system integrators to opt-in to receiving locale coercion and compatibility warnings, without emitting them by default. The output examples in the PEP itself have also been updated to remove the warnings and make them easier to read. Background ========== While the CPython interpreter is starting up, it may need to convert from the ``char *`` format to the ``wchar_t *`` format, or from one of those formats to ``PyUnicodeObject *``, in a way that's consistent with the locale settings of the overall system. It handles these cases by relying on the operating system to do the conversion and then ensuring that the text encoding name reported by ``sys.getfilesystemencoding()`` matches the encoding used during this early bootstrapping process. On Windows, the limitations of the ``mbcs`` format used by default in these conversions proved sufficiently problematic that :pep:`528` and :pep:`529` were implemented to bypass the operating system supplied interfaces for binary data handling and force the use of UTF-8 instead. On Mac OS X, iOS, and Android, many components, including CPython, already assume the use of UTF-8 as the system encoding, regardless of the locale setting. However, this isn't the case for all components, and the discrepancy can cause problems in some situations (for example, when using the GNU readline module [16_]). On non-Apple and non-Android \*nix systems, these operations are handled using the C locale system in glibc, which has the following characteristics [4]_: * by default, all processes start in the ``C`` locale, which uses ``ASCII`` for these conversions. This is almost never what anyone doing multilingual text processing actually wants (including CPython and C/C++ GUI frameworks). * calling ``setlocale(LC_ALL, "")`` reconfigures the active locale based on the locale categories configured in the current process environment * if the locale requested by the current environment is unknown, or no specific locale is configured, then the default ``C`` locale will remain active The specific locale category that covers the APIs that CPython depends on is ``LC_CTYPE``, which applies to "classification and conversion of characters, and to multibyte and wide characters" [5]_. Accordingly, CPython includes the following key calls to ``setlocale``: * in the main ``python`` binary, CPython calls ``setlocale(LC_ALL, "")`` to configure the entire C locale subsystem according to the process environment. It does this prior to making any calls into the shared CPython library * in ``Py_Initialize``, CPython calls ``setlocale(LC_CTYPE, "")``, such that the configured locale settings for that category *always* match those set in the environment. It does this unconditionally, and it *doesn't* revert the process state change in ``Py_Finalize`` (This summary of the locale handling omits several technical details related to exactly where and when the text encoding declared as part of the locale settings is used - see :pep:`540` for further discussion, as these particular details matter more when decoupling CPython from the declared C locale than they do when overriding the locale with one based on UTF-8) These calls are usually sufficient to provide sensible behaviour, but they can still fail in the following cases: * SSH environment forwarding means that SSH clients may sometimes forward client locale settings to servers that don't have that locale installed. This leads to CPython running in the default ASCII-based C locale * some process environments (such as Linux containers) may not have any explicit locale configured at all. As with unknown locales, this leads to CPython running in the default ASCII-based C locale * on Android, rather than configuring the locale based on environment variables, the empty locale ``""`` is treated as specifically requesting the ``"C"`` locale The simplest way to deal with this problem for currently released versions of CPython is to explicitly set a more sensible locale when launching the application. For example:: LC_CTYPE=C.UTF-8 python3 ... The ``C.UTF-8`` locale is a full locale definition that uses ``UTF-8`` for the ``LC_CTYPE`` category, and the same settings as the ``C`` locale for all other categories (including ``LC_COLLATE``). It is offered by a number of Linux distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an alternative to the ASCII-based C locale. Some other platforms (such as ``HP-UX``) offer an equivalent locale definition under the name ``C.utf8``. Mac OS X and other \*BSD systems have taken a different approach: instead of offering a ``C.UTF-8`` locale, they offer a partial ``UTF-8`` locale that only defines the ``LC_CTYPE`` category. On such systems, the preferred environmental locale adjustment is to set ``LC_CTYPE=UTF-8`` rather than to set ``LC_ALL`` or ``LANG``. [17]_ In the specific case of Docker containers and similar technologies, the appropriate locale setting can be specified directly in the container image definition. Another common failure case is developers specifying ``LANG=C`` in order to see otherwise translated user interface messages in English, rather than the more narrowly scoped ``LC_MESSAGES=C`` or ``LANGUAGE=en``. Relationship with other PEPs ============================ This PEP shares a common problem statement with :pep:`540` (improving Python 3's behaviour in the default C locale), but diverges markedly in the proposed solution: * :pep:`540` proposes to entirely decouple CPython's default text encoding from the C locale system in that case, allowing text handling inconsistencies to arise between CPython and other locale-aware components running in the same process and in subprocesses. This approach aims to make CPython behave less like a locale-aware application, and more like locale-independent language runtimes like those for Go, Node.js (V8), and Rust * this PEP proposes to override the legacy C locale with a more recently defined locale that uses UTF-8 as its default text encoding. This means that the text encoding override will apply not only to CPython, but also to any locale-aware extension modules loaded into the current process, as well as to locale-aware applications invoked in subprocesses that inherit their environment from the parent process. This approach aims to retain CPython's traditional strong support for integration with other locale-aware components while also actively helping to push forward the adoption and standardisation of the C.UTF-8 locale as a Unicode-aware replacement for the legacy C locale in the wider C/C++ ecosystem After reviewing both PEPs, it became clear that they didn't actually conflict at a technical level, and the proposal in :pep:`540` offered a superior option in cases where no suitable locale was available, as well as offering a better reference behaviour for platforms where the notion of a "locale encoding" doesn't make sense (for example, embedded systems running MicroPython rather than the CPython reference interpreter). Meanwhile, this PEP offered improved compatibility with other locale-aware components, and an approach more amenable to being backported to Python 3.6 by downstream redistributors. As a result, this PEP was amended to refer to :pep:`540` as a complementary solution that offered improved behaviour when none of the standard UTF-8 based locales were available, as well as extending the changes in the default settings to APIs that aren't currently independently configurable (such as the default encoding and error handler for ``open()``). The availability of :pep:`540` also meant that the ``LC_CTYPE=en_US.UTF-8`` legacy fallback was removed from the list of UTF-8 locales tried as a coercion target, with the expectation being that CPython will instead rely solely on the proposed PYTHONUTF8 mode in such cases. Motivation ========== While Linux container technologies like Docker, Kubernetes, and OpenShift are best known for their use in web service development, the related container formats and execution models are also being adopted for Linux command line application development. Technologies like Gnome Flatpak [7]_ and Ubuntu Snappy [8]_ further aim to bring these same techniques to Linux GUI application development. When using Python 3 for application development in these contexts, it isn't uncommon to see text encoding related errors akin to the following:: $ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")' Unable to decode the command from the command line: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed $ docker run --rm ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")' Unable to decode the command from the command line: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed Even though the same command is likely to work fine when run locally:: $ python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ The source of the problem can be seen by instead running the ``locale`` command in the three environments:: $ locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG=en_AU.UTF-8 LC_CTYPE="en_AU.UTF-8" LC_ALL= $ docker run --rm fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG= LC_CTYPE="POSIX" LC_ALL= $ docker run --rm ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG= LANGUAGE= LC_CTYPE="POSIX" LC_ALL= In this particular example, we can see that the host system locale is set to "en_AU.UTF-8", so CPython uses UTF-8 as the default text encoding. By contrast, the base Docker images for Fedora and Debian don't have any specific locale set, so they use the POSIX locale by default, which is an alias for the ASCII-based default C locale. The simplest way to get Python 3 (regardless of the exact version) to behave sensibly in Fedora and Debian based containers is to run it in the ``C.UTF-8`` locale that both distros provide:: $ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ $ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ $ docker run --rm -e LC_CTYPE=C.UTF-8 fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG= LC_CTYPE=C.UTF-8 LC_ALL= $ docker run --rm -e LC_CTYPE=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG= LANGUAGE= LC_CTYPE=C.UTF-8 LC_ALL= The Alpine Linux based Python images provided by Docker, Inc. already use the C.UTF-8 locale by default:: $ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ $ docker run --rm python:3 locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG=C.UTF-8 LANGUAGE= LC_CTYPE="C.UTF-8" LC_ALL= Similarly, for custom container images (i.e. those adding additional content on top of a base distro image), a more suitable locale can be set in the image definition so everything just works by default. However, it would provide a much nicer and more consistent user experience if CPython were able to just deal with this problem automatically rather than relying on redistributors or end users to handle it through system configuration changes. While the glibc developers are working towards making the C.UTF-8 locale universally available for use by glibc based applications like CPython [6]_, this unfortunately doesn't help on platforms that ship older versions of glibc without that feature, and also don't provide C.UTF-8 (or an equivalent) as an on-disk locale the way Debian and Fedora do. These platforms are considered out of scope for this PEP - see :pep:`540` for further discussion of possible options for improving CPython's default behaviour in such environments. Design Principles ================= The above motivation leads to the following core design principles for the proposed solution: * if a locale other than the default C locale is explicitly configured, we'll continue to respect it * as far as is feasible, any changes made will use *existing* configuration options * Python's runtime behaviour in potential coercion target locales should be identical regardless of whether the locale was set explicitly in the environment or implicitly as a locale coercion target * for Python 3.7, if we're changing the locale setting without an explicit config option, we'll emit a warning on stderr that we're doing so rather than silently changing the process configuration. This will alert application and system integrators to the change, even if they don't closely follow the PEP process or Python release announcements. However, to minimize the chance of introducing new problems for end users, we'll do this *without* using the warnings system, so even running with ``-Werror`` won't turn it into a runtime exception. (Note: these warnings ended up being silenced by default. See the Implementation Note above for more details) * for Python 3.7, any changed defaults will offer some form of explicit "off" switch at build time, runtime, or both Minimizing the negative impact on systems currently correctly configured to use GB-18030 or another partially ASCII compatible universal encoding leads to the following design principle: * if a UTF-8 based Linux container is run on a host that is explicitly configured to use a non-UTF-8 encoding, and tries to exchange locally encoded data with that host rather than exchanging explicitly UTF-8 encoded data, CPython will endeavour to correctly round-trip host provided data that is concatenated or split solely at common ASCII compatible code points, but may otherwise emit nonsensical results. Minimizing the negative impact on systems and programs correctly configured to use an explicit locale category like ``LC_TIME``, ``LC_MONETARY`` or ``LC_NUMERIC`` while otherwise running in the legacy C locale gives the following design principles: * don't make any environmental changes that would alter any existing settings for locale categories other than ``LC_CTYPE`` (most notably: don't set ``LC_ALL`` or ``LANG``) Finally, maintaining compatibility with running arbitrary subprocesses in orchestration use cases leads to the following design principle: * don't make any Python-specific environmental changes that might be incompatible with any still supported version of CPython (including CPython 2.7) Specification ============= To better handle the cases where CPython would otherwise end up attempting to operate in the ``C`` locale, this PEP proposes that CPython automatically attempt to coerce the legacy ``C`` locale to a UTF-8 based locale for the ``LC_CTYPE`` category when it is run as a standalone command line application. It further proposes to emit a warning on stderr if the legacy ``C`` locale is in effect for the ``LC_CTYPE`` category at the point where the language runtime itself is initialized, and the explicit environmental flag to disable locale coercion is not set, in order to warn system and application integrators that they're running CPython in an unsupported configuration. In addition to these general changes, some additional Android-specific changes are proposed to handle the differences in the behaviour of ``setlocale`` on that platform. Legacy C locale coercion in the standalone Python interpreter binary -------------------------------------------------------------------- When run as a standalone application, CPython has the opportunity to reconfigure the C locale before any locale dependent operations are executed in the process. This means that it can change the locale settings not only for the CPython runtime, but also for any other locale-aware components running in the current process (e.g. as part of extension modules), as well as in subprocesses that inherit their environment from the current process. After calling ``setlocale(LC_ALL, "")`` to initialize the locale settings in the current process, the main interpreter binary will be updated to include the following call:: const char *ctype_loc = setlocale(LC_CTYPE, NULL); This cryptic invocation is the API that C provides to query the current locale setting without changing it. Given that query, it is possible to check for exactly the ``C`` locale with ``strcmp``:: ctype_loc != NULL && strcmp(ctype_loc, "C") == 0 # true only in the C locale This call also returns ``"C"`` when either no particular locale is set, or the nominal locale is set to an alias for the ``C`` locale (such as ``POSIX``). Given this information, CPython can then attempt to coerce the locale to one that uses UTF-8 rather than ASCII as the default encoding. Three such locales will be tried: * ``C.UTF-8`` (available at least in Debian, Ubuntu, Alpine, and Fedora 25+, and expected to be available by default in a future version of glibc) * ``C.utf8`` (available at least in HP-UX) * ``UTF-8`` (available in at least some \*BSD variants, including Mac OS X) The coercion will be implemented by setting the ``LC_CTYPE`` environment variable to the candidate locale name, such that future calls to ``setlocale()`` will see it, as will other components looking for those settings (such as GUI development frameworks and Python's own ``locale`` module). To allow for better cross-platform binary portability and to adjust automatically to future changes in locale availability, these checks will be implemented at runtime on all platforms other than Windows, rather than attempting to determine which locales to try at compile time. When this locale coercion is activated, the following warning will be printed on stderr, with the warning containing whichever locale was successfully configured:: Python detected LC_CTYPE=C: LC_CTYPE coerced to C.UTF-8 (set another locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour). (Note: this warning ended up being silenced by default. See the Implementation Note above for more details) As long as the current platform provides at least one of the candidate UTF-8 based environments, this locale coercion will mean that the standard Python binary *and* locale-aware extensions should once again "just work" in the three main failure cases we're aware of (missing locale settings, SSH forwarding of unknown locales via ``LANG`` or ``LC_CTYPE``, and developers explicitly requesting ``LANG=C``). The one case where failures may still occur is when ``stderr`` is specifically being checked for no output, which can be resolved either by configuring a locale other than the C locale, or else by using a mechanism other than "there was no output on stderr" to check for subprocess errors (e.g. checking process return codes). If none of the candidate locales are successfully configured, or the ``LC_ALL``, locale override is defined in the current process environment, then initialization will continue in the C locale and the Unicode compatibility warning described in the next section will be emitted just as it would for any other application. If ``PYTHONCOERCECLOCALE=0`` is explicitly set, initialization will continue in the C locale and the Unicode compatibility warning described in the next section will be automatically suppressed. The interpreter will always check for the ``PYTHONCOERCECLOCALE`` environment variable at startup (even when running under the ``-E`` or ``-I`` switches), as the locale coercion check necessarily takes place before any command line argument processing. For consistency, the runtime check to determine whether or not to suppress the locale compatibility warning will be similarly independent of these settings. Legacy C locale warning during runtime initialization ----------------------------------------------------- By the time that ``Py_Initialize`` is called, arbitrary locale-dependent operations may have taken place in the current process. This means that by the time it is called, it is *too late* to reliably switch to a different locale - doing so would introduce inconsistencies in decoded text, even in the context of the standalone Python interpreter binary. Accordingly, when ``Py_Initialize`` is called and CPython detects that the configured locale is still the default ``C`` locale and ``PYTHONCOERCECLOCALE=0`` is not set, the following warning will be issued:: Python runtime initialized with LC_CTYPE=C (a locale with default ASCII encoding), which may cause Unicode compatibility problems. Using C.UTF-8, C.utf8, or UTF-8 (if available) as alternative Unicode-compatible locales is recommended. (Note: this warning ended up being silenced by default. See the Implementation Note above for more details) In this case, no actual change will be made to the locale settings. Instead, the warning informs both system and application integrators that they're running Python 3 in a configuration that we don't expect to work properly. The second sentence providing recommendations may eventually be conditionally compiled based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD systems), but the initial implementation will just use the common generic message shown above. New build-time configuration options ------------------------------------ While both of the above behaviours would be enabled by default, they would also have new associated configuration options and preprocessor definitions for the benefit of redistributors that want to override those default settings. The locale coercion behaviour would be controlled by the flag ``--with[out]-c-locale-coercion``, which would set the ``PY_COERCE_C_LOCALE`` preprocessor definition. The locale warning behaviour would be controlled by the flag ``--with[out]-c-locale-warning``, which would set the ``PY_WARN_ON_C_LOCALE`` preprocessor definition. (Note: this compile time warning option ended up being replaced by a runtime ``PYTHONCOERCECLOCALE=warn`` option. See the Implementation Note above for more details) On platforms which don't use the ``autotools`` based build system (i.e. Windows) these preprocessor variables would always be undefined. Changes to the default error handling on the standard streams ------------------------------------------------------------- Since Python 3.5, CPython has defaulted to using ``surrogateescape`` on the standard streams (``sys.stdin``, ``sys.stdout``) when it detects that the current locale is ``C`` and no specific error handled has been set using either the ``PYTHONIOENCODING`` environment variable or the ``Py_setStandardStreamEncoding`` API. For other locales, the default error handler for the standard streams is ``strict``. In order to preserve this behaviour without introducing any behavioural discrepancies between locale coercion and explicitly configuring a locale, the coercion target locales (``C.UTF-8``, ``C.utf8``, and ``UTF-8``) will be added to the list of locales that use ``surrogateescape`` as their default error handler for the standard streams. No changes are proposed to the default error handler for ``sys.stderr``: that will continue to be ``backslashreplace``. Changes to locale settings on Android ------------------------------------- Independently of the other changes in this PEP, CPython on Android systems will be updated to call ``setlocale(LC_ALL, "C.UTF-8")`` where it currently calls ``setlocale(LC_ALL, "")`` and ``setlocale(LC_CTYPE, "C.UTF-8")`` where it currently calls ``setlocale(LC_CTYPE, "")``. This Android-specific behaviour is being introduced due to the following Android-specific details: * on Android, passing ``""`` to ``setlocale`` is equivalent to passing ``"C"`` * the ``C.UTF-8`` locale is always available Platform Support Changes ======================== A new "Legacy C Locale" section will be added to :pep:`11` that states: * as of CPython 3.7, \*nix platforms are expected to provide at least one of ``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` ( ``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale. Any Unicode related integration problems that occur only in the legacy ``C`` locale and cannot be reproduced in an appropriately configured non-ASCII locale will be closed as "won't fix". Rationale ========= Improving the handling of the C locale -------------------------------------- It has been clear for some time that the C locale's default encoding of ``ASCII`` is entirely the wrong choice for development of modern networked services. Newer languages like Rust and Go have eschewed that default entirely, and instead made it a deployment requirement that systems be configured to use UTF-8 as the text encoding for operating system interfaces. Similarly, Node.js assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript engine) and requires custom build settings to indicate it should use the system locale settings for locale-aware operations. Both the JVM and the .NET CLR use UTF-16-LE as their primary encoding for passing text between applications and the application runtime (i.e. the JVM/CLR, not the host operating system). The challenge for CPython has been the fact that in addition to being used for network service development, it is also extensively used as an embedded scripting language in larger applications, and as a desktop application development language, where it is more important to be consistent with other locale-aware components sharing the same process, as well as with the user's desktop locale settings, than it is with the emergent conventions of modern network service development. The core premise of this PEP is that for *all* of these use cases, the assumption of ASCII implied by the default "C" locale is the wrong choice, and furthermore that the following assumptions are valid: * in desktop application use cases, the process locale will *already* be configured appropriately, and if it isn't, then that is an operating system or embedding application level problem that needs to be reported to and resolved by the operating system provider or application developer * in network service development use cases (especially those based on Linux containers), the process locale may not be configured *at all*, and if it isn't, then the expectation is that components will impose their own default encoding the way Rust, Go and Node.js do, rather than trusting the legacy C default encoding of ASCII the way CPython currently does Defaulting to "surrogateescape" error handling on the standard IO streams ------------------------------------------------------------------------- By coercing the locale away from the legacy C default and its assumption of ASCII as the preferred text encoding, this PEP also disables the implicit use of the "surrogateescape" error handler on the standard IO streams that was introduced in Python 3.5 ([15]_), as well as the automatic use of ``surrogateescape`` when operating in :pep:`540`'s proposed UTF-8 mode. Rather than introducing yet another configuration option to adjust that behaviour, this PEP instead proposes to extend the "surrogateescape" default for ``stdin`` and ``stderr`` error handling to also apply to the three potential coercion target locales. The aim of this behaviour is to attempt to ensure that operating system provided text values are typically able to be transparently passed through a Python 3 application even if it is incorrect in assuming that that text has been encoded as UTF-8. In particular, GB 18030 [12]_ is a Chinese national text encoding standard that handles all Unicode code points, that is formally incompatible with both ASCII and UTF-8, but will nevertheless often tolerate processing as surrogate escaped data - the points where GB 18030 reuses ASCII byte values in an incompatible way are likely to be invalid in UTF-8, and will therefore be escaped and opaque to string processing operations that split on or search for the relevant ASCII code points. Operations that don't involve splitting on or searching for particular ASCII or Unicode code point values are almost certain to work correctly. Similarly, Shift-JIS [13]_ and ISO-2022-JP [14]_ remain in widespread use in Japan, and are incompatible with both ASCII and UTF-8, but will tolerate text processing operations that don't involve splitting on or searching for particular ASCII or Unicode code point values. As an example, consider two files, one encoded with UTF-8 (the default encoding for ``en_AU.UTF-8``), and one encoded with GB-18030 (the default encoding for ``zh_CN.gb18030``):: $ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))' $ python3 -c 'open("gb18030.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("gb18030"))' On disk, we can see that these are two very different files:: $ python3 -c 'print("UTF-8: ", open("utf8.txt", "rb").read().strip()); \ print("GB18030:", open("gb18030.txt", "rb").read().strip())' UTF-8: b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n' GB18030: b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n' That nevertheless can both be rendered correctly to the terminal as long as they're decoded prior to printing:: $ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \ print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' UTF-8: ℙƴ☂ℌøἤ GB18030: ℙƴ☂ℌøἤ By contrast, if we just pass along the raw bytes, as ``cat`` and similar C/C++ utilities will tend to do:: $ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt ℙƴ☂ℌøἤ �6�6�0�0�7�9�6�4�0�3�6�6 Even setting a specifically Chinese locale won't help in getting the GB-18030 encoded file rendered correctly:: $ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt ℙƴ☂ℌøἤ �6�6�0�0�7�9�6�4�0�3�6�6 The problem is that the *terminal* encoding setting remains UTF-8, regardless of the nominal locale. A GB18030 terminal can be emulated using the ``iconv`` utility:: $ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8 鈩櫰粹槀鈩屆羔激 ℙƴ☂ℌøἤ This reverses the problem, such that the GB18030 file is rendered correctly, but the UTF-8 file has been converted to unrelated hanzi characters, rather than the expected rendering of "Python" as non-ASCII characters. With the emulated GB18030 terminal encoding, assuming UTF-8 in Python results in *both* files being displayed incorrectly:: $ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \ print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \ | iconv -f GB18030 -t UTF-8 UTF-8: 鈩櫰粹槀鈩屆羔激 GB18030: 鈩櫰粹槀鈩屆羔激 However, setting the locale correctly means that the emulated GB18030 terminal now displays both files as originally intended:: $ LANG=zh_CN.gb18030 \ python3 -c 'print("UTF-8: ", open("utf8.txt", "r", encoding="utf-8").read().strip()); \ print("GB18030:", open("gb18030.txt", "r", encoding="gb18030").read().strip())' \ | iconv -f GB18030 -t UTF-8 UTF-8: ℙƴ☂ℌøἤ GB18030: ℙƴ☂ℌøἤ The rationale for retaining ``surrogateescape`` as the default IO encoding is that it will preserve the following helpful behaviour in the ``C`` locale:: $ cat gb18030.txt \ | LANG=C python3 -c "import sys; print(sys.stdin.read())" \ | iconv -f GB18030 -t UTF-8 ℙƴ☂ℌøἤ Rather than reverting to the exception currently seen when a UTF-8 based locale is explicitly configured:: $ cat gb18030.txt \ | python3 -c "import sys; print(sys.stdin.read())" \ | iconv -f GB18030 -t UTF-8 Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte As an added benefit, environments explicitly configured to use one of the coercion target locales will implicitly gain the encoding transparency behaviour currently enabled by default in the ``C`` locale. Avoiding setting PYTHONIOENCODING during UTF-8 locale coercion -------------------------------------------------------------- Rather than changing the default handling of the standard streams during interpreter initialization, earlier versions of this PEP proposed setting ``PYTHONIOENCODING`` to ``utf-8:surrogateescape``. This turned out to create a significant compatibility problem: since the ``surrogateescape`` handler only exists in Python 3.1+, running Python 2.7 processes in subprocesses could potentially break in a confusing way with that configuration. The current design means that earlier Python versions will instead retain their default ``strict`` error handling on the standard streams, while Python 3.7+ will consistently use the more permissive ``surrogateescape`` handler even when these locales are explicitly configured (rather than being reached through locale coercion). Dropping official support for ASCII based text handling in the legacy C locale ------------------------------------------------------------------------------ We've been trying to get strict bytes/text separation to work reliably in the legacy C locale for over a decade at this point. Not only haven't we been able to get it to work, neither has anyone else - the only viable alternatives identified have been to pass the bytes along verbatim without eagerly decoding them to text (C/C++, Python 2.x, Ruby, etc), or else to largely ignore the nominal C/C++ locale encoding and assume the use of either UTF-8 (:pep:`540`, Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR). While this PEP ensures that developers that genuinely need to do so can still opt-in to running their Python code in the legacy C locale (by setting ``LC_ALL=C``, ``PYTHONCOERCECLOCALE=0``, or running a custom build that sets ``--without-c-locale-coercion``), it also makes it clear that we *don't* expect Python 3's Unicode handling to be completely reliable in that configuration, and the recommended alternative is to use a more appropriate locale setting (potentially in combination with :pep:`540`'s UTF-8 mode, if that is available). Providing implicit locale coercion only when running standalone --------------------------------------------------------------- The major downside of the proposed design in this PEP is that it introduces a potential discrepancy between the behaviour of the CPython runtime when it is run as a standalone application and when it is run as an embedded component inside a larger system (e.g. ``mod_wsgi`` running inside Apache ``httpd``). Over the course of Python 3.x development, multiple attempts have been made to improve the handling of incorrect locale settings at the point where the Python interpreter is initialised. The problem that emerged is that this is ultimately *too late* in the interpreter startup process - data such as command line arguments and the contents of environment variables may have already been retrieved from the operating system and processed under the incorrect ASCII text encoding assumption well before ``Py_Initialize`` is called. The problems created by those inconsistencies were then even harder to diagnose and debug than those created by believing the operating system's claim that ASCII was a suitable encoding to use for operating system interfaces. This was the case even for the default CPython binary, let alone larger C/C++ applications that embed CPython as a scripting engine. The approach proposed in this PEP handles that problem by moving the locale coercion as early as possible in the interpreter startup sequence when running standalone: it takes place directly in the C-level ``main()`` function, even before calling in to the ``Py_Main()`` library function that implements the features of the CPython interpreter CLI. The ``Py_Initialize`` API then only gains an explicit warning (emitted on ``stderr``) when it detects use of the ``C`` locale, and relies on the embedding application to specify something more reasonable. That said, the reference implementation for this PEP adds most of the functionality to the shared library, with the CLI being updated to unconditionally call two new private APIs:: if (_Py_LegacyLocaleDetected()) { _Py_CoerceLegacyLocale(); } These are similar to other "pre-configuration" APIs intended for embedding applications: they're designed to be called *before* ``Py_Initialize``, and hence change the way the interpreter gets initialized. If these were made public (either as part of this PEP or in a subsequent RFE), then it would be straightforward for other embedding applications to recreate the same behaviour as is proposed for the CPython CLI. Allowing restoration of the legacy behaviour -------------------------------------------- The CPython command line interpreter is often used to investigate faults that occur in other applications that embed CPython, and those applications may still be using the C locale even after this PEP is implemented. Providing a simple on/off switch for the locale coercion behaviour makes it much easier to reproduce the behaviour of such applications for debugging purposes, as well as making it easier to reproduce the behaviour of older 3.x runtimes even when running a version with this change applied. Querying LC_CTYPE for C locale detection ---------------------------------------- ``LC_CTYPE`` is the actual locale category that CPython relies on to drive the implicit decoding of environment variables, command line arguments, and other text values received from the operating system. As such, it makes sense to check it specifically when attempting to determine whether or not the current locale configuration is likely to cause Unicode handling problems. Explicitly setting LC_CTYPE for UTF-8 locale coercion ----------------------------------------------------- Python is often used as a glue language, integrating other C/C++ ABI compatible components in the current process, and components written in arbitrary languages in subprocesses. Setting ``LC_CTYPE`` to ``C.UTF-8`` is important to handle cases where the problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is configured to forward locale settings, and the user logs into a Linux server). This should be sufficient to ensure that when the locale coercion is activated, the switch to the UTF-8 based locale will be applied consistently across the current process and any subprocesses that inherit the current environment. Avoiding setting LANG for UTF-8 locale coercion ----------------------------------------------- Earlier versions of this PEP proposed setting the ``LANG`` category independent default locale, in addition to setting ``LC_CTYPE``. This was later removed on the grounds that setting only ``LC_CTYPE`` is sufficient to handle all of the problematic scenarios that the PEP aimed to resolve, while setting ``LANG`` as well would break cases where ``LANG`` was set correctly, and the locale problems were solely due to an incorrect ``LC_CTYPE`` setting ([22]_). For example, consider a Python application that called the Linux ``date`` utility in a subprocess rather than doing its own date formatting:: $ LANG=ja_JP.UTF-8 LC_CTYPE=C date 2017年 5月 23日 火曜日 17:31:03 JST $ LANG=ja_JP.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing only LC_CTYPE 2017年 5月 23日 火曜日 17:32:58 JST $ LANG=C.UTF-8 LC_CTYPE=C.UTF-8 date # Coercing both of LC_CTYPE and LANG Tue May 23 17:31:10 JST 2017 With only ``LC_CTYPE`` updated in the Python process, the subprocess would continue to behave as expected. However, if ``LANG`` was updated as well, that would effectively override the ``LC_TIME`` setting and use the wrong date formatting conventions. Avoiding setting LC_ALL for UTF-8 locale coercion ------------------------------------------------- Earlier versions of this PEP proposed setting the ``LC_ALL`` locale override, in addition to setting ``LC_CTYPE``. This was changed after it was determined that just setting ``LC_CTYPE`` and ``LANG`` should be sufficient to handle all the scenarios the PEP aims to cover, as it avoids causing any problems in cases like the following:: $ LANG=C LC_MONETARY=ja_JP.utf8 ./python -c \ "from locale import setlocale, LC_ALL, currency; setlocale(LC_ALL, ''); print(currency(1e6))" ¥1000000 Skipping locale coercion if LC_ALL is set in the current environment -------------------------------------------------------------------- With locale coercion now only setting ``LC_CTYPE`` and ``LANG``, it will have no effect if ``LC_ALL`` is also set. To avoid emitting a spurious locale coercion notice in that case, coercion is instead skipped entirely. Considering locale coercion independently of "UTF-8 mode" --------------------------------------------------------- With both this PEP's locale coercion and :pep:`540`'s UTF-8 mode under consideration for Python 3.7, it makes sense to ask whether or not we can limit ourselves to only doing one or the other, rather than making both changes. The UTF-8 mode proposed in :pep:`540` has two major limitations that make it a potential complement to this PEP rather than a potential replacement. First, unlike this PEP, :pep:`540`'s UTF-8 mode makes it possible to change default behaviours that are not currently configurable at all. While that's exactly what makes the proposal interesting, it's also what makes it an entirely unproven approach. By contrast, the approach proposed in this PEP builds directly atop existing configuration settings for the C locale system ( ``LC_CTYPE``, ``LANG``) and Python's standard streams (``PYTHONIOENCODING``) that have already been in use for years to handle the kinds of compatibility problems discussed in this PEP. Secondly, one of the things we know based on that experience is that the proposed locale coercion can resolve problems not only in CPython itself, but also in extension modules that interact with the standard streams, like GNU readline. As an example, consider the following interactive session from a :pep:`538` enabled CPython build, where each line after the first is executed by doing "up-arrow, left-arrow x4, delete, enter":: $ LANG=C ./python Python 3.7.0a0 (heads/pep538-coerce-c-locale:188e780, May 7 2017, 00:21:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> This is exactly what we'd expect from a well-behaved command history editor. By contrast, the following is what currently happens on an older release if you only change the Python level stream encoding settings without updating the locale settings:: $ LANG=C PYTHONIOENCODING=utf-8:surrogateescape python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌ�") File "", line 0 ^ SyntaxError: 'utf-8' codec can't decode bytes in position 20-21: invalid continuation byte That particular misbehaviour is coming from GNU readline, *not* CPython - because the command history editing wasn't UTF-8 aware, it corrupted the history buffer and fed such nonsense to stdin that even the surrogateescape error handler was bypassed. While :pep:`540`'s UTF-8 mode could technically be updated to also reconfigure readline, that's just *one* extension module that might be interacting with the standard streams without going through the CPython C API, and any change made by CPython would only apply when readline is running directly as part of Python 3.7 rather than in a separate subprocess. However, if we actually change the configured locale, GNU readline starts behaving itself, without requiring any changes to the embedding application:: $ LANG=C.UTF-8 python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> $ LC_CTYPE=C.UTF-8 python3 Python 3.5.3 (default, Apr 24 2017, 13:32:13) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> print("ℙƴ☂ℌøἤ") ℙƴ☂ℌøἤ >>> print("ℙƴ☂ℌἤ") ℙƴ☂ℌἤ >>> print("ℙƴ☂ἤ") ℙƴ☂ἤ >>> print("ℙƴἤ") ℙƴἤ >>> print("ℙἤ") ℙἤ >>> print("ἤ") ἤ >>> Enabling C locale coercion and warnings on Mac OS X, iOS and Android -------------------------------------------------------------------- On Mac OS X, iOS, and Android, CPython already assumes the use of UTF-8 for system interfaces, and we expect most other locale-aware components to do the same. Accordingly, this PEP originally proposed to disable locale coercion and warnings at build time for these platforms, on the assumption that it would be entirely redundant. However, that assumption turned out to be incorrect, as subsequent investigations showed that if you explicitly configure ``LANG=C`` on these platforms, extension modules like GNU readline will misbehave in much the same way as they do on other \*nix systems. [21]_ In addition, Mac OS X is also frequently used as a development and testing platform for Python software intended for deployment to other \*nix environments (such as Linux or Android), and Linux is similarly often used as a development and testing platform for mobile and Mac OS X applications. Accordingly, this PEP enables the locale coercion and warning features by default on all platforms that use CPython's ``autotools`` based build toolchain (i.e. everywhere other than Windows). Implementation ============== The reference implementation is being developed in the ``pep538-coerce-c-locale`` feature branch [18]_ in Alyssa Coghlan's fork of the CPython repository on GitHub. A work-in-progress PR is available at [20]_. This reference implementation covers not only the enhancement request in issue 28180 [1]_, but also the Android compatibility fixes needed to resolve issue 28997 [16]_. Backporting to earlier Python 3 releases ======================================== Backporting to Python 3.6.x --------------------------- If this PEP is accepted for Python 3.7, redistributors backporting the change specifically to their initial Python 3.6.x release will be both allowed and encouraged. However, such backports should only be undertaken either in conjunction with the changes needed to also provide a suitable locale by default, or else specifically for platforms where such a locale is already consistently available. At least the Fedora project is planning to pursue this approach for the upcoming Fedora 26 release [19]_. Backporting to other 3.x releases --------------------------------- While the proposed behavioural change is seen primarily as a bug fix addressing Python 3's current misbehaviour in the default ASCII-based C locale, it still represents a reasonably significant change in the way CPython interacts with the C locale system. As such, while some redistributors may still choose to backport it to even earlier Python 3.x releases based on the needs and interests of their particular user base, this wouldn't be encouraged as a general practice. However, configuring Python 3 *environments* (such as base container images) to use these configuration settings by default is both allowed and recommended. Acknowledgements ================ The locale coercion approach proposed in this PEP is inspired directly by Armin Ronacher's handling of this problem in the ``click`` command line utility development framework [2]_:: $ LANG=C python3 -c 'import click; cli = click.command()(lambda:None); cli()' Traceback (most recent call last): ... RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Either run this under Python 2 or consult http://click.pocoo.org/python3/ for mitigation steps. This system supports the C.UTF-8 locale which is recommended. You might be able to resolve your issue by exporting the following environment variables: export LC_ALL=C.UTF-8 export LANG=C.UTF-8 The change was originally proposed as a downstream patch for Fedora's system Python 3.6 package [3]_, and then reformulated as a PEP for Python 3.7 with a section allowing for backports to earlier versions by redistributors. In parallel with the development of the upstream patch, Charalampos Stratakis has been working on the Fedora 26 backport and providing feedback on the practical viability of the proposed changes. The initial draft was posted to the Python Linux SIG for discussion [10]_ and then amended based on both that discussion and Victor Stinner's work in :pep:`540` [11]_. The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9]_. Stephen Turnbull has long provided valuable insight into the text encoding handling challenges he regularly encounters at the University of Tsukuba (筑波大学). References ========== .. [1] CPython: sys.getfilesystemencoding() should default to utf-8 (https://bugs.python.org/issue28180) .. [2] Locale configuration required for click applications under Python 3 (https://click.palletsprojects.com/en/5.x/python3/#python-3-surrogate-handling) .. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale (https://bugzilla.redhat.com/show_bug.cgi?id=1404918) .. [4] GNU C: How Programs Set the Locale (https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html) .. [5] GNU C: Locale Categories (https://www.gnu.org/software/libc/manual/html_node/Locale-Categories.html) .. [6] glibc C.UTF-8 locale proposal (https://sourceware.org/glibc/wiki/Proposals/C.UTF-8) .. [7] GNOME Flatpak (https://flatpak.org/) .. [8] Ubuntu Snappy (https://www.ubuntu.com/desktop/snappy) .. [9] Pragmatic Unicode (https://nedbatchelder.com/text/unipain.html) .. [10] linux-sig discussion of initial PEP draft (https://mail.python.org/pipermail/linux-sig/2017-January/000014.html) .. [11] Feedback notes from linux-sig discussion and PEP 540 (https://github.com/python/peps/issues/171) .. [12] GB 18030 (https://en.wikipedia.org/wiki/GB_18030) .. [13] Shift-JIS (https://en.wikipedia.org/wiki/Shift_JIS) .. [14] ISO-2022 (https://en.wikipedia.org/wiki/ISO/IEC_2022) .. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale (https://bugs.python.org/issue19977) .. [16] test_readline.test_nonascii fails on Android (https://bugs.python.org/issue28997) .. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English" (https://bugs.python.org/issue18378#msg215215) .. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale`` (https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale) .. [19] Fedora 26 change proposal for locale coercion backport (https://fedoraproject.org/wiki/Changes/python3_c.utf-8_locale) .. [20] GitHub pull request for the reference implementation (https://github.com/python/cpython/pull/659) .. [21] GNU readline misbehaviour on Mac OS X with ``LANG=C`` (https://mail.python.org/pipermail/python-dev/2017-May/147897.html) .. [22] Potential problems when setting LANG in addition to setting LC_CTYPE (https://mail.python.org/pipermail/python-dev/2017-May/147968.html) Copyright ========= This document has been placed in the public domain under the terms of the CC0 1.0 license: https://creativecommons.org/publicdomain/zero/1.0/