From 09022d3f72f13e115aec4eabd3c8a809cc18aa6b Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 8 Dec 2017 15:38:40 +0100 Subject: [PATCH] PEP 540: getpreferredencoding() returns UTF-8 * List explicitly effects of the UTF-8 mode, but also "side effects" * Add a new "Relationship with the locale coercion (PEP 538)" section * Add a new "Version History" section --- pep-0540.txt | 158 ++++++++++++++++++++++++++++++++------------------- 1 file changed, 98 insertions(+), 60 deletions(-) diff --git a/pep-0540.txt b/pep-0540.txt index 0b214cabe..0a9cbc1e6 100644 --- a/pep-0540.txt +++ b/pep-0540.txt @@ -1,5 +1,5 @@ PEP: 540 -Title: Add a new UTF-8 mode +Title: Add a new UTF-8 Mode Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner @@ -14,13 +14,13 @@ Python-Version: 3.7 Abstract ======== -Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and -change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``. -This mode is enabled by default in the POSIX locale, but otherwise -disabled by default. +Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale +encoding, and change ``stdin`` and ``stdout`` error handlers to +``surrogateescape``. This mode is disabled by default and enabled by +the POSIX locale. The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment -variable are added to control the UTF-8 mode. +variable are added to control the UTF-8 Mode. Rationale @@ -51,13 +51,7 @@ and JSON file formats. The Go programming language uses UTF-8 for strings. When all data are stored as UTF-8 but the locale is often misconfigured, -an obvious solution is to ignore the locale and use UTF-8. - -PEP 538 attempts to mitigate this problem by coercing the C locale -to a UTF-8 based locale when one is available, but that isn't a -universal solution. For example, CentOS 7's container images default -to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's -locale coercion is ineffective. +an obvious solution is to ignore the locale encoding and use UTF-8. Passthough undecodable bytes: surrogateescape @@ -76,7 +70,7 @@ the ``surrogateescape`` error handler (:pep:`383`). It allows to process data "as bytes" but uses Unicode in practice (undecodable bytes are stored as surrogate characters). -The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin`` +The UTF-8 Mode uses the ``surrogateescape`` error handler for ``stdin`` and ``stdout`` since these streams as commonly associated to Unix command line tools. @@ -98,43 +92,98 @@ usually gives the ASCII encoding, whereas UTF-8 is a much better choice. It does not change the behaviour for other locales to prevent any risk or regression. -As users are responsible to enable explicitly the new UTF-8 mode, they +As users are responsible to enable explicitly the new UTF-8 Mode, they are responsible for any potential mojibake issues caused by this mode. Proposal ======== -Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and -change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``. -This mode is enabled by default in the POSIX locale, but otherwise -disabled by default. +Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale +encoding, and change ``stdin`` and ``stdout`` error handlers to +``surrogateescape``. The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment -variable are added. The UTF-8 mode is enabled by ``-X utf8`` or +variable are added. The UTF-8 Mode is enabled by ``-X utf8`` or ``PYTHONUTF8=1``. -The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode -can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``. +This mode is disabled by default and enabled by the POSIX locale. The +UTF-8 Mode can be explicitly disabled by ``-X utf8=0`` or +``PYTHONUTF8=0``. For standard streams, the ``PYTHONIOENCODING`` environment variable has -priority over the UTF-8 mode. +priority over the UTF-8 Mode. On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable -(:pep:`529`) has the priority over the UTF-8 mode. +(:pep:`529`) has the priority over the UTF-8 Mode. + +Effects of the UTF-8 Mode: + +* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``. +* ``locale.getpreferredencoding()`` returns ``UTF-8``, its + *do_setlocale* argument and the locale encoding are ignored. +* ``sys.stdin`` and ``sys.stdout`` error handler is set to + ``surrogateescape`` + +Side effects: + +* ``open()`` uses the UTF-8 encoding by default. +* ``os.fsdecode()`` and ``os.fsencode()`` use the UTF-8 encoding. +* Command line arguments, environment variables and filenames use the + UTF-8 encoding. + +.. note:: + In the UTF-8 Mode, ``open()`` still uses the ``strict`` error handler + by default. + + +Relationship with the locale coercion (PEP 538) +=============================================== + +The POSIX locale enables the locale coercion (PEP 538) and the UTF-8 +mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8 +mode has no (additional) effect. + +Locale coercion only impacts non-Python code like C libraries, whereas +the Python UTF-8 Mode only impacts Python code: the two PEPs are +complementary. + +On platforms where locale coercion is not supported like Centos 7, the +POSIX locale only enables the UTF-8 Mode. In this case, Python code uses +the UTF-8 encoding and ignores the locale encoding, whereas non-Python +code uses the locale encoding which is usually ASCII for the POSIX +locale. + +While the UTF-8 Mode is supported on all platforms and can be enabled +with any locale, the locale coercion is not supported by all platforms +and is restricted to the POSIX locale. + +The UTF-8 Mode has only an impact on Python child processes when the +``PYTHONUTF8`` environment variable is set to ``1``, whereas the locale +coercion sets the ``LC_CTYPE`` environment variables which impacts all +child processes. + +The benefit of the locale coercion approach is that it helps ensure that +encoding handling in binary extension modules and child processes is +consistent with Python's encoding handling. The upside of the UTF-8 Mode +approach is that it allows an embedding application to change the +interpreter's behaviour without having to change the process global +locale settings. Backward Compatibility ====================== -The only backward incompatible change is that the UTF-8 encoding is now -used for the POSIX locale. +The only backward incompatible change is that the POSIX locale now +enables the UTF-8 Mode by default: use the UTF-8 encoding, ignore the +locale encoding, and change ``stdin`` and ``stdout`` error handlers to +``surrogateescape``. Annex: Encodings And Error Handlers =================================== -The UTF-8 mode changes the default encoding and error handler used by +The UTF-8 Mode changes the default encoding and error handler used by ``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``, ``sys.stdout`` and ``sys.stderr``. @@ -142,7 +191,7 @@ Encoding and error handler -------------------------- ============================ ======================= ========================== -Function Default UTF-8 mode or POSIX locale +Function Default UTF-8 Mode or POSIX locale ============================ ======================= ========================== open() locale/strict **UTF-8**/strict os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape @@ -167,7 +216,7 @@ Encoding and error handler on Windows On Windows, the encodings and error handlers are different: ============================ ======================= ========================== ========================== -Function Default Legacy Windows FS encoding UTF-8 mode +Function Default Legacy Windows FS encoding UTF-8 Mode ============================ ======================= ========================== ========================== open() mbcs/strict mbcs/strict **UTF-8**/strict os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace** UTF-8/surrogatepass @@ -191,43 +240,19 @@ The "Legacy Windows FS encoding" is enabled by the If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or ``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But -in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8 +in the UTF-8 Mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8 encoding. .. note: - There is no POSIX locale on Windows. The ANSI code page is used to the - locale encoding, and this code page never uses the ASCII encoding. - - -Annex: Differences between PEP 538 and PEP 540 -============================================== - -PEP 538's locale coercion is only effective if a suitable UTF-8 -based locale is available as a coercion target. PEP 540's -UTF-8 mode can be enabled even for operating systems that don't -provide a suitable platform locale (such as CentOS 7). - -PEP 538 only changes the interpreter's behaviour for the C locale. While the -new UTF-8 mode of this PEP is only enabled by default in the C locale, it can -also be enabled manually for any other locale. - -PEP 538 is implemented with ``setlocale(LC_CTYPE, "")`` and -``setenv("LC_CTYPE", "")``, so any non-Python code running -in the process and any subprocesses that inherit the environment is impacted -by the change. PEP 540 is implemented in Python internals and ignores the -locale: non-Python running in the same process is not aware of the -"Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps -ensure that encoding handling in binary extension modules and subprocesses -is consistent with CPython's encoding handling. The upside of the PEP 540 -approach is that it allows an embedding application to change the -interpreter's behaviour without having to change the process global -locale settings. + There is no POSIX locale on Windows. The ANSI code page is used to + the locale encoding, and this code page never uses the ASCII + encoding. Links ===== -* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode +* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 Mode `_ * `PEP 538 `_: "Coercing the legacy C locale to C.UTF-8" @@ -242,12 +267,12 @@ Links Post History ============ -* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode +* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 Mode `_ * 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 & 540 (assuming UTF-8 for *nix system boundaries) `_ -* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode +* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 Mode `_ * 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to C.utf-8 (msg284764) `_ @@ -257,6 +282,19 @@ Post History filesystem encoding to UTF-8) +Version History +=============== + +* Version 4: ``locale.getpreferredencoding()`` now returns ``'UTF-8'`` + in the UTF-8 Mode. +* Version 3: The UTF-8 Mode does not change the ``open()`` default error + handler (``strict``) anymore, and the Strict UTF-8 Mode has been + removed. +* Version 2: Rewrite the PEP from scratch to make it much shorter and + easier to understand. +* Version 1: First version posted to python-dev. + + Copyright =========