From 09022d3f72f13e115aec4eabd3c8a809cc18aa6b Mon Sep 17 00:00:00 2001
From: Victor Stinner <victor.stinner@gmail.com>
Date: Fri, 8 Dec 2017 15:38:40 +0100
Subject: [PATCH] PEP 540: getpreferredencoding() returns UTF-8

* List explicitly effects of the UTF-8 mode, but also "side effects"
* Add a new "Relationship with the locale coercion (PEP 538)" section
* Add a new "Version History" section
---
 pep-0540.txt | 158 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 98 insertions(+), 60 deletions(-)

diff --git a/pep-0540.txt b/pep-0540.txt
index 0b214cabe..0a9cbc1e6 100644
--- a/pep-0540.txt
+++ b/pep-0540.txt
@@ -1,5 +1,5 @@
 PEP: 540
-Title: Add a new UTF-8 mode
+Title: Add a new UTF-8 Mode
 Version: $Revision$
 Last-Modified: $Date$
 Author: Victor Stinner <victor.stinner@gmail.com>
@@ -14,13 +14,13 @@ Python-Version: 3.7
 Abstract
 ========
 
-Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
-change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
-This mode is enabled by default in the POSIX locale, but otherwise
-disabled by default.
+Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
+encoding, and change ``stdin`` and ``stdout`` error handlers to
+``surrogateescape``.  This mode is disabled by default and enabled by
+the POSIX locale.
 
 The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
-variable are added to control the UTF-8 mode.
+variable are added to control the UTF-8 Mode.
 
 
 Rationale
@@ -51,13 +51,7 @@ and JSON file formats. The Go programming language uses UTF-8 for
 strings.
 
 When all data are stored as UTF-8 but the locale is often misconfigured,
-an obvious solution is to ignore the locale and use UTF-8.
-
-PEP 538 attempts to mitigate this problem by coercing the C locale
-to a UTF-8 based locale when one is available, but that isn't a
-universal solution. For example, CentOS 7's container images default
-to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
-locale coercion is ineffective.
+an obvious solution is to ignore the locale encoding and use UTF-8.
 
 
 Passthough undecodable bytes: surrogateescape
@@ -76,7 +70,7 @@ the ``surrogateescape`` error handler (:pep:`383`). It allows to process
 data "as bytes" but uses Unicode in practice (undecodable bytes are
 stored as surrogate characters).
 
-The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
+The UTF-8 Mode uses the ``surrogateescape`` error handler for ``stdin``
 and ``stdout`` since these streams as commonly associated to Unix
 command line tools.
 
@@ -98,43 +92,98 @@ usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
 It does not change the behaviour for other locales to prevent any risk
 or regression.
 
-As users are responsible to enable explicitly the new UTF-8 mode, they
+As users are responsible to enable explicitly the new UTF-8 Mode, they
 are responsible for any potential mojibake issues caused by this mode.
 
 
 Proposal
 ========
 
-Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
-change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
-This mode is enabled by default in the POSIX locale, but otherwise
-disabled by default.
+Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
+encoding, and change ``stdin`` and ``stdout`` error handlers to
+``surrogateescape``.
 
 The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
-variable are added. The UTF-8 mode is enabled by ``-X utf8`` or
+variable are added. The UTF-8 Mode is enabled by ``-X utf8`` or
 ``PYTHONUTF8=1``.
 
-The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
-can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
+This mode is disabled by default and enabled by the POSIX locale. The
+UTF-8 Mode can be explicitly disabled by ``-X utf8=0`` or
+``PYTHONUTF8=0``.
 
 For standard streams, the ``PYTHONIOENCODING`` environment variable has
-priority over the UTF-8 mode.
+priority over the UTF-8 Mode.
 
 On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable
-(:pep:`529`) has the priority over the UTF-8 mode.
+(:pep:`529`) has the priority over the UTF-8 Mode.
+
+Effects of the UTF-8 Mode:
+
+* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``.
+* ``locale.getpreferredencoding()`` returns ``UTF-8``, its
+  *do_setlocale* argument and the locale encoding are ignored.
+* ``sys.stdin`` and ``sys.stdout`` error handler is set to
+  ``surrogateescape``
+
+Side effects:
+
+* ``open()`` uses the UTF-8 encoding by default.
+* ``os.fsdecode()`` and ``os.fsencode()`` use the UTF-8 encoding.
+* Command line arguments, environment variables and filenames use the
+  UTF-8 encoding.
+
+.. note::
+   In the UTF-8 Mode, ``open()`` still uses the ``strict`` error handler
+   by default.
+
+
+Relationship with the locale coercion (PEP 538)
+===============================================
+
+The POSIX locale enables the locale coercion (PEP 538) and the UTF-8
+mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
+mode has no (additional) effect.
+
+Locale coercion only impacts non-Python code like C libraries, whereas
+the Python UTF-8 Mode only impacts Python code: the two PEPs are
+complementary.
+
+On platforms where locale coercion is not supported like Centos 7, the
+POSIX locale only enables the UTF-8 Mode. In this case, Python code uses
+the UTF-8 encoding and ignores the locale encoding, whereas non-Python
+code uses the locale encoding which is usually ASCII for the POSIX
+locale.
+
+While the UTF-8 Mode is supported on all platforms and can be enabled
+with any locale, the locale coercion is not supported by all platforms
+and is restricted to the POSIX locale.
+
+The UTF-8 Mode has only an impact on Python child processes when the
+``PYTHONUTF8`` environment variable is set to ``1``, whereas the locale
+coercion sets the ``LC_CTYPE`` environment variables which impacts all
+child processes.
+
+The benefit of the locale coercion approach is that it helps ensure that
+encoding handling in binary extension modules and child processes is
+consistent with Python's encoding handling. The upside of the UTF-8 Mode
+approach is that it allows an embedding application to change the
+interpreter's behaviour without having to change the process global
+locale settings.
 
 
 Backward Compatibility
 ======================
 
-The only backward incompatible change is that the UTF-8 encoding is now
-used for the POSIX locale.
+The only backward incompatible change is that the POSIX locale now
+enables the UTF-8 Mode by default: use the UTF-8 encoding, ignore the
+locale encoding, and change ``stdin`` and ``stdout`` error handlers to
+``surrogateescape``.
 
 
 Annex: Encodings And Error Handlers
 ===================================
 
-The UTF-8 mode changes the default encoding and error handler used by
+The UTF-8 Mode changes the default encoding and error handler used by
 ``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
 ``sys.stdout`` and ``sys.stderr``.
 
@@ -142,7 +191,7 @@ Encoding and error handler
 --------------------------
 
 ============================  =======================  ==========================
-Function                      Default                  UTF-8 mode or POSIX locale
+Function                      Default                  UTF-8 Mode or POSIX locale
 ============================  =======================  ==========================
 open()                        locale/strict            **UTF-8**/strict
 os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape
@@ -167,7 +216,7 @@ Encoding and error handler on Windows
 On Windows, the encodings and error handlers are different:
 
 ============================  =======================  ==========================  ==========================
-Function                      Default                  Legacy Windows FS encoding  UTF-8 mode
+Function                      Default                  Legacy Windows FS encoding  UTF-8 Mode
 ============================  =======================  ==========================  ==========================
 open()                        mbcs/strict              mbcs/strict                 **UTF-8**/strict
 os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**            UTF-8/surrogatepass
@@ -191,43 +240,19 @@ The "Legacy Windows FS encoding" is enabled by the
 
 If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
 ``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
-in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
+in the UTF-8 Mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
 encoding.
 
 .. note:
-   There is no POSIX locale on Windows. The ANSI code page is used to the
-   locale encoding, and this code page never uses the ASCII encoding.
-
-
-Annex: Differences between PEP 538 and PEP 540
-==============================================
-
-PEP 538's locale coercion is only effective if a suitable UTF-8
-based locale is available as a coercion target. PEP 540's
-UTF-8 mode can be enabled even for operating systems that don't
-provide a suitable platform locale (such as CentOS 7).
-
-PEP 538 only changes the interpreter's behaviour for the C locale. While the
-new UTF-8 mode of this PEP is only enabled by default in the C locale, it can
-also be enabled manually for any other locale.
-
-PEP 538 is implemented with ``setlocale(LC_CTYPE, "<coercion target>")`` and
-``setenv("LC_CTYPE", "<coercion target>")``, so any non-Python code running
-in the process and any subprocesses that inherit the environment is impacted
-by the change. PEP 540 is implemented in Python internals and ignores the
-locale: non-Python running in the same process is not aware of the
-"Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps
-ensure that encoding handling in binary extension modules and subprocesses
-is consistent with CPython's encoding handling. The upside of the PEP 540
-approach is that it allows an embedding application to change the
-interpreter's behaviour without having to change the process global
-locale settings.
+   There is no POSIX locale on Windows. The ANSI code page is used to
+   the locale encoding, and this code page never uses the ASCII
+   encoding.
 
 
 Links
 =====
 
-* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
+* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 Mode
   <http://bugs.python.org/issue29240>`_
 * `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
   "Coercing the legacy C locale to C.UTF-8"
@@ -242,12 +267,12 @@ Links
 Post History
 ============
 
-* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
+* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 Mode
   <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
 * 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
   540 (assuming UTF-8 for *nix system boundaries)
   <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
-* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
+* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 Mode
   <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
 * 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
   C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
@@ -257,6 +282,19 @@ Post History
   filesystem encoding to UTF-8)
 
 
+Version History
+===============
+
+* Version 4: ``locale.getpreferredencoding()`` now returns ``'UTF-8'``
+  in the UTF-8 Mode.
+* Version 3: The UTF-8 Mode does not change the ``open()`` default error
+  handler (``strict``) anymore, and the Strict UTF-8 Mode has been
+  removed.
+* Version 2: Rewrite the PEP from scratch to make it much shorter and
+  easier to understand.
+* Version 1: First version posted to python-dev.
+
+
 Copyright
 =========