Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX * Add Use Cases * Add "Don't modify the encoding of the POSIX locale" alternative * Rephase Abstract and Proposal * Proposal: mention expected mojibake issues * Fix PEP number: 393 => 383 * Add links
2017-01-05 23:54:22 +01:00 · 2017-01-05 23:54:22 +01:00 · 5b6b25f5d9
parent 9780f3ab43
commit 5b6b25f5d9
1 changed files with 266 additions and 39 deletions
--- a/pep-0540.txt
+++ b/pep-0540.txt
@ -13,9 +13,16 @@ Python-Version: 3.7
 Abstract
 ========

-Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system
-data instead of the locale encoding. Add ``-X utf8`` command line option
-and ``PYTHONUTF8`` environment variable.
+Add a new UTF-8 mode, disabled by default, to ignore the locale and
+force the usage of the UTF-8 encoding.
+
+Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
+bother users with encodings, but it can produce mojibake. The UTF-8 mode
+can be configured as strict to prevent mojibake.
+
+New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
+variable are added to control the UTF-8 mode. The POSIX locale enables
+the UTF-8 mode.


 Context
@ -33,9 +40,8 @@ data from/to the operating system:
 * environment variables: ``os.environ``
 * filenames: ``os.listdir(str)`` for example
 * pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
-* error messages
-* name of a timezone
-* user name, terminal name: ``os``, ``grp`` and ``pwd`` modules
+* error messages: ``os.strerror(code)`` for example
+* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
 * host name, UNIX socket path: see the ``socket`` module
 * etc.

@ -81,7 +87,7 @@ arguments are decoded by ``mbstowcs()`` and encoded back by
 ``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
 of retrieving the original byte string.

-To fix this issue, Python now checks since Python 3.4 if ``mbstowcs()``
+To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
 really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
 POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
 alias to ASCII). If not (the effective encoding is not ASCII), Python
@ -95,16 +101,18 @@ See the `POSIX locale (2016 Edition)
 C.UTF-8 and C.utf8 locales
 --------------------------

-Some operating systems provide a variant of the POSIX locale using the
+Some UNIX operating systems provide a variant of the POSIX locale using the
 UTF-8 encoding:

 * Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
-* Debian (eglibc 2.13-1, 2011): ``"C.UTF-8"``
+* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
 * HP-UX: ``"C.utf8"``

-It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
+It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
 proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.

+It is not planned to add such locale to BSD systems.
+

 Popularity of the UTF-8 encoding
 --------------------------------
@ -112,11 +120,10 @@ Popularity of the UTF-8 encoding
 Python 3 uses UTF-8 by default for Python source files.

 On Mac OS X, Windows and Android, Python always use UTF-8 for operating
-system data instead of the locale encoding. For Windows, see the `PEP
-529: Change Windows filesystem encoding to UTF-8
-<https://www.python.org/dev/peps/pep-0529/>`_.
+system data. For Windows, see the PEP 529: "Change Windows filesystem
+encoding to UTF-8".

-On Linux, UTF-8 became the defacto standard encoding by default,
+On Linux, UTF-8 became the defacto standard encoding,
 replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
 using different encodings for filenames and standard streams is likely
 to create mojibake, so UTF-8 is now used *everywhere*.
@ -152,8 +159,7 @@ the same encoding.

 Python 3 promotes Unicode everywhere including filenames. A solution to
 support filenames not decodable from the locale encoding was found: the
-``surrogateescape`` error handler (`PEP 393
-<https://www.python.org/dev/peps/pep-0393/>`_), store undecodable bytes
+``surrogateescape`` error handler (PEP 383), store undecodable bytes
 as surrogate characters. This error handler is used by default for
 operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
 example (except on Windows which uses the ``strict`` error handler).
@ -172,7 +178,7 @@ useful when the POSIX locale is used, because this locale usually uses
 the ASCII encoding.

 The problem is that operating system data like filenames are decoded
-using the ``surrogateescape`` error handler (PEP 393). Displaying a
+using the ``surrogateescape`` error handler (PEP 383). Displaying a
 filename to stdout raises an Unicode encode error if the filename
 contains an undecoded byte stored as a surrogate character.

@ -191,28 +197,60 @@ programming language, whereas Unicode is badly supported.
 Proposal
 ========

-Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data
-instead of the locale encoding:
+Changes
+-------

-* Add ``-X utf8`` command line option
-* Add ``PYTHONUTF8=1`` environment variable
+Add a new UTF-8 mode, disabled by default, to ignore the locale and
+force the usage of the UTF-8 encoding with the ``surrogateescape`` error
+handler, instead using the locale encoding (with ``strict`` or
+``surrogateescape`` error handler depending on the case).

-Add also a strict UTF-8 mode, enabled by ``-X utf8=strict`` or
-``PYTHONUTF8=strict``.
+Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
+bother users with encodings, but it can produce mojibake. It can be
+configured as strict to prevent mojibake: the UTF-8 encoding is used
+with the ``strict`` error handler in this case.
+
+New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
+variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
+by ``-X utf8`` or ``PYTHONUTF8=1``.  The UTF-8 is configured as strict
+by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
+
+The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
+can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
+
+Encoding and error handler
+--------------------------

 The UTF-8 mode changes the default encoding and error handler used by
 open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
 sys.stderr:

-============================  =======================  =======================  ======================  ======================
-Function                      Default, other locales   Default, POSIX locale    UTF-8                   UTF-8 Strict
-============================  =======================  =======================  ======================  ======================
-open()                        locale/strict            locale/strict            UTF-8/surrogateescape   UTF-8/strict
-os.fsdecode(), os.fsencode()  locale/surrogateescape   locale/surrogateescape   UTF-8/surrogateescape   UTF-8/strict
-sys.stdin                     locale/strict            locale/surrogateescape   UTF-8/surrogateescape   UTF-8/strict
-sys.stdout                    locale/strict            locale/surrogateescape   UTF-8/surrogateescape   UTF-8/strict
-sys.stderr                    locale/backslashreplace  locale/backslashreplace  UTF-8/backslashreplace  UTF-8/backslashreplace
-============================  =======================  =======================  ======================  ======================
+============================  =======================  ==========================  ==========================
+Function                      Default                  UTF-8 or POSIX locale       UTF-8 Strict
+============================  =======================  ==========================  ==========================
+open()                        locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
+os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape   **UTF-8/strict**
+sys.stdin, sys.stdout         locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
+sys.stderr                    locale/backslashreplace  **UTF-8**/backslashreplace  **UTF-8**/backslashreplace
+============================  =======================  ==========================  ==========================
+
+By comparison, Python 3.6 uses:
+
+============================  =======================  ==========================
+Function                      Default                  POSIX locale
+============================  =======================  ==========================
+open()                        locale/strict            locale/strict
+os.fsdecode(), os.fsencode()  locale/surrogateescape   locale/surrogateescape
+sys.stdin, sys.stdout         locale/strict            locale/**surrogateescape**
+sys.stderr                    locale/backslashreplace  locale/backslashreplace
+============================  =======================  ==========================
+
+The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
+strict mode for convenience: the idea is that data not encoded to UTF-8
+are passed through "Python" without being modified, as raw bytes.
+
+Rationale
+---------

 The UTF-8 mode is disabled by default to keep hard Unicode errors when
 encoding or decoding operating system data failed, and to keep the
@ -238,17 +276,184 @@ Python is more convenient, since they are more commonly misconfigured
 *by mistake* (configured to use an encoding different than UTF-8,
 whereas the system uses UTF-8), rather than being misconfigured by intent.

+Expected mojibake issues
+------------------------
+
+The UTF-8 mode only affects Python 3.7 code, other code is not aware of this
+mode.
+
+If Python 3.7 is used as a producer in a ``producer | consumer`` shell command
+and the consumer may fail to decode input data if it decodes it and the locale
+encoding is not UTF-8. If the consumer doesn't decode inputs, process them
+as bytes, it should just work.
+
+If Python 3.7 is used as a consumer in a ``producer | consumer`` shell command,
+it should just work.
+
+If Python calls third party libraries or if Python is embedded in an
+application, code outside Python is not aware of the UTF-8 mode. If the other
+code uses UTF-8, it's fine. If the other code uses the locale encoding,
+mojibake will occur when the locale encoding is not UTF-8.
+
+
+Use Cases
+=========
+
+List a directory into stdout
+----------------------------
+
+Script listing the content of the current directory into stdout::
+
+    import os
+    for name in os.listdir(os.curdir):
+        print(name)
+
+Result:
+
+========================  ==============================
+Python                    Always work?
+========================  ==============================
+Python 2                  **Yes**
+Python 3                  No
+Python 3.5, POSIX locale  **Yes**
+UTF-8 mode                **Yes**
+UTF-8 Strict mode         No
+========================  ==============================
+
+"Yes" means that the script cannot fail, but it can produce mojibake.
+
+"No" means that the script can fail on decoding or encoding a filename
+depending on the locale or the filename.
+
+
+List a directory into a text file
+---------------------------------
+
+Similar to the previous example, except that the listing is written into
+a text file::
+
+    import os
+    names = os.listdir(os.curdir)
+    with open("/tmp/content.txt", "w") as fp:
+        for name in names:
+            fp.write("%s\n" % name)
+
+Result:
+
+========================  ==============================
+Python                    Always work?
+========================  ==============================
+Python 2                  **Yes**
+Python 3                  No
+Python 3.5, POSIX locale  No
+UTF-8 mode                **Yes**
+UTF-8 Strict mode         No
+========================  ==============================
+
+"Yes" means that the script cannot fail, but it can produce mojibake.
+
+"No" means that the script can fail on decoding or encoding a filename
+depending on the locale or the filename. Typical error::
+
+    $ LC_ALL=C python3 test.py
+    Traceback (most recent call last):
+      File "test.py", line 5, in <module>
+        fp.write("%s\n" % name)
+    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
+
+
+Display Unicode characters into stdout
+--------------------------------------
+
+Very basic example used to illustrate a common issue, display the euro sign
+(U+20AC: €)::
+
+    print("euro: \u20ac")
+
+Result:
+
+========================  ==============================
+Python                    Always work?
+========================  ==============================
+Python 2                  No
+Python 3                  No
+Python 3.5, POSIX locale  No
+UTF-8 mode                **Yes**
+UTF-8 Strict mode         **Yes**
+========================  ==============================
+
+"Yes" means that the script cannot fail, but it can produce mojibake.
+
+"No" means that the script can fail on encoding the euro sign depending on the
+locale encoding.
+
+
+Replace a word in a text
+------------------------
+
+The following scripts replaces the word "apple" with "orange". It
+reads input from stdin and writes the output into stdout::
+
+    import sys
+    text = sys.stdin.read()
+    sys.stdout.write(text.replace("apple", "orange"))
+
+Result:
+
+========================  ==============================
+Python                    Always work?
+========================  ==============================
+Python 2                  **Yes**
+Python 3                  No
+Python 3.5, POSIX locale  **Yes**
+UTF-8 mode                **Yes**
+UTF-8 Strict mode         No
+========================  ==============================
+
+"Yes" means that the script cannot fail.
+
+"No" means that the script can fail on decoding the input depending on
+the locale.
+

 Backward Compatibility
 ======================

-Since the UTF-8 mode is disabled by default, it has no impact on the
-backward compatibility. The new UTF-8 mode must be enabled explicitly.
+The main backward incompatible change is that the UTF-8 encoding is now
+used if the locale is POSIX. Since the UTF-8 encoding is used with the
+``surrogateescape`` error handler, ecoding errors should not occur and
+so the change should not break applications.
+
+The more likely source of trouble comes from external libraries. Python
+can decode successfully data from UTF-8, but a library using the locale
+encoding can fail to encode the decoded text back to bytes.  Hopefully,
+encoding text in a library is a rare operation. Very few libraries
+expect text, most libraries expect bytes and even manipulate bytes
+internally.
+
+If the locale is not POSIX, the PEP has no impact on the backward
+compatibility since the UTF-8 mode is disabled by default in this case,
+it must be enabled explicitly.


 Alternatives
 ============

+Don't modify the encoding of the POSIX locale
+---------------------------------------------
+
+A first version of the PEP did not change the encoding and error handler
+used of the POSIX locale.
+
+The problem is that adding a command line option or setting an environment
+variable is not possible in some cases, or at least not convenient.
+
+Moreover, many users simply expect that Python 3 behaves as Python 2:
+don't bother them with encodings and "just works" in all cases. These
+users don't worry about mojibake, or even expect mojibake because of
+complex documents using multiple incompatibles encodings.
+
+
 Always use UTF-8
 ----------------

@ -266,13 +471,35 @@ Force UTF-8 for the POSIX locale
 An alternative to always using UTF-8 in any case is to only use UTF-8 when the
 ``LC_CTYPE`` locale is the POSIX locale.

-The `PEP 538: Coercing the legacy C locale to C.UTF-8
-<https://www.python.org/dev/peps/pep-0538/>`_ of  Nick Coghlan proposes to
-implement that using the ``C.UTF-8`` locale.
+The PEP 538 "Coercing the legacy C locale to C.UTF-8" of  Nick Coghlan
+proposes to implement that using the ``C.UTF-8`` locale.


-Related Work
-============
+Links
+=====
+
+PEPs:
+
+* PEP 538 "Coercing the legacy C locale to C.UTF-8"
+* PEP 529: "Change Windows filesystem encoding to UTF-8"
+* PEP 383: "Non-decodable Bytes in System Character Interfaces"
+
+Python issues:
+
+* `issue #28180: sys.getfilesystemencoding() should default to utf-8
+  <http://bugs.python.org/issue28180>`_
+* `Issue #19846: Python 3 raises Unicode errors with the C locale
+  <http://bugs.python.org/issue19846>`_
+* `Issue #8622: Add PYTHONFSENCODING environment variable
+  <https://bugs.python.org/issue8622>`_: added but reverted because of
+  many issues, read the `Inconsistencies if locale and filesystem
+  encodings are different
+  <https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
+  thread on the python-dev mailing list
+
+
+Prior Art
+=========

 Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
 varaible to force UTF-8: see `perlrun