PEP 540

* Strict mode doesn't use strict for OS data anymore: keep surrogateesscape, explain why in a new alternative * Define the priority between env vars and cmdline options to choose encodings and error handlers
2017-01-11 22:08:40 +01:00 · 2017-01-11 22:08:40 +01:00 · 1b6b889ed6
parent 4ba2196903
commit 1b6b889ed6
1 changed files with 98 additions and 37 deletions
--- a/pep-0540.txt
+++ b/pep-0540.txt
@ -76,8 +76,10 @@ backward compatibility should be preserved whenever possible.
 Locale and operating system data
 --------------------------------

-Python uses the ``LC_CTYPE`` locale to decide how to encode and decode
-data from/to the operating system:
+.. _operating system data:
+
+Python uses an encoding called the "filesystem encoding" to decide how
+to encode and decode data from/to the operating system:

 * file content
 * command line arguments: ``sys.argv``
@ -91,10 +93,15 @@ data from/to the operating system:
 * etc.

 At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
-``LC_CTYPE`` locale and then store the locale encoding,
-``sys.getfilesystemencoding()``. In the whole lifetime of a Python process,
-the same encoding and error handler are used to encode and decode data
-from/to the operating system.
+``LC_CTYPE`` locale and then store the locale encoding as the
+"filesystem error". It's possible to get this encoding using
+``sys.getfilesystemencoding()``. In the whole lifetime of a Python
+process, the same encoding and error handler are used to encode and
+decode data from/to the operating system.
+
+The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to
+decode and encode operating system data. These functions use the
+filesystem error handler: ``sys.getfilesystemencodeerrors()``.

 .. note::
   In some corner case, the *current* ``LC_CTYPE`` locale must be used
@ -137,7 +144,7 @@ really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
 POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
 alias to ASCII). If not (the effective encoding is not ASCII), Python
 uses its own ASCII codec instead of using ``mbstowcs()`` and
-``wcstombs()`` functions for operating system data.
+``wcstombs()`` functions for `operating system data`_.

 See the `POSIX locale (2016 Edition)
 <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
@ -163,8 +170,8 @@ it by mistake. Examples:
 C.UTF-8 and C.utf8 locales
 --------------------------

-Some UNIX operating systems provide a variant of the POSIX locale using the
-UTF-8 encoding:
+Some UNIX operating systems provide a variant of the POSIX locale using
+the UTF-8 encoding:

 * Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
 * Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
@ -182,7 +189,7 @@ Popularity of the UTF-8 encoding
 Python 3 uses UTF-8 by default for Python source files.

 On Mac OS X, Windows and Android, Python always use UTF-8 for operating
-system data. For Windows, see the PEP 529: "Change Windows filesystem
+system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
 encoding to UTF-8".

 On Linux, UTF-8 became the de facto standard encoding,
@ -215,15 +222,15 @@ filesystem with filenames encoded to ISO 8859-1.

 The Linux kernel and the libc don't decode filenames: a filename is used
 as a raw array of bytes. The common solution to support any filename is
-to store filenames as bytes and don't try to decode them. When displayed to
-stdout, mojibake is displayed if the filename and the terminal don't use
-the same encoding.
+to store filenames as bytes and don't try to decode them. When displayed
+to stdout, mojibake is displayed if the filename and the terminal don't
+use the same encoding.

 Python 3 promotes Unicode everywhere including filenames. A solution to
 support filenames not decodable from the locale encoding was found: the
-``surrogateescape`` error handler (PEP 383), store undecodable bytes
+``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
 as surrogate characters. This error handler is used by default for
-operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
+`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
 example (except on Windows which uses the ``strict`` error handler).


@ -239,16 +246,17 @@ Unicode encode error when displaying non-ASCII text. It is especially
 useful when the POSIX locale is used, because this locale usually uses
 the ASCII encoding.

-The problem is that operating system data like filenames are decoded
-using the ``surrogateescape`` error handler (PEP 383). Displaying a
+The problem is that `operating system data`_ like filenames are decoded
+using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
 filename to stdout raises a Unicode encode error if the filename
 contains an undecoded byte stored as a surrogate character.

 Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
-POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The
-idea is to passthrough operating system data even if it means mojibake, because
-most UNIX applications work like that. Most UNIX applications store filenames
-as bytes, usually simply because bytes are first-citizen class in the used
+POSIX locale is used: `issue #19977
+<http://bugs.python.org/issue19977>`_. The idea is to passthrough
+`operating system data`_ even if it means mojibake, because most UNIX
+applications work like that. Most UNIX applications store filenames as
+bytes, usually simply because bytes are first-citizen class in the used
 programming language, whereas Unicode is badly supported.

 .. note::
@ -280,6 +288,10 @@ by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
 The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
 can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.

+The ``-X utf8`` has the priority on the ``PYTHONUTF8`` environment
+variable. For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the
+UTF-8 mode.
+
 Encoding and error handler
 --------------------------

@ -291,7 +303,7 @@ sys.stderr:
 Function                      Default                  UTF-8 or POSIX locale       UTF-8 Strict
 ============================  =======================  ==========================  ==========================
 open()                        locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
-os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape   **UTF-8/strict**
+os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape   **UTF-8**/surrogateescape
 sys.stdin, sys.stdout         locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
 sys.stderr                    locale/backslashreplace  **UTF-8**/backslashreplace  **UTF-8**/backslashreplace
 ============================  =======================  ==========================  ==========================
@ -311,11 +323,22 @@ The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
 strict mode for convenience: the idea is that data not encoded to UTF-8
 are passed through "Python" without being modified, as raw bytes.

+The ``PYTHONIOENCODING`` environment variable has the priority on the
+UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
+python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.
+
+Encodings used by ``open()``, highest priority first:
+
+* *encoding* and *errors* parameters (if set)
+* UTF-8 mode
+* os.device_encoding(fd)
+* os.getpreferredencoding(False)
+
 Rationale
 ---------

 The UTF-8 mode is disabled by default to keep hard Unicode errors when
-encoding or decoding operating system data failed, and to keep the
+encoding or decoding `operating system data`_ failed, and to keep the
 backward compatibility. The user is responsible to enable explicitly the
 UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
 mode would be enabled *by default*.
@ -325,7 +348,7 @@ UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
 the user overrides a locale *by mistake* or if a Python program is
 started with no locale configured (and so with the POSIX locale).

-Most UNIX applications handle operating system data as bytes, so
+Most UNIX applications handle `operating system data`_ as bytes, so
 ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
 limited impact on how these data are handled by the application.

@ -336,7 +359,8 @@ everywhere and that users *expect* UTF-8.
 Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
 Python is more convenient, since they are more commonly misconfigured
 *by mistake* (configured to use an encoding different than UTF-8,
-whereas the system uses UTF-8), rather than being misconfigured by intent.
+whereas the system uses UTF-8), rather than being misconfigured by
+intent.

 Expected mojibake and surrogate character issues
 ------------------------------------------------
@ -648,8 +672,9 @@ Don't modify the encoding of the POSIX locale
 A first version of the PEP did not change the encoding and error handler
 used of the POSIX locale.

-The problem is that adding a command line option or setting an environment
-variable is not possible in some cases, or at least not convenient.
+The problem is that adding a command line option or setting an
+environment variable is not possible in some cases, or at least not
+convenient.

 Moreover, many users simply expect that Python 3 behaves as Python 2:
 don't bother them with encodings and "just works" in all cases. These
@ -660,12 +685,12 @@ complex documents using multiple incompatibles encodings.
 Always use UTF-8
 ----------------

-Python already always use the UTF-8 encoding on Mac OS X, Android and Windows.
-Since UTF-8 became the de facto encoding, it makes sense to always use it on all
-platforms with any locale.
+Python already always use the UTF-8 encoding on Mac OS X, Android and
+Windows.  Since UTF-8 became the de facto encoding, it makes sense to
+always use it on all platforms with any locale.

-The risk is to introduce mojibake if the locale uses a different encoding,
-especially for locales other than the POSIX locale.
+The risk is to introduce mojibake if the locale uses a different
+encoding, especially for locales other than the POSIX locale.


 Force UTF-8 for the POSIX locale
@ -674,8 +699,39 @@ Force UTF-8 for the POSIX locale
 An alternative to always using UTF-8 in any case is to only use UTF-8 when the
 ``LC_CTYPE`` locale is the POSIX locale.

-The PEP 538 "Coercing the legacy C locale to C.UTF-8" of  Nick Coghlan
-proposes to implement that using the ``C.UTF-8`` locale.
+The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of  Nick
+Coghlan proposes to implement that using the ``C.UTF-8`` locale.
+
+
+Use the strict error handler for operating system data
+------------------------------------------------------
+
+Using the ``surrogateescape`` error handler for `operating system data`_
+creates surprising surrogate characters. No Python codec (except of
+``utf-7``) accept surrogates, and so encoding text coming from the
+operating system is likely to raise an error error. The problem is that
+the error comes late, very far from where the data was read.
+
+The ``strict`` error handler can be used instead to decode
+(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
+data, to raise encoding errors as soon as possible. It helps to find
+bugs more quickly.
+
+The main drawback of this strategy is that it doesn't work in practice.
+Python 3 is designed on top on Unicode strings. Most functions expect
+Unicode and produce Unicode. Even if many operating system functions
+have two flavors, bytes and Unicode, the Unicode flavar is used is most
+cases. There are good reasons for that: Unicode is more convenient in
+Python 3 and using Unicode helps to support the full Unicode Character
+Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
+see the `PEP 528`_ and the `PEP 529`_).
+
+For example, if ``os.fsdecode()`` uses ``utf8/strict``,
+``os.listdir(str)`` fails to list filenames of a directory if a single
+filename is not decodable from UTF-8. As a consequence,
+``shutil.rmtree(str)`` fails to remove a directory. Undecodable
+filenames, environment variables, etc. are simply too common to make
+this alternative viable.


 Links
@ -683,9 +739,14 @@ Links

 PEPs:

-* PEP 538 "Coercing the legacy C locale to C.UTF-8"
-* PEP 529: "Change Windows filesystem encoding to UTF-8"
-* PEP 383: "Non-decodable Bytes in System Character Interfaces"
+* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
+  "Coercing the legacy C locale to C.UTF-8"
+* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
+  "Change Windows filesystem encoding to UTF-8"
+* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
+  "Change Windows console encoding to UTF-8"
+* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
+  "Non-decodable Bytes in System Character Interfaces"

 Main Python issues: