PEP 540
* Strict mode doesn't use strict for OS data anymore: keep surrogateesscape, explain why in a new alternative * Define the priority between env vars and cmdline options to choose encodings and error handlers
This commit is contained in:
parent
4ba2196903
commit
1b6b889ed6
135
pep-0540.txt
135
pep-0540.txt
|
@ -76,8 +76,10 @@ backward compatibility should be preserved whenever possible.
|
|||
Locale and operating system data
|
||||
--------------------------------
|
||||
|
||||
Python uses the ``LC_CTYPE`` locale to decide how to encode and decode
|
||||
data from/to the operating system:
|
||||
.. _operating system data:
|
||||
|
||||
Python uses an encoding called the "filesystem encoding" to decide how
|
||||
to encode and decode data from/to the operating system:
|
||||
|
||||
* file content
|
||||
* command line arguments: ``sys.argv``
|
||||
|
@ -91,10 +93,15 @@ data from/to the operating system:
|
|||
* etc.
|
||||
|
||||
At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
|
||||
``LC_CTYPE`` locale and then store the locale encoding,
|
||||
``sys.getfilesystemencoding()``. In the whole lifetime of a Python process,
|
||||
the same encoding and error handler are used to encode and decode data
|
||||
from/to the operating system.
|
||||
``LC_CTYPE`` locale and then store the locale encoding as the
|
||||
"filesystem error". It's possible to get this encoding using
|
||||
``sys.getfilesystemencoding()``. In the whole lifetime of a Python
|
||||
process, the same encoding and error handler are used to encode and
|
||||
decode data from/to the operating system.
|
||||
|
||||
The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to
|
||||
decode and encode operating system data. These functions use the
|
||||
filesystem error handler: ``sys.getfilesystemencodeerrors()``.
|
||||
|
||||
.. note::
|
||||
In some corner case, the *current* ``LC_CTYPE`` locale must be used
|
||||
|
@ -137,7 +144,7 @@ really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
|
|||
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
|
||||
alias to ASCII). If not (the effective encoding is not ASCII), Python
|
||||
uses its own ASCII codec instead of using ``mbstowcs()`` and
|
||||
``wcstombs()`` functions for operating system data.
|
||||
``wcstombs()`` functions for `operating system data`_.
|
||||
|
||||
See the `POSIX locale (2016 Edition)
|
||||
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
|
||||
|
@ -163,8 +170,8 @@ it by mistake. Examples:
|
|||
C.UTF-8 and C.utf8 locales
|
||||
--------------------------
|
||||
|
||||
Some UNIX operating systems provide a variant of the POSIX locale using the
|
||||
UTF-8 encoding:
|
||||
Some UNIX operating systems provide a variant of the POSIX locale using
|
||||
the UTF-8 encoding:
|
||||
|
||||
* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
|
||||
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
|
||||
|
@ -182,7 +189,7 @@ Popularity of the UTF-8 encoding
|
|||
Python 3 uses UTF-8 by default for Python source files.
|
||||
|
||||
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
|
||||
system data. For Windows, see the PEP 529: "Change Windows filesystem
|
||||
system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
|
||||
encoding to UTF-8".
|
||||
|
||||
On Linux, UTF-8 became the de facto standard encoding,
|
||||
|
@ -215,15 +222,15 @@ filesystem with filenames encoded to ISO 8859-1.
|
|||
|
||||
The Linux kernel and the libc don't decode filenames: a filename is used
|
||||
as a raw array of bytes. The common solution to support any filename is
|
||||
to store filenames as bytes and don't try to decode them. When displayed to
|
||||
stdout, mojibake is displayed if the filename and the terminal don't use
|
||||
the same encoding.
|
||||
to store filenames as bytes and don't try to decode them. When displayed
|
||||
to stdout, mojibake is displayed if the filename and the terminal don't
|
||||
use the same encoding.
|
||||
|
||||
Python 3 promotes Unicode everywhere including filenames. A solution to
|
||||
support filenames not decodable from the locale encoding was found: the
|
||||
``surrogateescape`` error handler (PEP 383), store undecodable bytes
|
||||
``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
|
||||
as surrogate characters. This error handler is used by default for
|
||||
operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
|
||||
`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
|
||||
example (except on Windows which uses the ``strict`` error handler).
|
||||
|
||||
|
||||
|
@ -239,16 +246,17 @@ Unicode encode error when displaying non-ASCII text. It is especially
|
|||
useful when the POSIX locale is used, because this locale usually uses
|
||||
the ASCII encoding.
|
||||
|
||||
The problem is that operating system data like filenames are decoded
|
||||
using the ``surrogateescape`` error handler (PEP 383). Displaying a
|
||||
The problem is that `operating system data`_ like filenames are decoded
|
||||
using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
|
||||
filename to stdout raises a Unicode encode error if the filename
|
||||
contains an undecoded byte stored as a surrogate character.
|
||||
|
||||
Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
|
||||
POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The
|
||||
idea is to passthrough operating system data even if it means mojibake, because
|
||||
most UNIX applications work like that. Most UNIX applications store filenames
|
||||
as bytes, usually simply because bytes are first-citizen class in the used
|
||||
POSIX locale is used: `issue #19977
|
||||
<http://bugs.python.org/issue19977>`_. The idea is to passthrough
|
||||
`operating system data`_ even if it means mojibake, because most UNIX
|
||||
applications work like that. Most UNIX applications store filenames as
|
||||
bytes, usually simply because bytes are first-citizen class in the used
|
||||
programming language, whereas Unicode is badly supported.
|
||||
|
||||
.. note::
|
||||
|
@ -280,6 +288,10 @@ by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
|
|||
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
|
||||
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
|
||||
|
||||
The ``-X utf8`` has the priority on the ``PYTHONUTF8`` environment
|
||||
variable. For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the
|
||||
UTF-8 mode.
|
||||
|
||||
Encoding and error handler
|
||||
--------------------------
|
||||
|
||||
|
@ -291,7 +303,7 @@ sys.stderr:
|
|||
Function Default UTF-8 or POSIX locale UTF-8 Strict
|
||||
============================ ======================= ========================== ==========================
|
||||
open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
||||
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8/strict**
|
||||
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8**/surrogateescape
|
||||
sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
||||
sys.stderr locale/backslashreplace **UTF-8**/backslashreplace **UTF-8**/backslashreplace
|
||||
============================ ======================= ========================== ==========================
|
||||
|
@ -311,11 +323,22 @@ The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
|
|||
strict mode for convenience: the idea is that data not encoded to UTF-8
|
||||
are passed through "Python" without being modified, as raw bytes.
|
||||
|
||||
The ``PYTHONIOENCODING`` environment variable has the priority on the
|
||||
UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
|
||||
python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.
|
||||
|
||||
Encodings used by ``open()``, highest priority first:
|
||||
|
||||
* *encoding* and *errors* parameters (if set)
|
||||
* UTF-8 mode
|
||||
* os.device_encoding(fd)
|
||||
* os.getpreferredencoding(False)
|
||||
|
||||
Rationale
|
||||
---------
|
||||
|
||||
The UTF-8 mode is disabled by default to keep hard Unicode errors when
|
||||
encoding or decoding operating system data failed, and to keep the
|
||||
encoding or decoding `operating system data`_ failed, and to keep the
|
||||
backward compatibility. The user is responsible to enable explicitly the
|
||||
UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
|
||||
mode would be enabled *by default*.
|
||||
|
@ -325,7 +348,7 @@ UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
|
|||
the user overrides a locale *by mistake* or if a Python program is
|
||||
started with no locale configured (and so with the POSIX locale).
|
||||
|
||||
Most UNIX applications handle operating system data as bytes, so
|
||||
Most UNIX applications handle `operating system data`_ as bytes, so
|
||||
``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
|
||||
limited impact on how these data are handled by the application.
|
||||
|
||||
|
@ -336,7 +359,8 @@ everywhere and that users *expect* UTF-8.
|
|||
Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
|
||||
Python is more convenient, since they are more commonly misconfigured
|
||||
*by mistake* (configured to use an encoding different than UTF-8,
|
||||
whereas the system uses UTF-8), rather than being misconfigured by intent.
|
||||
whereas the system uses UTF-8), rather than being misconfigured by
|
||||
intent.
|
||||
|
||||
Expected mojibake and surrogate character issues
|
||||
------------------------------------------------
|
||||
|
@ -648,8 +672,9 @@ Don't modify the encoding of the POSIX locale
|
|||
A first version of the PEP did not change the encoding and error handler
|
||||
used of the POSIX locale.
|
||||
|
||||
The problem is that adding a command line option or setting an environment
|
||||
variable is not possible in some cases, or at least not convenient.
|
||||
The problem is that adding a command line option or setting an
|
||||
environment variable is not possible in some cases, or at least not
|
||||
convenient.
|
||||
|
||||
Moreover, many users simply expect that Python 3 behaves as Python 2:
|
||||
don't bother them with encodings and "just works" in all cases. These
|
||||
|
@ -660,12 +685,12 @@ complex documents using multiple incompatibles encodings.
|
|||
Always use UTF-8
|
||||
----------------
|
||||
|
||||
Python already always use the UTF-8 encoding on Mac OS X, Android and Windows.
|
||||
Since UTF-8 became the de facto encoding, it makes sense to always use it on all
|
||||
platforms with any locale.
|
||||
Python already always use the UTF-8 encoding on Mac OS X, Android and
|
||||
Windows. Since UTF-8 became the de facto encoding, it makes sense to
|
||||
always use it on all platforms with any locale.
|
||||
|
||||
The risk is to introduce mojibake if the locale uses a different encoding,
|
||||
especially for locales other than the POSIX locale.
|
||||
The risk is to introduce mojibake if the locale uses a different
|
||||
encoding, especially for locales other than the POSIX locale.
|
||||
|
||||
|
||||
Force UTF-8 for the POSIX locale
|
||||
|
@ -674,8 +699,39 @@ Force UTF-8 for the POSIX locale
|
|||
An alternative to always using UTF-8 in any case is to only use UTF-8 when the
|
||||
``LC_CTYPE`` locale is the POSIX locale.
|
||||
|
||||
The PEP 538 "Coercing the legacy C locale to C.UTF-8" of Nick Coghlan
|
||||
proposes to implement that using the ``C.UTF-8`` locale.
|
||||
The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of Nick
|
||||
Coghlan proposes to implement that using the ``C.UTF-8`` locale.
|
||||
|
||||
|
||||
Use the strict error handler for operating system data
|
||||
------------------------------------------------------
|
||||
|
||||
Using the ``surrogateescape`` error handler for `operating system data`_
|
||||
creates surprising surrogate characters. No Python codec (except of
|
||||
``utf-7``) accept surrogates, and so encoding text coming from the
|
||||
operating system is likely to raise an error error. The problem is that
|
||||
the error comes late, very far from where the data was read.
|
||||
|
||||
The ``strict`` error handler can be used instead to decode
|
||||
(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
|
||||
data, to raise encoding errors as soon as possible. It helps to find
|
||||
bugs more quickly.
|
||||
|
||||
The main drawback of this strategy is that it doesn't work in practice.
|
||||
Python 3 is designed on top on Unicode strings. Most functions expect
|
||||
Unicode and produce Unicode. Even if many operating system functions
|
||||
have two flavors, bytes and Unicode, the Unicode flavar is used is most
|
||||
cases. There are good reasons for that: Unicode is more convenient in
|
||||
Python 3 and using Unicode helps to support the full Unicode Character
|
||||
Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
|
||||
see the `PEP 528`_ and the `PEP 529`_).
|
||||
|
||||
For example, if ``os.fsdecode()`` uses ``utf8/strict``,
|
||||
``os.listdir(str)`` fails to list filenames of a directory if a single
|
||||
filename is not decodable from UTF-8. As a consequence,
|
||||
``shutil.rmtree(str)`` fails to remove a directory. Undecodable
|
||||
filenames, environment variables, etc. are simply too common to make
|
||||
this alternative viable.
|
||||
|
||||
|
||||
Links
|
||||
|
@ -683,9 +739,14 @@ Links
|
|||
|
||||
PEPs:
|
||||
|
||||
* PEP 538 "Coercing the legacy C locale to C.UTF-8"
|
||||
* PEP 529: "Change Windows filesystem encoding to UTF-8"
|
||||
* PEP 383: "Non-decodable Bytes in System Character Interfaces"
|
||||
* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
|
||||
"Coercing the legacy C locale to C.UTF-8"
|
||||
* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
|
||||
"Change Windows filesystem encoding to UTF-8"
|
||||
* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
|
||||
"Change Windows console encoding to UTF-8"
|
||||
* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
|
||||
"Non-decodable Bytes in System Character Interfaces"
|
||||
|
||||
Main Python issues:
|
||||
|
||||
|
|
Loading…
Reference in New Issue