* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers
This commit is contained in:
Victor Stinner 2017-01-11 22:08:40 +01:00
parent 4ba2196903
commit 1b6b889ed6
1 changed files with 98 additions and 37 deletions

View File

@ -76,8 +76,10 @@ backward compatibility should be preserved whenever possible.
Locale and operating system data
--------------------------------
Python uses the ``LC_CTYPE`` locale to decide how to encode and decode
data from/to the operating system:
.. _operating system data:
Python uses an encoding called the "filesystem encoding" to decide how
to encode and decode data from/to the operating system:
* file content
* command line arguments: ``sys.argv``
@ -91,10 +93,15 @@ data from/to the operating system:
* etc.
At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
``LC_CTYPE`` locale and then store the locale encoding,
``sys.getfilesystemencoding()``. In the whole lifetime of a Python process,
the same encoding and error handler are used to encode and decode data
from/to the operating system.
``LC_CTYPE`` locale and then store the locale encoding as the
"filesystem error". It's possible to get this encoding using
``sys.getfilesystemencoding()``. In the whole lifetime of a Python
process, the same encoding and error handler are used to encode and
decode data from/to the operating system.
The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to
decode and encode operating system data. These functions use the
filesystem error handler: ``sys.getfilesystemencodeerrors()``.
.. note::
In some corner case, the *current* ``LC_CTYPE`` locale must be used
@ -137,7 +144,7 @@ really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
alias to ASCII). If not (the effective encoding is not ASCII), Python
uses its own ASCII codec instead of using ``mbstowcs()`` and
``wcstombs()`` functions for operating system data.
``wcstombs()`` functions for `operating system data`_.
See the `POSIX locale (2016 Edition)
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
@ -163,8 +170,8 @@ it by mistake. Examples:
C.UTF-8 and C.utf8 locales
--------------------------
Some UNIX operating systems provide a variant of the POSIX locale using the
UTF-8 encoding:
Some UNIX operating systems provide a variant of the POSIX locale using
the UTF-8 encoding:
* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
@ -182,7 +189,7 @@ Popularity of the UTF-8 encoding
Python 3 uses UTF-8 by default for Python source files.
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
system data. For Windows, see the PEP 529: "Change Windows filesystem
system data. For Windows, see the `PEP 529`_: "Change Windows filesystem
encoding to UTF-8".
On Linux, UTF-8 became the de facto standard encoding,
@ -215,15 +222,15 @@ filesystem with filenames encoded to ISO 8859-1.
The Linux kernel and the libc don't decode filenames: a filename is used
as a raw array of bytes. The common solution to support any filename is
to store filenames as bytes and don't try to decode them. When displayed to
stdout, mojibake is displayed if the filename and the terminal don't use
the same encoding.
to store filenames as bytes and don't try to decode them. When displayed
to stdout, mojibake is displayed if the filename and the terminal don't
use the same encoding.
Python 3 promotes Unicode everywhere including filenames. A solution to
support filenames not decodable from the locale encoding was found: the
``surrogateescape`` error handler (PEP 383), store undecodable bytes
``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
as surrogate characters. This error handler is used by default for
operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
`operating system data`_, by ``os.fsdecode()`` and ``os.fsencode()`` for
example (except on Windows which uses the ``strict`` error handler).
@ -239,16 +246,17 @@ Unicode encode error when displaying non-ASCII text. It is especially
useful when the POSIX locale is used, because this locale usually uses
the ASCII encoding.
The problem is that operating system data like filenames are decoded
using the ``surrogateescape`` error handler (PEP 383). Displaying a
The problem is that `operating system data`_ like filenames are decoded
using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
filename to stdout raises a Unicode encode error if the filename
contains an undecoded byte stored as a surrogate character.
Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The
idea is to passthrough operating system data even if it means mojibake, because
most UNIX applications work like that. Most UNIX applications store filenames
as bytes, usually simply because bytes are first-citizen class in the used
POSIX locale is used: `issue #19977
<http://bugs.python.org/issue19977>`_. The idea is to passthrough
`operating system data`_ even if it means mojibake, because most UNIX
applications work like that. Most UNIX applications store filenames as
bytes, usually simply because bytes are first-citizen class in the used
programming language, whereas Unicode is badly supported.
.. note::
@ -280,6 +288,10 @@ by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
The ``-X utf8`` has the priority on the ``PYTHONUTF8`` environment
variable. For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the
UTF-8 mode.
Encoding and error handler
--------------------------
@ -291,7 +303,7 @@ sys.stderr:
Function Default UTF-8 or POSIX locale UTF-8 Strict
============================ ======================= ========================== ==========================
open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8/strict**
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8**/surrogateescape
sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
sys.stderr locale/backslashreplace **UTF-8**/backslashreplace **UTF-8**/backslashreplace
============================ ======================= ========================== ==========================
@ -311,11 +323,22 @@ The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
strict mode for convenience: the idea is that data not encoded to UTF-8
are passed through "Python" without being modified, as raw bytes.
The ``PYTHONIOENCODING`` environment variable has the priority on the
UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.
Encodings used by ``open()``, highest priority first:
* *encoding* and *errors* parameters (if set)
* UTF-8 mode
* os.device_encoding(fd)
* os.getpreferredencoding(False)
Rationale
---------
The UTF-8 mode is disabled by default to keep hard Unicode errors when
encoding or decoding operating system data failed, and to keep the
encoding or decoding `operating system data`_ failed, and to keep the
backward compatibility. The user is responsible to enable explicitly the
UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
mode would be enabled *by default*.
@ -325,7 +348,7 @@ UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
the user overrides a locale *by mistake* or if a Python program is
started with no locale configured (and so with the POSIX locale).
Most UNIX applications handle operating system data as bytes, so
Most UNIX applications handle `operating system data`_ as bytes, so
``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
limited impact on how these data are handled by the application.
@ -336,7 +359,8 @@ everywhere and that users *expect* UTF-8.
Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
Python is more convenient, since they are more commonly misconfigured
*by mistake* (configured to use an encoding different than UTF-8,
whereas the system uses UTF-8), rather than being misconfigured by intent.
whereas the system uses UTF-8), rather than being misconfigured by
intent.
Expected mojibake and surrogate character issues
------------------------------------------------
@ -648,8 +672,9 @@ Don't modify the encoding of the POSIX locale
A first version of the PEP did not change the encoding and error handler
used of the POSIX locale.
The problem is that adding a command line option or setting an environment
variable is not possible in some cases, or at least not convenient.
The problem is that adding a command line option or setting an
environment variable is not possible in some cases, or at least not
convenient.
Moreover, many users simply expect that Python 3 behaves as Python 2:
don't bother them with encodings and "just works" in all cases. These
@ -660,12 +685,12 @@ complex documents using multiple incompatibles encodings.
Always use UTF-8
----------------
Python already always use the UTF-8 encoding on Mac OS X, Android and Windows.
Since UTF-8 became the de facto encoding, it makes sense to always use it on all
platforms with any locale.
Python already always use the UTF-8 encoding on Mac OS X, Android and
Windows. Since UTF-8 became the de facto encoding, it makes sense to
always use it on all platforms with any locale.
The risk is to introduce mojibake if the locale uses a different encoding,
especially for locales other than the POSIX locale.
The risk is to introduce mojibake if the locale uses a different
encoding, especially for locales other than the POSIX locale.
Force UTF-8 for the POSIX locale
@ -674,8 +699,39 @@ Force UTF-8 for the POSIX locale
An alternative to always using UTF-8 in any case is to only use UTF-8 when the
``LC_CTYPE`` locale is the POSIX locale.
The PEP 538 "Coercing the legacy C locale to C.UTF-8" of Nick Coghlan
proposes to implement that using the ``C.UTF-8`` locale.
The `PEP 538`_ "Coercing the legacy C locale to C.UTF-8" of Nick
Coghlan proposes to implement that using the ``C.UTF-8`` locale.
Use the strict error handler for operating system data
------------------------------------------------------
Using the ``surrogateescape`` error handler for `operating system data`_
creates surprising surrogate characters. No Python codec (except of
``utf-7``) accept surrogates, and so encoding text coming from the
operating system is likely to raise an error error. The problem is that
the error comes late, very far from where the data was read.
The ``strict`` error handler can be used instead to decode
(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
data, to raise encoding errors as soon as possible. It helps to find
bugs more quickly.
The main drawback of this strategy is that it doesn't work in practice.
Python 3 is designed on top on Unicode strings. Most functions expect
Unicode and produce Unicode. Even if many operating system functions
have two flavors, bytes and Unicode, the Unicode flavar is used is most
cases. There are good reasons for that: Unicode is more convenient in
Python 3 and using Unicode helps to support the full Unicode Character
Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
see the `PEP 528`_ and the `PEP 529`_).
For example, if ``os.fsdecode()`` uses ``utf8/strict``,
``os.listdir(str)`` fails to list filenames of a directory if a single
filename is not decodable from UTF-8. As a consequence,
``shutil.rmtree(str)`` fails to remove a directory. Undecodable
filenames, environment variables, etc. are simply too common to make
this alternative viable.
Links
@ -683,9 +739,14 @@ Links
PEPs:
* PEP 538 "Coercing the legacy C locale to C.UTF-8"
* PEP 529: "Change Windows filesystem encoding to UTF-8"
* PEP 383: "Non-decodable Bytes in System Character Interfaces"
* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
"Coercing the legacy C locale to C.UTF-8"
* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
"Change Windows filesystem encoding to UTF-8"
* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
"Change Windows console encoding to UTF-8"
* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
"Non-decodable Bytes in System Character Interfaces"
Main Python issues: