Add PEP 540: Add a new UTF-8 mode
This commit is contained in:
parent
043254687a
commit
9780f3ab43
|
@ -0,0 +1,286 @@
|
|||
PEP: 540
|
||||
Title: Add a new UTF-8 mode
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Victor Stinner <victor.stinner@gmail.com>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 5-January-2016
|
||||
Python-Version: 3.7
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system
|
||||
data instead of the locale encoding. Add ``-X utf8`` command line option
|
||||
and ``PYTHONUTF8`` environment variable.
|
||||
|
||||
|
||||
Context
|
||||
=======
|
||||
|
||||
Locale and operating system data
|
||||
--------------------------------
|
||||
|
||||
Python uses the ``LC_CTYPE`` locale to decide how to encode and decode
|
||||
data from/to the operating system:
|
||||
|
||||
* file content
|
||||
* command line arguments: ``sys.argv``
|
||||
* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
|
||||
* environment variables: ``os.environ``
|
||||
* filenames: ``os.listdir(str)`` for example
|
||||
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
|
||||
* error messages
|
||||
* name of a timezone
|
||||
* user name, terminal name: ``os``, ``grp`` and ``pwd`` modules
|
||||
* host name, UNIX socket path: see the ``socket`` module
|
||||
* etc.
|
||||
|
||||
At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
|
||||
``LC_CTYPE`` locale and then store the locale encoding,
|
||||
``sys.getfilesystemencoding()``. In the whole lifetime of a Python process,
|
||||
the same encoding and error handler are used to encode and decode data
|
||||
from/to the operating system.
|
||||
|
||||
.. note::
|
||||
In some corner case, the *current* ``LC_CTYPE`` locale must be used
|
||||
instead of ``sys.getfilesystemencoding()``. For example, the ``time``
|
||||
module uses the *current* ``LC_CTYPE`` locale to decode timezone
|
||||
names.
|
||||
|
||||
|
||||
The POSIX locale and its encoding
|
||||
---------------------------------
|
||||
|
||||
The following environment variables are used to configure the locale, in
|
||||
this preference order:
|
||||
|
||||
* ``LC_ALL``, most important variable
|
||||
* ``LC_CTYPE``
|
||||
* ``LANG``
|
||||
|
||||
The POSIX locale,also known as "the C locale", is used:
|
||||
|
||||
* if the first set variable is set to ``"C"``
|
||||
* if all these variables are unset, for example when a program is
|
||||
started in an empty environment.
|
||||
|
||||
The encoding of the POSIX locale must be ASCII or a superset of ASCII.
|
||||
|
||||
On Linux, the POSIX locale uses the ASCII encoding.
|
||||
|
||||
On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
|
||||
the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
|
||||
use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
|
||||
``os.fsencode()`` and ``os.fsdecode()`` use
|
||||
``locale.getpreferredencoding()`` codec. For example, if command line
|
||||
arguments are decoded by ``mbstowcs()`` and encoded back by
|
||||
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
|
||||
of retrieving the original byte string.
|
||||
|
||||
To fix this issue, Python now checks since Python 3.4 if ``mbstowcs()``
|
||||
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
|
||||
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
|
||||
alias to ASCII). If not (the effective encoding is not ASCII), Python
|
||||
uses its own ASCII codec instead of using ``mbstowcs()`` and
|
||||
``wcstombs()`` functions for operating system data.
|
||||
|
||||
See the `POSIX locale (2016 Edition)
|
||||
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
|
||||
|
||||
|
||||
C.UTF-8 and C.utf8 locales
|
||||
--------------------------
|
||||
|
||||
Some operating systems provide a variant of the POSIX locale using the
|
||||
UTF-8 encoding:
|
||||
|
||||
* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
|
||||
* Debian (eglibc 2.13-1, 2011): ``"C.UTF-8"``
|
||||
* HP-UX: ``"C.utf8"``
|
||||
|
||||
It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
|
||||
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
|
||||
|
||||
|
||||
Popularity of the UTF-8 encoding
|
||||
--------------------------------
|
||||
|
||||
Python 3 uses UTF-8 by default for Python source files.
|
||||
|
||||
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
|
||||
system data instead of the locale encoding. For Windows, see the `PEP
|
||||
529: Change Windows filesystem encoding to UTF-8
|
||||
<https://www.python.org/dev/peps/pep-0529/>`_.
|
||||
|
||||
On Linux, UTF-8 became the defacto standard encoding by default,
|
||||
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
|
||||
using different encodings for filenames and standard streams is likely
|
||||
to create mojibake, so UTF-8 is now used *everywhere*.
|
||||
|
||||
The UTF-8 encoding is the default encoding of XML and JSON file format.
|
||||
In January 2017, UTF-8 was used in `more than 88% of web pages
|
||||
<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
|
||||
Javascript, CSS, etc.).
|
||||
|
||||
See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
|
||||
information on the UTF-8 codec.
|
||||
|
||||
.. note::
|
||||
Some applications and operating systems (especially Windows) use Byte
|
||||
Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
|
||||
UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
|
||||
Python.
|
||||
|
||||
|
||||
Old data stored in different encodings and surrogateescape
|
||||
----------------------------------------------------------
|
||||
|
||||
Even if UTF-8 became the defacto standard, there are still systems in
|
||||
the wild which don't use UTF-8. And there are a lot of data stored in
|
||||
different encodings. For example, an old USB key using the ext3
|
||||
filesystem with filenames encoded to ISO 8859-1.
|
||||
|
||||
The Linux kernel and the libc don't decode filenames: a filename is used
|
||||
as a raw array of bytes. The common solution to support any filename is
|
||||
to store filenames as bytes and don't try to decode them. When displayed to
|
||||
stdout, mojibake is displayed if the filename and the terminal don't use
|
||||
the same encoding.
|
||||
|
||||
Python 3 promotes Unicode everywhere including filenames. A solution to
|
||||
support filenames not decodable from the locale encoding was found: the
|
||||
``surrogateescape`` error handler (`PEP 393
|
||||
<https://www.python.org/dev/peps/pep-0393/>`_), store undecodable bytes
|
||||
as surrogate characters. This error handler is used by default for
|
||||
operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
|
||||
example (except on Windows which uses the ``strict`` error handler).
|
||||
|
||||
|
||||
Standard streams
|
||||
----------------
|
||||
|
||||
Python uses the locale encoding for standard streams: stdin, stdout and
|
||||
stderr. The ``strict`` error handler is used by stdin and stdout to
|
||||
prevent mojibake.
|
||||
|
||||
The ``backslashreplace`` error handler is used by stderr to avoid
|
||||
Unicode encode error when displaying non-ASCII text. It is especially
|
||||
useful when the POSIX locale is used, because this locale usually uses
|
||||
the ASCII encoding.
|
||||
|
||||
The problem is that operating system data like filenames are decoded
|
||||
using the ``surrogateescape`` error handler (PEP 393). Displaying a
|
||||
filename to stdout raises an Unicode encode error if the filename
|
||||
contains an undecoded byte stored as a surrogate character.
|
||||
|
||||
Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
|
||||
POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The
|
||||
idea is to passthrough operating system data even if it means mojibake, because
|
||||
most UNIX applications work like that. Most UNIX applications store filenames
|
||||
as bytes, usually simply because bytes are first-citizen class in the used
|
||||
programming language, whereas Unicode is badly supported.
|
||||
|
||||
.. note::
|
||||
The encoding and/or the error handler of standard streams can be
|
||||
overriden with the ``PYTHONIOENCODING`` environment variable.
|
||||
|
||||
|
||||
Proposal
|
||||
========
|
||||
|
||||
Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data
|
||||
instead of the locale encoding:
|
||||
|
||||
* Add ``-X utf8`` command line option
|
||||
* Add ``PYTHONUTF8=1`` environment variable
|
||||
|
||||
Add also a strict UTF-8 mode, enabled by ``-X utf8=strict`` or
|
||||
``PYTHONUTF8=strict``.
|
||||
|
||||
The UTF-8 mode changes the default encoding and error handler used by
|
||||
open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
|
||||
sys.stderr:
|
||||
|
||||
============================ ======================= ======================= ====================== ======================
|
||||
Function Default, other locales Default, POSIX locale UTF-8 UTF-8 Strict
|
||||
============================ ======================= ======================= ====================== ======================
|
||||
open() locale/strict locale/strict UTF-8/surrogateescape UTF-8/strict
|
||||
os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
|
||||
sys.stdin locale/strict locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
|
||||
sys.stdout locale/strict locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
|
||||
sys.stderr locale/backslashreplace locale/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace
|
||||
============================ ======================= ======================= ====================== ======================
|
||||
|
||||
The UTF-8 mode is disabled by default to keep hard Unicode errors when
|
||||
encoding or decoding operating system data failed, and to keep the
|
||||
backward compatibility. The user is responsible to enable explicitly the
|
||||
UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
|
||||
mode would be enabled *by default*.
|
||||
|
||||
The UTF-8 mode should be used on systems known to be configured with
|
||||
UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
|
||||
the user overrides a locale *by mistake* or if a Python program is
|
||||
started with no locale configured (and so with the POSIX locale).
|
||||
|
||||
Most UNIX applications handle operating system data as bytes, so
|
||||
``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
|
||||
limited impact on how these data are handled by the application.
|
||||
|
||||
The Python UTF-8 mode should help to make Python more interoperable with
|
||||
the other UNIX applications in the system assuming that *UTF-8* is used
|
||||
everywhere and that users *expect* UTF-8.
|
||||
|
||||
Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
|
||||
Python is more convenient, since they are more commonly misconfigured
|
||||
*by mistake* (configured to use an encoding different than UTF-8,
|
||||
whereas the system uses UTF-8), rather than being misconfigured by intent.
|
||||
|
||||
|
||||
Backward Compatibility
|
||||
======================
|
||||
|
||||
Since the UTF-8 mode is disabled by default, it has no impact on the
|
||||
backward compatibility. The new UTF-8 mode must be enabled explicitly.
|
||||
|
||||
|
||||
Alternatives
|
||||
============
|
||||
|
||||
Always use UTF-8
|
||||
----------------
|
||||
|
||||
Python already always use the UTF-8 encoding on Mac OS X, Android and Windows.
|
||||
Since UTF-8 became the defacto encoding, it makes sense to always use it on all
|
||||
platforms with any locale.
|
||||
|
||||
The risk is to introduce mojibake if the locale uses a different encoding,
|
||||
especially for locales other than the POSIX locale.
|
||||
|
||||
|
||||
Force UTF-8 for the POSIX locale
|
||||
--------------------------------
|
||||
|
||||
An alternative to always using UTF-8 in any case is to only use UTF-8 when the
|
||||
``LC_CTYPE`` locale is the POSIX locale.
|
||||
|
||||
The `PEP 538: Coercing the legacy C locale to C.UTF-8
|
||||
<https://www.python.org/dev/peps/pep-0538/>`_ of Nick Coghlan proposes to
|
||||
implement that using the ``C.UTF-8`` locale.
|
||||
|
||||
|
||||
Related Work
|
||||
============
|
||||
|
||||
Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
|
||||
varaible to force UTF-8: see `perlrun
|
||||
<http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
|
||||
UTF-8 per standard stream, on input and output streams, etc.
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
Loading…
Reference in New Issue