Add PEP 540: Add a new UTF-8 mode

2017-01-05 13:46:03 +01:00 · 2017-01-05 13:46:03 +01:00 · 9780f3ab43
parent 043254687a
commit 9780f3ab43
1 changed files with 286 additions and 0 deletions
--- a/pep-0540.txt
+++ b/pep-0540.txt
@ -0,0 +1,286 @@
 PEP: 540
 Title: Add a new UTF-8 mode
 Version: $Revision$
 Last-Modified: $Date$
 Author: Victor Stinner <victor.stinner@gmail.com>
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 5-January-2016
 Python-Version: 3.7
 Abstract
 ========
 Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system
 data instead of the locale encoding. Add ``-X utf8`` command line option
 and ``PYTHONUTF8`` environment variable.
 Context
 =======
 Locale and operating system data
 --------------------------------
 Python uses the ``LC_CTYPE`` locale to decide how to encode and decode
 data from/to the operating system:
 * file content
 * command line arguments: ``sys.argv``
 * standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
 * environment variables: ``os.environ``
 * filenames: ``os.listdir(str)`` for example
 * pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
 * error messages
 * name of a timezone
 * user name, terminal name: ``os``, ``grp`` and ``pwd`` modules
 * host name, UNIX socket path: see the ``socket`` module
 * etc.
 At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
 ``LC_CTYPE`` locale and then store the locale encoding,
 ``sys.getfilesystemencoding()``. In the whole lifetime of a Python process,
 the same encoding and error handler are used to encode and decode data
 from/to the operating system.
 .. note::
   In some corner case, the *current* ``LC_CTYPE`` locale must be used
   instead of ``sys.getfilesystemencoding()``. For example, the ``time``
   module uses the *current* ``LC_CTYPE`` locale to decode timezone
   names.
 The POSIX locale and its encoding
 ---------------------------------
 The following environment variables are used to configure the locale, in
 this preference order:
 * ``LC_ALL``, most important variable
 * ``LC_CTYPE``
 * ``LANG``
 The POSIX locale,also known as "the C locale", is used:
 * if the first set variable is set to ``"C"``
 * if all these variables are unset, for example when a program is
  started in an empty environment.
 The encoding of the POSIX locale must be ASCII or a superset of ASCII.
 On Linux, the POSIX locale uses the ASCII encoding.
 On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
 the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
 use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
 ``os.fsencode()`` and ``os.fsdecode()`` use
 ``locale.getpreferredencoding()`` codec. For example, if command line
 arguments are decoded by ``mbstowcs()`` and encoded back by
 ``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
 of retrieving the original byte string.
 To fix this issue, Python now checks since Python 3.4 if ``mbstowcs()``
 really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
 POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
 alias to ASCII). If not (the effective encoding is not ASCII), Python
 uses its own ASCII codec instead of using ``mbstowcs()`` and
 ``wcstombs()`` functions for operating system data.
 See the `POSIX locale (2016 Edition)
 <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
 C.UTF-8 and C.utf8 locales
 --------------------------
 Some operating systems provide a variant of the POSIX locale using the
 UTF-8 encoding:
 * Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
 * Debian (eglibc 2.13-1, 2011): ``"C.UTF-8"``
 * HP-UX: ``"C.utf8"``
 It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
 proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
 Popularity of the UTF-8 encoding
 --------------------------------
 Python 3 uses UTF-8 by default for Python source files.
 On Mac OS X, Windows and Android, Python always use UTF-8 for operating
 system data instead of the locale encoding. For Windows, see the `PEP
 529: Change Windows filesystem encoding to UTF-8
 <https://www.python.org/dev/peps/pep-0529/>`_.
 On Linux, UTF-8 became the defacto standard encoding by default,
 replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
 using different encodings for filenames and standard streams is likely
 to create mojibake, so UTF-8 is now used *everywhere*.
 The UTF-8 encoding is the default encoding of XML and JSON file format.
 In January 2017, UTF-8 was used in `more than 88% of web pages
 <https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
 Javascript, CSS, etc.).
 See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
 information on the UTF-8 codec.
 .. note::
   Some applications and operating systems (especially Windows) use Byte
   Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
   UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
   Python.
 Old data stored in different encodings and surrogateescape
 ----------------------------------------------------------
 Even if UTF-8 became the defacto standard, there are still systems in
 the wild which don't use UTF-8. And there are a lot of data stored in
 different encodings. For example, an old USB key using the ext3
 filesystem with filenames encoded to ISO 8859-1.
 The Linux kernel and the libc don't decode filenames: a filename is used
 as a raw array of bytes. The common solution to support any filename is
 to store filenames as bytes and don't try to decode them. When displayed to
 stdout, mojibake is displayed if the filename and the terminal don't use
 the same encoding.
 Python 3 promotes Unicode everywhere including filenames. A solution to
 support filenames not decodable from the locale encoding was found: the
 ``surrogateescape`` error handler (`PEP 393
 <https://www.python.org/dev/peps/pep-0393/>`_), store undecodable bytes
 as surrogate characters. This error handler is used by default for
 operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
 example (except on Windows which uses the ``strict`` error handler).
 Standard streams
 ----------------
 Python uses the locale encoding for standard streams: stdin, stdout and
 stderr. The ``strict`` error handler is used by stdin and stdout to
 prevent mojibake.
 The ``backslashreplace`` error handler is used by stderr to avoid
 Unicode encode error when displaying non-ASCII text. It is especially
 useful when the POSIX locale is used, because this locale usually uses
 the ASCII encoding.
 The problem is that operating system data like filenames are decoded
 using the ``surrogateescape`` error handler (PEP 393). Displaying a
 filename to stdout raises an Unicode encode error if the filename
 contains an undecoded byte stored as a surrogate character.
 Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
 POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The
 idea is to passthrough operating system data even if it means mojibake, because
 most UNIX applications work like that. Most UNIX applications store filenames
 as bytes, usually simply because bytes are first-citizen class in the used
 programming language, whereas Unicode is badly supported.
 .. note::
   The encoding and/or the error handler of standard streams can be
   overriden with the ``PYTHONIOENCODING`` environment variable.
 Proposal
 ========
 Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data
 instead of the locale encoding:
 * Add ``-X utf8`` command line option
 * Add ``PYTHONUTF8=1`` environment variable
 Add also a strict UTF-8 mode, enabled by ``-X utf8=strict`` or
 ``PYTHONUTF8=strict``.
 The UTF-8 mode changes the default encoding and error handler used by
 open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
 sys.stderr:
 ============================  =======================  =======================  ======================  ======================
 Function                      Default, other locales   Default, POSIX locale    UTF-8                   UTF-8 Strict
 ============================  =======================  =======================  ======================  ======================
 open()                        locale/strict            locale/strict            UTF-8/surrogateescape   UTF-8/strict
 os.fsdecode(), os.fsencode()  locale/surrogateescape   locale/surrogateescape   UTF-8/surrogateescape   UTF-8/strict
 sys.stdin                     locale/strict            locale/surrogateescape   UTF-8/surrogateescape   UTF-8/strict
 sys.stdout                    locale/strict            locale/surrogateescape   UTF-8/surrogateescape   UTF-8/strict
 sys.stderr                    locale/backslashreplace  locale/backslashreplace  UTF-8/backslashreplace  UTF-8/backslashreplace
 ============================  =======================  =======================  ======================  ======================
 The UTF-8 mode is disabled by default to keep hard Unicode errors when
 encoding or decoding operating system data failed, and to keep the
 backward compatibility. The user is responsible to enable explicitly the
 UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
 mode would be enabled *by default*.
 The UTF-8 mode should be used on systems known to be configured with
 UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
 the user overrides a locale *by mistake* or if a Python program is
 started with no locale configured (and so with the POSIX locale).
 Most UNIX applications handle operating system data as bytes, so
 ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
 limited impact on how these data are handled by the application.
 The Python UTF-8 mode should help to make Python more interoperable with
 the  other UNIX applications in the system assuming that *UTF-8* is used
 everywhere and that users *expect* UTF-8.
 Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
 Python is more convenient, since they are more commonly misconfigured
 *by mistake* (configured to use an encoding different than UTF-8,
 whereas the system uses UTF-8), rather than being misconfigured by intent.
 Backward Compatibility
 ======================
 Since the UTF-8 mode is disabled by default, it has no impact on the
 backward compatibility. The new UTF-8 mode must be enabled explicitly.
 Alternatives
 ============
 Always use UTF-8
 ----------------
 Python already always use the UTF-8 encoding on Mac OS X, Android and Windows.
 Since UTF-8 became the defacto encoding, it makes sense to always use it on all
 platforms with any locale.
 The risk is to introduce mojibake if the locale uses a different encoding,
 especially for locales other than the POSIX locale.
 Force UTF-8 for the POSIX locale
 --------------------------------
 An alternative to always using UTF-8 in any case is to only use UTF-8 when the
 ``LC_CTYPE`` locale is the POSIX locale.
 The `PEP 538: Coercing the legacy C locale to C.UTF-8
 <https://www.python.org/dev/peps/pep-0538/>`_ of  Nick Coghlan proposes to
 implement that using the ``C.UTF-8`` locale.
 Related Work
 ============
 Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
 varaible to force UTF-8: see `perlrun
 <http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
 UTF-8 per standard stream, on input and output streams, etc.
 Copyright
 =========
 This document has been placed in the public domain.