python-peps/pep-0540.txt

PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner <victor.stinner@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract
========

Add a new UTF-8 mode, disabled by default, to ignore the locale and
force the usage of the UTF-8 encoding.

Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
bother users with encodings, but it can produce mojibake. The UTF-8 mode
can be configured as strict to prevent mojibake.

New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode. The POSIX locale enables
the UTF-8 mode.


Context
=======

Locale and operating system data
--------------------------------

Python uses the ``LC_CTYPE`` locale to decide how to encode and decode
data from/to the operating system:

* file content
* command line arguments: ``sys.argv``
* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
* environment variables: ``os.environ``
* filenames: ``os.listdir(str)`` for example
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
* error messages: ``os.strerror(code)`` for example
* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
* host name, UNIX socket path: see the ``socket`` module
* etc.

At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
``LC_CTYPE`` locale and then store the locale encoding,
``sys.getfilesystemencoding()``. In the whole lifetime of a Python process,
the same encoding and error handler are used to encode and decode data
from/to the operating system.

.. note::
   In some corner case, the *current* ``LC_CTYPE`` locale must be used
   instead of ``sys.getfilesystemencoding()``. For example, the ``time``
   module uses the *current* ``LC_CTYPE`` locale to decode timezone
   names.


The POSIX locale and its encoding
---------------------------------

The following environment variables are used to configure the locale, in
this preference order:

* ``LC_ALL``, most important variable
* ``LC_CTYPE``
* ``LANG``

The POSIX locale,also known as "the C locale", is used:

* if the first set variable is set to ``"C"``
* if all these variables are unset, for example when a program is
  started in an empty environment.

The encoding of the POSIX locale must be ASCII or a superset of ASCII.

On Linux, the POSIX locale uses the ASCII encoding.

On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
``os.fsencode()`` and ``os.fsdecode()`` use
``locale.getpreferredencoding()`` codec. For example, if command line
arguments are decoded by ``mbstowcs()`` and encoded back by
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
of retrieving the original byte string.

To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
alias to ASCII). If not (the effective encoding is not ASCII), Python
uses its own ASCII codec instead of using ``mbstowcs()`` and
``wcstombs()`` functions for operating system data.

See the `POSIX locale (2016 Edition)
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.


POSIX locale used by mistake
----------------------------

In many cases, the POSIX locale is not really expected by users who get
it by mistake. Examples:

* program started in an empty environment
* User forcing LANG=C to get messages in english
* LANG=C used for bad reasons, without being aware of the ASCII encoding
* SSH shell
* User locale set to a non-existing locale, typo in the locale name for
  example


C.UTF-8 and C.utf8 locales
--------------------------

Some UNIX operating systems provide a variant of the POSIX locale using the
UTF-8 encoding:

* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
* HP-UX: ``"C.utf8"``

It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.

It is not planned to add such locale to BSD systems.


Popularity of the UTF-8 encoding
--------------------------------

Python 3 uses UTF-8 by default for Python source files.

On Mac OS X, Windows and Android, Python always use UTF-8 for operating
system data. For Windows, see the PEP 529: "Change Windows filesystem
encoding to UTF-8".

On Linux, UTF-8 became the defacto standard encoding,
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
using different encodings for filenames and standard streams is likely
to create mojibake, so UTF-8 is now used *everywhere*.

The UTF-8 encoding is the default encoding of XML and JSON file format.
In January 2017, UTF-8 was used in `more than 88% of web pages
<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
Javascript, CSS, etc.).

See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
information on the UTF-8 codec.

.. note::
   Some applications and operating systems (especially Windows) use Byte
   Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
   UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
   Python.


Old data stored in different encodings and surrogateescape
----------------------------------------------------------

Even if UTF-8 became the defacto standard, there are still systems in
the wild which don't use UTF-8. And there are a lot of data stored in
different encodings. For example, an old USB key using the ext3
filesystem with filenames encoded to ISO 8859-1.

The Linux kernel and the libc don't decode filenames: a filename is used
as a raw array of bytes. The common solution to support any filename is
to store filenames as bytes and don't try to decode them. When displayed to
stdout, mojibake is displayed if the filename and the terminal don't use
the same encoding.

Python 3 promotes Unicode everywhere including filenames. A solution to
support filenames not decodable from the locale encoding was found: the
``surrogateescape`` error handler (PEP 383), store undecodable bytes
as surrogate characters. This error handler is used by default for
operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
example (except on Windows which uses the ``strict`` error handler).


Standard streams
----------------

Python uses the locale encoding for standard streams: stdin, stdout and
stderr. The ``strict`` error handler is used by stdin and stdout to
prevent mojibake.

The ``backslashreplace`` error handler is used by stderr to avoid
Unicode encode error when displaying non-ASCII text. It is especially
useful when the POSIX locale is used, because this locale usually uses
the ASCII encoding.

The problem is that operating system data like filenames are decoded
using the ``surrogateescape`` error handler (PEP 383). Displaying a
filename to stdout raises an Unicode encode error if the filename
contains an undecoded byte stored as a surrogate character.

Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The
idea is to passthrough operating system data even if it means mojibake, because
most UNIX applications work like that. Most UNIX applications store filenames
as bytes, usually simply because bytes are first-citizen class in the used
programming language, whereas Unicode is badly supported.

.. note::
   The encoding and/or the error handler of standard streams can be
   overriden with the ``PYTHONIOENCODING`` environment variable.


Proposal
========

Changes
-------

Add a new UTF-8 mode, disabled by default, to ignore the locale and
force the usage of the UTF-8 encoding with the ``surrogateescape`` error
handler, instead using the locale encoding (with ``strict`` or
``surrogateescape`` error handler depending on the case).

Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
bother users with encodings, but it can produce mojibake. It can be
configured as strict to prevent mojibake: the UTF-8 encoding is used
with the ``strict`` error handler in this case.

New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
by ``-X utf8`` or ``PYTHONUTF8=1``.  The UTF-8 is configured as strict
by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.

The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.

Encoding and error handler
--------------------------

The UTF-8 mode changes the default encoding and error handler used by
open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
sys.stderr:

============================  =======================  ==========================  ==========================
Function                      Default                  UTF-8 or POSIX locale       UTF-8 Strict
============================  =======================  ==========================  ==========================
open()                        locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape   **UTF-8/strict**
sys.stdin, sys.stdout         locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
sys.stderr                    locale/backslashreplace  **UTF-8**/backslashreplace  **UTF-8**/backslashreplace
============================  =======================  ==========================  ==========================

By comparison, Python 3.6 uses:

============================  =======================  ==========================
Function                      Default                  POSIX locale
============================  =======================  ==========================
open()                        locale/strict            locale/strict
os.fsdecode(), os.fsencode()  locale/surrogateescape   locale/surrogateescape
sys.stdin, sys.stdout         locale/strict            locale/**surrogateescape**
sys.stderr                    locale/backslashreplace  locale/backslashreplace
============================  =======================  ==========================

The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
strict mode for convenience: the idea is that data not encoded to UTF-8
are passed through "Python" without being modified, as raw bytes.

Rationale
---------

The UTF-8 mode is disabled by default to keep hard Unicode errors when
encoding or decoding operating system data failed, and to keep the
backward compatibility. The user is responsible to enable explicitly the
UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
mode would be enabled *by default*.

The UTF-8 mode should be used on systems known to be configured with
UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
the user overrides a locale *by mistake* or if a Python program is
started with no locale configured (and so with the POSIX locale).

Most UNIX applications handle operating system data as bytes, so
``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
limited impact on how these data are handled by the application.

The Python UTF-8 mode should help to make Python more interoperable with
the  other UNIX applications in the system assuming that *UTF-8* is used
everywhere and that users *expect* UTF-8.

Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
Python is more convenient, since they are more commonly misconfigured
*by mistake* (configured to use an encoding different than UTF-8,
whereas the system uses UTF-8), rather than being misconfigured by intent.

Expected mojibake issues
------------------------

The UTF-8 mode only affects Python 3.7 code, other code is not aware of this
mode.

If Python 3.7 is used as a producer in a ``producer | consumer`` shell command
and the consumer may fail to decode input data if it decodes it and the locale
encoding is not UTF-8. If the consumer doesn't decode inputs, process them
as bytes, it should just work.

If Python 3.7 is used as a consumer in a ``producer | consumer`` shell command,
it should just work.

If Python calls third party libraries or if Python is embedded in an
application, code outside Python is not aware of the UTF-8 mode. If the other
code uses UTF-8, it's fine. If the other code uses the locale encoding,
mojibake will occur when the locale encoding is not UTF-8.


Use Cases
=========

List a directory into stdout
----------------------------

Script listing the content of the current directory into stdout::

    import os
    for name in os.listdir(os.curdir):
        print(name)

Result:

========================  ==============================
Python                    Always work?
========================  ==============================
Python 2                  **Yes**
Python 3                  No
Python 3.5, POSIX locale  **Yes**
UTF-8 mode                **Yes**
UTF-8 Strict mode         No
========================  ==============================

"Yes" means that the script cannot fail, but it can produce mojibake.

"No" means that the script can fail on decoding or encoding a filename
depending on the locale or the filename.


List a directory into a text file
---------------------------------

Similar to the previous example, except that the listing is written into
a text file::

    import os
    names = os.listdir(os.curdir)
    with open("/tmp/content.txt", "w") as fp:
        for name in names:
            fp.write("%s\n" % name)

Result:

========================  ==============================
Python                    Always work?
========================  ==============================
Python 2                  **Yes**
Python 3                  No
Python 3.5, POSIX locale  No
UTF-8 mode                **Yes**
UTF-8 Strict mode         No
========================  ==============================

"Yes" means that the script cannot fail, but it can produce mojibake.

"No" means that the script can fail on decoding or encoding a filename
depending on the locale or the filename. Typical error::

    $ LC_ALL=C python3 test.py
    Traceback (most recent call last):
      File "test.py", line 5, in <module>
        fp.write("%s\n" % name)
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)


Display Unicode characters into stdout
--------------------------------------

Very basic example used to illustrate a common issue, display the euro sign
(U+20AC: €)::

    print("euro: \u20ac")

Result:

========================  ==============================
Python                    Always work?
========================  ==============================
Python 2                  No
Python 3                  No
Python 3.5, POSIX locale  No
UTF-8 mode                **Yes**
UTF-8 Strict mode         **Yes**
========================  ==============================

"Yes" means that the script cannot fail, but it can produce mojibake.

"No" means that the script can fail on encoding the euro sign depending on the
locale encoding.


Replace a word in a text
------------------------

The following scripts replaces the word "apple" with "orange". It
reads input from stdin and writes the output into stdout::

    import sys
    text = sys.stdin.read()
    sys.stdout.write(text.replace("apple", "orange"))

Result:

========================  ==============================
Python                    Always work?
========================  ==============================
Python 2                  **Yes**
Python 3                  No
Python 3.5, POSIX locale  **Yes**
UTF-8 mode                **Yes**
UTF-8 Strict mode         No
========================  ==============================

"Yes" means that the script cannot fail.

"No" means that the script can fail on decoding the input depending on
the locale.


Backward Compatibility
======================

The main backward incompatible change is that the UTF-8 encoding is now
used if the locale is POSIX. Since the UTF-8 encoding is used with the
``surrogateescape`` error handler, ecoding errors should not occur and
so the change should not break applications.

The more likely source of trouble comes from external libraries. Python
can decode successfully data from UTF-8, but a library using the locale
encoding can fail to encode the decoded text back to bytes.  Hopefully,
encoding text in a library is a rare operation. Very few libraries
expect text, most libraries expect bytes and even manipulate bytes
internally.

If the locale is not POSIX, the PEP has no impact on the backward
compatibility since the UTF-8 mode is disabled by default in this case,
it must be enabled explicitly.


Alternatives
============

Don't modify the encoding of the POSIX locale
---------------------------------------------

A first version of the PEP did not change the encoding and error handler
used of the POSIX locale.

The problem is that adding a command line option or setting an environment
variable is not possible in some cases, or at least not convenient.

Moreover, many users simply expect that Python 3 behaves as Python 2:
don't bother them with encodings and "just works" in all cases. These
users don't worry about mojibake, or even expect mojibake because of
complex documents using multiple incompatibles encodings.


Always use UTF-8
----------------

Python already always use the UTF-8 encoding on Mac OS X, Android and Windows.
Since UTF-8 became the defacto encoding, it makes sense to always use it on all
platforms with any locale.

The risk is to introduce mojibake if the locale uses a different encoding,
especially for locales other than the POSIX locale.


Force UTF-8 for the POSIX locale
--------------------------------

An alternative to always using UTF-8 in any case is to only use UTF-8 when the
``LC_CTYPE`` locale is the POSIX locale.

The PEP 538 "Coercing the legacy C locale to C.UTF-8" of  Nick Coghlan
proposes to implement that using the ``C.UTF-8`` locale.


Links
=====

PEPs:

* PEP 538 "Coercing the legacy C locale to C.UTF-8"
* PEP 529: "Change Windows filesystem encoding to UTF-8"
* PEP 383: "Non-decodable Bytes in System Character Interfaces"

Main Python issues:

* `issue #28180: sys.getfilesystemencoding() should default to utf-8
  <http://bugs.python.org/issue28180>`_
* `Issue #19977: Use "surrogateescape" error handler for sys.stdin and
  sys.stdout on UNIX for the C locale
  <http://bugs.python.org/issue19977>`_
* `Issue #19847: Setting the default filesystem-encoding
  <http://bugs.python.org/issue19847>`_
* `Issue #8622: Add PYTHONFSENCODING environment variable
  <https://bugs.python.org/issue8622>`_: added but reverted because of
  many issues, read the `Inconsistencies if locale and filesystem
  encodings are different
  <https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
  thread on the python-dev mailing list

Incomplete list of Python issues related to Unicode errors, especially
with the POSIX locale:

* 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')"
  <http://bugs.python.org/issue29042#msg283821>`_
* 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only error handler
  <http://bugs.python.org/issue22016>`_
* 2014-04-27: `Issue #21368: Check for systemd locale on startup if current
  locale is set to POSIX <http://bugs.python.org/issue21368>`_ -- read manually
  /etc/locale.conf when the locale is POSIX
* 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell with utf-8
  filename
  <http://bugs.python.org/issue20329>`_
* 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C locale
  <http://bugs.python.org/issue19846>`_
* 2010-05-04: `Issue #8610: Python3/POSIX:  errors if file system encoding is None
  <http://bugs.python.org/issue8610>`_
* 2013-08-12: `Issue #18713: Clearly document the use of PYTHONIOENCODING to
  set surrogateescape <http://bugs.python.org/issue18713>`_
* 2013-09-27: `Issue #19100: Use backslashreplace in pprint
  <http://bugs.python.org/issue19100>`_
* 2012-01-05: `Issue #13717: os.walk() + print fails with UnicodeEncodeError
  <http://bugs.python.org/issue13717>`_
* 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding
  <http://bugs.python.org/issue13643>`_
* 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default for the
  POSIX locale
  <http://bugs.python.org/issue11574>`_, thread on python-dev:
  `Low-Level Encoding Behavior on Python 3
  <https://mail.python.org/pipermail/python-dev/2011-March/109361.html>`_
* 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler for
  stdout <http://bugs.python.org/issue8533>`_, regrtest fails with Unicode
  encode error if the locale is POSIX

Some issues are real bug in applications which must set explicitly the
encoding. Well, it just works in the common case (locale configured
correctly), so what? But the program "suddenly" fails when the POSIX
locale is used (probably for bad reasons). Such bug is not well
understood by users. Example of such issue:

* 2013-11-21: `pip: open() uses the locale encoding to parse Python
  script, instead of the encoding cookie
  <http://bugs.python.org/issue19685>`_ -- pip must use the encoding
  cookie to read a Python source code file
* 2011-01-21: `IDLE 3.x can crash decoding recent file list
  <http://bugs.python.org/issue10974>`_


Prior Art
=========

Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
varaible to force UTF-8: see `perlrun
<http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
UTF-8 per standard stream, on input and output streams, etc.


Copyright
=========

This document has been placed in the public domain.