PEP: 540
Title: Add a new UTF-8 Mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner <vstinner@python.org>
BDFL-Delegate: INADA Naoki
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 05-Jan-2016
Python-Version: 3.7
Resolution: https://mail.python.org/pipermail/python-dev/2017-December/151173.html


Abstract
========

Add a new "UTF-8 Mode" to enhance Python's use of UTF-8.  When UTF-8 Mode
is active, Python will:

* use the ``utf-8`` encoding, regardless of the locale currently set by
  the current platform, and
* change the ``stdin`` and ``stdout`` error handlers to
  ``surrogateescape``.

This mode is off by default, but is automatically activated when using
the "POSIX" locale.

Add the ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable to control UTF-8 Mode.


Rationale
=========

Locale encoding and UTF-8
-------------------------

Python 3.6 uses the locale encoding for filenames, environment
variables, standard streams, etc. The locale encoding is inherited from
the locale; the encoding and the locale are tightly coupled.

Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
locale, but are unable change the locale for various reasons.  This
encoding is very limited in term of Unicode support: any non-ASCII
character is likely to cause trouble.

It isn't always easy to get an accurate locale.  Locales don't get the
exact same name on different Linux distributions, FreeBSD, macOS, etc.
And some locales, like the recent ``C.UTF-8`` locale, are only supported
by a few platforms.  The current locale can even vary on the *same*
platform depending on context; for example, a SSH connection can use a
different encoding than the filesystem or local terminal encoding on the
same machine.

On the flip side, Python 3.6 is already using UTF-8 by default on macOS,
Android and Windows (:pep:`529`) for most functions -- although
``open()`` is a notable exception here. UTF-8 is also the default
encoding of Python scripts, XML and JSON file formats. The Go
programming language
uses UTF-8 for all strings.

UTF-8 support is nearly ubiquitous for data read and written by modern
platforms.  It also has excellent support in Python.  The problem is
simply that the locale is frequently misconfigured.  An obvious solution
suggests itself: ignore the locale encoding and use UTF-8.


Passthrough for undecodable bytes: surrogateescape
--------------------------------------------------

When decoding bytes from UTF-8 using the default ``strict`` error
handler, Python 3 raises a ``UnicodeDecodeError`` on the first
undecodable byte.

Unix command line tools like ``cat`` or ``grep`` and most Python 2
applications simply do not have this class of bugs: they don't decode
data, but process data as a raw bytes sequence.

Python 3 already has a solution to behave like Unix tools and Python 2:
the ``surrogateescape`` error handler (:pep:`383`). It allows processing
data as if it were bytes, but uses Unicode in practice; undecodable
bytes are stored as surrogate characters.

UTF-8 Mode sets the ``surrogateescape`` error handler for ``stdin``
and ``stdout``, since these streams as commonly associated to Unix
command line tools.

However, users have a different expectation on files. Files are expected
to be properly encoded, and Python is expected to fail early when
``open()`` is called with the wrong options, like opening a JPEG picture
in text mode. The ``open()`` default error handler remains ``strict``
for these reasons.


No change by default for best backward compatibility
----------------------------------------------------

While UTF-8 is perfect in most cases, sometimes the locale encoding is
actually the best encoding.

This PEP changes the behaviour for the POSIX locale since this locale is
usually equivalent to the ASCII encoding, whereas UTF-8 is a much better
choice. It does not change the behaviour for other locales to prevent
any risk or regression.

As users are responsible to enable explicitly the new UTF-8 Mode for
these other locales, they are responsible for any potential mojibake
issues caused by UTF-8 Mode.


Proposal
========

Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
encoding, and change ``stdin`` and ``stdout`` error handlers to
``surrogateescape``.

Add the new ``-X utf8`` command line option and ``PYTHONUTF8``
environment variable.  Users can explicitly activate UTF-8 Mode with the
command-line option ``-X utf8`` or by setting the environment variable
``PYTHONUTF8=1``.

This mode is disabled by default and enabled by the POSIX locale.  Users
can explicitly disable UTF-8 Mode with the command-line option ``-X
utf8=0`` or by setting the environment variable ``PYTHONUTF8=0``.

For standard streams, the ``PYTHONIOENCODING`` environment variable has
priority over UTF-8 Mode.

On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable
(:pep:`529`) has the priority over UTF-8 Mode.

Effects of UTF-8 Mode:

* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``.
* ``locale.getpreferredencoding()`` returns ``UTF-8``; its
  *do_setlocale* argument, and the locale encoding, are ignored.
* ``sys.stdin`` and ``sys.stdout`` error handler is set to
  ``surrogateescape``.

Side effects:

* ``open()`` uses the UTF-8 encoding by default.  However, it still
  uses the ``strict`` error handler by default.
* ``os.fsdecode()`` and ``os.fsencode()`` use the UTF-8 encoding.
* Command line arguments, environment variables and filenames use the
  UTF-8 encoding.


Relationship with the locale coercion (PEP 538)
===============================================

The POSIX locale enables the locale coercion (:pep:`538`) and the UTF-8
mode (:pep:`540`). When the locale coercion is enabled, enabling the
UTF-8 mode has no additional effect.

The UTF-8 Mode has the same effect as locale coercion:

* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``,
* ``locale.getpreferredencoding()`` returns ``UTF-8``, and
* the ``sys.stdin`` and ``sys.stdout`` error handlers are set to
  ``surrogateescape``.

These changes only affect Python code. But the locale coercion has
additional effects: the ``LC_CTYPE`` environment variable and the
``LC_CTYPE`` locale are set to a UTF-8 locale like ``C.UTF-8``. One side
effect is that non-Python code is also impacted by the locale coercion.
The two PEPs are complementary.

On platforms like Centos 7 where locale coercion is not supported, the
POSIX locale only enables UTF-8 Mode.  In this case, Python code uses
the UTF-8 encoding and ignores the locale encoding, whereas non-Python
code uses the locale encoding, which is usually ASCII for the POSIX
locale.

While the UTF-8 Mode is supported on all platforms and can be enabled
with any locale, the locale coercion is not supported by all platforms
and is restricted to the POSIX locale.

The UTF-8 Mode has only an impact on Python child processes when the
``PYTHONUTF8`` environment variable is set to ``1``, whereas the locale
coercion sets the ``LC_CTYPE`` environment variables which impacts all
child processes.

The benefit of the locale coercion approach is that it helps ensure that
encoding handling in binary extension modules and child processes is
consistent with Python's encoding handling. The upside of the UTF-8 Mode
approach is that it allows an embedding application to change the
interpreter's behaviour without having to change the process global
locale settings.


Backward Compatibility
======================

The only backward incompatible change is that the POSIX locale now
enables the UTF-8 Mode by default: it will now use the UTF-8 encoding,
ignore the locale encoding, and change ``stdin`` and ``stdout`` error
handlers to ``surrogateescape``.


Annex: Encodings And Error Handlers
===================================

UTF-8 Mode changes the default encoding and error handler used by
``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
``sys.stdout`` and ``sys.stderr``.

Encoding and error handler
--------------------------

============================  =======================  ==========================
Function                      Default                  UTF-8 Mode or POSIX locale
============================  =======================  ==========================
open()                        locale/strict            **UTF-8**/strict
os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape
sys.stdin, sys.stdout         locale/strict            **UTF-8/surrogateescape**
sys.stderr                    locale/backslashreplace  **UTF-8**/backslashreplace
============================  =======================  ==========================

By comparison, Python 3.6 uses:

============================  =======================  ==========================
Function                      Default                  POSIX locale
============================  =======================  ==========================
open()                        locale/strict            locale/strict
os.fsdecode(), os.fsencode()  locale/surrogateescape   locale/surrogateescape
sys.stdin, sys.stdout         locale/strict            locale/**surrogateescape**
sys.stderr                    locale/backslashreplace  locale/backslashreplace
============================  =======================  ==========================

Encoding and error handler on Windows
-------------------------------------

On Windows, the encodings and error handlers are different:

============================  =======================  ==========================  ==========================
Function                      Default                  Legacy Windows FS encoding  UTF-8 Mode
============================  =======================  ==========================  ==========================
open()                        mbcs/strict              mbcs/strict                 **UTF-8**/strict
os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**            UTF-8/surrogatepass
sys.stdin, sys.stdout         UTF-8/surrogateescape    UTF-8/surrogateescape       UTF-8/surrogateescape
sys.stderr                    UTF-8/backslashreplace   UTF-8/backslashreplace      UTF-8/backslashreplace
============================  =======================  ==========================  ==========================

By comparison, Python 3.6 uses:

============================  =======================  ==========================
Function                      Default                  Legacy Windows FS encoding
============================  =======================  ==========================
open()                        mbcs/strict              mbcs/strict
os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**
sys.stdin, sys.stdout         UTF-8/surrogateescape    UTF-8/surrogateescape
sys.stderr                    UTF-8/backslashreplace   UTF-8/backslashreplace
============================  =======================  ==========================

The "Legacy Windows FS encoding" is enabled by the
``PYTHONLEGACYWINDOWSFSENCODING`` environment variable.

If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
``sys.output`` uses ``mbcs`` encoding by default rather than UTF-8.
But in UTF-8 Mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
encoding.

.. note::
   There is no POSIX locale on Windows. The ANSI code page is used as
   the locale encoding, and this code page never uses the ASCII
   encoding.


Links
=====

* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 Mode
  <http://bugs.python.org/issue29240>`_
* :pep:`538`:
  "Coercing the legacy C locale to C.UTF-8"
* :pep:`529`:
  "Change Windows filesystem encoding to UTF-8"
* :pep:`528`:
  "Change Windows console encoding to UTF-8"
* :pep:`383`:
  "Non-decodable Bytes in System Character Interfaces"


Post History
============

* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 Mode
  <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
  540 (assuming UTF-8 for *nix system boundaries)
  <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 Mode
  <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
  C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
* 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
  to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
  -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
  filesystem encoding to UTF-8)


Version History
===============

* Version 4: ``locale.getpreferredencoding()`` now returns ``'UTF-8'``
  in the UTF-8 Mode.
* Version 3: The UTF-8 Mode does not change the ``open()`` default error
  handler (``strict``) anymore, and the Strict UTF-8 Mode has been
  removed.
* Version 2: Rewrite the PEP from scratch to make it much shorter and
  easier to understand.
* Version 1: First version posted to python-dev.


Copyright
=========

This document has been placed in the public domain.