python-peps/pep-0540.txt

PEP: 540
Title: Add a new UTF-8 mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner <victor.stinner@gmail.com>
BDFL-Delegate: INADA Naoki
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 5-January-2016
Python-Version: 3.7


Abstract
========

Add a new UTF-8 mode, disabled by default, to ignore the locale and
force the usage of the UTF-8 encoding.

Basically, UTF-8 mode behaves as Python 2: it "just works" and doesn't
bother users with encodings, but it can produce mojibake. The UTF-8 mode
can be configured as strict to prevent mojibake.

A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode. The POSIX locale enables
the UTF-8 mode.


Rationale
=========

"It's not a bug, you must fix your locale" is not an acceptable answer
----------------------------------------------------------------------

Since Python 3.0 was released in 2008, the usual answer to users getting
Unicode errors is to ask developers to fix their code to handle Unicode
properly. Most applications and Python modules were fixed, but users
kept reporting Unicode errors regularly: see the long list of issues in
the `Links`_ section below.

In fact, a second class of bugs comes from a locale which is not properly
configured. The usual answer to such a bug report is: "it is not a bug,
you must fix your locale".

Technically, the answer is correct, but from a practical point of view,
the answer is not acceptable. In many cases, "fixing the issue" is a
hard task. Moreover, sometimes, the usage of the POSIX locale is
deliberate.

A good example of a concrete issue are build systems which create a
fresh environment for each build using a chroot, a container, a virtual
machine or something else to get reproducible builds. Such a setup
usually uses the POSIX locale.  To get 100% reproducible builds, the
POSIX locale is a good choice: see the `Locales section of
reproducible-builds.org
<https://reproducible-builds.org/docs/locales/>`_.

UNIX users don't expect Unicode errors, since the common command lines
tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors.
These users expect that Python 3 "just works" with any locale and won't
bother them with encodings. From their point of the view, the bug is not
their locale, it's obviously Python 3.

Since Python 2 handles data as bytes, it's rarer in Python 2
compared to Python 3 to get Unicode errors. It also explains why users
also perceive Python 3 as the root cause of their Unicode errors.

Some users expect that Python 3 just works with any locale and so don't
bother with mojibake, whereas some developers are working hard to prevent
mojibake and so expect that Python 3 fails early before creating
it.

Since different group of users have different expectations, there is no
silver bullet which solves all issues at once. Last but not least,
backward compatibility should be preserved whenever possible.

Locale and operating system data
--------------------------------

.. _operating system data:

Python uses an encoding called the "filesystem encoding" to decide how
to encode and decode data from/to the operating system:

* file content
* command line arguments: ``sys.argv``
* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
* environment variables: ``os.environ``
* filenames: ``os.listdir(str)`` for example
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
* error messages: ``os.strerror(code)`` for example
* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
* host name, UNIX socket path: see the ``socket`` module
* etc.

At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
``LC_CTYPE`` locale and then store the locale encoding as the
"filesystem error". It's possible to get this encoding using
``sys.getfilesystemencoding()``. In the whole lifetime of a Python
process, the same encoding and error handler are used to encode and
decode data from/to the operating system.

The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to
decode and encode operating system data. These functions use the
filesystem error handler: ``sys.getfilesystemencodeerrors()``.

.. note::
   In some corner cases, the *current* ``LC_CTYPE`` locale must be used
   instead of ``sys.getfilesystemencoding()``. For example, the ``time``
   module uses the *current* ``LC_CTYPE`` locale to decode timezone
   names.


The POSIX locale and its encoding
---------------------------------

The following environment variables are used to configure the locale, in
this preference order:

* ``LC_ALL``, most important variable
* ``LC_CTYPE``
* ``LANG``

The POSIX locale, also known as "the C locale", is used:

* if the first set variable is set to ``"C"``
* if all these variables are unset, for example when a program is
  started in an empty environment.

The encoding of the POSIX locale must be ASCII or a superset of ASCII.

On Linux, the POSIX locale uses the ASCII encoding.

On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
``os.fsencode()`` and ``os.fsdecode()`` use
Python codec of the locale encoding. For example, if command line
arguments are decoded by ``mbstowcs()`` and encoded back by
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
of retrieving the original byte string.

To fix this issue, from Python 3.4, a check is made to see if ``mbstowcs()``
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
alias to ASCII). If not (the effective encoding is not ASCII), Python
uses its own ASCII codec instead of using ``mbstowcs()`` and
``wcstombs()`` functions for `operating system data`_.

See the `POSIX locale (2016 Edition)
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.


POSIX locale used by mistake
----------------------------

In many cases, the POSIX locale is not really expected by users who get
it by mistake. Examples:

* program started in an empty environment
* User forcing LANG=C to get messages in English
* LANG=C used for bad reasons, without being aware of the ASCII encoding
* SSH shell
* Linux installed with no configured locale
* chroot environment, Docker image, container, ... with no locale is
  configured
* User locale set to a non-existing locale, typo in the locale name for
  example


C.UTF-8 and C.utf8 locales
--------------------------

Some UNIX operating systems provide a variant of the POSIX locale using
the UTF-8 encoding:

* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
* HP-UX: ``"C.utf8"``

It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.

It is not planned to add such locale to BSD systems.


Popularity of the UTF-8 encoding
--------------------------------

Python 3 uses UTF-8 by default for Python source files.

On Mac OS X, Windows and Android, Python always use UTF-8 for operating
system data. For Windows, see `PEP 529`_: "Change Windows filesystem
encoding to UTF-8".

On Linux, UTF-8 became the de facto standard encoding,
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
using different encodings for filenames and standard streams is likely
to create mojibake, so UTF-8 is now used *everywhere*.

The UTF-8 encoding is the default encoding of XML and JSON file formats.
As of January 2017, UTF-8 was used in `more than 88% of web pages
<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
Javascript, CSS, etc.).

See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
information on the UTF-8 codec.

.. note::
   Some applications and operating systems (especially Windows) use Byte
   Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
   UTF-8, UTF-16-LE, etc. BOM are not well supported and are rarely used in
   Python.


Old data stored in different encodings and surrogateescape
----------------------------------------------------------

Even if UTF-8 became the de facto standard, there are still systems in
the wild which don't use UTF-8. And there are a lot of data stored in
different encodings. For example, an old USB key using the ext3
filesystem with filenames encoded to ISO 8859-1.

The Linux kernel and libc don't decode filenames: a filename is used
as a raw array of bytes. The common solution to support any filename is
to store filenames as bytes and don't try to decode them. When displayed
to stdout, mojibake is displayed if the filename and the terminal don't
use the same encoding.

Python 3 promotes Unicode everywhere including filenames. A solution to
support filenames not decodable from the locale encoding was found: the
``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
as surrogate characters. This error handler is used by default for
`operating system data`_, for example, by ``os.fsdecode()`` and
``os.fsencode()`` (except on Windows which uses the ``strict`` error handler).


Standard streams
----------------

Python uses the locale encoding for standard streams: stdin, stdout and
stderr. The ``strict`` error handler is used by stdin and stdout to
prevent mojibake.

The ``backslashreplace`` error handler is used by stderr to avoid
Unicode encode errors when displaying non-ASCII text. It is especially
useful when the POSIX locale is used, because this locale usually uses
the ASCII encoding.

The problem is that `operating system data`_ like filenames are decoded
using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
filename to stdout raises a Unicode encode error if the filename
contains an undecoded byte stored as a surrogate character.

Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
POSIX locale is used: `issue #19977
<http://bugs.python.org/issue19977>`_. The idea is to pass through
`operating system data`_ even if it creates mojibake, because most UNIX
applications work like that. Most UNIX applications store filenames as
bytes, usually because bytes are first-citizen class in the used
programming language, whereas Unicode is badly supported.

.. note::
   The encoding and/or the error handler of standard streams can be
   overridden with the ``PYTHONIOENCODING`` environment variable.


Proposal
========

Changes
-------

Add a new UTF-8 mode, disabled by default, to ignore the locale and
force the usage of the UTF-8 encoding with the ``surrogateescape`` error
handler, instead using the locale encoding (with ``strict`` or
``surrogateescape`` error handler depending on the case).

Basically, the UTF-8 mode behaves as Python 2: it "just works" and doesn't
bother users with encodings, but it can produce mojibake. It can be
configured as strict to prevent mojibake: the UTF-8 encoding is used
with the ``strict`` error handler for inputs and outputs, but the
``surrogateescape`` error handler is still used for `operating system
data`_.

A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
by using ``-X utf8`` or ``PYTHONUTF8=1``.  It can be configured as strict
by using ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values
fail with an error.

The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.

Options priority for the UTF-8 mode:

* ``PYTHONLEGACYWINDOWSFSENCODING``
* ``-X utf8``
* ``PYTHONUTF8``
* POSIX locale

For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode,
whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and
uses the encoding of the POSIX locale.

Encodings used by ``open()``, highest priority first:

* *encoding* and *errors* parameters (if set)
* UTF-8 mode
* ``os.device_encoding(fd)``
* ``os.getpreferredencoding(False)``


Encoding and error handler
--------------------------

The UTF-8 mode changes the default encoding and error handler used by
``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
``sys.stdout`` and ``sys.stderr``:

============================  =======================  ==========================  ==========================
Function                      Default                  UTF-8 mode or POSIX locale  UTF-8 Strict mode
============================  =======================  ==========================  ==========================
open()                        locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape   **UTF-8**/surrogateescape
sys.stdin, sys.stdout         locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
sys.stderr                    locale/backslashreplace  **UTF-8**/backslashreplace  **UTF-8**/backslashreplace
============================  =======================  ==========================  ==========================

By comparison, Python 3.6 uses:

============================  =======================  ==========================
Function                      Default                  POSIX locale
============================  =======================  ==========================
open()                        locale/strict            locale/strict
os.fsdecode(), os.fsencode()  locale/surrogateescape   locale/surrogateescape
sys.stdin, sys.stdout         locale/strict            locale/**surrogateescape**
sys.stderr                    locale/backslashreplace  locale/backslashreplace
============================  =======================  ==========================

The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
strict mode for convenience: the idea is that data not encoded to UTF-8
are passed through "Python" without being modified, as raw bytes.

The ``PYTHONIOENCODING`` environment variable has priority over the
UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.

Encoding and error handler on Windows
-------------------------------------

On Windows, the encodings and error handlers are different:

============================  =======================  ==========================  ==========================  ==========================
Function                      Default                  Legacy Windows FS encoding  UTF-8 mode                  UTF-8 Strict mode
============================  =======================  ==========================  ==========================  ==========================
open()                        mbcs/strict              mbcs/strict                 **UTF-8/surrogateescape**   **UTF-8**/strict
os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**            UTF-8/surrogatepass         UTF-8/surrogatepass
sys.stdin, sys.stdout         UTF-8/surrogateescape    UTF-8/surrogateescape       UTF-8/surrogateescape       **UTF-8/strict**
sys.stderr                    UTF-8/backslashreplace   UTF-8/backslashreplace      UTF-8/backslashreplace      UTF-8/backslashreplace
============================  =======================  ==========================  ==========================  ==========================

By comparison, Python 3.6 uses:

============================  =======================  ==========================
Function                      Default                  Legacy Windows FS encoding
============================  =======================  ==========================
open()                        mbcs/strict              mbcs/strict
os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**
sys.stdin, sys.stdout         UTF-8/surrogateescape    UTF-8/surrogateescape
sys.stderr                    UTF-8/backslashreplace   UTF-8/backslashreplace
============================  =======================  ==========================

The "Legacy Windows FS encoding" is enabled by setting the
``PYTHONLEGACYWINDOWSFSENCODING`` environment variable to ``1`` as specified
in `PEP 529` .

Enabling the legacy Windows filesystem encoding disables the UTF-8 mode
(as ``-X utf8=0``).

If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
with the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
encoding.

There is no POSIX locale on Windows. The ANSI code page is used to the
locale encoding, and this code page never uses the ASCII encoding.


Rationale
---------

UTF-8 mode is disabled by default in order to keep hard Unicode errors when
encoding or decoding `operating system data`_ fails and preserve
backward compatibility. In addition, users will be better prepared for
mojibake if it is their responsibility to explicitly enable UTF-8 mode
than they would be if it was enabled *by default*.

UTF-8 mode should be used on systems known to be configured with
UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
the user overrides a locale *by mistake* or if a Python program is
started with no locale configured (and so with the POSIX locale).

Most UNIX applications handle `operating system data`_ as bytes, so
the ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
limited impact on how these data are handled by the application.

The UTF-8 mode should help make Python more interoperable with
other UNIX applications on the system assuming that *UTF-8* is used
everywhere and that users *expect* UTF-8.

Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
Python is more convenient, since they are more commonly misconfigured
*by mistake* (configured to use an encoding different than UTF-8,
whereas the system uses UTF-8), rather than being misconfigured by
intent.

Expected mojibake and surrogate character issues
------------------------------------------------

The UTF-8 mode only affects code running directly in Python, especially
code written in pure Python. The other code, called "external code"
here, is not aware of this mode. Examples:

* C libraries called by Python modules like OpenSSL
* The application code when Python is embedded in an application

In the UTF-8 mode, Python uses the ``surrogateescape`` error handler
which stores bytes not decodable from UTF-8 as surrogate characters.

If the external code uses the locale and the locale encoding is UTF-8,
it should work fine.

External code using bytes
^^^^^^^^^^^^^^^^^^^^^^^^^

If the external code processes data as bytes, surrogate characters are not
an issue since they are only used inside Python. Python encodes back
surrogate characters to bytes at the edges, before calling external
code.

The UTF-8 mode can produce mojibake since Python and external code don't
both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can
be configured as strict to prevent mojibake and fail early when data
is not decodable from UTF-8 or not encodable to UTF-8.

External code using text
^^^^^^^^^^^^^^^^^^^^^^^^

If the external code uses text API, for example using the ``wchar_t*`` C
type, mojibake should not occur, but the external code can fail on
surrogate characters.


Use Cases
=========

The following use cases were written to highlight the impact of
the chosen encodings and error handlers on concrete examples.

The "Always work" results were written to prove the benefit of having a
UTF-8 mode which works with any data and any locale, compared to the
existing old Python versions.

The "Mojibake" column shows that ignoring the locale causes a practical
issue: the UTF-8 mode produces mojibake if the terminal doesn't use the
UTF-8 encoding.

List a directory into stdout
----------------------------

Script listing the content of the current directory into stdout::

    import os
    for name in os.listdir(os.curdir):
        print(name)

Result:

========================  =============  =========
Python                    Always works?  Mojibake?
========================  =============  =========
Python 2                  **Yes**        **Yes**
Python 3                  No             No
Python 3.5, POSIX locale  **Yes**        **Yes**
UTF-8 mode                **Yes**        **Yes**
UTF-8 Strict mode         No             No
========================  =============  =========

"No" means that the script can fail on decoding or encoding a filename
depending on the locale or the filename.

To be able to always work, the program must be able to produce mojibake.
Mojibake is more user friendly than an error with a truncated or empty
output.

For example, using a directory which contains a file called ``b'xxx\xff'``
(the byte ``0xFF`` is invalid in UTF-8).

Default and UTF-8 Strict mode fail on ``print()`` with an encode error::

    $ python3.7 ../ls.py
    Traceback (most recent call last):
      File "../ls.py", line 5, in <module>
        print(name)
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...

    $ python3.7 -X utf8=strict ../ls.py
    Traceback (most recent call last):
      File "../ls.py", line 5, in <module>
        print(name)
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...

UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
but display mojibake::

    $ python3.7 -X utf8 ../ls.py
    xxx<78>

    $ LC_ALL=C /python3.6 ../ls.py
    xxx<78>

    $ python2 ../ls.py
    xxx<78>

    $ ls
    'xxx'$'\377'


List a directory into a text file
---------------------------------

Similar to the previous example, except that the listing is written into
a text file::

    import os
    names = os.listdir(os.curdir)
    with open("/tmp/content.txt", "w") as fp:
        for name in names:
            fp.write("%s\n" % name)

Result:

========================  =============  =========
Python                    Always works?  Mojibake?
========================  =============  =========
Python 2                  **Yes**        **Yes**
Python 3                  No             No
Python 3.5, POSIX locale  No             No
UTF-8 mode                **Yes**        **Yes**
UTF-8 Strict mode         No             No
========================  =============  =========

"Yes" implies that mojibake can be produced. "No" means that the script
can fail on decoding or encoding a filename depending on the locale or
the filename. Typical error::

    $ LC_ALL=C python3 test.py
    Traceback (most recent call last):
      File "test.py", line 5, in <module>
        fp.write("%s\n" % name)
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)


Display Unicode characters into stdout
--------------------------------------

Very basic example used to illustrate a common issue, display the euro sign
(U+20AC: €)::

    print("euro: \u20ac")

Result:

========================  =============  =========
Python                    Always works?  Mojibake?
========================  =============  =========
Python 2                  No             No
Python 3                  No             No
Python 3.5, POSIX locale  No             No
UTF-8 mode                **Yes**        **Yes**
UTF-8 Strict mode         **Yes**        **Yes**
========================  =============  =========

The UTF-8 and UTF-8 Strict modes will always encode the euro sign as
UTF-8. If the terminal uses a different encoding, we get mojibake.


Replace a word in a text
------------------------

The following script replaces the word "apple" with "orange". It
reads input from stdin and writes the output into stdout::

    import sys
    text = sys.stdin.read()
    sys.stdout.write(text.replace("apple", "orange"))

Result:

========================  =============  =========
Python                    Always works?  Mojibake?
========================  =============  =========
Python 2                  **Yes**        **Yes**
Python 3                  No             No
Python 3.5, POSIX locale  **Yes**        **Yes**
UTF-8 mode                **Yes**        **Yes**
UTF-8 Strict mode         No             No
========================  =============  =========

Producer-consumer model using pipes
-----------------------------------

Let's say that we have a "producer" program which writes data into its
stdout and a "consumer" program which reads data from its stdin.

On a shell, such programs are run with the command::

    producer | consumer

The question is if these programs will work with any data and any locale.
UNIX users don't expect Unicode errors, and so expect that such programs
"just work".

If the producer only produces ASCII output, no error should occur. Let's
say the that the producer writes at least one non-ASCII character (at least
one byte in the range ``0x80..0xff``).

To simplify the problem, let's say that the consumer has no output
(doesn't write results into a file or stdout).

A "Bytes producer" is an application which cannot fail with a Unicode
error and produces bytes into stdout.

Let's say that a "Bytes consumer" does not decode stdin but stores data
as bytes: such a consumer always works. Common UNIX command line tools like
``cat``, ``grep`` or ``sed`` are in this category. Many Python 2
applications are also in this category.

"Python producer" and "Python consumer" are a producer and consumer
implemented in Python.

Bytes producer, Bytes consumer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It always works, but it is out of the scope of this PEP since it doesn't
involve Python.

Python producer, Bytes consumer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Python producer::

    print("euro: \u20ac")

Result:

========================  =============  =========
Python                    Always works?  Mojibake?
========================  =============  =========
Python 2                  No             No
Python 3                  No             No
Python 3.5, POSIX locale  No             No
UTF-8 mode                **Yes**        **Yes**
UTF-8 Strict mode         No             No
========================  =============  =========

The question here is not if the consumer is able to decode the input,
but if Python is able to produce its output. So it's similar to the
`Display Unicode characters into stdout`_ case.

UTF-8 modes work with any locale since the consumer doesn't try to
decode its stdin.

Bytes producer, Python consumer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Python consumer::

    import sys
    text = sys.stdin.read()
    result = text.replace("apple", "orange")
    # ignore the result

Result:

========================  =============  =========
Python                    Always works?  Mojibake?
========================  =============  =========
Python 2                  **Yes**        **Yes**
Python 3                  No             No
Python 3.5, POSIX locale  **Yes**        **Yes**
UTF-8 mode                **Yes**        **Yes**
UTF-8 Strict mode         No             No
========================  =============  =========

Python 3 fails on decoding stdin depending on the input and the locale.


Python producer, Python consumer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Python producer::

    print("euro: \u20ac")

Python consumer::

    import sys
    text = sys.stdin.read()
    result = text.replace("apple", "orange")
    # ignore the result

Result, using the same Python version for the producer and the consumer:

========================  =============  =========
Python                    Always works?  Mojibake?
========================  =============  =========
Python 2                  No             No
Python 3                  No             No
Python 3.5, POSIX locale  No             No
UTF-8 mode                **Yes**        **Yes**
UTF-8 Strict mode         No             No
========================  =============  =========

This case combines a Python producer with a Python consumer, so the
result is the subset of `Python producer, Bytes consumer`_ and `Bytes
producer, Python consumer`_.


Backward Compatibility
======================

The main backward incompatible change is that the UTF-8 encoding is now
used by default if the locale is POSIX. Since the UTF-8 encoding is used
with the ``surrogateescape`` error handler, encoding errors should not
occur and so the change should not break applications.

The more likely source of trouble comes from external libraries. Python
can successfully decode data from UTF-8, but a library using the locale
encoding can fail to encode the decoded text back to bytes.  Hopefully,
encoding text in a library is a rare operation. Very few libraries
expect text, most libraries expect bytes and even manipulate bytes
internally.

The PEP only changes the default behaviour if the locale is POSIX. For
other locales, the *default* behaviour is unchanged.


Alternatives
============

Don't modify the encoding of the POSIX locale
---------------------------------------------

A first version of the PEP did not change the encoding and error handler
used for the POSIX locale.

The problem is that adding the ``-X utf8`` command line option or
setting the ``PYTHONUTF8`` environment variable is not possible in some
cases, or at least not convenient.

Moreover, many users simply expect that Python 3 behaves like Python 2:
it doesn't bother them with encodings and "just works" in all cases. These
users don't worry about mojibake, or even expect mojibake because of
complex documents using multiple incompatibles encodings.


Always use UTF-8
----------------

Python already always uses the UTF-8 encoding on Mac OS X, Android and
Windows.  Since UTF-8 became the de facto encoding, it makes sense to
always use it on all platforms with any locale.

The risk is to introduce mojibake if the locale uses a different
encoding, especially for locales other than the POSIX locale.


Force UTF-8 for the POSIX locale
--------------------------------

An alternative to always using UTF-8 in any case is to only use UTF-8 when the
``LC_CTYPE`` locale is the POSIX locale.

`PEP 538`_ "Coercing the legacy C locale to C.UTF-8" by Nick
Coghlan proposes to implement that using the ``C.UTF-8`` locale.


Use the strict error handler for operating system data
------------------------------------------------------

Using the ``surrogateescape`` error handler for `operating system data`_
creates surprising surrogate characters. No Python codec (except for
``utf-7``) accepts surrogates so encoding text coming from the
operating system is likely to raise an error. The problem is that
the error comes late, very far from where the data was read.

The ``strict`` error handler can be used instead to decode
(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
data and raise encoding errors as soon as possible. Using it helps find
bugs more quickly.

The main drawback of this strategy is that it doesn't work in practice.
Python 3 is designed on top on Unicode strings. Most functions expect
Unicode and produce Unicode. Even if many operating system functions
have two flavors, bytes and Unicode, the Unicode flavor is used in most
cases. There are good reasons for that: Unicode is more convenient in
Python 3 and using Unicode helps to support the full Unicode Character
Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
see the `PEP 528`_ and the `PEP 529`_).

For example, if ``os.fsdecode()`` uses ``utf8/strict``,
``os.listdir(str)`` fails to list filenames of a directory if a single
filename is not decodable from UTF-8. As a consequence,
``shutil.rmtree(str)`` fails to remove a directory. Undecodable
filenames, environment variables, etc. are simply too common to make
this alternative viable.


Links
=====

PEPs:

* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
  "Coercing the legacy C locale to C.UTF-8"
* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
  "Change Windows filesystem encoding to UTF-8"
* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
  "Change Windows console encoding to UTF-8"
* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
  "Non-decodable Bytes in System Character Interfaces"

Main Python issues:

* `Issue #29240: Implementation of the PEP 540: Add a new UTF-8 mode
  <http://bugs.python.org/issue29240>`_
* `Issue #28180: sys.getfilesystemencoding() should default to utf-8
  <http://bugs.python.org/issue28180>`_
* `Issue #19977: Use "surrogateescape" error handler for sys.stdin and
  sys.stdout on UNIX for the C locale
  <http://bugs.python.org/issue19977>`_
* `Issue #19847: Setting the default filesystem-encoding
  <http://bugs.python.org/issue19847>`_
* `Issue #8622: Add PYTHONFSENCODING environment variable
  <https://bugs.python.org/issue8622>`_: added but reverted because of
  many issues, read the `Inconsistencies if locale and filesystem
  encodings are different
  <https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
  thread on the python-dev mailing list

Incomplete list of Python issues related to Unicode errors, especially
with the POSIX locale:

* 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')"
  <http://bugs.python.org/issue29042#msg283821>`_
* 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only error handler
  <http://bugs.python.org/issue22016>`_
* 2014-04-27: `Issue #21368: Check for systemd locale on startup if current
  locale is set to POSIX <http://bugs.python.org/issue21368>`_ -- read manually
  /etc/locale.conf when the locale is POSIX
* 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell with utf-8
  filename
  <http://bugs.python.org/issue20329>`_
* 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C locale
  <http://bugs.python.org/issue19846>`_
* 2010-05-04: `Issue #8610: Python3/POSIX:  errors if file system encoding is None
  <http://bugs.python.org/issue8610>`_
* 2013-08-12: `Issue #18713: Clearly document the use of PYTHONIOENCODING to
  set surrogateescape <http://bugs.python.org/issue18713>`_
* 2013-09-27: `Issue #19100: Use backslashreplace in pprint
  <http://bugs.python.org/issue19100>`_
* 2012-01-05: `Issue #13717: os.walk() + print fails with UnicodeEncodeError
  <http://bugs.python.org/issue13717>`_
* 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding
  <http://bugs.python.org/issue13643>`_
* 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default for the
  POSIX locale
  <http://bugs.python.org/issue11574>`_, thread on python-dev:
  `Low-Level Encoding Behavior on Python 3
  <https://mail.python.org/pipermail/python-dev/2011-March/109361.html>`_
* 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler for
  stdout <http://bugs.python.org/issue8533>`_, regrtest fails with Unicode
  encode error if the locale is POSIX

Some issues are real bugs in applications which must explicitly set the
encoding. Well, it just works in the common case (locale configured
correctly), so what? The program "suddenly" fails when the POSIX
locale is used (probably for bad reasons). Such bugs are not well
understood by users. Example of such issues:

* 2013-11-21: `pip: open() uses the locale encoding to parse Python
  script, instead of the encoding cookie
  <http://bugs.python.org/issue19685>`_ -- pip must use the encoding
  cookie to read a Python source code file
* 2011-01-21: `IDLE 3.x can crash decoding recent file list
  <http://bugs.python.org/issue10974>`_


Prior Art
=========

Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
variable to force UTF-8: see `perlrun
<http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
UTF-8 per standard stream, on input and output streams, etc.


Copyright
=========

This document has been placed in the public domain.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								PEP: 540
 								Title: Add a new UTF-8 mode
 								Version: $Revision$
 								Last-Modified: $Date$
 								Author: Victor Stinner <victor.stinner@gmail.com>
-												Update BDFL delegations for PEPs 538 & 540

- Barry stepping down due to lack of time
- Naoki Inada will handle both PEPs due to
  the significant overlap between them

											
										
										
											2017-04-24 00:33:34 -04:00
+								BDFL-Delegate: INADA Naoki
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								Status: Draft
 								Type: Standards Track
 								Content-Type: text/x-rst
 								Created: 5-January-2016
 								Python-Version: 3.7
 								Abstract
 								========
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								Add a new UTF-8 mode, disabled by default, to ignore the locale and
 								force the usage of the UTF-8 encoding.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								Basically, UTF-8 mode behaves as Python 2: it "just works" and doesn't
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								bother users with encodings, but it can produce mojibake. The UTF-8 mode
 								can be configured as strict to prevent mojibake.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								variable are added to control the UTF-8 mode. The POSIX locale enables
 								the UTF-8 mode.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								Rationale
 								=========
 								"It's not a bug, you must fix your locale" is not an acceptable answer
 								----------------------------------------------------------------------
 								Since Python 3.0 was released in 2008, the usual answer to users getting
 								Unicode errors is to ask developers to fix their code to handle Unicode
 								properly. Most applications and Python modules were fixed, but users
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								kept reporting Unicode errors regularly: see the long list of issues in
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								the `Links`_ section below.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								In fact, a second class of bugs comes from a locale which is not properly
 								configured. The usual answer to such a bug report is: "it is not a bug,
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								you must fix your locale".
 								Technically, the answer is correct, but from a practical point of view,
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								the answer is not acceptable. In many cases, "fixing the issue" is a
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								hard task. Moreover, sometimes, the usage of the POSIX locale is
 								deliberate.
 								A good example of a concrete issue are build systems which create a
 								fresh environment for each build using a chroot, a container, a virtual
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								machine or something else to get reproducible builds. Such a setup
 								usually uses the POSIX locale.  To get 100% reproducible builds, the
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								POSIX locale is a good choice: see the `Locales section of
 								reproducible-builds.org
 								<https://reproducible-builds.org/docs/locales/>`_.
 								UNIX users don't expect Unicode errors, since the common command lines
 								tools like ``cat``, ``grep`` or ``sed`` never fail with Unicode errors.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								These users expect that Python 3 "just works" with any locale and won't
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								bother them with encodings. From their point of the view, the bug is not
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								their locale, it's obviously Python 3.
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
-												PEP 540: correcting english errors

											
										
										
											2017-01-08 11:08:01 -05:00
+								Since Python 2 handles data as bytes, it's rarer in Python 2
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								compared to Python 3 to get Unicode errors. It also explains why users
 								also perceive Python 3 as the root cause of their Unicode errors.
 								Some users expect that Python 3 just works with any locale and so don't
-												PEP 540: correcting english errors

											
										
										
											2017-01-08 11:08:01 -05:00
+								bother with mojibake, whereas some developers are working hard to prevent
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								mojibake and so expect that Python 3 fails early before creating
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								it.
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								Since different group of users have different expectations, there is no
 								silver bullet which solves all issues at once. Last but not least,
 								backward compatibility should be preserved whenever possible.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Locale and operating system data
 								--------------------------------
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								.. _operating system data:
 								Python uses an encoding called the "filesystem encoding" to decide how
 								to encode and decode data from/to the operating system:
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								* file content
 								* command line arguments: ``sys.argv``
 								* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
 								* environment variables: ``os.environ``
 								* filenames: ``os.listdir(str)`` for example
 								* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								* error messages: ``os.strerror(code)`` for example
 								* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								* host name, UNIX socket path: see the ``socket`` module
 								* etc.
 								At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								``LC_CTYPE`` locale and then store the locale encoding as the
 								"filesystem error". It's possible to get this encoding using
 								``sys.getfilesystemencoding()``. In the whole lifetime of a Python
 								process, the same encoding and error handler are used to encode and
 								decode data from/to the operating system.
 								The ``os.fsdecode()`` and ``os.fsencode()`` functions can be used to
 								decode and encode operating system data. These functions use the
 								filesystem error handler: ``sys.getfilesystemencodeerrors()``.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								.. note::
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								   In some corner cases, the *current* ``LC_CTYPE`` locale must be used
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								   instead of ``sys.getfilesystemencoding()``. For example, the ``time``
 								   module uses the *current* ``LC_CTYPE`` locale to decode timezone
 								   names.
 								The POSIX locale and its encoding
 								---------------------------------
 								The following environment variables are used to configure the locale, in
 								this preference order:
 								* ``LC_ALL``, most important variable
 								* ``LC_CTYPE``
 								* ``LANG``
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The POSIX locale, also known as "the C locale", is used:
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								* if the first set variable is set to ``"C"``
 								* if all these variables are unset, for example when a program is
 								  started in an empty environment.
 								The encoding of the POSIX locale must be ASCII or a superset of ASCII.
 								On Linux, the POSIX locale uses the ASCII encoding.
 								On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
 								the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
 								use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
 								``os.fsencode()`` and ``os.fsdecode()`` use
-												PEP 540: don't mention locale.getpreferredencoding

Replace locale.getpreferredencoding() with "locale encoding" to avoid
confusion.

											
										
										
											2017-01-12 12:13:38 -05:00
+								Python codec of the locale encoding. For example, if command line
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								arguments are decoded by ``mbstowcs()`` and encoded back by
 								``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
 								of retrieving the original byte string.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								To fix this issue, from Python 3.4, a check is made to see if ``mbstowcs()``
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
 								POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
 								alias to ASCII). If not (the effective encoding is not ASCII), Python
 								uses its own ASCII codec instead of using ``mbstowcs()`` and
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								``wcstombs()`` functions for `operating system data`_.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								See the `POSIX locale (2016 Edition)
 								<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
-												PEP 540

* Add "POSIX locale used by mistake" section
* Add a lot of issues in the Links section

											
										
										
											2017-01-06 07:57:10 -05:00
+								POSIX locale used by mistake
 								----------------------------
 								In many cases, the POSIX locale is not really expected by users who get
 								it by mistake. Examples:
 								* program started in an empty environment
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								* User forcing LANG=C to get messages in English
-												PEP 540

* Add "POSIX locale used by mistake" section
* Add a lot of issues in the Links section

											
										
										
											2017-01-06 07:57:10 -05:00
+								* LANG=C used for bad reasons, without being aware of the ASCII encoding
 								* SSH shell
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								* Linux installed with no configured locale
 								* chroot environment, Docker image, container, ... with no locale is
 								  configured
-												PEP 540

* Add "POSIX locale used by mistake" section
* Add a lot of issues in the Links section

											
										
										
											2017-01-06 07:57:10 -05:00
+								* User locale set to a non-existing locale, typo in the locale name for
 								  example
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								C.UTF-8 and C.utf8 locales
 								--------------------------
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								Some UNIX operating systems provide a variant of the POSIX locale using
 								the UTF-8 encoding:
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								* HP-UX: ``"C.utf8"``
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								It is not planned to add such locale to BSD systems.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Popularity of the UTF-8 encoding
 								--------------------------------
 								Python 3 uses UTF-8 by default for Python source files.
 								On Mac OS X, Windows and Android, Python always use UTF-8 for operating
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								system data. For Windows, see `PEP 529`_: "Change Windows filesystem
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								encoding to UTF-8".
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												PEP 540: correcting english errors

											
										
										
											2017-01-08 11:08:01 -05:00
+								On Linux, UTF-8 became the de facto standard encoding,
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
 								using different encodings for filenames and standard streams is likely
 								to create mojibake, so UTF-8 is now used *everywhere*.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The UTF-8 encoding is the default encoding of XML and JSON file formats.
 								As of January 2017, UTF-8 was used in `more than 88% of web pages
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
 								Javascript, CSS, etc.).
 								See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
 								information on the UTF-8 codec.
 								.. note::
 								   Some applications and operating systems (especially Windows) use Byte
 								   Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								   UTF-8, UTF-16-LE, etc. BOM are not well supported and are rarely used in
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								   Python.
 								Old data stored in different encodings and surrogateescape
 								----------------------------------------------------------
-												PEP 540: correcting english errors

											
										
										
											2017-01-08 11:08:01 -05:00
+								Even if UTF-8 became the de facto standard, there are still systems in
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								the wild which don't use UTF-8. And there are a lot of data stored in
 								different encodings. For example, an old USB key using the ext3
 								filesystem with filenames encoded to ISO 8859-1.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The Linux kernel and libc don't decode filenames: a filename is used
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								as a raw array of bytes. The common solution to support any filename is
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								to store filenames as bytes and don't try to decode them. When displayed
 								to stdout, mojibake is displayed if the filename and the terminal don't
 								use the same encoding.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Python 3 promotes Unicode everywhere including filenames. A solution to
 								support filenames not decodable from the locale encoding was found: the
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								``surrogateescape`` error handler (`PEP 383`_), store undecodable bytes
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								as surrogate characters. This error handler is used by default for
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								`operating system data`_, for example, by ``os.fsdecode()`` and
 								``os.fsencode()`` (except on Windows which uses the ``strict`` error handler).
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Standard streams
 								----------------
 								Python uses the locale encoding for standard streams: stdin, stdout and
 								stderr. The ``strict`` error handler is used by stdin and stdout to
 								prevent mojibake.
 								The ``backslashreplace`` error handler is used by stderr to avoid
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								Unicode encode errors when displaying non-ASCII text. It is especially
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								useful when the POSIX locale is used, because this locale usually uses
 								the ASCII encoding.
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								The problem is that `operating system data`_ like filenames are decoded
 								using the ``surrogateescape`` error handler (`PEP 383`_). Displaying a
-												PEP 540: correcting english errors

											
										
										
											2017-01-08 11:08:01 -05:00
+								filename to stdout raises a Unicode encode error if the filename
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								contains an undecoded byte stored as a surrogate character.
 								Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								POSIX locale is used: `issue #19977
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								<http://bugs.python.org/issue19977>`_. The idea is to pass through
 								`operating system data`_ even if it creates mojibake, because most UNIX
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								applications work like that. Most UNIX applications store filenames as
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								bytes, usually because bytes are first-citizen class in the used
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								programming language, whereas Unicode is badly supported.
 								.. note::
 								   The encoding and/or the error handler of standard streams can be
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								   overridden with the ``PYTHONIOENCODING`` environment variable.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Proposal
 								========
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								Changes
 								-------
 								Add a new UTF-8 mode, disabled by default, to ignore the locale and
 								force the usage of the UTF-8 encoding with the ``surrogateescape`` error
 								handler, instead using the locale encoding (with ``strict`` or
 								``surrogateescape`` error handler depending on the case).
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								Basically, the UTF-8 mode behaves as Python 2: it "just works" and doesn't
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								bother users with encodings, but it can produce mojibake. It can be
 								configured as strict to prevent mojibake: the UTF-8 encoding is used
-												PEP 540

Add examples for the "List a directory into stdout" use case.

											
										
										
											2017-01-11 16:32:24 -05:00
+								with the ``strict`` error handler for inputs and outputs, but the
 								``surrogateescape`` error handler is still used for `operating system
 								data`_.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								A new ``-X utf8`` command line option and a ``PYTHONUTF8`` environment
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								by using ``-X utf8`` or ``PYTHONUTF8=1``.  It can be configured as strict
 								by using ``-X utf8=strict`` or ``PYTHONUTF8=strict``. Other option values
 								fail with an error.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
 								The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
 								can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
-												Update 540 for Windows

Describe encodings and error handlers used on Windows and the
priority of PYTHONLEGACYWINDOWSFSENCODING.

											
										
										
											2017-01-12 07:26:21 -05:00
+								Options priority for the UTF-8 mode:
 								* ``PYTHONLEGACYWINDOWSFSENCODING``
 								* ``-X utf8``
 								* ``PYTHONUTF8``
 								* POSIX locale
 								For example, ``PYTHONUTF8=0 python3 -X utf8`` enables the UTF-8 mode,
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								whereas ``LC_ALL=C python3.7 -X utf8=0`` disables the UTF-8 mode and
 								uses the encoding of the POSIX locale.
-												Update 540 for Windows

Describe encodings and error handlers used on Windows and the
priority of PYTHONLEGACYWINDOWSFSENCODING.

											
										
										
											2017-01-12 07:26:21 -05:00
 								Encodings used by ``open()``, highest priority first:
 								* *encoding* and *errors* parameters (if set)
 								* UTF-8 mode
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								* ``os.device_encoding(fd)``
 								* ``os.getpreferredencoding(False)``
-												Update 540 for Windows

Describe encodings and error handlers used on Windows and the
priority of PYTHONLEGACYWINDOWSFSENCODING.

											
										
										
											2017-01-12 07:26:21 -05:00
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								Encoding and error handler
 								--------------------------
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								The UTF-8 mode changes the default encoding and error handler used by
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
 								``sys.stdout`` and ``sys.stderr``:
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								============================  =======================  ==========================  ==========================
-												Update 540 for Windows

Describe encodings and error handlers used on Windows and the
priority of PYTHONLEGACYWINDOWSFSENCODING.

											
										
										
											2017-01-12 07:26:21 -05:00
+								Function                      Default                  UTF-8 mode or POSIX locale  UTF-8 Strict mode
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								============================  =======================  ==========================  ==========================
 								open()                        locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								os.fsdecode(), os.fsencode()  locale/surrogateescape   **UTF-8**/surrogateescape   **UTF-8**/surrogateescape
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								sys.stdin, sys.stdout         locale/strict            **UTF-8/surrogateescape**   **UTF-8**/strict
 								sys.stderr                    locale/backslashreplace  **UTF-8**/backslashreplace  **UTF-8**/backslashreplace
 								============================  =======================  ==========================  ==========================
 								By comparison, Python 3.6 uses:
 								============================  =======================  ==========================
 								Function                      Default                  POSIX locale
 								============================  =======================  ==========================
 								open()                        locale/strict            locale/strict
 								os.fsdecode(), os.fsencode()  locale/surrogateescape   locale/surrogateescape
 								sys.stdin, sys.stdout         locale/strict            locale/**surrogateescape**
 								sys.stderr                    locale/backslashreplace  locale/backslashreplace
 								============================  =======================  ==========================
 								The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
 								strict mode for convenience: the idea is that data not encoded to UTF-8
 								are passed through "Python" without being modified, as raw bytes.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The ``PYTHONIOENCODING`` environment variable has priority over the
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								UTF-8 mode for standard streams. For example, ``PYTHONIOENCODING=latin1
 								python3 -X utf8`` uses the Latin1 encoding for stdin, stdout and stderr.
-												Update 540 for Windows

Describe encodings and error handlers used on Windows and the
priority of PYTHONLEGACYWINDOWSFSENCODING.

											
										
										
											2017-01-12 07:26:21 -05:00
+								Encoding and error handler on Windows
 								-------------------------------------
 								On Windows, the encodings and error handlers are different:
 								============================  =======================  ==========================  ==========================  ==========================
 								Function                      Default                  Legacy Windows FS encoding  UTF-8 mode                  UTF-8 Strict mode
 								============================  =======================  ==========================  ==========================  ==========================
 								open()                        mbcs/strict              mbcs/strict                 **UTF-8/surrogateescape**   **UTF-8**/strict
 								os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**            UTF-8/surrogatepass         UTF-8/surrogatepass
 								sys.stdin, sys.stdout         UTF-8/surrogateescape    UTF-8/surrogateescape       UTF-8/surrogateescape       **UTF-8/strict**
 								sys.stderr                    UTF-8/backslashreplace   UTF-8/backslashreplace      UTF-8/backslashreplace      UTF-8/backslashreplace
 								============================  =======================  ==========================  ==========================  ==========================
 								By comparison, Python 3.6 uses:
 								============================  =======================  ==========================
 								Function                      Default                  Legacy Windows FS encoding
 								============================  =======================  ==========================
 								open()                        mbcs/strict              mbcs/strict
 								os.fsdecode(), os.fsencode()  UTF-8/surrogatepass      **mbcs/replace**
 								sys.stdin, sys.stdout         UTF-8/surrogateescape    UTF-8/surrogateescape
 								sys.stderr                    UTF-8/backslashreplace   UTF-8/backslashreplace
 								============================  =======================  ==========================
 								The "Legacy Windows FS encoding" is enabled by setting the
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								``PYTHONLEGACYWINDOWSFSENCODING`` environment variable to ``1`` as specified
 								in `PEP 529` .
-												Update 540 for Windows

Describe encodings and error handlers used on Windows and the
priority of PYTHONLEGACYWINDOWSFSENCODING.

											
										
										
											2017-01-12 07:26:21 -05:00
 								Enabling the legacy Windows filesystem encoding disables the UTF-8 mode
 								(as ``-X utf8=0``).
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
 								``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
 								with the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
-												Update 540 for Windows

Describe encodings and error handlers used on Windows and the
priority of PYTHONLEGACYWINDOWSFSENCODING.

											
										
										
											2017-01-12 07:26:21 -05:00
+								encoding.
 								There is no POSIX locale on Windows. The ANSI code page is used to the
 								locale encoding, and this code page never uses the ASCII encoding.
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								Rationale
 								---------
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								UTF-8 mode is disabled by default in order to keep hard Unicode errors when
 								encoding or decoding `operating system data`_ fails and preserve
 								backward compatibility. In addition, users will be better prepared for
 								mojibake if it is their responsibility to explicitly enable UTF-8 mode
 								than they would be if it was enabled *by default*.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								UTF-8 mode should be used on systems known to be configured with
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
 								the user overrides a locale *by mistake* or if a Python program is
 								started with no locale configured (and so with the POSIX locale).
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								Most UNIX applications handle `operating system data`_ as bytes, so
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								the ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								limited impact on how these data are handled by the application.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The UTF-8 mode should help make Python more interoperable with
 								other UNIX applications on the system assuming that *UTF-8* is used
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								everywhere and that users *expect* UTF-8.
 								Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
 								Python is more convenient, since they are more commonly misconfigured
 								*by mistake* (configured to use an encoding different than UTF-8,
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								whereas the system uses UTF-8), rather than being misconfigured by
 								intent.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								Expected mojibake and surrogate character issues
 								------------------------------------------------
 								The UTF-8 mode only affects code running directly in Python, especially
 								code written in pure Python. The other code, called "external code"
 								here, is not aware of this mode. Examples:
 								* C libraries called by Python modules like OpenSSL
 								* The application code when Python is embedded in an application
 								In the UTF-8 mode, Python uses the ``surrogateescape`` error handler
 								which stores bytes not decodable from UTF-8 as surrogate characters.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								If the external code uses the locale and the locale encoding is UTF-8,
 								it should work fine.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								External code using bytes
 								^^^^^^^^^^^^^^^^^^^^^^^^^
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								If the external code processes data as bytes, surrogate characters are not
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								an issue since they are only used inside Python. Python encodes back
 								surrogate characters to bytes at the edges, before calling external
 								code.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								The UTF-8 mode can produce mojibake since Python and external code don't
 								both of invalid bytes, but it's a deliberate choice. The UTF-8 mode can
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								be configured as strict to prevent mojibake and fail early when data
-												PEP 540

Add examples for the "List a directory into stdout" use case.

											
										
										
											2017-01-11 16:32:24 -05:00
+								is not decodable from UTF-8 or not encodable to UTF-8.
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								External code using text
 								^^^^^^^^^^^^^^^^^^^^^^^^
 								If the external code uses text API, for example using the ``wchar_t*`` C
 								type, mojibake should not occur, but the external code can fail on
 								surrogate characters.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
 								Use Cases
 								=========
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The following use cases were written to highlight the impact of
 								the chosen encodings and error handlers on concrete examples.
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								The "Always work" results were written to prove the benefit of having a
 								UTF-8 mode which works with any data and any locale, compared to the
 								existing old Python versions.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The "Mojibake" column shows that ignoring the locale causes a practical
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								issue: the UTF-8 mode produces mojibake if the terminal doesn't use the
 								UTF-8 encoding.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								List a directory into stdout
 								----------------------------
 								Script listing the content of the current directory into stdout::
 								    import os
 								    for name in os.listdir(os.curdir):
 								        print(name)
 								Result:
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								========================  =============  =========
 								Python                    Always works?  Mojibake?
 								========================  =============  =========
 								Python 2                  **Yes**        **Yes**
 								Python 3                  No             No
 								Python 3.5, POSIX locale  **Yes**        **Yes**
 								UTF-8 mode                **Yes**        **Yes**
 								UTF-8 Strict mode         No             No
 								========================  =============  =========
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
 								"No" means that the script can fail on decoding or encoding a filename
 								depending on the locale or the filename.
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								To be able to always work, the program must be able to produce mojibake.
 								Mojibake is more user friendly than an error with a truncated or empty
 								output.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								For example, using a directory which contains a file called ``b'xxx\xff'``
-												PEP 540

Add examples for the "List a directory into stdout" use case.

											
										
										
											2017-01-11 16:32:24 -05:00
+								(the byte ``0xFF`` is invalid in UTF-8).
 								Default and UTF-8 Strict mode fail on ``print()`` with an encode error::
 								    $ python3.7 ../ls.py
 								    Traceback (most recent call last):
 								      File "../ls.py", line 5, in <module>
 								        print(name)
 								    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
 								    $ python3.7 -X utf8=strict ../ls.py
 								    Traceback (most recent call last):
 								      File "../ls.py", line 5, in <module>
 								        print(name)
 								    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' ...
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								UTF-8 mode, POSIX locale, Python 2 and the UNIX ``ls`` command work
-												PEP 540

Add examples for the "List a directory into stdout" use case.

											
										
										
											2017-01-11 16:32:24 -05:00
+								but display mojibake::
 								    $ python3.7 -X utf8 ../ls.py
 								    xxx<78>
 								    $ LC_ALL=C /python3.6 ../ls.py
 								    xxx<78>
 								    $ python2 ../ls.py
 								    xxx<78>
 								    $ ls
 								    'xxx'$'\377'
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
 								List a directory into a text file
 								---------------------------------
 								Similar to the previous example, except that the listing is written into
 								a text file::
 								    import os
 								    names = os.listdir(os.curdir)
 								    with open("/tmp/content.txt", "w") as fp:
 								        for name in names:
 								            fp.write("%s\n" % name)
 								Result:
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								========================  =============  =========
 								Python                    Always works?  Mojibake?
 								========================  =============  =========
 								Python 2                  **Yes**        **Yes**
 								Python 3                  No             No
 								Python 3.5, POSIX locale  No             No
 								UTF-8 mode                **Yes**        **Yes**
 								UTF-8 Strict mode         No             No
 								========================  =============  =========
 								"Yes" implies that mojibake can be produced. "No" means that the script
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								can fail on decoding or encoding a filename depending on the locale or
 								the filename. Typical error::
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
 								    $ LC_ALL=C python3 test.py
 								    Traceback (most recent call last):
 								      File "test.py", line 5, in <module>
 								        fp.write("%s\n" % name)
 								    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
 								Display Unicode characters into stdout
 								--------------------------------------
 								Very basic example used to illustrate a common issue, display the euro sign
 								(U+20AC: €)::
 								    print("euro: \u20ac")
 								Result:
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								========================  =============  =========
 								Python                    Always works?  Mojibake?
 								========================  =============  =========
 								Python 2                  No             No
 								Python 3                  No             No
 								Python 3.5, POSIX locale  No             No
 								UTF-8 mode                **Yes**        **Yes**
 								UTF-8 Strict mode         **Yes**        **Yes**
 								========================  =============  =========
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								The UTF-8 and UTF-8 Strict modes will always encode the euro sign as
 								UTF-8. If the terminal uses a different encoding, we get mojibake.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
 								Replace a word in a text
 								------------------------
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The following script replaces the word "apple" with "orange". It
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								reads input from stdin and writes the output into stdout::
 								    import sys
 								    text = sys.stdin.read()
 								    sys.stdout.write(text.replace("apple", "orange"))
 								Result:
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								========================  =============  =========
 								Python                    Always works?  Mojibake?
 								========================  =============  =========
 								Python 2                  **Yes**        **Yes**
 								Python 3                  No             No
 								Python 3.5, POSIX locale  **Yes**        **Yes**
 								UTF-8 mode                **Yes**        **Yes**
 								UTF-8 Strict mode         No             No
 								========================  =============  =========
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								Producer-consumer model using pipes
 								-----------------------------------
 								Let's say that we have a "producer" program which writes data into its
 								stdout and a "consumer" program which reads data from its stdin.
 								On a shell, such programs are run with the command::
 								    producer | consumer
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								The question is if these programs will work with any data and any locale.
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								UNIX users don't expect Unicode errors, and so expect that such programs
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								"just work".
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								If the producer only produces ASCII output, no error should occur. Let's
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								say the that the producer writes at least one non-ASCII character (at least
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								one byte in the range ``0x80..0xff``).
 								To simplify the problem, let's say that the consumer has no output
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								(doesn't write results into a file or stdout).
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								A "Bytes producer" is an application which cannot fail with a Unicode
 								error and produces bytes into stdout.
 								Let's say that a "Bytes consumer" does not decode stdin but stores data
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								as bytes: such a consumer always works. Common UNIX command line tools like
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								``cat``, ``grep`` or ``sed`` are in this category. Many Python 2
 								applications are also in this category.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								"Python producer" and "Python consumer" are a producer and consumer
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								implemented in Python.
 								Bytes producer, Bytes consumer
 								^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								It always works, but it is out of the scope of this PEP since it doesn't
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								involve Python.
 								Python producer, Bytes consumer
 								^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 								Python producer::
 								    print("euro: \u20ac")
 								Result:
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								========================  =============  =========
 								Python                    Always works?  Mojibake?
 								========================  =============  =========
 								Python 2                  No             No
 								Python 3                  No             No
 								Python 3.5, POSIX locale  No             No
 								UTF-8 mode                **Yes**        **Yes**
 								UTF-8 Strict mode         No             No
 								========================  =============  =========
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								The question here is not if the consumer is able to decode the input,
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								but if Python is able to produce its output. So it's similar to the
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								`Display Unicode characters into stdout`_ case.
 								UTF-8 modes work with any locale since the consumer doesn't try to
 								decode its stdin.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								Bytes producer, Python consumer
 								^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
+								Python consumer::
 								    import sys
 								    text = sys.stdin.read()
 								    result = text.replace("apple", "orange")
 								    # ignore the result
 								Result:
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								========================  =============  =========
 								Python                    Always works?  Mojibake?
 								========================  =============  =========
 								Python 2                  **Yes**        **Yes**
 								Python 3                  No             No
 								Python 3.5, POSIX locale  **Yes**        **Yes**
 								UTF-8 mode                **Yes**        **Yes**
 								UTF-8 Strict mode         No             No
 								========================  =============  =========
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								Python 3 fails on decoding stdin depending on the input and the locale.
 								Python producer, Python consumer
 								^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 								Python producer::
 								    print("euro: \u20ac")
 								Python consumer::
 								    import sys
 								    text = sys.stdin.read()
 								    result = text.replace("apple", "orange")
 								    # ignore the result
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								Result, using the same Python version for the producer and the consumer:
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								========================  =============  =========
 								Python                    Always works?  Mojibake?
 								========================  =============  =========
 								Python 2                  No             No
 								Python 3                  No             No
 								Python 3.5, POSIX locale  No             No
 								UTF-8 mode                **Yes**        **Yes**
 								UTF-8 Strict mode         No             No
 								========================  =============  =========
-												PEP 540

* add section: "It's not a bug, you must fix your locale" is not an
  acceptable answer
* elaborate the "Expected mojibake and surrogate character issues"
  section
* Add the "Producer-consumer model using pipes" use case

											
										
										
											2017-01-06 20:35:27 -05:00
 								This case combines a Python producer with a Python consumer, so the
 								result is the subset of `Python producer, Bytes consumer`_ and `Bytes
 								producer, Python consumer`_.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Backward Compatibility
 								======================
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								The main backward incompatible change is that the UTF-8 encoding is now
-												PEP 540

Add examples for the "List a directory into stdout" use case.

											
										
										
											2017-01-11 16:32:24 -05:00
+								used by default if the locale is POSIX. Since the UTF-8 encoding is used
 								with the ``surrogateescape`` error handler, encoding errors should not
 								occur and so the change should not break applications.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
 								The more likely source of trouble comes from external libraries. Python
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								can successfully decode data from UTF-8, but a library using the locale
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								encoding can fail to encode the decoded text back to bytes.  Hopefully,
 								encoding text in a library is a rare operation. Very few libraries
 								expect text, most libraries expect bytes and even manipulate bytes
 								internally.
-												PEP 540

Add examples for the "List a directory into stdout" use case.

											
										
										
											2017-01-11 16:32:24 -05:00
+								The PEP only changes the default behaviour if the locale is POSIX. For
 								other locales, the *default* behaviour is unchanged.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Alternatives
 								============
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								Don't modify the encoding of the POSIX locale
 								---------------------------------------------
 								A first version of the PEP did not change the encoding and error handler
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								used for the POSIX locale.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540

Add examples for the "List a directory into stdout" use case.

											
										
										
											2017-01-11 16:32:24 -05:00
+								The problem is that adding the ``-X utf8`` command line option or
 								setting the ``PYTHONUTF8`` environment variable is not possible in some
 								cases, or at least not convenient.
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								Moreover, many users simply expect that Python 3 behaves like Python 2:
 								it doesn't bother them with encodings and "just works" in all cases. These
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								users don't worry about mojibake, or even expect mojibake because of
 								complex documents using multiple incompatibles encodings.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								Always use UTF-8
 								----------------
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								Python already always uses the UTF-8 encoding on Mac OS X, Android and
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								Windows.  Since UTF-8 became the de facto encoding, it makes sense to
 								always use it on all platforms with any locale.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								The risk is to introduce mojibake if the locale uses a different
 								encoding, especially for locales other than the POSIX locale.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Force UTF-8 for the POSIX locale
 								--------------------------------
 								An alternative to always using UTF-8 in any case is to only use UTF-8 when the
 								``LC_CTYPE`` locale is the POSIX locale.
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								`PEP 538`_ "Coercing the legacy C locale to C.UTF-8" by Nick
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								Coghlan proposes to implement that using the ``C.UTF-8`` locale.
 								Use the strict error handler for operating system data
 								------------------------------------------------------
 								Using the ``surrogateescape`` error handler for `operating system data`_
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								creates surprising surrogate characters. No Python codec (except for
 								``utf-7``) accepts surrogates so encoding text coming from the
 								operating system is likely to raise an error. The problem is that
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								the error comes late, very far from where the data was read.
 								The ``strict`` error handler can be used instead to decode
 								(``os.fsdecode()``) and encode (``os.fsencode()``) operating system
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								data and raise encoding errors as soon as possible. Using it helps find
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								bugs more quickly.
 								The main drawback of this strategy is that it doesn't work in practice.
 								Python 3 is designed on top on Unicode strings. Most functions expect
 								Unicode and produce Unicode. Even if many operating system functions
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								have two flavors, bytes and Unicode, the Unicode flavor is used in most
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								cases. There are good reasons for that: Unicode is more convenient in
 								Python 3 and using Unicode helps to support the full Unicode Character
 								Set (UCS) on Windows (even if Python now uses UTF-8 since Python 3.6,
 								see the `PEP 528`_ and the `PEP 529`_).
 								For example, if ``os.fsdecode()`` uses ``utf8/strict``,
 								``os.listdir(str)`` fails to list filenames of a directory if a single
 								filename is not decodable from UTF-8. As a consequence,
 								``shutil.rmtree(str)`` fails to remove a directory. Undecodable
 								filenames, environment variables, etc. are simply too common to make
 								this alternative viable.
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								Links
 								=====
 								PEPs:
-												PEP 540

* Strict mode doesn't use strict for OS data anymore: keep
  surrogateesscape, explain why in a new alternative
* Define the priority between env vars and cmdline options to choose
  encodings and error handlers

											
										
										
											2017-01-11 16:08:40 -05:00
+								* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
 								  "Coercing the legacy C locale to C.UTF-8"
 								* `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_:
 								  "Change Windows filesystem encoding to UTF-8"
 								* `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_:
 								  "Change Windows console encoding to UTF-8"
 								* `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_:
 								  "Non-decodable Bytes in System Character Interfaces"
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540

* Add "POSIX locale used by mistake" section
* Add a lot of issues in the Links section

											
										
										
											2017-01-06 07:57:10 -05:00
+								Main Python issues:
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
-												PEP 540: add link to the implementation

											
										
										
											2017-01-11 06:30:36 -05:00
+								* `Issue #29240: Implementation of the PEP 540: Add a new UTF-8 mode
 								  <http://bugs.python.org/issue29240>`_
 								* `Issue #28180: sys.getfilesystemencoding() should default to utf-8
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								  <http://bugs.python.org/issue28180>`_
-												PEP 540

* Add "POSIX locale used by mistake" section
* Add a lot of issues in the Links section

											
										
										
											2017-01-06 07:57:10 -05:00
+								* `Issue #19977: Use "surrogateescape" error handler for sys.stdin and
 								  sys.stdout on UNIX for the C locale
 								  <http://bugs.python.org/issue19977>`_
 								* `Issue #19847: Setting the default filesystem-encoding
 								  <http://bugs.python.org/issue19847>`_
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
+								* `Issue #8622: Add PYTHONFSENCODING environment variable
 								  <https://bugs.python.org/issue8622>`_: added but reverted because of
 								  many issues, read the `Inconsistencies if locale and filesystem
 								  encodings are different
 								  <https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
 								  thread on the python-dev mailing list
-												PEP 540

* Add "POSIX locale used by mistake" section
* Add a lot of issues in the Links section

											
										
										
											2017-01-06 07:57:10 -05:00
+								Incomplete list of Python issues related to Unicode errors, especially
 								with the POSIX locale:
 								* 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')"
 								  <http://bugs.python.org/issue29042#msg283821>`_
 								* 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only error handler
 								  <http://bugs.python.org/issue22016>`_
 								* 2014-04-27: `Issue #21368: Check for systemd locale on startup if current
 								  locale is set to POSIX <http://bugs.python.org/issue21368>`_ -- read manually
 								  /etc/locale.conf when the locale is POSIX
 								* 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell with utf-8
 								  filename
 								  <http://bugs.python.org/issue20329>`_
 								* 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C locale
 								  <http://bugs.python.org/issue19846>`_
 								* 2010-05-04: `Issue #8610: Python3/POSIX:  errors if file system encoding is None
 								  <http://bugs.python.org/issue8610>`_
 								* 2013-08-12: `Issue #18713: Clearly document the use of PYTHONIOENCODING to
 								  set surrogateescape <http://bugs.python.org/issue18713>`_
 								* 2013-09-27: `Issue #19100: Use backslashreplace in pprint
 								  <http://bugs.python.org/issue19100>`_
 								* 2012-01-05: `Issue #13717: os.walk() + print fails with UnicodeEncodeError
 								  <http://bugs.python.org/issue13717>`_
 								* 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding
 								  <http://bugs.python.org/issue13643>`_
 								* 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default for the
 								  POSIX locale
 								  <http://bugs.python.org/issue11574>`_, thread on python-dev:
 								  `Low-Level Encoding Behavior on Python 3
 								  <https://mail.python.org/pipermail/python-dev/2011-March/109361.html>`_
 								* 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler for
 								  stdout <http://bugs.python.org/issue8533>`_, regrtest fails with Unicode
 								  encode error if the locale is POSIX
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								Some issues are real bugs in applications which must explicitly set the
-												PEP 540

* Add "POSIX locale used by mistake" section
* Add a lot of issues in the Links section

											
										
										
											2017-01-06 07:57:10 -05:00
+								encoding. Well, it just works in the common case (locale configured
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								correctly), so what? The program "suddenly" fails when the POSIX
 								locale is used (probably for bad reasons). Such bugs are not well
 								understood by users. Example of such issues:
-												PEP 540

* Add "POSIX locale used by mistake" section
* Add a lot of issues in the Links section

											
										
										
											2017-01-06 07:57:10 -05:00
 								* 2013-11-21: `pip: open() uses the locale encoding to parse Python
 								  script, instead of the encoding cookie
 								  <http://bugs.python.org/issue19685>`_ -- pip must use the encoding
 								  cookie to read a Python source code file
 								* 2011-01-21: `IDLE 3.x can crash decoding recent file list
 								  <http://bugs.python.org/issue10974>`_
-												Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links

											
										
										
											2017-01-05 17:54:22 -05:00
 								Prior Art
 								=========
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
 								Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
-												Fix a couple of issues with pep0540 (#252)


											
										
										
											2017-05-08 18:24:28 -04:00
+								variable to force UTF-8: see `perlrun
-												Add PEP 540: Add a new UTF-8 mode

											
										
										
											2017-01-05 07:46:03 -05:00
+								<http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
 								UTF-8 per standard stream, on input and output streams, etc.
 								Copyright
 								=========
 								This document has been placed in the public domain.