322 lines
13 KiB
ReStructuredText
322 lines
13 KiB
ReStructuredText
PEP: 540
|
|
Title: Add a new UTF-8 Mode
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Victor Stinner <vstinner@python.org>
|
|
BDFL-Delegate: INADA Naoki
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 05-Jan-2016
|
|
Python-Version: 3.7
|
|
Resolution: https://mail.python.org/pipermail/python-dev/2017-December/151173.html
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
Add a new "UTF-8 Mode" to enhance Python's use of UTF-8. When UTF-8 Mode
|
|
is active, Python will:
|
|
|
|
* use the ``utf-8`` encoding, regardless of the locale currently set by
|
|
the current platform, and
|
|
* change the ``stdin`` and ``stdout`` error handlers to
|
|
``surrogateescape``.
|
|
|
|
This mode is off by default, but is automatically activated when using
|
|
the "POSIX" locale.
|
|
|
|
Add the ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
|
variable to control UTF-8 Mode.
|
|
|
|
|
|
Rationale
|
|
=========
|
|
|
|
Locale encoding and UTF-8
|
|
-------------------------
|
|
|
|
Python 3.6 uses the locale encoding for filenames, environment
|
|
variables, standard streams, etc. The locale encoding is inherited from
|
|
the locale; the encoding and the locale are tightly coupled.
|
|
|
|
Many users inherit the ASCII encoding from the POSIX locale, aka the "C"
|
|
locale, but are unable change the locale for various reasons. This
|
|
encoding is very limited in term of Unicode support: any non-ASCII
|
|
character is likely to cause trouble.
|
|
|
|
It isn't always easy to get an accurate locale. Locales don't get the
|
|
exact same name on different Linux distributions, FreeBSD, macOS, etc.
|
|
And some locales, like the recent ``C.UTF-8`` locale, are only supported
|
|
by a few platforms. The current locale can even vary on the *same*
|
|
platform depending on context; for example, a SSH connection can use a
|
|
different encoding than the filesystem or local terminal encoding on the
|
|
same machine.
|
|
|
|
On the flip side, Python 3.6 is already using UTF-8 by default on macOS,
|
|
Android and Windows (:pep:`529`) for most functions -- although
|
|
``open()`` is a notable exception here. UTF-8 is also the default
|
|
encoding of Python scripts, XML and JSON file formats. The Go
|
|
programming language
|
|
uses UTF-8 for all strings.
|
|
|
|
UTF-8 support is nearly ubiquitous for data read and written by modern
|
|
platforms. It also has excellent support in Python. The problem is
|
|
simply that the locale is frequently misconfigured. An obvious solution
|
|
suggests itself: ignore the locale encoding and use UTF-8.
|
|
|
|
|
|
Passthrough for undecodable bytes: surrogateescape
|
|
--------------------------------------------------
|
|
|
|
When decoding bytes from UTF-8 using the default ``strict`` error
|
|
handler, Python 3 raises a ``UnicodeDecodeError`` on the first
|
|
undecodable byte.
|
|
|
|
Unix command line tools like ``cat`` or ``grep`` and most Python 2
|
|
applications simply do not have this class of bugs: they don't decode
|
|
data, but process data as a raw bytes sequence.
|
|
|
|
Python 3 already has a solution to behave like Unix tools and Python 2:
|
|
the ``surrogateescape`` error handler (:pep:`383`). It allows processing
|
|
data as if it were bytes, but uses Unicode in practice; undecodable
|
|
bytes are stored as surrogate characters.
|
|
|
|
UTF-8 Mode sets the ``surrogateescape`` error handler for ``stdin``
|
|
and ``stdout``, since these streams as commonly associated to Unix
|
|
command line tools.
|
|
|
|
However, users have a different expectation on files. Files are expected
|
|
to be properly encoded, and Python is expected to fail early when
|
|
``open()`` is called with the wrong options, like opening a JPEG picture
|
|
in text mode. The ``open()`` default error handler remains ``strict``
|
|
for these reasons.
|
|
|
|
|
|
No change by default for best backward compatibility
|
|
----------------------------------------------------
|
|
|
|
While UTF-8 is perfect in most cases, sometimes the locale encoding is
|
|
actually the best encoding.
|
|
|
|
This PEP changes the behaviour for the POSIX locale since this locale is
|
|
usually equivalent to the ASCII encoding, whereas UTF-8 is a much better
|
|
choice. It does not change the behaviour for other locales to prevent
|
|
any risk or regression.
|
|
|
|
As users are responsible to enable explicitly the new UTF-8 Mode for
|
|
these other locales, they are responsible for any potential mojibake
|
|
issues caused by UTF-8 Mode.
|
|
|
|
|
|
Proposal
|
|
========
|
|
|
|
Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
|
|
encoding, and change ``stdin`` and ``stdout`` error handlers to
|
|
``surrogateescape``.
|
|
|
|
Add the new ``-X utf8`` command line option and ``PYTHONUTF8``
|
|
environment variable. Users can explicitly activate UTF-8 Mode with the
|
|
command-line option ``-X utf8`` or by setting the environment variable
|
|
``PYTHONUTF8=1``.
|
|
|
|
This mode is disabled by default and enabled by the POSIX locale. Users
|
|
can explicitly disable UTF-8 Mode with the command-line option ``-X
|
|
utf8=0`` or by setting the environment variable ``PYTHONUTF8=0``.
|
|
|
|
For standard streams, the ``PYTHONIOENCODING`` environment variable has
|
|
priority over UTF-8 Mode.
|
|
|
|
On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable
|
|
(:pep:`529`) has the priority over UTF-8 Mode.
|
|
|
|
Effects of UTF-8 Mode:
|
|
|
|
* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``.
|
|
* ``locale.getpreferredencoding()`` returns ``UTF-8``; its
|
|
*do_setlocale* argument, and the locale encoding, are ignored.
|
|
* ``sys.stdin`` and ``sys.stdout`` error handler is set to
|
|
``surrogateescape``.
|
|
|
|
Side effects:
|
|
|
|
* ``open()`` uses the UTF-8 encoding by default. However, it still
|
|
uses the ``strict`` error handler by default.
|
|
* ``os.fsdecode()`` and ``os.fsencode()`` use the UTF-8 encoding.
|
|
* Command line arguments, environment variables and filenames use the
|
|
UTF-8 encoding.
|
|
|
|
|
|
Relationship with the locale coercion (PEP 538)
|
|
===============================================
|
|
|
|
The POSIX locale enables the locale coercion (:pep:`538`) and the UTF-8
|
|
mode (:pep:`540`). When the locale coercion is enabled, enabling the
|
|
UTF-8 mode has no additional effect.
|
|
|
|
The UTF-8 Mode has the same effect as locale coercion:
|
|
|
|
* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``,
|
|
* ``locale.getpreferredencoding()`` returns ``UTF-8``, and
|
|
* the ``sys.stdin`` and ``sys.stdout`` error handlers are set to
|
|
``surrogateescape``.
|
|
|
|
These changes only affect Python code. But the locale coercion has
|
|
additional effects: the ``LC_CTYPE`` environment variable and the
|
|
``LC_CTYPE`` locale are set to a UTF-8 locale like ``C.UTF-8``. One side
|
|
effect is that non-Python code is also impacted by the locale coercion.
|
|
The two PEPs are complementary.
|
|
|
|
On platforms like Centos 7 where locale coercion is not supported, the
|
|
POSIX locale only enables UTF-8 Mode. In this case, Python code uses
|
|
the UTF-8 encoding and ignores the locale encoding, whereas non-Python
|
|
code uses the locale encoding, which is usually ASCII for the POSIX
|
|
locale.
|
|
|
|
While the UTF-8 Mode is supported on all platforms and can be enabled
|
|
with any locale, the locale coercion is not supported by all platforms
|
|
and is restricted to the POSIX locale.
|
|
|
|
The UTF-8 Mode has only an impact on Python child processes when the
|
|
``PYTHONUTF8`` environment variable is set to ``1``, whereas the locale
|
|
coercion sets the ``LC_CTYPE`` environment variables which impacts all
|
|
child processes.
|
|
|
|
The benefit of the locale coercion approach is that it helps ensure that
|
|
encoding handling in binary extension modules and child processes is
|
|
consistent with Python's encoding handling. The upside of the UTF-8 Mode
|
|
approach is that it allows an embedding application to change the
|
|
interpreter's behaviour without having to change the process global
|
|
locale settings.
|
|
|
|
|
|
Backward Compatibility
|
|
======================
|
|
|
|
The only backward incompatible change is that the POSIX locale now
|
|
enables the UTF-8 Mode by default: it will now use the UTF-8 encoding,
|
|
ignore the locale encoding, and change ``stdin`` and ``stdout`` error
|
|
handlers to ``surrogateescape``.
|
|
|
|
|
|
Annex: Encodings And Error Handlers
|
|
===================================
|
|
|
|
UTF-8 Mode changes the default encoding and error handler used by
|
|
``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
|
|
``sys.stdout`` and ``sys.stderr``.
|
|
|
|
Encoding and error handler
|
|
--------------------------
|
|
|
|
============================ ======================= ==========================
|
|
Function Default UTF-8 Mode or POSIX locale
|
|
============================ ======================= ==========================
|
|
open() locale/strict **UTF-8**/strict
|
|
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape
|
|
sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape**
|
|
sys.stderr locale/backslashreplace **UTF-8**/backslashreplace
|
|
============================ ======================= ==========================
|
|
|
|
By comparison, Python 3.6 uses:
|
|
|
|
============================ ======================= ==========================
|
|
Function Default POSIX locale
|
|
============================ ======================= ==========================
|
|
open() locale/strict locale/strict
|
|
os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape
|
|
sys.stdin, sys.stdout locale/strict locale/**surrogateescape**
|
|
sys.stderr locale/backslashreplace locale/backslashreplace
|
|
============================ ======================= ==========================
|
|
|
|
Encoding and error handler on Windows
|
|
-------------------------------------
|
|
|
|
On Windows, the encodings and error handlers are different:
|
|
|
|
============================ ======================= ========================== ==========================
|
|
Function Default Legacy Windows FS encoding UTF-8 Mode
|
|
============================ ======================= ========================== ==========================
|
|
open() mbcs/strict mbcs/strict **UTF-8**/strict
|
|
os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace** UTF-8/surrogatepass
|
|
sys.stdin, sys.stdout UTF-8/surrogateescape UTF-8/surrogateescape UTF-8/surrogateescape
|
|
sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace
|
|
============================ ======================= ========================== ==========================
|
|
|
|
By comparison, Python 3.6 uses:
|
|
|
|
============================ ======================= ==========================
|
|
Function Default Legacy Windows FS encoding
|
|
============================ ======================= ==========================
|
|
open() mbcs/strict mbcs/strict
|
|
os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace**
|
|
sys.stdin, sys.stdout UTF-8/surrogateescape UTF-8/surrogateescape
|
|
sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace
|
|
============================ ======================= ==========================
|
|
|
|
The "Legacy Windows FS encoding" is enabled by the
|
|
``PYTHONLEGACYWINDOWSFSENCODING`` environment variable.
|
|
|
|
If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
|
|
``sys.stdout`` uses ``mbcs`` encoding by default rather than UTF-8.
|
|
But in UTF-8 Mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
|
|
encoding.
|
|
|
|
.. note::
|
|
There is no POSIX locale on Windows. The ANSI code page is used as
|
|
the locale encoding, and this code page never uses the ASCII
|
|
encoding.
|
|
|
|
|
|
Links
|
|
=====
|
|
|
|
* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 Mode
|
|
<http://bugs.python.org/issue29240>`_
|
|
* :pep:`538`:
|
|
"Coercing the legacy C locale to C.UTF-8"
|
|
* :pep:`529`:
|
|
"Change Windows filesystem encoding to UTF-8"
|
|
* :pep:`528`:
|
|
"Change Windows console encoding to UTF-8"
|
|
* :pep:`383`:
|
|
"Non-decodable Bytes in System Character Interfaces"
|
|
|
|
|
|
Post History
|
|
============
|
|
|
|
* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 Mode
|
|
<https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
|
|
* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
|
|
540 (assuming UTF-8 for *nix system boundaries)
|
|
<https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
|
|
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 Mode
|
|
<https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
|
|
* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
|
|
C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
|
|
* 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows
|
|
to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_
|
|
-- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows
|
|
filesystem encoding to UTF-8)
|
|
|
|
|
|
Version History
|
|
===============
|
|
|
|
* Version 4: ``locale.getpreferredencoding()`` now returns ``'UTF-8'``
|
|
in the UTF-8 Mode.
|
|
* Version 3: The UTF-8 Mode does not change the ``open()`` default error
|
|
handler (``strict``) anymore, and the Strict UTF-8 Mode has been
|
|
removed.
|
|
* Version 2: Rewrite the PEP from scratch to make it much shorter and
|
|
easier to understand.
|
|
* Version 1: First version posted to python-dev.
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|