PEP 540: getpreferredencoding() returns UTF-8

* List explicitly effects of the UTF-8 mode, but also "side effects"
* Add a new "Relationship with the locale coercion (PEP 538)" section
* Add a new "Version History" section
This commit is contained in:
Victor Stinner 2017-12-08 15:38:40 +01:00
parent b46be0d897
commit 09022d3f72
1 changed files with 98 additions and 60 deletions

View File

@ -1,5 +1,5 @@
PEP: 540
Title: Add a new UTF-8 mode
Title: Add a new UTF-8 Mode
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner <victor.stinner@gmail.com>
@ -14,13 +14,13 @@ Python-Version: 3.7
Abstract
========
Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.
Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
encoding, and change ``stdin`` and ``stdout`` error handlers to
``surrogateescape``. This mode is disabled by default and enabled by
the POSIX locale.
The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode.
variable are added to control the UTF-8 Mode.
Rationale
@ -51,13 +51,7 @@ and JSON file formats. The Go programming language uses UTF-8 for
strings.
When all data are stored as UTF-8 but the locale is often misconfigured,
an obvious solution is to ignore the locale and use UTF-8.
PEP 538 attempts to mitigate this problem by coercing the C locale
to a UTF-8 based locale when one is available, but that isn't a
universal solution. For example, CentOS 7's container images default
to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
locale coercion is ineffective.
an obvious solution is to ignore the locale encoding and use UTF-8.
Passthough undecodable bytes: surrogateescape
@ -76,7 +70,7 @@ the ``surrogateescape`` error handler (:pep:`383`). It allows to process
data "as bytes" but uses Unicode in practice (undecodable bytes are
stored as surrogate characters).
The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
The UTF-8 Mode uses the ``surrogateescape`` error handler for ``stdin``
and ``stdout`` since these streams as commonly associated to Unix
command line tools.
@ -98,43 +92,98 @@ usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
It does not change the behaviour for other locales to prevent any risk
or regression.
As users are responsible to enable explicitly the new UTF-8 mode, they
As users are responsible to enable explicitly the new UTF-8 Mode, they
are responsible for any potential mojibake issues caused by this mode.
Proposal
========
Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.
Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
encoding, and change ``stdin`` and ``stdout`` error handlers to
``surrogateescape``.
The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added. The UTF-8 mode is enabled by ``-X utf8`` or
variable are added. The UTF-8 Mode is enabled by ``-X utf8`` or
``PYTHONUTF8=1``.
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
This mode is disabled by default and enabled by the POSIX locale. The
UTF-8 Mode can be explicitly disabled by ``-X utf8=0`` or
``PYTHONUTF8=0``.
For standard streams, the ``PYTHONIOENCODING`` environment variable has
priority over the UTF-8 mode.
priority over the UTF-8 Mode.
On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable
(:pep:`529`) has the priority over the UTF-8 mode.
(:pep:`529`) has the priority over the UTF-8 Mode.
Effects of the UTF-8 Mode:
* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``.
* ``locale.getpreferredencoding()`` returns ``UTF-8``, its
*do_setlocale* argument and the locale encoding are ignored.
* ``sys.stdin`` and ``sys.stdout`` error handler is set to
``surrogateescape``
Side effects:
* ``open()`` uses the UTF-8 encoding by default.
* ``os.fsdecode()`` and ``os.fsencode()`` use the UTF-8 encoding.
* Command line arguments, environment variables and filenames use the
UTF-8 encoding.
.. note::
In the UTF-8 Mode, ``open()`` still uses the ``strict`` error handler
by default.
Relationship with the locale coercion (PEP 538)
===============================================
The POSIX locale enables the locale coercion (PEP 538) and the UTF-8
mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
mode has no (additional) effect.
Locale coercion only impacts non-Python code like C libraries, whereas
the Python UTF-8 Mode only impacts Python code: the two PEPs are
complementary.
On platforms where locale coercion is not supported like Centos 7, the
POSIX locale only enables the UTF-8 Mode. In this case, Python code uses
the UTF-8 encoding and ignores the locale encoding, whereas non-Python
code uses the locale encoding which is usually ASCII for the POSIX
locale.
While the UTF-8 Mode is supported on all platforms and can be enabled
with any locale, the locale coercion is not supported by all platforms
and is restricted to the POSIX locale.
The UTF-8 Mode has only an impact on Python child processes when the
``PYTHONUTF8`` environment variable is set to ``1``, whereas the locale
coercion sets the ``LC_CTYPE`` environment variables which impacts all
child processes.
The benefit of the locale coercion approach is that it helps ensure that
encoding handling in binary extension modules and child processes is
consistent with Python's encoding handling. The upside of the UTF-8 Mode
approach is that it allows an embedding application to change the
interpreter's behaviour without having to change the process global
locale settings.
Backward Compatibility
======================
The only backward incompatible change is that the UTF-8 encoding is now
used for the POSIX locale.
The only backward incompatible change is that the POSIX locale now
enables the UTF-8 Mode by default: use the UTF-8 encoding, ignore the
locale encoding, and change ``stdin`` and ``stdout`` error handlers to
``surrogateescape``.
Annex: Encodings And Error Handlers
===================================
The UTF-8 mode changes the default encoding and error handler used by
The UTF-8 Mode changes the default encoding and error handler used by
``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
``sys.stdout`` and ``sys.stderr``.
@ -142,7 +191,7 @@ Encoding and error handler
--------------------------
============================ ======================= ==========================
Function Default UTF-8 mode or POSIX locale
Function Default UTF-8 Mode or POSIX locale
============================ ======================= ==========================
open() locale/strict **UTF-8**/strict
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape
@ -167,7 +216,7 @@ Encoding and error handler on Windows
On Windows, the encodings and error handlers are different:
============================ ======================= ========================== ==========================
Function Default Legacy Windows FS encoding UTF-8 mode
Function Default Legacy Windows FS encoding UTF-8 Mode
============================ ======================= ========================== ==========================
open() mbcs/strict mbcs/strict **UTF-8**/strict
os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace** UTF-8/surrogatepass
@ -191,43 +240,19 @@ The "Legacy Windows FS encoding" is enabled by the
If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
in the UTF-8 Mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
encoding.
.. note:
There is no POSIX locale on Windows. The ANSI code page is used to the
locale encoding, and this code page never uses the ASCII encoding.
Annex: Differences between PEP 538 and PEP 540
==============================================
PEP 538's locale coercion is only effective if a suitable UTF-8
based locale is available as a coercion target. PEP 540's
UTF-8 mode can be enabled even for operating systems that don't
provide a suitable platform locale (such as CentOS 7).
PEP 538 only changes the interpreter's behaviour for the C locale. While the
new UTF-8 mode of this PEP is only enabled by default in the C locale, it can
also be enabled manually for any other locale.
PEP 538 is implemented with ``setlocale(LC_CTYPE, "<coercion target>")`` and
``setenv("LC_CTYPE", "<coercion target>")``, so any non-Python code running
in the process and any subprocesses that inherit the environment is impacted
by the change. PEP 540 is implemented in Python internals and ignores the
locale: non-Python running in the same process is not aware of the
"Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps
ensure that encoding handling in binary extension modules and subprocesses
is consistent with CPython's encoding handling. The upside of the PEP 540
approach is that it allows an embedding application to change the
interpreter's behaviour without having to change the process global
locale settings.
There is no POSIX locale on Windows. The ANSI code page is used to
the locale encoding, and this code page never uses the ASCII
encoding.
Links
=====
* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 Mode
<http://bugs.python.org/issue29240>`_
* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
"Coercing the legacy C locale to C.UTF-8"
@ -242,12 +267,12 @@ Links
Post History
============
* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 Mode
<https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
540 (assuming UTF-8 for *nix system boundaries)
<https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 Mode
<https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
@ -257,6 +282,19 @@ Post History
filesystem encoding to UTF-8)
Version History
===============
* Version 4: ``locale.getpreferredencoding()`` now returns ``'UTF-8'``
in the UTF-8 Mode.
* Version 3: The UTF-8 Mode does not change the ``open()`` default error
handler (``strict``) anymore, and the Strict UTF-8 Mode has been
removed.
* Version 2: Rewrite the PEP from scratch to make it much shorter and
easier to understand.
* Version 1: First version posted to python-dev.
Copyright
=========