PEP 540: getpreferredencoding() returns UTF-8
* List explicitly effects of the UTF-8 mode, but also "side effects" * Add a new "Relationship with the locale coercion (PEP 538)" section * Add a new "Version History" section
This commit is contained in:
parent
b46be0d897
commit
09022d3f72
158
pep-0540.txt
158
pep-0540.txt
|
@ -1,5 +1,5 @@
|
|||
PEP: 540
|
||||
Title: Add a new UTF-8 mode
|
||||
Title: Add a new UTF-8 Mode
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Victor Stinner <victor.stinner@gmail.com>
|
||||
|
@ -14,13 +14,13 @@ Python-Version: 3.7
|
|||
Abstract
|
||||
========
|
||||
|
||||
Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
|
||||
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
|
||||
This mode is enabled by default in the POSIX locale, but otherwise
|
||||
disabled by default.
|
||||
Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
|
||||
encoding, and change ``stdin`` and ``stdout`` error handlers to
|
||||
``surrogateescape``. This mode is disabled by default and enabled by
|
||||
the POSIX locale.
|
||||
|
||||
The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode.
|
||||
variable are added to control the UTF-8 Mode.
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -51,13 +51,7 @@ and JSON file formats. The Go programming language uses UTF-8 for
|
|||
strings.
|
||||
|
||||
When all data are stored as UTF-8 but the locale is often misconfigured,
|
||||
an obvious solution is to ignore the locale and use UTF-8.
|
||||
|
||||
PEP 538 attempts to mitigate this problem by coercing the C locale
|
||||
to a UTF-8 based locale when one is available, but that isn't a
|
||||
universal solution. For example, CentOS 7's container images default
|
||||
to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's
|
||||
locale coercion is ineffective.
|
||||
an obvious solution is to ignore the locale encoding and use UTF-8.
|
||||
|
||||
|
||||
Passthough undecodable bytes: surrogateescape
|
||||
|
@ -76,7 +70,7 @@ the ``surrogateescape`` error handler (:pep:`383`). It allows to process
|
|||
data "as bytes" but uses Unicode in practice (undecodable bytes are
|
||||
stored as surrogate characters).
|
||||
|
||||
The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
|
||||
The UTF-8 Mode uses the ``surrogateescape`` error handler for ``stdin``
|
||||
and ``stdout`` since these streams as commonly associated to Unix
|
||||
command line tools.
|
||||
|
||||
|
@ -98,43 +92,98 @@ usually gives the ASCII encoding, whereas UTF-8 is a much better choice.
|
|||
It does not change the behaviour for other locales to prevent any risk
|
||||
or regression.
|
||||
|
||||
As users are responsible to enable explicitly the new UTF-8 mode, they
|
||||
As users are responsible to enable explicitly the new UTF-8 Mode, they
|
||||
are responsible for any potential mojibake issues caused by this mode.
|
||||
|
||||
|
||||
Proposal
|
||||
========
|
||||
|
||||
Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and
|
||||
change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``.
|
||||
This mode is enabled by default in the POSIX locale, but otherwise
|
||||
disabled by default.
|
||||
Add a new UTF-8 Mode to use the UTF-8 encoding, ignore the locale
|
||||
encoding, and change ``stdin`` and ``stdout`` error handlers to
|
||||
``surrogateescape``.
|
||||
|
||||
The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
variable are added. The UTF-8 mode is enabled by ``-X utf8`` or
|
||||
variable are added. The UTF-8 Mode is enabled by ``-X utf8`` or
|
||||
``PYTHONUTF8=1``.
|
||||
|
||||
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
|
||||
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
|
||||
This mode is disabled by default and enabled by the POSIX locale. The
|
||||
UTF-8 Mode can be explicitly disabled by ``-X utf8=0`` or
|
||||
``PYTHONUTF8=0``.
|
||||
|
||||
For standard streams, the ``PYTHONIOENCODING`` environment variable has
|
||||
priority over the UTF-8 mode.
|
||||
priority over the UTF-8 Mode.
|
||||
|
||||
On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable
|
||||
(:pep:`529`) has the priority over the UTF-8 mode.
|
||||
(:pep:`529`) has the priority over the UTF-8 Mode.
|
||||
|
||||
Effects of the UTF-8 Mode:
|
||||
|
||||
* ``sys.getfilesystemencoding()`` returns ``'UTF-8'``.
|
||||
* ``locale.getpreferredencoding()`` returns ``UTF-8``, its
|
||||
*do_setlocale* argument and the locale encoding are ignored.
|
||||
* ``sys.stdin`` and ``sys.stdout`` error handler is set to
|
||||
``surrogateescape``
|
||||
|
||||
Side effects:
|
||||
|
||||
* ``open()`` uses the UTF-8 encoding by default.
|
||||
* ``os.fsdecode()`` and ``os.fsencode()`` use the UTF-8 encoding.
|
||||
* Command line arguments, environment variables and filenames use the
|
||||
UTF-8 encoding.
|
||||
|
||||
.. note::
|
||||
In the UTF-8 Mode, ``open()`` still uses the ``strict`` error handler
|
||||
by default.
|
||||
|
||||
|
||||
Relationship with the locale coercion (PEP 538)
|
||||
===============================================
|
||||
|
||||
The POSIX locale enables the locale coercion (PEP 538) and the UTF-8
|
||||
mode (PEP 540). When the locale coercion is enabled, enabling the UTF-8
|
||||
mode has no (additional) effect.
|
||||
|
||||
Locale coercion only impacts non-Python code like C libraries, whereas
|
||||
the Python UTF-8 Mode only impacts Python code: the two PEPs are
|
||||
complementary.
|
||||
|
||||
On platforms where locale coercion is not supported like Centos 7, the
|
||||
POSIX locale only enables the UTF-8 Mode. In this case, Python code uses
|
||||
the UTF-8 encoding and ignores the locale encoding, whereas non-Python
|
||||
code uses the locale encoding which is usually ASCII for the POSIX
|
||||
locale.
|
||||
|
||||
While the UTF-8 Mode is supported on all platforms and can be enabled
|
||||
with any locale, the locale coercion is not supported by all platforms
|
||||
and is restricted to the POSIX locale.
|
||||
|
||||
The UTF-8 Mode has only an impact on Python child processes when the
|
||||
``PYTHONUTF8`` environment variable is set to ``1``, whereas the locale
|
||||
coercion sets the ``LC_CTYPE`` environment variables which impacts all
|
||||
child processes.
|
||||
|
||||
The benefit of the locale coercion approach is that it helps ensure that
|
||||
encoding handling in binary extension modules and child processes is
|
||||
consistent with Python's encoding handling. The upside of the UTF-8 Mode
|
||||
approach is that it allows an embedding application to change the
|
||||
interpreter's behaviour without having to change the process global
|
||||
locale settings.
|
||||
|
||||
|
||||
Backward Compatibility
|
||||
======================
|
||||
|
||||
The only backward incompatible change is that the UTF-8 encoding is now
|
||||
used for the POSIX locale.
|
||||
The only backward incompatible change is that the POSIX locale now
|
||||
enables the UTF-8 Mode by default: use the UTF-8 encoding, ignore the
|
||||
locale encoding, and change ``stdin`` and ``stdout`` error handlers to
|
||||
``surrogateescape``.
|
||||
|
||||
|
||||
Annex: Encodings And Error Handlers
|
||||
===================================
|
||||
|
||||
The UTF-8 mode changes the default encoding and error handler used by
|
||||
The UTF-8 Mode changes the default encoding and error handler used by
|
||||
``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``,
|
||||
``sys.stdout`` and ``sys.stderr``.
|
||||
|
||||
|
@ -142,7 +191,7 @@ Encoding and error handler
|
|||
--------------------------
|
||||
|
||||
============================ ======================= ==========================
|
||||
Function Default UTF-8 mode or POSIX locale
|
||||
Function Default UTF-8 Mode or POSIX locale
|
||||
============================ ======================= ==========================
|
||||
open() locale/strict **UTF-8**/strict
|
||||
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape
|
||||
|
@ -167,7 +216,7 @@ Encoding and error handler on Windows
|
|||
On Windows, the encodings and error handlers are different:
|
||||
|
||||
============================ ======================= ========================== ==========================
|
||||
Function Default Legacy Windows FS encoding UTF-8 mode
|
||||
Function Default Legacy Windows FS encoding UTF-8 Mode
|
||||
============================ ======================= ========================== ==========================
|
||||
open() mbcs/strict mbcs/strict **UTF-8**/strict
|
||||
os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace** UTF-8/surrogatepass
|
||||
|
@ -191,43 +240,19 @@ The "Legacy Windows FS encoding" is enabled by the
|
|||
|
||||
If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or
|
||||
``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But
|
||||
in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
|
||||
in the UTF-8 Mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8
|
||||
encoding.
|
||||
|
||||
.. note:
|
||||
There is no POSIX locale on Windows. The ANSI code page is used to the
|
||||
locale encoding, and this code page never uses the ASCII encoding.
|
||||
|
||||
|
||||
Annex: Differences between PEP 538 and PEP 540
|
||||
==============================================
|
||||
|
||||
PEP 538's locale coercion is only effective if a suitable UTF-8
|
||||
based locale is available as a coercion target. PEP 540's
|
||||
UTF-8 mode can be enabled even for operating systems that don't
|
||||
provide a suitable platform locale (such as CentOS 7).
|
||||
|
||||
PEP 538 only changes the interpreter's behaviour for the C locale. While the
|
||||
new UTF-8 mode of this PEP is only enabled by default in the C locale, it can
|
||||
also be enabled manually for any other locale.
|
||||
|
||||
PEP 538 is implemented with ``setlocale(LC_CTYPE, "<coercion target>")`` and
|
||||
``setenv("LC_CTYPE", "<coercion target>")``, so any non-Python code running
|
||||
in the process and any subprocesses that inherit the environment is impacted
|
||||
by the change. PEP 540 is implemented in Python internals and ignores the
|
||||
locale: non-Python running in the same process is not aware of the
|
||||
"Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps
|
||||
ensure that encoding handling in binary extension modules and subprocesses
|
||||
is consistent with CPython's encoding handling. The upside of the PEP 540
|
||||
approach is that it allows an embedding application to change the
|
||||
interpreter's behaviour without having to change the process global
|
||||
locale settings.
|
||||
There is no POSIX locale on Windows. The ANSI code page is used to
|
||||
the locale encoding, and this code page never uses the ASCII
|
||||
encoding.
|
||||
|
||||
|
||||
Links
|
||||
=====
|
||||
|
||||
* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode
|
||||
* `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 Mode
|
||||
<http://bugs.python.org/issue29240>`_
|
||||
* `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_:
|
||||
"Coercing the legacy C locale to C.UTF-8"
|
||||
|
@ -242,12 +267,12 @@ Links
|
|||
Post History
|
||||
============
|
||||
|
||||
* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode
|
||||
* 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 Mode
|
||||
<https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_
|
||||
* 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 &
|
||||
540 (assuming UTF-8 for *nix system boundaries)
|
||||
<https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_
|
||||
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode
|
||||
* 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 Mode
|
||||
<https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_
|
||||
* 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to
|
||||
C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_
|
||||
|
@ -257,6 +282,19 @@ Post History
|
|||
filesystem encoding to UTF-8)
|
||||
|
||||
|
||||
Version History
|
||||
===============
|
||||
|
||||
* Version 4: ``locale.getpreferredencoding()`` now returns ``'UTF-8'``
|
||||
in the UTF-8 Mode.
|
||||
* Version 3: The UTF-8 Mode does not change the ``open()`` default error
|
||||
handler (``strict``) anymore, and the Strict UTF-8 Mode has been
|
||||
removed.
|
||||
* Version 2: Rewrite the PEP from scratch to make it much shorter and
|
||||
easier to understand.
|
||||
* Version 1: First version posted to python-dev.
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
|
|
Loading…
Reference in New Issue