Update PEP 540

* Enable UTF-8 mode by default if the locale is POSIX
* Add Use Cases
* Add "Don't modify the encoding of the POSIX locale" alternative
* Rephase Abstract and Proposal
* Proposal: mention expected mojibake issues
* Fix PEP number: 393 => 383
* Add links
This commit is contained in:
Victor Stinner 2017-01-05 23:54:22 +01:00
parent 9780f3ab43
commit 5b6b25f5d9
1 changed files with 266 additions and 39 deletions

View File

@ -13,9 +13,16 @@ Python-Version: 3.7
Abstract
========
Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system
data instead of the locale encoding. Add ``-X utf8`` command line option
and ``PYTHONUTF8`` environment variable.
Add a new UTF-8 mode, disabled by default, to ignore the locale and
force the usage of the UTF-8 encoding.
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
bother users with encodings, but it can produce mojibake. The UTF-8 mode
can be configured as strict to prevent mojibake.
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode. The POSIX locale enables
the UTF-8 mode.
Context
@ -33,9 +40,8 @@ data from/to the operating system:
* environment variables: ``os.environ``
* filenames: ``os.listdir(str)`` for example
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
* error messages
* name of a timezone
* user name, terminal name: ``os``, ``grp`` and ``pwd`` modules
* error messages: ``os.strerror(code)`` for example
* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
* host name, UNIX socket path: see the ``socket`` module
* etc.
@ -81,7 +87,7 @@ arguments are decoded by ``mbstowcs()`` and encoded back by
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
of retrieving the original byte string.
To fix this issue, Python now checks since Python 3.4 if ``mbstowcs()``
To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
alias to ASCII). If not (the effective encoding is not ASCII), Python
@ -95,16 +101,18 @@ See the `POSIX locale (2016 Edition)
C.UTF-8 and C.utf8 locales
--------------------------
Some operating systems provide a variant of the POSIX locale using the
Some UNIX operating systems provide a variant of the POSIX locale using the
UTF-8 encoding:
* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
* Debian (eglibc 2.13-1, 2011): ``"C.UTF-8"``
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
* HP-UX: ``"C.utf8"``
It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
It is not planned to add such locale to BSD systems.
Popularity of the UTF-8 encoding
--------------------------------
@ -112,11 +120,10 @@ Popularity of the UTF-8 encoding
Python 3 uses UTF-8 by default for Python source files.
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
system data instead of the locale encoding. For Windows, see the `PEP
529: Change Windows filesystem encoding to UTF-8
<https://www.python.org/dev/peps/pep-0529/>`_.
system data. For Windows, see the PEP 529: "Change Windows filesystem
encoding to UTF-8".
On Linux, UTF-8 became the defacto standard encoding by default,
On Linux, UTF-8 became the defacto standard encoding,
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
using different encodings for filenames and standard streams is likely
to create mojibake, so UTF-8 is now used *everywhere*.
@ -152,8 +159,7 @@ the same encoding.
Python 3 promotes Unicode everywhere including filenames. A solution to
support filenames not decodable from the locale encoding was found: the
``surrogateescape`` error handler (`PEP 393
<https://www.python.org/dev/peps/pep-0393/>`_), store undecodable bytes
``surrogateescape`` error handler (PEP 383), store undecodable bytes
as surrogate characters. This error handler is used by default for
operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
example (except on Windows which uses the ``strict`` error handler).
@ -172,7 +178,7 @@ useful when the POSIX locale is used, because this locale usually uses
the ASCII encoding.
The problem is that operating system data like filenames are decoded
using the ``surrogateescape`` error handler (PEP 393). Displaying a
using the ``surrogateescape`` error handler (PEP 383). Displaying a
filename to stdout raises an Unicode encode error if the filename
contains an undecoded byte stored as a surrogate character.
@ -191,28 +197,60 @@ programming language, whereas Unicode is badly supported.
Proposal
========
Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data
instead of the locale encoding:
Changes
-------
* Add ``-X utf8`` command line option
* Add ``PYTHONUTF8=1`` environment variable
Add a new UTF-8 mode, disabled by default, to ignore the locale and
force the usage of the UTF-8 encoding with the ``surrogateescape`` error
handler, instead using the locale encoding (with ``strict`` or
``surrogateescape`` error handler depending on the case).
Add also a strict UTF-8 mode, enabled by ``-X utf8=strict`` or
``PYTHONUTF8=strict``.
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
bother users with encodings, but it can produce mojibake. It can be
configured as strict to prevent mojibake: the UTF-8 encoding is used
with the ``strict`` error handler in this case.
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
by ``-X utf8`` or ``PYTHONUTF8=1``. The UTF-8 is configured as strict
by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
Encoding and error handler
--------------------------
The UTF-8 mode changes the default encoding and error handler used by
open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
sys.stderr:
============================ ======================= ======================= ====================== ======================
Function Default, other locales Default, POSIX locale UTF-8 UTF-8 Strict
============================ ======================= ======================= ====================== ======================
open() locale/strict locale/strict UTF-8/surrogateescape UTF-8/strict
os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
sys.stdin locale/strict locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
sys.stdout locale/strict locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
sys.stderr locale/backslashreplace locale/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace
============================ ======================= ======================= ====================== ======================
============================ ======================= ========================== ==========================
Function Default UTF-8 or POSIX locale UTF-8 Strict
============================ ======================= ========================== ==========================
open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8/strict**
sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
sys.stderr locale/backslashreplace **UTF-8**/backslashreplace **UTF-8**/backslashreplace
============================ ======================= ========================== ==========================
By comparison, Python 3.6 uses:
============================ ======================= ==========================
Function Default POSIX locale
============================ ======================= ==========================
open() locale/strict locale/strict
os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape
sys.stdin, sys.stdout locale/strict locale/**surrogateescape**
sys.stderr locale/backslashreplace locale/backslashreplace
============================ ======================= ==========================
The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
strict mode for convenience: the idea is that data not encoded to UTF-8
are passed through "Python" without being modified, as raw bytes.
Rationale
---------
The UTF-8 mode is disabled by default to keep hard Unicode errors when
encoding or decoding operating system data failed, and to keep the
@ -238,17 +276,184 @@ Python is more convenient, since they are more commonly misconfigured
*by mistake* (configured to use an encoding different than UTF-8,
whereas the system uses UTF-8), rather than being misconfigured by intent.
Expected mojibake issues
------------------------
The UTF-8 mode only affects Python 3.7 code, other code is not aware of this
mode.
If Python 3.7 is used as a producer in a ``producer | consumer`` shell command
and the consumer may fail to decode input data if it decodes it and the locale
encoding is not UTF-8. If the consumer doesn't decode inputs, process them
as bytes, it should just work.
If Python 3.7 is used as a consumer in a ``producer | consumer`` shell command,
it should just work.
If Python calls third party libraries or if Python is embedded in an
application, code outside Python is not aware of the UTF-8 mode. If the other
code uses UTF-8, it's fine. If the other code uses the locale encoding,
mojibake will occur when the locale encoding is not UTF-8.
Use Cases
=========
List a directory into stdout
----------------------------
Script listing the content of the current directory into stdout::
import os
for name in os.listdir(os.curdir):
print(name)
Result:
======================== ==============================
Python Always work?
======================== ==============================
Python 2 **Yes**
Python 3 No
Python 3.5, POSIX locale **Yes**
UTF-8 mode **Yes**
UTF-8 Strict mode No
======================== ==============================
"Yes" means that the script cannot fail, but it can produce mojibake.
"No" means that the script can fail on decoding or encoding a filename
depending on the locale or the filename.
List a directory into a text file
---------------------------------
Similar to the previous example, except that the listing is written into
a text file::
import os
names = os.listdir(os.curdir)
with open("/tmp/content.txt", "w") as fp:
for name in names:
fp.write("%s\n" % name)
Result:
======================== ==============================
Python Always work?
======================== ==============================
Python 2 **Yes**
Python 3 No
Python 3.5, POSIX locale No
UTF-8 mode **Yes**
UTF-8 Strict mode No
======================== ==============================
"Yes" means that the script cannot fail, but it can produce mojibake.
"No" means that the script can fail on decoding or encoding a filename
depending on the locale or the filename. Typical error::
$ LC_ALL=C python3 test.py
Traceback (most recent call last):
File "test.py", line 5, in <module>
fp.write("%s\n" % name)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Display Unicode characters into stdout
--------------------------------------
Very basic example used to illustrate a common issue, display the euro sign
(U+20AC: €)::
print("euro: \u20ac")
Result:
======================== ==============================
Python Always work?
======================== ==============================
Python 2 No
Python 3 No
Python 3.5, POSIX locale No
UTF-8 mode **Yes**
UTF-8 Strict mode **Yes**
======================== ==============================
"Yes" means that the script cannot fail, but it can produce mojibake.
"No" means that the script can fail on encoding the euro sign depending on the
locale encoding.
Replace a word in a text
------------------------
The following scripts replaces the word "apple" with "orange". It
reads input from stdin and writes the output into stdout::
import sys
text = sys.stdin.read()
sys.stdout.write(text.replace("apple", "orange"))
Result:
======================== ==============================
Python Always work?
======================== ==============================
Python 2 **Yes**
Python 3 No
Python 3.5, POSIX locale **Yes**
UTF-8 mode **Yes**
UTF-8 Strict mode No
======================== ==============================
"Yes" means that the script cannot fail.
"No" means that the script can fail on decoding the input depending on
the locale.
Backward Compatibility
======================
Since the UTF-8 mode is disabled by default, it has no impact on the
backward compatibility. The new UTF-8 mode must be enabled explicitly.
The main backward incompatible change is that the UTF-8 encoding is now
used if the locale is POSIX. Since the UTF-8 encoding is used with the
``surrogateescape`` error handler, ecoding errors should not occur and
so the change should not break applications.
The more likely source of trouble comes from external libraries. Python
can decode successfully data from UTF-8, but a library using the locale
encoding can fail to encode the decoded text back to bytes. Hopefully,
encoding text in a library is a rare operation. Very few libraries
expect text, most libraries expect bytes and even manipulate bytes
internally.
If the locale is not POSIX, the PEP has no impact on the backward
compatibility since the UTF-8 mode is disabled by default in this case,
it must be enabled explicitly.
Alternatives
============
Don't modify the encoding of the POSIX locale
---------------------------------------------
A first version of the PEP did not change the encoding and error handler
used of the POSIX locale.
The problem is that adding a command line option or setting an environment
variable is not possible in some cases, or at least not convenient.
Moreover, many users simply expect that Python 3 behaves as Python 2:
don't bother them with encodings and "just works" in all cases. These
users don't worry about mojibake, or even expect mojibake because of
complex documents using multiple incompatibles encodings.
Always use UTF-8
----------------
@ -266,13 +471,35 @@ Force UTF-8 for the POSIX locale
An alternative to always using UTF-8 in any case is to only use UTF-8 when the
``LC_CTYPE`` locale is the POSIX locale.
The `PEP 538: Coercing the legacy C locale to C.UTF-8
<https://www.python.org/dev/peps/pep-0538/>`_ of Nick Coghlan proposes to
implement that using the ``C.UTF-8`` locale.
The PEP 538 "Coercing the legacy C locale to C.UTF-8" of Nick Coghlan
proposes to implement that using the ``C.UTF-8`` locale.
Related Work
============
Links
=====
PEPs:
* PEP 538 "Coercing the legacy C locale to C.UTF-8"
* PEP 529: "Change Windows filesystem encoding to UTF-8"
* PEP 383: "Non-decodable Bytes in System Character Interfaces"
Python issues:
* `issue #28180: sys.getfilesystemencoding() should default to utf-8
<http://bugs.python.org/issue28180>`_
* `Issue #19846: Python 3 raises Unicode errors with the C locale
<http://bugs.python.org/issue19846>`_
* `Issue #8622: Add PYTHONFSENCODING environment variable
<https://bugs.python.org/issue8622>`_: added but reverted because of
many issues, read the `Inconsistencies if locale and filesystem
encodings are different
<https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
thread on the python-dev mailing list
Prior Art
=========
Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
varaible to force UTF-8: see `perlrun