Update PEP 540
* Enable UTF-8 mode by default if the locale is POSIX * Add Use Cases * Add "Don't modify the encoding of the POSIX locale" alternative * Rephase Abstract and Proposal * Proposal: mention expected mojibake issues * Fix PEP number: 393 => 383 * Add links
This commit is contained in:
parent
9780f3ab43
commit
5b6b25f5d9
305
pep-0540.txt
305
pep-0540.txt
|
@ -13,9 +13,16 @@ Python-Version: 3.7
|
|||
Abstract
|
||||
========
|
||||
|
||||
Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system
|
||||
data instead of the locale encoding. Add ``-X utf8`` command line option
|
||||
and ``PYTHONUTF8`` environment variable.
|
||||
Add a new UTF-8 mode, disabled by default, to ignore the locale and
|
||||
force the usage of the UTF-8 encoding.
|
||||
|
||||
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
|
||||
bother users with encodings, but it can produce mojibake. The UTF-8 mode
|
||||
can be configured as strict to prevent mojibake.
|
||||
|
||||
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode. The POSIX locale enables
|
||||
the UTF-8 mode.
|
||||
|
||||
|
||||
Context
|
||||
|
@ -33,9 +40,8 @@ data from/to the operating system:
|
|||
* environment variables: ``os.environ``
|
||||
* filenames: ``os.listdir(str)`` for example
|
||||
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
|
||||
* error messages
|
||||
* name of a timezone
|
||||
* user name, terminal name: ``os``, ``grp`` and ``pwd`` modules
|
||||
* error messages: ``os.strerror(code)`` for example
|
||||
* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
|
||||
* host name, UNIX socket path: see the ``socket`` module
|
||||
* etc.
|
||||
|
||||
|
@ -81,7 +87,7 @@ arguments are decoded by ``mbstowcs()`` and encoded back by
|
|||
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
|
||||
of retrieving the original byte string.
|
||||
|
||||
To fix this issue, Python now checks since Python 3.4 if ``mbstowcs()``
|
||||
To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
|
||||
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
|
||||
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
|
||||
alias to ASCII). If not (the effective encoding is not ASCII), Python
|
||||
|
@ -95,16 +101,18 @@ See the `POSIX locale (2016 Edition)
|
|||
C.UTF-8 and C.utf8 locales
|
||||
--------------------------
|
||||
|
||||
Some operating systems provide a variant of the POSIX locale using the
|
||||
Some UNIX operating systems provide a variant of the POSIX locale using the
|
||||
UTF-8 encoding:
|
||||
|
||||
* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
|
||||
* Debian (eglibc 2.13-1, 2011): ``"C.UTF-8"``
|
||||
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
|
||||
* HP-UX: ``"C.utf8"``
|
||||
|
||||
It was proposed to add a ``C.UTF-8`` locale to glibc: `glibc C.UTF-8
|
||||
It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
|
||||
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
|
||||
|
||||
It is not planned to add such locale to BSD systems.
|
||||
|
||||
|
||||
Popularity of the UTF-8 encoding
|
||||
--------------------------------
|
||||
|
@ -112,11 +120,10 @@ Popularity of the UTF-8 encoding
|
|||
Python 3 uses UTF-8 by default for Python source files.
|
||||
|
||||
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
|
||||
system data instead of the locale encoding. For Windows, see the `PEP
|
||||
529: Change Windows filesystem encoding to UTF-8
|
||||
<https://www.python.org/dev/peps/pep-0529/>`_.
|
||||
system data. For Windows, see the PEP 529: "Change Windows filesystem
|
||||
encoding to UTF-8".
|
||||
|
||||
On Linux, UTF-8 became the defacto standard encoding by default,
|
||||
On Linux, UTF-8 became the defacto standard encoding,
|
||||
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
|
||||
using different encodings for filenames and standard streams is likely
|
||||
to create mojibake, so UTF-8 is now used *everywhere*.
|
||||
|
@ -152,8 +159,7 @@ the same encoding.
|
|||
|
||||
Python 3 promotes Unicode everywhere including filenames. A solution to
|
||||
support filenames not decodable from the locale encoding was found: the
|
||||
``surrogateescape`` error handler (`PEP 393
|
||||
<https://www.python.org/dev/peps/pep-0393/>`_), store undecodable bytes
|
||||
``surrogateescape`` error handler (PEP 383), store undecodable bytes
|
||||
as surrogate characters. This error handler is used by default for
|
||||
operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
|
||||
example (except on Windows which uses the ``strict`` error handler).
|
||||
|
@ -172,7 +178,7 @@ useful when the POSIX locale is used, because this locale usually uses
|
|||
the ASCII encoding.
|
||||
|
||||
The problem is that operating system data like filenames are decoded
|
||||
using the ``surrogateescape`` error handler (PEP 393). Displaying a
|
||||
using the ``surrogateescape`` error handler (PEP 383). Displaying a
|
||||
filename to stdout raises an Unicode encode error if the filename
|
||||
contains an undecoded byte stored as a surrogate character.
|
||||
|
||||
|
@ -191,28 +197,60 @@ programming language, whereas Unicode is badly supported.
|
|||
Proposal
|
||||
========
|
||||
|
||||
Add a new UTF-8 mode, opt-in option to use UTF-8 for operating system data
|
||||
instead of the locale encoding:
|
||||
Changes
|
||||
-------
|
||||
|
||||
* Add ``-X utf8`` command line option
|
||||
* Add ``PYTHONUTF8=1`` environment variable
|
||||
Add a new UTF-8 mode, disabled by default, to ignore the locale and
|
||||
force the usage of the UTF-8 encoding with the ``surrogateescape`` error
|
||||
handler, instead using the locale encoding (with ``strict`` or
|
||||
``surrogateescape`` error handler depending on the case).
|
||||
|
||||
Add also a strict UTF-8 mode, enabled by ``-X utf8=strict`` or
|
||||
``PYTHONUTF8=strict``.
|
||||
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
|
||||
bother users with encodings, but it can produce mojibake. It can be
|
||||
configured as strict to prevent mojibake: the UTF-8 encoding is used
|
||||
with the ``strict`` error handler in this case.
|
||||
|
||||
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
||||
variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
|
||||
by ``-X utf8`` or ``PYTHONUTF8=1``. The UTF-8 is configured as strict
|
||||
by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
|
||||
|
||||
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
|
||||
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
|
||||
|
||||
Encoding and error handler
|
||||
--------------------------
|
||||
|
||||
The UTF-8 mode changes the default encoding and error handler used by
|
||||
open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
|
||||
sys.stderr:
|
||||
|
||||
============================ ======================= ======================= ====================== ======================
|
||||
Function Default, other locales Default, POSIX locale UTF-8 UTF-8 Strict
|
||||
============================ ======================= ======================= ====================== ======================
|
||||
open() locale/strict locale/strict UTF-8/surrogateescape UTF-8/strict
|
||||
os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
|
||||
sys.stdin locale/strict locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
|
||||
sys.stdout locale/strict locale/surrogateescape UTF-8/surrogateescape UTF-8/strict
|
||||
sys.stderr locale/backslashreplace locale/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace
|
||||
============================ ======================= ======================= ====================== ======================
|
||||
============================ ======================= ========================== ==========================
|
||||
Function Default UTF-8 or POSIX locale UTF-8 Strict
|
||||
============================ ======================= ========================== ==========================
|
||||
open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
||||
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8/strict**
|
||||
sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
||||
sys.stderr locale/backslashreplace **UTF-8**/backslashreplace **UTF-8**/backslashreplace
|
||||
============================ ======================= ========================== ==========================
|
||||
|
||||
By comparison, Python 3.6 uses:
|
||||
|
||||
============================ ======================= ==========================
|
||||
Function Default POSIX locale
|
||||
============================ ======================= ==========================
|
||||
open() locale/strict locale/strict
|
||||
os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape
|
||||
sys.stdin, sys.stdout locale/strict locale/**surrogateescape**
|
||||
sys.stderr locale/backslashreplace locale/backslashreplace
|
||||
============================ ======================= ==========================
|
||||
|
||||
The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
|
||||
strict mode for convenience: the idea is that data not encoded to UTF-8
|
||||
are passed through "Python" without being modified, as raw bytes.
|
||||
|
||||
Rationale
|
||||
---------
|
||||
|
||||
The UTF-8 mode is disabled by default to keep hard Unicode errors when
|
||||
encoding or decoding operating system data failed, and to keep the
|
||||
|
@ -238,17 +276,184 @@ Python is more convenient, since they are more commonly misconfigured
|
|||
*by mistake* (configured to use an encoding different than UTF-8,
|
||||
whereas the system uses UTF-8), rather than being misconfigured by intent.
|
||||
|
||||
Expected mojibake issues
|
||||
------------------------
|
||||
|
||||
The UTF-8 mode only affects Python 3.7 code, other code is not aware of this
|
||||
mode.
|
||||
|
||||
If Python 3.7 is used as a producer in a ``producer | consumer`` shell command
|
||||
and the consumer may fail to decode input data if it decodes it and the locale
|
||||
encoding is not UTF-8. If the consumer doesn't decode inputs, process them
|
||||
as bytes, it should just work.
|
||||
|
||||
If Python 3.7 is used as a consumer in a ``producer | consumer`` shell command,
|
||||
it should just work.
|
||||
|
||||
If Python calls third party libraries or if Python is embedded in an
|
||||
application, code outside Python is not aware of the UTF-8 mode. If the other
|
||||
code uses UTF-8, it's fine. If the other code uses the locale encoding,
|
||||
mojibake will occur when the locale encoding is not UTF-8.
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
List a directory into stdout
|
||||
----------------------------
|
||||
|
||||
Script listing the content of the current directory into stdout::
|
||||
|
||||
import os
|
||||
for name in os.listdir(os.curdir):
|
||||
print(name)
|
||||
|
||||
Result:
|
||||
|
||||
======================== ==============================
|
||||
Python Always work?
|
||||
======================== ==============================
|
||||
Python 2 **Yes**
|
||||
Python 3 No
|
||||
Python 3.5, POSIX locale **Yes**
|
||||
UTF-8 mode **Yes**
|
||||
UTF-8 Strict mode No
|
||||
======================== ==============================
|
||||
|
||||
"Yes" means that the script cannot fail, but it can produce mojibake.
|
||||
|
||||
"No" means that the script can fail on decoding or encoding a filename
|
||||
depending on the locale or the filename.
|
||||
|
||||
|
||||
List a directory into a text file
|
||||
---------------------------------
|
||||
|
||||
Similar to the previous example, except that the listing is written into
|
||||
a text file::
|
||||
|
||||
import os
|
||||
names = os.listdir(os.curdir)
|
||||
with open("/tmp/content.txt", "w") as fp:
|
||||
for name in names:
|
||||
fp.write("%s\n" % name)
|
||||
|
||||
Result:
|
||||
|
||||
======================== ==============================
|
||||
Python Always work?
|
||||
======================== ==============================
|
||||
Python 2 **Yes**
|
||||
Python 3 No
|
||||
Python 3.5, POSIX locale No
|
||||
UTF-8 mode **Yes**
|
||||
UTF-8 Strict mode No
|
||||
======================== ==============================
|
||||
|
||||
"Yes" means that the script cannot fail, but it can produce mojibake.
|
||||
|
||||
"No" means that the script can fail on decoding or encoding a filename
|
||||
depending on the locale or the filename. Typical error::
|
||||
|
||||
$ LC_ALL=C python3 test.py
|
||||
Traceback (most recent call last):
|
||||
File "test.py", line 5, in <module>
|
||||
fp.write("%s\n" % name)
|
||||
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
|
||||
|
||||
|
||||
Display Unicode characters into stdout
|
||||
--------------------------------------
|
||||
|
||||
Very basic example used to illustrate a common issue, display the euro sign
|
||||
(U+20AC: €)::
|
||||
|
||||
print("euro: \u20ac")
|
||||
|
||||
Result:
|
||||
|
||||
======================== ==============================
|
||||
Python Always work?
|
||||
======================== ==============================
|
||||
Python 2 No
|
||||
Python 3 No
|
||||
Python 3.5, POSIX locale No
|
||||
UTF-8 mode **Yes**
|
||||
UTF-8 Strict mode **Yes**
|
||||
======================== ==============================
|
||||
|
||||
"Yes" means that the script cannot fail, but it can produce mojibake.
|
||||
|
||||
"No" means that the script can fail on encoding the euro sign depending on the
|
||||
locale encoding.
|
||||
|
||||
|
||||
Replace a word in a text
|
||||
------------------------
|
||||
|
||||
The following scripts replaces the word "apple" with "orange". It
|
||||
reads input from stdin and writes the output into stdout::
|
||||
|
||||
import sys
|
||||
text = sys.stdin.read()
|
||||
sys.stdout.write(text.replace("apple", "orange"))
|
||||
|
||||
Result:
|
||||
|
||||
======================== ==============================
|
||||
Python Always work?
|
||||
======================== ==============================
|
||||
Python 2 **Yes**
|
||||
Python 3 No
|
||||
Python 3.5, POSIX locale **Yes**
|
||||
UTF-8 mode **Yes**
|
||||
UTF-8 Strict mode No
|
||||
======================== ==============================
|
||||
|
||||
"Yes" means that the script cannot fail.
|
||||
|
||||
"No" means that the script can fail on decoding the input depending on
|
||||
the locale.
|
||||
|
||||
|
||||
Backward Compatibility
|
||||
======================
|
||||
|
||||
Since the UTF-8 mode is disabled by default, it has no impact on the
|
||||
backward compatibility. The new UTF-8 mode must be enabled explicitly.
|
||||
The main backward incompatible change is that the UTF-8 encoding is now
|
||||
used if the locale is POSIX. Since the UTF-8 encoding is used with the
|
||||
``surrogateescape`` error handler, ecoding errors should not occur and
|
||||
so the change should not break applications.
|
||||
|
||||
The more likely source of trouble comes from external libraries. Python
|
||||
can decode successfully data from UTF-8, but a library using the locale
|
||||
encoding can fail to encode the decoded text back to bytes. Hopefully,
|
||||
encoding text in a library is a rare operation. Very few libraries
|
||||
expect text, most libraries expect bytes and even manipulate bytes
|
||||
internally.
|
||||
|
||||
If the locale is not POSIX, the PEP has no impact on the backward
|
||||
compatibility since the UTF-8 mode is disabled by default in this case,
|
||||
it must be enabled explicitly.
|
||||
|
||||
|
||||
Alternatives
|
||||
============
|
||||
|
||||
Don't modify the encoding of the POSIX locale
|
||||
---------------------------------------------
|
||||
|
||||
A first version of the PEP did not change the encoding and error handler
|
||||
used of the POSIX locale.
|
||||
|
||||
The problem is that adding a command line option or setting an environment
|
||||
variable is not possible in some cases, or at least not convenient.
|
||||
|
||||
Moreover, many users simply expect that Python 3 behaves as Python 2:
|
||||
don't bother them with encodings and "just works" in all cases. These
|
||||
users don't worry about mojibake, or even expect mojibake because of
|
||||
complex documents using multiple incompatibles encodings.
|
||||
|
||||
|
||||
Always use UTF-8
|
||||
----------------
|
||||
|
||||
|
@ -266,13 +471,35 @@ Force UTF-8 for the POSIX locale
|
|||
An alternative to always using UTF-8 in any case is to only use UTF-8 when the
|
||||
``LC_CTYPE`` locale is the POSIX locale.
|
||||
|
||||
The `PEP 538: Coercing the legacy C locale to C.UTF-8
|
||||
<https://www.python.org/dev/peps/pep-0538/>`_ of Nick Coghlan proposes to
|
||||
implement that using the ``C.UTF-8`` locale.
|
||||
The PEP 538 "Coercing the legacy C locale to C.UTF-8" of Nick Coghlan
|
||||
proposes to implement that using the ``C.UTF-8`` locale.
|
||||
|
||||
|
||||
Related Work
|
||||
============
|
||||
Links
|
||||
=====
|
||||
|
||||
PEPs:
|
||||
|
||||
* PEP 538 "Coercing the legacy C locale to C.UTF-8"
|
||||
* PEP 529: "Change Windows filesystem encoding to UTF-8"
|
||||
* PEP 383: "Non-decodable Bytes in System Character Interfaces"
|
||||
|
||||
Python issues:
|
||||
|
||||
* `issue #28180: sys.getfilesystemencoding() should default to utf-8
|
||||
<http://bugs.python.org/issue28180>`_
|
||||
* `Issue #19846: Python 3 raises Unicode errors with the C locale
|
||||
<http://bugs.python.org/issue19846>`_
|
||||
* `Issue #8622: Add PYTHONFSENCODING environment variable
|
||||
<https://bugs.python.org/issue8622>`_: added but reverted because of
|
||||
many issues, read the `Inconsistencies if locale and filesystem
|
||||
encodings are different
|
||||
<https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
|
||||
thread on the python-dev mailing list
|
||||
|
||||
|
||||
Prior Art
|
||||
=========
|
||||
|
||||
Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
|
||||
varaible to force UTF-8: see `perlrun
|
||||
|
|
Loading…
Reference in New Issue