2017-01-05 07:46:03 -05:00
|
|
|
PEP: 540
|
|
|
|
Title: Add a new UTF-8 mode
|
|
|
|
Version: $Revision$
|
|
|
|
Last-Modified: $Date$
|
|
|
|
Author: Victor Stinner <victor.stinner@gmail.com>
|
|
|
|
Status: Draft
|
|
|
|
Type: Standards Track
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
Created: 5-January-2016
|
|
|
|
Python-Version: 3.7
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
========
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
Add a new UTF-8 mode, disabled by default, to ignore the locale and
|
|
|
|
force the usage of the UTF-8 encoding.
|
|
|
|
|
|
|
|
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
|
|
|
|
bother users with encodings, but it can produce mojibake. The UTF-8 mode
|
|
|
|
can be configured as strict to prevent mojibake.
|
|
|
|
|
|
|
|
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
|
|
|
variable are added to control the UTF-8 mode. The POSIX locale enables
|
|
|
|
the UTF-8 mode.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
Context
|
|
|
|
=======
|
|
|
|
|
|
|
|
Locale and operating system data
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
Python uses the ``LC_CTYPE`` locale to decide how to encode and decode
|
|
|
|
data from/to the operating system:
|
|
|
|
|
|
|
|
* file content
|
|
|
|
* command line arguments: ``sys.argv``
|
|
|
|
* standard streams: ``sys.stdin``, ``sys.stdout``, ``sys.stderr``
|
|
|
|
* environment variables: ``os.environ``
|
|
|
|
* filenames: ``os.listdir(str)`` for example
|
|
|
|
* pipes: ``subprocess.Popen`` using ``subprocess.PIPE`` for example
|
2017-01-05 17:54:22 -05:00
|
|
|
* error messages: ``os.strerror(code)`` for example
|
|
|
|
* user and terminal names: ``os``, ``grp`` and ``pwd`` modules
|
2017-01-05 07:46:03 -05:00
|
|
|
* host name, UNIX socket path: see the ``socket`` module
|
|
|
|
* etc.
|
|
|
|
|
|
|
|
At startup, Python calls ``setlocale(LC_CTYPE, "")`` to use the user
|
|
|
|
``LC_CTYPE`` locale and then store the locale encoding,
|
|
|
|
``sys.getfilesystemencoding()``. In the whole lifetime of a Python process,
|
|
|
|
the same encoding and error handler are used to encode and decode data
|
|
|
|
from/to the operating system.
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
In some corner case, the *current* ``LC_CTYPE`` locale must be used
|
|
|
|
instead of ``sys.getfilesystemencoding()``. For example, the ``time``
|
|
|
|
module uses the *current* ``LC_CTYPE`` locale to decode timezone
|
|
|
|
names.
|
|
|
|
|
|
|
|
|
|
|
|
The POSIX locale and its encoding
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
The following environment variables are used to configure the locale, in
|
|
|
|
this preference order:
|
|
|
|
|
|
|
|
* ``LC_ALL``, most important variable
|
|
|
|
* ``LC_CTYPE``
|
|
|
|
* ``LANG``
|
|
|
|
|
|
|
|
The POSIX locale,also known as "the C locale", is used:
|
|
|
|
|
|
|
|
* if the first set variable is set to ``"C"``
|
|
|
|
* if all these variables are unset, for example when a program is
|
|
|
|
started in an empty environment.
|
|
|
|
|
|
|
|
The encoding of the POSIX locale must be ASCII or a superset of ASCII.
|
|
|
|
|
|
|
|
On Linux, the POSIX locale uses the ASCII encoding.
|
|
|
|
|
|
|
|
On FreeBSD and Solaris, ``nl_langinfo(CODESET)`` announces an alias of
|
|
|
|
the ASCII encoding, whereas ``mbstowcs()`` and ``wcstombs()`` functions
|
|
|
|
use the ISO 8859-1 encoding (Latin1) in practice. The problem is that
|
|
|
|
``os.fsencode()`` and ``os.fsdecode()`` use
|
|
|
|
``locale.getpreferredencoding()`` codec. For example, if command line
|
|
|
|
arguments are decoded by ``mbstowcs()`` and encoded back by
|
|
|
|
``os.fsencode()``, an ``UnicodeEncodeError`` exception is raised instead
|
|
|
|
of retrieving the original byte string.
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
To fix this issue, Python checks since Python 3.4 if ``mbstowcs()``
|
2017-01-05 07:46:03 -05:00
|
|
|
really uses the ASCII encoding if the the ``LC_CTYPE`` uses the the
|
|
|
|
POSIX locale and ``nl_langinfo(CODESET)`` returns ``"ASCII"`` (or an
|
|
|
|
alias to ASCII). If not (the effective encoding is not ASCII), Python
|
|
|
|
uses its own ASCII codec instead of using ``mbstowcs()`` and
|
|
|
|
``wcstombs()`` functions for operating system data.
|
|
|
|
|
|
|
|
See the `POSIX locale (2016 Edition)
|
|
|
|
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>`_.
|
|
|
|
|
|
|
|
|
2017-01-06 07:57:10 -05:00
|
|
|
POSIX locale used by mistake
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
In many cases, the POSIX locale is not really expected by users who get
|
|
|
|
it by mistake. Examples:
|
|
|
|
|
|
|
|
* program started in an empty environment
|
|
|
|
* User forcing LANG=C to get messages in english
|
|
|
|
* LANG=C used for bad reasons, without being aware of the ASCII encoding
|
|
|
|
* SSH shell
|
|
|
|
* User locale set to a non-existing locale, typo in the locale name for
|
|
|
|
example
|
|
|
|
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
C.UTF-8 and C.utf8 locales
|
|
|
|
--------------------------
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
Some UNIX operating systems provide a variant of the POSIX locale using the
|
2017-01-05 07:46:03 -05:00
|
|
|
UTF-8 encoding:
|
|
|
|
|
|
|
|
* Fedora 25: ``"C.utf8"`` or ``"C.UTF-8"``
|
2017-01-05 17:54:22 -05:00
|
|
|
* Debian (eglibc 2.13-1, 2011), Ubuntu: ``"C.UTF-8"``
|
2017-01-05 07:46:03 -05:00
|
|
|
* HP-UX: ``"C.utf8"``
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
It was proposed to add a ``C.UTF-8`` locale to the glibc: `glibc C.UTF-8
|
2017-01-05 07:46:03 -05:00
|
|
|
proposal <https://sourceware.org/glibc/wiki/Proposals/C.UTF-8>`_.
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
It is not planned to add such locale to BSD systems.
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
Popularity of the UTF-8 encoding
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
Python 3 uses UTF-8 by default for Python source files.
|
|
|
|
|
|
|
|
On Mac OS X, Windows and Android, Python always use UTF-8 for operating
|
2017-01-05 17:54:22 -05:00
|
|
|
system data. For Windows, see the PEP 529: "Change Windows filesystem
|
|
|
|
encoding to UTF-8".
|
2017-01-05 07:46:03 -05:00
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
On Linux, UTF-8 became the defacto standard encoding,
|
2017-01-05 07:46:03 -05:00
|
|
|
replacing legacy encodings like ISO 8859-1 or ShiftJIS. For example,
|
|
|
|
using different encodings for filenames and standard streams is likely
|
|
|
|
to create mojibake, so UTF-8 is now used *everywhere*.
|
|
|
|
|
|
|
|
The UTF-8 encoding is the default encoding of XML and JSON file format.
|
|
|
|
In January 2017, UTF-8 was used in `more than 88% of web pages
|
|
|
|
<https://w3techs.com/technologies/details/en-utf8/all/all>`_ (HTML,
|
|
|
|
Javascript, CSS, etc.).
|
|
|
|
|
|
|
|
See `utf8everywhere.org <http://utf8everywhere.org/>`_ for more general
|
|
|
|
information on the UTF-8 codec.
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
Some applications and operating systems (especially Windows) use Byte
|
|
|
|
Order Markers (BOM) to indicate the used Unicode encoding: UTF-7,
|
|
|
|
UTF-8, UTF-16-LE, etc. BOM are not well supported and rarely used in
|
|
|
|
Python.
|
|
|
|
|
|
|
|
|
|
|
|
Old data stored in different encodings and surrogateescape
|
|
|
|
----------------------------------------------------------
|
|
|
|
|
|
|
|
Even if UTF-8 became the defacto standard, there are still systems in
|
|
|
|
the wild which don't use UTF-8. And there are a lot of data stored in
|
|
|
|
different encodings. For example, an old USB key using the ext3
|
|
|
|
filesystem with filenames encoded to ISO 8859-1.
|
|
|
|
|
|
|
|
The Linux kernel and the libc don't decode filenames: a filename is used
|
|
|
|
as a raw array of bytes. The common solution to support any filename is
|
|
|
|
to store filenames as bytes and don't try to decode them. When displayed to
|
|
|
|
stdout, mojibake is displayed if the filename and the terminal don't use
|
|
|
|
the same encoding.
|
|
|
|
|
|
|
|
Python 3 promotes Unicode everywhere including filenames. A solution to
|
|
|
|
support filenames not decodable from the locale encoding was found: the
|
2017-01-05 17:54:22 -05:00
|
|
|
``surrogateescape`` error handler (PEP 383), store undecodable bytes
|
2017-01-05 07:46:03 -05:00
|
|
|
as surrogate characters. This error handler is used by default for
|
|
|
|
operating system data, by ``os.fsdecode()`` and ``os.fsencode()`` for
|
|
|
|
example (except on Windows which uses the ``strict`` error handler).
|
|
|
|
|
|
|
|
|
|
|
|
Standard streams
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Python uses the locale encoding for standard streams: stdin, stdout and
|
|
|
|
stderr. The ``strict`` error handler is used by stdin and stdout to
|
|
|
|
prevent mojibake.
|
|
|
|
|
|
|
|
The ``backslashreplace`` error handler is used by stderr to avoid
|
|
|
|
Unicode encode error when displaying non-ASCII text. It is especially
|
|
|
|
useful when the POSIX locale is used, because this locale usually uses
|
|
|
|
the ASCII encoding.
|
|
|
|
|
|
|
|
The problem is that operating system data like filenames are decoded
|
2017-01-05 17:54:22 -05:00
|
|
|
using the ``surrogateescape`` error handler (PEP 383). Displaying a
|
2017-01-05 07:46:03 -05:00
|
|
|
filename to stdout raises an Unicode encode error if the filename
|
|
|
|
contains an undecoded byte stored as a surrogate character.
|
|
|
|
|
|
|
|
Python 3.6 now uses ``surrogateescape`` for stdin and stdout if the
|
|
|
|
POSIX locale is used: `issue #19977 <http://bugs.python.org/issue19977>`_. The
|
|
|
|
idea is to passthrough operating system data even if it means mojibake, because
|
|
|
|
most UNIX applications work like that. Most UNIX applications store filenames
|
|
|
|
as bytes, usually simply because bytes are first-citizen class in the used
|
|
|
|
programming language, whereas Unicode is badly supported.
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
The encoding and/or the error handler of standard streams can be
|
|
|
|
overriden with the ``PYTHONIOENCODING`` environment variable.
|
|
|
|
|
|
|
|
|
|
|
|
Proposal
|
|
|
|
========
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
Changes
|
|
|
|
-------
|
|
|
|
|
|
|
|
Add a new UTF-8 mode, disabled by default, to ignore the locale and
|
|
|
|
force the usage of the UTF-8 encoding with the ``surrogateescape`` error
|
|
|
|
handler, instead using the locale encoding (with ``strict`` or
|
|
|
|
``surrogateescape`` error handler depending on the case).
|
2017-01-05 07:46:03 -05:00
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
Basically, the UTF-8 mode behaves as Python 2: it "just works" and don't
|
|
|
|
bother users with encodings, but it can produce mojibake. It can be
|
|
|
|
configured as strict to prevent mojibake: the UTF-8 encoding is used
|
|
|
|
with the ``strict`` error handler in this case.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
New ``-X utf8`` command line option and ``PYTHONUTF8`` environment
|
|
|
|
variable are added to control the UTF-8 mode. The UTF-8 mode is enabled
|
|
|
|
by ``-X utf8`` or ``PYTHONUTF8=1``. The UTF-8 is configured as strict
|
|
|
|
by ``-X utf8=strict`` or ``PYTHONUTF8=strict``.
|
|
|
|
|
|
|
|
The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode
|
|
|
|
can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``.
|
|
|
|
|
|
|
|
Encoding and error handler
|
|
|
|
--------------------------
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
The UTF-8 mode changes the default encoding and error handler used by
|
|
|
|
open(), os.fsdecode(), os.fsencode(), sys.stdin, sys.stdout and
|
|
|
|
sys.stderr:
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
============================ ======================= ========================== ==========================
|
|
|
|
Function Default UTF-8 or POSIX locale UTF-8 Strict
|
|
|
|
============================ ======================= ========================== ==========================
|
|
|
|
open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
|
|
|
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8/strict**
|
|
|
|
sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
|
|
|
|
sys.stderr locale/backslashreplace **UTF-8**/backslashreplace **UTF-8**/backslashreplace
|
|
|
|
============================ ======================= ========================== ==========================
|
|
|
|
|
|
|
|
By comparison, Python 3.6 uses:
|
|
|
|
|
|
|
|
============================ ======================= ==========================
|
|
|
|
Function Default POSIX locale
|
|
|
|
============================ ======================= ==========================
|
|
|
|
open() locale/strict locale/strict
|
|
|
|
os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape
|
|
|
|
sys.stdin, sys.stdout locale/strict locale/**surrogateescape**
|
|
|
|
sys.stderr locale/backslashreplace locale/backslashreplace
|
|
|
|
============================ ======================= ==========================
|
|
|
|
|
|
|
|
The UTF-8 mode uses the ``surrogateescape`` error handler instead of the
|
|
|
|
strict mode for convenience: the idea is that data not encoded to UTF-8
|
|
|
|
are passed through "Python" without being modified, as raw bytes.
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
---------
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
The UTF-8 mode is disabled by default to keep hard Unicode errors when
|
|
|
|
encoding or decoding operating system data failed, and to keep the
|
|
|
|
backward compatibility. The user is responsible to enable explicitly the
|
|
|
|
UTF-8 mode, and so is better prepared for mojibake than if the UTF-8
|
|
|
|
mode would be enabled *by default*.
|
|
|
|
|
|
|
|
The UTF-8 mode should be used on systems known to be configured with
|
|
|
|
UTF-8 where most applications speak UTF-8. It prevents Unicode errors if
|
|
|
|
the user overrides a locale *by mistake* or if a Python program is
|
|
|
|
started with no locale configured (and so with the POSIX locale).
|
|
|
|
|
|
|
|
Most UNIX applications handle operating system data as bytes, so
|
|
|
|
``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables have a
|
|
|
|
limited impact on how these data are handled by the application.
|
|
|
|
|
|
|
|
The Python UTF-8 mode should help to make Python more interoperable with
|
|
|
|
the other UNIX applications in the system assuming that *UTF-8* is used
|
|
|
|
everywhere and that users *expect* UTF-8.
|
|
|
|
|
|
|
|
Ignoring ``LC_ALL``, ``LC_CTYPE`` and ``LANG`` environment variables in
|
|
|
|
Python is more convenient, since they are more commonly misconfigured
|
|
|
|
*by mistake* (configured to use an encoding different than UTF-8,
|
|
|
|
whereas the system uses UTF-8), rather than being misconfigured by intent.
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
Expected mojibake issues
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
The UTF-8 mode only affects Python 3.7 code, other code is not aware of this
|
|
|
|
mode.
|
|
|
|
|
|
|
|
If Python 3.7 is used as a producer in a ``producer | consumer`` shell command
|
|
|
|
and the consumer may fail to decode input data if it decodes it and the locale
|
|
|
|
encoding is not UTF-8. If the consumer doesn't decode inputs, process them
|
|
|
|
as bytes, it should just work.
|
|
|
|
|
|
|
|
If Python 3.7 is used as a consumer in a ``producer | consumer`` shell command,
|
|
|
|
it should just work.
|
|
|
|
|
|
|
|
If Python calls third party libraries or if Python is embedded in an
|
|
|
|
application, code outside Python is not aware of the UTF-8 mode. If the other
|
|
|
|
code uses UTF-8, it's fine. If the other code uses the locale encoding,
|
|
|
|
mojibake will occur when the locale encoding is not UTF-8.
|
|
|
|
|
|
|
|
|
|
|
|
Use Cases
|
|
|
|
=========
|
|
|
|
|
|
|
|
List a directory into stdout
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
Script listing the content of the current directory into stdout::
|
|
|
|
|
|
|
|
import os
|
|
|
|
for name in os.listdir(os.curdir):
|
|
|
|
print(name)
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
|
|
======================== ==============================
|
|
|
|
Python Always work?
|
|
|
|
======================== ==============================
|
|
|
|
Python 2 **Yes**
|
|
|
|
Python 3 No
|
|
|
|
Python 3.5, POSIX locale **Yes**
|
|
|
|
UTF-8 mode **Yes**
|
|
|
|
UTF-8 Strict mode No
|
|
|
|
======================== ==============================
|
|
|
|
|
|
|
|
"Yes" means that the script cannot fail, but it can produce mojibake.
|
|
|
|
|
|
|
|
"No" means that the script can fail on decoding or encoding a filename
|
|
|
|
depending on the locale or the filename.
|
|
|
|
|
|
|
|
|
|
|
|
List a directory into a text file
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
Similar to the previous example, except that the listing is written into
|
|
|
|
a text file::
|
|
|
|
|
|
|
|
import os
|
|
|
|
names = os.listdir(os.curdir)
|
|
|
|
with open("/tmp/content.txt", "w") as fp:
|
|
|
|
for name in names:
|
|
|
|
fp.write("%s\n" % name)
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
|
|
======================== ==============================
|
|
|
|
Python Always work?
|
|
|
|
======================== ==============================
|
|
|
|
Python 2 **Yes**
|
|
|
|
Python 3 No
|
|
|
|
Python 3.5, POSIX locale No
|
|
|
|
UTF-8 mode **Yes**
|
|
|
|
UTF-8 Strict mode No
|
|
|
|
======================== ==============================
|
|
|
|
|
|
|
|
"Yes" means that the script cannot fail, but it can produce mojibake.
|
|
|
|
|
|
|
|
"No" means that the script can fail on decoding or encoding a filename
|
|
|
|
depending on the locale or the filename. Typical error::
|
|
|
|
|
|
|
|
$ LC_ALL=C python3 test.py
|
|
|
|
Traceback (most recent call last):
|
|
|
|
File "test.py", line 5, in <module>
|
|
|
|
fp.write("%s\n" % name)
|
|
|
|
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
|
|
|
|
|
|
|
|
|
|
|
|
Display Unicode characters into stdout
|
|
|
|
--------------------------------------
|
|
|
|
|
|
|
|
Very basic example used to illustrate a common issue, display the euro sign
|
|
|
|
(U+20AC: €)::
|
|
|
|
|
|
|
|
print("euro: \u20ac")
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
|
|
======================== ==============================
|
|
|
|
Python Always work?
|
|
|
|
======================== ==============================
|
|
|
|
Python 2 No
|
|
|
|
Python 3 No
|
|
|
|
Python 3.5, POSIX locale No
|
|
|
|
UTF-8 mode **Yes**
|
|
|
|
UTF-8 Strict mode **Yes**
|
|
|
|
======================== ==============================
|
|
|
|
|
|
|
|
"Yes" means that the script cannot fail, but it can produce mojibake.
|
|
|
|
|
|
|
|
"No" means that the script can fail on encoding the euro sign depending on the
|
|
|
|
locale encoding.
|
|
|
|
|
|
|
|
|
|
|
|
Replace a word in a text
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
The following scripts replaces the word "apple" with "orange". It
|
|
|
|
reads input from stdin and writes the output into stdout::
|
|
|
|
|
|
|
|
import sys
|
|
|
|
text = sys.stdin.read()
|
|
|
|
sys.stdout.write(text.replace("apple", "orange"))
|
|
|
|
|
|
|
|
Result:
|
|
|
|
|
|
|
|
======================== ==============================
|
|
|
|
Python Always work?
|
|
|
|
======================== ==============================
|
|
|
|
Python 2 **Yes**
|
|
|
|
Python 3 No
|
|
|
|
Python 3.5, POSIX locale **Yes**
|
|
|
|
UTF-8 mode **Yes**
|
|
|
|
UTF-8 Strict mode No
|
|
|
|
======================== ==============================
|
|
|
|
|
|
|
|
"Yes" means that the script cannot fail.
|
|
|
|
|
|
|
|
"No" means that the script can fail on decoding the input depending on
|
|
|
|
the locale.
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
Backward Compatibility
|
|
|
|
======================
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
The main backward incompatible change is that the UTF-8 encoding is now
|
|
|
|
used if the locale is POSIX. Since the UTF-8 encoding is used with the
|
|
|
|
``surrogateescape`` error handler, ecoding errors should not occur and
|
|
|
|
so the change should not break applications.
|
|
|
|
|
|
|
|
The more likely source of trouble comes from external libraries. Python
|
|
|
|
can decode successfully data from UTF-8, but a library using the locale
|
|
|
|
encoding can fail to encode the decoded text back to bytes. Hopefully,
|
|
|
|
encoding text in a library is a rare operation. Very few libraries
|
|
|
|
expect text, most libraries expect bytes and even manipulate bytes
|
|
|
|
internally.
|
|
|
|
|
|
|
|
If the locale is not POSIX, the PEP has no impact on the backward
|
|
|
|
compatibility since the UTF-8 mode is disabled by default in this case,
|
|
|
|
it must be enabled explicitly.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
Alternatives
|
|
|
|
============
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
Don't modify the encoding of the POSIX locale
|
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
A first version of the PEP did not change the encoding and error handler
|
|
|
|
used of the POSIX locale.
|
|
|
|
|
|
|
|
The problem is that adding a command line option or setting an environment
|
|
|
|
variable is not possible in some cases, or at least not convenient.
|
|
|
|
|
|
|
|
Moreover, many users simply expect that Python 3 behaves as Python 2:
|
|
|
|
don't bother them with encodings and "just works" in all cases. These
|
|
|
|
users don't worry about mojibake, or even expect mojibake because of
|
|
|
|
complex documents using multiple incompatibles encodings.
|
|
|
|
|
|
|
|
|
2017-01-05 07:46:03 -05:00
|
|
|
Always use UTF-8
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Python already always use the UTF-8 encoding on Mac OS X, Android and Windows.
|
|
|
|
Since UTF-8 became the defacto encoding, it makes sense to always use it on all
|
|
|
|
platforms with any locale.
|
|
|
|
|
|
|
|
The risk is to introduce mojibake if the locale uses a different encoding,
|
|
|
|
especially for locales other than the POSIX locale.
|
|
|
|
|
|
|
|
|
|
|
|
Force UTF-8 for the POSIX locale
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
An alternative to always using UTF-8 in any case is to only use UTF-8 when the
|
|
|
|
``LC_CTYPE`` locale is the POSIX locale.
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
The PEP 538 "Coercing the legacy C locale to C.UTF-8" of Nick Coghlan
|
|
|
|
proposes to implement that using the ``C.UTF-8`` locale.
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
Links
|
|
|
|
=====
|
|
|
|
|
|
|
|
PEPs:
|
|
|
|
|
|
|
|
* PEP 538 "Coercing the legacy C locale to C.UTF-8"
|
|
|
|
* PEP 529: "Change Windows filesystem encoding to UTF-8"
|
|
|
|
* PEP 383: "Non-decodable Bytes in System Character Interfaces"
|
|
|
|
|
2017-01-06 07:57:10 -05:00
|
|
|
Main Python issues:
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
* `issue #28180: sys.getfilesystemencoding() should default to utf-8
|
|
|
|
<http://bugs.python.org/issue28180>`_
|
2017-01-06 07:57:10 -05:00
|
|
|
* `Issue #19977: Use "surrogateescape" error handler for sys.stdin and
|
|
|
|
sys.stdout on UNIX for the C locale
|
|
|
|
<http://bugs.python.org/issue19977>`_
|
|
|
|
* `Issue #19847: Setting the default filesystem-encoding
|
|
|
|
<http://bugs.python.org/issue19847>`_
|
2017-01-05 17:54:22 -05:00
|
|
|
* `Issue #8622: Add PYTHONFSENCODING environment variable
|
|
|
|
<https://bugs.python.org/issue8622>`_: added but reverted because of
|
|
|
|
many issues, read the `Inconsistencies if locale and filesystem
|
|
|
|
encodings are different
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2010-October/104509.html>`_
|
|
|
|
thread on the python-dev mailing list
|
|
|
|
|
2017-01-06 07:57:10 -05:00
|
|
|
Incomplete list of Python issues related to Unicode errors, especially
|
|
|
|
with the POSIX locale:
|
|
|
|
|
|
|
|
* 2016-12-22: `LANG=C python3 -c "import os; os.path.exists('\xff')"
|
|
|
|
<http://bugs.python.org/issue29042#msg283821>`_
|
|
|
|
* 2014-07-20: `issue #22016: Add a new 'surrogatereplace' output only error handler
|
|
|
|
<http://bugs.python.org/issue22016>`_
|
|
|
|
* 2014-04-27: `Issue #21368: Check for systemd locale on startup if current
|
|
|
|
locale is set to POSIX <http://bugs.python.org/issue21368>`_ -- read manually
|
|
|
|
/etc/locale.conf when the locale is POSIX
|
|
|
|
* 2014-01-21: `Issue #20329: zipfile.extractall fails in Posix shell with utf-8
|
|
|
|
filename
|
|
|
|
<http://bugs.python.org/issue20329>`_
|
|
|
|
* 2013-11-30: `Issue #19846: Python 3 raises Unicode errors with the C locale
|
|
|
|
<http://bugs.python.org/issue19846>`_
|
|
|
|
* 2010-05-04: `Issue #8610: Python3/POSIX: errors if file system encoding is None
|
|
|
|
<http://bugs.python.org/issue8610>`_
|
|
|
|
* 2013-08-12: `Issue #18713: Clearly document the use of PYTHONIOENCODING to
|
|
|
|
set surrogateescape <http://bugs.python.org/issue18713>`_
|
|
|
|
* 2013-09-27: `Issue #19100: Use backslashreplace in pprint
|
|
|
|
<http://bugs.python.org/issue19100>`_
|
|
|
|
* 2012-01-05: `Issue #13717: os.walk() + print fails with UnicodeEncodeError
|
|
|
|
<http://bugs.python.org/issue13717>`_
|
|
|
|
* 2011-12-20: `Issue #13643: 'ascii' is a bad filesystem default encoding
|
|
|
|
<http://bugs.python.org/issue13643>`_
|
|
|
|
* 2011-03-16: `issue #11574: TextIOWrapper should use UTF-8 by default for the
|
|
|
|
POSIX locale
|
|
|
|
<http://bugs.python.org/issue11574>`_, thread on python-dev:
|
|
|
|
`Low-Level Encoding Behavior on Python 3
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2011-March/109361.html>`_
|
|
|
|
* 2010-04-26: `Issue #8533: regrtest: use backslashreplace error handler for
|
|
|
|
stdout <http://bugs.python.org/issue8533>`_, regrtest fails with Unicode
|
|
|
|
encode error if the locale is POSIX
|
|
|
|
|
|
|
|
Some issues are real bug in applications which must set explicitly the
|
|
|
|
encoding. Well, it just works in the common case (locale configured
|
|
|
|
correctly), so what? But the program "suddenly" fails when the POSIX
|
|
|
|
locale is used (probably for bad reasons). Such bug is not well
|
|
|
|
understood by users. Example of such issue:
|
|
|
|
|
|
|
|
* 2013-11-21: `pip: open() uses the locale encoding to parse Python
|
|
|
|
script, instead of the encoding cookie
|
|
|
|
<http://bugs.python.org/issue19685>`_ -- pip must use the encoding
|
|
|
|
cookie to read a Python source code file
|
|
|
|
* 2011-01-21: `IDLE 3.x can crash decoding recent file list
|
|
|
|
<http://bugs.python.org/issue10974>`_
|
|
|
|
|
2017-01-05 17:54:22 -05:00
|
|
|
|
|
|
|
Prior Art
|
|
|
|
=========
|
2017-01-05 07:46:03 -05:00
|
|
|
|
|
|
|
Perl has a ``-C`` command line option and a ``PERLUNICODE`` environment
|
|
|
|
varaible to force UTF-8: see `perlrun
|
|
|
|
<http://perldoc.perl.org/perlrun.html>`_. It is possible to configure
|
|
|
|
UTF-8 per standard stream, on input and output streams, etc.
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
=========
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|