PEP 597: Rewrite (#1296)
This commit is contained in:
parent
89b0b69863
commit
57ca4dc015
272
pep-0597.rst
272
pep-0597.rst
|
@ -1,190 +1,109 @@
|
||||||
PEP: 597
|
PEP: 597
|
||||||
Title: Add PYTHONTEXTENCODING environment variable
|
Title: Enable UTF-8 mode by default on Windows
|
||||||
Author: Inada Naoki <songofacandy@gmail.com>
|
Author: Inada Naoki <songofacandy@gmail.com>
|
||||||
Status: Draft
|
Status: Draft
|
||||||
Type: Standards Track
|
Type: Standards Track
|
||||||
Content-Type: text/x-rst
|
Content-Type: text/x-rst
|
||||||
Created: 05-Jun-2019
|
Created: 05-Jun-2019
|
||||||
Python-Version: 3.9
|
Python-Version: 3.10
|
||||||
|
|
||||||
|
|
||||||
Abstract
|
Abstract
|
||||||
========
|
========
|
||||||
|
|
||||||
Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
|
This PEP proposes to make UTF-8 mode [#]_ enabled by default on
|
||||||
(hereinafter called "locale encoding") when ``encoding`` is not
|
Windows.
|
||||||
specified.
|
|
||||||
|
|
||||||
This PEP proposes adding ``PYTHONTEXTENCODING`` environment
|
|
||||||
variable to override the default text encoding since Python 3.9.
|
|
||||||
|
|
||||||
The goal of this PEP is providing "UTF-8 by default" experience to
|
The goal of this PEP is providing "UTF-8 by default" experience to
|
||||||
Windows users, because macOS, Linux, Android, iOS users use UTF-8
|
Windows users like Unix users.
|
||||||
by default already.
|
|
||||||
|
|
||||||
|
|
||||||
Motivation
|
Motivation
|
||||||
==========
|
==========
|
||||||
|
|
||||||
UTF-8 is the best encoding for saving unicode text
|
UTF-8 is the best encoding nowdays
|
||||||
--------------------------------------------------
|
----------------------------------
|
||||||
|
|
||||||
String in Python 3 is unicode. Encoding valid unicode strings with
|
Popular text editors like VS Code uses UTF-8 by default.
|
||||||
UTF-8 should not fail.
|
Even Microsoft Notepad uses UTF-8 by default since the Windows 10
|
||||||
|
May 2019 Update.
|
||||||
|
Additionally, the default encoding of Python source files is UTF-8.
|
||||||
|
|
||||||
On the other hand, most locale encoding used in Windows can not
|
We can assume that most Python programmers use UTF-8 for most text
|
||||||
save all valid unicode string. It will cause UnicodeEncodeError
|
files.
|
||||||
or it may not round-trip. User may lost their data in such case.
|
|
||||||
|
|
||||||
UTF-8 is the best encoding for saving text when user don't specify
|
|
||||||
any encoding.
|
|
||||||
|
|
||||||
|
|
||||||
Using UTF-8 by default is easier on new programmers
|
|
||||||
---------------------------------------------------
|
|
||||||
|
|
||||||
Python is one of the most popular first programming languages.
|
Python is one of the most popular first programming languages.
|
||||||
|
New programmers may not know about encoding. If the default encoding
|
||||||
New programmers may not know about encoding. When they download text
|
for text files is UTF-8, they can learn about encoding when they need
|
||||||
data written in UTF-8 from the Internet, they are forced to learn
|
to handle legacy encoding.
|
||||||
about encoding.
|
|
||||||
|
|
||||||
Popular text editors like VS Code or Atom use UTF-8 by default.
|
|
||||||
Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May
|
|
||||||
2019 Update. (Note that Python 3.9 will be released in 2021.)
|
|
||||||
|
|
||||||
Additionally, the default encoding of Python source files is UTF-8.
|
|
||||||
We can assume new Python programmers who don't know about encoding
|
|
||||||
use editors which use UTF-8 by default.
|
|
||||||
|
|
||||||
It would be nice if new programmers are not forced to learn about
|
|
||||||
encoding until they need to handle text files encoded in encoding
|
|
||||||
other than UTF-8.
|
|
||||||
|
|
||||||
|
|
||||||
People assume it is always UTF-8
|
People assume the default encoding is UTF-8 already
|
||||||
--------------------------------
|
---------------------------------------------------
|
||||||
|
|
||||||
Package authors using macOS or Linux may forget that the default
|
Developers using macOS or Linux may forget that the default encoding
|
||||||
encoding is not always UTF-8.
|
is not always UTF-8.
|
||||||
|
|
||||||
For example, ``long_description = open("README.md").read()`` in
|
For example, ``long_description = open("README.md").read()`` in
|
||||||
``setup.py`` is a common mistake. If there is at least one emoji or
|
``setup.py`` is a common mistake. Many Windows users can not install
|
||||||
any other non-ASCII character in the ``README.md`` file, many Windows
|
the package if there is at least one emoji or any other non-ASCII
|
||||||
users cannot install the package due to a ``UnicodeDecodeError``.
|
character in the ``README.md`` file.
|
||||||
|
|
||||||
|
Even Python experts assume that default encoding is UTF-8.
|
||||||
|
It creates bugs that happen only on Windows. See [#]_ [#]_.
|
||||||
|
|
||||||
Consistent with default encoding
|
Changing the default text encoding to UTF-8 will help many Windows
|
||||||
--------------------------------
|
users.
|
||||||
|
|
||||||
Python has ``sys.defaultencoding()`` which is always "UTF-8".
|
|
||||||
``str.encode()`` uses "UTF-8" when encoding is omitted.
|
|
||||||
|
|
||||||
Using "UTF-8" for text files are consistent with it. It makes Python
|
|
||||||
more easy to learn language.
|
|
||||||
|
|
||||||
|
|
||||||
Specification
|
Specification
|
||||||
=============
|
=============
|
||||||
|
|
||||||
``PYTHONTEXTENCODING`` environment variable
|
Enable UTF-8 mode on Windows unless it is disabled explicitly.
|
||||||
-------------------------------------------
|
|
||||||
|
|
||||||
``PYTHONTEXTENCODING`` environment variable can be used to specify the
|
UTF-8 mode affects these areas:
|
||||||
default text encoding.
|
|
||||||
|
|
||||||
Unlike ``PYTHONIOENCODING``, it doesn't accept error handler.
|
* ``locale.getpreferredencoding`` returns "UTF-8".
|
||||||
``PYTHONIOENCODING`` support it because changing error handler of
|
|
||||||
stdio was difficult. But it is not true for regular files.
|
* ``open``, ``subprocess.Popen``, ``pathlib.Path.read_text``,
|
||||||
|
``ZipFile.open``, and many other functions use UTF-8 when
|
||||||
|
the ``encoding`` option is omitted.
|
||||||
|
|
||||||
|
* The stdio uses "UTF-8" always.
|
||||||
|
|
||||||
|
* Console I/O uses "UTF-8" already [#]_. So this affects
|
||||||
|
only when the stdio are redirected.
|
||||||
|
|
||||||
|
On the other hand, UTF-8 mode doesn't affect to "mbcs" encoding.
|
||||||
|
Users can still use system encoding by chosing "mbcs" encoding
|
||||||
|
explicitly.
|
||||||
|
|
||||||
|
|
||||||
``sys.gettextencoding()``
|
Backwards Compatibility
|
||||||
-------------------------
|
=======================
|
||||||
|
|
||||||
When ``PYTHONTEXTENCODING`` is specified, this function return it.
|
Some existing applications assuming the default text encoding is the
|
||||||
|
system encoding (a.k.a. ANSI encoding) will be broken by this change.
|
||||||
|
|
||||||
When it is not specified, this function returns
|
Users can disable the UTF-8 mode by environment variable
|
||||||
``locale.getpreferredencoding(False)``.
|
(``PYTHONUTF8=0``) or command line option (``-Xutf8=0``) for backward
|
||||||
|
compatibility.
|
||||||
|
|
||||||
|
|
||||||
``encoding="locale"`` option
|
Rejected Ideas
|
||||||
----------------------------
|
===============
|
||||||
|
|
||||||
``TextIOWrapper`` now accepts ``encoding="locale"`` option.
|
Change the default encoding of TextIOWrapper to "UTF-8"
|
||||||
|
-------------------------------------------------------
|
||||||
|
|
||||||
"locale" is not real encoding or alias.
|
This idea changed the default encoding to UTF-8 always, regardless of
|
||||||
This is just a shortcut of
|
platform, locale, and environment variables.
|
||||||
``encoding=locale.getpreferredencoding(False)``.
|
|
||||||
|
|
||||||
|
While this idea looks ideal in terms of consistency, it will cause
|
||||||
|
backward compatibility problems.
|
||||||
|
|
||||||
Changes in stdlibs
|
Utilizing the UTF-8 mode seems better than adding one more backward
|
||||||
------------------
|
compatibility option like ``PYTHONLEGACYWINDOWSSTDIO``.
|
||||||
|
|
||||||
``TextIOWrapper`` uses ``sys.gettextencoding()`` where
|
|
||||||
``locale.getpreferredencoding(False)`` is used.
|
|
||||||
|
|
||||||
But ``stdin``, ``stdout``, and ``stderr`` continue to respect
|
|
||||||
locale encoding as well. ``PYTHONIOENCODING`` can be used to
|
|
||||||
override their encoding.
|
|
||||||
|
|
||||||
Pipes and TTY should use the "locale" encoding. UTF-8 mode [1]_
|
|
||||||
can be used to override these encoding:
|
|
||||||
|
|
||||||
* ``subprocess`` and ``os.popen`` use the "locale" encoding because
|
|
||||||
the subprocess will use the locale encoding.
|
|
||||||
* ``getpass.getpass`` uses the "locale" encoding when using TTY.
|
|
||||||
|
|
||||||
All other code using the default encoding are not modified.
|
|
||||||
They can be overridden by ``PYTHONTEXTENCODING``.
|
|
||||||
This is an incomplete list:
|
|
||||||
|
|
||||||
* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
|
|
||||||
* ``socket.makefile``
|
|
||||||
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
|
|
||||||
* ``trace.CoverageResults.write_results_file``
|
|
||||||
|
|
||||||
|
|
||||||
Rationale
|
|
||||||
=========
|
|
||||||
|
|
||||||
Why not just enable UTF-8 mode by default?
|
|
||||||
------------------------------------------
|
|
||||||
|
|
||||||
This PEP is not mutually exclusive to UTF-8 mode.
|
|
||||||
|
|
||||||
If we enable UTF-8 mode by default, even people using Windows will
|
|
||||||
forget the default encoding is not always UTF-8. More scripts will
|
|
||||||
be written assuming the default encoding is UTF-8.
|
|
||||||
|
|
||||||
So changing the default encoding of text files to UTF-8 would be
|
|
||||||
better even if UTF-8 mode is enabled by default at some point.
|
|
||||||
|
|
||||||
|
|
||||||
Why is "locale" not an alias codec?
|
|
||||||
-----------------------------------
|
|
||||||
|
|
||||||
For backward compatibility, ``io.TextIOWrapper`` calls
|
|
||||||
``locale.getpreferredencoding(False)`` every time when
|
|
||||||
``encoding="locale"`` is specified.
|
|
||||||
|
|
||||||
It respects changing locale after Python startup.
|
|
||||||
|
|
||||||
|
|
||||||
Why not change std(in|out|err) encoding too?
|
|
||||||
--------------------------------------------
|
|
||||||
|
|
||||||
Even when the locale encoding is not UTF-8, there can be many UTF-8
|
|
||||||
text files. These files could be downloaded from the Internet or
|
|
||||||
written by modern text editors.
|
|
||||||
|
|
||||||
On the other hand, terminal encoding is assumed to be the same as
|
|
||||||
locale encoding. And other tools are assumed to read and write the
|
|
||||||
locale encoding as well.
|
|
||||||
|
|
||||||
std(in|out|err) are likely to be connected to a terminal or other
|
|
||||||
tools. So the locale encoding should be respected.
|
|
||||||
|
|
||||||
Anyway, ``PYTHONIOENCODING`` can be used to change these encodings.
|
|
||||||
|
|
||||||
|
|
||||||
Reference Implementation
|
Reference Implementation
|
||||||
|
@ -193,76 +112,13 @@ Reference Implementation
|
||||||
To be written.
|
To be written.
|
||||||
|
|
||||||
|
|
||||||
Rejected Ideas
|
|
||||||
==============
|
|
||||||
|
|
||||||
Change the default text encoding
|
|
||||||
--------------------------------
|
|
||||||
|
|
||||||
Previous version of this PEP tried to change the default encoding
|
|
||||||
to UTF-8.
|
|
||||||
|
|
||||||
But we should have deprecation period long enough. Between the
|
|
||||||
deprecation period, users can not change the default text encoding.
|
|
||||||
|
|
||||||
And there are many difficulty there:
|
|
||||||
|
|
||||||
* Omitting ``encoding`` option is very common.
|
|
||||||
|
|
||||||
* If we raise ``DeprecationWarning`` always, it will be too noisy.
|
|
||||||
* We can not assume how user use it. Complicated heuristics may be
|
|
||||||
needed to raise ``DeprecationWarning`` only when it is really
|
|
||||||
needed.
|
|
||||||
|
|
||||||
* Users of legacy systems may dismiss warning.
|
|
||||||
|
|
||||||
* They may not check the warning.
|
|
||||||
* They may upgrade Python from 2.7 after 2020.
|
|
||||||
|
|
||||||
|
|
||||||
Additionally, Microsoft is improving UTF-8 support of Windows 10
|
|
||||||
recently.
|
|
||||||
|
|
||||||
There are no public plan for future UTF-8 support yet. But Python may
|
|
||||||
be able to change the default encoding without painful deprecation
|
|
||||||
period in the future.
|
|
||||||
|
|
||||||
|
|
||||||
Open Issues
|
|
||||||
===========
|
|
||||||
|
|
||||||
Easy way to set ``PYTHONTEXTENCODING``
|
|
||||||
--------------------------------------
|
|
||||||
|
|
||||||
UTF-8 is the best encoding for new users. But setting environment
|
|
||||||
variables is not easy enough to new users.
|
|
||||||
|
|
||||||
It would be helpfule if Python on Windows can provide easy way to set
|
|
||||||
``PYTHONTEXTENCODING=UTF-8`` even after Python is installed.
|
|
||||||
|
|
||||||
|
|
||||||
Commandline option
|
|
||||||
------------------
|
|
||||||
|
|
||||||
If there is reasonable use case for changing default text encoding
|
|
||||||
per process, command line option should be considered.
|
|
||||||
|
|
||||||
|
|
||||||
C-API
|
|
||||||
-----
|
|
||||||
|
|
||||||
The default text encoding should be able to configured from C.
|
|
||||||
This will be considered when writing reference Implementation.
|
|
||||||
|
|
||||||
Additionally, C-API like ``PySys_GetTextEncoding()`` should be
|
|
||||||
considered too.
|
|
||||||
|
|
||||||
|
|
||||||
References
|
References
|
||||||
==========
|
==========
|
||||||
|
|
||||||
.. [1] PEP 540, Add a new UTF-8 Mode
|
.. [#] `PEP 540 -- Add a new UTF-8 Mode <https://www.python.org/dev/peps/pep-0540/>`_
|
||||||
(https://www.python.org/dev/peps/pep-0540/)
|
.. [#] https://github.com/pypa/packaging.python.org/pull/682
|
||||||
|
.. [#] https://bugs.python.org/issue33684
|
||||||
|
.. [#] `PEP 528 -- Change Windows console encoding to UTF-8 <https://www.python.org/dev/peps/pep-0528/>`_
|
||||||
|
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
|
|
Loading…
Reference in New Issue