PEP 597: Rewrite (#1296)

This commit is contained in:
Inada Naoki 2020-02-04 18:35:06 +09:00 committed by GitHub
parent 89b0b69863
commit 57ca4dc015
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 64 additions and 208 deletions

View File

@ -1,190 +1,109 @@
PEP: 597 PEP: 597
Title: Add PYTHONTEXTENCODING environment variable Title: Enable UTF-8 mode by default on Windows
Author: Inada Naoki <songofacandy@gmail.com> Author: Inada Naoki <songofacandy@gmail.com>
Status: Draft Status: Draft
Type: Standards Track Type: Standards Track
Content-Type: text/x-rst Content-Type: text/x-rst
Created: 05-Jun-2019 Created: 05-Jun-2019
Python-Version: 3.9 Python-Version: 3.10
Abstract Abstract
======== ========
Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)`` This PEP proposes to make UTF-8 mode [#]_ enabled by default on
(hereinafter called "locale encoding") when ``encoding`` is not Windows.
specified.
This PEP proposes adding ``PYTHONTEXTENCODING`` environment
variable to override the default text encoding since Python 3.9.
The goal of this PEP is providing "UTF-8 by default" experience to The goal of this PEP is providing "UTF-8 by default" experience to
Windows users, because macOS, Linux, Android, iOS users use UTF-8 Windows users like Unix users.
by default already.
Motivation Motivation
========== ==========
UTF-8 is the best encoding for saving unicode text UTF-8 is the best encoding nowdays
-------------------------------------------------- ----------------------------------
String in Python 3 is unicode. Encoding valid unicode strings with Popular text editors like VS Code uses UTF-8 by default.
UTF-8 should not fail. Even Microsoft Notepad uses UTF-8 by default since the Windows 10
May 2019 Update.
Additionally, the default encoding of Python source files is UTF-8.
On the other hand, most locale encoding used in Windows can not We can assume that most Python programmers use UTF-8 for most text
save all valid unicode string. It will cause UnicodeEncodeError files.
or it may not round-trip. User may lost their data in such case.
UTF-8 is the best encoding for saving text when user don't specify
any encoding.
Using UTF-8 by default is easier on new programmers
---------------------------------------------------
Python is one of the most popular first programming languages. Python is one of the most popular first programming languages.
New programmers may not know about encoding. If the default encoding
New programmers may not know about encoding. When they download text for text files is UTF-8, they can learn about encoding when they need
data written in UTF-8 from the Internet, they are forced to learn to handle legacy encoding.
about encoding.
Popular text editors like VS Code or Atom use UTF-8 by default.
Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May
2019 Update. (Note that Python 3.9 will be released in 2021.)
Additionally, the default encoding of Python source files is UTF-8.
We can assume new Python programmers who don't know about encoding
use editors which use UTF-8 by default.
It would be nice if new programmers are not forced to learn about
encoding until they need to handle text files encoded in encoding
other than UTF-8.
People assume it is always UTF-8 People assume the default encoding is UTF-8 already
-------------------------------- ---------------------------------------------------
Package authors using macOS or Linux may forget that the default Developers using macOS or Linux may forget that the default encoding
encoding is not always UTF-8. is not always UTF-8.
For example, ``long_description = open("README.md").read()`` in For example, ``long_description = open("README.md").read()`` in
``setup.py`` is a common mistake. If there is at least one emoji or ``setup.py`` is a common mistake. Many Windows users can not install
any other non-ASCII character in the ``README.md`` file, many Windows the package if there is at least one emoji or any other non-ASCII
users cannot install the package due to a ``UnicodeDecodeError``. character in the ``README.md`` file.
Even Python experts assume that default encoding is UTF-8.
It creates bugs that happen only on Windows. See [#]_ [#]_.
Consistent with default encoding Changing the default text encoding to UTF-8 will help many Windows
-------------------------------- users.
Python has ``sys.defaultencoding()`` which is always "UTF-8".
``str.encode()`` uses "UTF-8" when encoding is omitted.
Using "UTF-8" for text files are consistent with it. It makes Python
more easy to learn language.
Specification Specification
============= =============
``PYTHONTEXTENCODING`` environment variable Enable UTF-8 mode on Windows unless it is disabled explicitly.
-------------------------------------------
``PYTHONTEXTENCODING`` environment variable can be used to specify the UTF-8 mode affects these areas:
default text encoding.
Unlike ``PYTHONIOENCODING``, it doesn't accept error handler. * ``locale.getpreferredencoding`` returns "UTF-8".
``PYTHONIOENCODING`` support it because changing error handler of
stdio was difficult. But it is not true for regular files. * ``open``, ``subprocess.Popen``, ``pathlib.Path.read_text``,
``ZipFile.open``, and many other functions use UTF-8 when
the ``encoding`` option is omitted.
* The stdio uses "UTF-8" always.
* Console I/O uses "UTF-8" already [#]_. So this affects
only when the stdio are redirected.
On the other hand, UTF-8 mode doesn't affect to "mbcs" encoding.
Users can still use system encoding by chosing "mbcs" encoding
explicitly.
``sys.gettextencoding()`` Backwards Compatibility
------------------------- =======================
When ``PYTHONTEXTENCODING`` is specified, this function return it. Some existing applications assuming the default text encoding is the
system encoding (a.k.a. ANSI encoding) will be broken by this change.
When it is not specified, this function returns Users can disable the UTF-8 mode by environment variable
``locale.getpreferredencoding(False)``. (``PYTHONUTF8=0``) or command line option (``-Xutf8=0``) for backward
compatibility.
``encoding="locale"`` option Rejected Ideas
---------------------------- ===============
``TextIOWrapper`` now accepts ``encoding="locale"`` option. Change the default encoding of TextIOWrapper to "UTF-8"
-------------------------------------------------------
"locale" is not real encoding or alias. This idea changed the default encoding to UTF-8 always, regardless of
This is just a shortcut of platform, locale, and environment variables.
``encoding=locale.getpreferredencoding(False)``.
While this idea looks ideal in terms of consistency, it will cause
backward compatibility problems.
Changes in stdlibs Utilizing the UTF-8 mode seems better than adding one more backward
------------------ compatibility option like ``PYTHONLEGACYWINDOWSSTDIO``.
``TextIOWrapper`` uses ``sys.gettextencoding()`` where
``locale.getpreferredencoding(False)`` is used.
But ``stdin``, ``stdout``, and ``stderr`` continue to respect
locale encoding as well. ``PYTHONIOENCODING`` can be used to
override their encoding.
Pipes and TTY should use the "locale" encoding. UTF-8 mode [1]_
can be used to override these encoding:
* ``subprocess`` and ``os.popen`` use the "locale" encoding because
the subprocess will use the locale encoding.
* ``getpass.getpass`` uses the "locale" encoding when using TTY.
All other code using the default encoding are not modified.
They can be overridden by ``PYTHONTEXTENCODING``.
This is an incomplete list:
* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
* ``socket.makefile``
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
* ``trace.CoverageResults.write_results_file``
Rationale
=========
Why not just enable UTF-8 mode by default?
------------------------------------------
This PEP is not mutually exclusive to UTF-8 mode.
If we enable UTF-8 mode by default, even people using Windows will
forget the default encoding is not always UTF-8. More scripts will
be written assuming the default encoding is UTF-8.
So changing the default encoding of text files to UTF-8 would be
better even if UTF-8 mode is enabled by default at some point.
Why is "locale" not an alias codec?
-----------------------------------
For backward compatibility, ``io.TextIOWrapper`` calls
``locale.getpreferredencoding(False)`` every time when
``encoding="locale"`` is specified.
It respects changing locale after Python startup.
Why not change std(in|out|err) encoding too?
--------------------------------------------
Even when the locale encoding is not UTF-8, there can be many UTF-8
text files. These files could be downloaded from the Internet or
written by modern text editors.
On the other hand, terminal encoding is assumed to be the same as
locale encoding. And other tools are assumed to read and write the
locale encoding as well.
std(in|out|err) are likely to be connected to a terminal or other
tools. So the locale encoding should be respected.
Anyway, ``PYTHONIOENCODING`` can be used to change these encodings.
Reference Implementation Reference Implementation
@ -193,76 +112,13 @@ Reference Implementation
To be written. To be written.
Rejected Ideas
==============
Change the default text encoding
--------------------------------
Previous version of this PEP tried to change the default encoding
to UTF-8.
But we should have deprecation period long enough. Between the
deprecation period, users can not change the default text encoding.
And there are many difficulty there:
* Omitting ``encoding`` option is very common.
* If we raise ``DeprecationWarning`` always, it will be too noisy.
* We can not assume how user use it. Complicated heuristics may be
needed to raise ``DeprecationWarning`` only when it is really
needed.
* Users of legacy systems may dismiss warning.
* They may not check the warning.
* They may upgrade Python from 2.7 after 2020.
Additionally, Microsoft is improving UTF-8 support of Windows 10
recently.
There are no public plan for future UTF-8 support yet. But Python may
be able to change the default encoding without painful deprecation
period in the future.
Open Issues
===========
Easy way to set ``PYTHONTEXTENCODING``
--------------------------------------
UTF-8 is the best encoding for new users. But setting environment
variables is not easy enough to new users.
It would be helpfule if Python on Windows can provide easy way to set
``PYTHONTEXTENCODING=UTF-8`` even after Python is installed.
Commandline option
------------------
If there is reasonable use case for changing default text encoding
per process, command line option should be considered.
C-API
-----
The default text encoding should be able to configured from C.
This will be considered when writing reference Implementation.
Additionally, C-API like ``PySys_GetTextEncoding()`` should be
considered too.
References References
========== ==========
.. [1] PEP 540, Add a new UTF-8 Mode .. [#] `PEP 540 -- Add a new UTF-8 Mode <https://www.python.org/dev/peps/pep-0540/>`_
(https://www.python.org/dev/peps/pep-0540/) .. [#] https://github.com/pypa/packaging.python.org/pull/682
.. [#] https://bugs.python.org/issue33684
.. [#] `PEP 528 -- Change Windows console encoding to UTF-8 <https://www.python.org/dev/peps/pep-0528/>`_
Copyright Copyright