PEP 597: Add PYTHONTEXTENCODING envvar (#1102)
This commit is contained in:
parent
d3b10faf50
commit
988e3acf2b
304
pep-0597.rst
304
pep-0597.rst
|
@ -1,5 +1,5 @@
|
||||||
PEP: 597
|
PEP: 597
|
||||||
Title: Use UTF-8 for default text file encoding
|
Title: Add PYTHONTEXTENCODING environment variable
|
||||||
Author: Inada Naoki <songofacandy@gmail.com>
|
Author: Inada Naoki <songofacandy@gmail.com>
|
||||||
Status: Draft
|
Status: Draft
|
||||||
Type: Standards Track
|
Type: Standards Track
|
||||||
|
@ -12,46 +12,32 @@ Abstract
|
||||||
========
|
========
|
||||||
|
|
||||||
Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
|
Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
|
||||||
(hereinafter called "locale encoding") when ``encoding`` is not specified.
|
(hereinafter called "locale encoding") when ``encoding`` is not
|
||||||
|
specified.
|
||||||
|
|
||||||
This PEP proposes changing the default text encoding to "UTF-8"
|
This PEP proposes adding ``PYTHONTEXTENCODING`` environment
|
||||||
regardless of platform or locale.
|
variable to override the default text encoding since Python 3.9.
|
||||||
|
|
||||||
|
The goal of this PEP is providing "UTF-8 by default" experience to
|
||||||
|
Windows users, because macOS, Linux, Android, iOS users use UTF-8
|
||||||
|
by default already.
|
||||||
|
|
||||||
|
|
||||||
Motivation
|
Motivation
|
||||||
==========
|
==========
|
||||||
|
|
||||||
People assume it is always UTF-8
|
UTF-8 is the best encoding for saving unicode text
|
||||||
--------------------------------
|
--------------------------------------------------
|
||||||
|
|
||||||
Package authors using macOS or Linux may forget that the default encoding
|
String in Python 3 is unicode. Encoding valid unicode strings with
|
||||||
is not always UTF-8.
|
UTF-8 should not fail.
|
||||||
|
|
||||||
For example, ``long_description = open("README.md").read()`` in
|
On the other hand, most locale encoding used in Windows can not
|
||||||
``setup.py`` is a common mistake. If there is at least one emoji or any
|
save all valid unicode string. It will cause UnicodeEncodeError
|
||||||
other non-ASCII character in the ``README.md`` file, many Windows users
|
or it may not round-trip. User may lost their data in such case.
|
||||||
cannot install the package due to a ``UnicodeDecodeError``.
|
|
||||||
|
|
||||||
|
UTF-8 is the best encoding for saving text when user don't specify
|
||||||
Active code page is not stable
|
any encoding.
|
||||||
------------------------------
|
|
||||||
|
|
||||||
Some tools on Windows change the active code page to 65001 (UTF-8), and
|
|
||||||
Microsoft is using UTF-8 and cp65001 more widely in recent versions of
|
|
||||||
Windows 10.
|
|
||||||
|
|
||||||
For example, "Command Prompt" uses the legacy code page by default.
|
|
||||||
But the Windows Subsystem for Linux (WSL) changes the active code page to
|
|
||||||
65001, and ``python.exe`` can be executed from the WSL. So ``python.exe``
|
|
||||||
executed from the legacy console and from the WSL cannot read text files
|
|
||||||
written by each other.
|
|
||||||
|
|
||||||
But many Windows users don't understand which code page is active.
|
|
||||||
So changing the default text file encoding based on the active code page
|
|
||||||
causes confusion.
|
|
||||||
|
|
||||||
Consistent default text encoding will make Python behavior more expectable
|
|
||||||
and easier to learn.
|
|
||||||
|
|
||||||
|
|
||||||
Using UTF-8 by default is easier on new programmers
|
Using UTF-8 by default is easier on new programmers
|
||||||
|
@ -59,77 +45,104 @@ Using UTF-8 by default is easier on new programmers
|
||||||
|
|
||||||
Python is one of the most popular first programming languages.
|
Python is one of the most popular first programming languages.
|
||||||
|
|
||||||
New programmers may not know about encoding. When they download text data
|
New programmers may not know about encoding. When they download text
|
||||||
written in UTF-8 from the Internet, they are forced to learn about encoding.
|
data written in UTF-8 from the Internet, they are forced to learn
|
||||||
|
about encoding.
|
||||||
|
|
||||||
Popular text editors like VS Code or Atom use UTF-8 by default.
|
Popular text editors like VS Code or Atom use UTF-8 by default.
|
||||||
Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019
|
Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May
|
||||||
Update. (Note that Python 3.9 will be released in 2021.)
|
2019 Update. (Note that Python 3.9 will be released in 2021.)
|
||||||
|
|
||||||
Additionally, the default encoding of Python source files is UTF-8.
|
Additionally, the default encoding of Python source files is UTF-8.
|
||||||
We can assume new Python programmers who don't know about encoding
|
We can assume new Python programmers who don't know about encoding
|
||||||
use editors which use UTF-8 by default.
|
use editors which use UTF-8 by default.
|
||||||
|
|
||||||
It would be nice if new programmers are not forced to learn about encoding
|
It would be nice if new programmers are not forced to learn about
|
||||||
until they need to handle text files encoded in encoding other than UTF-8.
|
encoding until they need to handle text files encoded in encoding
|
||||||
|
other than UTF-8.
|
||||||
|
|
||||||
|
|
||||||
|
People assume it is always UTF-8
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Package authors using macOS or Linux may forget that the default
|
||||||
|
encoding is not always UTF-8.
|
||||||
|
|
||||||
|
For example, ``long_description = open("README.md").read()`` in
|
||||||
|
``setup.py`` is a common mistake. If there is at least one emoji or
|
||||||
|
any other non-ASCII character in the ``README.md`` file, many Windows
|
||||||
|
users cannot install the package due to a ``UnicodeDecodeError``.
|
||||||
|
|
||||||
|
|
||||||
|
Consistent with default encoding
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Python has ``sys.defaultencoding()`` which is always "UTF-8".
|
||||||
|
``str.encode()`` uses "UTF-8" when encoding is omitted.
|
||||||
|
|
||||||
|
Using "UTF-8" for text files are consistent with it. It makes Python
|
||||||
|
more easy to learn language.
|
||||||
|
|
||||||
|
|
||||||
Specification
|
Specification
|
||||||
=============
|
=============
|
||||||
|
|
||||||
From Python 3.9, the default encoding of ``TextIOWrapper`` and ``open()`` is
|
``PYTHONTEXTENCODING`` environment variable
|
||||||
changed from ``locale.getpreferredencoding(False)`` to "UTF-8".
|
-------------------------------------------
|
||||||
|
|
||||||
When there is device encoding (``os.device_encoding(buffer.fileno())``),
|
``PYTHONTEXTENCODING`` environment variable can be used to specify the
|
||||||
it still supersedes the default encoding.
|
default text encoding.
|
||||||
|
|
||||||
|
Unlike ``PYTHONIOENCODING``, it doesn't accept error handler.
|
||||||
|
``PYTHONIOENCODING`` support it because changing error handler of
|
||||||
|
stdio was difficult. But it is not true for regular files.
|
||||||
|
|
||||||
|
|
||||||
Unaffected areas
|
``sys.gettextencoding()``
|
||||||
----------------
|
-------------------------
|
||||||
|
|
||||||
Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respects
|
When ``PYTHONTEXTENCODING`` is specified, this function return it.
|
||||||
locale encoding.
|
|
||||||
|
|
||||||
``stdin``, ``stdout``, and ``stderr`` continue to respect locale encoding
|
When it is not specified, this function returns
|
||||||
as well. For example, these commands do not cause mojibake regardless of the
|
``locale.getpreferredencoding(False)``.
|
||||||
active code page::
|
|
||||||
|
|
||||||
> python -c "print('こんにちは')" | more
|
|
||||||
こんにちは
|
|
||||||
> python -c "print('こんにちは')" > temp.txt
|
|
||||||
> type temp.txt
|
|
||||||
こんにちは
|
|
||||||
|
|
||||||
Pipes and TTY should use the locale encoding:
|
|
||||||
|
|
||||||
* ``subprocess`` and ``os.popen`` use the locale encoding because the
|
|
||||||
subprocess will use the locale encoding.
|
|
||||||
* ``getpass.getpass`` uses the locale encoding when using TTY.
|
|
||||||
|
|
||||||
|
|
||||||
Affected APIs
|
``encoding="locale"`` option
|
||||||
-------------
|
----------------------------
|
||||||
|
|
||||||
All other code using the default encoding of ``TextIOWrapper`` or ``open`` are
|
``TextIOWrapper`` now accepts ``encoding="locale"`` option.
|
||||||
affected. This is an incomplete list of APIs affected by this PEP:
|
|
||||||
|
"locale" is not real encoding or alias.
|
||||||
|
This is just a shortcut of
|
||||||
|
``encoding=locale.getpreferredencoding(False)``.
|
||||||
|
|
||||||
|
|
||||||
|
Changes in stdlibs
|
||||||
|
------------------
|
||||||
|
|
||||||
|
``TextIOWrapper`` uses ``sys.gettextencoding()`` where
|
||||||
|
``locale.getpreferredencoding(False)`` is used.
|
||||||
|
|
||||||
|
But ``stdin``, ``stdout``, and ``stderr`` continue to respect
|
||||||
|
locale encoding as well. ``PYTHONIOENCODING`` can be used to
|
||||||
|
override thier encoding.
|
||||||
|
|
||||||
|
Pipes and TTY should use the "locale" encoding. UTF-8 mode [1]_
|
||||||
|
can be used to override these encoding:
|
||||||
|
|
||||||
|
* ``subprocess`` and ``os.popen`` use the "locale" encoding because
|
||||||
|
the subprocess will use the locale encoding.
|
||||||
|
* ``getpass.getpass`` uses the "locale" encoding when using TTY.
|
||||||
|
|
||||||
|
All other code using the default encoding are not modified.
|
||||||
|
They can be overridden by ``PYTHONTEXTENCODING``.
|
||||||
|
This is an incomplete list:
|
||||||
|
|
||||||
* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
|
* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
|
||||||
* ``socket.makefile``
|
* ``socket.makefile``
|
||||||
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
|
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
|
||||||
* ``trace.CoverageResults.write_results_file``
|
* ``trace.CoverageResults.write_results_file``
|
||||||
|
|
||||||
These APIs will always use "UTF-8" when opening text files.
|
|
||||||
|
|
||||||
|
|
||||||
Deprecation Warning
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
From 3.8 onwards, ``DeprecationWarning`` is shown when encoding is omitted and
|
|
||||||
the locale encoding is not UTF-8. This helps not only when writing
|
|
||||||
forward-compatible code, but also when investigating an unexpected
|
|
||||||
``UnicodeDecodeError`` caused by assuming the default text encoding is UTF-8.
|
|
||||||
(See `People assume it is always UTF-8`_ above.)
|
|
||||||
|
|
||||||
|
|
||||||
Rationale
|
Rationale
|
||||||
=========
|
=========
|
||||||
|
@ -139,12 +152,22 @@ Why not just enable UTF-8 mode by default?
|
||||||
|
|
||||||
This PEP is not mutually exclusive to UTF-8 mode.
|
This PEP is not mutually exclusive to UTF-8 mode.
|
||||||
|
|
||||||
If we enable UTF-8 mode by default, even people using Windows will forget
|
If we enable UTF-8 mode by default, even people using Windows will
|
||||||
the default encoding is not always UTF-8. More scripts will be written
|
forget the default encoding is not always UTF-8. More scripts will
|
||||||
assuming the default encoding is UTF-8.
|
be written assuming the default encoding is UTF-8.
|
||||||
|
|
||||||
So changing the default encoding of text files to UTF-8 would be better
|
So changing the default encoding of text files to UTF-8 would be
|
||||||
even if UTF-8 mode is enabled by default at some point.
|
better even if UTF-8 mode is enabled by default at some point.
|
||||||
|
|
||||||
|
|
||||||
|
Why is "locale" not an alias codec?
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
For backward compatibility, ``io.TextIOWrapper`` calls
|
||||||
|
``locale.getpreferredencoding(False)`` every time when
|
||||||
|
``encoding="locale"`` is specified.
|
||||||
|
|
||||||
|
It respects changing locale after Python startup.
|
||||||
|
|
||||||
|
|
||||||
Why not change std(in|out|err) encoding too?
|
Why not change std(in|out|err) encoding too?
|
||||||
|
@ -158,55 +181,10 @@ On the other hand, terminal encoding is assumed to be the same as
|
||||||
locale encoding. And other tools are assumed to read and write the
|
locale encoding. And other tools are assumed to read and write the
|
||||||
locale encoding as well.
|
locale encoding as well.
|
||||||
|
|
||||||
std(in|out|err) are likely to be connected to a terminal or other tools.
|
std(in|out|err) are likely to be connected to a terminal or other
|
||||||
So the locale encoding should be respected.
|
tools. So the locale encoding should be respected.
|
||||||
|
|
||||||
|
Anyway, ``PYTHONIOENCODING`` can be used to change these encodings.
|
||||||
Why not always warn when encoding is omitted?
|
|
||||||
---------------------------------------------
|
|
||||||
|
|
||||||
Omitting encoding is a common mistake when writing portable code.
|
|
||||||
|
|
||||||
But when portability does not matter, assuming UTF-8 is not so bad because
|
|
||||||
Python already implements locale coercion (:pep:`538`) and UTF-8 mode
|
|
||||||
(:pep:`540`).
|
|
||||||
|
|
||||||
And these scripts will become portable when the default encoding is changed
|
|
||||||
to UTF-8.
|
|
||||||
|
|
||||||
|
|
||||||
Backward compatibility
|
|
||||||
======================
|
|
||||||
|
|
||||||
There may be scripts relying on the locale encoding or active code page not
|
|
||||||
being UTF-8. They must be rewritten to specify ``encoding`` explicitly.
|
|
||||||
|
|
||||||
* If the script assumes ``latin1`` or ``cp932``, ``encoding="latin1"``
|
|
||||||
or ``encoding="cp932"`` should be used.
|
|
||||||
|
|
||||||
* If the script is designed to respect locale encoding,
|
|
||||||
``locale.getpreferredencoding(False)`` should be used.
|
|
||||||
|
|
||||||
There are non-portable short forms of
|
|
||||||
``locale.getpreferredencoding(False)``.
|
|
||||||
|
|
||||||
* On Windows, ``"mbcs"`` can be used instead.
|
|
||||||
* On Unix, ``os.fsencoding()`` can be used instead.
|
|
||||||
|
|
||||||
Note that such scripts will be broken even without upgrading Python, such as
|
|
||||||
when:
|
|
||||||
|
|
||||||
* Upgrading Windows
|
|
||||||
* Changing the language setting
|
|
||||||
* Changing terminal from legacy console to a modern one
|
|
||||||
* Using tools which do ``chcp 65001``
|
|
||||||
|
|
||||||
|
|
||||||
How to Teach This
|
|
||||||
=================
|
|
||||||
|
|
||||||
When opening text files, "UTF-8" is used by default. It is consistent with
|
|
||||||
the default encoding used for ``text.encode()``.
|
|
||||||
|
|
||||||
|
|
||||||
Reference Implementation
|
Reference Implementation
|
||||||
|
@ -218,35 +196,74 @@ To be written.
|
||||||
Rejected Ideas
|
Rejected Ideas
|
||||||
==============
|
==============
|
||||||
|
|
||||||
To be discussed.
|
Change the default text encoding
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Previous version of this PEP tried to change the default encoding
|
||||||
|
to UTF-8.
|
||||||
|
|
||||||
|
But we should have deprecation period long enough. Between the
|
||||||
|
deprecation period, users can not change the default text encoding.
|
||||||
|
|
||||||
|
And there are many difficulity there:
|
||||||
|
|
||||||
|
* Omitting ``encoding`` option is very common.
|
||||||
|
|
||||||
|
* If we raise ``DeprecationWarning`` always, it will be too noisy.
|
||||||
|
* We can not assume how user use it. Complicated heuritics may be
|
||||||
|
needed to raise ``DeprecationWarning`` only when it is really
|
||||||
|
needed.
|
||||||
|
|
||||||
|
* Users of legacy systems may dismiss warning.
|
||||||
|
|
||||||
|
* They may not check the warning.
|
||||||
|
* They may upgrade Python from 2.7 after 2020.
|
||||||
|
|
||||||
|
|
||||||
|
Additionally, Microsoft is improving UTF-8 support of Windows 10
|
||||||
|
recently.
|
||||||
|
|
||||||
|
There are no public plan for future UTF-8 support yet. But Python may
|
||||||
|
be able to change the default encoding without painful deprecation
|
||||||
|
period in the future.
|
||||||
|
|
||||||
|
|
||||||
Open Issues
|
Open Issues
|
||||||
===========
|
===========
|
||||||
|
|
||||||
Alias for locale encoding
|
Easy way to set ``PYTHONTEXTENCODING``
|
||||||
-------------------------
|
--------------------------------------
|
||||||
|
|
||||||
``encoding=locale.getpreferredencoding(False)`` is too long, and
|
UTF-8 is the best encoding for new users. But setting environment
|
||||||
``"mbcs"`` and ``os.fsencoding()`` are not portable.
|
variables is not easy enough to new users.
|
||||||
|
|
||||||
It may be possible to add a new "locale" encoding alias as an easy and
|
It would be helpfule if Python on Windows can provide easy way to set
|
||||||
portable version of ``locale.getpreferredencoding(False)``.
|
``PYTHONTEXTENCODING=UTF-8`` even after Python is installed.
|
||||||
|
|
||||||
The difficulty of this is uncertain because ``encodings`` is currently
|
|
||||||
imported prior to ``_bootlocale``.
|
|
||||||
|
|
||||||
Another option is for ``TextIOWrapper`` to treat `"locale"` as a special
|
Commandline option
|
||||||
case::
|
------------------
|
||||||
|
|
||||||
if encoding == "locale":
|
If there is reasonable use case for changing default text encoding
|
||||||
encoding = locale.getpreferredencoding(False)
|
per process, command line option should be considered.
|
||||||
|
|
||||||
|
|
||||||
|
C-API
|
||||||
|
-----
|
||||||
|
|
||||||
|
The default text encoding should be able to configured from C.
|
||||||
|
This will be considered when writing reference Implementation.
|
||||||
|
|
||||||
|
Additionally, C-API like ``PySys_GetTextEncoding()`` should be
|
||||||
|
considered too.
|
||||||
|
|
||||||
|
|
||||||
References
|
References
|
||||||
==========
|
==========
|
||||||
|
|
||||||
|
.. [1]: PEP 540, Add a new UTF-8 Mode
|
||||||
|
(https://www.python.org/dev/peps/pep-0540/)
|
||||||
|
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
=========
|
=========
|
||||||
|
@ -261,4 +278,3 @@ This document has been placed in the public domain.
|
||||||
fill-column: 70
|
fill-column: 70
|
||||||
coding: utf-8
|
coding: utf-8
|
||||||
End:
|
End:
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue