PEP 597: Copy editing (#1100)

This commit is contained in:
Harmon 2019-06-06 08:19:01 -05:00 committed by Inada Naoki
parent 36da0b1352
commit 091ba8436e
1 changed files with 82 additions and 78 deletions

View File

@ -28,69 +28,71 @@ Package authors using macOS or Linux may forget that the default encoding
is not always UTF-8. is not always UTF-8.
For example, ``long_description = open("README.md").read()`` in For example, ``long_description = open("README.md").read()`` in
``setup.py`` is a common mistake. If there are at least one emoji or any ``setup.py`` is a common mistake. If there is at least one emoji or any
other non-ASCII characters in the ``README.md`` file, many Windows users other non-ASCII character in the ``README.md`` file, many Windows users
cannot install the package by ``UnicodeDecodeError``. cannot install the package due to a ``UnicodeDecodeError``.
Code page is not stable Active code page is not stable
----------------------- ------------------------------
Some tools on Windows change code page to 65001 (UTF-8), and Microsoft Some tools on Windows change the active code page to 65001 (UTF-8), and
is using UTF-8 and cp65001 more widely in recent Windows 10. Microsoft is using UTF-8 and cp65001 more widely in recent versions of
Windows 10.
For example, "Command Prompt" uses legacy code page by default. For example, "Command Prompt" uses the legacy code page by default.
But WSL changes the code page to 65001, and ``python.exe`` on Windows But the Windows Subsystem for Linux (WSL) changes the active code page to
can be executed from WSL. So ``python.exe`` executed from legacy 65001, and ``python.exe`` can be executed from the WSL. So ``python.exe``
console and from WSL cannot read text files written by each other. executed from the legacy console and from the WSL cannot read text files
written by each other.
But many Windows users don't understand which code page is currently used. But many Windows users don't understand which code page is active.
So changing default text file encoding based on current code page will So changing the default text file encoding based on the active code page
cause confusion. causes confusion.
Consistent default text encoding will make Python behavior more expectable Consistent default text encoding will make Python behavior more expectable
and easy to learn. and easier to learn.
Use UTF-8 by default is easier to new programmers Using UTF-8 by default is easier on new programmers
------------------------------------------------- ---------------------------------------------------
Python is one of the most popular first programming languages. Python is one of the most popular first programming languages.
New programmers may not know about encoding. When they download text data New programmers may not know about encoding. When they download text data
written in UTF-8 from the internet, they are forced to know encoding. written in UTF-8 from the Internet, they are forced to learn about encoding.
Popular text editors like VS Code or Atom use UTF-8 by default. Popular text editors like VS Code or Atom use UTF-8 by default.
Even notepad.exe uses UTF-8 by default from Windows 10 2019 may update. Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019
(Note that Python 3.9 will be released in 2021.) Update. (Note that Python 3.9 will be released in 2021.)
Additionally, the default encoding of Python source file is UTF-8. Additionally, the default encoding of Python source files is UTF-8.
We can assume new Python programmers who don't know about encoding We can assume new Python programmers who don't know about encoding
use editors which use UTF-8 by default. use editors which use UTF-8 by default.
It would be nice if new programmers are not forced to know about encoding It would be nice if new programmers are not forced to learn about encoding
until they need to handle text files encoded in encoding other than UTF-8. until they need to handle text files encoded in encoding other than UTF-8.
Specification Specification
============= =============
From Python 3.9, default encoding of ``TextIOWrapper`` and ``open()`` is From Python 3.9, the default encoding of ``TextIOWrapper`` and ``open()`` is
changed from ``locale.getpreferredencoding(False)`` (called "locale encoding" changed from ``locale.getpreferredencoding(False)`` to "UTF-8".
in this PEP) to "UTF-8".
When there is device encoding (``os.device_encoding(buffer.fileno())``), When there is device encoding (``os.device_encoding(buffer.fileno())``),
it still precedes than the default encoding. it still supersedes the default encoding.
Not affected areas Unaffected areas
------------------ ----------------
Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respect Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respects
locale encoding. locale encoding.
``stdin``, ``stdout``, and ``stderr`` keep respecting locale too. For example, ``stdin``, ``stdout``, and ``stderr`` continue to respect locale encoding
these commands don't cause mojibake regardless code page:: as well. For example, these commands do not cause mojibake regardless of the
active code page::
> python -c "print('こんにちは')" | more > python -c "print('こんにちは')" | more
こんにちは こんにちは
@ -98,35 +100,35 @@ these commands don't cause mojibake regardless code page::
> type temp.txt > type temp.txt
こんにちは こんにちは
Pipes and TTY should use locale encoding: Pipes and TTY should use the locale encoding:
* ``subprocess`` and ``os.popen`` use locale encoding because subprocess * ``subprocess`` and ``os.popen`` use the locale encoding because the
will use locale encoding. subprocess will use the locale encoding.
* ``getpass.getpass`` uses locale encoding when using TTY. * ``getpass.getpass`` uses the locale encoding when using TTY.
Affected APIs Affected APIs
-------------- -------------
All other code using default encoding of ``TextIOWrapper`` or ``open`` are All other code using the default encoding of ``TextIOWrapper`` or ``open`` are
affected. This is incomplete list of APIs affected by this PEP: affected. This is an incomplete list of APIs affected by this PEP:
* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``. * ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
* ``socket.makefile`` * ``socket.makefile``
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile`` * ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
* ``trace.CoverageResults.write_results_file`` * ``trace.CoverageResults.write_results_file``
These APIs will use always "UTF-8" when opening text files. These APIs will always use "UTF-8" when opening text files.
Deprecation Warning Deprecation Warning
------------------- -------------------
From 3.8, ``DeprecationWarning`` is shown when encoding is omitted and From 3.8 onwards, ``DeprecationWarning`` is shown when encoding is omitted and
locale encoding is not UTF-8. This helps not only the locale encoding is not UTF-8. This helps not only when writing
writing forward compatible code, but also investigating unexpected forward-compatible code, but also when investigating an unexpected
``UnicodeDecodeError`` caused by assuming default text encoding is ``UnicodeDecodeError`` caused by assuming the default text encoding is UTF-8.
UTF-8. (See `People assume it is always UTF-8`_ above.) (See `People assume it is always UTF-8`_ above.)
Rationale Rationale
@ -141,69 +143,70 @@ If we enable UTF-8 mode by default, even people using Windows will forget
the default encoding is not always UTF-8. More scripts will be written the default encoding is not always UTF-8. More scripts will be written
assuming the default encoding is UTF-8. assuming the default encoding is UTF-8.
So changing default encoding of text files to always UTF-8 would be So changing the default encoding of text files to UTF-8 would be better
better even if UTF-8 mode is enabled by default at some point. even if UTF-8 mode is enabled by default at some point.
Why not change std(in|out|err) encoding too? Why not change std(in|out|err) encoding too?
-------------------------------------------- --------------------------------------------
Even when locale encoding is not UTF-8, there will be many UTF-8 Even when the locale encoding is not UTF-8, there can be many UTF-8
text files. These files are downloaded from the internet, or text files. These files could be downloaded from the Internet or
written by modern text editor same to editing Python source. written by modern text editors.
On the other hand, terminal encoding is assumed to be equal to On the other hand, terminal encoding is assumed to be the same as
locale encoding. And other tools are assumed to read and write locale encoding. And other tools are assumed to read and write the
locale encoding too. locale encoding as well.
std(in|out|err) are likely to be connected to a terminal or other std(in|out|err) are likely to be connected to a terminal or other tools.
tools. So locale encoding should be respected. So the locale encoding should be respected.
Why not warn always when encoding is omitted? Why not always warn when encoding is omitted?
---------------------------------------------- ---------------------------------------------
Omitting default encoding is a common mistake when writing portable code. Omitting encoding is a common mistake when writing portable code.
But when portability does not matter, assuming UTF-8 is not so bad because But when portability does not matter, assuming UTF-8 is not so bad because
Python already implemented locale coercion (:pep:`538`) and UTF-8 mode Python already implements locale coercion (:pep:`538`) and UTF-8 mode
(:pep:`540`). (:pep:`540`).
And these scripts will become portable when default encoding is changed And these scripts will become portable when the default encoding is changed
to always UTF-8. to UTF-8.
Backward compatibility Backward compatibility
====================== ======================
There may be scripts relying on locale or code page which is not UTF-8. There may be scripts relying on the locale encoding or active code page not
They must be rewritten to specify ``encoding`` explicitly. being UTF-8. They must be rewritten to specify ``encoding`` explicitly.
* If the script assumed ``latin1`` or ``cp932``, use ``encoding="latin1"`` * If the script assumes ``latin1`` or ``cp932``, ``encoding="latin1"``
or ``encoding="cp932"`` should be used. or ``encoding="cp932"`` should be used.
* If the script is designed to respect locale encoding, * If the script is designed to respect locale encoding,
``locale.getpreferredencoding(False)`` should be used. ``locale.getpreferredencoding(False)`` should be used.
There are non-portable short forms of ``locale.getpreferredencoding(False)``. There are non-portable short forms of
``locale.getpreferredencoding(False)``.
* On Windows, ``"mbcs"`` can be used instead. * On Windows, ``"mbcs"`` can be used instead.
* On Unix, ``os.fsencoding()`` can be used instead. * On Unix, ``os.fsencoding()`` can be used instead.
Note that such scripts will be broken even without upgrading Python: Note that such scripts will be broken even without upgrading Python, such as
when:
* Upgrading Windows * Upgrading Windows
* Changing the language setting * Changing the language setting
* Changing terminal from legacy console to a modern one * Changing terminal from legacy console to a modern one
* Using tools which does ``chcp 65001`` * Using tools which do ``chcp 65001``
How to Teach This How to Teach This
================= =================
When opening text files, "UTF-8" is used by default. It is consistent When opening text files, "UTF-8" is used by default. It is consistent with
with default encoding used for ``text.encode()``. the default encoding used for ``text.encode()``.
Reference Implementation Reference Implementation
@ -222,18 +225,19 @@ Open Issues
=========== ===========
Alias for locale encoding Alias for locale encoding
-------------------------- -------------------------
``encoding=locale.getpreferredencoding(False)`` is too long, and ``encoding=locale.getpreferredencoding(False)`` is too long, and
``"mbcs"`` or ``os.fsencoding()`` are not portable. ``"mbcs"`` and ``os.fsencoding()`` are not portable.
We may be possible to add new alias encoding "locale" for easy and It may be possible to add a new "locale" encoding alias as an easy and
portable version of ``locale.getpreferredencoding(False)``. portable version of ``locale.getpreferredencoding(False)``.
I'm not sure this is easy enough because ``encodings`` is imported The difficulty of this is uncertain because ``encodings`` is currently
before ``_bootlocale`` currently. imported prior to ``_bootlocale``.
Another option is ``TextIOWrapper`` treats `"locale"` as special case:: Another option is for ``TextIOWrapper`` to treat `"locale"` as a special
case::
if encoding == "locale": if encoding == "locale":
encoding = locale.getpreferredencoding(False) encoding = locale.getpreferredencoding(False)