PEP 597: Copy editing (#1100)
This commit is contained in:
parent
36da0b1352
commit
091ba8436e
156
pep-0597.rst
156
pep-0597.rst
|
@ -28,69 +28,71 @@ Package authors using macOS or Linux may forget that the default encoding
|
||||||
is not always UTF-8.
|
is not always UTF-8.
|
||||||
|
|
||||||
For example, ``long_description = open("README.md").read()`` in
|
For example, ``long_description = open("README.md").read()`` in
|
||||||
``setup.py`` is a common mistake. If there are at least one emoji or any
|
``setup.py`` is a common mistake. If there is at least one emoji or any
|
||||||
other non-ASCII characters in the ``README.md`` file, many Windows users
|
other non-ASCII character in the ``README.md`` file, many Windows users
|
||||||
cannot install the package by ``UnicodeDecodeError``.
|
cannot install the package due to a ``UnicodeDecodeError``.
|
||||||
|
|
||||||
|
|
||||||
Code page is not stable
|
Active code page is not stable
|
||||||
-----------------------
|
------------------------------
|
||||||
|
|
||||||
Some tools on Windows change code page to 65001 (UTF-8), and Microsoft
|
Some tools on Windows change the active code page to 65001 (UTF-8), and
|
||||||
is using UTF-8 and cp65001 more widely in recent Windows 10.
|
Microsoft is using UTF-8 and cp65001 more widely in recent versions of
|
||||||
|
Windows 10.
|
||||||
|
|
||||||
For example, "Command Prompt" uses legacy code page by default.
|
For example, "Command Prompt" uses the legacy code page by default.
|
||||||
But WSL changes the code page to 65001, and ``python.exe`` on Windows
|
But the Windows Subsystem for Linux (WSL) changes the active code page to
|
||||||
can be executed from WSL. So ``python.exe`` executed from legacy
|
65001, and ``python.exe`` can be executed from the WSL. So ``python.exe``
|
||||||
console and from WSL cannot read text files written by each other.
|
executed from the legacy console and from the WSL cannot read text files
|
||||||
|
written by each other.
|
||||||
|
|
||||||
But many Windows users don't understand which code page is currently used.
|
But many Windows users don't understand which code page is active.
|
||||||
So changing default text file encoding based on current code page will
|
So changing the default text file encoding based on the active code page
|
||||||
cause confusion.
|
causes confusion.
|
||||||
|
|
||||||
Consistent default text encoding will make Python behavior more expectable
|
Consistent default text encoding will make Python behavior more expectable
|
||||||
and easy to learn.
|
and easier to learn.
|
||||||
|
|
||||||
|
|
||||||
Use UTF-8 by default is easier to new programmers
|
Using UTF-8 by default is easier on new programmers
|
||||||
-------------------------------------------------
|
---------------------------------------------------
|
||||||
|
|
||||||
Python is one of the most popular first programming languages.
|
Python is one of the most popular first programming languages.
|
||||||
|
|
||||||
New programmers may not know about encoding. When they download text data
|
New programmers may not know about encoding. When they download text data
|
||||||
written in UTF-8 from the internet, they are forced to know encoding.
|
written in UTF-8 from the Internet, they are forced to learn about encoding.
|
||||||
|
|
||||||
Popular text editors like VS Code or Atom use UTF-8 by default.
|
Popular text editors like VS Code or Atom use UTF-8 by default.
|
||||||
Even notepad.exe uses UTF-8 by default from Windows 10 2019 may update.
|
Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019
|
||||||
(Note that Python 3.9 will be released in 2021.)
|
Update. (Note that Python 3.9 will be released in 2021.)
|
||||||
|
|
||||||
Additionally, the default encoding of Python source file is UTF-8.
|
Additionally, the default encoding of Python source files is UTF-8.
|
||||||
We can assume new Python programmers who don't know about encoding
|
We can assume new Python programmers who don't know about encoding
|
||||||
use editors which use UTF-8 by default.
|
use editors which use UTF-8 by default.
|
||||||
|
|
||||||
It would be nice if new programmers are not forced to know about encoding
|
It would be nice if new programmers are not forced to learn about encoding
|
||||||
until they need to handle text files encoded in encoding other than UTF-8.
|
until they need to handle text files encoded in encoding other than UTF-8.
|
||||||
|
|
||||||
|
|
||||||
Specification
|
Specification
|
||||||
=============
|
=============
|
||||||
|
|
||||||
From Python 3.9, default encoding of ``TextIOWrapper`` and ``open()`` is
|
From Python 3.9, the default encoding of ``TextIOWrapper`` and ``open()`` is
|
||||||
changed from ``locale.getpreferredencoding(False)`` (called "locale encoding"
|
changed from ``locale.getpreferredencoding(False)`` to "UTF-8".
|
||||||
in this PEP) to "UTF-8".
|
|
||||||
|
|
||||||
When there is device encoding (``os.device_encoding(buffer.fileno())``),
|
When there is device encoding (``os.device_encoding(buffer.fileno())``),
|
||||||
it still precedes than the default encoding.
|
it still supersedes the default encoding.
|
||||||
|
|
||||||
|
|
||||||
Not affected areas
|
Unaffected areas
|
||||||
------------------
|
----------------
|
||||||
|
|
||||||
Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respect
|
Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respects
|
||||||
locale encoding.
|
locale encoding.
|
||||||
|
|
||||||
``stdin``, ``stdout``, and ``stderr`` keep respecting locale too. For example,
|
``stdin``, ``stdout``, and ``stderr`` continue to respect locale encoding
|
||||||
these commands don't cause mojibake regardless code page::
|
as well. For example, these commands do not cause mojibake regardless of the
|
||||||
|
active code page::
|
||||||
|
|
||||||
> python -c "print('こんにちは')" | more
|
> python -c "print('こんにちは')" | more
|
||||||
こんにちは
|
こんにちは
|
||||||
|
@ -98,35 +100,35 @@ these commands don't cause mojibake regardless code page::
|
||||||
> type temp.txt
|
> type temp.txt
|
||||||
こんにちは
|
こんにちは
|
||||||
|
|
||||||
Pipes and TTY should use locale encoding:
|
Pipes and TTY should use the locale encoding:
|
||||||
|
|
||||||
* ``subprocess`` and ``os.popen`` use locale encoding because subprocess
|
* ``subprocess`` and ``os.popen`` use the locale encoding because the
|
||||||
will use locale encoding.
|
subprocess will use the locale encoding.
|
||||||
* ``getpass.getpass`` uses locale encoding when using TTY.
|
* ``getpass.getpass`` uses the locale encoding when using TTY.
|
||||||
|
|
||||||
|
|
||||||
Affected APIs
|
Affected APIs
|
||||||
--------------
|
-------------
|
||||||
|
|
||||||
All other code using default encoding of ``TextIOWrapper`` or ``open`` are
|
All other code using the default encoding of ``TextIOWrapper`` or ``open`` are
|
||||||
affected. This is incomplete list of APIs affected by this PEP:
|
affected. This is an incomplete list of APIs affected by this PEP:
|
||||||
|
|
||||||
* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``.
|
* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
|
||||||
* ``socket.makefile``
|
* ``socket.makefile``
|
||||||
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
|
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
|
||||||
* ``trace.CoverageResults.write_results_file``
|
* ``trace.CoverageResults.write_results_file``
|
||||||
|
|
||||||
These APIs will use always "UTF-8" when opening text files.
|
These APIs will always use "UTF-8" when opening text files.
|
||||||
|
|
||||||
|
|
||||||
Deprecation Warning
|
Deprecation Warning
|
||||||
-------------------
|
-------------------
|
||||||
|
|
||||||
From 3.8, ``DeprecationWarning`` is shown when encoding is omitted and
|
From 3.8 onwards, ``DeprecationWarning`` is shown when encoding is omitted and
|
||||||
locale encoding is not UTF-8. This helps not only
|
the locale encoding is not UTF-8. This helps not only when writing
|
||||||
writing forward compatible code, but also investigating unexpected
|
forward-compatible code, but also when investigating an unexpected
|
||||||
``UnicodeDecodeError`` caused by assuming default text encoding is
|
``UnicodeDecodeError`` caused by assuming the default text encoding is UTF-8.
|
||||||
UTF-8. (See `People assume it is always UTF-8`_ above.)
|
(See `People assume it is always UTF-8`_ above.)
|
||||||
|
|
||||||
|
|
||||||
Rationale
|
Rationale
|
||||||
|
@ -141,69 +143,70 @@ If we enable UTF-8 mode by default, even people using Windows will forget
|
||||||
the default encoding is not always UTF-8. More scripts will be written
|
the default encoding is not always UTF-8. More scripts will be written
|
||||||
assuming the default encoding is UTF-8.
|
assuming the default encoding is UTF-8.
|
||||||
|
|
||||||
So changing default encoding of text files to always UTF-8 would be
|
So changing the default encoding of text files to UTF-8 would be better
|
||||||
better even if UTF-8 mode is enabled by default at some point.
|
even if UTF-8 mode is enabled by default at some point.
|
||||||
|
|
||||||
|
|
||||||
Why not change std(in|out|err) encoding too?
|
Why not change std(in|out|err) encoding too?
|
||||||
--------------------------------------------
|
--------------------------------------------
|
||||||
|
|
||||||
Even when locale encoding is not UTF-8, there will be many UTF-8
|
Even when the locale encoding is not UTF-8, there can be many UTF-8
|
||||||
text files. These files are downloaded from the internet, or
|
text files. These files could be downloaded from the Internet or
|
||||||
written by modern text editor same to editing Python source.
|
written by modern text editors.
|
||||||
|
|
||||||
On the other hand, terminal encoding is assumed to be equal to
|
On the other hand, terminal encoding is assumed to be the same as
|
||||||
locale encoding. And other tools are assumed to read and write
|
locale encoding. And other tools are assumed to read and write the
|
||||||
locale encoding too.
|
locale encoding as well.
|
||||||
|
|
||||||
std(in|out|err) are likely to be connected to a terminal or other
|
std(in|out|err) are likely to be connected to a terminal or other tools.
|
||||||
tools. So locale encoding should be respected.
|
So the locale encoding should be respected.
|
||||||
|
|
||||||
|
|
||||||
Why not warn always when encoding is omitted?
|
Why not always warn when encoding is omitted?
|
||||||
----------------------------------------------
|
---------------------------------------------
|
||||||
|
|
||||||
Omitting default encoding is a common mistake when writing portable code.
|
Omitting encoding is a common mistake when writing portable code.
|
||||||
|
|
||||||
But when portability does not matter, assuming UTF-8 is not so bad because
|
But when portability does not matter, assuming UTF-8 is not so bad because
|
||||||
Python already implemented locale coercion (:pep:`538`) and UTF-8 mode
|
Python already implements locale coercion (:pep:`538`) and UTF-8 mode
|
||||||
(:pep:`540`).
|
(:pep:`540`).
|
||||||
|
|
||||||
And these scripts will become portable when default encoding is changed
|
And these scripts will become portable when the default encoding is changed
|
||||||
to always UTF-8.
|
to UTF-8.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Backward compatibility
|
Backward compatibility
|
||||||
======================
|
======================
|
||||||
|
|
||||||
There may be scripts relying on locale or code page which is not UTF-8.
|
There may be scripts relying on the locale encoding or active code page not
|
||||||
They must be rewritten to specify ``encoding`` explicitly.
|
being UTF-8. They must be rewritten to specify ``encoding`` explicitly.
|
||||||
|
|
||||||
* If the script assumed ``latin1`` or ``cp932``, use ``encoding="latin1"``
|
* If the script assumes ``latin1`` or ``cp932``, ``encoding="latin1"``
|
||||||
or ``encoding="cp932"`` should be used.
|
or ``encoding="cp932"`` should be used.
|
||||||
|
|
||||||
* If the script is designed to respect locale encoding,
|
* If the script is designed to respect locale encoding,
|
||||||
``locale.getpreferredencoding(False)`` should be used.
|
``locale.getpreferredencoding(False)`` should be used.
|
||||||
|
|
||||||
There are non-portable short forms of ``locale.getpreferredencoding(False)``.
|
There are non-portable short forms of
|
||||||
|
``locale.getpreferredencoding(False)``.
|
||||||
|
|
||||||
* On Windows, ``"mbcs"`` can be used instead.
|
* On Windows, ``"mbcs"`` can be used instead.
|
||||||
* On Unix, ``os.fsencoding()`` can be used instead.
|
* On Unix, ``os.fsencoding()`` can be used instead.
|
||||||
|
|
||||||
Note that such scripts will be broken even without upgrading Python:
|
Note that such scripts will be broken even without upgrading Python, such as
|
||||||
|
when:
|
||||||
|
|
||||||
* Upgrading Windows
|
* Upgrading Windows
|
||||||
* Changing the language setting
|
* Changing the language setting
|
||||||
* Changing terminal from legacy console to a modern one
|
* Changing terminal from legacy console to a modern one
|
||||||
* Using tools which does ``chcp 65001``
|
* Using tools which do ``chcp 65001``
|
||||||
|
|
||||||
|
|
||||||
How to Teach This
|
How to Teach This
|
||||||
=================
|
=================
|
||||||
|
|
||||||
When opening text files, "UTF-8" is used by default. It is consistent
|
When opening text files, "UTF-8" is used by default. It is consistent with
|
||||||
with default encoding used for ``text.encode()``.
|
the default encoding used for ``text.encode()``.
|
||||||
|
|
||||||
|
|
||||||
Reference Implementation
|
Reference Implementation
|
||||||
|
@ -222,18 +225,19 @@ Open Issues
|
||||||
===========
|
===========
|
||||||
|
|
||||||
Alias for locale encoding
|
Alias for locale encoding
|
||||||
--------------------------
|
-------------------------
|
||||||
|
|
||||||
``encoding=locale.getpreferredencoding(False)`` is too long, and
|
``encoding=locale.getpreferredencoding(False)`` is too long, and
|
||||||
``"mbcs"`` or ``os.fsencoding()`` are not portable.
|
``"mbcs"`` and ``os.fsencoding()`` are not portable.
|
||||||
|
|
||||||
We may be possible to add new alias encoding "locale" for easy and
|
It may be possible to add a new "locale" encoding alias as an easy and
|
||||||
portable version of ``locale.getpreferredencoding(False)``.
|
portable version of ``locale.getpreferredencoding(False)``.
|
||||||
|
|
||||||
I'm not sure this is easy enough because ``encodings`` is imported
|
The difficulty of this is uncertain because ``encodings`` is currently
|
||||||
before ``_bootlocale`` currently.
|
imported prior to ``_bootlocale``.
|
||||||
|
|
||||||
Another option is ``TextIOWrapper`` treats `"locale"` as special case::
|
Another option is for ``TextIOWrapper`` to treat `"locale"` as a special
|
||||||
|
case::
|
||||||
|
|
||||||
if encoding == "locale":
|
if encoding == "locale":
|
||||||
encoding = locale.getpreferredencoding(False)
|
encoding = locale.getpreferredencoding(False)
|
||||||
|
|
Loading…
Reference in New Issue