PEP 597: Use UTF-8 for default text file encoding (GH-1099)
This commit is contained in:
parent
847a503b5f
commit
cadb6ee369
|
@ -0,0 +1,260 @@
|
|||
PEP: 597
|
||||
Title: Use UTF-8 for default text file encoding
|
||||
Author: Inada Naoki <songofacandy@gmail.com>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 05-Jun-2019
|
||||
Python-Version: 3.9
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
|
||||
(hereinafter called "locale encoding") when ``encoding`` is not specified.
|
||||
|
||||
This PEP proposes changing the default text encoding to "UTF-8"
|
||||
regardless of platform or locale.
|
||||
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
People assume it is always UTF-8
|
||||
--------------------------------
|
||||
|
||||
Package authors using macOS or Linux may forget that the default encoding
|
||||
is not always UTF-8.
|
||||
|
||||
For example, ``long_description = open("README.md").read()`` in
|
||||
``setup.py`` is a common mistake. If there are at least one emoji or any
|
||||
other non-ASCII characters in the ``README.md`` file, many Windows users
|
||||
cannot install the package by ``UnicodeDecodeError``.
|
||||
|
||||
|
||||
Code page is not stable
|
||||
-----------------------
|
||||
|
||||
Some tools on Windows change code page to 65001 (UTF-8), and Microsoft
|
||||
is using UTF-8 and cp65001 more widely in recent Windows 10.
|
||||
|
||||
For example, "Command Prompt" uses legacy code page by default.
|
||||
But WSL changes the code page to 65001, and ``python.exe`` on Windows
|
||||
can be executed from WSL. So ``python.exe`` executed from legacy
|
||||
console and from WSL cannot read text files written by each other.
|
||||
|
||||
But many Windows users don't understand which code page is currently used.
|
||||
So changing default text file encoding based on current code page will
|
||||
cause confusion.
|
||||
|
||||
Consistent default text encoding will make Python behavior more expectable
|
||||
and easy to learn.
|
||||
|
||||
|
||||
Use UTF-8 by default is easier to new programmers
|
||||
-------------------------------------------------
|
||||
|
||||
Python is one of the most popular first programming languages.
|
||||
|
||||
New programmers may not know about encoding. When they download text data
|
||||
written in UTF-8 from the internet, they are forced to know encoding.
|
||||
|
||||
Popular text editors like VS Code or Atom use UTF-8 by default.
|
||||
Even notepad.exe uses UTF-8 by default from Windows 10 2019 may update.
|
||||
(Note that Python 3.9 will be released in 2021.)
|
||||
|
||||
Additionally, the default encoding of Python source file is UTF-8.
|
||||
We can assume new Python programmers who don't know about encoding
|
||||
use editors which use UTF-8 by default.
|
||||
|
||||
It would be nice if new programmers are not forced to know about encoding
|
||||
until they need to handle text files encoded in encoding other than UTF-8.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
From Python 3.9, default encoding of ``TextIOWrapper`` and ``open()`` is
|
||||
changed from ``locale.getpreferredencoding(False)`` (called "locale encoding"
|
||||
in this PEP) to "UTF-8".
|
||||
|
||||
When there is device encoding (``os.device_encoding(buffer.fileno())``),
|
||||
it still precedes than the default encoding.
|
||||
|
||||
|
||||
Not affected areas
|
||||
------------------
|
||||
|
||||
Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respect
|
||||
locale encoding.
|
||||
|
||||
``stdin``, ``stdout``, and ``stderr`` keep respecting locale too. For example,
|
||||
these commands don't cause mojibake regardless code page::
|
||||
|
||||
> python -c "print('こんにちは')" | more
|
||||
こんにちは
|
||||
> python -c "print('こんにちは')" > temp.txt
|
||||
> type temp.txt
|
||||
こんにちは
|
||||
|
||||
Pipes and TTY should use locale encoding:
|
||||
|
||||
* ``subprocess`` and ``os.popen`` use locale encoding because subprocess
|
||||
will use locale encoding.
|
||||
* ``getpass.getpass`` uses locale encoding when using TTY.
|
||||
|
||||
|
||||
Affected APIs
|
||||
--------------
|
||||
|
||||
All other code using default encoding of ``TextIOWrapper`` or ``open`` are
|
||||
affected. This is incomplete list of APIs affected by this PEP:
|
||||
|
||||
* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``.
|
||||
* ``socket.makefile``
|
||||
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
|
||||
* ``trace.CoverageResults.write_results_file``
|
||||
|
||||
These APIs will use always "UTF-8" when opening text files.
|
||||
|
||||
|
||||
Deprecation Warning
|
||||
-------------------
|
||||
|
||||
From 3.8, ``DeprecationWarning`` is shown when encoding is omitted and
|
||||
locale encoding is not UTF-8. This helps not only
|
||||
writing forward compatible code, but also investigating unexpected
|
||||
``UnicodeDecodeError`` caused by assuming default text encoding is
|
||||
UTF-8. (See `People assume it is always UTF-8`_ above.)
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
Why not just enable UTF-8 mode by default?
|
||||
------------------------------------------
|
||||
|
||||
This PEP is not mutually exclusive to UTF-8 mode.
|
||||
|
||||
If we enable UTF-8 mode by default, even people using Windows will forget
|
||||
the default encoding is not always UTF-8. More scripts will be written
|
||||
assuming the default encoding is UTF-8.
|
||||
|
||||
So changing default encoding of text files to always UTF-8 would be
|
||||
better even if UTF-8 mode is enabled by default at some point.
|
||||
|
||||
|
||||
Why not change std(in|out|err) encoding too?
|
||||
--------------------------------------------
|
||||
|
||||
Even when locale encoding is not UTF-8, there will be many UTF-8
|
||||
text files. These files are downloaded from the internet, or
|
||||
written by modern text editor same to editing Python source.
|
||||
|
||||
On the other hand, terminal encoding is assumed to be equal to
|
||||
locale encoding. And other tools are assumed to read and write
|
||||
locale encoding too.
|
||||
|
||||
std(in|out|err) are likely to be connected to a terminal or other
|
||||
tools. So locale encoding should be respected.
|
||||
|
||||
|
||||
Why not warn always when encoding is omitted?
|
||||
----------------------------------------------
|
||||
|
||||
Omitting default encoding is a common mistake when writing portable code.
|
||||
|
||||
But when portability does not matter, assuming UTF-8 is not so bad because
|
||||
Python already implemented locale coercion (:pep:`538`) and UTF-8 mode
|
||||
(:pep:`540`).
|
||||
|
||||
And these scripts will become portable when default encoding is changed
|
||||
to always UTF-8.
|
||||
|
||||
|
||||
|
||||
Backward compatibility
|
||||
======================
|
||||
|
||||
There may be scripts relying on locale or code page which is not UTF-8.
|
||||
They must be rewritten to specify ``encoding`` explicitly.
|
||||
|
||||
* If the script assumed ``latin1`` or ``cp932``, use ``encoding="latin1"``
|
||||
or ``encoding="cp932"`` should be used.
|
||||
|
||||
* If the script is designed to respect locale encoding,
|
||||
``locale.getpreferredencoding(False)`` should be used.
|
||||
|
||||
There are non-portable short forms of ``locale.getpreferredencoding(False)``.
|
||||
|
||||
* On Windows, ``"mbcs"`` can be used instead.
|
||||
* On Unix, ``os.fsencoding()`` can be used instead.
|
||||
|
||||
Note that such scripts will be broken even without upgrading Python:
|
||||
|
||||
* Upgrading Windows
|
||||
* Changing the language setting
|
||||
* Changing terminal from legacy console to a modern one
|
||||
* Using tools which does ``chcp 65001``
|
||||
|
||||
|
||||
How to Teach This
|
||||
=================
|
||||
|
||||
When opening text files, "UTF-8" is used by default. It is consistent
|
||||
with default encoding used for ``text.encode()``.
|
||||
|
||||
|
||||
Reference Implementation
|
||||
========================
|
||||
|
||||
To be written.
|
||||
|
||||
|
||||
Rejected Ideas
|
||||
==============
|
||||
|
||||
To be discussed.
|
||||
|
||||
|
||||
Open Issues
|
||||
===========
|
||||
|
||||
Alias for locale encoding
|
||||
--------------------------
|
||||
|
||||
``encoding=locale.getpreferredencoding(False)`` is too long, and
|
||||
``"mbcs"`` or ``os.fsencoding()`` are not portable.
|
||||
|
||||
We may be possible to add new alias encoding "locale" for easy and
|
||||
portable version of ``locale.getpreferredencoding(False)``.
|
||||
|
||||
I'm not sure this is easy enough because ``encodings`` is imported
|
||||
before ``_bootlocale`` currently.
|
||||
|
||||
Another option is ``TextIOWrapper`` treats `"locale"` as special case::
|
||||
|
||||
if encoding == "locale":
|
||||
encoding = locale.getpreferredencoding(False)
|
||||
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
||||
|
Loading…
Reference in New Issue