From cadb6ee3693f8fb6a54b123d04642575c405d7aa Mon Sep 17 00:00:00 2001 From: Inada Naoki Date: Wed, 5 Jun 2019 21:09:19 +0900 Subject: [PATCH] PEP 597: Use UTF-8 for default text file encoding (GH-1099) --- pep-0597.rst | 260 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 260 insertions(+) create mode 100644 pep-0597.rst diff --git a/pep-0597.rst b/pep-0597.rst new file mode 100644 index 000000000..e7ead2615 --- /dev/null +++ b/pep-0597.rst @@ -0,0 +1,260 @@ +PEP: 597 +Title: Use UTF-8 for default text file encoding +Author: Inada Naoki +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 05-Jun-2019 +Python-Version: 3.9 + + +Abstract +======== + +Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)`` +(hereinafter called "locale encoding") when ``encoding`` is not specified. + +This PEP proposes changing the default text encoding to "UTF-8" +regardless of platform or locale. + + +Motivation +========== + +People assume it is always UTF-8 +-------------------------------- + +Package authors using macOS or Linux may forget that the default encoding +is not always UTF-8. + +For example, ``long_description = open("README.md").read()`` in +``setup.py`` is a common mistake. If there are at least one emoji or any +other non-ASCII characters in the ``README.md`` file, many Windows users +cannot install the package by ``UnicodeDecodeError``. + + +Code page is not stable +----------------------- + +Some tools on Windows change code page to 65001 (UTF-8), and Microsoft +is using UTF-8 and cp65001 more widely in recent Windows 10. + +For example, "Command Prompt" uses legacy code page by default. +But WSL changes the code page to 65001, and ``python.exe`` on Windows +can be executed from WSL. So ``python.exe`` executed from legacy +console and from WSL cannot read text files written by each other. + +But many Windows users don't understand which code page is currently used. +So changing default text file encoding based on current code page will +cause confusion. + +Consistent default text encoding will make Python behavior more expectable +and easy to learn. + + +Use UTF-8 by default is easier to new programmers +------------------------------------------------- + +Python is one of the most popular first programming languages. + +New programmers may not know about encoding. When they download text data +written in UTF-8 from the internet, they are forced to know encoding. + +Popular text editors like VS Code or Atom use UTF-8 by default. +Even notepad.exe uses UTF-8 by default from Windows 10 2019 may update. +(Note that Python 3.9 will be released in 2021.) + +Additionally, the default encoding of Python source file is UTF-8. +We can assume new Python programmers who don't know about encoding +use editors which use UTF-8 by default. + +It would be nice if new programmers are not forced to know about encoding +until they need to handle text files encoded in encoding other than UTF-8. + + +Specification +============= + +From Python 3.9, default encoding of ``TextIOWrapper`` and ``open()`` is +changed from ``locale.getpreferredencoding(False)`` (called "locale encoding" +in this PEP) to "UTF-8". + +When there is device encoding (``os.device_encoding(buffer.fileno())``), +it still precedes than the default encoding. + + +Not affected areas +------------------ + +Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respect +locale encoding. + +``stdin``, ``stdout``, and ``stderr`` keep respecting locale too. For example, +these commands don't cause mojibake regardless code page:: + + > python -c "print('こんにちは')" | more + こんにちは + > python -c "print('こんにちは')" > temp.txt + > type temp.txt + こんにちは + +Pipes and TTY should use locale encoding: + +* ``subprocess`` and ``os.popen`` use locale encoding because subprocess + will use locale encoding. +* ``getpass.getpass`` uses locale encoding when using TTY. + + +Affected APIs +-------------- + +All other code using default encoding of ``TextIOWrapper`` or ``open`` are +affected. This is incomplete list of APIs affected by this PEP: + +* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``. +* ``socket.makefile`` +* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile`` +* ``trace.CoverageResults.write_results_file`` + +These APIs will use always "UTF-8" when opening text files. + + +Deprecation Warning +------------------- + +From 3.8, ``DeprecationWarning`` is shown when encoding is omitted and +locale encoding is not UTF-8. This helps not only +writing forward compatible code, but also investigating unexpected +``UnicodeDecodeError`` caused by assuming default text encoding is +UTF-8. (See `People assume it is always UTF-8`_ above.) + + +Rationale +========= + +Why not just enable UTF-8 mode by default? +------------------------------------------ + +This PEP is not mutually exclusive to UTF-8 mode. + +If we enable UTF-8 mode by default, even people using Windows will forget +the default encoding is not always UTF-8. More scripts will be written +assuming the default encoding is UTF-8. + +So changing default encoding of text files to always UTF-8 would be +better even if UTF-8 mode is enabled by default at some point. + + +Why not change std(in|out|err) encoding too? +-------------------------------------------- + +Even when locale encoding is not UTF-8, there will be many UTF-8 +text files. These files are downloaded from the internet, or +written by modern text editor same to editing Python source. + +On the other hand, terminal encoding is assumed to be equal to +locale encoding. And other tools are assumed to read and write +locale encoding too. + +std(in|out|err) are likely to be connected to a terminal or other +tools. So locale encoding should be respected. + + +Why not warn always when encoding is omitted? +---------------------------------------------- + +Omitting default encoding is a common mistake when writing portable code. + +But when portability does not matter, assuming UTF-8 is not so bad because +Python already implemented locale coercion (:pep:`538`) and UTF-8 mode +(:pep:`540`). + +And these scripts will become portable when default encoding is changed +to always UTF-8. + + + +Backward compatibility +====================== + +There may be scripts relying on locale or code page which is not UTF-8. +They must be rewritten to specify ``encoding`` explicitly. + +* If the script assumed ``latin1`` or ``cp932``, use ``encoding="latin1"`` + or ``encoding="cp932"`` should be used. + +* If the script is designed to respect locale encoding, + ``locale.getpreferredencoding(False)`` should be used. + + There are non-portable short forms of ``locale.getpreferredencoding(False)``. + + * On Windows, ``"mbcs"`` can be used instead. + * On Unix, ``os.fsencoding()`` can be used instead. + +Note that such scripts will be broken even without upgrading Python: + +* Upgrading Windows +* Changing the language setting +* Changing terminal from legacy console to a modern one +* Using tools which does ``chcp 65001`` + + +How to Teach This +================= + +When opening text files, "UTF-8" is used by default. It is consistent +with default encoding used for ``text.encode()``. + + +Reference Implementation +======================== + +To be written. + + +Rejected Ideas +============== + +To be discussed. + + +Open Issues +=========== + +Alias for locale encoding +-------------------------- + +``encoding=locale.getpreferredencoding(False)`` is too long, and +``"mbcs"`` or ``os.fsencoding()`` are not portable. + +We may be possible to add new alias encoding "locale" for easy and +portable version of ``locale.getpreferredencoding(False)``. + +I'm not sure this is easy enough because ``encodings`` is imported +before ``_bootlocale`` currently. + +Another option is ``TextIOWrapper`` treats `"locale"` as special case:: + + if encoding == "locale": + encoding = locale.getpreferredencoding(False) + + + +References +========== + + +Copyright +========= + +This document has been placed in the public domain. + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: +