PEP 597: Use UTF-8 for default text file encoding (GH-1099)

2019-06-05 21:09:19 +09:00 · 2019-06-05 21:09:19 +09:00 · cadb6ee369
parent 847a503b5f
commit cadb6ee369
1 changed files with 260 additions and 0 deletions
--- a/pep-0597.rst
+++ b/pep-0597.rst
@ -0,0 +1,260 @@
+PEP: 597
+Title: Use UTF-8 for default text file encoding
+Author: Inada Naoki  <songofacandy@gmail.com>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 05-Jun-2019
+Python-Version: 3.9
+
+
+Abstract
+========
+
+Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
+(hereinafter called "locale encoding") when ``encoding`` is not specified.
+
+This PEP proposes changing the default text encoding to "UTF-8"
+regardless of platform or locale.
+
+
+Motivation
+==========
+
+People assume it is always UTF-8
+--------------------------------
+
+Package authors using macOS or Linux may forget that the default encoding
+is not always UTF-8.
+
+For example, ``long_description = open("README.md").read()`` in
+``setup.py`` is a common mistake.  If there are at least one emoji or any
+other non-ASCII characters in the ``README.md`` file, many Windows users
+cannot install the package by ``UnicodeDecodeError``.
+
+
+Code page is not stable
+-----------------------
+
+Some tools on Windows change code page to 65001 (UTF-8), and Microsoft
+is using UTF-8 and cp65001 more widely in recent Windows 10.
+
+For example, "Command Prompt" uses legacy code page by default.
+But WSL changes the code page to 65001, and  ``python.exe`` on Windows
+can be executed from WSL.  So ``python.exe`` executed from legacy
+console and from WSL cannot read text files written by each other.
+
+But many Windows users don't understand which code page is currently used.
+So changing default text file encoding based on current code page will
+cause confusion.
+
+Consistent default text encoding will make Python behavior more expectable
+and easy to learn.
+
+
+Use UTF-8 by default is easier to new programmers
+-------------------------------------------------
+
+Python is one of the most popular first programming languages.
+
+New programmers may not know about encoding.  When they download text data
+written in UTF-8 from the internet, they are forced to know encoding.
+
+Popular text editors like VS Code or Atom use UTF-8 by default.
+Even notepad.exe uses UTF-8 by default from Windows 10 2019 may update.
+(Note that Python 3.9 will be released in 2021.)
+
+Additionally, the default encoding of Python source file is UTF-8.
+We can assume new Python programmers who don't know about encoding
+use editors which use UTF-8 by default.
+
+It would be nice if new programmers are not forced to know about encoding
+until they need to handle text files encoded in encoding other than UTF-8.
+
+
+Specification
+=============
+
+From Python 3.9, default encoding of ``TextIOWrapper`` and ``open()`` is
+changed from ``locale.getpreferredencoding(False)`` (called "locale encoding"
+in this PEP) to "UTF-8".
+
+When there is device encoding (``os.device_encoding(buffer.fileno())``),
+it still precedes than the default encoding.
+
+
+Not affected areas
+------------------
+
+Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respect
+locale encoding.
+
+``stdin``, ``stdout``, and ``stderr`` keep respecting locale too.  For example,
+these commands don't cause mojibake regardless code page::
+
+   > python -c "print('こんにちは')" | more
+   こんにちは
+   > python -c "print('こんにちは')" > temp.txt
+   > type temp.txt
+   こんにちは
+
+Pipes and TTY should use locale encoding:
+
+* ``subprocess`` and ``os.popen`` use locale encoding because subprocess
+  will use locale encoding.
+* ``getpass.getpass`` uses locale encoding when using TTY.
+
+
+Affected APIs
+--------------
+
+All other code using default encoding of ``TextIOWrapper`` or ``open`` are
+affected.  This is incomplete list of APIs affected by this PEP:
+
+* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``.
+* ``socket.makefile``
+* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
+* ``trace.CoverageResults.write_results_file``
+
+These APIs will use always "UTF-8" when opening text files.
+
+
+Deprecation Warning
+-------------------
+
+From 3.8, ``DeprecationWarning`` is shown when encoding is omitted and
+locale encoding is not UTF-8.  This helps not only
+writing forward compatible code, but also investigating unexpected
+``UnicodeDecodeError`` caused by assuming default text encoding is
+UTF-8. (See `People assume it is always UTF-8`_ above.)
+
+
+Rationale
+=========
+
+Why not just enable UTF-8 mode by default?
+------------------------------------------
+
+This PEP is not mutually exclusive to UTF-8 mode.
+
+If we enable UTF-8 mode by default, even people using Windows will forget
+the default encoding is not always UTF-8.  More scripts will be written
+assuming the default encoding is UTF-8.
+
+So changing default encoding of text files to always UTF-8 would be
+better even if UTF-8 mode is enabled by default at some point.
+
+
+Why not change std(in|out|err) encoding too?
+--------------------------------------------
+
+Even when locale encoding is not UTF-8, there will be many UTF-8
+text files.  These files are downloaded from the internet, or
+written by modern text editor same to editing Python source.
+
+On the other hand, terminal encoding is assumed to be equal to
+locale encoding.  And other tools are assumed to read and write
+locale encoding too.
+
+std(in|out|err) are likely to be connected to a terminal or other
+tools.  So locale encoding should be respected.
+
+
+Why not warn always when encoding is omitted?
+----------------------------------------------
+
+Omitting default encoding is a common mistake when writing portable code.
+
+But when portability does not matter, assuming UTF-8 is not so bad because
+Python already implemented locale coercion (:pep:`538`) and UTF-8 mode
+(:pep:`540`).
+
+And these scripts will become portable when default encoding is changed
+to always UTF-8.
+
+
+
+Backward compatibility
+======================
+
+There may be scripts relying on locale or code page which is not UTF-8.
+They must be rewritten to specify ``encoding`` explicitly.
+
+* If the script assumed ``latin1`` or ``cp932``, use ``encoding="latin1"``
+  or ``encoding="cp932"`` should be used.
+
+* If the script is designed to respect locale encoding,
+  ``locale.getpreferredencoding(False)`` should be used.
+
+  There are non-portable short forms of ``locale.getpreferredencoding(False)``.
+
+   * On Windows, ``"mbcs"`` can be used instead.
+   * On Unix, ``os.fsencoding()`` can be used instead.
+
+Note that such scripts will be broken even without upgrading Python:
+
+* Upgrading Windows
+* Changing the language setting
+* Changing terminal from legacy console to a modern one
+* Using tools which does ``chcp 65001``
+
+
+How to Teach This
+=================
+
+When opening text files, "UTF-8" is used by default.  It is consistent
+with default encoding used for ``text.encode()``.
+
+
+Reference Implementation
+========================
+
+To be written.
+
+
+Rejected Ideas
+==============
+
+To be discussed.
+
+
+Open Issues
+===========
+
+Alias for locale encoding
+--------------------------
+
+``encoding=locale.getpreferredencoding(False)`` is too long, and
+``"mbcs"`` or ``os.fsencoding()`` are not portable.
+
+We may be possible to add new alias encoding "locale" for easy and
+portable version of ``locale.getpreferredencoding(False)``.
+
+I'm not sure this is easy enough because ``encodings`` is imported
+before ``_bootlocale`` currently.
+
+Another option is ``TextIOWrapper`` treats `"locale"` as special case::
+
+   if encoding == "locale":
+       encoding = locale.getpreferredencoding(False)
+
+
+
+References
+==========
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   coding: utf-8
+   End:
+