PEP 597: Add PYTHONTEXTENCODING envvar (#1102)

2019-06-12 12:22:09 +09:00 · 2019-06-12 12:22:09 +09:00 · 988e3acf2b
parent d3b10faf50
commit 988e3acf2b
1 changed files with 160 additions and 144 deletions
--- a/pep-0597.rst
+++ b/pep-0597.rst
@ -1,5 +1,5 @@
 PEP: 597
-Title: Use UTF-8 for default text file encoding
+Title: Add PYTHONTEXTENCODING environment variable
 Author: Inada Naoki  <songofacandy@gmail.com>
 Status: Draft
 Type: Standards Track
@ -12,46 +12,32 @@ Abstract
 ========
 Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
-(hereinafter called "locale encoding") when ``encoding`` is not specified.
+(hereinafter called "locale encoding") when ``encoding`` is not
 specified.
-This PEP proposes changing the default text encoding to "UTF-8"
+This PEP proposes adding ``PYTHONTEXTENCODING`` environment
-regardless of platform or locale.
+variable to override the default text encoding since Python 3.9.
 The goal of this PEP is providing "UTF-8 by default" experience to
 Windows users, because macOS, Linux, Android, iOS users use UTF-8
 by default already.
 Motivation
 ==========
-People assume it is always UTF-8
+UTF-8 is the best encoding for saving unicode text
--------------------------------
+--------------------------------------------------
-Package authors using macOS or Linux may forget that the default encoding
+String in Python 3 is unicode.  Encoding valid unicode strings with
-is not always UTF-8.
+UTF-8 should not fail.
-For example, ``long_description = open("README.md").read()`` in
+On the other hand, most locale encoding used in Windows can not
-``setup.py`` is a common mistake.  If there is at least one emoji or any
+save all valid unicode string.  It will cause UnicodeEncodeError
-other non-ASCII character in the ``README.md`` file, many Windows users
+or it may not round-trip.  User may lost their data in such case.
 cannot install the package due to a ``UnicodeDecodeError``.
-
+UTF-8 is the best encoding for saving text when user don't specify
-Active code page is not stable
+any encoding.
 ------------------------------
 Some tools on Windows change the active code page to 65001 (UTF-8), and
 Microsoft is using UTF-8 and cp65001 more widely in recent versions of
 Windows 10.
 For example, "Command Prompt" uses the legacy code page by default.
 But the Windows Subsystem for Linux (WSL) changes the active code page to
 65001, and ``python.exe`` can be executed from the WSL.  So ``python.exe``
 executed from the legacy console and from the WSL cannot read text files
 written by each other.
 But many Windows users don't understand which code page is active.
 So changing the default text file encoding based on the active code page
 causes confusion.
 Consistent default text encoding will make Python behavior more expectable
 and easier to learn.
 Using UTF-8 by default is easier on new programmers
@ -59,77 +45,104 @@ Using UTF-8 by default is easier on new programmers
 Python is one of the most popular first programming languages.
-New programmers may not know about encoding.  When they download text data
+New programmers may not know about encoding.  When they download text
-written in UTF-8 from the Internet, they are forced to learn about encoding.
+data written in UTF-8 from the Internet, they are forced to learn
 about encoding.
 Popular text editors like VS Code or Atom use UTF-8 by default.
-Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019
+Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May
-Update.  (Note that Python 3.9 will be released in 2021.)
+2019 Update.  (Note that Python 3.9 will be released in 2021.)
 Additionally, the default encoding of Python source files is UTF-8.
 We can assume new Python programmers who don't know about encoding
 use editors which use UTF-8 by default.
-It would be nice if new programmers are not forced to learn about encoding
+It would be nice if new programmers are not forced to learn about
-until they need to handle text files encoded in encoding other than UTF-8.
+encoding until they need to handle text files encoded in encoding
 other than UTF-8.
 People assume it is always UTF-8
 --------------------------------
 Package authors using macOS or Linux may forget that the default
 encoding is not always UTF-8.
 For example, ``long_description = open("README.md").read()`` in
 ``setup.py`` is a common mistake.  If there is at least one emoji or
 any other non-ASCII character in the ``README.md`` file, many Windows
 users cannot install the package due to a ``UnicodeDecodeError``.
 Consistent with default encoding
 --------------------------------
 Python has ``sys.defaultencoding()`` which is always "UTF-8".
 ``str.encode()`` uses "UTF-8" when encoding is omitted.
 Using "UTF-8" for text files are consistent with it.  It makes Python
 more easy to learn language.
 Specification
 =============
-From Python 3.9, the default encoding of ``TextIOWrapper`` and ``open()`` is
+``PYTHONTEXTENCODING`` environment variable
-changed from ``locale.getpreferredencoding(False)`` to "UTF-8".
+-------------------------------------------
-When there is device encoding (``os.device_encoding(buffer.fileno())``),
+``PYTHONTEXTENCODING`` environment variable can be used to specify the
-it still supersedes the default encoding.
+default text encoding.
 Unlike ``PYTHONIOENCODING``, it doesn't accept error handler.
 ``PYTHONIOENCODING`` support it because changing error handler of
 stdio was difficult.  But it is not true for regular files.
-Unaffected areas
+``sys.gettextencoding()``
----------------
+-------------------------
-Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respects
+When ``PYTHONTEXTENCODING`` is specified, this function return it.
 locale encoding.
-``stdin``, ``stdout``, and ``stderr`` continue to respect locale encoding
+When it is not specified, this function returns
-as well.  For example, these commands do not cause mojibake regardless of the
+``locale.getpreferredencoding(False)``.
 active code page::
   > python -c "print('こんにちは')" | more
   こんにちは
   > python -c "print('こんにちは')" > temp.txt
   > type temp.txt
   こんにちは
 Pipes and TTY should use the locale encoding:
 * ``subprocess`` and ``os.popen`` use the locale encoding because the
  subprocess will use the locale encoding.
 * ``getpass.getpass`` uses the locale encoding when using TTY.
-Affected APIs
+``encoding="locale"`` option
-------------
+----------------------------
-All other code using the default encoding of ``TextIOWrapper`` or ``open`` are
+``TextIOWrapper`` now accepts ``encoding="locale"`` option.
-affected.  This is an incomplete list of APIs affected by this PEP:
+
 "locale" is not real encoding or alias.
 This is just a shortcut of
 ``encoding=locale.getpreferredencoding(False)``.
 Changes in stdlibs
 ------------------
 ``TextIOWrapper`` uses ``sys.gettextencoding()`` where
 ``locale.getpreferredencoding(False)`` is used.
 But ``stdin``, ``stdout``, and ``stderr`` continue to respect
 locale encoding as well.  ``PYTHONIOENCODING`` can be used to
 override thier encoding.
 Pipes and TTY should use the "locale" encoding.  UTF-8 mode [1]_
 can be used to override these encoding:
 * ``subprocess`` and ``os.popen`` use the "locale" encoding because
  the subprocess will use the locale encoding.
 * ``getpass.getpass`` uses the "locale" encoding when using TTY.
 All other code using the default encoding are not modified.
 They can be overridden by ``PYTHONTEXTENCODING``.
 This is an incomplete list:
 * ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
 * ``socket.makefile``
 * ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
 * ``trace.CoverageResults.write_results_file``
 These APIs will always use "UTF-8" when opening text files.
 Deprecation Warning
 -------------------
 From 3.8 onwards, ``DeprecationWarning`` is shown when encoding is omitted and
 the locale encoding is not UTF-8.  This helps not only when writing
 forward-compatible code, but also when investigating an unexpected
 ``UnicodeDecodeError`` caused by assuming the default text encoding is UTF-8.
 (See `People assume it is always UTF-8`_ above.)
 Rationale
 =========
@ -139,12 +152,22 @@ Why not just enable UTF-8 mode by default?
 This PEP is not mutually exclusive to UTF-8 mode.
-If we enable UTF-8 mode by default, even people using Windows will forget
+If we enable UTF-8 mode by default, even people using Windows will
-the default encoding is not always UTF-8.  More scripts will be written
+forget the default encoding is not always UTF-8.  More scripts will
-assuming the default encoding is UTF-8.
+be written assuming the default encoding is UTF-8.
-So changing the default encoding of text files to UTF-8 would be better
+So changing the default encoding of text files to UTF-8 would be
-even if UTF-8 mode is enabled by default at some point.
+better even if UTF-8 mode is enabled by default at some point.
 Why is "locale" not an alias codec?
 -----------------------------------
 For backward compatibility, ``io.TextIOWrapper`` calls
 ``locale.getpreferredencoding(False)`` every time when
 ``encoding="locale"`` is specified.
 It respects changing locale after Python startup.
 Why not change std(in|out|err) encoding too?
@ -158,55 +181,10 @@ On the other hand, terminal encoding is assumed to be the same as
 locale encoding.  And other tools are assumed to read and write the
 locale encoding as well.
-std(in|out|err) are likely to be connected to a terminal or other tools.
+std(in|out|err) are likely to be connected to a terminal or other
-So the locale encoding should be respected.
+tools. So the locale encoding should be respected.
-
+Anyway, ``PYTHONIOENCODING`` can be used to change these encodings.
 Why not always warn when encoding is omitted?
 ---------------------------------------------
 Omitting encoding is a common mistake when writing portable code.
 But when portability does not matter, assuming UTF-8 is not so bad because
 Python already implements locale coercion (:pep:`538`) and UTF-8 mode
 (:pep:`540`).
 And these scripts will become portable when the default encoding is changed
 to UTF-8.
 Backward compatibility
 ======================
 There may be scripts relying on the locale encoding or active code page not
 being UTF-8.  They must be rewritten to specify ``encoding`` explicitly.
 * If the script assumes ``latin1`` or ``cp932``, ``encoding="latin1"``
  or ``encoding="cp932"`` should be used.
 * If the script is designed to respect locale encoding,
  ``locale.getpreferredencoding(False)`` should be used.
  There are non-portable short forms of
  ``locale.getpreferredencoding(False)``.
  * On Windows, ``"mbcs"`` can be used instead.
  * On Unix, ``os.fsencoding()`` can be used instead.
 Note that such scripts will be broken even without upgrading Python, such as
 when:
 * Upgrading Windows
 * Changing the language setting
 * Changing terminal from legacy console to a modern one
 * Using tools which do ``chcp 65001``
 How to Teach This
 =================
 When opening text files, "UTF-8" is used by default.  It is consistent with
 the default encoding used for ``text.encode()``.
 Reference Implementation
@ -218,35 +196,74 @@ To be written.
 Rejected Ideas
 ==============
-To be discussed.
+Change the default text encoding
 --------------------------------
 Previous version of this PEP tried to change the default encoding
 to UTF-8.
 But we should have deprecation period long enough.  Between the
 deprecation period, users can not change the default text encoding.
 And there are many difficulity there:
 * Omitting ``encoding`` option is very common.
  * If we raise ``DeprecationWarning`` always, it will be too noisy.
  * We can not assume how user use it.  Complicated heuritics may be
    needed to raise ``DeprecationWarning`` only when it is really
    needed.
 * Users of legacy systems may dismiss warning.
  * They may not check the warning.
  * They may upgrade Python from 2.7 after 2020.
 Additionally, Microsoft is improving UTF-8 support of Windows 10
 recently.
 There are no public plan for future UTF-8 support yet.  But Python may
 be able to change the default encoding without painful deprecation
 period in the future.
 Open Issues
 ===========
-Alias for locale encoding
+Easy way to set ``PYTHONTEXTENCODING``
-------------------------
+--------------------------------------
-``encoding=locale.getpreferredencoding(False)`` is too long, and
+UTF-8 is the best encoding for new users.  But setting environment
-``"mbcs"`` and ``os.fsencoding()`` are not portable.
+variables is not easy enough to new users.
-It may be possible to add a new "locale" encoding alias as an easy and
+It would be helpfule if Python on Windows can provide easy way to set
-portable version of ``locale.getpreferredencoding(False)``.
+``PYTHONTEXTENCODING=UTF-8`` even after Python is installed.
 The difficulty of this is uncertain because ``encodings`` is currently
 imported prior to ``_bootlocale``.
-Another option is for ``TextIOWrapper`` to treat `"locale"` as a special
+Commandline option
-case::
+------------------
-   if encoding == "locale":
+If there is reasonable use case for changing default text encoding
-       encoding = locale.getpreferredencoding(False)
+per process, command line option should be considered.
 C-API
 -----
 The default text encoding should be able to configured from C.
 This will be considered when writing reference Implementation.
 Additionally, C-API like ``PySys_GetTextEncoding()`` should be
 considered too.
 References
 ==========
 .. [1]: PEP 540, Add a new UTF-8 Mode
   (https://www.python.org/dev/peps/pep-0540/)
 Copyright
 =========
@ -261,4 +278,3 @@ This document has been placed in the public domain.
   fill-column: 70
   coding: utf-8
   End: