PEP 597: Add PYTHONTEXTENCODING envvar (#1102)

2019-06-12 12:22:09 +09:00 · 2019-06-12 12:22:09 +09:00 · 988e3acf2b
parent d3b10faf50
commit 988e3acf2b
1 changed files with 160 additions and 144 deletions
--- a/pep-0597.rst
+++ b/pep-0597.rst
@ -1,5 +1,5 @@
 PEP: 597
-Title: Use UTF-8 for default text file encoding
+Title: Add PYTHONTEXTENCODING environment variable
 Author: Inada Naoki  <songofacandy@gmail.com>
 Status: Draft
 Type: Standards Track
@ -12,46 +12,32 @@ Abstract
 ========

 Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
-(hereinafter called "locale encoding") when ``encoding`` is not specified.
+(hereinafter called "locale encoding") when ``encoding`` is not
+specified.

-This PEP proposes changing the default text encoding to "UTF-8"
-regardless of platform or locale.
+This PEP proposes adding ``PYTHONTEXTENCODING`` environment
+variable to override the default text encoding since Python 3.9.
+
+The goal of this PEP is providing "UTF-8 by default" experience to
+Windows users, because macOS, Linux, Android, iOS users use UTF-8
+by default already.


 Motivation
 ==========

-People assume it is always UTF-8
--------------------------------
+UTF-8 is the best encoding for saving unicode text
+--------------------------------------------------

-Package authors using macOS or Linux may forget that the default encoding
-is not always UTF-8.
+String in Python 3 is unicode.  Encoding valid unicode strings with
+UTF-8 should not fail.

-For example, ``long_description = open("README.md").read()`` in
-``setup.py`` is a common mistake.  If there is at least one emoji or any
-other non-ASCII character in the ``README.md`` file, many Windows users
-cannot install the package due to a ``UnicodeDecodeError``.
+On the other hand, most locale encoding used in Windows can not
+save all valid unicode string.  It will cause UnicodeEncodeError
+or it may not round-trip.  User may lost their data in such case.

-
-Active code page is not stable
------------------------------
-
-Some tools on Windows change the active code page to 65001 (UTF-8), and
-Microsoft is using UTF-8 and cp65001 more widely in recent versions of
-Windows 10.
-
-For example, "Command Prompt" uses the legacy code page by default.
-But the Windows Subsystem for Linux (WSL) changes the active code page to
-65001, and ``python.exe`` can be executed from the WSL.  So ``python.exe``
-executed from the legacy console and from the WSL cannot read text files
-written by each other.
-
-But many Windows users don't understand which code page is active.
-So changing the default text file encoding based on the active code page
-causes confusion.
-
-Consistent default text encoding will make Python behavior more expectable
-and easier to learn.
+UTF-8 is the best encoding for saving text when user don't specify
+any encoding.


 Using UTF-8 by default is easier on new programmers
@ -59,77 +45,104 @@ Using UTF-8 by default is easier on new programmers

 Python is one of the most popular first programming languages.

-New programmers may not know about encoding.  When they download text data
-written in UTF-8 from the Internet, they are forced to learn about encoding.
+New programmers may not know about encoding.  When they download text
+data written in UTF-8 from the Internet, they are forced to learn
+about encoding.

 Popular text editors like VS Code or Atom use UTF-8 by default.
-Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019
-Update.  (Note that Python 3.9 will be released in 2021.)
+Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May
+2019 Update.  (Note that Python 3.9 will be released in 2021.)

 Additionally, the default encoding of Python source files is UTF-8.
 We can assume new Python programmers who don't know about encoding
 use editors which use UTF-8 by default.

-It would be nice if new programmers are not forced to learn about encoding
-until they need to handle text files encoded in encoding other than UTF-8.
+It would be nice if new programmers are not forced to learn about
+encoding until they need to handle text files encoded in encoding
+other than UTF-8.
+
+
+People assume it is always UTF-8
+--------------------------------
+
+Package authors using macOS or Linux may forget that the default
+encoding is not always UTF-8.
+
+For example, ``long_description = open("README.md").read()`` in
+``setup.py`` is a common mistake.  If there is at least one emoji or
+any other non-ASCII character in the ``README.md`` file, many Windows
+users cannot install the package due to a ``UnicodeDecodeError``.
+
+
+Consistent with default encoding
+--------------------------------
+
+Python has ``sys.defaultencoding()`` which is always "UTF-8".
+``str.encode()`` uses "UTF-8" when encoding is omitted.
+
+Using "UTF-8" for text files are consistent with it.  It makes Python
+more easy to learn language.


 Specification
 =============

-From Python 3.9, the default encoding of ``TextIOWrapper`` and ``open()`` is
-changed from ``locale.getpreferredencoding(False)`` to "UTF-8".
+``PYTHONTEXTENCODING`` environment variable
+-------------------------------------------

-When there is device encoding (``os.device_encoding(buffer.fileno())``),
-it still supersedes the default encoding.
+``PYTHONTEXTENCODING`` environment variable can be used to specify the
+default text encoding.
+
+Unlike ``PYTHONIOENCODING``, it doesn't accept error handler.
+``PYTHONIOENCODING`` support it because changing error handler of
+stdio was difficult.  But it is not true for regular files.


-Unaffected areas
----------------
+``sys.gettextencoding()``
+-------------------------

-Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respects
-locale encoding.
+When ``PYTHONTEXTENCODING`` is specified, this function return it.

-``stdin``, ``stdout``, and ``stderr`` continue to respect locale encoding
-as well.  For example, these commands do not cause mojibake regardless of the
-active code page::
-
-   > python -c "print('こんにちは')" | more
-   こんにちは
-   > python -c "print('こんにちは')" > temp.txt
-   > type temp.txt
-   こんにちは
-
-Pipes and TTY should use the locale encoding:
-
-* ``subprocess`` and ``os.popen`` use the locale encoding because the
-  subprocess will use the locale encoding.
-* ``getpass.getpass`` uses the locale encoding when using TTY.
+When it is not specified, this function returns
+``locale.getpreferredencoding(False)``.


-Affected APIs
-------------
+``encoding="locale"`` option
+----------------------------

-All other code using the default encoding of ``TextIOWrapper`` or ``open`` are
-affected.  This is an incomplete list of APIs affected by this PEP:
+``TextIOWrapper`` now accepts ``encoding="locale"`` option.
+
+"locale" is not real encoding or alias.
+This is just a shortcut of
+``encoding=locale.getpreferredencoding(False)``.
+
+
+Changes in stdlibs
+------------------
+
+``TextIOWrapper`` uses ``sys.gettextencoding()`` where
+``locale.getpreferredencoding(False)`` is used.
+
+But ``stdin``, ``stdout``, and ``stderr`` continue to respect
+locale encoding as well.  ``PYTHONIOENCODING`` can be used to
+override thier encoding.
+
+Pipes and TTY should use the "locale" encoding.  UTF-8 mode [1]_
+can be used to override these encoding:
+
+* ``subprocess`` and ``os.popen`` use the "locale" encoding because
+  the subprocess will use the locale encoding.
+* ``getpass.getpass`` uses the "locale" encoding when using TTY.
+
+All other code using the default encoding are not modified.
+They can be overridden by ``PYTHONTEXTENCODING``.
+This is an incomplete list:

 * ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
 * ``socket.makefile``
 * ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
 * ``trace.CoverageResults.write_results_file``

-These APIs will always use "UTF-8" when opening text files.
-
-
-Deprecation Warning
-------------------
-
-From 3.8 onwards, ``DeprecationWarning`` is shown when encoding is omitted and
-the locale encoding is not UTF-8.  This helps not only when writing
-forward-compatible code, but also when investigating an unexpected
-``UnicodeDecodeError`` caused by assuming the default text encoding is UTF-8.
-(See `People assume it is always UTF-8`_ above.)
-

 Rationale
 =========
@ -139,12 +152,22 @@ Why not just enable UTF-8 mode by default?

 This PEP is not mutually exclusive to UTF-8 mode.

-If we enable UTF-8 mode by default, even people using Windows will forget
-the default encoding is not always UTF-8.  More scripts will be written
-assuming the default encoding is UTF-8.
+If we enable UTF-8 mode by default, even people using Windows will
+forget the default encoding is not always UTF-8.  More scripts will
+be written assuming the default encoding is UTF-8.

-So changing the default encoding of text files to UTF-8 would be better
-even if UTF-8 mode is enabled by default at some point.
+So changing the default encoding of text files to UTF-8 would be
+better even if UTF-8 mode is enabled by default at some point.
+
+
+Why is "locale" not an alias codec?
+-----------------------------------
+
+For backward compatibility, ``io.TextIOWrapper`` calls
+``locale.getpreferredencoding(False)`` every time when
+``encoding="locale"`` is specified.
+
+It respects changing locale after Python startup.


 Why not change std(in|out|err) encoding too?
@ -158,55 +181,10 @@ On the other hand, terminal encoding is assumed to be the same as
 locale encoding.  And other tools are assumed to read and write the
 locale encoding as well.

-std(in|out|err) are likely to be connected to a terminal or other tools.
-So the locale encoding should be respected.
+std(in|out|err) are likely to be connected to a terminal or other
+tools. So the locale encoding should be respected.

-
-Why not always warn when encoding is omitted?
---------------------------------------------
-
-Omitting encoding is a common mistake when writing portable code.
-
-But when portability does not matter, assuming UTF-8 is not so bad because
-Python already implements locale coercion (:pep:`538`) and UTF-8 mode
-(:pep:`540`).
-
-And these scripts will become portable when the default encoding is changed
-to UTF-8.
-
-
-Backward compatibility
-======================
-
-There may be scripts relying on the locale encoding or active code page not
-being UTF-8.  They must be rewritten to specify ``encoding`` explicitly.
-
-* If the script assumes ``latin1`` or ``cp932``, ``encoding="latin1"``
-  or ``encoding="cp932"`` should be used.
-
-* If the script is designed to respect locale encoding,
-  ``locale.getpreferredencoding(False)`` should be used.
-
-  There are non-portable short forms of
-  ``locale.getpreferredencoding(False)``.
-
-  * On Windows, ``"mbcs"`` can be used instead.
-  * On Unix, ``os.fsencoding()`` can be used instead.
-
-Note that such scripts will be broken even without upgrading Python, such as
-when:
-
-* Upgrading Windows
-* Changing the language setting
-* Changing terminal from legacy console to a modern one
-* Using tools which do ``chcp 65001``
-
-
-How to Teach This
-=================
-
-When opening text files, "UTF-8" is used by default.  It is consistent with
-the default encoding used for ``text.encode()``.
+Anyway, ``PYTHONIOENCODING`` can be used to change these encodings.


 Reference Implementation
@ -218,35 +196,74 @@ To be written.
 Rejected Ideas
 ==============

-To be discussed.
+Change the default text encoding
+--------------------------------
+
+Previous version of this PEP tried to change the default encoding
+to UTF-8.
+
+But we should have deprecation period long enough.  Between the
+deprecation period, users can not change the default text encoding.
+
+And there are many difficulity there:
+
+* Omitting ``encoding`` option is very common.
+
+  * If we raise ``DeprecationWarning`` always, it will be too noisy.
+  * We can not assume how user use it.  Complicated heuritics may be
+    needed to raise ``DeprecationWarning`` only when it is really
+    needed.
+
+* Users of legacy systems may dismiss warning.
+
+  * They may not check the warning.
+  * They may upgrade Python from 2.7 after 2020.
+
+
+Additionally, Microsoft is improving UTF-8 support of Windows 10
+recently.
+
+There are no public plan for future UTF-8 support yet.  But Python may
+be able to change the default encoding without painful deprecation
+period in the future.


 Open Issues
 ===========

-Alias for locale encoding
-------------------------
+Easy way to set ``PYTHONTEXTENCODING``
+--------------------------------------

-``encoding=locale.getpreferredencoding(False)`` is too long, and
-``"mbcs"`` and ``os.fsencoding()`` are not portable.
+UTF-8 is the best encoding for new users.  But setting environment
+variables is not easy enough to new users.

-It may be possible to add a new "locale" encoding alias as an easy and
-portable version of ``locale.getpreferredencoding(False)``.
+It would be helpfule if Python on Windows can provide easy way to set
+``PYTHONTEXTENCODING=UTF-8`` even after Python is installed.

-The difficulty of this is uncertain because ``encodings`` is currently
-imported prior to ``_bootlocale``.

-Another option is for ``TextIOWrapper`` to treat `"locale"` as a special
-case::
+Commandline option
+------------------

-   if encoding == "locale":
-       encoding = locale.getpreferredencoding(False)
+If there is reasonable use case for changing default text encoding
+per process, command line option should be considered.


+C-API
+-----
+
+The default text encoding should be able to configured from C.
+This will be considered when writing reference Implementation.
+
+Additionally, C-API like ``PySys_GetTextEncoding()`` should be
+considered too.
+

 References
 ==========

+.. [1]: PEP 540, Add a new UTF-8 Mode
+   (https://www.python.org/dev/peps/pep-0540/)
+

 Copyright
 =========
@ -261,4 +278,3 @@ This document has been placed in the public domain.
   fill-column: 70
   coding: utf-8
   End:
-