diff --git a/pep-0597.rst b/pep-0597.rst index e7ead2615..2fd9ac58b 100644 --- a/pep-0597.rst +++ b/pep-0597.rst @@ -28,69 +28,71 @@ Package authors using macOS or Linux may forget that the default encoding is not always UTF-8. For example, ``long_description = open("README.md").read()`` in -``setup.py`` is a common mistake. If there are at least one emoji or any -other non-ASCII characters in the ``README.md`` file, many Windows users -cannot install the package by ``UnicodeDecodeError``. +``setup.py`` is a common mistake. If there is at least one emoji or any +other non-ASCII character in the ``README.md`` file, many Windows users +cannot install the package due to a ``UnicodeDecodeError``. -Code page is not stable ------------------------ +Active code page is not stable +------------------------------ -Some tools on Windows change code page to 65001 (UTF-8), and Microsoft -is using UTF-8 and cp65001 more widely in recent Windows 10. +Some tools on Windows change the active code page to 65001 (UTF-8), and +Microsoft is using UTF-8 and cp65001 more widely in recent versions of +Windows 10. -For example, "Command Prompt" uses legacy code page by default. -But WSL changes the code page to 65001, and ``python.exe`` on Windows -can be executed from WSL. So ``python.exe`` executed from legacy -console and from WSL cannot read text files written by each other. +For example, "Command Prompt" uses the legacy code page by default. +But the Windows Subsystem for Linux (WSL) changes the active code page to +65001, and ``python.exe`` can be executed from the WSL. So ``python.exe`` +executed from the legacy console and from the WSL cannot read text files +written by each other. -But many Windows users don't understand which code page is currently used. -So changing default text file encoding based on current code page will -cause confusion. +But many Windows users don't understand which code page is active. +So changing the default text file encoding based on the active code page +causes confusion. Consistent default text encoding will make Python behavior more expectable -and easy to learn. +and easier to learn. -Use UTF-8 by default is easier to new programmers -------------------------------------------------- +Using UTF-8 by default is easier on new programmers +--------------------------------------------------- Python is one of the most popular first programming languages. New programmers may not know about encoding. When they download text data -written in UTF-8 from the internet, they are forced to know encoding. +written in UTF-8 from the Internet, they are forced to learn about encoding. Popular text editors like VS Code or Atom use UTF-8 by default. -Even notepad.exe uses UTF-8 by default from Windows 10 2019 may update. -(Note that Python 3.9 will be released in 2021.) +Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019 +Update. (Note that Python 3.9 will be released in 2021.) -Additionally, the default encoding of Python source file is UTF-8. +Additionally, the default encoding of Python source files is UTF-8. We can assume new Python programmers who don't know about encoding use editors which use UTF-8 by default. -It would be nice if new programmers are not forced to know about encoding +It would be nice if new programmers are not forced to learn about encoding until they need to handle text files encoded in encoding other than UTF-8. Specification ============= -From Python 3.9, default encoding of ``TextIOWrapper`` and ``open()`` is -changed from ``locale.getpreferredencoding(False)`` (called "locale encoding" -in this PEP) to "UTF-8". +From Python 3.9, the default encoding of ``TextIOWrapper`` and ``open()`` is +changed from ``locale.getpreferredencoding(False)`` to "UTF-8". When there is device encoding (``os.device_encoding(buffer.fileno())``), -it still precedes than the default encoding. +it still supersedes the default encoding. -Not affected areas ------------------- +Unaffected areas +---------------- -Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respect +Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respects locale encoding. -``stdin``, ``stdout``, and ``stderr`` keep respecting locale too. For example, -these commands don't cause mojibake regardless code page:: +``stdin``, ``stdout``, and ``stderr`` continue to respect locale encoding +as well. For example, these commands do not cause mojibake regardless of the +active code page:: > python -c "print('こんにちは')" | more こんにちは @@ -98,35 +100,35 @@ these commands don't cause mojibake regardless code page:: > type temp.txt こんにちは -Pipes and TTY should use locale encoding: +Pipes and TTY should use the locale encoding: -* ``subprocess`` and ``os.popen`` use locale encoding because subprocess - will use locale encoding. -* ``getpass.getpass`` uses locale encoding when using TTY. +* ``subprocess`` and ``os.popen`` use the locale encoding because the + subprocess will use the locale encoding. +* ``getpass.getpass`` uses the locale encoding when using TTY. Affected APIs --------------- +------------- -All other code using default encoding of ``TextIOWrapper`` or ``open`` are -affected. This is incomplete list of APIs affected by this PEP: +All other code using the default encoding of ``TextIOWrapper`` or ``open`` are +affected. This is an incomplete list of APIs affected by this PEP: -* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``. +* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text`` * ``socket.makefile`` * ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile`` * ``trace.CoverageResults.write_results_file`` -These APIs will use always "UTF-8" when opening text files. +These APIs will always use "UTF-8" when opening text files. Deprecation Warning ------------------- -From 3.8, ``DeprecationWarning`` is shown when encoding is omitted and -locale encoding is not UTF-8. This helps not only -writing forward compatible code, but also investigating unexpected -``UnicodeDecodeError`` caused by assuming default text encoding is -UTF-8. (See `People assume it is always UTF-8`_ above.) +From 3.8 onwards, ``DeprecationWarning`` is shown when encoding is omitted and +the locale encoding is not UTF-8. This helps not only when writing +forward-compatible code, but also when investigating an unexpected +``UnicodeDecodeError`` caused by assuming the default text encoding is UTF-8. +(See `People assume it is always UTF-8`_ above.) Rationale @@ -141,69 +143,70 @@ If we enable UTF-8 mode by default, even people using Windows will forget the default encoding is not always UTF-8. More scripts will be written assuming the default encoding is UTF-8. -So changing default encoding of text files to always UTF-8 would be -better even if UTF-8 mode is enabled by default at some point. +So changing the default encoding of text files to UTF-8 would be better +even if UTF-8 mode is enabled by default at some point. Why not change std(in|out|err) encoding too? -------------------------------------------- -Even when locale encoding is not UTF-8, there will be many UTF-8 -text files. These files are downloaded from the internet, or -written by modern text editor same to editing Python source. +Even when the locale encoding is not UTF-8, there can be many UTF-8 +text files. These files could be downloaded from the Internet or +written by modern text editors. -On the other hand, terminal encoding is assumed to be equal to -locale encoding. And other tools are assumed to read and write -locale encoding too. +On the other hand, terminal encoding is assumed to be the same as +locale encoding. And other tools are assumed to read and write the +locale encoding as well. -std(in|out|err) are likely to be connected to a terminal or other -tools. So locale encoding should be respected. +std(in|out|err) are likely to be connected to a terminal or other tools. +So the locale encoding should be respected. -Why not warn always when encoding is omitted? ----------------------------------------------- +Why not always warn when encoding is omitted? +--------------------------------------------- -Omitting default encoding is a common mistake when writing portable code. +Omitting encoding is a common mistake when writing portable code. But when portability does not matter, assuming UTF-8 is not so bad because -Python already implemented locale coercion (:pep:`538`) and UTF-8 mode +Python already implements locale coercion (:pep:`538`) and UTF-8 mode (:pep:`540`). -And these scripts will become portable when default encoding is changed -to always UTF-8. - +And these scripts will become portable when the default encoding is changed +to UTF-8. Backward compatibility ====================== -There may be scripts relying on locale or code page which is not UTF-8. -They must be rewritten to specify ``encoding`` explicitly. +There may be scripts relying on the locale encoding or active code page not +being UTF-8. They must be rewritten to specify ``encoding`` explicitly. -* If the script assumed ``latin1`` or ``cp932``, use ``encoding="latin1"`` +* If the script assumes ``latin1`` or ``cp932``, ``encoding="latin1"`` or ``encoding="cp932"`` should be used. * If the script is designed to respect locale encoding, ``locale.getpreferredencoding(False)`` should be used. - There are non-portable short forms of ``locale.getpreferredencoding(False)``. + There are non-portable short forms of + ``locale.getpreferredencoding(False)``. - * On Windows, ``"mbcs"`` can be used instead. - * On Unix, ``os.fsencoding()`` can be used instead. + * On Windows, ``"mbcs"`` can be used instead. + * On Unix, ``os.fsencoding()`` can be used instead. -Note that such scripts will be broken even without upgrading Python: +Note that such scripts will be broken even without upgrading Python, such as +when: * Upgrading Windows * Changing the language setting * Changing terminal from legacy console to a modern one -* Using tools which does ``chcp 65001`` +* Using tools which do ``chcp 65001`` How to Teach This ================= -When opening text files, "UTF-8" is used by default. It is consistent -with default encoding used for ``text.encode()``. +When opening text files, "UTF-8" is used by default. It is consistent with +the default encoding used for ``text.encode()``. Reference Implementation @@ -222,18 +225,19 @@ Open Issues =========== Alias for locale encoding --------------------------- +------------------------- ``encoding=locale.getpreferredencoding(False)`` is too long, and -``"mbcs"`` or ``os.fsencoding()`` are not portable. +``"mbcs"`` and ``os.fsencoding()`` are not portable. -We may be possible to add new alias encoding "locale" for easy and +It may be possible to add a new "locale" encoding alias as an easy and portable version of ``locale.getpreferredencoding(False)``. -I'm not sure this is easy enough because ``encodings`` is imported -before ``_bootlocale`` currently. +The difficulty of this is uncertain because ``encodings`` is currently +imported prior to ``_bootlocale``. -Another option is ``TextIOWrapper`` treats `"locale"` as special case:: +Another option is for ``TextIOWrapper`` to treat `"locale"` as a special +case:: if encoding == "locale": encoding = locale.getpreferredencoding(False)