PEP: 597 Title: Use UTF-8 for default text file encoding Author: Inada Naoki Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 05-Jun-2019 Python-Version: 3.9 Abstract ======== Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)`` (hereinafter called "locale encoding") when ``encoding`` is not specified. This PEP proposes changing the default text encoding to "UTF-8" regardless of platform or locale. Motivation ========== People assume it is always UTF-8 -------------------------------- Package authors using macOS or Linux may forget that the default encoding is not always UTF-8. For example, ``long_description = open("README.md").read()`` in ``setup.py`` is a common mistake. If there are at least one emoji or any other non-ASCII characters in the ``README.md`` file, many Windows users cannot install the package by ``UnicodeDecodeError``. Code page is not stable ----------------------- Some tools on Windows change code page to 65001 (UTF-8), and Microsoft is using UTF-8 and cp65001 more widely in recent Windows 10. For example, "Command Prompt" uses legacy code page by default. But WSL changes the code page to 65001, and ``python.exe`` on Windows can be executed from WSL. So ``python.exe`` executed from legacy console and from WSL cannot read text files written by each other. But many Windows users don't understand which code page is currently used. So changing default text file encoding based on current code page will cause confusion. Consistent default text encoding will make Python behavior more expectable and easy to learn. Use UTF-8 by default is easier to new programmers ------------------------------------------------- Python is one of the most popular first programming languages. New programmers may not know about encoding. When they download text data written in UTF-8 from the internet, they are forced to know encoding. Popular text editors like VS Code or Atom use UTF-8 by default. Even notepad.exe uses UTF-8 by default from Windows 10 2019 may update. (Note that Python 3.9 will be released in 2021.) Additionally, the default encoding of Python source file is UTF-8. We can assume new Python programmers who don't know about encoding use editors which use UTF-8 by default. It would be nice if new programmers are not forced to know about encoding until they need to handle text files encoded in encoding other than UTF-8. Specification ============= From Python 3.9, default encoding of ``TextIOWrapper`` and ``open()`` is changed from ``locale.getpreferredencoding(False)`` (called "locale encoding" in this PEP) to "UTF-8". When there is device encoding (``os.device_encoding(buffer.fileno())``), it still precedes than the default encoding. Not affected areas ------------------ Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respect locale encoding. ``stdin``, ``stdout``, and ``stderr`` keep respecting locale too. For example, these commands don't cause mojibake regardless code page:: > python -c "print('こんにちは')" | more こんにちは > python -c "print('こんにちは')" > temp.txt > type temp.txt こんにちは Pipes and TTY should use locale encoding: * ``subprocess`` and ``os.popen`` use locale encoding because subprocess will use locale encoding. * ``getpass.getpass`` uses locale encoding when using TTY. Affected APIs -------------- All other code using default encoding of ``TextIOWrapper`` or ``open`` are affected. This is incomplete list of APIs affected by this PEP: * ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``. * ``socket.makefile`` * ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile`` * ``trace.CoverageResults.write_results_file`` These APIs will use always "UTF-8" when opening text files. Deprecation Warning ------------------- From 3.8, ``DeprecationWarning`` is shown when encoding is omitted and locale encoding is not UTF-8. This helps not only writing forward compatible code, but also investigating unexpected ``UnicodeDecodeError`` caused by assuming default text encoding is UTF-8. (See `People assume it is always UTF-8`_ above.) Rationale ========= Why not just enable UTF-8 mode by default? ------------------------------------------ This PEP is not mutually exclusive to UTF-8 mode. If we enable UTF-8 mode by default, even people using Windows will forget the default encoding is not always UTF-8. More scripts will be written assuming the default encoding is UTF-8. So changing default encoding of text files to always UTF-8 would be better even if UTF-8 mode is enabled by default at some point. Why not change std(in|out|err) encoding too? -------------------------------------------- Even when locale encoding is not UTF-8, there will be many UTF-8 text files. These files are downloaded from the internet, or written by modern text editor same to editing Python source. On the other hand, terminal encoding is assumed to be equal to locale encoding. And other tools are assumed to read and write locale encoding too. std(in|out|err) are likely to be connected to a terminal or other tools. So locale encoding should be respected. Why not warn always when encoding is omitted? ---------------------------------------------- Omitting default encoding is a common mistake when writing portable code. But when portability does not matter, assuming UTF-8 is not so bad because Python already implemented locale coercion (:pep:`538`) and UTF-8 mode (:pep:`540`). And these scripts will become portable when default encoding is changed to always UTF-8. Backward compatibility ====================== There may be scripts relying on locale or code page which is not UTF-8. They must be rewritten to specify ``encoding`` explicitly. * If the script assumed ``latin1`` or ``cp932``, use ``encoding="latin1"`` or ``encoding="cp932"`` should be used. * If the script is designed to respect locale encoding, ``locale.getpreferredencoding(False)`` should be used. There are non-portable short forms of ``locale.getpreferredencoding(False)``. * On Windows, ``"mbcs"`` can be used instead. * On Unix, ``os.fsencoding()`` can be used instead. Note that such scripts will be broken even without upgrading Python: * Upgrading Windows * Changing the language setting * Changing terminal from legacy console to a modern one * Using tools which does ``chcp 65001`` How to Teach This ================= When opening text files, "UTF-8" is used by default. It is consistent with default encoding used for ``text.encode()``. Reference Implementation ======================== To be written. Rejected Ideas ============== To be discussed. Open Issues =========== Alias for locale encoding -------------------------- ``encoding=locale.getpreferredencoding(False)`` is too long, and ``"mbcs"`` or ``os.fsencoding()`` are not portable. We may be possible to add new alias encoding "locale" for easy and portable version of ``locale.getpreferredencoding(False)``. I'm not sure this is easy enough because ``encodings`` is imported before ``_bootlocale`` currently. Another option is ``TextIOWrapper`` treats `"locale"` as special case:: if encoding == "locale": encoding = locale.getpreferredencoding(False) References ========== Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: