PEP: 597 Title: Add optional EncodingWarning Last-Modified: 30-Jan-2021 Author: Inada Naoki Discussions-To: https://discuss.python.org/t/3880 Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 05-Jun-2019 Python-Version: 3.10 Abstract ======== Add a new warning category ``EncodingWarning``. It is emitted when ``encoding`` option is omitted and the default encoding is a locale encoding. The warning is disabled by default. New ``-X warn_encoding`` command-line option and ``PYTHONWARNENCODING`` environment variable are used to enable the warnings. ``encoding="locale"`` option is added too. It is used to specify locale encoding explicitly. Motivation ========== Using the default encoding is a common mistake ---------------------------------------------- Developers using macOS or Linux may forget that the default encoding is not always UTF-8. For example, ``long_description = open("README.md").read()`` in ``setup.py`` is a common mistake. Many Windows users can not install the package if there is at least one non-ASCII character (e.g. emoji) in the ``README.md`` file which is encoded in UTF-8. For example, 489 packages of the 4000 most downloaded packages from PyPI used non-ASCII characters in README. And 82 packages of them can not be installed from source package when locale encoding is ASCII. [1]_ They used the default encoding to read README or TOML file. Another example is ``logging.basicConfig(filename="log.txt")``. Some users expect UTF-8 is used by default, but locale encoding is used actually. [2]_ Even Python experts assume that default encoding is UTF-8. It creates bugs that happen only on Windows. See [3]_, [4]_, [5]_, and [6]_ for example. Emitting a warning when the ``encoding`` option is omitted will help to find such mistakes. Explicit way to use locale-specific encoding -------------------------------------------- ``open(filename)`` isn't explicit about which encoding is expected: * Expects ASCII (not a bug, but inefficient on Windows) * Expects UTF-8 (bug or platform specific script) * Expects the locale encoding. In this point of view, ``open(filename)`` is not readable. ``encoding=locale.getpreferredencoding(False)`` can be used to specify the locale encoding explicitly. But it is too long and easy to misuse. (e.g. forget to pass ``False`` to its parameter) This PEP provides an explicit way to specify the locale encoding. Prepare to change the default encoding to UTF-8 ----------------------------------------------- Since UTF-8 becomes de-facto standard text encoding, we might change the default text encoding to UTF-8 in the future. But this change will affect many applications and libraries. If we start emitting ``DeprecationWarning`` everywhere ``encoding`` option is omitted by default, it will be too noisy and painful. Although this PEP doesn't propose to change the default encoding, this PEP will the change: * Reduce the number of omitted ``encoding`` option in many libraries before emitting the warning by default. * Users will be able to use ``encoding="locale"`` option to suppress the warning without dropping Python 3.10 support. Specification ============= ``EncodingWarning`` -------------------- Add a new ``EncodingWarning`` warning class which is a subclass of ``Warning``. It is used to warn when the ``encoding`` option is omitted and the default encoding is locale-specific. Options to enable the warning ------------------------------ ``-X warn_encoding`` option and the ``PYTHONWARNENCODING`` environment variable are added. They are used to enable the ``EncodingWarning``. ``sys.flags.encoding_warning`` is also added. The flag represents ``EncodingWarning`` is enabled. When the option is enabled, ``io.TextIOWrapper()``, ``open()``, and other modules using them will emit ``EncodingWarning`` when ``encoding`` is omitted. Since ``EncodingWarning`` is a subclass of ``Warning``, they are shown by default, unlike ``DeprecationWarning``. ``encoding="locale"`` option ---------------------------- ``io.TextIOWrapper`` accepts ``encoding="locale"`` option. It means same to current ``encoding=None``. But ``io.TextIOWrapper`` doesn't emit ``EncodingWarning`` when ``encoding="locale"`` is specified. ``io.text_encoding()`` ----------------------- ``io.text_encoding()`` is a helper function for functions having ``encoding=None`` option and passing it to ``io.TextIOWrapper()`` or ``open()``. Pure Python implementation will be like this:: def text_encoding(encoding, stacklevel=1): """Helper function to choose the text encoding. When *encoding* is not None, just return it. Otherwise, return the default text encoding (i.e., "locale"). This function emits EncodingWarning if *encoding* is None and sys.flags.encoding_warning is true. This function can be used in APIs having encoding=None option and pass it to TextIOWrapper or open. But please consider using encoding="utf-8" for new APIs. """ if encoding is None: if sys.flags.encoding_warning: import warnings warnings.warn("'encoding' option is omitted", EncodingWarning, stacklevel + 2) encoding = "locale" return encoding For example, ``pathlib.Path.read_text()`` can use the function like: .. code-block:: def read_text(self, encoding=None, errors=None): encoding = io.text_encoding(encoding) with self.open(mode='r', encoding=encoding, errors=errors) as f: return f.read() By using ``io.text_encoding()``, ``EncodingWarning`` is emitted for the caller of ``read_text()`` instead of ``read_text()`` itself. Affected stdlibs ----------------- Many stdlibs will be affected by this change. Most APIs accepting ``encoding=None`` will use ``io.text_encoding()`` as written in the previous section. Where using locale encoding as the default encoding is reasonable, ``encoding="locale"`` will be used instead. For example, the ``subprocess`` module will use locale encoding for the default encoding of the pipes. Many tests use ``open()`` without ``encoding`` specified to read ASCII text files. They should be rewritten with ``encoding="ascii"``. Rationale ========= Opt-in warning --------------- Although ``DeprecationWarning`` is suppressed by default, emitting ``DeprecationWarning`` always when the ``encoding`` option is omitted would be too noisy. Noisy warnings may lead developers to dismiss the ``DeprecationWarning``. "locale" is not a codec alias ----------------------------- We don't add the "locale" to the codec alias because locale can be changed in runtime. Additionally, ``TextIOWrapper`` checks ``os.device_encoding()`` when ``encoding=None``. This behavior can not be implemented in the codec. Backward Compatibility ====================== The new warning is not emitted by default. So this PEP is 100% backward compatible. Forward Compatibility ===================== ``encoding="locale"`` option is not forward compatible. Codes using the option will not work on Python older than 3.10. It will raise ``LookupError: unknown encoding: locale``. Until developers can drop Python 3.9 support, ``EncodingWarning`` can be used only for finding missing ``encoding="utf-8"`` options. How to teach this ================= For new users ------------- Since ``EncodingWarning`` is used to write a cross-platform code, no need to teach it to new users. We can just recommend using UTF-8 for text files and use ``encoding="utf-8"`` when opening test files. For experienced users --------------------- Using ``open(filename)`` to read text files encoded in UTF-8 is a common mistake. It may not work on Windows because UTF-8 is not the default encoding. You can use ``-X warn_encoding`` or ``PYTHONWARNENCODING=1`` to find this type of mistake. Omitting ``encoding`` option is not a bug when opening text files encoded in locale encoding. But ``encoding="locale"`` is recommended after Python 3.10 because it is more explicit. Reference Implementation ======================== https://github.com/python/cpython/pull/19481 Discussions =========== * Why not implement this in linters? * ``encoding="locale"`` and ``io.text_encoding()`` must be in Python. * It is difficult to find all caller of functions wrapping ``open()`` or ``TextIOWrapper()``. (See ``io.text_encoding()`` section.) * Many developers will not use the option. * Some developers use the option and report the warnings to libraries they use. So the option is worth enough even though many developers won't use it. * For example, I find [7]_ and [8]_ by running ``pip install -U pip`` and find [9]_ by running ``tox`` with the reference implementation. It demonstrates how this option find potential issues. References ========== .. [1] "Packages can't be installed when encoding is not UTF-8" (https://github.com/methane/pep597-pypi-ascii) .. [2] "Logging - Inconsistent behaviour when handling unicode" (https://bugs.python.org/issue37111) .. [3] Packaging tutorial in packaging.python.org didn't specify encoding to read a ``README.md`` (https://github.com/pypa/packaging.python.org/pull/682) .. [4] ``json.tool`` had used locale encoding to read JSON files. (https://bugs.python.org/issue33684) .. [5] site: Potential UnicodeDecodeError when handling pth file (https://bugs.python.org/issue33684) .. [6] pypa/pip: "Installing packages fails if Python 3 installed into path with non-ASCII characters" (https://github.com/pypa/pip/issues/9054) .. [7] "site: Potential UnicodeDecodeError when handling pth file" (https://bugs.python.org/issue43214) .. [8] "[pypa/pip] Use ``encoding`` option or binary mode for open()" (https://github.com/pypa/pip/pull/9608) .. [9] "Possible UnicodeError caused by missing encoding="utf-8"" (https://github.com/tox-dev/tox/issues/1908) Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: