PEP: 597 Title: Add optional EncodingWarning Last-Modified: 07-Aug-2021 Author: Inada Naoki Status: Final Type: Standards Track Content-Type: text/x-rst Created: 05-Jun-2019 Python-Version: 3.10 Abstract ======== Add a new warning category ``EncodingWarning``. It is emitted when the ``encoding`` argument to ``open()`` is omitted and the default locale-specific encoding is used. The warning is disabled by default. A new ``-X warn_default_encoding`` command-line option and a new ``PYTHONWARNDEFAULTENCODING`` environment variable can be used to enable it. A ``"locale"`` argument value for ``encoding`` is added too. It explicitly specifies that the locale encoding should be used, silencing the warning. Motivation ========== Using the default encoding is a common mistake ---------------------------------------------- Developers using macOS or Linux may forget that the default encoding is not always UTF-8. For example, using ``long_description = open("README.md").read()`` in ``setup.py`` is a common mistake. Many Windows users cannot install such packages if there is at least one non-ASCII character (e.g. emoji, author names, copyright symbols, and the like) in their UTF-8-encoded ``README.md`` file. Of the 4000 most downloaded packages from PyPI, 489 use non-ASCII characters in their README, and 82 fail to install from source on non-UTF-8 locales due to not specifying an encoding for a non-ASCII file. [1]_ Another example is ``logging.basicConfig(filename="log.txt")``. Some users might expect it to use UTF-8 by default, but the locale encoding is actually what is used. [2]_ Even Python experts may assume that the default encoding is UTF-8. This creates bugs that only happen on Windows; see [3]_, [4]_, [5]_, and [6]_ for example. Emitting a warning when the ``encoding`` argument is omitted will help find such mistakes. Explicit way to use locale-specific encoding -------------------------------------------- ``open(filename)`` isn't explicit about which encoding is expected: * If ASCII is assumed, this isn't a bug, but may result in decreased performance on Windows, particularly with non-Latin-1 locale encodings * If UTF-8 is assumed, this may be a bug or a platform-specific script * If the locale encoding is assumed, the behavior is as expected (but could change if future versions of Python modify the default) From this point of view, ``open(filename)`` is not readable code. ``encoding=locale.getpreferredencoding(False)`` can be used to specify the locale encoding explicitly, but it is too long and easy to misuse (e.g. one can forget to pass ``False`` as its argument). This PEP provides an explicit way to specify the locale encoding. Prepare to change the default encoding to UTF-8 ----------------------------------------------- Since UTF-8 has become the de-facto standard text encoding, we might default to it for opening files in the future. However, such a change will affect many applications and libraries. If we start emitting ``DeprecationWarning`` everywhere the ``encoding`` argument is omitted, it will be too noisy and painful. Although this PEP doesn't propose changing the default encoding, it will help enable that change by: * Reducing the number of omitted ``encoding`` arguments in libraries before we start emitting a ``DeprecationWarning`` by default. * Allowing users to pass ``encoding="locale"`` to suppress the current warning and any ``DeprecationWarning`` added in the future, as well as retaining consistent behavior if later Python versions change the default, ensuring support for any Python version >=3.10. Specification ============= ``EncodingWarning`` ------------------- Add a new ``EncodingWarning`` warning class as a subclass of ``Warning``. It is emitted when the ``encoding`` argument is omitted and the default locale-specific encoding is used. Options to enable the warning ----------------------------- The ``-X warn_default_encoding`` option and the ``PYTHONWARNDEFAULTENCODING`` environment variable are added. They are used to enable ``EncodingWarning``. ``sys.flags.warn_default_encoding`` is also added. The flag is true when ``EncodingWarning`` is enabled. When the flag is set, ``io.TextIOWrapper()``, ``open()`` and other modules using them will emit ``EncodingWarning`` when the ``encoding`` argument is omitted. Since ``EncodingWarning`` is a subclass of ``Warning``, they are shown by default (if the ``warn_default_encoding`` flag is set), unlike ``DeprecationWarning``. ``encoding="locale"`` --------------------- ``io.TextIOWrapper`` will accept ``"locale"`` as a valid argument to ``encoding``. It has the same meaning as the current ``encoding=None``, except that ``io.TextIOWrapper`` doesn't emit ``EncodingWarning`` when ``encoding="locale"`` is specified. ``io.text_encoding()`` ---------------------- ``io.text_encoding()`` is a helper for functions with an ``encoding=None`` parameter that pass it to ``io.TextIOWrapper()`` or ``open()``. A pure Python implementation will look like this:: def text_encoding(encoding, stacklevel=1): """A helper function to choose the text encoding. When *encoding* is not None, just return it. Otherwise, return the default text encoding (i.e. "locale"). This function emits an EncodingWarning if *encoding* is None and sys.flags.warn_default_encoding is true. This function can be used in APIs with an encoding=None parameter that pass it to TextIOWrapper or open. However, please consider using encoding="utf-8" for new APIs. """ if encoding is None: if sys.flags.warn_default_encoding: import warnings warnings.warn( "'encoding' argument not specified.", EncodingWarning, stacklevel + 2) encoding = "locale" return encoding For example, ``pathlib.Path.read_text()`` can use it like this: .. code-block:: def read_text(self, encoding=None, errors=None): encoding = io.text_encoding(encoding) with self.open(mode='r', encoding=encoding, errors=errors) as f: return f.read() By using ``io.text_encoding()``, ``EncodingWarning`` is emitted for the caller of ``read_text()`` instead of ``read_text()`` itself. Affected standard library modules --------------------------------- Many standard library modules will be affected by this change. Most APIs accepting ``encoding=None`` will use ``io.text_encoding()`` as written in the previous section. Where using the locale encoding as the default encoding is reasonable, ``encoding="locale"`` will be used instead. For example, the ``subprocess`` module will use the locale encoding as the default for pipes. Many tests use ``open()`` without ``encoding`` specified to read ASCII text files. They should be rewritten with ``encoding="ascii"``. Rationale ========= Opt-in warning -------------- Although ``DeprecationWarning`` is suppressed by default, always emitting ``DeprecationWarning`` when the ``encoding`` argument is omitted would be too noisy. Noisy warnings may lead developers to dismiss the ``DeprecationWarning``. "locale" is not a codec alias ----------------------------- We don't add "locale" as a codec alias because the locale can be changed at runtime. Additionally, ``TextIOWrapper`` checks ``os.device_encoding()`` when ``encoding=None``. This behavior cannot be implemented in a codec. Backward Compatibility ====================== The new warning is not emitted by default, so this PEP is 100% backwards-compatible. Forward Compatibility ===================== Passing ``"locale"`` as the argument to ``encoding`` is not forward-compatible. Code using it will not work on Python older than 3.10, and will instead raise ``LookupError: unknown encoding: locale``. Until developers can drop Python 3.9 support, ``EncodingWarning`` can only be used for finding missing ``encoding="utf-8"`` arguments. How to Teach This ================= For new users ------------- Since ``EncodingWarning`` is used to write cross-platform code, there is no need to teach it to new users. We can just recommend using UTF-8 for text files and using ``encoding="utf-8"`` when opening them. For experienced users --------------------- Using ``open(filename)`` to read text files encoded in UTF-8 is a common mistake. It may not work on Windows because UTF-8 is not the default encoding. You can use ``-X warn_default_encoding`` or ``PYTHONWARNDEFAULTENCODING=1`` to find this type of mistake. Omitting the ``encoding`` argument is not a bug when opening text files encoded in the locale encoding, but ``encoding="locale"`` is recommended in Python 3.10 and later because it is more explicit. Reference Implementation ======================== https://github.com/python/cpython/pull/19481 Discussions =========== The latest discussion thread is: https://mail.python.org/archives/list/python-dev@python.org/thread/SFYUP2TWD5JZ5KDLVSTZ44GWKVY4YNCV/ * Why not implement this in linters? * ``encoding="locale"`` and ``io.text_encoding()`` must be implemented in Python. * It is difficult to find all callers of functions wrapping ``open()`` or ``TextIOWrapper()`` (see the ``io.text_encoding()`` section). * Many developers will not use the option. * Some will, and report the warnings to libraries they use, so the option is worth it even if many developers don't enable it. * For example, I found [7]_ and [8]_ by running ``pip install -U pip``, and [9]_ by running ``tox`` with the reference implementation. This demonstrates how this option can be used to find potential issues. References ========== .. [1] "Packages can't be installed when encoding is not UTF-8" (https://github.com/methane/pep597-pypi-ascii) .. [2] "Logging - Inconsistent behaviour when handling unicode" (https://bugs.python.org/issue37111) .. [3] Packaging tutorial in packaging.python.org didn't specify encoding to read a ``README.md`` (https://github.com/pypa/packaging.python.org/pull/682) .. [4] ``json.tool`` had used locale encoding to read JSON files. (https://bugs.python.org/issue33684) .. [5] site: Potential UnicodeDecodeError when handling pth file (https://bugs.python.org/issue33684) .. [6] pypa/pip: "Installing packages fails if Python 3 installed into path with non-ASCII characters" (https://github.com/pypa/pip/issues/9054) .. [7] "site: Potential UnicodeDecodeError when handling pth file" (https://bugs.python.org/issue43214) .. [8] "[pypa/pip] Use ``encoding`` option or binary mode for open()" (https://github.com/pypa/pip/pull/9608) .. [9] "Possible UnicodeError caused by missing encoding="utf-8"" (https://github.com/tox-dev/tox/issues/1908) Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive. .. Local Variables: mode: indented-text indent-tabs-mode: nil fill-column: 70 coding: utf-8 End: