python-peps/pep-0597.rst

352 lines
11 KiB
ReStructuredText

PEP: 597
Title: Add optional EncodingWarning
Last-Modified: 07-Aug-2021
Author: Inada Naoki <songofacandy@gmail.com>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 05-Jun-2019
Python-Version: 3.10
Abstract
========
Add a new warning category ``EncodingWarning``. It is emitted when the
``encoding`` argument to ``open()`` is omitted and the default
locale-specific encoding is used.
The warning is disabled by default. A new ``-X warn_default_encoding``
command-line option and a new ``PYTHONWARNDEFAULTENCODING`` environment
variable can be used to enable it.
A ``"locale"`` argument value for ``encoding`` is added too. It
explicitly specifies that the locale encoding should be used, silencing
the warning.
Motivation
==========
Using the default encoding is a common mistake
----------------------------------------------
Developers using macOS or Linux may forget that the default encoding
is not always UTF-8.
For example, using ``long_description = open("README.md").read()`` in
``setup.py`` is a common mistake. Many Windows users cannot install
such packages if there is at least one non-ASCII character
(e.g. emoji, author names, copyright symbols, and the like)
in their UTF-8-encoded ``README.md`` file.
Of the 4000 most downloaded packages from PyPI, 489 use non-ASCII
characters in their README, and 82 fail to install from source on
non-UTF-8 locales due to not specifying an encoding for a non-ASCII
file. [1]_
Another example is ``logging.basicConfig(filename="log.txt")``.
Some users might expect it to use UTF-8 by default, but the locale
encoding is actually what is used. [2]_
Even Python experts may assume that the default encoding is UTF-8.
This creates bugs that only happen on Windows; see [3]_, [4]_, [5]_,
and [6]_ for example.
Emitting a warning when the ``encoding`` argument is omitted will help
find such mistakes.
Explicit way to use locale-specific encoding
--------------------------------------------
``open(filename)`` isn't explicit about which encoding is expected:
* If ASCII is assumed, this isn't a bug, but may result in decreased
performance on Windows, particularly with non-Latin-1 locale encodings
* If UTF-8 is assumed, this may be a bug or a platform-specific script
* If the locale encoding is assumed, the behavior is as expected
(but could change if future versions of Python modify the default)
From this point of view, ``open(filename)`` is not readable code.
``encoding=locale.getpreferredencoding(False)`` can be used to
specify the locale encoding explicitly, but it is too long and easy
to misuse (e.g. one can forget to pass ``False`` as its argument).
This PEP provides an explicit way to specify the locale encoding.
Prepare to change the default encoding to UTF-8
-----------------------------------------------
Since UTF-8 has become the de-facto standard text encoding,
we might default to it for opening files in the future.
However, such a change will affect many applications and libraries.
If we start emitting ``DeprecationWarning`` everywhere the ``encoding``
argument is omitted, it will be too noisy and painful.
Although this PEP doesn't propose changing the default encoding,
it will help enable that change by:
* Reducing the number of omitted ``encoding`` arguments in libraries
before we start emitting a ``DeprecationWarning`` by default.
* Allowing users to pass ``encoding="locale"`` to suppress
the current warning and any ``DeprecationWarning`` added in the future,
as well as retaining consistent behavior if later Python versions
change the default, ensuring support for any Python version >=3.10.
Specification
=============
``EncodingWarning``
-------------------
Add a new ``EncodingWarning`` warning class as a subclass of
``Warning``. It is emitted when the ``encoding`` argument is omitted and
the default locale-specific encoding is used.
Options to enable the warning
-----------------------------
The ``-X warn_default_encoding`` option and the
``PYTHONWARNDEFAULTENCODING`` environment variable are added. They
are used to enable ``EncodingWarning``.
``sys.flags.warn_default_encoding`` is also added. The flag is true when
``EncodingWarning`` is enabled.
When the flag is set, ``io.TextIOWrapper()``, ``open()`` and other
modules using them will emit ``EncodingWarning`` when the ``encoding``
argument is omitted.
Since ``EncodingWarning`` is a subclass of ``Warning``, they are
shown by default (if the ``warn_default_encoding`` flag is set), unlike
``DeprecationWarning``.
``encoding="locale"``
---------------------
``io.TextIOWrapper`` will accept ``"locale"`` as a valid argument to
``encoding``. It has the same meaning as the current ``encoding=None``,
except that ``io.TextIOWrapper`` doesn't emit ``EncodingWarning`` when
``encoding="locale"`` is specified.
``io.text_encoding()``
----------------------
``io.text_encoding()`` is a helper for functions with an
``encoding=None`` parameter that pass it to ``io.TextIOWrapper()`` or
``open()``.
A pure Python implementation will look like this::
def text_encoding(encoding, stacklevel=1):
"""A helper function to choose the text encoding.
When *encoding* is not None, just return it.
Otherwise, return the default text encoding (i.e. "locale").
This function emits an EncodingWarning if *encoding* is None and
sys.flags.warn_default_encoding is true.
This function can be used in APIs with an encoding=None parameter
that pass it to TextIOWrapper or open.
However, please consider using encoding="utf-8" for new APIs.
"""
if encoding is None:
if sys.flags.warn_default_encoding:
import warnings
warnings.warn(
"'encoding' argument not specified.",
EncodingWarning, stacklevel + 2)
encoding = "locale"
return encoding
For example, ``pathlib.Path.read_text()`` can use it like this:
.. code-block::
def read_text(self, encoding=None, errors=None):
encoding = io.text_encoding(encoding)
with self.open(mode='r', encoding=encoding, errors=errors) as f:
return f.read()
By using ``io.text_encoding()``, ``EncodingWarning`` is emitted for
the caller of ``read_text()`` instead of ``read_text()`` itself.
Affected standard library modules
---------------------------------
Many standard library modules will be affected by this change.
Most APIs accepting ``encoding=None`` will use ``io.text_encoding()``
as written in the previous section.
Where using the locale encoding as the default encoding is reasonable,
``encoding="locale"`` will be used instead. For example,
the ``subprocess`` module will use the locale encoding as the default
for pipes.
Many tests use ``open()`` without ``encoding`` specified to read
ASCII text files. They should be rewritten with ``encoding="ascii"``.
Rationale
=========
Opt-in warning
--------------
Although ``DeprecationWarning`` is suppressed by default, always
emitting ``DeprecationWarning`` when the ``encoding`` argument is
omitted would be too noisy.
Noisy warnings may lead developers to dismiss the
``DeprecationWarning``.
"locale" is not a codec alias
-----------------------------
We don't add "locale" as a codec alias because the locale can be
changed at runtime.
Additionally, ``TextIOWrapper`` checks ``os.device_encoding()``
when ``encoding=None``. This behavior cannot be implemented in
a codec.
Backward Compatibility
======================
The new warning is not emitted by default, so this PEP is 100%
backwards-compatible.
Forward Compatibility
=====================
Passing ``"locale"`` as the argument to ``encoding`` is not
forward-compatible. Code using it will not work on Python older than
3.10, and will instead raise ``LookupError: unknown encoding: locale``.
Until developers can drop Python 3.9 support, ``EncodingWarning``
can only be used for finding missing ``encoding="utf-8"`` arguments.
How to Teach This
=================
For new users
-------------
Since ``EncodingWarning`` is used to write cross-platform code,
there is no need to teach it to new users.
We can just recommend using UTF-8 for text files and using
``encoding="utf-8"`` when opening them.
For experienced users
---------------------
Using ``open(filename)`` to read text files encoded in UTF-8 is a
common mistake. It may not work on Windows because UTF-8 is not the
default encoding.
You can use ``-X warn_default_encoding`` or
``PYTHONWARNDEFAULTENCODING=1`` to find this type of mistake.
Omitting the ``encoding`` argument is not a bug when opening text files
encoded in the locale encoding, but ``encoding="locale"`` is recommended
in Python 3.10 and later because it is more explicit.
Reference Implementation
========================
https://github.com/python/cpython/pull/19481
Discussions
===========
The latest discussion thread is:
https://mail.python.org/archives/list/python-dev@python.org/thread/SFYUP2TWD5JZ5KDLVSTZ44GWKVY4YNCV/
* Why not implement this in linters?
* ``encoding="locale"`` and ``io.text_encoding()`` must be implemented
in Python.
* It is difficult to find all callers of functions wrapping
``open()`` or ``TextIOWrapper()`` (see the ``io.text_encoding()``
section).
* Many developers will not use the option.
* Some will, and report the warnings to libraries they use,
so the option is worth it even if many developers don't enable it.
* For example, I found [7]_ and [8]_ by running
``pip install -U pip``, and [9]_ by running ``tox``
with the reference implementation. This demonstrates how this
option can be used to find potential issues.
References
==========
.. [1] "Packages can't be installed when encoding is not UTF-8"
(https://github.com/methane/pep597-pypi-ascii)
.. [2] "Logging - Inconsistent behaviour when handling unicode"
(https://bugs.python.org/issue37111)
.. [3] Packaging tutorial in packaging.python.org didn't specify
encoding to read a ``README.md``
(https://github.com/pypa/packaging.python.org/pull/682)
.. [4] ``json.tool`` had used locale encoding to read JSON files.
(https://bugs.python.org/issue33684)
.. [5] site: Potential UnicodeDecodeError when handling pth file
(https://bugs.python.org/issue33684)
.. [6] pypa/pip: "Installing packages fails if Python 3 installed
into path with non-ASCII characters"
(https://github.com/pypa/pip/issues/9054)
.. [7] "site: Potential UnicodeDecodeError when handling pth file"
(https://bugs.python.org/issue43214)
.. [8] "[pypa/pip] Use ``encoding`` option or binary mode for open()"
(https://github.com/pypa/pip/pull/9608)
.. [9] "Possible UnicodeError caused by missing encoding="utf-8""
(https://github.com/tox-dev/tox/issues/1908)
Copyright
=========
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
fill-column: 70
coding: utf-8
End: