python-peps/pep-0597.rst

344 lines
10 KiB
ReStructuredText
Raw Normal View History

PEP: 597
2021-01-30 04:18:19 -05:00
Title: Add optional EncodingWarning
Last-Modified: 30-Jan-2021
2020-06-22 21:35:56 -04:00
Author: Inada Naoki <songofacandy@gmail.com>
2020-04-16 19:34:21 -04:00
Discussions-To: https://discuss.python.org/t/3880
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 05-Jun-2019
Python-Version: 3.10
Abstract
========
2021-01-30 04:18:19 -05:00
Add a new warning category ``EncodingWarning``. It is emitted when
2021-01-30 22:44:30 -05:00
``encoding`` option is omitted and the default encoding is a locale
2021-01-30 04:18:19 -05:00
encoding.
2021-01-30 04:18:19 -05:00
The warning is disabled by default. New ``-X warn_encoding``
2021-01-30 22:44:30 -05:00
command-line option and ``PYTHONWARNENCODING`` environment variable
2021-01-30 04:18:19 -05:00
are used to enable the warnings.
2021-02-14 09:06:57 -05:00
``encoding="locale"`` option is added too. It is used to specify
locale encoding explicitly.
Motivation
==========
2020-06-22 21:35:56 -04:00
Using the default encoding is a common mistake
----------------------------------------------
2020-02-04 04:35:06 -05:00
Developers using macOS or Linux may forget that the default encoding
is not always UTF-8.
For example, ``long_description = open("README.md").read()`` in
2020-02-04 04:35:06 -05:00
``setup.py`` is a common mistake. Many Windows users can not install
2020-04-16 19:34:21 -04:00
the package if there is at least one non-ASCII character (e.g. emoji)
2020-06-22 21:35:56 -04:00
in the ``README.md`` file which is encoded in UTF-8.
2020-04-16 19:34:21 -04:00
For example, 489 packages of the 4000 most downloaded packages from
PyPI used non-ASCII characters in README. And 82 packages of them
can not be installed from source package when locale encoding is
2021-02-14 09:06:57 -05:00
ASCII. [1]_ They used the default encoding to read README or TOML
2020-06-22 21:35:56 -04:00
file.
2020-04-16 19:34:21 -04:00
Another example is ``logging.basicConfig(filename="log.txt")``.
Some users expect UTF-8 is used by default, but locale encoding is
2021-02-14 09:06:57 -05:00
used actually. [2]_
2020-02-04 04:35:06 -05:00
Even Python experts assume that default encoding is UTF-8.
2021-02-14 09:06:57 -05:00
It creates bugs that happen only on Windows. See [3]_, [4]_, [5]_,
and [6]_ for example.
2020-04-16 19:34:21 -04:00
2021-01-30 22:44:30 -05:00
Emitting a warning when the ``encoding`` option is omitted will help
to find such mistakes.
2020-04-16 19:34:21 -04:00
2021-02-14 09:06:57 -05:00
Explicit way to use locale-specific encoding
--------------------------------------------
``open(filename)`` isn't explicit about which encoding is expected:
* Expects ASCII (not a bug, but inefficient on Windows)
* Expects UTF-8 (bug or platform specific script)
* Expects the locale encoding.
In this point of view, ``open(filename)`` is not readable.
``encoding=locale.getpreferredencoding(False)`` can be used to
specify the locale encoding explicitly. But it is too long and easy
to misuse. (e.g. forget to pass ``False`` to its parameter)
This PEP provides an explicit way to specify the locale encoding.
2020-04-16 19:34:21 -04:00
Prepare to change the default encoding to UTF-8
-----------------------------------------------
2021-02-14 09:06:57 -05:00
Since UTF-8 becomes de-facto standard text encoding, we might change
the default text encoding to UTF-8 in the future.
2020-04-16 19:34:21 -04:00
2021-02-14 09:06:57 -05:00
But this change will affect many applications and libraries. If we
start emitting ``DeprecationWarning`` everywhere ``encoding`` option
is omitted by default, it will be too noisy and painful.
2020-04-16 19:34:21 -04:00
2021-01-30 22:44:30 -05:00
Although this PEP doesn't propose to change the default encoding,
2021-02-14 09:06:57 -05:00
this PEP will the change:
* Reduce the number of omitted ``encoding`` option in many libraries
before emitting the warning by default.
* Users will be able to use ``encoding="locale"`` option to suppress
the warning without dropping Python 3.10 support.
Specification
=============
2021-01-30 04:18:19 -05:00
``EncodingWarning``
--------------------
2020-04-16 19:34:21 -04:00
2021-02-14 09:06:57 -05:00
Add a new ``EncodingWarning`` warning class which is a subclass of
2021-01-30 22:44:30 -05:00
``Warning``. It is used to warn when the ``encoding`` option is
omitted and the default encoding is locale-specific.
2021-01-30 04:18:19 -05:00
Options to enable the warning
------------------------------
2021-01-30 04:18:19 -05:00
``-X warn_encoding`` option and the ``PYTHONWARNENCODING``
environment variable are added. They are used to enable the
``EncodingWarning``.
2021-01-30 22:44:30 -05:00
``sys.flags.encoding_warning`` is also added. The flag represents
``EncodingWarning`` is enabled.
2021-01-30 04:18:19 -05:00
When the option is enabled, ``io.TextIOWrapper()``, ``open()``, and
2021-01-30 22:44:30 -05:00
other modules using them will emit ``EncodingWarning`` when
``encoding`` is omitted.
2021-02-14 09:06:57 -05:00
Since ``EncodingWarning`` is a subclass of ``Warning``, they are
shown by default, unlike ``DeprecationWarning``.
2021-01-30 04:18:19 -05:00
``encoding="locale"`` option
----------------------------
``io.TextIOWrapper`` accepts ``encoding="locale"`` option. It means
same to current ``encoding=None``. But ``io.TextIOWrapper`` doesn't
emit ``EncodingWarning`` when ``encoding="locale"`` is specified.
2021-01-30 04:18:19 -05:00
``io.text_encoding()``
-----------------------
2021-01-30 04:18:19 -05:00
``io.text_encoding()`` is a helper function for functions having
2021-01-30 22:44:30 -05:00
``encoding=None`` option and passing it to ``io.TextIOWrapper()`` or
2021-01-30 04:18:19 -05:00
``open()``.
2021-01-30 04:18:19 -05:00
Pure Python implementation will be like this::
2020-04-16 19:34:21 -04:00
def text_encoding(encoding, stacklevel=1):
2021-01-30 04:18:19 -05:00
"""Helper function to choose the text encoding.
When *encoding* is not None, just return it.
Otherwise, return the default text encoding (i.e., "locale").
2021-01-30 04:18:19 -05:00
This function emits EncodingWarning if *encoding* is None and
sys.flags.encoding_warning is true.
2021-01-30 04:18:19 -05:00
This function can be used in APIs having encoding=None option
and pass it to TextIOWrapper or open.
But please consider using encoding="utf-8" for new APIs.
2020-04-16 19:34:21 -04:00
"""
if encoding is None:
2021-01-30 04:18:19 -05:00
if sys.flags.encoding_warning:
2020-04-16 19:34:21 -04:00
import warnings
2021-01-30 04:18:19 -05:00
warnings.warn("'encoding' option is omitted",
EncodingWarning, stacklevel + 2)
2021-02-14 09:06:57 -05:00
encoding = "locale"
2020-04-16 19:34:21 -04:00
return encoding
2021-01-30 04:18:19 -05:00
For example, ``pathlib.Path.read_text()`` can use the function like:
.. code-block::
2020-04-16 19:34:21 -04:00
def read_text(self, encoding=None, errors=None):
encoding = io.text_encoding(encoding)
with self.open(mode='r', encoding=encoding, errors=errors) as f:
return f.read()
2021-01-30 22:44:30 -05:00
By using ``io.text_encoding()``, ``EncodingWarning`` is emitted for
2021-02-14 09:06:57 -05:00
the caller of ``read_text()`` instead of ``read_text()`` itself.
2020-04-16 19:34:21 -04:00
2021-01-30 22:44:30 -05:00
Affected stdlibs
2021-02-14 09:06:57 -05:00
-----------------
2021-01-30 04:18:19 -05:00
2021-01-30 22:44:30 -05:00
Many stdlibs will be affected by this change.
Most APIs accepting ``encoding=None`` will use ``io.text_encoding()``
as written in the previous section.
Where using locale encoding as the default encoding is reasonable,
2021-02-14 09:06:57 -05:00
``encoding="locale"`` will be used instead. For example,
the ``subprocess`` module will use locale encoding for the default
2021-01-30 22:44:30 -05:00
encoding of the pipes.
Many tests use ``open()`` without ``encoding`` specified to read
ASCII text files. They should be rewritten with ``encoding="ascii"``.
2020-04-16 19:34:21 -04:00
Rationale
=========
2021-01-30 04:18:19 -05:00
Opt-in warning
---------------
2021-01-30 22:44:30 -05:00
Although ``DeprecationWarning`` is suppressed by default, emitting
2021-02-14 09:06:57 -05:00
``DeprecationWarning`` always when the ``encoding`` option is omitted
2021-01-30 04:18:19 -05:00
would be too noisy.
2021-01-30 22:44:30 -05:00
Noisy warnings may lead developers to dismiss the
``DeprecationWarning``.
2021-01-30 04:18:19 -05:00
2020-04-16 19:34:21 -04:00
"locale" is not a codec alias
-----------------------------
We don't add the "locale" to the codec alias because locale can be
changed in runtime.
Additionally, ``TextIOWrapper`` checks ``os.device_encoding()``
2020-06-22 21:35:56 -04:00
when ``encoding=None``. This behavior can not be implemented in
2020-04-16 19:34:21 -04:00
the codec.
2021-02-14 09:06:57 -05:00
Backward Compatibility
======================
The new warning is not emitted by default. So this PEP is 100%
backward compatible.
Forward Compatibility
=====================
``encoding="locale"`` option is not forward compatible. Codes
using the option will not work on Python older than 3.10. It will
raise ``LookupError: unknown encoding: locale``.
Until developers can drop Python 3.9 support, ``EncodingWarning``
can be used only for finding missing ``encoding="utf-8"`` options.
How to teach this
=================
For new users
-------------
Since ``EncodingWarning`` is used to write a cross-platform code,
no need to teach it to new users.
We can just recommend using UTF-8 for text files and use
``encoding="utf-8"`` when opening test files.
For experienced users
---------------------
Using ``open(filename)`` to read text files encoded in UTF-8 is a
common mistake. It may not work on Windows because UTF-8 is not the
default encoding.
You can use ``-X warn_encoding`` or ``PYTHONWARNENCODING=1`` to find
this type of mistake.
Omitting ``encoding`` option is not a bug when opening text files
encoded in locale encoding. But ``encoding="locale"`` is recommended
after Python 3.10 because it is more explicit.
Reference Implementation
========================
2020-04-16 19:34:21 -04:00
https://github.com/python/cpython/pull/19481
2021-02-14 09:06:57 -05:00
Discussions
===========
* Why not implement this in linters?
* ``encoding="locale"`` and ``io.text_encoding()`` must be in
Python.
* It is difficult to find all caller of functions wrapping
``open()`` or ``TextIOWrapper()``. (See ``io.text_encoding()``
section.)
* Many developers will not use the option.
* Some developers use the option and report the warnings to
libraries they use. So the option is worth enough even though
many developers won't use it.
* For example, I find [7]_ and [8]_ by running
``pip install -U pip`` and find [9]_ by running ``tox``
with the reference implementation. It demonstrates how this
option find potential issues.
References
==========
2020-04-16 19:34:21 -04:00
.. [1] "Packages can't be installed when encoding is not UTF-8"
(https://github.com/methane/pep597-pypi-ascii)
.. [2] "Logging - Inconsistent behaviour when handling unicode"
(https://bugs.python.org/issue37111)
.. [3] Packaging tutorial in packaging.python.org didn't specify
encoding to read a ``README.md``
(https://github.com/pypa/packaging.python.org/pull/682)
.. [4] ``json.tool`` had used locale encoding to read JSON files.
(https://bugs.python.org/issue33684)
2021-02-14 09:06:57 -05:00
.. [5] site: Potential UnicodeDecodeError when handling pth file
(https://bugs.python.org/issue33684)
.. [6] pypa/pip: "Installing packages fails if Python 3 installed
into path with non-ASCII characters"
(https://github.com/pypa/pip/issues/9054)
.. [7] "site: Potential UnicodeDecodeError when handling pth file"
(https://bugs.python.org/issue43214)
.. [8] "[pypa/pip] Use ``encoding`` option or binary mode for open()"
(https://github.com/pypa/pip/pull/9608)
.. [9] "Possible UnicodeError caused by missing encoding="utf-8""
(https://github.com/tox-dev/tox/issues/1908)
Copyright
=========
2021-02-14 09:06:57 -05:00
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
2020-04-16 19:34:21 -04:00
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: