2022-03-18 06:57:52 -04:00
|
|
|
PEP: 686
|
|
|
|
Title: Make UTF-8 mode default
|
|
|
|
Author: Inada Naoki <songofacandy@gmail.com>
|
2022-03-18 21:25:23 -04:00
|
|
|
Discussions-To: https://discuss.python.org/t/14435
|
2022-03-18 06:57:52 -04:00
|
|
|
Status: Draft
|
|
|
|
Type: Standards Track
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
Created: 18-Mar-2022
|
|
|
|
Python-Version: 3.12
|
2022-03-18 21:25:23 -04:00
|
|
|
Post-History: `18-Mar-2022 <https://discuss.python.org/t/14435>`__
|
2022-03-18 06:57:52 -04:00
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
========
|
|
|
|
|
2022-03-18 21:25:23 -04:00
|
|
|
This PEP proposes making :pep:`UTF-8 mode <540>` on by default.
|
2022-03-18 06:57:52 -04:00
|
|
|
|
|
|
|
With this change, Python uses UTF-8 for default encoding of files, stdio, and
|
|
|
|
pipes consistently.
|
|
|
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
==========
|
|
|
|
|
|
|
|
UTF-8 becomes de-facto standard text encoding.
|
|
|
|
|
|
|
|
* Default encoding of Python source files is UTF-8.
|
|
|
|
* JSON, TOML, YAML uses UTF-8.
|
|
|
|
* Most text editors including VS Code and Windows notepad use UTF-8 by
|
|
|
|
default.
|
|
|
|
* Most websites and text data on the internet uses UTF-8.
|
|
|
|
* And many other popular programming languages including node.js, Go, Rust,
|
|
|
|
Ruby, and Java uses UTF-8 by default.
|
|
|
|
|
|
|
|
Changing the default encoding to UTF-8 makes Python easier to interoperate
|
|
|
|
with them.
|
|
|
|
|
|
|
|
Additionally, many Python developers using Unix forget that the default
|
|
|
|
encoding is platform dependant. They omit to specify ``encoding="utf-8"`` when
|
|
|
|
they read text files encoded in UTF-8 (e.g. JSON, TOML, Markdown, and Python
|
|
|
|
source files). Inconsistent default encoding caused many bugs.
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
=============
|
|
|
|
|
|
|
|
Changes to UTF-8 mode
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Currently, UTF-8 mode affects to ``locale.getpreferredencoding()``.
|
|
|
|
|
|
|
|
This PEP proposes to remove this override. UTF-8 mode will not affect to
|
|
|
|
``locale`` module.
|
|
|
|
|
|
|
|
After this change, UTF-8 mode affects to:
|
|
|
|
|
|
|
|
* stdin, stdout, stderr
|
|
|
|
|
|
|
|
* User can override it with ``PYTHONIOENCODING``.
|
|
|
|
|
|
|
|
* filesystem encoding
|
|
|
|
|
|
|
|
* ``TextIOWrapper`` and APIs using it including ``open()``,
|
|
|
|
``Path.read_text()``, ``subprocess.Popen(cmd, text=True)``, etc...
|
|
|
|
|
|
|
|
This change will be introduced in Python 3.11 if possible.
|
|
|
|
|
|
|
|
|
|
|
|
Enable UTF-8 mode by default
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
Python enables UTF-8 mode by default.
|
|
|
|
|
|
|
|
User can still disable UTF-8 mode by setting ``PYTHONUTF8=0`` or ``-X utf8=0``.
|
|
|
|
|
|
|
|
|
|
|
|
Backward Compatibility
|
|
|
|
======================
|
|
|
|
|
|
|
|
Most Unix systems use UTF-8 locale and Python enables UTF-8 mode when its
|
|
|
|
locale is C or POSIX. So this change mostly affects Windows users.
|
|
|
|
|
|
|
|
When a Python program depends on the default encoding, this change may cause
|
|
|
|
``UnicodeError``, mojibake, or even silent data corruption. So this change
|
|
|
|
should be announced very loudly.
|
|
|
|
|
|
|
|
To resolve this backward incompatibility, users can do:
|
|
|
|
|
|
|
|
* Disable UTF-8 mode
|
|
|
|
* Use ``EncodingWarning`` to find where the default encoding is used and use
|
2022-03-18 21:25:23 -04:00
|
|
|
``encoding="locale"`` option to keep using locale encoding
|
|
|
|
(as defined in :pep:`597`).
|
2022-03-18 06:57:52 -04:00
|
|
|
|
|
|
|
|
|
|
|
Preceding examples
|
|
|
|
==================
|
|
|
|
|
2022-03-18 21:25:23 -04:00
|
|
|
* Ruby `changed <Feature #16604_>`__ the default ``external_encoding``
|
|
|
|
to UTF-8 on Windows in Ruby 3.0 (2020).
|
|
|
|
* Java `changed <JEP 400_>`__ the default text encoding
|
|
|
|
to UTF-8 in JDK 18. (2022).
|
2022-03-18 06:57:52 -04:00
|
|
|
|
|
|
|
Both Ruby and Java have an option for backward compatibility.
|
2022-03-18 21:25:23 -04:00
|
|
|
They don't provide any warning like :pep:`597`'s ``EncodingWarning``
|
|
|
|
in Python for use of the default encoding.
|
2022-03-18 06:57:52 -04:00
|
|
|
|
|
|
|
|
|
|
|
Rejected Alternative
|
|
|
|
====================
|
|
|
|
|
|
|
|
Deprecate implicit encoding
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
Deprecating use of the default encoding is considered.
|
|
|
|
|
|
|
|
But there are many cases user uses the default encoding when just they need
|
|
|
|
ASCII. And some users use Python only on Unix with UTF-8 locale.
|
|
|
|
|
|
|
|
So forcing users to specify the ``encoding`` option everywhere is too painful.
|
|
|
|
|
2022-03-18 21:25:23 -04:00
|
|
|
Java also rejected this idea in `JEP 400`_.
|
2022-03-18 06:57:52 -04:00
|
|
|
|
|
|
|
|
|
|
|
How to teach this
|
|
|
|
=================
|
|
|
|
|
|
|
|
For new users, this change reduces things that need to teach.
|
|
|
|
|
|
|
|
Users can delay learning about text encoding until they need to handle
|
|
|
|
non-UTF-8 text files.
|
|
|
|
|
|
|
|
For existing users, see `Backward compatibility`_ section.
|
|
|
|
|
|
|
|
|
2022-03-18 21:25:23 -04:00
|
|
|
References
|
|
|
|
==========
|
2022-03-18 06:57:52 -04:00
|
|
|
|
2022-03-18 21:25:23 -04:00
|
|
|
.. _Feature #16604: https://bugs.ruby-lang.org/issues/16604
|
2022-03-18 06:57:52 -04:00
|
|
|
|
2022-03-18 21:25:23 -04:00
|
|
|
.. _JEP 400: https://openjdk.java.net/jeps/400
|
2022-03-18 06:57:52 -04:00
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
=========
|
|
|
|
|
|
|
|
This document is placed in the public domain or under the
|
|
|
|
CC0-1.0-Universal license, whichever is more permissive.
|