python-peps/pep-0597.rst

PEP: 597
Title: Use UTF-8 for default text file encoding
Author: Inada Naoki  <songofacandy@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 05-Jun-2019
Python-Version: 3.9


Abstract
========

Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
(hereinafter called "locale encoding") when ``encoding`` is not specified.

This PEP proposes changing the default text encoding to "UTF-8"
regardless of platform or locale.


Motivation
==========

People assume it is always UTF-8
--------------------------------

Package authors using macOS or Linux may forget that the default encoding
is not always UTF-8.

For example, ``long_description = open("README.md").read()`` in
``setup.py`` is a common mistake.  If there is at least one emoji or any
other non-ASCII character in the ``README.md`` file, many Windows users
cannot install the package due to a ``UnicodeDecodeError``.


Active code page is not stable
------------------------------

Some tools on Windows change the active code page to 65001 (UTF-8), and
Microsoft is using UTF-8 and cp65001 more widely in recent versions of
Windows 10.

For example, "Command Prompt" uses the legacy code page by default.
But the Windows Subsystem for Linux (WSL) changes the active code page to
65001, and ``python.exe`` can be executed from the WSL.  So ``python.exe``
executed from the legacy console and from the WSL cannot read text files
written by each other.

But many Windows users don't understand which code page is active.
So changing the default text file encoding based on the active code page
causes confusion.

Consistent default text encoding will make Python behavior more expectable
and easier to learn.


Using UTF-8 by default is easier on new programmers
---------------------------------------------------

Python is one of the most popular first programming languages.

New programmers may not know about encoding.  When they download text data
written in UTF-8 from the Internet, they are forced to learn about encoding.

Popular text editors like VS Code or Atom use UTF-8 by default.
Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019
Update.  (Note that Python 3.9 will be released in 2021.)

Additionally, the default encoding of Python source files is UTF-8.
We can assume new Python programmers who don't know about encoding
use editors which use UTF-8 by default.

It would be nice if new programmers are not forced to learn about encoding
until they need to handle text files encoded in encoding other than UTF-8.


Specification
=============

From Python 3.9, the default encoding of ``TextIOWrapper`` and ``open()`` is
changed from ``locale.getpreferredencoding(False)`` to "UTF-8".

When there is device encoding (``os.device_encoding(buffer.fileno())``),
it still supersedes the default encoding.


Unaffected areas
----------------

Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respects
locale encoding.

``stdin``, ``stdout``, and ``stderr`` continue to respect locale encoding
as well.  For example, these commands do not cause mojibake regardless of the
active code page::

   > python -c "print('こんにちは')" | more
   こんにちは
   > python -c "print('こんにちは')" > temp.txt
   > type temp.txt
   こんにちは

Pipes and TTY should use the locale encoding:

* ``subprocess`` and ``os.popen`` use the locale encoding because the
  subprocess will use the locale encoding.
* ``getpass.getpass`` uses the locale encoding when using TTY.


Affected APIs
-------------

All other code using the default encoding of ``TextIOWrapper`` or ``open`` are
affected.  This is an incomplete list of APIs affected by this PEP:

* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
* ``socket.makefile``
* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
* ``trace.CoverageResults.write_results_file``

These APIs will always use "UTF-8" when opening text files.


Deprecation Warning
-------------------

From 3.8 onwards, ``DeprecationWarning`` is shown when encoding is omitted and
the locale encoding is not UTF-8.  This helps not only when writing
forward-compatible code, but also when investigating an unexpected
``UnicodeDecodeError`` caused by assuming the default text encoding is UTF-8.
(See `People assume it is always UTF-8`_ above.)


Rationale
=========

Why not just enable UTF-8 mode by default?
------------------------------------------

This PEP is not mutually exclusive to UTF-8 mode.

If we enable UTF-8 mode by default, even people using Windows will forget
the default encoding is not always UTF-8.  More scripts will be written
assuming the default encoding is UTF-8.

So changing the default encoding of text files to UTF-8 would be better
even if UTF-8 mode is enabled by default at some point.


Why not change std(in|out|err) encoding too?
--------------------------------------------

Even when the locale encoding is not UTF-8, there can be many UTF-8
text files.  These files could be downloaded from the Internet or
written by modern text editors.

On the other hand, terminal encoding is assumed to be the same as
locale encoding.  And other tools are assumed to read and write the
locale encoding as well.

std(in|out|err) are likely to be connected to a terminal or other tools.
So the locale encoding should be respected.


Why not always warn when encoding is omitted?
---------------------------------------------

Omitting encoding is a common mistake when writing portable code.

But when portability does not matter, assuming UTF-8 is not so bad because
Python already implements locale coercion (:pep:`538`) and UTF-8 mode
(:pep:`540`).

And these scripts will become portable when the default encoding is changed
to UTF-8.


Backward compatibility
======================

There may be scripts relying on the locale encoding or active code page not
being UTF-8.  They must be rewritten to specify ``encoding`` explicitly.

* If the script assumes ``latin1`` or ``cp932``, ``encoding="latin1"``
  or ``encoding="cp932"`` should be used.

* If the script is designed to respect locale encoding,
  ``locale.getpreferredencoding(False)`` should be used.

  There are non-portable short forms of
  ``locale.getpreferredencoding(False)``.

  * On Windows, ``"mbcs"`` can be used instead.
  * On Unix, ``os.fsencoding()`` can be used instead.

Note that such scripts will be broken even without upgrading Python, such as
when:

* Upgrading Windows
* Changing the language setting
* Changing terminal from legacy console to a modern one
* Using tools which do ``chcp 65001``


How to Teach This
=================

When opening text files, "UTF-8" is used by default.  It is consistent with
the default encoding used for ``text.encode()``.


Reference Implementation
========================

To be written.


Rejected Ideas
==============

To be discussed.


Open Issues
===========

Alias for locale encoding
-------------------------

``encoding=locale.getpreferredencoding(False)`` is too long, and
``"mbcs"`` and ``os.fsencoding()`` are not portable.

It may be possible to add a new "locale" encoding alias as an easy and
portable version of ``locale.getpreferredencoding(False)``.

The difficulty of this is uncertain because ``encodings`` is currently
imported prior to ``_bootlocale``.

Another option is for ``TextIOWrapper`` to treat `"locale"` as a special
case::

   if encoding == "locale":
       encoding = locale.getpreferredencoding(False)


References
==========


Copyright
=========

This document has been placed in the public domain.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00			`PEP: 597`
			`Title: Use UTF-8 for default text file encoding`
			`Author: Inada Naoki <songofacandy@gmail.com>`
			`Status: Draft`
			`Type: Standards Track`
			`Content-Type: text/x-rst`
			`Created: 05-Jun-2019`
			`Python-Version: 3.9`


			`Abstract`
			`========`

			Currently, ``TextIOWrapper`` uses ``locale.getpreferredencoding(False)``
			(hereinafter called "locale encoding") when ``encoding`` is not specified.

			`This PEP proposes changing the default text encoding to "UTF-8"`
			`regardless of platform or locale.`


			`Motivation`
			`==========`

			`People assume it is always UTF-8`
			`--------------------------------`

			`Package authors using macOS or Linux may forget that the default encoding`
			`is not always UTF-8.`

			For example, ``long_description = open("README.md").read()`` in
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			``setup.py`` is a common mistake. If there is at least one emoji or any
			other non-ASCII character in the ``README.md`` file, many Windows users
			cannot install the package due to a ``UnicodeDecodeError``.
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Active code page is not stable`
			`------------------------------`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Some tools on Windows change the active code page to 65001 (UTF-8), and`
			`Microsoft is using UTF-8 and cp65001 more widely in recent versions of`
			`Windows 10.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`For example, "Command Prompt" uses the legacy code page by default.`
			`But the Windows Subsystem for Linux (WSL) changes the active code page to`
			65001, and ``python.exe`` can be executed from the WSL. So ``python.exe``
			`executed from the legacy console and from the WSL cannot read text files`
			`written by each other.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`But many Windows users don't understand which code page is active.`
			`So changing the default text file encoding based on the active code page`
			`causes confusion.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			`Consistent default text encoding will make Python behavior more expectable`
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`and easier to learn.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Using UTF-8 by default is easier on new programmers`
			`---------------------------------------------------`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			`Python is one of the most popular first programming languages.`

			`New programmers may not know about encoding. When they download text data`
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`written in UTF-8 from the Internet, they are forced to learn about encoding.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			`Popular text editors like VS Code or Atom use UTF-8 by default.`
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019`
			`Update. (Note that Python 3.9 will be released in 2021.)`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Additionally, the default encoding of Python source files is UTF-8.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00			`We can assume new Python programmers who don't know about encoding`
			`use editors which use UTF-8 by default.`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`It would be nice if new programmers are not forced to learn about encoding`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00			`until they need to handle text files encoded in encoding other than UTF-8.`


			`Specification`
			`=============`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			From Python 3.9, the default encoding of ``TextIOWrapper`` and ``open()`` is
			changed from ``locale.getpreferredencoding(False)`` to "UTF-8".
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			When there is device encoding (``os.device_encoding(buffer.fileno())``),
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`it still supersedes the default encoding.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Unaffected areas`
			`----------------`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			Unlike UTF-8 mode, ``locale.getpreferredencoding(False)`` still respects
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00			`locale encoding.`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			``stdin``, ``stdout``, and ``stderr`` continue to respect locale encoding
			`as well. For example, these commands do not cause mojibake regardless of the`
			`active code page::`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			`> python -c "print('こんにちは')" \| more`
			`こんにちは`
			`> python -c "print('こんにちは')" > temp.txt`
			`> type temp.txt`
			`こんにちは`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Pipes and TTY should use the locale encoding:`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			* ``subprocess`` and ``os.popen`` use the locale encoding because the
			`subprocess will use the locale encoding.`
			* ``getpass.getpass`` uses the locale encoding when using TTY.
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

			`Affected APIs`
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`-------------`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			All other code using the default encoding of ``TextIOWrapper`` or ``open`` are
			`affected. This is an incomplete list of APIs affected by this PEP:`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			* ``lzma.open``, ``gzip.open``, ``bz2.open``, ``ZipFile.read_text``
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00			* ``socket.makefile``
			* ``tempfile.TemporaryFile``, ``tempfile.NamedTemporaryFile``
			* ``trace.CoverageResults.write_results_file``

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`These APIs will always use "UTF-8" when opening text files.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

			`Deprecation Warning`
			`-------------------`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			From 3.8 onwards, ``DeprecationWarning`` is shown when encoding is omitted and
			`the locale encoding is not UTF-8. This helps not only when writing`
			`forward-compatible code, but also when investigating an unexpected`
			``UnicodeDecodeError`` caused by assuming the default text encoding is UTF-8.
			(See `People assume it is always UTF-8`_ above.)
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

			`Rationale`
			`=========`

			`Why not just enable UTF-8 mode by default?`
			`------------------------------------------`

			`This PEP is not mutually exclusive to UTF-8 mode.`

			`If we enable UTF-8 mode by default, even people using Windows will forget`
			`the default encoding is not always UTF-8. More scripts will be written`
			`assuming the default encoding is UTF-8.`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`So changing the default encoding of text files to UTF-8 would be better`
			`even if UTF-8 mode is enabled by default at some point.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

			`Why not change std(in\|out\|err) encoding too?`
			`--------------------------------------------`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Even when the locale encoding is not UTF-8, there can be many UTF-8`
			`text files. These files could be downloaded from the Internet or`
			`written by modern text editors.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`On the other hand, terminal encoding is assumed to be the same as`
			`locale encoding. And other tools are assumed to read and write the`
			`locale encoding as well.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`std(in\|out\|err) are likely to be connected to a terminal or other tools.`
			`So the locale encoding should be respected.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Why not always warn when encoding is omitted?`
			`---------------------------------------------`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Omitting encoding is a common mistake when writing portable code.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			`But when portability does not matter, assuming UTF-8 is not so bad because`
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			Python already implements locale coercion (:pep:`538`) and UTF-8 mode
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00			(:pep:`540`).

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`And these scripts will become portable when the default encoding is changed`
			`to UTF-8.`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

			`Backward compatibility`
			`======================`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`There may be scripts relying on the locale encoding or active code page not`
			being UTF-8. They must be rewritten to specify ``encoding`` explicitly.
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			* If the script assumes ``latin1`` or ``cp932``, ``encoding="latin1"``
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00			or ``encoding="cp932"`` should be used.

			`* If the script is designed to respect locale encoding,`
			``locale.getpreferredencoding(False)`` should be used.

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`There are non-portable short forms of`
			``locale.getpreferredencoding(False)``.
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			* On Windows, ``"mbcs"`` can be used instead.
			* On Unix, ``os.fsencoding()`` can be used instead.
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`Note that such scripts will be broken even without upgrading Python, such as`
			`when:`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			`* Upgrading Windows`
			`* Changing the language setting`
			`* Changing terminal from legacy console to a modern one`
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			* Using tools which do ``chcp 65001``
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

			`How to Teach This`
			`=================`

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`When opening text files, "UTF-8" is used by default. It is consistent with`
			the default encoding used for ``text.encode()``.
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00

			`Reference Implementation`
			`========================`

			`To be written.`


			`Rejected Ideas`
			`==============`

			`To be discussed.`


			`Open Issues`
			`===========`

			`Alias for locale encoding`
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`-------------------------`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			``encoding=locale.getpreferredencoding(False)`` is too long, and
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			``"mbcs"`` and ``os.fsencoding()`` are not portable.
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			`It may be possible to add a new "locale" encoding alias as an easy and`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00			portable version of ``locale.getpreferredencoding(False)``.

PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			The difficulty of this is uncertain because ``encodings`` is currently
			imported prior to ``_bootlocale``.
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
PEP 597: Copy editing (#1100) 2019-06-06 09:19:01 -04:00			Another option is for ``TextIOWrapper`` to treat `"locale"` as a special
			`case::`
PEP 597: Use UTF-8 for default text file encoding (GH-1099) 2019-06-05 08:09:19 -04:00
			`if encoding == "locale":`
			`encoding = locale.getpreferredencoding(False)`



			`References`
			`==========`


			`Copyright`
			`=========`

			`This document has been placed in the public domain.`

			`..`
			`Local Variables:`
			`mode: indented-text`
			`indent-tabs-mode: nil`
			`sentence-end-double-space: t`
			`fill-column: 70`
			`coding: utf-8`
			`End:`