183 lines
6.8 KiB
ReStructuredText
183 lines
6.8 KiB
ReStructuredText
PEP: 528
|
|
Title: Change Windows console encoding to UTF-8
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Steve Dower <steve.dower@python.org>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 27-Aug-2016
|
|
Python-Version: 3.6
|
|
Post-History: 01-Sep-2016, 04-Sep-2016
|
|
Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146278.html
|
|
|
|
Abstract
|
|
========
|
|
|
|
Historically, Python uses the ANSI APIs for interacting with the Windows
|
|
operating system, often via C Runtime functions. However, these have been long
|
|
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
|
|
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
|
|
the active code page.
|
|
|
|
This PEP proposes changing the default standard stream implementation on Windows
|
|
to use the Unicode APIs. This will allow users to print and input the full range
|
|
of Unicode characters at the default Windows console. This also requires a
|
|
subtle change to how the tokenizer parses text from readline hooks.
|
|
|
|
Specific Changes
|
|
================
|
|
|
|
Add _io.WindowsConsoleIO
|
|
------------------------
|
|
|
|
Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
|
|
representing standard input, output and error. We add a new class (implemented
|
|
in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows
|
|
console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.
|
|
|
|
This class will be used when the legacy-mode flag is not in effect, when opening
|
|
a standard stream by file descriptor and the stream is a console buffer rather
|
|
than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today.
|
|
|
|
This is a raw (bytes) IO class that requires text to be passed encoded with
|
|
utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
|
|
Similarly, bytes read from the class will be provided by the operating system as
|
|
utf-16-le and converted into utf-8 when returned to Python.
|
|
|
|
The use of an ASCII compatible encoding is required to maintain compatibility
|
|
with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to
|
|
the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes
|
|
a particular encoding for the standard streams other than ASCII will likely
|
|
break.
|
|
|
|
Add _PyOS_WindowsConsoleReadline
|
|
--------------------------------
|
|
|
|
To allow Unicode entry at the interactive prompt, a new readline hook is
|
|
required. The existing ``PyOS_StdioReadline`` function will delegate to the new
|
|
``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor
|
|
that is a console buffer and the legacy-mode flag is not in effect (the logic
|
|
should be identical to above).
|
|
|
|
Since the readline interface is required to return an 8-bit encoded string with
|
|
no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from
|
|
utf-16-le as read from the operating system into utf-8.
|
|
|
|
The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding
|
|
from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect.
|
|
This may require readline hooks to change their encodings to utf-8, or to
|
|
require legacy-mode for correct behaviour.
|
|
|
|
Add legacy mode
|
|
---------------
|
|
|
|
Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set
|
|
will enable the legacy-mode flag, which completely restores the previous
|
|
behaviour.
|
|
|
|
Alternative Approaches
|
|
======================
|
|
|
|
The `win_unicode_console package`_ is a pure-Python alternative to changing the
|
|
default behaviour of the console. It implements essentially the same
|
|
modifications as described here using pure Python code.
|
|
|
|
Code that may break
|
|
===================
|
|
|
|
The following code patterns may break or see different behaviour as a result of
|
|
this change. All of these code samples require explicitly choosing to use a raw
|
|
file object in place of a more convenient wrapper that would prevent any visible
|
|
change.
|
|
|
|
Assuming stdin/stdout encoding
|
|
------------------------------
|
|
|
|
Code that assumes that the encoding required by ``sys.stdin.buffer`` or
|
|
``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be
|
|
working by chance, but could encounter issues under this change. For example::
|
|
|
|
>>> sys.stdout.buffer.write(text.encode('mbcs'))
|
|
>>> r = sys.stdin.buffer.read(16).decode('cp437')
|
|
|
|
To correct this code, the encoding specified on the ``TextIOWrapper`` should be
|
|
used, either implicitly or explicitly::
|
|
|
|
>>> # Fix 1: Use wrapper correctly
|
|
>>> sys.stdout.write(text)
|
|
>>> r = sys.stdin.read(16)
|
|
|
|
>>> # Fix 2: Use encoding explicitly
|
|
>>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
|
|
>>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)
|
|
|
|
Incorrectly using the raw object
|
|
--------------------------------
|
|
|
|
Code that uses the raw IO object and does not correctly handle partial reads and
|
|
writes may be affected. This is particularly important for reads, where the
|
|
number of characters read will never exceed one-fourth of the number of bytes
|
|
allowed, as there is no feasible way to prevent input from encoding as much
|
|
longer utf-8 strings::
|
|
|
|
>>> raw_stdin = sys.stdin.buffer.raw
|
|
>>> data = raw_stdin.read(15)
|
|
abcdefghijklm
|
|
b'abc'
|
|
# data contains at most 3 characters, and never more than 12 bytes
|
|
# error, as "defghijklm\r\n" is passed to the interactive prompt
|
|
|
|
To correct this code, the buffered reader/writer should be used, or the caller
|
|
should continue reading until its buffer is full::
|
|
|
|
>>> # Fix 1: Use the buffered reader/writer
|
|
>>> stdin = sys.stdin.buffer
|
|
>>> data = stdin.read(15)
|
|
abcedfghijklm
|
|
b'abcdefghijklm\r\n'
|
|
|
|
>>> # Fix 2: Loop until enough bytes have been read
|
|
>>> raw_stdin = sys.stdin.buffer.raw
|
|
>>> b = b''
|
|
>>> while len(b) < 15:
|
|
... b += raw_stdin.read(15)
|
|
abcedfghijklm
|
|
b'abcdefghijklm\r\n'
|
|
|
|
Using the raw object with small buffers
|
|
---------------------------------------
|
|
|
|
Code that uses the raw IO object and attempts to read less than four characters
|
|
will now receive an error. Because it's possible that any single character may
|
|
require up to four bytes when represented in utf-8, requests must fail::
|
|
|
|
>>> raw_stdin = sys.stdin.buffer.raw
|
|
>>> data = raw_stdin.read(3)
|
|
Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
ValueError: must read at least 4 bytes
|
|
|
|
The only workaround is to pass a larger buffer::
|
|
|
|
>>> # Fix: Request at least four bytes
|
|
>>> raw_stdin = sys.stdin.buffer.raw
|
|
>>> data = raw_stdin.read(4)
|
|
a
|
|
b'a'
|
|
>>> >>>
|
|
|
|
(The extra ``>>>`` is due to the newline remaining in the input buffer and is
|
|
expected in this situation.)
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
References
|
|
==========
|
|
|
|
.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py
|
|
.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/
|