python-peps/pep-0528.txt

PEP: 528
Title: Change Windows console encoding to UTF-8
Version: $Revision$
Last-Modified: $Date$
Author: Steve Dower <steve.dower@python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 27-Aug-2016
Post-History: 01-Sep-2016

Abstract
========

Historically, Python uses the ANSI APIs for interacting with the Windows
operating system, often via C Runtime functions. However, these have been long
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
the active code page.

This PEP proposes changing the default standard stream implementation on Windows
to use the Unicode APIs. This will allow users to print and input the full range
of Unicode characters at the default Windows console. This also requires a
subtle change to how the tokenizer parses text from readline hooks.

Specific Changes
================

Add _io.WindowsConsoleIO
------------------------

Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
representing standard input, output and error. We add a new class (implemented
in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows
console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.

This class will be used when the legacy-mode flag is not in effect, when opening
a standard stream by file descriptor and the stream is a console buffer rather
than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today.

This is a raw (bytes) IO class that requires text to be passed encoded with
utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
Similarly, bytes read from the class will be provided by the operating system as
utf-16-le and converted into utf-8 when returned to Python.

The use of an ASCII compatible encoding is required to maintain compatibility
with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to
the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes
a particular encoding for the standard streams other than ASCII will likely
break.

Add _PyOS_WindowsConsoleReadline
--------------------------------

To allow Unicode entry at the interactive prompt, a new readline hook is
required. The existing ``PyOS_StdioReadline`` function will delegate to the new
``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor
that is a console buffer and the legacy-mode flag is not in effect (the logic
should be identical to above).

Since the readline interface is required to return an 8-bit encoded string with
no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from
utf-16-le as read from the operating system into utf-8.

The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding
from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect.
This may require readline hooks to change their encodings to utf-8, or to
require legacy-mode for correct behaviour.

Add legacy mode
---------------

Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set
will enable the legacy-mode flag, which completely restores the previous
behaviour.

Alternative Approaches
======================

The `win_unicode_console package`_ is a pure-Python alternative to changing the
default behaviour of the console. It implements essentially the same
modifications as described here using pure Python code.

Code that may break
===================

The following code patterns may break or see different behaviour as a result of
this change. All of these code samples require explicitly choosing to use a raw
file object in place of a more convenient wrapper that would prevent any visible
change.

Assuming stdin/stdout encoding
------------------------------

Code that assumes that the encoding required by ``sys.stdin.buffer`` or
``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be
working by chance, but could encounter issues under this change. For example::

    >>> sys.stdout.buffer.write(text.encode('mbcs'))
    >>> r = sys.stdin.buffer.read(16).decode('cp437')

To correct this code, the encoding specified on the ``TextIOWrapper`` should be
used, either implicitly or explicitly::

    >>> # Fix 1: Use wrapper correctly
    >>> sys.stdout.write(text)
    >>> r = sys.stdin.read(16)

    >>> # Fix 2: Use encoding explicitly
    >>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
    >>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)

Incorrectly using the raw object
--------------------------------

Code that uses the raw IO object and does not correctly handle partial reads and
writes may be affected. This is particularly important for reads, where the
number of characters read will never exceed one-fourth of the number of bytes
allowed, as there is no feasible way to prevent input from encoding as much
longer utf-8 strings::

    >>> raw_stdin = sys.stdin.buffer.raw
    >>> data = raw_stdin.read(15)
    abcdefghijklm
    b'abc'
    # data contains at most 3 characters, and never more than 12 bytes
    # error, as "defghijklm\r\n" is passed to the interactive prompt

To correct this code, the buffered reader/writer should be used, or the caller
should continue reading until its buffer is full::

    >>> # Fix 1: Use the buffered reader/writer
    >>> stdin = sys.stdin.buffer
    >>> data = stdin.read(15)
    abcedfghijklm
    b'abcdefghijklm\r\n'

    >>> # Fix 2: Loop until enough bytes have been read
    >>> raw_stdin = sys.stdin.buffer.raw
    >>> b = b''
    >>> while len(b) < 15:
    ...     b += raw_stdin.read(15)
    abcedfghijklm
    b'abcdefghijklm\r\n'

Using the raw object with small buffers
---------------------------------------

Code that uses the raw IO object and attempts to read less than four characters
will now receive an error. Because it's possible that any single character may
require up to four bytes when represented in utf-8, requests must fail::

    >>> raw_stdin = sys.stdin.buffer.raw
    >>> data = raw_stdin.read(3)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: must read at least 4 bytes

The only workaround is to pass a larger buffer::

    >>> # Fix: Request at least four bytes
    >>> raw_stdin = sys.stdin.buffer.raw
    >>> data = raw_stdin.read(4)
    a
    b'a'
    >>> >>>

(The extra ``>>>`` is due to the newline remaining in the input buffer and is
expected in this situation.)

Copyright
=========

This document has been placed in the public domain.

References
==========

.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py
.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00			`PEP: 528`
			`Title: Change Windows console encoding to UTF-8`
			`Version: $Revision$`
			`Last-Modified: $Date$`
			`Author: Steve Dower <steve.dower@python.org>`
			`Status: Draft`
			`Type: Standards Track`
			`Content-Type: text/x-rst`
			`Created: 27-Aug-2016`
			`Post-History: 01-Sep-2016`

			`Abstract`
			`========`

			`Historically, Python uses the ANSI APIs for interacting with the Windows`
			`operating system, often via C Runtime functions. However, these have been long`
			`discouraged in favor of the UTF-16 APIs. Within the operating system, all text`
			`is represented as UTF-16, and the ANSI APIs perform encoding and decoding using`
			`the active code page.`

			`This PEP proposes changing the default standard stream implementation on Windows`
			`to use the Unicode APIs. This will allow users to print and input the full range`
			`of Unicode characters at the default Windows console. This also requires a`
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`subtle change to how the tokenizer parses text from readline hooks.`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
			`Specific Changes`
			`================`

			`Add _io.WindowsConsoleIO`
			`------------------------`

			Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
			`representing standard input, output and error. We add a new class (implemented`
			in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows
			console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.

			`This class will be used when the legacy-mode flag is not in effect, when opening`
			`a standard stream by file descriptor and the stream is a console buffer rather`
			than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today.

			`This is a raw (bytes) IO class that requires text to be passed encoded with`
			`utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.`
			`Similarly, bytes read from the class will be provided by the operating system as`
			`utf-16-le and converted into utf-8 when returned to Python.`

			`The use of an ASCII compatible encoding is required to maintain compatibility`
			with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00			`a particular encoding for the standard streams other than ASCII will likely`
			`break.`

			`Add _PyOS_WindowsConsoleReadline`
			`--------------------------------`

			`To allow Unicode entry at the interactive prompt, a new readline hook is`
			required. The existing ``PyOS_StdioReadline`` function will delegate to the new
			``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor
			`that is a console buffer and the legacy-mode flag is not in effect (the logic`
			`should be identical to above).`

			`Since the readline interface is required to return an 8-bit encoded string with`
			no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from
			`utf-16-le as read from the operating system into utf-8.`

			The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding
			from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect.
			`This may require readline hooks to change their encodings to utf-8, or to`
			`require legacy-mode for correct behaviour.`

			`Add legacy mode`
			`---------------`

			Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set
			`will enable the legacy-mode flag, which completely restores the previous`
			`behaviour.`

			`Alternative Approaches`
			`======================`

PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			The `win_unicode_console package`_ is a pure-Python alternative to changing the
			`default behaviour of the console. It implements essentially the same`
			`modifications as described here using pure Python code.`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
			`Code that may break`
			`===================`

			`The following code patterns may break or see different behaviour as a result of`
			`this change. All of these code samples require explicitly choosing to use a raw`
			`file object in place of a more convenient wrapper that would prevent any visible`
			`change.`

			`Assuming stdin/stdout encoding`
			`------------------------------`

			Code that assumes that the encoding required by ``sys.stdin.buffer`` or
			``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be
PEP 528, 529: Formatting fixes 2016-09-06 14:03:28 -04:00			`working by chance, but could encounter issues under this change. For example::`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`>>> sys.stdout.buffer.write(text.encode('mbcs'))`
			`>>> r = sys.stdin.buffer.read(16).decode('cp437')`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
			To correct this code, the encoding specified on the ``TextIOWrapper`` should be
PEP 528, 529: Formatting fixes 2016-09-06 14:03:28 -04:00			`used, either implicitly or explicitly::`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`>>> # Fix 1: Use wrapper correctly`
			`>>> sys.stdout.write(text)`
			`>>> r = sys.stdin.read(16)`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`>>> # Fix 2: Use encoding explicitly`
			`>>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding))`
			`>>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
			`Incorrectly using the raw object`
			`--------------------------------`

			`Code that uses the raw IO object and does not correctly handle partial reads and`
			`writes may be affected. This is particularly important for reads, where the`
			`number of characters read will never exceed one-fourth of the number of bytes`
			`allowed, as there is no feasible way to prevent input from encoding as much`
PEP 528, 529: Formatting fixes 2016-09-06 14:03:28 -04:00			`longer utf-8 strings::`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`>>> raw_stdin = sys.stdin.buffer.raw`
			`>>> data = raw_stdin.read(15)`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00			`abcdefghijklm`
			`b'abc'`
			`# data contains at most 3 characters, and never more than 12 bytes`
			`# error, as "defghijklm\r\n" is passed to the interactive prompt`

			`To correct this code, the buffered reader/writer should be used, or the caller`
PEP 528, 529: Formatting fixes 2016-09-06 14:03:28 -04:00			`should continue reading until its buffer is full::`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`>>> # Fix 1: Use the buffered reader/writer`
			`>>> stdin = sys.stdin.buffer`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00			`>>> data = stdin.read(15)`
			`abcedfghijklm`
			`b'abcdefghijklm\r\n'`

PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`>>> # Fix 2: Loop until enough bytes have been read`
			`>>> raw_stdin = sys.stdin.buffer.raw`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00			`>>> b = b''`
			`>>> while len(b) < 15:`
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`... b += raw_stdin.read(15)`
Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00			`abcedfghijklm`
			`b'abcdefghijklm\r\n'`

PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`Using the raw object with small buffers`
			`---------------------------------------`

			`Code that uses the raw IO object and attempts to read less than four characters`
			`will now receive an error. Because it's possible that any single character may`
PEP 528, 529: Formatting fixes 2016-09-06 14:03:28 -04:00			`require up to four bytes when represented in utf-8, requests must fail::`
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00
			`>>> raw_stdin = sys.stdin.buffer.raw`
			`>>> data = raw_stdin.read(3)`
			`Traceback (most recent call last):`
			`File "<stdin>", line 1, in <module>`
			`ValueError: must read at least 4 bytes`

PEP 528, 529: Formatting fixes 2016-09-06 14:03:28 -04:00			`The only workaround is to pass a larger buffer::`
PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00
			`>>> # Fix: Request at least four bytes`
			`>>> raw_stdin = sys.stdin.buffer.raw`
			`>>> data = raw_stdin.read(4)`
			`a`
			`b'a'`
			`>>> >>>`

			(The extra ``>>>`` is due to the newline remaining in the input buffer and is
			`expected in this situation.)`

Add PEP 528 and PEP 529 drafts. 2016-09-01 18:26:27 -04:00			`Copyright`
			`=========`

			`This document has been placed in the public domain.`

			`References`
			`==========`

PEP 528: Readability and style updates 2016-09-05 01:43:56 -04:00			`.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py`
			`.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/`