Add PEP 528 and PEP 529 drafts.
This commit is contained in:
parent
fa600a0819
commit
846dfcfab8
|
@ -0,0 +1,157 @@
|
|||
PEP: 528
|
||||
Title: Change Windows console encoding to UTF-8
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Steve Dower <steve.dower@python.org>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 27-Aug-2016
|
||||
Post-History: 01-Sep-2016
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
Historically, Python uses the ANSI APIs for interacting with the Windows
|
||||
operating system, often via C Runtime functions. However, these have been long
|
||||
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
|
||||
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
|
||||
the active code page.
|
||||
|
||||
This PEP proposes changing the default standard stream implementation on Windows
|
||||
to use the Unicode APIs. This will allow users to print and input the full range
|
||||
of Unicode characters at the default Windows console. This also requires a
|
||||
subtle change to how the tokenizer parses text from readline hooks, that should
|
||||
have no backwards compatibility issues.
|
||||
|
||||
Specific Changes
|
||||
================
|
||||
|
||||
Add _io.WindowsConsoleIO
|
||||
------------------------
|
||||
|
||||
Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
|
||||
representing standard input, output and error. We add a new class (implemented
|
||||
in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows
|
||||
console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.
|
||||
|
||||
This class will be used when the legacy-mode flag is not in effect, when opening
|
||||
a standard stream by file descriptor and the stream is a console buffer rather
|
||||
than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today.
|
||||
|
||||
This is a raw (bytes) IO class that requires text to be passed encoded with
|
||||
utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
|
||||
Similarly, bytes read from the class will be provided by the operating system as
|
||||
utf-16-le and converted into utf-8 when returned to Python.
|
||||
|
||||
The use of an ASCII compatible encoding is required to maintain compatibility
|
||||
with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to
|
||||
the standard streams (for example, [process_stdinreader.py]_). Code that assumes
|
||||
a particular encoding for the standard streams other than ASCII will likely
|
||||
break.
|
||||
|
||||
Add _PyOS_WindowsConsoleReadline
|
||||
--------------------------------
|
||||
|
||||
To allow Unicode entry at the interactive prompt, a new readline hook is
|
||||
required. The existing ``PyOS_StdioReadline`` function will delegate to the new
|
||||
``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor
|
||||
that is a console buffer and the legacy-mode flag is not in effect (the logic
|
||||
should be identical to above).
|
||||
|
||||
Since the readline interface is required to return an 8-bit encoded string with
|
||||
no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from
|
||||
utf-16-le as read from the operating system into utf-8.
|
||||
|
||||
The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding
|
||||
from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect.
|
||||
This may require readline hooks to change their encodings to utf-8, or to
|
||||
require legacy-mode for correct behaviour.
|
||||
|
||||
Add legacy mode
|
||||
---------------
|
||||
|
||||
Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set
|
||||
will enable the legacy-mode flag, which completely restores the previous
|
||||
behaviour.
|
||||
|
||||
Alternative Approaches
|
||||
======================
|
||||
|
||||
The ``win_unicode_console`` package [win_unicode_console]_ is a pure-Python
|
||||
alternative to changing the default behaviour of the console.
|
||||
|
||||
Code that may break
|
||||
===================
|
||||
|
||||
The following code patterns may break or see different behaviour as a result of
|
||||
this change. All of these code samples require explicitly choosing to use a raw
|
||||
file object in place of a more convenient wrapper that would prevent any visible
|
||||
change.
|
||||
|
||||
Assuming stdin/stdout encoding
|
||||
------------------------------
|
||||
|
||||
Code that assumes that the encoding required by ``sys.stdin.buffer`` or
|
||||
``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be
|
||||
working by chance, but could encounter issues under this change. For example::
|
||||
|
||||
sys.stdout.buffer.write(text.encode('mbcs'))
|
||||
r = sys.stdin.buffer.read(16).decode('cp437')
|
||||
|
||||
To correct this code, the encoding specified on the ``TextIOWrapper`` should be
|
||||
used, either implicitly or explicitly::
|
||||
|
||||
# Fix 1: Use wrapper correctly
|
||||
sys.stdout.write(text)
|
||||
r = sys.stdin.read(16)
|
||||
|
||||
# Fix 2: Use encoding explicitly
|
||||
sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
|
||||
r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)
|
||||
|
||||
Incorrectly using the raw object
|
||||
--------------------------------
|
||||
|
||||
Code that uses the raw IO object and does not correctly handle partial reads and
|
||||
writes may be affected. This is particularly important for reads, where the
|
||||
number of characters read will never exceed one-fourth of the number of bytes
|
||||
allowed, as there is no feasible way to prevent input from encoding as much
|
||||
longer utf-8 strings::
|
||||
|
||||
>>> stdin = open(sys.stdin.fileno(), 'rb')
|
||||
>>> data = stdin.raw.read(15)
|
||||
abcdefghijklm
|
||||
b'abc'
|
||||
# data contains at most 3 characters, and never more than 12 bytes
|
||||
# error, as "defghijklm\r\n" is passed to the interactive prompt
|
||||
|
||||
To correct this code, the buffered reader/writer should be used, or the caller
|
||||
should continue reading until its buffer is full.::
|
||||
|
||||
# Fix 1: Use the buffered reader/writer
|
||||
>>> stdin = open(sys.stdin.fileno(), 'rb')
|
||||
>>> data = stdin.read(15)
|
||||
abcedfghijklm
|
||||
b'abcdefghijklm\r\n'
|
||||
|
||||
# Fix 2: Loop until enough bytes have been read
|
||||
>>> stdin = open(sys.stdin.fileno(), 'rb')
|
||||
>>> b = b''
|
||||
>>> while len(b) < 15:
|
||||
... b += stdin.raw.read(15)
|
||||
abcedfghijklm
|
||||
b'abcdefghijklm\r\n'
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [process_stdinreader.py] Twisted's process_stdinreader.py
|
||||
(https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py)
|
||||
.. [win_unicode_console] win_unicode_console package
|
||||
(https://pypi.org/project/win_unicode_console/)
|
|
@ -0,0 +1,293 @@
|
|||
PEP: 529
|
||||
Title: Change Windows filesystem encoding to UTF-8
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Steve Dower <steve.dower@python.org>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 27-Aug-2016
|
||||
Post-History: 01-Sep-2016
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
Historically, Python uses the ANSI APIs for interacting with the Windows
|
||||
operating system, often via C Runtime functions. However, these have been long
|
||||
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
|
||||
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
|
||||
the active code page.
|
||||
|
||||
This PEP proposes changing the default filesystem encoding on Windows to utf-8,
|
||||
and changing all filesystem functions to use the Unicode APIs for filesystem
|
||||
paths. This will not affect code that uses strings to represent paths, however
|
||||
those that use bytes for paths will now be able to correctly round-trip all
|
||||
valid paths in Windows filesystems. Currently, the conversions between Unicode
|
||||
(in the OS) and bytes (in Python) were lossy and would fail to round-trip
|
||||
characters outside of the user's active code page.
|
||||
|
||||
Notably, this does not impact the encoding of the contents of files. These will
|
||||
continue to default to locale.getpreferredencoding (for text files) or plain
|
||||
bytes (for binary files). This only affects the encoding used when users pass a
|
||||
bytes object to Python where it is then passed to the operating system as a path
|
||||
name.
|
||||
|
||||
Background
|
||||
==========
|
||||
|
||||
File system paths are almost universally represented as text with an encoding
|
||||
determined by the file system. In Python, we expose these paths via a number of
|
||||
interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either
|
||||
direction across these interfaces, that is, from the filesystem to the
|
||||
application (for example, ``os.listdir()``), or from the application to the
|
||||
filesystem (for example, ``os.unlink()``).
|
||||
|
||||
When paths are passed between the filesystem and the application, they are
|
||||
either passed through as a bytes blob or converted to/from str using
|
||||
``os.fsencode()`` or ``sys.getfilesystemencoding()``. The result of encoding a
|
||||
string with ``sys.getfilesystemencoding()`` is a blob of bytes in the native
|
||||
format for the default file system.
|
||||
|
||||
On Windows, the native format for the filesystem is utf-16-le. The recommended
|
||||
platform APIs for accessing the filesystem all accept and return text encoded in
|
||||
this format. However, prior to Windows NT (and possibly further back), the
|
||||
native format was a configurable machine option and a separate set of APIs
|
||||
existed to accept this format. The option (the "active code page") and these
|
||||
APIs (the "*A functions") still exist in recent versions of Windows for
|
||||
backwards compatibility, though new functionality often only has a utf-16-le API
|
||||
(the "*W functions").
|
||||
|
||||
In Python, str is recommended because it can correctly round-trip all characters
|
||||
used in paths (on POSIX with surrogateescape handling; on Windows because str
|
||||
maps to the native representation). On Windows bytes cannot round-trip all
|
||||
characters used in paths, as Python internally uses the *A functions and hence
|
||||
the encoding is "whatever the active code page is". Since the active code page
|
||||
cannot represent all Unicode characters, the conversion of a path into bytes can
|
||||
lose information without warning or any available indication.
|
||||
|
||||
As a demonstration of this::
|
||||
>>> open('test\uAB00.txt', 'wb').close()
|
||||
>>> import glob
|
||||
>>> glob.glob('test*')
|
||||
['test\uab00.txt']
|
||||
>>> glob.glob(b'test*')
|
||||
[b'test?.txt']
|
||||
|
||||
The Unicode character in the second call to glob has been replaced by a '?',
|
||||
which means passing the path back into the filesystem will result in a
|
||||
``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or
|
||||
any function that matches the return type to the parameter type.
|
||||
|
||||
While one user-accessible fix is to use str everywhere, POSIX systems generally
|
||||
do not suffer from data loss when using bytes exclusively as the bytes are the
|
||||
canonical representation. Even if the encoding is "incorrect" by some standard,
|
||||
the file system will still map the bytes back to the file. Making use of this
|
||||
avoids the cost of decoding and reencoding, such that (theoretically, and only
|
||||
on POSIX), code such as this may be faster because of the use of `b'.'` compared
|
||||
to using `'.'`::
|
||||
|
||||
>>> for f in os.listdir(b'.'):
|
||||
... os.stat(f)
|
||||
...
|
||||
|
||||
As a result, POSIX-focused library authors prefer to use bytes to represent
|
||||
paths. For some authors it is also a convenience, as their code may receive
|
||||
bytes already known to be encoded correctly, while others are attempting to
|
||||
simplify porting their code from Python 2. However, the correctness assumptions
|
||||
do not carry over to Windows where Unicode is the canonical representation, and
|
||||
errors may result. This potential data loss is why the use of bytes paths on
|
||||
Windows was deprecated in Python 3.3 - all of the above code snippets produce
|
||||
deprecation warnings on Windows.
|
||||
|
||||
Proposal
|
||||
========
|
||||
|
||||
Currently the default filesystem encoding is 'mbcs', which is a meta-encoder
|
||||
that uses the active code page. However, when bytes are passed to the filesystem
|
||||
they go through the *A APIs and the operating system handles encoding. In this
|
||||
case, paths are always encoded using the equivalent of 'mbcs:replace' - we have
|
||||
no ability to change this (though there is a user/machine configuration option
|
||||
to change the encoding from CP_ACP to CP_OEM, so it won't necessarily always
|
||||
match mbcs...)
|
||||
|
||||
This proposal would remove all use of the *A APIs and only ever call the *W
|
||||
APIs. When Windows returns paths to Python as str, they will be decoded from
|
||||
utf-16-le and returned as text (in whatever the minimal representation is). When
|
||||
Windows returns paths to Python as bytes, they will be decoded from utf-16-le to
|
||||
utf-8 using surrogatepass (Windows does not validate surrogate pairs, so it is
|
||||
possible to have invalid surrogates in filenames). Equally, when paths are
|
||||
provided as bytes, they are decoded from utf-8 into utf-16-le and passed to the
|
||||
*W APIs.
|
||||
|
||||
The use of utf-8 will not be configurable, with the possible exception of a
|
||||
"legacy mode" environment variable or X-flag.
|
||||
|
||||
surrogateescape does not apply here, as the concern is not about retaining
|
||||
non-sensical bytes. Any path returned from the operating system will be valid
|
||||
Unicode, while bytes paths created by the user may raise a decoding error
|
||||
(currently these would raise ``OSError`` or a subclass).
|
||||
|
||||
The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the
|
||||
ability to round-trip without breaking the functionality of the ``os.path``
|
||||
module, which assumes an ASCII-compatible encoding. Using utf-16-le as the
|
||||
encoding is more pure, but will cause more issues than are resolved.
|
||||
|
||||
This change would also undeprecate the use of bytes paths on Windows. No change
|
||||
to the semantics of using bytes as a path is required - as before, they must be
|
||||
encoded with the encoding specified by ``sys.getfilesystemencoding()``.
|
||||
|
||||
Specific Changes
|
||||
================
|
||||
|
||||
Update sys.getfilesystemencoding
|
||||
--------------------------------
|
||||
|
||||
Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in
|
||||
``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs.
|
||||
|
||||
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize`` and
|
||||
``PyUnicode_EncodeFSDefault`` to use the standard utf-8 codec with surrogatepass
|
||||
error mode, or if the legacy-mode switch is enabled the code page codec with
|
||||
replace error mode.
|
||||
|
||||
Update path_converter
|
||||
---------------------
|
||||
|
||||
Update the path converter to always decode bytes or buffer objects into text
|
||||
using ``PyUnicode_DecodeFSDefaultAndSize``.
|
||||
|
||||
Change the ``narrow`` field from a ``char*`` string into a flag that indicates
|
||||
whether the original object was bytes. This is required for functions that need
|
||||
to return paths using the same type as was originally provided.
|
||||
|
||||
Remove unused ANSI code
|
||||
-----------------------
|
||||
|
||||
Remove all code paths using the ``narrow`` field, as these will no longer be
|
||||
reachable by any caller. These are only used within ``posixmodule.c``. Other
|
||||
uses of paths should have use of bytes paths replaced with decoding and use of
|
||||
the *W APIs.
|
||||
|
||||
Add legacy mode
|
||||
---------------
|
||||
|
||||
Add a legacy mode flag, enabled by the environment variable
|
||||
``PYTHONLEGACYWINDOWSFSENCODING``. When this flag is set, the default filesystem
|
||||
encoding is set to mbcs rather than utf-8, and the error mode is set to
|
||||
'replace' rather than 'strict'. The ``path_converter`` will continue to decode
|
||||
to wide characters and only *W APIs will be called, however, the bytes passed in
|
||||
and received from Python will be encoded the same as prior to this change.
|
||||
|
||||
Undeprecate bytes paths on Windows
|
||||
----------------------------------
|
||||
|
||||
Using bytes as paths on Windows is currently deprecated. We would announce that
|
||||
this is no longer the case, and that paths when encoded as bytes should use
|
||||
whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's
|
||||
active code page.
|
||||
|
||||
|
||||
Rejected Alternatives
|
||||
=====================
|
||||
|
||||
Use strict mbcs decoding
|
||||
------------------------
|
||||
|
||||
This is essentially the same as the proposed change, but instead of changing
|
||||
``sys.getfilesystemencoding()`` to utf-8 it is changed to mbcs (which
|
||||
dynamically maps to the active code page).
|
||||
|
||||
This approach allows the use of new functionality that is only available as *W
|
||||
APIs and also detection of encoding/decoding errors. For example, rather than
|
||||
silently replacing Unicode characters with '?', it would be possible to warn or
|
||||
fail the operation.
|
||||
|
||||
Compared to the proposed fix, this could enable some new functionality but does
|
||||
not fix any of the problems described initially. New runtime errors may cause
|
||||
some problems to be more obvious and lead to fixes, provided library maintainers
|
||||
are interested in supporting Windows and adding a separate code path to treat
|
||||
filesystem paths as strings.
|
||||
|
||||
Making the encoding mbcs without strict errors is equivalent to the legacy-mode
|
||||
switch being enabled by default. This is a possible course of action if there is
|
||||
significant breakage of actual code and a need to extend the deprecation period,
|
||||
but still a desire to have the simplifications to the CPython source.
|
||||
|
||||
Make bytes paths an error on Windows
|
||||
------------------------------------
|
||||
|
||||
By preventing the use of bytes paths on Windows completely we prevent users from
|
||||
hitting encoding issues.
|
||||
|
||||
However, the motivation for this PEP is to increase the likelihood that code
|
||||
written on POSIX will also work correctly on Windows. This alternative would
|
||||
move the other direction and make such code completely incompatible. As this
|
||||
does not benefit users in any way, we reject it.
|
||||
|
||||
Make bytes paths an error on all platforms
|
||||
------------------------------------------
|
||||
|
||||
By deprecating and then disable the use of bytes paths on all platforms we
|
||||
prevent users from hitting encoding issues regardless of where the code was
|
||||
originally written. This would require a full deprecation cycle, as there are
|
||||
currently no warnings on platforms other than Windows.
|
||||
|
||||
This is likely to be seen as a hostile action against Python developers in
|
||||
general, and as such is rejected at this time.
|
||||
|
||||
Code that may break
|
||||
===================
|
||||
|
||||
The following code patterns may break or see different behaviour as a result of
|
||||
this change.
|
||||
|
||||
Note that all of these examples produce deprecation warnings on Python 3.3 and
|
||||
later.
|
||||
|
||||
Not managing encodings across boundaries
|
||||
----------------------------------------
|
||||
|
||||
Code that does not manage encodings when crossing protocol boundaries may
|
||||
currently be working by chance, but could encounter issues when either encoding
|
||||
changes. For example::
|
||||
|
||||
filename = open('filename_in_mbcs.txt', 'rb').read()
|
||||
text = open(filename, 'r').read()
|
||||
|
||||
To correct this code, the encoding of the bytes in ``filename`` should be
|
||||
specified, either when reading from the file or before using the value::
|
||||
|
||||
# Fix 1: Open file as text
|
||||
filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
|
||||
text = open(filename, 'r').read()
|
||||
|
||||
# Fix 2: Decode path
|
||||
filename = open('filename_in_mbcs.txt', 'rb').read()
|
||||
text = open(filename.decode('mbcs'), 'r').read()
|
||||
|
||||
|
||||
Explicitly using 'mbcs'
|
||||
-----------------------
|
||||
|
||||
Code that explicitly encodes text using 'mbcs' before passing to file system
|
||||
APIs. For example::
|
||||
|
||||
filename = open('files.txt', 'r').readline()
|
||||
text = open(filename.encode('mbcs'), 'r')
|
||||
|
||||
To correct this code, the string should be passed without explicit encoding, or
|
||||
should use ``os.fsencode()``::
|
||||
|
||||
# Fix 1: Do not encode the string
|
||||
filename = open('files.txt', 'r').readline()
|
||||
text = open(filename, 'r')
|
||||
|
||||
# Fix 2: Use correct encoding
|
||||
filename = open('files.txt', 'r').readline()
|
||||
text = open(os.fsencode(filename), 'r')
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
Loading…
Reference in New Issue