Add PEP 528 and PEP 529 drafts.

2016-09-01 15:26:27 -07:00 · 2016-09-01 15:26:27 -07:00 · 846dfcfab8
parent fa600a0819
commit 846dfcfab8
2 changed files with 450 additions and 0 deletions
--- a/pep-0528.txt
+++ b/pep-0528.txt
@ -0,0 +1,157 @@
+PEP: 528
+Title: Change Windows console encoding to UTF-8
+Version: $Revision$
+Last-Modified: $Date$
+Author: Steve Dower <steve.dower@python.org>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 27-Aug-2016
+Post-History: 01-Sep-2016
+
+Abstract
+========
+
+Historically, Python uses the ANSI APIs for interacting with the Windows
+operating system, often via C Runtime functions. However, these have been long
+discouraged in favor of the UTF-16 APIs. Within the operating system, all text
+is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
+the active code page.
+
+This PEP proposes changing the default standard stream implementation on Windows
+to use the Unicode APIs. This will allow users to print and input the full range
+of Unicode characters at the default Windows console. This also requires a
+subtle change to how the tokenizer parses text from readline hooks, that should
+have no backwards compatibility issues.
+
+Specific Changes
+================
+
+Add _io.WindowsConsoleIO
+------------------------
+
+Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
+representing standard input, output and error. We add a new class (implemented
+in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows
+console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.
+
+This class will be used when the legacy-mode flag is not in effect, when opening
+a standard stream by file descriptor and the stream is a console buffer rather
+than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today.
+
+This is a raw (bytes) IO class that requires text to be passed encoded with
+utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
+Similarly, bytes read from the class will be provided by the operating system as
+utf-16-le and converted into utf-8 when returned to Python.
+
+The use of an ASCII compatible encoding is required to maintain compatibility
+with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to
+the standard streams (for example, [process_stdinreader.py]_). Code that assumes
+a particular encoding for the standard streams other than ASCII will likely
+break.
+
+Add _PyOS_WindowsConsoleReadline
+--------------------------------
+
+To allow Unicode entry at the interactive prompt, a new readline hook is
+required. The existing ``PyOS_StdioReadline`` function will delegate to the new
+``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor
+that is a console buffer and the legacy-mode flag is not in effect (the logic
+should be identical to above).
+
+Since the readline interface is required to return an 8-bit encoded string with
+no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from
+utf-16-le as read from the operating system into utf-8.
+
+The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding
+from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect.
+This may require readline hooks to change their encodings to utf-8, or to
+require legacy-mode for correct behaviour.
+
+Add legacy mode
+---------------
+
+Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set
+will enable the legacy-mode flag, which completely restores the previous
+behaviour.
+
+Alternative Approaches
+======================
+
+The ``win_unicode_console`` package [win_unicode_console]_ is a pure-Python
+alternative to changing the default behaviour of the console.
+
+Code that may break
+===================
+
+The following code patterns may break or see different behaviour as a result of
+this change. All of these code samples require explicitly choosing to use a raw
+file object in place of a more convenient wrapper that would prevent any visible
+change.
+
+Assuming stdin/stdout encoding
+------------------------------
+
+Code that assumes that the encoding required by ``sys.stdin.buffer`` or
+``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be
+working by chance, but could encounter issues under this change. For example::
+
+    sys.stdout.buffer.write(text.encode('mbcs'))
+    r = sys.stdin.buffer.read(16).decode('cp437')
+
+To correct this code, the encoding specified on the ``TextIOWrapper`` should be
+used, either implicitly or explicitly::
+
+    # Fix 1: Use wrapper correctly
+    sys.stdout.write(text)
+    r = sys.stdin.read(16)
+
+    # Fix 2: Use encoding explicitly
+    sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
+    r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)
+
+Incorrectly using the raw object
+--------------------------------
+
+Code that uses the raw IO object and does not correctly handle partial reads and
+writes may be affected. This is particularly important for reads, where the
+number of characters read will never exceed one-fourth of the number of bytes
+allowed, as there is no feasible way to prevent input from encoding as much
+longer utf-8 strings::
+
+    >>> stdin = open(sys.stdin.fileno(), 'rb')
+    >>> data = stdin.raw.read(15)
+    abcdefghijklm
+    b'abc'
+    # data contains at most 3 characters, and never more than 12 bytes
+    # error, as "defghijklm\r\n" is passed to the interactive prompt
+
+To correct this code, the buffered reader/writer should be used, or the caller
+should continue reading until its buffer is full.::
+
+    # Fix 1: Use the buffered reader/writer
+    >>> stdin = open(sys.stdin.fileno(), 'rb')
+    >>> data = stdin.read(15)
+    abcedfghijklm
+    b'abcdefghijklm\r\n'
+
+    # Fix 2: Loop until enough bytes have been read
+    >>> stdin = open(sys.stdin.fileno(), 'rb')
+    >>> b = b''
+    >>> while len(b) < 15:
+    ... b += stdin.raw.read(15)
+    abcedfghijklm
+    b'abcdefghijklm\r\n'
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+References
+==========
+
+.. [process_stdinreader.py] Twisted's process_stdinreader.py
+   (https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py)
+.. [win_unicode_console] win_unicode_console package
+   (https://pypi.org/project/win_unicode_console/)
--- a/pep-0529.txt
+++ b/pep-0529.txt
@ -0,0 +1,293 @@
+PEP: 529
+Title: Change Windows filesystem encoding to UTF-8
+Version: $Revision$
+Last-Modified: $Date$
+Author: Steve Dower <steve.dower@python.org>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 27-Aug-2016
+Post-History: 01-Sep-2016
+
+Abstract
+========
+
+Historically, Python uses the ANSI APIs for interacting with the Windows
+operating system, often via C Runtime functions. However, these have been long
+discouraged in favor of the UTF-16 APIs. Within the operating system, all text
+is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
+the active code page.
+
+This PEP proposes changing the default filesystem encoding on Windows to utf-8,
+and changing all filesystem functions to use the Unicode APIs for filesystem
+paths. This will not affect code that uses strings to represent paths, however
+those that use bytes for paths will now be able to correctly round-trip all
+valid paths in Windows filesystems. Currently, the conversions between Unicode
+(in the OS) and bytes (in Python) were lossy and would fail to round-trip
+characters outside of the user's active code page.
+
+Notably, this does not impact the encoding of the contents of files. These will
+continue to default to locale.getpreferredencoding (for text files) or plain
+bytes (for binary files). This only affects the encoding used when users pass a
+bytes object to Python where it is then passed to the operating system as a path
+name.
+
+Background
+==========
+
+File system paths are almost universally represented as text with an encoding
+determined by the file system. In Python, we expose these paths via a number of
+interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either
+direction across these interfaces, that is, from the filesystem to the
+application (for example, ``os.listdir()``), or from the application to the
+filesystem (for example, ``os.unlink()``).
+
+When paths are passed between the filesystem and the application, they are
+either passed through as a bytes blob or converted to/from str using
+``os.fsencode()`` or ``sys.getfilesystemencoding()``. The result of encoding a
+string with ``sys.getfilesystemencoding()`` is a blob of bytes in the native
+format for the default file system.
+
+On Windows, the native format for the filesystem is utf-16-le. The recommended
+platform APIs for accessing the filesystem all accept and return text encoded in
+this format. However, prior to Windows NT (and possibly further back), the
+native format was a configurable machine option and a separate set of APIs
+existed to accept this format. The option (the "active code page") and these
+APIs (the "*A functions") still exist in recent versions of Windows for
+backwards compatibility, though new functionality often only has a utf-16-le API
+(the "*W functions").
+
+In Python, str is recommended because it can correctly round-trip all characters
+used in paths (on POSIX with surrogateescape handling; on Windows because str
+maps to the native representation). On Windows bytes cannot round-trip all
+characters used in paths, as Python internally uses the *A functions and hence
+the encoding is "whatever the active code page is". Since the active code page
+cannot represent all Unicode characters, the conversion of a path into bytes can
+lose information without warning or any available indication.
+
+As a demonstration of this::
+    >>> open('test\uAB00.txt', 'wb').close()
+    >>> import glob
+    >>> glob.glob('test*')
+    ['test\uab00.txt']
+    >>> glob.glob(b'test*')
+    [b'test?.txt']
+
+The Unicode character in the second call to glob has been replaced by a '?',
+which means passing the path back into the filesystem will result in a
+``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or
+any function that matches the return type to the parameter type.
+
+While one user-accessible fix is to use str everywhere, POSIX systems generally
+do not suffer from data loss when using bytes exclusively as the bytes are the
+canonical representation. Even if the encoding is "incorrect" by some standard,
+the file system will still map the bytes back to the file. Making use of this
+avoids the cost of decoding and reencoding, such that (theoretically, and only
+on POSIX), code such as this may be faster because of the use of `b'.'` compared
+to using `'.'`::
+
+    >>> for f in os.listdir(b'.'):
+    ... os.stat(f)
+    ...
+
+As a result, POSIX-focused library authors prefer to use bytes to represent
+paths. For some authors it is also a convenience, as their code may receive
+bytes already known to be encoded correctly, while others are attempting to
+simplify porting their code from Python 2. However, the correctness assumptions
+do not carry over to Windows where Unicode is the canonical representation, and
+errors may result. This potential data loss is why the use of bytes paths on
+Windows was deprecated in Python 3.3 - all of the above code snippets produce
+deprecation warnings on Windows.
+
+Proposal
+========
+
+Currently the default filesystem encoding is 'mbcs', which is a meta-encoder
+that uses the active code page. However, when bytes are passed to the filesystem
+they go through the *A APIs and the operating system handles encoding. In this
+case, paths are always encoded using the equivalent of 'mbcs:replace' - we have
+no ability to change this (though there is a user/machine configuration option
+to change the encoding from CP_ACP to CP_OEM, so it won't necessarily always
+match mbcs...)
+
+This proposal would remove all use of the *A APIs and only ever call the *W
+APIs. When Windows returns paths to Python as str, they will be decoded from
+utf-16-le and returned as text (in whatever the minimal representation is). When
+Windows returns paths to Python as bytes, they will be decoded from utf-16-le to
+utf-8 using surrogatepass (Windows does not validate surrogate pairs, so it is
+possible to have invalid surrogates in filenames). Equally, when paths are
+provided as bytes, they are decoded from utf-8 into utf-16-le and passed to the
+*W APIs.
+
+The use of utf-8 will not be configurable, with the possible exception of a
+"legacy mode" environment variable or X-flag.
+
+surrogateescape does not apply here, as the concern is not about retaining
+non-sensical bytes. Any path returned from the operating system will be valid
+Unicode, while bytes paths created by the user may raise a decoding error
+(currently these would raise ``OSError`` or a subclass).
+
+The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the
+ability to round-trip without breaking the functionality of the ``os.path``
+module, which assumes an ASCII-compatible encoding. Using utf-16-le as the
+encoding is more pure, but will cause more issues than are resolved.
+
+This change would also undeprecate the use of bytes paths on Windows. No change
+to the semantics of using bytes as a path is required - as before, they must be
+encoded with the encoding specified by ``sys.getfilesystemencoding()``.
+
+Specific Changes
+================
+
+Update sys.getfilesystemencoding
+--------------------------------
+
+Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in
+``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs.
+
+Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize`` and
+``PyUnicode_EncodeFSDefault`` to use the standard utf-8 codec with surrogatepass
+error mode, or if the legacy-mode switch is enabled the code page codec with
+replace error mode.
+
+Update path_converter
+---------------------
+
+Update the path converter to always decode bytes or buffer objects into text
+using ``PyUnicode_DecodeFSDefaultAndSize``.
+
+Change the ``narrow`` field from a ``char*`` string into a flag that indicates
+whether the original object was bytes. This is required for functions that need
+to return paths using the same type as was originally provided.
+
+Remove unused ANSI code
+-----------------------
+
+Remove all code paths using the ``narrow`` field, as these will no longer be
+reachable by any caller. These are only used within ``posixmodule.c``. Other
+uses of paths should have use of bytes paths replaced with decoding and use of
+the *W APIs.
+
+Add legacy mode
+---------------
+
+Add a legacy mode flag, enabled by the environment variable
+``PYTHONLEGACYWINDOWSFSENCODING``. When this flag is set, the default filesystem
+encoding is set to mbcs rather than utf-8, and the error mode is set to
+'replace' rather than 'strict'. The ``path_converter`` will continue to decode
+to wide characters and only *W APIs will be called, however, the bytes passed in
+and received from Python will be encoded the same as prior to this change.
+
+Undeprecate bytes paths on Windows
+----------------------------------
+
+Using bytes as paths on Windows is currently deprecated. We would announce that
+this is no longer the case, and that paths when encoded as bytes should use
+whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's
+active code page.
+
+
+Rejected Alternatives
+=====================
+
+Use strict mbcs decoding
+------------------------
+
+This is essentially the same as the proposed change, but instead of changing
+``sys.getfilesystemencoding()`` to utf-8 it is changed to mbcs (which
+dynamically maps to the active code page).
+
+This approach allows the use of new functionality that is only available as *W
+APIs and also detection of encoding/decoding errors. For example, rather than
+silently replacing Unicode characters with '?', it would be possible to warn or
+fail the operation.
+
+Compared to the proposed fix, this could enable some new functionality but does
+not fix any of the problems described initially. New runtime errors may cause
+some problems to be more obvious and lead to fixes, provided library maintainers
+are interested in supporting Windows and adding a separate code path to treat
+filesystem paths as strings.
+
+Making the encoding mbcs without strict errors is equivalent to the legacy-mode
+switch being enabled by default. This is a possible course of action if there is
+significant breakage of actual code and a need to extend the deprecation period,
+but still a desire to have the simplifications to the CPython source.
+
+Make bytes paths an error on Windows
+------------------------------------
+
+By preventing the use of bytes paths on Windows completely we prevent users from
+hitting encoding issues.
+
+However, the motivation for this PEP is to increase the likelihood that code
+written on POSIX will also work correctly on Windows. This alternative would
+move the other direction and make such code completely incompatible. As this
+does not benefit users in any way, we reject it.
+
+Make bytes paths an error on all platforms
+------------------------------------------
+
+By deprecating and then disable the use of bytes paths on all platforms we
+prevent users from hitting encoding issues regardless of where the code was
+originally written. This would require a full deprecation cycle, as there are
+currently no warnings on platforms other than Windows.
+
+This is likely to be seen as a hostile action against Python developers in
+general, and as such is rejected at this time.
+
+Code that may break
+===================
+
+The following code patterns may break or see different behaviour as a result of
+this change.
+
+Note that all of these examples produce deprecation warnings on Python 3.3 and
+later.
+
+Not managing encodings across boundaries
+----------------------------------------
+
+Code that does not manage encodings when crossing protocol boundaries may
+currently be working by chance, but could encounter issues when either encoding
+changes. For example::
+
+    filename = open('filename_in_mbcs.txt', 'rb').read()
+    text = open(filename, 'r').read()
+
+To correct this code, the encoding of the bytes in ``filename`` should be
+specified, either when reading from the file or before using the value::
+
+    # Fix 1: Open file as text
+    filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
+    text = open(filename, 'r').read()
+
+    # Fix 2: Decode path
+    filename = open('filename_in_mbcs.txt', 'rb').read()
+    text = open(filename.decode('mbcs'), 'r').read()
+
+
+Explicitly using 'mbcs'
+-----------------------
+
+Code that explicitly encodes text using 'mbcs' before passing to file system
+APIs. For example::
+
+    filename = open('files.txt', 'r').readline()
+    text = open(filename.encode('mbcs'), 'r')
+
+To correct this code, the string should be passed without explicit encoding, or
+should use ``os.fsencode()``::
+
+    # Fix 1: Do not encode the string
+    filename = open('files.txt', 'r').readline()
+    text = open(filename, 'r')
+
+    # Fix 2: Use correct encoding
+    filename = open('files.txt', 'r').readline()
+    text = open(os.fsencode(filename), 'r')
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.