PEP 529: Readability, style and detail updates

This commit is contained in:
Steve Dower 2016-09-04 22:44:12 -07:00
parent 37567a1d72
commit 71d8c4c69d
1 changed files with 137 additions and 60 deletions

View File

@ -16,7 +16,8 @@ Historically, Python uses the ANSI APIs for interacting with the Windows
operating system, often via C Runtime functions. However, these have been long operating system, often via C Runtime functions. However, these have been long
discouraged in favor of the UTF-16 APIs. Within the operating system, all text discouraged in favor of the UTF-16 APIs. Within the operating system, all text
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
the active code page. the active code page. See `Naming Files, Paths, and Namespaces`_ for
more details.
This PEP proposes changing the default filesystem encoding on Windows to utf-8, This PEP proposes changing the default filesystem encoding on Windows to utf-8,
and changing all filesystem functions to use the Unicode APIs for filesystem and changing all filesystem functions to use the Unicode APIs for filesystem
@ -27,10 +28,10 @@ valid paths in Windows filesystems. Currently, the conversions between Unicode
characters outside of the user's active code page. characters outside of the user's active code page.
Notably, this does not impact the encoding of the contents of files. These will Notably, this does not impact the encoding of the contents of files. These will
continue to default to locale.getpreferredencoding (for text files) or plain continue to default to ``locale.getpreferredencoding()`` (for text files) or
bytes (for binary files). This only affects the encoding used when users pass a plain bytes (for binary files). This only affects the encoding used when users
bytes object to Python where it is then passed to the operating system as a path pass a bytes object to Python where it is then passed to the operating system as
name. a path name.
Background Background
========== ==========
@ -44,9 +45,10 @@ filesystem (for example, ``os.unlink()``).
When paths are passed between the filesystem and the application, they are When paths are passed between the filesystem and the application, they are
either passed through as a bytes blob or converted to/from str using either passed through as a bytes blob or converted to/from str using
``os.fsencode()`` or ``sys.getfilesystemencoding()``. The result of encoding a ``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using
string with ``sys.getfilesystemencoding()`` is a blob of bytes in the native ``sys.getfilesystemencoding()``. The result of encoding a string with
format for the default file system. ``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the
default file system.
On Windows, the native format for the filesystem is utf-16-le. The recommended On Windows, the native format for the filesystem is utf-16-le. The recommended
platform APIs for accessing the filesystem all accept and return text encoded in platform APIs for accessing the filesystem all accept and return text encoded in
@ -83,8 +85,8 @@ do not suffer from data loss when using bytes exclusively as the bytes are the
canonical representation. Even if the encoding is "incorrect" by some standard, canonical representation. Even if the encoding is "incorrect" by some standard,
the file system will still map the bytes back to the file. Making use of this the file system will still map the bytes back to the file. Making use of this
avoids the cost of decoding and reencoding, such that (theoretically, and only avoids the cost of decoding and reencoding, such that (theoretically, and only
on POSIX), code such as this may be faster because of the use of `b'.'` compared on POSIX), code such as this may be faster because of the use of ``b'.'``
to using `'.'`:: compared to using ``'.'``::
>>> for f in os.listdir(b'.'): >>> for f in os.listdir(b'.'):
... os.stat(f) ... os.stat(f)
@ -105,32 +107,31 @@ Proposal
Currently the default filesystem encoding is 'mbcs', which is a meta-encoder Currently the default filesystem encoding is 'mbcs', which is a meta-encoder
that uses the active code page. However, when bytes are passed to the filesystem that uses the active code page. However, when bytes are passed to the filesystem
they go through the \*A APIs and the operating system handles encoding. In this they go through the \*A APIs and the operating system handles encoding. In this
case, paths are always encoded using the equivalent of 'mbcs:replace' - we have case, paths are always encoded using the equivalent of 'mbcs:replace' with no
no ability to change this (though there is a user/machine configuration option opportunity for Python to override or change this.
to change the encoding from CP_ACP to CP_OEM, so it won't necessarily always
match mbcs...)
This proposal would remove all use of the \*A APIs and only ever call the \*W This proposal would remove all use of the \*A APIs and only ever call the \*W
APIs. When Windows returns paths to Python as str, they will be decoded from APIs. When Windows returns paths to Python as ``str``, they will be decoded from
utf-16-le and returned as text (in whatever the minimal representation is). When utf-16-le and returned as text (in whatever the minimal representation is). When
Windows returns paths to Python as bytes, they will be decoded from utf-16-le to Python code requests paths as ``bytes``, the paths will be transcoded from
utf-8 using surrogatepass (Windows does not validate surrogate pairs, so it is utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate
possible to have invalid surrogates in filenames). Equally, when paths are pairs, so it is possible to have invalid surrogates in filenames). Equally, when
provided as bytes, they are decoded from utf-8 into utf-16-le and passed to the paths are provided as ``bytes``, they are trasncoded from utf-8 into utf-16-le
\*W APIs. and passed to the \*W APIs.
The use of utf-8 will not be configurable, with the possible exception of a The use of utf-8 will not be configurable, except for the provision of a
"legacy mode" environment variable or X-flag. "legacy mode" flag to revert to the previous behaviour.
surrogateescape does not apply here, as the concern is not about retaining The ``surrogateescape`` error mode does not apply here, as the concern is not
non-sensical bytes. Any path returned from the operating system will be valid about retaining non-sensical bytes. Any path returned from the operating system
Unicode, while bytes paths created by the user may raise a decoding error will be valid Unicode, while invalid paths created by the user should raise a
(currently these would raise ``OSError`` or a subclass). decoding error (currently these would raise ``OSError`` or a subclass).
The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the
ability to round-trip without breaking the functionality of the ``os.path`` ability to round-trip path names and allow basic manipulation (for example,
module, which assumes an ASCII-compatible encoding. Using utf-16-le as the using the ``os.path`` module) when assuming an ASCII-compatible encoding. Using
encoding is more pure, but will cause more issues than are resolved. utf-16-le as the encoding is more pure, but will cause more issues than are
resolved.
This change would also undeprecate the use of bytes paths on Windows. No change This change would also undeprecate the use of bytes paths on Windows. No change
to the semantics of using bytes as a path is required - as before, they must be to the semantics of using bytes as a path is required - as before, they must be
@ -145,16 +146,38 @@ Update sys.getfilesystemencoding
Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in
``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs. ``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs.
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize`` and Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
``PyUnicode_EncodeFSDefault`` to use the standard utf-8 codec with surrogatepass ``PyUnicode_EncodeFSDefault()`` to use the utf-8 codec, or if the legacy-mode
error mode, or if the legacy-mode switch is enabled the code page codec with switch is enabled the existing mbcs codec.
replace error mode.
Add sys.getfilesystemencodeerrors
---------------------------------
As the error mode may now change between ``surrogatepass`` and ``replace``,
Python code that manually performs encoding also needs access to the current
error mode. This includes the implementation of ``os.fsencode()`` and
``os.fsdecode()``, which currently assume an error mode based on the codec.
Add a public ``Py_FileSystemDefaultEncodeErrors``, similar to the existing
``Py_FileSystemDefaultEncoding``. The default value on Windows will be
``surrogatepass`` or in legacy mode, ``replace``. The default value on all other
platforms will be ``surrogateescape``.
Add a public ``sys.getfilesystemencodeerrors()`` function that returns the
current error mode.
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
``PyUnicode_EncodeFSDefault()`` to use the variable for error mode rather than
constant strings.
Update the implementations of ``os.fsencode()`` and ``os.fsdecode()`` to use
``sys.getfilesystemencodeerrors()`` instead of assuming the mode.
Update path_converter Update path_converter
--------------------- ---------------------
Update the path converter to always decode bytes or buffer objects into text Update the path converter to always decode bytes or buffer objects into text
using ``PyUnicode_DecodeFSDefaultAndSize``. using ``PyUnicode_DecodeFSDefaultAndSize()``.
Change the ``narrow`` field from a ``char*`` string into a flag that indicates Change the ``narrow`` field from a ``char*`` string into a flag that indicates
whether the original object was bytes. This is required for functions that need whether the original object was bytes. This is required for functions that need
@ -172,11 +195,13 @@ Add legacy mode
--------------- ---------------
Add a legacy mode flag, enabled by the environment variable Add a legacy mode flag, enabled by the environment variable
``PYTHONLEGACYWINDOWSFSENCODING``. When this flag is set, the default filesystem ``PYTHONLEGACYWINDOWSFSENCODING``.
encoding is set to mbcs rather than utf-8, and the error mode is set to
'replace' rather than 'strict'. The ``path_converter`` will continue to decode When this flag is set, the default filesystem encoding is set to mbcs rather
to wide characters and only \*W APIs will be called, however, the bytes passed in than utf-8, and the error mode is set to ``replace`` rather than
and received from Python will be encoded the same as prior to this change. ``surrogatepass``. Paths will continue to decode to wide characters and only \*W
APIs will be called, however, the bytes passed in and received from Python will
be encoded the same as prior to this change.
Undeprecate bytes paths on Windows Undeprecate bytes paths on Windows
---------------------------------- ----------------------------------
@ -186,6 +211,52 @@ this is no longer the case, and that paths when encoded as bytes should use
whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's
active code page. active code page.
Affected Modules
----------------
This PEP implicitly includes all modules within the Python that either pass path
names to the operating system, or otherwise use ``sys.getfilesystemencoding()``.
As of 3.6.0a4, the following modules require modification:
* ``os``
* ``_overlapped``
* ``_socket``
* ``subprocess``
* ``zipimport``
The following modules use ``sys.getfilesystemencoding()`` but do not need
modification:
* ``gc`` (already assumes bytes are utf-8)
* ``grp`` (not compiled for Windows)
* ``http.server`` (correctly includes codec name with transmitted data)
* ``idlelib.editor`` (should not be needed; has fallback handling)
* ``nis`` (not compiled for Windows)
* ``pwd`` (not compiled for Windows)
* ``spwd`` (not compiled for Windows)
* ``_ssl`` (only used for ASCII constants)
* ``tarfile`` (code unused on Windows)
* ``_tkinter`` (already assumes bytes are utf-8)
* ``wsgiref`` (assumed as the default encoding for unknown environments)
* ``zipapp`` (code unused on Windows)
The following native code uses one of the encoding or decoding functions, but do
not require any modification:
* ``Parser/parsetok.c`` (docs already specify ``sys.getfilesystemencoding()``)
* ``Python/ast.c`` (docs already specify ``sys.getfilesystemencoding()``)
* ``Python/compile.c`` (undocumented, but Python filesystem encoding implied)
* ``Python/errors.c`` (docs already specify ``os.fsdecode()``)
* ``Python/fileutils.c`` (code unused on Windows)
* ``Python/future.c`` (undocumented, but Python filesystem encoding implied)
* ``Python/import.c`` (docs already specify utf-8)
* ``Python/importdl.c`` (code unused on Windows)
* ``Python/pythonrun.c`` (docs already specify ``sys.getfilesystemencoding()``)
* ``Python/symtable.c`` (undocumented, but Python filesystem encoding implied)
* ``Python/thread.c`` (code unused on Windows)
* ``Python/traceback.c`` (encodes correctly for comparing strings)
* ``Python/_warnings.c`` (docs already specify ``os.fsdecode()``)
Rejected Alternatives Rejected Alternatives
===================== =====================
@ -249,44 +320,50 @@ Not managing encodings across boundaries
Code that does not manage encodings when crossing protocol boundaries may Code that does not manage encodings when crossing protocol boundaries may
currently be working by chance, but could encounter issues when either encoding currently be working by chance, but could encounter issues when either encoding
changes. For example:: changes. For example:
filename = open('filename_in_mbcs.txt', 'rb').read() >>> filename = open('filename_in_mbcs.txt', 'rb').read()
text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
To correct this code, the encoding of the bytes in ``filename`` should be To correct this code, the encoding of the bytes in ``filename`` should be
specified, either when reading from the file or before using the value:: specified, either when reading from the file or before using the value:
# Fix 1: Open file as text >>> # Fix 1: Open file as text
filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read() >>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
# Fix 2: Decode path >>> # Fix 2: Decode path
filename = open('filename_in_mbcs.txt', 'rb').read() >>> filename = open('filename_in_mbcs.txt', 'rb').read()
text = open(filename.decode('mbcs'), 'r').read() >>> text = open(filename.decode('mbcs'), 'r').read()
Explicitly using 'mbcs' Explicitly using 'mbcs'
----------------------- -----------------------
Code that explicitly encodes text using 'mbcs' before passing to file system Code that explicitly encodes text using 'mbcs' before passing to file system
APIs. For example:: APIs is now passing incorrectly encoded bytes. For example:
filename = open('files.txt', 'r').readline() >>> filename = open('files.txt', 'r').readline()
text = open(filename.encode('mbcs'), 'r') >>> text = open(filename.encode('mbcs'), 'r')
To correct this code, the string should be passed without explicit encoding, or To correct this code, the string should be passed without explicit encoding, or
should use ``os.fsencode()``:: should use ``os.fsencode()``:
# Fix 1: Do not encode the string >>> # Fix 1: Do not encode the string
filename = open('files.txt', 'r').readline() >>> filename = open('files.txt', 'r').readline()
text = open(filename, 'r') >>> text = open(filename, 'r')
# Fix 2: Use correct encoding >>> # Fix 2: Use correct encoding
filename = open('files.txt', 'r').readline() >>> filename = open('files.txt', 'r').readline()
text = open(os.fsencode(filename), 'r') >>> text = open(os.fsencode(filename), 'r')
References
==========
.. _Naming Files, Paths, and Namespaces:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx
Copyright Copyright
========= =========