PEP 529: Readability, style and detail updates
This commit is contained in:
parent
37567a1d72
commit
71d8c4c69d
197
pep-0529.txt
197
pep-0529.txt
|
@ -16,7 +16,8 @@ Historically, Python uses the ANSI APIs for interacting with the Windows
|
|||
operating system, often via C Runtime functions. However, these have been long
|
||||
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
|
||||
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
|
||||
the active code page.
|
||||
the active code page. See `Naming Files, Paths, and Namespaces`_ for
|
||||
more details.
|
||||
|
||||
This PEP proposes changing the default filesystem encoding on Windows to utf-8,
|
||||
and changing all filesystem functions to use the Unicode APIs for filesystem
|
||||
|
@ -27,10 +28,10 @@ valid paths in Windows filesystems. Currently, the conversions between Unicode
|
|||
characters outside of the user's active code page.
|
||||
|
||||
Notably, this does not impact the encoding of the contents of files. These will
|
||||
continue to default to locale.getpreferredencoding (for text files) or plain
|
||||
bytes (for binary files). This only affects the encoding used when users pass a
|
||||
bytes object to Python where it is then passed to the operating system as a path
|
||||
name.
|
||||
continue to default to ``locale.getpreferredencoding()`` (for text files) or
|
||||
plain bytes (for binary files). This only affects the encoding used when users
|
||||
pass a bytes object to Python where it is then passed to the operating system as
|
||||
a path name.
|
||||
|
||||
Background
|
||||
==========
|
||||
|
@ -44,9 +45,10 @@ filesystem (for example, ``os.unlink()``).
|
|||
|
||||
When paths are passed between the filesystem and the application, they are
|
||||
either passed through as a bytes blob or converted to/from str using
|
||||
``os.fsencode()`` or ``sys.getfilesystemencoding()``. The result of encoding a
|
||||
string with ``sys.getfilesystemencoding()`` is a blob of bytes in the native
|
||||
format for the default file system.
|
||||
``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using
|
||||
``sys.getfilesystemencoding()``. The result of encoding a string with
|
||||
``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the
|
||||
default file system.
|
||||
|
||||
On Windows, the native format for the filesystem is utf-16-le. The recommended
|
||||
platform APIs for accessing the filesystem all accept and return text encoded in
|
||||
|
@ -83,11 +85,11 @@ do not suffer from data loss when using bytes exclusively as the bytes are the
|
|||
canonical representation. Even if the encoding is "incorrect" by some standard,
|
||||
the file system will still map the bytes back to the file. Making use of this
|
||||
avoids the cost of decoding and reencoding, such that (theoretically, and only
|
||||
on POSIX), code such as this may be faster because of the use of `b'.'` compared
|
||||
to using `'.'`::
|
||||
on POSIX), code such as this may be faster because of the use of ``b'.'``
|
||||
compared to using ``'.'``::
|
||||
|
||||
>>> for f in os.listdir(b'.'):
|
||||
... os.stat(f)
|
||||
... os.stat(f)
|
||||
...
|
||||
|
||||
As a result, POSIX-focused library authors prefer to use bytes to represent
|
||||
|
@ -105,32 +107,31 @@ Proposal
|
|||
Currently the default filesystem encoding is 'mbcs', which is a meta-encoder
|
||||
that uses the active code page. However, when bytes are passed to the filesystem
|
||||
they go through the \*A APIs and the operating system handles encoding. In this
|
||||
case, paths are always encoded using the equivalent of 'mbcs:replace' - we have
|
||||
no ability to change this (though there is a user/machine configuration option
|
||||
to change the encoding from CP_ACP to CP_OEM, so it won't necessarily always
|
||||
match mbcs...)
|
||||
case, paths are always encoded using the equivalent of 'mbcs:replace' with no
|
||||
opportunity for Python to override or change this.
|
||||
|
||||
This proposal would remove all use of the \*A APIs and only ever call the \*W
|
||||
APIs. When Windows returns paths to Python as str, they will be decoded from
|
||||
APIs. When Windows returns paths to Python as ``str``, they will be decoded from
|
||||
utf-16-le and returned as text (in whatever the minimal representation is). When
|
||||
Windows returns paths to Python as bytes, they will be decoded from utf-16-le to
|
||||
utf-8 using surrogatepass (Windows does not validate surrogate pairs, so it is
|
||||
possible to have invalid surrogates in filenames). Equally, when paths are
|
||||
provided as bytes, they are decoded from utf-8 into utf-16-le and passed to the
|
||||
\*W APIs.
|
||||
Python code requests paths as ``bytes``, the paths will be transcoded from
|
||||
utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate
|
||||
pairs, so it is possible to have invalid surrogates in filenames). Equally, when
|
||||
paths are provided as ``bytes``, they are trasncoded from utf-8 into utf-16-le
|
||||
and passed to the \*W APIs.
|
||||
|
||||
The use of utf-8 will not be configurable, with the possible exception of a
|
||||
"legacy mode" environment variable or X-flag.
|
||||
The use of utf-8 will not be configurable, except for the provision of a
|
||||
"legacy mode" flag to revert to the previous behaviour.
|
||||
|
||||
surrogateescape does not apply here, as the concern is not about retaining
|
||||
non-sensical bytes. Any path returned from the operating system will be valid
|
||||
Unicode, while bytes paths created by the user may raise a decoding error
|
||||
(currently these would raise ``OSError`` or a subclass).
|
||||
The ``surrogateescape`` error mode does not apply here, as the concern is not
|
||||
about retaining non-sensical bytes. Any path returned from the operating system
|
||||
will be valid Unicode, while invalid paths created by the user should raise a
|
||||
decoding error (currently these would raise ``OSError`` or a subclass).
|
||||
|
||||
The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the
|
||||
ability to round-trip without breaking the functionality of the ``os.path``
|
||||
module, which assumes an ASCII-compatible encoding. Using utf-16-le as the
|
||||
encoding is more pure, but will cause more issues than are resolved.
|
||||
ability to round-trip path names and allow basic manipulation (for example,
|
||||
using the ``os.path`` module) when assuming an ASCII-compatible encoding. Using
|
||||
utf-16-le as the encoding is more pure, but will cause more issues than are
|
||||
resolved.
|
||||
|
||||
This change would also undeprecate the use of bytes paths on Windows. No change
|
||||
to the semantics of using bytes as a path is required - as before, they must be
|
||||
|
@ -145,16 +146,38 @@ Update sys.getfilesystemencoding
|
|||
Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in
|
||||
``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs.
|
||||
|
||||
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize`` and
|
||||
``PyUnicode_EncodeFSDefault`` to use the standard utf-8 codec with surrogatepass
|
||||
error mode, or if the legacy-mode switch is enabled the code page codec with
|
||||
replace error mode.
|
||||
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
|
||||
``PyUnicode_EncodeFSDefault()`` to use the utf-8 codec, or if the legacy-mode
|
||||
switch is enabled the existing mbcs codec.
|
||||
|
||||
Add sys.getfilesystemencodeerrors
|
||||
---------------------------------
|
||||
|
||||
As the error mode may now change between ``surrogatepass`` and ``replace``,
|
||||
Python code that manually performs encoding also needs access to the current
|
||||
error mode. This includes the implementation of ``os.fsencode()`` and
|
||||
``os.fsdecode()``, which currently assume an error mode based on the codec.
|
||||
|
||||
Add a public ``Py_FileSystemDefaultEncodeErrors``, similar to the existing
|
||||
``Py_FileSystemDefaultEncoding``. The default value on Windows will be
|
||||
``surrogatepass`` or in legacy mode, ``replace``. The default value on all other
|
||||
platforms will be ``surrogateescape``.
|
||||
|
||||
Add a public ``sys.getfilesystemencodeerrors()`` function that returns the
|
||||
current error mode.
|
||||
|
||||
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
|
||||
``PyUnicode_EncodeFSDefault()`` to use the variable for error mode rather than
|
||||
constant strings.
|
||||
|
||||
Update the implementations of ``os.fsencode()`` and ``os.fsdecode()`` to use
|
||||
``sys.getfilesystemencodeerrors()`` instead of assuming the mode.
|
||||
|
||||
Update path_converter
|
||||
---------------------
|
||||
|
||||
Update the path converter to always decode bytes or buffer objects into text
|
||||
using ``PyUnicode_DecodeFSDefaultAndSize``.
|
||||
using ``PyUnicode_DecodeFSDefaultAndSize()``.
|
||||
|
||||
Change the ``narrow`` field from a ``char*`` string into a flag that indicates
|
||||
whether the original object was bytes. This is required for functions that need
|
||||
|
@ -172,11 +195,13 @@ Add legacy mode
|
|||
---------------
|
||||
|
||||
Add a legacy mode flag, enabled by the environment variable
|
||||
``PYTHONLEGACYWINDOWSFSENCODING``. When this flag is set, the default filesystem
|
||||
encoding is set to mbcs rather than utf-8, and the error mode is set to
|
||||
'replace' rather than 'strict'. The ``path_converter`` will continue to decode
|
||||
to wide characters and only \*W APIs will be called, however, the bytes passed in
|
||||
and received from Python will be encoded the same as prior to this change.
|
||||
``PYTHONLEGACYWINDOWSFSENCODING``.
|
||||
|
||||
When this flag is set, the default filesystem encoding is set to mbcs rather
|
||||
than utf-8, and the error mode is set to ``replace`` rather than
|
||||
``surrogatepass``. Paths will continue to decode to wide characters and only \*W
|
||||
APIs will be called, however, the bytes passed in and received from Python will
|
||||
be encoded the same as prior to this change.
|
||||
|
||||
Undeprecate bytes paths on Windows
|
||||
----------------------------------
|
||||
|
@ -186,6 +211,52 @@ this is no longer the case, and that paths when encoded as bytes should use
|
|||
whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's
|
||||
active code page.
|
||||
|
||||
Affected Modules
|
||||
----------------
|
||||
|
||||
This PEP implicitly includes all modules within the Python that either pass path
|
||||
names to the operating system, or otherwise use ``sys.getfilesystemencoding()``.
|
||||
|
||||
As of 3.6.0a4, the following modules require modification:
|
||||
|
||||
* ``os``
|
||||
* ``_overlapped``
|
||||
* ``_socket``
|
||||
* ``subprocess``
|
||||
* ``zipimport``
|
||||
|
||||
The following modules use ``sys.getfilesystemencoding()`` but do not need
|
||||
modification:
|
||||
|
||||
* ``gc`` (already assumes bytes are utf-8)
|
||||
* ``grp`` (not compiled for Windows)
|
||||
* ``http.server`` (correctly includes codec name with transmitted data)
|
||||
* ``idlelib.editor`` (should not be needed; has fallback handling)
|
||||
* ``nis`` (not compiled for Windows)
|
||||
* ``pwd`` (not compiled for Windows)
|
||||
* ``spwd`` (not compiled for Windows)
|
||||
* ``_ssl`` (only used for ASCII constants)
|
||||
* ``tarfile`` (code unused on Windows)
|
||||
* ``_tkinter`` (already assumes bytes are utf-8)
|
||||
* ``wsgiref`` (assumed as the default encoding for unknown environments)
|
||||
* ``zipapp`` (code unused on Windows)
|
||||
|
||||
The following native code uses one of the encoding or decoding functions, but do
|
||||
not require any modification:
|
||||
|
||||
* ``Parser/parsetok.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
||||
* ``Python/ast.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
||||
* ``Python/compile.c`` (undocumented, but Python filesystem encoding implied)
|
||||
* ``Python/errors.c`` (docs already specify ``os.fsdecode()``)
|
||||
* ``Python/fileutils.c`` (code unused on Windows)
|
||||
* ``Python/future.c`` (undocumented, but Python filesystem encoding implied)
|
||||
* ``Python/import.c`` (docs already specify utf-8)
|
||||
* ``Python/importdl.c`` (code unused on Windows)
|
||||
* ``Python/pythonrun.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
||||
* ``Python/symtable.c`` (undocumented, but Python filesystem encoding implied)
|
||||
* ``Python/thread.c`` (code unused on Windows)
|
||||
* ``Python/traceback.c`` (encodes correctly for comparing strings)
|
||||
* ``Python/_warnings.c`` (docs already specify ``os.fsdecode()``)
|
||||
|
||||
Rejected Alternatives
|
||||
=====================
|
||||
|
@ -249,44 +320,50 @@ Not managing encodings across boundaries
|
|||
|
||||
Code that does not manage encodings when crossing protocol boundaries may
|
||||
currently be working by chance, but could encounter issues when either encoding
|
||||
changes. For example::
|
||||
changes. For example:
|
||||
|
||||
filename = open('filename_in_mbcs.txt', 'rb').read()
|
||||
text = open(filename, 'r').read()
|
||||
>>> filename = open('filename_in_mbcs.txt', 'rb').read()
|
||||
>>> text = open(filename, 'r').read()
|
||||
|
||||
To correct this code, the encoding of the bytes in ``filename`` should be
|
||||
specified, either when reading from the file or before using the value::
|
||||
specified, either when reading from the file or before using the value:
|
||||
|
||||
# Fix 1: Open file as text
|
||||
filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
|
||||
text = open(filename, 'r').read()
|
||||
>>> # Fix 1: Open file as text
|
||||
>>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
|
||||
>>> text = open(filename, 'r').read()
|
||||
|
||||
# Fix 2: Decode path
|
||||
filename = open('filename_in_mbcs.txt', 'rb').read()
|
||||
text = open(filename.decode('mbcs'), 'r').read()
|
||||
>>> # Fix 2: Decode path
|
||||
>>> filename = open('filename_in_mbcs.txt', 'rb').read()
|
||||
>>> text = open(filename.decode('mbcs'), 'r').read()
|
||||
|
||||
|
||||
Explicitly using 'mbcs'
|
||||
-----------------------
|
||||
|
||||
Code that explicitly encodes text using 'mbcs' before passing to file system
|
||||
APIs. For example::
|
||||
APIs is now passing incorrectly encoded bytes. For example:
|
||||
|
||||
filename = open('files.txt', 'r').readline()
|
||||
text = open(filename.encode('mbcs'), 'r')
|
||||
>>> filename = open('files.txt', 'r').readline()
|
||||
>>> text = open(filename.encode('mbcs'), 'r')
|
||||
|
||||
To correct this code, the string should be passed without explicit encoding, or
|
||||
should use ``os.fsencode()``::
|
||||
should use ``os.fsencode()``:
|
||||
|
||||
# Fix 1: Do not encode the string
|
||||
filename = open('files.txt', 'r').readline()
|
||||
text = open(filename, 'r')
|
||||
>>> # Fix 1: Do not encode the string
|
||||
>>> filename = open('files.txt', 'r').readline()
|
||||
>>> text = open(filename, 'r')
|
||||
|
||||
# Fix 2: Use correct encoding
|
||||
filename = open('files.txt', 'r').readline()
|
||||
text = open(os.fsencode(filename), 'r')
|
||||
>>> # Fix 2: Use correct encoding
|
||||
>>> filename = open('files.txt', 'r').readline()
|
||||
>>> text = open(os.fsencode(filename), 'r')
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _Naming Files, Paths, and Namespaces:
|
||||
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
|
|
Loading…
Reference in New Issue