Convert PEPs 519, 528 and 529 from CRLF to LF line endings. (#236)

This commit is contained in:
Serhiy Storchaka 2017-04-02 00:04:46 +03:00 committed by GitHub
parent 425a46fb20
commit d675175520
3 changed files with 1192 additions and 1192 deletions

File diff suppressed because it is too large Load Diff

View File

@ -1,182 +1,182 @@
PEP: 528 PEP: 528
Title: Change Windows console encoding to UTF-8 Title: Change Windows console encoding to UTF-8
Version: $Revision$ Version: $Revision$
Last-Modified: $Date$ Last-Modified: $Date$
Author: Steve Dower <steve.dower@python.org> Author: Steve Dower <steve.dower@python.org>
Status: Final Status: Final
Type: Standards Track Type: Standards Track
Content-Type: text/x-rst Content-Type: text/x-rst
Created: 27-Aug-2016 Created: 27-Aug-2016
Python-Version: 3.6 Python-Version: 3.6
Post-History: 01-Sep-2016, 04-Sep-2016 Post-History: 01-Sep-2016, 04-Sep-2016
Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146278.html Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146278.html
Abstract Abstract
======== ========
Historically, Python uses the ANSI APIs for interacting with the Windows Historically, Python uses the ANSI APIs for interacting with the Windows
operating system, often via C Runtime functions. However, these have been long operating system, often via C Runtime functions. However, these have been long
discouraged in favor of the UTF-16 APIs. Within the operating system, all text discouraged in favor of the UTF-16 APIs. Within the operating system, all text
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
the active code page. the active code page.
This PEP proposes changing the default standard stream implementation on Windows This PEP proposes changing the default standard stream implementation on Windows
to use the Unicode APIs. This will allow users to print and input the full range to use the Unicode APIs. This will allow users to print and input the full range
of Unicode characters at the default Windows console. This also requires a of Unicode characters at the default Windows console. This also requires a
subtle change to how the tokenizer parses text from readline hooks. subtle change to how the tokenizer parses text from readline hooks.
Specific Changes Specific Changes
================ ================
Add _io.WindowsConsoleIO Add _io.WindowsConsoleIO
------------------------ ------------------------
Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
representing standard input, output and error. We add a new class (implemented representing standard input, output and error. We add a new class (implemented
in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows
console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``. console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.
This class will be used when the legacy-mode flag is not in effect, when opening This class will be used when the legacy-mode flag is not in effect, when opening
a standard stream by file descriptor and the stream is a console buffer rather a standard stream by file descriptor and the stream is a console buffer rather
than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today. than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today.
This is a raw (bytes) IO class that requires text to be passed encoded with This is a raw (bytes) IO class that requires text to be passed encoded with
utf-8, which will be decoded to utf-16-le and passed to the Windows APIs. utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
Similarly, bytes read from the class will be provided by the operating system as Similarly, bytes read from the class will be provided by the operating system as
utf-16-le and converted into utf-8 when returned to Python. utf-16-le and converted into utf-8 when returned to Python.
The use of an ASCII compatible encoding is required to maintain compatibility The use of an ASCII compatible encoding is required to maintain compatibility
with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to
the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes
a particular encoding for the standard streams other than ASCII will likely a particular encoding for the standard streams other than ASCII will likely
break. break.
Add _PyOS_WindowsConsoleReadline Add _PyOS_WindowsConsoleReadline
-------------------------------- --------------------------------
To allow Unicode entry at the interactive prompt, a new readline hook is To allow Unicode entry at the interactive prompt, a new readline hook is
required. The existing ``PyOS_StdioReadline`` function will delegate to the new required. The existing ``PyOS_StdioReadline`` function will delegate to the new
``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor ``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor
that is a console buffer and the legacy-mode flag is not in effect (the logic that is a console buffer and the legacy-mode flag is not in effect (the logic
should be identical to above). should be identical to above).
Since the readline interface is required to return an 8-bit encoded string with Since the readline interface is required to return an 8-bit encoded string with
no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from
utf-16-le as read from the operating system into utf-8. utf-16-le as read from the operating system into utf-8.
The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding
from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect. from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect.
This may require readline hooks to change their encodings to utf-8, or to This may require readline hooks to change their encodings to utf-8, or to
require legacy-mode for correct behaviour. require legacy-mode for correct behaviour.
Add legacy mode Add legacy mode
--------------- ---------------
Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set
will enable the legacy-mode flag, which completely restores the previous will enable the legacy-mode flag, which completely restores the previous
behaviour. behaviour.
Alternative Approaches Alternative Approaches
====================== ======================
The `win_unicode_console package`_ is a pure-Python alternative to changing the The `win_unicode_console package`_ is a pure-Python alternative to changing the
default behaviour of the console. It implements essentially the same default behaviour of the console. It implements essentially the same
modifications as described here using pure Python code. modifications as described here using pure Python code.
Code that may break Code that may break
=================== ===================
The following code patterns may break or see different behaviour as a result of The following code patterns may break or see different behaviour as a result of
this change. All of these code samples require explicitly choosing to use a raw this change. All of these code samples require explicitly choosing to use a raw
file object in place of a more convenient wrapper that would prevent any visible file object in place of a more convenient wrapper that would prevent any visible
change. change.
Assuming stdin/stdout encoding Assuming stdin/stdout encoding
------------------------------ ------------------------------
Code that assumes that the encoding required by ``sys.stdin.buffer`` or Code that assumes that the encoding required by ``sys.stdin.buffer`` or
``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be ``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be
working by chance, but could encounter issues under this change. For example:: working by chance, but could encounter issues under this change. For example::
>>> sys.stdout.buffer.write(text.encode('mbcs')) >>> sys.stdout.buffer.write(text.encode('mbcs'))
>>> r = sys.stdin.buffer.read(16).decode('cp437') >>> r = sys.stdin.buffer.read(16).decode('cp437')
To correct this code, the encoding specified on the ``TextIOWrapper`` should be To correct this code, the encoding specified on the ``TextIOWrapper`` should be
used, either implicitly or explicitly:: used, either implicitly or explicitly::
>>> # Fix 1: Use wrapper correctly >>> # Fix 1: Use wrapper correctly
>>> sys.stdout.write(text) >>> sys.stdout.write(text)
>>> r = sys.stdin.read(16) >>> r = sys.stdin.read(16)
>>> # Fix 2: Use encoding explicitly >>> # Fix 2: Use encoding explicitly
>>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding)) >>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
>>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding) >>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)
Incorrectly using the raw object Incorrectly using the raw object
-------------------------------- --------------------------------
Code that uses the raw IO object and does not correctly handle partial reads and Code that uses the raw IO object and does not correctly handle partial reads and
writes may be affected. This is particularly important for reads, where the writes may be affected. This is particularly important for reads, where the
number of characters read will never exceed one-fourth of the number of bytes number of characters read will never exceed one-fourth of the number of bytes
allowed, as there is no feasible way to prevent input from encoding as much allowed, as there is no feasible way to prevent input from encoding as much
longer utf-8 strings:: longer utf-8 strings::
>>> raw_stdin = sys.stdin.buffer.raw >>> raw_stdin = sys.stdin.buffer.raw
>>> data = raw_stdin.read(15) >>> data = raw_stdin.read(15)
abcdefghijklm abcdefghijklm
b'abc' b'abc'
# data contains at most 3 characters, and never more than 12 bytes # data contains at most 3 characters, and never more than 12 bytes
# error, as "defghijklm\r\n" is passed to the interactive prompt # error, as "defghijklm\r\n" is passed to the interactive prompt
To correct this code, the buffered reader/writer should be used, or the caller To correct this code, the buffered reader/writer should be used, or the caller
should continue reading until its buffer is full:: should continue reading until its buffer is full::
>>> # Fix 1: Use the buffered reader/writer >>> # Fix 1: Use the buffered reader/writer
>>> stdin = sys.stdin.buffer >>> stdin = sys.stdin.buffer
>>> data = stdin.read(15) >>> data = stdin.read(15)
abcedfghijklm abcedfghijklm
b'abcdefghijklm\r\n' b'abcdefghijklm\r\n'
>>> # Fix 2: Loop until enough bytes have been read >>> # Fix 2: Loop until enough bytes have been read
>>> raw_stdin = sys.stdin.buffer.raw >>> raw_stdin = sys.stdin.buffer.raw
>>> b = b'' >>> b = b''
>>> while len(b) < 15: >>> while len(b) < 15:
... b += raw_stdin.read(15) ... b += raw_stdin.read(15)
abcedfghijklm abcedfghijklm
b'abcdefghijklm\r\n' b'abcdefghijklm\r\n'
Using the raw object with small buffers Using the raw object with small buffers
--------------------------------------- ---------------------------------------
Code that uses the raw IO object and attempts to read less than four characters Code that uses the raw IO object and attempts to read less than four characters
will now receive an error. Because it's possible that any single character may will now receive an error. Because it's possible that any single character may
require up to four bytes when represented in utf-8, requests must fail:: require up to four bytes when represented in utf-8, requests must fail::
>>> raw_stdin = sys.stdin.buffer.raw >>> raw_stdin = sys.stdin.buffer.raw
>>> data = raw_stdin.read(3) >>> data = raw_stdin.read(3)
Traceback (most recent call last): Traceback (most recent call last):
File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <module>
ValueError: must read at least 4 bytes ValueError: must read at least 4 bytes
The only workaround is to pass a larger buffer:: The only workaround is to pass a larger buffer::
>>> # Fix: Request at least four bytes >>> # Fix: Request at least four bytes
>>> raw_stdin = sys.stdin.buffer.raw >>> raw_stdin = sys.stdin.buffer.raw
>>> data = raw_stdin.read(4) >>> data = raw_stdin.read(4)
a a
b'a' b'a'
>>> >>> >>> >>>
(The extra ``>>>`` is due to the newline remaining in the input buffer and is (The extra ``>>>`` is due to the newline remaining in the input buffer and is
expected in this situation.) expected in this situation.)
Copyright Copyright
========= =========
This document has been placed in the public domain. This document has been placed in the public domain.
References References
========== ==========
.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py .. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py
.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/ .. _win_unicode_console package: https://pypi.org/project/win_unicode_console/

View File

@ -1,453 +1,453 @@
PEP: 529 PEP: 529
Title: Change Windows filesystem encoding to UTF-8 Title: Change Windows filesystem encoding to UTF-8
Version: $Revision$ Version: $Revision$
Last-Modified: $Date$ Last-Modified: $Date$
Author: Steve Dower <steve.dower@python.org> Author: Steve Dower <steve.dower@python.org>
Status: Final Status: Final
Type: Standards Track Type: Standards Track
Content-Type: text/x-rst Content-Type: text/x-rst
Created: 27-Aug-2016 Created: 27-Aug-2016
Python-Version: 3.6 Python-Version: 3.6
Post-History: 01-Sep-2016, 04-Sep-2016 Post-History: 01-Sep-2016, 04-Sep-2016
Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146277.html Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146277.html
Abstract Abstract
======== ========
Historically, Python uses the ANSI APIs for interacting with the Windows Historically, Python uses the ANSI APIs for interacting with the Windows
operating system, often via C Runtime functions. However, these have been long operating system, often via C Runtime functions. However, these have been long
discouraged in favor of the UTF-16 APIs. Within the operating system, all text discouraged in favor of the UTF-16 APIs. Within the operating system, all text
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
the active code page. See `Naming Files, Paths, and Namespaces`_ for the active code page. See `Naming Files, Paths, and Namespaces`_ for
more details. more details.
This PEP proposes changing the default filesystem encoding on Windows to utf-8, This PEP proposes changing the default filesystem encoding on Windows to utf-8,
and changing all filesystem functions to use the Unicode APIs for filesystem and changing all filesystem functions to use the Unicode APIs for filesystem
paths. This will not affect code that uses strings to represent paths, however paths. This will not affect code that uses strings to represent paths, however
those that use bytes for paths will now be able to correctly round-trip all those that use bytes for paths will now be able to correctly round-trip all
valid paths in Windows filesystems. Currently, the conversions between Unicode valid paths in Windows filesystems. Currently, the conversions between Unicode
(in the OS) and bytes (in Python) were lossy and would fail to round-trip (in the OS) and bytes (in Python) were lossy and would fail to round-trip
characters outside of the user's active code page. characters outside of the user's active code page.
Notably, this does not impact the encoding of the contents of files. These will Notably, this does not impact the encoding of the contents of files. These will
continue to default to ``locale.getpreferredencoding()`` (for text files) or continue to default to ``locale.getpreferredencoding()`` (for text files) or
plain bytes (for binary files). This only affects the encoding used when users plain bytes (for binary files). This only affects the encoding used when users
pass a bytes object to Python where it is then passed to the operating system as pass a bytes object to Python where it is then passed to the operating system as
a path name. a path name.
Background Background
========== ==========
File system paths are almost universally represented as text with an encoding File system paths are almost universally represented as text with an encoding
determined by the file system. In Python, we expose these paths via a number of determined by the file system. In Python, we expose these paths via a number of
interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either
direction across these interfaces, that is, from the filesystem to the direction across these interfaces, that is, from the filesystem to the
application (for example, ``os.listdir()``), or from the application to the application (for example, ``os.listdir()``), or from the application to the
filesystem (for example, ``os.unlink()``). filesystem (for example, ``os.unlink()``).
When paths are passed between the filesystem and the application, they are When paths are passed between the filesystem and the application, they are
either passed through as a bytes blob or converted to/from str using either passed through as a bytes blob or converted to/from str using
``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using ``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using
``sys.getfilesystemencoding()``. The result of encoding a string with ``sys.getfilesystemencoding()``. The result of encoding a string with
``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the ``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the
default file system. default file system.
On Windows, the native format for the filesystem is utf-16-le. The recommended On Windows, the native format for the filesystem is utf-16-le. The recommended
platform APIs for accessing the filesystem all accept and return text encoded in platform APIs for accessing the filesystem all accept and return text encoded in
this format. However, prior to Windows NT (and possibly further back), the this format. However, prior to Windows NT (and possibly further back), the
native format was a configurable machine option and a separate set of APIs native format was a configurable machine option and a separate set of APIs
existed to accept this format. The option (the "active code page") and these existed to accept this format. The option (the "active code page") and these
APIs (the "\*A functions") still exist in recent versions of Windows for APIs (the "\*A functions") still exist in recent versions of Windows for
backwards compatibility, though new functionality often only has a utf-16-le API backwards compatibility, though new functionality often only has a utf-16-le API
(the "\*W functions"). (the "\*W functions").
In Python, str is recommended because it can correctly round-trip all characters In Python, str is recommended because it can correctly round-trip all characters
used in paths (on POSIX with surrogateescape handling; on Windows because str used in paths (on POSIX with surrogateescape handling; on Windows because str
maps to the native representation). On Windows bytes cannot round-trip all maps to the native representation). On Windows bytes cannot round-trip all
characters used in paths, as Python internally uses the \*A functions and hence characters used in paths, as Python internally uses the \*A functions and hence
the encoding is "whatever the active code page is". Since the active code page the encoding is "whatever the active code page is". Since the active code page
cannot represent all Unicode characters, the conversion of a path into bytes can cannot represent all Unicode characters, the conversion of a path into bytes can
lose information without warning or any available indication. lose information without warning or any available indication.
As a demonstration of this:: As a demonstration of this::
>>> open('test\uAB00.txt', 'wb').close() >>> open('test\uAB00.txt', 'wb').close()
>>> import glob >>> import glob
>>> glob.glob('test*') >>> glob.glob('test*')
['test\uab00.txt'] ['test\uab00.txt']
>>> glob.glob(b'test*') >>> glob.glob(b'test*')
[b'test?.txt'] [b'test?.txt']
The Unicode character in the second call to glob has been replaced by a '?', The Unicode character in the second call to glob has been replaced by a '?',
which means passing the path back into the filesystem will result in a which means passing the path back into the filesystem will result in a
``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or ``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or
any function that matches the return type to the parameter type. any function that matches the return type to the parameter type.
While one user-accessible fix is to use str everywhere, POSIX systems generally While one user-accessible fix is to use str everywhere, POSIX systems generally
do not suffer from data loss when using bytes exclusively as the bytes are the do not suffer from data loss when using bytes exclusively as the bytes are the
canonical representation. Even if the encoding is "incorrect" by some standard, canonical representation. Even if the encoding is "incorrect" by some standard,
the file system will still map the bytes back to the file. Making use of this the file system will still map the bytes back to the file. Making use of this
avoids the cost of decoding and reencoding, such that (theoretically, and only avoids the cost of decoding and reencoding, such that (theoretically, and only
on POSIX), code such as this may be faster because of the use of ``b'.'`` on POSIX), code such as this may be faster because of the use of ``b'.'``
compared to using ``'.'``:: compared to using ``'.'``::
>>> for f in os.listdir(b'.'): >>> for f in os.listdir(b'.'):
... os.stat(f) ... os.stat(f)
... ...
As a result, POSIX-focused library authors prefer to use bytes to represent As a result, POSIX-focused library authors prefer to use bytes to represent
paths. For some authors it is also a convenience, as their code may receive paths. For some authors it is also a convenience, as their code may receive
bytes already known to be encoded correctly, while others are attempting to bytes already known to be encoded correctly, while others are attempting to
simplify porting their code from Python 2. However, the correctness assumptions simplify porting their code from Python 2. However, the correctness assumptions
do not carry over to Windows where Unicode is the canonical representation, and do not carry over to Windows where Unicode is the canonical representation, and
errors may result. This potential data loss is why the use of bytes paths on errors may result. This potential data loss is why the use of bytes paths on
Windows was deprecated in Python 3.3 - all of the above code snippets produce Windows was deprecated in Python 3.3 - all of the above code snippets produce
deprecation warnings on Windows. deprecation warnings on Windows.
Proposal Proposal
======== ========
Currently the default filesystem encoding is 'mbcs', which is a meta-encoder Currently the default filesystem encoding is 'mbcs', which is a meta-encoder
that uses the active code page. However, when bytes are passed to the filesystem that uses the active code page. However, when bytes are passed to the filesystem
they go through the \*A APIs and the operating system handles encoding. In this they go through the \*A APIs and the operating system handles encoding. In this
case, paths are always encoded using the equivalent of 'mbcs:replace' with no case, paths are always encoded using the equivalent of 'mbcs:replace' with no
opportunity for Python to override or change this. opportunity for Python to override or change this.
This proposal would remove all use of the \*A APIs and only ever call the \*W This proposal would remove all use of the \*A APIs and only ever call the \*W
APIs. When Windows returns paths to Python as ``str``, they will be decoded from APIs. When Windows returns paths to Python as ``str``, they will be decoded from
utf-16-le and returned as text (in whatever the minimal representation is). When utf-16-le and returned as text (in whatever the minimal representation is). When
Python code requests paths as ``bytes``, the paths will be transcoded from Python code requests paths as ``bytes``, the paths will be transcoded from
utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate
pairs, so it is possible to have invalid surrogates in filenames). Equally, when pairs, so it is possible to have invalid surrogates in filenames). Equally, when
paths are provided as ``bytes``, they are transcoded from utf-8 into utf-16-le paths are provided as ``bytes``, they are transcoded from utf-8 into utf-16-le
and passed to the \*W APIs. and passed to the \*W APIs.
The use of utf-8 will not be configurable, except for the provision of a The use of utf-8 will not be configurable, except for the provision of a
"legacy mode" flag to revert to the previous behaviour. "legacy mode" flag to revert to the previous behaviour.
The ``surrogateescape`` error mode does not apply here, as the concern is not The ``surrogateescape`` error mode does not apply here, as the concern is not
about retaining non-sensical bytes. Any path returned from the operating system about retaining non-sensical bytes. Any path returned from the operating system
will be valid Unicode, while invalid paths created by the user should raise a will be valid Unicode, while invalid paths created by the user should raise a
decoding error (currently these would raise ``OSError`` or a subclass). decoding error (currently these would raise ``OSError`` or a subclass).
The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the
ability to round-trip path names and allow basic manipulation (for example, ability to round-trip path names and allow basic manipulation (for example,
using the ``os.path`` module) when assuming an ASCII-compatible encoding. Using using the ``os.path`` module) when assuming an ASCII-compatible encoding. Using
utf-16-le as the encoding is more pure, but will cause more issues than are utf-16-le as the encoding is more pure, but will cause more issues than are
resolved. resolved.
This change would also undeprecate the use of bytes paths on Windows. No change This change would also undeprecate the use of bytes paths on Windows. No change
to the semantics of using bytes as a path is required - as before, they must be to the semantics of using bytes as a path is required - as before, they must be
encoded with the encoding specified by ``sys.getfilesystemencoding()``. encoded with the encoding specified by ``sys.getfilesystemencoding()``.
Specific Changes Specific Changes
================ ================
Update sys.getfilesystemencoding Update sys.getfilesystemencoding
-------------------------------- --------------------------------
Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in
``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs. ``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs.
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
``PyUnicode_EncodeFSDefault()`` to use the utf-8 codec, or if the legacy-mode ``PyUnicode_EncodeFSDefault()`` to use the utf-8 codec, or if the legacy-mode
switch is enabled the existing mbcs codec. switch is enabled the existing mbcs codec.
Add sys.getfilesystemencodeerrors Add sys.getfilesystemencodeerrors
--------------------------------- ---------------------------------
As the error mode may now change between ``surrogatepass`` and ``replace``, As the error mode may now change between ``surrogatepass`` and ``replace``,
Python code that manually performs encoding also needs access to the current Python code that manually performs encoding also needs access to the current
error mode. This includes the implementation of ``os.fsencode()`` and error mode. This includes the implementation of ``os.fsencode()`` and
``os.fsdecode()``, which currently assume an error mode based on the codec. ``os.fsdecode()``, which currently assume an error mode based on the codec.
Add a public ``Py_FileSystemDefaultEncodeErrors``, similar to the existing Add a public ``Py_FileSystemDefaultEncodeErrors``, similar to the existing
``Py_FileSystemDefaultEncoding``. The default value on Windows will be ``Py_FileSystemDefaultEncoding``. The default value on Windows will be
``surrogatepass`` or in legacy mode, ``replace``. The default value on all other ``surrogatepass`` or in legacy mode, ``replace``. The default value on all other
platforms will be ``surrogateescape``. platforms will be ``surrogateescape``.
Add a public ``sys.getfilesystemencodeerrors()`` function that returns the Add a public ``sys.getfilesystemencodeerrors()`` function that returns the
current error mode. current error mode.
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
``PyUnicode_EncodeFSDefault()`` to use the variable for error mode rather than ``PyUnicode_EncodeFSDefault()`` to use the variable for error mode rather than
constant strings. constant strings.
Update the implementations of ``os.fsencode()`` and ``os.fsdecode()`` to use Update the implementations of ``os.fsencode()`` and ``os.fsdecode()`` to use
``sys.getfilesystemencodeerrors()`` instead of assuming the mode. ``sys.getfilesystemencodeerrors()`` instead of assuming the mode.
Update path_converter Update path_converter
--------------------- ---------------------
Update the path converter to always decode bytes or buffer objects into text Update the path converter to always decode bytes or buffer objects into text
using ``PyUnicode_DecodeFSDefaultAndSize()``. using ``PyUnicode_DecodeFSDefaultAndSize()``.
Change the ``narrow`` field from a ``char*`` string into a flag that indicates Change the ``narrow`` field from a ``char*`` string into a flag that indicates
whether the original object was bytes. This is required for functions that need whether the original object was bytes. This is required for functions that need
to return paths using the same type as was originally provided. to return paths using the same type as was originally provided.
Remove unused ANSI code Remove unused ANSI code
----------------------- -----------------------
Remove all code paths using the ``narrow`` field, as these will no longer be Remove all code paths using the ``narrow`` field, as these will no longer be
reachable by any caller. These are only used within ``posixmodule.c``. Other reachable by any caller. These are only used within ``posixmodule.c``. Other
uses of paths should have use of bytes paths replaced with decoding and use of uses of paths should have use of bytes paths replaced with decoding and use of
the \*W APIs. the \*W APIs.
Add legacy mode Add legacy mode
--------------- ---------------
Add a legacy mode flag, enabled by the environment variable Add a legacy mode flag, enabled by the environment variable
``PYTHONLEGACYWINDOWSFSENCODING`` or by a function call to ``PYTHONLEGACYWINDOWSFSENCODING`` or by a function call to
``sys._enablelegacywindowsfsencoding()``. The function call can only be ``sys._enablelegacywindowsfsencoding()``. The function call can only be
used to enable the flag and should be used by programs as close to used to enable the flag and should be used by programs as close to
initialization as possible. Legacy mode cannot be disabled while Python is initialization as possible. Legacy mode cannot be disabled while Python is
running. running.
When this flag is set, the default filesystem encoding is set to mbcs rather When this flag is set, the default filesystem encoding is set to mbcs rather
than utf-8, and the error mode is set to ``replace`` rather than than utf-8, and the error mode is set to ``replace`` rather than
``surrogatepass``. Paths will continue to decode to wide characters and only \*W ``surrogatepass``. Paths will continue to decode to wide characters and only \*W
APIs will be called, however, the bytes passed in and received from Python will APIs will be called, however, the bytes passed in and received from Python will
be encoded the same as prior to this change. be encoded the same as prior to this change.
Undeprecate bytes paths on Windows Undeprecate bytes paths on Windows
---------------------------------- ----------------------------------
Using bytes as paths on Windows is currently deprecated. We would announce that Using bytes as paths on Windows is currently deprecated. We would announce that
this is no longer the case, and that paths when encoded as bytes should use this is no longer the case, and that paths when encoded as bytes should use
whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's
active code page. active code page.
Beta experiment Beta experiment
--------------- ---------------
To assist with determining the impact of this change, we propose applying it to To assist with determining the impact of this change, we propose applying it to
3.6.0b1 provisionally with the intent being to make a final decision before 3.6.0b1 provisionally with the intent being to make a final decision before
3.6.0b4. 3.6.0b4.
During the experiment period, decoding and encoding exception messages will be During the experiment period, decoding and encoding exception messages will be
expanded to include a link to an active online discussion and encourage expanded to include a link to an active online discussion and encourage
reporting of problems. reporting of problems.
If it is decided to revert the functionality for 3.6.0b4, the implementation If it is decided to revert the functionality for 3.6.0b4, the implementation
change would be to permanently enable the legacy mode flag, change the change would be to permanently enable the legacy mode flag, change the
environment variable to ``PYTHONWINDOWSUTF8FSENCODING`` and function to environment variable to ``PYTHONWINDOWSUTF8FSENCODING`` and function to
``sys._enablewindowsutf8fsencoding()`` to allow enabling the functionality ``sys._enablewindowsutf8fsencoding()`` to allow enabling the functionality
on a case-by-case basis, as opposed to disabling it. on a case-by-case basis, as opposed to disabling it.
It is expected that if we cannot feasibly make the change for 3.6 due to It is expected that if we cannot feasibly make the change for 3.6 due to
compatibility concerns, it will not be possible to make the change at any later compatibility concerns, it will not be possible to make the change at any later
time in Python 3.x. time in Python 3.x.
Affected Modules Affected Modules
---------------- ----------------
This PEP implicitly includes all modules within the Python that either pass path This PEP implicitly includes all modules within the Python that either pass path
names to the operating system, or otherwise use ``sys.getfilesystemencoding()``. names to the operating system, or otherwise use ``sys.getfilesystemencoding()``.
As of 3.6.0a4, the following modules require modification: As of 3.6.0a4, the following modules require modification:
* ``os`` * ``os``
* ``_overlapped`` * ``_overlapped``
* ``_socket`` * ``_socket``
* ``subprocess`` * ``subprocess``
* ``zipimport`` * ``zipimport``
The following modules use ``sys.getfilesystemencoding()`` but do not need The following modules use ``sys.getfilesystemencoding()`` but do not need
modification: modification:
* ``gc`` (already assumes bytes are utf-8) * ``gc`` (already assumes bytes are utf-8)
* ``grp`` (not compiled for Windows) * ``grp`` (not compiled for Windows)
* ``http.server`` (correctly includes codec name with transmitted data) * ``http.server`` (correctly includes codec name with transmitted data)
* ``idlelib.editor`` (should not be needed; has fallback handling) * ``idlelib.editor`` (should not be needed; has fallback handling)
* ``nis`` (not compiled for Windows) * ``nis`` (not compiled for Windows)
* ``pwd`` (not compiled for Windows) * ``pwd`` (not compiled for Windows)
* ``spwd`` (not compiled for Windows) * ``spwd`` (not compiled for Windows)
* ``_ssl`` (only used for ASCII constants) * ``_ssl`` (only used for ASCII constants)
* ``tarfile`` (code unused on Windows) * ``tarfile`` (code unused on Windows)
* ``_tkinter`` (already assumes bytes are utf-8) * ``_tkinter`` (already assumes bytes are utf-8)
* ``wsgiref`` (assumed as the default encoding for unknown environments) * ``wsgiref`` (assumed as the default encoding for unknown environments)
* ``zipapp`` (code unused on Windows) * ``zipapp`` (code unused on Windows)
The following native code uses one of the encoding or decoding functions, but do The following native code uses one of the encoding or decoding functions, but do
not require any modification: not require any modification:
* ``Parser/parsetok.c`` (docs already specify ``sys.getfilesystemencoding()``) * ``Parser/parsetok.c`` (docs already specify ``sys.getfilesystemencoding()``)
* ``Python/ast.c`` (docs already specify ``sys.getfilesystemencoding()``) * ``Python/ast.c`` (docs already specify ``sys.getfilesystemencoding()``)
* ``Python/compile.c`` (undocumented, but Python filesystem encoding implied) * ``Python/compile.c`` (undocumented, but Python filesystem encoding implied)
* ``Python/errors.c`` (docs already specify ``os.fsdecode()``) * ``Python/errors.c`` (docs already specify ``os.fsdecode()``)
* ``Python/fileutils.c`` (code unused on Windows) * ``Python/fileutils.c`` (code unused on Windows)
* ``Python/future.c`` (undocumented, but Python filesystem encoding implied) * ``Python/future.c`` (undocumented, but Python filesystem encoding implied)
* ``Python/import.c`` (docs already specify utf-8) * ``Python/import.c`` (docs already specify utf-8)
* ``Python/importdl.c`` (code unused on Windows) * ``Python/importdl.c`` (code unused on Windows)
* ``Python/pythonrun.c`` (docs already specify ``sys.getfilesystemencoding()``) * ``Python/pythonrun.c`` (docs already specify ``sys.getfilesystemencoding()``)
* ``Python/symtable.c`` (undocumented, but Python filesystem encoding implied) * ``Python/symtable.c`` (undocumented, but Python filesystem encoding implied)
* ``Python/thread.c`` (code unused on Windows) * ``Python/thread.c`` (code unused on Windows)
* ``Python/traceback.c`` (encodes correctly for comparing strings) * ``Python/traceback.c`` (encodes correctly for comparing strings)
* ``Python/_warnings.c`` (docs already specify ``os.fsdecode()``) * ``Python/_warnings.c`` (docs already specify ``os.fsdecode()``)
Rejected Alternatives Rejected Alternatives
===================== =====================
Use strict mbcs decoding Use strict mbcs decoding
------------------------ ------------------------
This is essentially the same as the proposed change, but instead of changing This is essentially the same as the proposed change, but instead of changing
``sys.getfilesystemencoding()`` to utf-8 it is changed to mbcs (which ``sys.getfilesystemencoding()`` to utf-8 it is changed to mbcs (which
dynamically maps to the active code page). dynamically maps to the active code page).
This approach allows the use of new functionality that is only available as \*W This approach allows the use of new functionality that is only available as \*W
APIs and also detection of encoding/decoding errors. For example, rather than APIs and also detection of encoding/decoding errors. For example, rather than
silently replacing Unicode characters with '?', it would be possible to warn or silently replacing Unicode characters with '?', it would be possible to warn or
fail the operation. fail the operation.
Compared to the proposed fix, this could enable some new functionality but does Compared to the proposed fix, this could enable some new functionality but does
not fix any of the problems described initially. New runtime errors may cause not fix any of the problems described initially. New runtime errors may cause
some problems to be more obvious and lead to fixes, provided library maintainers some problems to be more obvious and lead to fixes, provided library maintainers
are interested in supporting Windows and adding a separate code path to treat are interested in supporting Windows and adding a separate code path to treat
filesystem paths as strings. filesystem paths as strings.
Making the encoding mbcs without strict errors is equivalent to the legacy-mode Making the encoding mbcs without strict errors is equivalent to the legacy-mode
switch being enabled by default. This is a possible course of action if there is switch being enabled by default. This is a possible course of action if there is
significant breakage of actual code and a need to extend the deprecation period, significant breakage of actual code and a need to extend the deprecation period,
but still a desire to have the simplifications to the CPython source. but still a desire to have the simplifications to the CPython source.
Make bytes paths an error on Windows Make bytes paths an error on Windows
------------------------------------ ------------------------------------
By preventing the use of bytes paths on Windows completely we prevent users from By preventing the use of bytes paths on Windows completely we prevent users from
hitting encoding issues. hitting encoding issues.
However, the motivation for this PEP is to increase the likelihood that code However, the motivation for this PEP is to increase the likelihood that code
written on POSIX will also work correctly on Windows. This alternative would written on POSIX will also work correctly on Windows. This alternative would
move the other direction and make such code completely incompatible. As this move the other direction and make such code completely incompatible. As this
does not benefit users in any way, we reject it. does not benefit users in any way, we reject it.
Make bytes paths an error on all platforms Make bytes paths an error on all platforms
------------------------------------------ ------------------------------------------
By deprecating and then disable the use of bytes paths on all platforms we By deprecating and then disable the use of bytes paths on all platforms we
prevent users from hitting encoding issues regardless of where the code was prevent users from hitting encoding issues regardless of where the code was
originally written. This would require a full deprecation cycle, as there are originally written. This would require a full deprecation cycle, as there are
currently no warnings on platforms other than Windows. currently no warnings on platforms other than Windows.
This is likely to be seen as a hostile action against Python developers in This is likely to be seen as a hostile action against Python developers in
general, and as such is rejected at this time. general, and as such is rejected at this time.
Code that may break Code that may break
=================== ===================
The following code patterns may break or see different behaviour as a result of The following code patterns may break or see different behaviour as a result of
this change. Each of these examples would have been fragile in code intended for this change. Each of these examples would have been fragile in code intended for
cross-platform use. The suggested fixes demonstrate the most compatible way to cross-platform use. The suggested fixes demonstrate the most compatible way to
handle path encoding issues across all platforms and across multiple Python handle path encoding issues across all platforms and across multiple Python
versions. versions.
Note that all of these examples produce deprecation warnings on Python 3.3 and Note that all of these examples produce deprecation warnings on Python 3.3 and
later. later.
Not managing encodings across boundaries Not managing encodings across boundaries
---------------------------------------- ----------------------------------------
Code that does not manage encodings when crossing protocol boundaries may Code that does not manage encodings when crossing protocol boundaries may
currently be working by chance, but could encounter issues when either encoding currently be working by chance, but could encounter issues when either encoding
changes. Note that the source of ``filename`` may be any function that returns changes. Note that the source of ``filename`` may be any function that returns
a bytes object, as illustrated in a second example below:: a bytes object, as illustrated in a second example below::
>>> filename = open('filename_in_mbcs.txt', 'rb').read() >>> filename = open('filename_in_mbcs.txt', 'rb').read()
>>> text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
To correct this code, the encoding of the bytes in ``filename`` should be To correct this code, the encoding of the bytes in ``filename`` should be
specified, either when reading from the file or before using the value:: specified, either when reading from the file or before using the value::
>>> # Fix 1: Open file as text (default encoding) >>> # Fix 1: Open file as text (default encoding)
>>> filename = open('filename_in_mbcs.txt', 'r').read() >>> filename = open('filename_in_mbcs.txt', 'r').read()
>>> text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
>>> # Fix 2: Open file as text (explicit encoding) >>> # Fix 2: Open file as text (explicit encoding)
>>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read() >>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
>>> text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
>>> # Fix 3: Explicitly decode the path >>> # Fix 3: Explicitly decode the path
>>> filename = open('filename_in_mbcs.txt', 'rb').read() >>> filename = open('filename_in_mbcs.txt', 'rb').read()
>>> text = open(filename.decode('mbcs'), 'r').read() >>> text = open(filename.decode('mbcs'), 'r').read()
Where the creator of ``filename`` is separated from the user of ``filename``, Where the creator of ``filename`` is separated from the user of ``filename``,
the encoding is important information to include:: the encoding is important information to include::
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('mbcs') >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('mbcs')
>>> filename = some_object.filename >>> filename = some_object.filename
>>> type(filename) >>> type(filename)
<class 'bytes'> <class 'bytes'>
>>> text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
To fix this code for best compatibility across operating systems and Python To fix this code for best compatibility across operating systems and Python
versions, the filename should be exposed as str:: versions, the filename should be exposed as str::
>>> # Fix 1: Expose as str >>> # Fix 1: Expose as str
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt' >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'
>>> filename = some_object.filename >>> filename = some_object.filename
>>> type(filename) >>> type(filename)
<class 'str'> <class 'str'>
>>> text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
Alternatively, the encoding used for the path needs to be made available to the Alternatively, the encoding used for the path needs to be made available to the
user. Specifying ``os.fsencode()`` (or ``sys.getfilesystemencoding()``) is an user. Specifying ``os.fsencode()`` (or ``sys.getfilesystemencoding()``) is an
acceptable choice, or a new attribute could be added with the exact encoding:: acceptable choice, or a new attribute could be added with the exact encoding::
>>> # Fix 2: Use fsencode >>> # Fix 2: Use fsencode
>>> some_object.filename = os.fsencode(r'C:\Users\Steve\Documents\my_file.txt') >>> some_object.filename = os.fsencode(r'C:\Users\Steve\Documents\my_file.txt')
>>> filename = some_object.filename >>> filename = some_object.filename
>>> type(filename) >>> type(filename)
<class 'bytes'> <class 'bytes'>
>>> text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
>>> # Fix 3: Expose as explicit encoding >>> # Fix 3: Expose as explicit encoding
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('cp437') >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('cp437')
>>> some_object.filename_encoding = 'cp437' >>> some_object.filename_encoding = 'cp437'
>>> filename = some_object.filename >>> filename = some_object.filename
>>> type(filename) >>> type(filename)
<class 'bytes'> <class 'bytes'>
>>> filename = filename.decode(some_object.filename_encoding) >>> filename = filename.decode(some_object.filename_encoding)
>>> type(filename) >>> type(filename)
<class 'str'> <class 'str'>
>>> text = open(filename, 'r').read() >>> text = open(filename, 'r').read()
Explicitly using 'mbcs' Explicitly using 'mbcs'
----------------------- -----------------------
Code that explicitly encodes text using 'mbcs' before passing to file system Code that explicitly encodes text using 'mbcs' before passing to file system
APIs is now passing incorrectly encoded bytes. Note that the source of APIs is now passing incorrectly encoded bytes. Note that the source of
``filename`` in this example is not relevant, provided that it is a str:: ``filename`` in this example is not relevant, provided that it is a str::
>>> filename = open('files.txt', 'r').readline().rstrip() >>> filename = open('files.txt', 'r').readline().rstrip()
>>> text = open(filename.encode('mbcs'), 'r') >>> text = open(filename.encode('mbcs'), 'r')
To correct this code, the string should be passed without explicit encoding, or To correct this code, the string should be passed without explicit encoding, or
should use ``os.fsencode()``:: should use ``os.fsencode()``::
>>> # Fix 1: Do not encode the string >>> # Fix 1: Do not encode the string
>>> filename = open('files.txt', 'r').readline().rstrip() >>> filename = open('files.txt', 'r').readline().rstrip()
>>> text = open(filename, 'r') >>> text = open(filename, 'r')
>>> # Fix 2: Use correct encoding >>> # Fix 2: Use correct encoding
>>> filename = open('files.txt', 'r').readline().rstrip() >>> filename = open('files.txt', 'r').readline().rstrip()
>>> text = open(os.fsencode(filename), 'r') >>> text = open(os.fsencode(filename), 'r')
References References
========== ==========
.. _Naming Files, Paths, and Namespaces: .. _Naming Files, Paths, and Namespaces:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx
Copyright Copyright
========= =========
This document has been placed in the public domain. This document has been placed in the public domain.