Convert PEPs 519, 528 and 529 from CRLF to LF line endings. (#236)
This commit is contained in:
parent
425a46fb20
commit
d675175520
1114
pep-0519.txt
1114
pep-0519.txt
File diff suppressed because it is too large
Load Diff
364
pep-0528.txt
364
pep-0528.txt
|
@ -1,182 +1,182 @@
|
||||||
PEP: 528
|
PEP: 528
|
||||||
Title: Change Windows console encoding to UTF-8
|
Title: Change Windows console encoding to UTF-8
|
||||||
Version: $Revision$
|
Version: $Revision$
|
||||||
Last-Modified: $Date$
|
Last-Modified: $Date$
|
||||||
Author: Steve Dower <steve.dower@python.org>
|
Author: Steve Dower <steve.dower@python.org>
|
||||||
Status: Final
|
Status: Final
|
||||||
Type: Standards Track
|
Type: Standards Track
|
||||||
Content-Type: text/x-rst
|
Content-Type: text/x-rst
|
||||||
Created: 27-Aug-2016
|
Created: 27-Aug-2016
|
||||||
Python-Version: 3.6
|
Python-Version: 3.6
|
||||||
Post-History: 01-Sep-2016, 04-Sep-2016
|
Post-History: 01-Sep-2016, 04-Sep-2016
|
||||||
Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146278.html
|
Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146278.html
|
||||||
|
|
||||||
Abstract
|
Abstract
|
||||||
========
|
========
|
||||||
|
|
||||||
Historically, Python uses the ANSI APIs for interacting with the Windows
|
Historically, Python uses the ANSI APIs for interacting with the Windows
|
||||||
operating system, often via C Runtime functions. However, these have been long
|
operating system, often via C Runtime functions. However, these have been long
|
||||||
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
|
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
|
||||||
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
|
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
|
||||||
the active code page.
|
the active code page.
|
||||||
|
|
||||||
This PEP proposes changing the default standard stream implementation on Windows
|
This PEP proposes changing the default standard stream implementation on Windows
|
||||||
to use the Unicode APIs. This will allow users to print and input the full range
|
to use the Unicode APIs. This will allow users to print and input the full range
|
||||||
of Unicode characters at the default Windows console. This also requires a
|
of Unicode characters at the default Windows console. This also requires a
|
||||||
subtle change to how the tokenizer parses text from readline hooks.
|
subtle change to how the tokenizer parses text from readline hooks.
|
||||||
|
|
||||||
Specific Changes
|
Specific Changes
|
||||||
================
|
================
|
||||||
|
|
||||||
Add _io.WindowsConsoleIO
|
Add _io.WindowsConsoleIO
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
|
Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors
|
||||||
representing standard input, output and error. We add a new class (implemented
|
representing standard input, output and error. We add a new class (implemented
|
||||||
in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows
|
in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows
|
||||||
console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.
|
console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``.
|
||||||
|
|
||||||
This class will be used when the legacy-mode flag is not in effect, when opening
|
This class will be used when the legacy-mode flag is not in effect, when opening
|
||||||
a standard stream by file descriptor and the stream is a console buffer rather
|
a standard stream by file descriptor and the stream is a console buffer rather
|
||||||
than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today.
|
than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today.
|
||||||
|
|
||||||
This is a raw (bytes) IO class that requires text to be passed encoded with
|
This is a raw (bytes) IO class that requires text to be passed encoded with
|
||||||
utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
|
utf-8, which will be decoded to utf-16-le and passed to the Windows APIs.
|
||||||
Similarly, bytes read from the class will be provided by the operating system as
|
Similarly, bytes read from the class will be provided by the operating system as
|
||||||
utf-16-le and converted into utf-8 when returned to Python.
|
utf-16-le and converted into utf-8 when returned to Python.
|
||||||
|
|
||||||
The use of an ASCII compatible encoding is required to maintain compatibility
|
The use of an ASCII compatible encoding is required to maintain compatibility
|
||||||
with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to
|
with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to
|
||||||
the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes
|
the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes
|
||||||
a particular encoding for the standard streams other than ASCII will likely
|
a particular encoding for the standard streams other than ASCII will likely
|
||||||
break.
|
break.
|
||||||
|
|
||||||
Add _PyOS_WindowsConsoleReadline
|
Add _PyOS_WindowsConsoleReadline
|
||||||
--------------------------------
|
--------------------------------
|
||||||
|
|
||||||
To allow Unicode entry at the interactive prompt, a new readline hook is
|
To allow Unicode entry at the interactive prompt, a new readline hook is
|
||||||
required. The existing ``PyOS_StdioReadline`` function will delegate to the new
|
required. The existing ``PyOS_StdioReadline`` function will delegate to the new
|
||||||
``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor
|
``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor
|
||||||
that is a console buffer and the legacy-mode flag is not in effect (the logic
|
that is a console buffer and the legacy-mode flag is not in effect (the logic
|
||||||
should be identical to above).
|
should be identical to above).
|
||||||
|
|
||||||
Since the readline interface is required to return an 8-bit encoded string with
|
Since the readline interface is required to return an 8-bit encoded string with
|
||||||
no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from
|
no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from
|
||||||
utf-16-le as read from the operating system into utf-8.
|
utf-16-le as read from the operating system into utf-8.
|
||||||
|
|
||||||
The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding
|
The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding
|
||||||
from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect.
|
from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect.
|
||||||
This may require readline hooks to change their encodings to utf-8, or to
|
This may require readline hooks to change their encodings to utf-8, or to
|
||||||
require legacy-mode for correct behaviour.
|
require legacy-mode for correct behaviour.
|
||||||
|
|
||||||
Add legacy mode
|
Add legacy mode
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set
|
Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set
|
||||||
will enable the legacy-mode flag, which completely restores the previous
|
will enable the legacy-mode flag, which completely restores the previous
|
||||||
behaviour.
|
behaviour.
|
||||||
|
|
||||||
Alternative Approaches
|
Alternative Approaches
|
||||||
======================
|
======================
|
||||||
|
|
||||||
The `win_unicode_console package`_ is a pure-Python alternative to changing the
|
The `win_unicode_console package`_ is a pure-Python alternative to changing the
|
||||||
default behaviour of the console. It implements essentially the same
|
default behaviour of the console. It implements essentially the same
|
||||||
modifications as described here using pure Python code.
|
modifications as described here using pure Python code.
|
||||||
|
|
||||||
Code that may break
|
Code that may break
|
||||||
===================
|
===================
|
||||||
|
|
||||||
The following code patterns may break or see different behaviour as a result of
|
The following code patterns may break or see different behaviour as a result of
|
||||||
this change. All of these code samples require explicitly choosing to use a raw
|
this change. All of these code samples require explicitly choosing to use a raw
|
||||||
file object in place of a more convenient wrapper that would prevent any visible
|
file object in place of a more convenient wrapper that would prevent any visible
|
||||||
change.
|
change.
|
||||||
|
|
||||||
Assuming stdin/stdout encoding
|
Assuming stdin/stdout encoding
|
||||||
------------------------------
|
------------------------------
|
||||||
|
|
||||||
Code that assumes that the encoding required by ``sys.stdin.buffer`` or
|
Code that assumes that the encoding required by ``sys.stdin.buffer`` or
|
||||||
``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be
|
``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be
|
||||||
working by chance, but could encounter issues under this change. For example::
|
working by chance, but could encounter issues under this change. For example::
|
||||||
|
|
||||||
>>> sys.stdout.buffer.write(text.encode('mbcs'))
|
>>> sys.stdout.buffer.write(text.encode('mbcs'))
|
||||||
>>> r = sys.stdin.buffer.read(16).decode('cp437')
|
>>> r = sys.stdin.buffer.read(16).decode('cp437')
|
||||||
|
|
||||||
To correct this code, the encoding specified on the ``TextIOWrapper`` should be
|
To correct this code, the encoding specified on the ``TextIOWrapper`` should be
|
||||||
used, either implicitly or explicitly::
|
used, either implicitly or explicitly::
|
||||||
|
|
||||||
>>> # Fix 1: Use wrapper correctly
|
>>> # Fix 1: Use wrapper correctly
|
||||||
>>> sys.stdout.write(text)
|
>>> sys.stdout.write(text)
|
||||||
>>> r = sys.stdin.read(16)
|
>>> r = sys.stdin.read(16)
|
||||||
|
|
||||||
>>> # Fix 2: Use encoding explicitly
|
>>> # Fix 2: Use encoding explicitly
|
||||||
>>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
|
>>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
|
||||||
>>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)
|
>>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)
|
||||||
|
|
||||||
Incorrectly using the raw object
|
Incorrectly using the raw object
|
||||||
--------------------------------
|
--------------------------------
|
||||||
|
|
||||||
Code that uses the raw IO object and does not correctly handle partial reads and
|
Code that uses the raw IO object and does not correctly handle partial reads and
|
||||||
writes may be affected. This is particularly important for reads, where the
|
writes may be affected. This is particularly important for reads, where the
|
||||||
number of characters read will never exceed one-fourth of the number of bytes
|
number of characters read will never exceed one-fourth of the number of bytes
|
||||||
allowed, as there is no feasible way to prevent input from encoding as much
|
allowed, as there is no feasible way to prevent input from encoding as much
|
||||||
longer utf-8 strings::
|
longer utf-8 strings::
|
||||||
|
|
||||||
>>> raw_stdin = sys.stdin.buffer.raw
|
>>> raw_stdin = sys.stdin.buffer.raw
|
||||||
>>> data = raw_stdin.read(15)
|
>>> data = raw_stdin.read(15)
|
||||||
abcdefghijklm
|
abcdefghijklm
|
||||||
b'abc'
|
b'abc'
|
||||||
# data contains at most 3 characters, and never more than 12 bytes
|
# data contains at most 3 characters, and never more than 12 bytes
|
||||||
# error, as "defghijklm\r\n" is passed to the interactive prompt
|
# error, as "defghijklm\r\n" is passed to the interactive prompt
|
||||||
|
|
||||||
To correct this code, the buffered reader/writer should be used, or the caller
|
To correct this code, the buffered reader/writer should be used, or the caller
|
||||||
should continue reading until its buffer is full::
|
should continue reading until its buffer is full::
|
||||||
|
|
||||||
>>> # Fix 1: Use the buffered reader/writer
|
>>> # Fix 1: Use the buffered reader/writer
|
||||||
>>> stdin = sys.stdin.buffer
|
>>> stdin = sys.stdin.buffer
|
||||||
>>> data = stdin.read(15)
|
>>> data = stdin.read(15)
|
||||||
abcedfghijklm
|
abcedfghijklm
|
||||||
b'abcdefghijklm\r\n'
|
b'abcdefghijklm\r\n'
|
||||||
|
|
||||||
>>> # Fix 2: Loop until enough bytes have been read
|
>>> # Fix 2: Loop until enough bytes have been read
|
||||||
>>> raw_stdin = sys.stdin.buffer.raw
|
>>> raw_stdin = sys.stdin.buffer.raw
|
||||||
>>> b = b''
|
>>> b = b''
|
||||||
>>> while len(b) < 15:
|
>>> while len(b) < 15:
|
||||||
... b += raw_stdin.read(15)
|
... b += raw_stdin.read(15)
|
||||||
abcedfghijklm
|
abcedfghijklm
|
||||||
b'abcdefghijklm\r\n'
|
b'abcdefghijklm\r\n'
|
||||||
|
|
||||||
Using the raw object with small buffers
|
Using the raw object with small buffers
|
||||||
---------------------------------------
|
---------------------------------------
|
||||||
|
|
||||||
Code that uses the raw IO object and attempts to read less than four characters
|
Code that uses the raw IO object and attempts to read less than four characters
|
||||||
will now receive an error. Because it's possible that any single character may
|
will now receive an error. Because it's possible that any single character may
|
||||||
require up to four bytes when represented in utf-8, requests must fail::
|
require up to four bytes when represented in utf-8, requests must fail::
|
||||||
|
|
||||||
>>> raw_stdin = sys.stdin.buffer.raw
|
>>> raw_stdin = sys.stdin.buffer.raw
|
||||||
>>> data = raw_stdin.read(3)
|
>>> data = raw_stdin.read(3)
|
||||||
Traceback (most recent call last):
|
Traceback (most recent call last):
|
||||||
File "<stdin>", line 1, in <module>
|
File "<stdin>", line 1, in <module>
|
||||||
ValueError: must read at least 4 bytes
|
ValueError: must read at least 4 bytes
|
||||||
|
|
||||||
The only workaround is to pass a larger buffer::
|
The only workaround is to pass a larger buffer::
|
||||||
|
|
||||||
>>> # Fix: Request at least four bytes
|
>>> # Fix: Request at least four bytes
|
||||||
>>> raw_stdin = sys.stdin.buffer.raw
|
>>> raw_stdin = sys.stdin.buffer.raw
|
||||||
>>> data = raw_stdin.read(4)
|
>>> data = raw_stdin.read(4)
|
||||||
a
|
a
|
||||||
b'a'
|
b'a'
|
||||||
>>> >>>
|
>>> >>>
|
||||||
|
|
||||||
(The extra ``>>>`` is due to the newline remaining in the input buffer and is
|
(The extra ``>>>`` is due to the newline remaining in the input buffer and is
|
||||||
expected in this situation.)
|
expected in this situation.)
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
=========
|
=========
|
||||||
|
|
||||||
This document has been placed in the public domain.
|
This document has been placed in the public domain.
|
||||||
|
|
||||||
References
|
References
|
||||||
==========
|
==========
|
||||||
|
|
||||||
.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py
|
.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py
|
||||||
.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/
|
.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/
|
||||||
|
|
906
pep-0529.txt
906
pep-0529.txt
|
@ -1,453 +1,453 @@
|
||||||
PEP: 529
|
PEP: 529
|
||||||
Title: Change Windows filesystem encoding to UTF-8
|
Title: Change Windows filesystem encoding to UTF-8
|
||||||
Version: $Revision$
|
Version: $Revision$
|
||||||
Last-Modified: $Date$
|
Last-Modified: $Date$
|
||||||
Author: Steve Dower <steve.dower@python.org>
|
Author: Steve Dower <steve.dower@python.org>
|
||||||
Status: Final
|
Status: Final
|
||||||
Type: Standards Track
|
Type: Standards Track
|
||||||
Content-Type: text/x-rst
|
Content-Type: text/x-rst
|
||||||
Created: 27-Aug-2016
|
Created: 27-Aug-2016
|
||||||
Python-Version: 3.6
|
Python-Version: 3.6
|
||||||
Post-History: 01-Sep-2016, 04-Sep-2016
|
Post-History: 01-Sep-2016, 04-Sep-2016
|
||||||
Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146277.html
|
Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146277.html
|
||||||
|
|
||||||
Abstract
|
Abstract
|
||||||
========
|
========
|
||||||
|
|
||||||
Historically, Python uses the ANSI APIs for interacting with the Windows
|
Historically, Python uses the ANSI APIs for interacting with the Windows
|
||||||
operating system, often via C Runtime functions. However, these have been long
|
operating system, often via C Runtime functions. However, these have been long
|
||||||
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
|
discouraged in favor of the UTF-16 APIs. Within the operating system, all text
|
||||||
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
|
is represented as UTF-16, and the ANSI APIs perform encoding and decoding using
|
||||||
the active code page. See `Naming Files, Paths, and Namespaces`_ for
|
the active code page. See `Naming Files, Paths, and Namespaces`_ for
|
||||||
more details.
|
more details.
|
||||||
|
|
||||||
This PEP proposes changing the default filesystem encoding on Windows to utf-8,
|
This PEP proposes changing the default filesystem encoding on Windows to utf-8,
|
||||||
and changing all filesystem functions to use the Unicode APIs for filesystem
|
and changing all filesystem functions to use the Unicode APIs for filesystem
|
||||||
paths. This will not affect code that uses strings to represent paths, however
|
paths. This will not affect code that uses strings to represent paths, however
|
||||||
those that use bytes for paths will now be able to correctly round-trip all
|
those that use bytes for paths will now be able to correctly round-trip all
|
||||||
valid paths in Windows filesystems. Currently, the conversions between Unicode
|
valid paths in Windows filesystems. Currently, the conversions between Unicode
|
||||||
(in the OS) and bytes (in Python) were lossy and would fail to round-trip
|
(in the OS) and bytes (in Python) were lossy and would fail to round-trip
|
||||||
characters outside of the user's active code page.
|
characters outside of the user's active code page.
|
||||||
|
|
||||||
Notably, this does not impact the encoding of the contents of files. These will
|
Notably, this does not impact the encoding of the contents of files. These will
|
||||||
continue to default to ``locale.getpreferredencoding()`` (for text files) or
|
continue to default to ``locale.getpreferredencoding()`` (for text files) or
|
||||||
plain bytes (for binary files). This only affects the encoding used when users
|
plain bytes (for binary files). This only affects the encoding used when users
|
||||||
pass a bytes object to Python where it is then passed to the operating system as
|
pass a bytes object to Python where it is then passed to the operating system as
|
||||||
a path name.
|
a path name.
|
||||||
|
|
||||||
Background
|
Background
|
||||||
==========
|
==========
|
||||||
|
|
||||||
File system paths are almost universally represented as text with an encoding
|
File system paths are almost universally represented as text with an encoding
|
||||||
determined by the file system. In Python, we expose these paths via a number of
|
determined by the file system. In Python, we expose these paths via a number of
|
||||||
interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either
|
interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either
|
||||||
direction across these interfaces, that is, from the filesystem to the
|
direction across these interfaces, that is, from the filesystem to the
|
||||||
application (for example, ``os.listdir()``), or from the application to the
|
application (for example, ``os.listdir()``), or from the application to the
|
||||||
filesystem (for example, ``os.unlink()``).
|
filesystem (for example, ``os.unlink()``).
|
||||||
|
|
||||||
When paths are passed between the filesystem and the application, they are
|
When paths are passed between the filesystem and the application, they are
|
||||||
either passed through as a bytes blob or converted to/from str using
|
either passed through as a bytes blob or converted to/from str using
|
||||||
``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using
|
``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using
|
||||||
``sys.getfilesystemencoding()``. The result of encoding a string with
|
``sys.getfilesystemencoding()``. The result of encoding a string with
|
||||||
``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the
|
``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the
|
||||||
default file system.
|
default file system.
|
||||||
|
|
||||||
On Windows, the native format for the filesystem is utf-16-le. The recommended
|
On Windows, the native format for the filesystem is utf-16-le. The recommended
|
||||||
platform APIs for accessing the filesystem all accept and return text encoded in
|
platform APIs for accessing the filesystem all accept and return text encoded in
|
||||||
this format. However, prior to Windows NT (and possibly further back), the
|
this format. However, prior to Windows NT (and possibly further back), the
|
||||||
native format was a configurable machine option and a separate set of APIs
|
native format was a configurable machine option and a separate set of APIs
|
||||||
existed to accept this format. The option (the "active code page") and these
|
existed to accept this format. The option (the "active code page") and these
|
||||||
APIs (the "\*A functions") still exist in recent versions of Windows for
|
APIs (the "\*A functions") still exist in recent versions of Windows for
|
||||||
backwards compatibility, though new functionality often only has a utf-16-le API
|
backwards compatibility, though new functionality often only has a utf-16-le API
|
||||||
(the "\*W functions").
|
(the "\*W functions").
|
||||||
|
|
||||||
In Python, str is recommended because it can correctly round-trip all characters
|
In Python, str is recommended because it can correctly round-trip all characters
|
||||||
used in paths (on POSIX with surrogateescape handling; on Windows because str
|
used in paths (on POSIX with surrogateescape handling; on Windows because str
|
||||||
maps to the native representation). On Windows bytes cannot round-trip all
|
maps to the native representation). On Windows bytes cannot round-trip all
|
||||||
characters used in paths, as Python internally uses the \*A functions and hence
|
characters used in paths, as Python internally uses the \*A functions and hence
|
||||||
the encoding is "whatever the active code page is". Since the active code page
|
the encoding is "whatever the active code page is". Since the active code page
|
||||||
cannot represent all Unicode characters, the conversion of a path into bytes can
|
cannot represent all Unicode characters, the conversion of a path into bytes can
|
||||||
lose information without warning or any available indication.
|
lose information without warning or any available indication.
|
||||||
|
|
||||||
As a demonstration of this::
|
As a demonstration of this::
|
||||||
|
|
||||||
>>> open('test\uAB00.txt', 'wb').close()
|
>>> open('test\uAB00.txt', 'wb').close()
|
||||||
>>> import glob
|
>>> import glob
|
||||||
>>> glob.glob('test*')
|
>>> glob.glob('test*')
|
||||||
['test\uab00.txt']
|
['test\uab00.txt']
|
||||||
>>> glob.glob(b'test*')
|
>>> glob.glob(b'test*')
|
||||||
[b'test?.txt']
|
[b'test?.txt']
|
||||||
|
|
||||||
The Unicode character in the second call to glob has been replaced by a '?',
|
The Unicode character in the second call to glob has been replaced by a '?',
|
||||||
which means passing the path back into the filesystem will result in a
|
which means passing the path back into the filesystem will result in a
|
||||||
``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or
|
``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or
|
||||||
any function that matches the return type to the parameter type.
|
any function that matches the return type to the parameter type.
|
||||||
|
|
||||||
While one user-accessible fix is to use str everywhere, POSIX systems generally
|
While one user-accessible fix is to use str everywhere, POSIX systems generally
|
||||||
do not suffer from data loss when using bytes exclusively as the bytes are the
|
do not suffer from data loss when using bytes exclusively as the bytes are the
|
||||||
canonical representation. Even if the encoding is "incorrect" by some standard,
|
canonical representation. Even if the encoding is "incorrect" by some standard,
|
||||||
the file system will still map the bytes back to the file. Making use of this
|
the file system will still map the bytes back to the file. Making use of this
|
||||||
avoids the cost of decoding and reencoding, such that (theoretically, and only
|
avoids the cost of decoding and reencoding, such that (theoretically, and only
|
||||||
on POSIX), code such as this may be faster because of the use of ``b'.'``
|
on POSIX), code such as this may be faster because of the use of ``b'.'``
|
||||||
compared to using ``'.'``::
|
compared to using ``'.'``::
|
||||||
|
|
||||||
>>> for f in os.listdir(b'.'):
|
>>> for f in os.listdir(b'.'):
|
||||||
... os.stat(f)
|
... os.stat(f)
|
||||||
...
|
...
|
||||||
|
|
||||||
As a result, POSIX-focused library authors prefer to use bytes to represent
|
As a result, POSIX-focused library authors prefer to use bytes to represent
|
||||||
paths. For some authors it is also a convenience, as their code may receive
|
paths. For some authors it is also a convenience, as their code may receive
|
||||||
bytes already known to be encoded correctly, while others are attempting to
|
bytes already known to be encoded correctly, while others are attempting to
|
||||||
simplify porting their code from Python 2. However, the correctness assumptions
|
simplify porting their code from Python 2. However, the correctness assumptions
|
||||||
do not carry over to Windows where Unicode is the canonical representation, and
|
do not carry over to Windows where Unicode is the canonical representation, and
|
||||||
errors may result. This potential data loss is why the use of bytes paths on
|
errors may result. This potential data loss is why the use of bytes paths on
|
||||||
Windows was deprecated in Python 3.3 - all of the above code snippets produce
|
Windows was deprecated in Python 3.3 - all of the above code snippets produce
|
||||||
deprecation warnings on Windows.
|
deprecation warnings on Windows.
|
||||||
|
|
||||||
Proposal
|
Proposal
|
||||||
========
|
========
|
||||||
|
|
||||||
Currently the default filesystem encoding is 'mbcs', which is a meta-encoder
|
Currently the default filesystem encoding is 'mbcs', which is a meta-encoder
|
||||||
that uses the active code page. However, when bytes are passed to the filesystem
|
that uses the active code page. However, when bytes are passed to the filesystem
|
||||||
they go through the \*A APIs and the operating system handles encoding. In this
|
they go through the \*A APIs and the operating system handles encoding. In this
|
||||||
case, paths are always encoded using the equivalent of 'mbcs:replace' with no
|
case, paths are always encoded using the equivalent of 'mbcs:replace' with no
|
||||||
opportunity for Python to override or change this.
|
opportunity for Python to override or change this.
|
||||||
|
|
||||||
This proposal would remove all use of the \*A APIs and only ever call the \*W
|
This proposal would remove all use of the \*A APIs and only ever call the \*W
|
||||||
APIs. When Windows returns paths to Python as ``str``, they will be decoded from
|
APIs. When Windows returns paths to Python as ``str``, they will be decoded from
|
||||||
utf-16-le and returned as text (in whatever the minimal representation is). When
|
utf-16-le and returned as text (in whatever the minimal representation is). When
|
||||||
Python code requests paths as ``bytes``, the paths will be transcoded from
|
Python code requests paths as ``bytes``, the paths will be transcoded from
|
||||||
utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate
|
utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate
|
||||||
pairs, so it is possible to have invalid surrogates in filenames). Equally, when
|
pairs, so it is possible to have invalid surrogates in filenames). Equally, when
|
||||||
paths are provided as ``bytes``, they are transcoded from utf-8 into utf-16-le
|
paths are provided as ``bytes``, they are transcoded from utf-8 into utf-16-le
|
||||||
and passed to the \*W APIs.
|
and passed to the \*W APIs.
|
||||||
|
|
||||||
The use of utf-8 will not be configurable, except for the provision of a
|
The use of utf-8 will not be configurable, except for the provision of a
|
||||||
"legacy mode" flag to revert to the previous behaviour.
|
"legacy mode" flag to revert to the previous behaviour.
|
||||||
|
|
||||||
The ``surrogateescape`` error mode does not apply here, as the concern is not
|
The ``surrogateescape`` error mode does not apply here, as the concern is not
|
||||||
about retaining non-sensical bytes. Any path returned from the operating system
|
about retaining non-sensical bytes. Any path returned from the operating system
|
||||||
will be valid Unicode, while invalid paths created by the user should raise a
|
will be valid Unicode, while invalid paths created by the user should raise a
|
||||||
decoding error (currently these would raise ``OSError`` or a subclass).
|
decoding error (currently these would raise ``OSError`` or a subclass).
|
||||||
|
|
||||||
The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the
|
The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the
|
||||||
ability to round-trip path names and allow basic manipulation (for example,
|
ability to round-trip path names and allow basic manipulation (for example,
|
||||||
using the ``os.path`` module) when assuming an ASCII-compatible encoding. Using
|
using the ``os.path`` module) when assuming an ASCII-compatible encoding. Using
|
||||||
utf-16-le as the encoding is more pure, but will cause more issues than are
|
utf-16-le as the encoding is more pure, but will cause more issues than are
|
||||||
resolved.
|
resolved.
|
||||||
|
|
||||||
This change would also undeprecate the use of bytes paths on Windows. No change
|
This change would also undeprecate the use of bytes paths on Windows. No change
|
||||||
to the semantics of using bytes as a path is required - as before, they must be
|
to the semantics of using bytes as a path is required - as before, they must be
|
||||||
encoded with the encoding specified by ``sys.getfilesystemencoding()``.
|
encoded with the encoding specified by ``sys.getfilesystemencoding()``.
|
||||||
|
|
||||||
Specific Changes
|
Specific Changes
|
||||||
================
|
================
|
||||||
|
|
||||||
Update sys.getfilesystemencoding
|
Update sys.getfilesystemencoding
|
||||||
--------------------------------
|
--------------------------------
|
||||||
|
|
||||||
Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in
|
Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in
|
||||||
``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs.
|
``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs.
|
||||||
|
|
||||||
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
|
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
|
||||||
``PyUnicode_EncodeFSDefault()`` to use the utf-8 codec, or if the legacy-mode
|
``PyUnicode_EncodeFSDefault()`` to use the utf-8 codec, or if the legacy-mode
|
||||||
switch is enabled the existing mbcs codec.
|
switch is enabled the existing mbcs codec.
|
||||||
|
|
||||||
Add sys.getfilesystemencodeerrors
|
Add sys.getfilesystemencodeerrors
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
As the error mode may now change between ``surrogatepass`` and ``replace``,
|
As the error mode may now change between ``surrogatepass`` and ``replace``,
|
||||||
Python code that manually performs encoding also needs access to the current
|
Python code that manually performs encoding also needs access to the current
|
||||||
error mode. This includes the implementation of ``os.fsencode()`` and
|
error mode. This includes the implementation of ``os.fsencode()`` and
|
||||||
``os.fsdecode()``, which currently assume an error mode based on the codec.
|
``os.fsdecode()``, which currently assume an error mode based on the codec.
|
||||||
|
|
||||||
Add a public ``Py_FileSystemDefaultEncodeErrors``, similar to the existing
|
Add a public ``Py_FileSystemDefaultEncodeErrors``, similar to the existing
|
||||||
``Py_FileSystemDefaultEncoding``. The default value on Windows will be
|
``Py_FileSystemDefaultEncoding``. The default value on Windows will be
|
||||||
``surrogatepass`` or in legacy mode, ``replace``. The default value on all other
|
``surrogatepass`` or in legacy mode, ``replace``. The default value on all other
|
||||||
platforms will be ``surrogateescape``.
|
platforms will be ``surrogateescape``.
|
||||||
|
|
||||||
Add a public ``sys.getfilesystemencodeerrors()`` function that returns the
|
Add a public ``sys.getfilesystemencodeerrors()`` function that returns the
|
||||||
current error mode.
|
current error mode.
|
||||||
|
|
||||||
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
|
Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and
|
||||||
``PyUnicode_EncodeFSDefault()`` to use the variable for error mode rather than
|
``PyUnicode_EncodeFSDefault()`` to use the variable for error mode rather than
|
||||||
constant strings.
|
constant strings.
|
||||||
|
|
||||||
Update the implementations of ``os.fsencode()`` and ``os.fsdecode()`` to use
|
Update the implementations of ``os.fsencode()`` and ``os.fsdecode()`` to use
|
||||||
``sys.getfilesystemencodeerrors()`` instead of assuming the mode.
|
``sys.getfilesystemencodeerrors()`` instead of assuming the mode.
|
||||||
|
|
||||||
Update path_converter
|
Update path_converter
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
Update the path converter to always decode bytes or buffer objects into text
|
Update the path converter to always decode bytes or buffer objects into text
|
||||||
using ``PyUnicode_DecodeFSDefaultAndSize()``.
|
using ``PyUnicode_DecodeFSDefaultAndSize()``.
|
||||||
|
|
||||||
Change the ``narrow`` field from a ``char*`` string into a flag that indicates
|
Change the ``narrow`` field from a ``char*`` string into a flag that indicates
|
||||||
whether the original object was bytes. This is required for functions that need
|
whether the original object was bytes. This is required for functions that need
|
||||||
to return paths using the same type as was originally provided.
|
to return paths using the same type as was originally provided.
|
||||||
|
|
||||||
Remove unused ANSI code
|
Remove unused ANSI code
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
Remove all code paths using the ``narrow`` field, as these will no longer be
|
Remove all code paths using the ``narrow`` field, as these will no longer be
|
||||||
reachable by any caller. These are only used within ``posixmodule.c``. Other
|
reachable by any caller. These are only used within ``posixmodule.c``. Other
|
||||||
uses of paths should have use of bytes paths replaced with decoding and use of
|
uses of paths should have use of bytes paths replaced with decoding and use of
|
||||||
the \*W APIs.
|
the \*W APIs.
|
||||||
|
|
||||||
Add legacy mode
|
Add legacy mode
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
Add a legacy mode flag, enabled by the environment variable
|
Add a legacy mode flag, enabled by the environment variable
|
||||||
``PYTHONLEGACYWINDOWSFSENCODING`` or by a function call to
|
``PYTHONLEGACYWINDOWSFSENCODING`` or by a function call to
|
||||||
``sys._enablelegacywindowsfsencoding()``. The function call can only be
|
``sys._enablelegacywindowsfsencoding()``. The function call can only be
|
||||||
used to enable the flag and should be used by programs as close to
|
used to enable the flag and should be used by programs as close to
|
||||||
initialization as possible. Legacy mode cannot be disabled while Python is
|
initialization as possible. Legacy mode cannot be disabled while Python is
|
||||||
running.
|
running.
|
||||||
|
|
||||||
When this flag is set, the default filesystem encoding is set to mbcs rather
|
When this flag is set, the default filesystem encoding is set to mbcs rather
|
||||||
than utf-8, and the error mode is set to ``replace`` rather than
|
than utf-8, and the error mode is set to ``replace`` rather than
|
||||||
``surrogatepass``. Paths will continue to decode to wide characters and only \*W
|
``surrogatepass``. Paths will continue to decode to wide characters and only \*W
|
||||||
APIs will be called, however, the bytes passed in and received from Python will
|
APIs will be called, however, the bytes passed in and received from Python will
|
||||||
be encoded the same as prior to this change.
|
be encoded the same as prior to this change.
|
||||||
|
|
||||||
Undeprecate bytes paths on Windows
|
Undeprecate bytes paths on Windows
|
||||||
----------------------------------
|
----------------------------------
|
||||||
|
|
||||||
Using bytes as paths on Windows is currently deprecated. We would announce that
|
Using bytes as paths on Windows is currently deprecated. We would announce that
|
||||||
this is no longer the case, and that paths when encoded as bytes should use
|
this is no longer the case, and that paths when encoded as bytes should use
|
||||||
whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's
|
whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's
|
||||||
active code page.
|
active code page.
|
||||||
|
|
||||||
Beta experiment
|
Beta experiment
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
To assist with determining the impact of this change, we propose applying it to
|
To assist with determining the impact of this change, we propose applying it to
|
||||||
3.6.0b1 provisionally with the intent being to make a final decision before
|
3.6.0b1 provisionally with the intent being to make a final decision before
|
||||||
3.6.0b4.
|
3.6.0b4.
|
||||||
|
|
||||||
During the experiment period, decoding and encoding exception messages will be
|
During the experiment period, decoding and encoding exception messages will be
|
||||||
expanded to include a link to an active online discussion and encourage
|
expanded to include a link to an active online discussion and encourage
|
||||||
reporting of problems.
|
reporting of problems.
|
||||||
|
|
||||||
If it is decided to revert the functionality for 3.6.0b4, the implementation
|
If it is decided to revert the functionality for 3.6.0b4, the implementation
|
||||||
change would be to permanently enable the legacy mode flag, change the
|
change would be to permanently enable the legacy mode flag, change the
|
||||||
environment variable to ``PYTHONWINDOWSUTF8FSENCODING`` and function to
|
environment variable to ``PYTHONWINDOWSUTF8FSENCODING`` and function to
|
||||||
``sys._enablewindowsutf8fsencoding()`` to allow enabling the functionality
|
``sys._enablewindowsutf8fsencoding()`` to allow enabling the functionality
|
||||||
on a case-by-case basis, as opposed to disabling it.
|
on a case-by-case basis, as opposed to disabling it.
|
||||||
|
|
||||||
It is expected that if we cannot feasibly make the change for 3.6 due to
|
It is expected that if we cannot feasibly make the change for 3.6 due to
|
||||||
compatibility concerns, it will not be possible to make the change at any later
|
compatibility concerns, it will not be possible to make the change at any later
|
||||||
time in Python 3.x.
|
time in Python 3.x.
|
||||||
|
|
||||||
Affected Modules
|
Affected Modules
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
This PEP implicitly includes all modules within the Python that either pass path
|
This PEP implicitly includes all modules within the Python that either pass path
|
||||||
names to the operating system, or otherwise use ``sys.getfilesystemencoding()``.
|
names to the operating system, or otherwise use ``sys.getfilesystemencoding()``.
|
||||||
|
|
||||||
As of 3.6.0a4, the following modules require modification:
|
As of 3.6.0a4, the following modules require modification:
|
||||||
|
|
||||||
* ``os``
|
* ``os``
|
||||||
* ``_overlapped``
|
* ``_overlapped``
|
||||||
* ``_socket``
|
* ``_socket``
|
||||||
* ``subprocess``
|
* ``subprocess``
|
||||||
* ``zipimport``
|
* ``zipimport``
|
||||||
|
|
||||||
The following modules use ``sys.getfilesystemencoding()`` but do not need
|
The following modules use ``sys.getfilesystemencoding()`` but do not need
|
||||||
modification:
|
modification:
|
||||||
|
|
||||||
* ``gc`` (already assumes bytes are utf-8)
|
* ``gc`` (already assumes bytes are utf-8)
|
||||||
* ``grp`` (not compiled for Windows)
|
* ``grp`` (not compiled for Windows)
|
||||||
* ``http.server`` (correctly includes codec name with transmitted data)
|
* ``http.server`` (correctly includes codec name with transmitted data)
|
||||||
* ``idlelib.editor`` (should not be needed; has fallback handling)
|
* ``idlelib.editor`` (should not be needed; has fallback handling)
|
||||||
* ``nis`` (not compiled for Windows)
|
* ``nis`` (not compiled for Windows)
|
||||||
* ``pwd`` (not compiled for Windows)
|
* ``pwd`` (not compiled for Windows)
|
||||||
* ``spwd`` (not compiled for Windows)
|
* ``spwd`` (not compiled for Windows)
|
||||||
* ``_ssl`` (only used for ASCII constants)
|
* ``_ssl`` (only used for ASCII constants)
|
||||||
* ``tarfile`` (code unused on Windows)
|
* ``tarfile`` (code unused on Windows)
|
||||||
* ``_tkinter`` (already assumes bytes are utf-8)
|
* ``_tkinter`` (already assumes bytes are utf-8)
|
||||||
* ``wsgiref`` (assumed as the default encoding for unknown environments)
|
* ``wsgiref`` (assumed as the default encoding for unknown environments)
|
||||||
* ``zipapp`` (code unused on Windows)
|
* ``zipapp`` (code unused on Windows)
|
||||||
|
|
||||||
The following native code uses one of the encoding or decoding functions, but do
|
The following native code uses one of the encoding or decoding functions, but do
|
||||||
not require any modification:
|
not require any modification:
|
||||||
|
|
||||||
* ``Parser/parsetok.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
* ``Parser/parsetok.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
||||||
* ``Python/ast.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
* ``Python/ast.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
||||||
* ``Python/compile.c`` (undocumented, but Python filesystem encoding implied)
|
* ``Python/compile.c`` (undocumented, but Python filesystem encoding implied)
|
||||||
* ``Python/errors.c`` (docs already specify ``os.fsdecode()``)
|
* ``Python/errors.c`` (docs already specify ``os.fsdecode()``)
|
||||||
* ``Python/fileutils.c`` (code unused on Windows)
|
* ``Python/fileutils.c`` (code unused on Windows)
|
||||||
* ``Python/future.c`` (undocumented, but Python filesystem encoding implied)
|
* ``Python/future.c`` (undocumented, but Python filesystem encoding implied)
|
||||||
* ``Python/import.c`` (docs already specify utf-8)
|
* ``Python/import.c`` (docs already specify utf-8)
|
||||||
* ``Python/importdl.c`` (code unused on Windows)
|
* ``Python/importdl.c`` (code unused on Windows)
|
||||||
* ``Python/pythonrun.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
* ``Python/pythonrun.c`` (docs already specify ``sys.getfilesystemencoding()``)
|
||||||
* ``Python/symtable.c`` (undocumented, but Python filesystem encoding implied)
|
* ``Python/symtable.c`` (undocumented, but Python filesystem encoding implied)
|
||||||
* ``Python/thread.c`` (code unused on Windows)
|
* ``Python/thread.c`` (code unused on Windows)
|
||||||
* ``Python/traceback.c`` (encodes correctly for comparing strings)
|
* ``Python/traceback.c`` (encodes correctly for comparing strings)
|
||||||
* ``Python/_warnings.c`` (docs already specify ``os.fsdecode()``)
|
* ``Python/_warnings.c`` (docs already specify ``os.fsdecode()``)
|
||||||
|
|
||||||
Rejected Alternatives
|
Rejected Alternatives
|
||||||
=====================
|
=====================
|
||||||
|
|
||||||
Use strict mbcs decoding
|
Use strict mbcs decoding
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
This is essentially the same as the proposed change, but instead of changing
|
This is essentially the same as the proposed change, but instead of changing
|
||||||
``sys.getfilesystemencoding()`` to utf-8 it is changed to mbcs (which
|
``sys.getfilesystemencoding()`` to utf-8 it is changed to mbcs (which
|
||||||
dynamically maps to the active code page).
|
dynamically maps to the active code page).
|
||||||
|
|
||||||
This approach allows the use of new functionality that is only available as \*W
|
This approach allows the use of new functionality that is only available as \*W
|
||||||
APIs and also detection of encoding/decoding errors. For example, rather than
|
APIs and also detection of encoding/decoding errors. For example, rather than
|
||||||
silently replacing Unicode characters with '?', it would be possible to warn or
|
silently replacing Unicode characters with '?', it would be possible to warn or
|
||||||
fail the operation.
|
fail the operation.
|
||||||
|
|
||||||
Compared to the proposed fix, this could enable some new functionality but does
|
Compared to the proposed fix, this could enable some new functionality but does
|
||||||
not fix any of the problems described initially. New runtime errors may cause
|
not fix any of the problems described initially. New runtime errors may cause
|
||||||
some problems to be more obvious and lead to fixes, provided library maintainers
|
some problems to be more obvious and lead to fixes, provided library maintainers
|
||||||
are interested in supporting Windows and adding a separate code path to treat
|
are interested in supporting Windows and adding a separate code path to treat
|
||||||
filesystem paths as strings.
|
filesystem paths as strings.
|
||||||
|
|
||||||
Making the encoding mbcs without strict errors is equivalent to the legacy-mode
|
Making the encoding mbcs without strict errors is equivalent to the legacy-mode
|
||||||
switch being enabled by default. This is a possible course of action if there is
|
switch being enabled by default. This is a possible course of action if there is
|
||||||
significant breakage of actual code and a need to extend the deprecation period,
|
significant breakage of actual code and a need to extend the deprecation period,
|
||||||
but still a desire to have the simplifications to the CPython source.
|
but still a desire to have the simplifications to the CPython source.
|
||||||
|
|
||||||
Make bytes paths an error on Windows
|
Make bytes paths an error on Windows
|
||||||
------------------------------------
|
------------------------------------
|
||||||
|
|
||||||
By preventing the use of bytes paths on Windows completely we prevent users from
|
By preventing the use of bytes paths on Windows completely we prevent users from
|
||||||
hitting encoding issues.
|
hitting encoding issues.
|
||||||
|
|
||||||
However, the motivation for this PEP is to increase the likelihood that code
|
However, the motivation for this PEP is to increase the likelihood that code
|
||||||
written on POSIX will also work correctly on Windows. This alternative would
|
written on POSIX will also work correctly on Windows. This alternative would
|
||||||
move the other direction and make such code completely incompatible. As this
|
move the other direction and make such code completely incompatible. As this
|
||||||
does not benefit users in any way, we reject it.
|
does not benefit users in any way, we reject it.
|
||||||
|
|
||||||
Make bytes paths an error on all platforms
|
Make bytes paths an error on all platforms
|
||||||
------------------------------------------
|
------------------------------------------
|
||||||
|
|
||||||
By deprecating and then disable the use of bytes paths on all platforms we
|
By deprecating and then disable the use of bytes paths on all platforms we
|
||||||
prevent users from hitting encoding issues regardless of where the code was
|
prevent users from hitting encoding issues regardless of where the code was
|
||||||
originally written. This would require a full deprecation cycle, as there are
|
originally written. This would require a full deprecation cycle, as there are
|
||||||
currently no warnings on platforms other than Windows.
|
currently no warnings on platforms other than Windows.
|
||||||
|
|
||||||
This is likely to be seen as a hostile action against Python developers in
|
This is likely to be seen as a hostile action against Python developers in
|
||||||
general, and as such is rejected at this time.
|
general, and as such is rejected at this time.
|
||||||
|
|
||||||
Code that may break
|
Code that may break
|
||||||
===================
|
===================
|
||||||
|
|
||||||
The following code patterns may break or see different behaviour as a result of
|
The following code patterns may break or see different behaviour as a result of
|
||||||
this change. Each of these examples would have been fragile in code intended for
|
this change. Each of these examples would have been fragile in code intended for
|
||||||
cross-platform use. The suggested fixes demonstrate the most compatible way to
|
cross-platform use. The suggested fixes demonstrate the most compatible way to
|
||||||
handle path encoding issues across all platforms and across multiple Python
|
handle path encoding issues across all platforms and across multiple Python
|
||||||
versions.
|
versions.
|
||||||
|
|
||||||
Note that all of these examples produce deprecation warnings on Python 3.3 and
|
Note that all of these examples produce deprecation warnings on Python 3.3 and
|
||||||
later.
|
later.
|
||||||
|
|
||||||
Not managing encodings across boundaries
|
Not managing encodings across boundaries
|
||||||
----------------------------------------
|
----------------------------------------
|
||||||
|
|
||||||
Code that does not manage encodings when crossing protocol boundaries may
|
Code that does not manage encodings when crossing protocol boundaries may
|
||||||
currently be working by chance, but could encounter issues when either encoding
|
currently be working by chance, but could encounter issues when either encoding
|
||||||
changes. Note that the source of ``filename`` may be any function that returns
|
changes. Note that the source of ``filename`` may be any function that returns
|
||||||
a bytes object, as illustrated in a second example below::
|
a bytes object, as illustrated in a second example below::
|
||||||
|
|
||||||
>>> filename = open('filename_in_mbcs.txt', 'rb').read()
|
>>> filename = open('filename_in_mbcs.txt', 'rb').read()
|
||||||
>>> text = open(filename, 'r').read()
|
>>> text = open(filename, 'r').read()
|
||||||
|
|
||||||
To correct this code, the encoding of the bytes in ``filename`` should be
|
To correct this code, the encoding of the bytes in ``filename`` should be
|
||||||
specified, either when reading from the file or before using the value::
|
specified, either when reading from the file or before using the value::
|
||||||
|
|
||||||
>>> # Fix 1: Open file as text (default encoding)
|
>>> # Fix 1: Open file as text (default encoding)
|
||||||
>>> filename = open('filename_in_mbcs.txt', 'r').read()
|
>>> filename = open('filename_in_mbcs.txt', 'r').read()
|
||||||
>>> text = open(filename, 'r').read()
|
>>> text = open(filename, 'r').read()
|
||||||
|
|
||||||
>>> # Fix 2: Open file as text (explicit encoding)
|
>>> # Fix 2: Open file as text (explicit encoding)
|
||||||
>>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
|
>>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read()
|
||||||
>>> text = open(filename, 'r').read()
|
>>> text = open(filename, 'r').read()
|
||||||
|
|
||||||
>>> # Fix 3: Explicitly decode the path
|
>>> # Fix 3: Explicitly decode the path
|
||||||
>>> filename = open('filename_in_mbcs.txt', 'rb').read()
|
>>> filename = open('filename_in_mbcs.txt', 'rb').read()
|
||||||
>>> text = open(filename.decode('mbcs'), 'r').read()
|
>>> text = open(filename.decode('mbcs'), 'r').read()
|
||||||
|
|
||||||
Where the creator of ``filename`` is separated from the user of ``filename``,
|
Where the creator of ``filename`` is separated from the user of ``filename``,
|
||||||
the encoding is important information to include::
|
the encoding is important information to include::
|
||||||
|
|
||||||
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('mbcs')
|
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('mbcs')
|
||||||
|
|
||||||
>>> filename = some_object.filename
|
>>> filename = some_object.filename
|
||||||
>>> type(filename)
|
>>> type(filename)
|
||||||
<class 'bytes'>
|
<class 'bytes'>
|
||||||
>>> text = open(filename, 'r').read()
|
>>> text = open(filename, 'r').read()
|
||||||
|
|
||||||
To fix this code for best compatibility across operating systems and Python
|
To fix this code for best compatibility across operating systems and Python
|
||||||
versions, the filename should be exposed as str::
|
versions, the filename should be exposed as str::
|
||||||
|
|
||||||
>>> # Fix 1: Expose as str
|
>>> # Fix 1: Expose as str
|
||||||
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'
|
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'
|
||||||
|
|
||||||
>>> filename = some_object.filename
|
>>> filename = some_object.filename
|
||||||
>>> type(filename)
|
>>> type(filename)
|
||||||
<class 'str'>
|
<class 'str'>
|
||||||
>>> text = open(filename, 'r').read()
|
>>> text = open(filename, 'r').read()
|
||||||
|
|
||||||
Alternatively, the encoding used for the path needs to be made available to the
|
Alternatively, the encoding used for the path needs to be made available to the
|
||||||
user. Specifying ``os.fsencode()`` (or ``sys.getfilesystemencoding()``) is an
|
user. Specifying ``os.fsencode()`` (or ``sys.getfilesystemencoding()``) is an
|
||||||
acceptable choice, or a new attribute could be added with the exact encoding::
|
acceptable choice, or a new attribute could be added with the exact encoding::
|
||||||
|
|
||||||
>>> # Fix 2: Use fsencode
|
>>> # Fix 2: Use fsencode
|
||||||
>>> some_object.filename = os.fsencode(r'C:\Users\Steve\Documents\my_file.txt')
|
>>> some_object.filename = os.fsencode(r'C:\Users\Steve\Documents\my_file.txt')
|
||||||
|
|
||||||
>>> filename = some_object.filename
|
>>> filename = some_object.filename
|
||||||
>>> type(filename)
|
>>> type(filename)
|
||||||
<class 'bytes'>
|
<class 'bytes'>
|
||||||
>>> text = open(filename, 'r').read()
|
>>> text = open(filename, 'r').read()
|
||||||
|
|
||||||
|
|
||||||
>>> # Fix 3: Expose as explicit encoding
|
>>> # Fix 3: Expose as explicit encoding
|
||||||
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('cp437')
|
>>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('cp437')
|
||||||
>>> some_object.filename_encoding = 'cp437'
|
>>> some_object.filename_encoding = 'cp437'
|
||||||
|
|
||||||
>>> filename = some_object.filename
|
>>> filename = some_object.filename
|
||||||
>>> type(filename)
|
>>> type(filename)
|
||||||
<class 'bytes'>
|
<class 'bytes'>
|
||||||
>>> filename = filename.decode(some_object.filename_encoding)
|
>>> filename = filename.decode(some_object.filename_encoding)
|
||||||
>>> type(filename)
|
>>> type(filename)
|
||||||
<class 'str'>
|
<class 'str'>
|
||||||
>>> text = open(filename, 'r').read()
|
>>> text = open(filename, 'r').read()
|
||||||
|
|
||||||
|
|
||||||
Explicitly using 'mbcs'
|
Explicitly using 'mbcs'
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
Code that explicitly encodes text using 'mbcs' before passing to file system
|
Code that explicitly encodes text using 'mbcs' before passing to file system
|
||||||
APIs is now passing incorrectly encoded bytes. Note that the source of
|
APIs is now passing incorrectly encoded bytes. Note that the source of
|
||||||
``filename`` in this example is not relevant, provided that it is a str::
|
``filename`` in this example is not relevant, provided that it is a str::
|
||||||
|
|
||||||
>>> filename = open('files.txt', 'r').readline().rstrip()
|
>>> filename = open('files.txt', 'r').readline().rstrip()
|
||||||
>>> text = open(filename.encode('mbcs'), 'r')
|
>>> text = open(filename.encode('mbcs'), 'r')
|
||||||
|
|
||||||
To correct this code, the string should be passed without explicit encoding, or
|
To correct this code, the string should be passed without explicit encoding, or
|
||||||
should use ``os.fsencode()``::
|
should use ``os.fsencode()``::
|
||||||
|
|
||||||
>>> # Fix 1: Do not encode the string
|
>>> # Fix 1: Do not encode the string
|
||||||
>>> filename = open('files.txt', 'r').readline().rstrip()
|
>>> filename = open('files.txt', 'r').readline().rstrip()
|
||||||
>>> text = open(filename, 'r')
|
>>> text = open(filename, 'r')
|
||||||
|
|
||||||
>>> # Fix 2: Use correct encoding
|
>>> # Fix 2: Use correct encoding
|
||||||
>>> filename = open('files.txt', 'r').readline().rstrip()
|
>>> filename = open('files.txt', 'r').readline().rstrip()
|
||||||
>>> text = open(os.fsencode(filename), 'r')
|
>>> text = open(os.fsencode(filename), 'r')
|
||||||
|
|
||||||
|
|
||||||
References
|
References
|
||||||
==========
|
==========
|
||||||
|
|
||||||
.. _Naming Files, Paths, and Namespaces:
|
.. _Naming Files, Paths, and Namespaces:
|
||||||
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx
|
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
=========
|
=========
|
||||||
|
|
||||||
This document has been placed in the public domain.
|
This document has been placed in the public domain.
|
||||||
|
|
Loading…
Reference in New Issue