diff --git a/pep-0519.txt b/pep-0519.txt index 20a9c3717..205f9eb69 100644 --- a/pep-0519.txt +++ b/pep-0519.txt @@ -1,557 +1,557 @@ -PEP: 519 -Title: Adding a file system path protocol -Version: $Revision$ -Last-Modified: $Date$ -Author: Brett Cannon , - Koos Zevenhoven -Status: Final -Type: Standards Track -Content-Type: text/x-rst -Created: 11-May-2016 -Python-Version: 3.6 -Post-History: 11-May-2016, - 12-May-2016, - 13-May-2016 -Resolution: https://mail.python.org/pipermail/python-dev/2016-May/144646.html - - -Abstract -======== - -This PEP proposes a protocol for classes which represent a file system -path to be able to provide a ``str`` or ``bytes`` representation. -Changes to Python's standard library are also proposed to utilize this -protocol where appropriate to facilitate the use of path objects where -historically only ``str`` and/or ``bytes`` file system paths are -accepted. The goal is to facilitate the migration of users towards -rich path objects while providing an easy way to work with code -expecting ``str`` or ``bytes``. - - -Rationale -========= - -Historically in Python, file system paths have been represented as -strings or bytes. This choice of representation has stemmed from C's -own decision to represent file system paths as -``const char *`` [#libc-open]_. While that is a totally serviceable -format to use for file system paths, it's not necessarily optimal. At -issue is the fact that while all file system paths can be represented -as strings or bytes, not all strings or bytes represent a file system -path. This can lead to issues where any e.g. string duck-types to a -file system path whether it actually represents a path or not. - -To help elevate the representation of file system paths from their -representation as strings and bytes to a richer object representation, -the pathlib module [#pathlib]_ was provisionally introduced in -Python 3.4 through PEP 428. While considered by some as an improvement -over strings and bytes for file system paths, it has suffered from a -lack of adoption. Typically the key issue listed for the low adoption -rate has been the lack of support in the standard library. This lack -of support required users of pathlib to manually convert path objects -to strings by calling ``str(path)`` which many found error-prone. - -One issue in converting path objects to strings comes from -the fact that the only generic way to get a string representation of -the path was to pass the object to ``str()``. This can pose a -problem when done blindly as nearly all Python objects have some -string representation whether they are a path or not, e.g. -``str(None)`` will give a result that -``builtins.open()`` [#builtins-open]_ will happily use to create a new -file. - -Exacerbating this whole situation is the -``DirEntry`` object [#os-direntry]_. While path objects have a -representation that can be extracted using ``str()``, ``DirEntry`` -objects expose a ``path`` attribute instead. Having no common -interface between path objects, ``DirEntry``, and any other -third-party path library has become an issue. A solution that allows -any path-representing object to declare that it is a path and a way -to extract a low-level representation that all path objects could -support is desired. - -This PEP then proposes to introduce a new protocol to be followed by -objects which represent file system paths. Providing a protocol allows -for explicit signaling of what objects represent file system paths as -well as a way to extract a lower-level representation that can be used -with older APIs which only support strings or bytes. - -Discussions regarding path objects that led to this PEP can be found -in multiple threads on the python-ideas mailing list archive -[#python-ideas-archive]_ for the months of March and April 2016 and on -the python-dev mailing list archives [#python-dev-archive]_ during -April 2016. - - -Proposal -======== - -This proposal is split into two parts. One part is the proposal of a -protocol for objects to declare and provide support for exposing a -file system path representation. The other part deals with changes to -Python's standard library to support the new protocol. These changes -will also lead to the pathlib module dropping its provisional status. - -Protocol --------- - -The following abstract base class defines the protocol for an object -to be considered a path object:: - - import abc - import typing as t - - - class PathLike(abc.ABC): - - """Abstract base class for implementing the file system path protocol.""" - - @abc.abstractmethod - def __fspath__(self) -> t.Union[str, bytes]: - """Return the file system path representation of the object.""" - raise NotImplementedError - - -Objects representing file system paths will implement the -``__fspath__()`` method which will return the ``str`` or ``bytes`` -representation of the path. The ``str`` representation is the -preferred low-level path representation as it is human-readable and -what people historically represent paths as. - - -Standard library changes ------------------------- - -It is expected that most APIs in Python's standard library that -currently accept a file system path will be updated appropriately to -accept path objects (whether that requires code or simply an update -to documentation will vary). The modules mentioned below, though, -deserve specific details as they have either fundamental changes that -empower the ability to use path objects, or entail additions/removal -of APIs. - - -builtins -'''''''' - -``open()`` [#builtins-open]_ will be updated to accept path objects as -well as continue to accept ``str`` and ``bytes``. - - -os -''' - -The ``fspath()`` function will be added with the following semantics:: - - import typing as t - - - def fspath(path: t.Union[PathLike, str, bytes]) -> t.Union[str, bytes]: - """Return the string representation of the path. - - If str or bytes is passed in, it is returned unchanged. If __fspath__() - returns something other than str or bytes then TypeError is raised. If - this function is given something that is not str, bytes, or os.PathLike - then TypeError is raised. - """ - if isinstance(path, (str, bytes)): - return path - - # Work from the object's type to match method resolution of other magic - # methods. - path_type = type(path) - try: - path = path_type.__fspath__(path) - except AttributeError: - if hasattr(path_type, '__fspath__'): - raise - else: - if isinstance(path, (str, bytes)): - return path - else: - raise TypeError("expected __fspath__() to return str or bytes, " - "not " + type(path).__name__) - - raise TypeError("expected str, bytes or os.PathLike object, not " - + path_type.__name__) - -The ``os.fsencode()`` [#os-fsencode]_ and -``os.fsdecode()`` [#os-fsdecode]_ functions will be updated to accept -path objects. As both functions coerce their arguments to -``bytes`` and ``str``, respectively, they will be updated to call -``__fspath__()`` if present to convert the path object to a ``str`` or -``bytes`` representation, and then perform their appropriate -coercion operations as if the return value from ``__fspath__()`` had -been the original argument to the coercion function in question. - -The addition of ``os.fspath()``, the updates to -``os.fsencode()``/``os.fsdecode()``, and the current semantics of -``pathlib.PurePath`` provide the semantics necessary to -get the path representation one prefers. For a path object, -``pathlib.PurePath``/``Path`` can be used. To obtain the ``str`` or -``bytes`` representation without any coersion, then ``os.fspath()`` -can be used. If a ``str`` is desired and the encoding of ``bytes`` -should be assumed to be the default file system encoding, then -``os.fsdecode()`` should be used. If a ``bytes`` representation is -desired and any strings should be encoded using the default file -system encoding, then ``os.fsencode()`` is used. This PEP recommends -using path objects when possible and falling back to string paths as -necessary and using ``bytes`` as a last resort. - -Another way to view this is as a hierarchy of file system path -representations (highest- to lowest-level): path → str → bytes. The -functions and classes under discussion can all accept objects on the -same level of the hierarchy, but they vary in whether they promote or -demote objects to another level. The ``pathlib.PurePath`` class can -promote a ``str`` to a path object. The ``os.fspath()`` function can -demote a path object to a ``str`` or ``bytes`` instance, depending -on what ``__fspath__()`` returns. -The ``os.fsdecode()`` function will demote a path object to -a string or promote a ``bytes`` object to a ``str``. The -``os.fsencode()`` function will demote a path or string object to -``bytes``. There is no function that provides a way to demote a path -object directly to ``bytes`` while bypassing string demotion. - -The ``DirEntry`` object [#os-direntry]_ will gain an ``__fspath__()`` -method. It will return the same value as currently found on the -``path`` attribute of ``DirEntry`` instances. - -The Protocol_ ABC will be added to the ``os`` module under the name -``os.PathLike``. - - -os.path -''''''' - -The various path-manipulation functions of ``os.path`` [#os-path]_ -will be updated to accept path objects. For polymorphic functions that -accept both bytes and strings, they will be updated to simply use -``os.fspath()``. - -During the discussions leading up to this PEP it was suggested that -``os.path`` not be updated using an "explicit is better than implicit" -argument. The thinking was that since ``__fspath__()`` is polymorphic -itself it may be better to have code working with ``os.path`` extract -the path representation from path objects explicitly. There is also -the consideration that adding support this deep into the low-level OS -APIs will lead to code magically supporting path objects without -requiring any documentation updated, leading to potential complaints -when it doesn't work, unbeknownst to the project author. - -But it is the view of this PEP that "practicality beats purity" in -this instance. To help facilitate the transition to supporting path -objects, it is better to make the transition as easy as possible than -to worry about unexpected/undocumented duck typing support for -path objects by projects. - -There has also been the suggestion that ``os.path`` functions could be -used in a tight loop and the overhead of checking or calling -``__fspath__()`` would be too costly. In this scenario only -path-consuming APIs would be directly updated and path-manipulating -APIs like the ones in ``os.path`` would go unmodified. This would -require library authors to update their code to support path objects -if they performed any path manipulations, but if the library code -passed the path straight through then the library wouldn't need to be -updated. It is the view of this PEP and Guido, though, that this is an -unnecessary worry and that performance will still be acceptable. - - -pathlib -''''''' - -The constructor for ``pathlib.PurePath`` and ``pathlib.Path`` will be -updated to accept ``PathLike`` objects. Both ``PurePath`` and ``Path`` -will continue to not accept ``bytes`` path representations, and so if -``__fspath__()`` returns ``bytes`` it will raise an exception. - -The ``path`` attribute will be removed as this PEP makes it -redundant (it has not been included in any released version of Python -and so is not a backwards-compatibility concern). - - -C API -''''' - -The C API will gain an equivalent function to ``os.fspath()``:: - - /* - Return the file system path representation of the object. - - If the object is str or bytes, then allow it to pass through with - an incremented refcount. If the object defines __fspath__(), then - return the result of that method. All other types raise a TypeError. - */ - PyObject * - PyOS_FSPath(PyObject *path) - { - _Py_IDENTIFIER(__fspath__); - PyObject *func = NULL; - PyObject *path_repr = NULL; - - if (PyUnicode_Check(path) || PyBytes_Check(path)) { - Py_INCREF(path); - return path; - } - - func = _PyObject_LookupSpecial(path, &PyId___fspath__); - if (NULL == func) { - return PyErr_Format(PyExc_TypeError, - "expected str, bytes or os.PathLike object, " - "not %S", - path->ob_type); - } - - path_repr = PyObject_CallFunctionObjArgs(func, NULL); - Py_DECREF(func); - if (!PyUnicode_Check(path_repr) && !PyBytes_Check(path_repr)) { - Py_DECREF(path_repr); - return PyErr_Format(PyExc_TypeError, - "expected __fspath__() to return str or bytes, " - "not %S", - path_repr->ob_type); - } - - return path_repr; - } - - - - -Backwards compatibility -======================= - -There are no explicit backwards-compatibility concerns. Unless an -object incidentally already defines a ``__fspath__()`` method there is -no reason to expect the pre-existing code to break or expect to have -its semantics implicitly changed. - -Libraries wishing to support path objects and a version of Python -prior to Python 3.6 and the existence of ``os.fspath()`` can use the -idiom of -``path.__fspath__() if hasattr(path, "__fspath__") else path``. - - -Implementation -============== - -This is the task list for what this PEP proposes to be changed in -Python 3.6: - -#. Remove the ``path`` attribute from pathlib - (`done `__) -#. Remove the provisional status of pathlib - (`done `__) -#. Add ``os.PathLike`` - (`code `__ and - `docs `__ done) -#. Add ``PyOS_FSPath()`` - (`code `__ and - `docs `__ done) -#. Add ``os.fspath()`` - (`done `__) -#. Update ``os.fsencode()`` - (`done `__) -#. Update ``os.fsdecode()`` - (`done `__) -#. Update ``pathlib.PurePath`` and ``pathlib.Path`` - (`done `__) - - #. Add ``__fspath__()`` - #. Add ``os.PathLike`` support to the constructors - -#. Add ``__fspath__()`` to ``DirEntry`` - (`done `__) - -#. Update ``builtins.open()`` - (`done `__) -#. Update ``os.path`` - (`done `__) -#. Add a `glossary `__ entry for "path-like" - (`done `__) -#. Update `"What's New" `_ - (`done `__) - - -Rejected Ideas -============== - -Other names for the protocol's method -------------------------------------- - -Various names were proposed during discussions leading to this PEP, -including ``__path__``, ``__pathname__``, and ``__fspathname__``. In -the end people seemed to gravitate towards ``__fspath__`` for being -unambiguous without being unnecessarily long. - - -Separate str/bytes methods --------------------------- - -At one point it was suggested that ``__fspath__()`` only return -strings and another method named ``__fspathb__()`` be introduced to -return bytes. The thinking is that by making ``__fspath__()`` not be -polymorphic it could make dealing with the potential string or bytes -representations easier. But the general consensus was that returning -bytes will more than likely be rare and that the various functions in -the os module are the better abstraction to promote over direct -calls to ``__fspath__()``. - - -Providing a ``path`` attribute ------------------------------- - -To help deal with the issue of ``pathlib.PurePath`` not inheriting -from ``str``, originally it was proposed to introduce a ``path`` -attribute to mirror what ``os.DirEntry`` provides. In the end, -though, it was determined that a protocol would provide the same -result while not directly exposing an API that most people will never -need to interact with directly. - - -Have ``__fspath__()`` only return strings ------------------------------------------- - -Much of the discussion that led to this PEP revolved around whether -``__fspath__()`` should be polymorphic and return ``bytes`` as well as -``str`` or only return ``str``. The general sentiment for this view -was that ``bytes`` are difficult to work with due to their -inherent lack of information about their encoding and PEP 383 makes -it possible to represent all file system paths using ``str`` with the -``surrogateescape`` handler. Thus, it would be better to forcibly -promote the use of ``str`` as the low-level path representation for -high-level path objects. - -In the end, it was decided that using ``bytes`` to represent paths is -simply not going to go away and thus they should be supported to some -degree. The hope is that people will gravitate towards path objects -like pathlib and that will move people away from operating directly -with ``bytes``. - - -A generic string encoding mechanism ------------------------------------ - -At one point there was a discussion of developing a generic mechanism -to extract a string representation of an object that had semantic -meaning (``__str__()`` does not necessarily return anything of -semantic significance beyond what may be helpful for debugging). In -the end, it was deemed to lack a motivating need beyond the one this -PEP is trying to solve in a specific fashion. - - -Have __fspath__ be an attribute -------------------------------- - -It was briefly considered to have ``__fspath__`` be an attribute -instead of a method. This was rejected for two reasons. One, -historically protocols have been implemented as "magic methods" and -not "magic methods and attributes". Two, there is no guarantee that -the lower-level representation of a path object will be pre-computed, -potentially misleading users that there was no expensive computation -behind the scenes in case the attribute was implemented as a property. - -This also indirectly ties into the idea of introducing a ``path`` -attribute to accomplish the same thing. This idea has an added issue, -though, of accidentally having any object with a ``path`` attribute -meet the protocol's duck typing. Introducing a new magic method for -the protocol helpfully avoids any accidental opting into the protocol. - - -Provide specific type hinting support -------------------------------------- - -There was some consideration to provdinga generic ``typing.PathLike`` -class which would allow for e.g. ``typing.PathLike[str]`` to specify -a type hint for a path object which returned a string representation. -While potentially beneficial, the usefulness was deemed too small to -bother adding the type hint class. - -This also removed any desire to have a class in the ``typing`` module -which represented the union of all acceptable path-representing types -as that can be represented with -``typing.Union[str, bytes, os.PathLike]`` easily enough and the hope -is users will slowly gravitate to path objects only. - - -Provide ``os.fspathb()`` ------------------------- - -It was suggested that to mirror the structure of e.g. -``os.getcwd()``/``os.getcwdb()``, that ``os.fspath()`` only return -``str`` and that another function named ``os.fspathb()`` be -introduced that only returned ``bytes``. This was rejected as the -purposes of the ``*b()`` functions are tied to querying the file -system where there is a need to get the raw bytes back. As this PEP -does not work directly with data on a file system (but which *may* -be), the view was taken this distinction is unnecessary. It's also -believed that the need for only bytes will not be common enough to -need to support in such a specific manner as ``os.fsencode()`` will -provide similar functionality. - - -Call ``__fspath__()`` off of the instance ------------------------------------------ - -An earlier draft of this PEP had ``os.fspath()`` calling -``path.__fspath__()`` instead of ``type(path).__fspath__(path)``. The -changed to be consistent with how other magic methods in Python are -resolved. - - -Acknowledgements -================ - -Thanks to everyone who participated in the various discussions related -to this PEP that spanned both python-ideas and python-dev. Special -thanks to Stephen Turnbull for direct feedback on early drafts of this -PEP. More special thanks to Koos Zevenhoven and Ethan Furman for not -only feedback on early drafts of this PEP but also helping to drive -the overall discussion on this topic across the two mailing lists. - - -References -========== - -.. [#python-ideas-archive] The python-ideas mailing list archive - (https://mail.python.org/pipermail/python-ideas/) - -.. [#python-dev-archive] The python-dev mailing list archive - (https://mail.python.org/pipermail/python-dev/) - -.. [#libc-open] ``open()`` documention for the C standard library - (http://www.gnu.org/software/libc/manual/html_node/Opening-and-Closing-Files.html) - -.. [#pathlib] The ``pathlib`` module - (https://docs.python.org/3/library/pathlib.html#module-pathlib) - -.. [#builtins-open] The ``builtins.open()`` function - (https://docs.python.org/3/library/functions.html#open) - -.. [#os-fsencode] The ``os.fsencode()`` function - (https://docs.python.org/3/library/os.html#os.fsencode) - -.. [#os-fsdecode] The ``os.fsdecode()`` function - (https://docs.python.org/3/library/os.html#os.fsdecode) - -.. [#os-direntry] The ``os.DirEntry`` class - (https://docs.python.org/3/library/os.html#os.DirEntry) - -.. [#os-path] The ``os.path`` module - (https://docs.python.org/3/library/os.path.html#module-os.path) - - -Copyright -========= - -This document has been placed in the public domain. - - - -.. - Local Variables: - mode: indented-text - indent-tabs-mode: nil - sentence-end-double-space: t - fill-column: 70 - coding: utf-8 - End: +PEP: 519 +Title: Adding a file system path protocol +Version: $Revision$ +Last-Modified: $Date$ +Author: Brett Cannon , + Koos Zevenhoven +Status: Final +Type: Standards Track +Content-Type: text/x-rst +Created: 11-May-2016 +Python-Version: 3.6 +Post-History: 11-May-2016, + 12-May-2016, + 13-May-2016 +Resolution: https://mail.python.org/pipermail/python-dev/2016-May/144646.html + + +Abstract +======== + +This PEP proposes a protocol for classes which represent a file system +path to be able to provide a ``str`` or ``bytes`` representation. +Changes to Python's standard library are also proposed to utilize this +protocol where appropriate to facilitate the use of path objects where +historically only ``str`` and/or ``bytes`` file system paths are +accepted. The goal is to facilitate the migration of users towards +rich path objects while providing an easy way to work with code +expecting ``str`` or ``bytes``. + + +Rationale +========= + +Historically in Python, file system paths have been represented as +strings or bytes. This choice of representation has stemmed from C's +own decision to represent file system paths as +``const char *`` [#libc-open]_. While that is a totally serviceable +format to use for file system paths, it's not necessarily optimal. At +issue is the fact that while all file system paths can be represented +as strings or bytes, not all strings or bytes represent a file system +path. This can lead to issues where any e.g. string duck-types to a +file system path whether it actually represents a path or not. + +To help elevate the representation of file system paths from their +representation as strings and bytes to a richer object representation, +the pathlib module [#pathlib]_ was provisionally introduced in +Python 3.4 through PEP 428. While considered by some as an improvement +over strings and bytes for file system paths, it has suffered from a +lack of adoption. Typically the key issue listed for the low adoption +rate has been the lack of support in the standard library. This lack +of support required users of pathlib to manually convert path objects +to strings by calling ``str(path)`` which many found error-prone. + +One issue in converting path objects to strings comes from +the fact that the only generic way to get a string representation of +the path was to pass the object to ``str()``. This can pose a +problem when done blindly as nearly all Python objects have some +string representation whether they are a path or not, e.g. +``str(None)`` will give a result that +``builtins.open()`` [#builtins-open]_ will happily use to create a new +file. + +Exacerbating this whole situation is the +``DirEntry`` object [#os-direntry]_. While path objects have a +representation that can be extracted using ``str()``, ``DirEntry`` +objects expose a ``path`` attribute instead. Having no common +interface between path objects, ``DirEntry``, and any other +third-party path library has become an issue. A solution that allows +any path-representing object to declare that it is a path and a way +to extract a low-level representation that all path objects could +support is desired. + +This PEP then proposes to introduce a new protocol to be followed by +objects which represent file system paths. Providing a protocol allows +for explicit signaling of what objects represent file system paths as +well as a way to extract a lower-level representation that can be used +with older APIs which only support strings or bytes. + +Discussions regarding path objects that led to this PEP can be found +in multiple threads on the python-ideas mailing list archive +[#python-ideas-archive]_ for the months of March and April 2016 and on +the python-dev mailing list archives [#python-dev-archive]_ during +April 2016. + + +Proposal +======== + +This proposal is split into two parts. One part is the proposal of a +protocol for objects to declare and provide support for exposing a +file system path representation. The other part deals with changes to +Python's standard library to support the new protocol. These changes +will also lead to the pathlib module dropping its provisional status. + +Protocol +-------- + +The following abstract base class defines the protocol for an object +to be considered a path object:: + + import abc + import typing as t + + + class PathLike(abc.ABC): + + """Abstract base class for implementing the file system path protocol.""" + + @abc.abstractmethod + def __fspath__(self) -> t.Union[str, bytes]: + """Return the file system path representation of the object.""" + raise NotImplementedError + + +Objects representing file system paths will implement the +``__fspath__()`` method which will return the ``str`` or ``bytes`` +representation of the path. The ``str`` representation is the +preferred low-level path representation as it is human-readable and +what people historically represent paths as. + + +Standard library changes +------------------------ + +It is expected that most APIs in Python's standard library that +currently accept a file system path will be updated appropriately to +accept path objects (whether that requires code or simply an update +to documentation will vary). The modules mentioned below, though, +deserve specific details as they have either fundamental changes that +empower the ability to use path objects, or entail additions/removal +of APIs. + + +builtins +'''''''' + +``open()`` [#builtins-open]_ will be updated to accept path objects as +well as continue to accept ``str`` and ``bytes``. + + +os +''' + +The ``fspath()`` function will be added with the following semantics:: + + import typing as t + + + def fspath(path: t.Union[PathLike, str, bytes]) -> t.Union[str, bytes]: + """Return the string representation of the path. + + If str or bytes is passed in, it is returned unchanged. If __fspath__() + returns something other than str or bytes then TypeError is raised. If + this function is given something that is not str, bytes, or os.PathLike + then TypeError is raised. + """ + if isinstance(path, (str, bytes)): + return path + + # Work from the object's type to match method resolution of other magic + # methods. + path_type = type(path) + try: + path = path_type.__fspath__(path) + except AttributeError: + if hasattr(path_type, '__fspath__'): + raise + else: + if isinstance(path, (str, bytes)): + return path + else: + raise TypeError("expected __fspath__() to return str or bytes, " + "not " + type(path).__name__) + + raise TypeError("expected str, bytes or os.PathLike object, not " + + path_type.__name__) + +The ``os.fsencode()`` [#os-fsencode]_ and +``os.fsdecode()`` [#os-fsdecode]_ functions will be updated to accept +path objects. As both functions coerce their arguments to +``bytes`` and ``str``, respectively, they will be updated to call +``__fspath__()`` if present to convert the path object to a ``str`` or +``bytes`` representation, and then perform their appropriate +coercion operations as if the return value from ``__fspath__()`` had +been the original argument to the coercion function in question. + +The addition of ``os.fspath()``, the updates to +``os.fsencode()``/``os.fsdecode()``, and the current semantics of +``pathlib.PurePath`` provide the semantics necessary to +get the path representation one prefers. For a path object, +``pathlib.PurePath``/``Path`` can be used. To obtain the ``str`` or +``bytes`` representation without any coersion, then ``os.fspath()`` +can be used. If a ``str`` is desired and the encoding of ``bytes`` +should be assumed to be the default file system encoding, then +``os.fsdecode()`` should be used. If a ``bytes`` representation is +desired and any strings should be encoded using the default file +system encoding, then ``os.fsencode()`` is used. This PEP recommends +using path objects when possible and falling back to string paths as +necessary and using ``bytes`` as a last resort. + +Another way to view this is as a hierarchy of file system path +representations (highest- to lowest-level): path → str → bytes. The +functions and classes under discussion can all accept objects on the +same level of the hierarchy, but they vary in whether they promote or +demote objects to another level. The ``pathlib.PurePath`` class can +promote a ``str`` to a path object. The ``os.fspath()`` function can +demote a path object to a ``str`` or ``bytes`` instance, depending +on what ``__fspath__()`` returns. +The ``os.fsdecode()`` function will demote a path object to +a string or promote a ``bytes`` object to a ``str``. The +``os.fsencode()`` function will demote a path or string object to +``bytes``. There is no function that provides a way to demote a path +object directly to ``bytes`` while bypassing string demotion. + +The ``DirEntry`` object [#os-direntry]_ will gain an ``__fspath__()`` +method. It will return the same value as currently found on the +``path`` attribute of ``DirEntry`` instances. + +The Protocol_ ABC will be added to the ``os`` module under the name +``os.PathLike``. + + +os.path +''''''' + +The various path-manipulation functions of ``os.path`` [#os-path]_ +will be updated to accept path objects. For polymorphic functions that +accept both bytes and strings, they will be updated to simply use +``os.fspath()``. + +During the discussions leading up to this PEP it was suggested that +``os.path`` not be updated using an "explicit is better than implicit" +argument. The thinking was that since ``__fspath__()`` is polymorphic +itself it may be better to have code working with ``os.path`` extract +the path representation from path objects explicitly. There is also +the consideration that adding support this deep into the low-level OS +APIs will lead to code magically supporting path objects without +requiring any documentation updated, leading to potential complaints +when it doesn't work, unbeknownst to the project author. + +But it is the view of this PEP that "practicality beats purity" in +this instance. To help facilitate the transition to supporting path +objects, it is better to make the transition as easy as possible than +to worry about unexpected/undocumented duck typing support for +path objects by projects. + +There has also been the suggestion that ``os.path`` functions could be +used in a tight loop and the overhead of checking or calling +``__fspath__()`` would be too costly. In this scenario only +path-consuming APIs would be directly updated and path-manipulating +APIs like the ones in ``os.path`` would go unmodified. This would +require library authors to update their code to support path objects +if they performed any path manipulations, but if the library code +passed the path straight through then the library wouldn't need to be +updated. It is the view of this PEP and Guido, though, that this is an +unnecessary worry and that performance will still be acceptable. + + +pathlib +''''''' + +The constructor for ``pathlib.PurePath`` and ``pathlib.Path`` will be +updated to accept ``PathLike`` objects. Both ``PurePath`` and ``Path`` +will continue to not accept ``bytes`` path representations, and so if +``__fspath__()`` returns ``bytes`` it will raise an exception. + +The ``path`` attribute will be removed as this PEP makes it +redundant (it has not been included in any released version of Python +and so is not a backwards-compatibility concern). + + +C API +''''' + +The C API will gain an equivalent function to ``os.fspath()``:: + + /* + Return the file system path representation of the object. + + If the object is str or bytes, then allow it to pass through with + an incremented refcount. If the object defines __fspath__(), then + return the result of that method. All other types raise a TypeError. + */ + PyObject * + PyOS_FSPath(PyObject *path) + { + _Py_IDENTIFIER(__fspath__); + PyObject *func = NULL; + PyObject *path_repr = NULL; + + if (PyUnicode_Check(path) || PyBytes_Check(path)) { + Py_INCREF(path); + return path; + } + + func = _PyObject_LookupSpecial(path, &PyId___fspath__); + if (NULL == func) { + return PyErr_Format(PyExc_TypeError, + "expected str, bytes or os.PathLike object, " + "not %S", + path->ob_type); + } + + path_repr = PyObject_CallFunctionObjArgs(func, NULL); + Py_DECREF(func); + if (!PyUnicode_Check(path_repr) && !PyBytes_Check(path_repr)) { + Py_DECREF(path_repr); + return PyErr_Format(PyExc_TypeError, + "expected __fspath__() to return str or bytes, " + "not %S", + path_repr->ob_type); + } + + return path_repr; + } + + + + +Backwards compatibility +======================= + +There are no explicit backwards-compatibility concerns. Unless an +object incidentally already defines a ``__fspath__()`` method there is +no reason to expect the pre-existing code to break or expect to have +its semantics implicitly changed. + +Libraries wishing to support path objects and a version of Python +prior to Python 3.6 and the existence of ``os.fspath()`` can use the +idiom of +``path.__fspath__() if hasattr(path, "__fspath__") else path``. + + +Implementation +============== + +This is the task list for what this PEP proposes to be changed in +Python 3.6: + +#. Remove the ``path`` attribute from pathlib + (`done `__) +#. Remove the provisional status of pathlib + (`done `__) +#. Add ``os.PathLike`` + (`code `__ and + `docs `__ done) +#. Add ``PyOS_FSPath()`` + (`code `__ and + `docs `__ done) +#. Add ``os.fspath()`` + (`done `__) +#. Update ``os.fsencode()`` + (`done `__) +#. Update ``os.fsdecode()`` + (`done `__) +#. Update ``pathlib.PurePath`` and ``pathlib.Path`` + (`done `__) + + #. Add ``__fspath__()`` + #. Add ``os.PathLike`` support to the constructors + +#. Add ``__fspath__()`` to ``DirEntry`` + (`done `__) + +#. Update ``builtins.open()`` + (`done `__) +#. Update ``os.path`` + (`done `__) +#. Add a `glossary `__ entry for "path-like" + (`done `__) +#. Update `"What's New" `_ + (`done `__) + + +Rejected Ideas +============== + +Other names for the protocol's method +------------------------------------- + +Various names were proposed during discussions leading to this PEP, +including ``__path__``, ``__pathname__``, and ``__fspathname__``. In +the end people seemed to gravitate towards ``__fspath__`` for being +unambiguous without being unnecessarily long. + + +Separate str/bytes methods +-------------------------- + +At one point it was suggested that ``__fspath__()`` only return +strings and another method named ``__fspathb__()`` be introduced to +return bytes. The thinking is that by making ``__fspath__()`` not be +polymorphic it could make dealing with the potential string or bytes +representations easier. But the general consensus was that returning +bytes will more than likely be rare and that the various functions in +the os module are the better abstraction to promote over direct +calls to ``__fspath__()``. + + +Providing a ``path`` attribute +------------------------------ + +To help deal with the issue of ``pathlib.PurePath`` not inheriting +from ``str``, originally it was proposed to introduce a ``path`` +attribute to mirror what ``os.DirEntry`` provides. In the end, +though, it was determined that a protocol would provide the same +result while not directly exposing an API that most people will never +need to interact with directly. + + +Have ``__fspath__()`` only return strings +------------------------------------------ + +Much of the discussion that led to this PEP revolved around whether +``__fspath__()`` should be polymorphic and return ``bytes`` as well as +``str`` or only return ``str``. The general sentiment for this view +was that ``bytes`` are difficult to work with due to their +inherent lack of information about their encoding and PEP 383 makes +it possible to represent all file system paths using ``str`` with the +``surrogateescape`` handler. Thus, it would be better to forcibly +promote the use of ``str`` as the low-level path representation for +high-level path objects. + +In the end, it was decided that using ``bytes`` to represent paths is +simply not going to go away and thus they should be supported to some +degree. The hope is that people will gravitate towards path objects +like pathlib and that will move people away from operating directly +with ``bytes``. + + +A generic string encoding mechanism +----------------------------------- + +At one point there was a discussion of developing a generic mechanism +to extract a string representation of an object that had semantic +meaning (``__str__()`` does not necessarily return anything of +semantic significance beyond what may be helpful for debugging). In +the end, it was deemed to lack a motivating need beyond the one this +PEP is trying to solve in a specific fashion. + + +Have __fspath__ be an attribute +------------------------------- + +It was briefly considered to have ``__fspath__`` be an attribute +instead of a method. This was rejected for two reasons. One, +historically protocols have been implemented as "magic methods" and +not "magic methods and attributes". Two, there is no guarantee that +the lower-level representation of a path object will be pre-computed, +potentially misleading users that there was no expensive computation +behind the scenes in case the attribute was implemented as a property. + +This also indirectly ties into the idea of introducing a ``path`` +attribute to accomplish the same thing. This idea has an added issue, +though, of accidentally having any object with a ``path`` attribute +meet the protocol's duck typing. Introducing a new magic method for +the protocol helpfully avoids any accidental opting into the protocol. + + +Provide specific type hinting support +------------------------------------- + +There was some consideration to provdinga generic ``typing.PathLike`` +class which would allow for e.g. ``typing.PathLike[str]`` to specify +a type hint for a path object which returned a string representation. +While potentially beneficial, the usefulness was deemed too small to +bother adding the type hint class. + +This also removed any desire to have a class in the ``typing`` module +which represented the union of all acceptable path-representing types +as that can be represented with +``typing.Union[str, bytes, os.PathLike]`` easily enough and the hope +is users will slowly gravitate to path objects only. + + +Provide ``os.fspathb()`` +------------------------ + +It was suggested that to mirror the structure of e.g. +``os.getcwd()``/``os.getcwdb()``, that ``os.fspath()`` only return +``str`` and that another function named ``os.fspathb()`` be +introduced that only returned ``bytes``. This was rejected as the +purposes of the ``*b()`` functions are tied to querying the file +system where there is a need to get the raw bytes back. As this PEP +does not work directly with data on a file system (but which *may* +be), the view was taken this distinction is unnecessary. It's also +believed that the need for only bytes will not be common enough to +need to support in such a specific manner as ``os.fsencode()`` will +provide similar functionality. + + +Call ``__fspath__()`` off of the instance +----------------------------------------- + +An earlier draft of this PEP had ``os.fspath()`` calling +``path.__fspath__()`` instead of ``type(path).__fspath__(path)``. The +changed to be consistent with how other magic methods in Python are +resolved. + + +Acknowledgements +================ + +Thanks to everyone who participated in the various discussions related +to this PEP that spanned both python-ideas and python-dev. Special +thanks to Stephen Turnbull for direct feedback on early drafts of this +PEP. More special thanks to Koos Zevenhoven and Ethan Furman for not +only feedback on early drafts of this PEP but also helping to drive +the overall discussion on this topic across the two mailing lists. + + +References +========== + +.. [#python-ideas-archive] The python-ideas mailing list archive + (https://mail.python.org/pipermail/python-ideas/) + +.. [#python-dev-archive] The python-dev mailing list archive + (https://mail.python.org/pipermail/python-dev/) + +.. [#libc-open] ``open()`` documention for the C standard library + (http://www.gnu.org/software/libc/manual/html_node/Opening-and-Closing-Files.html) + +.. [#pathlib] The ``pathlib`` module + (https://docs.python.org/3/library/pathlib.html#module-pathlib) + +.. [#builtins-open] The ``builtins.open()`` function + (https://docs.python.org/3/library/functions.html#open) + +.. [#os-fsencode] The ``os.fsencode()`` function + (https://docs.python.org/3/library/os.html#os.fsencode) + +.. [#os-fsdecode] The ``os.fsdecode()`` function + (https://docs.python.org/3/library/os.html#os.fsdecode) + +.. [#os-direntry] The ``os.DirEntry`` class + (https://docs.python.org/3/library/os.html#os.DirEntry) + +.. [#os-path] The ``os.path`` module + (https://docs.python.org/3/library/os.path.html#module-os.path) + + +Copyright +========= + +This document has been placed in the public domain. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: diff --git a/pep-0528.txt b/pep-0528.txt index 85d0d46d5..ad26401ef 100644 --- a/pep-0528.txt +++ b/pep-0528.txt @@ -1,182 +1,182 @@ -PEP: 528 -Title: Change Windows console encoding to UTF-8 -Version: $Revision$ -Last-Modified: $Date$ -Author: Steve Dower -Status: Final -Type: Standards Track -Content-Type: text/x-rst -Created: 27-Aug-2016 -Python-Version: 3.6 -Post-History: 01-Sep-2016, 04-Sep-2016 -Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146278.html - -Abstract -======== - -Historically, Python uses the ANSI APIs for interacting with the Windows -operating system, often via C Runtime functions. However, these have been long -discouraged in favor of the UTF-16 APIs. Within the operating system, all text -is represented as UTF-16, and the ANSI APIs perform encoding and decoding using -the active code page. - -This PEP proposes changing the default standard stream implementation on Windows -to use the Unicode APIs. This will allow users to print and input the full range -of Unicode characters at the default Windows console. This also requires a -subtle change to how the tokenizer parses text from readline hooks. - -Specific Changes -================ - -Add _io.WindowsConsoleIO ------------------------- - -Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors -representing standard input, output and error. We add a new class (implemented -in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows -console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``. - -This class will be used when the legacy-mode flag is not in effect, when opening -a standard stream by file descriptor and the stream is a console buffer rather -than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today. - -This is a raw (bytes) IO class that requires text to be passed encoded with -utf-8, which will be decoded to utf-16-le and passed to the Windows APIs. -Similarly, bytes read from the class will be provided by the operating system as -utf-16-le and converted into utf-8 when returned to Python. - -The use of an ASCII compatible encoding is required to maintain compatibility -with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to -the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes -a particular encoding for the standard streams other than ASCII will likely -break. - -Add _PyOS_WindowsConsoleReadline --------------------------------- - -To allow Unicode entry at the interactive prompt, a new readline hook is -required. The existing ``PyOS_StdioReadline`` function will delegate to the new -``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor -that is a console buffer and the legacy-mode flag is not in effect (the logic -should be identical to above). - -Since the readline interface is required to return an 8-bit encoded string with -no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from -utf-16-le as read from the operating system into utf-8. - -The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding -from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect. -This may require readline hooks to change their encodings to utf-8, or to -require legacy-mode for correct behaviour. - -Add legacy mode ---------------- - -Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set -will enable the legacy-mode flag, which completely restores the previous -behaviour. - -Alternative Approaches -====================== - -The `win_unicode_console package`_ is a pure-Python alternative to changing the -default behaviour of the console. It implements essentially the same -modifications as described here using pure Python code. - -Code that may break -=================== - -The following code patterns may break or see different behaviour as a result of -this change. All of these code samples require explicitly choosing to use a raw -file object in place of a more convenient wrapper that would prevent any visible -change. - -Assuming stdin/stdout encoding ------------------------------- - -Code that assumes that the encoding required by ``sys.stdin.buffer`` or -``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be -working by chance, but could encounter issues under this change. For example:: - - >>> sys.stdout.buffer.write(text.encode('mbcs')) - >>> r = sys.stdin.buffer.read(16).decode('cp437') - -To correct this code, the encoding specified on the ``TextIOWrapper`` should be -used, either implicitly or explicitly:: - - >>> # Fix 1: Use wrapper correctly - >>> sys.stdout.write(text) - >>> r = sys.stdin.read(16) - - >>> # Fix 2: Use encoding explicitly - >>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding)) - >>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding) - -Incorrectly using the raw object --------------------------------- - -Code that uses the raw IO object and does not correctly handle partial reads and -writes may be affected. This is particularly important for reads, where the -number of characters read will never exceed one-fourth of the number of bytes -allowed, as there is no feasible way to prevent input from encoding as much -longer utf-8 strings:: - - >>> raw_stdin = sys.stdin.buffer.raw - >>> data = raw_stdin.read(15) - abcdefghijklm - b'abc' - # data contains at most 3 characters, and never more than 12 bytes - # error, as "defghijklm\r\n" is passed to the interactive prompt - -To correct this code, the buffered reader/writer should be used, or the caller -should continue reading until its buffer is full:: - - >>> # Fix 1: Use the buffered reader/writer - >>> stdin = sys.stdin.buffer - >>> data = stdin.read(15) - abcedfghijklm - b'abcdefghijklm\r\n' - - >>> # Fix 2: Loop until enough bytes have been read - >>> raw_stdin = sys.stdin.buffer.raw - >>> b = b'' - >>> while len(b) < 15: - ... b += raw_stdin.read(15) - abcedfghijklm - b'abcdefghijklm\r\n' - -Using the raw object with small buffers ---------------------------------------- - -Code that uses the raw IO object and attempts to read less than four characters -will now receive an error. Because it's possible that any single character may -require up to four bytes when represented in utf-8, requests must fail:: - - >>> raw_stdin = sys.stdin.buffer.raw - >>> data = raw_stdin.read(3) - Traceback (most recent call last): - File "", line 1, in - ValueError: must read at least 4 bytes - -The only workaround is to pass a larger buffer:: - - >>> # Fix: Request at least four bytes - >>> raw_stdin = sys.stdin.buffer.raw - >>> data = raw_stdin.read(4) - a - b'a' - >>> >>> - -(The extra ``>>>`` is due to the newline remaining in the input buffer and is -expected in this situation.) - -Copyright -========= - -This document has been placed in the public domain. - -References -========== - -.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py -.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/ +PEP: 528 +Title: Change Windows console encoding to UTF-8 +Version: $Revision$ +Last-Modified: $Date$ +Author: Steve Dower +Status: Final +Type: Standards Track +Content-Type: text/x-rst +Created: 27-Aug-2016 +Python-Version: 3.6 +Post-History: 01-Sep-2016, 04-Sep-2016 +Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146278.html + +Abstract +======== + +Historically, Python uses the ANSI APIs for interacting with the Windows +operating system, often via C Runtime functions. However, these have been long +discouraged in favor of the UTF-16 APIs. Within the operating system, all text +is represented as UTF-16, and the ANSI APIs perform encoding and decoding using +the active code page. + +This PEP proposes changing the default standard stream implementation on Windows +to use the Unicode APIs. This will allow users to print and input the full range +of Unicode characters at the default Windows console. This also requires a +subtle change to how the tokenizer parses text from readline hooks. + +Specific Changes +================ + +Add _io.WindowsConsoleIO +------------------------ + +Currently an instance of ``_io.FileIO`` is used to wrap the file descriptors +representing standard input, output and error. We add a new class (implemented +in C) ``_io.WindowsConsoleIO`` that acts as a raw IO object using the Windows +console functions, specifically, ``ReadConsoleW`` and ``WriteConsoleW``. + +This class will be used when the legacy-mode flag is not in effect, when opening +a standard stream by file descriptor and the stream is a console buffer rather +than a redirected file. Otherwise, ``_io.FileIO`` will be used as it is today. + +This is a raw (bytes) IO class that requires text to be passed encoded with +utf-8, which will be decoded to utf-16-le and passed to the Windows APIs. +Similarly, bytes read from the class will be provided by the operating system as +utf-16-le and converted into utf-8 when returned to Python. + +The use of an ASCII compatible encoding is required to maintain compatibility +with code that bypasses the ``TextIOWrapper`` and directly writes ASCII bytes to +the standard streams (for example, `Twisted's process_stdinreader.py`_). Code that assumes +a particular encoding for the standard streams other than ASCII will likely +break. + +Add _PyOS_WindowsConsoleReadline +-------------------------------- + +To allow Unicode entry at the interactive prompt, a new readline hook is +required. The existing ``PyOS_StdioReadline`` function will delegate to the new +``_PyOS_WindowsConsoleReadline`` function when reading from a file descriptor +that is a console buffer and the legacy-mode flag is not in effect (the logic +should be identical to above). + +Since the readline interface is required to return an 8-bit encoded string with +no embedded nulls, the ``_PyOS_WindowsConsoleReadline`` function transcodes from +utf-16-le as read from the operating system into utf-8. + +The function ``PyRun_InteractiveOneObject`` which currently obtains the encoding +from ``sys.stdin`` will select utf-8 unless the legacy-mode flag is in effect. +This may require readline hooks to change their encodings to utf-8, or to +require legacy-mode for correct behaviour. + +Add legacy mode +--------------- + +Launching Python with the environment variable ``PYTHONLEGACYWINDOWSSTDIO`` set +will enable the legacy-mode flag, which completely restores the previous +behaviour. + +Alternative Approaches +====================== + +The `win_unicode_console package`_ is a pure-Python alternative to changing the +default behaviour of the console. It implements essentially the same +modifications as described here using pure Python code. + +Code that may break +=================== + +The following code patterns may break or see different behaviour as a result of +this change. All of these code samples require explicitly choosing to use a raw +file object in place of a more convenient wrapper that would prevent any visible +change. + +Assuming stdin/stdout encoding +------------------------------ + +Code that assumes that the encoding required by ``sys.stdin.buffer`` or +``sys.stdout.buffer`` is ``'mbcs'`` or a more specific encoding may currently be +working by chance, but could encounter issues under this change. For example:: + + >>> sys.stdout.buffer.write(text.encode('mbcs')) + >>> r = sys.stdin.buffer.read(16).decode('cp437') + +To correct this code, the encoding specified on the ``TextIOWrapper`` should be +used, either implicitly or explicitly:: + + >>> # Fix 1: Use wrapper correctly + >>> sys.stdout.write(text) + >>> r = sys.stdin.read(16) + + >>> # Fix 2: Use encoding explicitly + >>> sys.stdout.buffer.write(text.encode(sys.stdout.encoding)) + >>> r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding) + +Incorrectly using the raw object +-------------------------------- + +Code that uses the raw IO object and does not correctly handle partial reads and +writes may be affected. This is particularly important for reads, where the +number of characters read will never exceed one-fourth of the number of bytes +allowed, as there is no feasible way to prevent input from encoding as much +longer utf-8 strings:: + + >>> raw_stdin = sys.stdin.buffer.raw + >>> data = raw_stdin.read(15) + abcdefghijklm + b'abc' + # data contains at most 3 characters, and never more than 12 bytes + # error, as "defghijklm\r\n" is passed to the interactive prompt + +To correct this code, the buffered reader/writer should be used, or the caller +should continue reading until its buffer is full:: + + >>> # Fix 1: Use the buffered reader/writer + >>> stdin = sys.stdin.buffer + >>> data = stdin.read(15) + abcedfghijklm + b'abcdefghijklm\r\n' + + >>> # Fix 2: Loop until enough bytes have been read + >>> raw_stdin = sys.stdin.buffer.raw + >>> b = b'' + >>> while len(b) < 15: + ... b += raw_stdin.read(15) + abcedfghijklm + b'abcdefghijklm\r\n' + +Using the raw object with small buffers +--------------------------------------- + +Code that uses the raw IO object and attempts to read less than four characters +will now receive an error. Because it's possible that any single character may +require up to four bytes when represented in utf-8, requests must fail:: + + >>> raw_stdin = sys.stdin.buffer.raw + >>> data = raw_stdin.read(3) + Traceback (most recent call last): + File "", line 1, in + ValueError: must read at least 4 bytes + +The only workaround is to pass a larger buffer:: + + >>> # Fix: Request at least four bytes + >>> raw_stdin = sys.stdin.buffer.raw + >>> data = raw_stdin.read(4) + a + b'a' + >>> >>> + +(The extra ``>>>`` is due to the newline remaining in the input buffer and is +expected in this situation.) + +Copyright +========= + +This document has been placed in the public domain. + +References +========== + +.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py +.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/ diff --git a/pep-0529.txt b/pep-0529.txt index 611522552..2e7264c1e 100644 --- a/pep-0529.txt +++ b/pep-0529.txt @@ -1,453 +1,453 @@ -PEP: 529 -Title: Change Windows filesystem encoding to UTF-8 -Version: $Revision$ -Last-Modified: $Date$ -Author: Steve Dower -Status: Final -Type: Standards Track -Content-Type: text/x-rst -Created: 27-Aug-2016 -Python-Version: 3.6 -Post-History: 01-Sep-2016, 04-Sep-2016 -Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146277.html - -Abstract -======== - -Historically, Python uses the ANSI APIs for interacting with the Windows -operating system, often via C Runtime functions. However, these have been long -discouraged in favor of the UTF-16 APIs. Within the operating system, all text -is represented as UTF-16, and the ANSI APIs perform encoding and decoding using -the active code page. See `Naming Files, Paths, and Namespaces`_ for -more details. - -This PEP proposes changing the default filesystem encoding on Windows to utf-8, -and changing all filesystem functions to use the Unicode APIs for filesystem -paths. This will not affect code that uses strings to represent paths, however -those that use bytes for paths will now be able to correctly round-trip all -valid paths in Windows filesystems. Currently, the conversions between Unicode -(in the OS) and bytes (in Python) were lossy and would fail to round-trip -characters outside of the user's active code page. - -Notably, this does not impact the encoding of the contents of files. These will -continue to default to ``locale.getpreferredencoding()`` (for text files) or -plain bytes (for binary files). This only affects the encoding used when users -pass a bytes object to Python where it is then passed to the operating system as -a path name. - -Background -========== - -File system paths are almost universally represented as text with an encoding -determined by the file system. In Python, we expose these paths via a number of -interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either -direction across these interfaces, that is, from the filesystem to the -application (for example, ``os.listdir()``), or from the application to the -filesystem (for example, ``os.unlink()``). - -When paths are passed between the filesystem and the application, they are -either passed through as a bytes blob or converted to/from str using -``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using -``sys.getfilesystemencoding()``. The result of encoding a string with -``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the -default file system. - -On Windows, the native format for the filesystem is utf-16-le. The recommended -platform APIs for accessing the filesystem all accept and return text encoded in -this format. However, prior to Windows NT (and possibly further back), the -native format was a configurable machine option and a separate set of APIs -existed to accept this format. The option (the "active code page") and these -APIs (the "\*A functions") still exist in recent versions of Windows for -backwards compatibility, though new functionality often only has a utf-16-le API -(the "\*W functions"). - -In Python, str is recommended because it can correctly round-trip all characters -used in paths (on POSIX with surrogateescape handling; on Windows because str -maps to the native representation). On Windows bytes cannot round-trip all -characters used in paths, as Python internally uses the \*A functions and hence -the encoding is "whatever the active code page is". Since the active code page -cannot represent all Unicode characters, the conversion of a path into bytes can -lose information without warning or any available indication. - -As a demonstration of this:: - - >>> open('test\uAB00.txt', 'wb').close() - >>> import glob - >>> glob.glob('test*') - ['test\uab00.txt'] - >>> glob.glob(b'test*') - [b'test?.txt'] - -The Unicode character in the second call to glob has been replaced by a '?', -which means passing the path back into the filesystem will result in a -``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or -any function that matches the return type to the parameter type. - -While one user-accessible fix is to use str everywhere, POSIX systems generally -do not suffer from data loss when using bytes exclusively as the bytes are the -canonical representation. Even if the encoding is "incorrect" by some standard, -the file system will still map the bytes back to the file. Making use of this -avoids the cost of decoding and reencoding, such that (theoretically, and only -on POSIX), code such as this may be faster because of the use of ``b'.'`` -compared to using ``'.'``:: - - >>> for f in os.listdir(b'.'): - ... os.stat(f) - ... - -As a result, POSIX-focused library authors prefer to use bytes to represent -paths. For some authors it is also a convenience, as their code may receive -bytes already known to be encoded correctly, while others are attempting to -simplify porting their code from Python 2. However, the correctness assumptions -do not carry over to Windows where Unicode is the canonical representation, and -errors may result. This potential data loss is why the use of bytes paths on -Windows was deprecated in Python 3.3 - all of the above code snippets produce -deprecation warnings on Windows. - -Proposal -======== - -Currently the default filesystem encoding is 'mbcs', which is a meta-encoder -that uses the active code page. However, when bytes are passed to the filesystem -they go through the \*A APIs and the operating system handles encoding. In this -case, paths are always encoded using the equivalent of 'mbcs:replace' with no -opportunity for Python to override or change this. - -This proposal would remove all use of the \*A APIs and only ever call the \*W -APIs. When Windows returns paths to Python as ``str``, they will be decoded from -utf-16-le and returned as text (in whatever the minimal representation is). When -Python code requests paths as ``bytes``, the paths will be transcoded from -utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate -pairs, so it is possible to have invalid surrogates in filenames). Equally, when -paths are provided as ``bytes``, they are transcoded from utf-8 into utf-16-le -and passed to the \*W APIs. - -The use of utf-8 will not be configurable, except for the provision of a -"legacy mode" flag to revert to the previous behaviour. - -The ``surrogateescape`` error mode does not apply here, as the concern is not -about retaining non-sensical bytes. Any path returned from the operating system -will be valid Unicode, while invalid paths created by the user should raise a -decoding error (currently these would raise ``OSError`` or a subclass). - -The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the -ability to round-trip path names and allow basic manipulation (for example, -using the ``os.path`` module) when assuming an ASCII-compatible encoding. Using -utf-16-le as the encoding is more pure, but will cause more issues than are -resolved. - -This change would also undeprecate the use of bytes paths on Windows. No change -to the semantics of using bytes as a path is required - as before, they must be -encoded with the encoding specified by ``sys.getfilesystemencoding()``. - -Specific Changes -================ - -Update sys.getfilesystemencoding --------------------------------- - -Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in -``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs. - -Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and -``PyUnicode_EncodeFSDefault()`` to use the utf-8 codec, or if the legacy-mode -switch is enabled the existing mbcs codec. - -Add sys.getfilesystemencodeerrors ---------------------------------- - -As the error mode may now change between ``surrogatepass`` and ``replace``, -Python code that manually performs encoding also needs access to the current -error mode. This includes the implementation of ``os.fsencode()`` and -``os.fsdecode()``, which currently assume an error mode based on the codec. - -Add a public ``Py_FileSystemDefaultEncodeErrors``, similar to the existing -``Py_FileSystemDefaultEncoding``. The default value on Windows will be -``surrogatepass`` or in legacy mode, ``replace``. The default value on all other -platforms will be ``surrogateescape``. - -Add a public ``sys.getfilesystemencodeerrors()`` function that returns the -current error mode. - -Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and -``PyUnicode_EncodeFSDefault()`` to use the variable for error mode rather than -constant strings. - -Update the implementations of ``os.fsencode()`` and ``os.fsdecode()`` to use -``sys.getfilesystemencodeerrors()`` instead of assuming the mode. - -Update path_converter ---------------------- - -Update the path converter to always decode bytes or buffer objects into text -using ``PyUnicode_DecodeFSDefaultAndSize()``. - -Change the ``narrow`` field from a ``char*`` string into a flag that indicates -whether the original object was bytes. This is required for functions that need -to return paths using the same type as was originally provided. - -Remove unused ANSI code ------------------------ - -Remove all code paths using the ``narrow`` field, as these will no longer be -reachable by any caller. These are only used within ``posixmodule.c``. Other -uses of paths should have use of bytes paths replaced with decoding and use of -the \*W APIs. - -Add legacy mode ---------------- - -Add a legacy mode flag, enabled by the environment variable -``PYTHONLEGACYWINDOWSFSENCODING`` or by a function call to -``sys._enablelegacywindowsfsencoding()``. The function call can only be -used to enable the flag and should be used by programs as close to -initialization as possible. Legacy mode cannot be disabled while Python is -running. - -When this flag is set, the default filesystem encoding is set to mbcs rather -than utf-8, and the error mode is set to ``replace`` rather than -``surrogatepass``. Paths will continue to decode to wide characters and only \*W -APIs will be called, however, the bytes passed in and received from Python will -be encoded the same as prior to this change. - -Undeprecate bytes paths on Windows ----------------------------------- - -Using bytes as paths on Windows is currently deprecated. We would announce that -this is no longer the case, and that paths when encoded as bytes should use -whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's -active code page. - -Beta experiment ---------------- - -To assist with determining the impact of this change, we propose applying it to -3.6.0b1 provisionally with the intent being to make a final decision before -3.6.0b4. - -During the experiment period, decoding and encoding exception messages will be -expanded to include a link to an active online discussion and encourage -reporting of problems. - -If it is decided to revert the functionality for 3.6.0b4, the implementation -change would be to permanently enable the legacy mode flag, change the -environment variable to ``PYTHONWINDOWSUTF8FSENCODING`` and function to -``sys._enablewindowsutf8fsencoding()`` to allow enabling the functionality -on a case-by-case basis, as opposed to disabling it. - -It is expected that if we cannot feasibly make the change for 3.6 due to -compatibility concerns, it will not be possible to make the change at any later -time in Python 3.x. - -Affected Modules ----------------- - -This PEP implicitly includes all modules within the Python that either pass path -names to the operating system, or otherwise use ``sys.getfilesystemencoding()``. - -As of 3.6.0a4, the following modules require modification: - -* ``os`` -* ``_overlapped`` -* ``_socket`` -* ``subprocess`` -* ``zipimport`` - -The following modules use ``sys.getfilesystemencoding()`` but do not need -modification: - -* ``gc`` (already assumes bytes are utf-8) -* ``grp`` (not compiled for Windows) -* ``http.server`` (correctly includes codec name with transmitted data) -* ``idlelib.editor`` (should not be needed; has fallback handling) -* ``nis`` (not compiled for Windows) -* ``pwd`` (not compiled for Windows) -* ``spwd`` (not compiled for Windows) -* ``_ssl`` (only used for ASCII constants) -* ``tarfile`` (code unused on Windows) -* ``_tkinter`` (already assumes bytes are utf-8) -* ``wsgiref`` (assumed as the default encoding for unknown environments) -* ``zipapp`` (code unused on Windows) - -The following native code uses one of the encoding or decoding functions, but do -not require any modification: - -* ``Parser/parsetok.c`` (docs already specify ``sys.getfilesystemencoding()``) -* ``Python/ast.c`` (docs already specify ``sys.getfilesystemencoding()``) -* ``Python/compile.c`` (undocumented, but Python filesystem encoding implied) -* ``Python/errors.c`` (docs already specify ``os.fsdecode()``) -* ``Python/fileutils.c`` (code unused on Windows) -* ``Python/future.c`` (undocumented, but Python filesystem encoding implied) -* ``Python/import.c`` (docs already specify utf-8) -* ``Python/importdl.c`` (code unused on Windows) -* ``Python/pythonrun.c`` (docs already specify ``sys.getfilesystemencoding()``) -* ``Python/symtable.c`` (undocumented, but Python filesystem encoding implied) -* ``Python/thread.c`` (code unused on Windows) -* ``Python/traceback.c`` (encodes correctly for comparing strings) -* ``Python/_warnings.c`` (docs already specify ``os.fsdecode()``) - -Rejected Alternatives -===================== - -Use strict mbcs decoding ------------------------- - -This is essentially the same as the proposed change, but instead of changing -``sys.getfilesystemencoding()`` to utf-8 it is changed to mbcs (which -dynamically maps to the active code page). - -This approach allows the use of new functionality that is only available as \*W -APIs and also detection of encoding/decoding errors. For example, rather than -silently replacing Unicode characters with '?', it would be possible to warn or -fail the operation. - -Compared to the proposed fix, this could enable some new functionality but does -not fix any of the problems described initially. New runtime errors may cause -some problems to be more obvious and lead to fixes, provided library maintainers -are interested in supporting Windows and adding a separate code path to treat -filesystem paths as strings. - -Making the encoding mbcs without strict errors is equivalent to the legacy-mode -switch being enabled by default. This is a possible course of action if there is -significant breakage of actual code and a need to extend the deprecation period, -but still a desire to have the simplifications to the CPython source. - -Make bytes paths an error on Windows ------------------------------------- - -By preventing the use of bytes paths on Windows completely we prevent users from -hitting encoding issues. - -However, the motivation for this PEP is to increase the likelihood that code -written on POSIX will also work correctly on Windows. This alternative would -move the other direction and make such code completely incompatible. As this -does not benefit users in any way, we reject it. - -Make bytes paths an error on all platforms ------------------------------------------- - -By deprecating and then disable the use of bytes paths on all platforms we -prevent users from hitting encoding issues regardless of where the code was -originally written. This would require a full deprecation cycle, as there are -currently no warnings on platforms other than Windows. - -This is likely to be seen as a hostile action against Python developers in -general, and as such is rejected at this time. - -Code that may break -=================== - -The following code patterns may break or see different behaviour as a result of -this change. Each of these examples would have been fragile in code intended for -cross-platform use. The suggested fixes demonstrate the most compatible way to -handle path encoding issues across all platforms and across multiple Python -versions. - -Note that all of these examples produce deprecation warnings on Python 3.3 and -later. - -Not managing encodings across boundaries ----------------------------------------- - -Code that does not manage encodings when crossing protocol boundaries may -currently be working by chance, but could encounter issues when either encoding -changes. Note that the source of ``filename`` may be any function that returns -a bytes object, as illustrated in a second example below:: - - >>> filename = open('filename_in_mbcs.txt', 'rb').read() - >>> text = open(filename, 'r').read() - -To correct this code, the encoding of the bytes in ``filename`` should be -specified, either when reading from the file or before using the value:: - - >>> # Fix 1: Open file as text (default encoding) - >>> filename = open('filename_in_mbcs.txt', 'r').read() - >>> text = open(filename, 'r').read() - - >>> # Fix 2: Open file as text (explicit encoding) - >>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read() - >>> text = open(filename, 'r').read() - - >>> # Fix 3: Explicitly decode the path - >>> filename = open('filename_in_mbcs.txt', 'rb').read() - >>> text = open(filename.decode('mbcs'), 'r').read() - -Where the creator of ``filename`` is separated from the user of ``filename``, -the encoding is important information to include:: - - >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('mbcs') - - >>> filename = some_object.filename - >>> type(filename) - - >>> text = open(filename, 'r').read() - -To fix this code for best compatibility across operating systems and Python -versions, the filename should be exposed as str:: - - >>> # Fix 1: Expose as str - >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt' - - >>> filename = some_object.filename - >>> type(filename) - - >>> text = open(filename, 'r').read() - -Alternatively, the encoding used for the path needs to be made available to the -user. Specifying ``os.fsencode()`` (or ``sys.getfilesystemencoding()``) is an -acceptable choice, or a new attribute could be added with the exact encoding:: - - >>> # Fix 2: Use fsencode - >>> some_object.filename = os.fsencode(r'C:\Users\Steve\Documents\my_file.txt') - - >>> filename = some_object.filename - >>> type(filename) - - >>> text = open(filename, 'r').read() - - - >>> # Fix 3: Expose as explicit encoding - >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('cp437') - >>> some_object.filename_encoding = 'cp437' - - >>> filename = some_object.filename - >>> type(filename) - - >>> filename = filename.decode(some_object.filename_encoding) - >>> type(filename) - - >>> text = open(filename, 'r').read() - - -Explicitly using 'mbcs' ------------------------ - -Code that explicitly encodes text using 'mbcs' before passing to file system -APIs is now passing incorrectly encoded bytes. Note that the source of -``filename`` in this example is not relevant, provided that it is a str:: - - >>> filename = open('files.txt', 'r').readline().rstrip() - >>> text = open(filename.encode('mbcs'), 'r') - -To correct this code, the string should be passed without explicit encoding, or -should use ``os.fsencode()``:: - - >>> # Fix 1: Do not encode the string - >>> filename = open('files.txt', 'r').readline().rstrip() - >>> text = open(filename, 'r') - - >>> # Fix 2: Use correct encoding - >>> filename = open('files.txt', 'r').readline().rstrip() - >>> text = open(os.fsencode(filename), 'r') - - -References -========== - -.. _Naming Files, Paths, and Namespaces: - https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx - -Copyright -========= - -This document has been placed in the public domain. +PEP: 529 +Title: Change Windows filesystem encoding to UTF-8 +Version: $Revision$ +Last-Modified: $Date$ +Author: Steve Dower +Status: Final +Type: Standards Track +Content-Type: text/x-rst +Created: 27-Aug-2016 +Python-Version: 3.6 +Post-History: 01-Sep-2016, 04-Sep-2016 +Resolution: https://mail.python.org/pipermail/python-dev/2016-September/146277.html + +Abstract +======== + +Historically, Python uses the ANSI APIs for interacting with the Windows +operating system, often via C Runtime functions. However, these have been long +discouraged in favor of the UTF-16 APIs. Within the operating system, all text +is represented as UTF-16, and the ANSI APIs perform encoding and decoding using +the active code page. See `Naming Files, Paths, and Namespaces`_ for +more details. + +This PEP proposes changing the default filesystem encoding on Windows to utf-8, +and changing all filesystem functions to use the Unicode APIs for filesystem +paths. This will not affect code that uses strings to represent paths, however +those that use bytes for paths will now be able to correctly round-trip all +valid paths in Windows filesystems. Currently, the conversions between Unicode +(in the OS) and bytes (in Python) were lossy and would fail to round-trip +characters outside of the user's active code page. + +Notably, this does not impact the encoding of the contents of files. These will +continue to default to ``locale.getpreferredencoding()`` (for text files) or +plain bytes (for binary files). This only affects the encoding used when users +pass a bytes object to Python where it is then passed to the operating system as +a path name. + +Background +========== + +File system paths are almost universally represented as text with an encoding +determined by the file system. In Python, we expose these paths via a number of +interfaces, such as the ``os`` and ``io`` modules. Paths may be passed either +direction across these interfaces, that is, from the filesystem to the +application (for example, ``os.listdir()``), or from the application to the +filesystem (for example, ``os.unlink()``). + +When paths are passed between the filesystem and the application, they are +either passed through as a bytes blob or converted to/from str using +``os.fsencode()`` and ``os.fsdecode()`` or explicit encoding using +``sys.getfilesystemencoding()``. The result of encoding a string with +``sys.getfilesystemencoding()`` is a blob of bytes in the native format for the +default file system. + +On Windows, the native format for the filesystem is utf-16-le. The recommended +platform APIs for accessing the filesystem all accept and return text encoded in +this format. However, prior to Windows NT (and possibly further back), the +native format was a configurable machine option and a separate set of APIs +existed to accept this format. The option (the "active code page") and these +APIs (the "\*A functions") still exist in recent versions of Windows for +backwards compatibility, though new functionality often only has a utf-16-le API +(the "\*W functions"). + +In Python, str is recommended because it can correctly round-trip all characters +used in paths (on POSIX with surrogateescape handling; on Windows because str +maps to the native representation). On Windows bytes cannot round-trip all +characters used in paths, as Python internally uses the \*A functions and hence +the encoding is "whatever the active code page is". Since the active code page +cannot represent all Unicode characters, the conversion of a path into bytes can +lose information without warning or any available indication. + +As a demonstration of this:: + + >>> open('test\uAB00.txt', 'wb').close() + >>> import glob + >>> glob.glob('test*') + ['test\uab00.txt'] + >>> glob.glob(b'test*') + [b'test?.txt'] + +The Unicode character in the second call to glob has been replaced by a '?', +which means passing the path back into the filesystem will result in a +``FileNotFoundError``. The same results may be observed with ``os.listdir()`` or +any function that matches the return type to the parameter type. + +While one user-accessible fix is to use str everywhere, POSIX systems generally +do not suffer from data loss when using bytes exclusively as the bytes are the +canonical representation. Even if the encoding is "incorrect" by some standard, +the file system will still map the bytes back to the file. Making use of this +avoids the cost of decoding and reencoding, such that (theoretically, and only +on POSIX), code such as this may be faster because of the use of ``b'.'`` +compared to using ``'.'``:: + + >>> for f in os.listdir(b'.'): + ... os.stat(f) + ... + +As a result, POSIX-focused library authors prefer to use bytes to represent +paths. For some authors it is also a convenience, as their code may receive +bytes already known to be encoded correctly, while others are attempting to +simplify porting their code from Python 2. However, the correctness assumptions +do not carry over to Windows where Unicode is the canonical representation, and +errors may result. This potential data loss is why the use of bytes paths on +Windows was deprecated in Python 3.3 - all of the above code snippets produce +deprecation warnings on Windows. + +Proposal +======== + +Currently the default filesystem encoding is 'mbcs', which is a meta-encoder +that uses the active code page. However, when bytes are passed to the filesystem +they go through the \*A APIs and the operating system handles encoding. In this +case, paths are always encoded using the equivalent of 'mbcs:replace' with no +opportunity for Python to override or change this. + +This proposal would remove all use of the \*A APIs and only ever call the \*W +APIs. When Windows returns paths to Python as ``str``, they will be decoded from +utf-16-le and returned as text (in whatever the minimal representation is). When +Python code requests paths as ``bytes``, the paths will be transcoded from +utf-16-le into utf-8 using surrogatepass (Windows does not validate surrogate +pairs, so it is possible to have invalid surrogates in filenames). Equally, when +paths are provided as ``bytes``, they are transcoded from utf-8 into utf-16-le +and passed to the \*W APIs. + +The use of utf-8 will not be configurable, except for the provision of a +"legacy mode" flag to revert to the previous behaviour. + +The ``surrogateescape`` error mode does not apply here, as the concern is not +about retaining non-sensical bytes. Any path returned from the operating system +will be valid Unicode, while invalid paths created by the user should raise a +decoding error (currently these would raise ``OSError`` or a subclass). + +The choice of utf-8 bytes (as opposed to utf-16-le bytes) is to ensure the +ability to round-trip path names and allow basic manipulation (for example, +using the ``os.path`` module) when assuming an ASCII-compatible encoding. Using +utf-16-le as the encoding is more pure, but will cause more issues than are +resolved. + +This change would also undeprecate the use of bytes paths on Windows. No change +to the semantics of using bytes as a path is required - as before, they must be +encoded with the encoding specified by ``sys.getfilesystemencoding()``. + +Specific Changes +================ + +Update sys.getfilesystemencoding +-------------------------------- + +Remove the default value for ``Py_FileSystemDefaultEncoding`` and set it in +``initfsencoding()`` to utf-8, or if the legacy-mode switch is enabled to mbcs. + +Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and +``PyUnicode_EncodeFSDefault()`` to use the utf-8 codec, or if the legacy-mode +switch is enabled the existing mbcs codec. + +Add sys.getfilesystemencodeerrors +--------------------------------- + +As the error mode may now change between ``surrogatepass`` and ``replace``, +Python code that manually performs encoding also needs access to the current +error mode. This includes the implementation of ``os.fsencode()`` and +``os.fsdecode()``, which currently assume an error mode based on the codec. + +Add a public ``Py_FileSystemDefaultEncodeErrors``, similar to the existing +``Py_FileSystemDefaultEncoding``. The default value on Windows will be +``surrogatepass`` or in legacy mode, ``replace``. The default value on all other +platforms will be ``surrogateescape``. + +Add a public ``sys.getfilesystemencodeerrors()`` function that returns the +current error mode. + +Update the implementations of ``PyUnicode_DecodeFSDefaultAndSize()`` and +``PyUnicode_EncodeFSDefault()`` to use the variable for error mode rather than +constant strings. + +Update the implementations of ``os.fsencode()`` and ``os.fsdecode()`` to use +``sys.getfilesystemencodeerrors()`` instead of assuming the mode. + +Update path_converter +--------------------- + +Update the path converter to always decode bytes or buffer objects into text +using ``PyUnicode_DecodeFSDefaultAndSize()``. + +Change the ``narrow`` field from a ``char*`` string into a flag that indicates +whether the original object was bytes. This is required for functions that need +to return paths using the same type as was originally provided. + +Remove unused ANSI code +----------------------- + +Remove all code paths using the ``narrow`` field, as these will no longer be +reachable by any caller. These are only used within ``posixmodule.c``. Other +uses of paths should have use of bytes paths replaced with decoding and use of +the \*W APIs. + +Add legacy mode +--------------- + +Add a legacy mode flag, enabled by the environment variable +``PYTHONLEGACYWINDOWSFSENCODING`` or by a function call to +``sys._enablelegacywindowsfsencoding()``. The function call can only be +used to enable the flag and should be used by programs as close to +initialization as possible. Legacy mode cannot be disabled while Python is +running. + +When this flag is set, the default filesystem encoding is set to mbcs rather +than utf-8, and the error mode is set to ``replace`` rather than +``surrogatepass``. Paths will continue to decode to wide characters and only \*W +APIs will be called, however, the bytes passed in and received from Python will +be encoded the same as prior to this change. + +Undeprecate bytes paths on Windows +---------------------------------- + +Using bytes as paths on Windows is currently deprecated. We would announce that +this is no longer the case, and that paths when encoded as bytes should use +whatever is returned from ``sys.getfilesystemencoding()`` rather than the user's +active code page. + +Beta experiment +--------------- + +To assist with determining the impact of this change, we propose applying it to +3.6.0b1 provisionally with the intent being to make a final decision before +3.6.0b4. + +During the experiment period, decoding and encoding exception messages will be +expanded to include a link to an active online discussion and encourage +reporting of problems. + +If it is decided to revert the functionality for 3.6.0b4, the implementation +change would be to permanently enable the legacy mode flag, change the +environment variable to ``PYTHONWINDOWSUTF8FSENCODING`` and function to +``sys._enablewindowsutf8fsencoding()`` to allow enabling the functionality +on a case-by-case basis, as opposed to disabling it. + +It is expected that if we cannot feasibly make the change for 3.6 due to +compatibility concerns, it will not be possible to make the change at any later +time in Python 3.x. + +Affected Modules +---------------- + +This PEP implicitly includes all modules within the Python that either pass path +names to the operating system, or otherwise use ``sys.getfilesystemencoding()``. + +As of 3.6.0a4, the following modules require modification: + +* ``os`` +* ``_overlapped`` +* ``_socket`` +* ``subprocess`` +* ``zipimport`` + +The following modules use ``sys.getfilesystemencoding()`` but do not need +modification: + +* ``gc`` (already assumes bytes are utf-8) +* ``grp`` (not compiled for Windows) +* ``http.server`` (correctly includes codec name with transmitted data) +* ``idlelib.editor`` (should not be needed; has fallback handling) +* ``nis`` (not compiled for Windows) +* ``pwd`` (not compiled for Windows) +* ``spwd`` (not compiled for Windows) +* ``_ssl`` (only used for ASCII constants) +* ``tarfile`` (code unused on Windows) +* ``_tkinter`` (already assumes bytes are utf-8) +* ``wsgiref`` (assumed as the default encoding for unknown environments) +* ``zipapp`` (code unused on Windows) + +The following native code uses one of the encoding or decoding functions, but do +not require any modification: + +* ``Parser/parsetok.c`` (docs already specify ``sys.getfilesystemencoding()``) +* ``Python/ast.c`` (docs already specify ``sys.getfilesystemencoding()``) +* ``Python/compile.c`` (undocumented, but Python filesystem encoding implied) +* ``Python/errors.c`` (docs already specify ``os.fsdecode()``) +* ``Python/fileutils.c`` (code unused on Windows) +* ``Python/future.c`` (undocumented, but Python filesystem encoding implied) +* ``Python/import.c`` (docs already specify utf-8) +* ``Python/importdl.c`` (code unused on Windows) +* ``Python/pythonrun.c`` (docs already specify ``sys.getfilesystemencoding()``) +* ``Python/symtable.c`` (undocumented, but Python filesystem encoding implied) +* ``Python/thread.c`` (code unused on Windows) +* ``Python/traceback.c`` (encodes correctly for comparing strings) +* ``Python/_warnings.c`` (docs already specify ``os.fsdecode()``) + +Rejected Alternatives +===================== + +Use strict mbcs decoding +------------------------ + +This is essentially the same as the proposed change, but instead of changing +``sys.getfilesystemencoding()`` to utf-8 it is changed to mbcs (which +dynamically maps to the active code page). + +This approach allows the use of new functionality that is only available as \*W +APIs and also detection of encoding/decoding errors. For example, rather than +silently replacing Unicode characters with '?', it would be possible to warn or +fail the operation. + +Compared to the proposed fix, this could enable some new functionality but does +not fix any of the problems described initially. New runtime errors may cause +some problems to be more obvious and lead to fixes, provided library maintainers +are interested in supporting Windows and adding a separate code path to treat +filesystem paths as strings. + +Making the encoding mbcs without strict errors is equivalent to the legacy-mode +switch being enabled by default. This is a possible course of action if there is +significant breakage of actual code and a need to extend the deprecation period, +but still a desire to have the simplifications to the CPython source. + +Make bytes paths an error on Windows +------------------------------------ + +By preventing the use of bytes paths on Windows completely we prevent users from +hitting encoding issues. + +However, the motivation for this PEP is to increase the likelihood that code +written on POSIX will also work correctly on Windows. This alternative would +move the other direction and make such code completely incompatible. As this +does not benefit users in any way, we reject it. + +Make bytes paths an error on all platforms +------------------------------------------ + +By deprecating and then disable the use of bytes paths on all platforms we +prevent users from hitting encoding issues regardless of where the code was +originally written. This would require a full deprecation cycle, as there are +currently no warnings on platforms other than Windows. + +This is likely to be seen as a hostile action against Python developers in +general, and as such is rejected at this time. + +Code that may break +=================== + +The following code patterns may break or see different behaviour as a result of +this change. Each of these examples would have been fragile in code intended for +cross-platform use. The suggested fixes demonstrate the most compatible way to +handle path encoding issues across all platforms and across multiple Python +versions. + +Note that all of these examples produce deprecation warnings on Python 3.3 and +later. + +Not managing encodings across boundaries +---------------------------------------- + +Code that does not manage encodings when crossing protocol boundaries may +currently be working by chance, but could encounter issues when either encoding +changes. Note that the source of ``filename`` may be any function that returns +a bytes object, as illustrated in a second example below:: + + >>> filename = open('filename_in_mbcs.txt', 'rb').read() + >>> text = open(filename, 'r').read() + +To correct this code, the encoding of the bytes in ``filename`` should be +specified, either when reading from the file or before using the value:: + + >>> # Fix 1: Open file as text (default encoding) + >>> filename = open('filename_in_mbcs.txt', 'r').read() + >>> text = open(filename, 'r').read() + + >>> # Fix 2: Open file as text (explicit encoding) + >>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read() + >>> text = open(filename, 'r').read() + + >>> # Fix 3: Explicitly decode the path + >>> filename = open('filename_in_mbcs.txt', 'rb').read() + >>> text = open(filename.decode('mbcs'), 'r').read() + +Where the creator of ``filename`` is separated from the user of ``filename``, +the encoding is important information to include:: + + >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('mbcs') + + >>> filename = some_object.filename + >>> type(filename) + + >>> text = open(filename, 'r').read() + +To fix this code for best compatibility across operating systems and Python +versions, the filename should be exposed as str:: + + >>> # Fix 1: Expose as str + >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt' + + >>> filename = some_object.filename + >>> type(filename) + + >>> text = open(filename, 'r').read() + +Alternatively, the encoding used for the path needs to be made available to the +user. Specifying ``os.fsencode()`` (or ``sys.getfilesystemencoding()``) is an +acceptable choice, or a new attribute could be added with the exact encoding:: + + >>> # Fix 2: Use fsencode + >>> some_object.filename = os.fsencode(r'C:\Users\Steve\Documents\my_file.txt') + + >>> filename = some_object.filename + >>> type(filename) + + >>> text = open(filename, 'r').read() + + + >>> # Fix 3: Expose as explicit encoding + >>> some_object.filename = r'C:\Users\Steve\Documents\my_file.txt'.encode('cp437') + >>> some_object.filename_encoding = 'cp437' + + >>> filename = some_object.filename + >>> type(filename) + + >>> filename = filename.decode(some_object.filename_encoding) + >>> type(filename) + + >>> text = open(filename, 'r').read() + + +Explicitly using 'mbcs' +----------------------- + +Code that explicitly encodes text using 'mbcs' before passing to file system +APIs is now passing incorrectly encoded bytes. Note that the source of +``filename`` in this example is not relevant, provided that it is a str:: + + >>> filename = open('files.txt', 'r').readline().rstrip() + >>> text = open(filename.encode('mbcs'), 'r') + +To correct this code, the string should be passed without explicit encoding, or +should use ``os.fsencode()``:: + + >>> # Fix 1: Do not encode the string + >>> filename = open('files.txt', 'r').readline().rstrip() + >>> text = open(filename, 'r') + + >>> # Fix 2: Use correct encoding + >>> filename = open('files.txt', 'r').readline().rstrip() + >>> text = open(os.fsencode(filename), 'r') + + +References +========== + +.. _Naming Files, Paths, and Namespaces: + https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx + +Copyright +========= + +This document has been placed in the public domain.