PEP 471: Ben Hoyt updates

2014-07-18 18:25:41 +02:00 · 2014-07-18 18:25:41 +02:00 · 89ae8bb813
parent 06c61b9447
commit 89ae8bb813
1 changed files with 205 additions and 113 deletions
--- a/pep-0471.txt
+++ b/pep-0471.txt
@ -8,7 +8,7 @@ Type: Standards Track
 Content-Type: text/x-rst
 Created: 30-May-2014
 Python-Version: 3.5
-Post-History: 27-Jun-2014, 8-Jul-2014
+Post-History: 27-Jun-2014, 8-Jul-2014, 14-Jul-2014, 18-Jul-2014


 Abstract
@ -16,9 +16,9 @@ Abstract

 This PEP proposes including a new directory iteration function,
 ``os.scandir()``, in the standard library. This new function adds
-useful functionality and increases the speed of ``os.walk()`` by 2-10
-times (depending on the platform and file system) by significantly
-reducing the number of times ``stat()`` needs to be called.
+useful functionality and increases the speed of ``os.walk()`` by 2-20
+times (depending on the platform and file system) by avoiding calls to
+``os.stat()`` in most cases.


 Rationale
@ -34,8 +34,8 @@ But the underlying system calls -- ``FindFirstFile`` /
 ``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
 already tell you whether the files returned are directories or not, so
 no further system calls are needed. Further, the Windows system calls
-return all the information for a ``stat_result`` object, such as file
-size and last modification time.
+return all the information for a ``stat_result`` object on the directory
+entry, such as file size and last modification time.

 In short, you can reduce the number of system calls required for a
 tree function like ``os.walk()`` from approximately 2N to N, where N
@ -56,7 +56,7 @@ iterates instead of returning them as one big list. This improves
 memory efficiency for iterating very large directories.

 So, as well as providing a ``scandir()`` iterator function for calling
-directly, Python's existing ``os.walk()`` function could be sped up a
+directly, Python's existing ``os.walk()`` function can be sped up a
 huge amount.

 .. _`Issue 11406`: http://bugs.python.org/issue11406
@ -67,7 +67,8 @@ Implementation

 The implementation of this proposal was written by Ben Hoyt (initial
 version) and Tim Golden (who helped a lot with the C extension
-module). It lives on GitHub at `benhoyt/scandir`_.
+module). It lives on GitHub at `benhoyt/scandir`_. (The implementation
+may lag behind the updates to this PEP a little.)

 .. _`benhoyt/scandir`: https://github.com/benhoyt/scandir

@ -82,67 +83,83 @@ the standard library, as well as integration into ``posixmodule.c``.
 Specifics of proposal
 =====================

+os.scandir()
+------------
+
 Specifically, this PEP proposes adding a single function to the ``os``
 module in the standard library, ``scandir``, that takes a single,
 optional string as its argument::

-    scandir(path='.') -> generator of DirEntry objects
+    scandir(directory='.') -> generator of DirEntry objects

 Like ``listdir``, ``scandir`` calls the operating system's directory
-iteration system calls to get the names of the files in the ``path``
-directory, but it's different from ``listdir`` in two ways:
+iteration system calls to get the names of the files in the given
+``directory``, but it's different from ``listdir`` in two ways:

 * Instead of returning bare filename strings, it returns lightweight
  ``DirEntry`` objects that hold the filename string and provide
  simple methods that allow access to the additional data the
-  operating system returned.
+  operating system may have returned.

 * It returns a generator instead of a list, so that ``scandir`` acts
  as a true iterator instead of returning the full list immediately.

-``scandir()`` yields a ``DirEntry`` object for each file and directory
-in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
-pseudo-directories are skipped, and the entries are yielded in
-system-dependent order. Each ``DirEntry`` object has the following
-attributes and methods:
+``scandir()`` yields a ``DirEntry`` object for each file and
+sub-directory in ``directory``. Just like ``listdir``, the ``'.'``
+and ``'..'`` pseudo-directories are skipped, and the entries are
+yielded in system-dependent order. Each ``DirEntry`` object has the
+following attributes and methods:

-* ``name``: the entry's filename, relative to the ``path`` argument
-  (corresponds to the return values of ``os.listdir``)
+* ``name``: the entry's filename, relative to the ``directory``
+  argument (corresponds to the return values of ``os.listdir``)

-* ``full_name``: the entry's full path name -- the equivalent of
-  ``os.path.join(path, entry.name)``
+* ``path``: the entry's full path name (not necessarily an absolute
+  path) -- the equivalent of ``os.path.join(directory, entry.name)``

-* ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
-  requires a system call on Windows, and usually doesn't on POSIX
-  systems
+* ``is_dir(*, follow_symlinks=True)``: similar to
+  ``pathlib.Path.is_dir()``, but the return value is cached on the
+  ``DirEntry`` object; doesn't require a system call in most cases;
+  don't follow symbolic links if ``follow_symlinks`` is False

-* ``is_file()``: like ``os.path.isfile()``, but much cheaper -- it
-  never requires a system call on Windows, and usually doesn't on
-  POSIX systems
+* ``is_file(*, follow_symlinks=True)``: similar to
+  ``pathlib.Path.is_file()``, but the return value is cached on the
+  ``DirEntry`` object; doesn't require a system call in most cases; 
+  don't follow symbolic links if ``follow_symlinks`` is False

-* ``is_symlink()``: like ``os.path.islink()``, but much cheaper -- it
-  never requires a system call on Windows, and usually doesn't on
-  POSIX systems
+* ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the
+  return value is cached on the ``DirEntry`` object; doesn't require a
+  system call in most cases

-* ``lstat()``: like ``os.lstat()``, but much cheaper on some systems
-  -- it only requires a system call on POSIX systems
+* ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the
+  return value is cached on the ``DirEntry`` object; does not require a
+  system call on Windows (except for symlinks); don't follow symbolic links
+  (like ``os.lstat()``) if ``follow_symlinks`` is False

-The ``is_X`` methods may perform a ``stat()`` call under certain
-conditions (for example, on certain file systems on POSIX systems),
-and therefore possibly raise ``OSError``. The ``lstat()`` method will
-call ``stat()`` on POSIX systems and therefore also possibly raise
-``OSError``. See the "Notes on exception handling" section for more
-details.
+All *methods* may perform system calls in some cases and therefore
+possibly raise ``OSError`` -- see the "Notes on exception handling"
+section for more details.

 The ``DirEntry`` attribute and method names were chosen to be the same
-as those in the new ``pathlib`` module for consistency.
+as those in the new ``pathlib`` module where possible, for
+consistency. The only difference in functionality is that the
+``DirEntry`` methods cache their values on the entry object after the
+first call.

 Like the other functions in the ``os`` module, ``scandir()`` accepts
-either a bytes or str object for the ``path`` parameter, and returns
-the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
-same type as ``path``. However, it is *strongly recommended* to use
-the str type, as this ensures cross-platform support for Unicode
-filenames.
+either a bytes or str object for the ``directory`` parameter, and
+returns the ``DirEntry.name`` and ``DirEntry.path`` attributes with
+the same type as ``directory``. However, it is *strongly recommended*
+to use the str type, as this ensures cross-platform support for
+Unicode filenames. (On Windows, bytes filenames have been deprecated
+since Python 3.3).
+
+os.walk()
+---------
+
+As part of this proposal, ``os.walk()`` will also be modified to use
+``scandir()`` rather than ``listdir()`` and ``os.path.isdir()``. This
+will increase the speed of ``os.walk()`` very significantly (as
+mentioned above, by 2-20 times, depending on the system).


 Examples
@ -154,7 +171,7 @@ uses it::

    dirs = []
    non_dirs = []
-    for entry in os.scandir(path):
+    for entry in os.scandir(directory):
        if entry.is_dir():
            dirs.append(entry)
        else:
@ -165,19 +182,25 @@ scandir than ``os.listdir()`` and ``os.path.isdir()`` on both Windows
 and POSIX systems.

 Or, for getting the total size of files in a directory tree, showing
-use of the ``DirEntry.lstat()`` method and ``DirEntry.full_name``
+use of the ``DirEntry.stat()`` method and ``DirEntry.path``
 attribute::

-    def get_tree_size(path):
-        """Return total size of files in path and subdirs."""
+    def get_tree_size(directory):
+        """Return total size of files in directory and subdirs."""
        total = 0
-        for entry in os.scandir(path):
-            if entry.is_dir():
-                total += get_tree_size(entry.full_name)
+        for entry in os.scandir(directory):
+            if entry.is_dir(follow_symlinks=False):
+                total += get_tree_size(entry.path)
            else:
-                total += entry.lstat().st_size
+                total += entry.stat(follow_symlinks=False).st_size
        return total

+This also shows the use of the ``follow_symlinks`` parameter to
+``is_dir()`` -- in a recursive function like this, we probably don't
+want to follow links. (To properly follow links in a recursive
+function like this we'd want special handling for the case where
+following a symlink leads to a recursive loop.)
+
 Note that ``get_tree_size()`` will get a huge speed boost on Windows,
 because no extra stat call are needed, but on POSIX systems the size
 information is not returned by the directory iteration functions, so
@ -188,10 +211,10 @@ Notes on caching
 ----------------

 The ``DirEntry`` objects are relatively dumb -- the ``name`` and
-``full_name`` attributes are obviously always cached, and the ``is_X``
-and ``lstat`` methods cache their values (immediately on Windows via
+``path`` attributes are obviously always cached, and the ``is_X``
+and ``stat`` methods cache their values (immediately on Windows via
 ``FindNextFile``, and on first use on POSIX systems via a ``stat``
-call) and never refetch from the system.
+system call) and never refetch from the system.

 For this reason, ``DirEntry`` objects are intended to be used and
 thrown away after iteration, not stored in long-lived data structured
@ -199,50 +222,61 @@ and the methods called again and again.

 If developers want "refresh" behaviour (for example, for watching a
 file's size change), they can simply use ``pathlib.Path`` objects,
-or call the regular ``os.lstat()`` or ``os.path.getsize()`` functions
+or call the regular ``os.stat()`` or ``os.path.getsize()`` functions
 which get fresh data from the operating system every call.


 Notes on exception handling
 ---------------------------

-``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
+``DirEntry.is_X()`` and ``DirEntry.stat()`` are explicitly methods
 rather than attributes or properties, to make it clear that they may
-not be cheap operations, and they may do a system call. As a result,
-these methods may raise ``OSError``.
+not be cheap operations (although they often are), and they may do a
+system call. As a result, these methods may raise ``OSError``.

-For example, ``DirEntry.lstat()`` will always make a system call on
+For example, ``DirEntry.stat()`` will always make a system call on
 POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
-``stat()`` system call on such systems if ``readdir()`` returns a
-``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
-certain conditions or on certain file systems.
+``stat()`` system call on such systems if ``readdir()`` does not
+support ``d_type`` or returns a ``d_type`` with a value of
+``DT_UNKNOWN``, which can occur under certain conditions or on
+certain file systems.

-For this reason, when a user requires fine-grained error handling,
-it's good to catch ``OSError`` around these method calls and then
-handle as appropriate.
+Often this does not matter -- for example, ``os.walk()`` as defined in
+the standard library only catches errors around the ``listdir()``
+calls.
+
+Also, because the exception-raising behaviour of the ``DirEntry.is_X``
+methods matches that of ``pathlib`` -- which only raises ``OSError``
+in the case of permissions or other fatal errors, but returns False
+if the path doesn't exist or is a broken symlink -- it's often
+not necessary to catch errors around the ``is_X()`` calls.
+
+However, when a user requires fine-grained error handling, it may be
+desirable to catch ``OSError`` around all method calls and handle as
+appropriate.

 For example, below is a version of the ``get_tree_size()`` example
-shown above, but with basic error handling added::
+shown above, but with fine-grained error handling added::

-    def get_tree_size(path):
-        """Return total size of files in path and subdirs. If
-        is_dir() or lstat() fails, print an error message to stderr
+    def get_tree_size(directory):
+        """Return total size of files in directory and subdirs. If
+        is_dir() or stat() fails, print an error message to stderr
        and assume zero size (for example, file has been deleted).
        """
        total = 0
-        for entry in os.scandir(path):
+        for entry in os.scandir(directory):
            try:
-                is_dir = entry.is_dir()
+                is_dir = entry.is_dir(follow_symlinks=False)
            except OSError as error:
                print('Error calling is_dir():', error, file=sys.stderr)
                continue
            if is_dir:
-                total += get_tree_size(entry.full_name)
+                total += get_tree_size(entry.path)
            else:
                try:
-                    total += entry.lstat().st_size
+                    total += entry.stat(follow_symlinks=False).st_size
                except OSError as error:
-                    print('Error calling lstat():', error, file=sys.stderr)
+                    print('Error calling stat():', error, file=sys.stderr)
        return total


@ -316,6 +350,12 @@ For example:
  Seems pretty solid, so first thing, just want to say nice work!"
  [via personal email]

+* Matt Z: "I used scandir to dump the contents of a network dir in
+  under 15 seconds. 13 root dirs, 60,000 files in the structure. This
+  will replace some old VBA code embedded in a spreadsheet that was
+  taking 15-20 minutes to do the exact same thing." [via personal
+  email]
+
 Others have `requested a PyPI package`_ for it, which has been
 created. See `PyPI package`_.

@ -331,13 +371,11 @@ of July 7, 2014:
 * Forks: 20
 * Issues: 4 open, 26 closed

-**However, the much larger point is this:**, if this PEP is accepted,
-``os.walk()`` can easily be reimplemented using ``scandir`` rather
-than ``listdir`` and ``stat``, increasing the speed of ``os.walk()``
-very significantly. There are thousands of developers, scripts, and
-production code that would benefit from this large speedup of
-``os.walk()``. For example, on GitHub, there are almost as many uses
-of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).
+Also, because this PEP will increase the speed of ``os.walk()``
+significantly, there are thousands of developers and scripts, and a lot
+of production code, that would benefit from it. For example, on GitHub,
+there are almost as many uses of ``os.walk`` (194,000) as there are of
+``os.mkdir`` (230,000).


 Rejected ideas
@ -392,12 +430,51 @@ and this `June 2014 python-dev thread on PEP 471
 <https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.


+Methods not following symlinks by default
+-----------------------------------------
+
+There was much debate on python-dev (see messages in `this thread
+<https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_)
+over whether the ``DirEntry`` methods should follow symbolic links or
+not (when the ``is_X()`` methods had no ``follow_symlinks`` parameter).
+
+Initially they did not (see previous versions of this PEP and the
+scandir.py module), but Victor Stinner made a pretty compelling case on
+python-dev that following symlinks by default is a better idea, because:
+
+* following links is usually what you want (in 92% of cases in the
+  standard library, functions using ``os.listdir()`` and
+  ``os.path.isdir()`` do follow symlinks)
+
+* that's the precedent set by the similar functions
+  ``os.path.isdir()`` and ``pathlib.Path.is_dir()``, so to do
+  otherwise would be confusing
+
+* with the non-link-following approach, if you wanted to follow links
+  you'd have to say something like ``if (entry.is_symlink() and
+  os.path.isdir(entry.path)) or entry.is_dir()``, which is clumsy
+
+As a case in point that shows the non-symlink-following version is
+error prone, this PEP's author had a bug caused by getting this
+exact test wrong in his initial implementation of ``scandir.walk()``
+in scandir.py (see `Issue #4 here
+<https://github.com/benhoyt/scandir/issues/4>`_).
+
+In the end there was not total agreement that the methods should
+follow symlinks, but there was basic consensus among the most involved
+participants, and this PEP's author believes that the above case is
+strong enough to warrant following symlinks by default.
+
+In addition, it's straight-forward to call the relevant methods with
+``follow_symlinks=False`` if the other behaviour is desired.
+
+
 DirEntry attributes being properties
 ------------------------------------

 In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
-``lstat()`` to be properties instead of methods, to indicate they're
-very cheap or free. However, this isn't quite the case, as ``lstat()``
+``stat()`` to be properties instead of methods, to indicate they're
+very cheap or free. However, this isn't quite the case, as ``stat()``
 will require an OS call on POSIX-based systems but not on Windows.
 Even ``is_dir()`` and friends may perform an OS call on POSIX-based
 systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
@ -422,8 +499,8 @@ In `this July 2014 python-dev message
 <https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
 Paul Moore suggested a solution that was a "thin wrapper round the OS
 feature", where the ``DirEntry`` object had only static attributes:
-``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
-only present on Windows. The idea was to use this simpler, lower-level
+``name``, ``path``, and ``is_X``, with the ``st_X`` attributes only
+present on Windows. The idea was to use this simpler, lower-level
 function as a building block for higher-level functions.

 At first there was general agreement that simplifying in this way was
@ -459,19 +536,24 @@ because ``stat()`` will be called (and hence potentially raise
 ``OSError``) during iteration, leading to a rather ugly, hand-made
 iteration loop::

-    it = os.scandir(path)
+    it = os.scandir(directory)
    while True:
        try:
            entry = next(it)
        except OSError as error:
-            handle_error(path, error)
+            handle_error(directory, error)
        except StopIteration:
            break

 Or it means that ``scandir()`` would have to accept an ``onerror``
 argument -- a function to call when ``stat()`` errors occur during
 iteration. This seems to this PEP's author neither as direct nor as
-Pythonic as ``try``/``except`` around a ``DirEntry.lstat()`` call.
+Pythonic as ``try``/``except`` around a ``DirEntry.stat()`` call.
+
+Another drawback is that ``os.scandir()`` is written to make code faster.
+Always calling ``os.lstat()`` on POSIX would not bring any speedup. In most
+cases, you don't need the full ``stat_result`` object -- the ``is_X()``
+methods are enough and this information is already known.

 See `Ben Hoyt's July 2014 reply
 <https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
@ -513,7 +595,7 @@ Return values being overloaded stat_result objects
 --------------------------------------------------

 Another alternative discussed was making the return values to be
-overloaded ``stat_result`` objects with ``name`` and ``full_name``
+overloaded ``stat_result`` objects with ``name`` and ``path``
 attributes. However, apart from this being a strange (and strained!)
 kind of overloading, this has the same problems mentioned above --
 most of the ``stat_result`` information is not fetched by
@ -526,15 +608,15 @@ Return values being pathlib.Path objects
 With Antoine Pitrou's new standard library ``pathlib`` module, it
 at first seems like a great idea for ``scandir()`` to return instances
 of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
-``lstat()`` functions are explicitly not cached, whereas ``scandir``
+``stat()`` functions are explicitly not cached, whereas ``scandir``
 has to cache them by design, because it's (often) returning values
 from the original directory iteration system call.

 And if the ``pathlib.Path`` instances returned by ``scandir`` cached
-lstat values, but the ordinary ``pathlib.Path`` objects explicitly
+stat values, but the ordinary ``pathlib.Path`` objects explicitly
 don't, that would be more than a little confusing.

-Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
+Guido van Rossum explicitly rejected ``pathlib.Path`` caching stat in
 the context of scandir `here
 <https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
 making ``pathlib.Path`` objects a bad choice for scandir return
@ -564,35 +646,45 @@ here is a short list of some this PEP's author has in mind:
 Previous discussion
 ===================

-* `Original thread Ben Hoyt started on python-ideas`_ about speeding
-  up ``os.walk()``
+* `Original November 2012 thread Ben Hoyt started on python-ideas
+  <https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
+  about speeding up ``os.walk()``

 * Python `Issue 11406`_, which includes the original proposal for a
  scandir-like function

-* `Further thread Ben Hoyt started on python-dev`_ that refined the
-  ``scandir()`` API, including Nick Coghlan's suggestion of scandir
-  yielding ``DirEntry``-like objects
+* `Further May 2013 thread Ben Hoyt started on python-dev
+  <https://mail.python.org/pipermail/python-dev/2013-May/126119.html>`_
+  that refined the ``scandir()`` API, including Nick Coghlan's
+  suggestion of scandir yielding ``DirEntry``-like objects

-* `Another thread Ben Hoyt started on python-dev`_ to discuss the
-  interaction between scandir and the new ``pathlib`` module
+* `November 2013 thread Ben Hoyt started on python-dev
+  <https://mail.python.org/pipermail/python-dev/2013-November/130572.html>`_
+  to discuss the interaction between scandir and the new ``pathlib``
+  module

-* `Final thread Ben Hoyt started on python-dev`_ to discuss the first
-  version of this PEP, with extensive discussion about the API.
+* `June 2014 thread Ben Hoyt started on python-dev
+  <https://mail.python.org/pipermail/python-dev/2014-June/135215.html>`_
+  to discuss the first version of this PEP, with extensive discussion
+  about the API

-* `Question on StackOverflow`_ about why ``os.walk()`` is slow and
-  pointers on how to fix it (this inspired the author of this PEP
-  early on)
+* `First July 2014 thread Ben Hoyt started on python-dev
+  <https://mail.python.org/pipermail/python-dev/2014-July/135377.html>`_
+  to discuss his updates to PEP 471

-* `BetterWalk`_, this PEP's author's previous attempt at this, on
-  which the scandir code is based
+* `Second July 2014 thread Ben Hoyt started on python-dev
+  <https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_
+  to discuss the remaining decisions needed to finalize PEP 471,
+  specifically whether the ``DirEntry`` methods should follow symlinks
+  by default

-.. _`Original thread Ben Hoyt started on python-ideas`: https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
-.. _`Further thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-May/126119.html
-.. _`Another thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
-.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2014-June/135215.html
-.. _`Question on StackOverflow`: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
-.. _`BetterWalk`: https://github.com/benhoyt/betterwalk
+* `Question on StackOverflow
+  <http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder>`_
+  about why ``os.walk()`` is slow and pointers on how to fix it (this
+  inspired the author of this PEP early on)
+
+* `BetterWalk <https://github.com/benhoyt/betterwalk>`_, this PEP's
+  author's previous attempt at this, on which the scandir code is based


 Copyright