PEP 471: update by Ben Hoy

After the significant discussion on python-dev about PEP 471, I've now made the relevant updates and improved a few things.
2014-07-08 10:59:42 +02:00 · 2014-07-08 10:59:42 +02:00 · 689e1bff5e
parent 4b500b691d
commit 689e1bff5e
1 changed files with 335 additions and 99 deletions
--- a/pep-0471.txt
+++ b/pep-0471.txt
@ -8,6 +8,7 @@ Type: Standards Track
 Content-Type: text/x-rst
 Created: 30-May-2014
 Python-Version: 3.5
+Post-History: 27-Jun-2014, 8-Jul-2014


 Abstract
@ -25,32 +26,36 @@ Rationale

 Python's built-in ``os.walk()`` is significantly slower than it needs
 to be, because -- in addition to calling ``os.listdir()`` on each
-directory -- it executes the system call ``os.stat()`` or
+directory -- it executes the ``stat()`` system call or
 ``GetFileAttributes()`` on each file to determine whether the entry is
 a directory or not.

 But the underlying system calls -- ``FindFirstFile`` /
-``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
+``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
 already tell you whether the files returned are directories or not, so
-no further system calls are needed. In short, you can reduce the
-number of system calls from approximately 2N to N, where N is the
-total number of files and directories in the tree. (And because
-directory trees are usually much wider than they are deep, it's often
-much better than this.)
+no further system calls are needed. Further, the Windows system calls
+return all the information for a ``stat_result`` object, such as file
+size and last modification time.
+
+In short, you can reduce the number of system calls required for a
+tree function like ``os.walk()`` from approximately 2N to N, where N
+is the total number of files and directories in the tree. (And because
+directory trees are usually wider than they are deep, it's often much
+better than this.)

 In practice, removing all those extra system calls makes ``os.walk()``
 about **8-9 times as fast on Windows**, and about **2-3 times as fast
-on Linux and Mac OS X**. So we're not talking about micro-
-optimizations. See more `benchmarks`_.
+on POSIX systems**. So we're not talking about micro-
+optimizations. See more `benchmarks here`_.

-.. _`benchmarks`: https://github.com/benhoyt/scandir#benchmarks
+.. _`benchmarks here`: https://github.com/benhoyt/scandir#benchmarks

 Somewhat relatedly, many people (see Python `Issue 11406`_) are also
 keen on a version of ``os.listdir()`` that yields filenames as it
 iterates instead of returning them as one big list. This improves
 memory efficiency for iterating very large directories.

-So as well as providing a ``scandir()`` iterator function for calling
+So, as well as providing a ``scandir()`` iterator function for calling
 directly, Python's existing ``os.walk()`` function could be sped up a
 huge amount.

@ -70,7 +75,7 @@ Note that this module has been used and tested (see "Use in the wild"
 section in this PEP), so it's more than a proof-of-concept. However,
 it is marked as beta software and is not extensively battle-tested.
 It will need some cleanup and more thorough testing before going into
-the standard library, as well as integration into `posixmodule.c`.
+the standard library, as well as integration into ``posixmodule.c``.



@ -87,10 +92,10 @@ Like ``listdir``, ``scandir`` calls the operating system's directory
 iteration system calls to get the names of the files in the ``path``
 directory, but it's different from ``listdir`` in two ways:

-* Instead of bare filename strings, it returns lightweight
+* Instead of returning bare filename strings, it returns lightweight
  ``DirEntry`` objects that hold the filename string and provide
-  simple methods that allow access to the stat-like data the operating
-  system returned.
+  simple methods that allow access to the additional data the
+  operating system returned.

 * It returns a generator instead of a list, so that ``scandir`` acts
  as a true iterator instead of returning the full list immediately.
@ -101,82 +106,146 @@ pseudo-directories are skipped, and the entries are yielded in
 system-dependent order. Each ``DirEntry`` object has the following
 attributes and methods:

-* ``name``: the entry's filename, relative to ``path`` (corresponds to
-  the return values of ``os.listdir``)
+* ``name``: the entry's filename, relative to the ``path`` argument
+  (corresponds to the return values of ``os.listdir``)

-* ``is_dir()``: like ``os.path.isdir()``, but requires no system calls
-  on most systems (Linux, Windows, OS X)
+* ``full_name``: the entry's full path name -- the equivalent of
+  ``os.path.join(path, entry.name)``

-* ``is_file()``: like ``os.path.isfile()``, but requires no system
-  calls on most systems (Linux, Windows, OS X)
+* ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
+  requires a system call on Windows, and usually doesn't on POSIX
+  systems

-* ``is_symlink()``: like ``os.path.islink()``, but requires no system
-  calls on most systems (Linux, Windows, OS X)
+* ``is_file()``: like ``os.path.isfile()``, but much cheaper -- it
+  never requires a system call on Windows, and usually doesn't on
+  POSIX systems

-* ``lstat()``: like ``os.lstat()``, but requires no system calls on
-  Windows
+* ``is_symlink()``: like ``os.path.islink()``, but much cheaper -- it
+  never requires a system call on Windows, and usually doesn't on
+  POSIX systems
+
+* ``lstat()``: like ``os.lstat()``, but much cheaper on some systems
+  -- it only requires a system call on POSIX systems
+
+The ``is_X`` methods may perform a ``stat()`` call under certain
+conditions (for example, on certain file systems on POSIX systems),
+and therefore possibly raise ``OSError``. The ``lstat()`` method will
+call ``stat()`` on POSIX systems and therefore also possibly raise
+``OSError``. See the "Notes on exception handling" section for more
+details.

 The ``DirEntry`` attribute and method names were chosen to be the same
 as those in the new ``pathlib`` module for consistency.

-
-Notes on caching
----------------
-
-The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
-is obviously always cached, and the ``is_X`` and ``lstat`` methods
-cache their values (immediately on Windows via ``FindNextFile``, and
-on first use on Linux / OS X via a ``stat`` call) and never refetch
-from the system.
-
-For this reason, ``DirEntry`` objects are intended to be used and
-thrown away after iteration, not stored in long-lived data structured
-and the methods called again and again.
-
-If a user wants to do that (for example, for watching a file's size
-change), they'll need to call the regular ``os.lstat()`` or
-``os.path.getsize()`` functions which force a new system call each
-time.
+Like the other functions in the ``os`` module, ``scandir()`` accepts
+either a bytes or str object for the ``path`` parameter, and returns
+the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
+same type as ``path``. However, it is *strongly recommended* to use
+the str type, as this ensures cross-platform support for Unicode
+filenames.


 Examples
 ========

-Here's a good usage pattern for ``scandir``. This is in fact almost
+Below is a good usage pattern for ``scandir``. This is in fact almost
 exactly how the scandir module's faster ``os.walk()`` implementation
 uses it::

    dirs = []
    non_dirs = []
-    for entry in scandir(path):
+    for entry in os.scandir(path):
        if entry.is_dir():
            dirs.append(entry)
        else:
            non_dirs.append(entry)

-The above ``os.walk()``-like code will be significantly using scandir
-on both Windows and Linux or OS X.
+The above ``os.walk()``-like code will be significantly faster with
+scandir than ``os.listdir()`` and ``os.path.isdir()`` on both Windows
+and POSIX systems.

-Or, for getting the total size of files in a directory tree -- showing
-use of the ``DirEntry.lstat()`` method::
+Or, for getting the total size of files in a directory tree, showing
+use of the ``DirEntry.lstat()`` method and ``DirEntry.full_name``
+attribute::

    def get_tree_size(path):
        """Return total size of files in path and subdirs."""
-        size = 0
-        for entry in scandir(path):
+        total = 0
+        for entry in os.scandir(path):
            if entry.is_dir():
-                sub_path = os.path.join(path, entry.name)
-                size += get_tree_size(sub_path)
+                total += get_tree_size(entry.full_name)
            else:
-                size += entry.lstat().st_size
-        return size
+                total += entry.lstat().st_size
+        return total

 Note that ``get_tree_size()`` will get a huge speed boost on Windows,
-because no extra stat call are needed, but on Linux and OS X the size
+because no extra stat call are needed, but on POSIX systems the size
 information is not returned by the directory iteration functions, so
 this function won't gain anything there.


+Notes on caching
+----------------
+
+The ``DirEntry`` objects are relatively dumb -- the ``name`` and
+``full_name`` attributes are obviously always cached, and the ``is_X``
+and ``lstat`` methods cache their values (immediately on Windows via
+``FindNextFile``, and on first use on POSIX systems via a ``stat``
+call) and never refetch from the system.
+
+For this reason, ``DirEntry`` objects are intended to be used and
+thrown away after iteration, not stored in long-lived data structured
+and the methods called again and again.
+
+If developers want "refresh" behaviour (for example, for watching a
+file's size change), they can simply use ``pathlib.Path`` objects,
+or call the regular ``os.lstat()`` or ``os.path.getsize()`` functions
+which get fresh data from the operating system every call.
+
+
+Notes on exception handling
+---------------------------
+
+``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
+rather than attributes or properties, to make it clear that they may
+not be cheap operations, and they may do a system call. As a result,
+these methods may raise ``OSError``.
+
+For example, ``DirEntry.lstat()`` will always make a system call on
+POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
+``stat()`` system call on such systems if ``readdir()`` returns a
+``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
+certain conditions or on certain file systems.
+
+For this reason, when a user requires fine-grained error handling,
+it's good to catch ``OSError`` around these method calls and then
+handle as appropriate.
+
+For example, below is a version of the ``get_tree_size()`` example
+shown above, but with basic error handling added::
+
+    def get_tree_size(path):
+        """Return total size of files in path and subdirs. If
+        is_dir() or lstat() fails, print an error message to stderr
+        and assume zero size (for example, file has been deleted).
+        """
+        total = 0
+        for entry in os.scandir(path):
+            try:
+                is_dir = entry.is_dir()
+            except OSError as error:
+                print('Error calling is_dir():', error, file=sys.stderr)
+                continue
+            if is_dir:
+                total += get_tree_size(entry.full_name)
+            else:
+                try:
+                    total += entry.lstat().st_size
+                except OSError as error:
+                    print('Error calling lstat():', error, file=sys.stderr)
+        return total
+
+
 Support
 =======

@ -185,6 +254,10 @@ The scandir module on GitHub has been forked and used quite a bit (see
 direct support for a scandir-like function from core developers and
 others on the python-dev and python-ideas mailing lists. A sampling:

+* **python-dev**: a good number of +1's and very few negatives for
+  scandir and PEP 471 on `this June 2014 python-dev thread
+  <https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_
+
 * **Nick Coghlan**, a core Python developer: "I've had the local Red
  Hat release engineering team express their displeasure at having to
  stat every file in a network mounted directory tree for info that is
@ -225,9 +298,10 @@ specific os.scandir API would be a good thing."
 Use in the wild
 ===============

-To date, ``scandir`` is definitely useful, but has been clearly marked
-"beta", so it's uncertain how much use of it there is in the wild. Ben
-Hoyt has had several reports from people using it. For example:
+To date, the ``scandir`` implementation is definitely useful, but has
+been clearly marked "beta", so it's uncertain how much use of it there
+is in the wild. Ben Hoyt has had several reports from people using it.
+For example:

 * Chris F: "I am processing some pretty large directories and was half
  expecting to have to modify getdents. So thanks for saving me the
@ -250,12 +324,12 @@ created. See `PyPI package`_.

 GitHub stats don't mean too much, but scandir does have several
 watchers, issues, forks, etc. Here's the run-down as of the stats as
-of June 5, 2014:
+of July 7, 2014:

 * Watchers: 17
-* Stars: 48
-* Forks: 15
-* Issues: 2 open, 19 closed
+* Stars: 57
+* Forks: 20
+* Issues: 4 open, 26 closed

 **However, the much larger point is this:**, if this PEP is accepted,
 ``os.walk()`` can easily be reimplemented using ``scandir`` rather
@ -266,53 +340,205 @@ production code that would benefit from this large speedup of
 of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).


-Open issues and optional things
-===============================
-
-There are a few open issues or optional additions:
+Rejected ideas
+==============


-Should scandir be in its own module?
+Naming
+------
+
+The only other real contender for this function's name was
+``iterdir()``. However, ``iterX()`` functions in Python (mostly found
+in Python 2) tend to be simple iterator equivalents of their
+non-iterator counterparts. For example, ``dict.iterkeys()`` is just an
+iterator version of ``dict.keys()``, but the objects returned are
+identical. In ``scandir()``'s case, however, the return values are
+quite different objects (``DirEntry`` objects vs filename strings), so
+this should probably be reflected by a difference in name -- hence
+``scandir()``.
+
+See some `relevant discussion on python-dev
+<https://mail.python.org/pipermail/python-dev/2014-June/135228.html>`_.
+
+
+Wildcard support
+----------------
+
+``FindFirstFile``/``FindNextFile`` on Windows support passing a
+"wildcard" like ``*.jpg``, so at first folks (this PEP's author
+included) felt it would be a good idea to include a
+``windows_wildcard`` keyword argument to the ``scandir`` function so
+users could pass this in.
+
+However, on further thought and discussion it was decided that this
+would be bad idea, *unless it could be made cross-platform* (a
+``pattern`` keyword argument or similar). This seems easy enough at
+first -- just use the OS wildcard support on Windows, and something
+like ``fnmatch`` or ``re`` afterwards on POSIX-based systems.
+
+Unfortunately the exact Windows wildcard matching rules aren't really
+documented anywhere by Microsoft, and they're quite quirky (see this
+`blog post
+<http://blogs.msdn.com/b/oldnewthing/archive/2007/12/17/6785519.aspx>`_),
+meaning it's very problematic to emulate using ``fnmatch`` or regexes.
+
+So the consensus was that Windows wildcard support was a bad idea.
+It would be possible to add at a later date if there's a
+cross-platform way to achieve it, but not for the initial version.
+
+Read more on the `this Nov 2012 python-ideas thread
+<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
+and this `June 2014 python-dev thread on PEP 471
+<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.
+
+
+DirEntry attributes being properties
 ------------------------------------

-Should the function be included in the standard library in a new
-module, ``scandir.scandir()``, or just as ``os.scandir()`` as
-discussed? The preference of this PEP's author (Ben Hoyt) would be
-``os.scandir()``, as it's just a single function.
+In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
+``lstat()`` to be properties instead of methods, to indicate they're
+very cheap or free. However, this isn't quite the case, as ``lstat()``
+will require an OS call on POSIX-based systems but not on Windows.
+Even ``is_dir()`` and friends may perform an OS call on POSIX-based
+systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
+file systems).
+
+Also, people would expect the attribute access ``entry.is_dir`` to
+only ever raise ``AttributeError``, not ``OSError`` in the case it
+makes a system call under the covers. Calling code would have to have
+a ``try``/``except`` around what looks like a simple attribute access,
+and so it's much better to make them *methods*.
+
+See `this May 2013 python-dev thread
+<https://mail.python.org/pipermail/python-dev/2013-May/126184.html>`_
+where this PEP author makes this case and there's agreement from a
+core developers.


-Should there be a way to access the full path?
----------------------------------------------
+DirEntry fields being "static" attribute-only objects
+-----------------------------------------------------

-Should ``DirEntry``'s have a way to get the full path without using
-``os.path.join(path, entry.name)``? This is a pretty common pattern,
-and it may be useful to add pathlib-like ``str(entry)`` functionality.
-This functionality has also been requested in `issue 13`_ on GitHub.
+In `this July 2014 python-dev message
+<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
+Paul Moore suggested a solution that was a "thin wrapper round the OS
+feature", where the ``DirEntry`` object had only static attributes:
+``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
+only present on Windows. The idea was to use this simpler, lower-level
+function as a building block for higher-level functions.

-.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
+At first there was general agreement that simplifying in this way was
+a good thing. However, there were two problems with this approach.
+First, the assumption is the ``is_dir`` and similar attributes are
+always present on POSIX, which isn't the case (if ``d_type`` is not
+present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API
+in practice, as even the ``is_dir`` attributes aren't always present
+on POSIX, and would need to be tested with ``hasattr()`` and then
+``os.stat()`` called if they weren't present.
+
+See `this July 2014 python-dev response
+<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
+from this PEP's author detailing why this option is a non-ideal
+solution, and the subsequent reply from Paul Moore voicing agreement.


-Should it expose Windows wildcard functionality?
------------------------------------------------
+DirEntry fields being static with an ensure_lstat option
+--------------------------------------------------------

-Should ``scandir()`` have a way of exposing the wildcard functionality
-in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
-scandir module on GitHub exposes this as a ``windows_wildcard``
-keyword argument, allowing Windows power users the option to pass a
-custom wildcard to ``FindFirstFile``, which may avoid the need to use
-``fnmatch`` or similar on the resulting names. It is named the
-unwieldly ``windows_wildcard`` to remind you you're writing power-
-user, Windows-only code if you use it.
+Another seemingly simpler and attractive option was suggested by
+Nick Coghlan in this `June 2014 python-dev message
+<https://mail.python.org/pipermail/python-dev/2014-June/135261.html>`_:
+make ``DirEntry.is_X`` and ``DirEntry.lstat_result`` properties, and
+populate ``DirEntry.lstat_result`` at iteration time, but only if
+the new argument ``ensure_lstat=True`` was specified on the
+``scandir()`` call.

-This boils down to whether ``scandir`` should be about exposing all of
-the system's directory iteration features, or simply providing a fast,
-simple, cross-platform directory iteration API.
+This does have the advantage over the above in that you can easily get
+the stat result from ``scandir()`` if you need it. However, it has the
+serious disadvantage that fine-grained error handling is messy,
+because ``stat()`` will be called (and hence potentially raise
+``OSError``) during iteration, leading to a rather ugly, hand-made
+iteration loop::

-This PEP's author votes for not including ``windows_wildcard`` in the
-standard library version, because even though it could be useful in
-rare cases (say the Windows Dropbox client?), it'd be too easy to use
-it just because you're a Windows developer, and create code that is
-not cross-platform.
+    it = os.scandir(path)
+    while True:
+        try:
+            entry = next(it)
+        except OSError as error:
+            handle_error(path, error)
+        except StopIteration:
+            break
+
+Or it means that ``scandir()`` would have to accept an ``onerror``
+argument -- a function to call when ``stat()`` errors occur during
+iteration. This seems to this PEP's author neither as direct nor as
+Pythonic as ``try``/``except`` around a ``DirEntry.lstat()`` call.
+
+See `Ben Hoyt's July 2014 reply
+<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
+to the discussion summarizing this and detailing why he thinks the
+original PEP 471 proposal is "the right one" after all.
+
+
+Return values being (name, stat_result) two-tuples
+--------------------------------------------------
+
+Initially this PEP's author proposed this concept as a function called
+``iterdir_stat()`` which yielded two-tuples of (name, stat_result).
+This does have the advantage that there are no new types introduced.
+However, the ``stat_result`` is only partially filled on POSIX-based
+systems (most fields set to ``None`` and other quirks), so they're not
+really ``stat_result`` objects at all, and this would have to be
+thoroughly documented as different from ``os.stat()``.
+
+Also, Python has good support for proper objects with attributes and
+methods, which makes for a saner and simpler API than two-tuples. It
+also makes the ``DirEntry`` objects more extensible and future-proof
+as operating systems add functionality and we want to include this in
+``DirEntry``.
+
+See also some previous discussion:
+
+* `May 2013 python-dev thread
+  <https://mail.python.org/pipermail/python-dev/2013-May/126148.html>`_
+  where Nick Coghlan makes the original case for a ``DirEntry``-style
+  object.
+
+* `June 2014 python-dev thread
+  <https://mail.python.org/pipermail/python-dev/2014-June/135244.html>`_
+  where Nick Coghlan makes (another) good case against the two-tuple
+  approach.
+
+
+Return values being overloaded stat_result objects
+--------------------------------------------------
+
+Another alternative discussed was making the return values to be
+overloaded ``stat_result`` objects with ``name`` and ``full_name``
+attributes. However, apart from this being a strange (and strained!)
+kind of overloading, this has the same problems mentioned above --
+most of the ``stat_result`` information is not fetched by
+``readdir()`` on POSIX systems, only (part of) the ``st_mode`` value.
+
+
+Return values being pathlib.Path objects
+----------------------------------------
+
+With Antoine Pitrou's new standard library ``pathlib`` module, it
+at first seems like a great idea for ``scandir()`` to return instances
+of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
+``lstat()`` functions are explicitly not cached, whereas ``scandir``
+has to cache them by design, because it's (often) returning values
+from the original directory iteration system call.
+
+And if the ``pathlib.Path`` instances returned by ``scandir`` cached
+lstat values, but the ordinary ``pathlib.Path`` objects explicitly
+don't, that would be more than a little confusing.
+
+Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
+the context of scandir `here
+<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
+making ``pathlib.Path`` objects a bad choice for scandir return
+values.


 Possible improvements
@ -328,6 +554,12 @@ here is a short list of some this PEP's author has in mind:
  was suggested by on Issue 11406 by Antoine Pitrou.
  [`source9 <http://bugs.python.org/msg130125>`_]

+* scandir could use a free list to avoid the cost of memory allocation
+  for each iteration -- a short free list of 10 or maybe even 1 may help.
+  Suggested by Victor Stinner on a `python-dev thread on June 27`_.
+
+.. _`python-dev thread on June 27`: https://mail.python.org/pipermail/python-dev/2014-June/135232.html
+

 Previous discussion
 ===================
@ -342,9 +574,12 @@ Previous discussion
  ``scandir()`` API, including Nick Coghlan's suggestion of scandir
  yielding ``DirEntry``-like objects

-* `Final thread Ben Hoyt started on python-dev`_ to discuss the
+* `Another thread Ben Hoyt started on python-dev`_ to discuss the
  interaction between scandir and the new ``pathlib`` module

+* `Final thread Ben Hoyt started on python-dev`_ to discuss the first
+  version of this PEP, with extensive discussion about the API.
+
 * `Question on StackOverflow`_ about why ``os.walk()`` is slow and
  pointers on how to fix it (this inspired the author of this PEP
  early on)
@ -354,7 +589,8 @@ Previous discussion

 .. _`Original thread Ben Hoyt started on python-ideas`: https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
 .. _`Further thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-May/126119.html
-.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
+.. _`Another thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
+.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2014-June/135215.html
 .. _`Question on StackOverflow`: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
 .. _`BetterWalk`: https://github.com/benhoyt/betterwalk