PEP 471: update by Ben Hoy

After the significant discussion on python-dev about PEP 471, I've now made the
relevant updates and improved a few things.
This commit is contained in:
Victor Stinner 2014-07-08 10:59:42 +02:00
parent 4b500b691d
commit 689e1bff5e
1 changed files with 335 additions and 99 deletions

View File

@ -8,6 +8,7 @@ Type: Standards Track
Content-Type: text/x-rst
Created: 30-May-2014
Python-Version: 3.5
Post-History: 27-Jun-2014, 8-Jul-2014
Abstract
@ -25,32 +26,36 @@ Rationale
Python's built-in ``os.walk()`` is significantly slower than it needs
to be, because -- in addition to calling ``os.listdir()`` on each
directory -- it executes the system call ``os.stat()`` or
directory -- it executes the ``stat()`` system call or
``GetFileAttributes()`` on each file to determine whether the entry is
a directory or not.
But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
already tell you whether the files returned are directories or not, so
no further system calls are needed. In short, you can reduce the
number of system calls from approximately 2N to N, where N is the
total number of files and directories in the tree. (And because
directory trees are usually much wider than they are deep, it's often
much better than this.)
no further system calls are needed. Further, the Windows system calls
return all the information for a ``stat_result`` object, such as file
size and last modification time.
In short, you can reduce the number of system calls required for a
tree function like ``os.walk()`` from approximately 2N to N, where N
is the total number of files and directories in the tree. (And because
directory trees are usually wider than they are deep, it's often much
better than this.)
In practice, removing all those extra system calls makes ``os.walk()``
about **8-9 times as fast on Windows**, and about **2-3 times as fast
on Linux and Mac OS X**. So we're not talking about micro-
optimizations. See more `benchmarks`_.
on POSIX systems**. So we're not talking about micro-
optimizations. See more `benchmarks here`_.
.. _`benchmarks`: https://github.com/benhoyt/scandir#benchmarks
.. _`benchmarks here`: https://github.com/benhoyt/scandir#benchmarks
Somewhat relatedly, many people (see Python `Issue 11406`_) are also
keen on a version of ``os.listdir()`` that yields filenames as it
iterates instead of returning them as one big list. This improves
memory efficiency for iterating very large directories.
So as well as providing a ``scandir()`` iterator function for calling
So, as well as providing a ``scandir()`` iterator function for calling
directly, Python's existing ``os.walk()`` function could be sped up a
huge amount.
@ -70,7 +75,7 @@ Note that this module has been used and tested (see "Use in the wild"
section in this PEP), so it's more than a proof-of-concept. However,
it is marked as beta software and is not extensively battle-tested.
It will need some cleanup and more thorough testing before going into
the standard library, as well as integration into `posixmodule.c`.
the standard library, as well as integration into ``posixmodule.c``.
@ -87,10 +92,10 @@ Like ``listdir``, ``scandir`` calls the operating system's directory
iteration system calls to get the names of the files in the ``path``
directory, but it's different from ``listdir`` in two ways:
* Instead of bare filename strings, it returns lightweight
* Instead of returning bare filename strings, it returns lightweight
``DirEntry`` objects that hold the filename string and provide
simple methods that allow access to the stat-like data the operating
system returned.
simple methods that allow access to the additional data the
operating system returned.
* It returns a generator instead of a list, so that ``scandir`` acts
as a true iterator instead of returning the full list immediately.
@ -101,82 +106,146 @@ pseudo-directories are skipped, and the entries are yielded in
system-dependent order. Each ``DirEntry`` object has the following
attributes and methods:
* ``name``: the entry's filename, relative to ``path`` (corresponds to
the return values of ``os.listdir``)
* ``name``: the entry's filename, relative to the ``path`` argument
(corresponds to the return values of ``os.listdir``)
* ``is_dir()``: like ``os.path.isdir()``, but requires no system calls
on most systems (Linux, Windows, OS X)
* ``full_name``: the entry's full path name -- the equivalent of
``os.path.join(path, entry.name)``
* ``is_file()``: like ``os.path.isfile()``, but requires no system
calls on most systems (Linux, Windows, OS X)
* ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
requires a system call on Windows, and usually doesn't on POSIX
systems
* ``is_symlink()``: like ``os.path.islink()``, but requires no system
calls on most systems (Linux, Windows, OS X)
* ``is_file()``: like ``os.path.isfile()``, but much cheaper -- it
never requires a system call on Windows, and usually doesn't on
POSIX systems
* ``lstat()``: like ``os.lstat()``, but requires no system calls on
Windows
* ``is_symlink()``: like ``os.path.islink()``, but much cheaper -- it
never requires a system call on Windows, and usually doesn't on
POSIX systems
* ``lstat()``: like ``os.lstat()``, but much cheaper on some systems
-- it only requires a system call on POSIX systems
The ``is_X`` methods may perform a ``stat()`` call under certain
conditions (for example, on certain file systems on POSIX systems),
and therefore possibly raise ``OSError``. The ``lstat()`` method will
call ``stat()`` on POSIX systems and therefore also possibly raise
``OSError``. See the "Notes on exception handling" section for more
details.
The ``DirEntry`` attribute and method names were chosen to be the same
as those in the new ``pathlib`` module for consistency.
Notes on caching
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
is obviously always cached, and the ``is_X`` and ``lstat`` methods
cache their values (immediately on Windows via ``FindNextFile``, and
on first use on Linux / OS X via a ``stat`` call) and never refetch
from the system.
For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
and the methods called again and again.
If a user wants to do that (for example, for watching a file's size
change), they'll need to call the regular ``os.lstat()`` or
``os.path.getsize()`` functions which force a new system call each
time.
Like the other functions in the ``os`` module, ``scandir()`` accepts
either a bytes or str object for the ``path`` parameter, and returns
the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
same type as ``path``. However, it is *strongly recommended* to use
the str type, as this ensures cross-platform support for Unicode
filenames.
Examples
========
Here's a good usage pattern for ``scandir``. This is in fact almost
Below is a good usage pattern for ``scandir``. This is in fact almost
exactly how the scandir module's faster ``os.walk()`` implementation
uses it::
dirs = []
non_dirs = []
for entry in scandir(path):
for entry in os.scandir(path):
if entry.is_dir():
dirs.append(entry)
else:
non_dirs.append(entry)
The above ``os.walk()``-like code will be significantly using scandir
on both Windows and Linux or OS X.
The above ``os.walk()``-like code will be significantly faster with
scandir than ``os.listdir()`` and ``os.path.isdir()`` on both Windows
and POSIX systems.
Or, for getting the total size of files in a directory tree -- showing
use of the ``DirEntry.lstat()`` method::
Or, for getting the total size of files in a directory tree, showing
use of the ``DirEntry.lstat()`` method and ``DirEntry.full_name``
attribute::
def get_tree_size(path):
"""Return total size of files in path and subdirs."""
size = 0
for entry in scandir(path):
total = 0
for entry in os.scandir(path):
if entry.is_dir():
sub_path = os.path.join(path, entry.name)
size += get_tree_size(sub_path)
total += get_tree_size(entry.full_name)
else:
size += entry.lstat().st_size
return size
total += entry.lstat().st_size
return total
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
because no extra stat call are needed, but on Linux and OS X the size
because no extra stat call are needed, but on POSIX systems the size
information is not returned by the directory iteration functions, so
this function won't gain anything there.
Notes on caching
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` and
``full_name`` attributes are obviously always cached, and the ``is_X``
and ``lstat`` methods cache their values (immediately on Windows via
``FindNextFile``, and on first use on POSIX systems via a ``stat``
call) and never refetch from the system.
For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
and the methods called again and again.
If developers want "refresh" behaviour (for example, for watching a
file's size change), they can simply use ``pathlib.Path`` objects,
or call the regular ``os.lstat()`` or ``os.path.getsize()`` functions
which get fresh data from the operating system every call.
Notes on exception handling
---------------------------
``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
rather than attributes or properties, to make it clear that they may
not be cheap operations, and they may do a system call. As a result,
these methods may raise ``OSError``.
For example, ``DirEntry.lstat()`` will always make a system call on
POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
``stat()`` system call on such systems if ``readdir()`` returns a
``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
certain conditions or on certain file systems.
For this reason, when a user requires fine-grained error handling,
it's good to catch ``OSError`` around these method calls and then
handle as appropriate.
For example, below is a version of the ``get_tree_size()`` example
shown above, but with basic error handling added::
def get_tree_size(path):
"""Return total size of files in path and subdirs. If
is_dir() or lstat() fails, print an error message to stderr
and assume zero size (for example, file has been deleted).
"""
total = 0
for entry in os.scandir(path):
try:
is_dir = entry.is_dir()
except OSError as error:
print('Error calling is_dir():', error, file=sys.stderr)
continue
if is_dir:
total += get_tree_size(entry.full_name)
else:
try:
total += entry.lstat().st_size
except OSError as error:
print('Error calling lstat():', error, file=sys.stderr)
return total
Support
=======
@ -185,6 +254,10 @@ The scandir module on GitHub has been forked and used quite a bit (see
direct support for a scandir-like function from core developers and
others on the python-dev and python-ideas mailing lists. A sampling:
* **python-dev**: a good number of +1's and very few negatives for
scandir and PEP 471 on `this June 2014 python-dev thread
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_
* **Nick Coghlan**, a core Python developer: "I've had the local Red
Hat release engineering team express their displeasure at having to
stat every file in a network mounted directory tree for info that is
@ -225,9 +298,10 @@ specific os.scandir API would be a good thing."
Use in the wild
===============
To date, ``scandir`` is definitely useful, but has been clearly marked
"beta", so it's uncertain how much use of it there is in the wild. Ben
Hoyt has had several reports from people using it. For example:
To date, the ``scandir`` implementation is definitely useful, but has
been clearly marked "beta", so it's uncertain how much use of it there
is in the wild. Ben Hoyt has had several reports from people using it.
For example:
* Chris F: "I am processing some pretty large directories and was half
expecting to have to modify getdents. So thanks for saving me the
@ -250,12 +324,12 @@ created. See `PyPI package`_.
GitHub stats don't mean too much, but scandir does have several
watchers, issues, forks, etc. Here's the run-down as of the stats as
of June 5, 2014:
of July 7, 2014:
* Watchers: 17
* Stars: 48
* Forks: 15
* Issues: 2 open, 19 closed
* Stars: 57
* Forks: 20
* Issues: 4 open, 26 closed
**However, the much larger point is this:**, if this PEP is accepted,
``os.walk()`` can easily be reimplemented using ``scandir`` rather
@ -266,53 +340,205 @@ production code that would benefit from this large speedup of
of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).
Open issues and optional things
===============================
There are a few open issues or optional additions:
Rejected ideas
==============
Should scandir be in its own module?
Naming
------
The only other real contender for this function's name was
``iterdir()``. However, ``iterX()`` functions in Python (mostly found
in Python 2) tend to be simple iterator equivalents of their
non-iterator counterparts. For example, ``dict.iterkeys()`` is just an
iterator version of ``dict.keys()``, but the objects returned are
identical. In ``scandir()``'s case, however, the return values are
quite different objects (``DirEntry`` objects vs filename strings), so
this should probably be reflected by a difference in name -- hence
``scandir()``.
See some `relevant discussion on python-dev
<https://mail.python.org/pipermail/python-dev/2014-June/135228.html>`_.
Wildcard support
----------------
``FindFirstFile``/``FindNextFile`` on Windows support passing a
"wildcard" like ``*.jpg``, so at first folks (this PEP's author
included) felt it would be a good idea to include a
``windows_wildcard`` keyword argument to the ``scandir`` function so
users could pass this in.
However, on further thought and discussion it was decided that this
would be bad idea, *unless it could be made cross-platform* (a
``pattern`` keyword argument or similar). This seems easy enough at
first -- just use the OS wildcard support on Windows, and something
like ``fnmatch`` or ``re`` afterwards on POSIX-based systems.
Unfortunately the exact Windows wildcard matching rules aren't really
documented anywhere by Microsoft, and they're quite quirky (see this
`blog post
<http://blogs.msdn.com/b/oldnewthing/archive/2007/12/17/6785519.aspx>`_),
meaning it's very problematic to emulate using ``fnmatch`` or regexes.
So the consensus was that Windows wildcard support was a bad idea.
It would be possible to add at a later date if there's a
cross-platform way to achieve it, but not for the initial version.
Read more on the `this Nov 2012 python-ideas thread
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
and this `June 2014 python-dev thread on PEP 471
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.
DirEntry attributes being properties
------------------------------------
Should the function be included in the standard library in a new
module, ``scandir.scandir()``, or just as ``os.scandir()`` as
discussed? The preference of this PEP's author (Ben Hoyt) would be
``os.scandir()``, as it's just a single function.
In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
``lstat()`` to be properties instead of methods, to indicate they're
very cheap or free. However, this isn't quite the case, as ``lstat()``
will require an OS call on POSIX-based systems but not on Windows.
Even ``is_dir()`` and friends may perform an OS call on POSIX-based
systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
file systems).
Also, people would expect the attribute access ``entry.is_dir`` to
only ever raise ``AttributeError``, not ``OSError`` in the case it
makes a system call under the covers. Calling code would have to have
a ``try``/``except`` around what looks like a simple attribute access,
and so it's much better to make them *methods*.
See `this May 2013 python-dev thread
<https://mail.python.org/pipermail/python-dev/2013-May/126184.html>`_
where this PEP author makes this case and there's agreement from a
core developers.
Should there be a way to access the full path?
----------------------------------------------
DirEntry fields being "static" attribute-only objects
-----------------------------------------------------
Should ``DirEntry``'s have a way to get the full path without using
``os.path.join(path, entry.name)``? This is a pretty common pattern,
and it may be useful to add pathlib-like ``str(entry)`` functionality.
This functionality has also been requested in `issue 13`_ on GitHub.
In `this July 2014 python-dev message
<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
Paul Moore suggested a solution that was a "thin wrapper round the OS
feature", where the ``DirEntry`` object had only static attributes:
``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
only present on Windows. The idea was to use this simpler, lower-level
function as a building block for higher-level functions.
.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
At first there was general agreement that simplifying in this way was
a good thing. However, there were two problems with this approach.
First, the assumption is the ``is_dir`` and similar attributes are
always present on POSIX, which isn't the case (if ``d_type`` is not
present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API
in practice, as even the ``is_dir`` attributes aren't always present
on POSIX, and would need to be tested with ``hasattr()`` and then
``os.stat()`` called if they weren't present.
See `this July 2014 python-dev response
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
from this PEP's author detailing why this option is a non-ideal
solution, and the subsequent reply from Paul Moore voicing agreement.
Should it expose Windows wildcard functionality?
------------------------------------------------
DirEntry fields being static with an ensure_lstat option
--------------------------------------------------------
Should ``scandir()`` have a way of exposing the wildcard functionality
in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
scandir module on GitHub exposes this as a ``windows_wildcard``
keyword argument, allowing Windows power users the option to pass a
custom wildcard to ``FindFirstFile``, which may avoid the need to use
``fnmatch`` or similar on the resulting names. It is named the
unwieldly ``windows_wildcard`` to remind you you're writing power-
user, Windows-only code if you use it.
Another seemingly simpler and attractive option was suggested by
Nick Coghlan in this `June 2014 python-dev message
<https://mail.python.org/pipermail/python-dev/2014-June/135261.html>`_:
make ``DirEntry.is_X`` and ``DirEntry.lstat_result`` properties, and
populate ``DirEntry.lstat_result`` at iteration time, but only if
the new argument ``ensure_lstat=True`` was specified on the
``scandir()`` call.
This boils down to whether ``scandir`` should be about exposing all of
the system's directory iteration features, or simply providing a fast,
simple, cross-platform directory iteration API.
This does have the advantage over the above in that you can easily get
the stat result from ``scandir()`` if you need it. However, it has the
serious disadvantage that fine-grained error handling is messy,
because ``stat()`` will be called (and hence potentially raise
``OSError``) during iteration, leading to a rather ugly, hand-made
iteration loop::
This PEP's author votes for not including ``windows_wildcard`` in the
standard library version, because even though it could be useful in
rare cases (say the Windows Dropbox client?), it'd be too easy to use
it just because you're a Windows developer, and create code that is
not cross-platform.
it = os.scandir(path)
while True:
try:
entry = next(it)
except OSError as error:
handle_error(path, error)
except StopIteration:
break
Or it means that ``scandir()`` would have to accept an ``onerror``
argument -- a function to call when ``stat()`` errors occur during
iteration. This seems to this PEP's author neither as direct nor as
Pythonic as ``try``/``except`` around a ``DirEntry.lstat()`` call.
See `Ben Hoyt's July 2014 reply
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
to the discussion summarizing this and detailing why he thinks the
original PEP 471 proposal is "the right one" after all.
Return values being (name, stat_result) two-tuples
--------------------------------------------------
Initially this PEP's author proposed this concept as a function called
``iterdir_stat()`` which yielded two-tuples of (name, stat_result).
This does have the advantage that there are no new types introduced.
However, the ``stat_result`` is only partially filled on POSIX-based
systems (most fields set to ``None`` and other quirks), so they're not
really ``stat_result`` objects at all, and this would have to be
thoroughly documented as different from ``os.stat()``.
Also, Python has good support for proper objects with attributes and
methods, which makes for a saner and simpler API than two-tuples. It
also makes the ``DirEntry`` objects more extensible and future-proof
as operating systems add functionality and we want to include this in
``DirEntry``.
See also some previous discussion:
* `May 2013 python-dev thread
<https://mail.python.org/pipermail/python-dev/2013-May/126148.html>`_
where Nick Coghlan makes the original case for a ``DirEntry``-style
object.
* `June 2014 python-dev thread
<https://mail.python.org/pipermail/python-dev/2014-June/135244.html>`_
where Nick Coghlan makes (another) good case against the two-tuple
approach.
Return values being overloaded stat_result objects
--------------------------------------------------
Another alternative discussed was making the return values to be
overloaded ``stat_result`` objects with ``name`` and ``full_name``
attributes. However, apart from this being a strange (and strained!)
kind of overloading, this has the same problems mentioned above --
most of the ``stat_result`` information is not fetched by
``readdir()`` on POSIX systems, only (part of) the ``st_mode`` value.
Return values being pathlib.Path objects
----------------------------------------
With Antoine Pitrou's new standard library ``pathlib`` module, it
at first seems like a great idea for ``scandir()`` to return instances
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
``lstat()`` functions are explicitly not cached, whereas ``scandir``
has to cache them by design, because it's (often) returning values
from the original directory iteration system call.
And if the ``pathlib.Path`` instances returned by ``scandir`` cached
lstat values, but the ordinary ``pathlib.Path`` objects explicitly
don't, that would be more than a little confusing.
Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
the context of scandir `here
<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
making ``pathlib.Path`` objects a bad choice for scandir return
values.
Possible improvements
@ -328,6 +554,12 @@ here is a short list of some this PEP's author has in mind:
was suggested by on Issue 11406 by Antoine Pitrou.
[`source9 <http://bugs.python.org/msg130125>`_]
* scandir could use a free list to avoid the cost of memory allocation
for each iteration -- a short free list of 10 or maybe even 1 may help.
Suggested by Victor Stinner on a `python-dev thread on June 27`_.
.. _`python-dev thread on June 27`: https://mail.python.org/pipermail/python-dev/2014-June/135232.html
Previous discussion
===================
@ -342,9 +574,12 @@ Previous discussion
``scandir()`` API, including Nick Coghlan's suggestion of scandir
yielding ``DirEntry``-like objects
* `Final thread Ben Hoyt started on python-dev`_ to discuss the
* `Another thread Ben Hoyt started on python-dev`_ to discuss the
interaction between scandir and the new ``pathlib`` module
* `Final thread Ben Hoyt started on python-dev`_ to discuss the first
version of this PEP, with extensive discussion about the API.
* `Question on StackOverflow`_ about why ``os.walk()`` is slow and
pointers on how to fix it (this inspired the author of this PEP
early on)
@ -354,7 +589,8 @@ Previous discussion
.. _`Original thread Ben Hoyt started on python-ideas`: https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
.. _`Further thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-May/126119.html
.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
.. _`Another thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2014-June/135215.html
.. _`Question on StackOverflow`: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
.. _`BetterWalk`: https://github.com/benhoyt/betterwalk