PEP 471: Ben Hoyt updates

This commit is contained in:
Victor Stinner 2014-07-18 18:25:41 +02:00
parent 06c61b9447
commit 89ae8bb813
1 changed files with 205 additions and 113 deletions

View File

@ -8,7 +8,7 @@ Type: Standards Track
Content-Type: text/x-rst
Created: 30-May-2014
Python-Version: 3.5
Post-History: 27-Jun-2014, 8-Jul-2014
Post-History: 27-Jun-2014, 8-Jul-2014, 14-Jul-2014, 18-Jul-2014
Abstract
@ -16,9 +16,9 @@ Abstract
This PEP proposes including a new directory iteration function,
``os.scandir()``, in the standard library. This new function adds
useful functionality and increases the speed of ``os.walk()`` by 2-10
times (depending on the platform and file system) by significantly
reducing the number of times ``stat()`` needs to be called.
useful functionality and increases the speed of ``os.walk()`` by 2-20
times (depending on the platform and file system) by avoiding calls to
``os.stat()`` in most cases.
Rationale
@ -34,8 +34,8 @@ But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
already tell you whether the files returned are directories or not, so
no further system calls are needed. Further, the Windows system calls
return all the information for a ``stat_result`` object, such as file
size and last modification time.
return all the information for a ``stat_result`` object on the directory
entry, such as file size and last modification time.
In short, you can reduce the number of system calls required for a
tree function like ``os.walk()`` from approximately 2N to N, where N
@ -56,7 +56,7 @@ iterates instead of returning them as one big list. This improves
memory efficiency for iterating very large directories.
So, as well as providing a ``scandir()`` iterator function for calling
directly, Python's existing ``os.walk()`` function could be sped up a
directly, Python's existing ``os.walk()`` function can be sped up a
huge amount.
.. _`Issue 11406`: http://bugs.python.org/issue11406
@ -67,7 +67,8 @@ Implementation
The implementation of this proposal was written by Ben Hoyt (initial
version) and Tim Golden (who helped a lot with the C extension
module). It lives on GitHub at `benhoyt/scandir`_.
module). It lives on GitHub at `benhoyt/scandir`_. (The implementation
may lag behind the updates to this PEP a little.)
.. _`benhoyt/scandir`: https://github.com/benhoyt/scandir
@ -82,67 +83,83 @@ the standard library, as well as integration into ``posixmodule.c``.
Specifics of proposal
=====================
os.scandir()
------------
Specifically, this PEP proposes adding a single function to the ``os``
module in the standard library, ``scandir``, that takes a single,
optional string as its argument::
scandir(path='.') -> generator of DirEntry objects
scandir(directory='.') -> generator of DirEntry objects
Like ``listdir``, ``scandir`` calls the operating system's directory
iteration system calls to get the names of the files in the ``path``
directory, but it's different from ``listdir`` in two ways:
iteration system calls to get the names of the files in the given
``directory``, but it's different from ``listdir`` in two ways:
* Instead of returning bare filename strings, it returns lightweight
``DirEntry`` objects that hold the filename string and provide
simple methods that allow access to the additional data the
operating system returned.
operating system may have returned.
* It returns a generator instead of a list, so that ``scandir`` acts
as a true iterator instead of returning the full list immediately.
``scandir()`` yields a ``DirEntry`` object for each file and directory
in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
pseudo-directories are skipped, and the entries are yielded in
system-dependent order. Each ``DirEntry`` object has the following
attributes and methods:
``scandir()`` yields a ``DirEntry`` object for each file and
sub-directory in ``directory``. Just like ``listdir``, the ``'.'``
and ``'..'`` pseudo-directories are skipped, and the entries are
yielded in system-dependent order. Each ``DirEntry`` object has the
following attributes and methods:
* ``name``: the entry's filename, relative to the ``path`` argument
(corresponds to the return values of ``os.listdir``)
* ``name``: the entry's filename, relative to the ``directory``
argument (corresponds to the return values of ``os.listdir``)
* ``full_name``: the entry's full path name -- the equivalent of
``os.path.join(path, entry.name)``
* ``path``: the entry's full path name (not necessarily an absolute
path) -- the equivalent of ``os.path.join(directory, entry.name)``
* ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
requires a system call on Windows, and usually doesn't on POSIX
systems
* ``is_dir(*, follow_symlinks=True)``: similar to
``pathlib.Path.is_dir()``, but the return value is cached on the
``DirEntry`` object; doesn't require a system call in most cases;
don't follow symbolic links if ``follow_symlinks`` is False
* ``is_file()``: like ``os.path.isfile()``, but much cheaper -- it
never requires a system call on Windows, and usually doesn't on
POSIX systems
* ``is_file(*, follow_symlinks=True)``: similar to
``pathlib.Path.is_file()``, but the return value is cached on the
``DirEntry`` object; doesn't require a system call in most cases;
don't follow symbolic links if ``follow_symlinks`` is False
* ``is_symlink()``: like ``os.path.islink()``, but much cheaper -- it
never requires a system call on Windows, and usually doesn't on
POSIX systems
* ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the
return value is cached on the ``DirEntry`` object; doesn't require a
system call in most cases
* ``lstat()``: like ``os.lstat()``, but much cheaper on some systems
-- it only requires a system call on POSIX systems
* ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the
return value is cached on the ``DirEntry`` object; does not require a
system call on Windows (except for symlinks); don't follow symbolic links
(like ``os.lstat()``) if ``follow_symlinks`` is False
The ``is_X`` methods may perform a ``stat()`` call under certain
conditions (for example, on certain file systems on POSIX systems),
and therefore possibly raise ``OSError``. The ``lstat()`` method will
call ``stat()`` on POSIX systems and therefore also possibly raise
``OSError``. See the "Notes on exception handling" section for more
details.
All *methods* may perform system calls in some cases and therefore
possibly raise ``OSError`` -- see the "Notes on exception handling"
section for more details.
The ``DirEntry`` attribute and method names were chosen to be the same
as those in the new ``pathlib`` module for consistency.
as those in the new ``pathlib`` module where possible, for
consistency. The only difference in functionality is that the
``DirEntry`` methods cache their values on the entry object after the
first call.
Like the other functions in the ``os`` module, ``scandir()`` accepts
either a bytes or str object for the ``path`` parameter, and returns
the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
same type as ``path``. However, it is *strongly recommended* to use
the str type, as this ensures cross-platform support for Unicode
filenames.
either a bytes or str object for the ``directory`` parameter, and
returns the ``DirEntry.name`` and ``DirEntry.path`` attributes with
the same type as ``directory``. However, it is *strongly recommended*
to use the str type, as this ensures cross-platform support for
Unicode filenames. (On Windows, bytes filenames have been deprecated
since Python 3.3).
os.walk()
---------
As part of this proposal, ``os.walk()`` will also be modified to use
``scandir()`` rather than ``listdir()`` and ``os.path.isdir()``. This
will increase the speed of ``os.walk()`` very significantly (as
mentioned above, by 2-20 times, depending on the system).
Examples
@ -154,7 +171,7 @@ uses it::
dirs = []
non_dirs = []
for entry in os.scandir(path):
for entry in os.scandir(directory):
if entry.is_dir():
dirs.append(entry)
else:
@ -165,19 +182,25 @@ scandir than ``os.listdir()`` and ``os.path.isdir()`` on both Windows
and POSIX systems.
Or, for getting the total size of files in a directory tree, showing
use of the ``DirEntry.lstat()`` method and ``DirEntry.full_name``
use of the ``DirEntry.stat()`` method and ``DirEntry.path``
attribute::
def get_tree_size(path):
"""Return total size of files in path and subdirs."""
def get_tree_size(directory):
"""Return total size of files in directory and subdirs."""
total = 0
for entry in os.scandir(path):
if entry.is_dir():
total += get_tree_size(entry.full_name)
for entry in os.scandir(directory):
if entry.is_dir(follow_symlinks=False):
total += get_tree_size(entry.path)
else:
total += entry.lstat().st_size
total += entry.stat(follow_symlinks=False).st_size
return total
This also shows the use of the ``follow_symlinks`` parameter to
``is_dir()`` -- in a recursive function like this, we probably don't
want to follow links. (To properly follow links in a recursive
function like this we'd want special handling for the case where
following a symlink leads to a recursive loop.)
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
because no extra stat call are needed, but on POSIX systems the size
information is not returned by the directory iteration functions, so
@ -188,10 +211,10 @@ Notes on caching
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` and
``full_name`` attributes are obviously always cached, and the ``is_X``
and ``lstat`` methods cache their values (immediately on Windows via
``path`` attributes are obviously always cached, and the ``is_X``
and ``stat`` methods cache their values (immediately on Windows via
``FindNextFile``, and on first use on POSIX systems via a ``stat``
call) and never refetch from the system.
system call) and never refetch from the system.
For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
@ -199,50 +222,61 @@ and the methods called again and again.
If developers want "refresh" behaviour (for example, for watching a
file's size change), they can simply use ``pathlib.Path`` objects,
or call the regular ``os.lstat()`` or ``os.path.getsize()`` functions
or call the regular ``os.stat()`` or ``os.path.getsize()`` functions
which get fresh data from the operating system every call.
Notes on exception handling
---------------------------
``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
``DirEntry.is_X()`` and ``DirEntry.stat()`` are explicitly methods
rather than attributes or properties, to make it clear that they may
not be cheap operations, and they may do a system call. As a result,
these methods may raise ``OSError``.
not be cheap operations (although they often are), and they may do a
system call. As a result, these methods may raise ``OSError``.
For example, ``DirEntry.lstat()`` will always make a system call on
For example, ``DirEntry.stat()`` will always make a system call on
POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
``stat()`` system call on such systems if ``readdir()`` returns a
``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
certain conditions or on certain file systems.
``stat()`` system call on such systems if ``readdir()`` does not
support ``d_type`` or returns a ``d_type`` with a value of
``DT_UNKNOWN``, which can occur under certain conditions or on
certain file systems.
For this reason, when a user requires fine-grained error handling,
it's good to catch ``OSError`` around these method calls and then
handle as appropriate.
Often this does not matter -- for example, ``os.walk()`` as defined in
the standard library only catches errors around the ``listdir()``
calls.
Also, because the exception-raising behaviour of the ``DirEntry.is_X``
methods matches that of ``pathlib`` -- which only raises ``OSError``
in the case of permissions or other fatal errors, but returns False
if the path doesn't exist or is a broken symlink -- it's often
not necessary to catch errors around the ``is_X()`` calls.
However, when a user requires fine-grained error handling, it may be
desirable to catch ``OSError`` around all method calls and handle as
appropriate.
For example, below is a version of the ``get_tree_size()`` example
shown above, but with basic error handling added::
shown above, but with fine-grained error handling added::
def get_tree_size(path):
"""Return total size of files in path and subdirs. If
is_dir() or lstat() fails, print an error message to stderr
def get_tree_size(directory):
"""Return total size of files in directory and subdirs. If
is_dir() or stat() fails, print an error message to stderr
and assume zero size (for example, file has been deleted).
"""
total = 0
for entry in os.scandir(path):
for entry in os.scandir(directory):
try:
is_dir = entry.is_dir()
is_dir = entry.is_dir(follow_symlinks=False)
except OSError as error:
print('Error calling is_dir():', error, file=sys.stderr)
continue
if is_dir:
total += get_tree_size(entry.full_name)
total += get_tree_size(entry.path)
else:
try:
total += entry.lstat().st_size
total += entry.stat(follow_symlinks=False).st_size
except OSError as error:
print('Error calling lstat():', error, file=sys.stderr)
print('Error calling stat():', error, file=sys.stderr)
return total
@ -316,6 +350,12 @@ For example:
Seems pretty solid, so first thing, just want to say nice work!"
[via personal email]
* Matt Z: "I used scandir to dump the contents of a network dir in
under 15 seconds. 13 root dirs, 60,000 files in the structure. This
will replace some old VBA code embedded in a spreadsheet that was
taking 15-20 minutes to do the exact same thing." [via personal
email]
Others have `requested a PyPI package`_ for it, which has been
created. See `PyPI package`_.
@ -331,13 +371,11 @@ of July 7, 2014:
* Forks: 20
* Issues: 4 open, 26 closed
**However, the much larger point is this:**, if this PEP is accepted,
``os.walk()`` can easily be reimplemented using ``scandir`` rather
than ``listdir`` and ``stat``, increasing the speed of ``os.walk()``
very significantly. There are thousands of developers, scripts, and
production code that would benefit from this large speedup of
``os.walk()``. For example, on GitHub, there are almost as many uses
of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).
Also, because this PEP will increase the speed of ``os.walk()``
significantly, there are thousands of developers and scripts, and a lot
of production code, that would benefit from it. For example, on GitHub,
there are almost as many uses of ``os.walk`` (194,000) as there are of
``os.mkdir`` (230,000).
Rejected ideas
@ -392,12 +430,51 @@ and this `June 2014 python-dev thread on PEP 471
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.
Methods not following symlinks by default
-----------------------------------------
There was much debate on python-dev (see messages in `this thread
<https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_)
over whether the ``DirEntry`` methods should follow symbolic links or
not (when the ``is_X()`` methods had no ``follow_symlinks`` parameter).
Initially they did not (see previous versions of this PEP and the
scandir.py module), but Victor Stinner made a pretty compelling case on
python-dev that following symlinks by default is a better idea, because:
* following links is usually what you want (in 92% of cases in the
standard library, functions using ``os.listdir()`` and
``os.path.isdir()`` do follow symlinks)
* that's the precedent set by the similar functions
``os.path.isdir()`` and ``pathlib.Path.is_dir()``, so to do
otherwise would be confusing
* with the non-link-following approach, if you wanted to follow links
you'd have to say something like ``if (entry.is_symlink() and
os.path.isdir(entry.path)) or entry.is_dir()``, which is clumsy
As a case in point that shows the non-symlink-following version is
error prone, this PEP's author had a bug caused by getting this
exact test wrong in his initial implementation of ``scandir.walk()``
in scandir.py (see `Issue #4 here
<https://github.com/benhoyt/scandir/issues/4>`_).
In the end there was not total agreement that the methods should
follow symlinks, but there was basic consensus among the most involved
participants, and this PEP's author believes that the above case is
strong enough to warrant following symlinks by default.
In addition, it's straight-forward to call the relevant methods with
``follow_symlinks=False`` if the other behaviour is desired.
DirEntry attributes being properties
------------------------------------
In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
``lstat()`` to be properties instead of methods, to indicate they're
very cheap or free. However, this isn't quite the case, as ``lstat()``
``stat()`` to be properties instead of methods, to indicate they're
very cheap or free. However, this isn't quite the case, as ``stat()``
will require an OS call on POSIX-based systems but not on Windows.
Even ``is_dir()`` and friends may perform an OS call on POSIX-based
systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
@ -422,8 +499,8 @@ In `this July 2014 python-dev message
<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
Paul Moore suggested a solution that was a "thin wrapper round the OS
feature", where the ``DirEntry`` object had only static attributes:
``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
only present on Windows. The idea was to use this simpler, lower-level
``name``, ``path``, and ``is_X``, with the ``st_X`` attributes only
present on Windows. The idea was to use this simpler, lower-level
function as a building block for higher-level functions.
At first there was general agreement that simplifying in this way was
@ -459,19 +536,24 @@ because ``stat()`` will be called (and hence potentially raise
``OSError``) during iteration, leading to a rather ugly, hand-made
iteration loop::
it = os.scandir(path)
it = os.scandir(directory)
while True:
try:
entry = next(it)
except OSError as error:
handle_error(path, error)
handle_error(directory, error)
except StopIteration:
break
Or it means that ``scandir()`` would have to accept an ``onerror``
argument -- a function to call when ``stat()`` errors occur during
iteration. This seems to this PEP's author neither as direct nor as
Pythonic as ``try``/``except`` around a ``DirEntry.lstat()`` call.
Pythonic as ``try``/``except`` around a ``DirEntry.stat()`` call.
Another drawback is that ``os.scandir()`` is written to make code faster.
Always calling ``os.lstat()`` on POSIX would not bring any speedup. In most
cases, you don't need the full ``stat_result`` object -- the ``is_X()``
methods are enough and this information is already known.
See `Ben Hoyt's July 2014 reply
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
@ -513,7 +595,7 @@ Return values being overloaded stat_result objects
--------------------------------------------------
Another alternative discussed was making the return values to be
overloaded ``stat_result`` objects with ``name`` and ``full_name``
overloaded ``stat_result`` objects with ``name`` and ``path``
attributes. However, apart from this being a strange (and strained!)
kind of overloading, this has the same problems mentioned above --
most of the ``stat_result`` information is not fetched by
@ -526,15 +608,15 @@ Return values being pathlib.Path objects
With Antoine Pitrou's new standard library ``pathlib`` module, it
at first seems like a great idea for ``scandir()`` to return instances
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
``lstat()`` functions are explicitly not cached, whereas ``scandir``
``stat()`` functions are explicitly not cached, whereas ``scandir``
has to cache them by design, because it's (often) returning values
from the original directory iteration system call.
And if the ``pathlib.Path`` instances returned by ``scandir`` cached
lstat values, but the ordinary ``pathlib.Path`` objects explicitly
stat values, but the ordinary ``pathlib.Path`` objects explicitly
don't, that would be more than a little confusing.
Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
Guido van Rossum explicitly rejected ``pathlib.Path`` caching stat in
the context of scandir `here
<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
making ``pathlib.Path`` objects a bad choice for scandir return
@ -564,35 +646,45 @@ here is a short list of some this PEP's author has in mind:
Previous discussion
===================
* `Original thread Ben Hoyt started on python-ideas`_ about speeding
up ``os.walk()``
* `Original November 2012 thread Ben Hoyt started on python-ideas
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
about speeding up ``os.walk()``
* Python `Issue 11406`_, which includes the original proposal for a
scandir-like function
* `Further thread Ben Hoyt started on python-dev`_ that refined the
``scandir()`` API, including Nick Coghlan's suggestion of scandir
yielding ``DirEntry``-like objects
* `Further May 2013 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2013-May/126119.html>`_
that refined the ``scandir()`` API, including Nick Coghlan's
suggestion of scandir yielding ``DirEntry``-like objects
* `Another thread Ben Hoyt started on python-dev`_ to discuss the
interaction between scandir and the new ``pathlib`` module
* `November 2013 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2013-November/130572.html>`_
to discuss the interaction between scandir and the new ``pathlib``
module
* `Final thread Ben Hoyt started on python-dev`_ to discuss the first
version of this PEP, with extensive discussion about the API.
* `June 2014 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2014-June/135215.html>`_
to discuss the first version of this PEP, with extensive discussion
about the API
* `Question on StackOverflow`_ about why ``os.walk()`` is slow and
pointers on how to fix it (this inspired the author of this PEP
early on)
* `First July 2014 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2014-July/135377.html>`_
to discuss his updates to PEP 471
* `BetterWalk`_, this PEP's author's previous attempt at this, on
which the scandir code is based
* `Second July 2014 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_
to discuss the remaining decisions needed to finalize PEP 471,
specifically whether the ``DirEntry`` methods should follow symlinks
by default
.. _`Original thread Ben Hoyt started on python-ideas`: https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
.. _`Further thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-May/126119.html
.. _`Another thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2014-June/135215.html
.. _`Question on StackOverflow`: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
.. _`BetterWalk`: https://github.com/benhoyt/betterwalk
* `Question on StackOverflow
<http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder>`_
about why ``os.walk()`` is slow and pointers on how to fix it (this
inspired the author of this PEP early on)
* `BetterWalk <https://github.com/benhoyt/betterwalk>`_, this PEP's
author's previous attempt at this, on which the scandir code is based
Copyright