703 lines
29 KiB
Plaintext
703 lines
29 KiB
Plaintext
PEP: 471
|
||
Title: os.scandir() function -- a better and faster directory iterator
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Ben Hoyt <benhoyt@gmail.com>
|
||
BDFL-Delegate: Victor Stinner <victor.stinner@gmail.com>
|
||
Status: Accepted
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 30-May-2014
|
||
Python-Version: 3.5
|
||
Post-History: 27-Jun-2014, 8-Jul-2014, 14-Jul-2014
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP proposes including a new directory iteration function,
|
||
``os.scandir()``, in the standard library. This new function adds
|
||
useful functionality and increases the speed of ``os.walk()`` by 2-20
|
||
times (depending on the platform and file system) by avoiding calls to
|
||
``os.stat()`` in most cases.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
Python's built-in ``os.walk()`` is significantly slower than it needs
|
||
to be, because -- in addition to calling ``os.listdir()`` on each
|
||
directory -- it executes the ``stat()`` system call or
|
||
``GetFileAttributes()`` on each file to determine whether the entry is
|
||
a directory or not.
|
||
|
||
But the underlying system calls -- ``FindFirstFile`` /
|
||
``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
|
||
already tell you whether the files returned are directories or not, so
|
||
no further system calls are needed. Further, the Windows system calls
|
||
return all the information for a ``stat_result`` object on the directory
|
||
entry, such as file size and last modification time.
|
||
|
||
In short, you can reduce the number of system calls required for a
|
||
tree function like ``os.walk()`` from approximately 2N to N, where N
|
||
is the total number of files and directories in the tree. (And because
|
||
directory trees are usually wider than they are deep, it's often much
|
||
better than this.)
|
||
|
||
In practice, removing all those extra system calls makes ``os.walk()``
|
||
about **8-9 times as fast on Windows**, and about **2-3 times as fast
|
||
on POSIX systems**. So we're not talking about micro-
|
||
optimizations. See more `benchmarks here`_.
|
||
|
||
.. _`benchmarks here`: https://github.com/benhoyt/scandir#benchmarks
|
||
|
||
Somewhat relatedly, many people (see Python `Issue 11406`_) are also
|
||
keen on a version of ``os.listdir()`` that yields filenames as it
|
||
iterates instead of returning them as one big list. This improves
|
||
memory efficiency for iterating very large directories.
|
||
|
||
So, as well as providing a ``scandir()`` iterator function for calling
|
||
directly, Python's existing ``os.walk()`` function can be sped up a
|
||
huge amount.
|
||
|
||
.. _`Issue 11406`: http://bugs.python.org/issue11406
|
||
|
||
|
||
Implementation
|
||
==============
|
||
|
||
The implementation of this proposal was written by Ben Hoyt (initial
|
||
version) and Tim Golden (who helped a lot with the C extension
|
||
module). It lives on GitHub at `benhoyt/scandir`_. (The implementation
|
||
may lag behind the updates to this PEP a little.)
|
||
|
||
.. _`benhoyt/scandir`: https://github.com/benhoyt/scandir
|
||
|
||
Note that this module has been used and tested (see "Use in the wild"
|
||
section in this PEP), so it's more than a proof-of-concept. However,
|
||
it is marked as beta software and is not extensively battle-tested.
|
||
It will need some cleanup and more thorough testing before going into
|
||
the standard library, as well as integration into ``posixmodule.c``.
|
||
|
||
|
||
|
||
Specifics of proposal
|
||
=====================
|
||
|
||
os.scandir()
|
||
------------
|
||
|
||
Specifically, this PEP proposes adding a single function to the ``os``
|
||
module in the standard library, ``scandir``, that takes a single,
|
||
optional string as its argument::
|
||
|
||
scandir(directory='.') -> generator of DirEntry objects
|
||
|
||
Like ``listdir``, ``scandir`` calls the operating system's directory
|
||
iteration system calls to get the names of the files in the given
|
||
``directory``, but it's different from ``listdir`` in two ways:
|
||
|
||
* Instead of returning bare filename strings, it returns lightweight
|
||
``DirEntry`` objects that hold the filename string and provide
|
||
simple methods that allow access to the additional data the
|
||
operating system may have returned.
|
||
|
||
* It returns a generator instead of a list, so that ``scandir`` acts
|
||
as a true iterator instead of returning the full list immediately.
|
||
|
||
``scandir()`` yields a ``DirEntry`` object for each file and
|
||
sub-directory in ``directory``. Just like ``listdir``, the ``'.'``
|
||
and ``'..'`` pseudo-directories are skipped, and the entries are
|
||
yielded in system-dependent order. Each ``DirEntry`` object has the
|
||
following attributes and methods:
|
||
|
||
* ``name``: the entry's filename, relative to the ``directory``
|
||
argument (corresponds to the return values of ``os.listdir``)
|
||
|
||
* ``path``: the entry's full path name (not necessarily an absolute
|
||
path) -- the equivalent of ``os.path.join(directory, entry.name)``
|
||
|
||
* ``is_dir(*, follow_symlinks=True)``: similar to
|
||
``pathlib.Path.is_dir()``, but the return value is cached on the
|
||
``DirEntry`` object; doesn't require a system call in most cases;
|
||
don't follow symbolic links if ``follow_symlinks`` is False
|
||
|
||
* ``is_file(*, follow_symlinks=True)``: similar to
|
||
``pathlib.Path.is_file()``, but the return value is cached on the
|
||
``DirEntry`` object; doesn't require a system call in most cases;
|
||
don't follow symbolic links if ``follow_symlinks`` is False
|
||
|
||
* ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the
|
||
return value is cached on the ``DirEntry`` object; doesn't require a
|
||
system call in most cases
|
||
|
||
* ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the
|
||
return value is cached on the ``DirEntry`` object; does not require a
|
||
system call on Windows (except for symlinks); don't follow symbolic links
|
||
(like ``os.lstat()``) if ``follow_symlinks`` is False
|
||
|
||
All *methods* may perform system calls in some cases and therefore
|
||
possibly raise ``OSError`` -- see the "Notes on exception handling"
|
||
section for more details.
|
||
|
||
The ``DirEntry`` attribute and method names were chosen to be the same
|
||
as those in the new ``pathlib`` module where possible, for
|
||
consistency. The only difference in functionality is that the
|
||
``DirEntry`` methods cache their values on the entry object after the
|
||
first call.
|
||
|
||
Like the other functions in the ``os`` module, ``scandir()`` accepts
|
||
either a bytes or str object for the ``directory`` parameter, and
|
||
returns the ``DirEntry.name`` and ``DirEntry.path`` attributes with
|
||
the same type as ``directory``. However, it is *strongly recommended*
|
||
to use the str type, as this ensures cross-platform support for
|
||
Unicode filenames. (On Windows, bytes filenames have been deprecated
|
||
since Python 3.3).
|
||
|
||
os.walk()
|
||
---------
|
||
|
||
As part of this proposal, ``os.walk()`` will also be modified to use
|
||
``scandir()`` rather than ``listdir()`` and ``os.path.isdir()``. This
|
||
will increase the speed of ``os.walk()`` very significantly (as
|
||
mentioned above, by 2-20 times, depending on the system).
|
||
|
||
|
||
Examples
|
||
========
|
||
|
||
First, a very simple example of ``scandir()`` showing use of the
|
||
``DirEntry.name`` attribute and the ``DirEntry.is_dir()`` method::
|
||
|
||
def subdirs(path):
|
||
"""Yield directory names not starting with '.' under given path."""
|
||
for entry in os.scandir(path):
|
||
if not entry.name.startswith('.') and entry.is_dir():
|
||
yield entry.name
|
||
|
||
This ``subdirs()`` function will be significantly faster with scandir
|
||
than ``os.listdir()`` and ``os.path.isdir()`` on both Windows and POSIX
|
||
systems, especially on medium-sized or large directories.
|
||
|
||
Or, for getting the total size of files in a directory tree, showing
|
||
use of the ``DirEntry.stat()`` method and ``DirEntry.path``
|
||
attribute::
|
||
|
||
def get_tree_size(directory):
|
||
"""Return total size of files in directory and subdirs."""
|
||
total = 0
|
||
for entry in os.scandir(directory):
|
||
if entry.is_dir(follow_symlinks=False):
|
||
total += get_tree_size(entry.path)
|
||
else:
|
||
total += entry.stat(follow_symlinks=False).st_size
|
||
return total
|
||
|
||
This also shows the use of the ``follow_symlinks`` parameter to
|
||
``is_dir()`` -- in a recursive function like this, we probably don't
|
||
want to follow links. (To properly follow links in a recursive
|
||
function like this we'd want special handling for the case where
|
||
following a symlink leads to a recursive loop.)
|
||
|
||
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
|
||
because no extra stat call are needed, but on POSIX systems the size
|
||
information is not returned by the directory iteration functions, so
|
||
this function won't gain anything there.
|
||
|
||
|
||
Notes on caching
|
||
----------------
|
||
|
||
The ``DirEntry`` objects are relatively dumb -- the ``name`` and
|
||
``path`` attributes are obviously always cached, and the ``is_X``
|
||
and ``stat`` methods cache their values (immediately on Windows via
|
||
``FindNextFile``, and on first use on POSIX systems via a ``stat``
|
||
system call) and never refetch from the system.
|
||
|
||
For this reason, ``DirEntry`` objects are intended to be used and
|
||
thrown away after iteration, not stored in long-lived data structured
|
||
and the methods called again and again.
|
||
|
||
If developers want "refresh" behaviour (for example, for watching a
|
||
file's size change), they can simply use ``pathlib.Path`` objects,
|
||
or call the regular ``os.stat()`` or ``os.path.getsize()`` functions
|
||
which get fresh data from the operating system every call.
|
||
|
||
|
||
Notes on exception handling
|
||
---------------------------
|
||
|
||
``DirEntry.is_X()`` and ``DirEntry.stat()`` are explicitly methods
|
||
rather than attributes or properties, to make it clear that they may
|
||
not be cheap operations (although they often are), and they may do a
|
||
system call. As a result, these methods may raise ``OSError``.
|
||
|
||
For example, ``DirEntry.stat()`` will always make a system call on
|
||
POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
|
||
``stat()`` system call on such systems if ``readdir()`` does not
|
||
support ``d_type`` or returns a ``d_type`` with a value of
|
||
``DT_UNKNOWN``, which can occur under certain conditions or on
|
||
certain file systems.
|
||
|
||
Often this does not matter -- for example, ``os.walk()`` as defined in
|
||
the standard library only catches errors around the ``listdir()``
|
||
calls.
|
||
|
||
Also, because the exception-raising behaviour of the ``DirEntry.is_X``
|
||
methods matches that of ``pathlib`` -- which only raises ``OSError``
|
||
in the case of permissions or other fatal errors, but returns False
|
||
if the path doesn't exist or is a broken symlink -- it's often
|
||
not necessary to catch errors around the ``is_X()`` calls.
|
||
|
||
However, when a user requires fine-grained error handling, it may be
|
||
desirable to catch ``OSError`` around all method calls and handle as
|
||
appropriate.
|
||
|
||
For example, below is a version of the ``get_tree_size()`` example
|
||
shown above, but with fine-grained error handling added::
|
||
|
||
def get_tree_size(directory):
|
||
"""Return total size of files in directory and subdirs. If
|
||
is_dir() or stat() fails, print an error message to stderr
|
||
and assume zero size (for example, file has been deleted).
|
||
"""
|
||
total = 0
|
||
for entry in os.scandir(directory):
|
||
try:
|
||
is_dir = entry.is_dir(follow_symlinks=False)
|
||
except OSError as error:
|
||
print('Error calling is_dir():', error, file=sys.stderr)
|
||
continue
|
||
if is_dir:
|
||
total += get_tree_size(entry.path)
|
||
else:
|
||
try:
|
||
total += entry.stat(follow_symlinks=False).st_size
|
||
except OSError as error:
|
||
print('Error calling stat():', error, file=sys.stderr)
|
||
return total
|
||
|
||
|
||
Support
|
||
=======
|
||
|
||
The scandir module on GitHub has been forked and used quite a bit (see
|
||
"Use in the wild" in this PEP), but there's also been a fair bit of
|
||
direct support for a scandir-like function from core developers and
|
||
others on the python-dev and python-ideas mailing lists. A sampling:
|
||
|
||
* **python-dev**: a good number of +1's and very few negatives for
|
||
scandir and PEP 471 on `this June 2014 python-dev thread
|
||
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_
|
||
|
||
* **Nick Coghlan**, a core Python developer: "I've had the local Red
|
||
Hat release engineering team express their displeasure at having to
|
||
stat every file in a network mounted directory tree for info that is
|
||
present in the dirent structure, so a definite +1 to os.scandir from
|
||
me, so long as it makes that info available."
|
||
[`source1 <http://bugs.python.org/issue11406>`_]
|
||
|
||
* **Tim Golden**, a core Python developer, supports scandir enough to
|
||
have spent time refactoring and significantly improving scandir's C
|
||
extension module.
|
||
[`source2 <https://github.com/tjguk/scandir>`_]
|
||
|
||
* **Christian Heimes**, a core Python developer: "+1 for something
|
||
like yielddir()"
|
||
[`source3 <https://mail.python.org/pipermail/python-ideas/2012-November/017772.html>`_]
|
||
and "Indeed! I'd like to see the feature in 3.4 so I can remove my
|
||
own hack from our code base."
|
||
[`source4 <http://bugs.python.org/issue11406>`_]
|
||
|
||
* **Gregory P. Smith**, a core Python developer: "As 3.4beta1 happens
|
||
tonight, this isn't going to make 3.4 so i'm bumping this to 3.5.
|
||
I really like the proposed design outlined above."
|
||
[`source5 <http://bugs.python.org/issue11406>`_]
|
||
|
||
* **Guido van Rossum** on the possibility of adding scandir to Python
|
||
3.5 (as it was too late for 3.4): "The ship has likewise sailed for
|
||
adding scandir() (whether to os or pathlib). By all means experiment
|
||
and get it ready for consideration for 3.5, but I don't want to add
|
||
it to 3.4."
|
||
[`source6 <https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_]
|
||
|
||
Support for this PEP itself (meta-support?) was given by Nick Coghlan
|
||
on python-dev: "A PEP reviewing all this for 3.5 and proposing a
|
||
specific os.scandir API would be a good thing."
|
||
[`source7 <https://mail.python.org/pipermail/python-dev/2013-November/130588.html>`_]
|
||
|
||
|
||
Use in the wild
|
||
===============
|
||
|
||
To date, the ``scandir`` implementation is definitely useful, but has
|
||
been clearly marked "beta", so it's uncertain how much use of it there
|
||
is in the wild. Ben Hoyt has had several reports from people using it.
|
||
For example:
|
||
|
||
* Chris F: "I am processing some pretty large directories and was half
|
||
expecting to have to modify getdents. So thanks for saving me the
|
||
effort." [via personal email]
|
||
|
||
* bschollnick: "I wanted to let you know about this, since I am using
|
||
Scandir as a building block for this code. Here's a good example of
|
||
scandir making a radical performance improvement over os.listdir."
|
||
[`source8 <https://github.com/benhoyt/scandir/issues/19>`_]
|
||
|
||
* Avram L: "I'm testing our scandir for a project I'm working on.
|
||
Seems pretty solid, so first thing, just want to say nice work!"
|
||
[via personal email]
|
||
|
||
* Matt Z: "I used scandir to dump the contents of a network dir in
|
||
under 15 seconds. 13 root dirs, 60,000 files in the structure. This
|
||
will replace some old VBA code embedded in a spreadsheet that was
|
||
taking 15-20 minutes to do the exact same thing." [via personal
|
||
email]
|
||
|
||
Others have `requested a PyPI package`_ for it, which has been
|
||
created. See `PyPI package`_.
|
||
|
||
.. _`requested a PyPI package`: https://github.com/benhoyt/scandir/issues/12
|
||
.. _`PyPI package`: https://pypi.python.org/pypi/scandir
|
||
|
||
GitHub stats don't mean too much, but scandir does have several
|
||
watchers, issues, forks, etc. Here's the run-down as of the stats as
|
||
of July 7, 2014:
|
||
|
||
* Watchers: 17
|
||
* Stars: 57
|
||
* Forks: 20
|
||
* Issues: 4 open, 26 closed
|
||
|
||
Also, because this PEP will increase the speed of ``os.walk()``
|
||
significantly, there are thousands of developers and scripts, and a lot
|
||
of production code, that would benefit from it. For example, on GitHub,
|
||
there are almost as many uses of ``os.walk`` (194,000) as there are of
|
||
``os.mkdir`` (230,000).
|
||
|
||
|
||
Rejected ideas
|
||
==============
|
||
|
||
|
||
Naming
|
||
------
|
||
|
||
The only other real contender for this function's name was
|
||
``iterdir()``. However, ``iterX()`` functions in Python (mostly found
|
||
in Python 2) tend to be simple iterator equivalents of their
|
||
non-iterator counterparts. For example, ``dict.iterkeys()`` is just an
|
||
iterator version of ``dict.keys()``, but the objects returned are
|
||
identical. In ``scandir()``'s case, however, the return values are
|
||
quite different objects (``DirEntry`` objects vs filename strings), so
|
||
this should probably be reflected by a difference in name -- hence
|
||
``scandir()``.
|
||
|
||
See some `relevant discussion on python-dev
|
||
<https://mail.python.org/pipermail/python-dev/2014-June/135228.html>`_.
|
||
|
||
|
||
Wildcard support
|
||
----------------
|
||
|
||
``FindFirstFile``/``FindNextFile`` on Windows support passing a
|
||
"wildcard" like ``*.jpg``, so at first folks (this PEP's author
|
||
included) felt it would be a good idea to include a
|
||
``windows_wildcard`` keyword argument to the ``scandir`` function so
|
||
users could pass this in.
|
||
|
||
However, on further thought and discussion it was decided that this
|
||
would be bad idea, *unless it could be made cross-platform* (a
|
||
``pattern`` keyword argument or similar). This seems easy enough at
|
||
first -- just use the OS wildcard support on Windows, and something
|
||
like ``fnmatch`` or ``re`` afterwards on POSIX-based systems.
|
||
|
||
Unfortunately the exact Windows wildcard matching rules aren't really
|
||
documented anywhere by Microsoft, and they're quite quirky (see this
|
||
`blog post
|
||
<http://blogs.msdn.com/b/oldnewthing/archive/2007/12/17/6785519.aspx>`_),
|
||
meaning it's very problematic to emulate using ``fnmatch`` or regexes.
|
||
|
||
So the consensus was that Windows wildcard support was a bad idea.
|
||
It would be possible to add at a later date if there's a
|
||
cross-platform way to achieve it, but not for the initial version.
|
||
|
||
Read more on the `this Nov 2012 python-ideas thread
|
||
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
|
||
and this `June 2014 python-dev thread on PEP 471
|
||
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.
|
||
|
||
|
||
Methods not following symlinks by default
|
||
-----------------------------------------
|
||
|
||
There was much debate on python-dev (see messages in `this thread
|
||
<https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_)
|
||
over whether the ``DirEntry`` methods should follow symbolic links or
|
||
not (when the ``is_X()`` methods had no ``follow_symlinks`` parameter).
|
||
|
||
Initially they did not (see previous versions of this PEP and the
|
||
scandir.py module), but Victor Stinner made a pretty compelling case on
|
||
python-dev that following symlinks by default is a better idea, because:
|
||
|
||
* following links is usually what you want (in 92% of cases in the
|
||
standard library, functions using ``os.listdir()`` and
|
||
``os.path.isdir()`` do follow symlinks)
|
||
|
||
* that's the precedent set by the similar functions
|
||
``os.path.isdir()`` and ``pathlib.Path.is_dir()``, so to do
|
||
otherwise would be confusing
|
||
|
||
* with the non-link-following approach, if you wanted to follow links
|
||
you'd have to say something like ``if (entry.is_symlink() and
|
||
os.path.isdir(entry.path)) or entry.is_dir()``, which is clumsy
|
||
|
||
As a case in point that shows the non-symlink-following version is
|
||
error prone, this PEP's author had a bug caused by getting this
|
||
exact test wrong in his initial implementation of ``scandir.walk()``
|
||
in scandir.py (see `Issue #4 here
|
||
<https://github.com/benhoyt/scandir/issues/4>`_).
|
||
|
||
In the end there was not total agreement that the methods should
|
||
follow symlinks, but there was basic consensus among the most involved
|
||
participants, and this PEP's author believes that the above case is
|
||
strong enough to warrant following symlinks by default.
|
||
|
||
In addition, it's straight-forward to call the relevant methods with
|
||
``follow_symlinks=False`` if the other behaviour is desired.
|
||
|
||
|
||
DirEntry attributes being properties
|
||
------------------------------------
|
||
|
||
In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
|
||
``stat()`` to be properties instead of methods, to indicate they're
|
||
very cheap or free. However, this isn't quite the case, as ``stat()``
|
||
will require an OS call on POSIX-based systems but not on Windows.
|
||
Even ``is_dir()`` and friends may perform an OS call on POSIX-based
|
||
systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
|
||
file systems).
|
||
|
||
Also, people would expect the attribute access ``entry.is_dir`` to
|
||
only ever raise ``AttributeError``, not ``OSError`` in the case it
|
||
makes a system call under the covers. Calling code would have to have
|
||
a ``try``/``except`` around what looks like a simple attribute access,
|
||
and so it's much better to make them *methods*.
|
||
|
||
See `this May 2013 python-dev thread
|
||
<https://mail.python.org/pipermail/python-dev/2013-May/126184.html>`_
|
||
where this PEP author makes this case and there's agreement from a
|
||
core developers.
|
||
|
||
|
||
DirEntry fields being "static" attribute-only objects
|
||
-----------------------------------------------------
|
||
|
||
In `this July 2014 python-dev message
|
||
<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
|
||
Paul Moore suggested a solution that was a "thin wrapper round the OS
|
||
feature", where the ``DirEntry`` object had only static attributes:
|
||
``name``, ``path``, and ``is_X``, with the ``st_X`` attributes only
|
||
present on Windows. The idea was to use this simpler, lower-level
|
||
function as a building block for higher-level functions.
|
||
|
||
At first there was general agreement that simplifying in this way was
|
||
a good thing. However, there were two problems with this approach.
|
||
First, the assumption is the ``is_dir`` and similar attributes are
|
||
always present on POSIX, which isn't the case (if ``d_type`` is not
|
||
present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API
|
||
in practice, as even the ``is_dir`` attributes aren't always present
|
||
on POSIX, and would need to be tested with ``hasattr()`` and then
|
||
``os.stat()`` called if they weren't present.
|
||
|
||
See `this July 2014 python-dev response
|
||
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
|
||
from this PEP's author detailing why this option is a non-ideal
|
||
solution, and the subsequent reply from Paul Moore voicing agreement.
|
||
|
||
|
||
DirEntry fields being static with an ensure_lstat option
|
||
--------------------------------------------------------
|
||
|
||
Another seemingly simpler and attractive option was suggested by
|
||
Nick Coghlan in this `June 2014 python-dev message
|
||
<https://mail.python.org/pipermail/python-dev/2014-June/135261.html>`_:
|
||
make ``DirEntry.is_X`` and ``DirEntry.lstat_result`` properties, and
|
||
populate ``DirEntry.lstat_result`` at iteration time, but only if
|
||
the new argument ``ensure_lstat=True`` was specified on the
|
||
``scandir()`` call.
|
||
|
||
This does have the advantage over the above in that you can easily get
|
||
the stat result from ``scandir()`` if you need it. However, it has the
|
||
serious disadvantage that fine-grained error handling is messy,
|
||
because ``stat()`` will be called (and hence potentially raise
|
||
``OSError``) during iteration, leading to a rather ugly, hand-made
|
||
iteration loop::
|
||
|
||
it = os.scandir(directory)
|
||
while True:
|
||
try:
|
||
entry = next(it)
|
||
except OSError as error:
|
||
handle_error(directory, error)
|
||
except StopIteration:
|
||
break
|
||
|
||
Or it means that ``scandir()`` would have to accept an ``onerror``
|
||
argument -- a function to call when ``stat()`` errors occur during
|
||
iteration. This seems to this PEP's author neither as direct nor as
|
||
Pythonic as ``try``/``except`` around a ``DirEntry.stat()`` call.
|
||
|
||
Another drawback is that ``os.scandir()`` is written to make code faster.
|
||
Always calling ``os.lstat()`` on POSIX would not bring any speedup. In most
|
||
cases, you don't need the full ``stat_result`` object -- the ``is_X()``
|
||
methods are enough and this information is already known.
|
||
|
||
See `Ben Hoyt's July 2014 reply
|
||
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
|
||
to the discussion summarizing this and detailing why he thinks the
|
||
original PEP 471 proposal is "the right one" after all.
|
||
|
||
|
||
Return values being (name, stat_result) two-tuples
|
||
--------------------------------------------------
|
||
|
||
Initially this PEP's author proposed this concept as a function called
|
||
``iterdir_stat()`` which yielded two-tuples of (name, stat_result).
|
||
This does have the advantage that there are no new types introduced.
|
||
However, the ``stat_result`` is only partially filled on POSIX-based
|
||
systems (most fields set to ``None`` and other quirks), so they're not
|
||
really ``stat_result`` objects at all, and this would have to be
|
||
thoroughly documented as different from ``os.stat()``.
|
||
|
||
Also, Python has good support for proper objects with attributes and
|
||
methods, which makes for a saner and simpler API than two-tuples. It
|
||
also makes the ``DirEntry`` objects more extensible and future-proof
|
||
as operating systems add functionality and we want to include this in
|
||
``DirEntry``.
|
||
|
||
See also some previous discussion:
|
||
|
||
* `May 2013 python-dev thread
|
||
<https://mail.python.org/pipermail/python-dev/2013-May/126148.html>`_
|
||
where Nick Coghlan makes the original case for a ``DirEntry``-style
|
||
object.
|
||
|
||
* `June 2014 python-dev thread
|
||
<https://mail.python.org/pipermail/python-dev/2014-June/135244.html>`_
|
||
where Nick Coghlan makes (another) good case against the two-tuple
|
||
approach.
|
||
|
||
|
||
Return values being overloaded stat_result objects
|
||
--------------------------------------------------
|
||
|
||
Another alternative discussed was making the return values to be
|
||
overloaded ``stat_result`` objects with ``name`` and ``path``
|
||
attributes. However, apart from this being a strange (and strained!)
|
||
kind of overloading, this has the same problems mentioned above --
|
||
most of the ``stat_result`` information is not fetched by
|
||
``readdir()`` on POSIX systems, only (part of) the ``st_mode`` value.
|
||
|
||
|
||
Return values being pathlib.Path objects
|
||
----------------------------------------
|
||
|
||
With Antoine Pitrou's new standard library ``pathlib`` module, it
|
||
at first seems like a great idea for ``scandir()`` to return instances
|
||
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
|
||
``stat()`` functions are explicitly not cached, whereas ``scandir``
|
||
has to cache them by design, because it's (often) returning values
|
||
from the original directory iteration system call.
|
||
|
||
And if the ``pathlib.Path`` instances returned by ``scandir`` cached
|
||
stat values, but the ordinary ``pathlib.Path`` objects explicitly
|
||
don't, that would be more than a little confusing.
|
||
|
||
Guido van Rossum explicitly rejected ``pathlib.Path`` caching stat in
|
||
the context of scandir `here
|
||
<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
|
||
making ``pathlib.Path`` objects a bad choice for scandir return
|
||
values.
|
||
|
||
|
||
Possible improvements
|
||
=====================
|
||
|
||
There are many possible improvements one could make to scandir, but
|
||
here is a short list of some this PEP's author has in mind:
|
||
|
||
* scandir could potentially be further sped up by calling ``readdir``
|
||
/ ``FindNextFile`` say 50 times per ``Py_BEGIN_ALLOW_THREADS`` block
|
||
so that it stays in the C extension module for longer, and may be
|
||
somewhat faster as a result. This approach hasn't been tested, but
|
||
was suggested by on Issue 11406 by Antoine Pitrou.
|
||
[`source9 <http://bugs.python.org/msg130125>`_]
|
||
|
||
* scandir could use a free list to avoid the cost of memory allocation
|
||
for each iteration -- a short free list of 10 or maybe even 1 may help.
|
||
Suggested by Victor Stinner on a `python-dev thread on June 27`_.
|
||
|
||
.. _`python-dev thread on June 27`: https://mail.python.org/pipermail/python-dev/2014-June/135232.html
|
||
|
||
|
||
Previous discussion
|
||
===================
|
||
|
||
* `Original November 2012 thread Ben Hoyt started on python-ideas
|
||
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
|
||
about speeding up ``os.walk()``
|
||
|
||
* Python `Issue 11406`_, which includes the original proposal for a
|
||
scandir-like function
|
||
|
||
* `Further May 2013 thread Ben Hoyt started on python-dev
|
||
<https://mail.python.org/pipermail/python-dev/2013-May/126119.html>`_
|
||
that refined the ``scandir()`` API, including Nick Coghlan's
|
||
suggestion of scandir yielding ``DirEntry``-like objects
|
||
|
||
* `November 2013 thread Ben Hoyt started on python-dev
|
||
<https://mail.python.org/pipermail/python-dev/2013-November/130572.html>`_
|
||
to discuss the interaction between scandir and the new ``pathlib``
|
||
module
|
||
|
||
* `June 2014 thread Ben Hoyt started on python-dev
|
||
<https://mail.python.org/pipermail/python-dev/2014-June/135215.html>`_
|
||
to discuss the first version of this PEP, with extensive discussion
|
||
about the API
|
||
|
||
* `First July 2014 thread Ben Hoyt started on python-dev
|
||
<https://mail.python.org/pipermail/python-dev/2014-July/135377.html>`_
|
||
to discuss his updates to PEP 471
|
||
|
||
* `Second July 2014 thread Ben Hoyt started on python-dev
|
||
<https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_
|
||
to discuss the remaining decisions needed to finalize PEP 471,
|
||
specifically whether the ``DirEntry`` methods should follow symlinks
|
||
by default
|
||
|
||
* `Question on StackOverflow
|
||
<http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder>`_
|
||
about why ``os.walk()`` is slow and pointers on how to fix it (this
|
||
inspired the author of this PEP early on)
|
||
|
||
* `BetterWalk <https://github.com/benhoyt/betterwalk>`_, this PEP's
|
||
author's previous attempt at this, on which the scandir code is based
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|