python-peps/pep-0471.txt

709 lines
29 KiB
Plaintext
Raw Normal View History

PEP: 471
Title: os.scandir() function -- a better and faster directory iterator
Version: $Revision$
Last-Modified: $Date$
Author: Ben Hoyt <benhoyt@gmail.com>
2019-10-17 20:48:46 -04:00
BDFL-Delegate: Victor Stinner <vstinner@python.org>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 30-May-2014
Python-Version: 3.5
Post-History: 27-Jun-2014, 08-Jul-2014, 14-Jul-2014
Abstract
========
This PEP proposes including a new directory iteration function,
``os.scandir()``, in the standard library. This new function adds
2014-07-18 12:25:41 -04:00
useful functionality and increases the speed of ``os.walk()`` by 2-20
times (depending on the platform and file system) by avoiding calls to
``os.stat()`` in most cases.
Rationale
=========
Python's built-in ``os.walk()`` is significantly slower than it needs
to be, because -- in addition to calling ``os.listdir()`` on each
directory -- it executes the ``stat()`` system call or
``GetFileAttributes()`` on each file to determine whether the entry is
a directory or not.
But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
already tell you whether the files returned are directories or not, so
no further system calls are needed. Further, the Windows system calls
2014-07-18 12:25:41 -04:00
return all the information for a ``stat_result`` object on the directory
entry, such as file size and last modification time.
In short, you can reduce the number of system calls required for a
tree function like ``os.walk()`` from approximately 2N to N, where N
is the total number of files and directories in the tree. (And because
directory trees are usually wider than they are deep, it's often much
better than this.)
In practice, removing all those extra system calls makes ``os.walk()``
about **8-9 times as fast on Windows**, and about **2-3 times as fast
on POSIX systems**. So we're not talking about micro-
optimizations. See more `benchmarks here`_.
.. _`benchmarks here`: https://github.com/benhoyt/scandir#benchmarks
Somewhat relatedly, many people (see Python `Issue 11406`_) are also
keen on a version of ``os.listdir()`` that yields filenames as it
iterates instead of returning them as one big list. This improves
memory efficiency for iterating very large directories.
So, as well as providing a ``scandir()`` iterator function for calling
2014-07-18 12:25:41 -04:00
directly, Python's existing ``os.walk()`` function can be sped up a
huge amount.
.. _`Issue 11406`: http://bugs.python.org/issue11406
Implementation
==============
The implementation of this proposal was written by Ben Hoyt (initial
version) and Tim Golden (who helped a lot with the C extension
2014-07-18 12:25:41 -04:00
module). It lives on GitHub at `benhoyt/scandir`_. (The implementation
may lag behind the updates to this PEP a little.)
.. _`benhoyt/scandir`: https://github.com/benhoyt/scandir
Note that this module has been used and tested (see "Use in the wild"
section in this PEP), so it's more than a proof-of-concept. However,
it is marked as beta software and is not extensively battle-tested.
It will need some cleanup and more thorough testing before going into
the standard library, as well as integration into ``posixmodule.c``.
Specifics of proposal
=====================
2014-07-18 12:25:41 -04:00
os.scandir()
------------
Specifically, this PEP proposes adding a single function to the ``os``
module in the standard library, ``scandir``, that takes a single,
optional string as its argument::
scandir(path='.') -> generator of DirEntry objects
Like ``listdir``, ``scandir`` calls the operating system's directory
2014-07-18 12:25:41 -04:00
iteration system calls to get the names of the files in the given
``path``, but it's different from ``listdir`` in two ways:
* Instead of returning bare filename strings, it returns lightweight
``DirEntry`` objects that hold the filename string and provide
simple methods that allow access to the additional data the
2014-07-18 12:25:41 -04:00
operating system may have returned.
* It returns a generator instead of a list, so that ``scandir`` acts
as a true iterator instead of returning the full list immediately.
2014-07-18 12:25:41 -04:00
``scandir()`` yields a ``DirEntry`` object for each file and
sub-directory in ``path``. Just like ``listdir``, the ``'.'``
2014-07-18 12:25:41 -04:00
and ``'..'`` pseudo-directories are skipped, and the entries are
yielded in system-dependent order. Each ``DirEntry`` object has the
following attributes and methods:
* ``name``: the entry's filename, relative to the scandir ``path``
2014-07-18 12:25:41 -04:00
argument (corresponds to the return values of ``os.listdir``)
2014-07-18 12:25:41 -04:00
* ``path``: the entry's full path name (not necessarily an absolute
path) -- the equivalent of ``os.path.join(scandir_path,
entry.name)``
* ``inode()``: return the inode number of the entry. The result is cached on
the ``DirEntry`` object, use ``os.stat(entry.path,
follow_symlinks=False).st_ino`` to fetch up-to-date information.
On Unix, no system call is required.
2014-07-18 12:25:41 -04:00
* ``is_dir(*, follow_symlinks=True)``: similar to
``pathlib.Path.is_dir()``, but the return value is cached on the
``DirEntry`` object; doesn't require a system call in most cases;
don't follow symbolic links if ``follow_symlinks`` is False
2014-07-18 12:25:41 -04:00
* ``is_file(*, follow_symlinks=True)``: similar to
``pathlib.Path.is_file()``, but the return value is cached on the
``DirEntry`` object; doesn't require a system call in most cases;
2014-07-18 12:25:41 -04:00
don't follow symbolic links if ``follow_symlinks`` is False
2014-07-18 12:25:41 -04:00
* ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the
return value is cached on the ``DirEntry`` object; doesn't require a
system call in most cases
2014-07-18 12:25:41 -04:00
* ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the
return value is cached on the ``DirEntry`` object; does not require a
system call on Windows (except for symlinks); don't follow symbolic links
(like ``os.lstat()``) if ``follow_symlinks`` is False
2014-07-18 12:25:41 -04:00
All *methods* may perform system calls in some cases and therefore
possibly raise ``OSError`` -- see the "Notes on exception handling"
section for more details.
The ``DirEntry`` attribute and method names were chosen to be the same
2014-07-18 12:25:41 -04:00
as those in the new ``pathlib`` module where possible, for
consistency. The only difference in functionality is that the
``DirEntry`` methods cache their values on the entry object after the
first call.
Like the other functions in the ``os`` module, ``scandir()`` accepts
either a bytes or str object for the ``path`` parameter, and
2014-07-18 12:25:41 -04:00
returns the ``DirEntry.name`` and ``DirEntry.path`` attributes with
the same type as ``path``. However, it is *strongly recommended*
2014-07-18 12:25:41 -04:00
to use the str type, as this ensures cross-platform support for
Unicode filenames. (On Windows, bytes filenames have been deprecated
since Python 3.3).
os.walk()
---------
As part of this proposal, ``os.walk()`` will also be modified to use
``scandir()`` rather than ``listdir()`` and ``os.path.isdir()``. This
will increase the speed of ``os.walk()`` very significantly (as
mentioned above, by 2-20 times, depending on the system).
Examples
========
First, a very simple example of ``scandir()`` showing use of the
``DirEntry.name`` attribute and the ``DirEntry.is_dir()`` method::
def subdirs(path):
"""Yield directory names not starting with '.' under given path."""
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_dir():
yield entry.name
This ``subdirs()`` function will be significantly faster with scandir
than ``os.listdir()`` and ``os.path.isdir()`` on both Windows and POSIX
systems, especially on medium-sized or large directories.
Or, for getting the total size of files in a directory tree, showing
2014-07-18 12:25:41 -04:00
use of the ``DirEntry.stat()`` method and ``DirEntry.path``
attribute::
def get_tree_size(path):
"""Return total size of files in given path and subdirs."""
total = 0
for entry in os.scandir(path):
2014-07-18 12:25:41 -04:00
if entry.is_dir(follow_symlinks=False):
total += get_tree_size(entry.path)
else:
2014-07-18 12:25:41 -04:00
total += entry.stat(follow_symlinks=False).st_size
return total
2014-07-18 12:25:41 -04:00
This also shows the use of the ``follow_symlinks`` parameter to
``is_dir()`` -- in a recursive function like this, we probably don't
want to follow links. (To properly follow links in a recursive
function like this we'd want special handling for the case where
following a symlink leads to a recursive loop.)
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
because no extra stat call are needed, but on POSIX systems the size
information is not returned by the directory iteration functions, so
this function won't gain anything there.
Notes on caching
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` and
2014-07-18 12:25:41 -04:00
``path`` attributes are obviously always cached, and the ``is_X``
and ``stat`` methods cache their values (immediately on Windows via
``FindNextFile``, and on first use on POSIX systems via a ``stat``
2014-07-18 12:25:41 -04:00
system call) and never refetch from the system.
For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
and the methods called again and again.
If developers want "refresh" behaviour (for example, for watching a
file's size change), they can simply use ``pathlib.Path`` objects,
2014-07-18 12:25:41 -04:00
or call the regular ``os.stat()`` or ``os.path.getsize()`` functions
which get fresh data from the operating system every call.
Notes on exception handling
---------------------------
2014-07-18 12:25:41 -04:00
``DirEntry.is_X()`` and ``DirEntry.stat()`` are explicitly methods
rather than attributes or properties, to make it clear that they may
2014-07-18 12:25:41 -04:00
not be cheap operations (although they often are), and they may do a
system call. As a result, these methods may raise ``OSError``.
2014-07-18 12:25:41 -04:00
For example, ``DirEntry.stat()`` will always make a system call on
POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
2014-07-18 12:25:41 -04:00
``stat()`` system call on such systems if ``readdir()`` does not
support ``d_type`` or returns a ``d_type`` with a value of
``DT_UNKNOWN``, which can occur under certain conditions or on
certain file systems.
Often this does not matter -- for example, ``os.walk()`` as defined in
the standard library only catches errors around the ``listdir()``
calls.
Also, because the exception-raising behaviour of the ``DirEntry.is_X``
methods matches that of ``pathlib`` -- which only raises ``OSError``
in the case of permissions or other fatal errors, but returns False
if the path doesn't exist or is a broken symlink -- it's often
not necessary to catch errors around the ``is_X()`` calls.
2014-07-18 12:25:41 -04:00
However, when a user requires fine-grained error handling, it may be
desirable to catch ``OSError`` around all method calls and handle as
appropriate.
For example, below is a version of the ``get_tree_size()`` example
2014-07-18 12:25:41 -04:00
shown above, but with fine-grained error handling added::
def get_tree_size(path):
"""Return total size of files in path and subdirs. If
2014-07-18 12:25:41 -04:00
is_dir() or stat() fails, print an error message to stderr
and assume zero size (for example, file has been deleted).
"""
total = 0
for entry in os.scandir(path):
try:
2014-07-18 12:25:41 -04:00
is_dir = entry.is_dir(follow_symlinks=False)
except OSError as error:
print('Error calling is_dir():', error, file=sys.stderr)
continue
if is_dir:
2014-07-18 12:25:41 -04:00
total += get_tree_size(entry.path)
else:
try:
2014-07-18 12:25:41 -04:00
total += entry.stat(follow_symlinks=False).st_size
except OSError as error:
2014-07-18 12:25:41 -04:00
print('Error calling stat():', error, file=sys.stderr)
return total
Support
=======
The scandir module on GitHub has been forked and used quite a bit (see
"Use in the wild" in this PEP), but there's also been a fair bit of
direct support for a scandir-like function from core developers and
others on the python-dev and python-ideas mailing lists. A sampling:
* **python-dev**: a good number of +1's and very few negatives for
scandir and :pep:`471` on `this June 2014 python-dev thread
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_
* **Nick Coghlan**, a core Python developer: "I've had the local Red
Hat release engineering team express their displeasure at having to
stat every file in a network mounted directory tree for info that is
present in the dirent structure, so a definite +1 to os.scandir from
me, so long as it makes that info available."
[`source1 <http://bugs.python.org/issue11406>`_]
* **Tim Golden**, a core Python developer, supports scandir enough to
have spent time refactoring and significantly improving scandir's C
extension module.
[`source2 <https://github.com/tjguk/scandir>`_]
* **Christian Heimes**, a core Python developer: "+1 for something
like yielddir()"
[`source3 <https://mail.python.org/pipermail/python-ideas/2012-November/017772.html>`_]
and "Indeed! I'd like to see the feature in 3.4 so I can remove my
own hack from our code base."
[`source4 <http://bugs.python.org/issue11406>`_]
* **Gregory P. Smith**, a core Python developer: "As 3.4beta1 happens
tonight, this isn't going to make 3.4 so i'm bumping this to 3.5.
I really like the proposed design outlined above."
[`source5 <http://bugs.python.org/issue11406>`_]
* **Guido van Rossum** on the possibility of adding scandir to Python
3.5 (as it was too late for 3.4): "The ship has likewise sailed for
adding scandir() (whether to os or pathlib). By all means experiment
and get it ready for consideration for 3.5, but I don't want to add
it to 3.4."
[`source6 <https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_]
Support for this PEP itself (meta-support?) was given by Nick Coghlan
on python-dev: "A PEP reviewing all this for 3.5 and proposing a
specific os.scandir API would be a good thing."
[`source7 <https://mail.python.org/pipermail/python-dev/2013-November/130588.html>`_]
Use in the wild
===============
To date, the ``scandir`` implementation is definitely useful, but has
been clearly marked "beta", so it's uncertain how much use of it there
is in the wild. Ben Hoyt has had several reports from people using it.
For example:
* Chris F: "I am processing some pretty large directories and was half
expecting to have to modify getdents. So thanks for saving me the
effort." [via personal email]
* bschollnick: "I wanted to let you know about this, since I am using
Scandir as a building block for this code. Here's a good example of
scandir making a radical performance improvement over os.listdir."
[`source8 <https://github.com/benhoyt/scandir/issues/19>`_]
* Avram L: "I'm testing our scandir for a project I'm working on.
Seems pretty solid, so first thing, just want to say nice work!"
[via personal email]
2014-07-18 12:25:41 -04:00
* Matt Z: "I used scandir to dump the contents of a network dir in
under 15 seconds. 13 root dirs, 60,000 files in the structure. This
will replace some old VBA code embedded in a spreadsheet that was
taking 15-20 minutes to do the exact same thing." [via personal
email]
Others have `requested a PyPI package`_ for it, which has been
created. See `PyPI package`_.
.. _`requested a PyPI package`: https://github.com/benhoyt/scandir/issues/12
.. _`PyPI package`: https://pypi.python.org/pypi/scandir
GitHub stats don't mean too much, but scandir does have several
watchers, issues, forks, etc. Here's the run-down as of the stats as
of July 7, 2014:
* Watchers: 17
* Stars: 57
* Forks: 20
* Issues: 4 open, 26 closed
2014-07-18 12:25:41 -04:00
Also, because this PEP will increase the speed of ``os.walk()``
significantly, there are thousands of developers and scripts, and a lot
of production code, that would benefit from it. For example, on GitHub,
there are almost as many uses of ``os.walk`` (194,000) as there are of
``os.mkdir`` (230,000).
Rejected ideas
==============
Naming
------
The only other real contender for this function's name was
``iterdir()``. However, ``iterX()`` functions in Python (mostly found
in Python 2) tend to be simple iterator equivalents of their
non-iterator counterparts. For example, ``dict.iterkeys()`` is just an
iterator version of ``dict.keys()``, but the objects returned are
identical. In ``scandir()``'s case, however, the return values are
quite different objects (``DirEntry`` objects vs filename strings), so
this should probably be reflected by a difference in name -- hence
``scandir()``.
See some `relevant discussion on python-dev
<https://mail.python.org/pipermail/python-dev/2014-June/135228.html>`_.
Wildcard support
----------------
``FindFirstFile``/``FindNextFile`` on Windows support passing a
"wildcard" like ``*.jpg``, so at first folks (this PEP's author
included) felt it would be a good idea to include a
``windows_wildcard`` keyword argument to the ``scandir`` function so
users could pass this in.
However, on further thought and discussion it was decided that this
would be bad idea, *unless it could be made cross-platform* (a
``pattern`` keyword argument or similar). This seems easy enough at
first -- just use the OS wildcard support on Windows, and something
like ``fnmatch`` or ``re`` afterwards on POSIX-based systems.
Unfortunately the exact Windows wildcard matching rules aren't really
documented anywhere by Microsoft, and they're quite quirky (see this
`blog post
<http://blogs.msdn.com/b/oldnewthing/archive/2007/12/17/6785519.aspx>`_),
meaning it's very problematic to emulate using ``fnmatch`` or regexes.
So the consensus was that Windows wildcard support was a bad idea.
It would be possible to add at a later date if there's a
cross-platform way to achieve it, but not for the initial version.
Read more on the `this Nov 2012 python-ideas thread
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
and this `June 2014 python-dev thread on PEP 471
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.
2014-07-18 12:25:41 -04:00
Methods not following symlinks by default
-----------------------------------------
There was much debate on python-dev (see messages in `this thread
<https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_)
over whether the ``DirEntry`` methods should follow symbolic links or
not (when the ``is_X()`` methods had no ``follow_symlinks`` parameter).
Initially they did not (see previous versions of this PEP and the
scandir.py module), but Victor Stinner made a pretty compelling case on
python-dev that following symlinks by default is a better idea, because:
* following links is usually what you want (in 92% of cases in the
standard library, functions using ``os.listdir()`` and
``os.path.isdir()`` do follow symlinks)
* that's the precedent set by the similar functions
``os.path.isdir()`` and ``pathlib.Path.is_dir()``, so to do
otherwise would be confusing
* with the non-link-following approach, if you wanted to follow links
you'd have to say something like ``if (entry.is_symlink() and
os.path.isdir(entry.path)) or entry.is_dir()``, which is clumsy
As a case in point that shows the non-symlink-following version is
error prone, this PEP's author had a bug caused by getting this
exact test wrong in his initial implementation of ``scandir.walk()``
in scandir.py (see `Issue #4 here
<https://github.com/benhoyt/scandir/issues/4>`_).
In the end there was not total agreement that the methods should
follow symlinks, but there was basic consensus among the most involved
participants, and this PEP's author believes that the above case is
strong enough to warrant following symlinks by default.
In addition, it's straightforward to call the relevant methods with
2014-07-18 12:25:41 -04:00
``follow_symlinks=False`` if the other behaviour is desired.
DirEntry attributes being properties
------------------------------------
In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
2014-07-18 12:25:41 -04:00
``stat()`` to be properties instead of methods, to indicate they're
very cheap or free. However, this isn't quite the case, as ``stat()``
will require an OS call on POSIX-based systems but not on Windows.
Even ``is_dir()`` and friends may perform an OS call on POSIX-based
systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
file systems).
Also, people would expect the attribute access ``entry.is_dir`` to
only ever raise ``AttributeError``, not ``OSError`` in the case it
makes a system call under the covers. Calling code would have to have
a ``try``/``except`` around what looks like a simple attribute access,
and so it's much better to make them *methods*.
See `this May 2013 python-dev thread
<https://mail.python.org/pipermail/python-dev/2013-May/126184.html>`_
where this PEP author makes this case and there's agreement from a
core developers.
DirEntry fields being "static" attribute-only objects
-----------------------------------------------------
In `this July 2014 python-dev message
<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
Paul Moore suggested a solution that was a "thin wrapper round the OS
feature", where the ``DirEntry`` object had only static attributes:
2014-07-18 12:25:41 -04:00
``name``, ``path``, and ``is_X``, with the ``st_X`` attributes only
present on Windows. The idea was to use this simpler, lower-level
function as a building block for higher-level functions.
At first there was general agreement that simplifying in this way was
a good thing. However, there were two problems with this approach.
First, the assumption is the ``is_dir`` and similar attributes are
always present on POSIX, which isn't the case (if ``d_type`` is not
present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API
in practice, as even the ``is_dir`` attributes aren't always present
on POSIX, and would need to be tested with ``hasattr()`` and then
``os.stat()`` called if they weren't present.
See `this July 2014 python-dev response
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
from this PEP's author detailing why this option is a non-ideal
solution, and the subsequent reply from Paul Moore voicing agreement.
DirEntry fields being static with an ensure_lstat option
--------------------------------------------------------
Another seemingly simpler and attractive option was suggested by
Nick Coghlan in this `June 2014 python-dev message
<https://mail.python.org/pipermail/python-dev/2014-June/135261.html>`_:
make ``DirEntry.is_X`` and ``DirEntry.lstat_result`` properties, and
populate ``DirEntry.lstat_result`` at iteration time, but only if
the new argument ``ensure_lstat=True`` was specified on the
``scandir()`` call.
This does have the advantage over the above in that you can easily get
the stat result from ``scandir()`` if you need it. However, it has the
serious disadvantage that fine-grained error handling is messy,
because ``stat()`` will be called (and hence potentially raise
``OSError``) during iteration, leading to a rather ugly, hand-made
iteration loop::
it = os.scandir(path)
while True:
try:
entry = next(it)
except OSError as error:
handle_error(path, error)
except StopIteration:
break
Or it means that ``scandir()`` would have to accept an ``onerror``
argument -- a function to call when ``stat()`` errors occur during
iteration. This seems to this PEP's author neither as direct nor as
2014-07-18 12:25:41 -04:00
Pythonic as ``try``/``except`` around a ``DirEntry.stat()`` call.
Another drawback is that ``os.scandir()`` is written to make code faster.
Always calling ``os.lstat()`` on POSIX would not bring any speedup. In most
cases, you don't need the full ``stat_result`` object -- the ``is_X()``
methods are enough and this information is already known.
See `Ben Hoyt's July 2014 reply
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
to the discussion summarizing this and detailing why he thinks the
original :pep:`471` proposal is "the right one" after all.
Return values being (name, stat_result) two-tuples
--------------------------------------------------
Initially this PEP's author proposed this concept as a function called
``iterdir_stat()`` which yielded two-tuples of (name, stat_result).
This does have the advantage that there are no new types introduced.
However, the ``stat_result`` is only partially filled on POSIX-based
systems (most fields set to ``None`` and other quirks), so they're not
really ``stat_result`` objects at all, and this would have to be
thoroughly documented as different from ``os.stat()``.
Also, Python has good support for proper objects with attributes and
methods, which makes for a saner and simpler API than two-tuples. It
also makes the ``DirEntry`` objects more extensible and future-proof
as operating systems add functionality and we want to include this in
``DirEntry``.
See also some previous discussion:
* `May 2013 python-dev thread
<https://mail.python.org/pipermail/python-dev/2013-May/126148.html>`_
where Nick Coghlan makes the original case for a ``DirEntry``-style
object.
* `June 2014 python-dev thread
<https://mail.python.org/pipermail/python-dev/2014-June/135244.html>`_
where Nick Coghlan makes (another) good case against the two-tuple
approach.
Return values being overloaded stat_result objects
--------------------------------------------------
Another alternative discussed was making the return values to be
2014-07-18 12:25:41 -04:00
overloaded ``stat_result`` objects with ``name`` and ``path``
attributes. However, apart from this being a strange (and strained!)
kind of overloading, this has the same problems mentioned above --
most of the ``stat_result`` information is not fetched by
``readdir()`` on POSIX systems, only (part of) the ``st_mode`` value.
Return values being pathlib.Path objects
----------------------------------------
With Antoine Pitrou's new standard library ``pathlib`` module, it
at first seems like a great idea for ``scandir()`` to return instances
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
2014-07-18 12:25:41 -04:00
``stat()`` functions are explicitly not cached, whereas ``scandir``
has to cache them by design, because it's (often) returning values
from the original directory iteration system call.
And if the ``pathlib.Path`` instances returned by ``scandir`` cached
2014-07-18 12:25:41 -04:00
stat values, but the ordinary ``pathlib.Path`` objects explicitly
don't, that would be more than a little confusing.
2014-07-18 12:25:41 -04:00
Guido van Rossum explicitly rejected ``pathlib.Path`` caching stat in
the context of scandir `here
<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
making ``pathlib.Path`` objects a bad choice for scandir return
values.
Possible improvements
=====================
There are many possible improvements one could make to scandir, but
here is a short list of some this PEP's author has in mind:
* scandir could potentially be further sped up by calling ``readdir``
/ ``FindNextFile`` say 50 times per ``Py_BEGIN_ALLOW_THREADS`` block
so that it stays in the C extension module for longer, and may be
somewhat faster as a result. This approach hasn't been tested, but
was suggested by on Issue 11406 by Antoine Pitrou.
[`source9 <http://bugs.python.org/msg130125>`_]
* scandir could use a free list to avoid the cost of memory allocation
for each iteration -- a short free list of 10 or maybe even 1 may help.
Suggested by Victor Stinner on a `python-dev thread on June 27`_.
.. _`python-dev thread on June 27`: https://mail.python.org/pipermail/python-dev/2014-June/135232.html
Previous discussion
===================
2014-07-18 12:25:41 -04:00
* `Original November 2012 thread Ben Hoyt started on python-ideas
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
about speeding up ``os.walk()``
* Python `Issue 11406`_, which includes the original proposal for a
scandir-like function
2014-07-18 12:25:41 -04:00
* `Further May 2013 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2013-May/126119.html>`_
that refined the ``scandir()`` API, including Nick Coghlan's
suggestion of scandir yielding ``DirEntry``-like objects
* `November 2013 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2013-November/130572.html>`_
to discuss the interaction between scandir and the new ``pathlib``
module
* `June 2014 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2014-June/135215.html>`_
to discuss the first version of this PEP, with extensive discussion
about the API
* `First July 2014 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2014-July/135377.html>`_
to discuss his updates to :pep:`471`
2014-07-18 12:25:41 -04:00
* `Second July 2014 thread Ben Hoyt started on python-dev
<https://mail.python.org/pipermail/python-dev/2014-July/135485.html>`_
to discuss the remaining decisions needed to finalize :pep:`471`,
2014-07-18 12:25:41 -04:00
specifically whether the ``DirEntry`` methods should follow symlinks
by default
* `Question on StackOverflow
<http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder>`_
about why ``os.walk()`` is slow and pointers on how to fix it (this
inspired the author of this PEP early on)
* `BetterWalk <https://github.com/benhoyt/betterwalk>`_, this PEP's
author's previous attempt at this, on which the scandir code is based
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: