2014-06-26 17:14:01 -04:00
|
|
|
|
PEP: 471
|
|
|
|
|
Title: os.scandir() function -- a better and faster directory iterator
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Ben Hoyt <benhoyt@gmail.com>
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 30-May-2014
|
|
|
|
|
Python-Version: 3.5
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Post-History: 27-Jun-2014, 8-Jul-2014
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
This PEP proposes including a new directory iteration function,
|
|
|
|
|
``os.scandir()``, in the standard library. This new function adds
|
|
|
|
|
useful functionality and increases the speed of ``os.walk()`` by 2-10
|
|
|
|
|
times (depending on the platform and file system) by significantly
|
|
|
|
|
reducing the number of times ``stat()`` needs to be called.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
Python's built-in ``os.walk()`` is significantly slower than it needs
|
|
|
|
|
to be, because -- in addition to calling ``os.listdir()`` on each
|
2014-07-08 04:59:42 -04:00
|
|
|
|
directory -- it executes the ``stat()`` system call or
|
2014-06-26 17:14:01 -04:00
|
|
|
|
``GetFileAttributes()`` on each file to determine whether the entry is
|
|
|
|
|
a directory or not.
|
|
|
|
|
|
|
|
|
|
But the underlying system calls -- ``FindFirstFile`` /
|
2014-07-08 04:59:42 -04:00
|
|
|
|
``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
|
2014-06-26 17:14:01 -04:00
|
|
|
|
already tell you whether the files returned are directories or not, so
|
2014-07-08 04:59:42 -04:00
|
|
|
|
no further system calls are needed. Further, the Windows system calls
|
|
|
|
|
return all the information for a ``stat_result`` object, such as file
|
|
|
|
|
size and last modification time.
|
|
|
|
|
|
|
|
|
|
In short, you can reduce the number of system calls required for a
|
|
|
|
|
tree function like ``os.walk()`` from approximately 2N to N, where N
|
|
|
|
|
is the total number of files and directories in the tree. (And because
|
|
|
|
|
directory trees are usually wider than they are deep, it's often much
|
|
|
|
|
better than this.)
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
In practice, removing all those extra system calls makes ``os.walk()``
|
|
|
|
|
about **8-9 times as fast on Windows**, and about **2-3 times as fast
|
2014-07-08 04:59:42 -04:00
|
|
|
|
on POSIX systems**. So we're not talking about micro-
|
|
|
|
|
optimizations. See more `benchmarks here`_.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
.. _`benchmarks here`: https://github.com/benhoyt/scandir#benchmarks
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
Somewhat relatedly, many people (see Python `Issue 11406`_) are also
|
|
|
|
|
keen on a version of ``os.listdir()`` that yields filenames as it
|
|
|
|
|
iterates instead of returning them as one big list. This improves
|
|
|
|
|
memory efficiency for iterating very large directories.
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
So, as well as providing a ``scandir()`` iterator function for calling
|
2014-06-26 17:14:01 -04:00
|
|
|
|
directly, Python's existing ``os.walk()`` function could be sped up a
|
|
|
|
|
huge amount.
|
|
|
|
|
|
|
|
|
|
.. _`Issue 11406`: http://bugs.python.org/issue11406
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Implementation
|
|
|
|
|
==============
|
|
|
|
|
|
|
|
|
|
The implementation of this proposal was written by Ben Hoyt (initial
|
|
|
|
|
version) and Tim Golden (who helped a lot with the C extension
|
|
|
|
|
module). It lives on GitHub at `benhoyt/scandir`_.
|
|
|
|
|
|
|
|
|
|
.. _`benhoyt/scandir`: https://github.com/benhoyt/scandir
|
|
|
|
|
|
|
|
|
|
Note that this module has been used and tested (see "Use in the wild"
|
|
|
|
|
section in this PEP), so it's more than a proof-of-concept. However,
|
|
|
|
|
it is marked as beta software and is not extensively battle-tested.
|
|
|
|
|
It will need some cleanup and more thorough testing before going into
|
2014-07-08 04:59:42 -04:00
|
|
|
|
the standard library, as well as integration into ``posixmodule.c``.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specifics of proposal
|
|
|
|
|
=====================
|
|
|
|
|
|
|
|
|
|
Specifically, this PEP proposes adding a single function to the ``os``
|
|
|
|
|
module in the standard library, ``scandir``, that takes a single,
|
|
|
|
|
optional string as its argument::
|
|
|
|
|
|
|
|
|
|
scandir(path='.') -> generator of DirEntry objects
|
|
|
|
|
|
|
|
|
|
Like ``listdir``, ``scandir`` calls the operating system's directory
|
|
|
|
|
iteration system calls to get the names of the files in the ``path``
|
|
|
|
|
directory, but it's different from ``listdir`` in two ways:
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* Instead of returning bare filename strings, it returns lightweight
|
2014-06-26 17:14:01 -04:00
|
|
|
|
``DirEntry`` objects that hold the filename string and provide
|
2014-07-08 04:59:42 -04:00
|
|
|
|
simple methods that allow access to the additional data the
|
|
|
|
|
operating system returned.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
* It returns a generator instead of a list, so that ``scandir`` acts
|
|
|
|
|
as a true iterator instead of returning the full list immediately.
|
|
|
|
|
|
|
|
|
|
``scandir()`` yields a ``DirEntry`` object for each file and directory
|
|
|
|
|
in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
|
|
|
|
|
pseudo-directories are skipped, and the entries are yielded in
|
|
|
|
|
system-dependent order. Each ``DirEntry`` object has the following
|
|
|
|
|
attributes and methods:
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* ``name``: the entry's filename, relative to the ``path`` argument
|
|
|
|
|
(corresponds to the return values of ``os.listdir``)
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* ``full_name``: the entry's full path name -- the equivalent of
|
|
|
|
|
``os.path.join(path, entry.name)``
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
|
|
|
|
|
requires a system call on Windows, and usually doesn't on POSIX
|
|
|
|
|
systems
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* ``is_file()``: like ``os.path.isfile()``, but much cheaper -- it
|
|
|
|
|
never requires a system call on Windows, and usually doesn't on
|
|
|
|
|
POSIX systems
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* ``is_symlink()``: like ``os.path.islink()``, but much cheaper -- it
|
|
|
|
|
never requires a system call on Windows, and usually doesn't on
|
|
|
|
|
POSIX systems
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* ``lstat()``: like ``os.lstat()``, but much cheaper on some systems
|
|
|
|
|
-- it only requires a system call on POSIX systems
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
The ``is_X`` methods may perform a ``stat()`` call under certain
|
|
|
|
|
conditions (for example, on certain file systems on POSIX systems),
|
|
|
|
|
and therefore possibly raise ``OSError``. The ``lstat()`` method will
|
|
|
|
|
call ``stat()`` on POSIX systems and therefore also possibly raise
|
|
|
|
|
``OSError``. See the "Notes on exception handling" section for more
|
|
|
|
|
details.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
The ``DirEntry`` attribute and method names were chosen to be the same
|
|
|
|
|
as those in the new ``pathlib`` module for consistency.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Like the other functions in the ``os`` module, ``scandir()`` accepts
|
|
|
|
|
either a bytes or str object for the ``path`` parameter, and returns
|
|
|
|
|
the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
|
|
|
|
|
same type as ``path``. However, it is *strongly recommended* to use
|
|
|
|
|
the str type, as this ensures cross-platform support for Unicode
|
|
|
|
|
filenames.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Examples
|
|
|
|
|
========
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Below is a good usage pattern for ``scandir``. This is in fact almost
|
2014-06-26 17:14:01 -04:00
|
|
|
|
exactly how the scandir module's faster ``os.walk()`` implementation
|
|
|
|
|
uses it::
|
|
|
|
|
|
|
|
|
|
dirs = []
|
|
|
|
|
non_dirs = []
|
2014-07-08 04:59:42 -04:00
|
|
|
|
for entry in os.scandir(path):
|
2014-06-26 17:14:01 -04:00
|
|
|
|
if entry.is_dir():
|
|
|
|
|
dirs.append(entry)
|
|
|
|
|
else:
|
|
|
|
|
non_dirs.append(entry)
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
The above ``os.walk()``-like code will be significantly faster with
|
|
|
|
|
scandir than ``os.listdir()`` and ``os.path.isdir()`` on both Windows
|
|
|
|
|
and POSIX systems.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Or, for getting the total size of files in a directory tree, showing
|
|
|
|
|
use of the ``DirEntry.lstat()`` method and ``DirEntry.full_name``
|
|
|
|
|
attribute::
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
def get_tree_size(path):
|
|
|
|
|
"""Return total size of files in path and subdirs."""
|
2014-07-08 04:59:42 -04:00
|
|
|
|
total = 0
|
|
|
|
|
for entry in os.scandir(path):
|
2014-06-26 17:14:01 -04:00
|
|
|
|
if entry.is_dir():
|
2014-07-08 04:59:42 -04:00
|
|
|
|
total += get_tree_size(entry.full_name)
|
2014-06-26 17:14:01 -04:00
|
|
|
|
else:
|
2014-07-08 04:59:42 -04:00
|
|
|
|
total += entry.lstat().st_size
|
|
|
|
|
return total
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
|
2014-07-08 04:59:42 -04:00
|
|
|
|
because no extra stat call are needed, but on POSIX systems the size
|
2014-06-26 17:14:01 -04:00
|
|
|
|
information is not returned by the directory iteration functions, so
|
|
|
|
|
this function won't gain anything there.
|
|
|
|
|
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Notes on caching
|
|
|
|
|
----------------
|
|
|
|
|
|
|
|
|
|
The ``DirEntry`` objects are relatively dumb -- the ``name`` and
|
|
|
|
|
``full_name`` attributes are obviously always cached, and the ``is_X``
|
|
|
|
|
and ``lstat`` methods cache their values (immediately on Windows via
|
|
|
|
|
``FindNextFile``, and on first use on POSIX systems via a ``stat``
|
|
|
|
|
call) and never refetch from the system.
|
|
|
|
|
|
|
|
|
|
For this reason, ``DirEntry`` objects are intended to be used and
|
|
|
|
|
thrown away after iteration, not stored in long-lived data structured
|
|
|
|
|
and the methods called again and again.
|
|
|
|
|
|
|
|
|
|
If developers want "refresh" behaviour (for example, for watching a
|
|
|
|
|
file's size change), they can simply use ``pathlib.Path`` objects,
|
|
|
|
|
or call the regular ``os.lstat()`` or ``os.path.getsize()`` functions
|
|
|
|
|
which get fresh data from the operating system every call.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Notes on exception handling
|
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
|
|
``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
|
|
|
|
|
rather than attributes or properties, to make it clear that they may
|
|
|
|
|
not be cheap operations, and they may do a system call. As a result,
|
|
|
|
|
these methods may raise ``OSError``.
|
|
|
|
|
|
|
|
|
|
For example, ``DirEntry.lstat()`` will always make a system call on
|
|
|
|
|
POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
|
|
|
|
|
``stat()`` system call on such systems if ``readdir()`` returns a
|
|
|
|
|
``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
|
|
|
|
|
certain conditions or on certain file systems.
|
|
|
|
|
|
|
|
|
|
For this reason, when a user requires fine-grained error handling,
|
|
|
|
|
it's good to catch ``OSError`` around these method calls and then
|
|
|
|
|
handle as appropriate.
|
|
|
|
|
|
|
|
|
|
For example, below is a version of the ``get_tree_size()`` example
|
|
|
|
|
shown above, but with basic error handling added::
|
|
|
|
|
|
|
|
|
|
def get_tree_size(path):
|
|
|
|
|
"""Return total size of files in path and subdirs. If
|
|
|
|
|
is_dir() or lstat() fails, print an error message to stderr
|
|
|
|
|
and assume zero size (for example, file has been deleted).
|
|
|
|
|
"""
|
|
|
|
|
total = 0
|
|
|
|
|
for entry in os.scandir(path):
|
|
|
|
|
try:
|
|
|
|
|
is_dir = entry.is_dir()
|
|
|
|
|
except OSError as error:
|
|
|
|
|
print('Error calling is_dir():', error, file=sys.stderr)
|
|
|
|
|
continue
|
|
|
|
|
if is_dir:
|
|
|
|
|
total += get_tree_size(entry.full_name)
|
|
|
|
|
else:
|
|
|
|
|
try:
|
|
|
|
|
total += entry.lstat().st_size
|
|
|
|
|
except OSError as error:
|
|
|
|
|
print('Error calling lstat():', error, file=sys.stderr)
|
|
|
|
|
return total
|
|
|
|
|
|
|
|
|
|
|
2014-06-26 17:14:01 -04:00
|
|
|
|
Support
|
|
|
|
|
=======
|
|
|
|
|
|
|
|
|
|
The scandir module on GitHub has been forked and used quite a bit (see
|
|
|
|
|
"Use in the wild" in this PEP), but there's also been a fair bit of
|
|
|
|
|
direct support for a scandir-like function from core developers and
|
|
|
|
|
others on the python-dev and python-ideas mailing lists. A sampling:
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* **python-dev**: a good number of +1's and very few negatives for
|
|
|
|
|
scandir and PEP 471 on `this June 2014 python-dev thread
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_
|
|
|
|
|
|
2014-06-26 17:14:01 -04:00
|
|
|
|
* **Nick Coghlan**, a core Python developer: "I've had the local Red
|
|
|
|
|
Hat release engineering team express their displeasure at having to
|
|
|
|
|
stat every file in a network mounted directory tree for info that is
|
|
|
|
|
present in the dirent structure, so a definite +1 to os.scandir from
|
|
|
|
|
me, so long as it makes that info available."
|
|
|
|
|
[`source1 <http://bugs.python.org/issue11406>`_]
|
|
|
|
|
|
|
|
|
|
* **Tim Golden**, a core Python developer, supports scandir enough to
|
|
|
|
|
have spent time refactoring and significantly improving scandir's C
|
|
|
|
|
extension module.
|
|
|
|
|
[`source2 <https://github.com/tjguk/scandir>`_]
|
|
|
|
|
|
|
|
|
|
* **Christian Heimes**, a core Python developer: "+1 for something
|
|
|
|
|
like yielddir()"
|
|
|
|
|
[`source3 <https://mail.python.org/pipermail/python-ideas/2012-November/017772.html>`_]
|
|
|
|
|
and "Indeed! I'd like to see the feature in 3.4 so I can remove my
|
|
|
|
|
own hack from our code base."
|
|
|
|
|
[`source4 <http://bugs.python.org/issue11406>`_]
|
|
|
|
|
|
|
|
|
|
* **Gregory P. Smith**, a core Python developer: "As 3.4beta1 happens
|
|
|
|
|
tonight, this isn't going to make 3.4 so i'm bumping this to 3.5.
|
|
|
|
|
I really like the proposed design outlined above."
|
|
|
|
|
[`source5 <http://bugs.python.org/issue11406>`_]
|
|
|
|
|
|
|
|
|
|
* **Guido van Rossum** on the possibility of adding scandir to Python
|
|
|
|
|
3.5 (as it was too late for 3.4): "The ship has likewise sailed for
|
|
|
|
|
adding scandir() (whether to os or pathlib). By all means experiment
|
|
|
|
|
and get it ready for consideration for 3.5, but I don't want to add
|
|
|
|
|
it to 3.4."
|
|
|
|
|
[`source6 <https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_]
|
|
|
|
|
|
|
|
|
|
Support for this PEP itself (meta-support?) was given by Nick Coghlan
|
|
|
|
|
on python-dev: "A PEP reviewing all this for 3.5 and proposing a
|
|
|
|
|
specific os.scandir API would be a good thing."
|
|
|
|
|
[`source7 <https://mail.python.org/pipermail/python-dev/2013-November/130588.html>`_]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Use in the wild
|
|
|
|
|
===============
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
To date, the ``scandir`` implementation is definitely useful, but has
|
|
|
|
|
been clearly marked "beta", so it's uncertain how much use of it there
|
|
|
|
|
is in the wild. Ben Hoyt has had several reports from people using it.
|
|
|
|
|
For example:
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
* Chris F: "I am processing some pretty large directories and was half
|
|
|
|
|
expecting to have to modify getdents. So thanks for saving me the
|
|
|
|
|
effort." [via personal email]
|
|
|
|
|
|
|
|
|
|
* bschollnick: "I wanted to let you know about this, since I am using
|
|
|
|
|
Scandir as a building block for this code. Here's a good example of
|
|
|
|
|
scandir making a radical performance improvement over os.listdir."
|
|
|
|
|
[`source8 <https://github.com/benhoyt/scandir/issues/19>`_]
|
|
|
|
|
|
|
|
|
|
* Avram L: "I'm testing our scandir for a project I'm working on.
|
|
|
|
|
Seems pretty solid, so first thing, just want to say nice work!"
|
|
|
|
|
[via personal email]
|
|
|
|
|
|
|
|
|
|
Others have `requested a PyPI package`_ for it, which has been
|
|
|
|
|
created. See `PyPI package`_.
|
|
|
|
|
|
|
|
|
|
.. _`requested a PyPI package`: https://github.com/benhoyt/scandir/issues/12
|
|
|
|
|
.. _`PyPI package`: https://pypi.python.org/pypi/scandir
|
|
|
|
|
|
|
|
|
|
GitHub stats don't mean too much, but scandir does have several
|
|
|
|
|
watchers, issues, forks, etc. Here's the run-down as of the stats as
|
2014-07-08 04:59:42 -04:00
|
|
|
|
of July 7, 2014:
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
* Watchers: 17
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* Stars: 57
|
|
|
|
|
* Forks: 20
|
|
|
|
|
* Issues: 4 open, 26 closed
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
**However, the much larger point is this:**, if this PEP is accepted,
|
|
|
|
|
``os.walk()`` can easily be reimplemented using ``scandir`` rather
|
|
|
|
|
than ``listdir`` and ``stat``, increasing the speed of ``os.walk()``
|
|
|
|
|
very significantly. There are thousands of developers, scripts, and
|
|
|
|
|
production code that would benefit from this large speedup of
|
|
|
|
|
``os.walk()``. For example, on GitHub, there are almost as many uses
|
|
|
|
|
of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).
|
|
|
|
|
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Rejected ideas
|
|
|
|
|
==============
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Naming
|
|
|
|
|
------
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
The only other real contender for this function's name was
|
|
|
|
|
``iterdir()``. However, ``iterX()`` functions in Python (mostly found
|
|
|
|
|
in Python 2) tend to be simple iterator equivalents of their
|
|
|
|
|
non-iterator counterparts. For example, ``dict.iterkeys()`` is just an
|
|
|
|
|
iterator version of ``dict.keys()``, but the objects returned are
|
|
|
|
|
identical. In ``scandir()``'s case, however, the return values are
|
|
|
|
|
quite different objects (``DirEntry`` objects vs filename strings), so
|
|
|
|
|
this should probably be reflected by a difference in name -- hence
|
|
|
|
|
``scandir()``.
|
|
|
|
|
|
|
|
|
|
See some `relevant discussion on python-dev
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2014-June/135228.html>`_.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Wildcard support
|
|
|
|
|
----------------
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
``FindFirstFile``/``FindNextFile`` on Windows support passing a
|
|
|
|
|
"wildcard" like ``*.jpg``, so at first folks (this PEP's author
|
|
|
|
|
included) felt it would be a good idea to include a
|
|
|
|
|
``windows_wildcard`` keyword argument to the ``scandir`` function so
|
|
|
|
|
users could pass this in.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
However, on further thought and discussion it was decided that this
|
|
|
|
|
would be bad idea, *unless it could be made cross-platform* (a
|
|
|
|
|
``pattern`` keyword argument or similar). This seems easy enough at
|
|
|
|
|
first -- just use the OS wildcard support on Windows, and something
|
|
|
|
|
like ``fnmatch`` or ``re`` afterwards on POSIX-based systems.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Unfortunately the exact Windows wildcard matching rules aren't really
|
|
|
|
|
documented anywhere by Microsoft, and they're quite quirky (see this
|
|
|
|
|
`blog post
|
|
|
|
|
<http://blogs.msdn.com/b/oldnewthing/archive/2007/12/17/6785519.aspx>`_),
|
|
|
|
|
meaning it's very problematic to emulate using ``fnmatch`` or regexes.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
So the consensus was that Windows wildcard support was a bad idea.
|
|
|
|
|
It would be possible to add at a later date if there's a
|
|
|
|
|
cross-platform way to achieve it, but not for the initial version.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
Read more on the `this Nov 2012 python-ideas thread
|
|
|
|
|
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
|
|
|
|
|
and this `June 2014 python-dev thread on PEP 471
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
DirEntry attributes being properties
|
|
|
|
|
------------------------------------
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
|
|
|
|
|
``lstat()`` to be properties instead of methods, to indicate they're
|
|
|
|
|
very cheap or free. However, this isn't quite the case, as ``lstat()``
|
|
|
|
|
will require an OS call on POSIX-based systems but not on Windows.
|
|
|
|
|
Even ``is_dir()`` and friends may perform an OS call on POSIX-based
|
|
|
|
|
systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
|
|
|
|
|
file systems).
|
|
|
|
|
|
|
|
|
|
Also, people would expect the attribute access ``entry.is_dir`` to
|
|
|
|
|
only ever raise ``AttributeError``, not ``OSError`` in the case it
|
|
|
|
|
makes a system call under the covers. Calling code would have to have
|
|
|
|
|
a ``try``/``except`` around what looks like a simple attribute access,
|
|
|
|
|
and so it's much better to make them *methods*.
|
|
|
|
|
|
|
|
|
|
See `this May 2013 python-dev thread
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2013-May/126184.html>`_
|
|
|
|
|
where this PEP author makes this case and there's agreement from a
|
|
|
|
|
core developers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
DirEntry fields being "static" attribute-only objects
|
|
|
|
|
-----------------------------------------------------
|
|
|
|
|
|
|
|
|
|
In `this July 2014 python-dev message
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
|
|
|
|
|
Paul Moore suggested a solution that was a "thin wrapper round the OS
|
|
|
|
|
feature", where the ``DirEntry`` object had only static attributes:
|
|
|
|
|
``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
|
|
|
|
|
only present on Windows. The idea was to use this simpler, lower-level
|
|
|
|
|
function as a building block for higher-level functions.
|
|
|
|
|
|
|
|
|
|
At first there was general agreement that simplifying in this way was
|
|
|
|
|
a good thing. However, there were two problems with this approach.
|
|
|
|
|
First, the assumption is the ``is_dir`` and similar attributes are
|
|
|
|
|
always present on POSIX, which isn't the case (if ``d_type`` is not
|
|
|
|
|
present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API
|
|
|
|
|
in practice, as even the ``is_dir`` attributes aren't always present
|
|
|
|
|
on POSIX, and would need to be tested with ``hasattr()`` and then
|
|
|
|
|
``os.stat()`` called if they weren't present.
|
|
|
|
|
|
|
|
|
|
See `this July 2014 python-dev response
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
|
|
|
|
|
from this PEP's author detailing why this option is a non-ideal
|
|
|
|
|
solution, and the subsequent reply from Paul Moore voicing agreement.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
DirEntry fields being static with an ensure_lstat option
|
|
|
|
|
--------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Another seemingly simpler and attractive option was suggested by
|
|
|
|
|
Nick Coghlan in this `June 2014 python-dev message
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2014-June/135261.html>`_:
|
|
|
|
|
make ``DirEntry.is_X`` and ``DirEntry.lstat_result`` properties, and
|
|
|
|
|
populate ``DirEntry.lstat_result`` at iteration time, but only if
|
|
|
|
|
the new argument ``ensure_lstat=True`` was specified on the
|
|
|
|
|
``scandir()`` call.
|
|
|
|
|
|
|
|
|
|
This does have the advantage over the above in that you can easily get
|
|
|
|
|
the stat result from ``scandir()`` if you need it. However, it has the
|
|
|
|
|
serious disadvantage that fine-grained error handling is messy,
|
|
|
|
|
because ``stat()`` will be called (and hence potentially raise
|
|
|
|
|
``OSError``) during iteration, leading to a rather ugly, hand-made
|
|
|
|
|
iteration loop::
|
|
|
|
|
|
|
|
|
|
it = os.scandir(path)
|
|
|
|
|
while True:
|
|
|
|
|
try:
|
|
|
|
|
entry = next(it)
|
|
|
|
|
except OSError as error:
|
|
|
|
|
handle_error(path, error)
|
|
|
|
|
except StopIteration:
|
|
|
|
|
break
|
|
|
|
|
|
|
|
|
|
Or it means that ``scandir()`` would have to accept an ``onerror``
|
|
|
|
|
argument -- a function to call when ``stat()`` errors occur during
|
|
|
|
|
iteration. This seems to this PEP's author neither as direct nor as
|
|
|
|
|
Pythonic as ``try``/``except`` around a ``DirEntry.lstat()`` call.
|
|
|
|
|
|
|
|
|
|
See `Ben Hoyt's July 2014 reply
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
|
|
|
|
|
to the discussion summarizing this and detailing why he thinks the
|
|
|
|
|
original PEP 471 proposal is "the right one" after all.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Return values being (name, stat_result) two-tuples
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Initially this PEP's author proposed this concept as a function called
|
|
|
|
|
``iterdir_stat()`` which yielded two-tuples of (name, stat_result).
|
|
|
|
|
This does have the advantage that there are no new types introduced.
|
|
|
|
|
However, the ``stat_result`` is only partially filled on POSIX-based
|
|
|
|
|
systems (most fields set to ``None`` and other quirks), so they're not
|
|
|
|
|
really ``stat_result`` objects at all, and this would have to be
|
|
|
|
|
thoroughly documented as different from ``os.stat()``.
|
|
|
|
|
|
|
|
|
|
Also, Python has good support for proper objects with attributes and
|
|
|
|
|
methods, which makes for a saner and simpler API than two-tuples. It
|
|
|
|
|
also makes the ``DirEntry`` objects more extensible and future-proof
|
|
|
|
|
as operating systems add functionality and we want to include this in
|
|
|
|
|
``DirEntry``.
|
|
|
|
|
|
|
|
|
|
See also some previous discussion:
|
|
|
|
|
|
|
|
|
|
* `May 2013 python-dev thread
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2013-May/126148.html>`_
|
|
|
|
|
where Nick Coghlan makes the original case for a ``DirEntry``-style
|
|
|
|
|
object.
|
|
|
|
|
|
|
|
|
|
* `June 2014 python-dev thread
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2014-June/135244.html>`_
|
|
|
|
|
where Nick Coghlan makes (another) good case against the two-tuple
|
|
|
|
|
approach.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Return values being overloaded stat_result objects
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Another alternative discussed was making the return values to be
|
|
|
|
|
overloaded ``stat_result`` objects with ``name`` and ``full_name``
|
|
|
|
|
attributes. However, apart from this being a strange (and strained!)
|
|
|
|
|
kind of overloading, this has the same problems mentioned above --
|
|
|
|
|
most of the ``stat_result`` information is not fetched by
|
|
|
|
|
``readdir()`` on POSIX systems, only (part of) the ``st_mode`` value.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Return values being pathlib.Path objects
|
|
|
|
|
----------------------------------------
|
|
|
|
|
|
|
|
|
|
With Antoine Pitrou's new standard library ``pathlib`` module, it
|
|
|
|
|
at first seems like a great idea for ``scandir()`` to return instances
|
|
|
|
|
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
|
|
|
|
|
``lstat()`` functions are explicitly not cached, whereas ``scandir``
|
|
|
|
|
has to cache them by design, because it's (often) returning values
|
|
|
|
|
from the original directory iteration system call.
|
|
|
|
|
|
|
|
|
|
And if the ``pathlib.Path`` instances returned by ``scandir`` cached
|
|
|
|
|
lstat values, but the ordinary ``pathlib.Path`` objects explicitly
|
|
|
|
|
don't, that would be more than a little confusing.
|
|
|
|
|
|
|
|
|
|
Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
|
|
|
|
|
the context of scandir `here
|
|
|
|
|
<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
|
|
|
|
|
making ``pathlib.Path`` objects a bad choice for scandir return
|
|
|
|
|
values.
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Possible improvements
|
|
|
|
|
=====================
|
|
|
|
|
|
|
|
|
|
There are many possible improvements one could make to scandir, but
|
|
|
|
|
here is a short list of some this PEP's author has in mind:
|
|
|
|
|
|
|
|
|
|
* scandir could potentially be further sped up by calling ``readdir``
|
|
|
|
|
/ ``FindNextFile`` say 50 times per ``Py_BEGIN_ALLOW_THREADS`` block
|
|
|
|
|
so that it stays in the C extension module for longer, and may be
|
|
|
|
|
somewhat faster as a result. This approach hasn't been tested, but
|
|
|
|
|
was suggested by on Issue 11406 by Antoine Pitrou.
|
|
|
|
|
[`source9 <http://bugs.python.org/msg130125>`_]
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* scandir could use a free list to avoid the cost of memory allocation
|
|
|
|
|
for each iteration -- a short free list of 10 or maybe even 1 may help.
|
|
|
|
|
Suggested by Victor Stinner on a `python-dev thread on June 27`_.
|
|
|
|
|
|
|
|
|
|
.. _`python-dev thread on June 27`: https://mail.python.org/pipermail/python-dev/2014-June/135232.html
|
|
|
|
|
|
2014-06-26 17:14:01 -04:00
|
|
|
|
|
|
|
|
|
Previous discussion
|
|
|
|
|
===================
|
|
|
|
|
|
|
|
|
|
* `Original thread Ben Hoyt started on python-ideas`_ about speeding
|
|
|
|
|
up ``os.walk()``
|
|
|
|
|
|
|
|
|
|
* Python `Issue 11406`_, which includes the original proposal for a
|
|
|
|
|
scandir-like function
|
|
|
|
|
|
|
|
|
|
* `Further thread Ben Hoyt started on python-dev`_ that refined the
|
|
|
|
|
``scandir()`` API, including Nick Coghlan's suggestion of scandir
|
|
|
|
|
yielding ``DirEntry``-like objects
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* `Another thread Ben Hoyt started on python-dev`_ to discuss the
|
2014-06-26 17:14:01 -04:00
|
|
|
|
interaction between scandir and the new ``pathlib`` module
|
|
|
|
|
|
2014-07-08 04:59:42 -04:00
|
|
|
|
* `Final thread Ben Hoyt started on python-dev`_ to discuss the first
|
|
|
|
|
version of this PEP, with extensive discussion about the API.
|
|
|
|
|
|
2014-06-26 17:14:01 -04:00
|
|
|
|
* `Question on StackOverflow`_ about why ``os.walk()`` is slow and
|
|
|
|
|
pointers on how to fix it (this inspired the author of this PEP
|
|
|
|
|
early on)
|
|
|
|
|
|
|
|
|
|
* `BetterWalk`_, this PEP's author's previous attempt at this, on
|
|
|
|
|
which the scandir code is based
|
|
|
|
|
|
|
|
|
|
.. _`Original thread Ben Hoyt started on python-ideas`: https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
|
|
|
|
|
.. _`Further thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-May/126119.html
|
2014-07-08 04:59:42 -04:00
|
|
|
|
.. _`Another thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
|
|
|
|
|
.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2014-June/135215.html
|
2014-06-26 17:14:01 -04:00
|
|
|
|
.. _`Question on StackOverflow`: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
|
|
|
|
|
.. _`BetterWalk`: https://github.com/benhoyt/betterwalk
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|