PEP 471: update by Ben Hoy
After the significant discussion on python-dev about PEP 471, I've now made the relevant updates and improved a few things.
This commit is contained in:
parent
4b500b691d
commit
689e1bff5e
434
pep-0471.txt
434
pep-0471.txt
|
@ -8,6 +8,7 @@ Type: Standards Track
|
|||
Content-Type: text/x-rst
|
||||
Created: 30-May-2014
|
||||
Python-Version: 3.5
|
||||
Post-History: 27-Jun-2014, 8-Jul-2014
|
||||
|
||||
|
||||
Abstract
|
||||
|
@ -25,32 +26,36 @@ Rationale
|
|||
|
||||
Python's built-in ``os.walk()`` is significantly slower than it needs
|
||||
to be, because -- in addition to calling ``os.listdir()`` on each
|
||||
directory -- it executes the system call ``os.stat()`` or
|
||||
directory -- it executes the ``stat()`` system call or
|
||||
``GetFileAttributes()`` on each file to determine whether the entry is
|
||||
a directory or not.
|
||||
|
||||
But the underlying system calls -- ``FindFirstFile`` /
|
||||
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
|
||||
``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
|
||||
already tell you whether the files returned are directories or not, so
|
||||
no further system calls are needed. In short, you can reduce the
|
||||
number of system calls from approximately 2N to N, where N is the
|
||||
total number of files and directories in the tree. (And because
|
||||
directory trees are usually much wider than they are deep, it's often
|
||||
much better than this.)
|
||||
no further system calls are needed. Further, the Windows system calls
|
||||
return all the information for a ``stat_result`` object, such as file
|
||||
size and last modification time.
|
||||
|
||||
In short, you can reduce the number of system calls required for a
|
||||
tree function like ``os.walk()`` from approximately 2N to N, where N
|
||||
is the total number of files and directories in the tree. (And because
|
||||
directory trees are usually wider than they are deep, it's often much
|
||||
better than this.)
|
||||
|
||||
In practice, removing all those extra system calls makes ``os.walk()``
|
||||
about **8-9 times as fast on Windows**, and about **2-3 times as fast
|
||||
on Linux and Mac OS X**. So we're not talking about micro-
|
||||
optimizations. See more `benchmarks`_.
|
||||
on POSIX systems**. So we're not talking about micro-
|
||||
optimizations. See more `benchmarks here`_.
|
||||
|
||||
.. _`benchmarks`: https://github.com/benhoyt/scandir#benchmarks
|
||||
.. _`benchmarks here`: https://github.com/benhoyt/scandir#benchmarks
|
||||
|
||||
Somewhat relatedly, many people (see Python `Issue 11406`_) are also
|
||||
keen on a version of ``os.listdir()`` that yields filenames as it
|
||||
iterates instead of returning them as one big list. This improves
|
||||
memory efficiency for iterating very large directories.
|
||||
|
||||
So as well as providing a ``scandir()`` iterator function for calling
|
||||
So, as well as providing a ``scandir()`` iterator function for calling
|
||||
directly, Python's existing ``os.walk()`` function could be sped up a
|
||||
huge amount.
|
||||
|
||||
|
@ -70,7 +75,7 @@ Note that this module has been used and tested (see "Use in the wild"
|
|||
section in this PEP), so it's more than a proof-of-concept. However,
|
||||
it is marked as beta software and is not extensively battle-tested.
|
||||
It will need some cleanup and more thorough testing before going into
|
||||
the standard library, as well as integration into `posixmodule.c`.
|
||||
the standard library, as well as integration into ``posixmodule.c``.
|
||||
|
||||
|
||||
|
||||
|
@ -87,10 +92,10 @@ Like ``listdir``, ``scandir`` calls the operating system's directory
|
|||
iteration system calls to get the names of the files in the ``path``
|
||||
directory, but it's different from ``listdir`` in two ways:
|
||||
|
||||
* Instead of bare filename strings, it returns lightweight
|
||||
* Instead of returning bare filename strings, it returns lightweight
|
||||
``DirEntry`` objects that hold the filename string and provide
|
||||
simple methods that allow access to the stat-like data the operating
|
||||
system returned.
|
||||
simple methods that allow access to the additional data the
|
||||
operating system returned.
|
||||
|
||||
* It returns a generator instead of a list, so that ``scandir`` acts
|
||||
as a true iterator instead of returning the full list immediately.
|
||||
|
@ -101,82 +106,146 @@ pseudo-directories are skipped, and the entries are yielded in
|
|||
system-dependent order. Each ``DirEntry`` object has the following
|
||||
attributes and methods:
|
||||
|
||||
* ``name``: the entry's filename, relative to ``path`` (corresponds to
|
||||
the return values of ``os.listdir``)
|
||||
* ``name``: the entry's filename, relative to the ``path`` argument
|
||||
(corresponds to the return values of ``os.listdir``)
|
||||
|
||||
* ``is_dir()``: like ``os.path.isdir()``, but requires no system calls
|
||||
on most systems (Linux, Windows, OS X)
|
||||
* ``full_name``: the entry's full path name -- the equivalent of
|
||||
``os.path.join(path, entry.name)``
|
||||
|
||||
* ``is_file()``: like ``os.path.isfile()``, but requires no system
|
||||
calls on most systems (Linux, Windows, OS X)
|
||||
* ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
|
||||
requires a system call on Windows, and usually doesn't on POSIX
|
||||
systems
|
||||
|
||||
* ``is_symlink()``: like ``os.path.islink()``, but requires no system
|
||||
calls on most systems (Linux, Windows, OS X)
|
||||
* ``is_file()``: like ``os.path.isfile()``, but much cheaper -- it
|
||||
never requires a system call on Windows, and usually doesn't on
|
||||
POSIX systems
|
||||
|
||||
* ``lstat()``: like ``os.lstat()``, but requires no system calls on
|
||||
Windows
|
||||
* ``is_symlink()``: like ``os.path.islink()``, but much cheaper -- it
|
||||
never requires a system call on Windows, and usually doesn't on
|
||||
POSIX systems
|
||||
|
||||
* ``lstat()``: like ``os.lstat()``, but much cheaper on some systems
|
||||
-- it only requires a system call on POSIX systems
|
||||
|
||||
The ``is_X`` methods may perform a ``stat()`` call under certain
|
||||
conditions (for example, on certain file systems on POSIX systems),
|
||||
and therefore possibly raise ``OSError``. The ``lstat()`` method will
|
||||
call ``stat()`` on POSIX systems and therefore also possibly raise
|
||||
``OSError``. See the "Notes on exception handling" section for more
|
||||
details.
|
||||
|
||||
The ``DirEntry`` attribute and method names were chosen to be the same
|
||||
as those in the new ``pathlib`` module for consistency.
|
||||
|
||||
|
||||
Notes on caching
|
||||
----------------
|
||||
|
||||
The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
|
||||
is obviously always cached, and the ``is_X`` and ``lstat`` methods
|
||||
cache their values (immediately on Windows via ``FindNextFile``, and
|
||||
on first use on Linux / OS X via a ``stat`` call) and never refetch
|
||||
from the system.
|
||||
|
||||
For this reason, ``DirEntry`` objects are intended to be used and
|
||||
thrown away after iteration, not stored in long-lived data structured
|
||||
and the methods called again and again.
|
||||
|
||||
If a user wants to do that (for example, for watching a file's size
|
||||
change), they'll need to call the regular ``os.lstat()`` or
|
||||
``os.path.getsize()`` functions which force a new system call each
|
||||
time.
|
||||
Like the other functions in the ``os`` module, ``scandir()`` accepts
|
||||
either a bytes or str object for the ``path`` parameter, and returns
|
||||
the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
|
||||
same type as ``path``. However, it is *strongly recommended* to use
|
||||
the str type, as this ensures cross-platform support for Unicode
|
||||
filenames.
|
||||
|
||||
|
||||
Examples
|
||||
========
|
||||
|
||||
Here's a good usage pattern for ``scandir``. This is in fact almost
|
||||
Below is a good usage pattern for ``scandir``. This is in fact almost
|
||||
exactly how the scandir module's faster ``os.walk()`` implementation
|
||||
uses it::
|
||||
|
||||
dirs = []
|
||||
non_dirs = []
|
||||
for entry in scandir(path):
|
||||
for entry in os.scandir(path):
|
||||
if entry.is_dir():
|
||||
dirs.append(entry)
|
||||
else:
|
||||
non_dirs.append(entry)
|
||||
|
||||
The above ``os.walk()``-like code will be significantly using scandir
|
||||
on both Windows and Linux or OS X.
|
||||
The above ``os.walk()``-like code will be significantly faster with
|
||||
scandir than ``os.listdir()`` and ``os.path.isdir()`` on both Windows
|
||||
and POSIX systems.
|
||||
|
||||
Or, for getting the total size of files in a directory tree -- showing
|
||||
use of the ``DirEntry.lstat()`` method::
|
||||
Or, for getting the total size of files in a directory tree, showing
|
||||
use of the ``DirEntry.lstat()`` method and ``DirEntry.full_name``
|
||||
attribute::
|
||||
|
||||
def get_tree_size(path):
|
||||
"""Return total size of files in path and subdirs."""
|
||||
size = 0
|
||||
for entry in scandir(path):
|
||||
total = 0
|
||||
for entry in os.scandir(path):
|
||||
if entry.is_dir():
|
||||
sub_path = os.path.join(path, entry.name)
|
||||
size += get_tree_size(sub_path)
|
||||
total += get_tree_size(entry.full_name)
|
||||
else:
|
||||
size += entry.lstat().st_size
|
||||
return size
|
||||
total += entry.lstat().st_size
|
||||
return total
|
||||
|
||||
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
|
||||
because no extra stat call are needed, but on Linux and OS X the size
|
||||
because no extra stat call are needed, but on POSIX systems the size
|
||||
information is not returned by the directory iteration functions, so
|
||||
this function won't gain anything there.
|
||||
|
||||
|
||||
Notes on caching
|
||||
----------------
|
||||
|
||||
The ``DirEntry`` objects are relatively dumb -- the ``name`` and
|
||||
``full_name`` attributes are obviously always cached, and the ``is_X``
|
||||
and ``lstat`` methods cache their values (immediately on Windows via
|
||||
``FindNextFile``, and on first use on POSIX systems via a ``stat``
|
||||
call) and never refetch from the system.
|
||||
|
||||
For this reason, ``DirEntry`` objects are intended to be used and
|
||||
thrown away after iteration, not stored in long-lived data structured
|
||||
and the methods called again and again.
|
||||
|
||||
If developers want "refresh" behaviour (for example, for watching a
|
||||
file's size change), they can simply use ``pathlib.Path`` objects,
|
||||
or call the regular ``os.lstat()`` or ``os.path.getsize()`` functions
|
||||
which get fresh data from the operating system every call.
|
||||
|
||||
|
||||
Notes on exception handling
|
||||
---------------------------
|
||||
|
||||
``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
|
||||
rather than attributes or properties, to make it clear that they may
|
||||
not be cheap operations, and they may do a system call. As a result,
|
||||
these methods may raise ``OSError``.
|
||||
|
||||
For example, ``DirEntry.lstat()`` will always make a system call on
|
||||
POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
|
||||
``stat()`` system call on such systems if ``readdir()`` returns a
|
||||
``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
|
||||
certain conditions or on certain file systems.
|
||||
|
||||
For this reason, when a user requires fine-grained error handling,
|
||||
it's good to catch ``OSError`` around these method calls and then
|
||||
handle as appropriate.
|
||||
|
||||
For example, below is a version of the ``get_tree_size()`` example
|
||||
shown above, but with basic error handling added::
|
||||
|
||||
def get_tree_size(path):
|
||||
"""Return total size of files in path and subdirs. If
|
||||
is_dir() or lstat() fails, print an error message to stderr
|
||||
and assume zero size (for example, file has been deleted).
|
||||
"""
|
||||
total = 0
|
||||
for entry in os.scandir(path):
|
||||
try:
|
||||
is_dir = entry.is_dir()
|
||||
except OSError as error:
|
||||
print('Error calling is_dir():', error, file=sys.stderr)
|
||||
continue
|
||||
if is_dir:
|
||||
total += get_tree_size(entry.full_name)
|
||||
else:
|
||||
try:
|
||||
total += entry.lstat().st_size
|
||||
except OSError as error:
|
||||
print('Error calling lstat():', error, file=sys.stderr)
|
||||
return total
|
||||
|
||||
|
||||
Support
|
||||
=======
|
||||
|
||||
|
@ -185,6 +254,10 @@ The scandir module on GitHub has been forked and used quite a bit (see
|
|||
direct support for a scandir-like function from core developers and
|
||||
others on the python-dev and python-ideas mailing lists. A sampling:
|
||||
|
||||
* **python-dev**: a good number of +1's and very few negatives for
|
||||
scandir and PEP 471 on `this June 2014 python-dev thread
|
||||
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_
|
||||
|
||||
* **Nick Coghlan**, a core Python developer: "I've had the local Red
|
||||
Hat release engineering team express their displeasure at having to
|
||||
stat every file in a network mounted directory tree for info that is
|
||||
|
@ -225,9 +298,10 @@ specific os.scandir API would be a good thing."
|
|||
Use in the wild
|
||||
===============
|
||||
|
||||
To date, ``scandir`` is definitely useful, but has been clearly marked
|
||||
"beta", so it's uncertain how much use of it there is in the wild. Ben
|
||||
Hoyt has had several reports from people using it. For example:
|
||||
To date, the ``scandir`` implementation is definitely useful, but has
|
||||
been clearly marked "beta", so it's uncertain how much use of it there
|
||||
is in the wild. Ben Hoyt has had several reports from people using it.
|
||||
For example:
|
||||
|
||||
* Chris F: "I am processing some pretty large directories and was half
|
||||
expecting to have to modify getdents. So thanks for saving me the
|
||||
|
@ -250,12 +324,12 @@ created. See `PyPI package`_.
|
|||
|
||||
GitHub stats don't mean too much, but scandir does have several
|
||||
watchers, issues, forks, etc. Here's the run-down as of the stats as
|
||||
of June 5, 2014:
|
||||
of July 7, 2014:
|
||||
|
||||
* Watchers: 17
|
||||
* Stars: 48
|
||||
* Forks: 15
|
||||
* Issues: 2 open, 19 closed
|
||||
* Stars: 57
|
||||
* Forks: 20
|
||||
* Issues: 4 open, 26 closed
|
||||
|
||||
**However, the much larger point is this:**, if this PEP is accepted,
|
||||
``os.walk()`` can easily be reimplemented using ``scandir`` rather
|
||||
|
@ -266,53 +340,205 @@ production code that would benefit from this large speedup of
|
|||
of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).
|
||||
|
||||
|
||||
Open issues and optional things
|
||||
===============================
|
||||
|
||||
There are a few open issues or optional additions:
|
||||
Rejected ideas
|
||||
==============
|
||||
|
||||
|
||||
Should scandir be in its own module?
|
||||
Naming
|
||||
------
|
||||
|
||||
The only other real contender for this function's name was
|
||||
``iterdir()``. However, ``iterX()`` functions in Python (mostly found
|
||||
in Python 2) tend to be simple iterator equivalents of their
|
||||
non-iterator counterparts. For example, ``dict.iterkeys()`` is just an
|
||||
iterator version of ``dict.keys()``, but the objects returned are
|
||||
identical. In ``scandir()``'s case, however, the return values are
|
||||
quite different objects (``DirEntry`` objects vs filename strings), so
|
||||
this should probably be reflected by a difference in name -- hence
|
||||
``scandir()``.
|
||||
|
||||
See some `relevant discussion on python-dev
|
||||
<https://mail.python.org/pipermail/python-dev/2014-June/135228.html>`_.
|
||||
|
||||
|
||||
Wildcard support
|
||||
----------------
|
||||
|
||||
``FindFirstFile``/``FindNextFile`` on Windows support passing a
|
||||
"wildcard" like ``*.jpg``, so at first folks (this PEP's author
|
||||
included) felt it would be a good idea to include a
|
||||
``windows_wildcard`` keyword argument to the ``scandir`` function so
|
||||
users could pass this in.
|
||||
|
||||
However, on further thought and discussion it was decided that this
|
||||
would be bad idea, *unless it could be made cross-platform* (a
|
||||
``pattern`` keyword argument or similar). This seems easy enough at
|
||||
first -- just use the OS wildcard support on Windows, and something
|
||||
like ``fnmatch`` or ``re`` afterwards on POSIX-based systems.
|
||||
|
||||
Unfortunately the exact Windows wildcard matching rules aren't really
|
||||
documented anywhere by Microsoft, and they're quite quirky (see this
|
||||
`blog post
|
||||
<http://blogs.msdn.com/b/oldnewthing/archive/2007/12/17/6785519.aspx>`_),
|
||||
meaning it's very problematic to emulate using ``fnmatch`` or regexes.
|
||||
|
||||
So the consensus was that Windows wildcard support was a bad idea.
|
||||
It would be possible to add at a later date if there's a
|
||||
cross-platform way to achieve it, but not for the initial version.
|
||||
|
||||
Read more on the `this Nov 2012 python-ideas thread
|
||||
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
|
||||
and this `June 2014 python-dev thread on PEP 471
|
||||
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.
|
||||
|
||||
|
||||
DirEntry attributes being properties
|
||||
------------------------------------
|
||||
|
||||
Should the function be included in the standard library in a new
|
||||
module, ``scandir.scandir()``, or just as ``os.scandir()`` as
|
||||
discussed? The preference of this PEP's author (Ben Hoyt) would be
|
||||
``os.scandir()``, as it's just a single function.
|
||||
In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
|
||||
``lstat()`` to be properties instead of methods, to indicate they're
|
||||
very cheap or free. However, this isn't quite the case, as ``lstat()``
|
||||
will require an OS call on POSIX-based systems but not on Windows.
|
||||
Even ``is_dir()`` and friends may perform an OS call on POSIX-based
|
||||
systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
|
||||
file systems).
|
||||
|
||||
Also, people would expect the attribute access ``entry.is_dir`` to
|
||||
only ever raise ``AttributeError``, not ``OSError`` in the case it
|
||||
makes a system call under the covers. Calling code would have to have
|
||||
a ``try``/``except`` around what looks like a simple attribute access,
|
||||
and so it's much better to make them *methods*.
|
||||
|
||||
See `this May 2013 python-dev thread
|
||||
<https://mail.python.org/pipermail/python-dev/2013-May/126184.html>`_
|
||||
where this PEP author makes this case and there's agreement from a
|
||||
core developers.
|
||||
|
||||
|
||||
Should there be a way to access the full path?
|
||||
----------------------------------------------
|
||||
DirEntry fields being "static" attribute-only objects
|
||||
-----------------------------------------------------
|
||||
|
||||
Should ``DirEntry``'s have a way to get the full path without using
|
||||
``os.path.join(path, entry.name)``? This is a pretty common pattern,
|
||||
and it may be useful to add pathlib-like ``str(entry)`` functionality.
|
||||
This functionality has also been requested in `issue 13`_ on GitHub.
|
||||
In `this July 2014 python-dev message
|
||||
<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
|
||||
Paul Moore suggested a solution that was a "thin wrapper round the OS
|
||||
feature", where the ``DirEntry`` object had only static attributes:
|
||||
``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
|
||||
only present on Windows. The idea was to use this simpler, lower-level
|
||||
function as a building block for higher-level functions.
|
||||
|
||||
.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
|
||||
At first there was general agreement that simplifying in this way was
|
||||
a good thing. However, there were two problems with this approach.
|
||||
First, the assumption is the ``is_dir`` and similar attributes are
|
||||
always present on POSIX, which isn't the case (if ``d_type`` is not
|
||||
present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API
|
||||
in practice, as even the ``is_dir`` attributes aren't always present
|
||||
on POSIX, and would need to be tested with ``hasattr()`` and then
|
||||
``os.stat()`` called if they weren't present.
|
||||
|
||||
See `this July 2014 python-dev response
|
||||
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
|
||||
from this PEP's author detailing why this option is a non-ideal
|
||||
solution, and the subsequent reply from Paul Moore voicing agreement.
|
||||
|
||||
|
||||
Should it expose Windows wildcard functionality?
|
||||
------------------------------------------------
|
||||
DirEntry fields being static with an ensure_lstat option
|
||||
--------------------------------------------------------
|
||||
|
||||
Should ``scandir()`` have a way of exposing the wildcard functionality
|
||||
in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
|
||||
scandir module on GitHub exposes this as a ``windows_wildcard``
|
||||
keyword argument, allowing Windows power users the option to pass a
|
||||
custom wildcard to ``FindFirstFile``, which may avoid the need to use
|
||||
``fnmatch`` or similar on the resulting names. It is named the
|
||||
unwieldly ``windows_wildcard`` to remind you you're writing power-
|
||||
user, Windows-only code if you use it.
|
||||
Another seemingly simpler and attractive option was suggested by
|
||||
Nick Coghlan in this `June 2014 python-dev message
|
||||
<https://mail.python.org/pipermail/python-dev/2014-June/135261.html>`_:
|
||||
make ``DirEntry.is_X`` and ``DirEntry.lstat_result`` properties, and
|
||||
populate ``DirEntry.lstat_result`` at iteration time, but only if
|
||||
the new argument ``ensure_lstat=True`` was specified on the
|
||||
``scandir()`` call.
|
||||
|
||||
This boils down to whether ``scandir`` should be about exposing all of
|
||||
the system's directory iteration features, or simply providing a fast,
|
||||
simple, cross-platform directory iteration API.
|
||||
This does have the advantage over the above in that you can easily get
|
||||
the stat result from ``scandir()`` if you need it. However, it has the
|
||||
serious disadvantage that fine-grained error handling is messy,
|
||||
because ``stat()`` will be called (and hence potentially raise
|
||||
``OSError``) during iteration, leading to a rather ugly, hand-made
|
||||
iteration loop::
|
||||
|
||||
This PEP's author votes for not including ``windows_wildcard`` in the
|
||||
standard library version, because even though it could be useful in
|
||||
rare cases (say the Windows Dropbox client?), it'd be too easy to use
|
||||
it just because you're a Windows developer, and create code that is
|
||||
not cross-platform.
|
||||
it = os.scandir(path)
|
||||
while True:
|
||||
try:
|
||||
entry = next(it)
|
||||
except OSError as error:
|
||||
handle_error(path, error)
|
||||
except StopIteration:
|
||||
break
|
||||
|
||||
Or it means that ``scandir()`` would have to accept an ``onerror``
|
||||
argument -- a function to call when ``stat()`` errors occur during
|
||||
iteration. This seems to this PEP's author neither as direct nor as
|
||||
Pythonic as ``try``/``except`` around a ``DirEntry.lstat()`` call.
|
||||
|
||||
See `Ben Hoyt's July 2014 reply
|
||||
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
|
||||
to the discussion summarizing this and detailing why he thinks the
|
||||
original PEP 471 proposal is "the right one" after all.
|
||||
|
||||
|
||||
Return values being (name, stat_result) two-tuples
|
||||
--------------------------------------------------
|
||||
|
||||
Initially this PEP's author proposed this concept as a function called
|
||||
``iterdir_stat()`` which yielded two-tuples of (name, stat_result).
|
||||
This does have the advantage that there are no new types introduced.
|
||||
However, the ``stat_result`` is only partially filled on POSIX-based
|
||||
systems (most fields set to ``None`` and other quirks), so they're not
|
||||
really ``stat_result`` objects at all, and this would have to be
|
||||
thoroughly documented as different from ``os.stat()``.
|
||||
|
||||
Also, Python has good support for proper objects with attributes and
|
||||
methods, which makes for a saner and simpler API than two-tuples. It
|
||||
also makes the ``DirEntry`` objects more extensible and future-proof
|
||||
as operating systems add functionality and we want to include this in
|
||||
``DirEntry``.
|
||||
|
||||
See also some previous discussion:
|
||||
|
||||
* `May 2013 python-dev thread
|
||||
<https://mail.python.org/pipermail/python-dev/2013-May/126148.html>`_
|
||||
where Nick Coghlan makes the original case for a ``DirEntry``-style
|
||||
object.
|
||||
|
||||
* `June 2014 python-dev thread
|
||||
<https://mail.python.org/pipermail/python-dev/2014-June/135244.html>`_
|
||||
where Nick Coghlan makes (another) good case against the two-tuple
|
||||
approach.
|
||||
|
||||
|
||||
Return values being overloaded stat_result objects
|
||||
--------------------------------------------------
|
||||
|
||||
Another alternative discussed was making the return values to be
|
||||
overloaded ``stat_result`` objects with ``name`` and ``full_name``
|
||||
attributes. However, apart from this being a strange (and strained!)
|
||||
kind of overloading, this has the same problems mentioned above --
|
||||
most of the ``stat_result`` information is not fetched by
|
||||
``readdir()`` on POSIX systems, only (part of) the ``st_mode`` value.
|
||||
|
||||
|
||||
Return values being pathlib.Path objects
|
||||
----------------------------------------
|
||||
|
||||
With Antoine Pitrou's new standard library ``pathlib`` module, it
|
||||
at first seems like a great idea for ``scandir()`` to return instances
|
||||
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
|
||||
``lstat()`` functions are explicitly not cached, whereas ``scandir``
|
||||
has to cache them by design, because it's (often) returning values
|
||||
from the original directory iteration system call.
|
||||
|
||||
And if the ``pathlib.Path`` instances returned by ``scandir`` cached
|
||||
lstat values, but the ordinary ``pathlib.Path`` objects explicitly
|
||||
don't, that would be more than a little confusing.
|
||||
|
||||
Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
|
||||
the context of scandir `here
|
||||
<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
|
||||
making ``pathlib.Path`` objects a bad choice for scandir return
|
||||
values.
|
||||
|
||||
|
||||
Possible improvements
|
||||
|
@ -328,6 +554,12 @@ here is a short list of some this PEP's author has in mind:
|
|||
was suggested by on Issue 11406 by Antoine Pitrou.
|
||||
[`source9 <http://bugs.python.org/msg130125>`_]
|
||||
|
||||
* scandir could use a free list to avoid the cost of memory allocation
|
||||
for each iteration -- a short free list of 10 or maybe even 1 may help.
|
||||
Suggested by Victor Stinner on a `python-dev thread on June 27`_.
|
||||
|
||||
.. _`python-dev thread on June 27`: https://mail.python.org/pipermail/python-dev/2014-June/135232.html
|
||||
|
||||
|
||||
Previous discussion
|
||||
===================
|
||||
|
@ -342,9 +574,12 @@ Previous discussion
|
|||
``scandir()`` API, including Nick Coghlan's suggestion of scandir
|
||||
yielding ``DirEntry``-like objects
|
||||
|
||||
* `Final thread Ben Hoyt started on python-dev`_ to discuss the
|
||||
* `Another thread Ben Hoyt started on python-dev`_ to discuss the
|
||||
interaction between scandir and the new ``pathlib`` module
|
||||
|
||||
* `Final thread Ben Hoyt started on python-dev`_ to discuss the first
|
||||
version of this PEP, with extensive discussion about the API.
|
||||
|
||||
* `Question on StackOverflow`_ about why ``os.walk()`` is slow and
|
||||
pointers on how to fix it (this inspired the author of this PEP
|
||||
early on)
|
||||
|
@ -354,7 +589,8 @@ Previous discussion
|
|||
|
||||
.. _`Original thread Ben Hoyt started on python-ideas`: https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
|
||||
.. _`Further thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-May/126119.html
|
||||
.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
|
||||
.. _`Another thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2013-November/130572.html
|
||||
.. _`Final thread Ben Hoyt started on python-dev`: https://mail.python.org/pipermail/python-dev/2014-June/135215.html
|
||||
.. _`Question on StackOverflow`: http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
|
||||
.. _`BetterWalk`: https://github.com/benhoyt/betterwalk
|
||||
|
||||
|
|
Loading…
Reference in New Issue