484 lines
20 KiB
Plaintext
484 lines
20 KiB
Plaintext
PEP: 234
|
|
Title: Iterators
|
|
Author: Ka-Ping Yee <ping@zesty.ca>, Guido van Rossum <guido@python.org>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 30-Jan-2001
|
|
Python-Version: 2.1
|
|
Post-History: 30-Apr-2001
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
This document proposes an iteration interface that objects can provide to
|
|
control the behaviour of ``for`` loops. Looping is customized by providing a
|
|
method that produces an iterator object. The iterator provides a *get next
|
|
value* operation that produces the next item in the sequence each time it is
|
|
called, raising an exception when no more items are available.
|
|
|
|
In addition, specific iterators over the keys of a dictionary and over the
|
|
lines of a file are proposed, and a proposal is made to allow spelling
|
|
``dict.has_key(key)`` as ``key in dict``.
|
|
|
|
Note: this is an almost complete rewrite of this PEP by the second author,
|
|
describing the actual implementation checked into the trunk of the Python 2.2
|
|
CVS tree. It is still open for discussion. Some of the more esoteric
|
|
proposals in the original version of this PEP have been withdrawn for now;
|
|
these may be the subject of a separate PEP in the future.
|
|
|
|
|
|
C API Specification
|
|
===================
|
|
|
|
A new exception is defined, ``StopIteration``, which can be used to signal the
|
|
end of an iteration.
|
|
|
|
A new slot named ``tp_iter`` for requesting an iterator is added to the type
|
|
object structure. This should be a function of one ``PyObject *`` argument
|
|
returning a ``PyObject *``, or ``NULL``. To use this slot, a new C API
|
|
function ``PyObject_GetIter()`` is added, with the same signature as the
|
|
``tp_iter`` slot function.
|
|
|
|
Another new slot, named ``tp_iternext``, is added to the type structure, for
|
|
obtaining the next value in the iteration. To use this slot, a new C API
|
|
function ``PyIter_Next()`` is added. The signature for both the slot and the
|
|
API function is as follows, although the ``NULL`` return conditions differ:
|
|
the argument is a ``PyObject *`` and so is the return value. When the return
|
|
value is non-``NULL``, it is the next value in the iteration. When it is
|
|
``NULL``, then for the ``tp_iternext slot`` there are three possibilities:
|
|
|
|
- No exception is set; this implies the end of the iteration.
|
|
|
|
- The ``StopIteration`` exception (or a derived exception class) is set; this
|
|
implies the end of the iteration.
|
|
|
|
- Some other exception is set; this means that an error occurred that should be
|
|
propagated normally.
|
|
|
|
The higher-level ``PyIter_Next()`` function clears the ``StopIteration``
|
|
exception (or derived exception) when it occurs, so its ``NULL`` return
|
|
conditions are simpler:
|
|
|
|
- No exception is set; this means iteration has ended.
|
|
|
|
- Some exception is set; this means an error occurred, and should be propagated
|
|
normally.
|
|
|
|
Iterators implemented in C should *not* implement a ``next()`` method with
|
|
similar semantics as the ``tp_iternext`` slot! When the type's dictionary is
|
|
initialized (by ``PyType_Ready()``), the presence of a ``tp_iternext`` slot
|
|
causes a method ``next()`` wrapping that slot to be added to the type's
|
|
``tp_dict``. (Exception: if the type doesn't use ``PyObject_GenericGetAttr()``
|
|
to access instance attributes, the ``next()`` method in the type's ``tp_dict``
|
|
may not be seen.) (Due to a misunderstanding in the original text of this PEP,
|
|
in Python 2.2, all iterator types implemented a ``next()`` method that was
|
|
overridden by the wrapper; this has been fixed in Python 2.3.)
|
|
|
|
To ensure binary backwards compatibility, a new flag ``Py_TPFLAGS_HAVE_ITER``
|
|
is added to the set of flags in the ``tp_flags`` field, and to the default
|
|
flags macro. This flag must be tested before accessing the ``tp_iter`` or
|
|
``tp_iternext`` slots. The macro ``PyIter_Check()`` tests whether an object
|
|
has the appropriate flag set and has a non-``NULL`` ``tp_iternext`` slot.
|
|
There is no such macro for the ``tp_iter`` slot (since the only place where
|
|
this slot is referenced should be ``PyObject_GetIter()``, and this can check
|
|
for the ``Py_TPFLAGS_HAVE_ITER`` flag directly).
|
|
|
|
(Note: the ``tp_iter`` slot can be present on any object; the ``tp_iternext``
|
|
slot should only be present on objects that act as iterators.)
|
|
|
|
For backwards compatibility, the ``PyObject_GetIter()`` function implements
|
|
fallback semantics when its argument is a sequence that does not implement a
|
|
``tp_iter`` function: a lightweight sequence iterator object is constructed in
|
|
that case which iterates over the items of the sequence in the natural order.
|
|
|
|
The Python bytecode generated for ``for`` loops is changed to use new opcodes,
|
|
``GET_ITER`` and ``FOR_ITER``, that use the iterator protocol rather than the
|
|
sequence protocol to get the next value for the loop variable. This makes it
|
|
possible to use a ``for`` loop to loop over non-sequence objects that support
|
|
the ``tp_iter`` slot. Other places where the interpreter loops over the values
|
|
of a sequence should also be changed to use iterators.
|
|
|
|
Iterators ought to implement the ``tp_iter`` slot as returning a reference to
|
|
themselves; this is needed to make it possible to use an iterator (as opposed
|
|
to a sequence) in a ``for`` loop.
|
|
|
|
Iterator implementations (in C or in Python) should guarantee that once the
|
|
iterator has signalled its exhaustion, subsequent calls to ``tp_iternext`` or
|
|
to the ``next()`` method will continue to do so. It is not specified whether
|
|
an iterator should enter the exhausted state when an exception (other than
|
|
``StopIteration``) is raised. Note that Python cannot guarantee that
|
|
user-defined or 3rd party iterators implement this requirement correctly.
|
|
|
|
|
|
Python API Specification
|
|
========================
|
|
|
|
The ``StopIteration`` exception is made visible as one of the standard
|
|
exceptions. It is derived from ``Exception``.
|
|
|
|
A new built-in function is defined, ``iter()``, which can be called in two
|
|
ways:
|
|
|
|
- ``iter(obj)`` calls ``PyObject_GetIter(obj)``.
|
|
|
|
- ``iter(callable, sentinel)`` returns a special kind of iterator that calls
|
|
the callable to produce a new value, and compares the return value to the
|
|
sentinel value. If the return value equals the sentinel, this signals the
|
|
end of the iteration and ``StopIteration`` is raised rather than returning
|
|
normal; if the return value does not equal the sentinel, it is returned as
|
|
the next value from the iterator. If the callable raises an exception, this
|
|
is propagated normally; in particular, the function is allowed to raise
|
|
``StopIteration`` as an alternative way to end the iteration. (This
|
|
functionality is available from the C API as
|
|
``PyCallIter_New(callable, sentinel)``.)
|
|
|
|
Iterator objects returned by either form of ``iter()`` have a ``next()``
|
|
method. This method either returns the next value in the iteration, or raises
|
|
``StopIteration`` (or a derived exception class) to signal the end of the
|
|
iteration. Any other exception should be considered to signify an error and
|
|
should be propagated normally, not taken to mean the end of the iteration.
|
|
|
|
Classes can define how they are iterated over by defining an ``__iter__()``
|
|
method; this should take no additional arguments and return a valid iterator
|
|
object. A class that wants to be an iterator should implement two methods: a
|
|
``next()`` method that behaves as described above, and an ``__iter__()`` method
|
|
that returns ``self``.
|
|
|
|
The two methods correspond to two distinct protocols:
|
|
|
|
1. An object can be iterated over with ``for`` if it implements ``__iter__()``
|
|
or ``__getitem__()``.
|
|
|
|
2. An object can function as an iterator if it implements ``next()``.
|
|
|
|
Container-like objects usually support protocol 1. Iterators are currently
|
|
required to support both protocols. The semantics of iteration come only from
|
|
protocol 2; protocol 1 is present to make iterators behave like sequences; in
|
|
particular so that code receiving an iterator can use a for-loop over the
|
|
iterator.
|
|
|
|
|
|
Dictionary Iterators
|
|
====================
|
|
|
|
- Dictionaries implement a ``sq_contains`` slot that implements the same test
|
|
as the ``has_key()`` method. This means that we can write
|
|
|
|
::
|
|
|
|
if k in dict: ...
|
|
|
|
which is equivalent to
|
|
|
|
::
|
|
|
|
if dict.has_key(k): ...
|
|
|
|
- Dictionaries implement a ``tp_iter`` slot that returns an efficient iterator
|
|
that iterates over the keys of the dictionary. During such an iteration, the
|
|
dictionary should not be modified, except that setting the value for an
|
|
existing key is allowed (deletions or additions are not, nor is the
|
|
``update()`` method). This means that we can write
|
|
|
|
::
|
|
|
|
for k in dict: ...
|
|
|
|
which is equivalent to, but much faster than
|
|
|
|
::
|
|
|
|
for k in dict.keys(): ...
|
|
|
|
as long as the restriction on modifications to the dictionary (either by the
|
|
loop or by another thread) are not violated.
|
|
|
|
- Add methods to dictionaries that return different kinds of iterators
|
|
explicitly::
|
|
|
|
for key in dict.iterkeys(): ...
|
|
|
|
for value in dict.itervalues(): ...
|
|
|
|
for key, value in dict.iteritems(): ...
|
|
|
|
This means that ``for x in dict`` is shorthand for
|
|
``for x in dict.iterkeys()``.
|
|
|
|
Other mappings, if they support iterators at all, should also iterate over the
|
|
keys. However, this should not be taken as an absolute rule; specific
|
|
applications may have different requirements.
|
|
|
|
|
|
File Iterators
|
|
==============
|
|
|
|
The following proposal is useful because it provides us with a good answer to
|
|
the complaint that the common idiom to iterate over the lines of a file is ugly
|
|
and slow.
|
|
|
|
- Files implement a ``tp_iter`` slot that is equivalent to
|
|
``iter(f.readline, "")``. This means that we can write
|
|
|
|
::
|
|
|
|
for line in file:
|
|
...
|
|
|
|
as a shorthand for
|
|
|
|
::
|
|
|
|
for line in iter(file.readline, ""):
|
|
...
|
|
|
|
which is equivalent to, but faster than
|
|
|
|
::
|
|
|
|
while 1:
|
|
line = file.readline()
|
|
if not line:
|
|
break
|
|
...
|
|
|
|
This also shows that some iterators are destructive: they consume all the
|
|
values and a second iterator cannot easily be created that iterates
|
|
independently over the same values. You could open the file for a second time,
|
|
or ``seek()`` to the beginning, but these solutions don't work for all file
|
|
types, e.g. they don't work when the open file object really represents a pipe
|
|
or a stream socket.
|
|
|
|
Because the file iterator uses an internal buffer, mixing this with other file
|
|
operations (e.g. ``file.readline()``) doesn't work right. Also, the following
|
|
code::
|
|
|
|
for line in file:
|
|
if line == "\n":
|
|
break
|
|
for line in file:
|
|
print line,
|
|
|
|
doesn't work as you might expect, because the iterator created by the second
|
|
for-loop doesn't take the buffer read-ahead by the first for-loop into account.
|
|
A correct way to write this is::
|
|
|
|
it = iter(file)
|
|
for line in it:
|
|
if line == "\n":
|
|
break
|
|
for line in it:
|
|
print line,
|
|
|
|
(The rationale for these restrictions are that ``for line in file`` ought to
|
|
become the recommended, standard way to iterate over the lines of a file, and
|
|
this should be as fast as can be. The iterator version is considerable faster
|
|
than calling ``readline()``, due to the internal buffer in the iterator.)
|
|
|
|
|
|
Rationale
|
|
=========
|
|
|
|
If all the parts of the proposal are included, this addresses many concerns in
|
|
a consistent and flexible fashion. Among its chief virtues are the following
|
|
four -- no, five -- no, six -- points:
|
|
|
|
1. It provides an extensible iterator interface.
|
|
|
|
2. It allows performance enhancements to list iteration.
|
|
|
|
3. It allows big performance enhancements to dictionary iteration.
|
|
|
|
4. It allows one to provide an interface for just iteration without pretending
|
|
to provide random access to elements.
|
|
|
|
5. It is backward-compatible with all existing user-defined classes and
|
|
extension objects that emulate sequences and mappings, even mappings that
|
|
only implement a subset of {``__getitem__``, ``keys``, ``values``,
|
|
``items``}.
|
|
|
|
6. It makes code iterating over non-sequence collections more concise and
|
|
readable.
|
|
|
|
|
|
Resolved Issues
|
|
===============
|
|
|
|
The following topics have been decided by consensus or BDFL pronouncement.
|
|
|
|
- Two alternative spellings for ``next()`` have been proposed but rejected:
|
|
``__next__()``, because it corresponds to a type object slot
|
|
(``tp_iternext``); and ``__call__()``, because this is the only operation.
|
|
|
|
Arguments against ``__next__()``: while many iterators are used in for loops,
|
|
it is expected that user code will also call ``next()`` directly, so having
|
|
to write ``__next__()`` is ugly; also, a possible extension of the protocol
|
|
would be to allow for ``prev()``, ``current()`` and ``reset()`` operations;
|
|
surely we don't want to use ``__prev__()``, ``__current__()``,
|
|
``__reset__()``.
|
|
|
|
Arguments against ``__call__()`` (the original proposal): taken out of
|
|
context, ``x()`` is not very readable, while ``x.next()`` is clear; there's a
|
|
danger that every special-purpose object wants to use ``__call__()`` for its
|
|
most common operation, causing more confusion than clarity.
|
|
|
|
(In retrospect, it might have been better to go for ``__next__()`` and have a
|
|
new built-in, ``next(it)``, which calls ``it.__next__()``. But alas, it's too
|
|
late; this has been deployed in Python 2.2 since December 2001.)
|
|
|
|
- Some folks have requested the ability to restart an iterator. This should be
|
|
dealt with by calling ``iter()`` on a sequence repeatedly, not by the
|
|
iterator protocol itself. (See also requested extensions below.)
|
|
|
|
- It has been questioned whether an exception to signal the end of the
|
|
iteration isn't too expensive. Several alternatives for the
|
|
``StopIteration`` exception have been proposed: a special value ``End`` to
|
|
signal the end, a function ``end()`` to test whether the iterator is
|
|
finished, even reusing the ``IndexError`` exception.
|
|
|
|
- A special value has the problem that if a sequence ever contains that
|
|
special value, a loop over that sequence will end prematurely without any
|
|
warning. If the experience with null-terminated C strings hasn't taught us
|
|
the problems this can cause, imagine the trouble a Python introspection
|
|
tool would have iterating over a list of all built-in names, assuming that
|
|
the special ``End`` value was a built-in name!
|
|
|
|
- Calling an ``end()`` function would require two calls per iteration. Two
|
|
calls is much more expensive than one call plus a test for an exception.
|
|
Especially the time-critical for loop can test very cheaply for an
|
|
exception.
|
|
|
|
- Reusing ``IndexError`` can cause confusion because it can be a genuine
|
|
error, which would be masked by ending the loop prematurely.
|
|
|
|
- Some have asked for a standard iterator type. Presumably all iterators would
|
|
have to be derived from this type. But this is not the Python way:
|
|
dictionaries are mappings because they support ``__getitem__()`` and a
|
|
handful other operations, not because they are derived from an abstract
|
|
mapping type.
|
|
|
|
- Regarding ``if key in dict``: there is no doubt that the ``dict.has_key(x)``
|
|
interpretation of ``x in dict`` is by far the most useful interpretation,
|
|
probably the only useful one. There has been resistance against this because
|
|
``x in list`` checks whether *x* is present among the values, while the
|
|
proposal makes ``x in dict`` check whether *x* is present among the keys.
|
|
Given that the symmetry between lists and dictionaries is very weak, this
|
|
argument does not have much weight.
|
|
|
|
- The name ``iter()`` is an abbreviation. Alternatives proposed include
|
|
``iterate()``, ``traverse()``, but these appear too long. Python has a
|
|
history of using abbrs for common builtins, e.g. ``repr()``, ``str()``,
|
|
``len()``.
|
|
|
|
Resolution: ``iter()`` it is.
|
|
|
|
- Using the same name for two different operations (getting an iterator from an
|
|
object and making an iterator for a function with a sentinel value) is
|
|
somewhat ugly. I haven't seen a better name for the second operation though,
|
|
and since they both return an iterator, it's easy to remember.
|
|
|
|
Resolution: the builtin ``iter()`` takes an optional argument, which is the
|
|
sentinel to look for.
|
|
|
|
- Once a particular iterator object has raised ``StopIteration``, will it also
|
|
raise ``StopIteration`` on all subsequent ``next()`` calls? Some say that it
|
|
would be useful to require this, others say that it is useful to leave this
|
|
open to individual iterators. Note that this may require an additional state
|
|
bit for some iterator implementations (e.g. function-wrapping iterators).
|
|
|
|
Resolution: once ``StopIteration`` is raised, calling ``it.next()`` continues
|
|
to raise ``StopIteration``.
|
|
|
|
Note: this was in fact not implemented in Python 2.2; there are many cases
|
|
where an iterator's ``next()`` method can raise ``StopIteration`` on one call
|
|
but not on the next. This has been remedied in Python 2.3.
|
|
|
|
- It has been proposed that a file object should be its own iterator, with a
|
|
``next()`` method returning the next line. This has certain advantages, and
|
|
makes it even clearer that this iterator is destructive. The disadvantage is
|
|
that this would make it even more painful to implement the "sticky
|
|
StopIteration" feature proposed in the previous bullet.
|
|
|
|
Resolution: tentatively rejected (though there are still people arguing for
|
|
this).
|
|
|
|
- Some folks have requested extensions of the iterator protocol, e.g.
|
|
``prev()`` to get the previous item, ``current()`` to get the current item
|
|
again, ``finished()`` to test whether the iterator is finished, and maybe
|
|
even others, like ``rewind()``, ``__len__()``, ``position()``.
|
|
|
|
While some of these are useful, many of these cannot easily be implemented
|
|
for all iterator types without adding arbitrary buffering, and sometimes they
|
|
can't be implemented at all (or not reasonably). E.g. anything to do with
|
|
reversing directions can't be done when iterating over a file or function.
|
|
Maybe a separate PEP can be drafted to standardize the names for such
|
|
operations when they are implementable.
|
|
|
|
Resolution: rejected.
|
|
|
|
- There has been a long discussion about whether
|
|
|
|
::
|
|
|
|
for x in dict: ...
|
|
|
|
should assign *x* the successive keys, values, or items of the dictionary.
|
|
The symmetry between ``if x in y`` and ``for x in y`` suggests that it should
|
|
iterate over keys. This symmetry has been observed by many independently and
|
|
has even been used to "explain" one using the other. This is because for
|
|
sequences, ``if x in y`` iterates over *y* comparing the iterated values to
|
|
*x*. If we adopt both of the above proposals, this will also hold for
|
|
dictionaries.
|
|
|
|
The argument against making ``for x in dict`` iterate over the keys comes
|
|
mostly from a practicality point of view: scans of the standard library show
|
|
that there are about as many uses of ``for x in dict.items()`` as there are
|
|
of ``for x in dict.keys()``, with the ``items()`` version having a small
|
|
majority. Presumably many of the loops using ``keys()`` use the
|
|
corresponding value anyway, by writing ``dict[x]``, so (the argument goes) by
|
|
making both the key and value available, we could support the largest number
|
|
of cases. While this is true, I (Guido) find the correspondence between
|
|
``for x in dict`` and ``if x in dict`` too compelling to break, and there's
|
|
not much overhead in having to write ``dict[x]`` to explicitly get the value.
|
|
|
|
For fast iteration over items, use ``for key, value in dict.iteritems()``.
|
|
I've timed the difference between
|
|
|
|
::
|
|
|
|
for key in dict: dict[key]
|
|
|
|
and
|
|
|
|
::
|
|
|
|
for key, value in dict.iteritems(): pass
|
|
|
|
and found that the latter is only about 7% faster.
|
|
|
|
Resolution: By BDFL pronouncement, ``for x in dict`` iterates over the keys,
|
|
and dictionaries have ``iteritems()``, ``iterkeys()``, and ``itervalues()``
|
|
to return the different flavors of dictionary iterators.
|
|
|
|
|
|
Mailing Lists
|
|
=============
|
|
|
|
The iterator protocol has been discussed extensively in a mailing list on
|
|
SourceForge:
|
|
|
|
http://lists.sourceforge.net/lists/listinfo/python-iterators
|
|
|
|
Initially, some of the discussion was carried out at Yahoo; archives are still
|
|
accessible:
|
|
|
|
http://groups.yahoo.com/group/python-iter
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document is in the public domain.
|