494 lines
20 KiB
Plaintext
494 lines
20 KiB
Plaintext
PEP: 234
|
||
Title: Iterators
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: ping@lfw.org (Ka-Ping Yee), guido@python.org (Guido van Rossum)
|
||
Status: Final
|
||
Type: Standards Track
|
||
Python-Version: 2.1
|
||
Created: 30-Jan-2001
|
||
Post-History: 30-Apr-2001
|
||
|
||
Abstract
|
||
|
||
This document proposes an iteration interface that objects can
|
||
provide to control the behaviour of 'for' loops. Looping is
|
||
customized by providing a method that produces an iterator object.
|
||
The iterator provides a 'get next value' operation that produces
|
||
the next item in the sequence each time it is called, raising an
|
||
exception when no more items are available.
|
||
|
||
In addition, specific iterators over the keys of a dictionary and
|
||
over the lines of a file are proposed, and a proposal is made to
|
||
allow spelling dict.has_key(key) as "key in dict".
|
||
|
||
Note: this is an almost complete rewrite of this PEP by the second
|
||
author, describing the actual implementation checked into the
|
||
trunk of the Python 2.2 CVS tree. It is still open for
|
||
discussion. Some of the more esoteric proposals in the original
|
||
version of this PEP have been withdrawn for now; these may be the
|
||
subject of a separate PEP in the future.
|
||
|
||
|
||
C API Specification
|
||
|
||
A new exception is defined, StopIteration, which can be used to
|
||
signal the end of an iteration.
|
||
|
||
A new slot named tp_iter for requesting an iterator is added to
|
||
the type object structure. This should be a function of one
|
||
PyObject * argument returning a PyObject *, or NULL. To use this
|
||
slot, a new C API function PyObject_GetIter() is added, with the
|
||
same signature as the tp_iter slot function.
|
||
|
||
Another new slot, named tp_iternext, is added to the type
|
||
structure, for obtaining the next value in the iteration. To use
|
||
this slot, a new C API function PyIter_Next() is added. The
|
||
signature for both the slot and the API function is as follows,
|
||
although the NULL return conditions differ: the argument is a
|
||
PyObject * and so is the return value. When the return value is
|
||
non-NULL, it is the next value in the iteration. When it is NULL,
|
||
then for the tp_iternext slot there are three possibilities:
|
||
|
||
- No exception is set; this implies the end of the iteration.
|
||
|
||
- The StopIteration exception (or a derived exception class) is
|
||
set; this implies the end of the iteration.
|
||
|
||
- Some other exception is set; this means that an error occurred
|
||
that should be propagated normally.
|
||
|
||
The higher-level PyIter_Next() function clears the StopIteration
|
||
exception (or derived exception) when it occurs, so its NULL return
|
||
conditions are simpler:
|
||
|
||
- No exception is set; this means iteration has ended.
|
||
|
||
- Some exception is set; this means an error occurred, and should
|
||
be propagated normally.
|
||
|
||
Iterators implemented in C should *not* implement a next() method
|
||
with similar semantics as the tp_iternext slot! When the type's
|
||
dictionary is initialized (by PyType_Ready()), the presence of a
|
||
tp_iternext slot causes a method next() wrapping that slot to be
|
||
added to the type's tp_dict. (Exception: if the type doesn't use
|
||
PyObject_GenericGetAttr() to access instance attributes, the
|
||
next() method in the type's tp_dict may not be seen.) (Due to a
|
||
misunderstanding in the original text of this PEP, in Python 2.2,
|
||
all iterator types implemented a next() method that was overridden
|
||
by the wrapper; this has been fixed in Python 2.3.)
|
||
|
||
To ensure binary backwards compatibility, a new flag
|
||
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
|
||
field, and to the default flags macro. This flag must be tested
|
||
before accessing the tp_iter or tp_iternext slots. The macro
|
||
PyIter_Check() tests whether an object has the appropriate flag
|
||
set and has a non-NULL tp_iternext slot. There is no such macro
|
||
for the tp_iter slot (since the only place where this slot is
|
||
referenced should be PyObject_GetIter(), and this can check for
|
||
the Py_TPFLAGS_HAVE_ITER flag directly).
|
||
|
||
(Note: the tp_iter slot can be present on any object; the
|
||
tp_iternext slot should only be present on objects that act as
|
||
iterators.)
|
||
|
||
For backwards compatibility, the PyObject_GetIter() function
|
||
implements fallback semantics when its argument is a sequence that
|
||
does not implement a tp_iter function: a lightweight sequence
|
||
iterator object is constructed in that case which iterates over
|
||
the items of the sequence in the natural order.
|
||
|
||
The Python bytecode generated for 'for' loops is changed to use
|
||
new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
|
||
rather than the sequence protocol to get the next value for the
|
||
loop variable. This makes it possible to use a 'for' loop to loop
|
||
over non-sequence objects that support the tp_iter slot. Other
|
||
places where the interpreter loops over the values of a sequence
|
||
should also be changed to use iterators.
|
||
|
||
Iterators ought to implement the tp_iter slot as returning a
|
||
reference to themselves; this is needed to make it possible to
|
||
use an iterator (as opposed to a sequence) in a for loop.
|
||
|
||
Iterator implementations (in C or in Python) should guarantee that
|
||
once the iterator has signalled its exhaustion, subsequent calls
|
||
to tp_iternext or to the next() method will continue to do so. It
|
||
is not specified whether an iterator should enter the exhausted
|
||
state when an exception (other than StopIteration) is raised.
|
||
Note that Python cannot guarantee that user-defined or 3rd party
|
||
iterators implement this requirement correctly.
|
||
|
||
|
||
Python API Specification
|
||
|
||
The StopIteration exception is made visible as one of the
|
||
standard exceptions. It is derived from Exception.
|
||
|
||
A new built-in function is defined, iter(), which can be called in
|
||
two ways:
|
||
|
||
- iter(obj) calls PyObject_GetIter(obj).
|
||
|
||
- iter(callable, sentinel) returns a special kind of iterator that
|
||
calls the callable to produce a new value, and compares the
|
||
return value to the sentinel value. If the return value equals
|
||
the sentinel, this signals the end of the iteration and
|
||
StopIteration is raised rather than returning normal; if the
|
||
return value does not equal the sentinel, it is returned as the
|
||
next value from the iterator. If the callable raises an
|
||
exception, this is propagated normally; in particular, the
|
||
function is allowed to raise StopIteration as an alternative way
|
||
to end the iteration. (This functionality is available from the
|
||
C API as PyCallIter_New(callable, sentinel).)
|
||
|
||
Iterator objects returned by either form of iter() have a next()
|
||
method. This method either returns the next value in the
|
||
iteration, or raises StopIteration (or a derived exception class)
|
||
to signal the end of the iteration. Any other exception should be
|
||
considered to signify an error and should be propagated normally,
|
||
not taken to mean the end of the iteration.
|
||
|
||
Classes can define how they are iterated over by defining an
|
||
__iter__() method; this should take no additional arguments and
|
||
return a valid iterator object. A class that wants to be an
|
||
iterator should implement two methods: a next() method that behaves
|
||
as described above, and an __iter__() method that returns self.
|
||
|
||
The two methods correspond to two distinct protocols:
|
||
|
||
1. An object can be iterated over with "for" if it implements
|
||
__iter__() or __getitem__().
|
||
|
||
2. An object can function as an iterator if it implements next().
|
||
|
||
Container-like objects usually support protocol 1. Iterators are
|
||
currently required to support both protocols. The semantics of
|
||
iteration come only from protocol 2; protocol 1 is present to make
|
||
iterators behave like sequences; in particular so that code
|
||
receiving an iterator can use a for-loop over the iterator.
|
||
|
||
|
||
Dictionary Iterators
|
||
|
||
- Dictionaries implement a sq_contains slot that implements the
|
||
same test as the has_key() method. This means that we can write
|
||
|
||
if k in dict: ...
|
||
|
||
which is equivalent to
|
||
|
||
if dict.has_key(k): ...
|
||
|
||
- Dictionaries implement a tp_iter slot that returns an efficient
|
||
iterator that iterates over the keys of the dictionary. During
|
||
such an iteration, the dictionary should not be modified, except
|
||
that setting the value for an existing key is allowed (deletions
|
||
or additions are not, nor is the update() method). This means
|
||
that we can write
|
||
|
||
for k in dict: ...
|
||
|
||
which is equivalent to, but much faster than
|
||
|
||
for k in dict.keys(): ...
|
||
|
||
as long as the restriction on modifications to the dictionary
|
||
(either by the loop or by another thread) are not violated.
|
||
|
||
- Add methods to dictionaries that return different kinds of
|
||
iterators explicitly:
|
||
|
||
for key in dict.iterkeys(): ...
|
||
|
||
for value in dict.itervalues(): ...
|
||
|
||
for key, value in dict.iteritems(): ...
|
||
|
||
This means that "for x in dict" is shorthand for "for x in
|
||
dict.iterkeys()".
|
||
|
||
Other mappings, if they support iterators at all, should also
|
||
iterate over the keys. However, this should not be taken as an
|
||
absolute rule; specific applications may have different
|
||
requirements.
|
||
|
||
|
||
File Iterators
|
||
|
||
The following proposal is useful because it provides us with a
|
||
good answer to the complaint that the common idiom to iterate over
|
||
the lines of a file is ugly and slow.
|
||
|
||
- Files implement a tp_iter slot that is equivalent to
|
||
iter(f.readline, ""). This means that we can write
|
||
|
||
for line in file:
|
||
...
|
||
|
||
as a shorthand for
|
||
|
||
for line in iter(file.readline, ""):
|
||
...
|
||
|
||
which is equivalent to, but faster than
|
||
|
||
while 1:
|
||
line = file.readline()
|
||
if not line:
|
||
break
|
||
...
|
||
|
||
This also shows that some iterators are destructive: they consume
|
||
all the values and a second iterator cannot easily be created that
|
||
iterates independently over the same values. You could open the
|
||
file for a second time, or seek() to the beginning, but these
|
||
solutions don't work for all file types, e.g. they don't work when
|
||
the open file object really represents a pipe or a stream socket.
|
||
|
||
Because the file iterator uses an internal buffer, mixing this
|
||
with other file operations (e.g. file.readline()) doesn't work
|
||
right. Also, the following code:
|
||
|
||
for line in file:
|
||
if line == "\n":
|
||
break
|
||
for line in file:
|
||
print line,
|
||
|
||
doesn't work as you might expect, because the iterator created by
|
||
the second for-loop doesn't take the buffer read-ahead by the
|
||
first for-loop into account. A correct way to write this is:
|
||
|
||
it = iter(file)
|
||
for line in it:
|
||
if line == "\n":
|
||
break
|
||
for line in it:
|
||
print line,
|
||
|
||
(The rationale for these restrictions are that "for line in file"
|
||
ought to become the recommended, standard way to iterate over the
|
||
lines of a file, and this should be as fast as can be. The
|
||
iterator version is considerable faster than calling readline(),
|
||
due to the internal buffer in the iterator.)
|
||
|
||
|
||
Rationale
|
||
|
||
If all the parts of the proposal are included, this addresses many
|
||
concerns in a consistent and flexible fashion. Among its chief
|
||
virtues are the following four -- no, five -- no, six -- points:
|
||
|
||
1. It provides an extensible iterator interface.
|
||
|
||
2. It allows performance enhancements to list iteration.
|
||
|
||
3. It allows big performance enhancements to dictionary iteration.
|
||
|
||
4. It allows one to provide an interface for just iteration
|
||
without pretending to provide random access to elements.
|
||
|
||
5. It is backward-compatible with all existing user-defined
|
||
classes and extension objects that emulate sequences and
|
||
mappings, even mappings that only implement a subset of
|
||
{__getitem__, keys, values, items}.
|
||
|
||
6. It makes code iterating over non-sequence collections more
|
||
concise and readable.
|
||
|
||
|
||
Resolved Issues
|
||
|
||
The following topics have been decided by consensus or BDFL
|
||
pronouncement.
|
||
|
||
- Two alternative spellings for next() have been proposed but
|
||
rejected: __next__(), because it corresponds to a type object
|
||
slot (tp_iternext); and __call__(), because this is the only
|
||
operation.
|
||
|
||
Arguments against __next__(): while many iterators are used in
|
||
for loops, it is expected that user code will also call next()
|
||
directly, so having to write __next__() is ugly; also, a
|
||
possible extension of the protocol would be to allow for prev(),
|
||
current() and reset() operations; surely we don't want to use
|
||
__prev__(), __current__(), __reset__().
|
||
|
||
Arguments against __call__() (the original proposal): taken out
|
||
of context, x() is not very readable, while x.next() is clear;
|
||
there's a danger that every special-purpose object wants to use
|
||
__call__() for its most common operation, causing more confusion
|
||
than clarity.
|
||
|
||
(In retrospect, it might have been better to go for __next__()
|
||
and have a new built-in, next(it), which calls it.__next__().
|
||
But alas, it's too late; this has been deployed in Python 2.2
|
||
since December 2001.)
|
||
|
||
- Some folks have requested the ability to restart an iterator.
|
||
This should be dealt with by calling iter() on a sequence
|
||
repeatedly, not by the iterator protocol itself. (See also
|
||
requested extensions below.)
|
||
|
||
- It has been questioned whether an exception to signal the end of
|
||
the iteration isn't too expensive. Several alternatives for the
|
||
StopIteration exception have been proposed: a special value End
|
||
to signal the end, a function end() to test whether the iterator
|
||
is finished, even reusing the IndexError exception.
|
||
|
||
- A special value has the problem that if a sequence ever
|
||
contains that special value, a loop over that sequence will
|
||
end prematurely without any warning. If the experience with
|
||
null-terminated C strings hasn't taught us the problems this
|
||
can cause, imagine the trouble a Python introspection tool
|
||
would have iterating over a list of all built-in names,
|
||
assuming that the special End value was a built-in name!
|
||
|
||
- Calling an end() function would require two calls per
|
||
iteration. Two calls is much more expensive than one call
|
||
plus a test for an exception. Especially the time-critical
|
||
for loop can test very cheaply for an exception.
|
||
|
||
- Reusing IndexError can cause confusion because it can be a
|
||
genuine error, which would be masked by ending the loop
|
||
prematurely.
|
||
|
||
- Some have asked for a standard iterator type. Presumably all
|
||
iterators would have to be derived from this type. But this is
|
||
not the Python way: dictionaries are mappings because they
|
||
support __getitem__() and a handful other operations, not
|
||
because they are derived from an abstract mapping type.
|
||
|
||
- Regarding "if key in dict": there is no doubt that the
|
||
dict.has_key(x) interpretation of "x in dict" is by far the
|
||
most useful interpretation, probably the only useful one. There
|
||
has been resistance against this because "x in list" checks
|
||
whether x is present among the values, while the proposal makes
|
||
"x in dict" check whether x is present among the keys. Given
|
||
that the symmetry between lists and dictionaries is very weak,
|
||
this argument does not have much weight.
|
||
|
||
- The name iter() is an abbreviation. Alternatives proposed
|
||
include iterate(), traverse(), but these appear too long.
|
||
Python has a history of using abbrs for common builtins,
|
||
e.g. repr(), str(), len().
|
||
|
||
Resolution: iter() it is.
|
||
|
||
- Using the same name for two different operations (getting an
|
||
iterator from an object and making an iterator for a function
|
||
with an sentinel value) is somewhat ugly. I haven't seen a
|
||
better name for the second operation though, and since they both
|
||
return an iterator, it's easy to remember.
|
||
|
||
Resolution: the builtin iter() takes an optional argument, which
|
||
is the sentinel to look for.
|
||
|
||
- Once a particular iterator object has raised StopIteration, will
|
||
it also raise StopIteration on all subsequent next() calls?
|
||
Some say that it would be useful to require this, others say
|
||
that it is useful to leave this open to individual iterators.
|
||
Note that this may require an additional state bit for some
|
||
iterator implementations (e.g. function-wrapping iterators).
|
||
|
||
Resolution: once StopIteration is raised, calling it.next()
|
||
continues to raise StopIteration.
|
||
|
||
Note: this was in fact not implemented in Python 2.2; there are
|
||
many cases where an iterator's next() method can raise
|
||
StopIteration on one call but not on the next. This has been
|
||
remedied in Python 2.3.
|
||
|
||
- It has been proposed that a file object should be its own
|
||
iterator, with a next() method returning the next line. This
|
||
has certain advantages, and makes it even clearer that this
|
||
iterator is destructive. The disadvantage is that this would
|
||
make it even more painful to implement the "sticky
|
||
StopIteration" feature proposed in the previous bullet.
|
||
|
||
Resolution: tentatively rejected (though there are still people
|
||
arguing for this).
|
||
|
||
- Some folks have requested extensions of the iterator protocol,
|
||
e.g. prev() to get the previous item, current() to get the
|
||
current item again, finished() to test whether the iterator is
|
||
finished, and maybe even others, like rewind(), __len__(),
|
||
position().
|
||
|
||
While some of these are useful, many of these cannot easily be
|
||
implemented for all iterator types without adding arbitrary
|
||
buffering, and sometimes they can't be implemented at all (or
|
||
not reasonably). E.g. anything to do with reversing directions
|
||
can't be done when iterating over a file or function. Maybe a
|
||
separate PEP can be drafted to standardize the names for such
|
||
operations when the are implementable.
|
||
|
||
Resolution: rejected.
|
||
|
||
- There has been a long discussion about whether
|
||
|
||
for x in dict: ...
|
||
|
||
should assign x the successive keys, values, or items of the
|
||
dictionary. The symmetry between "if x in y" and "for x in y"
|
||
suggests that it should iterate over keys. This symmetry has been
|
||
observed by many independently and has even been used to "explain"
|
||
one using the other. This is because for sequences, "if x in y"
|
||
iterates over y comparing the iterated values to x. If we adopt
|
||
both of the above proposals, this will also hold for
|
||
dictionaries.
|
||
|
||
The argument against making "for x in dict" iterate over the keys
|
||
comes mostly from a practicality point of view: scans of the
|
||
standard library show that there are about as many uses of "for x
|
||
in dict.items()" as there are of "for x in dict.keys()", with the
|
||
items() version having a small majority. Presumably many of the
|
||
loops using keys() use the corresponding value anyway, by writing
|
||
dict[x], so (the argument goes) by making both the key and value
|
||
available, we could support the largest number of cases. While
|
||
this is true, I (Guido) find the correspondence between "for x in
|
||
dict" and "if x in dict" too compelling to break, and there's not
|
||
much overhead in having to write dict[x] to explicitly get the
|
||
value.
|
||
|
||
For fast iteration over items, use "for key, value in
|
||
dict.iteritems()". I've timed the difference between
|
||
|
||
for key in dict: dict[key]
|
||
|
||
and
|
||
|
||
for key, value in dict.iteritems(): pass
|
||
|
||
and found that the latter is only about 7% faster.
|
||
|
||
Resolution: By BDFL pronouncement, "for x in dict" iterates over
|
||
the keys, and dictionaries have iteritems(), iterkeys(), and
|
||
itervalues() to return the different flavors of dictionary
|
||
iterators.
|
||
|
||
|
||
Mailing Lists
|
||
|
||
The iterator protocol has been discussed extensively in a mailing
|
||
list on SourceForge:
|
||
|
||
http://lists.sourceforge.net/lists/listinfo/python-iterators
|
||
|
||
Initially, some of the discussion was carried out at Yahoo;
|
||
archives are still accessible:
|
||
|
||
http://groups.yahoo.com/group/python-iter
|
||
|
||
|
||
Copyright
|
||
|
||
This document is in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
End:
|