417 lines
17 KiB
Plaintext
417 lines
17 KiB
Plaintext
PEP: 234
|
||
Title: Iterators
|
||
Version: $Revision$
|
||
Author: ping@lfw.org (Ka-Ping Yee), guido@python.org (Guido van Rossum)
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Python-Version: 2.1
|
||
Created: 30-Jan-2001
|
||
Post-History: 30-Apr-2001
|
||
|
||
Abstract
|
||
|
||
This document proposes an iteration interface that objects can
|
||
provide to control the behaviour of 'for' loops. Looping is
|
||
customized by providing a method that produces an iterator object.
|
||
The iterator provides a 'get next value' operation that produces
|
||
the next item in the sequence each time it is called, raising an
|
||
exception when no more items are available.
|
||
|
||
In addition, specific iterators over the keys of a dictionary and
|
||
over the lines of a file are proposed, and a proposal is made to
|
||
allow spelling dict.has_key(key) as "key in dict".
|
||
|
||
Note: this is an almost complete rewrite of this PEP by the second
|
||
author, describing the actual implementation checked into the
|
||
trunk of the Python 2.2 CVS tree. It is still open for
|
||
discussion. Some of the more esoteric proposals in the original
|
||
version of this PEP have been withdrawn for now; these may be the
|
||
subject of a separate PEP in the future.
|
||
|
||
|
||
C API Specification
|
||
|
||
A new exception is defined, StopIteration, which can be used to
|
||
signal the end of an iteration.
|
||
|
||
A new slot named tp_iter for requesting an iterator is added to
|
||
the type object structure. This should be a function of one
|
||
PyObject * argument returning a PyObject *, or NULL. To use this
|
||
slot, a new C API function PyObject_GetIter() is added, with the
|
||
same signature as the tp_iter slot function.
|
||
|
||
Another new slot, named tp_iternext, is added to the type
|
||
structure, for obtaining the next value in the iteration. To use
|
||
this slot, a new C API function PyIter_Next() is added. The
|
||
signature for both the slot and the API function is as follows:
|
||
the argument is a PyObject * and so is the return value. When the
|
||
return value is non-NULL, it is the next value in the iteration.
|
||
When it is NULL, there are three possibilities:
|
||
|
||
- No exception is set; this implies the end of the iteration.
|
||
|
||
- The StopIteration exception (or a derived exception class) is
|
||
set; this implies the end of the iteration.
|
||
|
||
- Some other exception is set; this means that an error occurred
|
||
that should be propagated normally.
|
||
|
||
In addition to the tp_iternext slot, every iterator object must
|
||
also implement a next() method, callable without arguments. This
|
||
should have the same semantics as the tp_iternext slot function,
|
||
except that the only way to signal the end of the iteration is to
|
||
raise StopIteration. The iterator object should not care whether
|
||
its tp_iternext slot function is called or its next() method, and
|
||
the caller may mix calls arbitrarily. (The next() method is for
|
||
the benefit of Python code using iterators directly; the
|
||
tp_iternext slot is added to make 'for' loops more efficient.)
|
||
|
||
To ensure binary backwards compatibility, a new flag
|
||
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
|
||
field, and to the default flags macro. This flag must be tested
|
||
before accessing the tp_iter or tp_iternext slots. The macro
|
||
PyIter_Check() tests whether an object has the appropriate flag
|
||
set and has a non-NULL tp_iternext slot. There is no such macro
|
||
for the tp_iter slot (since the only place where this slot is
|
||
referenced should be PyObject_GetIter()).
|
||
|
||
(Note: the tp_iter slot can be present on any object; the
|
||
tp_iternext slot should only be present on objects that act as
|
||
iterators.)
|
||
|
||
For backwards compatibility, the PyObject_GetIter() function
|
||
implements fallback semantics when its argument is a sequence that
|
||
does not implement a tp_iter function: a lightweight sequence
|
||
iterator object is constructed in that case which iterates over
|
||
the items of the sequence in the natural order.
|
||
|
||
The Python bytecode generated for 'for' loops is changed to use
|
||
new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
|
||
rather than the sequence protocol to get the next value for the
|
||
loop variable. This makes it possible to use a 'for' loop to loop
|
||
over non-sequence objects that support the tp_iter slot. Other
|
||
places where the interpreter loops over the values of a sequence
|
||
should also be changed to use iterators.
|
||
|
||
Iterators ought to implement the tp_iter slot as returning a
|
||
reference to themselves; this is needed to make it possible to
|
||
use an iterator (as opposed to a sequence) in a for loop.
|
||
|
||
|
||
Python API Specification
|
||
|
||
The StopIteration exception is made visible as one of the
|
||
standard exceptions. It is derived from Exception.
|
||
|
||
A new built-in function is defined, iter(), which can be called in
|
||
two ways:
|
||
|
||
- iter(obj) calls PyObject_GetIter(obj).
|
||
|
||
- iter(callable, sentinel) returns a special kind of iterator that
|
||
calls the callable to produce a new value, and compares the
|
||
return value to the sentinel value. If the return value equals
|
||
the sentinel, this signals the end of the iteration and
|
||
StopIteration is raised rather than returning normal; if the
|
||
return value does not equal the sentinel, it is returned as the
|
||
next value from the iterator. If the callable raises an
|
||
exception, this is propagated normally; in particular, the
|
||
function is allowed to raise StopIteration as an alternative way
|
||
to end the iteration. (This functionality is available from the
|
||
C API as PyCallIter_New(callable, sentinel).)
|
||
|
||
Iterator objects returned by either form of iter() have a next()
|
||
method. This method either returns the next value in the
|
||
iteration, or raises StopIteration (or a derived exception class)
|
||
to signal the end of the iteration. Any other exception should be
|
||
considered to signify an error and should be propagated normally,
|
||
not taken to mean the end of the iteration.
|
||
|
||
Classes can define how they are iterated over by defining an
|
||
__iter__() method; this should take no additional arguments and
|
||
return a valid iterator object. A class is a valid iterator
|
||
object when it defines a next() method that behaves as described
|
||
above. A class that wants to be an iterator also ought to
|
||
implement __iter__() returning itself.
|
||
|
||
|
||
Dictionary Iterators
|
||
|
||
The following two proposals are somewhat controversial. They are
|
||
also independent from the main iterator implementation. However,
|
||
they are both very useful.
|
||
|
||
- Dictionaries implement a sq_contains slot that implements the
|
||
same test as the has_key() method. This means that we can write
|
||
|
||
if k in dict: ...
|
||
|
||
which is equivalent to
|
||
|
||
if dict.has_key(k): ...
|
||
|
||
- Dictionaries implement a tp_iter slot that returns an efficient
|
||
iterator that iterates over the keys of the dictionary. During
|
||
such an iteration, the dictionary should not be modified, except
|
||
that setting the value for an existing key is allowed (deletions
|
||
or additions are not, nor is the update() method). This means
|
||
that we can write
|
||
|
||
for k in dict: ...
|
||
|
||
which is equivalent to, but much faster than
|
||
|
||
for k in dict.keys(): ...
|
||
|
||
as long as the restriction on modifications to the dictionary
|
||
(either by the loop or by another thread) are not violated.
|
||
|
||
- Add methods to dictionaries that return different kinds of
|
||
iterators explicitly:
|
||
|
||
for key in dict.iterkeys(): ...
|
||
|
||
for value in dict.itervalues(): ...
|
||
|
||
for key, value in dict.iteritems(): ...
|
||
|
||
This means that "for x in dict" is shorthand for "for x in
|
||
dict.iterkeys()".
|
||
|
||
If this proposal is accepted, it makes sense to recommend that
|
||
other mappings, if they support iterators at all, should also
|
||
iterate over the keys. However, this should not be taken as an
|
||
absolute rule; specific applications may have different
|
||
requirements.
|
||
|
||
|
||
File Iterators
|
||
|
||
The following proposal is not controversial, but should be
|
||
considered a separate step after introducing the iterator
|
||
framework described above. It is useful because it provides us
|
||
with a good answer to the complaint that the common idiom to
|
||
iterate over the lines of a file is ugly and slow.
|
||
|
||
- Files implement a tp_iter slot that is equivalent to
|
||
iter(f.readline, ""). This means that we can write
|
||
|
||
for line in file:
|
||
...
|
||
|
||
as a shorthand for
|
||
|
||
for line in iter(file.readline, ""):
|
||
...
|
||
|
||
which is equivalent to, but faster than
|
||
|
||
while 1:
|
||
line = file.readline()
|
||
if not line:
|
||
break
|
||
...
|
||
|
||
This also shows that some iterators are destructive: they consume
|
||
all the values and a second iterator cannot easily be created that
|
||
iterates independently over the same values. You could open the
|
||
file for a second time, or seek() to the beginning, but these
|
||
solutions don't work for all file types, e.g. they don't work when
|
||
the open file object really represents a pipe or a stream socket.
|
||
|
||
|
||
Rationale
|
||
|
||
If all the parts of the proposal are included, this addresses many
|
||
concerns in a consistent and flexible fashion. Among its chief
|
||
virtues are the following four -- no, five -- no, six -- points:
|
||
|
||
1. It provides an extensible iterator interface.
|
||
|
||
2. It allows performance enhancements to list iteration.
|
||
|
||
3. It allows big performance enhancements to dictionary iteration.
|
||
|
||
4. It allows one to provide an interface for just iteration
|
||
without pretending to provide random access to elements.
|
||
|
||
5. It is backward-compatible with all existing user-defined
|
||
classes and extension objects that emulate sequences and
|
||
mappings, even mappings that only implement a subset of
|
||
{__getitem__, keys, values, items}.
|
||
|
||
6. It makes code iterating over non-sequence collections more
|
||
concise and readable.
|
||
|
||
|
||
Open Issues
|
||
|
||
The following questions are still open.
|
||
|
||
- The name iter() is an abbreviation. Alternatives proposed
|
||
include iterate(), traverse(), but these appear too long.
|
||
Python has a history of using abbrs for common builtins,
|
||
e.g. repr(), str(), len().
|
||
|
||
- Using the same name for two different operations (getting an
|
||
iterator from an object and making an iterator for a function
|
||
with an sentinel value) is somewhat ugly. I haven't seen a
|
||
better name for the second operation though, and since they both
|
||
return an iterator, it's easy to remember.
|
||
|
||
- Once a particular iterator object has raised StopIteration, will
|
||
it also raise StopIteration on all subsequent next() calls?
|
||
Some say that it would be useful to require this, others say
|
||
that it is useful to leave this open to individual iterators.
|
||
Note that this may require an additional state bit for some
|
||
iterator implementations (e.g. function-wrapping iterators).
|
||
|
||
- It has been proposed that a file object should be its own
|
||
iterator, with a next() method returning the next line. This
|
||
has certain advantages, and makes it even clearer that this
|
||
iterator is destructive. The disadvantage is that this would
|
||
make it even more painful to implement the "sticky
|
||
StopIteration" feature proposed in the previous bullet.
|
||
|
||
- Some folks have requested extensions of the iterator protocol,
|
||
e.g. prev() to get the previous item, current() to get the
|
||
current item again, finished() to test whether the iterator is
|
||
finished, and maybe even others, like rewind(), __len__(),
|
||
position().
|
||
|
||
While some of these are useful, many of these cannot easily be
|
||
implemented for all iterator types without adding arbitrary
|
||
buffering, and sometimes they can't be implemented at all (or
|
||
not reasonably). E.g. anything to do with reversing directions
|
||
can't be done when iterating over a file or function. Maybe a
|
||
separate PEP can be drafted to standardize the names for such
|
||
operations when the are implementable.
|
||
|
||
- There is still discussion about whether
|
||
|
||
for x in dict: ...
|
||
|
||
should assign x the successive keys, values, or items of the
|
||
dictionary. The symmetry between "if x in y" and "for x in y"
|
||
suggests that it should iterate over keys. This symmetry has been
|
||
observed by many independently and has even been used to "explain"
|
||
one using the other. This is because for sequences, "if x in y"
|
||
iterates over y comparing the iterated values to x. If we adopt
|
||
both of the above proposals, this will also hold for
|
||
dictionaries.
|
||
|
||
The argument against making "for x in dict" iterate over the keys
|
||
comes mostly from a practicality point of view: scans of the
|
||
standard library show that there are about as many uses of "for x
|
||
in dict.items()" as there are of "for x in dict.keys()", with the
|
||
items() version having a small majority. Presumably many of the
|
||
loops using keys() use the corresponding value anyway, by writing
|
||
dict[x], so (the argument goes) by making both the key and value
|
||
available, we could support the largest number of cases. While
|
||
this is true, I (Guido) find the correspondence between "for x in
|
||
dict" and "if x in dict" too compelling to break, and there's not
|
||
much overhead in having to write dict[x] to explicitly get the
|
||
value.
|
||
|
||
For fast iteration over items, use "for key, value in
|
||
dict.iteritems()". I've timed the difference between
|
||
|
||
for key in dict: dict[key]
|
||
|
||
and
|
||
|
||
for key, value in dict.iteritems(): pass
|
||
|
||
and found that the latter is only about 7% faster.
|
||
|
||
|
||
Resolved Issues
|
||
|
||
The following topics have been decided by consensus or BDFL
|
||
pronouncement.
|
||
|
||
- Two alternative spellings for next() have been proposed but
|
||
rejected: __next__(), because it corresponds to a type object
|
||
slot (tp_iternext); and __call__(), because this is the only
|
||
operation.
|
||
|
||
Arguments against __next__(): while many iterators are used in
|
||
for loops, it is expected that user code will also call next()
|
||
directly, so having to write __next__() is ugly; also, a
|
||
possible extension of the protocol would be to allow for prev(),
|
||
current() and reset() operations; surely we don't want to use
|
||
__prev__(), __current__(), __reset__().
|
||
|
||
Arguments against __call__() (the original proposal): taken out
|
||
of context, x() is not very readable, while x.next() is clear;
|
||
there's a danger that every special-purpose object wants to use
|
||
__call__() for its most common operation, causing more confusion
|
||
than clarity.
|
||
|
||
- Some folks have requested the ability to restart an iterator.
|
||
This should be dealt with by calling iter() on a sequence
|
||
repeatedly, not by the iterator protocol itself.
|
||
|
||
- It has been questioned whether an exception to signal the end of
|
||
the iteration isn't too expensive. Several alternatives for the
|
||
StopIteration exception have been proposed: a special value End
|
||
to signal the end, a function end() to test whether the iterator
|
||
is finished, even reusing the IndexError exception.
|
||
|
||
- A special value has the problem that if a sequence ever
|
||
contains that special value, a loop over that sequence will
|
||
end prematurely without any warning. If the experience with
|
||
null-terminated C strings hasn't taught us the problems this
|
||
can cause, imagine the trouble a Python introspection tool
|
||
would have iterating over a list of all built-in names,
|
||
assuming that the special End value was a built-in name!
|
||
|
||
- Calling an end() function would require two calls per
|
||
iteration. Two calls is much more expensive than one call
|
||
plus a test for an exception. Especially the time-critical
|
||
for loop can test very cheaply for an exception.
|
||
|
||
- Reusing IndexError can cause confusion because it can be a
|
||
genuine error, which would be masked by ending the loop
|
||
prematurely.
|
||
|
||
- Some have asked for a standard iterator type. Presumably all
|
||
iterators would have to be derived from this type. But this is
|
||
not the Python way: dictionaries are mappings because they
|
||
support __getitem__() and a handful other operations, not
|
||
because they are derived from an abstract mapping type.
|
||
|
||
- Regarding "if key in dict": there is no doubt that the
|
||
dict.has_keys(x) interpretation of "x in dict" is by far the
|
||
most useful interpretation, probably the only useful one. There
|
||
has been resistance against this because "x in list" checks
|
||
whether x is present among the values, while the proposal makes
|
||
"x in dict" check whether x is present among the keys. Given
|
||
that the symmetry between lists and dictionaries is very weak,
|
||
this argument does not have much weight.
|
||
|
||
|
||
Mailing Lists
|
||
|
||
The iterator protocol has been discussed extensively in a mailing
|
||
list on SourceForge:
|
||
|
||
http://lists.sourceforge.net/lists/listinfo/python-iterators
|
||
|
||
Initially, some of the discussion was carried out at Yahoo;
|
||
archives are still accessible:
|
||
|
||
http://groups.yahoo.com/group/python-iter
|
||
|
||
|
||
Copyright
|
||
|
||
This document is in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
End:
|