python-peps/pep-0234.txt

494 lines
20 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

PEP: 234
Title: Iterators
Version: $Revision$
Last-Modified: $Date$
Author: ping@zesty.ca (Ka-Ping Yee), guido@python.org (Guido van Rossum)
Status: Final
Type: Standards Track
Created: 30-Jan-2001
Python-Version: 2.1
Post-History: 30-Apr-2001
Abstract
This document proposes an iteration interface that objects can
provide to control the behaviour of 'for' loops. Looping is
customized by providing a method that produces an iterator object.
The iterator provides a 'get next value' operation that produces
the next item in the sequence each time it is called, raising an
exception when no more items are available.
In addition, specific iterators over the keys of a dictionary and
over the lines of a file are proposed, and a proposal is made to
allow spelling dict.has_key(key) as "key in dict".
Note: this is an almost complete rewrite of this PEP by the second
author, describing the actual implementation checked into the
trunk of the Python 2.2 CVS tree. It is still open for
discussion. Some of the more esoteric proposals in the original
version of this PEP have been withdrawn for now; these may be the
subject of a separate PEP in the future.
C API Specification
A new exception is defined, StopIteration, which can be used to
signal the end of an iteration.
A new slot named tp_iter for requesting an iterator is added to
the type object structure. This should be a function of one
PyObject * argument returning a PyObject *, or NULL. To use this
slot, a new C API function PyObject_GetIter() is added, with the
same signature as the tp_iter slot function.
Another new slot, named tp_iternext, is added to the type
structure, for obtaining the next value in the iteration. To use
this slot, a new C API function PyIter_Next() is added. The
signature for both the slot and the API function is as follows,
although the NULL return conditions differ: the argument is a
PyObject * and so is the return value. When the return value is
non-NULL, it is the next value in the iteration. When it is NULL,
then for the tp_iternext slot there are three possibilities:
- No exception is set; this implies the end of the iteration.
- The StopIteration exception (or a derived exception class) is
set; this implies the end of the iteration.
- Some other exception is set; this means that an error occurred
that should be propagated normally.
The higher-level PyIter_Next() function clears the StopIteration
exception (or derived exception) when it occurs, so its NULL return
conditions are simpler:
- No exception is set; this means iteration has ended.
- Some exception is set; this means an error occurred, and should
be propagated normally.
Iterators implemented in C should *not* implement a next() method
with similar semantics as the tp_iternext slot! When the type's
dictionary is initialized (by PyType_Ready()), the presence of a
tp_iternext slot causes a method next() wrapping that slot to be
added to the type's tp_dict. (Exception: if the type doesn't use
PyObject_GenericGetAttr() to access instance attributes, the
next() method in the type's tp_dict may not be seen.) (Due to a
misunderstanding in the original text of this PEP, in Python 2.2,
all iterator types implemented a next() method that was overridden
by the wrapper; this has been fixed in Python 2.3.)
To ensure binary backwards compatibility, a new flag
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
field, and to the default flags macro. This flag must be tested
before accessing the tp_iter or tp_iternext slots. The macro
PyIter_Check() tests whether an object has the appropriate flag
set and has a non-NULL tp_iternext slot. There is no such macro
for the tp_iter slot (since the only place where this slot is
referenced should be PyObject_GetIter(), and this can check for
the Py_TPFLAGS_HAVE_ITER flag directly).
(Note: the tp_iter slot can be present on any object; the
tp_iternext slot should only be present on objects that act as
iterators.)
For backwards compatibility, the PyObject_GetIter() function
implements fallback semantics when its argument is a sequence that
does not implement a tp_iter function: a lightweight sequence
iterator object is constructed in that case which iterates over
the items of the sequence in the natural order.
The Python bytecode generated for 'for' loops is changed to use
new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
rather than the sequence protocol to get the next value for the
loop variable. This makes it possible to use a 'for' loop to loop
over non-sequence objects that support the tp_iter slot. Other
places where the interpreter loops over the values of a sequence
should also be changed to use iterators.
Iterators ought to implement the tp_iter slot as returning a
reference to themselves; this is needed to make it possible to
use an iterator (as opposed to a sequence) in a for loop.
Iterator implementations (in C or in Python) should guarantee that
once the iterator has signalled its exhaustion, subsequent calls
to tp_iternext or to the next() method will continue to do so. It
is not specified whether an iterator should enter the exhausted
state when an exception (other than StopIteration) is raised.
Note that Python cannot guarantee that user-defined or 3rd party
iterators implement this requirement correctly.
Python API Specification
The StopIteration exception is made visible as one of the
standard exceptions. It is derived from Exception.
A new built-in function is defined, iter(), which can be called in
two ways:
- iter(obj) calls PyObject_GetIter(obj).
- iter(callable, sentinel) returns a special kind of iterator that
calls the callable to produce a new value, and compares the
return value to the sentinel value. If the return value equals
the sentinel, this signals the end of the iteration and
StopIteration is raised rather than returning normal; if the
return value does not equal the sentinel, it is returned as the
next value from the iterator. If the callable raises an
exception, this is propagated normally; in particular, the
function is allowed to raise StopIteration as an alternative way
to end the iteration. (This functionality is available from the
C API as PyCallIter_New(callable, sentinel).)
Iterator objects returned by either form of iter() have a next()
method. This method either returns the next value in the
iteration, or raises StopIteration (or a derived exception class)
to signal the end of the iteration. Any other exception should be
considered to signify an error and should be propagated normally,
not taken to mean the end of the iteration.
Classes can define how they are iterated over by defining an
__iter__() method; this should take no additional arguments and
return a valid iterator object. A class that wants to be an
iterator should implement two methods: a next() method that behaves
as described above, and an __iter__() method that returns self.
The two methods correspond to two distinct protocols:
1. An object can be iterated over with "for" if it implements
__iter__() or __getitem__().
2. An object can function as an iterator if it implements next().
Container-like objects usually support protocol 1. Iterators are
currently required to support both protocols. The semantics of
iteration come only from protocol 2; protocol 1 is present to make
iterators behave like sequences; in particular so that code
receiving an iterator can use a for-loop over the iterator.
Dictionary Iterators
- Dictionaries implement a sq_contains slot that implements the
same test as the has_key() method. This means that we can write
if k in dict: ...
which is equivalent to
if dict.has_key(k): ...
- Dictionaries implement a tp_iter slot that returns an efficient
iterator that iterates over the keys of the dictionary. During
such an iteration, the dictionary should not be modified, except
that setting the value for an existing key is allowed (deletions
or additions are not, nor is the update() method). This means
that we can write
for k in dict: ...
which is equivalent to, but much faster than
for k in dict.keys(): ...
as long as the restriction on modifications to the dictionary
(either by the loop or by another thread) are not violated.
- Add methods to dictionaries that return different kinds of
iterators explicitly:
for key in dict.iterkeys(): ...
for value in dict.itervalues(): ...
for key, value in dict.iteritems(): ...
This means that "for x in dict" is shorthand for "for x in
dict.iterkeys()".
Other mappings, if they support iterators at all, should also
iterate over the keys. However, this should not be taken as an
absolute rule; specific applications may have different
requirements.
File Iterators
The following proposal is useful because it provides us with a
good answer to the complaint that the common idiom to iterate over
the lines of a file is ugly and slow.
- Files implement a tp_iter slot that is equivalent to
iter(f.readline, ""). This means that we can write
for line in file:
...
as a shorthand for
for line in iter(file.readline, ""):
...
which is equivalent to, but faster than
while 1:
line = file.readline()
if not line:
break
...
This also shows that some iterators are destructive: they consume
all the values and a second iterator cannot easily be created that
iterates independently over the same values. You could open the
file for a second time, or seek() to the beginning, but these
solutions don't work for all file types, e.g. they don't work when
the open file object really represents a pipe or a stream socket.
Because the file iterator uses an internal buffer, mixing this
with other file operations (e.g. file.readline()) doesn't work
right. Also, the following code:
for line in file:
if line == "\n":
break
for line in file:
print line,
doesn't work as you might expect, because the iterator created by
the second for-loop doesn't take the buffer read-ahead by the
first for-loop into account. A correct way to write this is:
it = iter(file)
for line in it:
if line == "\n":
break
for line in it:
print line,
(The rationale for these restrictions are that "for line in file"
ought to become the recommended, standard way to iterate over the
lines of a file, and this should be as fast as can be. The
iterator version is considerable faster than calling readline(),
due to the internal buffer in the iterator.)
Rationale
If all the parts of the proposal are included, this addresses many
concerns in a consistent and flexible fashion. Among its chief
virtues are the following four -- no, five -- no, six -- points:
1. It provides an extensible iterator interface.
2. It allows performance enhancements to list iteration.
3. It allows big performance enhancements to dictionary iteration.
4. It allows one to provide an interface for just iteration
without pretending to provide random access to elements.
5. It is backward-compatible with all existing user-defined
classes and extension objects that emulate sequences and
mappings, even mappings that only implement a subset of
{__getitem__, keys, values, items}.
6. It makes code iterating over non-sequence collections more
concise and readable.
Resolved Issues
The following topics have been decided by consensus or BDFL
pronouncement.
- Two alternative spellings for next() have been proposed but
rejected: __next__(), because it corresponds to a type object
slot (tp_iternext); and __call__(), because this is the only
operation.
Arguments against __next__(): while many iterators are used in
for loops, it is expected that user code will also call next()
directly, so having to write __next__() is ugly; also, a
possible extension of the protocol would be to allow for prev(),
current() and reset() operations; surely we don't want to use
__prev__(), __current__(), __reset__().
Arguments against __call__() (the original proposal): taken out
of context, x() is not very readable, while x.next() is clear;
there's a danger that every special-purpose object wants to use
__call__() for its most common operation, causing more confusion
than clarity.
(In retrospect, it might have been better to go for __next__()
and have a new built-in, next(it), which calls it.__next__().
But alas, it's too late; this has been deployed in Python 2.2
since December 2001.)
- Some folks have requested the ability to restart an iterator.
This should be dealt with by calling iter() on a sequence
repeatedly, not by the iterator protocol itself. (See also
requested extensions below.)
- It has been questioned whether an exception to signal the end of
the iteration isn't too expensive. Several alternatives for the
StopIteration exception have been proposed: a special value End
to signal the end, a function end() to test whether the iterator
is finished, even reusing the IndexError exception.
- A special value has the problem that if a sequence ever
contains that special value, a loop over that sequence will
end prematurely without any warning. If the experience with
null-terminated C strings hasn't taught us the problems this
can cause, imagine the trouble a Python introspection tool
would have iterating over a list of all built-in names,
assuming that the special End value was a built-in name!
- Calling an end() function would require two calls per
iteration. Two calls is much more expensive than one call
plus a test for an exception. Especially the time-critical
for loop can test very cheaply for an exception.
- Reusing IndexError can cause confusion because it can be a
genuine error, which would be masked by ending the loop
prematurely.
- Some have asked for a standard iterator type. Presumably all
iterators would have to be derived from this type. But this is
not the Python way: dictionaries are mappings because they
support __getitem__() and a handful other operations, not
because they are derived from an abstract mapping type.
- Regarding "if key in dict": there is no doubt that the
dict.has_key(x) interpretation of "x in dict" is by far the
most useful interpretation, probably the only useful one. There
has been resistance against this because "x in list" checks
whether x is present among the values, while the proposal makes
"x in dict" check whether x is present among the keys. Given
that the symmetry between lists and dictionaries is very weak,
this argument does not have much weight.
- The name iter() is an abbreviation. Alternatives proposed
include iterate(), traverse(), but these appear too long.
Python has a history of using abbrs for common builtins,
e.g. repr(), str(), len().
Resolution: iter() it is.
- Using the same name for two different operations (getting an
iterator from an object and making an iterator for a function
with an sentinel value) is somewhat ugly. I haven't seen a
better name for the second operation though, and since they both
return an iterator, it's easy to remember.
Resolution: the builtin iter() takes an optional argument, which
is the sentinel to look for.
- Once a particular iterator object has raised StopIteration, will
it also raise StopIteration on all subsequent next() calls?
Some say that it would be useful to require this, others say
that it is useful to leave this open to individual iterators.
Note that this may require an additional state bit for some
iterator implementations (e.g. function-wrapping iterators).
Resolution: once StopIteration is raised, calling it.next()
continues to raise StopIteration.
Note: this was in fact not implemented in Python 2.2; there are
many cases where an iterator's next() method can raise
StopIteration on one call but not on the next. This has been
remedied in Python 2.3.
- It has been proposed that a file object should be its own
iterator, with a next() method returning the next line. This
has certain advantages, and makes it even clearer that this
iterator is destructive. The disadvantage is that this would
make it even more painful to implement the "sticky
StopIteration" feature proposed in the previous bullet.
Resolution: tentatively rejected (though there are still people
arguing for this).
- Some folks have requested extensions of the iterator protocol,
e.g. prev() to get the previous item, current() to get the
current item again, finished() to test whether the iterator is
finished, and maybe even others, like rewind(), __len__(),
position().
While some of these are useful, many of these cannot easily be
implemented for all iterator types without adding arbitrary
buffering, and sometimes they can't be implemented at all (or
not reasonably). E.g. anything to do with reversing directions
can't be done when iterating over a file or function. Maybe a
separate PEP can be drafted to standardize the names for such
operations when the are implementable.
Resolution: rejected.
- There has been a long discussion about whether
for x in dict: ...
should assign x the successive keys, values, or items of the
dictionary. The symmetry between "if x in y" and "for x in y"
suggests that it should iterate over keys. This symmetry has been
observed by many independently and has even been used to "explain"
one using the other. This is because for sequences, "if x in y"
iterates over y comparing the iterated values to x. If we adopt
both of the above proposals, this will also hold for
dictionaries.
The argument against making "for x in dict" iterate over the keys
comes mostly from a practicality point of view: scans of the
standard library show that there are about as many uses of "for x
in dict.items()" as there are of "for x in dict.keys()", with the
items() version having a small majority. Presumably many of the
loops using keys() use the corresponding value anyway, by writing
dict[x], so (the argument goes) by making both the key and value
available, we could support the largest number of cases. While
this is true, I (Guido) find the correspondence between "for x in
dict" and "if x in dict" too compelling to break, and there's not
much overhead in having to write dict[x] to explicitly get the
value.
For fast iteration over items, use "for key, value in
dict.iteritems()". I've timed the difference between
for key in dict: dict[key]
and
for key, value in dict.iteritems(): pass
and found that the latter is only about 7% faster.
Resolution: By BDFL pronouncement, "for x in dict" iterates over
the keys, and dictionaries have iteritems(), iterkeys(), and
itervalues() to return the different flavors of dictionary
iterators.
Mailing Lists
The iterator protocol has been discussed extensively in a mailing
list on SourceForge:
http://lists.sourceforge.net/lists/listinfo/python-iterators
Initially, some of the discussion was carried out at Yahoo;
archives are still accessible:
http://groups.yahoo.com/group/python-iter
Copyright
This document is in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End: