334 lines
14 KiB
Plaintext
334 lines
14 KiB
Plaintext
PEP: 234
|
||
Title: Iterators
|
||
Version: $Revision$
|
||
Author: ping@lfw.org (Ka-Ping Yee), guido@python.org (Guido van Rossum)
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Python-Version: 2.1
|
||
Created: 30-Jan-2001
|
||
Post-History:
|
||
|
||
Abstract
|
||
|
||
This document proposes an iteration interface that objects can
|
||
provide to control the behaviour of 'for' loops. Looping is
|
||
customized by providing a method that produces an iterator object.
|
||
The iterator provides a 'get next value' operation that produces
|
||
the nxet item in the sequence each time it is called, raising an
|
||
exception when no more items are available.
|
||
|
||
In addition, specific iterators over the keys of a dictionary and
|
||
over the lines of a file are proposed, and a proposal is made to
|
||
allow spelling dict.kas_key(key) as "key in dict".
|
||
|
||
Note: this is an almost complete rewrite of this PEP by the second
|
||
author, describing the actual implementation checked into the
|
||
trunk of the Python 2.2 CVS tree. It is still open for
|
||
discussion. Some of the more esoteric proposals in the original
|
||
version of this PEP have been withdrawn for now; these may be the
|
||
subject of a separate PEP in the future.
|
||
|
||
|
||
C API Specification
|
||
|
||
A new exception is defined, StopIteration, which can be used to
|
||
signal the end of an iteration.
|
||
|
||
A new slot named tp_iter for requesting an iterator is added to
|
||
the type object structure. This should be a function of one
|
||
PyObject * argument returning a PyObject *, or NULL. To use this
|
||
slot, a new C API function PyObject_GetIter() is added, with the
|
||
same signature as the tp_iter slot function.
|
||
|
||
Another new slot, named tp_iternext, is added to the type
|
||
structure, for obtaining the next value in the iteration. To use
|
||
this slot, a new C API function PyIter_Next() is added. The
|
||
signature for both the slot and the API function is as follows:
|
||
the argument is a PyObject * and so is the return value. When the
|
||
return value is non-NULL, it is the next value in the iteration.
|
||
When it is NULL, there are three possibilities:
|
||
|
||
- No exception is set; this implies the end of the iteration.
|
||
|
||
- The StopIteration exception (or a derived exception class) is
|
||
set; this implies the end of the iteration.
|
||
|
||
- Some other exception is set; this means that an error occurred
|
||
that should be propagated normally.
|
||
|
||
In addition to the tp_iternext slot, every iterator object must
|
||
also implement a next() method, callable without arguments. This
|
||
should have the same semantics as the tp_iternext slot function,
|
||
except that the only way to signal the end of the iteration is to
|
||
raise StopIteration. The iterator object should not care whether
|
||
its tp_iternext slot function is called or its next() method, and
|
||
the caller may mix calls arbitrarily. (The next() method is for
|
||
the benefit of Python code using iterators directly; the
|
||
tp_iternext slot is added to make 'for' loops more efficient.)
|
||
|
||
To ensure binary backwards compatibility, a new flag
|
||
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
|
||
field, and to the default flags macro. This flag must be tested
|
||
before accessing the tp_iter or tp_iternext slots. The macro
|
||
PyIter_Check() tests whether an object has the appropriate flag
|
||
set and has a non-NULL tp_iternext slot. There is no such macro
|
||
for the tp_iter slot (since the only place where this slot is
|
||
referenced should be PyObject_GetIter()).
|
||
|
||
(Note: the tp_iter slot can be present on any object; the
|
||
tp_iternext slot should only be present on objects that act as
|
||
iterators.)
|
||
|
||
For backwards compatibility, the PyObject_GetIter() function
|
||
implements fallback semantics when its argument is a sequence that
|
||
does not implement a tp_iter function: a lightweight sequence
|
||
iterator object is constructed in that case which iterates over
|
||
the items of the sequence in the natural order.
|
||
|
||
The Python bytecode generated for 'for' loops is changed to use
|
||
new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
|
||
rather than the sequence protocol to get the next value for the
|
||
loop variable. This makes it possible to use a 'for' loop to loop
|
||
over non-sequence objects that support the tp_iter slot. Other
|
||
places where the interpreter loops over the values of a sequence
|
||
should also be changed to use iterators.
|
||
|
||
Iterators ought to implement the tp_iter slot as returning a
|
||
reference to themselves; this is needed to make it possible to
|
||
use an iterator (as opposed to a sequence) in a for loop.
|
||
|
||
Discussion: should the next() method be renamed to __next__()?
|
||
Every other method corresponding to a tp_<something> slot has a
|
||
special name. On the other hand, this would suggest that there
|
||
should also be a primitive operation next(x) that would call
|
||
x.__next__(), and this just looks like adding complexity without
|
||
benefit. So I think it's better to stick with next(). On the
|
||
other hand, Marc-Andre Lemburg points out: "Even though .next()
|
||
reads better, I think that we should stick to the convention that
|
||
interpreter APIs use the __xxx__ naming scheme. Otherwise, people
|
||
will have a hard time differentiating between user-level protocols
|
||
and interpreter-level ones. AFAIK, .next() would be the first
|
||
low-level API not using this convention." My (BDFL's) response:
|
||
there are other important protocols with a user-level name
|
||
(e.g. keys()), and I don't see the importance of this particular
|
||
rule. BDFL pronouncement: this topic is closed. next() it is.
|
||
|
||
|
||
Python API Specification
|
||
|
||
The StopIteration exception is made visiable as one of the
|
||
standard exceptions. It is derived from Exception.
|
||
|
||
A new built-in function is defined, iter(), which can be called in
|
||
two ways:
|
||
|
||
- iter(obj) calls PyObject_GetIter(obj).
|
||
|
||
- iter(callable, sentinel) returns a special kind of iterator that
|
||
calls the callable to produce a new value, and compares the
|
||
return value to the sentinel value. If the return value equals
|
||
the sentinel, this signals the end of the iteration and
|
||
StopIteration is raised rather than returning normal; if the
|
||
return value does not equal the sentinel, it is returned as the
|
||
next value from the iterator. If the callable raises an
|
||
exception, this is propagated normally; in particular, the
|
||
function is allowed to raise StopError as an alternative way to
|
||
end the iteration. (This functionality is available from the C
|
||
API as PyCallIter_New(callable, sentinel).)
|
||
|
||
Iterator objects returned by either form of iter() have a next()
|
||
method. This method either returns the next value in the
|
||
iteration, or raises StopError (or a derived exception class) to
|
||
signal the end of the iteration. Any other exception should be
|
||
considered to signify an error and should be propagated normally,
|
||
not taken to mean the end of the iteration.
|
||
|
||
Classes can define how they are iterated over by defining an
|
||
__iter__() method; this should take no additional arguments and
|
||
return a valid iterator object. A class is a valid iterator
|
||
object when it defines a next() method that behaves as described
|
||
above. A class that wants to be an iterator also ought to
|
||
implement __iter__() returning itself.
|
||
|
||
Discussion:
|
||
|
||
- The name iter() is an abbreviation. Alternatives proposed
|
||
include iterate(), harp(), traverse(), narrate().
|
||
|
||
- Using the same name for two different operations (getting an
|
||
iterator from an object and making an iterator for a function
|
||
with an sentinel value) is somewhat ugly. I haven't seen a
|
||
better name for the second operation though.
|
||
|
||
- There's a bit of undefined behavior for iterators: once a
|
||
particular iterator object has raised StopIteration, will it
|
||
also raise StopIteration on all subsequent next() calls? Some
|
||
say that it would be useful to require this, others say that it
|
||
is useful to leave this open to individual iterators. Note that
|
||
this may require an additional state bit for some iterator
|
||
implementations (e.g. function-wrapping iterators).
|
||
|
||
- Some folks have requested the ability to restart an iterator. I
|
||
believe this should be dealt with by calling iter() on a
|
||
sequence repeatedly, not by the iterator protocol itself.
|
||
|
||
- It was originally proposed that rather than having a next()
|
||
method, an iterator object should simply be callable. This was
|
||
rejected in favor of an explicit next() method. The reason is
|
||
clarity: if you don't know the code very well, "x = s()" does
|
||
not give a hint about what it does; but "x = s.next()" is pretty
|
||
clear. BDFL pronouncement: this topic is closed. next() it is.
|
||
|
||
|
||
Dictionary Iterators
|
||
|
||
The following two proposals are somewhat controversial. They are
|
||
also independent from the main iterator implementation. However,
|
||
they are both very useful.
|
||
|
||
- Dictionaries implement a sq_contains slot that implements the
|
||
same test as the has_key() method. This means that we can write
|
||
|
||
if k in dict: ...
|
||
|
||
which is equivalent to
|
||
|
||
if dict.has_key(k): ...
|
||
|
||
- Dictionaries implement a tp_iter slot that returns an efficient
|
||
iterator that iterates over the keys of the dictionary. During
|
||
such an iteration, the dictionary should not be modified, except
|
||
that setting the value for an existing key is allowed (deletions
|
||
or additions are not, nor is the update() method). This means
|
||
that we can write
|
||
|
||
for k in dict: ...
|
||
|
||
which is equivalent to, but much faster than
|
||
|
||
for k in dict.keys(): ...
|
||
|
||
as long as the restriction on modifications to the dictionary
|
||
(either by the loop or by another thread) are not violated.
|
||
|
||
There is no doubt that the dict.has_keys(x) interpretation of "x
|
||
in dict" is by far the most useful interpretation, probably the
|
||
only useful one. There has been resistance against this because
|
||
"x in list" checks whether x is present among the values, while
|
||
the proposal makes "x in dict" check whether x is present among
|
||
the keys. Given that the symmetry between lists and dictionaries
|
||
is very weak, this argument does not have much weight.
|
||
|
||
The main discussion focuses on whether
|
||
|
||
for x in dict: ...
|
||
|
||
should assign x the successive keys, values, or items of the
|
||
dictionary. The symmetry between "if x in y" and "for x in y"
|
||
suggests that it should iterate over keys. This symmetry has been
|
||
observed by many independently and has even been used to "explain"
|
||
one using the other. This is because for sequences, "if x in y"
|
||
iterates over y comparing the iterated values to x. If we adopt
|
||
both of the above proposals, this will also hold for
|
||
dictionaries.
|
||
|
||
The argument against making "for x in dict" iterate over the keys
|
||
comes mostly from a practicality point of view: scans of the
|
||
standard library show that there are about as many uses of "for x
|
||
in dict.items()" as there are of "for x in dict.keys()", with the
|
||
items() version having a small majority. Presumably many of the
|
||
loops using keys() use the corresponding value anyway, by writing
|
||
dict[x], so (the argument goes) by making both the key and value
|
||
available, we could support the largest number of cases. While
|
||
this is true, I (Guido) find the correspondence between "for x in
|
||
dict" and "if x in dict" too compelling to break, and there's not
|
||
much overhead in having to write dict[x] to explicitly get the
|
||
value. We could also add methods to dictionaries that return
|
||
different kinds of iterators, e.g.
|
||
|
||
for key, value in dict.iteritems(): ...
|
||
|
||
for value in dict.itervalues(): ...
|
||
|
||
for key in dict.iterkeys(): ...
|
||
|
||
|
||
File Iterators
|
||
|
||
The following proposal is not controversial, but should be
|
||
considered a separate step after introducing the iterator
|
||
framework described above. It is useful because it provides us
|
||
with a good answer to the complaint that the common idiom to
|
||
iterate over the lines of a file is ugly and slow.
|
||
|
||
- Files implement a tp_iter slot that is equivalent to
|
||
iter(f.readline, ""). This means that we can write
|
||
|
||
for line in file:
|
||
...
|
||
|
||
as a shorthand for
|
||
|
||
for line in iter(file.readline, ""):
|
||
...
|
||
|
||
which is equivalent to, but faster than
|
||
|
||
while 1:
|
||
line = file.readline()
|
||
if not line:
|
||
break
|
||
...
|
||
|
||
This also shows that some iterators are destructive: they consume
|
||
all the values and a second iterator cannot easily be created that
|
||
iterates independently over the same values. You could open the
|
||
file for a second time, or seek() to the beginning, but these
|
||
solutions don't work for all file types, e.g. they don't work when
|
||
the open file object really represents a pipe or a stream socket.
|
||
|
||
|
||
Rationale
|
||
|
||
If all the parts of the proposal are included, this addresses many
|
||
concerns in a consistent and flexible fashion. Among its chief
|
||
virtues are the following three -- no, four -- no, five -- points:
|
||
|
||
1. It provides an extensible iterator interface.
|
||
|
||
1. It allows performance enhancements to list iteration.
|
||
|
||
3. It allows big performance enhancements to dictionary iteration.
|
||
|
||
4. It allows one to provide an interface for just iteration
|
||
without pretending to provide random access to elements.
|
||
|
||
5. It is backward-compatible with all existing user-defined
|
||
classes and extension objects that emulate sequences and
|
||
mappings, even mappings that only implement a subset of
|
||
{__getitem__, keys, values, items}.
|
||
|
||
|
||
Mailing Lists
|
||
|
||
The iterator protocol has been discussed extensively in a mailing
|
||
list on SourceForge:
|
||
|
||
http://lists.sourceforge.net/lists/listinfo/python-iterators
|
||
|
||
Initially, some of the discussion was carried out at Yahoo;
|
||
archives are still accessible:
|
||
|
||
http://groups.yahoo.com/group/python-iter
|
||
|
||
Copyright
|
||
|
||
This document is in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
End:
|