Almost completely rewritten, focusing on documenting the current state
of affairs, filling in some things still under discussion. Ping, I hope this is okay with you. If you want to revive "for keys:values in dict" etc., you'll write a separate PEP, right?
This commit is contained in:
parent
bad454ef15
commit
31a363c4f5
482
pep-0234.txt
482
pep-0234.txt
|
@ -1,7 +1,7 @@
|
|||
PEP: 234
|
||||
Title: Iterators
|
||||
Version: $Revision$
|
||||
Author: ping@lfw.org (Ka-Ping Yee)
|
||||
Author: ping@lfw.org (Ka-Ping Yee), guido@python.org (Guido van Rossum)
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Python-Version: 2.1
|
||||
|
@ -13,228 +13,244 @@ Abstract
|
|||
This document proposes an iteration interface that objects can
|
||||
provide to control the behaviour of 'for' loops. Looping is
|
||||
customized by providing a method that produces an iterator object.
|
||||
The iterator should be a callable object that returns the next
|
||||
item in the sequence each time it is called, raising an exception
|
||||
when no more items are available.
|
||||
The iterator provides a 'get next value' operation that produces
|
||||
the nxet item in the sequence each time it is called, raising an
|
||||
exception when no more items are available.
|
||||
|
||||
In addition, specific iterators over the keys of a dictionary and
|
||||
over the lines of a file are proposed, and a proposal is made to
|
||||
allow spelling dict.kas_key(key) as "key in dict".
|
||||
|
||||
Note: this is an almost complete rewrite of this PEP by the second
|
||||
author, describing the actual implementation checked into the
|
||||
trunk of the Python 2.2 CVS tree. It is still open for
|
||||
discussion. Some of the more esoteric proposals in the original
|
||||
version of this PEP have been withdrawn for now; these may be the
|
||||
subject of a separate PEP in the future.
|
||||
|
||||
|
||||
Copyright
|
||||
C API Specification
|
||||
|
||||
This document is in the public domain.
|
||||
A new exception is defined, StopIteration, which can be used to
|
||||
signal the end of an iteration.
|
||||
|
||||
A new slot named tp_iter for requesting an iterator is added to
|
||||
the type object structure. This should be a function of one
|
||||
PyObject * argument returning a PyObject *, or NULL. To use this
|
||||
slot, a new C API function PyObject_GetIter() is added, with the
|
||||
same signature as the tp_iter slot function.
|
||||
|
||||
Another new slot, named tp_iternext, is added to the type
|
||||
structure, for obtaining the next value in the iteration. To use
|
||||
this slot, a new C API function PyIter_Next() is added. The
|
||||
signature for both the slot and the API function is as follows:
|
||||
the argument is a PyObject * and so is the return value. When the
|
||||
return value is non-NULL, it is the next value in the iteration.
|
||||
When it is NULL, there are three possibilities:
|
||||
|
||||
- No exception is set; this implies the end of the iteration.
|
||||
|
||||
- The StopIteration exception (or a derived exception class) is
|
||||
set; this implies the end of the iteration.
|
||||
|
||||
- Some other exception is set; this means that an error occurred
|
||||
that should be propagated normally.
|
||||
|
||||
In addition to the tp_iternext slot, every iterator object must
|
||||
also implement a next() method, callable without arguments. This
|
||||
should have the same semantics as the tp_iternext slot function,
|
||||
except that the only way to signal the end of the iteration is to
|
||||
raise StopIteration. The iterator object should not care whether
|
||||
its tp_iternext slot function is called or its next() method, and
|
||||
the caller may mix calls arbitrarily. (The next() method is for
|
||||
the benefit of Python code using iterators directly; the
|
||||
tp_iternext slot is added to make 'for' loops more efficient.)
|
||||
|
||||
To ensure binary backwards compatibility, a new flag
|
||||
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
|
||||
field, and to the default flags macro. This flag must be tested
|
||||
before accessing the tp_iter or tp_iternext slots. The macro
|
||||
PyIter_Check() tests whether an object has the appropriate flag
|
||||
set and has a non-NULL tp_iternext slot. There is no such macro
|
||||
for the tp_iter slot (since the only place where this slot is
|
||||
referenced should be PyObject_GetIter()).
|
||||
|
||||
(Note: the tp_iter slot can be present on any object; the
|
||||
tp_iternext slot should only be present on objects that act as
|
||||
iterators.)
|
||||
|
||||
For backwards compatibility, the PyObject_GetIter() function
|
||||
implements fallback semantics when its argument is a sequence that
|
||||
does not implement a tp_iter function: a lightweight sequence
|
||||
iterator object is constructed in that case which iterates over
|
||||
the items of the sequence in the natural order.
|
||||
|
||||
The Python bytecode generated for 'for' loops is changed to use
|
||||
new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
|
||||
rather than the sequence protocol to get the next value for the
|
||||
loop variable. This makes it possible to use a 'for' loop to loop
|
||||
over non-sequence objects that support the tp_iter slot. Other
|
||||
places where the interpreter loops over the values of a sequence
|
||||
should also be changed to use iterators.
|
||||
|
||||
Iterators ought to implement the tp_iter slot as returning a
|
||||
reference to themselves; this is needed to make it possible to
|
||||
use an iterator (as opposed to a sequence) in a for loop.
|
||||
|
||||
|
||||
Sequence Iterators
|
||||
Python API Specification
|
||||
|
||||
A new field named 'sq_iter' for requesting an iterator is added
|
||||
to the PySequenceMethods table. Upon an attempt to iterate over
|
||||
an object with a loop such as
|
||||
The StopIteration exception is made visiable as one of the
|
||||
standard exceptions. It is derived from Exception.
|
||||
|
||||
for item in sequence:
|
||||
...body...
|
||||
A new built-in function is defined, iter(), which can be called in
|
||||
two ways:
|
||||
|
||||
the interpreter looks for the 'sq_iter' of the 'sequence' object.
|
||||
If the method exists, it is called to get an iterator; it should
|
||||
return a callable object. If the method does not exist, the
|
||||
interpreter produces a built-in iterator object in the following
|
||||
manner (described in Python here, but implemented in the core):
|
||||
- iter(obj) calls PyObject_GetIter(obj).
|
||||
|
||||
def make_iterator(sequence):
|
||||
def iterator(sequence=sequence, index=[0]):
|
||||
item = sequence[index[0]]
|
||||
index[0] += 1
|
||||
return item
|
||||
return iterator
|
||||
- iter(callable, sentinel) returns a special kind of iterator that
|
||||
calls the callable to produce a new value, and compares the
|
||||
return value to the sentinel value. If the return value equals
|
||||
the sentinel, this signals the end of the iteration and
|
||||
StopIteration is raised rather than returning normal; if the
|
||||
return value does not equal the sentinel, it is returned as the
|
||||
next value from the iterator. If the callable raises an
|
||||
exception, this is propagated normally; in particular, the
|
||||
function is allowed to raise StopError as an alternative way to
|
||||
end the iteration. (This functionality is available from the C
|
||||
API as PyCallIter_New(callable, sentinel).)
|
||||
|
||||
To execute the above 'for' loop, the interpreter would proceed as
|
||||
follows, where 'iterator' is the iterator that was obtained:
|
||||
Iterator objects returned by either form of iter() have a next()
|
||||
method. This method either returns the next value in the
|
||||
iteration, or raises StopError (or a derived exception class) to
|
||||
signal the end of the iteration. Any other exception should be
|
||||
considered to signify an error and should be propagated normally,
|
||||
not taken to mean the end of the iteration.
|
||||
|
||||
while 1:
|
||||
try:
|
||||
item = iterator()
|
||||
except IndexError:
|
||||
break
|
||||
...body...
|
||||
Classes can define how they are iterated over by defining an
|
||||
__iter__() method; this should take no additional arguments and
|
||||
return a valid iterator object. A class is a valid iterator
|
||||
object when it defines a next() method that behaves as described
|
||||
above. A class that wants to be an iterator also ought to
|
||||
implement __iter__() returning itself.
|
||||
|
||||
(Note that the 'break' above doesn't translate to a "real" Python
|
||||
break, since it would go to the 'else:' clause of the loop whereas
|
||||
a "real" break in the body would skip the 'else:' clause.)
|
||||
There is some controversy here:
|
||||
|
||||
The list() and tuple() built-in functions would be updated to use
|
||||
this same iterator logic to retrieve the items in their argument.
|
||||
- The name iter() is an abbreviation. Alternatives proposed
|
||||
include iterate(), harp(), traverse(), narrate().
|
||||
|
||||
List and tuple objects would implement the 'sq_iter' method by
|
||||
calling the built-in make_iterator() routine just described.
|
||||
|
||||
Instance objects would implement the 'sq_iter' method as follows:
|
||||
|
||||
if hasattr(self, '__iter__'):
|
||||
return self.__iter__()
|
||||
elif hasattr(self, '__getitem__'):
|
||||
return make_iterator(self)
|
||||
else:
|
||||
raise TypeError, thing.__class__.__name__ + \
|
||||
' instance does not support iteration'
|
||||
|
||||
Extension objects can implement 'sq_iter' however they wish, as
|
||||
long as they return a callable object.
|
||||
- Using the same name for two different operations (getting an
|
||||
iterator from an object and making an iterator for a function
|
||||
with an sentinel value) is somewhat ugly. I haven't seen a
|
||||
better name for the second operation though.
|
||||
|
||||
|
||||
Mapping Iterators
|
||||
Dictionary Iterators
|
||||
|
||||
An additional proposal from Guido is to provide special syntax
|
||||
for iterating over mappings. The loop:
|
||||
The following two proposals are somewhat controversial. They are
|
||||
also independent from the main iterator implementation. However,
|
||||
they are both very useful.
|
||||
|
||||
for key:value in mapping:
|
||||
- Dictionaries implement a sq_contains slot that implements the
|
||||
same test as the has_key() method. This means that we can write
|
||||
|
||||
would bind both 'key' and 'value' to a key-value pair from the
|
||||
mapping on each iteration. Tim Peters suggested that similarly,
|
||||
if k in dict: ...
|
||||
|
||||
for key: in mapping:
|
||||
which is equivalent to
|
||||
|
||||
could iterate over just the keys and
|
||||
if dict.has_key(k): ...
|
||||
|
||||
for :value in mapping:
|
||||
- Dictionaries implement a tp_iter slot that returns an efficient
|
||||
iterator that iterates over the keys of the dictionary. During
|
||||
such an iteration, the dictionary should not be modified, except
|
||||
that setting the value for an existing key is allowed (deletions
|
||||
or additions are not, nor is the update() method). This means
|
||||
that we can write
|
||||
|
||||
could iterate over just the values.
|
||||
for k in dict: ...
|
||||
|
||||
The syntax is unambiguous since the new colon is currently not
|
||||
permitted in this position in the grammar.
|
||||
which is equivalent to, but much faster than
|
||||
|
||||
This behaviour would be provided by additional methods in the
|
||||
PyMappingMethods table: 'mp_iteritems', 'mp_iterkeys', and
|
||||
'mp_itervalues' respectively. 'mp_iteritems' is expected to
|
||||
produce a callable object that returns a (key, value) tuple;
|
||||
'mp_iterkeys' and 'mp_itervalues' are expected to produce a
|
||||
callable object that returns a single key or value.
|
||||
for k in dict.keys(): ...
|
||||
|
||||
The implementations of these methods on instance objects would
|
||||
then check for and call the '__iteritems__', '__iterkeys__',
|
||||
and '__itervalues__' methods respectively.
|
||||
as long as the restriction on modifications to the dictionary
|
||||
(either by the loop or by another thread) are not violated.
|
||||
|
||||
When 'mp_iteritems', 'mp_iterkeys', or 'mp_itervalues' is missing,
|
||||
the default behaviour is to do make_iterator(mapping.items()),
|
||||
make_iterator(mapping.keys()), or make_iterator(mapping.values())
|
||||
respectively, using the definition of make_iterator() above.
|
||||
There is no doubt that the dict.has_keys(x) interpretation of "x
|
||||
in dict" is by far the most useful interpretation, probably the
|
||||
only useful one. There has been resistance against this because
|
||||
"x in list" checks whether x is present among the values, while
|
||||
the proposal makes "x in dict" check whether x is present among
|
||||
the keys. Given that the symmetry between lists and dictionaries
|
||||
is very weak, this argument does not have much weight.
|
||||
|
||||
The main discussion focuses on whether
|
||||
|
||||
for x in dict: ...
|
||||
|
||||
should assign x the successive keys, values, or items of the
|
||||
dictionary. The symmetry between "if x in y" and "for x in y"
|
||||
suggests that it should iterate over keys. This symmetry has been
|
||||
observed by many independently and has even been used to "explain"
|
||||
one using the other. This is because for sequences, "if x in y"
|
||||
iterates over y comparing the iterated values to x. If we adopt
|
||||
both of the above proposals, this will also hold for
|
||||
dictionaries.
|
||||
|
||||
The argument against making "for x in dict" iterate over the keys
|
||||
comes mostly from a practicality point of view: scans of the
|
||||
standard library show that there are about as many uses of "for x
|
||||
in dict.items()" as there are of "for x in dict.keys()", with the
|
||||
items() version having a small majority. Presumably many of the
|
||||
loops using keys() use the corresponding value anyway, by writing
|
||||
dict[x], so (the argument goes) by making both the key and value
|
||||
available, we could support the largest number of cases. While
|
||||
this is true, I (Guido) find the correspondence between "for x in
|
||||
dict" and "if x in dict" too compelling to break, and there's not
|
||||
much overhead in having to write dict[x] to explicitly get the
|
||||
value. We could also add methods to dictionaries that return
|
||||
different kinds of iterators, e.g.
|
||||
|
||||
for key, value in dict.iteritems(): ...
|
||||
|
||||
for value in dict.itervalues(): ...
|
||||
|
||||
for key in dict.iterkeys(): ...
|
||||
|
||||
|
||||
Indexing Sequences
|
||||
File Iterators
|
||||
|
||||
The special syntax described above can be applied to sequences
|
||||
as well, to provide the long-hoped-for ability to obtain the
|
||||
indices of a sequence without the strange-looking 'range(len(x))'
|
||||
expression.
|
||||
The following proposal is not controversial, but should be
|
||||
considered a separate step after introducing the iterator
|
||||
framework described above. It is useful because it provides us
|
||||
with a good answer to the complaint that the common idiom to
|
||||
iterate over the lines of a file is ugly and slow.
|
||||
|
||||
for index:item in sequence:
|
||||
- Files implement a tp_iter slot that is equivalent to
|
||||
iter(f.readline, ""). This means that we can write
|
||||
|
||||
causes 'index' to be bound to the index of each item as 'item' is
|
||||
bound to the items of the sequence in turn, and
|
||||
for line in file:
|
||||
...
|
||||
|
||||
for index: in sequence:
|
||||
as a shorthand for
|
||||
|
||||
simply causes 'index' to start at 0 and increment until an attempt
|
||||
to get sequence[index] produces an IndexError. For completeness,
|
||||
for line in iter(file.readline, ""):
|
||||
...
|
||||
|
||||
for :item in sequence:
|
||||
which is equivalent to, but faster than
|
||||
|
||||
is equivalent to
|
||||
while 1:
|
||||
line = file.readline()
|
||||
if not line:
|
||||
break
|
||||
...
|
||||
|
||||
for item in sequence:
|
||||
|
||||
In each case we try to request an appropriate iterator from the
|
||||
sequence. In summary:
|
||||
|
||||
for k:v in x looks for mp_iteritems, then sq_iter
|
||||
for k: in x looks for mp_iterkeys, then sq_iter
|
||||
for :v in x looks for mp_itervalues, then sq_iter
|
||||
for v in x looks for sq_iter
|
||||
|
||||
If we fall back to sq_iter in the first two cases, we generate
|
||||
indices for k as needed, by starting at 0 and incrementing.
|
||||
|
||||
The implementation of the mp_iter* methods on instance objects
|
||||
then checks for methods in the following order:
|
||||
|
||||
mp_iteritems __iteritems__, __iter__, items, __getitem__
|
||||
mp_iterkeys __iterkeys__, __iter__, keys, __getitem__
|
||||
mp_itervalues __itervalues__, __iter__, values, __getitem__
|
||||
sq_iter __iter__, __getitem__
|
||||
|
||||
If a __iteritems__, __iterkeys__, or __itervalues__ method is
|
||||
found, we just call it and use the resulting iterator. If a
|
||||
mp_* function finds no such method but finds __iter__ instead,
|
||||
we generate indices as needed.
|
||||
|
||||
Upon finding an items(), keys(), or values() method, we use
|
||||
make_iterator(x.items()), make_iterator(x.keys()), or
|
||||
make_iterator(x.values()) respectively. Upon finding a
|
||||
__getitem__ method, we use it and generate indices as needed.
|
||||
|
||||
For example, the complete implementation of the mp_iteritems
|
||||
method for instances can be roughly described as follows:
|
||||
|
||||
def mp_iteritems(thing):
|
||||
if hasattr(thing, '__iteritems__'):
|
||||
return thing.__iteritems__()
|
||||
if hasattr(thing, '__iter__'):
|
||||
def iterator(sequence=thing, index=[0]):
|
||||
item = (index[0], sequence.__iter__())
|
||||
index[0] += 1
|
||||
return item
|
||||
return iterator
|
||||
if hasattr(thing, 'items'):
|
||||
return make_iterator(thing.items())
|
||||
if hasattr(thing, '__getitem__'):
|
||||
def iterator(sequence=thing, index=[0]):
|
||||
item = (index[0], sequence[index[0]])
|
||||
index[0] += 1
|
||||
return item
|
||||
return iterator
|
||||
raise TypeError, thing.__class__.__name__ + \
|
||||
' instance does not support iteration over items'
|
||||
|
||||
|
||||
Examples
|
||||
|
||||
Here is a class written in Python that represents the sequence of
|
||||
lines in a file.
|
||||
|
||||
class FileLines:
|
||||
def __init__(self, filename):
|
||||
self.file = open(filename)
|
||||
def __iter__(self):
|
||||
def iter(self=self):
|
||||
line = self.file.readline()
|
||||
if line: return line
|
||||
else: raise IndexError
|
||||
return iter
|
||||
|
||||
for line in FileLines('spam.txt'):
|
||||
print line
|
||||
|
||||
And here's an interactive session demonstrating the proposed new
|
||||
looping syntax:
|
||||
|
||||
>>> for i:item in ['a', 'b', 'c']:
|
||||
... print i, item
|
||||
...
|
||||
0 a
|
||||
1 b
|
||||
2 c
|
||||
>>> for i: in 'abcdefg': # just the indices, please
|
||||
... print i,
|
||||
... print
|
||||
...
|
||||
0 1 2 3 4 5 6
|
||||
>>> for k:v in os.environ: # os.environ is an instance, but
|
||||
... print k, v # this still works because we fall
|
||||
... # back to calling items()
|
||||
MAIL /var/spool/mail/ping
|
||||
HOME /home/ping
|
||||
DISPLAY :0.0
|
||||
TERM xterm
|
||||
.
|
||||
.
|
||||
.
|
||||
This also shows that some iterators are destructive: they consume
|
||||
all the values and a second iterator cannot easily be created that
|
||||
iterates independently over the same values. You could open the
|
||||
file for a second time, or seek() to the beginning, but these
|
||||
solutions don't work for all file types, e.g. they don't work when
|
||||
the open file object really represents a pipe or a stream socket.
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -245,9 +261,9 @@ Rationale
|
|||
|
||||
1. It provides an extensible iterator interface.
|
||||
|
||||
2. It resolves the endless "i indexing sequence" debate.
|
||||
1. It allows performance enhancements to list iteration.
|
||||
|
||||
3. It allows performance enhancements to dictionary iteration.
|
||||
3. It allows big performance enhancements to dictionary iteration.
|
||||
|
||||
4. It allows one to provide an interface for just iteration
|
||||
without pretending to provide random access to elements.
|
||||
|
@ -258,95 +274,9 @@ Rationale
|
|||
{__getitem__, keys, values, items}.
|
||||
|
||||
|
||||
Errors
|
||||
Copyright
|
||||
|
||||
Errors that occur during sq_iter, mp_iter*, or the __iter*__
|
||||
methods are allowed to propagate normally to the surface.
|
||||
|
||||
An attempt to do
|
||||
|
||||
for item in dict:
|
||||
|
||||
over a dictionary object still produces:
|
||||
|
||||
TypeError: loop over non-sequence
|
||||
|
||||
An attempt to iterate over an instance that provides neither
|
||||
__iter__ nor __getitem__ produces:
|
||||
|
||||
TypeError: instance does not support iteration
|
||||
|
||||
Similarly, an attempt to do mapping-iteration over an instance
|
||||
that doesn't provide the right methods should produce one of the
|
||||
following errors:
|
||||
|
||||
TypeError: instance does not support iteration over items
|
||||
TypeError: instance does not support iteration over keys
|
||||
TypeError: instance does not support iteration over values
|
||||
|
||||
It's an error for the iterator produced by __iteritems__ or
|
||||
mp_iteritems to return an object whose length is not 2:
|
||||
|
||||
TypeError: item iterator did not return a 2-tuple
|
||||
|
||||
|
||||
Open Issues
|
||||
|
||||
We could introduce a new exception type such as IteratorExit just
|
||||
for terminating loops rather than using IndexError. In this case,
|
||||
the implementation of make_iterator() would catch and translate an
|
||||
IndexError into an IteratorExit for backward compatibility.
|
||||
|
||||
We could provide access to the logic that calls either 'sq_item'
|
||||
or make_iterator() with an iter() function in the built-in module
|
||||
(just as the getattr() function provides access to 'tp_getattr').
|
||||
One possible motivation for this is to make it easier for the
|
||||
implementation of __iter__ to delegate iteration to some other
|
||||
sequence. Presumably we would then have to consider adding
|
||||
iteritems(), iterkeys(), and itervalues() as well.
|
||||
|
||||
An alternative way to let __iter__ delegate iteration to another
|
||||
sequence is for it to return another sequence. Upon detecting
|
||||
that the object returned by __iter__ is not callable, the
|
||||
interpreter could repeat the process of looking for an iterator
|
||||
on the new object. However, this process seems potentially
|
||||
convoluted and likely to produce more confusing error messages.
|
||||
|
||||
If we decide to add "freezing" ability to lists and dictionaries,
|
||||
it is suggested that the implementation of make_iterator
|
||||
automatically freeze any list or dictionary argument for the
|
||||
duration of the loop, and produce an error complaining about any
|
||||
attempt to modify it during iteration. Since it is relatively
|
||||
rare to actually want to modify it during iteration, this is
|
||||
likely to catch mistakes earlier. If a programmer wants to
|
||||
modify a list or dictionary during iteration, they should
|
||||
explicitly make a copy to iterate over using x[:], x.clone(),
|
||||
x.keys(), x.values(), or x.items().
|
||||
|
||||
For consistency with the 'key in dict' expression, we could
|
||||
support 'for key in dict' as equivalent to 'for key: in dict'.
|
||||
|
||||
|
||||
BDFL Pronouncements
|
||||
|
||||
The "parallel expression" to 'for key:value in mapping':
|
||||
|
||||
if key:value in mapping:
|
||||
|
||||
is infeasible since the first colon ends the "if" condition.
|
||||
The following compromise is technically feasible:
|
||||
|
||||
if (key:value) in mapping:
|
||||
|
||||
but the BDFL has pronounced a solid -1 on this.
|
||||
|
||||
The BDFL gave a +0.5 to:
|
||||
|
||||
for key:value in mapping:
|
||||
for index:item in sequence:
|
||||
|
||||
and a +0.2 to the variations where the part before or after
|
||||
the first colon is missing.
|
||||
This document is in the public domain.
|
||||
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue