PEP: 234 Title: Iterators Version: $Revision$ Author: ping@lfw.org (Ka-Ping Yee), guido@python.org (Guido van Rossum) Status: Draft Type: Standards Track Python-Version: 2.1 Created: 30-Jan-2001 Post-History: Abstract This document proposes an iteration interface that objects can provide to control the behaviour of 'for' loops. Looping is customized by providing a method that produces an iterator object. The iterator provides a 'get next value' operation that produces the nxet item in the sequence each time it is called, raising an exception when no more items are available. In addition, specific iterators over the keys of a dictionary and over the lines of a file are proposed, and a proposal is made to allow spelling dict.kas_key(key) as "key in dict". Note: this is an almost complete rewrite of this PEP by the second author, describing the actual implementation checked into the trunk of the Python 2.2 CVS tree. It is still open for discussion. Some of the more esoteric proposals in the original version of this PEP have been withdrawn for now; these may be the subject of a separate PEP in the future. C API Specification A new exception is defined, StopIteration, which can be used to signal the end of an iteration. A new slot named tp_iter for requesting an iterator is added to the type object structure. This should be a function of one PyObject * argument returning a PyObject *, or NULL. To use this slot, a new C API function PyObject_GetIter() is added, with the same signature as the tp_iter slot function. Another new slot, named tp_iternext, is added to the type structure, for obtaining the next value in the iteration. To use this slot, a new C API function PyIter_Next() is added. The signature for both the slot and the API function is as follows: the argument is a PyObject * and so is the return value. When the return value is non-NULL, it is the next value in the iteration. When it is NULL, there are three possibilities: - No exception is set; this implies the end of the iteration. - The StopIteration exception (or a derived exception class) is set; this implies the end of the iteration. - Some other exception is set; this means that an error occurred that should be propagated normally. In addition to the tp_iternext slot, every iterator object must also implement a next() method, callable without arguments. This should have the same semantics as the tp_iternext slot function, except that the only way to signal the end of the iteration is to raise StopIteration. The iterator object should not care whether its tp_iternext slot function is called or its next() method, and the caller may mix calls arbitrarily. (The next() method is for the benefit of Python code using iterators directly; the tp_iternext slot is added to make 'for' loops more efficient.) To ensure binary backwards compatibility, a new flag Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags field, and to the default flags macro. This flag must be tested before accessing the tp_iter or tp_iternext slots. The macro PyIter_Check() tests whether an object has the appropriate flag set and has a non-NULL tp_iternext slot. There is no such macro for the tp_iter slot (since the only place where this slot is referenced should be PyObject_GetIter()). (Note: the tp_iter slot can be present on any object; the tp_iternext slot should only be present on objects that act as iterators.) For backwards compatibility, the PyObject_GetIter() function implements fallback semantics when its argument is a sequence that does not implement a tp_iter function: a lightweight sequence iterator object is constructed in that case which iterates over the items of the sequence in the natural order. The Python bytecode generated for 'for' loops is changed to use new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol rather than the sequence protocol to get the next value for the loop variable. This makes it possible to use a 'for' loop to loop over non-sequence objects that support the tp_iter slot. Other places where the interpreter loops over the values of a sequence should also be changed to use iterators. Iterators ought to implement the tp_iter slot as returning a reference to themselves; this is needed to make it possible to use an iterator (as opposed to a sequence) in a for loop. Discussion: should the next() method be renamed to __next__()? Every other method corresponding to a tp_ slot has a special name. On the other hand, this would suggest that there should also be a primitive operation next(x) that would call x.__next__(), and this just looks like adding complexity without benefit. So I think it's better to stick with next(). Python API Specification The StopIteration exception is made visiable as one of the standard exceptions. It is derived from Exception. A new built-in function is defined, iter(), which can be called in two ways: - iter(obj) calls PyObject_GetIter(obj). - iter(callable, sentinel) returns a special kind of iterator that calls the callable to produce a new value, and compares the return value to the sentinel value. If the return value equals the sentinel, this signals the end of the iteration and StopIteration is raised rather than returning normal; if the return value does not equal the sentinel, it is returned as the next value from the iterator. If the callable raises an exception, this is propagated normally; in particular, the function is allowed to raise StopError as an alternative way to end the iteration. (This functionality is available from the C API as PyCallIter_New(callable, sentinel).) Iterator objects returned by either form of iter() have a next() method. This method either returns the next value in the iteration, or raises StopError (or a derived exception class) to signal the end of the iteration. Any other exception should be considered to signify an error and should be propagated normally, not taken to mean the end of the iteration. Classes can define how they are iterated over by defining an __iter__() method; this should take no additional arguments and return a valid iterator object. A class is a valid iterator object when it defines a next() method that behaves as described above. A class that wants to be an iterator also ought to implement __iter__() returning itself. There is some controversy here: - The name iter() is an abbreviation. Alternatives proposed include iterate(), harp(), traverse(), narrate(). - Using the same name for two different operations (getting an iterator from an object and making an iterator for a function with an sentinel value) is somewhat ugly. I haven't seen a better name for the second operation though. Dictionary Iterators The following two proposals are somewhat controversial. They are also independent from the main iterator implementation. However, they are both very useful. - Dictionaries implement a sq_contains slot that implements the same test as the has_key() method. This means that we can write if k in dict: ... which is equivalent to if dict.has_key(k): ... - Dictionaries implement a tp_iter slot that returns an efficient iterator that iterates over the keys of the dictionary. During such an iteration, the dictionary should not be modified, except that setting the value for an existing key is allowed (deletions or additions are not, nor is the update() method). This means that we can write for k in dict: ... which is equivalent to, but much faster than for k in dict.keys(): ... as long as the restriction on modifications to the dictionary (either by the loop or by another thread) are not violated. There is no doubt that the dict.has_keys(x) interpretation of "x in dict" is by far the most useful interpretation, probably the only useful one. There has been resistance against this because "x in list" checks whether x is present among the values, while the proposal makes "x in dict" check whether x is present among the keys. Given that the symmetry between lists and dictionaries is very weak, this argument does not have much weight. The main discussion focuses on whether for x in dict: ... should assign x the successive keys, values, or items of the dictionary. The symmetry between "if x in y" and "for x in y" suggests that it should iterate over keys. This symmetry has been observed by many independently and has even been used to "explain" one using the other. This is because for sequences, "if x in y" iterates over y comparing the iterated values to x. If we adopt both of the above proposals, this will also hold for dictionaries. The argument against making "for x in dict" iterate over the keys comes mostly from a practicality point of view: scans of the standard library show that there are about as many uses of "for x in dict.items()" as there are of "for x in dict.keys()", with the items() version having a small majority. Presumably many of the loops using keys() use the corresponding value anyway, by writing dict[x], so (the argument goes) by making both the key and value available, we could support the largest number of cases. While this is true, I (Guido) find the correspondence between "for x in dict" and "if x in dict" too compelling to break, and there's not much overhead in having to write dict[x] to explicitly get the value. We could also add methods to dictionaries that return different kinds of iterators, e.g. for key, value in dict.iteritems(): ... for value in dict.itervalues(): ... for key in dict.iterkeys(): ... File Iterators The following proposal is not controversial, but should be considered a separate step after introducing the iterator framework described above. It is useful because it provides us with a good answer to the complaint that the common idiom to iterate over the lines of a file is ugly and slow. - Files implement a tp_iter slot that is equivalent to iter(f.readline, ""). This means that we can write for line in file: ... as a shorthand for for line in iter(file.readline, ""): ... which is equivalent to, but faster than while 1: line = file.readline() if not line: break ... This also shows that some iterators are destructive: they consume all the values and a second iterator cannot easily be created that iterates independently over the same values. You could open the file for a second time, or seek() to the beginning, but these solutions don't work for all file types, e.g. they don't work when the open file object really represents a pipe or a stream socket. Rationale If all the parts of the proposal are included, this addresses many concerns in a consistent and flexible fashion. Among its chief virtues are the following three -- no, four -- no, five -- points: 1. It provides an extensible iterator interface. 1. It allows performance enhancements to list iteration. 3. It allows big performance enhancements to dictionary iteration. 4. It allows one to provide an interface for just iteration without pretending to provide random access to elements. 5. It is backward-compatible with all existing user-defined classes and extension objects that emulate sequences and mappings, even mappings that only implement a subset of {__getitem__, keys, values, items}. Copyright This document is in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: