diff --git a/pep-0234.txt b/pep-0234.txt index cfdbe679a..fa24c7321 100644 --- a/pep-0234.txt +++ b/pep-0234.txt @@ -1,7 +1,7 @@ PEP: 234 Title: Iterators Version: $Revision$ -Author: ping@lfw.org (Ka-Ping Yee) +Author: ping@lfw.org (Ka-Ping Yee), guido@python.org (Guido van Rossum) Status: Draft Type: Standards Track Python-Version: 2.1 @@ -13,228 +13,244 @@ Abstract This document proposes an iteration interface that objects can provide to control the behaviour of 'for' loops. Looping is customized by providing a method that produces an iterator object. - The iterator should be a callable object that returns the next - item in the sequence each time it is called, raising an exception - when no more items are available. + The iterator provides a 'get next value' operation that produces + the nxet item in the sequence each time it is called, raising an + exception when no more items are available. + + In addition, specific iterators over the keys of a dictionary and + over the lines of a file are proposed, and a proposal is made to + allow spelling dict.kas_key(key) as "key in dict". + + Note: this is an almost complete rewrite of this PEP by the second + author, describing the actual implementation checked into the + trunk of the Python 2.2 CVS tree. It is still open for + discussion. Some of the more esoteric proposals in the original + version of this PEP have been withdrawn for now; these may be the + subject of a separate PEP in the future. -Copyright +C API Specification - This document is in the public domain. + A new exception is defined, StopIteration, which can be used to + signal the end of an iteration. + + A new slot named tp_iter for requesting an iterator is added to + the type object structure. This should be a function of one + PyObject * argument returning a PyObject *, or NULL. To use this + slot, a new C API function PyObject_GetIter() is added, with the + same signature as the tp_iter slot function. + + Another new slot, named tp_iternext, is added to the type + structure, for obtaining the next value in the iteration. To use + this slot, a new C API function PyIter_Next() is added. The + signature for both the slot and the API function is as follows: + the argument is a PyObject * and so is the return value. When the + return value is non-NULL, it is the next value in the iteration. + When it is NULL, there are three possibilities: + + - No exception is set; this implies the end of the iteration. + + - The StopIteration exception (or a derived exception class) is + set; this implies the end of the iteration. + + - Some other exception is set; this means that an error occurred + that should be propagated normally. + + In addition to the tp_iternext slot, every iterator object must + also implement a next() method, callable without arguments. This + should have the same semantics as the tp_iternext slot function, + except that the only way to signal the end of the iteration is to + raise StopIteration. The iterator object should not care whether + its tp_iternext slot function is called or its next() method, and + the caller may mix calls arbitrarily. (The next() method is for + the benefit of Python code using iterators directly; the + tp_iternext slot is added to make 'for' loops more efficient.) + + To ensure binary backwards compatibility, a new flag + Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags + field, and to the default flags macro. This flag must be tested + before accessing the tp_iter or tp_iternext slots. The macro + PyIter_Check() tests whether an object has the appropriate flag + set and has a non-NULL tp_iternext slot. There is no such macro + for the tp_iter slot (since the only place where this slot is + referenced should be PyObject_GetIter()). + + (Note: the tp_iter slot can be present on any object; the + tp_iternext slot should only be present on objects that act as + iterators.) + + For backwards compatibility, the PyObject_GetIter() function + implements fallback semantics when its argument is a sequence that + does not implement a tp_iter function: a lightweight sequence + iterator object is constructed in that case which iterates over + the items of the sequence in the natural order. + + The Python bytecode generated for 'for' loops is changed to use + new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol + rather than the sequence protocol to get the next value for the + loop variable. This makes it possible to use a 'for' loop to loop + over non-sequence objects that support the tp_iter slot. Other + places where the interpreter loops over the values of a sequence + should also be changed to use iterators. + + Iterators ought to implement the tp_iter slot as returning a + reference to themselves; this is needed to make it possible to + use an iterator (as opposed to a sequence) in a for loop. -Sequence Iterators +Python API Specification - A new field named 'sq_iter' for requesting an iterator is added - to the PySequenceMethods table. Upon an attempt to iterate over - an object with a loop such as + The StopIteration exception is made visiable as one of the + standard exceptions. It is derived from Exception. - for item in sequence: - ...body... + A new built-in function is defined, iter(), which can be called in + two ways: - the interpreter looks for the 'sq_iter' of the 'sequence' object. - If the method exists, it is called to get an iterator; it should - return a callable object. If the method does not exist, the - interpreter produces a built-in iterator object in the following - manner (described in Python here, but implemented in the core): + - iter(obj) calls PyObject_GetIter(obj). - def make_iterator(sequence): - def iterator(sequence=sequence, index=[0]): - item = sequence[index[0]] - index[0] += 1 - return item - return iterator + - iter(callable, sentinel) returns a special kind of iterator that + calls the callable to produce a new value, and compares the + return value to the sentinel value. If the return value equals + the sentinel, this signals the end of the iteration and + StopIteration is raised rather than returning normal; if the + return value does not equal the sentinel, it is returned as the + next value from the iterator. If the callable raises an + exception, this is propagated normally; in particular, the + function is allowed to raise StopError as an alternative way to + end the iteration. (This functionality is available from the C + API as PyCallIter_New(callable, sentinel).) - To execute the above 'for' loop, the interpreter would proceed as - follows, where 'iterator' is the iterator that was obtained: + Iterator objects returned by either form of iter() have a next() + method. This method either returns the next value in the + iteration, or raises StopError (or a derived exception class) to + signal the end of the iteration. Any other exception should be + considered to signify an error and should be propagated normally, + not taken to mean the end of the iteration. - while 1: - try: - item = iterator() - except IndexError: - break - ...body... + Classes can define how they are iterated over by defining an + __iter__() method; this should take no additional arguments and + return a valid iterator object. A class is a valid iterator + object when it defines a next() method that behaves as described + above. A class that wants to be an iterator also ought to + implement __iter__() returning itself. - (Note that the 'break' above doesn't translate to a "real" Python - break, since it would go to the 'else:' clause of the loop whereas - a "real" break in the body would skip the 'else:' clause.) + There is some controversy here: - The list() and tuple() built-in functions would be updated to use - this same iterator logic to retrieve the items in their argument. + - The name iter() is an abbreviation. Alternatives proposed + include iterate(), harp(), traverse(), narrate(). - List and tuple objects would implement the 'sq_iter' method by - calling the built-in make_iterator() routine just described. - - Instance objects would implement the 'sq_iter' method as follows: - - if hasattr(self, '__iter__'): - return self.__iter__() - elif hasattr(self, '__getitem__'): - return make_iterator(self) - else: - raise TypeError, thing.__class__.__name__ + \ - ' instance does not support iteration' - - Extension objects can implement 'sq_iter' however they wish, as - long as they return a callable object. + - Using the same name for two different operations (getting an + iterator from an object and making an iterator for a function + with an sentinel value) is somewhat ugly. I haven't seen a + better name for the second operation though. -Mapping Iterators +Dictionary Iterators - An additional proposal from Guido is to provide special syntax - for iterating over mappings. The loop: + The following two proposals are somewhat controversial. They are + also independent from the main iterator implementation. However, + they are both very useful. - for key:value in mapping: + - Dictionaries implement a sq_contains slot that implements the + same test as the has_key() method. This means that we can write - would bind both 'key' and 'value' to a key-value pair from the - mapping on each iteration. Tim Peters suggested that similarly, + if k in dict: ... - for key: in mapping: + which is equivalent to - could iterate over just the keys and + if dict.has_key(k): ... - for :value in mapping: + - Dictionaries implement a tp_iter slot that returns an efficient + iterator that iterates over the keys of the dictionary. During + such an iteration, the dictionary should not be modified, except + that setting the value for an existing key is allowed (deletions + or additions are not, nor is the update() method). This means + that we can write - could iterate over just the values. + for k in dict: ... - The syntax is unambiguous since the new colon is currently not - permitted in this position in the grammar. + which is equivalent to, but much faster than - This behaviour would be provided by additional methods in the - PyMappingMethods table: 'mp_iteritems', 'mp_iterkeys', and - 'mp_itervalues' respectively. 'mp_iteritems' is expected to - produce a callable object that returns a (key, value) tuple; - 'mp_iterkeys' and 'mp_itervalues' are expected to produce a - callable object that returns a single key or value. + for k in dict.keys(): ... - The implementations of these methods on instance objects would - then check for and call the '__iteritems__', '__iterkeys__', - and '__itervalues__' methods respectively. + as long as the restriction on modifications to the dictionary + (either by the loop or by another thread) are not violated. - When 'mp_iteritems', 'mp_iterkeys', or 'mp_itervalues' is missing, - the default behaviour is to do make_iterator(mapping.items()), - make_iterator(mapping.keys()), or make_iterator(mapping.values()) - respectively, using the definition of make_iterator() above. + There is no doubt that the dict.has_keys(x) interpretation of "x + in dict" is by far the most useful interpretation, probably the + only useful one. There has been resistance against this because + "x in list" checks whether x is present among the values, while + the proposal makes "x in dict" check whether x is present among + the keys. Given that the symmetry between lists and dictionaries + is very weak, this argument does not have much weight. + + The main discussion focuses on whether + + for x in dict: ... + + should assign x the successive keys, values, or items of the + dictionary. The symmetry between "if x in y" and "for x in y" + suggests that it should iterate over keys. This symmetry has been + observed by many independently and has even been used to "explain" + one using the other. This is because for sequences, "if x in y" + iterates over y comparing the iterated values to x. If we adopt + both of the above proposals, this will also hold for + dictionaries. + + The argument against making "for x in dict" iterate over the keys + comes mostly from a practicality point of view: scans of the + standard library show that there are about as many uses of "for x + in dict.items()" as there are of "for x in dict.keys()", with the + items() version having a small majority. Presumably many of the + loops using keys() use the corresponding value anyway, by writing + dict[x], so (the argument goes) by making both the key and value + available, we could support the largest number of cases. While + this is true, I (Guido) find the correspondence between "for x in + dict" and "if x in dict" too compelling to break, and there's not + much overhead in having to write dict[x] to explicitly get the + value. We could also add methods to dictionaries that return + different kinds of iterators, e.g. + + for key, value in dict.iteritems(): ... + + for value in dict.itervalues(): ... + + for key in dict.iterkeys(): ... -Indexing Sequences +File Iterators - The special syntax described above can be applied to sequences - as well, to provide the long-hoped-for ability to obtain the - indices of a sequence without the strange-looking 'range(len(x))' - expression. + The following proposal is not controversial, but should be + considered a separate step after introducing the iterator + framework described above. It is useful because it provides us + with a good answer to the complaint that the common idiom to + iterate over the lines of a file is ugly and slow. - for index:item in sequence: + - Files implement a tp_iter slot that is equivalent to + iter(f.readline, ""). This means that we can write - causes 'index' to be bound to the index of each item as 'item' is - bound to the items of the sequence in turn, and + for line in file: + ... - for index: in sequence: + as a shorthand for - simply causes 'index' to start at 0 and increment until an attempt - to get sequence[index] produces an IndexError. For completeness, + for line in iter(file.readline, ""): + ... - for :item in sequence: + which is equivalent to, but faster than - is equivalent to + while 1: + line = file.readline() + if not line: + break + ... - for item in sequence: - - In each case we try to request an appropriate iterator from the - sequence. In summary: - - for k:v in x looks for mp_iteritems, then sq_iter - for k: in x looks for mp_iterkeys, then sq_iter - for :v in x looks for mp_itervalues, then sq_iter - for v in x looks for sq_iter - - If we fall back to sq_iter in the first two cases, we generate - indices for k as needed, by starting at 0 and incrementing. - - The implementation of the mp_iter* methods on instance objects - then checks for methods in the following order: - - mp_iteritems __iteritems__, __iter__, items, __getitem__ - mp_iterkeys __iterkeys__, __iter__, keys, __getitem__ - mp_itervalues __itervalues__, __iter__, values, __getitem__ - sq_iter __iter__, __getitem__ - - If a __iteritems__, __iterkeys__, or __itervalues__ method is - found, we just call it and use the resulting iterator. If a - mp_* function finds no such method but finds __iter__ instead, - we generate indices as needed. - - Upon finding an items(), keys(), or values() method, we use - make_iterator(x.items()), make_iterator(x.keys()), or - make_iterator(x.values()) respectively. Upon finding a - __getitem__ method, we use it and generate indices as needed. - - For example, the complete implementation of the mp_iteritems - method for instances can be roughly described as follows: - - def mp_iteritems(thing): - if hasattr(thing, '__iteritems__'): - return thing.__iteritems__() - if hasattr(thing, '__iter__'): - def iterator(sequence=thing, index=[0]): - item = (index[0], sequence.__iter__()) - index[0] += 1 - return item - return iterator - if hasattr(thing, 'items'): - return make_iterator(thing.items()) - if hasattr(thing, '__getitem__'): - def iterator(sequence=thing, index=[0]): - item = (index[0], sequence[index[0]]) - index[0] += 1 - return item - return iterator - raise TypeError, thing.__class__.__name__ + \ - ' instance does not support iteration over items' - - -Examples - - Here is a class written in Python that represents the sequence of - lines in a file. - - class FileLines: - def __init__(self, filename): - self.file = open(filename) - def __iter__(self): - def iter(self=self): - line = self.file.readline() - if line: return line - else: raise IndexError - return iter - - for line in FileLines('spam.txt'): - print line - - And here's an interactive session demonstrating the proposed new - looping syntax: - - >>> for i:item in ['a', 'b', 'c']: - ... print i, item - ... - 0 a - 1 b - 2 c - >>> for i: in 'abcdefg': # just the indices, please - ... print i, - ... print - ... - 0 1 2 3 4 5 6 - >>> for k:v in os.environ: # os.environ is an instance, but - ... print k, v # this still works because we fall - ... # back to calling items() - MAIL /var/spool/mail/ping - HOME /home/ping - DISPLAY :0.0 - TERM xterm - . - . - . + This also shows that some iterators are destructive: they consume + all the values and a second iterator cannot easily be created that + iterates independently over the same values. You could open the + file for a second time, or seek() to the beginning, but these + solutions don't work for all file types, e.g. they don't work when + the open file object really represents a pipe or a stream socket. Rationale @@ -245,9 +261,9 @@ Rationale 1. It provides an extensible iterator interface. - 2. It resolves the endless "i indexing sequence" debate. + 1. It allows performance enhancements to list iteration. - 3. It allows performance enhancements to dictionary iteration. + 3. It allows big performance enhancements to dictionary iteration. 4. It allows one to provide an interface for just iteration without pretending to provide random access to elements. @@ -258,95 +274,9 @@ Rationale {__getitem__, keys, values, items}. -Errors +Copyright - Errors that occur during sq_iter, mp_iter*, or the __iter*__ - methods are allowed to propagate normally to the surface. - - An attempt to do - - for item in dict: - - over a dictionary object still produces: - - TypeError: loop over non-sequence - - An attempt to iterate over an instance that provides neither - __iter__ nor __getitem__ produces: - - TypeError: instance does not support iteration - - Similarly, an attempt to do mapping-iteration over an instance - that doesn't provide the right methods should produce one of the - following errors: - - TypeError: instance does not support iteration over items - TypeError: instance does not support iteration over keys - TypeError: instance does not support iteration over values - - It's an error for the iterator produced by __iteritems__ or - mp_iteritems to return an object whose length is not 2: - - TypeError: item iterator did not return a 2-tuple - - -Open Issues - - We could introduce a new exception type such as IteratorExit just - for terminating loops rather than using IndexError. In this case, - the implementation of make_iterator() would catch and translate an - IndexError into an IteratorExit for backward compatibility. - - We could provide access to the logic that calls either 'sq_item' - or make_iterator() with an iter() function in the built-in module - (just as the getattr() function provides access to 'tp_getattr'). - One possible motivation for this is to make it easier for the - implementation of __iter__ to delegate iteration to some other - sequence. Presumably we would then have to consider adding - iteritems(), iterkeys(), and itervalues() as well. - - An alternative way to let __iter__ delegate iteration to another - sequence is for it to return another sequence. Upon detecting - that the object returned by __iter__ is not callable, the - interpreter could repeat the process of looking for an iterator - on the new object. However, this process seems potentially - convoluted and likely to produce more confusing error messages. - - If we decide to add "freezing" ability to lists and dictionaries, - it is suggested that the implementation of make_iterator - automatically freeze any list or dictionary argument for the - duration of the loop, and produce an error complaining about any - attempt to modify it during iteration. Since it is relatively - rare to actually want to modify it during iteration, this is - likely to catch mistakes earlier. If a programmer wants to - modify a list or dictionary during iteration, they should - explicitly make a copy to iterate over using x[:], x.clone(), - x.keys(), x.values(), or x.items(). - - For consistency with the 'key in dict' expression, we could - support 'for key in dict' as equivalent to 'for key: in dict'. - - -BDFL Pronouncements - - The "parallel expression" to 'for key:value in mapping': - - if key:value in mapping: - - is infeasible since the first colon ends the "if" condition. - The following compromise is technically feasible: - - if (key:value) in mapping: - - but the BDFL has pronounced a solid -1 on this. - - The BDFL gave a +0.5 to: - - for key:value in mapping: - for index:item in sequence: - - and a +0.2 to the variations where the part before or after - the first colon is missing. + This document is in the public domain.