From 36e7a69be3ac444ba18e463ef94f653ab1723840 Mon Sep 17 00:00:00 2001 From: Alex Martelli Date: Wed, 29 Oct 2003 13:39:51 +0000 Subject: [PATCH] Rewritten, with removal of deep-copying ideas, but otherwise many additions of details and examples, as per feedback received on python-dev. --- pep-0323.txt | 442 +++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 357 insertions(+), 85 deletions(-) diff --git a/pep-0323.txt b/pep-0323.txt index fc771f0b7..5715dc21c 100644 --- a/pep-0323.txt +++ b/pep-0323.txt @@ -8,46 +8,107 @@ Type: Standards Track Content-Type: text/plain Created: 25-Oct-2003 Python-Version: 2.4 -Post-History: +Post-History: 29-Oct-2003 Abstract - This PEP suggests that some iterator types should support being copied - (shallowly or deeply), by exposing suitable methods such as __copy__ or - __deepcopy__. + This PEP suggests that some iterator types should support shallow + copies of their instances by exposing a __copy__ method which meets + some specific requirements, and indicates how code using an iterator + might exploit such a __copy__ method when present. Motivation - Some iterators, such as those which built-in function iter builds over - sequences, should be intrinsically easy to copy (just get another - reference to the same sequence and a copy of the index) or deepcopy - (ditto but with a deepcopy of the underlying sequence). + In Python up to 2.3, most built-in iterator types don't let the user + copy their instances. User-coded iterators that do let their clients + call copy.copy on their instances may, or may not, happen to return, + as a result of the copy, a separate iterator object that may be + iterated upon independently from the original. - However, in Python up to 2.3, those iterators don't expose their state: - user code which is using such an iterator X is not able to just "save" a - copy of X (for future iteration from the current point) with a simple - "saved_X = copy.copy(X)". + Currently, "support" for copy.copy in a user-coded iterator type is + almost invariably "accidental" -- i.e., the standard machinery of the + copy method in Python's standard library's copy module does build and + return a copy. However, the copy will be independently iterable with + respect to the original only if calling .next() on an instance of that + class happens to change instance state solely by rebinding some + attributes to new values, and not by mutating some attributes' + existing values. - To "tee" a non-copyable iterator into two independently iterable ones, - subtle code [such as that which will go into itertools.tee in 2.4] is - needed (a Python version thereof is currently in the itertools docs). + For example, an iterator whose "index" state is held as an integer + attribute will probably give usable copies, since (integers being + immutable) .next() presumably just rebinds that attribute. On the + other hand, another iterator whose "index" state is held as a list + attribute will probably mutate the same list object when .next() + executes, and therefore copies of such an iterator will not be + iterable separately and independently from the original. - In any case, such code will inevitably have to consume memory to keep - a copy of the subsequence which has been iterated on by one but not - both tee'd iterators. This is a waste of memory in those cases in - which the iterator being tee'd already _has_ an underlying sequence - (or other iterable) and index. + Given this existing situation, copy.copy(it) on some iterator object + isn't very useful, nor, therefore, is it at all widely used. However, + there are many cases in which being able to get a "snapshot" of an + iterator, as a "bookmark", so as to be able to keep iterating along + the sequence but later iterate again on the same sequence from the + bookmark onwards, is useful. To support such "bookmarking", module + itertools, in 2.4, has grown a 'tee' function, to be used as: - Having iterators expose a suitable __copy__ when feasible allows easy - optimization of itertools.tee and similar user code, as in: + it, bookmark = itertools.tee(it) + + The previous value of 'it' must not be used again, which is why this + typical usage idiom rebinds the name. After this call, 'it' and + 'bookmark' are independently-iterable iterators on the same underlying + sequence as the original value of 'it': this satisfies application + needs for "iterator copying". + + However, when itertools.tee can make no hypotheses about the nature of + the iterator it is passed as an argument, it must save in memory all + items through which one of the two 'teed' iterators, but not yet both, + have stepped. This can be quite costly in terms of memory, if the two + iterators get very far from each other in their stepping; indeed, in + some cases it may be preferable to make a list from the iterator so as + to be able to step repeatedly through the subsequence, or, if that is + too costy in terms of memory, save items to disk, again in order to be + able to iterate through them repeatedly. + + This PEP proposes another idea that will, in some important cases, + allow itertools.tee to do its job with minimal cost in terms of + memory; user code may also occasionally be able to exploit the idea in + order to decide whether to copy an iterator, make a list from it, or + use an auxiliary disk file. + + The key consideration is that some important iterators, such as those + which built-in function iter builds over sequences, would be + intrinsically easy to copy: just get another reference to the same + sequence, and a copy of the integer index. However, in Python 2.3, + those iterators don't expose the state, and don't support copy.copy. + + The purpose of this PEP, therefore, is to have those iterator types + expose a suitable __copy__ method. Similarly, user-coded iterator + types that can provide copies of their instances, suitable for + separate and independent iteration, with limited costs in time and + space, should also expose a suitable __copy__ method. While + copy.copy also supports other ways to let a type control the way + its instances are copied, it is suggested, for simplicity, that + iterator types that support copying always do so by exposing a + __copy__ method, and not in the other ways copy.copy supports. + + Having iterators expose a suitable __copy__ when feasible will afford + easy optimization of itertools.tee and similar user code, as in: def tee(it): it = iter(it) - try: return it, copy.copy(it) - except TypeError: + try: copier = it.__copy__ + except AttributeError: # non-copyable iterator, do all the needed hard work + # [snipped!] + else: + return it, copier() + + Note that this function does NOT call "copy.copy(it)", which (even + after this PEP is implemented) might well still "just happen to + succeed". for some iterator type that is implemented as a user-coded + class. without really supplying an adequate "independently iterable" + copy object as its result. Specification @@ -55,11 +116,10 @@ Specification Any iterator type X may expose a method __copy__ that is callable without arguments on any instance x of X. The method should be exposed if and only if the iterator type can provide copyability with - reasonably little computational and memory effort. Similarly, X may - expose a method __deepcopy__ that is callable with one argument (a memo - dictionary) on instances of X. The (very concise...) specs for - these methods are at [2] -- for more details, see also file copy.py - in the Python standard library. + reasonably little computational and memory effort. Furthermore, the + new object y returned by method __copy__ should be a new instance + of X that is iterable independently and separately from x, stepping + along the same "underlying sequence" of items. For example, suppose a class Iter essentially duplicated the functionality of the iter builtin for iterating on a sequence: @@ -75,84 +135,133 @@ Specification def next(self): try: result = self.sequence[self.index] - except IndexError: raise StopIteration + except IndexError: raise StopIteration self.index += 1 return result To make this Iter class compliant with this PEP, the following - additions to the body of class Iter would suffice: + addition to the body of class Iter would suffice: def __copy__(self): result = self.__class__(sequence) result.index = self.index return result - def __deepcopy__(self, memo): - result = self.__copy__() - result.sequence = deepcopy(self.sequence, memo) - return result + Note that __copy__, in this case, does not even try to copy the + sequence; if the sequence is altered while either or both of the + original and copied iterators are still stepping on it, the iteration + behavior is quite likely to go awry anyway -- it is not __copy__'s + responsibility to change this normal Python behavior for iterators + which iterate on mutable sequences (that might, perhaps, be the + specification for a __deepcopy__ method of iterators, which, however, + this PEP does not deal with). + + Consider also a "random iterator", which provides a nonterminating + sequence of results from some method of a random instance, called + with given arguments: + + class RandomIterator(object): + + def __init__(self, bound_method, *args): + self.call = bound_method + self.args = args + + def __iter__(self): + return self + + def next(self): + return self.call(*self.args) + + def __copy__(self): + import copy, new + im_self = copy.copy(self.call.im_self) + method = new.instancemethod(self.call.im_func, im_self) + return self.__class__(method, *self.args) + + This iterator type is slightly more general than its name implies, as + it supports calls to any bound method (or other callable, but if the + callable is not a bound method, then method __copy__ will fail). But + the use case is for the purpose of generating random streams, as in: + + import random + + def show5(it): + for i, result in enumerate(it): + print '%6.3f'%result, + if i==4: break + print + + normit = RandomIterator(random.Random().gauss, 0, 1) + show5(normit) + copit = normit.__copy__() + show5(normit) + show5(copit) + + which will display some output such as: + + -0.536 1.936 -1.182 -1.690 -1.184 + 0.666 -0.701 1.214 0.348 1.373 + 0.666 -0.701 1.214 0.348 1.373 + + the key point being that the second and third lines are equal, because + the normit and copit iterators will step along the same "underlying + sequence". (As an aside, note that to get a copy of self.call.im_self + we must use copy.copy, NOT try getting at a __copy__ method directly, + because for example instances of random.Random support copying via + __getstate__ and __setstate__, NOT via __copy__; indeed, using + copy.copy is the normal way to get a shallow copy of any object -- + copyable iterators are different because of the already-mentioned + uncertainty about the result of copy.copy supporting these "copyable + iterator" specs). Details Besides adding to the Python docs a recommendation that user-coded - iterators be made copyable when feasible, this PEP's implementation - will specifically include the addition of copyability to the iterators - over sequences that built-in iter returns, and also to the iterators - over a dictionary returned by the methods __iter__, iterkeys, itervalues, - and iteritems of built-in type dict. + iterator types support a __copy__ method (if and only if it can be + implemented with small costs in memory and runtime, and produce an + independently-iterable copy of an iterator object), this PEP's + implementation will specifically include the addition of copyability + to the iterators over sequences that built-in iter returns, and also + to the iterators over a dictionary returned by the methods __iter__, + iterkeys, itervalues, and iteritems of built-in type dict. - Iterators produced by generators will not be copyable (the BDFL deems - shallow copy impossible, and deep copy too much trouble). However, - iterators produced by the new "generator expressions" of Python 2.4 - (PEP 289 [3]) should be copyable if their underlying iterator[s] are; - the strict limitations on what is possible in a generator expression, - compared to the much vaster generality of a generator, should make - that feasible. Similarly, the iterators produced by the built-in - function enumerate should be copyable if the underlying iterator is. + Iterators produced by generator functions will not be copyable. + However, iterators produced by the new "generator expressions" of + Python 2.4 (PEP 289 [3]) should be copyable if their underlying + iterator[s] are; the strict limitations on what is possible in a + generator expression, compared to the much vaster generality of a + generator, should make that feasible. Similarly, the iterators + produced by the built-in function enumerate, and certain functions + suppiled by module itertools, should be copyable if the underlying + iterators are. - The implementation of this PEP will also include the optimization - of the new itertools.tee function mentioned in the Motivation section. + The implementation of this PEP will also include the optimization of + the new itertools.tee function mentioned in the Motivation section. Rationale - Being able to copy iterators will allow copying of user-coded classes - that have copyable iterators as attributes. This applies to both - shallow and deep copying. + The main use case for (shallow) copying of an iterator is the same as + for the function itertools.tee (new in 2.4). User code will not + directly attempt to copy an iterator, because it would have to deal + separately with uncopyable cases; calling itertools.tee will + internally perform the copy when appropriate, and implicitly fallback + to a maximally efficient non-copying strategy for iterators that are + not copyable. (Occasionally, user code may want more direct control, + specifically in order to deal with non-copyable iterators by other + strategies, such as making a list or saving the sequence to disk). - Deep copyability of suitable iterators will allow "decoupling" from - an underlying mutable sequence, and, according to the BDFL, may in - the future allow certain iterators to become picklable (however, this - PEP, in itself, does not include picklability of iterators). - - The main use case for (shallow) copying of an iterator is the same - as for the function itertools.tee (new in 2.4). Indeed, we assume - that user code will typically not directly attempt to copy.copy an - iterator, because it would have to deal with uncopyable cases; just - calling itertools.tee will enable copying when feasible, with an - implicit fallback to a maximally efficient non-copying strategy for - iterators that are not copyable. - - A tee'd iterator may serve as a "reference point", allowing processing - of a sequence to continue or resume from a known point, while the other - independent iterator can be freely advanced to "explore" a further part - of the sequence as needed. To fully exploit this pattern, it should - ideally also be possible, given two iterators originally produced by - iterator.tee, to check if they are "equal" (have been stepped the same - number of times); however, this PEP, in itself, does not include - comparability of such iterators. Therefore, code using this pattern - may also need to count the number of forward steps taken after tee by - the iterators, so as to be able to tell which one is "behind", and by - how much. Built-in function enumerate is one way to make such counting - very easy in many situations. However, such needs do not always arise. - - Here is a simpler example: a generator which, given an iterator of - numbers (assumed to be positive), returns a corresponding iterator - each of whose items is the fraction of the total corresponding to each - corresponding item of the input iterator. The caller may pass the - total as a value, if known in advance; otherwise, the iterator - returned by calling the generator will first compute the total. + A tee'd iterator may serve as a "reference point", allowing processing + of a sequence to continue or resume from a known point, while the + other independent iterator can be freely advanced to "explore" a + further part of the sequence as needed. A simple example: a generator + function which, given an iterator of numbers (assumed to be positive), + returns a corresponding iterator, each of whose items is the fraction + of the total corresponding to each corresponding item of the input + iterator. The caller may pass the total as a value, if known in + advance; otherwise, the iterator returned by calling this generator + function will first compute the total. def fractions(numbers, total=None): if total is None: @@ -165,7 +274,170 @@ Rationale The ability to tee the numbers iterator allows this generator to precompute the total, if needed, without necessarily requiring O(N) auxiliary memory if the numbers iterator is copyable. + + As another example of "iterator bookmarking", consider a stream of + numbers with an occasional string as a "postfix operator" now and + then. By far most frequent such operator is a '+', whereupon we must + sum all previous numbers (since the last previous operator if any, or + else since the start) and yield the result. Sometimes we find a '*' + instead, which is the same except that the previous numbers must + instead be multiplied, not summed. + + def filter_weird_stream(stream): + it = iter(stream) + while True: + it, bookmark = itertools.tee(it) + total = 0 + for item in it: + if item=='+': + yield total + break + elif item=='*': + product = 1 + for item in bookmark: + if item=='*': + yield product + break + else: + product *= item + else: + total += item + + Similar use cases of itertools.tee can support such tasks as + "undo" on a stream of commands represented by an iterator, + "backtracking" on the parse of a stream of tokens, and so on. + (Of course, in each case, one should also consider simpler + possibilities such as saving relevant portions of the sequence + into lists while stepping on the sequence with just one iterator, + depending on the details of one's task). + + + Here is an example, in pure Python, of how the 'enumerate' + built-in could be extended to support __copy__ if its underlying + iterator also supported __copy__: + + class enumerate(object): + + def __init__(self, it): + self.it = iter(it) + self.i = -1 + + def __iter__(self): + return self + + def next(self): + self.i += 1 + return self.i, self.it.next() + + def __copy__(self): + result = self.__class__.new() + result.it = self.it.__copy__() + result.i = self.i + return result + + + Here is an example of the kind of "fragility" produced by "accidental + copyability" of an iterator -- the reason why one must NOT use + copy.copy expecting, if it succeeds, to receive as a result an + iterator which is iterable-on independently from the original. Here + is an iterator class that iterates (in preorder) on "trees" which, for + simplicity, are just nested lists -- any item that's a list is treated + as a subtree, any other item as a leaf. + + class ListreeIter(object): + + def __init__(self, tree): + self.tree = [tree] + self.indx = [-1] + + def __iter__(self): + return self + + def next(self): + if not self.indx: + raise StopIteration + self.indx[-1] += 1 + try: + result = self.tree[-1][self.indx[-1]] + except IndexError: + self.tree.pop() + self.indx.pop() + return self.next() + if type(result) is not list: + return result + self.tree.append(result) + self.indx.append(-1) + return self.next() + + Now, for example, the following code: + + import copy + x = [ [1,2,3], [4, 5, [6, 7, 8], 9], 10, 11, [12] ] + + print 'showing all items:', + it = ListreeIter(x) + for i in it: + print i, + if i==6: cop = copy.copy(it) + print + + print 'showing items >6 again:' + for i in cop: print i, + print + + does NOT work as intended -- the "cop" iterator gets consumed, and + exhausted, step by step as the original "it" iterator is, because + the accidental (rather than deliberate) copying performed by + copy.copy shares, rather than duplicating the "index" list, which + is the mutable attribute it.indx (a list of numerical indices). + Thus, this "client code" of the iterator, which attemps to iterate + twice over a portion of the sequence via a copy.copy on the + iterator, is NOT correct. + + Some correct solutions include using itertools.tee, i.e., changing + the first for loop into: + + for i in it: + print i, + if i==6: + it, cop = itertools.tee(it) + break + for i in it: print i, + + (note that we MUST break the loop in two, otherwise we'd still + be looping on the ORIGINAL value of it, which must NOT be used + further after the call to tee!!!); or making a list, i.e.: + + for i in it: + print i, + if i==6: + cop = lit = list(it) + break + for i in lit: print i, + (again, the loop must be broken in two, since iterator 'it' + gets exhausted by the call list(it)). + + Finally, all of these solutions would work if Listiter supplied + a suitable __copy__ method, as this PEP recommends: + + def __copy__(self): + result = self.__class__.new() + result.tree = copy.copy(self.tree) + result.indx = copy.copy(self.indx) + return result + + There is no need to get any "deeper" in the copy, but the two + mutable "index state" attributes must indeed be copied in order + to achieve a "proper" (independently iterable) iterator-copy. + + The recommended solution is to have class Listiter supply this + __copy__ method AND have client code use itertools.tee (with + the split-in-two-parts loop as shown above). This will make + client code maximally tolerant of different iterator types it + might be using AND achieve good performance for tee'ing of this + specific iterator type at the same time. + References