Rewritten, with removal of deep-copying ideas, but otherwise many additions of

details and examples, as per feedback received on python-dev.
This commit is contained in:
Alex Martelli 2003-10-29 13:39:51 +00:00
parent f7f0a4d4ea
commit 36e7a69be3
1 changed files with 357 additions and 85 deletions

View File

@ -8,46 +8,107 @@ Type: Standards Track
Content-Type: text/plain Content-Type: text/plain
Created: 25-Oct-2003 Created: 25-Oct-2003
Python-Version: 2.4 Python-Version: 2.4
Post-History: Post-History: 29-Oct-2003
Abstract Abstract
This PEP suggests that some iterator types should support being copied This PEP suggests that some iterator types should support shallow
(shallowly or deeply), by exposing suitable methods such as __copy__ or copies of their instances by exposing a __copy__ method which meets
__deepcopy__. some specific requirements, and indicates how code using an iterator
might exploit such a __copy__ method when present.
Motivation Motivation
Some iterators, such as those which built-in function iter builds over In Python up to 2.3, most built-in iterator types don't let the user
sequences, should be intrinsically easy to copy (just get another copy their instances. User-coded iterators that do let their clients
reference to the same sequence and a copy of the index) or deepcopy call copy.copy on their instances may, or may not, happen to return,
(ditto but with a deepcopy of the underlying sequence). as a result of the copy, a separate iterator object that may be
iterated upon independently from the original.
However, in Python up to 2.3, those iterators don't expose their state: Currently, "support" for copy.copy in a user-coded iterator type is
user code which is using such an iterator X is not able to just "save" a almost invariably "accidental" -- i.e., the standard machinery of the
copy of X (for future iteration from the current point) with a simple copy method in Python's standard library's copy module does build and
"saved_X = copy.copy(X)". return a copy. However, the copy will be independently iterable with
respect to the original only if calling .next() on an instance of that
class happens to change instance state solely by rebinding some
attributes to new values, and not by mutating some attributes'
existing values.
To "tee" a non-copyable iterator into two independently iterable ones, For example, an iterator whose "index" state is held as an integer
subtle code [such as that which will go into itertools.tee in 2.4] is attribute will probably give usable copies, since (integers being
needed (a Python version thereof is currently in the itertools docs). immutable) .next() presumably just rebinds that attribute. On the
other hand, another iterator whose "index" state is held as a list
attribute will probably mutate the same list object when .next()
executes, and therefore copies of such an iterator will not be
iterable separately and independently from the original.
In any case, such code will inevitably have to consume memory to keep Given this existing situation, copy.copy(it) on some iterator object
a copy of the subsequence which has been iterated on by one but not isn't very useful, nor, therefore, is it at all widely used. However,
both tee'd iterators. This is a waste of memory in those cases in there are many cases in which being able to get a "snapshot" of an
which the iterator being tee'd already _has_ an underlying sequence iterator, as a "bookmark", so as to be able to keep iterating along
(or other iterable) and index. the sequence but later iterate again on the same sequence from the
bookmark onwards, is useful. To support such "bookmarking", module
itertools, in 2.4, has grown a 'tee' function, to be used as:
Having iterators expose a suitable __copy__ when feasible allows easy it, bookmark = itertools.tee(it)
optimization of itertools.tee and similar user code, as in:
The previous value of 'it' must not be used again, which is why this
typical usage idiom rebinds the name. After this call, 'it' and
'bookmark' are independently-iterable iterators on the same underlying
sequence as the original value of 'it': this satisfies application
needs for "iterator copying".
However, when itertools.tee can make no hypotheses about the nature of
the iterator it is passed as an argument, it must save in memory all
items through which one of the two 'teed' iterators, but not yet both,
have stepped. This can be quite costly in terms of memory, if the two
iterators get very far from each other in their stepping; indeed, in
some cases it may be preferable to make a list from the iterator so as
to be able to step repeatedly through the subsequence, or, if that is
too costy in terms of memory, save items to disk, again in order to be
able to iterate through them repeatedly.
This PEP proposes another idea that will, in some important cases,
allow itertools.tee to do its job with minimal cost in terms of
memory; user code may also occasionally be able to exploit the idea in
order to decide whether to copy an iterator, make a list from it, or
use an auxiliary disk file.
The key consideration is that some important iterators, such as those
which built-in function iter builds over sequences, would be
intrinsically easy to copy: just get another reference to the same
sequence, and a copy of the integer index. However, in Python 2.3,
those iterators don't expose the state, and don't support copy.copy.
The purpose of this PEP, therefore, is to have those iterator types
expose a suitable __copy__ method. Similarly, user-coded iterator
types that can provide copies of their instances, suitable for
separate and independent iteration, with limited costs in time and
space, should also expose a suitable __copy__ method. While
copy.copy also supports other ways to let a type control the way
its instances are copied, it is suggested, for simplicity, that
iterator types that support copying always do so by exposing a
__copy__ method, and not in the other ways copy.copy supports.
Having iterators expose a suitable __copy__ when feasible will afford
easy optimization of itertools.tee and similar user code, as in:
def tee(it): def tee(it):
it = iter(it) it = iter(it)
try: return it, copy.copy(it) try: copier = it.__copy__
except TypeError: except AttributeError:
# non-copyable iterator, do all the needed hard work # non-copyable iterator, do all the needed hard work
# [snipped!]
else:
return it, copier()
Note that this function does NOT call "copy.copy(it)", which (even
after this PEP is implemented) might well still "just happen to
succeed". for some iterator type that is implemented as a user-coded
class. without really supplying an adequate "independently iterable"
copy object as its result.
Specification Specification
@ -55,11 +116,10 @@ Specification
Any iterator type X may expose a method __copy__ that is callable Any iterator type X may expose a method __copy__ that is callable
without arguments on any instance x of X. The method should be without arguments on any instance x of X. The method should be
exposed if and only if the iterator type can provide copyability with exposed if and only if the iterator type can provide copyability with
reasonably little computational and memory effort. Similarly, X may reasonably little computational and memory effort. Furthermore, the
expose a method __deepcopy__ that is callable with one argument (a memo new object y returned by method __copy__ should be a new instance
dictionary) on instances of X. The (very concise...) specs for of X that is iterable independently and separately from x, stepping
these methods are at [2] -- for more details, see also file copy.py along the same "underlying sequence" of items.
in the Python standard library.
For example, suppose a class Iter essentially duplicated the For example, suppose a class Iter essentially duplicated the
functionality of the iter builtin for iterating on a sequence: functionality of the iter builtin for iterating on a sequence:
@ -75,84 +135,133 @@ Specification
def next(self): def next(self):
try: result = self.sequence[self.index] try: result = self.sequence[self.index]
except IndexError: raise StopIteration except IndexError: raise StopIteration
self.index += 1 self.index += 1
return result return result
To make this Iter class compliant with this PEP, the following To make this Iter class compliant with this PEP, the following
additions to the body of class Iter would suffice: addition to the body of class Iter would suffice:
def __copy__(self): def __copy__(self):
result = self.__class__(sequence) result = self.__class__(sequence)
result.index = self.index result.index = self.index
return result return result
def __deepcopy__(self, memo): Note that __copy__, in this case, does not even try to copy the
result = self.__copy__() sequence; if the sequence is altered while either or both of the
result.sequence = deepcopy(self.sequence, memo) original and copied iterators are still stepping on it, the iteration
return result behavior is quite likely to go awry anyway -- it is not __copy__'s
responsibility to change this normal Python behavior for iterators
which iterate on mutable sequences (that might, perhaps, be the
specification for a __deepcopy__ method of iterators, which, however,
this PEP does not deal with).
Consider also a "random iterator", which provides a nonterminating
sequence of results from some method of a random instance, called
with given arguments:
class RandomIterator(object):
def __init__(self, bound_method, *args):
self.call = bound_method
self.args = args
def __iter__(self):
return self
def next(self):
return self.call(*self.args)
def __copy__(self):
import copy, new
im_self = copy.copy(self.call.im_self)
method = new.instancemethod(self.call.im_func, im_self)
return self.__class__(method, *self.args)
This iterator type is slightly more general than its name implies, as
it supports calls to any bound method (or other callable, but if the
callable is not a bound method, then method __copy__ will fail). But
the use case is for the purpose of generating random streams, as in:
import random
def show5(it):
for i, result in enumerate(it):
print '%6.3f'%result,
if i==4: break
print
normit = RandomIterator(random.Random().gauss, 0, 1)
show5(normit)
copit = normit.__copy__()
show5(normit)
show5(copit)
which will display some output such as:
-0.536 1.936 -1.182 -1.690 -1.184
0.666 -0.701 1.214 0.348 1.373
0.666 -0.701 1.214 0.348 1.373
the key point being that the second and third lines are equal, because
the normit and copit iterators will step along the same "underlying
sequence". (As an aside, note that to get a copy of self.call.im_self
we must use copy.copy, NOT try getting at a __copy__ method directly,
because for example instances of random.Random support copying via
__getstate__ and __setstate__, NOT via __copy__; indeed, using
copy.copy is the normal way to get a shallow copy of any object --
copyable iterators are different because of the already-mentioned
uncertainty about the result of copy.copy supporting these "copyable
iterator" specs).
Details Details
Besides adding to the Python docs a recommendation that user-coded Besides adding to the Python docs a recommendation that user-coded
iterators be made copyable when feasible, this PEP's implementation iterator types support a __copy__ method (if and only if it can be
will specifically include the addition of copyability to the iterators implemented with small costs in memory and runtime, and produce an
over sequences that built-in iter returns, and also to the iterators independently-iterable copy of an iterator object), this PEP's
over a dictionary returned by the methods __iter__, iterkeys, itervalues, implementation will specifically include the addition of copyability
and iteritems of built-in type dict. to the iterators over sequences that built-in iter returns, and also
to the iterators over a dictionary returned by the methods __iter__,
iterkeys, itervalues, and iteritems of built-in type dict.
Iterators produced by generators will not be copyable (the BDFL deems Iterators produced by generator functions will not be copyable.
shallow copy impossible, and deep copy too much trouble). However, However, iterators produced by the new "generator expressions" of
iterators produced by the new "generator expressions" of Python 2.4 Python 2.4 (PEP 289 [3]) should be copyable if their underlying
(PEP 289 [3]) should be copyable if their underlying iterator[s] are; iterator[s] are; the strict limitations on what is possible in a
the strict limitations on what is possible in a generator expression, generator expression, compared to the much vaster generality of a
compared to the much vaster generality of a generator, should make generator, should make that feasible. Similarly, the iterators
that feasible. Similarly, the iterators produced by the built-in produced by the built-in function enumerate, and certain functions
function enumerate should be copyable if the underlying iterator is. suppiled by module itertools, should be copyable if the underlying
iterators are.
The implementation of this PEP will also include the optimization The implementation of this PEP will also include the optimization of
of the new itertools.tee function mentioned in the Motivation section. the new itertools.tee function mentioned in the Motivation section.
Rationale Rationale
Being able to copy iterators will allow copying of user-coded classes The main use case for (shallow) copying of an iterator is the same as
that have copyable iterators as attributes. This applies to both for the function itertools.tee (new in 2.4). User code will not
shallow and deep copying. directly attempt to copy an iterator, because it would have to deal
separately with uncopyable cases; calling itertools.tee will
internally perform the copy when appropriate, and implicitly fallback
to a maximally efficient non-copying strategy for iterators that are
not copyable. (Occasionally, user code may want more direct control,
specifically in order to deal with non-copyable iterators by other
strategies, such as making a list or saving the sequence to disk).
Deep copyability of suitable iterators will allow "decoupling" from A tee'd iterator may serve as a "reference point", allowing processing
an underlying mutable sequence, and, according to the BDFL, may in of a sequence to continue or resume from a known point, while the
the future allow certain iterators to become picklable (however, this other independent iterator can be freely advanced to "explore" a
PEP, in itself, does not include picklability of iterators). further part of the sequence as needed. A simple example: a generator
function which, given an iterator of numbers (assumed to be positive),
The main use case for (shallow) copying of an iterator is the same returns a corresponding iterator, each of whose items is the fraction
as for the function itertools.tee (new in 2.4). Indeed, we assume of the total corresponding to each corresponding item of the input
that user code will typically not directly attempt to copy.copy an iterator. The caller may pass the total as a value, if known in
iterator, because it would have to deal with uncopyable cases; just advance; otherwise, the iterator returned by calling this generator
calling itertools.tee will enable copying when feasible, with an function will first compute the total.
implicit fallback to a maximally efficient non-copying strategy for
iterators that are not copyable.
A tee'd iterator may serve as a "reference point", allowing processing
of a sequence to continue or resume from a known point, while the other
independent iterator can be freely advanced to "explore" a further part
of the sequence as needed. To fully exploit this pattern, it should
ideally also be possible, given two iterators originally produced by
iterator.tee, to check if they are "equal" (have been stepped the same
number of times); however, this PEP, in itself, does not include
comparability of such iterators. Therefore, code using this pattern
may also need to count the number of forward steps taken after tee by
the iterators, so as to be able to tell which one is "behind", and by
how much. Built-in function enumerate is one way to make such counting
very easy in many situations. However, such needs do not always arise.
Here is a simpler example: a generator which, given an iterator of
numbers (assumed to be positive), returns a corresponding iterator
each of whose items is the fraction of the total corresponding to each
corresponding item of the input iterator. The caller may pass the
total as a value, if known in advance; otherwise, the iterator
returned by calling the generator will first compute the total.
def fractions(numbers, total=None): def fractions(numbers, total=None):
if total is None: if total is None:
@ -165,7 +274,170 @@ Rationale
The ability to tee the numbers iterator allows this generator to The ability to tee the numbers iterator allows this generator to
precompute the total, if needed, without necessarily requiring precompute the total, if needed, without necessarily requiring
O(N) auxiliary memory if the numbers iterator is copyable. O(N) auxiliary memory if the numbers iterator is copyable.
As another example of "iterator bookmarking", consider a stream of
numbers with an occasional string as a "postfix operator" now and
then. By far most frequent such operator is a '+', whereupon we must
sum all previous numbers (since the last previous operator if any, or
else since the start) and yield the result. Sometimes we find a '*'
instead, which is the same except that the previous numbers must
instead be multiplied, not summed.
def filter_weird_stream(stream):
it = iter(stream)
while True:
it, bookmark = itertools.tee(it)
total = 0
for item in it:
if item=='+':
yield total
break
elif item=='*':
product = 1
for item in bookmark:
if item=='*':
yield product
break
else:
product *= item
else:
total += item
Similar use cases of itertools.tee can support such tasks as
"undo" on a stream of commands represented by an iterator,
"backtracking" on the parse of a stream of tokens, and so on.
(Of course, in each case, one should also consider simpler
possibilities such as saving relevant portions of the sequence
into lists while stepping on the sequence with just one iterator,
depending on the details of one's task).
Here is an example, in pure Python, of how the 'enumerate'
built-in could be extended to support __copy__ if its underlying
iterator also supported __copy__:
class enumerate(object):
def __init__(self, it):
self.it = iter(it)
self.i = -1
def __iter__(self):
return self
def next(self):
self.i += 1
return self.i, self.it.next()
def __copy__(self):
result = self.__class__.new()
result.it = self.it.__copy__()
result.i = self.i
return result
Here is an example of the kind of "fragility" produced by "accidental
copyability" of an iterator -- the reason why one must NOT use
copy.copy expecting, if it succeeds, to receive as a result an
iterator which is iterable-on independently from the original. Here
is an iterator class that iterates (in preorder) on "trees" which, for
simplicity, are just nested lists -- any item that's a list is treated
as a subtree, any other item as a leaf.
class ListreeIter(object):
def __init__(self, tree):
self.tree = [tree]
self.indx = [-1]
def __iter__(self):
return self
def next(self):
if not self.indx:
raise StopIteration
self.indx[-1] += 1
try:
result = self.tree[-1][self.indx[-1]]
except IndexError:
self.tree.pop()
self.indx.pop()
return self.next()
if type(result) is not list:
return result
self.tree.append(result)
self.indx.append(-1)
return self.next()
Now, for example, the following code:
import copy
x = [ [1,2,3], [4, 5, [6, 7, 8], 9], 10, 11, [12] ]
print 'showing all items:',
it = ListreeIter(x)
for i in it:
print i,
if i==6: cop = copy.copy(it)
print
print 'showing items >6 again:'
for i in cop: print i,
print
does NOT work as intended -- the "cop" iterator gets consumed, and
exhausted, step by step as the original "it" iterator is, because
the accidental (rather than deliberate) copying performed by
copy.copy shares, rather than duplicating the "index" list, which
is the mutable attribute it.indx (a list of numerical indices).
Thus, this "client code" of the iterator, which attemps to iterate
twice over a portion of the sequence via a copy.copy on the
iterator, is NOT correct.
Some correct solutions include using itertools.tee, i.e., changing
the first for loop into:
for i in it:
print i,
if i==6:
it, cop = itertools.tee(it)
break
for i in it: print i,
(note that we MUST break the loop in two, otherwise we'd still
be looping on the ORIGINAL value of it, which must NOT be used
further after the call to tee!!!); or making a list, i.e.:
for i in it:
print i,
if i==6:
cop = lit = list(it)
break
for i in lit: print i,
(again, the loop must be broken in two, since iterator 'it'
gets exhausted by the call list(it)).
Finally, all of these solutions would work if Listiter supplied
a suitable __copy__ method, as this PEP recommends:
def __copy__(self):
result = self.__class__.new()
result.tree = copy.copy(self.tree)
result.indx = copy.copy(self.indx)
return result
There is no need to get any "deeper" in the copy, but the two
mutable "index state" attributes must indeed be copied in order
to achieve a "proper" (independently iterable) iterator-copy.
The recommended solution is to have class Listiter supply this
__copy__ method AND have client code use itertools.tee (with
the split-in-two-parts loop as shown above). This will make
client code maximally tolerant of different iterator types it
might be using AND achieve good performance for tee'ing of this
specific iterator type at the same time.
References References