422 lines
14 KiB
Plaintext
422 lines
14 KiB
Plaintext
PEP: 201
|
||
Title: Parallel Iteration
|
||
Version: $Revision$
|
||
Author: bwarsaw@beopen.com (Barry A. Warsaw)
|
||
Python-Version: 2.0
|
||
Status: Draft
|
||
Created: 13-Jul-2000
|
||
Post-History:
|
||
|
||
|
||
Introduction
|
||
|
||
This PEP describes the `parallel iteration' proposal for Python
|
||
2.0, previously known as `parallel for loops'. This PEP tracks
|
||
the status and ownership of this feature, slated for introduction
|
||
in Python 2.0. It contains a description of the feature and
|
||
outlines changes necessary to support the feature. This PEP
|
||
summarizes discussions held in mailing list forums, and provides
|
||
URLs for further information, where appropriate. The CVS revision
|
||
history of this file contains the definitive historical record.
|
||
|
||
|
||
Motivation
|
||
|
||
Standard for-loops in Python iterate over every element in a
|
||
sequence until the sequence is exhausted[1]. However, for-loops
|
||
iterate over only a single sequence, and it is often desirable to
|
||
loop over more than one sequence, in a lock-step, "Chinese Menu"
|
||
type of way.
|
||
|
||
The common idioms used to accomplish this are unintuitive and
|
||
inflexible. This PEP proposes a standard way of performing such
|
||
iterations by introducing a new builtin function called `zip'.
|
||
|
||
|
||
Parallel For-Loops
|
||
|
||
Parallel for-loops are non-nested iterations over two or more
|
||
sequences, such that at each pass through the loop, one element
|
||
from each sequence is taken to compose the target. This behavior
|
||
can already be accomplished in Python through the use of the map()
|
||
built-in function:
|
||
|
||
>>> a = (1, 2, 3)
|
||
>>> b = (4, 5, 6)
|
||
>>> for i in map(None, a, b): print i
|
||
...
|
||
(1, 4)
|
||
(2, 5)
|
||
(3, 6)
|
||
>>> map(None, a, b)
|
||
[(1, 4), (2, 5), (3, 6)]
|
||
|
||
The for-loop simply iterates over this list as normal.
|
||
|
||
While the map() idiom is a common one in Python, it has several
|
||
disadvantages:
|
||
|
||
- It is non-obvious to programmers without a functional
|
||
programming background.
|
||
|
||
- The use of the magic `None' first argument is non-obvious.
|
||
|
||
- It has arbitrary, often unintended, and inflexible semantics
|
||
when the lists are not of the same length: the shorter sequences
|
||
are padded with `None'.
|
||
|
||
>>> c = (4, 5, 6, 7)
|
||
>>> map(None, a, c)
|
||
[(1, 4), (2, 5), (3, 6), (None, 7)]
|
||
|
||
For these reasons, several proposals were floated in the Python
|
||
2.0 beta time frame for providing a better spelling of parallel
|
||
for-loops. The initial proposals centered around syntactic
|
||
changes to the for statement, but conflicts and problems with the
|
||
syntax were unresolvable, especially when parallel for-loops were
|
||
combined with another proposed feature called `list
|
||
comprehensions' (see pep-0202.txt).
|
||
|
||
|
||
The Proposed Solution
|
||
|
||
The proposed solution is to introduce a new built-in sequence
|
||
generator function, available in the __builtin__ module. This
|
||
function is to be called `zip' and has the following signature:
|
||
|
||
zip(seqa, [seqb, [...]], [pad=<value>])
|
||
|
||
zip() takes one or more sequences and weaves their elements
|
||
together, just as map(None, ...) does with sequences of equal
|
||
length. The optional keyword argument `pad', if supplied, is a
|
||
value used to pad all shorter sequences to the length of the
|
||
longest sequence. If `pad' is omitted, then weaving stops when
|
||
the shortest sequence is exhausted.
|
||
|
||
It is not possible to pad short lists with different pad values,
|
||
nor will zip() ever raise an exception with lists of different
|
||
lengths. To accomplish either behavior, the sequences must be
|
||
checked and processed before the call to zip() -- but see the Open
|
||
Issues below for more discussion.
|
||
|
||
|
||
Lazy Execution
|
||
|
||
For performance purposes, zip() does not construct the list of
|
||
tuples immediately. Instead it instantiates an object that
|
||
implements a __getitem__() method and conforms to the informal
|
||
for-loop protocol. This method constructs the individual tuples
|
||
on demand.
|
||
|
||
Guido is strongly opposed to lazy execution. See Open Issues.
|
||
|
||
|
||
Examples
|
||
|
||
Here are some examples, based on the reference implementation
|
||
below.
|
||
|
||
>>> a = (1, 2, 3, 4)
|
||
>>> b = (5, 6, 7, 8)
|
||
>>> c = (9, 10, 11)
|
||
>>> d = (12, 13)
|
||
|
||
>>> zip(a, b)
|
||
[(1, 5), (2, 6), (3, 7), (4, 8)]
|
||
|
||
>>> zip(a, d)
|
||
[(1, 12), (2, 13)]
|
||
|
||
>>> zip(a, d, pad=0)
|
||
[(1, 12), (2, 13), (3, 0), (4, 0)]
|
||
|
||
>>> zip(a, d, pid=0)
|
||
Traceback (most recent call last):
|
||
File "<stdin>", line 1, in ?
|
||
File "/usr/tmp/python-iKAOxR", line 11, in zip
|
||
TypeError: unexpected keyword arguments
|
||
|
||
>>> zip(a, b, c, d)
|
||
[(1, 5, 9, 12), (2, 6, 10, 13)]
|
||
|
||
>>> zip(a, b, c, d, pad=None)
|
||
[(1, 5, 9, 12), (2, 6, 10, 13), (3, 7, 11, None), (4, 8, None, None)]
|
||
>>> map(None, a, b, c, d)
|
||
[(1, 5, 9, 12), (2, 6, 10, 13), (3, 7, 11, None), (4, 8, None, None)]
|
||
|
||
Note that when the sequences are of the same length, zip() is
|
||
reversible:
|
||
|
||
>>> a = (1, 2, 3)
|
||
>>> b = (4, 5, 6)
|
||
>>> x = zip(a, b)
|
||
>>> y = zip(*x) # alternatively, apply(zip, x)
|
||
>>> z = zip(*y) # alternatively, apply(zip, y)
|
||
>>> x
|
||
[(1, 4), (2, 5), (3, 6)]
|
||
>>> y
|
||
[(1, 2, 3), (4, 5, 6)]
|
||
>>> z
|
||
[(1, 4), (2, 5), (3, 6)]
|
||
>>> x == z
|
||
1
|
||
|
||
It is not possible to reverse zip this way when the sequences are
|
||
not all the same length.
|
||
|
||
|
||
Reference Implementation
|
||
|
||
Here is a reference implementation, in Python of the zip()
|
||
built-in function and helper class. These would ultimately be
|
||
replaced by equivalent C code.
|
||
|
||
class _Zipper:
|
||
def __init__(self, args, kws):
|
||
# Defaults
|
||
self.__padgiven = 0
|
||
if kws.has_key('pad'):
|
||
self.__padgiven = 1
|
||
self.__pad = kws['pad']
|
||
del kws['pad']
|
||
# Assert no unknown arguments are left
|
||
if kws:
|
||
raise TypeError('unexpected keyword arguments')
|
||
self.__sequences = args
|
||
self.__seqlen = len(args)
|
||
|
||
def __getitem__(self, i):
|
||
if not self.__sequences:
|
||
raise IndexError
|
||
ret = []
|
||
exhausted = 0
|
||
for s in self.__sequences:
|
||
try:
|
||
ret.append(s[i])
|
||
except IndexError:
|
||
if not self.__padgiven:
|
||
raise
|
||
exhausted = exhausted + 1
|
||
if exhausted == self.__seqlen:
|
||
raise
|
||
ret.append(self.__pad)
|
||
return tuple(ret)
|
||
|
||
def __len__(self):
|
||
# If we're padding, then len is the length of the longest sequence,
|
||
# otherwise it's the length of the shortest sequence.
|
||
if not self.__padgiven:
|
||
shortest = -1
|
||
for s in self.__sequences:
|
||
slen = len(s)
|
||
if shortest < 0 or slen < shortest:
|
||
shortest = slen
|
||
if shortest < 0:
|
||
return 0
|
||
return shortest
|
||
longest = 0
|
||
for s in self.__sequences:
|
||
slen = len(s)
|
||
if slen > longest:
|
||
longest = slen
|
||
return longest
|
||
|
||
def __cmp__(self, other):
|
||
i = 0
|
||
smore = 1
|
||
omore = 1
|
||
while 1:
|
||
try:
|
||
si = self[i]
|
||
except IndexError:
|
||
smore = 0
|
||
try:
|
||
oi = other[i]
|
||
except IndexError:
|
||
omore = 0
|
||
if not smore and not omore:
|
||
return 0
|
||
elif not smore:
|
||
return -1
|
||
elif not omore:
|
||
return 1
|
||
test = cmp(si, oi)
|
||
if test:
|
||
return test
|
||
i = i + 1
|
||
|
||
def __str__(self):
|
||
ret = []
|
||
i = 0
|
||
while 1:
|
||
try:
|
||
ret.append(self[i])
|
||
except IndexError:
|
||
break
|
||
i = i + 1
|
||
return str(ret)
|
||
__repr__ = __str__
|
||
|
||
|
||
def zip(*args, **kws):
|
||
return _Zipper(args, kws)
|
||
|
||
|
||
Rejected Elaborations
|
||
|
||
Some people have suggested that the user be able to specify the
|
||
type of the inner and outer containers for the zipped sequence.
|
||
This would be specified by additional keyword arguments to zip(),
|
||
named `inner' and `outer'.
|
||
|
||
This elaboration is rejected for several reasons. First, there
|
||
really is no outer container, even though there appears to be an
|
||
outer list container the example above. This is simply an
|
||
artifact of the repr() of the zipped object. User code can do its
|
||
own looping over the zipped object via __getitem__(), and build
|
||
any type of outer container for the fully evaluated, concrete
|
||
sequence. For example, to build a zipped object with lists as an
|
||
outer container, use
|
||
|
||
>>> list(zip(sequence_a, sequence_b, sequence_c))
|
||
|
||
for tuple outer container, use
|
||
|
||
>>> tuple(zip(sequence_a, sequence_b, sequence_c))
|
||
|
||
This type of construction will usually not be necessary though,
|
||
since it is expected that zipped objects will most often appear in
|
||
for-loops.
|
||
|
||
Second, allowing the user to specify the inner container
|
||
introduces needless complexity and arbitrary decisions. You might
|
||
imagine that instead of the default tuple inner container, the
|
||
user could prefer a list, or a dictionary, or instances of some
|
||
sequence-like class.
|
||
|
||
One problem is the API. Should the argument to `inner' be a type
|
||
or a template object? For flexibility, the argument should
|
||
probably be a type object (i.e. TupleType, ListType, DictType), or
|
||
a class. For classes, the implementation could just pass the zip
|
||
element to the constructor. But what about built-in types that
|
||
don't have constructors? They would have to be special-cased in
|
||
the implementation (i.e. what is the constructor for TupleType?
|
||
The tuple() built-in).
|
||
|
||
Another problem that arises is for zips greater than length two.
|
||
Say you had three sequences and you wanted the inner type to be a
|
||
dictionary. What would the semantics of the following be?
|
||
|
||
>>> zip(sequence_a, sequence_b, sequence_c, inner=DictType)
|
||
|
||
Would the key be (element_a, element_b) and the value be
|
||
element_c, or would the key be element_a and the value be
|
||
(element_b, element_c)? Or should an exception be thrown?
|
||
|
||
This suggests that the specification of the inner container type
|
||
is needless complexity. It isn't likely that the inner container
|
||
will need to be specified very often, and it is easy to roll your
|
||
own should you need it. Tuples are chosen for the inner container
|
||
type due to their (slight) memory footprint and performance
|
||
advantages.
|
||
|
||
|
||
Open Issues
|
||
|
||
- Guido opposes lazy evaluation for zip(). He believes zip()
|
||
should return a real list, with an xzip() lazy evaluator added
|
||
later if necessary.
|
||
|
||
- What should "zip(a)" do? Given
|
||
|
||
a = (1, 2, 3); zip(a)
|
||
|
||
three outcomes are possible.
|
||
|
||
1) Returns [(1,), (2,), (3,)]
|
||
|
||
Pros: no special casing in the implementation or in user
|
||
code, and is more consistent with the description of it's
|
||
semantics. Cons: this isn't what map(None, a) would return,
|
||
and may be counter to user expectations.
|
||
|
||
2) Returns [1, 2, 3]
|
||
|
||
Pros: consistency with map(None, a), and simpler code for
|
||
for-loops, e.g.
|
||
|
||
for i in zip(a):
|
||
|
||
instead of
|
||
|
||
for (i,) in zip(a):
|
||
|
||
Cons: too much complexity and special casing for what should
|
||
be a relatively rare usage pattern.
|
||
|
||
3) Raises TypeError
|
||
|
||
Pros: zip(a) doesn't make much sense and could be confusing
|
||
to explain.
|
||
|
||
Cons: needless restriction
|
||
|
||
Current scoring seems to generally favor outcome 1.
|
||
|
||
- What should "zip()" do?
|
||
|
||
Along similar lines, zip() with no arguments (or zip() with just
|
||
a pad argument) can have ambiguous semantics. Should this
|
||
return no elements or an infinite number? For these reaons,
|
||
raising a TypeError exception in this case makes the most
|
||
sense.
|
||
|
||
- The name of the built-in `zip' may cause some initial confusion
|
||
with the zip compression algorithm. Other suggestions include
|
||
(but are not limited to!): marry, weave, parallel, lace, braid,
|
||
interlace, permute, furl, tuples, lists, stitch, collate, knit,
|
||
plait, fold, with, mktuples, maketuples, totuples, gentuples,
|
||
tupleorama.
|
||
|
||
All have disadvantages, and there is no clear unanimous choice,
|
||
therefore the decision was made to go with `zip' because the
|
||
same functionality is available in other languages
|
||
(e.g. Haskell) under the name `zip'[2].
|
||
|
||
- Should zip() be including in the builtins module or should it be
|
||
in a separate generators module (possibly with other candidate
|
||
functions like irange())?
|
||
|
||
- Padding short sequences with different values. A suggestion has
|
||
been made to allow a `padtuple' (probably better called `pads'
|
||
or `padseq') argument similar to `pad'. This sequence must have
|
||
a length equal to the number of sequences given. It is a
|
||
sequence of the individual pad values to use for each sequence,
|
||
should it be shorter than the maximum length.
|
||
|
||
One problem is what to do if `padtuple' itself isn't of the
|
||
right length? A TypeError seems to be the only choice here.
|
||
|
||
How does `pad' and `padtuple' interact? Perhaps if padtuple
|
||
were too short, it could use pad as a fallback. padtuple would
|
||
always override pad if both were given.
|
||
|
||
|
||
References
|
||
|
||
[1] http://www.python.org/doc/current/ref/for.html
|
||
[2] http://www.haskell.org/onlinereport/standard-prelude.html#$vzip
|
||
|
||
TBD: URL to python-dev archives
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
End:
|