PEP: 255 Title: Simple Generators Version: $Revision$ Author: nas@python.ca (Neil Schemenauer), tim.one@home.com (Tim Peters), magnus@hetland.org (Magnus Lie Hetland) Discussion-To: python-iterators@lists.sourceforge.net Status: Draft Type: Standards Track Requires: 234 Created: 18-May-2001 Python-Version: 2.2 Post-History: Abstract This PEP introduces the concept of generators to Python, as well as a new statement used in conjunction with them, the "yield" statement. Motivation When a producer function has a hard enough job that it requires maintaining state between values produced, most programming languages offer no pleasant and efficient solution beyond adding a callback function to the producer's argument list, to be called with each value produced. For example, tokenize.py in the standard library takes this approach: the caller must pass a "tokeneater" function to tokenize(), called whenever tokenize() finds the next token. This allows tokenize to be coded in a natural way, but programs calling tokenize are typically convoluted by the need to remember between callbacks which token(s) were seen last. The tokeneater function in tabnanny.py is a good example of that, maintaining a state machine in global variables, to remember across callbacks what it has already seen and what it hopes to see next. This was difficult to get working correctly, and is still difficult for people to understand. Unfortunately, that's typical of this approach. An alternative would have been for tokenize to produce an entire parse of the Python program at once, in a large list. Then tokenize clients could be written in a natural way, using local variables and local control flow (such as loops and nested if statements) to keep track of their state. But this isn't practical: programs can be very large, so no a priori bound can be placed on the memory needed to materialize the whole parse; and some tokenize clients only want to see whether something specific appears early in the program (e.g., a future statement, or, as is done in IDLE, just the first indented statement), and then parsing the whole program first is a severe waste of time. Another alternative would be to make tokenize an iterator[1], delivering the next token whenever its .next() method is invoked. This is pleasant for the caller in the same way a large list of results would be, but without the memory and "what if I want to get out early?" drawbacks. However, this shifts the burden on tokenize to remember *its* state between .next() invocations, and the reader need only glance at tokenize.tokenize_loop() to realize what a horrid chore that would be. Or picture a recursive algorithm for producing the nodes of a general tree structure: to cast that into an iterator framework requires removing the recursion manually and maintaining the state of the traversal by hand. A fourth option is to run the producer and consumer in separate threads. This allows both to maintain their states in natural ways, and so is pleasant for both. Indeed, Demo/threads/Generator.py in the Python source distribution provides a usable synchronized-communication class for doing that in a general way. This doesn't work on platforms without threads, though, and is very slow on platforms that do (compared to what is achievable without threads). A final option is to use the Stackless[2][3] variant implementation of Python instead, which supports lightweight coroutines. This has much the same programmatic benefits as the thread option, but is much more efficient. However, Stackless is a radical and controversial rethinking of the Python core, and it may not be possible for Jython to implement the same semantics. This PEP isn't the place to debate that, so suffice it to say here that generators provide a useful subset of Stackless functionality in a way that fits easily into the current Python implementation. That exhausts the current alternatives. Some other high-level languages provide pleasant solutions, notably iterators in Sather[4], which were inspired by iterators in CLU; and generators in Icon[5], a novel language where every expression "is a generator". There are differences among these, but the basic idea is the same: provide a kind of function that can return an intermediate result ("the next value") to its caller, but maintaining the function's local state so that the function can be resumed again right where it left off. A very simple example: def fib(): a, b = 0, 1 while 1: yield b a, b = b, a+b When fib() is first invoked, it sets a to 0 and b to 1, then yields b back to its caller. The caller sees 1. When fib is resumed, from its point of view the yield statement is really the same as, say, a print statement: fib continues after the yield with all local state intact. a and b then become 1 and 1, and fib loops back to the yield, yielding 1 to its invoker. And so on. From fib's point of view it's just delivering a sequence of results, as if via callback. But from its caller's point of view, the fib invocation is an iterable object that can be resumed at will. As in the thread approach, this allows both sides to be coded in the most natural ways; but unlike the thread approach, this can be done efficiently and on all platforms. Indeed, resuming a generator should be no more expensive than a function call. The same kind of approach applies to many producer/consumer functions. For example, tokenize.py could yield the next token instead of invoking a callback function with it as argument, and tokenize clients could iterate over the tokens in a natural way: a Python generator is a kind of Python iterator[1], but of an especially powerful kind. Specification A new statement is introduced: yield_stmt: "yield" expression_list "yield" is a new keyword, so a future statement[8] is needed to phase this in. [XXX spell this out] The yield statement may only be used inside functions. A function that contains a yield statement is called a generator function. When a generator function is called, the actual arguments are bound to function-local formal argument names in the usual way, but no code in the body of the function is executed. Instead a generator-iterator object is returned; this conforms to the iterator protocol[6], so in particular can be used in for-loops in a natural way. Note that when the intent is clear from context, the unqualified name "generator" may be used to refer either to a generator-function or a generator- iterator. Each time the .next() method of a generator-iterator is invoked, the code in the body of the generator-function is executed until a yield or return statement (see below) is encountered, or until the end of the body is reached. If a yield statement is encountered, the state of the function is frozen, and the value of expression_list is returned to .next()'s caller. By "frozen" we mean that all local state is retained, including the current bindings of local variables, the instruction pointer, and the internal evaluation stack: enough information is saved so that the next time .next() is invoked, the function can proceed exactly is if the yield statement were just another external call. A generator function can also contain return statements of the form: "return" Note that an expression_list is not allowed on return statements in the body of a generator (although, of course, they may appear in the bodies of non-generator functions nested within the generator). When a return statement is encountered, nothing is returned, but a StopIteration exception is raised, signalling that the iterator is exhausted. The same is true if control flows off the end of the function. Note that return means "I'm done, and have nothing interesting to return", for both generator functions and non-generator functions. Example # A binary tree class. class Tree: def __init__(self, label, left=None, right=None): self.label = label self.left = left self.right = right def __repr__(self, level=0, indent=" "): s = level*indent + `self.label` if self.left: s = s + "\n" + self.left.__repr__(level+1, indent) if self.right: s = s + "\n" + self.right.__repr__(level+1, indent) return s def __iter__(self): return inorder(self) # Create a Tree from a list. def tree(list): n = len(list) if n == 0: return [] i = n / 2 return Tree(list[i], tree(list[:i]), tree(list[i+1:])) # A recursive generator that generates Tree leaves in in-order. def inorder(t): if t: for x in inorder(t.left): yield x yield t.label for x in inorder(t.right): yield x # Show it off: create a tree. t = tree("ABCDEFGHIJKLMNOPQRSTUVWXYZ") # Print the nodes of the tree in in-order. for x in t: print x, print # A non-recursive generator. def inorder(node): stack = [] while node: while node.left: stack.append(node) node = node.left yield node.label while not node.right: try: node = stack.pop() except IndexError: return yield node.label node = node.right # Exercise the non-recursive generator. for x in t: print x, print Reference Implementation A preliminary patch against the CVS Python source is available[7]. Footnotes and References [1] PEP 234, http://python.sf.net/peps/pep-0234.html [2] http://www.stackless.com/ [3] PEP 219, http://python.sf.net/peps/pep-0219.html [4] "Iteration Abstraction in Sather" Murer , Omohundro, Stoutamire and Szyperski http://www.icsi.berkeley.edu/~sather/Publications/toplas.html [5] http://www.cs.arizona.edu/icon/ [6] The concept of iterators is described in PEP 234 http://python.sf.net/peps/pep-0234.html [7] http://python.ca/nas/python/generator.diff [8] http://python.sf.net/peps/pep-0236.html Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: