PEP: 279
Title: Enhanced Generators
Version: $Revision$
Last-Modified: $Date$
Author: python@rcn.com (Raymond D. Hettinger)
Status: Draft
Type: Standards Track
Created: 30-Jan-2002
Python-Version: 2.3
Post-History:


Abstract

    This PEP introduces two orthogonal (not mutually exclusive) ideas
    for enhancing the generators introduced in Python version 2.2 [1].
    The goal is to increase the convenience, utility, and power
    of generators.


Rationale

    Python 2.2 introduced the concept of an iterable interface as proposed
    in PEP 234 [4].  The iter() factory function was provided as common
    calling convention and deep changes were made to use iterators as a
    unifying theme throughout Python.  The unification came in the form of
    establishing a common iterable interface for mappings, sequences,
    and file objects.

    Generators, as proposed in PEP 255 [1], were introduced as a means for
    making it easier to create iterators, especially ones with complex
    internal execution or variable states.  When I created new programs,
    generators were often the tool of choice for creating an iterator.

    However, when updating existing programs, I found that the tool had
    another use, one that improved program function as well as structure.
    Some programs exhibited a pattern of creating large lists and then
    looping over them.  As data sizes increased, the programs encountered
    scalability limitations owing to excessive memory consumption (and
    malloc time) for the intermediate lists.  Generators were found to be
    directly substitutable for the lists while eliminating the memory
    issues through lazy evaluation a.k.a. just in time manufacturing.

    Python itself encountered similar issues.  As a result, xrange() and
    xreadlines() were introduced.  And, in the case of file objects and
    mappings, lazy evaluation became the norm.  Generators provide a tool
    to program memory conserving for-loops whenever complete evaluation is
    not desired because of memory restrictions or availability of data.

    The next steps in the evolution of generators are:

    1. Add a new builtin function, iterindexed() which was made possible
       once iterators and generators became available.  It provides
       all iterables with the same advantage that iteritems() affords
       to dictionaries -- a compact, readable, reliable index notation.

    2. Establish a generator alternative to list comprehensions [3]
       that provides a simple way to convert a list comprehension into
       a generator whenever memory issues arise.

    All of the suggestions are designed to take advantage of the
    existing implementation and require little additional effort to
    incorporate.  Each is backward compatible and requires no new
    keywords.  The two generator tools go into Python 2.3 when
    generators become final and are not imported from __future__.


BDFL Pronouncements

    1.  The new built-in function is ACCEPTED.  There needs to be further
    discussion on the best name for the function.

    2.  Generator comprehensions are REJECTED.  The rationale is that
    the benefits are marginal since generators can already be coded directly
    and the costs are high because implementation and maintenance require
    major efforts with the parser.


Reference Implementation

    There is not currently a CPython implementation; however, a simulation
    module written in pure Python is available on SourceForge [5].  The
    simulation covers every feature proposed in this PEP and is meant
    to allow direct experimentation with the proposals.

    There is also a module [6] with working source code for all of the
    examples used in this PEP.  It serves as a test suite for the simulator
    and it documents how each of the new features works in practice.

    The authors and implementers of PEP 255 [1] were contacted to provide
    their assessment of whether these enhancements were going to be
    straight-forward to implement and require only minor modification
    of the existing generator code.  Neil felt the assertion was correct.
    Ka-Ping thought so also.  GvR said he could believe that it was true.
    Tim did not have an opportunity to give an assessment.


Specification for a new builtin [ACCEPTED PROPOSAL]:


    def iterindexed(collection):
        'Generates an indexed series:  (0,seqn[0]), (1,seqn[1]) ...'     
        i = 0
        it = iter(collection)
        while 1:
            yield (i, it.next())
            i += 1


    Note A: PEP 212 Loop Counter Iteration [2] discussed several
    proposals for achieving indexing.  Some of the proposals only work
    for lists unlike the above function which works for any generator,
    xrange, sequence, or iterable object.  Also, those proposals were
    presented and evaluated in the world prior to Python 2.2 which did
    not include generators.  As a result, the non-generator version in
    PEP 212 had the disadvantage of consuming memory with a giant list
    of tuples.  The generator version presented here is fast and light,
    works with all iterables, and allows users to abandon the sequence
    in mid-stream with no loss of computation effort.

    There are other PEPs which touch on related issues:  integer iterators,
    integer for-loops, and one for modifying the arguments to range and
    xrange.  The iterindexed() proposal does not preclude the other proposals
    and it still meets an important need even if those are adopted -- the need
    to count items in any iterable.  The other proposals give a means of
    producing an index but not the corresponding value.  This is especially
    problematic if a sequence is given which doesn't support random access
    such as a file object, generator, or sequence defined with __getitem__.


    Note B:  Almost all of the PEP reviewers welcomed the function but were
    divided as to whether there should be any builtins.  The main argument
    for a separate module was to slow the rate of language inflation.  The
    main argument for a builtin was that the function is destined to be
    part of a core programming style, applicable to any object with an
    iterable interface.  Just as zip() solves the problem of looping
    over multiple sequences, the iterindexed() function solves the loop
    counter problem.

    If only one builtin is allowed, then iterindexed() is the most important
    general purpose tool, solving the broadest class of problems while
    improving program brevity, clarity and reliability.


    Note C:  Various alternative names have been proposed:

        iterindexed()-- five syllables is a mouthfull
        index()      -- nice verb but could be confused the .index() method
        indexed()    -- widely liked however adjectives should be avoided
        count()      -- direct and explicit but often used in other contexts
        itercount()  -- direct, explicit and hated by more than one person
        enumerate()  -- a contender but doesn't mention iteration or indices
        iteritems()  -- conflicts with key:value concept for dictionaries


    Note D:  This function was originally proposed with optional start and
    stop arguments.  GvR pointed out that the function call
    iterindexed(seqn,4,6) had an alternate, plausible interpretation as a
    slice that would return the fourth and fifth elements of the sequence.
    To avoid the ambiguity, the optional arguments were dropped eventhough
    it meant losing flexibity as a loop counter.  That flexiblity was most
    important for the common case of counting from one, as in:
        for linenum, line in iterindexed(source):  print linenum, line


    Comments from GvR:  filter and map should die and be subsumed into list
        comprehensions, not grow more variants. I'd rather introduce builtins
        that do iterator algebra (e.g. the iterzip that I've often used as
        an example).

        I like the idea of having some way to iterate over a sequence and
        its index set in parallel.  It's fine for this to be a builtin.

        I don't like the name "indexed"; adjectives do not make good
        function names.  Maybe iterindexed()?

    Comments from Ka-Ping Yee:  I'm also quite happy with everything  you
        proposed ... and the extra builtins (really 'indexed' in particular)
        are things I have wanted for a long time.

    Comments from Neil Schemenauer:  The new builtins sound okay.  Guido
        may be concerned with increasing the number of builtins too much.  You
        might be better off selling them as part of a module.  If you use a
        module then you can add lots of useful functions (Haskell has lots of
        them that we could steal).

    Comments for Magnus Lie Hetland:  I think indexed would be a useful and
        natural built-in function. I would certainly use it a lot.
        I like indexed() a lot; +1. I'm quite happy to have it make PEP 281
        obsolete. Adding a separate module for iterator utilities seems like
        a good idea.

    Comments from the Community:  The response to the iterindexed() proposal
        has been close to 100% favorable.  Almost everyone loves the idea.

    Author response:  Prior to these comments, four builtins were proposed.
        After the comments, xmap xfilter and xzip were withdrawn.  The one
        that remains is vital for the language and is proposed by itself.
        Indexed() is trivially easy to implement and can be documented in
        minutes.  More importantly, it is useful in everyday programming
        which does not otherwise involve explicit use of generators.

        Though withdrawn from the proposal, I still secretly covet xzip()
        a.k.a. iterzip() but think that it will happen on its own someday.


Specification for Generator Comprehensions [REJECTED PROPOSAL]:

    If a list comprehension starts with a 'yield' keyword, then
    express the comprehension with a generator.  For example:

        g = [yield (len(line),line)  for line in file  if len(line)>5]

    This would be implemented as if it had been written:

        def __temp(self):
            for line in file:
                if len(line) > 5:
                    yield (len(line), line)
        g = __temp()


    Note A: There is some discussion about whether the enclosing brackets
    should be part of the syntax for generator comprehensions.  On the
    plus side, it neatly parallels list comprehensions and would be
    immediately recognizable as a similar form with similar internal
    syntax (taking maximum advantage of what people already know).
    More importantly, it sets off the generator comprehension from the
    rest of the function so as to not suggest that the enclosing
    function is a generator (currently the only cue that a function is
    really a generator is the presence of the yield keyword).  On the
    minus side, the brackets may falsely suggest that the whole
    expression returns a list.  Most of the feedback received to date
    indicates that brackets are helpful and not misleading. Unfortunately,
    the one dissent is from GvR.

    A key advantage of the generator comprehension syntax is that it
    makes it trivially easy to transform existing list comprehension
    code to a generator by adding yield.  Likewise, it can be converted
    back to a list by deleting yield.  This makes it easy to scale-up
    programs from small datasets to ones large enough to warrant
    just in time evaluation.


    Note B: List comprehensions expose their looping variable and
    leave that variable in the enclosing scope.  The code, [str(i) for
    i in range(8)] leaves 'i' set to 7 in the scope where the
    comprehension appears.  This behavior is by design and reflects an
    intent to duplicate the result of coding a for-loop instead of a
    list comprehension.  Further, the variable 'i' is in a defined and
    potentially useful state on the line immediately following the
    list comprehension.

    In contrast, generator comprehensions do not expose the looping
    variable to the enclosing scope.  The code, [yield str(i) for i in
    range(8)] leaves 'i' untouched in the scope where the
    comprehension appears.  This is also by design and reflects an
    intent to duplicate the result of coding a generator directly
    instead of a generator comprehension.  Further, the variable 'i'
    is not in a defined state on the line immediately following the
    list comprehension.  It does not come into existence until
    iteration starts (possibly never).


    Comments from GvR:  Cute hack, but I think the use of the [] syntax
        strongly suggests that it would return a list, not an iterator. I
        also think that this is trying to turn Python into a functional
        language, where most algorithms use lazy infinite sequences, and I
        just don't think that's where its future lies.

        I don't think it's worth the trouble.  I expect it will take a lot
        of work to hack it into the code generator: it has to create a
        separate code object in order to be a generator.  List
        comprehensions are inlined, so I expect that the generator
        comprehension code generator can't share much with the list
        comprehension code generator.  And this for something that's not
        that common and easily done by writing a 2-line helper function.
        IOW the ROI isn't high enough.

    Comments from Ka-Ping Yee:  I am very happy with the things you have
        proposed in this PEP.  I feel quite positive about generator
        comprehensions and have no reservations.  So a +1 on that.

    Comments from Neil Schemenauer:  I'm -0 on the generator list
        comprehensions.  They don't seem to add much.  You could easily use
        a nested generator to do the same thing.  They smell like lambda.

    Comments for Magnus Lie Hetland:  Generator comprehensions seem mildly
        useful, but I vote +0. Defining a separate, named generator would
        probably be my preference. On the other hand, I do see the advantage
        of "scaling up" from list comprehensions.

    Comments from the Community:  The response to the generator comprehension
        proposal has been mostly favorable.  There were some 0 votes from
        people who didn't see a real need or who were not energized by the
        idea.  Some of the 0 votes were tempered by comments that the reviewer
        did not even like list comprehensions or did not have any use for
        generators in any form.  The +1 votes outnumbered the 0 votes by about
        two to one.

    Author response:  I've studied several syntactical variations and
        concluded that the brackets are essential for:
        - teachability (it's like a list comprehension)
        - set-off (yield applies to the comprehension not the enclosing
          function)
        - substitutability (list comprehensions can be made lazy just by
          adding yield)

        What I like best about generator comprehensions is that I can design
        using list comprehensions and then easily switch to a generator (by
        adding yield) in response to scalability requirements (when the list
        comprehension produces too large of an intermediate result).  Had
        generators already been in-place when list comprehensions were
        accepted, the yield option might have been incorporated from the
        start.  For certain, the mathematical style notation is explicit and
        readable as compared to a separate function definition with an
        embedded yield.


References

    [1] PEP 255 Simple Generators
        http://python.sourceforge.net/peps/pep-0255.html

    [2] PEP 212 Loop Counter Iteration
        http://python.sourceforge.net/peps/pep-0212.html

    [3] PEP 202 List Comprehensions
        http://python.sourceforge.net/peps/pep-0202.html

    [4] PEP 234 Iterators
        http://python.sourceforge.net/peps/pep-0234.html

    [5] A pure Python simulation of every feature in this PEP is at:
        http://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=17348&aid=513752

    [6] The full, working source code for each of the examples in this PEP
        along with other examples and tests is at:
        http://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=17412&aid=513756


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
fill-column: 70
End: