python-peps/pep-0234.txt

484 lines
20 KiB
Plaintext

PEP: 234
Title: Iterators
Author: Ka-Ping Yee <ping@zesty.ca>, Guido van Rossum <guido@python.org>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 30-Jan-2001
Python-Version: 2.1
Post-History: 30-Apr-2001
Abstract
========
This document proposes an iteration interface that objects can provide to
control the behaviour of ``for`` loops. Looping is customized by providing a
method that produces an iterator object. The iterator provides a *get next
value* operation that produces the next item in the sequence each time it is
called, raising an exception when no more items are available.
In addition, specific iterators over the keys of a dictionary and over the
lines of a file are proposed, and a proposal is made to allow spelling
``dict.has_key(key)`` as ``key in dict``.
Note: this is an almost complete rewrite of this PEP by the second author,
describing the actual implementation checked into the trunk of the Python 2.2
CVS tree. It is still open for discussion. Some of the more esoteric
proposals in the original version of this PEP have been withdrawn for now;
these may be the subject of a separate PEP in the future.
C API Specification
===================
A new exception is defined, ``StopIteration``, which can be used to signal the
end of an iteration.
A new slot named ``tp_iter`` for requesting an iterator is added to the type
object structure. This should be a function of one ``PyObject *`` argument
returning a ``PyObject *``, or ``NULL``. To use this slot, a new C API
function ``PyObject_GetIter()`` is added, with the same signature as the
``tp_iter`` slot function.
Another new slot, named ``tp_iternext``, is added to the type structure, for
obtaining the next value in the iteration. To use this slot, a new C API
function ``PyIter_Next()`` is added. The signature for both the slot and the
API function is as follows, although the ``NULL`` return conditions differ:
the argument is a ``PyObject *`` and so is the return value. When the return
value is non-``NULL``, it is the next value in the iteration. When it is
``NULL``, then for the ``tp_iternext slot`` there are three possibilities:
- No exception is set; this implies the end of the iteration.
- The ``StopIteration`` exception (or a derived exception class) is set; this
implies the end of the iteration.
- Some other exception is set; this means that an error occurred that should be
propagated normally.
The higher-level ``PyIter_Next()`` function clears the ``StopIteration``
exception (or derived exception) when it occurs, so its ``NULL`` return
conditions are simpler:
- No exception is set; this means iteration has ended.
- Some exception is set; this means an error occurred, and should be propagated
normally.
Iterators implemented in C should *not* implement a ``next()`` method with
similar semantics as the ``tp_iternext`` slot! When the type's dictionary is
initialized (by ``PyType_Ready()``), the presence of a ``tp_iternext`` slot
causes a method ``next()`` wrapping that slot to be added to the type's
``tp_dict``. (Exception: if the type doesn't use ``PyObject_GenericGetAttr()``
to access instance attributes, the ``next()`` method in the type's ``tp_dict``
may not be seen.) (Due to a misunderstanding in the original text of this PEP,
in Python 2.2, all iterator types implemented a ``next()`` method that was
overridden by the wrapper; this has been fixed in Python 2.3.)
To ensure binary backwards compatibility, a new flag ``Py_TPFLAGS_HAVE_ITER``
is added to the set of flags in the ``tp_flags`` field, and to the default
flags macro. This flag must be tested before accessing the ``tp_iter`` or
``tp_iternext`` slots. The macro ``PyIter_Check()`` tests whether an object
has the appropriate flag set and has a non-``NULL`` ``tp_iternext`` slot.
There is no such macro for the ``tp_iter`` slot (since the only place where
this slot is referenced should be ``PyObject_GetIter()``, and this can check
for the ``Py_TPFLAGS_HAVE_ITER`` flag directly).
(Note: the ``tp_iter`` slot can be present on any object; the ``tp_iternext``
slot should only be present on objects that act as iterators.)
For backwards compatibility, the ``PyObject_GetIter()`` function implements
fallback semantics when its argument is a sequence that does not implement a
``tp_iter`` function: a lightweight sequence iterator object is constructed in
that case which iterates over the items of the sequence in the natural order.
The Python bytecode generated for ``for`` loops is changed to use new opcodes,
``GET_ITER`` and ``FOR_ITER``, that use the iterator protocol rather than the
sequence protocol to get the next value for the loop variable. This makes it
possible to use a ``for`` loop to loop over non-sequence objects that support
the ``tp_iter`` slot. Other places where the interpreter loops over the values
of a sequence should also be changed to use iterators.
Iterators ought to implement the ``tp_iter`` slot as returning a reference to
themselves; this is needed to make it possible to use an iterator (as opposed
to a sequence) in a ``for`` loop.
Iterator implementations (in C or in Python) should guarantee that once the
iterator has signalled its exhaustion, subsequent calls to ``tp_iternext`` or
to the ``next()`` method will continue to do so. It is not specified whether
an iterator should enter the exhausted state when an exception (other than
``StopIteration``) is raised. Note that Python cannot guarantee that
user-defined or 3rd party iterators implement this requirement correctly.
Python API Specification
========================
The ``StopIteration`` exception is made visible as one of the standard
exceptions. It is derived from ``Exception``.
A new built-in function is defined, ``iter()``, which can be called in two
ways:
- ``iter(obj)`` calls ``PyObject_GetIter(obj)``.
- ``iter(callable, sentinel)`` returns a special kind of iterator that calls
the callable to produce a new value, and compares the return value to the
sentinel value. If the return value equals the sentinel, this signals the
end of the iteration and ``StopIteration`` is raised rather than returning
normal; if the return value does not equal the sentinel, it is returned as
the next value from the iterator. If the callable raises an exception, this
is propagated normally; in particular, the function is allowed to raise
``StopIteration`` as an alternative way to end the iteration. (This
functionality is available from the C API as
``PyCallIter_New(callable, sentinel)``.)
Iterator objects returned by either form of ``iter()`` have a ``next()``
method. This method either returns the next value in the iteration, or raises
``StopIteration`` (or a derived exception class) to signal the end of the
iteration. Any other exception should be considered to signify an error and
should be propagated normally, not taken to mean the end of the iteration.
Classes can define how they are iterated over by defining an ``__iter__()``
method; this should take no additional arguments and return a valid iterator
object. A class that wants to be an iterator should implement two methods: a
``next()`` method that behaves as described above, and an ``__iter__()`` method
that returns ``self``.
The two methods correspond to two distinct protocols:
1. An object can be iterated over with ``for`` if it implements ``__iter__()``
or ``__getitem__()``.
2. An object can function as an iterator if it implements ``next()``.
Container-like objects usually support protocol 1. Iterators are currently
required to support both protocols. The semantics of iteration come only from
protocol 2; protocol 1 is present to make iterators behave like sequences; in
particular so that code receiving an iterator can use a for-loop over the
iterator.
Dictionary Iterators
====================
- Dictionaries implement a ``sq_contains`` slot that implements the same test
as the ``has_key()`` method. This means that we can write
::
if k in dict: ...
which is equivalent to
::
if dict.has_key(k): ...
- Dictionaries implement a ``tp_iter`` slot that returns an efficient iterator
that iterates over the keys of the dictionary. During such an iteration, the
dictionary should not be modified, except that setting the value for an
existing key is allowed (deletions or additions are not, nor is the
``update()`` method). This means that we can write
::
for k in dict: ...
which is equivalent to, but much faster than
::
for k in dict.keys(): ...
as long as the restriction on modifications to the dictionary (either by the
loop or by another thread) are not violated.
- Add methods to dictionaries that return different kinds of iterators
explicitly::
for key in dict.iterkeys(): ...
for value in dict.itervalues(): ...
for key, value in dict.iteritems(): ...
This means that ``for x in dict`` is shorthand for
``for x in dict.iterkeys()``.
Other mappings, if they support iterators at all, should also iterate over the
keys. However, this should not be taken as an absolute rule; specific
applications may have different requirements.
File Iterators
==============
The following proposal is useful because it provides us with a good answer to
the complaint that the common idiom to iterate over the lines of a file is ugly
and slow.
- Files implement a ``tp_iter`` slot that is equivalent to
``iter(f.readline, "")``. This means that we can write
::
for line in file:
...
as a shorthand for
::
for line in iter(file.readline, ""):
...
which is equivalent to, but faster than
::
while 1:
line = file.readline()
if not line:
break
...
This also shows that some iterators are destructive: they consume all the
values and a second iterator cannot easily be created that iterates
independently over the same values. You could open the file for a second time,
or ``seek()`` to the beginning, but these solutions don't work for all file
types, e.g. they don't work when the open file object really represents a pipe
or a stream socket.
Because the file iterator uses an internal buffer, mixing this with other file
operations (e.g. ``file.readline()``) doesn't work right. Also, the following
code::
for line in file:
if line == "\n":
break
for line in file:
print line,
doesn't work as you might expect, because the iterator created by the second
for-loop doesn't take the buffer read-ahead by the first for-loop into account.
A correct way to write this is::
it = iter(file)
for line in it:
if line == "\n":
break
for line in it:
print line,
(The rationale for these restrictions are that ``for line in file`` ought to
become the recommended, standard way to iterate over the lines of a file, and
this should be as fast as can be. The iterator version is considerable faster
than calling ``readline()``, due to the internal buffer in the iterator.)
Rationale
=========
If all the parts of the proposal are included, this addresses many concerns in
a consistent and flexible fashion. Among its chief virtues are the following
four -- no, five -- no, six -- points:
1. It provides an extensible iterator interface.
2. It allows performance enhancements to list iteration.
3. It allows big performance enhancements to dictionary iteration.
4. It allows one to provide an interface for just iteration without pretending
to provide random access to elements.
5. It is backward-compatible with all existing user-defined classes and
extension objects that emulate sequences and mappings, even mappings that
only implement a subset of {``__getitem__``, ``keys``, ``values``,
``items``}.
6. It makes code iterating over non-sequence collections more concise and
readable.
Resolved Issues
===============
The following topics have been decided by consensus or BDFL pronouncement.
- Two alternative spellings for ``next()`` have been proposed but rejected:
``__next__()``, because it corresponds to a type object slot
(``tp_iternext``); and ``__call__()``, because this is the only operation.
Arguments against ``__next__()``: while many iterators are used in for loops,
it is expected that user code will also call ``next()`` directly, so having
to write ``__next__()`` is ugly; also, a possible extension of the protocol
would be to allow for ``prev()``, ``current()`` and ``reset()`` operations;
surely we don't want to use ``__prev__()``, ``__current__()``,
``__reset__()``.
Arguments against ``__call__()`` (the original proposal): taken out of
context, ``x()`` is not very readable, while ``x.next()`` is clear; there's a
danger that every special-purpose object wants to use ``__call__()`` for its
most common operation, causing more confusion than clarity.
(In retrospect, it might have been better to go for ``__next__()`` and have a
new built-in, ``next(it)``, which calls ``it.__next__()``. But alas, it's too
late; this has been deployed in Python 2.2 since December 2001.)
- Some folks have requested the ability to restart an iterator. This should be
dealt with by calling ``iter()`` on a sequence repeatedly, not by the
iterator protocol itself. (See also requested extensions below.)
- It has been questioned whether an exception to signal the end of the
iteration isn't too expensive. Several alternatives for the
``StopIteration`` exception have been proposed: a special value ``End`` to
signal the end, a function ``end()`` to test whether the iterator is
finished, even reusing the ``IndexError`` exception.
- A special value has the problem that if a sequence ever contains that
special value, a loop over that sequence will end prematurely without any
warning. If the experience with null-terminated C strings hasn't taught us
the problems this can cause, imagine the trouble a Python introspection
tool would have iterating over a list of all built-in names, assuming that
the special ``End`` value was a built-in name!
- Calling an ``end()`` function would require two calls per iteration. Two
calls is much more expensive than one call plus a test for an exception.
Especially the time-critical for loop can test very cheaply for an
exception.
- Reusing ``IndexError`` can cause confusion because it can be a genuine
error, which would be masked by ending the loop prematurely.
- Some have asked for a standard iterator type. Presumably all iterators would
have to be derived from this type. But this is not the Python way:
dictionaries are mappings because they support ``__getitem__()`` and a
handful other operations, not because they are derived from an abstract
mapping type.
- Regarding ``if key in dict``: there is no doubt that the ``dict.has_key(x)``
interpretation of ``x in dict`` is by far the most useful interpretation,
probably the only useful one. There has been resistance against this because
``x in list`` checks whether *x* is present among the values, while the
proposal makes ``x in dict`` check whether *x* is present among the keys.
Given that the symmetry between lists and dictionaries is very weak, this
argument does not have much weight.
- The name ``iter()`` is an abbreviation. Alternatives proposed include
``iterate()``, ``traverse()``, but these appear too long. Python has a
history of using abbrs for common builtins, e.g. ``repr()``, ``str()``,
``len()``.
Resolution: ``iter()`` it is.
- Using the same name for two different operations (getting an iterator from an
object and making an iterator for a function with a sentinel value) is
somewhat ugly. I haven't seen a better name for the second operation though,
and since they both return an iterator, it's easy to remember.
Resolution: the builtin ``iter()`` takes an optional argument, which is the
sentinel to look for.
- Once a particular iterator object has raised ``StopIteration``, will it also
raise ``StopIteration`` on all subsequent ``next()`` calls? Some say that it
would be useful to require this, others say that it is useful to leave this
open to individual iterators. Note that this may require an additional state
bit for some iterator implementations (e.g. function-wrapping iterators).
Resolution: once ``StopIteration`` is raised, calling ``it.next()`` continues
to raise ``StopIteration``.
Note: this was in fact not implemented in Python 2.2; there are many cases
where an iterator's ``next()`` method can raise ``StopIteration`` on one call
but not on the next. This has been remedied in Python 2.3.
- It has been proposed that a file object should be its own iterator, with a
``next()`` method returning the next line. This has certain advantages, and
makes it even clearer that this iterator is destructive. The disadvantage is
that this would make it even more painful to implement the "sticky
StopIteration" feature proposed in the previous bullet.
Resolution: tentatively rejected (though there are still people arguing for
this).
- Some folks have requested extensions of the iterator protocol, e.g.
``prev()`` to get the previous item, ``current()`` to get the current item
again, ``finished()`` to test whether the iterator is finished, and maybe
even others, like ``rewind()``, ``__len__()``, ``position()``.
While some of these are useful, many of these cannot easily be implemented
for all iterator types without adding arbitrary buffering, and sometimes they
can't be implemented at all (or not reasonably). E.g. anything to do with
reversing directions can't be done when iterating over a file or function.
Maybe a separate PEP can be drafted to standardize the names for such
operations when they are implementable.
Resolution: rejected.
- There has been a long discussion about whether
::
for x in dict: ...
should assign *x* the successive keys, values, or items of the dictionary.
The symmetry between ``if x in y`` and ``for x in y`` suggests that it should
iterate over keys. This symmetry has been observed by many independently and
has even been used to "explain" one using the other. This is because for
sequences, ``if x in y`` iterates over *y* comparing the iterated values to
*x*. If we adopt both of the above proposals, this will also hold for
dictionaries.
The argument against making ``for x in dict`` iterate over the keys comes
mostly from a practicality point of view: scans of the standard library show
that there are about as many uses of ``for x in dict.items()`` as there are
of ``for x in dict.keys()``, with the ``items()`` version having a small
majority. Presumably many of the loops using ``keys()`` use the
corresponding value anyway, by writing ``dict[x]``, so (the argument goes) by
making both the key and value available, we could support the largest number
of cases. While this is true, I (Guido) find the correspondence between
``for x in dict`` and ``if x in dict`` too compelling to break, and there's
not much overhead in having to write ``dict[x]`` to explicitly get the value.
For fast iteration over items, use ``for key, value in dict.iteritems()``.
I've timed the difference between
::
for key in dict: dict[key]
and
::
for key, value in dict.iteritems(): pass
and found that the latter is only about 7% faster.
Resolution: By BDFL pronouncement, ``for x in dict`` iterates over the keys,
and dictionaries have ``iteritems()``, ``iterkeys()``, and ``itervalues()``
to return the different flavors of dictionary iterators.
Mailing Lists
=============
The iterator protocol has been discussed extensively in a mailing list on
SourceForge:
http://lists.sourceforge.net/lists/listinfo/python-iterators
Initially, some of the discussion was carried out at Yahoo; archives are still
accessible:
http://groups.yahoo.com/group/python-iter
Copyright
=========
This document is in the public domain.