reSTify PEP225, PEP234, PEP255, PEP450 (#293)

This commit is contained in:
csabella 2017-06-16 13:20:28 -04:00 committed by Brett Cannon
parent 82ea79a94a
commit ad339c1e27
4 changed files with 1629 additions and 1602 deletions

File diff suppressed because it is too large Load Diff

View File

@ -5,489 +5,489 @@ Last-Modified: $Date$
Author: ping@zesty.ca (Ka-Ping Yee), guido@python.org (Guido van Rossum)
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 30-Jan-2001
Python-Version: 2.1
Post-History: 30-Apr-2001
Abstract
========
This document proposes an iteration interface that objects can
provide to control the behaviour of 'for' loops. Looping is
customized by providing a method that produces an iterator object.
The iterator provides a 'get next value' operation that produces
the next item in the sequence each time it is called, raising an
exception when no more items are available.
This document proposes an iteration interface that objects can provide to
control the behaviour of ``for`` loops. Looping is customized by providing a
method that produces an iterator object. The iterator provides a *get next
value* operation that produces the next item in the sequence each time it is
called, raising an exception when no more items are available.
In addition, specific iterators over the keys of a dictionary and
over the lines of a file are proposed, and a proposal is made to
allow spelling dict.has_key(key) as "key in dict".
In addition, specific iterators over the keys of a dictionary and over the
lines of a file are proposed, and a proposal is made to allow spelling
``dict.has_key(key)`` as ``key in dict``.
Note: this is an almost complete rewrite of this PEP by the second
author, describing the actual implementation checked into the
trunk of the Python 2.2 CVS tree. It is still open for
discussion. Some of the more esoteric proposals in the original
version of this PEP have been withdrawn for now; these may be the
subject of a separate PEP in the future.
Note: this is an almost complete rewrite of this PEP by the second author,
describing the actual implementation checked into the trunk of the Python 2.2
CVS tree. It is still open for discussion. Some of the more esoteric
proposals in the original version of this PEP have been withdrawn for now;
these may be the subject of a separate PEP in the future.
C API Specification
===================
A new exception is defined, StopIteration, which can be used to
signal the end of an iteration.
A new exception is defined, ``StopIteration``, which can be used to signal the
end of an iteration.
A new slot named tp_iter for requesting an iterator is added to
the type object structure. This should be a function of one
PyObject * argument returning a PyObject *, or NULL. To use this
slot, a new C API function PyObject_GetIter() is added, with the
same signature as the tp_iter slot function.
A new slot named ``tp_iter`` for requesting an iterator is added to the type
object structure. This should be a function of one ``PyObject *`` argument
returning a ``PyObject *``, or ``NULL``. To use this slot, a new C API
function ``PyObject_GetIter()`` is added, with the same signature as the
``tp_iter`` slot function.
Another new slot, named tp_iternext, is added to the type
structure, for obtaining the next value in the iteration. To use
this slot, a new C API function PyIter_Next() is added. The
signature for both the slot and the API function is as follows,
although the NULL return conditions differ: the argument is a
PyObject * and so is the return value. When the return value is
non-NULL, it is the next value in the iteration. When it is NULL,
then for the tp_iternext slot there are three possibilities:
Another new slot, named ``tp_iternext``, is added to the type structure, for
obtaining the next value in the iteration. To use this slot, a new C API
function ``PyIter_Next()`` is added. The signature for both the slot and the
API function is as follows, although the ``NULL`` return conditions differ:
the argument is a ``PyObject *`` and so is the return value. When the return
value is non-``NULL``, it is the next value in the iteration. When it is
``NULL``, then for the ``tp_iternext slot`` there are three possibilities:
- No exception is set; this implies the end of the iteration.
- No exception is set; this implies the end of the iteration.
- The StopIteration exception (or a derived exception class) is
set; this implies the end of the iteration.
- The ``StopIteration`` exception (or a derived exception class) is set; this
implies the end of the iteration.
- Some other exception is set; this means that an error occurred
that should be propagated normally.
- Some other exception is set; this means that an error occurred that should be
propagated normally.
The higher-level PyIter_Next() function clears the StopIteration
exception (or derived exception) when it occurs, so its NULL return
conditions are simpler:
The higher-level ``PyIter_Next()`` function clears the ``StopIteration``
exception (or derived exception) when it occurs, so its ``NULL`` return
conditions are simpler:
- No exception is set; this means iteration has ended.
- No exception is set; this means iteration has ended.
- Some exception is set; this means an error occurred, and should
be propagated normally.
- Some exception is set; this means an error occurred, and should be propagated
normally.
Iterators implemented in C should *not* implement a next() method
with similar semantics as the tp_iternext slot! When the type's
dictionary is initialized (by PyType_Ready()), the presence of a
tp_iternext slot causes a method next() wrapping that slot to be
added to the type's tp_dict. (Exception: if the type doesn't use
PyObject_GenericGetAttr() to access instance attributes, the
next() method in the type's tp_dict may not be seen.) (Due to a
misunderstanding in the original text of this PEP, in Python 2.2,
all iterator types implemented a next() method that was overridden
by the wrapper; this has been fixed in Python 2.3.)
Iterators implemented in C should *not* implement a ``next()`` method with
similar semantics as the ``tp_iternext`` slot! When the type's dictionary is
initialized (by ``PyType_Ready()``), the presence of a ``tp_iternext`` slot
causes a method ``next()`` wrapping that slot to be added to the type's
``tp_dict``. (Exception: if the type doesn't use ``PyObject_GenericGetAttr()``
to access instance attributes, the ``next()`` method in the type's ``tp_dict``
may not be seen.) (Due to a misunderstanding in the original text of this PEP,
in Python 2.2, all iterator types implemented a ``next()`` method that was
overridden by the wrapper; this has been fixed in Python 2.3.)
To ensure binary backwards compatibility, a new flag
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
field, and to the default flags macro. This flag must be tested
before accessing the tp_iter or tp_iternext slots. The macro
PyIter_Check() tests whether an object has the appropriate flag
set and has a non-NULL tp_iternext slot. There is no such macro
for the tp_iter slot (since the only place where this slot is
referenced should be PyObject_GetIter(), and this can check for
the Py_TPFLAGS_HAVE_ITER flag directly).
To ensure binary backwards compatibility, a new flag ``Py_TPFLAGS_HAVE_ITER``
is added to the set of flags in the ``tp_flags`` field, and to the default
flags macro. This flag must be tested before accessing the ``tp_iter`` or
``tp_iternext`` slots. The macro ``PyIter_Check()`` tests whether an object
has the appropriate flag set and has a non-``NULL`` ``tp_iternext`` slot.
There is no such macro for the ``tp_iter`` slot (since the only place where
this slot is referenced should be ``PyObject_GetIter()``, and this can check
for the ``Py_TPFLAGS_HAVE_ITER`` flag directly).
(Note: the tp_iter slot can be present on any object; the
tp_iternext slot should only be present on objects that act as
iterators.)
(Note: the ``tp_iter`` slot can be present on any object; the ``tp_iternext``
slot should only be present on objects that act as iterators.)
For backwards compatibility, the PyObject_GetIter() function
implements fallback semantics when its argument is a sequence that
does not implement a tp_iter function: a lightweight sequence
iterator object is constructed in that case which iterates over
the items of the sequence in the natural order.
For backwards compatibility, the ``PyObject_GetIter()`` function implements
fallback semantics when its argument is a sequence that does not implement a
``tp_iter`` function: a lightweight sequence iterator object is constructed in
that case which iterates over the items of the sequence in the natural order.
The Python bytecode generated for 'for' loops is changed to use
new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
rather than the sequence protocol to get the next value for the
loop variable. This makes it possible to use a 'for' loop to loop
over non-sequence objects that support the tp_iter slot. Other
places where the interpreter loops over the values of a sequence
should also be changed to use iterators.
The Python bytecode generated for ``for`` loops is changed to use new opcodes,
``GET_ITER`` and ``FOR_ITER``, that use the iterator protocol rather than the
sequence protocol to get the next value for the loop variable. This makes it
possible to use a ``for`` loop to loop over non-sequence objects that support
the ``tp_iter`` slot. Other places where the interpreter loops over the values
of a sequence should also be changed to use iterators.
Iterators ought to implement the tp_iter slot as returning a
reference to themselves; this is needed to make it possible to
use an iterator (as opposed to a sequence) in a for loop.
Iterators ought to implement the ``tp_iter`` slot as returning a reference to
themselves; this is needed to make it possible to use an iterator (as opposed
to a sequence) in a ``for`` loop.
Iterator implementations (in C or in Python) should guarantee that
once the iterator has signalled its exhaustion, subsequent calls
to tp_iternext or to the next() method will continue to do so. It
is not specified whether an iterator should enter the exhausted
state when an exception (other than StopIteration) is raised.
Note that Python cannot guarantee that user-defined or 3rd party
iterators implement this requirement correctly.
Iterator implementations (in C or in Python) should guarantee that once the
iterator has signalled its exhaustion, subsequent calls to ``tp_iternext`` or
to the ``next()`` method will continue to do so. It is not specified whether
an iterator should enter the exhausted state when an exception (other than
``StopIteration``) is raised. Note that Python cannot guarantee that
user-defined or 3rd party iterators implement this requirement correctly.
Python API Specification
========================
The StopIteration exception is made visible as one of the
standard exceptions. It is derived from Exception.
The ``StopIteration`` exception is made visible as one of the standard
exceptions. It is derived from ``Exception``.
A new built-in function is defined, iter(), which can be called in
two ways:
A new built-in function is defined, ``iter()``, which can be called in two
ways:
- iter(obj) calls PyObject_GetIter(obj).
- ``iter(obj)`` calls ``PyObject_GetIter(obj)``.
- iter(callable, sentinel) returns a special kind of iterator that
calls the callable to produce a new value, and compares the
return value to the sentinel value. If the return value equals
the sentinel, this signals the end of the iteration and
StopIteration is raised rather than returning normal; if the
return value does not equal the sentinel, it is returned as the
next value from the iterator. If the callable raises an
exception, this is propagated normally; in particular, the
function is allowed to raise StopIteration as an alternative way
to end the iteration. (This functionality is available from the
C API as PyCallIter_New(callable, sentinel).)
- ``iter(callable, sentinel)`` returns a special kind of iterator that calls
the callable to produce a new value, and compares the return value to the
sentinel value. If the return value equals the sentinel, this signals the
end of the iteration and ``StopIteration`` is raised rather than returning
normal; if the return value does not equal the sentinel, it is returned as
the next value from the iterator. If the callable raises an exception, this
is propagated normally; in particular, the function is allowed to raise
``StopIteration`` as an alternative way to end the iteration. (This
functionality is available from the C API as
``PyCallIter_New(callable, sentinel)``.)
Iterator objects returned by either form of iter() have a next()
method. This method either returns the next value in the
iteration, or raises StopIteration (or a derived exception class)
to signal the end of the iteration. Any other exception should be
considered to signify an error and should be propagated normally,
not taken to mean the end of the iteration.
Iterator objects returned by either form of ``iter()`` have a ``next()``
method. This method either returns the next value in the iteration, or raises
``StopIteration`` (or a derived exception class) to signal the end of the
iteration. Any other exception should be considered to signify an error and
should be propagated normally, not taken to mean the end of the iteration.
Classes can define how they are iterated over by defining an
__iter__() method; this should take no additional arguments and
return a valid iterator object. A class that wants to be an
iterator should implement two methods: a next() method that behaves
as described above, and an __iter__() method that returns self.
Classes can define how they are iterated over by defining an ``__iter__()``
method; this should take no additional arguments and return a valid iterator
object. A class that wants to be an iterator should implement two methods: a
``next()`` method that behaves as described above, and an ``__iter__()`` method
that returns ``self``.
The two methods correspond to two distinct protocols:
The two methods correspond to two distinct protocols:
1. An object can be iterated over with "for" if it implements
__iter__() or __getitem__().
1. An object can be iterated over with ``for`` if it implements ``__iter__()``
or ``__getitem__()``.
2. An object can function as an iterator if it implements next().
2. An object can function as an iterator if it implements ``next()``.
Container-like objects usually support protocol 1. Iterators are
currently required to support both protocols. The semantics of
iteration come only from protocol 2; protocol 1 is present to make
iterators behave like sequences; in particular so that code
receiving an iterator can use a for-loop over the iterator.
Container-like objects usually support protocol 1. Iterators are currently
required to support both protocols. The semantics of iteration come only from
protocol 2; protocol 1 is present to make iterators behave like sequences; in
particular so that code receiving an iterator can use a for-loop over the
iterator.
Dictionary Iterators
====================
- Dictionaries implement a sq_contains slot that implements the
same test as the has_key() method. This means that we can write
- Dictionaries implement a ``sq_contains`` slot that implements the same test
as the ``has_key()`` method. This means that we can write
if k in dict: ...
::
which is equivalent to
if k in dict: ...
if dict.has_key(k): ...
which is equivalent to
- Dictionaries implement a tp_iter slot that returns an efficient
iterator that iterates over the keys of the dictionary. During
such an iteration, the dictionary should not be modified, except
that setting the value for an existing key is allowed (deletions
or additions are not, nor is the update() method). This means
that we can write
::
for k in dict: ...
if dict.has_key(k): ...
which is equivalent to, but much faster than
- Dictionaries implement a ``tp_iter`` slot that returns an efficient iterator
that iterates over the keys of the dictionary. During such an iteration, the
dictionary should not be modified, except that setting the value for an
existing key is allowed (deletions or additions are not, nor is the
``update()`` method). This means that we can write
for k in dict.keys(): ...
::
as long as the restriction on modifications to the dictionary
(either by the loop or by another thread) are not violated.
for k in dict: ...
- Add methods to dictionaries that return different kinds of
iterators explicitly:
which is equivalent to, but much faster than
for key in dict.iterkeys(): ...
::
for value in dict.itervalues(): ...
for k in dict.keys(): ...
for key, value in dict.iteritems(): ...
as long as the restriction on modifications to the dictionary (either by the
loop or by another thread) are not violated.
This means that "for x in dict" is shorthand for "for x in
dict.iterkeys()".
- Add methods to dictionaries that return different kinds of iterators
explicitly::
Other mappings, if they support iterators at all, should also
iterate over the keys. However, this should not be taken as an
absolute rule; specific applications may have different
requirements.
for key in dict.iterkeys(): ...
for value in dict.itervalues(): ...
for key, value in dict.iteritems(): ...
This means that ``for x in dict`` is shorthand for
``for x in dict.iterkeys()``.
Other mappings, if they support iterators at all, should also iterate over the
keys. However, this should not be taken as an absolute rule; specific
applications may have different requirements.
File Iterators
==============
The following proposal is useful because it provides us with a
good answer to the complaint that the common idiom to iterate over
the lines of a file is ugly and slow.
The following proposal is useful because it provides us with a good answer to
the complaint that the common idiom to iterate over the lines of a file is ugly
and slow.
- Files implement a tp_iter slot that is equivalent to
iter(f.readline, ""). This means that we can write
- Files implement a ``tp_iter`` slot that is equivalent to
``iter(f.readline, "")``. This means that we can write
for line in file:
...
as a shorthand for
for line in iter(file.readline, ""):
...
which is equivalent to, but faster than
while 1:
line = file.readline()
if not line:
break
...
This also shows that some iterators are destructive: they consume
all the values and a second iterator cannot easily be created that
iterates independently over the same values. You could open the
file for a second time, or seek() to the beginning, but these
solutions don't work for all file types, e.g. they don't work when
the open file object really represents a pipe or a stream socket.
Because the file iterator uses an internal buffer, mixing this
with other file operations (e.g. file.readline()) doesn't work
right. Also, the following code:
::
for line in file:
if line == "\n":
...
as a shorthand for
::
for line in iter(file.readline, ""):
...
which is equivalent to, but faster than
::
while 1:
line = file.readline()
if not line:
break
for line in file:
print line,
...
doesn't work as you might expect, because the iterator created by
the second for-loop doesn't take the buffer read-ahead by the
first for-loop into account. A correct way to write this is:
This also shows that some iterators are destructive: they consume all the
values and a second iterator cannot easily be created that iterates
independently over the same values. You could open the file for a second time,
or ``seek()`` to the beginning, but these solutions don't work for all file
types, e.g. they don't work when the open file object really represents a pipe
or a stream socket.
it = iter(file)
for line in it:
if line == "\n":
break
for line in it:
print line,
Because the file iterator uses an internal buffer, mixing this with other file
operations (e.g. ``file.readline()``) doesn't work right. Also, the following
code::
(The rationale for these restrictions are that "for line in file"
ought to become the recommended, standard way to iterate over the
lines of a file, and this should be as fast as can be. The
iterator version is considerable faster than calling readline(),
due to the internal buffer in the iterator.)
for line in file:
if line == "\n":
break
for line in file:
print line,
doesn't work as you might expect, because the iterator created by the second
for-loop doesn't take the buffer read-ahead by the first for-loop into account.
A correct way to write this is::
it = iter(file)
for line in it:
if line == "\n":
break
for line in it:
print line,
(The rationale for these restrictions are that ``for line in file`` ought to
become the recommended, standard way to iterate over the lines of a file, and
this should be as fast as can be. The iterator version is considerable faster
than calling ``readline()``, due to the internal buffer in the iterator.)
Rationale
=========
If all the parts of the proposal are included, this addresses many
concerns in a consistent and flexible fashion. Among its chief
virtues are the following four -- no, five -- no, six -- points:
If all the parts of the proposal are included, this addresses many concerns in
a consistent and flexible fashion. Among its chief virtues are the following
four -- no, five -- no, six -- points:
1. It provides an extensible iterator interface.
1. It provides an extensible iterator interface.
2. It allows performance enhancements to list iteration.
2. It allows performance enhancements to list iteration.
3. It allows big performance enhancements to dictionary iteration.
3. It allows big performance enhancements to dictionary iteration.
4. It allows one to provide an interface for just iteration
without pretending to provide random access to elements.
4. It allows one to provide an interface for just iteration without pretending
to provide random access to elements.
5. It is backward-compatible with all existing user-defined
classes and extension objects that emulate sequences and
mappings, even mappings that only implement a subset of
{__getitem__, keys, values, items}.
5. It is backward-compatible with all existing user-defined classes and
extension objects that emulate sequences and mappings, even mappings that
only implement a subset of {``__getitem__``, ``keys``, ``values``,
``items``}.
6. It makes code iterating over non-sequence collections more
concise and readable.
6. It makes code iterating over non-sequence collections more concise and
readable.
Resolved Issues
===============
The following topics have been decided by consensus or BDFL
pronouncement.
The following topics have been decided by consensus or BDFL pronouncement.
- Two alternative spellings for next() have been proposed but
rejected: __next__(), because it corresponds to a type object
slot (tp_iternext); and __call__(), because this is the only
operation.
- Two alternative spellings for ``next()`` have been proposed but rejected:
``__next__()``, because it corresponds to a type object slot
(``tp_iternext``); and ``__call__()``, because this is the only operation.
Arguments against __next__(): while many iterators are used in
for loops, it is expected that user code will also call next()
directly, so having to write __next__() is ugly; also, a
possible extension of the protocol would be to allow for prev(),
current() and reset() operations; surely we don't want to use
__prev__(), __current__(), __reset__().
Arguments against ``__next__()``: while many iterators are used in for loops,
it is expected that user code will also call ``next()`` directly, so having
to write ``__next__()`` is ugly; also, a possible extension of the protocol
would be to allow for ``prev()``, ``current()`` and ``reset()`` operations;
surely we don't want to use ``__prev__()``, ``__current__()``,
``__reset__()``.
Arguments against __call__() (the original proposal): taken out
of context, x() is not very readable, while x.next() is clear;
there's a danger that every special-purpose object wants to use
__call__() for its most common operation, causing more confusion
than clarity.
Arguments against ``__call__()`` (the original proposal): taken out of
context, ``x()`` is not very readable, while ``x.next()`` is clear; there's a
danger that every special-purpose object wants to use ``__call__()`` for its
most common operation, causing more confusion than clarity.
(In retrospect, it might have been better to go for __next__()
and have a new built-in, next(it), which calls it.__next__().
But alas, it's too late; this has been deployed in Python 2.2
since December 2001.)
(In retrospect, it might have been better to go for ``__next__()`` and have a
new built-in, ``next(it)``, which calls ``it.__next__()``. But alas, it's too
late; this has been deployed in Python 2.2 since December 2001.)
- Some folks have requested the ability to restart an iterator.
This should be dealt with by calling iter() on a sequence
repeatedly, not by the iterator protocol itself. (See also
requested extensions below.)
- Some folks have requested the ability to restart an iterator. This should be
dealt with by calling ``iter()`` on a sequence repeatedly, not by the
iterator protocol itself. (See also requested extensions below.)
- It has been questioned whether an exception to signal the end of
the iteration isn't too expensive. Several alternatives for the
StopIteration exception have been proposed: a special value End
to signal the end, a function end() to test whether the iterator
is finished, even reusing the IndexError exception.
- It has been questioned whether an exception to signal the end of the
iteration isn't too expensive. Several alternatives for the
``StopIteration`` exception have been proposed: a special value ``End`` to
signal the end, a function ``end()`` to test whether the iterator is
finished, even reusing the ``IndexError`` exception.
- A special value has the problem that if a sequence ever
contains that special value, a loop over that sequence will
end prematurely without any warning. If the experience with
null-terminated C strings hasn't taught us the problems this
can cause, imagine the trouble a Python introspection tool
would have iterating over a list of all built-in names,
assuming that the special End value was a built-in name!
- A special value has the problem that if a sequence ever contains that
special value, a loop over that sequence will end prematurely without any
warning. If the experience with null-terminated C strings hasn't taught us
the problems this can cause, imagine the trouble a Python introspection
tool would have iterating over a list of all built-in names, assuming that
the special ``End`` value was a built-in name!
- Calling an end() function would require two calls per
iteration. Two calls is much more expensive than one call
plus a test for an exception. Especially the time-critical
for loop can test very cheaply for an exception.
- Calling an ``end()`` function would require two calls per iteration. Two
calls is much more expensive than one call plus a test for an exception.
Especially the time-critical for loop can test very cheaply for an
exception.
- Reusing IndexError can cause confusion because it can be a
genuine error, which would be masked by ending the loop
prematurely.
- Reusing ``IndexError`` can cause confusion because it can be a genuine
error, which would be masked by ending the loop prematurely.
- Some have asked for a standard iterator type. Presumably all
iterators would have to be derived from this type. But this is
not the Python way: dictionaries are mappings because they
support __getitem__() and a handful other operations, not
because they are derived from an abstract mapping type.
- Some have asked for a standard iterator type. Presumably all iterators would
have to be derived from this type. But this is not the Python way:
dictionaries are mappings because they support ``__getitem__()`` and a
handful other operations, not because they are derived from an abstract
mapping type.
- Regarding "if key in dict": there is no doubt that the
dict.has_key(x) interpretation of "x in dict" is by far the
most useful interpretation, probably the only useful one. There
has been resistance against this because "x in list" checks
whether x is present among the values, while the proposal makes
"x in dict" check whether x is present among the keys. Given
that the symmetry between lists and dictionaries is very weak,
this argument does not have much weight.
- Regarding ``if key in dict``: there is no doubt that the ``dict.has_key(x)``
interpretation of ``x in dict`` is by far the most useful interpretation,
probably the only useful one. There has been resistance against this because
``x in list`` checks whether *x* is present among the values, while the
proposal makes ``x in dict`` check whether *x* is present among the keys.
Given that the symmetry between lists and dictionaries is very weak, this
argument does not have much weight.
- The name iter() is an abbreviation. Alternatives proposed
include iterate(), traverse(), but these appear too long.
Python has a history of using abbrs for common builtins,
e.g. repr(), str(), len().
- The name ``iter()`` is an abbreviation. Alternatives proposed include
``iterate()``, ``traverse()``, but these appear too long. Python has a
history of using abbrs for common builtins, e.g. ``repr()``, ``str()``,
``len()``.
Resolution: iter() it is.
Resolution: ``iter()`` it is.
- Using the same name for two different operations (getting an
iterator from an object and making an iterator for a function
with a sentinel value) is somewhat ugly. I haven't seen a
better name for the second operation though, and since they both
return an iterator, it's easy to remember.
- Using the same name for two different operations (getting an iterator from an
object and making an iterator for a function with a sentinel value) is
somewhat ugly. I haven't seen a better name for the second operation though,
and since they both return an iterator, it's easy to remember.
Resolution: the builtin iter() takes an optional argument, which
is the sentinel to look for.
Resolution: the builtin ``iter()`` takes an optional argument, which is the
sentinel to look for.
- Once a particular iterator object has raised StopIteration, will
it also raise StopIteration on all subsequent next() calls?
Some say that it would be useful to require this, others say
that it is useful to leave this open to individual iterators.
Note that this may require an additional state bit for some
iterator implementations (e.g. function-wrapping iterators).
- Once a particular iterator object has raised ``StopIteration``, will it also
raise ``StopIteration`` on all subsequent ``next()`` calls? Some say that it
would be useful to require this, others say that it is useful to leave this
open to individual iterators. Note that this may require an additional state
bit for some iterator implementations (e.g. function-wrapping iterators).
Resolution: once StopIteration is raised, calling it.next()
continues to raise StopIteration.
Resolution: once ``StopIteration`` is raised, calling ``it.next()`` continues
to raise ``StopIteration``.
Note: this was in fact not implemented in Python 2.2; there are
many cases where an iterator's next() method can raise
StopIteration on one call but not on the next. This has been
remedied in Python 2.3.
Note: this was in fact not implemented in Python 2.2; there are many cases
where an iterator's ``next()`` method can raise ``StopIteration`` on one call
but not on the next. This has been remedied in Python 2.3.
- It has been proposed that a file object should be its own
iterator, with a next() method returning the next line. This
has certain advantages, and makes it even clearer that this
iterator is destructive. The disadvantage is that this would
make it even more painful to implement the "sticky
StopIteration" feature proposed in the previous bullet.
- It has been proposed that a file object should be its own iterator, with a
``next()`` method returning the next line. This has certain advantages, and
makes it even clearer that this iterator is destructive. The disadvantage is
that this would make it even more painful to implement the "sticky
StopIteration" feature proposed in the previous bullet.
Resolution: tentatively rejected (though there are still people
arguing for this).
Resolution: tentatively rejected (though there are still people arguing for
this).
- Some folks have requested extensions of the iterator protocol,
e.g. prev() to get the previous item, current() to get the
current item again, finished() to test whether the iterator is
finished, and maybe even others, like rewind(), __len__(),
position().
- Some folks have requested extensions of the iterator protocol, e.g.
``prev()`` to get the previous item, ``current()`` to get the current item
again, ``finished()`` to test whether the iterator is finished, and maybe
even others, like ``rewind()``, ``__len__()``, ``position()``.
While some of these are useful, many of these cannot easily be
implemented for all iterator types without adding arbitrary
buffering, and sometimes they can't be implemented at all (or
not reasonably). E.g. anything to do with reversing directions
can't be done when iterating over a file or function. Maybe a
separate PEP can be drafted to standardize the names for such
operations when the are implementable.
While some of these are useful, many of these cannot easily be implemented
for all iterator types without adding arbitrary buffering, and sometimes they
can't be implemented at all (or not reasonably). E.g. anything to do with
reversing directions can't be done when iterating over a file or function.
Maybe a separate PEP can be drafted to standardize the names for such
operations when they are implementable.
Resolution: rejected.
Resolution: rejected.
- There has been a long discussion about whether
- There has been a long discussion about whether
for x in dict: ...
::
should assign x the successive keys, values, or items of the
dictionary. The symmetry between "if x in y" and "for x in y"
suggests that it should iterate over keys. This symmetry has been
observed by many independently and has even been used to "explain"
one using the other. This is because for sequences, "if x in y"
iterates over y comparing the iterated values to x. If we adopt
both of the above proposals, this will also hold for
dictionaries.
for x in dict: ...
The argument against making "for x in dict" iterate over the keys
comes mostly from a practicality point of view: scans of the
standard library show that there are about as many uses of "for x
in dict.items()" as there are of "for x in dict.keys()", with the
items() version having a small majority. Presumably many of the
loops using keys() use the corresponding value anyway, by writing
dict[x], so (the argument goes) by making both the key and value
available, we could support the largest number of cases. While
this is true, I (Guido) find the correspondence between "for x in
dict" and "if x in dict" too compelling to break, and there's not
much overhead in having to write dict[x] to explicitly get the
value.
should assign *x* the successive keys, values, or items of the dictionary.
The symmetry between ``if x in y`` and ``for x in y`` suggests that it should
iterate over keys. This symmetry has been observed by many independently and
has even been used to "explain" one using the other. This is because for
sequences, ``if x in y`` iterates over *y* comparing the iterated values to
*x*. If we adopt both of the above proposals, this will also hold for
dictionaries.
For fast iteration over items, use "for key, value in
dict.iteritems()". I've timed the difference between
The argument against making ``for x in dict`` iterate over the keys comes
mostly from a practicality point of view: scans of the standard library show
that there are about as many uses of ``for x in dict.items()`` as there are
of ``for x in dict.keys()``, with the ``items()`` version having a small
majority. Presumably many of the loops using ``keys()`` use the
corresponding value anyway, by writing ``dict[x]``, so (the argument goes) by
making both the key and value available, we could support the largest number
of cases. While this is true, I (Guido) find the correspondence between
``for x in dict`` and ``if x in dict`` too compelling to break, and there's
not much overhead in having to write ``dict[x]`` to explicitly get the value.
for key in dict: dict[key]
For fast iteration over items, use ``for key, value in dict.iteritems()``.
I've timed the difference between
and
::
for key, value in dict.iteritems(): pass
for key in dict: dict[key]
and found that the latter is only about 7% faster.
and
Resolution: By BDFL pronouncement, "for x in dict" iterates over
the keys, and dictionaries have iteritems(), iterkeys(), and
itervalues() to return the different flavors of dictionary
iterators.
::
for key, value in dict.iteritems(): pass
and found that the latter is only about 7% faster.
Resolution: By BDFL pronouncement, ``for x in dict`` iterates over the keys,
and dictionaries have ``iteritems()``, ``iterkeys()``, and ``itervalues()``
to return the different flavors of dictionary iterators.
Mailing Lists
=============
The iterator protocol has been discussed extensively in a mailing
list on SourceForge:
The iterator protocol has been discussed extensively in a mailing list on
SourceForge:
http://lists.sourceforge.net/lists/listinfo/python-iterators
http://lists.sourceforge.net/lists/listinfo/python-iterators
Initially, some of the discussion was carried out at Yahoo;
archives are still accessible:
Initially, some of the discussion was carried out at Yahoo; archives are still
accessible:
http://groups.yahoo.com/group/python-iter
http://groups.yahoo.com/group/python-iter
Copyright
=========
This document is in the public domain.
This document is in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:

View File

@ -3,11 +3,12 @@ Title: Simple Generators
Version: $Revision$
Last-Modified: $Date$
Author: nas@arctrix.com (Neil Schemenauer),
tim.peters@gmail.com (Tim Peters),
magnus@hetland.org (Magnus Lie Hetland)
tim.peters@gmail.com (Tim Peters),
magnus@hetland.org (Magnus Lie Hetland)
Discussions-To: python-iterators@lists.sourceforge.net
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Requires: 234
Created: 18-May-2001
Python-Version: 2.2
@ -15,230 +16,222 @@ Post-History: 14-Jun-2001, 23-Jun-2001
Abstract
========
This PEP introduces the concept of generators to Python, as well
as a new statement used in conjunction with them, the "yield"
statement.
This PEP introduces the concept of generators to Python, as well as a new
statement used in conjunction with them, the ``yield`` statement.
Motivation
==========
When a producer function has a hard enough job that it requires
maintaining state between values produced, most programming languages
offer no pleasant and efficient solution beyond adding a callback
function to the producer's argument list, to be called with each value
produced.
When a producer function has a hard enough job that it requires maintaining
state between values produced, most programming languages offer no pleasant and
efficient solution beyond adding a callback function to the producer's argument
list, to be called with each value produced.
For example, tokenize.py in the standard library takes this approach:
the caller must pass a "tokeneater" function to tokenize(), called
whenever tokenize() finds the next token. This allows tokenize to be
coded in a natural way, but programs calling tokenize are typically
convoluted by the need to remember between callbacks which token(s)
were seen last. The tokeneater function in tabnanny.py is a good
example of that, maintaining a state machine in global variables, to
remember across callbacks what it has already seen and what it hopes to
see next. This was difficult to get working correctly, and is still
difficult for people to understand. Unfortunately, that's typical of
this approach.
For example, ``tokenize.py`` in the standard library takes this approach: the
caller must pass a *tokeneater* function to ``tokenize()``, called whenever
``tokenize()`` finds the next token. This allows tokenize to be coded in a
natural way, but programs calling tokenize are typically convoluted by the need
to remember between callbacks which token(s) were seen last. The *tokeneater*
function in ``tabnanny.py`` is a good example of that, maintaining a state
machine in global variables, to remember across callbacks what it has already
seen and what it hopes to see next. This was difficult to get working
correctly, and is still difficult for people to understand. Unfortunately,
that's typical of this approach.
An alternative would have been for tokenize to produce an entire parse
of the Python program at once, in a large list. Then tokenize clients
could be written in a natural way, using local variables and local
control flow (such as loops and nested if statements) to keep track of
their state. But this isn't practical: programs can be very large, so
no a priori bound can be placed on the memory needed to materialize the
whole parse; and some tokenize clients only want to see whether
something specific appears early in the program (e.g., a future
statement, or, as is done in IDLE, just the first indented statement),
and then parsing the whole program first is a severe waste of time.
An alternative would have been for tokenize to produce an entire parse of the
Python program at once, in a large list. Then tokenize clients could be
written in a natural way, using local variables and local control flow (such as
loops and nested if statements) to keep track of their state. But this isn't
practical: programs can be very large, so no a priori bound can be placed on
the memory needed to materialize the whole parse; and some tokenize clients
only want to see whether something specific appears early in the program (e.g.,
a future statement, or, as is done in IDLE, just the first indented statement),
and then parsing the whole program first is a severe waste of time.
Another alternative would be to make tokenize an iterator[1],
delivering the next token whenever its .next() method is invoked. This
is pleasant for the caller in the same way a large list of results
would be, but without the memory and "what if I want to get out early?"
drawbacks. However, this shifts the burden on tokenize to remember
*its* state between .next() invocations, and the reader need only
glance at tokenize.tokenize_loop() to realize what a horrid chore that
would be. Or picture a recursive algorithm for producing the nodes of
a general tree structure: to cast that into an iterator framework
requires removing the recursion manually and maintaining the state of
the traversal by hand.
Another alternative would be to make tokenize an iterator [1], delivering the
next token whenever its ``.next()`` method is invoked. This is pleasant for the
caller in the same way a large list of results would be, but without the memory
and "what if I want to get out early?" drawbacks. However, this shifts the
burden on tokenize to remember *its* state between ``.next()`` invocations, and
the reader need only glance at ``tokenize.tokenize_loop()`` to realize what a
horrid chore that would be. Or picture a recursive algorithm for producing the
nodes of a general tree structure: to cast that into an iterator framework
requires removing the recursion manually and maintaining the state of the
traversal by hand.
A fourth option is to run the producer and consumer in separate
threads. This allows both to maintain their states in natural ways,
and so is pleasant for both. Indeed, Demo/threads/Generator.py in the
Python source distribution provides a usable synchronized-communication
class for doing that in a general way. This doesn't work on platforms
without threads, though, and is very slow on platforms that do
(compared to what is achievable without threads).
A fourth option is to run the producer and consumer in separate threads. This
allows both to maintain their states in natural ways, and so is pleasant for
both. Indeed, Demo/threads/Generator.py in the Python source distribution
provides a usable synchronized-communication class for doing that in a general
way. This doesn't work on platforms without threads, though, and is very slow
on platforms that do (compared to what is achievable without threads).
A final option is to use the Stackless[2][3] variant implementation of
Python instead, which supports lightweight coroutines. This has much
the same programmatic benefits as the thread option, but is much more
efficient. However, Stackless is a controversial rethinking of the
Python core, and it may not be possible for Jython to implement the
same semantics. This PEP isn't the place to debate that, so suffice it
to say here that generators provide a useful subset of Stackless
functionality in a way that fits easily into the current CPython
implementation, and is believed to be relatively straightforward for
other Python implementations.
A final option is to use the Stackless [2] [3] variant implementation of Python
instead, which supports lightweight coroutines. This has much the same
programmatic benefits as the thread option, but is much more efficient.
However, Stackless is a controversial rethinking of the Python core, and it may
not be possible for Jython to implement the same semantics. This PEP isn't the
place to debate that, so suffice it to say here that generators provide a
useful subset of Stackless functionality in a way that fits easily into the
current CPython implementation, and is believed to be relatively
straightforward for other Python implementations.
That exhausts the current alternatives. Some other high-level
languages provide pleasant solutions, notably iterators in Sather[4],
which were inspired by iterators in CLU; and generators in Icon[5], a
novel language where every expression "is a generator". There are
differences among these, but the basic idea is the same: provide a
kind of function that can return an intermediate result ("the next
value") to its caller, but maintaining the function's local state so
that the function can be resumed again right where it left off. A
very simple example:
That exhausts the current alternatives. Some other high-level languages
provide pleasant solutions, notably iterators in Sather [4], which were
inspired by iterators in CLU; and generators in Icon [5], a novel language
where every expression *is a generator*. There are differences among these,
but the basic idea is the same: provide a kind of function that can return an
intermediate result ("the next value") to its caller, but maintaining the
function's local state so that the function can be resumed again right where it
left off. A very simple example::
def fib():
a, b = 0, 1
while 1:
yield b
a, b = b, a+b
def fib():
a, b = 0, 1
while 1:
yield b
a, b = b, a+b
When fib() is first invoked, it sets a to 0 and b to 1, then yields b
back to its caller. The caller sees 1. When fib is resumed, from its
point of view the yield statement is really the same as, say, a print
statement: fib continues after the yield with all local state intact.
a and b then become 1 and 1, and fib loops back to the yield, yielding
1 to its invoker. And so on. From fib's point of view it's just
delivering a sequence of results, as if via callback. But from its
caller's point of view, the fib invocation is an iterable object that
can be resumed at will. As in the thread approach, this allows both
sides to be coded in the most natural ways; but unlike the thread
approach, this can be done efficiently and on all platforms. Indeed,
resuming a generator should be no more expensive than a function call.
When ``fib()`` is first invoked, it sets *a* to 0 and *b* to 1, then yields *b*
back to its caller. The caller sees 1. When ``fib`` is resumed, from its
point of view the ``yield`` statement is really the same as, say, a ``print``
statement: ``fib`` continues after the yield with all local state intact. *a*
and *b* then become 1 and 1, and ``fib`` loops back to the ``yield``, yielding
1 to its invoker. And so on. From ``fib``'s point of view it's just
delivering a sequence of results, as if via callback. But from its caller's
point of view, the ``fib`` invocation is an iterable object that can be resumed
at will. As in the thread approach, this allows both sides to be coded in the
most natural ways; but unlike the thread approach, this can be done efficiently
and on all platforms. Indeed, resuming a generator should be no more expensive
than a function call.
The same kind of approach applies to many producer/consumer functions.
For example, tokenize.py could yield the next token instead of invoking
a callback function with it as argument, and tokenize clients could
iterate over the tokens in a natural way: a Python generator is a kind
of Python iterator[1], but of an especially powerful kind.
The same kind of approach applies to many producer/consumer functions. For
example, ``tokenize.py`` could yield the next token instead of invoking a
callback function with it as argument, and tokenize clients could iterate over
the tokens in a natural way: a Python generator is a kind of Python
iterator [1]_, but of an especially powerful kind.
Specification: Yield
=====================
A new statement is introduced:
A new statement is introduced::
yield_stmt: "yield" expression_list
yield_stmt: "yield" expression_list
"yield" is a new keyword, so a future statement[8] is needed to phase
this in: in the initial release, a module desiring to use generators
must include the line
``yield`` is a new keyword, so a ``future`` statement [8]_ is needed to phase
this in: in the initial release, a module desiring to use generators must
include the line::
from __future__ import generators
from __future__ import generators
near the top (see PEP 236[8]) for details). Modules using the
identifier "yield" without a future statement will trigger warnings.
In the following release, yield will be a language keyword and the
future statement will no longer be needed.
near the top (see PEP 236 [8]_) for details). Modules using the identifier
``yield`` without a ``future`` statement will trigger warnings. In the
following release, ``yield`` will be a language keyword and the ``future``
statement will no longer be needed.
The yield statement may only be used inside functions. A function that
contains a yield statement is called a generator function. A generator
function is an ordinary function object in all respects, but has the
new CO_GENERATOR flag set in the code object's co_flags member.
The ``yield`` statement may only be used inside functions. A function that
contains a ``yield`` statement is called a generator function. A generator
function is an ordinary function object in all respects, but has the new
``CO_GENERATOR`` flag set in the code object's co_flags member.
When a generator function is called, the actual arguments are bound to
function-local formal argument names in the usual way, but no code in
the body of the function is executed. Instead a generator-iterator
object is returned; this conforms to the iterator protocol[6], so in
particular can be used in for-loops in a natural way. Note that when
the intent is clear from context, the unqualified name "generator" may
be used to refer either to a generator-function or a generator-
iterator.
When a generator function is called, the actual arguments are bound to
function-local formal argument names in the usual way, but no code in the body
of the function is executed. Instead a generator-iterator object is returned;
this conforms to the iterator protocol [6]_, so in particular can be used in
for-loops in a natural way. Note that when the intent is clear from context,
the unqualified name "generator" may be used to refer either to a
generator-function or a generator-iterator.
Each time the .next() method of a generator-iterator is invoked, the
code in the body of the generator-function is executed until a yield
or return statement (see below) is encountered, or until the end of
the body is reached.
Each time the ``.next()`` method of a generator-iterator is invoked, the code
in the body of the generator-function is executed until a ``yield`` or
``return`` statement (see below) is encountered, or until the end of the body
is reached.
If a yield statement is encountered, the state of the function is
frozen, and the value of expression_list is returned to .next()'s
caller. By "frozen" we mean that all local state is retained,
including the current bindings of local variables, the instruction
pointer, and the internal evaluation stack: enough information is
saved so that the next time .next() is invoked, the function can
proceed exactly as if the yield statement were just another external
call.
If a ``yield`` statement is encountered, the state of the function is frozen,
and the value of *expression_list* is returned to ``.next()``'s caller. By
"frozen" we mean that all local state is retained, including the current
bindings of local variables, the instruction pointer, and the internal
evaluation stack: enough information is saved so that the next time
``.next()`` is invoked, the function can proceed exactly as if the ``yield``
statement were just another external call.
Restriction: A yield statement is not allowed in the try clause of a
try/finally construct. The difficulty is that there's no guarantee
the generator will ever be resumed, hence no guarantee that the finally
block will ever get executed; that's too much a violation of finally's
purpose to bear.
Restriction: A ``yield`` statement is not allowed in the ``try`` clause of a
``try/finally`` construct. The difficulty is that there's no guarantee the
generator will ever be resumed, hence no guarantee that the finally block will
ever get executed; that's too much a violation of finally's purpose to bear.
Restriction: A generator cannot be resumed while it is actively
running:
Restriction: A generator cannot be resumed while it is actively running::
>>> def g():
... i = me.next()
... yield i
>>> me = g()
>>> me.next()
Traceback (most recent call last):
...
File "<string>", line 2, in g
ValueError: generator already executing
>>> def g():
... i = me.next()
... yield i
>>> me = g()
>>> me.next()
Traceback (most recent call last):
...
File "<string>", line 2, in g
ValueError: generator already executing
Specification: Return
======================
A generator function can also contain return statements of the form:
A generator function can also contain return statements of the form::
"return"
return
Note that an expression_list is not allowed on return statements
in the body of a generator (although, of course, they may appear in
the bodies of non-generator functions nested within the generator).
Note that an *expression_list* is not allowed on return statements in the body
of a generator (although, of course, they may appear in the bodies of
non-generator functions nested within the generator).
When a return statement is encountered, control proceeds as in any
function return, executing the appropriate finally clauses (if any
exist). Then a StopIteration exception is raised, signalling that the
iterator is exhausted. A StopIteration exception is also raised if
control flows off the end of the generator without an explicit return.
When a return statement is encountered, control proceeds as in any function
return, executing the appropriate ``finally`` clauses (if any exist). Then a
``StopIteration`` exception is raised, signalling that the iterator is
exhausted. A ``StopIteration`` exception is also raised if control flows off
the end of the generator without an explicit return.
Note that return means "I'm done, and have nothing interesting to
return", for both generator functions and non-generator functions.
Note that return means "I'm done, and have nothing interesting to return", for
both generator functions and non-generator functions.
Note that return isn't always equivalent to raising StopIteration: the
difference lies in how enclosing try/except constructs are treated.
For example,
Note that return isn't always equivalent to raising ``StopIteration``: the
difference lies in how enclosing ``try/except`` constructs are treated. For
example,::
>>> def f1():
... try:
... return
... except:
... yield 1
>>> print list(f1())
[]
>>> def f1():
... try:
... return
... except:
... yield 1
>>> print list(f1())
[]
because, as in any function, return simply exits, but
because, as in any function, ``return`` simply exits, but::
>>> def f2():
... try:
... raise StopIteration
... except:
... yield 42
>>> print list(f2())
[42]
>>> def f2():
... try:
... raise StopIteration
... except:
... yield 42
>>> print list(f2())
[42]
because StopIteration is captured by a bare "except", as is any
exception.
because ``StopIteration`` is captured by a bare ``except``, as is any
exception.
Specification: Generators and Exception Propagation
====================================================
If an unhandled exception-- including, but not limited to,
StopIteration --is raised by, or passes through, a generator function,
then the exception is passed on to the caller in the usual way, and
subsequent attempts to resume the generator function raise
StopIteration. In other words, an unhandled exception terminates a
generator's useful life.
If an unhandled exception-- including, but not limited to, ``StopIteration``
--is raised by, or passes through, a generator function, then the exception is
passed on to the caller in the usual way, and subsequent attempts to resume the
generator function raise ``StopIteration``. In other words, an unhandled
exception terminates a generator's useful life.
Example (not idiomatic but to illustrate the point):
Example (not idiomatic but to illustrate the point)::
>>> def f():
... return 1/0
@ -260,12 +253,13 @@ Specification: Generators and Exception Propagation
Specification: Try/Except/Finally
==================================
As noted earlier, yield is not allowed in the try clause of a try/
finally construct. A consequence is that generators should allocate
critical resources with great care. There is no restriction on yield
otherwise appearing in finally clauses, except clauses, or in the try
clause of a try/except construct:
As noted earlier, ``yield`` is not allowed in the ``try`` clause of a
``try/finally`` construct. A consequence is that generators should allocate
critical resources with great care. There is no restriction on ``yield``
otherwise appearing in ``finally`` clauses, ``except`` clauses, or in the
``try`` clause of a ``try/except`` construct::
>>> def f():
... try:
@ -287,7 +281,7 @@ Specification: Try/Except/Finally
... try:
... x = 12
... finally:
... yield 10
... yield 10
... yield 11
>>> print list(f())
[1, 2, 4, 5, 8, 9, 10, 11]
@ -295,217 +289,236 @@ Specification: Try/Except/Finally
Example
=======
# A binary tree class.
class Tree:
::
def __init__(self, label, left=None, right=None):
self.label = label
self.left = left
self.right = right
# A binary tree class.
class Tree:
def __repr__(self, level=0, indent=" "):
s = level*indent + `self.label`
if self.left:
s = s + "\n" + self.left.__repr__(level+1, indent)
if self.right:
s = s + "\n" + self.right.__repr__(level+1, indent)
return s
def __init__(self, label, left=None, right=None):
self.label = label
self.left = left
self.right = right
def __iter__(self):
return inorder(self)
def __repr__(self, level=0, indent=" "):
s = level*indent + `self.label`
if self.left:
s = s + "\n" + self.left.__repr__(level+1, indent)
if self.right:
s = s + "\n" + self.right.__repr__(level+1, indent)
return s
# Create a Tree from a list.
def tree(list):
n = len(list)
if n == 0:
return []
i = n / 2
return Tree(list[i], tree(list[:i]), tree(list[i+1:]))
def __iter__(self):
return inorder(self)
# A recursive generator that generates Tree labels in in-order.
def inorder(t):
if t:
for x in inorder(t.left):
yield x
yield t.label
for x in inorder(t.right):
yield x
# Create a Tree from a list.
def tree(list):
n = len(list)
if n == 0:
return []
i = n / 2
return Tree(list[i], tree(list[:i]), tree(list[i+1:]))
# Show it off: create a tree.
t = tree("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
# Print the nodes of the tree in in-order.
for x in t:
print x,
print
# A recursive generator that generates Tree labels in in-order.
def inorder(t):
if t:
for x in inorder(t.left):
yield x
yield t.label
for x in inorder(t.right):
yield x
# A non-recursive generator.
def inorder(node):
stack = []
while node:
while node.left:
stack.append(node)
node = node.left
# Show it off: create a tree.
t = tree("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
# Print the nodes of the tree in in-order.
for x in t:
print x,
print
# A non-recursive generator.
def inorder(node):
stack = []
while node:
while node.left:
stack.append(node)
node = node.left
yield node.label
while not node.right:
try:
node = stack.pop()
except IndexError:
return
yield node.label
while not node.right:
try:
node = stack.pop()
except IndexError:
return
yield node.label
node = node.right
node = node.right
# Exercise the non-recursive generator.
for x in t:
print x,
print
# Exercise the non-recursive generator.
for x in t:
print x,
print
Both output blocks display:
Both output blocks display::
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Q & A
=====
Q. Why not a new keyword instead of reusing "def"?
Why not a new keyword instead of reusing ``def``?
-------------------------------------------------
A. See BDFL Pronouncements section below.
See BDFL Pronouncements section below.
Q. Why a new keyword for "yield"? Why not a builtin function instead?
Why a new keyword for ``yield``? Why not a builtin function instead?
---------------------------------------------------------------------
A. Control flow is much better expressed via keyword in Python, and
yield is a control construct. It's also believed that efficient
implementation in Jython requires that the compiler be able to
determine potential suspension points at compile-time, and a new
keyword makes that easy. The CPython reference implementation also
exploits it heavily, to detect which functions *are* generator-
functions (although a new keyword in place of "def" would solve that
for CPython -- but people asking the "why a new keyword?" question
don't want any new keyword).
Control flow is much better expressed via keyword in Python, and yield is a
control construct. It's also believed that efficient implementation in Jython
requires that the compiler be able to determine potential suspension points at
compile-time, and a new keyword makes that easy. The CPython reference
implementation also exploits it heavily, to detect which functions *are*
generator-functions (although a new keyword in place of ``def`` would solve
that for CPython -- but people asking the "why a new keyword?" question don't
want any new keyword).
Q: Then why not some other special syntax without a new keyword? For
example, one of these instead of "yield 3":
Then why not some other special syntax without a new keyword?
-------------------------------------------------------------
return 3 and continue
return and continue 3
return generating 3
continue return 3
return >> , 3
from generator return 3
return >> 3
return << 3
>> 3
<< 3
* 3
For example, one of these instead of ``yield 3``::
return 3 and continue
return and continue 3
return generating 3
continue return 3
return >> , 3
from generator return 3
return >> 3
return << 3
>> 3
<< 3
* 3
A: Did I miss one <wink>? Out of hundreds of messages, I counted three
suggesting such an alternative, and extracted the above from them.
It would be nice not to need a new keyword, but nicer to make yield
very clear -- I don't want to have to *deduce* that a yield is
occurring from making sense of a previously senseless sequence of
keywords or operators. Still, if this attracts enough interest,
proponents should settle on a single consensus suggestion, and Guido
will Pronounce on it.
Did I miss one <wink>? Out of hundreds of messages, I counted three
suggesting such an alternative, and extracted the above from them. It would be
nice not to need a new keyword, but nicer to make ``yield`` very clear -- I
don't want to have to *deduce* that a yield is occurring from making sense of a
previously senseless sequence of keywords or operators. Still, if this
attracts enough interest, proponents should settle on a single consensus
suggestion, and Guido will Pronounce on it.
Q. Why allow "return" at all? Why not force termination to be spelled
"raise StopIteration"?
Why allow ``return`` at all? Why not force termination to be spelled ``raise StopIteration``?
----------------------------------------------------------------------------------------------
A. The mechanics of StopIteration are low-level details, much like the
mechanics of IndexError in Python 2.1: the implementation needs to
do *something* well-defined under the covers, and Python exposes
these mechanisms for advanced users. That's not an argument for
forcing everyone to work at that level, though. "return" means "I'm
done" in any kind of function, and that's easy to explain and to use.
Note that "return" isn't always equivalent to "raise StopIteration"
in try/except construct, either (see the "Specification: Return"
section).
The mechanics of ``StopIteration`` are low-level details, much like the
mechanics of ``IndexError`` in Python 2.1: the implementation needs to do
*something* well-defined under the covers, and Python exposes these mechanisms
for advanced users. That's not an argument for forcing everyone to work at
that level, though. ``return`` means "I'm done" in any kind of function, and
that's easy to explain and to use. Note that ``return`` isn't always equivalent
to ``raise StopIteration`` in try/except construct, either (see the
"Specification: Return" section).
Q. Then why not allow an expression on "return" too?
Then why not allow an expression on ``return`` too?
---------------------------------------------------
A. Perhaps we will someday. In Icon, "return expr" means both "I'm
done", and "but I have one final useful value to return too, and
this is it". At the start, and in the absence of compelling uses
for "return expr", it's simply cleaner to use "yield" exclusively
for delivering values.
Perhaps we will someday. In Icon, ``return expr`` means both "I'm done", and
"but I have one final useful value to return too, and this is it". At the
start, and in the absence of compelling uses for ``return expr``, it's simply
cleaner to use ``yield`` exclusively for delivering values.
BDFL Pronouncements
===================
Issue: Introduce another new keyword (say, "gen" or "generator") in
place of "def", or otherwise alter the syntax, to distinguish
generator-functions from non-generator functions.
Issue
-----
Con: In practice (how you think about them), generators *are*
functions, but with the twist that they're resumable. The mechanics of
how they're set up is a comparatively minor technical issue, and
introducing a new keyword would unhelpfully overemphasize the
mechanics of how generators get started (a vital but tiny part of a
generator's life).
Introduce another new keyword (say, ``gen`` or ``generator``) in place
of ``def``, or otherwise alter the syntax, to distinguish generator-functions
from non-generator functions.
Pro: In reality (how you think about them), generator-functions are
actually factory functions that produce generator-iterators as if by
magic. In this respect they're radically different from non-generator
functions, acting more like a constructor than a function, so reusing
"def" is at best confusing. A "yield" statement buried in the body is
not enough warning that the semantics are so different.
Con
---
BDFL: "def" it stays. No argument on either side is totally
convincing, so I have consulted my language designer's intuition. It
tells me that the syntax proposed in the PEP is exactly right - not too
hot, not too cold. But, like the Oracle at Delphi in Greek mythology,
it doesn't tell me why, so I don't have a rebuttal for the arguments
against the PEP syntax. The best I can come up with (apart from
agreeing with the rebuttals ... already made) is "FUD". If this had
been part of the language from day one, I very much doubt it would have
made Andrew Kuchling's "Python Warts" page.
In practice (how you think about them), generators *are* functions, but
with the twist that they're resumable. The mechanics of how they're set up
is a comparatively minor technical issue, and introducing a new keyword would
unhelpfully overemphasize the mechanics of how generators get started (a vital
but tiny part of a generator's life).
Pro
---
In reality (how you think about them), generator-functions are actually
factory functions that produce generator-iterators as if by magic. In this
respect they're radically different from non-generator functions, acting more
like a constructor than a function, so reusing ``def`` is at best confusing.
A ``yield`` statement buried in the body is not enough warning that the
semantics are so different.
BDFL
----
``def`` it stays. No argument on either side is totally convincing, so I
have consulted my language designer's intuition. It tells me that the syntax
proposed in the PEP is exactly right - not too hot, not too cold. But, like
the Oracle at Delphi in Greek mythology, it doesn't tell me why, so I don't
have a rebuttal for the arguments against the PEP syntax. The best I can come
up with (apart from agreeing with the rebuttals ... already made) is "FUD".
If this had been part of the language from day one, I very much doubt it would
have made Andrew Kuchling's "Python Warts" page.
Reference Implementation
========================
The current implementation, in a preliminary state (no docs, but well
tested and solid), is part of Python's CVS development tree[9]. Using
this requires that you build Python from source.
The current implementation, in a preliminary state (no docs, but well tested
and solid), is part of Python's CVS development tree [9]_. Using this requires
that you build Python from source.
This was derived from an earlier patch by Neil Schemenauer[7].
This was derived from an earlier patch by Neil Schemenauer [7]_.
Footnotes and References
========================
[1] PEP 234, Iterators, Yee, Van Rossum
http://www.python.org/dev/peps/pep-0234/
.. [1] PEP 234, Iterators, Yee, Van Rossum
http://www.python.org/dev/peps/pep-0234/
[2] http://www.stackless.com/
.. [2] http://www.stackless.com/
[3] PEP 219, Stackless Python, McMillan
http://www.python.org/dev/peps/pep-0219/
.. [3] PEP 219, Stackless Python, McMillan
http://www.python.org/dev/peps/pep-0219/
[4] "Iteration Abstraction in Sather"
Murer, Omohundro, Stoutamire and Szyperski
http://www.icsi.berkeley.edu/~sather/Publications/toplas.html
.. [4] "Iteration Abstraction in Sather"
Murer, Omohundro, Stoutamire and Szyperski
http://www.icsi.berkeley.edu/~sather/Publications/toplas.html
[5] http://www.cs.arizona.edu/icon/
.. [5] http://www.cs.arizona.edu/icon/
[6] The concept of iterators is described in PEP 234. See [1] above.
.. [6] The concept of iterators is described in PEP 234. See [1] above.
[7] http://python.ca/nas/python/generator.diff
.. [7] http://python.ca/nas/python/generator.diff
[8] PEP 236, Back to the __future__, Peters
http://www.python.org/dev/peps/pep-0236/
.. [8] PEP 236, Back to the __future__, Peters
http://www.python.org/dev/peps/pep-0236/
[9] To experiment with this implementation, check out Python from CVS
according to the instructions at
http://sf.net/cvs/?group_id=5470
Note that the std test Lib/test/test_generators.py contains many
examples, including all those in this PEP.
.. [9] To experiment with this implementation, check out Python from CVS
according to the instructions at http://sf.net/cvs/?group_id=5470
Note that the std test ``Lib/test/test_generators.py`` contains many
examples, including all those in this PEP.
Copyright
=========
This document has been placed in the public domain.
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:

View File

@ -5,554 +5,577 @@ Last-Modified: $Date$
Author: Steven D'Aprano <steve@pearwood.info>
Status: Final
Type: Standards Track
Content-Type: text/plain
Content-Type: text/x-rst
Created: 01-Aug-2013
Python-Version: 3.4
Post-History: 13-Sep-2013
Abstract
========
This PEP proposes the addition of a module for common statistics functions
such as mean, median, variance and standard deviation to the Python
standard library. See also http://bugs.python.org/issue18606
This PEP proposes the addition of a module for common statistics functions such
as mean, median, variance and standard deviation to the Python standard
library. See also http://bugs.python.org/issue18606
Rationale
=========
The proposed statistics module is motivated by the "batteries included"
philosophy towards the Python standard library. Raymond Hettinger and
other senior developers have requested a quality statistics library that
falls somewhere in between high-end statistics libraries and ad hoc
code.[1] Statistical functions such as mean, standard deviation and others
are obvious and useful batteries, familiar to any Secondary School student.
Even cheap scientific calculators typically include multiple statistical
functions such as:
The proposed statistics module is motivated by the "batteries included"
philosophy towards the Python standard library. Raymond Hettinger and other
senior developers have requested a quality statistics library that falls
somewhere in between high-end statistics libraries and ad hoc code. [1]_
Statistical functions such as mean, standard deviation and others are obvious
and useful batteries, familiar to any Secondary School student. Even cheap
scientific calculators typically include multiple statistical functions such
as:
- mean
- population and sample variance
- population and sample standard deviation
- linear regression
- correlation coefficient
- mean
- population and sample variance
- population and sample standard deviation
- linear regression
- correlation coefficient
Graphing calculators aimed at Secondary School students typically
include all of the above, plus some or all of:
Graphing calculators aimed at Secondary School students typically include all
of the above, plus some or all of:
- median
- mode
- functions for calculating the probability of random variables
from the normal, t, chi-squared, and F distributions
- inference on the mean
- median
- mode
- functions for calculating the probability of random variables from the
normal, t, chi-squared, and F distributions
- inference on the mean
and others[2]. Likewise spreadsheet applications such as Microsoft Excel,
LibreOffice and Gnumeric include rich collections of statistical
functions[3].
and others [2]_. Likewise spreadsheet applications such as Microsoft Excel,
LibreOffice and Gnumeric include rich collections of statistical
functions [3]_.
In contrast, Python currently has no standard way to calculate even the
simplest and most obvious statistical functions such as mean. For those
who need statistical functions in Python, there are two obvious solutions:
In contrast, Python currently has no standard way to calculate even the
simplest and most obvious statistical functions such as mean. For those who
need statistical functions in Python, there are two obvious solutions:
- install numpy and/or scipy[4];
- install numpy and/or scipy [4]_;
- or use a Do It Yourself solution.
- or use a Do It Yourself solution.
Numpy is perhaps the most full-featured solution, but it has a few
disadvantages:
Numpy is perhaps the most full-featured solution, but it has a few
disadvantages:
- It may be overkill for many purposes. The documentation for numpy even
warns
- It may be overkill for many purposes. The documentation for numpy even warns
"It can be hard to know what functions are available in
numpy. This is not a complete list, but it does cover
most of them."[5]
"It can be hard to know what functions are available in numpy. This is
not a complete list, but it does cover most of them."[5]_
and then goes on to list over 270 functions, only a small number of
which are related to statistics.
and then goes on to list over 270 functions, only a small number of which are
related to statistics.
- Numpy is aimed at those doing heavy numerical work, and may be
intimidating to those who don't have a background in computational
mathematics and computer science. For example, numpy.mean takes four
arguments:
- Numpy is aimed at those doing heavy numerical work, and may be intimidating
to those who don't have a background in computational mathematics and
computer science. For example, ``numpy.mean`` takes four arguments::
mean(a, axis=None, dtype=None, out=None)
mean(a, axis=None, dtype=None, out=None)
although fortunately for the beginner or casual numpy user, three are
optional and numpy.mean does the right thing in simple cases:
although fortunately for the beginner or casual numpy user, three are
optional and ``numpy.mean`` does the right thing in simple cases::
>>> numpy.mean([1, 2, 3, 4])
2.5
>>> numpy.mean([1, 2, 3, 4])
2.5
- For many people, installing numpy may be difficult or impossible. For
example, people in corporate environments may have to go through a
difficult, time-consuming process before being permitted to install
third-party software. For the casual Python user, having to learn about
installing third-party packages in order to average a list of numbers is
unfortunate.
- For many people, installing numpy may be difficult or impossible. For
example, people in corporate environments may have to go through a difficult,
time-consuming process before being permitted to install third-party
software. For the casual Python user, having to learn about installing
third-party packages in order to average a list of numbers is unfortunate.
This leads to option number 2, DIY statistics functions. At first glance,
this appears to be an attractive option, due to the apparent simplicity of
common statistical functions. For example:
This leads to option number 2, DIY statistics functions. At first glance, this
appears to be an attractive option, due to the apparent simplicity of common
statistical functions. For example::
def mean(data):
return sum(data)/len(data)
def mean(data):
return sum(data)/len(data)
def variance(data):
# Use the Computational Formula for Variance.
n = len(data)
ss = sum(x**2 for x in data) - (sum(data)**2)/n
return ss/(n-1)
def variance(data):
# Use the Computational Formula for Variance.
n = len(data)
ss = sum(x**2 for x in data) - (sum(data)**2)/n
return ss/(n-1)
def standard_deviation(data):
return math.sqrt(variance(data))
def standard_deviation(data):
return math.sqrt(variance(data))
The above appears to be correct with a casual test:
The above appears to be correct with a casual test::
>>> data = [1, 2, 4, 5, 8]
>>> variance(data)
7.5
>>> data = [1, 2, 4, 5, 8]
>>> variance(data)
7.5
But adding a constant to every data point should not change the variance:
But adding a constant to every data point should not change the variance::
>>> data = [x+1e12 for x in data]
>>> variance(data)
0.0
>>> data = [x+1e12 for x in data]
>>> variance(data)
0.0
And variance should *never* be negative:
And variance should *never* be negative::
>>> variance(data*100)
-1239429440.1282566
>>> variance(data*100)
-1239429440.1282566
By contrast, the proposed reference implementation gets the exactly correct
answer 7.5 for the first two examples, and a reasonably close answer for
the third: 6.012. numpy does no better[6].
By contrast, the proposed reference implementation gets the exactly correct
answer 7.5 for the first two examples, and a reasonably close answer for the
third: 6.012. numpy does no better [6]_.
Even simple statistical calculations contain traps for the unwary, starting
with the Computational Formula itself. Despite the name, it is numerically
unstable and can be extremely inaccurate, as can be seen above. It is
completely unsuitable for computation by computer[7]. This problem plagues
users of many programming language, not just Python[8], as coders reinvent
the same numerically inaccurate code over and over again[9], or advise
others to do so[10].
Even simple statistical calculations contain traps for the unwary, starting
with the Computational Formula itself. Despite the name, it is numerically
unstable and can be extremely inaccurate, as can be seen above. It is
completely unsuitable for computation by computer [7]_. This problem plagues
users of many programming language, not just Python [8]_, as coders reinvent
the same numerically inaccurate code over and over again [9]_, or advise others
to do so [10]_.
It isn't just the variance and standard deviation. Even the mean is not
quite as straightforward as it might appear. The above implementation
seems too simple to have problems, but it does:
It isn't just the variance and standard deviation. Even the mean is not quite
as straightforward as it might appear. The above implementation seems too
simple to have problems, but it does:
- The built-in sum can lose accuracy when dealing with floats of wildly
differing magnitude. Consequently, the above naive mean fails this
"torture test":
- The built-in ``sum`` can lose accuracy when dealing with floats of wildly
differing magnitude. Consequently, the above naive ``mean`` fails this
"torture test"::
assert mean([1e30, 1, 3, -1e30]) == 1
assert mean([1e30, 1, 3, -1e30]) == 1
returning 0 instead of 1, a purely computational error of 100%.
returning 0 instead of 1, a purely computational error of 100%.
- Using math.fsum inside mean will make it more accurate with float data,
but it also has the side-effect of converting any arguments to float
even when unnecessary. E.g. we should expect the mean of a list of
Fractions to be a Fraction, not a float.
- Using ``math.fsum`` inside ``mean`` will make it more accurate with float
data, but it also has the side-effect of converting any arguments to float
even when unnecessary. E.g. we should expect the mean of a list of Fractions
to be a Fraction, not a float.
While the above mean implementation does not fail quite as catastrophically
as the naive variance does, a standard library function can do much better
than the DIY versions.
While the above mean implementation does not fail quite as catastrophically as
the naive variance does, a standard library function can do much better than
the DIY versions.
The example above involves an especially bad set of data, but even for
more realistic data sets accuracy is important. The first step in
interpreting variation in data (including dealing with ill-conditioned
data) is often to standardize it to a series with variance 1 (and often
mean 0). This standardization requires accurate computation of the mean
and variance of the raw series. Naive computation of mean and variance
can lose precision very quickly. Because precision bounds accuracy, it is
important to use the most precise algorithms for computing mean and
variance that are practical, or the results of standardization are
themselves useless.
The example above involves an especially bad set of data, but even for more
realistic data sets accuracy is important. The first step in interpreting
variation in data (including dealing with ill-conditioned data) is often to
standardize it to a series with variance 1 (and often mean 0). This
standardization requires accurate computation of the mean and variance of the
raw series. Naive computation of mean and variance can lose precision very
quickly. Because precision bounds accuracy, it is important to use the most
precise algorithms for computing mean and variance that are practical, or the
results of standardization are themselves useless.
Comparison To Other Languages/Packages
======================================
The proposed statistics library is not intended to be a competitor to such
third-party libraries as numpy/scipy, or of proprietary full-featured
statistics packages aimed at professional statisticians such as Minitab,
SAS and Matlab. It is aimed at the level of graphing and scientific
calculators.
The proposed statistics library is not intended to be a competitor to such
third-party libraries as numpy/scipy, or of proprietary full-featured
statistics packages aimed at professional statisticians such as Minitab, SAS
and Matlab. It is aimed at the level of graphing and scientific calculators.
Most programming languages have little or no built-in support for
statistics functions. Some exceptions:
Most programming languages have little or no built-in support for statistics
functions. Some exceptions:
R
R (and its proprietary cousin, S) is a programming language designed
for statistics work. It is extremely popular with statisticians and
is extremely feature-rich[11].
R
-
C#
R (and its proprietary cousin, S) is a programming language designed for
statistics work. It is extremely popular with statisticians and is extremely
feature-rich [11]_.
The C# LINQ package includes extension methods to calculate the
average of enumerables[12].
C#
--
Ruby
The C# LINQ package includes extension methods to calculate the average of
enumerables [12]_.
Ruby does not ship with a standard statistics module, despite some
apparent demand[13]. Statsample appears to be a feature-rich third-
party library, aiming to compete with R[14].
Ruby
----
PHP
Ruby does not ship with a standard statistics module, despite some apparent
demand [13]_. Statsample appears to be a feature-rich third-party library,
aiming to compete with R [14]_.
PHP has an extremely feature-rich (although mostly undocumented) set
of advanced statistical functions[15].
PHP
---
Delphi
PHP has an extremely feature-rich (although mostly undocumented) set of
advanced statistical functions [15]_.
Delphi includes standard statistical functions including Mean, Sum,
Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].
Delphi
------
GNU Scientific Library
Delphi includes standard statistical functions including Mean, Sum,
Variance, TotalVariance, MomentSkewKurtosis in its Math library [16]_.
The GNU Scientific Library includes standard statistical functions,
percentiles, median and others[17]. One innovation I have borrowed
from the GSL is to allow the caller to optionally specify the pre-
calculated mean of the sample (or an a priori known population mean)
when calculating the variance and standard deviation[18].
GNU Scientific Library
----------------------
The GNU Scientific Library includes standard statistical functions,
percentiles, median and others [17]_. One innovation I have borrowed from the
GSL is to allow the caller to optionally specify the pre-calculated mean of
the sample (or an a priori known population mean) when calculating the variance
and standard deviation [18]_.
Design Decisions Of The Module
==============================
My intention is to start small and grow the library as needed, rather than
try to include everything from the start. Consequently, the current
reference implementation includes only a small number of functions: mean,
variance, standard deviation, median, mode. (See the reference
implementation for a full list.)
My intention is to start small and grow the library as needed, rather than try
to include everything from the start. Consequently, the current reference
implementation includes only a small number of functions: mean, variance,
standard deviation, median, mode. (See the reference implementation for a full
list.)
I have aimed for the following design features:
I have aimed for the following design features:
- Correctness over speed. It is easier to speed up a correct but slow
function than to correct a fast but buggy one.
- Correctness over speed. It is easier to speed up a correct but slow function
than to correct a fast but buggy one.
- Concentrate on data in sequences, allowing two-passes over the data,
rather than potentially compromise on accuracy for the sake of a one-pass
algorithm. Functions expect data will be passed as a list or other
sequence; if given an iterator, they may internally convert to a list.
- Concentrate on data in sequences, allowing two-passes over the data, rather
than potentially compromise on accuracy for the sake of a one-pass algorithm.
Functions expect data will be passed as a list or other sequence; if given an
iterator, they may internally convert to a list.
- Functions should, as much as possible, honour any type of numeric data.
E.g. the mean of a list of Decimals should be a Decimal, not a float.
When this is not possible, treat float as the "lowest common data type".
- Functions should, as much as possible, honour any type of numeric data. E.g.
the mean of a list of Decimals should be a Decimal, not a float. When this is
not possible, treat float as the "lowest common data type".
- Although functions support data sets of floats, Decimals or Fractions,
there is no guarantee that *mixed* data sets will be supported. (But on
the other hand, they aren't explicitly rejected either.)
- Although functions support data sets of floats, Decimals or Fractions, there
is no guarantee that *mixed* data sets will be supported. (But on the other
hand, they aren't explicitly rejected either.)
- Plenty of documentation, aimed at readers who understand the basic
concepts but may not know (for example) which variance they should use
(population or sample?). Mathematicians and statisticians have a terrible
habit of being inconsistent with both notation and terminology[19], and
having spent many hours making sense of the contradictory/confusing
definitions in use, it is only fair that I do my best to clarify rather
than obfuscate the topic.
- Plenty of documentation, aimed at readers who understand the basic concepts
but may not know (for example) which variance they should use (population or
sample?). Mathematicians and statisticians have a terrible habit of being
inconsistent with both notation and terminology [19]_, and having spent many
hours making sense of the contradictory/confusing definitions in use, it is
only fair that I do my best to clarify rather than obfuscate the topic.
- But avoid going into tedious[20] mathematical detail.
- But avoid going into tedious [20]_ mathematical detail.
API
===
The initial version of the library will provide univariate (single
variable) statistics functions. The general API will be based on a
functional model ``function(data, ...) -> result``, where ``data``
is a mandatory iterable of (usually) numeric data.
The initial version of the library will provide univariate (single variable)
statistics functions. The general API will be based on a functional model
``function(data, ...) -> result``, where ``data`` is a mandatory iterable of
(usually) numeric data.
The author expects that lists will be the most common data type used,
but any iterable type should be acceptable. Where necessary, functions
may convert to lists internally. Where possible, functions are
expected to conserve the type of the data values, for example, the mean
of a list of Decimals should be a Decimal rather than float.
The author expects that lists will be the most common data type used, but any
iterable type should be acceptable. Where necessary, functions may convert to
lists internally. Where possible, functions are expected to conserve the type
of the data values, for example, the mean of a list of Decimals should be a
Decimal rather than float.
Calculating mean, median and mode
Calculating mean, median and mode
---------------------------------
The ``mean``, ``median*`` and ``mode`` functions take a single
mandatory argument and return the appropriate statistic, e.g.:
The ``mean``, ``median*`` and ``mode`` functions take a single mandatory
argument and return the appropriate statistic, e.g.::
>>> mean([1, 2, 3])
2.0
>>> mean([1, 2, 3])
2.0
Functions provided are:
Functions provided are:
* mean(data) -> arithmetic mean of data.
* ``mean(data)``
arithmetic mean of *data*.
* median(data) -> median (middle value) of data, taking the
average of the two middle values when there are an even
number of values.
* ``median(data)``
median (middle value) of *data*, taking the average of the two
middle values when there are an even number of values.
* median_high(data) -> high median of data, taking the
larger of the two middle values when the number of items
is even.
* ``median_high(data)``
high median of *data*, taking the larger of the two middle
values when the number of items is even.
* median_low(data) -> low median of data, taking the smaller
of the two middle values when the number of items is even.
* ``median_low(data)``
low median of *data*, taking the smaller of the two middle
values when the number of items is even.
* median_grouped(data, interval=1) -> 50th percentile of
grouped data, using interpolation.
* ``median_grouped(data, interval=1)``
50th percentile of grouped *data*, using interpolation.
* mode(data) -> most common data point.
* ``mode(data)``
most common *data* point.
``mode`` is the sole exception to the rule that the data argument
must be numeric. It will also accept an iterable of nominal data,
such as strings.
``mode`` is the sole exception to the rule that the data argument must be
numeric. It will also accept an iterable of nominal data, such as strings.
Calculating variance and standard deviation
Calculating variance and standard deviation
-------------------------------------------
In order to be similar to scientific calculators, the statistics
module will include separate functions for population and sample
variance and standard deviation. All four functions have similar
signatures, with a single mandatory argument, an iterable of
numeric data, e.g.:
In order to be similar to scientific calculators, the statistics module will
include separate functions for population and sample variance and standard
deviation. All four functions have similar signatures, with a single mandatory
argument, an iterable of numeric data, e.g.::
>>> variance([1, 2, 2, 2, 3])
0.5
>>> variance([1, 2, 2, 2, 3])
0.5
All four functions also accept a second, optional, argument, the
mean of the data. This is modelled on a similar API provided by
the GNU Scientific Library[18]. There are three use-cases for
using this argument, in no particular order:
All four functions also accept a second, optional, argument, the mean of the
data. This is modelled on a similar API provided by the GNU Scientific
Library [18]_. There are three use-cases for using this argument, in no
particular order:
1) The value of the mean is known *a priori*.
1) The value of the mean is known *a priori*.
2) You have already calculated the mean, and wish to avoid
calculating it again.
2) You have already calculated the mean, and wish to avoid calculating
it again.
3) You wish to (ab)use the variance functions to calculate
the second moment about some given point other than the
mean.
3) You wish to (ab)use the variance functions to calculate the second
moment about some given point other than the mean.
In each case, it is the caller's responsibility to ensure that
given argument is meaningful.
In each case, it is the caller's responsibility to ensure that given
argument is meaningful.
Functions provided are:
Functions provided are:
* variance(data, xbar=None) -> sample variance of data,
optionally using xbar as the sample mean.
* ``variance(data, xbar=None)``
sample variance of *data*, optionally using *xbar* as the sample mean.
* stdev(data, xbar=None) -> sample standard deviation of
data, optionally using xbar as the sample mean.
* ``stdev(data, xbar=None)``
sample standard deviation of *data*, optionally using *xbar* as the
sample mean.
* pvariance(data, mu=None) -> population variance of data,
optionally using mu as the population mean.
* ``pvariance(data, mu=None)``
population variance of *data*, optionally using *mu* as the population
mean.
* pstdev(data, mu=None) -> population standard deviation of
data, optionally using mu as the population mean.
* ``pstdev(data, mu=None)``
population standard deviation of *data*, optionally using *mu* as the
population mean.
Other functions
Other functions
---------------
There is one other public function:
There is one other public function:
* sum(data, start=0) -> high-precision sum of numeric data.
* ``sum(data, start=0)``
high-precision sum of numeric *data*.
Specification
=============
As the proposed reference implementation is in pure Python,
other Python implementations can easily make use of the module
unchanged, or adapt it as they see fit.
As the proposed reference implementation is in pure Python, other Python
implementations can easily make use of the module unchanged, or adapt it as
they see fit.
What Should Be The Name Of The Module?
======================================
This will be a top-level module "statistics".
This will be a top-level module ``statistics``.
There was some interest in turning math into a package, and making this a
sub-module of math, but the general consensus eventually agreed on a
top-level module. Other potential but rejected names included "stats" (too
much risk of confusion with existing "stat" module), and "statslib"
(described as "too C-like").
There was some interest in turning ``math`` into a package, and making this a
sub-module of ``math``, but the general consensus eventually agreed on a
top-level module. Other potential but rejected names included ``stats`` (too
much risk of confusion with existing ``stat`` module), and ``statslib``
(described as "too C-like").
Discussion And Resolved Issues
==============================
This proposal has been previously discussed here[21].
This proposal has been previously discussed here [21]_.
A number of design issues were resolved during the discussion on
Python-Ideas and the initial code review. There was a lot of concern
about the addition of yet another ``sum`` function to the standard
library, see the FAQs below for more details. In addition, the
initial implementation of ``sum`` suffered from some rounding issues
and other design problems when dealing with Decimals. Oscar
Benjamin's assistance in resolving this was invaluable.
A number of design issues were resolved during the discussion on Python-Ideas
and the initial code review. There was a lot of concern about the addition of
yet another ``sum`` function to the standard library, see the FAQs below for
more details. In addition, the initial implementation of ``sum`` suffered from
some rounding issues and other design problems when dealing with Decimals.
Oscar Benjamin's assistance in resolving this was invaluable.
Another issue was the handling of data in the form of iterators. The
first implementation of variance silently swapped between a one- and
two-pass algorithm, depending on whether the data was in the form of
an iterator or sequence. This proved to be a design mistake, as the
calculated variance could differ slightly depending on the algorithm
used, and ``variance`` etc. were changed to internally generate a list
and always use the more accurate two-pass implementation.
Another issue was the handling of data in the form of iterators. The first
implementation of variance silently swapped between a one- and two-pass
algorithm, depending on whether the data was in the form of an iterator or
sequence. This proved to be a design mistake, as the calculated variance could
differ slightly depending on the algorithm used, and ``variance`` etc. were
changed to internally generate a list and always use the more accurate two-pass
implementation.
One controversial design involved the functions to calculate median,
which were implemented as attributes on the ``median`` callable, e.g.
``median``, ``median.low``, ``median.high`` etc. Although there is
at least one existing use of this style in the standard library, in
``unittest.mock``, the code reviewers felt that this was too unusual
for the standard library. Consequently, the design has been changed
to a more traditional design of separate functions with a pseudo-
namespace naming convention, ``median_low``, ``median_high``, etc.
One controversial design involved the functions to calculate median, which were
implemented as attributes on the ``median`` callable, e.g. ``median``,
``median.low``, ``median.high`` etc. Although there is at least one existing
use of this style in the standard library, in ``unittest.mock``, the code
reviewers felt that this was too unusual for the standard library.
Consequently, the design has been changed to a more traditional design of
separate functions with a pseudo-namespace naming convention, ``median_low``,
``median_high``, etc.
Another issue that was of concern to code reviewers was the existence
of a function calculating the sample mode of continuous data, with
some people questioning the choice of algorithm, and whether it was
a sufficiently common need to be included. So it was dropped from
the API, and ``mode`` now implements only the basic schoolbook
algorithm based on counting unique values.
Another issue that was of concern to code reviewers was the existence of a
function calculating the sample mode of continuous data, with some people
questioning the choice of algorithm, and whether it was a sufficiently common
need to be included. So it was dropped from the API, and ``mode`` now
implements only the basic schoolbook algorithm based on counting unique values.
Another significant point of discussion was calculating statistics of
timedelta objects. Although the statistics module will not directly
support timedelta objects, it is possible to support this use-case by
converting them to numbers first using the ``timedelta.total_seconds``
method.
Another significant point of discussion was calculating statistics of
``timedelta`` objects. Although the statistics module will not directly
support ``timedelta`` objects, it is possible to support this use-case by
converting them to numbers first using the ``timedelta.total_seconds`` method.
Frequently Asked Questions
==========================
Q: Shouldn't this module spend time on PyPI before being considered for
the standard library?
Shouldn't this module spend time on PyPI before being considered for the standard library?
------------------------------------------------------------------------------------------
A: Older versions of this module have been available on PyPI[22] since
2010. Being much simpler than numpy, it does not require many years of
external development.
Older versions of this module have been available on PyPI [22]_ since 2010.
Being much simpler than numpy, it does not require many years of external
development.
Q: Does the standard library really need yet another version of ``sum``?
Does the standard library really need yet another version of ``sum``?
---------------------------------------------------------------------
A: This proved to be the most controversial part of the reference
implementation. In one sense, clearly three sums is two too many. But
in another sense, yes. The reasons why the two existing versions are
unsuitable are described here[23] but the short summary is:
This proved to be the most controversial part of the reference implementation.
In one sense, clearly three sums is two too many. But in another sense, yes.
The reasons why the two existing versions are unsuitable are described
here [23]_ but the short summary is:
- the built-in sum can lose precision with floats;
- the built-in sum can lose precision with floats;
- the built-in sum accepts any non-numeric data type that supports
the + operator, apart from strings and bytes;
- the built-in sum accepts any non-numeric data type that supports the ``+``
operator, apart from strings and bytes;
- math.fsum is high-precision, but coerces all arguments to float.
- ``math.fsum`` is high-precision, but coerces all arguments to float.
There was some interest in "fixing" one or the other of the existing
sums. If this occurs before 3.4 feature-freeze, the decision to keep
statistics.sum can be re-considered.
There was some interest in "fixing" one or the other of the existing sums. If
this occurs before 3.4 feature-freeze, the decision to keep ``statistics.sum``
can be re-considered.
Q: Will this module be backported to older versions of Python?
Will this module be backported to older versions of Python?
-----------------------------------------------------------
A: The module currently targets 3.3, and I will make it available on PyPI
for 3.3 for the foreseeable future. Backporting to older versions of
the 3.x series is likely (but not yet decided). Backporting to 2.7 is
less likely but not ruled out.
The module currently targets 3.3, and I will make it available on PyPI for
3.3 for the foreseeable future. Backporting to older versions of the 3.x
series is likely (but not yet decided). Backporting to 2.7 is less likely but
not ruled out.
Q: Is this supposed to replace numpy?
Is this supposed to replace numpy?
----------------------------------
A: No. While it is likely to grow over the years (see open issues below)
it is not aimed to replace, or even compete directly with, numpy. Numpy
is a full-featured numeric library aimed at professionals, the nuclear
reactor of numeric libraries in the Python ecosystem. This is just a
battery, as in "batteries included", and is aimed at an intermediate
level somewhere between "use numpy" and "roll your own version".
No. While it is likely to grow over the years (see open issues below) it is
not aimed to replace, or even compete directly with, numpy. Numpy is a
full-featured numeric library aimed at professionals, the nuclear reactor of
numeric libraries in the Python ecosystem. This is just a battery, as in
"batteries included", and is aimed at an intermediate level somewhere between
"use numpy" and "roll your own version".
Future Work
===========
- At this stage, I am unsure of the best API for multivariate statistical
functions such as linear regression, correlation coefficient, and
covariance. Possible APIs include:
- At this stage, I am unsure of the best API for multivariate statistical
functions such as linear regression, correlation coefficient, and covariance.
Possible APIs include:
* Separate arguments for x and y data:
function([x0, x1, ...], [y0, y1, ...])
* Separate arguments for x and y data::
* A single argument for (x, y) data:
function([(x0, y0), (x1, y1), ...])
function([x0, x1, ...], [y0, y1, ...])
This API is preferred by GvR[24].
* A single argument for (x, y) data::
* Selecting arbitrary columns from a 2D array:
function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
function([(x0, y0), (x1, y1), ...])
* Some combination of the above.
This API is preferred by GvR [24]_.
In the absence of a consensus of preferred API for multivariate stats,
I will defer including such multivariate functions until Python 3.5.
* Selecting arbitrary columns from a 2D array::
- Likewise, functions for calculating probability of random variables and
inference testing (e.g. Student's t-test) will be deferred until 3.5.
function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
- There is considerable interest in including one-pass functions that can
calculate multiple statistics from data in iterator form, without having
to convert to a list. The experimental "stats" package on PyPI includes
co-routine versions of statistics functions. Including these will be
deferred to 3.5.
* Some combination of the above.
In the absence of a consensus of preferred API for multivariate stats, I will
defer including such multivariate functions until Python 3.5.
- Likewise, functions for calculating probability of random variables and
inference testing (e.g. Student's t-test) will be deferred until 3.5.
- There is considerable interest in including one-pass functions that can
calculate multiple statistics from data in iterator form, without having to
convert to a list. The experimental ``stats`` package on PyPI includes
co-routine versions of statistics functions. Including these will be deferred
to 3.5.
References
==========
[1] https://mail.python.org/pipermail/python-dev/2010-October/104721.html
.. [1] https://mail.python.org/pipermail/python-dev/2010-October/104721.html
[2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
.. [2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
[3] Gnumeric:
https://projects.gnome.org/gnumeric/functions.shtml
.. [3] Gnumeric::
https://projects.gnome.org/gnumeric/functions.shtml
LibreOffice:
https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
LibreOffice:
https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
[4] Scipy: http://scipy-central.org/
Numpy: http://www.numpy.org/
.. [4] Scipy: http://scipy-central.org/
Numpy: http://www.numpy.org/
[5] http://wiki.scipy.org/Numpy_Functions_by_Category
.. [5] http://wiki.scipy.org/Numpy_Functions_by_Category
[6] Tested with numpy 1.6.1 and Python 2.7.
.. [6] Tested with numpy 1.6.1 and Python 2.7.
[7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
.. [7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
[8] http://rosettacode.org/wiki/Standard_deviation
.. [8] http://rosettacode.org/wiki/Standard_deviation
[9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
.. [9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
[10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
.. [10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
[11] http://www.r-project.org/
.. [11] http://www.r-project.org/
[12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
.. [12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
[13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
.. [13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
[14] http://ruby-statsample.rubyforge.org/
.. [14] http://ruby-statsample.rubyforge.org/
[15] http://www.php.net/manual/en/ref.stats.php
.. [15] http://www.php.net/manual/en/ref.stats.php
[16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
.. [16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
[17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
.. [17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
[18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
.. [18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
[19] http://mathworld.wolfram.com/Skewness.html
.. [19] http://mathworld.wolfram.com/Skewness.html
[20] At least, tedious to those who don't like this sort of thing.
.. [20] At least, tedious to those who don't like this sort of thing.
[21] https://mail.python.org/pipermail/python-ideas/2011-September/011524.html
.. [21] https://mail.python.org/pipermail/python-ideas/2011-September/011524.html
[22] https://pypi.python.org/pypi/stats/
.. [22] https://pypi.python.org/pypi/stats/
[23] https://mail.python.org/pipermail/python-ideas/2013-August/022630.html
.. [23] https://mail.python.org/pipermail/python-ideas/2013-August/022630.html
[24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
.. [24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
Copyright
=========
This document has been placed in the public domain.
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: