reSTify PEP225, PEP234, PEP255, PEP450 (#293)
This commit is contained in:
parent
82ea79a94a
commit
ad339c1e27
765
pep-0225.txt
765
pep-0225.txt
File diff suppressed because it is too large
Load Diff
622
pep-0234.txt
622
pep-0234.txt
|
@ -5,198 +5,200 @@ Last-Modified: $Date$
|
|||
Author: ping@zesty.ca (Ka-Ping Yee), guido@python.org (Guido van Rossum)
|
||||
Status: Final
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 30-Jan-2001
|
||||
Python-Version: 2.1
|
||||
Post-History: 30-Apr-2001
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This document proposes an iteration interface that objects can
|
||||
provide to control the behaviour of 'for' loops. Looping is
|
||||
customized by providing a method that produces an iterator object.
|
||||
The iterator provides a 'get next value' operation that produces
|
||||
the next item in the sequence each time it is called, raising an
|
||||
exception when no more items are available.
|
||||
This document proposes an iteration interface that objects can provide to
|
||||
control the behaviour of ``for`` loops. Looping is customized by providing a
|
||||
method that produces an iterator object. The iterator provides a *get next
|
||||
value* operation that produces the next item in the sequence each time it is
|
||||
called, raising an exception when no more items are available.
|
||||
|
||||
In addition, specific iterators over the keys of a dictionary and
|
||||
over the lines of a file are proposed, and a proposal is made to
|
||||
allow spelling dict.has_key(key) as "key in dict".
|
||||
In addition, specific iterators over the keys of a dictionary and over the
|
||||
lines of a file are proposed, and a proposal is made to allow spelling
|
||||
``dict.has_key(key)`` as ``key in dict``.
|
||||
|
||||
Note: this is an almost complete rewrite of this PEP by the second
|
||||
author, describing the actual implementation checked into the
|
||||
trunk of the Python 2.2 CVS tree. It is still open for
|
||||
discussion. Some of the more esoteric proposals in the original
|
||||
version of this PEP have been withdrawn for now; these may be the
|
||||
subject of a separate PEP in the future.
|
||||
Note: this is an almost complete rewrite of this PEP by the second author,
|
||||
describing the actual implementation checked into the trunk of the Python 2.2
|
||||
CVS tree. It is still open for discussion. Some of the more esoteric
|
||||
proposals in the original version of this PEP have been withdrawn for now;
|
||||
these may be the subject of a separate PEP in the future.
|
||||
|
||||
|
||||
C API Specification
|
||||
===================
|
||||
|
||||
A new exception is defined, StopIteration, which can be used to
|
||||
signal the end of an iteration.
|
||||
A new exception is defined, ``StopIteration``, which can be used to signal the
|
||||
end of an iteration.
|
||||
|
||||
A new slot named tp_iter for requesting an iterator is added to
|
||||
the type object structure. This should be a function of one
|
||||
PyObject * argument returning a PyObject *, or NULL. To use this
|
||||
slot, a new C API function PyObject_GetIter() is added, with the
|
||||
same signature as the tp_iter slot function.
|
||||
A new slot named ``tp_iter`` for requesting an iterator is added to the type
|
||||
object structure. This should be a function of one ``PyObject *`` argument
|
||||
returning a ``PyObject *``, or ``NULL``. To use this slot, a new C API
|
||||
function ``PyObject_GetIter()`` is added, with the same signature as the
|
||||
``tp_iter`` slot function.
|
||||
|
||||
Another new slot, named tp_iternext, is added to the type
|
||||
structure, for obtaining the next value in the iteration. To use
|
||||
this slot, a new C API function PyIter_Next() is added. The
|
||||
signature for both the slot and the API function is as follows,
|
||||
although the NULL return conditions differ: the argument is a
|
||||
PyObject * and so is the return value. When the return value is
|
||||
non-NULL, it is the next value in the iteration. When it is NULL,
|
||||
then for the tp_iternext slot there are three possibilities:
|
||||
Another new slot, named ``tp_iternext``, is added to the type structure, for
|
||||
obtaining the next value in the iteration. To use this slot, a new C API
|
||||
function ``PyIter_Next()`` is added. The signature for both the slot and the
|
||||
API function is as follows, although the ``NULL`` return conditions differ:
|
||||
the argument is a ``PyObject *`` and so is the return value. When the return
|
||||
value is non-``NULL``, it is the next value in the iteration. When it is
|
||||
``NULL``, then for the ``tp_iternext slot`` there are three possibilities:
|
||||
|
||||
- No exception is set; this implies the end of the iteration.
|
||||
- No exception is set; this implies the end of the iteration.
|
||||
|
||||
- The StopIteration exception (or a derived exception class) is
|
||||
set; this implies the end of the iteration.
|
||||
- The ``StopIteration`` exception (or a derived exception class) is set; this
|
||||
implies the end of the iteration.
|
||||
|
||||
- Some other exception is set; this means that an error occurred
|
||||
that should be propagated normally.
|
||||
- Some other exception is set; this means that an error occurred that should be
|
||||
propagated normally.
|
||||
|
||||
The higher-level PyIter_Next() function clears the StopIteration
|
||||
exception (or derived exception) when it occurs, so its NULL return
|
||||
conditions are simpler:
|
||||
The higher-level ``PyIter_Next()`` function clears the ``StopIteration``
|
||||
exception (or derived exception) when it occurs, so its ``NULL`` return
|
||||
conditions are simpler:
|
||||
|
||||
- No exception is set; this means iteration has ended.
|
||||
- No exception is set; this means iteration has ended.
|
||||
|
||||
- Some exception is set; this means an error occurred, and should
|
||||
be propagated normally.
|
||||
- Some exception is set; this means an error occurred, and should be propagated
|
||||
normally.
|
||||
|
||||
Iterators implemented in C should *not* implement a next() method
|
||||
with similar semantics as the tp_iternext slot! When the type's
|
||||
dictionary is initialized (by PyType_Ready()), the presence of a
|
||||
tp_iternext slot causes a method next() wrapping that slot to be
|
||||
added to the type's tp_dict. (Exception: if the type doesn't use
|
||||
PyObject_GenericGetAttr() to access instance attributes, the
|
||||
next() method in the type's tp_dict may not be seen.) (Due to a
|
||||
misunderstanding in the original text of this PEP, in Python 2.2,
|
||||
all iterator types implemented a next() method that was overridden
|
||||
by the wrapper; this has been fixed in Python 2.3.)
|
||||
Iterators implemented in C should *not* implement a ``next()`` method with
|
||||
similar semantics as the ``tp_iternext`` slot! When the type's dictionary is
|
||||
initialized (by ``PyType_Ready()``), the presence of a ``tp_iternext`` slot
|
||||
causes a method ``next()`` wrapping that slot to be added to the type's
|
||||
``tp_dict``. (Exception: if the type doesn't use ``PyObject_GenericGetAttr()``
|
||||
to access instance attributes, the ``next()`` method in the type's ``tp_dict``
|
||||
may not be seen.) (Due to a misunderstanding in the original text of this PEP,
|
||||
in Python 2.2, all iterator types implemented a ``next()`` method that was
|
||||
overridden by the wrapper; this has been fixed in Python 2.3.)
|
||||
|
||||
To ensure binary backwards compatibility, a new flag
|
||||
Py_TPFLAGS_HAVE_ITER is added to the set of flags in the tp_flags
|
||||
field, and to the default flags macro. This flag must be tested
|
||||
before accessing the tp_iter or tp_iternext slots. The macro
|
||||
PyIter_Check() tests whether an object has the appropriate flag
|
||||
set and has a non-NULL tp_iternext slot. There is no such macro
|
||||
for the tp_iter slot (since the only place where this slot is
|
||||
referenced should be PyObject_GetIter(), and this can check for
|
||||
the Py_TPFLAGS_HAVE_ITER flag directly).
|
||||
To ensure binary backwards compatibility, a new flag ``Py_TPFLAGS_HAVE_ITER``
|
||||
is added to the set of flags in the ``tp_flags`` field, and to the default
|
||||
flags macro. This flag must be tested before accessing the ``tp_iter`` or
|
||||
``tp_iternext`` slots. The macro ``PyIter_Check()`` tests whether an object
|
||||
has the appropriate flag set and has a non-``NULL`` ``tp_iternext`` slot.
|
||||
There is no such macro for the ``tp_iter`` slot (since the only place where
|
||||
this slot is referenced should be ``PyObject_GetIter()``, and this can check
|
||||
for the ``Py_TPFLAGS_HAVE_ITER`` flag directly).
|
||||
|
||||
(Note: the tp_iter slot can be present on any object; the
|
||||
tp_iternext slot should only be present on objects that act as
|
||||
iterators.)
|
||||
(Note: the ``tp_iter`` slot can be present on any object; the ``tp_iternext``
|
||||
slot should only be present on objects that act as iterators.)
|
||||
|
||||
For backwards compatibility, the PyObject_GetIter() function
|
||||
implements fallback semantics when its argument is a sequence that
|
||||
does not implement a tp_iter function: a lightweight sequence
|
||||
iterator object is constructed in that case which iterates over
|
||||
the items of the sequence in the natural order.
|
||||
For backwards compatibility, the ``PyObject_GetIter()`` function implements
|
||||
fallback semantics when its argument is a sequence that does not implement a
|
||||
``tp_iter`` function: a lightweight sequence iterator object is constructed in
|
||||
that case which iterates over the items of the sequence in the natural order.
|
||||
|
||||
The Python bytecode generated for 'for' loops is changed to use
|
||||
new opcodes, GET_ITER and FOR_ITER, that use the iterator protocol
|
||||
rather than the sequence protocol to get the next value for the
|
||||
loop variable. This makes it possible to use a 'for' loop to loop
|
||||
over non-sequence objects that support the tp_iter slot. Other
|
||||
places where the interpreter loops over the values of a sequence
|
||||
should also be changed to use iterators.
|
||||
The Python bytecode generated for ``for`` loops is changed to use new opcodes,
|
||||
``GET_ITER`` and ``FOR_ITER``, that use the iterator protocol rather than the
|
||||
sequence protocol to get the next value for the loop variable. This makes it
|
||||
possible to use a ``for`` loop to loop over non-sequence objects that support
|
||||
the ``tp_iter`` slot. Other places where the interpreter loops over the values
|
||||
of a sequence should also be changed to use iterators.
|
||||
|
||||
Iterators ought to implement the tp_iter slot as returning a
|
||||
reference to themselves; this is needed to make it possible to
|
||||
use an iterator (as opposed to a sequence) in a for loop.
|
||||
Iterators ought to implement the ``tp_iter`` slot as returning a reference to
|
||||
themselves; this is needed to make it possible to use an iterator (as opposed
|
||||
to a sequence) in a ``for`` loop.
|
||||
|
||||
Iterator implementations (in C or in Python) should guarantee that
|
||||
once the iterator has signalled its exhaustion, subsequent calls
|
||||
to tp_iternext or to the next() method will continue to do so. It
|
||||
is not specified whether an iterator should enter the exhausted
|
||||
state when an exception (other than StopIteration) is raised.
|
||||
Note that Python cannot guarantee that user-defined or 3rd party
|
||||
iterators implement this requirement correctly.
|
||||
Iterator implementations (in C or in Python) should guarantee that once the
|
||||
iterator has signalled its exhaustion, subsequent calls to ``tp_iternext`` or
|
||||
to the ``next()`` method will continue to do so. It is not specified whether
|
||||
an iterator should enter the exhausted state when an exception (other than
|
||||
``StopIteration``) is raised. Note that Python cannot guarantee that
|
||||
user-defined or 3rd party iterators implement this requirement correctly.
|
||||
|
||||
|
||||
Python API Specification
|
||||
========================
|
||||
|
||||
The StopIteration exception is made visible as one of the
|
||||
standard exceptions. It is derived from Exception.
|
||||
The ``StopIteration`` exception is made visible as one of the standard
|
||||
exceptions. It is derived from ``Exception``.
|
||||
|
||||
A new built-in function is defined, iter(), which can be called in
|
||||
two ways:
|
||||
A new built-in function is defined, ``iter()``, which can be called in two
|
||||
ways:
|
||||
|
||||
- iter(obj) calls PyObject_GetIter(obj).
|
||||
- ``iter(obj)`` calls ``PyObject_GetIter(obj)``.
|
||||
|
||||
- iter(callable, sentinel) returns a special kind of iterator that
|
||||
calls the callable to produce a new value, and compares the
|
||||
return value to the sentinel value. If the return value equals
|
||||
the sentinel, this signals the end of the iteration and
|
||||
StopIteration is raised rather than returning normal; if the
|
||||
return value does not equal the sentinel, it is returned as the
|
||||
next value from the iterator. If the callable raises an
|
||||
exception, this is propagated normally; in particular, the
|
||||
function is allowed to raise StopIteration as an alternative way
|
||||
to end the iteration. (This functionality is available from the
|
||||
C API as PyCallIter_New(callable, sentinel).)
|
||||
- ``iter(callable, sentinel)`` returns a special kind of iterator that calls
|
||||
the callable to produce a new value, and compares the return value to the
|
||||
sentinel value. If the return value equals the sentinel, this signals the
|
||||
end of the iteration and ``StopIteration`` is raised rather than returning
|
||||
normal; if the return value does not equal the sentinel, it is returned as
|
||||
the next value from the iterator. If the callable raises an exception, this
|
||||
is propagated normally; in particular, the function is allowed to raise
|
||||
``StopIteration`` as an alternative way to end the iteration. (This
|
||||
functionality is available from the C API as
|
||||
``PyCallIter_New(callable, sentinel)``.)
|
||||
|
||||
Iterator objects returned by either form of iter() have a next()
|
||||
method. This method either returns the next value in the
|
||||
iteration, or raises StopIteration (or a derived exception class)
|
||||
to signal the end of the iteration. Any other exception should be
|
||||
considered to signify an error and should be propagated normally,
|
||||
not taken to mean the end of the iteration.
|
||||
Iterator objects returned by either form of ``iter()`` have a ``next()``
|
||||
method. This method either returns the next value in the iteration, or raises
|
||||
``StopIteration`` (or a derived exception class) to signal the end of the
|
||||
iteration. Any other exception should be considered to signify an error and
|
||||
should be propagated normally, not taken to mean the end of the iteration.
|
||||
|
||||
Classes can define how they are iterated over by defining an
|
||||
__iter__() method; this should take no additional arguments and
|
||||
return a valid iterator object. A class that wants to be an
|
||||
iterator should implement two methods: a next() method that behaves
|
||||
as described above, and an __iter__() method that returns self.
|
||||
Classes can define how they are iterated over by defining an ``__iter__()``
|
||||
method; this should take no additional arguments and return a valid iterator
|
||||
object. A class that wants to be an iterator should implement two methods: a
|
||||
``next()`` method that behaves as described above, and an ``__iter__()`` method
|
||||
that returns ``self``.
|
||||
|
||||
The two methods correspond to two distinct protocols:
|
||||
The two methods correspond to two distinct protocols:
|
||||
|
||||
1. An object can be iterated over with "for" if it implements
|
||||
__iter__() or __getitem__().
|
||||
1. An object can be iterated over with ``for`` if it implements ``__iter__()``
|
||||
or ``__getitem__()``.
|
||||
|
||||
2. An object can function as an iterator if it implements next().
|
||||
2. An object can function as an iterator if it implements ``next()``.
|
||||
|
||||
Container-like objects usually support protocol 1. Iterators are
|
||||
currently required to support both protocols. The semantics of
|
||||
iteration come only from protocol 2; protocol 1 is present to make
|
||||
iterators behave like sequences; in particular so that code
|
||||
receiving an iterator can use a for-loop over the iterator.
|
||||
Container-like objects usually support protocol 1. Iterators are currently
|
||||
required to support both protocols. The semantics of iteration come only from
|
||||
protocol 2; protocol 1 is present to make iterators behave like sequences; in
|
||||
particular so that code receiving an iterator can use a for-loop over the
|
||||
iterator.
|
||||
|
||||
|
||||
Dictionary Iterators
|
||||
====================
|
||||
|
||||
- Dictionaries implement a sq_contains slot that implements the
|
||||
same test as the has_key() method. This means that we can write
|
||||
- Dictionaries implement a ``sq_contains`` slot that implements the same test
|
||||
as the ``has_key()`` method. This means that we can write
|
||||
|
||||
::
|
||||
|
||||
if k in dict: ...
|
||||
|
||||
which is equivalent to
|
||||
|
||||
::
|
||||
|
||||
if dict.has_key(k): ...
|
||||
|
||||
- Dictionaries implement a tp_iter slot that returns an efficient
|
||||
iterator that iterates over the keys of the dictionary. During
|
||||
such an iteration, the dictionary should not be modified, except
|
||||
that setting the value for an existing key is allowed (deletions
|
||||
or additions are not, nor is the update() method). This means
|
||||
that we can write
|
||||
- Dictionaries implement a ``tp_iter`` slot that returns an efficient iterator
|
||||
that iterates over the keys of the dictionary. During such an iteration, the
|
||||
dictionary should not be modified, except that setting the value for an
|
||||
existing key is allowed (deletions or additions are not, nor is the
|
||||
``update()`` method). This means that we can write
|
||||
|
||||
::
|
||||
|
||||
for k in dict: ...
|
||||
|
||||
which is equivalent to, but much faster than
|
||||
|
||||
::
|
||||
|
||||
for k in dict.keys(): ...
|
||||
|
||||
as long as the restriction on modifications to the dictionary
|
||||
(either by the loop or by another thread) are not violated.
|
||||
as long as the restriction on modifications to the dictionary (either by the
|
||||
loop or by another thread) are not violated.
|
||||
|
||||
- Add methods to dictionaries that return different kinds of
|
||||
iterators explicitly:
|
||||
- Add methods to dictionaries that return different kinds of iterators
|
||||
explicitly::
|
||||
|
||||
for key in dict.iterkeys(): ...
|
||||
|
||||
|
@ -204,50 +206,56 @@ Dictionary Iterators
|
|||
|
||||
for key, value in dict.iteritems(): ...
|
||||
|
||||
This means that "for x in dict" is shorthand for "for x in
|
||||
dict.iterkeys()".
|
||||
This means that ``for x in dict`` is shorthand for
|
||||
``for x in dict.iterkeys()``.
|
||||
|
||||
Other mappings, if they support iterators at all, should also
|
||||
iterate over the keys. However, this should not be taken as an
|
||||
absolute rule; specific applications may have different
|
||||
requirements.
|
||||
Other mappings, if they support iterators at all, should also iterate over the
|
||||
keys. However, this should not be taken as an absolute rule; specific
|
||||
applications may have different requirements.
|
||||
|
||||
|
||||
File Iterators
|
||||
==============
|
||||
|
||||
The following proposal is useful because it provides us with a
|
||||
good answer to the complaint that the common idiom to iterate over
|
||||
the lines of a file is ugly and slow.
|
||||
The following proposal is useful because it provides us with a good answer to
|
||||
the complaint that the common idiom to iterate over the lines of a file is ugly
|
||||
and slow.
|
||||
|
||||
- Files implement a tp_iter slot that is equivalent to
|
||||
iter(f.readline, ""). This means that we can write
|
||||
- Files implement a ``tp_iter`` slot that is equivalent to
|
||||
``iter(f.readline, "")``. This means that we can write
|
||||
|
||||
::
|
||||
|
||||
for line in file:
|
||||
...
|
||||
|
||||
as a shorthand for
|
||||
|
||||
::
|
||||
|
||||
for line in iter(file.readline, ""):
|
||||
...
|
||||
|
||||
which is equivalent to, but faster than
|
||||
|
||||
::
|
||||
|
||||
while 1:
|
||||
line = file.readline()
|
||||
if not line:
|
||||
break
|
||||
...
|
||||
|
||||
This also shows that some iterators are destructive: they consume
|
||||
all the values and a second iterator cannot easily be created that
|
||||
iterates independently over the same values. You could open the
|
||||
file for a second time, or seek() to the beginning, but these
|
||||
solutions don't work for all file types, e.g. they don't work when
|
||||
the open file object really represents a pipe or a stream socket.
|
||||
This also shows that some iterators are destructive: they consume all the
|
||||
values and a second iterator cannot easily be created that iterates
|
||||
independently over the same values. You could open the file for a second time,
|
||||
or ``seek()`` to the beginning, but these solutions don't work for all file
|
||||
types, e.g. they don't work when the open file object really represents a pipe
|
||||
or a stream socket.
|
||||
|
||||
Because the file iterator uses an internal buffer, mixing this
|
||||
with other file operations (e.g. file.readline()) doesn't work
|
||||
right. Also, the following code:
|
||||
Because the file iterator uses an internal buffer, mixing this with other file
|
||||
operations (e.g. ``file.readline()``) doesn't work right. Also, the following
|
||||
code::
|
||||
|
||||
for line in file:
|
||||
if line == "\n":
|
||||
|
@ -255,9 +263,9 @@ File Iterators
|
|||
for line in file:
|
||||
print line,
|
||||
|
||||
doesn't work as you might expect, because the iterator created by
|
||||
the second for-loop doesn't take the buffer read-ahead by the
|
||||
first for-loop into account. A correct way to write this is:
|
||||
doesn't work as you might expect, because the iterator created by the second
|
||||
for-loop doesn't take the buffer read-ahead by the first for-loop into account.
|
||||
A correct way to write this is::
|
||||
|
||||
it = iter(file)
|
||||
for line in it:
|
||||
|
@ -266,228 +274,220 @@ File Iterators
|
|||
for line in it:
|
||||
print line,
|
||||
|
||||
(The rationale for these restrictions are that "for line in file"
|
||||
ought to become the recommended, standard way to iterate over the
|
||||
lines of a file, and this should be as fast as can be. The
|
||||
iterator version is considerable faster than calling readline(),
|
||||
due to the internal buffer in the iterator.)
|
||||
(The rationale for these restrictions are that ``for line in file`` ought to
|
||||
become the recommended, standard way to iterate over the lines of a file, and
|
||||
this should be as fast as can be. The iterator version is considerable faster
|
||||
than calling ``readline()``, due to the internal buffer in the iterator.)
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
If all the parts of the proposal are included, this addresses many
|
||||
concerns in a consistent and flexible fashion. Among its chief
|
||||
virtues are the following four -- no, five -- no, six -- points:
|
||||
If all the parts of the proposal are included, this addresses many concerns in
|
||||
a consistent and flexible fashion. Among its chief virtues are the following
|
||||
four -- no, five -- no, six -- points:
|
||||
|
||||
1. It provides an extensible iterator interface.
|
||||
1. It provides an extensible iterator interface.
|
||||
|
||||
2. It allows performance enhancements to list iteration.
|
||||
2. It allows performance enhancements to list iteration.
|
||||
|
||||
3. It allows big performance enhancements to dictionary iteration.
|
||||
3. It allows big performance enhancements to dictionary iteration.
|
||||
|
||||
4. It allows one to provide an interface for just iteration
|
||||
without pretending to provide random access to elements.
|
||||
4. It allows one to provide an interface for just iteration without pretending
|
||||
to provide random access to elements.
|
||||
|
||||
5. It is backward-compatible with all existing user-defined
|
||||
classes and extension objects that emulate sequences and
|
||||
mappings, even mappings that only implement a subset of
|
||||
{__getitem__, keys, values, items}.
|
||||
5. It is backward-compatible with all existing user-defined classes and
|
||||
extension objects that emulate sequences and mappings, even mappings that
|
||||
only implement a subset of {``__getitem__``, ``keys``, ``values``,
|
||||
``items``}.
|
||||
|
||||
6. It makes code iterating over non-sequence collections more
|
||||
concise and readable.
|
||||
6. It makes code iterating over non-sequence collections more concise and
|
||||
readable.
|
||||
|
||||
|
||||
Resolved Issues
|
||||
===============
|
||||
|
||||
The following topics have been decided by consensus or BDFL
|
||||
pronouncement.
|
||||
The following topics have been decided by consensus or BDFL pronouncement.
|
||||
|
||||
- Two alternative spellings for next() have been proposed but
|
||||
rejected: __next__(), because it corresponds to a type object
|
||||
slot (tp_iternext); and __call__(), because this is the only
|
||||
operation.
|
||||
- Two alternative spellings for ``next()`` have been proposed but rejected:
|
||||
``__next__()``, because it corresponds to a type object slot
|
||||
(``tp_iternext``); and ``__call__()``, because this is the only operation.
|
||||
|
||||
Arguments against __next__(): while many iterators are used in
|
||||
for loops, it is expected that user code will also call next()
|
||||
directly, so having to write __next__() is ugly; also, a
|
||||
possible extension of the protocol would be to allow for prev(),
|
||||
current() and reset() operations; surely we don't want to use
|
||||
__prev__(), __current__(), __reset__().
|
||||
Arguments against ``__next__()``: while many iterators are used in for loops,
|
||||
it is expected that user code will also call ``next()`` directly, so having
|
||||
to write ``__next__()`` is ugly; also, a possible extension of the protocol
|
||||
would be to allow for ``prev()``, ``current()`` and ``reset()`` operations;
|
||||
surely we don't want to use ``__prev__()``, ``__current__()``,
|
||||
``__reset__()``.
|
||||
|
||||
Arguments against __call__() (the original proposal): taken out
|
||||
of context, x() is not very readable, while x.next() is clear;
|
||||
there's a danger that every special-purpose object wants to use
|
||||
__call__() for its most common operation, causing more confusion
|
||||
than clarity.
|
||||
Arguments against ``__call__()`` (the original proposal): taken out of
|
||||
context, ``x()`` is not very readable, while ``x.next()`` is clear; there's a
|
||||
danger that every special-purpose object wants to use ``__call__()`` for its
|
||||
most common operation, causing more confusion than clarity.
|
||||
|
||||
(In retrospect, it might have been better to go for __next__()
|
||||
and have a new built-in, next(it), which calls it.__next__().
|
||||
But alas, it's too late; this has been deployed in Python 2.2
|
||||
since December 2001.)
|
||||
(In retrospect, it might have been better to go for ``__next__()`` and have a
|
||||
new built-in, ``next(it)``, which calls ``it.__next__()``. But alas, it's too
|
||||
late; this has been deployed in Python 2.2 since December 2001.)
|
||||
|
||||
- Some folks have requested the ability to restart an iterator.
|
||||
This should be dealt with by calling iter() on a sequence
|
||||
repeatedly, not by the iterator protocol itself. (See also
|
||||
requested extensions below.)
|
||||
- Some folks have requested the ability to restart an iterator. This should be
|
||||
dealt with by calling ``iter()`` on a sequence repeatedly, not by the
|
||||
iterator protocol itself. (See also requested extensions below.)
|
||||
|
||||
- It has been questioned whether an exception to signal the end of
|
||||
the iteration isn't too expensive. Several alternatives for the
|
||||
StopIteration exception have been proposed: a special value End
|
||||
to signal the end, a function end() to test whether the iterator
|
||||
is finished, even reusing the IndexError exception.
|
||||
- It has been questioned whether an exception to signal the end of the
|
||||
iteration isn't too expensive. Several alternatives for the
|
||||
``StopIteration`` exception have been proposed: a special value ``End`` to
|
||||
signal the end, a function ``end()`` to test whether the iterator is
|
||||
finished, even reusing the ``IndexError`` exception.
|
||||
|
||||
- A special value has the problem that if a sequence ever
|
||||
contains that special value, a loop over that sequence will
|
||||
end prematurely without any warning. If the experience with
|
||||
null-terminated C strings hasn't taught us the problems this
|
||||
can cause, imagine the trouble a Python introspection tool
|
||||
would have iterating over a list of all built-in names,
|
||||
assuming that the special End value was a built-in name!
|
||||
- A special value has the problem that if a sequence ever contains that
|
||||
special value, a loop over that sequence will end prematurely without any
|
||||
warning. If the experience with null-terminated C strings hasn't taught us
|
||||
the problems this can cause, imagine the trouble a Python introspection
|
||||
tool would have iterating over a list of all built-in names, assuming that
|
||||
the special ``End`` value was a built-in name!
|
||||
|
||||
- Calling an end() function would require two calls per
|
||||
iteration. Two calls is much more expensive than one call
|
||||
plus a test for an exception. Especially the time-critical
|
||||
for loop can test very cheaply for an exception.
|
||||
- Calling an ``end()`` function would require two calls per iteration. Two
|
||||
calls is much more expensive than one call plus a test for an exception.
|
||||
Especially the time-critical for loop can test very cheaply for an
|
||||
exception.
|
||||
|
||||
- Reusing IndexError can cause confusion because it can be a
|
||||
genuine error, which would be masked by ending the loop
|
||||
prematurely.
|
||||
- Reusing ``IndexError`` can cause confusion because it can be a genuine
|
||||
error, which would be masked by ending the loop prematurely.
|
||||
|
||||
- Some have asked for a standard iterator type. Presumably all
|
||||
iterators would have to be derived from this type. But this is
|
||||
not the Python way: dictionaries are mappings because they
|
||||
support __getitem__() and a handful other operations, not
|
||||
because they are derived from an abstract mapping type.
|
||||
- Some have asked for a standard iterator type. Presumably all iterators would
|
||||
have to be derived from this type. But this is not the Python way:
|
||||
dictionaries are mappings because they support ``__getitem__()`` and a
|
||||
handful other operations, not because they are derived from an abstract
|
||||
mapping type.
|
||||
|
||||
- Regarding "if key in dict": there is no doubt that the
|
||||
dict.has_key(x) interpretation of "x in dict" is by far the
|
||||
most useful interpretation, probably the only useful one. There
|
||||
has been resistance against this because "x in list" checks
|
||||
whether x is present among the values, while the proposal makes
|
||||
"x in dict" check whether x is present among the keys. Given
|
||||
that the symmetry between lists and dictionaries is very weak,
|
||||
this argument does not have much weight.
|
||||
- Regarding ``if key in dict``: there is no doubt that the ``dict.has_key(x)``
|
||||
interpretation of ``x in dict`` is by far the most useful interpretation,
|
||||
probably the only useful one. There has been resistance against this because
|
||||
``x in list`` checks whether *x* is present among the values, while the
|
||||
proposal makes ``x in dict`` check whether *x* is present among the keys.
|
||||
Given that the symmetry between lists and dictionaries is very weak, this
|
||||
argument does not have much weight.
|
||||
|
||||
- The name iter() is an abbreviation. Alternatives proposed
|
||||
include iterate(), traverse(), but these appear too long.
|
||||
Python has a history of using abbrs for common builtins,
|
||||
e.g. repr(), str(), len().
|
||||
- The name ``iter()`` is an abbreviation. Alternatives proposed include
|
||||
``iterate()``, ``traverse()``, but these appear too long. Python has a
|
||||
history of using abbrs for common builtins, e.g. ``repr()``, ``str()``,
|
||||
``len()``.
|
||||
|
||||
Resolution: iter() it is.
|
||||
Resolution: ``iter()`` it is.
|
||||
|
||||
- Using the same name for two different operations (getting an
|
||||
iterator from an object and making an iterator for a function
|
||||
with a sentinel value) is somewhat ugly. I haven't seen a
|
||||
better name for the second operation though, and since they both
|
||||
return an iterator, it's easy to remember.
|
||||
- Using the same name for two different operations (getting an iterator from an
|
||||
object and making an iterator for a function with a sentinel value) is
|
||||
somewhat ugly. I haven't seen a better name for the second operation though,
|
||||
and since they both return an iterator, it's easy to remember.
|
||||
|
||||
Resolution: the builtin iter() takes an optional argument, which
|
||||
is the sentinel to look for.
|
||||
Resolution: the builtin ``iter()`` takes an optional argument, which is the
|
||||
sentinel to look for.
|
||||
|
||||
- Once a particular iterator object has raised StopIteration, will
|
||||
it also raise StopIteration on all subsequent next() calls?
|
||||
Some say that it would be useful to require this, others say
|
||||
that it is useful to leave this open to individual iterators.
|
||||
Note that this may require an additional state bit for some
|
||||
iterator implementations (e.g. function-wrapping iterators).
|
||||
- Once a particular iterator object has raised ``StopIteration``, will it also
|
||||
raise ``StopIteration`` on all subsequent ``next()`` calls? Some say that it
|
||||
would be useful to require this, others say that it is useful to leave this
|
||||
open to individual iterators. Note that this may require an additional state
|
||||
bit for some iterator implementations (e.g. function-wrapping iterators).
|
||||
|
||||
Resolution: once StopIteration is raised, calling it.next()
|
||||
continues to raise StopIteration.
|
||||
Resolution: once ``StopIteration`` is raised, calling ``it.next()`` continues
|
||||
to raise ``StopIteration``.
|
||||
|
||||
Note: this was in fact not implemented in Python 2.2; there are
|
||||
many cases where an iterator's next() method can raise
|
||||
StopIteration on one call but not on the next. This has been
|
||||
remedied in Python 2.3.
|
||||
Note: this was in fact not implemented in Python 2.2; there are many cases
|
||||
where an iterator's ``next()`` method can raise ``StopIteration`` on one call
|
||||
but not on the next. This has been remedied in Python 2.3.
|
||||
|
||||
- It has been proposed that a file object should be its own
|
||||
iterator, with a next() method returning the next line. This
|
||||
has certain advantages, and makes it even clearer that this
|
||||
iterator is destructive. The disadvantage is that this would
|
||||
make it even more painful to implement the "sticky
|
||||
- It has been proposed that a file object should be its own iterator, with a
|
||||
``next()`` method returning the next line. This has certain advantages, and
|
||||
makes it even clearer that this iterator is destructive. The disadvantage is
|
||||
that this would make it even more painful to implement the "sticky
|
||||
StopIteration" feature proposed in the previous bullet.
|
||||
|
||||
Resolution: tentatively rejected (though there are still people
|
||||
arguing for this).
|
||||
Resolution: tentatively rejected (though there are still people arguing for
|
||||
this).
|
||||
|
||||
- Some folks have requested extensions of the iterator protocol,
|
||||
e.g. prev() to get the previous item, current() to get the
|
||||
current item again, finished() to test whether the iterator is
|
||||
finished, and maybe even others, like rewind(), __len__(),
|
||||
position().
|
||||
- Some folks have requested extensions of the iterator protocol, e.g.
|
||||
``prev()`` to get the previous item, ``current()`` to get the current item
|
||||
again, ``finished()`` to test whether the iterator is finished, and maybe
|
||||
even others, like ``rewind()``, ``__len__()``, ``position()``.
|
||||
|
||||
While some of these are useful, many of these cannot easily be
|
||||
implemented for all iterator types without adding arbitrary
|
||||
buffering, and sometimes they can't be implemented at all (or
|
||||
not reasonably). E.g. anything to do with reversing directions
|
||||
can't be done when iterating over a file or function. Maybe a
|
||||
separate PEP can be drafted to standardize the names for such
|
||||
operations when the are implementable.
|
||||
While some of these are useful, many of these cannot easily be implemented
|
||||
for all iterator types without adding arbitrary buffering, and sometimes they
|
||||
can't be implemented at all (or not reasonably). E.g. anything to do with
|
||||
reversing directions can't be done when iterating over a file or function.
|
||||
Maybe a separate PEP can be drafted to standardize the names for such
|
||||
operations when they are implementable.
|
||||
|
||||
Resolution: rejected.
|
||||
|
||||
- There has been a long discussion about whether
|
||||
- There has been a long discussion about whether
|
||||
|
||||
::
|
||||
|
||||
for x in dict: ...
|
||||
|
||||
should assign x the successive keys, values, or items of the
|
||||
dictionary. The symmetry between "if x in y" and "for x in y"
|
||||
suggests that it should iterate over keys. This symmetry has been
|
||||
observed by many independently and has even been used to "explain"
|
||||
one using the other. This is because for sequences, "if x in y"
|
||||
iterates over y comparing the iterated values to x. If we adopt
|
||||
both of the above proposals, this will also hold for
|
||||
should assign *x* the successive keys, values, or items of the dictionary.
|
||||
The symmetry between ``if x in y`` and ``for x in y`` suggests that it should
|
||||
iterate over keys. This symmetry has been observed by many independently and
|
||||
has even been used to "explain" one using the other. This is because for
|
||||
sequences, ``if x in y`` iterates over *y* comparing the iterated values to
|
||||
*x*. If we adopt both of the above proposals, this will also hold for
|
||||
dictionaries.
|
||||
|
||||
The argument against making "for x in dict" iterate over the keys
|
||||
comes mostly from a practicality point of view: scans of the
|
||||
standard library show that there are about as many uses of "for x
|
||||
in dict.items()" as there are of "for x in dict.keys()", with the
|
||||
items() version having a small majority. Presumably many of the
|
||||
loops using keys() use the corresponding value anyway, by writing
|
||||
dict[x], so (the argument goes) by making both the key and value
|
||||
available, we could support the largest number of cases. While
|
||||
this is true, I (Guido) find the correspondence between "for x in
|
||||
dict" and "if x in dict" too compelling to break, and there's not
|
||||
much overhead in having to write dict[x] to explicitly get the
|
||||
value.
|
||||
The argument against making ``for x in dict`` iterate over the keys comes
|
||||
mostly from a practicality point of view: scans of the standard library show
|
||||
that there are about as many uses of ``for x in dict.items()`` as there are
|
||||
of ``for x in dict.keys()``, with the ``items()`` version having a small
|
||||
majority. Presumably many of the loops using ``keys()`` use the
|
||||
corresponding value anyway, by writing ``dict[x]``, so (the argument goes) by
|
||||
making both the key and value available, we could support the largest number
|
||||
of cases. While this is true, I (Guido) find the correspondence between
|
||||
``for x in dict`` and ``if x in dict`` too compelling to break, and there's
|
||||
not much overhead in having to write ``dict[x]`` to explicitly get the value.
|
||||
|
||||
For fast iteration over items, use "for key, value in
|
||||
dict.iteritems()". I've timed the difference between
|
||||
For fast iteration over items, use ``for key, value in dict.iteritems()``.
|
||||
I've timed the difference between
|
||||
|
||||
::
|
||||
|
||||
for key in dict: dict[key]
|
||||
|
||||
and
|
||||
|
||||
::
|
||||
|
||||
for key, value in dict.iteritems(): pass
|
||||
|
||||
and found that the latter is only about 7% faster.
|
||||
|
||||
Resolution: By BDFL pronouncement, "for x in dict" iterates over
|
||||
the keys, and dictionaries have iteritems(), iterkeys(), and
|
||||
itervalues() to return the different flavors of dictionary
|
||||
iterators.
|
||||
Resolution: By BDFL pronouncement, ``for x in dict`` iterates over the keys,
|
||||
and dictionaries have ``iteritems()``, ``iterkeys()``, and ``itervalues()``
|
||||
to return the different flavors of dictionary iterators.
|
||||
|
||||
|
||||
Mailing Lists
|
||||
=============
|
||||
|
||||
The iterator protocol has been discussed extensively in a mailing
|
||||
list on SourceForge:
|
||||
The iterator protocol has been discussed extensively in a mailing list on
|
||||
SourceForge:
|
||||
|
||||
http://lists.sourceforge.net/lists/listinfo/python-iterators
|
||||
|
||||
Initially, some of the discussion was carried out at Yahoo;
|
||||
archives are still accessible:
|
||||
Initially, some of the discussion was carried out at Yahoo; archives are still
|
||||
accessible:
|
||||
|
||||
http://groups.yahoo.com/group/python-iter
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document is in the public domain.
|
||||
This document is in the public domain.
|
||||
|
||||
|
||||
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
End:
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
End:
|
||||
|
|
489
pep-0255.txt
489
pep-0255.txt
|
@ -8,6 +8,7 @@ Author: nas@arctrix.com (Neil Schemenauer),
|
|||
Discussions-To: python-iterators@lists.sourceforge.net
|
||||
Status: Final
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Requires: 234
|
||||
Created: 18-May-2001
|
||||
Python-Version: 2.2
|
||||
|
@ -15,83 +16,77 @@ Post-History: 14-Jun-2001, 23-Jun-2001
|
|||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP introduces the concept of generators to Python, as well
|
||||
as a new statement used in conjunction with them, the "yield"
|
||||
statement.
|
||||
This PEP introduces the concept of generators to Python, as well as a new
|
||||
statement used in conjunction with them, the ``yield`` statement.
|
||||
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
When a producer function has a hard enough job that it requires
|
||||
maintaining state between values produced, most programming languages
|
||||
offer no pleasant and efficient solution beyond adding a callback
|
||||
function to the producer's argument list, to be called with each value
|
||||
produced.
|
||||
When a producer function has a hard enough job that it requires maintaining
|
||||
state between values produced, most programming languages offer no pleasant and
|
||||
efficient solution beyond adding a callback function to the producer's argument
|
||||
list, to be called with each value produced.
|
||||
|
||||
For example, tokenize.py in the standard library takes this approach:
|
||||
the caller must pass a "tokeneater" function to tokenize(), called
|
||||
whenever tokenize() finds the next token. This allows tokenize to be
|
||||
coded in a natural way, but programs calling tokenize are typically
|
||||
convoluted by the need to remember between callbacks which token(s)
|
||||
were seen last. The tokeneater function in tabnanny.py is a good
|
||||
example of that, maintaining a state machine in global variables, to
|
||||
remember across callbacks what it has already seen and what it hopes to
|
||||
see next. This was difficult to get working correctly, and is still
|
||||
difficult for people to understand. Unfortunately, that's typical of
|
||||
this approach.
|
||||
For example, ``tokenize.py`` in the standard library takes this approach: the
|
||||
caller must pass a *tokeneater* function to ``tokenize()``, called whenever
|
||||
``tokenize()`` finds the next token. This allows tokenize to be coded in a
|
||||
natural way, but programs calling tokenize are typically convoluted by the need
|
||||
to remember between callbacks which token(s) were seen last. The *tokeneater*
|
||||
function in ``tabnanny.py`` is a good example of that, maintaining a state
|
||||
machine in global variables, to remember across callbacks what it has already
|
||||
seen and what it hopes to see next. This was difficult to get working
|
||||
correctly, and is still difficult for people to understand. Unfortunately,
|
||||
that's typical of this approach.
|
||||
|
||||
An alternative would have been for tokenize to produce an entire parse
|
||||
of the Python program at once, in a large list. Then tokenize clients
|
||||
could be written in a natural way, using local variables and local
|
||||
control flow (such as loops and nested if statements) to keep track of
|
||||
their state. But this isn't practical: programs can be very large, so
|
||||
no a priori bound can be placed on the memory needed to materialize the
|
||||
whole parse; and some tokenize clients only want to see whether
|
||||
something specific appears early in the program (e.g., a future
|
||||
statement, or, as is done in IDLE, just the first indented statement),
|
||||
and then parsing the whole program first is a severe waste of time.
|
||||
An alternative would have been for tokenize to produce an entire parse of the
|
||||
Python program at once, in a large list. Then tokenize clients could be
|
||||
written in a natural way, using local variables and local control flow (such as
|
||||
loops and nested if statements) to keep track of their state. But this isn't
|
||||
practical: programs can be very large, so no a priori bound can be placed on
|
||||
the memory needed to materialize the whole parse; and some tokenize clients
|
||||
only want to see whether something specific appears early in the program (e.g.,
|
||||
a future statement, or, as is done in IDLE, just the first indented statement),
|
||||
and then parsing the whole program first is a severe waste of time.
|
||||
|
||||
Another alternative would be to make tokenize an iterator[1],
|
||||
delivering the next token whenever its .next() method is invoked. This
|
||||
is pleasant for the caller in the same way a large list of results
|
||||
would be, but without the memory and "what if I want to get out early?"
|
||||
drawbacks. However, this shifts the burden on tokenize to remember
|
||||
*its* state between .next() invocations, and the reader need only
|
||||
glance at tokenize.tokenize_loop() to realize what a horrid chore that
|
||||
would be. Or picture a recursive algorithm for producing the nodes of
|
||||
a general tree structure: to cast that into an iterator framework
|
||||
requires removing the recursion manually and maintaining the state of
|
||||
the traversal by hand.
|
||||
Another alternative would be to make tokenize an iterator [1], delivering the
|
||||
next token whenever its ``.next()`` method is invoked. This is pleasant for the
|
||||
caller in the same way a large list of results would be, but without the memory
|
||||
and "what if I want to get out early?" drawbacks. However, this shifts the
|
||||
burden on tokenize to remember *its* state between ``.next()`` invocations, and
|
||||
the reader need only glance at ``tokenize.tokenize_loop()`` to realize what a
|
||||
horrid chore that would be. Or picture a recursive algorithm for producing the
|
||||
nodes of a general tree structure: to cast that into an iterator framework
|
||||
requires removing the recursion manually and maintaining the state of the
|
||||
traversal by hand.
|
||||
|
||||
A fourth option is to run the producer and consumer in separate
|
||||
threads. This allows both to maintain their states in natural ways,
|
||||
and so is pleasant for both. Indeed, Demo/threads/Generator.py in the
|
||||
Python source distribution provides a usable synchronized-communication
|
||||
class for doing that in a general way. This doesn't work on platforms
|
||||
without threads, though, and is very slow on platforms that do
|
||||
(compared to what is achievable without threads).
|
||||
A fourth option is to run the producer and consumer in separate threads. This
|
||||
allows both to maintain their states in natural ways, and so is pleasant for
|
||||
both. Indeed, Demo/threads/Generator.py in the Python source distribution
|
||||
provides a usable synchronized-communication class for doing that in a general
|
||||
way. This doesn't work on platforms without threads, though, and is very slow
|
||||
on platforms that do (compared to what is achievable without threads).
|
||||
|
||||
A final option is to use the Stackless[2][3] variant implementation of
|
||||
Python instead, which supports lightweight coroutines. This has much
|
||||
the same programmatic benefits as the thread option, but is much more
|
||||
efficient. However, Stackless is a controversial rethinking of the
|
||||
Python core, and it may not be possible for Jython to implement the
|
||||
same semantics. This PEP isn't the place to debate that, so suffice it
|
||||
to say here that generators provide a useful subset of Stackless
|
||||
functionality in a way that fits easily into the current CPython
|
||||
implementation, and is believed to be relatively straightforward for
|
||||
other Python implementations.
|
||||
A final option is to use the Stackless [2] [3] variant implementation of Python
|
||||
instead, which supports lightweight coroutines. This has much the same
|
||||
programmatic benefits as the thread option, but is much more efficient.
|
||||
However, Stackless is a controversial rethinking of the Python core, and it may
|
||||
not be possible for Jython to implement the same semantics. This PEP isn't the
|
||||
place to debate that, so suffice it to say here that generators provide a
|
||||
useful subset of Stackless functionality in a way that fits easily into the
|
||||
current CPython implementation, and is believed to be relatively
|
||||
straightforward for other Python implementations.
|
||||
|
||||
That exhausts the current alternatives. Some other high-level
|
||||
languages provide pleasant solutions, notably iterators in Sather[4],
|
||||
which were inspired by iterators in CLU; and generators in Icon[5], a
|
||||
novel language where every expression "is a generator". There are
|
||||
differences among these, but the basic idea is the same: provide a
|
||||
kind of function that can return an intermediate result ("the next
|
||||
value") to its caller, but maintaining the function's local state so
|
||||
that the function can be resumed again right where it left off. A
|
||||
very simple example:
|
||||
That exhausts the current alternatives. Some other high-level languages
|
||||
provide pleasant solutions, notably iterators in Sather [4], which were
|
||||
inspired by iterators in CLU; and generators in Icon [5], a novel language
|
||||
where every expression *is a generator*. There are differences among these,
|
||||
but the basic idea is the same: provide a kind of function that can return an
|
||||
intermediate result ("the next value") to its caller, but maintaining the
|
||||
function's local state so that the function can be resumed again right where it
|
||||
left off. A very simple example::
|
||||
|
||||
def fib():
|
||||
a, b = 0, 1
|
||||
|
@ -99,79 +94,76 @@ Motivation
|
|||
yield b
|
||||
a, b = b, a+b
|
||||
|
||||
When fib() is first invoked, it sets a to 0 and b to 1, then yields b
|
||||
back to its caller. The caller sees 1. When fib is resumed, from its
|
||||
point of view the yield statement is really the same as, say, a print
|
||||
statement: fib continues after the yield with all local state intact.
|
||||
a and b then become 1 and 1, and fib loops back to the yield, yielding
|
||||
1 to its invoker. And so on. From fib's point of view it's just
|
||||
delivering a sequence of results, as if via callback. But from its
|
||||
caller's point of view, the fib invocation is an iterable object that
|
||||
can be resumed at will. As in the thread approach, this allows both
|
||||
sides to be coded in the most natural ways; but unlike the thread
|
||||
approach, this can be done efficiently and on all platforms. Indeed,
|
||||
resuming a generator should be no more expensive than a function call.
|
||||
When ``fib()`` is first invoked, it sets *a* to 0 and *b* to 1, then yields *b*
|
||||
back to its caller. The caller sees 1. When ``fib`` is resumed, from its
|
||||
point of view the ``yield`` statement is really the same as, say, a ``print``
|
||||
statement: ``fib`` continues after the yield with all local state intact. *a*
|
||||
and *b* then become 1 and 1, and ``fib`` loops back to the ``yield``, yielding
|
||||
1 to its invoker. And so on. From ``fib``'s point of view it's just
|
||||
delivering a sequence of results, as if via callback. But from its caller's
|
||||
point of view, the ``fib`` invocation is an iterable object that can be resumed
|
||||
at will. As in the thread approach, this allows both sides to be coded in the
|
||||
most natural ways; but unlike the thread approach, this can be done efficiently
|
||||
and on all platforms. Indeed, resuming a generator should be no more expensive
|
||||
than a function call.
|
||||
|
||||
The same kind of approach applies to many producer/consumer functions.
|
||||
For example, tokenize.py could yield the next token instead of invoking
|
||||
a callback function with it as argument, and tokenize clients could
|
||||
iterate over the tokens in a natural way: a Python generator is a kind
|
||||
of Python iterator[1], but of an especially powerful kind.
|
||||
The same kind of approach applies to many producer/consumer functions. For
|
||||
example, ``tokenize.py`` could yield the next token instead of invoking a
|
||||
callback function with it as argument, and tokenize clients could iterate over
|
||||
the tokens in a natural way: a Python generator is a kind of Python
|
||||
iterator [1]_, but of an especially powerful kind.
|
||||
|
||||
|
||||
Specification: Yield
|
||||
=====================
|
||||
|
||||
A new statement is introduced:
|
||||
A new statement is introduced::
|
||||
|
||||
yield_stmt: "yield" expression_list
|
||||
|
||||
"yield" is a new keyword, so a future statement[8] is needed to phase
|
||||
this in: in the initial release, a module desiring to use generators
|
||||
must include the line
|
||||
``yield`` is a new keyword, so a ``future`` statement [8]_ is needed to phase
|
||||
this in: in the initial release, a module desiring to use generators must
|
||||
include the line::
|
||||
|
||||
from __future__ import generators
|
||||
|
||||
near the top (see PEP 236[8]) for details). Modules using the
|
||||
identifier "yield" without a future statement will trigger warnings.
|
||||
In the following release, yield will be a language keyword and the
|
||||
future statement will no longer be needed.
|
||||
near the top (see PEP 236 [8]_) for details). Modules using the identifier
|
||||
``yield`` without a ``future`` statement will trigger warnings. In the
|
||||
following release, ``yield`` will be a language keyword and the ``future``
|
||||
statement will no longer be needed.
|
||||
|
||||
The yield statement may only be used inside functions. A function that
|
||||
contains a yield statement is called a generator function. A generator
|
||||
function is an ordinary function object in all respects, but has the
|
||||
new CO_GENERATOR flag set in the code object's co_flags member.
|
||||
The ``yield`` statement may only be used inside functions. A function that
|
||||
contains a ``yield`` statement is called a generator function. A generator
|
||||
function is an ordinary function object in all respects, but has the new
|
||||
``CO_GENERATOR`` flag set in the code object's co_flags member.
|
||||
|
||||
When a generator function is called, the actual arguments are bound to
|
||||
function-local formal argument names in the usual way, but no code in
|
||||
the body of the function is executed. Instead a generator-iterator
|
||||
object is returned; this conforms to the iterator protocol[6], so in
|
||||
particular can be used in for-loops in a natural way. Note that when
|
||||
the intent is clear from context, the unqualified name "generator" may
|
||||
be used to refer either to a generator-function or a generator-
|
||||
iterator.
|
||||
When a generator function is called, the actual arguments are bound to
|
||||
function-local formal argument names in the usual way, but no code in the body
|
||||
of the function is executed. Instead a generator-iterator object is returned;
|
||||
this conforms to the iterator protocol [6]_, so in particular can be used in
|
||||
for-loops in a natural way. Note that when the intent is clear from context,
|
||||
the unqualified name "generator" may be used to refer either to a
|
||||
generator-function or a generator-iterator.
|
||||
|
||||
Each time the .next() method of a generator-iterator is invoked, the
|
||||
code in the body of the generator-function is executed until a yield
|
||||
or return statement (see below) is encountered, or until the end of
|
||||
the body is reached.
|
||||
Each time the ``.next()`` method of a generator-iterator is invoked, the code
|
||||
in the body of the generator-function is executed until a ``yield`` or
|
||||
``return`` statement (see below) is encountered, or until the end of the body
|
||||
is reached.
|
||||
|
||||
If a yield statement is encountered, the state of the function is
|
||||
frozen, and the value of expression_list is returned to .next()'s
|
||||
caller. By "frozen" we mean that all local state is retained,
|
||||
including the current bindings of local variables, the instruction
|
||||
pointer, and the internal evaluation stack: enough information is
|
||||
saved so that the next time .next() is invoked, the function can
|
||||
proceed exactly as if the yield statement were just another external
|
||||
call.
|
||||
If a ``yield`` statement is encountered, the state of the function is frozen,
|
||||
and the value of *expression_list* is returned to ``.next()``'s caller. By
|
||||
"frozen" we mean that all local state is retained, including the current
|
||||
bindings of local variables, the instruction pointer, and the internal
|
||||
evaluation stack: enough information is saved so that the next time
|
||||
``.next()`` is invoked, the function can proceed exactly as if the ``yield``
|
||||
statement were just another external call.
|
||||
|
||||
Restriction: A yield statement is not allowed in the try clause of a
|
||||
try/finally construct. The difficulty is that there's no guarantee
|
||||
the generator will ever be resumed, hence no guarantee that the finally
|
||||
block will ever get executed; that's too much a violation of finally's
|
||||
purpose to bear.
|
||||
Restriction: A ``yield`` statement is not allowed in the ``try`` clause of a
|
||||
``try/finally`` construct. The difficulty is that there's no guarantee the
|
||||
generator will ever be resumed, hence no guarantee that the finally block will
|
||||
ever get executed; that's too much a violation of finally's purpose to bear.
|
||||
|
||||
Restriction: A generator cannot be resumed while it is actively
|
||||
running:
|
||||
Restriction: A generator cannot be resumed while it is actively running::
|
||||
|
||||
>>> def g():
|
||||
... i = me.next()
|
||||
|
@ -185,27 +177,28 @@ Specification: Yield
|
|||
|
||||
|
||||
Specification: Return
|
||||
======================
|
||||
|
||||
A generator function can also contain return statements of the form:
|
||||
A generator function can also contain return statements of the form::
|
||||
|
||||
"return"
|
||||
return
|
||||
|
||||
Note that an expression_list is not allowed on return statements
|
||||
in the body of a generator (although, of course, they may appear in
|
||||
the bodies of non-generator functions nested within the generator).
|
||||
Note that an *expression_list* is not allowed on return statements in the body
|
||||
of a generator (although, of course, they may appear in the bodies of
|
||||
non-generator functions nested within the generator).
|
||||
|
||||
When a return statement is encountered, control proceeds as in any
|
||||
function return, executing the appropriate finally clauses (if any
|
||||
exist). Then a StopIteration exception is raised, signalling that the
|
||||
iterator is exhausted. A StopIteration exception is also raised if
|
||||
control flows off the end of the generator without an explicit return.
|
||||
When a return statement is encountered, control proceeds as in any function
|
||||
return, executing the appropriate ``finally`` clauses (if any exist). Then a
|
||||
``StopIteration`` exception is raised, signalling that the iterator is
|
||||
exhausted. A ``StopIteration`` exception is also raised if control flows off
|
||||
the end of the generator without an explicit return.
|
||||
|
||||
Note that return means "I'm done, and have nothing interesting to
|
||||
return", for both generator functions and non-generator functions.
|
||||
Note that return means "I'm done, and have nothing interesting to return", for
|
||||
both generator functions and non-generator functions.
|
||||
|
||||
Note that return isn't always equivalent to raising StopIteration: the
|
||||
difference lies in how enclosing try/except constructs are treated.
|
||||
For example,
|
||||
Note that return isn't always equivalent to raising ``StopIteration``: the
|
||||
difference lies in how enclosing ``try/except`` constructs are treated. For
|
||||
example,::
|
||||
|
||||
>>> def f1():
|
||||
... try:
|
||||
|
@ -215,7 +208,7 @@ Specification: Return
|
|||
>>> print list(f1())
|
||||
[]
|
||||
|
||||
because, as in any function, return simply exits, but
|
||||
because, as in any function, ``return`` simply exits, but::
|
||||
|
||||
>>> def f2():
|
||||
... try:
|
||||
|
@ -225,20 +218,20 @@ Specification: Return
|
|||
>>> print list(f2())
|
||||
[42]
|
||||
|
||||
because StopIteration is captured by a bare "except", as is any
|
||||
exception.
|
||||
because ``StopIteration`` is captured by a bare ``except``, as is any
|
||||
exception.
|
||||
|
||||
|
||||
Specification: Generators and Exception Propagation
|
||||
====================================================
|
||||
|
||||
If an unhandled exception-- including, but not limited to,
|
||||
StopIteration --is raised by, or passes through, a generator function,
|
||||
then the exception is passed on to the caller in the usual way, and
|
||||
subsequent attempts to resume the generator function raise
|
||||
StopIteration. In other words, an unhandled exception terminates a
|
||||
generator's useful life.
|
||||
If an unhandled exception-- including, but not limited to, ``StopIteration``
|
||||
--is raised by, or passes through, a generator function, then the exception is
|
||||
passed on to the caller in the usual way, and subsequent attempts to resume the
|
||||
generator function raise ``StopIteration``. In other words, an unhandled
|
||||
exception terminates a generator's useful life.
|
||||
|
||||
Example (not idiomatic but to illustrate the point):
|
||||
Example (not idiomatic but to illustrate the point)::
|
||||
|
||||
>>> def f():
|
||||
... return 1/0
|
||||
|
@ -260,12 +253,13 @@ Specification: Generators and Exception Propagation
|
|||
|
||||
|
||||
Specification: Try/Except/Finally
|
||||
==================================
|
||||
|
||||
As noted earlier, yield is not allowed in the try clause of a try/
|
||||
finally construct. A consequence is that generators should allocate
|
||||
critical resources with great care. There is no restriction on yield
|
||||
otherwise appearing in finally clauses, except clauses, or in the try
|
||||
clause of a try/except construct:
|
||||
As noted earlier, ``yield`` is not allowed in the ``try`` clause of a
|
||||
``try/finally`` construct. A consequence is that generators should allocate
|
||||
critical resources with great care. There is no restriction on ``yield``
|
||||
otherwise appearing in ``finally`` clauses, ``except`` clauses, or in the
|
||||
``try`` clause of a ``try/except`` construct::
|
||||
|
||||
>>> def f():
|
||||
... try:
|
||||
|
@ -295,6 +289,9 @@ Specification: Try/Except/Finally
|
|||
|
||||
|
||||
Example
|
||||
=======
|
||||
|
||||
::
|
||||
|
||||
# A binary tree class.
|
||||
class Tree:
|
||||
|
@ -360,31 +357,35 @@ Example
|
|||
print x,
|
||||
print
|
||||
|
||||
Both output blocks display:
|
||||
Both output blocks display::
|
||||
|
||||
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
|
||||
|
||||
|
||||
Q & A
|
||||
=====
|
||||
|
||||
Q. Why not a new keyword instead of reusing "def"?
|
||||
Why not a new keyword instead of reusing ``def``?
|
||||
-------------------------------------------------
|
||||
|
||||
A. See BDFL Pronouncements section below.
|
||||
See BDFL Pronouncements section below.
|
||||
|
||||
Q. Why a new keyword for "yield"? Why not a builtin function instead?
|
||||
Why a new keyword for ``yield``? Why not a builtin function instead?
|
||||
---------------------------------------------------------------------
|
||||
|
||||
A. Control flow is much better expressed via keyword in Python, and
|
||||
yield is a control construct. It's also believed that efficient
|
||||
implementation in Jython requires that the compiler be able to
|
||||
determine potential suspension points at compile-time, and a new
|
||||
keyword makes that easy. The CPython reference implementation also
|
||||
exploits it heavily, to detect which functions *are* generator-
|
||||
functions (although a new keyword in place of "def" would solve that
|
||||
for CPython -- but people asking the "why a new keyword?" question
|
||||
don't want any new keyword).
|
||||
Control flow is much better expressed via keyword in Python, and yield is a
|
||||
control construct. It's also believed that efficient implementation in Jython
|
||||
requires that the compiler be able to determine potential suspension points at
|
||||
compile-time, and a new keyword makes that easy. The CPython reference
|
||||
implementation also exploits it heavily, to detect which functions *are*
|
||||
generator-functions (although a new keyword in place of ``def`` would solve
|
||||
that for CPython -- but people asking the "why a new keyword?" question don't
|
||||
want any new keyword).
|
||||
|
||||
Q: Then why not some other special syntax without a new keyword? For
|
||||
example, one of these instead of "yield 3":
|
||||
Then why not some other special syntax without a new keyword?
|
||||
-------------------------------------------------------------
|
||||
|
||||
For example, one of these instead of ``yield 3``::
|
||||
|
||||
return 3 and continue
|
||||
return and continue 3
|
||||
|
@ -398,114 +399,126 @@ Q & A
|
|||
<< 3
|
||||
* 3
|
||||
|
||||
A: Did I miss one <wink>? Out of hundreds of messages, I counted three
|
||||
suggesting such an alternative, and extracted the above from them.
|
||||
It would be nice not to need a new keyword, but nicer to make yield
|
||||
very clear -- I don't want to have to *deduce* that a yield is
|
||||
occurring from making sense of a previously senseless sequence of
|
||||
keywords or operators. Still, if this attracts enough interest,
|
||||
proponents should settle on a single consensus suggestion, and Guido
|
||||
will Pronounce on it.
|
||||
Did I miss one <wink>? Out of hundreds of messages, I counted three
|
||||
suggesting such an alternative, and extracted the above from them. It would be
|
||||
nice not to need a new keyword, but nicer to make ``yield`` very clear -- I
|
||||
don't want to have to *deduce* that a yield is occurring from making sense of a
|
||||
previously senseless sequence of keywords or operators. Still, if this
|
||||
attracts enough interest, proponents should settle on a single consensus
|
||||
suggestion, and Guido will Pronounce on it.
|
||||
|
||||
Q. Why allow "return" at all? Why not force termination to be spelled
|
||||
"raise StopIteration"?
|
||||
Why allow ``return`` at all? Why not force termination to be spelled ``raise StopIteration``?
|
||||
----------------------------------------------------------------------------------------------
|
||||
|
||||
A. The mechanics of StopIteration are low-level details, much like the
|
||||
mechanics of IndexError in Python 2.1: the implementation needs to
|
||||
do *something* well-defined under the covers, and Python exposes
|
||||
these mechanisms for advanced users. That's not an argument for
|
||||
forcing everyone to work at that level, though. "return" means "I'm
|
||||
done" in any kind of function, and that's easy to explain and to use.
|
||||
Note that "return" isn't always equivalent to "raise StopIteration"
|
||||
in try/except construct, either (see the "Specification: Return"
|
||||
section).
|
||||
The mechanics of ``StopIteration`` are low-level details, much like the
|
||||
mechanics of ``IndexError`` in Python 2.1: the implementation needs to do
|
||||
*something* well-defined under the covers, and Python exposes these mechanisms
|
||||
for advanced users. That's not an argument for forcing everyone to work at
|
||||
that level, though. ``return`` means "I'm done" in any kind of function, and
|
||||
that's easy to explain and to use. Note that ``return`` isn't always equivalent
|
||||
to ``raise StopIteration`` in try/except construct, either (see the
|
||||
"Specification: Return" section).
|
||||
|
||||
Q. Then why not allow an expression on "return" too?
|
||||
Then why not allow an expression on ``return`` too?
|
||||
---------------------------------------------------
|
||||
|
||||
A. Perhaps we will someday. In Icon, "return expr" means both "I'm
|
||||
done", and "but I have one final useful value to return too, and
|
||||
this is it". At the start, and in the absence of compelling uses
|
||||
for "return expr", it's simply cleaner to use "yield" exclusively
|
||||
for delivering values.
|
||||
Perhaps we will someday. In Icon, ``return expr`` means both "I'm done", and
|
||||
"but I have one final useful value to return too, and this is it". At the
|
||||
start, and in the absence of compelling uses for ``return expr``, it's simply
|
||||
cleaner to use ``yield`` exclusively for delivering values.
|
||||
|
||||
|
||||
BDFL Pronouncements
|
||||
===================
|
||||
|
||||
Issue: Introduce another new keyword (say, "gen" or "generator") in
|
||||
place of "def", or otherwise alter the syntax, to distinguish
|
||||
generator-functions from non-generator functions.
|
||||
Issue
|
||||
-----
|
||||
|
||||
Con: In practice (how you think about them), generators *are*
|
||||
functions, but with the twist that they're resumable. The mechanics of
|
||||
how they're set up is a comparatively minor technical issue, and
|
||||
introducing a new keyword would unhelpfully overemphasize the
|
||||
mechanics of how generators get started (a vital but tiny part of a
|
||||
generator's life).
|
||||
Introduce another new keyword (say, ``gen`` or ``generator``) in place
|
||||
of ``def``, or otherwise alter the syntax, to distinguish generator-functions
|
||||
from non-generator functions.
|
||||
|
||||
Pro: In reality (how you think about them), generator-functions are
|
||||
actually factory functions that produce generator-iterators as if by
|
||||
magic. In this respect they're radically different from non-generator
|
||||
functions, acting more like a constructor than a function, so reusing
|
||||
"def" is at best confusing. A "yield" statement buried in the body is
|
||||
not enough warning that the semantics are so different.
|
||||
Con
|
||||
---
|
||||
|
||||
BDFL: "def" it stays. No argument on either side is totally
|
||||
convincing, so I have consulted my language designer's intuition. It
|
||||
tells me that the syntax proposed in the PEP is exactly right - not too
|
||||
hot, not too cold. But, like the Oracle at Delphi in Greek mythology,
|
||||
it doesn't tell me why, so I don't have a rebuttal for the arguments
|
||||
against the PEP syntax. The best I can come up with (apart from
|
||||
agreeing with the rebuttals ... already made) is "FUD". If this had
|
||||
been part of the language from day one, I very much doubt it would have
|
||||
made Andrew Kuchling's "Python Warts" page.
|
||||
In practice (how you think about them), generators *are* functions, but
|
||||
with the twist that they're resumable. The mechanics of how they're set up
|
||||
is a comparatively minor technical issue, and introducing a new keyword would
|
||||
unhelpfully overemphasize the mechanics of how generators get started (a vital
|
||||
but tiny part of a generator's life).
|
||||
|
||||
Pro
|
||||
---
|
||||
|
||||
In reality (how you think about them), generator-functions are actually
|
||||
factory functions that produce generator-iterators as if by magic. In this
|
||||
respect they're radically different from non-generator functions, acting more
|
||||
like a constructor than a function, so reusing ``def`` is at best confusing.
|
||||
A ``yield`` statement buried in the body is not enough warning that the
|
||||
semantics are so different.
|
||||
|
||||
BDFL
|
||||
----
|
||||
|
||||
``def`` it stays. No argument on either side is totally convincing, so I
|
||||
have consulted my language designer's intuition. It tells me that the syntax
|
||||
proposed in the PEP is exactly right - not too hot, not too cold. But, like
|
||||
the Oracle at Delphi in Greek mythology, it doesn't tell me why, so I don't
|
||||
have a rebuttal for the arguments against the PEP syntax. The best I can come
|
||||
up with (apart from agreeing with the rebuttals ... already made) is "FUD".
|
||||
If this had been part of the language from day one, I very much doubt it would
|
||||
have made Andrew Kuchling's "Python Warts" page.
|
||||
|
||||
|
||||
Reference Implementation
|
||||
========================
|
||||
|
||||
The current implementation, in a preliminary state (no docs, but well
|
||||
tested and solid), is part of Python's CVS development tree[9]. Using
|
||||
this requires that you build Python from source.
|
||||
The current implementation, in a preliminary state (no docs, but well tested
|
||||
and solid), is part of Python's CVS development tree [9]_. Using this requires
|
||||
that you build Python from source.
|
||||
|
||||
This was derived from an earlier patch by Neil Schemenauer[7].
|
||||
This was derived from an earlier patch by Neil Schemenauer [7]_.
|
||||
|
||||
|
||||
Footnotes and References
|
||||
========================
|
||||
|
||||
[1] PEP 234, Iterators, Yee, Van Rossum
|
||||
.. [1] PEP 234, Iterators, Yee, Van Rossum
|
||||
http://www.python.org/dev/peps/pep-0234/
|
||||
|
||||
[2] http://www.stackless.com/
|
||||
.. [2] http://www.stackless.com/
|
||||
|
||||
[3] PEP 219, Stackless Python, McMillan
|
||||
.. [3] PEP 219, Stackless Python, McMillan
|
||||
http://www.python.org/dev/peps/pep-0219/
|
||||
|
||||
[4] "Iteration Abstraction in Sather"
|
||||
.. [4] "Iteration Abstraction in Sather"
|
||||
Murer, Omohundro, Stoutamire and Szyperski
|
||||
http://www.icsi.berkeley.edu/~sather/Publications/toplas.html
|
||||
|
||||
[5] http://www.cs.arizona.edu/icon/
|
||||
.. [5] http://www.cs.arizona.edu/icon/
|
||||
|
||||
[6] The concept of iterators is described in PEP 234. See [1] above.
|
||||
.. [6] The concept of iterators is described in PEP 234. See [1] above.
|
||||
|
||||
[7] http://python.ca/nas/python/generator.diff
|
||||
.. [7] http://python.ca/nas/python/generator.diff
|
||||
|
||||
[8] PEP 236, Back to the __future__, Peters
|
||||
.. [8] PEP 236, Back to the __future__, Peters
|
||||
http://www.python.org/dev/peps/pep-0236/
|
||||
|
||||
[9] To experiment with this implementation, check out Python from CVS
|
||||
according to the instructions at
|
||||
http://sf.net/cvs/?group_id=5470
|
||||
Note that the std test Lib/test/test_generators.py contains many
|
||||
.. [9] To experiment with this implementation, check out Python from CVS
|
||||
according to the instructions at http://sf.net/cvs/?group_id=5470
|
||||
Note that the std test ``Lib/test/test_generators.py`` contains many
|
||||
examples, including all those in this PEP.
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
End:
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
End:
|
||||
|
|
681
pep-0450.txt
681
pep-0450.txt
|
@ -5,93 +5,91 @@ Last-Modified: $Date$
|
|||
Author: Steven D'Aprano <steve@pearwood.info>
|
||||
Status: Final
|
||||
Type: Standards Track
|
||||
Content-Type: text/plain
|
||||
Content-Type: text/x-rst
|
||||
Created: 01-Aug-2013
|
||||
Python-Version: 3.4
|
||||
Post-History: 13-Sep-2013
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP proposes the addition of a module for common statistics functions
|
||||
such as mean, median, variance and standard deviation to the Python
|
||||
standard library. See also http://bugs.python.org/issue18606
|
||||
This PEP proposes the addition of a module for common statistics functions such
|
||||
as mean, median, variance and standard deviation to the Python standard
|
||||
library. See also http://bugs.python.org/issue18606
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
The proposed statistics module is motivated by the "batteries included"
|
||||
philosophy towards the Python standard library. Raymond Hettinger and
|
||||
other senior developers have requested a quality statistics library that
|
||||
falls somewhere in between high-end statistics libraries and ad hoc
|
||||
code.[1] Statistical functions such as mean, standard deviation and others
|
||||
are obvious and useful batteries, familiar to any Secondary School student.
|
||||
Even cheap scientific calculators typically include multiple statistical
|
||||
functions such as:
|
||||
The proposed statistics module is motivated by the "batteries included"
|
||||
philosophy towards the Python standard library. Raymond Hettinger and other
|
||||
senior developers have requested a quality statistics library that falls
|
||||
somewhere in between high-end statistics libraries and ad hoc code. [1]_
|
||||
Statistical functions such as mean, standard deviation and others are obvious
|
||||
and useful batteries, familiar to any Secondary School student. Even cheap
|
||||
scientific calculators typically include multiple statistical functions such
|
||||
as:
|
||||
|
||||
- mean
|
||||
- population and sample variance
|
||||
- population and sample standard deviation
|
||||
- linear regression
|
||||
- correlation coefficient
|
||||
- mean
|
||||
- population and sample variance
|
||||
- population and sample standard deviation
|
||||
- linear regression
|
||||
- correlation coefficient
|
||||
|
||||
Graphing calculators aimed at Secondary School students typically
|
||||
include all of the above, plus some or all of:
|
||||
Graphing calculators aimed at Secondary School students typically include all
|
||||
of the above, plus some or all of:
|
||||
|
||||
- median
|
||||
- mode
|
||||
- functions for calculating the probability of random variables
|
||||
from the normal, t, chi-squared, and F distributions
|
||||
- inference on the mean
|
||||
- median
|
||||
- mode
|
||||
- functions for calculating the probability of random variables from the
|
||||
normal, t, chi-squared, and F distributions
|
||||
- inference on the mean
|
||||
|
||||
and others[2]. Likewise spreadsheet applications such as Microsoft Excel,
|
||||
LibreOffice and Gnumeric include rich collections of statistical
|
||||
functions[3].
|
||||
and others [2]_. Likewise spreadsheet applications such as Microsoft Excel,
|
||||
LibreOffice and Gnumeric include rich collections of statistical
|
||||
functions [3]_.
|
||||
|
||||
In contrast, Python currently has no standard way to calculate even the
|
||||
simplest and most obvious statistical functions such as mean. For those
|
||||
who need statistical functions in Python, there are two obvious solutions:
|
||||
In contrast, Python currently has no standard way to calculate even the
|
||||
simplest and most obvious statistical functions such as mean. For those who
|
||||
need statistical functions in Python, there are two obvious solutions:
|
||||
|
||||
- install numpy and/or scipy[4];
|
||||
- install numpy and/or scipy [4]_;
|
||||
|
||||
- or use a Do It Yourself solution.
|
||||
- or use a Do It Yourself solution.
|
||||
|
||||
Numpy is perhaps the most full-featured solution, but it has a few
|
||||
disadvantages:
|
||||
Numpy is perhaps the most full-featured solution, but it has a few
|
||||
disadvantages:
|
||||
|
||||
- It may be overkill for many purposes. The documentation for numpy even
|
||||
warns
|
||||
- It may be overkill for many purposes. The documentation for numpy even warns
|
||||
|
||||
"It can be hard to know what functions are available in
|
||||
numpy. This is not a complete list, but it does cover
|
||||
most of them."[5]
|
||||
"It can be hard to know what functions are available in numpy. This is
|
||||
not a complete list, but it does cover most of them."[5]_
|
||||
|
||||
and then goes on to list over 270 functions, only a small number of
|
||||
which are related to statistics.
|
||||
and then goes on to list over 270 functions, only a small number of which are
|
||||
related to statistics.
|
||||
|
||||
- Numpy is aimed at those doing heavy numerical work, and may be
|
||||
intimidating to those who don't have a background in computational
|
||||
mathematics and computer science. For example, numpy.mean takes four
|
||||
arguments:
|
||||
- Numpy is aimed at those doing heavy numerical work, and may be intimidating
|
||||
to those who don't have a background in computational mathematics and
|
||||
computer science. For example, ``numpy.mean`` takes four arguments::
|
||||
|
||||
mean(a, axis=None, dtype=None, out=None)
|
||||
|
||||
although fortunately for the beginner or casual numpy user, three are
|
||||
optional and numpy.mean does the right thing in simple cases:
|
||||
optional and ``numpy.mean`` does the right thing in simple cases::
|
||||
|
||||
>>> numpy.mean([1, 2, 3, 4])
|
||||
2.5
|
||||
|
||||
- For many people, installing numpy may be difficult or impossible. For
|
||||
example, people in corporate environments may have to go through a
|
||||
difficult, time-consuming process before being permitted to install
|
||||
third-party software. For the casual Python user, having to learn about
|
||||
installing third-party packages in order to average a list of numbers is
|
||||
unfortunate.
|
||||
- For many people, installing numpy may be difficult or impossible. For
|
||||
example, people in corporate environments may have to go through a difficult,
|
||||
time-consuming process before being permitted to install third-party
|
||||
software. For the casual Python user, having to learn about installing
|
||||
third-party packages in order to average a list of numbers is unfortunate.
|
||||
|
||||
This leads to option number 2, DIY statistics functions. At first glance,
|
||||
this appears to be an attractive option, due to the apparent simplicity of
|
||||
common statistical functions. For example:
|
||||
This leads to option number 2, DIY statistics functions. At first glance, this
|
||||
appears to be an attractive option, due to the apparent simplicity of common
|
||||
statistical functions. For example::
|
||||
|
||||
def mean(data):
|
||||
return sum(data)/len(data)
|
||||
|
@ -105,391 +103,414 @@ Rationale
|
|||
def standard_deviation(data):
|
||||
return math.sqrt(variance(data))
|
||||
|
||||
The above appears to be correct with a casual test:
|
||||
The above appears to be correct with a casual test::
|
||||
|
||||
>>> data = [1, 2, 4, 5, 8]
|
||||
>>> variance(data)
|
||||
7.5
|
||||
|
||||
But adding a constant to every data point should not change the variance:
|
||||
But adding a constant to every data point should not change the variance::
|
||||
|
||||
>>> data = [x+1e12 for x in data]
|
||||
>>> variance(data)
|
||||
0.0
|
||||
|
||||
And variance should *never* be negative:
|
||||
And variance should *never* be negative::
|
||||
|
||||
>>> variance(data*100)
|
||||
-1239429440.1282566
|
||||
|
||||
By contrast, the proposed reference implementation gets the exactly correct
|
||||
answer 7.5 for the first two examples, and a reasonably close answer for
|
||||
the third: 6.012. numpy does no better[6].
|
||||
By contrast, the proposed reference implementation gets the exactly correct
|
||||
answer 7.5 for the first two examples, and a reasonably close answer for the
|
||||
third: 6.012. numpy does no better [6]_.
|
||||
|
||||
Even simple statistical calculations contain traps for the unwary, starting
|
||||
with the Computational Formula itself. Despite the name, it is numerically
|
||||
unstable and can be extremely inaccurate, as can be seen above. It is
|
||||
completely unsuitable for computation by computer[7]. This problem plagues
|
||||
users of many programming language, not just Python[8], as coders reinvent
|
||||
the same numerically inaccurate code over and over again[9], or advise
|
||||
others to do so[10].
|
||||
Even simple statistical calculations contain traps for the unwary, starting
|
||||
with the Computational Formula itself. Despite the name, it is numerically
|
||||
unstable and can be extremely inaccurate, as can be seen above. It is
|
||||
completely unsuitable for computation by computer [7]_. This problem plagues
|
||||
users of many programming language, not just Python [8]_, as coders reinvent
|
||||
the same numerically inaccurate code over and over again [9]_, or advise others
|
||||
to do so [10]_.
|
||||
|
||||
It isn't just the variance and standard deviation. Even the mean is not
|
||||
quite as straightforward as it might appear. The above implementation
|
||||
seems too simple to have problems, but it does:
|
||||
It isn't just the variance and standard deviation. Even the mean is not quite
|
||||
as straightforward as it might appear. The above implementation seems too
|
||||
simple to have problems, but it does:
|
||||
|
||||
- The built-in sum can lose accuracy when dealing with floats of wildly
|
||||
differing magnitude. Consequently, the above naive mean fails this
|
||||
"torture test":
|
||||
- The built-in ``sum`` can lose accuracy when dealing with floats of wildly
|
||||
differing magnitude. Consequently, the above naive ``mean`` fails this
|
||||
"torture test"::
|
||||
|
||||
assert mean([1e30, 1, 3, -1e30]) == 1
|
||||
|
||||
returning 0 instead of 1, a purely computational error of 100%.
|
||||
|
||||
- Using math.fsum inside mean will make it more accurate with float data,
|
||||
but it also has the side-effect of converting any arguments to float
|
||||
even when unnecessary. E.g. we should expect the mean of a list of
|
||||
Fractions to be a Fraction, not a float.
|
||||
- Using ``math.fsum`` inside ``mean`` will make it more accurate with float
|
||||
data, but it also has the side-effect of converting any arguments to float
|
||||
even when unnecessary. E.g. we should expect the mean of a list of Fractions
|
||||
to be a Fraction, not a float.
|
||||
|
||||
While the above mean implementation does not fail quite as catastrophically
|
||||
as the naive variance does, a standard library function can do much better
|
||||
than the DIY versions.
|
||||
While the above mean implementation does not fail quite as catastrophically as
|
||||
the naive variance does, a standard library function can do much better than
|
||||
the DIY versions.
|
||||
|
||||
The example above involves an especially bad set of data, but even for
|
||||
more realistic data sets accuracy is important. The first step in
|
||||
interpreting variation in data (including dealing with ill-conditioned
|
||||
data) is often to standardize it to a series with variance 1 (and often
|
||||
mean 0). This standardization requires accurate computation of the mean
|
||||
and variance of the raw series. Naive computation of mean and variance
|
||||
can lose precision very quickly. Because precision bounds accuracy, it is
|
||||
important to use the most precise algorithms for computing mean and
|
||||
variance that are practical, or the results of standardization are
|
||||
themselves useless.
|
||||
The example above involves an especially bad set of data, but even for more
|
||||
realistic data sets accuracy is important. The first step in interpreting
|
||||
variation in data (including dealing with ill-conditioned data) is often to
|
||||
standardize it to a series with variance 1 (and often mean 0). This
|
||||
standardization requires accurate computation of the mean and variance of the
|
||||
raw series. Naive computation of mean and variance can lose precision very
|
||||
quickly. Because precision bounds accuracy, it is important to use the most
|
||||
precise algorithms for computing mean and variance that are practical, or the
|
||||
results of standardization are themselves useless.
|
||||
|
||||
|
||||
Comparison To Other Languages/Packages
|
||||
======================================
|
||||
|
||||
The proposed statistics library is not intended to be a competitor to such
|
||||
third-party libraries as numpy/scipy, or of proprietary full-featured
|
||||
statistics packages aimed at professional statisticians such as Minitab,
|
||||
SAS and Matlab. It is aimed at the level of graphing and scientific
|
||||
calculators.
|
||||
The proposed statistics library is not intended to be a competitor to such
|
||||
third-party libraries as numpy/scipy, or of proprietary full-featured
|
||||
statistics packages aimed at professional statisticians such as Minitab, SAS
|
||||
and Matlab. It is aimed at the level of graphing and scientific calculators.
|
||||
|
||||
Most programming languages have little or no built-in support for
|
||||
statistics functions. Some exceptions:
|
||||
Most programming languages have little or no built-in support for statistics
|
||||
functions. Some exceptions:
|
||||
|
||||
R
|
||||
R (and its proprietary cousin, S) is a programming language designed
|
||||
for statistics work. It is extremely popular with statisticians and
|
||||
is extremely feature-rich[11].
|
||||
R
|
||||
-
|
||||
|
||||
C#
|
||||
R (and its proprietary cousin, S) is a programming language designed for
|
||||
statistics work. It is extremely popular with statisticians and is extremely
|
||||
feature-rich [11]_.
|
||||
|
||||
The C# LINQ package includes extension methods to calculate the
|
||||
average of enumerables[12].
|
||||
C#
|
||||
--
|
||||
|
||||
Ruby
|
||||
The C# LINQ package includes extension methods to calculate the average of
|
||||
enumerables [12]_.
|
||||
|
||||
Ruby does not ship with a standard statistics module, despite some
|
||||
apparent demand[13]. Statsample appears to be a feature-rich third-
|
||||
party library, aiming to compete with R[14].
|
||||
Ruby
|
||||
----
|
||||
|
||||
PHP
|
||||
Ruby does not ship with a standard statistics module, despite some apparent
|
||||
demand [13]_. Statsample appears to be a feature-rich third-party library,
|
||||
aiming to compete with R [14]_.
|
||||
|
||||
PHP has an extremely feature-rich (although mostly undocumented) set
|
||||
of advanced statistical functions[15].
|
||||
PHP
|
||||
---
|
||||
|
||||
Delphi
|
||||
PHP has an extremely feature-rich (although mostly undocumented) set of
|
||||
advanced statistical functions [15]_.
|
||||
|
||||
Delphi includes standard statistical functions including Mean, Sum,
|
||||
Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].
|
||||
Delphi
|
||||
------
|
||||
|
||||
GNU Scientific Library
|
||||
Delphi includes standard statistical functions including Mean, Sum,
|
||||
Variance, TotalVariance, MomentSkewKurtosis in its Math library [16]_.
|
||||
|
||||
The GNU Scientific Library includes standard statistical functions,
|
||||
percentiles, median and others[17]. One innovation I have borrowed
|
||||
from the GSL is to allow the caller to optionally specify the pre-
|
||||
calculated mean of the sample (or an a priori known population mean)
|
||||
when calculating the variance and standard deviation[18].
|
||||
GNU Scientific Library
|
||||
----------------------
|
||||
|
||||
The GNU Scientific Library includes standard statistical functions,
|
||||
percentiles, median and others [17]_. One innovation I have borrowed from the
|
||||
GSL is to allow the caller to optionally specify the pre-calculated mean of
|
||||
the sample (or an a priori known population mean) when calculating the variance
|
||||
and standard deviation [18]_.
|
||||
|
||||
|
||||
Design Decisions Of The Module
|
||||
==============================
|
||||
|
||||
My intention is to start small and grow the library as needed, rather than
|
||||
try to include everything from the start. Consequently, the current
|
||||
reference implementation includes only a small number of functions: mean,
|
||||
variance, standard deviation, median, mode. (See the reference
|
||||
implementation for a full list.)
|
||||
My intention is to start small and grow the library as needed, rather than try
|
||||
to include everything from the start. Consequently, the current reference
|
||||
implementation includes only a small number of functions: mean, variance,
|
||||
standard deviation, median, mode. (See the reference implementation for a full
|
||||
list.)
|
||||
|
||||
I have aimed for the following design features:
|
||||
I have aimed for the following design features:
|
||||
|
||||
- Correctness over speed. It is easier to speed up a correct but slow
|
||||
function than to correct a fast but buggy one.
|
||||
- Correctness over speed. It is easier to speed up a correct but slow function
|
||||
than to correct a fast but buggy one.
|
||||
|
||||
- Concentrate on data in sequences, allowing two-passes over the data,
|
||||
rather than potentially compromise on accuracy for the sake of a one-pass
|
||||
algorithm. Functions expect data will be passed as a list or other
|
||||
sequence; if given an iterator, they may internally convert to a list.
|
||||
- Concentrate on data in sequences, allowing two-passes over the data, rather
|
||||
than potentially compromise on accuracy for the sake of a one-pass algorithm.
|
||||
Functions expect data will be passed as a list or other sequence; if given an
|
||||
iterator, they may internally convert to a list.
|
||||
|
||||
- Functions should, as much as possible, honour any type of numeric data.
|
||||
E.g. the mean of a list of Decimals should be a Decimal, not a float.
|
||||
When this is not possible, treat float as the "lowest common data type".
|
||||
- Functions should, as much as possible, honour any type of numeric data. E.g.
|
||||
the mean of a list of Decimals should be a Decimal, not a float. When this is
|
||||
not possible, treat float as the "lowest common data type".
|
||||
|
||||
- Although functions support data sets of floats, Decimals or Fractions,
|
||||
there is no guarantee that *mixed* data sets will be supported. (But on
|
||||
the other hand, they aren't explicitly rejected either.)
|
||||
- Although functions support data sets of floats, Decimals or Fractions, there
|
||||
is no guarantee that *mixed* data sets will be supported. (But on the other
|
||||
hand, they aren't explicitly rejected either.)
|
||||
|
||||
- Plenty of documentation, aimed at readers who understand the basic
|
||||
concepts but may not know (for example) which variance they should use
|
||||
(population or sample?). Mathematicians and statisticians have a terrible
|
||||
habit of being inconsistent with both notation and terminology[19], and
|
||||
having spent many hours making sense of the contradictory/confusing
|
||||
definitions in use, it is only fair that I do my best to clarify rather
|
||||
than obfuscate the topic.
|
||||
- Plenty of documentation, aimed at readers who understand the basic concepts
|
||||
but may not know (for example) which variance they should use (population or
|
||||
sample?). Mathematicians and statisticians have a terrible habit of being
|
||||
inconsistent with both notation and terminology [19]_, and having spent many
|
||||
hours making sense of the contradictory/confusing definitions in use, it is
|
||||
only fair that I do my best to clarify rather than obfuscate the topic.
|
||||
|
||||
- But avoid going into tedious[20] mathematical detail.
|
||||
- But avoid going into tedious [20]_ mathematical detail.
|
||||
|
||||
|
||||
API
|
||||
===
|
||||
|
||||
The initial version of the library will provide univariate (single
|
||||
variable) statistics functions. The general API will be based on a
|
||||
functional model ``function(data, ...) -> result``, where ``data``
|
||||
is a mandatory iterable of (usually) numeric data.
|
||||
The initial version of the library will provide univariate (single variable)
|
||||
statistics functions. The general API will be based on a functional model
|
||||
``function(data, ...) -> result``, where ``data`` is a mandatory iterable of
|
||||
(usually) numeric data.
|
||||
|
||||
The author expects that lists will be the most common data type used,
|
||||
but any iterable type should be acceptable. Where necessary, functions
|
||||
may convert to lists internally. Where possible, functions are
|
||||
expected to conserve the type of the data values, for example, the mean
|
||||
of a list of Decimals should be a Decimal rather than float.
|
||||
The author expects that lists will be the most common data type used, but any
|
||||
iterable type should be acceptable. Where necessary, functions may convert to
|
||||
lists internally. Where possible, functions are expected to conserve the type
|
||||
of the data values, for example, the mean of a list of Decimals should be a
|
||||
Decimal rather than float.
|
||||
|
||||
|
||||
Calculating mean, median and mode
|
||||
Calculating mean, median and mode
|
||||
---------------------------------
|
||||
|
||||
The ``mean``, ``median*`` and ``mode`` functions take a single
|
||||
mandatory argument and return the appropriate statistic, e.g.:
|
||||
The ``mean``, ``median*`` and ``mode`` functions take a single mandatory
|
||||
argument and return the appropriate statistic, e.g.::
|
||||
|
||||
>>> mean([1, 2, 3])
|
||||
2.0
|
||||
|
||||
Functions provided are:
|
||||
Functions provided are:
|
||||
|
||||
* mean(data) -> arithmetic mean of data.
|
||||
* ``mean(data)``
|
||||
arithmetic mean of *data*.
|
||||
|
||||
* median(data) -> median (middle value) of data, taking the
|
||||
average of the two middle values when there are an even
|
||||
number of values.
|
||||
* ``median(data)``
|
||||
median (middle value) of *data*, taking the average of the two
|
||||
middle values when there are an even number of values.
|
||||
|
||||
* median_high(data) -> high median of data, taking the
|
||||
larger of the two middle values when the number of items
|
||||
is even.
|
||||
* ``median_high(data)``
|
||||
high median of *data*, taking the larger of the two middle
|
||||
values when the number of items is even.
|
||||
|
||||
* median_low(data) -> low median of data, taking the smaller
|
||||
of the two middle values when the number of items is even.
|
||||
* ``median_low(data)``
|
||||
low median of *data*, taking the smaller of the two middle
|
||||
values when the number of items is even.
|
||||
|
||||
* median_grouped(data, interval=1) -> 50th percentile of
|
||||
grouped data, using interpolation.
|
||||
* ``median_grouped(data, interval=1)``
|
||||
50th percentile of grouped *data*, using interpolation.
|
||||
|
||||
* mode(data) -> most common data point.
|
||||
* ``mode(data)``
|
||||
most common *data* point.
|
||||
|
||||
``mode`` is the sole exception to the rule that the data argument
|
||||
must be numeric. It will also accept an iterable of nominal data,
|
||||
such as strings.
|
||||
``mode`` is the sole exception to the rule that the data argument must be
|
||||
numeric. It will also accept an iterable of nominal data, such as strings.
|
||||
|
||||
|
||||
Calculating variance and standard deviation
|
||||
Calculating variance and standard deviation
|
||||
-------------------------------------------
|
||||
|
||||
In order to be similar to scientific calculators, the statistics
|
||||
module will include separate functions for population and sample
|
||||
variance and standard deviation. All four functions have similar
|
||||
signatures, with a single mandatory argument, an iterable of
|
||||
numeric data, e.g.:
|
||||
In order to be similar to scientific calculators, the statistics module will
|
||||
include separate functions for population and sample variance and standard
|
||||
deviation. All four functions have similar signatures, with a single mandatory
|
||||
argument, an iterable of numeric data, e.g.::
|
||||
|
||||
>>> variance([1, 2, 2, 2, 3])
|
||||
0.5
|
||||
|
||||
All four functions also accept a second, optional, argument, the
|
||||
mean of the data. This is modelled on a similar API provided by
|
||||
the GNU Scientific Library[18]. There are three use-cases for
|
||||
using this argument, in no particular order:
|
||||
All four functions also accept a second, optional, argument, the mean of the
|
||||
data. This is modelled on a similar API provided by the GNU Scientific
|
||||
Library [18]_. There are three use-cases for using this argument, in no
|
||||
particular order:
|
||||
|
||||
1) The value of the mean is known *a priori*.
|
||||
1) The value of the mean is known *a priori*.
|
||||
|
||||
2) You have already calculated the mean, and wish to avoid
|
||||
calculating it again.
|
||||
2) You have already calculated the mean, and wish to avoid calculating
|
||||
it again.
|
||||
|
||||
3) You wish to (ab)use the variance functions to calculate
|
||||
the second moment about some given point other than the
|
||||
3) You wish to (ab)use the variance functions to calculate the second
|
||||
moment about some given point other than the mean.
|
||||
|
||||
In each case, it is the caller's responsibility to ensure that given
|
||||
argument is meaningful.
|
||||
|
||||
Functions provided are:
|
||||
|
||||
* ``variance(data, xbar=None)``
|
||||
sample variance of *data*, optionally using *xbar* as the sample mean.
|
||||
|
||||
* ``stdev(data, xbar=None)``
|
||||
sample standard deviation of *data*, optionally using *xbar* as the
|
||||
sample mean.
|
||||
|
||||
* ``pvariance(data, mu=None)``
|
||||
population variance of *data*, optionally using *mu* as the population
|
||||
mean.
|
||||
|
||||
In each case, it is the caller's responsibility to ensure that
|
||||
given argument is meaningful.
|
||||
* ``pstdev(data, mu=None)``
|
||||
population standard deviation of *data*, optionally using *mu* as the
|
||||
population mean.
|
||||
|
||||
Functions provided are:
|
||||
Other functions
|
||||
---------------
|
||||
|
||||
* variance(data, xbar=None) -> sample variance of data,
|
||||
optionally using xbar as the sample mean.
|
||||
There is one other public function:
|
||||
|
||||
* stdev(data, xbar=None) -> sample standard deviation of
|
||||
data, optionally using xbar as the sample mean.
|
||||
|
||||
* pvariance(data, mu=None) -> population variance of data,
|
||||
optionally using mu as the population mean.
|
||||
|
||||
* pstdev(data, mu=None) -> population standard deviation of
|
||||
data, optionally using mu as the population mean.
|
||||
|
||||
Other functions
|
||||
|
||||
There is one other public function:
|
||||
|
||||
* sum(data, start=0) -> high-precision sum of numeric data.
|
||||
* ``sum(data, start=0)``
|
||||
high-precision sum of numeric *data*.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
As the proposed reference implementation is in pure Python,
|
||||
other Python implementations can easily make use of the module
|
||||
unchanged, or adapt it as they see fit.
|
||||
As the proposed reference implementation is in pure Python, other Python
|
||||
implementations can easily make use of the module unchanged, or adapt it as
|
||||
they see fit.
|
||||
|
||||
|
||||
What Should Be The Name Of The Module?
|
||||
======================================
|
||||
|
||||
This will be a top-level module "statistics".
|
||||
This will be a top-level module ``statistics``.
|
||||
|
||||
There was some interest in turning math into a package, and making this a
|
||||
sub-module of math, but the general consensus eventually agreed on a
|
||||
top-level module. Other potential but rejected names included "stats" (too
|
||||
much risk of confusion with existing "stat" module), and "statslib"
|
||||
(described as "too C-like").
|
||||
There was some interest in turning ``math`` into a package, and making this a
|
||||
sub-module of ``math``, but the general consensus eventually agreed on a
|
||||
top-level module. Other potential but rejected names included ``stats`` (too
|
||||
much risk of confusion with existing ``stat`` module), and ``statslib``
|
||||
(described as "too C-like").
|
||||
|
||||
|
||||
Discussion And Resolved Issues
|
||||
==============================
|
||||
|
||||
This proposal has been previously discussed here[21].
|
||||
This proposal has been previously discussed here [21]_.
|
||||
|
||||
A number of design issues were resolved during the discussion on
|
||||
Python-Ideas and the initial code review. There was a lot of concern
|
||||
about the addition of yet another ``sum`` function to the standard
|
||||
library, see the FAQs below for more details. In addition, the
|
||||
initial implementation of ``sum`` suffered from some rounding issues
|
||||
and other design problems when dealing with Decimals. Oscar
|
||||
Benjamin's assistance in resolving this was invaluable.
|
||||
A number of design issues were resolved during the discussion on Python-Ideas
|
||||
and the initial code review. There was a lot of concern about the addition of
|
||||
yet another ``sum`` function to the standard library, see the FAQs below for
|
||||
more details. In addition, the initial implementation of ``sum`` suffered from
|
||||
some rounding issues and other design problems when dealing with Decimals.
|
||||
Oscar Benjamin's assistance in resolving this was invaluable.
|
||||
|
||||
Another issue was the handling of data in the form of iterators. The
|
||||
first implementation of variance silently swapped between a one- and
|
||||
two-pass algorithm, depending on whether the data was in the form of
|
||||
an iterator or sequence. This proved to be a design mistake, as the
|
||||
calculated variance could differ slightly depending on the algorithm
|
||||
used, and ``variance`` etc. were changed to internally generate a list
|
||||
and always use the more accurate two-pass implementation.
|
||||
Another issue was the handling of data in the form of iterators. The first
|
||||
implementation of variance silently swapped between a one- and two-pass
|
||||
algorithm, depending on whether the data was in the form of an iterator or
|
||||
sequence. This proved to be a design mistake, as the calculated variance could
|
||||
differ slightly depending on the algorithm used, and ``variance`` etc. were
|
||||
changed to internally generate a list and always use the more accurate two-pass
|
||||
implementation.
|
||||
|
||||
One controversial design involved the functions to calculate median,
|
||||
which were implemented as attributes on the ``median`` callable, e.g.
|
||||
``median``, ``median.low``, ``median.high`` etc. Although there is
|
||||
at least one existing use of this style in the standard library, in
|
||||
``unittest.mock``, the code reviewers felt that this was too unusual
|
||||
for the standard library. Consequently, the design has been changed
|
||||
to a more traditional design of separate functions with a pseudo-
|
||||
namespace naming convention, ``median_low``, ``median_high``, etc.
|
||||
One controversial design involved the functions to calculate median, which were
|
||||
implemented as attributes on the ``median`` callable, e.g. ``median``,
|
||||
``median.low``, ``median.high`` etc. Although there is at least one existing
|
||||
use of this style in the standard library, in ``unittest.mock``, the code
|
||||
reviewers felt that this was too unusual for the standard library.
|
||||
Consequently, the design has been changed to a more traditional design of
|
||||
separate functions with a pseudo-namespace naming convention, ``median_low``,
|
||||
``median_high``, etc.
|
||||
|
||||
Another issue that was of concern to code reviewers was the existence
|
||||
of a function calculating the sample mode of continuous data, with
|
||||
some people questioning the choice of algorithm, and whether it was
|
||||
a sufficiently common need to be included. So it was dropped from
|
||||
the API, and ``mode`` now implements only the basic schoolbook
|
||||
algorithm based on counting unique values.
|
||||
Another issue that was of concern to code reviewers was the existence of a
|
||||
function calculating the sample mode of continuous data, with some people
|
||||
questioning the choice of algorithm, and whether it was a sufficiently common
|
||||
need to be included. So it was dropped from the API, and ``mode`` now
|
||||
implements only the basic schoolbook algorithm based on counting unique values.
|
||||
|
||||
Another significant point of discussion was calculating statistics of
|
||||
timedelta objects. Although the statistics module will not directly
|
||||
support timedelta objects, it is possible to support this use-case by
|
||||
converting them to numbers first using the ``timedelta.total_seconds``
|
||||
method.
|
||||
Another significant point of discussion was calculating statistics of
|
||||
``timedelta`` objects. Although the statistics module will not directly
|
||||
support ``timedelta`` objects, it is possible to support this use-case by
|
||||
converting them to numbers first using the ``timedelta.total_seconds`` method.
|
||||
|
||||
|
||||
Frequently Asked Questions
|
||||
==========================
|
||||
|
||||
Q: Shouldn't this module spend time on PyPI before being considered for
|
||||
the standard library?
|
||||
Shouldn't this module spend time on PyPI before being considered for the standard library?
|
||||
------------------------------------------------------------------------------------------
|
||||
|
||||
A: Older versions of this module have been available on PyPI[22] since
|
||||
2010. Being much simpler than numpy, it does not require many years of
|
||||
external development.
|
||||
Older versions of this module have been available on PyPI [22]_ since 2010.
|
||||
Being much simpler than numpy, it does not require many years of external
|
||||
development.
|
||||
|
||||
Q: Does the standard library really need yet another version of ``sum``?
|
||||
Does the standard library really need yet another version of ``sum``?
|
||||
---------------------------------------------------------------------
|
||||
|
||||
A: This proved to be the most controversial part of the reference
|
||||
implementation. In one sense, clearly three sums is two too many. But
|
||||
in another sense, yes. The reasons why the two existing versions are
|
||||
unsuitable are described here[23] but the short summary is:
|
||||
This proved to be the most controversial part of the reference implementation.
|
||||
In one sense, clearly three sums is two too many. But in another sense, yes.
|
||||
The reasons why the two existing versions are unsuitable are described
|
||||
here [23]_ but the short summary is:
|
||||
|
||||
- the built-in sum can lose precision with floats;
|
||||
- the built-in sum can lose precision with floats;
|
||||
|
||||
- the built-in sum accepts any non-numeric data type that supports
|
||||
the + operator, apart from strings and bytes;
|
||||
- the built-in sum accepts any non-numeric data type that supports the ``+``
|
||||
operator, apart from strings and bytes;
|
||||
|
||||
- math.fsum is high-precision, but coerces all arguments to float.
|
||||
- ``math.fsum`` is high-precision, but coerces all arguments to float.
|
||||
|
||||
There was some interest in "fixing" one or the other of the existing
|
||||
sums. If this occurs before 3.4 feature-freeze, the decision to keep
|
||||
statistics.sum can be re-considered.
|
||||
There was some interest in "fixing" one or the other of the existing sums. If
|
||||
this occurs before 3.4 feature-freeze, the decision to keep ``statistics.sum``
|
||||
can be re-considered.
|
||||
|
||||
Q: Will this module be backported to older versions of Python?
|
||||
Will this module be backported to older versions of Python?
|
||||
-----------------------------------------------------------
|
||||
|
||||
A: The module currently targets 3.3, and I will make it available on PyPI
|
||||
for 3.3 for the foreseeable future. Backporting to older versions of
|
||||
the 3.x series is likely (but not yet decided). Backporting to 2.7 is
|
||||
less likely but not ruled out.
|
||||
The module currently targets 3.3, and I will make it available on PyPI for
|
||||
3.3 for the foreseeable future. Backporting to older versions of the 3.x
|
||||
series is likely (but not yet decided). Backporting to 2.7 is less likely but
|
||||
not ruled out.
|
||||
|
||||
Q: Is this supposed to replace numpy?
|
||||
Is this supposed to replace numpy?
|
||||
----------------------------------
|
||||
|
||||
A: No. While it is likely to grow over the years (see open issues below)
|
||||
it is not aimed to replace, or even compete directly with, numpy. Numpy
|
||||
is a full-featured numeric library aimed at professionals, the nuclear
|
||||
reactor of numeric libraries in the Python ecosystem. This is just a
|
||||
battery, as in "batteries included", and is aimed at an intermediate
|
||||
level somewhere between "use numpy" and "roll your own version".
|
||||
No. While it is likely to grow over the years (see open issues below) it is
|
||||
not aimed to replace, or even compete directly with, numpy. Numpy is a
|
||||
full-featured numeric library aimed at professionals, the nuclear reactor of
|
||||
numeric libraries in the Python ecosystem. This is just a battery, as in
|
||||
"batteries included", and is aimed at an intermediate level somewhere between
|
||||
"use numpy" and "roll your own version".
|
||||
|
||||
|
||||
Future Work
|
||||
===========
|
||||
|
||||
- At this stage, I am unsure of the best API for multivariate statistical
|
||||
functions such as linear regression, correlation coefficient, and
|
||||
covariance. Possible APIs include:
|
||||
- At this stage, I am unsure of the best API for multivariate statistical
|
||||
functions such as linear regression, correlation coefficient, and covariance.
|
||||
Possible APIs include:
|
||||
|
||||
* Separate arguments for x and y data::
|
||||
|
||||
* Separate arguments for x and y data:
|
||||
function([x0, x1, ...], [y0, y1, ...])
|
||||
|
||||
* A single argument for (x, y) data:
|
||||
* A single argument for (x, y) data::
|
||||
|
||||
function([(x0, y0), (x1, y1), ...])
|
||||
|
||||
This API is preferred by GvR[24].
|
||||
This API is preferred by GvR [24]_.
|
||||
|
||||
* Selecting arbitrary columns from a 2D array::
|
||||
|
||||
* Selecting arbitrary columns from a 2D array:
|
||||
function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
|
||||
|
||||
* Some combination of the above.
|
||||
|
||||
In the absence of a consensus of preferred API for multivariate stats,
|
||||
I will defer including such multivariate functions until Python 3.5.
|
||||
In the absence of a consensus of preferred API for multivariate stats, I will
|
||||
defer including such multivariate functions until Python 3.5.
|
||||
|
||||
- Likewise, functions for calculating probability of random variables and
|
||||
- Likewise, functions for calculating probability of random variables and
|
||||
inference testing (e.g. Student's t-test) will be deferred until 3.5.
|
||||
|
||||
- There is considerable interest in including one-pass functions that can
|
||||
calculate multiple statistics from data in iterator form, without having
|
||||
to convert to a list. The experimental "stats" package on PyPI includes
|
||||
co-routine versions of statistics functions. Including these will be
|
||||
deferred to 3.5.
|
||||
- There is considerable interest in including one-pass functions that can
|
||||
calculate multiple statistics from data in iterator form, without having to
|
||||
convert to a list. The experimental ``stats`` package on PyPI includes
|
||||
co-routine versions of statistics functions. Including these will be deferred
|
||||
to 3.5.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://mail.python.org/pipermail/python-dev/2010-October/104721.html
|
||||
.. [1] https://mail.python.org/pipermail/python-dev/2010-October/104721.html
|
||||
|
||||
[2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
|
||||
.. [2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
|
||||
|
||||
[3] Gnumeric:
|
||||
.. [3] Gnumeric::
|
||||
https://projects.gnome.org/gnumeric/functions.shtml
|
||||
|
||||
LibreOffice:
|
||||
|
@ -499,60 +520,62 @@ References
|
|||
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
|
||||
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
|
||||
|
||||
[4] Scipy: http://scipy-central.org/
|
||||
.. [4] Scipy: http://scipy-central.org/
|
||||
Numpy: http://www.numpy.org/
|
||||
|
||||
[5] http://wiki.scipy.org/Numpy_Functions_by_Category
|
||||
.. [5] http://wiki.scipy.org/Numpy_Functions_by_Category
|
||||
|
||||
[6] Tested with numpy 1.6.1 and Python 2.7.
|
||||
.. [6] Tested with numpy 1.6.1 and Python 2.7.
|
||||
|
||||
[7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
|
||||
.. [7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
|
||||
|
||||
[8] http://rosettacode.org/wiki/Standard_deviation
|
||||
.. [8] http://rosettacode.org/wiki/Standard_deviation
|
||||
|
||||
[9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
|
||||
.. [9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
|
||||
|
||||
[10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
|
||||
.. [10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
|
||||
|
||||
[11] http://www.r-project.org/
|
||||
.. [11] http://www.r-project.org/
|
||||
|
||||
[12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
|
||||
.. [12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
|
||||
|
||||
[13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
|
||||
.. [13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
|
||||
|
||||
[14] http://ruby-statsample.rubyforge.org/
|
||||
.. [14] http://ruby-statsample.rubyforge.org/
|
||||
|
||||
[15] http://www.php.net/manual/en/ref.stats.php
|
||||
.. [15] http://www.php.net/manual/en/ref.stats.php
|
||||
|
||||
[16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
|
||||
.. [16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
|
||||
|
||||
[17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
|
||||
.. [17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
|
||||
|
||||
[18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
|
||||
.. [18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
|
||||
|
||||
[19] http://mathworld.wolfram.com/Skewness.html
|
||||
.. [19] http://mathworld.wolfram.com/Skewness.html
|
||||
|
||||
[20] At least, tedious to those who don't like this sort of thing.
|
||||
.. [20] At least, tedious to those who don't like this sort of thing.
|
||||
|
||||
[21] https://mail.python.org/pipermail/python-ideas/2011-September/011524.html
|
||||
.. [21] https://mail.python.org/pipermail/python-ideas/2011-September/011524.html
|
||||
|
||||
[22] https://pypi.python.org/pypi/stats/
|
||||
.. [22] https://pypi.python.org/pypi/stats/
|
||||
|
||||
[23] https://mail.python.org/pipermail/python-ideas/2013-August/022630.html
|
||||
.. [23] https://mail.python.org/pipermail/python-ideas/2013-August/022630.html
|
||||
|
||||
[24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
|
||||
.. [24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
||||
|
|
Loading…
Reference in New Issue