PEP 450 update from Steven D'Aprano.

This commit is contained in:
Guido van Rossum 2013-09-08 21:21:27 -07:00
parent e67cea86ef
commit 8747d13f25
1 changed files with 102 additions and 2 deletions

View File

@ -15,7 +15,7 @@ Abstract
This PEP proposes the addition of a module for common statistics functions
such as mean, median, variance and standard deviation to the Python
standard library.
standard library. See also http://bugs.python.org/issue18606
Rationale
@ -250,6 +250,102 @@ Design Decisions Of The Module
- But avoid going into tedious[20] mathematical detail.
API
The initial version of the library will provide univariate (single
variable) statistics functions. The general API will be based on a
functional model ``function(data, ...) -> result``, where ``data``
is a mandatory iterable of (usually) numeric data.
The author expects that lists will be the most common data type used,
but any iterable type should be acceptable. Where necessary, functions
may convert to lists internally. Where possible, functions are
expected to conserve the type of the data values, for example, the mean
of a list of Decimals should be a Decimal rather than float.
Calculating mean, median and mode
The ``mean``, ``median*`` and ``mode`` functions take a single
mandatory argument and return the appropriate statistic, e.g.:
>>> mean([1, 2, 3])
2.0
Functions provided are:
* mean(data) -> arithmetic mean of data.
* median(data) -> median (middle value) of data, taking the
average of the two middle values when there are an even
number of values.
* median_high(data) -> high median of data, taking the
larger of the two middle values when the number of items
is even.
* median_low(data) -> low median of data, taking the smaller
of the two middle values when the number of items is even.
* median_grouped(data, interval=1) -> 50th percentile of
grouped data, using interpolation.
* mode(data) -> most common data point.
``mode`` is the sole exception to the rule that the data argument
must be numeric. It will also accept an iterable of nominal data,
such as strings.
Calculating variance and standard deviation
In order to be similar to scientific calculators, the statistics
module will include separate functions for population and sample
variance and standard deviation. All four functions have similar
signatures, with a single mandatory argument, an iterable of
numeric data, e.g.:
>>> variance([1, 2, 2, 2, 3])
0.5
All four functions also accept a second, optional, argument, the
mean of the data. This is modelled on a similar API provided by
the GNU Scientific Library[18]. There are three use-cases for
using this argument, in no particular order:
1) The value of the mean is known *a priori*.
2) You have already calculated the mean, and wish to avoid
calculating it again.
3) You wish to (ab)use the variance functions to calculate
the second moment about some given point other than the
mean.
In each case, it is the caller's responsibility to ensure that
given argument is meaningful.
Functions provided are:
* variance(data, xbar=None) -> sample variance of data,
optionally using xbar as the sample mean.
* stdev(data, xbar=None) -> sample standard deviation of
data, optionally using xbar as the sample mean.
* pvariance(data, mu=None) -> population variance of data,
optionally using mu as the population mean.
* pstdev(data, mu=None) -> population standard deviation of
data, optionally using mu as the population mean.
Other functions
There is one other public function:
* sum(data, start=0) -> high-precision sum of numeric data.
Specification
As the proposed reference implementation is in pure Python,
@ -317,7 +413,7 @@ Frequently Asked Questions
level somewhere between "use numpy" and "roll your own version".
Open and Deferred Issues
Future Work
- At this stage, I am unsure of the best API for multivariate statistical
functions such as linear regression, correlation coefficient, and
@ -329,6 +425,8 @@ Open and Deferred Issues
* A single argument for (x, y) data:
function([(x0, y0), (x1, y1), ...])
This API is preferred by GvR[24].
* Selecting arbitrary columns from a 2D array:
function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
@ -404,6 +502,8 @@ References
[23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html
[24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
Copyright