PEP 450 update from Steven D'Aprano.
This commit is contained in:
parent
e67cea86ef
commit
8747d13f25
104
pep-0450.txt
104
pep-0450.txt
|
@ -15,7 +15,7 @@ Abstract
|
|||
|
||||
This PEP proposes the addition of a module for common statistics functions
|
||||
such as mean, median, variance and standard deviation to the Python
|
||||
standard library.
|
||||
standard library. See also http://bugs.python.org/issue18606
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -250,6 +250,102 @@ Design Decisions Of The Module
|
|||
- But avoid going into tedious[20] mathematical detail.
|
||||
|
||||
|
||||
API
|
||||
|
||||
The initial version of the library will provide univariate (single
|
||||
variable) statistics functions. The general API will be based on a
|
||||
functional model ``function(data, ...) -> result``, where ``data``
|
||||
is a mandatory iterable of (usually) numeric data.
|
||||
|
||||
The author expects that lists will be the most common data type used,
|
||||
but any iterable type should be acceptable. Where necessary, functions
|
||||
may convert to lists internally. Where possible, functions are
|
||||
expected to conserve the type of the data values, for example, the mean
|
||||
of a list of Decimals should be a Decimal rather than float.
|
||||
|
||||
|
||||
Calculating mean, median and mode
|
||||
|
||||
The ``mean``, ``median*`` and ``mode`` functions take a single
|
||||
mandatory argument and return the appropriate statistic, e.g.:
|
||||
|
||||
>>> mean([1, 2, 3])
|
||||
2.0
|
||||
|
||||
Functions provided are:
|
||||
|
||||
* mean(data) -> arithmetic mean of data.
|
||||
|
||||
* median(data) -> median (middle value) of data, taking the
|
||||
average of the two middle values when there are an even
|
||||
number of values.
|
||||
|
||||
* median_high(data) -> high median of data, taking the
|
||||
larger of the two middle values when the number of items
|
||||
is even.
|
||||
|
||||
* median_low(data) -> low median of data, taking the smaller
|
||||
of the two middle values when the number of items is even.
|
||||
|
||||
* median_grouped(data, interval=1) -> 50th percentile of
|
||||
grouped data, using interpolation.
|
||||
|
||||
* mode(data) -> most common data point.
|
||||
|
||||
``mode`` is the sole exception to the rule that the data argument
|
||||
must be numeric. It will also accept an iterable of nominal data,
|
||||
such as strings.
|
||||
|
||||
|
||||
Calculating variance and standard deviation
|
||||
|
||||
In order to be similar to scientific calculators, the statistics
|
||||
module will include separate functions for population and sample
|
||||
variance and standard deviation. All four functions have similar
|
||||
signatures, with a single mandatory argument, an iterable of
|
||||
numeric data, e.g.:
|
||||
|
||||
>>> variance([1, 2, 2, 2, 3])
|
||||
0.5
|
||||
|
||||
All four functions also accept a second, optional, argument, the
|
||||
mean of the data. This is modelled on a similar API provided by
|
||||
the GNU Scientific Library[18]. There are three use-cases for
|
||||
using this argument, in no particular order:
|
||||
|
||||
1) The value of the mean is known *a priori*.
|
||||
|
||||
2) You have already calculated the mean, and wish to avoid
|
||||
calculating it again.
|
||||
|
||||
3) You wish to (ab)use the variance functions to calculate
|
||||
the second moment about some given point other than the
|
||||
mean.
|
||||
|
||||
In each case, it is the caller's responsibility to ensure that
|
||||
given argument is meaningful.
|
||||
|
||||
Functions provided are:
|
||||
|
||||
* variance(data, xbar=None) -> sample variance of data,
|
||||
optionally using xbar as the sample mean.
|
||||
|
||||
* stdev(data, xbar=None) -> sample standard deviation of
|
||||
data, optionally using xbar as the sample mean.
|
||||
|
||||
* pvariance(data, mu=None) -> population variance of data,
|
||||
optionally using mu as the population mean.
|
||||
|
||||
* pstdev(data, mu=None) -> population standard deviation of
|
||||
data, optionally using mu as the population mean.
|
||||
|
||||
Other functions
|
||||
|
||||
There is one other public function:
|
||||
|
||||
* sum(data, start=0) -> high-precision sum of numeric data.
|
||||
|
||||
|
||||
Specification
|
||||
|
||||
As the proposed reference implementation is in pure Python,
|
||||
|
@ -317,7 +413,7 @@ Frequently Asked Questions
|
|||
level somewhere between "use numpy" and "roll your own version".
|
||||
|
||||
|
||||
Open and Deferred Issues
|
||||
Future Work
|
||||
|
||||
- At this stage, I am unsure of the best API for multivariate statistical
|
||||
functions such as linear regression, correlation coefficient, and
|
||||
|
@ -329,6 +425,8 @@ Open and Deferred Issues
|
|||
* A single argument for (x, y) data:
|
||||
function([(x0, y0), (x1, y1), ...])
|
||||
|
||||
This API is preferred by GvR[24].
|
||||
|
||||
* Selecting arbitrary columns from a 2D array:
|
||||
function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
|
||||
|
||||
|
@ -404,6 +502,8 @@ References
|
|||
|
||||
[23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html
|
||||
|
||||
[24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
|
||||
|
||||
|
||||
Copyright
|
||||
|
||||
|
|
Loading…
Reference in New Issue