559 lines
22 KiB
Plaintext
559 lines
22 KiB
Plaintext
PEP: 450
|
||
Title: Adding A Statistics Module To The Standard Library
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Steven D'Aprano <steve@pearwood.info>
|
||
Status: Final
|
||
Type: Standards Track
|
||
Content-Type: text/plain
|
||
Created: 01-Aug-2013
|
||
Python-Version: 3.4
|
||
Post-History: 13-Sep-2013
|
||
|
||
|
||
Abstract
|
||
|
||
This PEP proposes the addition of a module for common statistics functions
|
||
such as mean, median, variance and standard deviation to the Python
|
||
standard library. See also http://bugs.python.org/issue18606
|
||
|
||
|
||
Rationale
|
||
|
||
The proposed statistics module is motivated by the "batteries included"
|
||
philosophy towards the Python standard library. Raymond Hettinger and
|
||
other senior developers have requested a quality statistics library that
|
||
falls somewhere in between high-end statistics libraries and ad hoc
|
||
code.[1] Statistical functions such as mean, standard deviation and others
|
||
are obvious and useful batteries, familiar to any Secondary School student.
|
||
Even cheap scientific calculators typically include multiple statistical
|
||
functions such as:
|
||
|
||
- mean
|
||
- population and sample variance
|
||
- population and sample standard deviation
|
||
- linear regression
|
||
- correlation coefficient
|
||
|
||
Graphing calculators aimed at Secondary School students typically
|
||
include all of the above, plus some or all of:
|
||
|
||
- median
|
||
- mode
|
||
- functions for calculating the probability of random variables
|
||
from the normal, t, chi-squared, and F distributions
|
||
- inference on the mean
|
||
|
||
and others[2]. Likewise spreadsheet applications such as Microsoft Excel,
|
||
LibreOffice and Gnumeric include rich collections of statistical
|
||
functions[3].
|
||
|
||
In contrast, Python currently has no standard way to calculate even the
|
||
simplest and most obvious statistical functions such as mean. For those
|
||
who need statistical functions in Python, there are two obvious solutions:
|
||
|
||
- install numpy and/or scipy[4];
|
||
|
||
- or use a Do It Yourself solution.
|
||
|
||
Numpy is perhaps the most full-featured solution, but it has a few
|
||
disadvantages:
|
||
|
||
- It may be overkill for many purposes. The documentation for numpy even
|
||
warns
|
||
|
||
"It can be hard to know what functions are available in
|
||
numpy. This is not a complete list, but it does cover
|
||
most of them."[5]
|
||
|
||
and then goes on to list over 270 functions, only a small number of
|
||
which are related to statistics.
|
||
|
||
- Numpy is aimed at those doing heavy numerical work, and may be
|
||
intimidating to those who don't have a background in computational
|
||
mathematics and computer science. For example, numpy.mean takes four
|
||
arguments:
|
||
|
||
mean(a, axis=None, dtype=None, out=None)
|
||
|
||
although fortunately for the beginner or casual numpy user, three are
|
||
optional and numpy.mean does the right thing in simple cases:
|
||
|
||
>>> numpy.mean([1, 2, 3, 4])
|
||
2.5
|
||
|
||
- For many people, installing numpy may be difficult or impossible. For
|
||
example, people in corporate environments may have to go through a
|
||
difficult, time-consuming process before being permitted to install
|
||
third-party software. For the casual Python user, having to learn about
|
||
installing third-party packages in order to average a list of numbers is
|
||
unfortunate.
|
||
|
||
This leads to option number 2, DIY statistics functions. At first glance,
|
||
this appears to be an attractive option, due to the apparent simplicity of
|
||
common statistical functions. For example:
|
||
|
||
def mean(data):
|
||
return sum(data)/len(data)
|
||
|
||
def variance(data):
|
||
# Use the Computational Formula for Variance.
|
||
n = len(data)
|
||
ss = sum(x**2 for x in data) - (sum(data)**2)/n
|
||
return ss/(n-1)
|
||
|
||
def standard_deviation(data):
|
||
return math.sqrt(variance(data))
|
||
|
||
The above appears to be correct with a casual test:
|
||
|
||
>>> data = [1, 2, 4, 5, 8]
|
||
>>> variance(data)
|
||
7.5
|
||
|
||
But adding a constant to every data point should not change the variance:
|
||
|
||
>>> data = [x+1e12 for x in data]
|
||
>>> variance(data)
|
||
0.0
|
||
|
||
And variance should *never* be negative:
|
||
|
||
>>> variance(data*100)
|
||
-1239429440.1282566
|
||
|
||
By contrast, the proposed reference implementation gets the exactly correct
|
||
answer 7.5 for the first two examples, and a reasonably close answer for
|
||
the third: 6.012. numpy does no better[6].
|
||
|
||
Even simple statistical calculations contain traps for the unwary, starting
|
||
with the Computational Formula itself. Despite the name, it is numerically
|
||
unstable and can be extremely inaccurate, as can be seen above. It is
|
||
completely unsuitable for computation by computer[7]. This problem plagues
|
||
users of many programming language, not just Python[8], as coders reinvent
|
||
the same numerically inaccurate code over and over again[9], or advise
|
||
others to do so[10].
|
||
|
||
It isn't just the variance and standard deviation. Even the mean is not
|
||
quite as straight-forward as it might appear. The above implementation
|
||
seems too simple to have problems, but it does:
|
||
|
||
- The built-in sum can lose accuracy when dealing with floats of wildly
|
||
differing magnitude. Consequently, the above naive mean fails this
|
||
"torture test":
|
||
|
||
assert mean([1e30, 1, 3, -1e30]) == 1
|
||
|
||
returning 0 instead of 1, a purely computational error of 100%.
|
||
|
||
- Using math.fsum inside mean will make it more accurate with float data,
|
||
but it also has the side-effect of converting any arguments to float
|
||
even when unnecessary. E.g. we should expect the mean of a list of
|
||
Fractions to be a Fraction, not a float.
|
||
|
||
While the above mean implementation does not fail quite as catastrophically
|
||
as the naive variance does, a standard library function can do much better
|
||
than the DIY versions.
|
||
|
||
The example above involves an especially bad set of data, but even for
|
||
more realistic data sets accuracy is important. The first step in
|
||
interpreting variation in data (including dealing with ill-conditioned
|
||
data) is often to standardize it to a series with variance 1 (and often
|
||
mean 0). This standardization requires accurate computation of the mean
|
||
and variance of the raw series. Naive computation of mean and variance
|
||
can lose precision very quickly. Because precision bounds accuracy, it is
|
||
important to use the most precise algorithms for computing mean and
|
||
variance that are practical, or the results of standardization are
|
||
themselves useless.
|
||
|
||
|
||
Comparison To Other Languages/Packages
|
||
|
||
The proposed statistics library is not intended to be a competitor to such
|
||
third-party libraries as numpy/scipy, or of proprietary full-featured
|
||
statistics packages aimed at professional statisticians such as Minitab,
|
||
SAS and Matlab. It is aimed at the level of graphing and scientific
|
||
calculators.
|
||
|
||
Most programming languages have little or no built-in support for
|
||
statistics functions. Some exceptions:
|
||
|
||
R
|
||
R (and its proprietary cousin, S) is a programming language designed
|
||
for statistics work. It is extremely popular with statisticians and
|
||
is extremely feature-rich[11].
|
||
|
||
C#
|
||
|
||
The C# LINQ package includes extension methods to calculate the
|
||
average of enumerables[12].
|
||
|
||
Ruby
|
||
|
||
Ruby does not ship with a standard statistics module, despite some
|
||
apparent demand[13]. Statsample appears to be a feature-rich third-
|
||
party library, aiming to compete with R[14].
|
||
|
||
PHP
|
||
|
||
PHP has an extremely feature-rich (although mostly undocumented) set
|
||
of advanced statistical functions[15].
|
||
|
||
Delphi
|
||
|
||
Delphi includes standard statistical functions including Mean, Sum,
|
||
Variance, TotalVariance, MomentSkewKurtosis in its Math library[16].
|
||
|
||
GNU Scientific Library
|
||
|
||
The GNU Scientific Library includes standard statistical functions,
|
||
percentiles, median and others[17]. One innovation I have borrowed
|
||
from the GSL is to allow the caller to optionally specify the pre-
|
||
calculated mean of the sample (or an a priori known population mean)
|
||
when calculating the variance and standard deviation[18].
|
||
|
||
|
||
Design Decisions Of The Module
|
||
|
||
My intention is to start small and grow the library as needed, rather than
|
||
try to include everything from the start. Consequently, the current
|
||
reference implementation includes only a small number of functions: mean,
|
||
variance, standard deviation, median, mode. (See the reference
|
||
implementation for a full list.)
|
||
|
||
I have aimed for the following design features:
|
||
|
||
- Correctness over speed. It is easier to speed up a correct but slow
|
||
function than to correct a fast but buggy one.
|
||
|
||
- Concentrate on data in sequences, allowing two-passes over the data,
|
||
rather than potentially compromise on accuracy for the sake of a one-pass
|
||
algorithm. Functions expect data will be passed as a list or other
|
||
sequence; if given an iterator, they may internally convert to a list.
|
||
|
||
- Functions should, as much as possible, honour any type of numeric data.
|
||
E.g. the mean of a list of Decimals should be a Decimal, not a float.
|
||
When this is not possible, treat float as the "lowest common data type".
|
||
|
||
- Although functions support data sets of floats, Decimals or Fractions,
|
||
there is no guarantee that *mixed* data sets will be supported. (But on
|
||
the other hand, they aren't explicitly rejected either.)
|
||
|
||
- Plenty of documentation, aimed at readers who understand the basic
|
||
concepts but may not know (for example) which variance they should use
|
||
(population or sample?). Mathematicians and statisticians have a terrible
|
||
habit of being inconsistent with both notation and terminology[19], and
|
||
having spent many hours making sense of the contradictory/confusing
|
||
definitions in use, it is only fair that I do my best to clarify rather
|
||
than obfuscate the topic.
|
||
|
||
- But avoid going into tedious[20] mathematical detail.
|
||
|
||
|
||
API
|
||
|
||
The initial version of the library will provide univariate (single
|
||
variable) statistics functions. The general API will be based on a
|
||
functional model ``function(data, ...) -> result``, where ``data``
|
||
is a mandatory iterable of (usually) numeric data.
|
||
|
||
The author expects that lists will be the most common data type used,
|
||
but any iterable type should be acceptable. Where necessary, functions
|
||
may convert to lists internally. Where possible, functions are
|
||
expected to conserve the type of the data values, for example, the mean
|
||
of a list of Decimals should be a Decimal rather than float.
|
||
|
||
|
||
Calculating mean, median and mode
|
||
|
||
The ``mean``, ``median*`` and ``mode`` functions take a single
|
||
mandatory argument and return the appropriate statistic, e.g.:
|
||
|
||
>>> mean([1, 2, 3])
|
||
2.0
|
||
|
||
Functions provided are:
|
||
|
||
* mean(data) -> arithmetic mean of data.
|
||
|
||
* median(data) -> median (middle value) of data, taking the
|
||
average of the two middle values when there are an even
|
||
number of values.
|
||
|
||
* median_high(data) -> high median of data, taking the
|
||
larger of the two middle values when the number of items
|
||
is even.
|
||
|
||
* median_low(data) -> low median of data, taking the smaller
|
||
of the two middle values when the number of items is even.
|
||
|
||
* median_grouped(data, interval=1) -> 50th percentile of
|
||
grouped data, using interpolation.
|
||
|
||
* mode(data) -> most common data point.
|
||
|
||
``mode`` is the sole exception to the rule that the data argument
|
||
must be numeric. It will also accept an iterable of nominal data,
|
||
such as strings.
|
||
|
||
|
||
Calculating variance and standard deviation
|
||
|
||
In order to be similar to scientific calculators, the statistics
|
||
module will include separate functions for population and sample
|
||
variance and standard deviation. All four functions have similar
|
||
signatures, with a single mandatory argument, an iterable of
|
||
numeric data, e.g.:
|
||
|
||
>>> variance([1, 2, 2, 2, 3])
|
||
0.5
|
||
|
||
All four functions also accept a second, optional, argument, the
|
||
mean of the data. This is modelled on a similar API provided by
|
||
the GNU Scientific Library[18]. There are three use-cases for
|
||
using this argument, in no particular order:
|
||
|
||
1) The value of the mean is known *a priori*.
|
||
|
||
2) You have already calculated the mean, and wish to avoid
|
||
calculating it again.
|
||
|
||
3) You wish to (ab)use the variance functions to calculate
|
||
the second moment about some given point other than the
|
||
mean.
|
||
|
||
In each case, it is the caller's responsibility to ensure that
|
||
given argument is meaningful.
|
||
|
||
Functions provided are:
|
||
|
||
* variance(data, xbar=None) -> sample variance of data,
|
||
optionally using xbar as the sample mean.
|
||
|
||
* stdev(data, xbar=None) -> sample standard deviation of
|
||
data, optionally using xbar as the sample mean.
|
||
|
||
* pvariance(data, mu=None) -> population variance of data,
|
||
optionally using mu as the population mean.
|
||
|
||
* pstdev(data, mu=None) -> population standard deviation of
|
||
data, optionally using mu as the population mean.
|
||
|
||
Other functions
|
||
|
||
There is one other public function:
|
||
|
||
* sum(data, start=0) -> high-precision sum of numeric data.
|
||
|
||
|
||
Specification
|
||
|
||
As the proposed reference implementation is in pure Python,
|
||
other Python implementations can easily make use of the module
|
||
unchanged, or adapt it as they see fit.
|
||
|
||
|
||
What Should Be The Name Of The Module?
|
||
|
||
This will be a top-level module "statistics".
|
||
|
||
There was some interest in turning math into a package, and making this a
|
||
sub-module of math, but the general consensus eventually agreed on a
|
||
top-level module. Other potential but rejected names included "stats" (too
|
||
much risk of confusion with existing "stat" module), and "statslib"
|
||
(described as "too C-like").
|
||
|
||
|
||
Discussion And Resolved Issues
|
||
|
||
This proposal has been previously discussed here[21].
|
||
|
||
A number of design issues were resolved during the discussion on
|
||
Python-Ideas and the initial code review. There was a lot of concern
|
||
about the addition of yet another ``sum`` function to the standard
|
||
library, see the FAQs below for more details. In addition, the
|
||
initial implementation of ``sum`` suffered from some rounding issues
|
||
and other design problems when dealing with Decimals. Oscar
|
||
Benjamin's assistance in resolving this was invaluable.
|
||
|
||
Another issue was the handling of data in the form of iterators. The
|
||
first implementation of variance silently swapped between a one- and
|
||
two-pass algorithm, depending on whether the data was in the form of
|
||
an iterator or sequence. This proved to be a design mistake, as the
|
||
calculated variance could differ slightly depending on the algorithm
|
||
used, and ``variance`` etc. were changed to internally generate a list
|
||
and always use the more accurate two-pass implementation.
|
||
|
||
One controversial design involved the functions to calculate median,
|
||
which were implemented as attributes on the ``median`` callable, e.g.
|
||
``median``, ``median.low``, ``median.high`` etc. Although there is
|
||
at least one existing use of this style in the standard library, in
|
||
``unittest.mock``, the code reviewers felt that this was too unusual
|
||
for the standard library. Consequently, the design has been changed
|
||
to a more traditional design of separate functions with a pseudo-
|
||
namespace naming convention, ``median_low``, ``median_high``, etc.
|
||
|
||
Another issue that was of concern to code reviewers was the existence
|
||
of a function calculating the sample mode of continuous data, with
|
||
some people questioning the choice of algorithm, and whether it was
|
||
a sufficiently common need to be included. So it was dropped from
|
||
the API, and ``mode`` now implements only the basic schoolbook
|
||
algorithm based on counting unique values.
|
||
|
||
Another significant point of discussion was calculating statistics of
|
||
timedelta objects. Although the statistics module will not directly
|
||
support timedelta objects, it is possible to support this use-case by
|
||
converting them to numbers first using the ``timedelta.total_seconds``
|
||
method.
|
||
|
||
|
||
Frequently Asked Questions
|
||
|
||
Q: Shouldn't this module spend time on PyPI before being considered for
|
||
the standard library?
|
||
|
||
A: Older versions of this module have been available on PyPI[22] since
|
||
2010. Being much simpler than numpy, it does not require many years of
|
||
external development.
|
||
|
||
Q: Does the standard library really need yet another version of ``sum``?
|
||
|
||
A: This proved to be the most controversial part of the reference
|
||
implementation. In one sense, clearly three sums is two too many. But
|
||
in another sense, yes. The reasons why the two existing versions are
|
||
unsuitable are described here[23] but the short summary is:
|
||
|
||
- the built-in sum can lose precision with floats;
|
||
|
||
- the built-in sum accepts any non-numeric data type that supports
|
||
the + operator, apart from strings and bytes;
|
||
|
||
- math.fsum is high-precision, but coerces all arguments to float.
|
||
|
||
There was some interest in "fixing" one or the other of the existing
|
||
sums. If this occurs before 3.4 feature-freeze, the decision to keep
|
||
statistics.sum can be re-considered.
|
||
|
||
Q: Will this module be backported to older versions of Python?
|
||
|
||
A: The module currently targets 3.3, and I will make it available on PyPI
|
||
for 3.3 for the foreseeable future. Backporting to older versions of
|
||
the 3.x series is likely (but not yet decided). Backporting to 2.7 is
|
||
less likely but not ruled out.
|
||
|
||
Q: Is this supposed to replace numpy?
|
||
|
||
A: No. While it is likely to grow over the years (see open issues below)
|
||
it is not aimed to replace, or even compete directly with, numpy. Numpy
|
||
is a full-featured numeric library aimed at professionals, the nuclear
|
||
reactor of numeric libraries in the Python ecosystem. This is just a
|
||
battery, as in "batteries included", and is aimed at an intermediate
|
||
level somewhere between "use numpy" and "roll your own version".
|
||
|
||
|
||
Future Work
|
||
|
||
- At this stage, I am unsure of the best API for multivariate statistical
|
||
functions such as linear regression, correlation coefficient, and
|
||
covariance. Possible APIs include:
|
||
|
||
* Separate arguments for x and y data:
|
||
function([x0, x1, ...], [y0, y1, ...])
|
||
|
||
* A single argument for (x, y) data:
|
||
function([(x0, y0), (x1, y1), ...])
|
||
|
||
This API is preferred by GvR[24].
|
||
|
||
* Selecting arbitrary columns from a 2D array:
|
||
function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
|
||
|
||
* Some combination of the above.
|
||
|
||
In the absence of a consensus of preferred API for multivariate stats,
|
||
I will defer including such multivariate functions until Python 3.5.
|
||
|
||
- Likewise, functions for calculating probability of random variables and
|
||
inference testing (e.g. Student's t-test) will be deferred until 3.5.
|
||
|
||
- There is considerable interest in including one-pass functions that can
|
||
calculate multiple statistics from data in iterator form, without having
|
||
to convert to a list. The experimental "stats" package on PyPI includes
|
||
co-routine versions of statistics functions. Including these will be
|
||
deferred to 3.5.
|
||
|
||
|
||
References
|
||
|
||
[1] http://mail.python.org/pipermail/python-dev/2010-October/104721.html
|
||
|
||
[2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
|
||
|
||
[3] Gnumeric:
|
||
https://projects.gnome.org/gnumeric/functions.shtml
|
||
|
||
LibreOffice:
|
||
https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
|
||
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
|
||
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
|
||
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
|
||
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
|
||
|
||
[4] Scipy: http://scipy-central.org/
|
||
Numpy: http://www.numpy.org/
|
||
|
||
[5] http://wiki.scipy.org/Numpy_Functions_by_Category
|
||
|
||
[6] Tested with numpy 1.6.1 and Python 2.7.
|
||
|
||
[7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
|
||
|
||
[8] http://rosettacode.org/wiki/Standard_deviation
|
||
|
||
[9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
|
||
|
||
[10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
|
||
|
||
[11] http://www.r-project.org/
|
||
|
||
[12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
|
||
|
||
[13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
|
||
|
||
[14] http://ruby-statsample.rubyforge.org/
|
||
|
||
[15] http://www.php.net/manual/en/ref.stats.php
|
||
|
||
[16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
|
||
|
||
[17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
|
||
|
||
[18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
|
||
|
||
[19] http://mathworld.wolfram.com/Skewness.html
|
||
|
||
[20] At least, tedious to those who don't like this sort of thing.
|
||
|
||
[21] http://mail.python.org/pipermail/python-ideas/2011-September/011524.html
|
||
|
||
[22] https://pypi.python.org/pypi/stats/
|
||
|
||
[23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html
|
||
|
||
[24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
|
||
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|