582 lines
21 KiB
Plaintext
582 lines
21 KiB
Plaintext
PEP: 450
|
|
Title: Adding A Statistics Module To The Standard Library
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Steven D'Aprano <steve@pearwood.info>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 01-Aug-2013
|
|
Python-Version: 3.4
|
|
Post-History: 13-Sep-2013
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
This PEP proposes the addition of a module for common statistics functions such
|
|
as mean, median, variance and standard deviation to the Python standard
|
|
library. See also http://bugs.python.org/issue18606
|
|
|
|
|
|
Rationale
|
|
=========
|
|
|
|
The proposed statistics module is motivated by the "batteries included"
|
|
philosophy towards the Python standard library. Raymond Hettinger and other
|
|
senior developers have requested a quality statistics library that falls
|
|
somewhere in between high-end statistics libraries and ad hoc code. [1]_
|
|
Statistical functions such as mean, standard deviation and others are obvious
|
|
and useful batteries, familiar to any Secondary School student. Even cheap
|
|
scientific calculators typically include multiple statistical functions such
|
|
as:
|
|
|
|
- mean
|
|
- population and sample variance
|
|
- population and sample standard deviation
|
|
- linear regression
|
|
- correlation coefficient
|
|
|
|
Graphing calculators aimed at Secondary School students typically include all
|
|
of the above, plus some or all of:
|
|
|
|
- median
|
|
- mode
|
|
- functions for calculating the probability of random variables from the
|
|
normal, t, chi-squared, and F distributions
|
|
- inference on the mean
|
|
|
|
and others [2]_. Likewise spreadsheet applications such as Microsoft Excel,
|
|
LibreOffice and Gnumeric include rich collections of statistical
|
|
functions [3]_.
|
|
|
|
In contrast, Python currently has no standard way to calculate even the
|
|
simplest and most obvious statistical functions such as mean. For those who
|
|
need statistical functions in Python, there are two obvious solutions:
|
|
|
|
- install numpy and/or scipy [4]_;
|
|
|
|
- or use a Do It Yourself solution.
|
|
|
|
Numpy is perhaps the most full-featured solution, but it has a few
|
|
disadvantages:
|
|
|
|
- It may be overkill for many purposes. The documentation for numpy even warns
|
|
|
|
"It can be hard to know what functions are available in numpy. This is
|
|
not a complete list, but it does cover most of them."[5]_
|
|
|
|
and then goes on to list over 270 functions, only a small number of which are
|
|
related to statistics.
|
|
|
|
- Numpy is aimed at those doing heavy numerical work, and may be intimidating
|
|
to those who don't have a background in computational mathematics and
|
|
computer science. For example, ``numpy.mean`` takes four arguments::
|
|
|
|
mean(a, axis=None, dtype=None, out=None)
|
|
|
|
although fortunately for the beginner or casual numpy user, three are
|
|
optional and ``numpy.mean`` does the right thing in simple cases::
|
|
|
|
>>> numpy.mean([1, 2, 3, 4])
|
|
2.5
|
|
|
|
- For many people, installing numpy may be difficult or impossible. For
|
|
example, people in corporate environments may have to go through a difficult,
|
|
time-consuming process before being permitted to install third-party
|
|
software. For the casual Python user, having to learn about installing
|
|
third-party packages in order to average a list of numbers is unfortunate.
|
|
|
|
This leads to option number 2, DIY statistics functions. At first glance, this
|
|
appears to be an attractive option, due to the apparent simplicity of common
|
|
statistical functions. For example::
|
|
|
|
def mean(data):
|
|
return sum(data)/len(data)
|
|
|
|
def variance(data):
|
|
# Use the Computational Formula for Variance.
|
|
n = len(data)
|
|
ss = sum(x**2 for x in data) - (sum(data)**2)/n
|
|
return ss/(n-1)
|
|
|
|
def standard_deviation(data):
|
|
return math.sqrt(variance(data))
|
|
|
|
The above appears to be correct with a casual test::
|
|
|
|
>>> data = [1, 2, 4, 5, 8]
|
|
>>> variance(data)
|
|
7.5
|
|
|
|
But adding a constant to every data point should not change the variance::
|
|
|
|
>>> data = [x+1e12 for x in data]
|
|
>>> variance(data)
|
|
0.0
|
|
|
|
And variance should *never* be negative::
|
|
|
|
>>> variance(data*100)
|
|
-1239429440.1282566
|
|
|
|
By contrast, the proposed reference implementation gets the exactly correct
|
|
answer 7.5 for the first two examples, and a reasonably close answer for the
|
|
third: 6.012. numpy does no better [6]_.
|
|
|
|
Even simple statistical calculations contain traps for the unwary, starting
|
|
with the Computational Formula itself. Despite the name, it is numerically
|
|
unstable and can be extremely inaccurate, as can be seen above. It is
|
|
completely unsuitable for computation by computer [7]_. This problem plagues
|
|
users of many programming language, not just Python [8]_, as coders reinvent
|
|
the same numerically inaccurate code over and over again [9]_, or advise others
|
|
to do so [10]_.
|
|
|
|
It isn't just the variance and standard deviation. Even the mean is not quite
|
|
as straightforward as it might appear. The above implementation seems too
|
|
simple to have problems, but it does:
|
|
|
|
- The built-in ``sum`` can lose accuracy when dealing with floats of wildly
|
|
differing magnitude. Consequently, the above naive ``mean`` fails this
|
|
"torture test"::
|
|
|
|
assert mean([1e30, 1, 3, -1e30]) == 1
|
|
|
|
returning 0 instead of 1, a purely computational error of 100%.
|
|
|
|
- Using ``math.fsum`` inside ``mean`` will make it more accurate with float
|
|
data, but it also has the side-effect of converting any arguments to float
|
|
even when unnecessary. E.g. we should expect the mean of a list of Fractions
|
|
to be a Fraction, not a float.
|
|
|
|
While the above mean implementation does not fail quite as catastrophically as
|
|
the naive variance does, a standard library function can do much better than
|
|
the DIY versions.
|
|
|
|
The example above involves an especially bad set of data, but even for more
|
|
realistic data sets accuracy is important. The first step in interpreting
|
|
variation in data (including dealing with ill-conditioned data) is often to
|
|
standardize it to a series with variance 1 (and often mean 0). This
|
|
standardization requires accurate computation of the mean and variance of the
|
|
raw series. Naive computation of mean and variance can lose precision very
|
|
quickly. Because precision bounds accuracy, it is important to use the most
|
|
precise algorithms for computing mean and variance that are practical, or the
|
|
results of standardization are themselves useless.
|
|
|
|
|
|
Comparison To Other Languages/Packages
|
|
======================================
|
|
|
|
The proposed statistics library is not intended to be a competitor to such
|
|
third-party libraries as numpy/scipy, or of proprietary full-featured
|
|
statistics packages aimed at professional statisticians such as Minitab, SAS
|
|
and Matlab. It is aimed at the level of graphing and scientific calculators.
|
|
|
|
Most programming languages have little or no built-in support for statistics
|
|
functions. Some exceptions:
|
|
|
|
R
|
|
-
|
|
|
|
R (and its proprietary cousin, S) is a programming language designed for
|
|
statistics work. It is extremely popular with statisticians and is extremely
|
|
feature-rich [11]_.
|
|
|
|
C#
|
|
--
|
|
|
|
The C# LINQ package includes extension methods to calculate the average of
|
|
enumerables [12]_.
|
|
|
|
Ruby
|
|
----
|
|
|
|
Ruby does not ship with a standard statistics module, despite some apparent
|
|
demand [13]_. Statsample appears to be a feature-rich third-party library,
|
|
aiming to compete with R [14]_.
|
|
|
|
PHP
|
|
---
|
|
|
|
PHP has an extremely feature-rich (although mostly undocumented) set of
|
|
advanced statistical functions [15]_.
|
|
|
|
Delphi
|
|
------
|
|
|
|
Delphi includes standard statistical functions including Mean, Sum,
|
|
Variance, TotalVariance, MomentSkewKurtosis in its Math library [16]_.
|
|
|
|
GNU Scientific Library
|
|
----------------------
|
|
|
|
The GNU Scientific Library includes standard statistical functions,
|
|
percentiles, median and others [17]_. One innovation I have borrowed from the
|
|
GSL is to allow the caller to optionally specify the pre-calculated mean of
|
|
the sample (or an a priori known population mean) when calculating the variance
|
|
and standard deviation [18]_.
|
|
|
|
|
|
Design Decisions Of The Module
|
|
==============================
|
|
|
|
My intention is to start small and grow the library as needed, rather than try
|
|
to include everything from the start. Consequently, the current reference
|
|
implementation includes only a small number of functions: mean, variance,
|
|
standard deviation, median, mode. (See the reference implementation for a full
|
|
list.)
|
|
|
|
I have aimed for the following design features:
|
|
|
|
- Correctness over speed. It is easier to speed up a correct but slow function
|
|
than to correct a fast but buggy one.
|
|
|
|
- Concentrate on data in sequences, allowing two-passes over the data, rather
|
|
than potentially compromise on accuracy for the sake of a one-pass algorithm.
|
|
Functions expect data will be passed as a list or other sequence; if given an
|
|
iterator, they may internally convert to a list.
|
|
|
|
- Functions should, as much as possible, honour any type of numeric data. E.g.
|
|
the mean of a list of Decimals should be a Decimal, not a float. When this is
|
|
not possible, treat float as the "lowest common data type".
|
|
|
|
- Although functions support data sets of floats, Decimals or Fractions, there
|
|
is no guarantee that *mixed* data sets will be supported. (But on the other
|
|
hand, they aren't explicitly rejected either.)
|
|
|
|
- Plenty of documentation, aimed at readers who understand the basic concepts
|
|
but may not know (for example) which variance they should use (population or
|
|
sample?). Mathematicians and statisticians have a terrible habit of being
|
|
inconsistent with both notation and terminology [19]_, and having spent many
|
|
hours making sense of the contradictory/confusing definitions in use, it is
|
|
only fair that I do my best to clarify rather than obfuscate the topic.
|
|
|
|
- But avoid going into tedious [20]_ mathematical detail.
|
|
|
|
|
|
API
|
|
===
|
|
|
|
The initial version of the library will provide univariate (single variable)
|
|
statistics functions. The general API will be based on a functional model
|
|
``function(data, ...) -> result``, where ``data`` is a mandatory iterable of
|
|
(usually) numeric data.
|
|
|
|
The author expects that lists will be the most common data type used, but any
|
|
iterable type should be acceptable. Where necessary, functions may convert to
|
|
lists internally. Where possible, functions are expected to conserve the type
|
|
of the data values, for example, the mean of a list of Decimals should be a
|
|
Decimal rather than float.
|
|
|
|
|
|
Calculating mean, median and mode
|
|
---------------------------------
|
|
|
|
The ``mean``, ``median*`` and ``mode`` functions take a single mandatory
|
|
argument and return the appropriate statistic, e.g.::
|
|
|
|
>>> mean([1, 2, 3])
|
|
2.0
|
|
|
|
Functions provided are:
|
|
|
|
* ``mean(data)``
|
|
arithmetic mean of *data*.
|
|
|
|
* ``median(data)``
|
|
median (middle value) of *data*, taking the average of the two
|
|
middle values when there are an even number of values.
|
|
|
|
* ``median_high(data)``
|
|
high median of *data*, taking the larger of the two middle
|
|
values when the number of items is even.
|
|
|
|
* ``median_low(data)``
|
|
low median of *data*, taking the smaller of the two middle
|
|
values when the number of items is even.
|
|
|
|
* ``median_grouped(data, interval=1)``
|
|
50th percentile of grouped *data*, using interpolation.
|
|
|
|
* ``mode(data)``
|
|
most common *data* point.
|
|
|
|
``mode`` is the sole exception to the rule that the data argument must be
|
|
numeric. It will also accept an iterable of nominal data, such as strings.
|
|
|
|
|
|
Calculating variance and standard deviation
|
|
-------------------------------------------
|
|
|
|
In order to be similar to scientific calculators, the statistics module will
|
|
include separate functions for population and sample variance and standard
|
|
deviation. All four functions have similar signatures, with a single mandatory
|
|
argument, an iterable of numeric data, e.g.::
|
|
|
|
>>> variance([1, 2, 2, 2, 3])
|
|
0.5
|
|
|
|
All four functions also accept a second, optional, argument, the mean of the
|
|
data. This is modelled on a similar API provided by the GNU Scientific
|
|
Library [18]_. There are three use-cases for using this argument, in no
|
|
particular order:
|
|
|
|
1) The value of the mean is known *a priori*.
|
|
|
|
2) You have already calculated the mean, and wish to avoid calculating
|
|
it again.
|
|
|
|
3) You wish to (ab)use the variance functions to calculate the second
|
|
moment about some given point other than the mean.
|
|
|
|
In each case, it is the caller's responsibility to ensure that given
|
|
argument is meaningful.
|
|
|
|
Functions provided are:
|
|
|
|
* ``variance(data, xbar=None)``
|
|
sample variance of *data*, optionally using *xbar* as the sample mean.
|
|
|
|
* ``stdev(data, xbar=None)``
|
|
sample standard deviation of *data*, optionally using *xbar* as the
|
|
sample mean.
|
|
|
|
* ``pvariance(data, mu=None)``
|
|
population variance of *data*, optionally using *mu* as the population
|
|
mean.
|
|
|
|
* ``pstdev(data, mu=None)``
|
|
population standard deviation of *data*, optionally using *mu* as the
|
|
population mean.
|
|
|
|
Other functions
|
|
---------------
|
|
|
|
There is one other public function:
|
|
|
|
* ``sum(data, start=0)``
|
|
high-precision sum of numeric *data*.
|
|
|
|
|
|
Specification
|
|
=============
|
|
|
|
As the proposed reference implementation is in pure Python, other Python
|
|
implementations can easily make use of the module unchanged, or adapt it as
|
|
they see fit.
|
|
|
|
|
|
What Should Be The Name Of The Module?
|
|
======================================
|
|
|
|
This will be a top-level module ``statistics``.
|
|
|
|
There was some interest in turning ``math`` into a package, and making this a
|
|
sub-module of ``math``, but the general consensus eventually agreed on a
|
|
top-level module. Other potential but rejected names included ``stats`` (too
|
|
much risk of confusion with existing ``stat`` module), and ``statslib``
|
|
(described as "too C-like").
|
|
|
|
|
|
Discussion And Resolved Issues
|
|
==============================
|
|
|
|
This proposal has been previously discussed here [21]_.
|
|
|
|
A number of design issues were resolved during the discussion on Python-Ideas
|
|
and the initial code review. There was a lot of concern about the addition of
|
|
yet another ``sum`` function to the standard library, see the FAQs below for
|
|
more details. In addition, the initial implementation of ``sum`` suffered from
|
|
some rounding issues and other design problems when dealing with Decimals.
|
|
Oscar Benjamin's assistance in resolving this was invaluable.
|
|
|
|
Another issue was the handling of data in the form of iterators. The first
|
|
implementation of variance silently swapped between a one- and two-pass
|
|
algorithm, depending on whether the data was in the form of an iterator or
|
|
sequence. This proved to be a design mistake, as the calculated variance could
|
|
differ slightly depending on the algorithm used, and ``variance`` etc. were
|
|
changed to internally generate a list and always use the more accurate two-pass
|
|
implementation.
|
|
|
|
One controversial design involved the functions to calculate median, which were
|
|
implemented as attributes on the ``median`` callable, e.g. ``median``,
|
|
``median.low``, ``median.high`` etc. Although there is at least one existing
|
|
use of this style in the standard library, in ``unittest.mock``, the code
|
|
reviewers felt that this was too unusual for the standard library.
|
|
Consequently, the design has been changed to a more traditional design of
|
|
separate functions with a pseudo-namespace naming convention, ``median_low``,
|
|
``median_high``, etc.
|
|
|
|
Another issue that was of concern to code reviewers was the existence of a
|
|
function calculating the sample mode of continuous data, with some people
|
|
questioning the choice of algorithm, and whether it was a sufficiently common
|
|
need to be included. So it was dropped from the API, and ``mode`` now
|
|
implements only the basic schoolbook algorithm based on counting unique values.
|
|
|
|
Another significant point of discussion was calculating statistics of
|
|
``timedelta`` objects. Although the statistics module will not directly
|
|
support ``timedelta`` objects, it is possible to support this use-case by
|
|
converting them to numbers first using the ``timedelta.total_seconds`` method.
|
|
|
|
|
|
Frequently Asked Questions
|
|
==========================
|
|
|
|
Shouldn't this module spend time on PyPI before being considered for the standard library?
|
|
------------------------------------------------------------------------------------------
|
|
|
|
Older versions of this module have been available on PyPI [22]_ since 2010.
|
|
Being much simpler than numpy, it does not require many years of external
|
|
development.
|
|
|
|
Does the standard library really need yet another version of ``sum``?
|
|
---------------------------------------------------------------------
|
|
|
|
This proved to be the most controversial part of the reference implementation.
|
|
In one sense, clearly three sums is two too many. But in another sense, yes.
|
|
The reasons why the two existing versions are unsuitable are described
|
|
here [23]_ but the short summary is:
|
|
|
|
- the built-in sum can lose precision with floats;
|
|
|
|
- the built-in sum accepts any non-numeric data type that supports the ``+``
|
|
operator, apart from strings and bytes;
|
|
|
|
- ``math.fsum`` is high-precision, but coerces all arguments to float.
|
|
|
|
There was some interest in "fixing" one or the other of the existing sums. If
|
|
this occurs before 3.4 feature-freeze, the decision to keep ``statistics.sum``
|
|
can be re-considered.
|
|
|
|
Will this module be backported to older versions of Python?
|
|
-----------------------------------------------------------
|
|
|
|
The module currently targets 3.3, and I will make it available on PyPI for
|
|
3.3 for the foreseeable future. Backporting to older versions of the 3.x
|
|
series is likely (but not yet decided). Backporting to 2.7 is less likely but
|
|
not ruled out.
|
|
|
|
Is this supposed to replace numpy?
|
|
----------------------------------
|
|
|
|
No. While it is likely to grow over the years (see open issues below) it is
|
|
not aimed to replace, or even compete directly with, numpy. Numpy is a
|
|
full-featured numeric library aimed at professionals, the nuclear reactor of
|
|
numeric libraries in the Python ecosystem. This is just a battery, as in
|
|
"batteries included", and is aimed at an intermediate level somewhere between
|
|
"use numpy" and "roll your own version".
|
|
|
|
|
|
Future Work
|
|
===========
|
|
|
|
- At this stage, I am unsure of the best API for multivariate statistical
|
|
functions such as linear regression, correlation coefficient, and covariance.
|
|
Possible APIs include:
|
|
|
|
* Separate arguments for x and y data::
|
|
|
|
function([x0, x1, ...], [y0, y1, ...])
|
|
|
|
* A single argument for (x, y) data::
|
|
|
|
function([(x0, y0), (x1, y1), ...])
|
|
|
|
This API is preferred by GvR [24]_.
|
|
|
|
* Selecting arbitrary columns from a 2D array::
|
|
|
|
function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2)
|
|
|
|
* Some combination of the above.
|
|
|
|
In the absence of a consensus of preferred API for multivariate stats, I will
|
|
defer including such multivariate functions until Python 3.5.
|
|
|
|
- Likewise, functions for calculating probability of random variables and
|
|
inference testing (e.g. Student's t-test) will be deferred until 3.5.
|
|
|
|
- There is considerable interest in including one-pass functions that can
|
|
calculate multiple statistics from data in iterator form, without having to
|
|
convert to a list. The experimental ``stats`` package on PyPI includes
|
|
co-routine versions of statistics functions. Including these will be deferred
|
|
to 3.5.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
.. [1] https://mail.python.org/pipermail/python-dev/2010-October/104721.html
|
|
|
|
.. [2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf
|
|
|
|
.. [3] Gnumeric::
|
|
https://projects.gnome.org/gnumeric/functions.shtml
|
|
|
|
LibreOffice:
|
|
https://help.libreoffice.org/Calc/Statistical_Functions_Part_One
|
|
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two
|
|
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three
|
|
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four
|
|
https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five
|
|
|
|
.. [4] Scipy: http://scipy-central.org/
|
|
Numpy: http://www.numpy.org/
|
|
|
|
.. [5] http://wiki.scipy.org/Numpy_Functions_by_Category
|
|
|
|
.. [6] Tested with numpy 1.6.1 and Python 2.7.
|
|
|
|
.. [7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
|
|
|
|
.. [8] http://rosettacode.org/wiki/Standard_deviation
|
|
|
|
.. [9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py
|
|
|
|
.. [10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration
|
|
|
|
.. [11] http://www.r-project.org/
|
|
|
|
.. [12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx
|
|
|
|
.. [13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation
|
|
|
|
.. [14] http://ruby-statsample.rubyforge.org/
|
|
|
|
.. [15] http://www.php.net/manual/en/ref.stats.php
|
|
|
|
.. [16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions.
|
|
|
|
.. [17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html
|
|
|
|
.. [18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html
|
|
|
|
.. [19] http://mathworld.wolfram.com/Skewness.html
|
|
|
|
.. [20] At least, tedious to those who don't like this sort of thing.
|
|
|
|
.. [21] https://mail.python.org/pipermail/python-ideas/2011-September/011524.html
|
|
|
|
.. [22] https://pypi.python.org/pypi/stats/
|
|
|
|
.. [23] https://mail.python.org/pipermail/python-ideas/2013-August/022630.html
|
|
|
|
.. [24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
..
|
|
Local Variables:
|
|
mode: indented-text
|
|
indent-tabs-mode: nil
|
|
sentence-end-double-space: t
|
|
fill-column: 70
|
|
coding: utf-8
|
|
End:
|