PEP: 450 Title: Adding A Statistics Module To The Standard Library Version: $Revision$ Last-Modified: $Date$ Author: Steven D'Aprano Status: Draft Type: Standards Track Content-Type: text/plain Created: 01-Aug-2013 Python-Version: 3.4 Post-History: Abstract This PEP proposes the addition of a module for common statistics functions such as mean, median, variance and standard deviation to the Python standard library. Rationale The proposed statistics module is motivated by the "batteries included" philosophy towards the Python standard library. Raymond Hettinger and other senior developers have requested a quality statistics library that falls somewhere in between high-end statistics libraries and ad hoc code.[1] Statistical functions such as mean, standard deviation and others are obvious and useful batteries, familiar to any Secondary School student. Even cheap scientific calculators typically include multiple statistical functions such as: - mean - population and sample variance - population and sample standard deviation - linear regression - correlation coefficient Graphing calculators aimed at Secondary School students typically include all of the above, plus some or all of: - median - mode - functions for calculating the probability of random variables from the normal, t, chi-squared, and F distributions - inference on the mean and others[2]. Likewise spreadsheet applications such as Microsoft Excel, LibreOffice and Gnumeric include rich collections of statistical functions[3]. In contrast, Python currently has no standard way to calculate even the simplest and most obvious statistical functions such as mean. For those who need statistical functions in Python, there are two obvious solutions: - install numpy and/or scipy[4]; - or use a Do It Yourself solution. Numpy is perhaps the most full-featured solution, but it has a few disadvantages: - It may be overkill for many purposes. The documentation for numpy even warns "It can be hard to know what functions are available in numpy. This is not a complete list, but it does cover most of them."[5] and then goes on to list over 270 functions, only a small number of which are related to statistics. - Numpy is aimed at those doing heavy numerical work, and may be intimidating to those who don't have a background in computational mathematics and computer science. For example, numpy.mean takes four arguments: mean(a, axis=None, dtype=None, out=None) although fortunately for the beginner or casual numpy user, three are optional and numpy.mean does the right thing in simple cases: >>> numpy.mean([1, 2, 3, 4]) 2.5 - For many people, installing numpy may be difficult or impossible. For example, people in corporate environments may have to go through a difficult, time-consuming process before being permitted to install third-party software. For the casual Python user, having to learn about installing third-party packages in order to average a list of numbers is unfortunate. This leads to option number 2, DIY statistics functions. At first glance, this appears to be an attractive option, due to the apparent simplicity of common statistical functions. For example: def mean(data): return sum(data)/len(data) def variance(data): # Use the Computational Formula for Variance. n = len(data) ss = sum(x**2 for x in data) - (sum(data)**2)/n return ss/(n-1) def standard_deviation(data): return math.sqrt(variance(data)) The above appears to be correct with a casual test: >>> data = [1, 2, 4, 5, 8] >>> variance(data) 7.5 But adding a constant to every data point should not change the variance: >>> data = [x+1e12 for x in data] >>> variance(data) 0.0 And variance should *never* be negative: >>> variance(data*100) -1239429440.1282566 By contrast, the proposed reference implementation gets the exactly correct answer 7.5 for the first two examples, and a reasonably close answer for the third: 6.012. numpy does no better[6]. Even simple statistical calculations contain traps for the unwary, starting with the Computational Formula itself. Despite the name, it is numerically unstable and can be extremely inaccurate, as can be seen above. It is completely unsuitable for computation by computer[7]. This problem plagues users of many programming language, not just Python[8], as coders reinvent the same numerically inaccurate code over and over again[9], or advise others to do so[10]. It isn't just the variance and standard deviation. Even the mean is not quite as straight-forward as it might appear. The above implementation seems too simple to have problems, but it does: - The built-in sum can lose accuracy when dealing with floats of wildly differing magnitude. Consequently, the above naive mean fails this "torture test": assert mean([1e30, 1, 3, -1e30]) == 1 returning 0 instead of 1, a purely computational error of 100%. - Using math.fsum inside mean will make it more accurate with float data, but it also has the side-effect of converting any arguments to float even when unnecessary. E.g. we should expect the mean of a list of Fractions to be a Fraction, not a float. While the above mean implementation does not fail quite as catastrophically as the naive variance does, a standard library function can do much better than the DIY versions. The example above involves an especially bad set of data, but even for more realistic data sets accuracy is important. The first step in interpreting variation in data (including dealing with ill-conditioned data) is often to standardize it to a series with variance 1 (and often mean 0). This standardization requires accurate computation of the mean and variance of the raw series. Naive computation of mean and variance can lose precision very quickly. Because precision bounds accuracy, it is important to use the most precise algorithms for computing mean and variance that are practical, or the results of standardization are themselves useless. Comparison To Other Languages/Packages The proposed statistics library is not intended to be a competitor to such third-party libraries as numpy/scipy, or of proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators. Most programming languages have little or no built-in support for statistics functions. Some exceptions: R R (and its proprietary cousin, S) is a programming language designed for statistics work. It is extremely popular with statisticians and is extremely feature-rich[11]. C# The C# LINQ package includes extension methods to calculate the average of enumerables[12]. Ruby Ruby does not ship with a standard statistics module, despite some apparent demand[13]. Statsample appears to be a feature-rich third- party library, aiming to compete with R[14]. PHP PHP has an extremely feature-rich (although mostly undocumented) set of advanced statistical functions[15]. Delphi Delphi includes standard statistical functions including Mean, Sum, Variance, TotalVariance, MomentSkewKurtosis in its Math library[16]. GNU Scientific Library The GNU Scientific Library includes standard statistical functions, percentiles, median and others[17]. One innovation I have borrowed from the GSL is to allow the caller to optionally specify the pre- calculated mean of the sample (or an a priori known population mean) when calculating the variance and standard deviation[18]. Design Decisions Of The Module My intention is to start small and grow the library as needed, rather than try to include everything from the start. Consequently, the current reference implementation includes only a small number of functions: mean, variance, standard deviation, median, mode. (See the reference implementation for a full list.) I have aimed for the following design features: - Correctness over speed. It is easier to speed up a correct but slow function than to correct a fast but buggy one. - Concentrate on data in sequences, allowing two-passes over the data, rather than potentially compromise on accuracy for the sake of a one-pass algorithm. Functions expect data will be passed as a list or other sequence; if given an iterator, they may internally convert to a list. - Functions should, as much as possible, honour any type of numeric data. E.g. the mean of a list of Decimals should be a Decimal, not a float. When this is not possible, treat float as the "lowest common data type". - Although functions support data sets of floats, Decimals or Fractions, there is no guarantee that *mixed* data sets will be supported. (But on the other hand, they aren't explicitly rejected either.) - Plenty of documentation, aimed at readers who understand the basic concepts but may not know (for example) which variance they should use (population or sample?). Mathematicians and statisticians have a terrible habit of being inconsistent with both notation and terminology[19], and having spent many hours making sense of the contradictory/confusing definitions in use, it is only fair that I do my best to clarify rather than obfuscate the topic. - But avoid going into tedious[20] mathematical detail. Specification As the proposed reference implementation is in pure Python, other Python implementations can easily make use of the module unchanged, or adapt it as they see fit. What Should Be The Name Of The Module? This will be a top-level module "statistics". There was some interest in turning math into a package, and making this a sub-module of math, but the general consensus eventually agreed on a top-level module. Other potential but rejected names included "stats" (too much risk of confusion with existing "stat" module), and "statslib" (described as "too C-like"). Previous Discussions This proposal has been previously discussed here[21]. Frequently Asked Questions Q: Shouldn't this module spend time on PyPI before being considered for the standard library? A: Older versions of this module have been available on PyPI[22] since 2010. Being much simpler than numpy, it does not require many years of external development. Q: Does the standard library really need yet another version of ``sum``? A: This proved to be the most controversial part of the reference implementation. In one sense, clearly three sums is two too many. But in another sense, yes. The reasons why the two existing versions are unsuitable are described here[23] but the short summary is: - the built-in sum can lose precision with floats; - the built-in sum accepts any non-numeric data type that supports the + operator, apart from strings and bytes; - math.fsum is high-precision, but coerces all arguments to float. There is some interest in "fixing" one or the other of the existing sums. If this occurs before 3.4 feature-freeze, the decision to keep statistics.sum can be re-considered. Q: Will this module be backported to older versions of Python? A: The module currently targets 3.3, and I will make it available on PyPI for 3.3 for the foreseeable future. Backporting to older versions of the 3.x series is likely (but not yet decided). Backporting to 2.7 is less likely but not ruled out. Q: Is this supposed to replace numpy? A: No. While it is likely to grow over the years (see open issues below) it is not aimed to replace, or even compete directly with, numpy. Numpy is a full-featured numeric library aimed at professionals, the nuclear reactor of numeric libraries in the Python ecosystem. This is just a battery, as in "batteries included", and is aimed at an intermediate level somewhere between "use numpy" and "roll your own version". Open and Deferred Issues - At this stage, I am unsure of the best API for multivariate statistical functions such as linear regression, correlation coefficient, and covariance. Possible APIs include: * Separate arguments for x and y data: function([x0, x1, ...], [y0, y1, ...]) * A single argument for (x, y) data: function([(x0, y0), (x1, y1), ...]) * Selecting arbitrary columns from a 2D array: function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2) * Some combination of the above. In the absence of a consensus of preferred API for multivariate stats, I will defer including such multivariate functions until Python 3.5. - Likewise, functions for calculating probability of random variables and inference testing (e.g. Student's t-test) will be deferred until 3.5. - There is considerable interest in including one-pass functions that can calculate multiple statistics from data in iterator form, without having to convert to a list. The experimental "stats" package on PyPI includes co-routine versions of statistics functions. Including these will be deferred to 3.5. References [1] http://mail.python.org/pipermail/python-dev/2010-October/104721.html [2] http://support.casio.com/pdf/004/CP330PLUSver310_Soft_E.pdf [3] Gnumeric: https://projects.gnome.org/gnumeric/functions.shtml LibreOffice: https://help.libreoffice.org/Calc/Statistical_Functions_Part_One https://help.libreoffice.org/Calc/Statistical_Functions_Part_Two https://help.libreoffice.org/Calc/Statistical_Functions_Part_Three https://help.libreoffice.org/Calc/Statistical_Functions_Part_Four https://help.libreoffice.org/Calc/Statistical_Functions_Part_Five [4] Scipy: http://scipy-central.org/ Numpy: http://www.numpy.org/ [5] http://wiki.scipy.org/Numpy_Functions_by_Category [6] Tested with numpy 1.6.1 and Python 2.7. [7] http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/ [8] http://rosettacode.org/wiki/Standard_deviation [9] https://bitbucket.org/larsyencken/simplestats/src/c42e048a6625/src/basic.py [10] http://stackoverflow.com/questions/2341340/calculate-mean-and-variance-with-one-iteration [11] http://www.r-project.org/ [12] http://msdn.microsoft.com/en-us/library/system.linq.enumerable.average.aspx [13] https://www.bcg.wisc.edu/webteam/support/ruby/standard_deviation [14] http://ruby-statsample.rubyforge.org/ [15] http://www.php.net/manual/en/ref.stats.php [16] http://www.ayton.id.au/gary/it/Delphi/D_maths.htm#Delphi%20Statistical%20functions. [17] http://www.gnu.org/software/gsl/manual/html_node/Statistics.html [18] http://www.gnu.org/software/gsl/manual/html_node/Mean-and-standard-deviation-and-variance.html [19] http://mathworld.wolfram.com/Skewness.html [20] At least, tedious to those who don't like this sort of thing. [21] http://mail.python.org/pipermail/python-ideas/2011-September/011524.html [22] https://pypi.python.org/pypi/stats/ [23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: