diff --git a/pep-0450.txt b/pep-0450.txt index 5de7cfa29..448f466d4 100644 --- a/pep-0450.txt +++ b/pep-0450.txt @@ -15,7 +15,7 @@ Abstract This PEP proposes the addition of a module for common statistics functions such as mean, median, variance and standard deviation to the Python - standard library. + standard library. See also http://bugs.python.org/issue18606 Rationale @@ -250,6 +250,102 @@ Design Decisions Of The Module - But avoid going into tedious[20] mathematical detail. +API + + The initial version of the library will provide univariate (single + variable) statistics functions. The general API will be based on a + functional model ``function(data, ...) -> result``, where ``data`` + is a mandatory iterable of (usually) numeric data. + + The author expects that lists will be the most common data type used, + but any iterable type should be acceptable. Where necessary, functions + may convert to lists internally. Where possible, functions are + expected to conserve the type of the data values, for example, the mean + of a list of Decimals should be a Decimal rather than float. + + + Calculating mean, median and mode + + The ``mean``, ``median*`` and ``mode`` functions take a single + mandatory argument and return the appropriate statistic, e.g.: + + >>> mean([1, 2, 3]) + 2.0 + + Functions provided are: + + * mean(data) -> arithmetic mean of data. + + * median(data) -> median (middle value) of data, taking the + average of the two middle values when there are an even + number of values. + + * median_high(data) -> high median of data, taking the + larger of the two middle values when the number of items + is even. + + * median_low(data) -> low median of data, taking the smaller + of the two middle values when the number of items is even. + + * median_grouped(data, interval=1) -> 50th percentile of + grouped data, using interpolation. + + * mode(data) -> most common data point. + + ``mode`` is the sole exception to the rule that the data argument + must be numeric. It will also accept an iterable of nominal data, + such as strings. + + + Calculating variance and standard deviation + + In order to be similar to scientific calculators, the statistics + module will include separate functions for population and sample + variance and standard deviation. All four functions have similar + signatures, with a single mandatory argument, an iterable of + numeric data, e.g.: + + >>> variance([1, 2, 2, 2, 3]) + 0.5 + + All four functions also accept a second, optional, argument, the + mean of the data. This is modelled on a similar API provided by + the GNU Scientific Library[18]. There are three use-cases for + using this argument, in no particular order: + + 1) The value of the mean is known *a priori*. + + 2) You have already calculated the mean, and wish to avoid + calculating it again. + + 3) You wish to (ab)use the variance functions to calculate + the second moment about some given point other than the + mean. + + In each case, it is the caller's responsibility to ensure that + given argument is meaningful. + + Functions provided are: + + * variance(data, xbar=None) -> sample variance of data, + optionally using xbar as the sample mean. + + * stdev(data, xbar=None) -> sample standard deviation of + data, optionally using xbar as the sample mean. + + * pvariance(data, mu=None) -> population variance of data, + optionally using mu as the population mean. + + * pstdev(data, mu=None) -> population standard deviation of + data, optionally using mu as the population mean. + + Other functions + + There is one other public function: + + * sum(data, start=0) -> high-precision sum of numeric data. + + Specification As the proposed reference implementation is in pure Python, @@ -317,7 +413,7 @@ Frequently Asked Questions level somewhere between "use numpy" and "roll your own version". -Open and Deferred Issues +Future Work - At this stage, I am unsure of the best API for multivariate statistical functions such as linear regression, correlation coefficient, and @@ -329,6 +425,8 @@ Open and Deferred Issues * A single argument for (x, y) data: function([(x0, y0), (x1, y1), ...]) + This API is preferred by GvR[24]. + * Selecting arbitrary columns from a 2D array: function([[a0, x0, y0, z0], [a1, x1, y1, z1], ...], x=1, y=2) @@ -404,6 +502,8 @@ References [23] http://mail.python.org/pipermail/python-ideas/2013-August/022630.html + [24] https://mail.python.org/pipermail/python-dev/2013-September/128429.html + Copyright