The statistics and distributions packages provide frameworks and implementations for basic univariate statistics, frequency distributions, bivariate regression, t- and chi-square test statistics and some commonly used probability distributions.
The stat package includes a framework and default implementations for the following univariate statistics:
With the exception of percentiles and the median, all of these statistics can be computed without maintaining the full list of input data values in memory. The stat package provides interfaces and implementations that do not require value storage as well as implementations that operate on arrays of stored values.
The top level interface is
org.apache.commons.math.stat.univariate.UnivariateStatistic. This interface, implemented by
all statistics, consists of evaluate()
methods that take double[] arrays as arguments and return
the value of the statistic. This interface is extended by
StorelessUnivariateStatistic, which adds increment(),
getResult()
and associated methods to support "storageless" implementations that
maintain counters, sums or other state information as values are added using the increment()
method.
Abstract implementations of the top level interfaces are provided in AbstractUnivariateStatistic and AbstractStorelessUnivariateStatistic respectively.
Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and each extends one of the abstract classes above (depending on whether or not value storage is required to compute the statistic). There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is generally more convenient (and efficient) to access them using the provided aggregates, DescriptiveStatistics and SummaryStatistics.
DescriptiveStatistics
maintains the input data in memory and has the capability
of producing "rolling" statistics computed from a "window" consisting of the most recently added values.
SummaryStatisics
does not store the input data values in memory, so the statistics
included in this aggregate are limited to those that can be computed in one pass through the data
without access to the full array of values.
Aggregate | Statistics Included | Values stored? | "Rolling" capability? |
---|---|---|---|
DescriptiveStatistics | min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis, median | Yes | Yes |
SummaryStatistics | min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance | No | No |
There is also a utility class, StatUtils, that provides static methods for computing statistics directly from double[] arrays.
Here are some examples showing how to compute univariate statistics.
DescriptiveStatistics
aggregate (values are stored in memory):
SummaryStatistics
aggregate (values are not stored in memory):
StatUtils
utility class:
DescriptiveStatistics
instance with window size set to 100
org.apache.commons.math.stat.univariate.Frequency provides a simple interface for maintaining counts and percentages of discrete values.
Strings, integers, longs and chars are all supported as value types, as well as instances
of any class that implements Comparable.
The ordering of values
used in computing cumulative frequencies is by default the natural ordering,
but this can be overriden by supplying a Comparator
to the constructor.
Adding values that are not comparable to those that have already been added results in an
IllegalArgumentException.
Here are some examples.
This is yet to be written. Any contributions will be gratefully accepted!
This is yet to be written. Any contributions will be gratefully accepted!
The distribution framework provides the means to compute probability density function (PDF) probabilities and cumulative distribution function (CDF) probabilities for common probability distributions. Along with the direct computation of PDF and CDF probabilities, the framework also allows for the computation of inverse PDF and inverse CDF values.
In order to use the distribution framework, first a distribution object must
be created. It is encouraged that all distribution object creation occurs via
the org.apache.commons.math.stat.distribution.DistributionFactory
class. DistributionFactory
is a simple factory used to create all
of the distribution objects supported by Commons-Math. The typical usage of
DistributionFactory
to create a distribution object would be:
The distributions that can be instantiated via the DistributionFactory
are detailed below:
Distribution | Factory Method | Parameters |
---|---|---|
Binomial | createBinomialDistribution | Number of trials Probability of success |
Chi-Squared | createChiSquaredDistribution | Degrees of freedom |
Exponential | createExponentialDistribution | Mean |
F | createFDistribution | Numerator degrees of freedom Denominator degrees of freedom |
Gamma | createGammaDistribution | Alpha Beta |
Hypergeometric | createHypogeometricDistribution | Population size Number of successes in population Sample size |
Normal (Gaussian) | createNormalDistribution | Mean Standard Deviation |
t | createTDistribution | Degrees of freedom |
Using a distribution object, PDF and CDF probabilities are easily computed
using the cumulativeProbability
methods. For a distribution X
,
and a domain value, x
, cumulativeProbability
computes
P(X <= x)
(i.e. the lower tail probability of X
).
The inverse PDF and CDF values are just as easily computed using the
inverseCumulativeProbability
methods. For a distribution X
,
and a probability, p
, inverseCumulativeProbability
computes the domain value x
, such that:
P(X <= x) = p
, for continuous distributionsP(X <= x) <= p
, for discrete distributions