The Commons Math User Guide

The statistics and distributions packages provide frameworks and implementations for basic univariate statistics, frequency distributions, bivariate regression, t- and chi-square test statistics and some commonly used probability distributions.

The stat package includes a framework and default implementations for the following univariate statistics:

arithmetic and geometric means
variance and standard deviation
sum, product, log sum, sum of squared values
minimum, maximum, median, and percentiles
skewness and kurtosis
first, second, third and fourth moments

With the exception of percentiles and the median, all of these statistics can be computed without maintaining the full list of input data values in memory. The stat package provides interfaces and implementations that do not require value storage as well as implementations that operate on arrays of stored values.

The top level interface is org.apache.commons.math.stat.univariate.UnivariateStatistic. This interface, implemented by all statistics, consists of evaluate() methods that take double[] arrays as arguments and return the value of the statistic. This interface is extended by org.apache.commons.math.stat.univariate.StorelessUnivariateStatistic, which adds increment(), getResult() and associated methods to support "storageless" implementations that maintain counters, sums or other state information as values are added using the increment() method.

Abstract implementations of the top level interfaces are provided in org.apache.commons.math.stat.univariate.AbstractUnivariateStatistic and org.apache.commons.math.stat.univariate.AbstractStorelessUnivariateStatistic respectively.

Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and each extends one of the abstract classes above (depending on whether or not value storage is required to compute the statistic). There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is generally more convenient to access them using the provided aggregates:

Aggregate Statistics Included Values stored?

org.apache.commons.math.stat.DescriptiveStatistics All Yes

org.apache.commons.math.stat.SummaryStatistics min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance No

TODO: add code sample There is also a utility class, org.apache.commons.math.stat.StatUtils, that provides static methods for computing statistics from double[] arrays.

Aggregate	Statistics Included	Values stored?
org.apache.commons.math.stat.DescriptiveStatistics	All	Yes
org.apache.commons.math.stat.SummaryStatistics	min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance	No

This is yet to be written. Any contributions will be gratefully accepted!

The distribution framework provides the means to compute probability density function (PDF) probabilities and cumulative distribution function (CDF) probabilities for common probability distributions. Along with the direct computation of PDF and CDF probabilities, the framework also allows for the computation of inverse PDF and inverse CDF values.

In order to use the distribution framework, first a distribution object must be created. It is encouraged that all distribution object creation occurs via the org.apache.commons.math.stat.distribution.DistributionFactory class. DistributionFactory is a simple factory used to create all of the distribution objects supported by Commons-Math. The typical usage of DistributionFactory to create a distribution object would be:

DistributionFactory factory = DistributionFactory.newInstance(); BinomialDistribution binomial = factory.createBinomialDistribution(10, .75);

The distributions that can be instantiated via the DistributionFactory are detailed below:

Distribution Factory Method Parameters

Binomial createBinomialDistribution
Number of trials
Probability of success

Chi-Squared createChiSquaredDistribution
Degrees of freedom

Exponential createExponentialDistribution
Mean

F createFDistribution
Numerator degrees of freedom
Denominator degrees of freedom

Gamma createGammaDistribution
Alpha
Beta

Hypergeometric createHypogeometricDistribution
Population size
Number of successes in population
Sample size

Normal (Gaussian) createNormalDistribution
Mean
Standard Deviation

t createTDistribution
Degrees of freedom

Distribution	Factory Method	Parameters
Binomial	createBinomialDistribution	Number of trials Probability of success
Chi-Squared	createChiSquaredDistribution	Degrees of freedom
Exponential	createExponentialDistribution	Mean
F	createFDistribution	Numerator degrees of freedom Denominator degrees of freedom
Gamma	createGammaDistribution	Alpha Beta
Hypergeometric	createHypogeometricDistribution	Population size Number of successes in population Sample size
Normal (Gaussian)	createNormalDistribution	Mean Standard Deviation
t	createTDistribution	Degrees of freedom

Using a distribution object, PDF and CDF probabilities are easily computed using the cumulativeProbability methods. For a distribution X, and a domain value, x, cumulativeProbability computes P(X <= x) (i.e. the lower tail probability of X).

DistributionFactory factory = DistributionFactory.newInstance(); TDistribution t = factory.createBinomialDistribution(29); double lowerTail = t.cumulativeProbability(-2.656); // P(T <= -2.656) double upperTail = 1.0 - t.cumulativeProbability(2.75); // P(T >= 2.75)

The inverse PDF and CDF values are just as easily computed using the inverseCumulativeProbabilitymethods. For a distribution X, and a probability, p, inverseCumulativeProbability computes the domain value x, such that:

P(X <= x) = p, for continuous distributions
P(X <= x) <= p, for discrete distributions

Notice the different cases for continuous and discrete distributions. This is the result of PDFs not being invertible functions. As such, for discrete distributions, an exact domain value can not be returned. Only the "best" domain value. For Commons-Math, the "best" domain value is determined by the largest domain value whose cumulative probability is less-than or equal to the given probability.