The Commons Math User Guide - Statistics

The statistics and distributions packages provide frameworks and implementations for basic univariate statistics, frequency distributions, bivariate regression, t- and chi-square test statistics and some commonly used probability distributions.

The stat package includes a framework and default implementations for the following univariate statistics:

  • arithmetic and geometric means
  • variance and standard deviation
  • sum, product, log sum, sum of squared values
  • minimum, maximum, median, and percentiles
  • skewness and kurtosis
  • first, second, third and fourth moments

With the exception of percentiles and the median, all of these statistics can be computed without maintaining the full list of input data values in memory. The stat package provides interfaces and implementations that do not require value storage as well as implementations that operate on arrays of stored values.

The top level interface is org.apache.commons.math.stat.univariate.UnivariateStatistic. This interface, implemented by all statistics, consists of evaluate() methods that take double[] arrays as arguments and return the value of the statistic. This interface is extended by StorelessUnivariateStatistic, which adds increment(), getResult() and associated methods to support "storageless" implementations that maintain counters, sums or other state information as values are added using the increment() method.

Abstract implementations of the top level interfaces are provided in AbstractUnivariateStatistic and AbstractStorelessUnivariateStatistic respectively.

Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and each extends one of the abstract classes above (depending on whether or not value storage is required to compute the statistic). There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is generally more convenient (and efficient) to access them using the provided aggregates, DescriptiveStatistics and SummaryStatistics.

DescriptiveStatistics maintains the input data in memory and has the capability of producing "rolling" statistics computed from a "window" consisting of the most recently added values.

SummaryStatisics does not store the input data values in memory, so the statistics included in this aggregate are limited to those that can be computed in one pass through the data without access to the full array of values.

AggregateStatistics IncludedValues stored?"Rolling" capability?
DescriptiveStatisticsmin, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis, medianYesYes
SummaryStatisticsmin, max, mean, geometric mean, n, sum, sum of squares, standard deviation, varianceNoNo

There is also a utility class, StatUtils, that provides static methods for computing statistics directly from double[] arrays.

Here are some examples showing how to compute univariate statistics.

Compute summary statistics for a list of double values


Using the DescriptiveStatistics aggregate (values are stored in memory): // Get a DescriptiveStatistics instance using factory method DescriptiveStatistics stats = DescriptiveStatistics.newInstance(); // Add the data from the array for( int i = 0; i < inputArray.length; i++) { stats.addValue(inputArray[i]); } // Compute some statistics double mean = stats.getMean(); double std = stats.getStandardDeviation(); double median = stats.getMedian();
Using the SummaryStatistics aggregate (values are not stored in memory): // Get a SummaryStatistics instance using factory method SummaryStatistics stats = SummaryStatistics.newInstance(); // Read data from an input stream, adding values and updating sums, counters, etc. necessary for stats while (line != null) { line = in.readLine(); stats.addValue(Double.parseDouble(line.trim())); } in.close(); // Compute the statistics double mean = stats.getMean(); double std = stats.getStandardDeviation(); //double median = stats.getMedian(); <-- NOT AVAILABLE in SummaryStatistics
Using the StatUtils utility class: // Compute statistics directly from the array -- assume values is a double[] array double mean = StatUtils.mean(values); double std = StatUtils.variance(values); double median = StatUtils.percentile(50); // Compute the mean of the first three values in the array mean = StatuUtils.mean(values, 0, 3);
Maintain a "rolling mean" of the most recent 100 values from an input stream


Use a DescriptiveStatistics instance with window size set to 100 // Create a DescriptiveStats instance and set the window size to 100 DescriptiveStatistics stats = DescriptiveStatistics.newInstance(); stats.setWindowSize(100); // Read data from an input stream, displaying the mean of the most recent 100 observations // after every 100 observations long nLines = 0; while (line != null) { line = in.readLine(); stats.addValue(Double.parseDouble(line.trim())); if (nLines == 100) { nLines = 0; System.out.println(stats.getMean()); // "rolling" mean of most recent 100 values } } in.close();

org.apache.commons.math.stat.univariate.Frequency provides a simple interface for maintaining counts and percentages of discrete values.

Strings, integers, longs and chars are all supported as value types, as well as instances of any class that implements Comparable. The ordering of values used in computing cumulative frequencies is by default the natural ordering, but this can be overriden by supplying a Comparator to the constructor. Adding values that are not comparable to those that have already been added results in an IllegalArgumentException.

Here are some examples.

Compute a frequency distribution based on integer values


Mixing integers, longs, Integers and Longs: Frequency f = new Frequency(); f.addValue(1); f.addValue(new Integer(1)); f.addValue(new Long(1)); f.addValue(2) f.addValue(new Integer(-1)); System.out.prinltn(f.getCount(1)); // displays 3 System.out.println(f.getCumPct(0)); // displays 0.2 System.out.println(f.getPct(new Integer(1))); // displays 0.6 System.out.println(f.getCumPct(-2)); // displays 0 -- all values are greater than this System.out.println(f.getCumPct(10)); // displays 1 -- all values are less than this
Count string frequencies


Using case-sensitive comparison, alpha sort order (natural comparator): Frequency f = new Frequency(); f.addValue("one"); f.addValue("One"); f.addValue("oNe"); f.addValue("Z"); System.out.println(f.getCount("one")); // displays 1 System.out.println(f.getCumPct("Z")); // displays 0.5 -- second in sort order System.out.println(f.getCumPct("Ot")); // displays 0.25 -- between first ("One") and second ("Z") value
Using case-insensitive comparator: Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER); f.addValue("one"); f.addValue("One"); f.addValue("oNe"); f.addValue("Z"); System.out.println(f.getCount("one")); // displays 3 System.out.println(f.getCumPct("z")); // displays 1 -- last value

This is yet to be written. Any contributions will be gratefully accepted!

This is yet to be written. Any contributions will be gratefully accepted!

The distribution framework provides the means to compute probability density function (PDF) probabilities and cumulative distribution function (CDF) probabilities for common probability distributions. Along with the direct computation of PDF and CDF probabilities, the framework also allows for the computation of inverse PDF and inverse CDF values.

In order to use the distribution framework, first a distribution object must be created. It is encouraged that all distribution object creation occurs via the org.apache.commons.math.stat.distribution.DistributionFactory class. DistributionFactory is a simple factory used to create all of the distribution objects supported by Commons-Math. The typical usage of DistributionFactory to create a distribution object would be:

DistributionFactory factory = DistributionFactory.newInstance(); BinomialDistribution binomial = factory.createBinomialDistribution(10, .75);

The distributions that can be instantiated via the DistributionFactory are detailed below:
DistributionFactory MethodParameters
BinomialcreateBinomialDistribution
Number of trials
Probability of success
Chi-SquaredcreateChiSquaredDistribution
Degrees of freedom
ExponentialcreateExponentialDistribution
Mean
FcreateFDistribution
Numerator degrees of freedom
Denominator degrees of freedom
GammacreateGammaDistribution
Alpha
Beta
HypergeometriccreateHypogeometricDistribution
Population size
Number of successes in population
Sample size
Normal (Gaussian)createNormalDistribution
Mean
Standard Deviation
tcreateTDistribution
Degrees of freedom

Using a distribution object, PDF and CDF probabilities are easily computed using the cumulativeProbability methods. For a distribution X, and a domain value, x, cumulativeProbability computes P(X <= x) (i.e. the lower tail probability of X).

DistributionFactory factory = DistributionFactory.newInstance(); TDistribution t = factory.createBinomialDistribution(29); double lowerTail = t.cumulativeProbability(-2.656); // P(T <= -2.656) double upperTail = 1.0 - t.cumulativeProbability(2.75); // P(T >= 2.75)

The inverse PDF and CDF values are just as easily computed using the inverseCumulativeProbabilitymethods. For a distribution X, and a probability, p, inverseCumulativeProbability computes the domain value x, such that:

  • P(X <= x) = p, for continuous distributions
  • P(X <= x) <= p, for discrete distributions
Notice the different cases for continuous and discrete distributions. This is the result of PDFs not being invertible functions. As such, for discrete distributions, an exact domain value can not be returned. Only the "best" domain value. For Commons-Math, the "best" domain value is determined by the largest domain value whose cumulative probability is less-than or equal to the given probability.