The statistics and distributions packages provide frameworks and implementations for
basic univariate statistics, frequency distributions, bivariate regression, t- and chi-square test
statistics and some commonly used probability distributions.
The stat package includes a framework and default implementations for the following univariate
statistics:
arithmetic and geometric means
variance and standard deviation
sum, product, log sum, sum of squared values
minimum, maximum, median, and percentiles
skewness and kurtosis
first, second, third and fourth moments
With the exception of percentiles and the median, all of these statistics can be computed without
maintaining the full list of input data values in memory. The stat package provides interfaces and
implementations that do not require value storage as well as implementations that operate on arrays
of stored values.
The top level interface is
org.apache.commons.math.stat.univariate.UnivariateStatistic. This interface, implemented by
all statistics, consists of evaluate() methods that take double[] arrays as arguments and return
the value of the statistic. This interface is extended by
StorelessUnivariateStatistic, which adds increment(),getResult() and associated methods to support "storageless" implementations that
maintain counters, sums or other state information as values are added using the increment()
method.
Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and
each extends one of the abstract classes above (depending on whether or not value storage is required to
compute the statistic).
There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is
generally more convenient (and efficient) to access them using the provided aggregates,
DescriptiveStatistics and
SummaryStatistics.
DescriptiveStatistics maintains the input data in memory and has the capability
of producing "rolling" statistics computed from a "window" consisting of the most recently added values.
SummaryStatisics does not store the input data values in memory, so the statistics
included in this aggregate are limited to those that can be computed in one pass through the data
without access to the full array of values.
Strings, integers, longs and chars are all supported as value types, as well as instances
of any class that implements Comparable. The ordering of values
used in computing cumulative frequencies is by default the natural ordering,
but this can be overriden by supplying a Comparator to the constructor.
Adding values that are not comparable to those that have already been added results in an
IllegalArgumentException.
Here are some examples.
Compute a frequency distribution based on integer values
Mixing integers, longs, Integers and Longs:
Count string frequencies
Using case-sensitive comparison, alpha sort order (natural comparator):
Standard errors for intercept and slope are
available as well as ANOVA, r-square and Pearson's r statistics.
Observations (x,y pairs) can be added to the model one at a time or they
can be provided in a 2-dimensional array. The observations are not stored
in memory, so there is no limit to the number of observations that can be
added to the model.
Usage Notes:
When there are fewer than two observations in the model, or when
there is no variation in the x values (i.e. all x values are the same)
all statistics return NaN. At least two observations with
different x coordinates are requred to estimate a bivariate regression
model.
getters for the statistics always compute values based on the current
set of observations -- i.e., you can get statistics, then add more data
and get updated statistics without using a new instance. There is no
"compute" method that updates all statistics. Each of the getters performs
the necessary computations to return the requested statistic.
Implementation Notes:
As observations are added to the model, the sum of x values, y values,
cross products (x times y), and squared deviations of x and y from their
respective means are updated using updating formulas defined in
"Algorithms for Computing the Sample Variance: Analysis and
Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J.
1983, American Statistician, vol. 37, pp. 242-247, referenced in
Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985. All regression
statistics are computed from these sums.
Inference statistics (confidence intervals, parameter significance levels)
are based on on the assumption that the observations included in the model are
drawn from a
Bivariate Normal Distribution
Here is are some examples.
Estimate a model based on observations added one at a time
Instantiate a regression instance and add data points
Compute some statistics based on observations added so far
Use the regression model to predict the y value for a new x value
More data points can be added and subsequent getXxx calls will incorporate
additional data in statistics.
Estimate a model from a double[][] array of data points
Instantiate a regression object and load dataset
Estimate regression model based on data
More data points -- even another double[][] array -- can be added and subsequent
getXxx calls will incorporate additional data in statistics.
This is yet to be written. Any contributions will be gratefully
accepted!