2003-11-15 13:38:16 -05:00
<?xml version="1.0"?>
2004-02-28 12:47:37 -05:00
<!--
Copyright 2003-2004 The Apache Software Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
2003-11-15 13:38:16 -05:00
<?xml-stylesheet type="text/xsl" href="./xdoc.xsl"?>
2004-03-02 21:32:25 -05:00
<!-- $Revision: 1.10 $ $Date: 2004/03/03 02:32:25 $ -->
2003-11-15 13:38:16 -05:00
<document url= "stat.html" >
<properties >
<title > The Commons Math User Guide - Statistics</title>
</properties>
<body >
2004-02-29 16:25:08 -05:00
<section name= "1 Statistics and Distributions" >
2003-11-15 13:38:16 -05:00
<subsection name= "1.1 Overview" href= "overview" >
2004-02-29 16:25:08 -05:00
<p >
The statistics and distributions packages provide frameworks and implementations for
basic univariate statistics, frequency distributions, bivariate regression, t- and chi-square test
statistics and some commonly used probability distributions.
</p>
2003-11-15 13:38:16 -05:00
</subsection>
<subsection name= "1.2 Univariate statistics" href= "univariate" >
2004-02-29 16:25:08 -05:00
<p >
The stat package includes a framework and default implementations for the following univariate
statistics:
<ul >
<li > arithmetic and geometric means</li>
<li > variance and standard deviation</li>
<li > sum, product, log sum, sum of squared values</li>
<li > minimum, maximum, median, and percentiles</li>
<li > skewness and kurtosis</li>
<li > first, second, third and fourth moments</li>
</ul>
</p>
<p >
With the exception of percentiles and the median, all of these statistics can be computed without
maintaining the full list of input data values in memory. The stat package provides interfaces and
implementations that do not require value storage as well as implementations that operate on arrays
of stored values.
</p>
<p >
The top level interface is
<a href= "../apidocs/org/apache/commons/math/stat/univariate/UnivariateStatistic.html" >
org.apache.commons.math.stat.univariate.UnivariateStatistic.</a> This interface, implemented by
all statistics, consists of <code > evaluate()</code> methods that take double[] arrays as arguments and return
the value of the statistic. This interface is extended by
<a href= "../apidocs/org/apache/commons/math/stat/univariate/StorelessUnivariateStatistic.html" >
2004-03-02 21:32:25 -05:00
StorelessUnivariateStatistic,</a> which adds <code > increment(),</code>
2004-02-29 16:25:08 -05:00
<code > getResult()</code> and associated methods to support "storageless" implementations that
maintain counters, sums or other state information as values are added using the <code > increment()</code>
method.
</p>
<p >
Abstract implementations of the top level interfaces are provided in
<a href= "../apidocs/org/apache/commons/math/stat/univariate/AbstractUnivariateStatistic.html" >
2004-03-02 21:32:25 -05:00
AbstractUnivariateStatistic</a> and
2004-02-29 16:25:08 -05:00
<a href= "../apidocs/org/apache/commons/math/stat/univariate/AbstractStorelessUnivariateStatistic.html" >
2004-03-02 21:32:25 -05:00
AbstractStorelessUnivariateStatistic</a> respectively.
2004-02-29 16:25:08 -05:00
</p>
<p >
Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and
each extends one of the abstract classes above (depending on whether or not value storage is required to
compute the statistic).
There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is
2004-03-02 21:32:25 -05:00
generally more convenient (and efficient) to access them using the provided aggregates, <a href= "../apidocs/org/apache/commons/math/stat/DescriptiveStatistics.html" >
DescriptiveStatistics</a> and <a href= "../apidocs/org/apache/commons/math/stat/SummaryStatistics.html" >
SummaryStatistics.</a> <code > DescriptiveStatistics</code> maintains the input data in memory and has the capability
of producing "rolling" statistics computed from a "window" consisting of the most recently added values. <code > SummaryStatisics</code>
does not store the input data values in memory, so the statistics included in this aggregate are limited to those that can be
computed in one pass through the data without access to the full array of values.
</p>
<p >
2004-02-29 16:25:08 -05:00
<table >
2004-03-02 21:32:25 -05:00
<tr > <th > Aggregate</th> <th > Statistics Included</th> <th > Values stored?</th> <th > "Rolling" capability?</th> </tr>
2004-02-29 16:25:08 -05:00
<tr > <td > <a href= "../apidocs/org/apache/commons/math/stat/DescriptiveStatistics.html" >
2004-03-02 21:32:25 -05:00
DescriptiveStatistics</a> </td> <td > min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis, median</td> <td > Yes</td> <td > Yes</td> </tr>
2004-02-29 16:25:08 -05:00
<tr > <td > <a href= "../apidocs/org/apache/commons/math/stat/SummaryStatistics.html" >
2004-03-02 21:32:25 -05:00
SummaryStatistics</a> </td> <td > min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance</td> <td > No</td> <td > No</td> </tr>
2004-02-29 16:25:08 -05:00
</table>
2004-03-02 21:32:25 -05:00
</p>
<p >
2004-02-29 16:25:08 -05:00
There is also a utility class, <a href= "../apidocs/org/apache/commons/math/stat/StatUtils.html" >
2004-03-02 21:32:25 -05:00
StatUtils,</a> that provides static methods for computing statistics
directly from double[] arrays.
2004-02-29 16:25:08 -05:00
</p>
2004-03-02 21:32:25 -05:00
<p >
Here are some examples showing how to compute univariate statistics.
<dl >
<dt > Compute summary statistics for a list of double values</dt>
<br > </br>
<dd > Using the <code > DescriptiveStatistics</code> aggregate (values are stored in memory):
<source >
// Get a DescriptiveStatistics instance using factory method
DescriptiveStatistics stats = DescriptiveStatistics.newInstance();
// Add the data from the array
for( int i = 0; i < inputArray.length; i++) {
stats.addValue(inputArray[i]);
}
// Compute some statistics
double mean = stats.getMean();
double std = stats.getStandardDeviation();
double median = stats.getMedian();
</source>
</dd>
<dd > Using the <code > SummaryStatistics</code> aggregate (values are <strong > not</strong> stored in memory):
<source >
// Get a SummaryStatistics instance using factory method
SummaryStatistics stats = SummaryStatistics.newInstance();
// Read data from an input stream, adding values and updating sums, counters, etc. necessary for stats
while (line != null) {
line = in.readLine();
stats.addValue(Double.parseDouble(line.trim()));
}
in.close();
// Compute the statistics
double mean = stats.getMean();
double std = stats.getStandardDeviation();
//double median = stats.getMedian(); < -- NOT AVAILABLE in SummaryStatistics
</source>
</dd>
<dd > Using the <code > StatUtils</code> utility class:
<source >
// Compute statistics directly from the array -- assume values is a double[] array
double mean = StatUtils.mean(values);
double std = StatUtils.variance(values);
double median = StatUtils.percentile(50);
// Compute the mean of the first three values in the array
mean = StatuUtils.mean(values, 0, 3);
</source>
</dd>
<dt > Maintain a "rolling mean" of the most recent 100 values from an input stream</dt>
<br > </br>
<dd > Use a <code > DescriptiveStatistics</code> instance with window size set to 100
<source >
// Create a DescriptiveStats instance and set the window size to 100
DescriptiveStatistics stats = DescriptiveStatistics.newInstance();
stats.setWindowSize(100);
// Read data from an input stream, displaying the mean of the most recent 100 observations
// after every 100 observations
long nLines = 0;
while (line != null) {
line = in.readLine();
stats.addValue(Double.parseDouble(line.trim()));
if (nLines == 100) {
nLines = 0;
System.out.println(stats.getMean()); // "rolling" mean of most recent 100 values
}
}
in.close();
</source>
</dd>
</dl>
</p>
2003-11-15 13:38:16 -05:00
</subsection>
2004-03-02 21:32:25 -05:00
2003-11-15 13:38:16 -05:00
<subsection name= "1.3 Frequency distributions" href= "frequency" >
<p > This is yet to be written. Any contributions will be gratefully
accepted!</p>
</subsection>
<subsection name= "1.4 Bivariate regression" href= "regression" >
<p > This is yet to be written. Any contributions will be gratefully
accepted!</p>
</subsection>
<subsection name= "1.5 Statistical tests" href= "tests" >
<p > This is yet to be written. Any contributions will be gratefully
accepted!</p>
</subsection>
<subsection name= "1.6 Distribution framework" href= "distributions" >
<p >
The distribution framework provides the means to compute probability density
function (PDF) probabilities and cumulative distribution function (CDF)
probabilities for common probability distributions. Along with the direct
computation of PDF and CDF probabilities, the framework also allows for the
computation of inverse PDF and inverse CDF values.
</p>
<p >
In order to use the distribution framework, first a distribution object must
be created. It is encouraged that all distribution object creation occurs via
the <code > org.apache.commons.math.stat.distribution.DistributionFactory</code>
class. <code > DistributionFactory</code> is a simple factory used to create all
of the distribution objects supported by Commons-Math. The typical usage of
<code > DistributionFactory</code> to create a distribution object would be:
</p>
<source > DistributionFactory factory = DistributionFactory.newInstance();
BinomialDistribution binomial = factory.createBinomialDistribution(10, .75);</source>
<p >
The distributions that can be instantiated via the <code > DistributionFactory</code>
are detailed below:
<table >
<tr > <th > Distribution</th> <th > Factory Method</th> <th > Parameters</th> </tr>
<tr > <td > Binomial</td> <td > createBinomialDistribution</td> <td > <div > Number of trials</div> <div > Probability of success</div> </td> </tr>
<tr > <td > Chi-Squared</td> <td > createChiSquaredDistribution</td> <td > <div > Degrees of freedom</div> </td> </tr>
<tr > <td > Exponential</td> <td > createExponentialDistribution</td> <td > <div > Mean</div> </td> </tr>
<tr > <td > F</td> <td > createFDistribution</td> <td > <div > Numerator degrees of freedom</div> <div > Denominator degrees of freedom</div> </td> </tr>
<tr > <td > Gamma</td> <td > createGammaDistribution</td> <td > <div > Alpha</div> <div > Beta</div> </td> </tr>
<tr > <td > Hypergeometric</td> <td > createHypogeometricDistribution</td> <td > <div > Population size</div> <div > Number of successes in population</div> <div > Sample size</div> </td> </tr>
2004-02-29 16:25:08 -05:00
<tr > <td > Normal (Gaussian)</td> <td > createNormalDistribution</td> <td > <div > Mean</div> <div > Standard Deviation</div> </td> </tr>
2003-11-15 13:38:16 -05:00
<tr > <td > t</td> <td > createTDistribution</td> <td > <div > Degrees of freedom</div> </td> </tr>
</table>
</p>
<p >
Using a distribution object, PDF and CDF probabilities are easily computed
2004-02-17 23:04:18 -05:00
using the <code > cumulativeProbability</code> methods. For a distribution <code > X</code> ,
and a domain value, <code > x</code> , <code > cumulativeProbability</code> computes
2003-11-15 13:38:16 -05:00
<code > P(X < = x)</code> (i.e. the lower tail probability of <code > X</code> ).
</p>
<source > DistributionFactory factory = DistributionFactory.newInstance();
TDistribution t = factory.createBinomialDistribution(29);
2004-02-17 23:04:18 -05:00
double lowerTail = t.cumulativeProbability(-2.656); // P(T < = -2.656)
double upperTail = 1.0 - t.cumulativeProbability(2.75); // P(T > = 2.75)</source>
2003-11-15 13:38:16 -05:00
<p >
The inverse PDF and CDF values are just as easily computed using the
2004-02-17 23:04:18 -05:00
<code > inverseCumulativeProbability</code> methods. For a distribution <code > X</code> ,
and a probability, <code > p</code> , <code > inverseCumulativeProbability</code>
2003-11-15 13:38:16 -05:00
computes the domain value <code > x</code> , such that:
<ul >
<li > <code > P(X < = x) = p</code> , for continuous distributions</li>
<li > <code > P(X < = x) < = p</code> , for discrete distributions</li>
</ul>
Notice the different cases for continuous and discrete distributions. This is the result
of PDFs not being invertible functions. As such, for discrete distributions, an exact
domain value can not be returned. Only the "best" domain value. For Commons-Math, the "best"
2004-02-17 23:04:18 -05:00
domain value is determined by the largest domain value whose cumulative probability is
2003-11-15 13:38:16 -05:00
less-than or equal to the given probability.
</p>
</subsection>
</section>
</body>
</document>