306 lines
16 KiB
XML
306 lines
16 KiB
XML
<?xml version="1.0"?>
|
|
|
|
<!--
|
|
Copyright 2003-2004 The Apache Software Foundation
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
-->
|
|
|
|
<?xml-stylesheet type="text/xsl" href="./xdoc.xsl"?>
|
|
<!-- $Revision: 1.12 $ $Date: 2004/03/07 00:56:14 $ -->
|
|
<document url="stat.html">
|
|
<properties>
|
|
<title>The Commons Math User Guide - Statistics</title>
|
|
</properties>
|
|
<body>
|
|
<section name="1 Statistics and Distributions">
|
|
<subsection name="1.1 Overview" href="overview">
|
|
<p>
|
|
The statistics and distributions packages provide frameworks and implementations for
|
|
basic univariate statistics, frequency distributions, bivariate regression, t- and chi-square test
|
|
statistics and some commonly used probability distributions.
|
|
</p>
|
|
</subsection>
|
|
<subsection name="1.2 Univariate statistics" href="univariate">
|
|
<p>
|
|
The stat package includes a framework and default implementations for the following univariate
|
|
statistics:
|
|
<ul>
|
|
<li>arithmetic and geometric means</li>
|
|
<li>variance and standard deviation</li>
|
|
<li>sum, product, log sum, sum of squared values</li>
|
|
<li>minimum, maximum, median, and percentiles</li>
|
|
<li>skewness and kurtosis</li>
|
|
<li>first, second, third and fourth moments</li>
|
|
</ul>
|
|
</p>
|
|
<p>
|
|
With the exception of percentiles and the median, all of these statistics can be computed without
|
|
maintaining the full list of input data values in memory. The stat package provides interfaces and
|
|
implementations that do not require value storage as well as implementations that operate on arrays
|
|
of stored values.
|
|
</p>
|
|
<p>
|
|
The top level interface is
|
|
<a href="../apidocs/org/apache/commons/math/stat/univariate/UnivariateStatistic.html">
|
|
org.apache.commons.math.stat.univariate.UnivariateStatistic.</a> This interface, implemented by
|
|
all statistics, consists of <code>evaluate()</code> methods that take double[] arrays as arguments and return
|
|
the value of the statistic. This interface is extended by
|
|
<a href="../apidocs/org/apache/commons/math/stat/univariate/StorelessUnivariateStatistic.html">
|
|
StorelessUnivariateStatistic,</a> which adds <code>increment(),</code>
|
|
<code>getResult()</code> and associated methods to support "storageless" implementations that
|
|
maintain counters, sums or other state information as values are added using the <code>increment()</code>
|
|
method.
|
|
</p>
|
|
<p>
|
|
Abstract implementations of the top level interfaces are provided in
|
|
<a href="../apidocs/org/apache/commons/math/stat/univariate/AbstractUnivariateStatistic.html">
|
|
AbstractUnivariateStatistic</a> and
|
|
<a href="../apidocs/org/apache/commons/math/stat/univariate/AbstractStorelessUnivariateStatistic.html">
|
|
AbstractStorelessUnivariateStatistic</a> respectively.
|
|
</p>
|
|
<p>
|
|
Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and
|
|
each extends one of the abstract classes above (depending on whether or not value storage is required to
|
|
compute the statistic).
|
|
There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is
|
|
generally more convenient (and efficient) to access them using the provided aggregates, <a href="../apidocs/org/apache/commons/math/stat/DescriptiveStatistics.html">
|
|
DescriptiveStatistics</a> and <a href="../apidocs/org/apache/commons/math/stat/SummaryStatistics.html">
|
|
SummaryStatistics.</a> <code>DescriptiveStatistics</code> maintains the input data in memory and has the capability
|
|
of producing "rolling" statistics computed from a "window" consisting of the most recently added values. <code>SummaryStatisics</code>
|
|
does not store the input data values in memory, so the statistics included in this aggregate are limited to those that can be
|
|
computed in one pass through the data without access to the full array of values.
|
|
</p>
|
|
<p>
|
|
<table>
|
|
<tr><th>Aggregate</th><th>Statistics Included</th><th>Values stored?</th><th>"Rolling" capability?</th></tr>
|
|
<tr><td><a href="../apidocs/org/apache/commons/math/stat/DescriptiveStatistics.html">
|
|
DescriptiveStatistics</a></td><td>min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis, median</td><td>Yes</td><td>Yes</td></tr>
|
|
<tr><td><a href="../apidocs/org/apache/commons/math/stat/SummaryStatistics.html">
|
|
SummaryStatistics</a></td><td>min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance</td><td>No</td><td>No</td></tr>
|
|
</table>
|
|
</p>
|
|
<p>
|
|
There is also a utility class, <a href="../apidocs/org/apache/commons/math/stat/StatUtils.html">
|
|
StatUtils,</a> that provides static methods for computing statistics
|
|
directly from double[] arrays.
|
|
</p>
|
|
<p>
|
|
Here are some examples showing how to compute univariate statistics.
|
|
<dl>
|
|
<dt>Compute summary statistics for a list of double values</dt>
|
|
<br></br>
|
|
<dd>Using the <code>DescriptiveStatistics</code> aggregate (values are stored in memory):
|
|
<source>
|
|
// Get a DescriptiveStatistics instance using factory method
|
|
DescriptiveStatistics stats = DescriptiveStatistics.newInstance();
|
|
|
|
// Add the data from the array
|
|
for( int i = 0; i < inputArray.length; i++) {
|
|
stats.addValue(inputArray[i]);
|
|
}
|
|
|
|
// Compute some statistics
|
|
double mean = stats.getMean();
|
|
double std = stats.getStandardDeviation();
|
|
double median = stats.getMedian();
|
|
</source>
|
|
</dd>
|
|
<dd>Using the <code>SummaryStatistics</code> aggregate (values are <strong>not</strong> stored in memory):
|
|
<source>
|
|
// Get a SummaryStatistics instance using factory method
|
|
SummaryStatistics stats = SummaryStatistics.newInstance();
|
|
|
|
// Read data from an input stream, adding values and updating sums, counters, etc. necessary for stats
|
|
while (line != null) {
|
|
line = in.readLine();
|
|
stats.addValue(Double.parseDouble(line.trim()));
|
|
}
|
|
in.close();
|
|
|
|
// Compute the statistics
|
|
double mean = stats.getMean();
|
|
double std = stats.getStandardDeviation();
|
|
//double median = stats.getMedian(); <-- NOT AVAILABLE in SummaryStatistics
|
|
</source>
|
|
</dd>
|
|
<dd>Using the <code>StatUtils</code> utility class:
|
|
<source>
|
|
// Compute statistics directly from the array -- assume values is a double[] array
|
|
double mean = StatUtils.mean(values);
|
|
double std = StatUtils.variance(values);
|
|
double median = StatUtils.percentile(50);
|
|
// Compute the mean of the first three values in the array
|
|
mean = StatuUtils.mean(values, 0, 3);
|
|
</source>
|
|
</dd>
|
|
<dt>Maintain a "rolling mean" of the most recent 100 values from an input stream</dt>
|
|
<br></br>
|
|
<dd>Use a <code>DescriptiveStatistics</code> instance with window size set to 100
|
|
<source>
|
|
// Create a DescriptiveStats instance and set the window size to 100
|
|
DescriptiveStatistics stats = DescriptiveStatistics.newInstance();
|
|
stats.setWindowSize(100);
|
|
// Read data from an input stream, displaying the mean of the most recent 100 observations
|
|
// after every 100 observations
|
|
long nLines = 0;
|
|
while (line != null) {
|
|
line = in.readLine();
|
|
stats.addValue(Double.parseDouble(line.trim()));
|
|
if (nLines == 100) {
|
|
nLines = 0;
|
|
System.out.println(stats.getMean()); // "rolling" mean of most recent 100 values
|
|
}
|
|
}
|
|
in.close();
|
|
</source>
|
|
</dd>
|
|
</dl>
|
|
</p>
|
|
</subsection>
|
|
|
|
<subsection name="1.3 Frequency distributions" href="frequency">
|
|
<p>
|
|
<a href="../apidocs/org/apache/commons/math/stat/Frequency.html">
|
|
org.apache.commons.math.stat.univariate.Frequency</a>
|
|
provides a simple interface for maintaining counts and percentages of discrete
|
|
values.
|
|
</p>
|
|
<p>
|
|
Strings, integers, longs, chars are all supported as value types, as well as instances
|
|
of any class that implements <code>Comparable.</code> The ordering of values
|
|
used in computing cumulative frequencies is by default the <i>natural ordering,</i>
|
|
but this can be overriden by supplying a <code>Comparator</code> to the constructor.
|
|
Adding values that are not comparable to those that have already been added results in an
|
|
<code>IllegalArgumentException.</code>
|
|
</p>
|
|
<p>
|
|
Here are some examples.
|
|
<dl>
|
|
<dt>Compute a frequency distribution based on integer values</dt>
|
|
<br></br>
|
|
<dd>Mixing integers, longs, Integers and Longs:
|
|
<source>
|
|
Frequency f = new Frequency();
|
|
f.addValue(1);
|
|
f.addValue(new Integer(1));
|
|
f.addValue(new Long(1));
|
|
f.addValue(2)
|
|
f.addValue(new Integer(-1));
|
|
System.out.prinltn(f.getCount(1)); // displays 3
|
|
System.out.println(f.getCumPct(0)); // displays 0.2
|
|
System.out.println(f.getPct(new Integer(1))); // displays 0.6
|
|
System.out.println(f.getCumPct(-2)); // displays 0 -- all values are greater than this
|
|
System.out.println(f.getCumPct(10)); // displays 1 -- all values are less than this
|
|
</source>
|
|
</dd>
|
|
<dt>Count string frequencies</dt>
|
|
<br></br>
|
|
<dd>Using case-sensitive comparison, alpha sort order (natural comparator):
|
|
<source>
|
|
Frequency f = new Frequency();
|
|
f.addValue("one");
|
|
f.addValue("One");
|
|
f.addValue("oNe");
|
|
f.addValue("Z");
|
|
System.out.println(f.getCount("one")); // displays 1
|
|
System.out.println(f.getCumPct("Z")); // displays 0.5 -- second in sort order
|
|
System.out.println(f.getCumPct("Ot")); // displays 0.25 -- between first ("One") and second ("Z") value
|
|
</source>
|
|
</dd>
|
|
<dd>Using case-insensitive comparator:
|
|
<source>
|
|
Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER);
|
|
f.addValue("one");
|
|
f.addValue("One");
|
|
f.addValue("oNe");
|
|
f.addValue("Z");
|
|
System.out.println(f.getCount("one")); // displays 3
|
|
System.out.println(f.getCumPct("z")); // displays 1 -- last value
|
|
</source>
|
|
</dd>
|
|
</dl>
|
|
</p>
|
|
</subsection>
|
|
<subsection name="1.4 Bivariate regression" href="regression">
|
|
<p>This is yet to be written. Any contributions will be gratefully
|
|
accepted!</p>
|
|
</subsection>
|
|
<subsection name="1.5 Statistical tests" href="tests">
|
|
<p>This is yet to be written. Any contributions will be gratefully
|
|
accepted!</p>
|
|
</subsection>
|
|
<subsection name="1.6 Distribution framework" href="distributions">
|
|
<p>
|
|
The distribution framework provides the means to compute probability density
|
|
function (PDF) probabilities and cumulative distribution function (CDF)
|
|
probabilities for common probability distributions. Along with the direct
|
|
computation of PDF and CDF probabilities, the framework also allows for the
|
|
computation of inverse PDF and inverse CDF values.
|
|
</p>
|
|
<p>
|
|
In order to use the distribution framework, first a distribution object must
|
|
be created. It is encouraged that all distribution object creation occurs via
|
|
the <code>org.apache.commons.math.stat.distribution.DistributionFactory</code>
|
|
class. <code>DistributionFactory</code> is a simple factory used to create all
|
|
of the distribution objects supported by Commons-Math. The typical usage of
|
|
<code>DistributionFactory</code> to create a distribution object would be:
|
|
</p>
|
|
<source>DistributionFactory factory = DistributionFactory.newInstance();
|
|
BinomialDistribution binomial = factory.createBinomialDistribution(10, .75);</source>
|
|
<p>
|
|
The distributions that can be instantiated via the <code>DistributionFactory</code>
|
|
are detailed below:
|
|
<table>
|
|
<tr><th>Distribution</th><th>Factory Method</th><th>Parameters</th></tr>
|
|
<tr><td>Binomial</td><td>createBinomialDistribution</td><td><div>Number of trials</div><div>Probability of success</div></td></tr>
|
|
<tr><td>Chi-Squared</td><td>createChiSquaredDistribution</td><td><div>Degrees of freedom</div></td></tr>
|
|
<tr><td>Exponential</td><td>createExponentialDistribution</td><td><div>Mean</div></td></tr>
|
|
<tr><td>F</td><td>createFDistribution</td><td><div>Numerator degrees of freedom</div><div>Denominator degrees of freedom</div></td></tr>
|
|
<tr><td>Gamma</td><td>createGammaDistribution</td><td><div>Alpha</div><div>Beta</div></td></tr>
|
|
<tr><td>Hypergeometric</td><td>createHypogeometricDistribution</td><td><div>Population size</div><div>Number of successes in population</div><div>Sample size</div></td></tr>
|
|
<tr><td>Normal (Gaussian)</td><td>createNormalDistribution</td><td><div>Mean</div><div>Standard Deviation</div></td></tr>
|
|
<tr><td>t</td><td>createTDistribution</td><td><div>Degrees of freedom</div></td></tr>
|
|
</table>
|
|
</p>
|
|
<p>
|
|
Using a distribution object, PDF and CDF probabilities are easily computed
|
|
using the <code>cumulativeProbability</code> methods. For a distribution <code>X</code>,
|
|
and a domain value, <code>x</code>, <code>cumulativeProbability</code> computes
|
|
<code>P(X <= x)</code> (i.e. the lower tail probability of <code>X</code>).
|
|
</p>
|
|
<source>DistributionFactory factory = DistributionFactory.newInstance();
|
|
TDistribution t = factory.createBinomialDistribution(29);
|
|
double lowerTail = t.cumulativeProbability(-2.656); // P(T <= -2.656)
|
|
double upperTail = 1.0 - t.cumulativeProbability(2.75); // P(T >= 2.75)</source>
|
|
<p>
|
|
The inverse PDF and CDF values are just as easily computed using the
|
|
<code>inverseCumulativeProbability</code>methods. For a distribution <code>X</code>,
|
|
and a probability, <code>p</code>, <code>inverseCumulativeProbability</code>
|
|
computes the domain value <code>x</code>, such that:
|
|
<ul>
|
|
<li><code>P(X <= x) = p</code>, for continuous distributions</li>
|
|
<li><code>P(X <= x) <= p</code>, for discrete distributions</li>
|
|
</ul>
|
|
Notice the different cases for continuous and discrete distributions. This is the result
|
|
of PDFs not being invertible functions. As such, for discrete distributions, an exact
|
|
domain value can not be returned. Only the "best" domain value. For Commons-Math, the "best"
|
|
domain value is determined by the largest domain value whose cumulative probability is
|
|
less-than or equal to the given probability.
|
|
</p>
|
|
</subsection>
|
|
|
|
</section>
|
|
</body>
|
|
</document>
|