166 lines
9.8 KiB
XML
166 lines
9.8 KiB
XML
<?xml version="1.0"?>
|
|
|
|
<!--
|
|
Copyright 2003-2004 The Apache Software Foundation
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
-->
|
|
|
|
<?xml-stylesheet type="text/xsl" href="./xdoc.xsl"?>
|
|
<!-- $Revision: 1.9 $ $Date: 2004/02/29 21:25:08 $ -->
|
|
<document url="stat.html">
|
|
<properties>
|
|
<title>The Commons Math User Guide - Statistics</title>
|
|
</properties>
|
|
<body>
|
|
<section name="1 Statistics and Distributions">
|
|
<subsection name="1.1 Overview" href="overview">
|
|
<p>
|
|
The statistics and distributions packages provide frameworks and implementations for
|
|
basic univariate statistics, frequency distributions, bivariate regression, t- and chi-square test
|
|
statistics and some commonly used probability distributions.
|
|
</p>
|
|
</subsection>
|
|
<subsection name="1.2 Univariate statistics" href="univariate">
|
|
<p>
|
|
The stat package includes a framework and default implementations for the following univariate
|
|
statistics:
|
|
<ul>
|
|
<li>arithmetic and geometric means</li>
|
|
<li>variance and standard deviation</li>
|
|
<li>sum, product, log sum, sum of squared values</li>
|
|
<li>minimum, maximum, median, and percentiles</li>
|
|
<li>skewness and kurtosis</li>
|
|
<li>first, second, third and fourth moments</li>
|
|
</ul>
|
|
</p>
|
|
<p>
|
|
With the exception of percentiles and the median, all of these statistics can be computed without
|
|
maintaining the full list of input data values in memory. The stat package provides interfaces and
|
|
implementations that do not require value storage as well as implementations that operate on arrays
|
|
of stored values.
|
|
</p>
|
|
<p>
|
|
The top level interface is
|
|
<a href="../apidocs/org/apache/commons/math/stat/univariate/UnivariateStatistic.html">
|
|
org.apache.commons.math.stat.univariate.UnivariateStatistic.</a> This interface, implemented by
|
|
all statistics, consists of <code>evaluate()</code> methods that take double[] arrays as arguments and return
|
|
the value of the statistic. This interface is extended by
|
|
<a href="../apidocs/org/apache/commons/math/stat/univariate/StorelessUnivariateStatistic.html">
|
|
org.apache.commons.math.stat.univariate.StorelessUnivariateStatistic,</a> which adds <code>increment(),</code>
|
|
<code>getResult()</code> and associated methods to support "storageless" implementations that
|
|
maintain counters, sums or other state information as values are added using the <code>increment()</code>
|
|
method.
|
|
</p>
|
|
<p>
|
|
Abstract implementations of the top level interfaces are provided in
|
|
<a href="../apidocs/org/apache/commons/math/stat/univariate/AbstractUnivariateStatistic.html">
|
|
org.apache.commons.math.stat.univariate.AbstractUnivariateStatistic</a> and
|
|
<a href="../apidocs/org/apache/commons/math/stat/univariate/AbstractStorelessUnivariateStatistic.html">
|
|
org.apache.commons.math.stat.univariate.AbstractStorelessUnivariateStatistic</a> respectively.
|
|
</p>
|
|
<p>
|
|
Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and
|
|
each extends one of the abstract classes above (depending on whether or not value storage is required to
|
|
compute the statistic).
|
|
There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is
|
|
generally more convenient to access them using the provided aggregates:
|
|
<table>
|
|
<tr><th>Aggregate</th><th>Statistics Included</th><th>Values stored?</th></tr>
|
|
<tr><td><a href="../apidocs/org/apache/commons/math/stat/DescriptiveStatistics.html">
|
|
org.apache.commons.math.stat.DescriptiveStatistics</a></td><td>All</td><td>Yes</td></tr>
|
|
<tr><td><a href="../apidocs/org/apache/commons/math/stat/SummaryStatistics.html">
|
|
org.apache.commons.math.stat.SummaryStatistics</a></td><td>min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance</td><td>No</td></tr>
|
|
</table>
|
|
TODO: add code sample
|
|
There is also a utility class, <a href="../apidocs/org/apache/commons/math/stat/StatUtils.html">
|
|
org.apache.commons.math.stat.StatUtils,</a> that provides static methods for computing statistics
|
|
from double[] arrays.
|
|
</p>
|
|
</subsection>
|
|
<subsection name="1.3 Frequency distributions" href="frequency">
|
|
<p>This is yet to be written. Any contributions will be gratefully
|
|
accepted!</p>
|
|
</subsection>
|
|
<subsection name="1.4 Bivariate regression" href="regression">
|
|
<p>This is yet to be written. Any contributions will be gratefully
|
|
accepted!</p>
|
|
</subsection>
|
|
<subsection name="1.5 Statistical tests" href="tests">
|
|
<p>This is yet to be written. Any contributions will be gratefully
|
|
accepted!</p>
|
|
</subsection>
|
|
<subsection name="1.6 Distribution framework" href="distributions">
|
|
<p>
|
|
The distribution framework provides the means to compute probability density
|
|
function (PDF) probabilities and cumulative distribution function (CDF)
|
|
probabilities for common probability distributions. Along with the direct
|
|
computation of PDF and CDF probabilities, the framework also allows for the
|
|
computation of inverse PDF and inverse CDF values.
|
|
</p>
|
|
<p>
|
|
In order to use the distribution framework, first a distribution object must
|
|
be created. It is encouraged that all distribution object creation occurs via
|
|
the <code>org.apache.commons.math.stat.distribution.DistributionFactory</code>
|
|
class. <code>DistributionFactory</code> is a simple factory used to create all
|
|
of the distribution objects supported by Commons-Math. The typical usage of
|
|
<code>DistributionFactory</code> to create a distribution object would be:
|
|
</p>
|
|
<source>DistributionFactory factory = DistributionFactory.newInstance();
|
|
BinomialDistribution binomial = factory.createBinomialDistribution(10, .75);</source>
|
|
<p>
|
|
The distributions that can be instantiated via the <code>DistributionFactory</code>
|
|
are detailed below:
|
|
<table>
|
|
<tr><th>Distribution</th><th>Factory Method</th><th>Parameters</th></tr>
|
|
<tr><td>Binomial</td><td>createBinomialDistribution</td><td><div>Number of trials</div><div>Probability of success</div></td></tr>
|
|
<tr><td>Chi-Squared</td><td>createChiSquaredDistribution</td><td><div>Degrees of freedom</div></td></tr>
|
|
<tr><td>Exponential</td><td>createExponentialDistribution</td><td><div>Mean</div></td></tr>
|
|
<tr><td>F</td><td>createFDistribution</td><td><div>Numerator degrees of freedom</div><div>Denominator degrees of freedom</div></td></tr>
|
|
<tr><td>Gamma</td><td>createGammaDistribution</td><td><div>Alpha</div><div>Beta</div></td></tr>
|
|
<tr><td>Hypergeometric</td><td>createHypogeometricDistribution</td><td><div>Population size</div><div>Number of successes in population</div><div>Sample size</div></td></tr>
|
|
<tr><td>Normal (Gaussian)</td><td>createNormalDistribution</td><td><div>Mean</div><div>Standard Deviation</div></td></tr>
|
|
<tr><td>t</td><td>createTDistribution</td><td><div>Degrees of freedom</div></td></tr>
|
|
</table>
|
|
</p>
|
|
<p>
|
|
Using a distribution object, PDF and CDF probabilities are easily computed
|
|
using the <code>cumulativeProbability</code> methods. For a distribution <code>X</code>,
|
|
and a domain value, <code>x</code>, <code>cumulativeProbability</code> computes
|
|
<code>P(X <= x)</code> (i.e. the lower tail probability of <code>X</code>).
|
|
</p>
|
|
<source>DistributionFactory factory = DistributionFactory.newInstance();
|
|
TDistribution t = factory.createBinomialDistribution(29);
|
|
double lowerTail = t.cumulativeProbability(-2.656); // P(T <= -2.656)
|
|
double upperTail = 1.0 - t.cumulativeProbability(2.75); // P(T >= 2.75)</source>
|
|
<p>
|
|
The inverse PDF and CDF values are just as easily computed using the
|
|
<code>inverseCumulativeProbability</code>methods. For a distribution <code>X</code>,
|
|
and a probability, <code>p</code>, <code>inverseCumulativeProbability</code>
|
|
computes the domain value <code>x</code>, such that:
|
|
<ul>
|
|
<li><code>P(X <= x) = p</code>, for continuous distributions</li>
|
|
<li><code>P(X <= x) <= p</code>, for discrete distributions</li>
|
|
</ul>
|
|
Notice the different cases for continuous and discrete distributions. This is the result
|
|
of PDFs not being invertible functions. As such, for discrete distributions, an exact
|
|
domain value can not be returned. Only the "best" domain value. For Commons-Math, the "best"
|
|
domain value is determined by the largest domain value whose cumulative probability is
|
|
less-than or equal to the given probability.
|
|
</p>
|
|
</subsection>
|
|
|
|
</section>
|
|
</body>
|
|
</document>
|