2003-11-15 13:38:16 -05:00
|
|
|
<?xml version="1.0"?>
|
2004-02-28 12:47:37 -05:00
|
|
|
|
|
|
|
<!--
|
|
|
|
Copyright 2003-2004 The Apache Software Foundation
|
|
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
you may not use this file except in compliance with the License.
|
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
limitations under the License.
|
|
|
|
-->
|
|
|
|
|
2003-11-15 13:38:16 -05:00
|
|
|
<?xml-stylesheet type="text/xsl" href="./xdoc.xsl"?>
|
2004-10-08 01:08:21 -04:00
|
|
|
<!-- $Revision: 1.23 $ $Date: 2004/10/08 05:08:21 $ -->
|
2003-11-15 13:38:16 -05:00
|
|
|
<document url="stat.html">
|
|
|
|
<properties>
|
|
|
|
<title>The Commons Math User Guide - Statistics</title>
|
|
|
|
</properties>
|
|
|
|
<body>
|
2004-05-06 00:24:28 -04:00
|
|
|
<section name="1 Statistics">
|
2003-11-15 13:38:16 -05:00
|
|
|
<subsection name="1.1 Overview" href="overview">
|
2004-02-29 16:25:08 -05:00
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
The statistics package provides frameworks and implementations for
|
2004-10-08 01:08:21 -04:00
|
|
|
basic Descriptive statistics, frequency distributions, bivariate regression,
|
2004-06-05 15:28:43 -04:00
|
|
|
and t- and chi-square test statistics.
|
2004-05-06 00:24:28 -04:00
|
|
|
</p>
|
|
|
|
<p>
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="#1.2 Descriptive statistics">Descriptive statistics</a><br></br>
|
2004-05-06 00:24:28 -04:00
|
|
|
<a href="#1.3 Frequency distributions">Frequency distributions</a><br></br>
|
2004-09-01 12:19:21 -04:00
|
|
|
<a href="#1.4 Simple regression">Simple Regression</a><br></br>
|
2004-05-06 00:24:28 -04:00
|
|
|
<a href="#1.5 Statistical tests">Statistical Tests</a><br></br>
|
2004-02-29 16:25:08 -05:00
|
|
|
</p>
|
2003-11-15 13:38:16 -05:00
|
|
|
</subsection>
|
2004-10-08 01:08:21 -04:00
|
|
|
<subsection name="1.2 Descriptive statistics" href="univariate">
|
2004-02-29 16:25:08 -05:00
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
The stat package includes a framework and default implementations for
|
2004-10-08 01:08:21 -04:00
|
|
|
the following Descriptive statistics:
|
2004-02-29 16:25:08 -05:00
|
|
|
<ul>
|
|
|
|
<li>arithmetic and geometric means</li>
|
|
|
|
<li>variance and standard deviation</li>
|
|
|
|
<li>sum, product, log sum, sum of squared values</li>
|
|
|
|
<li>minimum, maximum, median, and percentiles</li>
|
|
|
|
<li>skewness and kurtosis</li>
|
|
|
|
<li>first, second, third and fourth moments</li>
|
|
|
|
</ul>
|
|
|
|
</p>
|
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
With the exception of percentiles and the median, all of these
|
|
|
|
statistics can be computed without maintaining the full list of input
|
|
|
|
data values in memory. The stat package provides interfaces and
|
|
|
|
implementations that do not require value storage as well as
|
2004-06-05 15:28:43 -04:00
|
|
|
implementations that operate on arrays of stored values.
|
2004-02-29 16:25:08 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The top level interface is
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/UnivariateStatistic.html">
|
|
|
|
org.apache.commons.math.stat.descriptive.UnivariateStatistic.</a>
|
2004-05-06 00:24:28 -04:00
|
|
|
This interface, implemented by all statistics, consists of
|
|
|
|
<code>evaluate()</code> methods that take double[] arrays as arguments
|
|
|
|
and return the value of the statistic. This interface is extended by
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/StorelessUnivariateStatistic.html">
|
2004-03-21 15:32:50 -05:00
|
|
|
StorelessUnivariateStatistic</a>, which adds <code>increment(),</code>
|
2004-05-06 00:24:28 -04:00
|
|
|
<code>getResult()</code> and associated methods to support
|
|
|
|
"storageless" implementations that maintain counters, sums or other
|
2004-06-05 15:28:43 -04:00
|
|
|
state information as values are added using the <code>increment()</code>
|
|
|
|
method.
|
2004-02-29 16:25:08 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Abstract implementations of the top level interfaces are provided in
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractUnivariateStatistic.html">
|
2004-03-02 21:32:25 -05:00
|
|
|
AbstractUnivariateStatistic</a> and
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/AbstractStorelessUnivariateStatistic.html">
|
2004-03-02 21:32:25 -05:00
|
|
|
AbstractStorelessUnivariateStatistic</a> respectively.
|
2004-02-29 16:25:08 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
Each statistic is implemented as a separate class, in one of the
|
|
|
|
subpackages (moment, rank, summary) and each extends one of the abstract
|
|
|
|
classes above (depending on whether or not value storage is required to
|
|
|
|
compute the statistic). There are several ways to instantiate and use statistics.
|
|
|
|
Statistics can be instantiated and used directly, but it is generally more convenient
|
|
|
|
(and efficient) to access them using the provided aggregates,
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html">
|
2004-05-06 00:24:28 -04:00
|
|
|
DescriptiveStatistics</a> and
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html">
|
2004-05-06 00:24:28 -04:00
|
|
|
SummaryStatistics.</a>
|
2004-03-21 15:32:50 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
<code>DescriptiveStatistics</code> maintains the input data in memory
|
|
|
|
and has the capability of producing "rolling" statistics computed from a
|
|
|
|
"window" consisting of the most recently added values.
|
2004-03-21 15:32:50 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
<code>SummaryStatisics</code> does not store the input data values
|
2004-10-08 01:08:21 -04:00
|
|
|
in memory, so the statistics included in this aggregate are limited to those
|
2004-05-06 00:24:28 -04:00
|
|
|
that can be computed in one pass through the data without access to
|
|
|
|
the full array of values.
|
2004-03-02 21:32:25 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
2004-02-29 16:25:08 -05:00
|
|
|
<table>
|
2004-05-06 00:24:28 -04:00
|
|
|
<tr><th>Aggregate</th><th>Statistics Included</th><th>Values stored?</th>
|
|
|
|
<th>"Rolling" capability?</th></tr><tr><td>
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html">
|
2004-05-06 00:24:28 -04:00
|
|
|
DescriptiveStatistics</a></td><td>min, max, mean, geometric mean, n,
|
|
|
|
sum, sum of squares, standard deviation, variance, percentiles, skewness,
|
|
|
|
kurtosis, median</td><td>Yes</td><td>Yes</td></tr><tr><td>
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/SummaryStatistics.html">
|
2004-05-06 00:24:28 -04:00
|
|
|
SummaryStatistics</a></td><td>min, max, mean, geometric mean, n,
|
|
|
|
sum, sum of squares, standard deviation, variance</td><td>No</td><td>No</td></tr>
|
2004-02-29 16:25:08 -05:00
|
|
|
</table>
|
2004-03-02 21:32:25 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
There is also a utility class,
|
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/StatUtils.html">
|
|
|
|
StatUtils</a>, that provides static methods for computing statistics
|
|
|
|
directly from double[] arrays.
|
2004-02-29 16:25:08 -05:00
|
|
|
</p>
|
2004-03-02 21:32:25 -05:00
|
|
|
<p>
|
2004-10-08 01:08:21 -04:00
|
|
|
Here are some examples showing how to compute Descriptive statistics.
|
2004-03-02 21:32:25 -05:00
|
|
|
<dl>
|
|
|
|
<dt>Compute summary statistics for a list of double values</dt>
|
|
|
|
<br></br>
|
2004-05-06 00:24:28 -04:00
|
|
|
<dd>Using the <code>DescriptiveStatistics</code> aggregate
|
|
|
|
(values are stored in memory):
|
2004-03-02 21:32:25 -05:00
|
|
|
<source>
|
|
|
|
// Get a DescriptiveStatistics instance using factory method
|
|
|
|
DescriptiveStatistics stats = DescriptiveStatistics.newInstance();
|
|
|
|
|
|
|
|
// Add the data from the array
|
|
|
|
for( int i = 0; i < inputArray.length; i++) {
|
|
|
|
stats.addValue(inputArray[i]);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Compute some statistics
|
|
|
|
double mean = stats.getMean();
|
|
|
|
double std = stats.getStandardDeviation();
|
|
|
|
double median = stats.getMedian();
|
|
|
|
</source>
|
|
|
|
</dd>
|
2004-05-06 00:24:28 -04:00
|
|
|
<dd>Using the <code>SummaryStatistics</code> aggregate (values are
|
|
|
|
<strong>not</strong> stored in memory):
|
2004-03-02 21:32:25 -05:00
|
|
|
<source>
|
|
|
|
// Get a SummaryStatistics instance using factory method
|
|
|
|
SummaryStatistics stats = SummaryStatistics.newInstance();
|
|
|
|
|
2004-05-06 00:24:28 -04:00
|
|
|
// Read data from an input stream,
|
|
|
|
// adding values and updating sums, counters, etc.
|
2004-03-02 21:32:25 -05:00
|
|
|
while (line != null) {
|
|
|
|
line = in.readLine();
|
|
|
|
stats.addValue(Double.parseDouble(line.trim()));
|
|
|
|
}
|
|
|
|
in.close();
|
|
|
|
|
|
|
|
// Compute the statistics
|
|
|
|
double mean = stats.getMean();
|
|
|
|
double std = stats.getStandardDeviation();
|
2004-05-06 00:24:28 -04:00
|
|
|
//double median = stats.getMedian(); <-- NOT AVAILABLE
|
2004-03-02 21:32:25 -05:00
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dd>Using the <code>StatUtils</code> utility class:
|
|
|
|
<source>
|
2004-05-06 00:24:28 -04:00
|
|
|
// Compute statistics directly from the array
|
|
|
|
// assume values is a double[] array
|
2004-03-02 21:32:25 -05:00
|
|
|
double mean = StatUtils.mean(values);
|
|
|
|
double std = StatUtils.variance(values);
|
|
|
|
double median = StatUtils.percentile(50);
|
2004-03-21 15:32:50 -05:00
|
|
|
|
2004-03-02 21:32:25 -05:00
|
|
|
// Compute the mean of the first three values in the array
|
|
|
|
mean = StatuUtils.mean(values, 0, 3);
|
|
|
|
</source>
|
|
|
|
</dd>
|
2004-05-06 00:24:28 -04:00
|
|
|
<dt>Maintain a "rolling mean" of the most recent 100 values from
|
|
|
|
an input stream</dt>
|
2004-03-02 21:32:25 -05:00
|
|
|
<br></br>
|
2004-05-06 00:24:28 -04:00
|
|
|
<dd>Use a <code>DescriptiveStatistics</code> instance with
|
|
|
|
window size set to 100
|
2004-03-02 21:32:25 -05:00
|
|
|
<source>
|
|
|
|
// Create a DescriptiveStats instance and set the window size to 100
|
|
|
|
DescriptiveStatistics stats = DescriptiveStatistics.newInstance();
|
|
|
|
stats.setWindowSize(100);
|
2004-03-21 15:32:50 -05:00
|
|
|
|
2004-05-06 00:24:28 -04:00
|
|
|
// Read data from an input stream,
|
|
|
|
// displaying the mean of the most recent 100 observations
|
2004-03-02 21:32:25 -05:00
|
|
|
// after every 100 observations
|
|
|
|
long nLines = 0;
|
|
|
|
while (line != null) {
|
|
|
|
line = in.readLine();
|
|
|
|
stats.addValue(Double.parseDouble(line.trim()));
|
|
|
|
if (nLines == 100) {
|
|
|
|
nLines = 0;
|
2004-05-06 00:24:28 -04:00
|
|
|
System.out.println(stats.getMean());
|
2004-03-02 21:32:25 -05:00
|
|
|
}
|
|
|
|
}
|
|
|
|
in.close();
|
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
</dl>
|
|
|
|
</p>
|
2004-05-06 00:24:28 -04:00
|
|
|
</subsection>
|
2003-11-15 13:38:16 -05:00
|
|
|
<subsection name="1.3 Frequency distributions" href="frequency">
|
2004-03-06 15:43:00 -05:00
|
|
|
<p>
|
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/Frequency.html">
|
2004-10-08 01:08:21 -04:00
|
|
|
org.apache.commons.math.stat.descriptive.Frequency</a>
|
2004-03-06 15:43:00 -05:00
|
|
|
provides a simple interface for maintaining counts and percentages of discrete
|
|
|
|
values.
|
|
|
|
</p>
|
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
Strings, integers, longs and chars are all supported as value types,
|
|
|
|
as well as instances of any class that implements <code>Comparable.</code>
|
|
|
|
The ordering of values used in computing cumulative frequencies is by
|
|
|
|
default the <i>natural ordering,</i> but this can be overriden by supplying a
|
|
|
|
<code>Comparator</code> to the constructor. Adding values that are not
|
|
|
|
comparable to those that have already been added results in an
|
2004-03-06 15:43:00 -05:00
|
|
|
<code>IllegalArgumentException.</code>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Here are some examples.
|
|
|
|
<dl>
|
|
|
|
<dt>Compute a frequency distribution based on integer values</dt>
|
|
|
|
<br></br>
|
|
|
|
<dd>Mixing integers, longs, Integers and Longs:
|
|
|
|
<source>
|
|
|
|
Frequency f = new Frequency();
|
|
|
|
f.addValue(1);
|
|
|
|
f.addValue(new Integer(1));
|
|
|
|
f.addValue(new Long(1));
|
2004-08-12 11:34:14 -04:00
|
|
|
f.addValue(2);
|
2004-03-06 15:43:00 -05:00
|
|
|
f.addValue(new Integer(-1));
|
2004-06-05 15:28:43 -04:00
|
|
|
System.out.prinltn(f.getCount(1)); // displays 3
|
|
|
|
System.out.println(f.getCumPct(0)); // displays 0.2
|
|
|
|
System.out.println(f.getPct(new Integer(1))); // displays 0.6
|
|
|
|
System.out.println(f.getCumPct(-2)); // displays 0
|
|
|
|
System.out.println(f.getCumPct(10)); // displays 1
|
2004-03-06 15:43:00 -05:00
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dt>Count string frequencies</dt>
|
|
|
|
<br></br>
|
|
|
|
<dd>Using case-sensitive comparison, alpha sort order (natural comparator):
|
|
|
|
<source>
|
|
|
|
Frequency f = new Frequency();
|
|
|
|
f.addValue("one");
|
|
|
|
f.addValue("One");
|
|
|
|
f.addValue("oNe");
|
|
|
|
f.addValue("Z");
|
2004-06-05 15:28:43 -04:00
|
|
|
System.out.println(f.getCount("one")); // displays 1
|
|
|
|
System.out.println(f.getCumPct("Z")); // displays 0.5
|
|
|
|
System.out.println(f.getCumPct("Ot")); // displays 0.25
|
2004-03-06 15:43:00 -05:00
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dd>Using case-insensitive comparator:
|
|
|
|
<source>
|
|
|
|
Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER);
|
|
|
|
f.addValue("one");
|
|
|
|
f.addValue("One");
|
|
|
|
f.addValue("oNe");
|
|
|
|
f.addValue("Z");
|
2004-06-05 15:28:43 -04:00
|
|
|
System.out.println(f.getCount("one")); // displays 3
|
|
|
|
System.out.println(f.getCumPct("z")); // displays 1
|
2004-03-06 15:43:00 -05:00
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
</dl>
|
|
|
|
</p>
|
2003-11-15 13:38:16 -05:00
|
|
|
</subsection>
|
2004-09-01 12:19:21 -04:00
|
|
|
<subsection name="1.4 Simple regression" href="regression">
|
2004-04-25 15:16:13 -04:00
|
|
|
<p>
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/regression/SimpleRegression.html">
|
|
|
|
org.apache.commons.math.stat.regression.SimpleRegression</a>
|
2004-05-06 00:24:28 -04:00
|
|
|
provides ordinary least squares regression with one independent variable,
|
|
|
|
estimating the linear model:
|
2004-04-25 15:16:13 -04:00
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
<code> y = intercept + slope * x </code>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Standard errors for <code>intercept</code> and <code>slope</code> are
|
|
|
|
available as well as ANOVA, r-square and Pearson's r statistics.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Observations (x,y pairs) can be added to the model one at a time or they
|
|
|
|
can be provided in a 2-dimensional array. The observations are not stored
|
|
|
|
in memory, so there is no limit to the number of observations that can be
|
|
|
|
added to the model.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
<strong>Usage Notes</strong>: <ul>
|
|
|
|
<li> When there are fewer than two observations in the model, or when
|
|
|
|
there is no variation in the x values (i.e. all x values are the same)
|
|
|
|
all statistics return <code>NaN</code>. At least two observations with
|
|
|
|
different x coordinates are requred to estimate a bivariate regression
|
|
|
|
model.</li>
|
|
|
|
<li> getters for the statistics always compute values based on the current
|
|
|
|
set of observations -- i.e., you can get statistics, then add more data
|
|
|
|
and get updated statistics without using a new instance. There is no
|
|
|
|
"compute" method that updates all statistics. Each of the getters performs
|
|
|
|
the necessary computations to return the requested statistic.</li>
|
|
|
|
</ul>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
<strong>Implementation Notes</strong>: <ul>
|
|
|
|
<li> As observations are added to the model, the sum of x values, y values,
|
|
|
|
cross products (x times y), and squared deviations of x and y from their
|
|
|
|
respective means are updated using updating formulas defined in
|
|
|
|
"Algorithms for Computing the Sample Variance: Analysis and
|
|
|
|
Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J.
|
|
|
|
1983, American Statistician, vol. 37, pp. 242-247, referenced in
|
|
|
|
Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985. All regression
|
|
|
|
statistics are computed from these sums.</li>
|
|
|
|
<li> Inference statistics (confidence intervals, parameter significance levels)
|
|
|
|
are based on on the assumption that the observations included in the model are
|
|
|
|
drawn from a <a href="http://mathworld.wolfram.com/BivariateNormalDistribution.html">
|
|
|
|
Bivariate Normal Distribution</a></li>
|
|
|
|
</ul>
|
|
|
|
</p>
|
|
|
|
<p>
|
2004-05-06 00:24:28 -04:00
|
|
|
Here are some examples.
|
2004-04-25 15:16:13 -04:00
|
|
|
<dl>
|
|
|
|
<dt>Estimate a model based on observations added one at a time</dt>
|
|
|
|
<br></br>
|
|
|
|
<dd>Instantiate a regression instance and add data points
|
|
|
|
<source>
|
2004-09-01 12:19:21 -04:00
|
|
|
regression = new SimpleRegression();
|
2004-06-05 15:28:43 -04:00
|
|
|
regression.addData(1d, 2d);
|
|
|
|
// At this point, with only one observation,
|
|
|
|
// all regression statistics will return NaN
|
|
|
|
|
|
|
|
regression.addData(3d, 3d);
|
|
|
|
// With only two observations,
|
|
|
|
// slope and intercept can be computed
|
|
|
|
// but inference statistics will return NaN
|
|
|
|
|
|
|
|
regression.addData(3d, 3d);
|
|
|
|
// Now all statistics are defined.
|
2004-04-25 15:16:13 -04:00
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dd>Compute some statistics based on observations added so far
|
|
|
|
<source>
|
2004-05-06 00:24:28 -04:00
|
|
|
System.out.println(regression.getIntercept());
|
|
|
|
// displays intercept of regression line
|
2004-06-05 15:28:43 -04:00
|
|
|
|
2004-05-06 00:24:28 -04:00
|
|
|
System.out.println(regression.getSlope());
|
|
|
|
// displays slope of regression line
|
2004-06-05 15:28:43 -04:00
|
|
|
|
2004-05-06 00:24:28 -04:00
|
|
|
System.out.println(regression.getSlopeStdErr());
|
|
|
|
// displays slope standard error
|
2004-04-25 15:16:13 -04:00
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dd>Use the regression model to predict the y value for a new x value
|
|
|
|
<source>
|
2004-05-06 00:24:28 -04:00
|
|
|
System.out.println(regression.predict(1.5d)
|
|
|
|
// displays predicted y value for x = 1.5
|
2004-04-25 15:16:13 -04:00
|
|
|
</source>
|
2004-04-26 01:16:35 -04:00
|
|
|
More data points can be added and subsequent getXxx calls will incorporate
|
2004-04-25 15:16:13 -04:00
|
|
|
additional data in statistics.
|
|
|
|
</dd>
|
|
|
|
<dt>Estimate a model from a double[][] array of data points</dt>
|
|
|
|
<br></br>
|
|
|
|
<dd>Instantiate a regression object and load dataset
|
|
|
|
<source>
|
2004-05-06 00:24:28 -04:00
|
|
|
double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }};
|
2004-09-01 12:19:21 -04:00
|
|
|
SimpleRegression regression = new SimpleRegression();
|
2004-05-06 00:24:28 -04:00
|
|
|
regression.addData(data);
|
2004-04-25 15:16:13 -04:00
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dd>Estimate regression model based on data
|
|
|
|
<source>
|
2004-05-06 00:24:28 -04:00
|
|
|
System.out.println(regression.getIntercept());
|
|
|
|
// displays intercept of regression line
|
2004-06-05 15:28:43 -04:00
|
|
|
|
2004-05-06 00:24:28 -04:00
|
|
|
System.out.println(regression.getSlope());
|
|
|
|
// displays slope of regression line
|
2004-06-05 15:28:43 -04:00
|
|
|
|
2004-05-06 00:24:28 -04:00
|
|
|
System.out.println(regression.getSlopeStdErr());
|
|
|
|
// displays slope standard error
|
2004-04-25 15:16:13 -04:00
|
|
|
</source>
|
|
|
|
More data points -- even another double[][] array -- can be added and subsequent
|
2004-04-26 01:16:35 -04:00
|
|
|
getXxx calls will incorporate additional data in statistics.
|
2004-04-25 15:16:13 -04:00
|
|
|
</dd>
|
|
|
|
</dl>
|
|
|
|
</p>
|
2003-11-15 13:38:16 -05:00
|
|
|
</subsection>
|
|
|
|
<subsection name="1.5 Statistical tests" href="tests">
|
2004-05-06 00:24:28 -04:00
|
|
|
<p>
|
|
|
|
The interfaces and implementations in the
|
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/inference/">
|
|
|
|
org.apache.commons.math.stat.inference</a> package provide
|
|
|
|
<a href="http://www.itl.nist.gov/div898/handbook/prc/section2/prc22.htm">
|
2004-06-05 15:28:43 -04:00
|
|
|
Student's t</a> and
|
|
|
|
<a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm">
|
|
|
|
Chi-Square</a> test statistics as well as
|
2004-05-06 00:24:28 -04:00
|
|
|
<a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue">
|
|
|
|
p-values</a> associated with <code>t-</code> and
|
|
|
|
<code>Chi-Square</code> tests.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
<strong>Implementation Notes</strong>
|
|
|
|
<ul>
|
2004-06-05 15:28:43 -04:00
|
|
|
<li>Both one- and two-sample t-tests are supported. Two sample tests
|
|
|
|
can be either paired or unpaired and the unpaired two-sample tests can
|
|
|
|
be conducted under the assumption of equal subpopulation variances or
|
|
|
|
without this assumption. When equal variances is assumed, a pooled
|
|
|
|
variance estimate is used to compute the t-statistic and the degrees
|
|
|
|
of freedom used in the t-test equals the sum of the sample sizes minus 2.
|
|
|
|
When equal variances is not assumed, the t-statistic uses both sample
|
|
|
|
variances and the
|
|
|
|
<a href="http://www.itl.nist.gov/div898/handbook/prc/section3/gifs/nu3.gif">
|
|
|
|
Welch-Satterwaite approximation</a> is used to compute the degrees
|
|
|
|
of freedom. Methods to return t-statistics and p-values are provided in each
|
|
|
|
case, as well as boolean-valued methods to perform fixed significance
|
2004-08-02 00:20:09 -04:00
|
|
|
level tests. The names of methods or methods that assume equal
|
|
|
|
subpopulation variances always start with "homoscedastic." Test or
|
|
|
|
test-statistic methods that just start with "t" do not assume equal
|
|
|
|
variances. See the examples below and the API documentation for
|
2004-06-05 15:28:43 -04:00
|
|
|
more details.</li>
|
2004-05-06 00:24:28 -04:00
|
|
|
<li>The validity of the p-values returned by the t-test depends on the
|
|
|
|
assumptions of the parametric t-test procedure, as discussed
|
|
|
|
<a href="http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html">
|
|
|
|
here</a></li>
|
|
|
|
<li>p-values returned by both t- and chi-square tests are exact, based
|
|
|
|
on numerical approximations to the t- and chi-square distributions in the
|
|
|
|
<code>distributions</code> package. </li>
|
2004-06-05 15:28:43 -04:00
|
|
|
<li>p-values returned by t-tests are for two-sided tests and the boolean-valued
|
|
|
|
methods supporting fixed significance level tests assume that the hypotheses
|
|
|
|
are two-sided. One sided tests can be performed by dividing returned p-values
|
|
|
|
(resp. critical values) by 2.</li>
|
2004-05-06 00:24:28 -04:00
|
|
|
<li>Degrees of freedom for chi-square tests are integral values, based on the
|
|
|
|
number of observed or expected counts (number of observed counts - 1)
|
|
|
|
for the goodness-of-fit tests and (number of columns -1) * (number of rows - 1)
|
|
|
|
for independence tests.</li>
|
|
|
|
</ul>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
<strong>Examples:</strong>
|
|
|
|
<dl>
|
2004-06-05 15:28:43 -04:00
|
|
|
<dt><strong>One-sample <code>t</code> tests</strong></dt>
|
2004-05-06 00:24:28 -04:00
|
|
|
<br></br>
|
|
|
|
<dd>To compare the mean of a double[] array to a fixed value:
|
|
|
|
<source>
|
|
|
|
double[] observed = {1d, 2d, 3d};
|
|
|
|
double mu = 2.5d;
|
|
|
|
TTestImpl testStatistic = new TTestImpl();
|
|
|
|
System.out.println(testStatistic.t(mu, observed);
|
|
|
|
</source>
|
|
|
|
The code above will display the t-statisitic associated with a one-sample
|
|
|
|
t-test comparing the mean of the <code>observed</code> values against
|
|
|
|
<code>mu.</code>
|
|
|
|
</dd>
|
|
|
|
<dd>To compare the mean of a dataset described by a
|
2004-10-08 01:08:21 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/stat/descriptive/StatisticalSummary.html">
|
|
|
|
org.apache.commons.math.stat.descriptive.StatisticalSummary</a> to a fixed value:
|
2004-05-06 00:24:28 -04:00
|
|
|
<source>
|
|
|
|
double[] observed ={1d, 2d, 3d};
|
|
|
|
double mu = 2.5d;
|
|
|
|
SummaryStatistics sampleStats = null;
|
|
|
|
sampleStats = SummaryStatistics.newInstance();
|
|
|
|
for (int i = 0; i < observed.length; i++) {
|
|
|
|
sampleStats.addValue(observed[i]);
|
|
|
|
}
|
|
|
|
System.out.println(testStatistic.t(mu, observed);
|
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dd>To compute the p-value associated with the null hypothesis that the mean
|
|
|
|
of a set of values equals a point estimate, against the two-sided alternative that
|
|
|
|
the mean is different from the target value:
|
|
|
|
<source>
|
|
|
|
double[] observed = {1d, 2d, 3d};
|
|
|
|
double mu = 2.5d;
|
|
|
|
TTestImpl testStatistic = new TTestImpl();
|
|
|
|
System.out.println(testStatistic.tTest(mu, observed);
|
|
|
|
</source>
|
|
|
|
The snippet above will display the p-value associated with the null
|
|
|
|
hypothesis that the mean of the population from which the
|
|
|
|
<code>observed</code> values are drawn equals <code>mu.</code>
|
|
|
|
</dd>
|
2004-06-05 15:28:43 -04:00
|
|
|
<dd>To perform the test using a fixed significance level, use:
|
2004-05-06 00:24:28 -04:00
|
|
|
<source>
|
|
|
|
testStatistic.tTest(mu, observed, alpha);
|
|
|
|
</source>
|
|
|
|
where <code>0 < alpha < 0.5</code> is the significance level of
|
|
|
|
the test. The boolean value returned will be <code>true</code> iff the
|
|
|
|
null hypothesis can be rejected with confidence <code>1 - alpha</code>.
|
|
|
|
To test, for example at the 95% level of confidence, use
|
|
|
|
<code>alpha = 0.05</code>
|
|
|
|
</dd>
|
2004-06-05 15:28:43 -04:00
|
|
|
<dt><strong>Two-Sample t-tests</strong></dt>
|
|
|
|
<br></br>
|
|
|
|
<dd><strong>Example 1:</strong> Paired test evaluating
|
|
|
|
the null hypothesis that the mean difference between corresponding
|
|
|
|
(paired) elements of the <code>double[]</code> arrays
|
|
|
|
<code>sample1</code> and <code>sample2</code> is zero.
|
|
|
|
<p>
|
|
|
|
To compute the t-statistic:
|
|
|
|
<source>
|
|
|
|
TTestImpl testStatistic = new TTestImpl();
|
|
|
|
testStatistic.pairedT(sample1, sample2);
|
|
|
|
</source>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
To compute the (one-sided) p-value:
|
|
|
|
<source>
|
|
|
|
testStatistic.pairedTTest(sample1, sample2);
|
|
|
|
</source>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
To perform a fixed significance level test with alpha = .05:
|
|
|
|
<source>
|
|
|
|
testStatistic.pairedTTest(sample1, sample2, .05);
|
|
|
|
</source>
|
|
|
|
</p>
|
|
|
|
The last example will return <code>true</code> iff the p-value
|
|
|
|
returned by <code>testStatistic.pairedTTest(sample1, sample2)</code>
|
|
|
|
is less than <code>.05</code>
|
|
|
|
</dd>
|
|
|
|
<dd><strong>Example 2: </strong> unpaired, two-sample t-test using
|
|
|
|
<code>StatisticalSummary</code> instances, without assuming that
|
|
|
|
subpopulation variances are equal.
|
|
|
|
<p>
|
|
|
|
First create the <code>StatisticalSummary</code> instances. Both
|
|
|
|
<code>DescriptiveStatistics</code> and <code>SummaryStatistics</code>
|
|
|
|
implement this interface. Assume that <code>summary1</code> and
|
|
|
|
<code>summary2</code> are <code>SummaryStatistics</code> instances,
|
|
|
|
each of which has had at least 2 values added to the (virtual) dataset that
|
|
|
|
it describes. The sample sizes do not have to be the same -- all that is required
|
|
|
|
is that both samples have at least 2 elements.
|
|
|
|
</p>
|
|
|
|
<p><strong>Note:</strong> The <code>SummaryStatistics</code> class does
|
|
|
|
not store the dataset that it describes in memory, but it does compute all
|
|
|
|
statistics necessary to perform t-tests, so this method can be used to
|
|
|
|
conduct t-tests with very large samples. One-sample tests can also be
|
|
|
|
performed this way.
|
2004-10-08 01:08:21 -04:00
|
|
|
(See <a href="#1.2 Descriptive statistics">Descriptive statistics</a> for details
|
2004-06-05 15:28:43 -04:00
|
|
|
on the <code>SummaryStatistics</code> class.)
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
To compute the t-statistic:
|
|
|
|
<source>
|
|
|
|
TTestImpl testStatistic = new TTestImpl();
|
2004-08-02 00:20:09 -04:00
|
|
|
testStatistic.t(summary1, summary2);
|
2004-06-05 15:28:43 -04:00
|
|
|
</source>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
To compute the (one-sided) p-value:
|
|
|
|
<source>
|
2004-08-02 00:20:09 -04:00
|
|
|
testStatistic.tTest(sample1, sample2);
|
2004-06-05 15:28:43 -04:00
|
|
|
</source>
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
To perform a fixed significance level test with alpha = .05:
|
|
|
|
<source>
|
2004-08-02 00:20:09 -04:00
|
|
|
testStatistic.tTest(sample1, sample2, .05);
|
2004-06-05 15:28:43 -04:00
|
|
|
</source>
|
|
|
|
</p>
|
|
|
|
<p>
|
2004-08-02 00:20:09 -04:00
|
|
|
In each case above, the test does not assume that the subpopulation
|
|
|
|
variances are equal. To perform the tests under this assumption,
|
|
|
|
replace "t" at the beginning of the method name with "homoscedasticT"
|
2004-06-05 15:28:43 -04:00
|
|
|
</p>
|
|
|
|
</dd>
|
2004-05-06 00:24:28 -04:00
|
|
|
<dt>Computing <code>chi-square</code> test statistics</dt>
|
|
|
|
<br></br>
|
|
|
|
<dd>To compute a chi-square statistic measuring the agreement between a
|
|
|
|
<code>long[]</code> array of observed counts and a <code>double[]</code>
|
|
|
|
array of expected counts, use:
|
|
|
|
<source>
|
|
|
|
ChiSquareTestImpl testStatistic = new ChiSquareTestImpl();
|
|
|
|
long[] observed = {10, 9, 11};
|
|
|
|
double[] expected = {10.1, 9.8, 10.3};
|
|
|
|
System.out.println(testStatistic.chiSquare(expected, observed));
|
|
|
|
</source>
|
|
|
|
the value displayed will be
|
|
|
|
<code>sum((expected[i] - observed[i])^2 / expected[i])</code>
|
|
|
|
</dd>
|
|
|
|
<dd> To get the p-value associated with the null hypothesis that
|
|
|
|
<code>observed</code> conforms to <code>expected</code> use:
|
|
|
|
<source>
|
|
|
|
testStatistic.chiSquareTest(expected, observed);
|
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dd> To test the null hypothesis that <code>observed</code> conforms to
|
|
|
|
<code>expected</code> with <code>alpha</code> siginficance level
|
|
|
|
(equiv. <code>100 * (1-alpha)%</code> confidence) where <code>
|
|
|
|
0 < alpha < 1 </code> use:
|
|
|
|
<source>
|
|
|
|
testStatistic.chiSquareTest(expected, observed, alpha);
|
|
|
|
</source>
|
|
|
|
The boolean value returned will be <code>true</code> iff the null hypothesis
|
|
|
|
can be rejected with confidence <code>1 - alpha</code>.
|
|
|
|
</dd>
|
|
|
|
<dd>To compute a chi-square statistic statistic associated with a
|
|
|
|
<a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm">
|
|
|
|
chi-square test of independence</a> based on a two-dimensional (long[][])
|
|
|
|
<code>counts</code> array viewed as a two-way table, use:
|
|
|
|
<source>
|
|
|
|
testStatistic.chiSquareTest(counts);
|
|
|
|
</source>
|
|
|
|
The rows of the 2-way table are
|
|
|
|
<code>count[0], ... , count[count.length - 1]. </code><br></br>
|
|
|
|
The chi-square statistic returned is
|
|
|
|
<code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code>
|
|
|
|
where the sum is taken over all table entries and
|
|
|
|
<code>expected[i][j]</code> is the product of the row and column sums at
|
|
|
|
row <code>i</code>, column <code>j</code> divided by the total count.
|
|
|
|
</dd>
|
|
|
|
<dd>To compute the p-value associated with the null hypothesis that
|
|
|
|
the classifications represented by the counts in the columns of the input 2-way
|
|
|
|
table are independent of the rows, use:
|
|
|
|
<source>
|
|
|
|
testStatistic.chiSquareTest(counts);
|
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dd>To perform a chi-square test of independence with <code>alpha</code>
|
|
|
|
siginficance level (equiv. <code>100 * (1-alpha)%</code> confidence)
|
|
|
|
where <code>0 < alpha < 1 </code> use:
|
|
|
|
<source>
|
|
|
|
testStatistic.chiSquareTest(counts, alpha);
|
|
|
|
</source>
|
|
|
|
The boolean value returned will be <code>true</code> iff the null
|
|
|
|
hypothesis can be rejected with confidence <code>1 - alpha</code>.
|
|
|
|
</dd>
|
|
|
|
</dl>
|
|
|
|
</p>
|
2003-11-15 13:38:16 -05:00
|
|
|
</subsection>
|
|
|
|
</section>
|
|
|
|
</body>
|
|
|
|
</document>
|