Add recently added features to the userguide.

git-svn-id: https://svn.apache.org/repos/asf/commons/proper/math/trunk@1538282 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Thomas Neidhart 2013-11-02 21:02:13 +00:00
parent 280af43635
commit 40a97ba13a
1 changed files with 82 additions and 62 deletions

View File

@ -32,13 +32,13 @@
and t-, chi-square and ANOVA test statistics. and t-, chi-square and ANOVA test statistics.
</p> </p>
<p> <p>
<a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br></br> <a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br/>
<a href="#a1.3_Frequency_distributions">Frequency distributions</a><br></br> <a href="#a1.3_Frequency_distributions">Frequency distributions</a><br/>
<a href="#a1.4_Simple_regression">Simple Regression</a><br></br> <a href="#a1.4_Simple_regression">Simple Regression</a><br/>
<a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br></br> <a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br/>
<a href="#a1.6_Rank_transformations">Rank transformations</a><br></br> <a href="#a1.6_Rank_transformations">Rank transformations</a><br/>
<a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br></br> <a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br/>
<a href="#a1.8_Statistical_tests">Statistical Tests</a><br></br> <a href="#a1.8_Statistical_tests">Statistical Tests</a><br/>
</p> </p>
</subsection> </subsection>
<subsection name="1.2 Descriptive statistics"> <subsection name="1.2 Descriptive statistics">
@ -154,7 +154,7 @@
Here are some examples showing how to compute Descriptive statistics. Here are some examples showing how to compute Descriptive statistics.
<dl> <dl>
<dt>Compute summary statistics for a list of double values</dt> <dt>Compute summary statistics for a list of double values</dt>
<br></br> <br/>
<dd>Using the <code>DescriptiveStatistics</code> aggregate <dd>Using the <code>DescriptiveStatistics</code> aggregate
(values are stored in memory): (values are stored in memory):
<source> <source>
@ -206,7 +206,7 @@ mean = StatUtils.mean(values, 0, 3);
</dd> </dd>
<dt>Maintain a "rolling mean" of the most recent 100 values from <dt>Maintain a "rolling mean" of the most recent 100 values from
an input stream</dt> an input stream</dt>
<br></br> <br/>
<dd>Use a <code>DescriptiveStatistics</code> instance with <dd>Use a <code>DescriptiveStatistics</code> instance with
window size set to 100 window size set to 100
<source> <source>
@ -311,7 +311,7 @@ double totalSampleSum = aggregatedStats.getSum();
Here are some examples. Here are some examples.
<dl> <dl>
<dt>Compute a frequency distribution based on integer values</dt> <dt>Compute a frequency distribution based on integer values</dt>
<br></br> <br/>
<dd>Mixing integers, longs, Integers and Longs: <dd>Mixing integers, longs, Integers and Longs:
<source> <source>
Frequency f = new Frequency(); Frequency f = new Frequency();
@ -328,7 +328,7 @@ double totalSampleSum = aggregatedStats.getSum();
</source> </source>
</dd> </dd>
<dt>Count string frequencies</dt> <dt>Count string frequencies</dt>
<br></br> <br/>
<dd>Using case-sensitive comparison, alpha sort order (natural comparator): <dd>Using case-sensitive comparison, alpha sort order (natural comparator):
<source> <source>
Frequency f = new Frequency(); Frequency f = new Frequency();
@ -455,7 +455,7 @@ System.out.println(regression.predict(1.5d)
More data points can be added and subsequent getXxx calls will incorporate More data points can be added and subsequent getXxx calls will incorporate
additional data in statistics. additional data in statistics.
</dd> </dd>
<br></br> <br/>
<dt>Estimate a model from a double[][] array of data points</dt> <dt>Estimate a model from a double[][] array of data points</dt>
<dd>Instantiate a regression object and load dataset <dd>Instantiate a regression object and load dataset
<source> <source>
@ -478,7 +478,7 @@ System.out.println(regression.getSlopeStdErr());
More data points -- even another double[][] array -- can be added and subsequent More data points -- even another double[][] array -- can be added and subsequent
getXxx calls will incorporate additional data in statistics. getXxx calls will incorporate additional data in statistics.
</dd> </dd>
<br></br> <br/>
<dt>Estimate a model from a double[][] array of data points, <em>excluding</em> the intercept</dt> <dt>Estimate a model from a double[][] array of data points, <em>excluding</em> the intercept</dt>
<dd>Instantiate a regression object and load dataset <dd>Instantiate a regression object and load dataset
<source> <source>
@ -558,7 +558,7 @@ System.out.println(regression.getInterceptStdErr() );
Here are some examples. Here are some examples.
<dl> <dl>
<dt>OLS regression</dt> <dt>OLS regression</dt>
<br></br> <br/>
<dd>Instantiate an OLS regression object and load a dataset: <dd>Instantiate an OLS regression object and load a dataset:
<source> <source>
OLSMultipleLinearRegression regression = new OLSMultipleLinearRegression(); OLSMultipleLinearRegression regression = new OLSMultipleLinearRegression();
@ -589,7 +589,7 @@ double sigma = regression.estimateRegressionStandardError();
</source> </source>
</dd> </dd>
<dt>GLS regression</dt> <dt>GLS regression</dt>
<br></br> <br/>
<dd>Instantiate a GLS regression object and load a dataset: <dd>Instantiate a GLS regression object and load a dataset:
<source> <source>
GLSMultipleLinearRegression regression = new GLSMultipleLinearRegression(); GLSMultipleLinearRegression regression = new GLSMultipleLinearRegression();
@ -664,15 +664,17 @@ new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData
<a href="../apidocs/org/apache/commons/math3/stat/correlation/Covariance.html"> <a href="../apidocs/org/apache/commons/math3/stat/correlation/Covariance.html">
Covariance</a> computes covariances, Covariance</a> computes covariances,
<a href="../apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html"> <a href="../apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html">
PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients and PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients,
<a href="../apidocs/org/apache/commons/math3/stat/correlation/SpearmansCorrelation.html"> <a href="../apidocs/org/apache/commons/math3/stat/correlation/SpearmansCorrelation.html">
SpearmansCorrelation</a> computes Spearman's rank correlation. SpearmansCorrelation</a> computes Spearman's rank correlation and
<a href="../apidocs/org/apache/commons/math3/stat/correlation/KendallsCorrelation.html">
KendallsCorrelation</a> computes Kendall's tau rank correlation.
</p> </p>
<p> <p>
<strong>Implementation Notes</strong> <strong>Implementation Notes</strong>
<ul> <ul>
<li> <li>
Unbiased covariances are given by the formula <br></br> Unbiased covariances are given by the formula <br/>
<code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code> <code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code>
where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code> where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code>
is the mean of the <code>Y</code> values. Non-bias-corrected estimates use is the mean of the <code>Y</code> values. Non-bias-corrected estimates use
@ -682,7 +684,7 @@ new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData
</li> </li>
<li> <li>
<a href="../apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html"> <a href="../apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html">
PearsonsCorrelation</a> computes correlations defined by the formula <br></br> PearsonsCorrelation</a> computes correlations defined by the formula <br/>
<code>cor(X, Y) = sum[(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / [(n - 1)s(X)s(Y)]</code><br/> <code>cor(X, Y) = sum[(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / [(n - 1)s(X)s(Y)]</code><br/>
where <code>E(X)</code> and <code>E(Y)</code> are means of <code>X</code> and <code>Y</code> where <code>E(X)</code> and <code>E(Y)</code> are means of <code>X</code> and <code>Y</code>
and <code>s(X)</code>, <code>s(Y)</code> are standard deviations. and <code>s(X)</code>, <code>s(Y)</code> are standard deviations.
@ -694,13 +696,18 @@ new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData
<a href="../apidocs/org/apache/commons/math3/stat/ranking/NaturalRanking.html"> <a href="../apidocs/org/apache/commons/math3/stat/ranking/NaturalRanking.html">
NaturalRanking</a> with default strategies for handling ties and NaN values is used. NaturalRanking</a> with default strategies for handling ties and NaN values is used.
</li> </li>
<li>
<a href="../apidocs/org/apache/commons/math3/stat/correlation/KendallsCorrelation.html">
KendallsCorrelation</a> computes the association between two measured quantities. A tau test
is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.
</li>
</ul> </ul>
</p> </p>
<p> <p>
<strong>Examples:</strong> <strong>Examples:</strong>
<dl> <dl>
<dt><strong>Covariance of 2 arrays</strong></dt> <dt><strong>Covariance of 2 arrays</strong></dt>
<br></br> <br/>
<dd>To compute the unbiased covariance between 2 double arrays, <dd>To compute the unbiased covariance between 2 double arrays,
<code>x</code> and <code>y</code>, use: <code>x</code> and <code>y</code>, use:
<source> <source>
@ -711,9 +718,9 @@ new Covariance().covariance(x, y)
covariance(x, y, false) covariance(x, y, false)
</source> </source>
</dd> </dd>
<br></br> <br/>
<dt><strong>Covariance matrix</strong></dt> <dt><strong>Covariance matrix</strong></dt>
<br></br> <br/>
<dd> A covariance matrix over the columns of a source matrix <code>data</code> <dd> A covariance matrix over the columns of a source matrix <code>data</code>
can be computed using can be computed using
<source> <source>
@ -726,18 +733,18 @@ new Covariance().computeCovarianceMatrix(data)
computeCovarianceMatrix(data, false) computeCovarianceMatrix(data, false)
</source> </source>
</dd> </dd>
<br></br> <br/>
<dt><strong>Pearson's correlation of 2 arrays</strong></dt> <dt><strong>Pearson's correlation of 2 arrays</strong></dt>
<br></br> <br/>
<dd>To compute the Pearson's product-moment correlation between two double arrays <dd>To compute the Pearson's product-moment correlation between two double arrays
<code>x</code> and <code>y</code>, use: <code>x</code> and <code>y</code>, use:
<source> <source>
new PearsonsCorrelation().correlation(x, y) new PearsonsCorrelation().correlation(x, y)
</source> </source>
</dd> </dd>
<br></br> <br/>
<dt><strong>Pearson's correlation matrix</strong></dt> <dt><strong>Pearson's correlation matrix</strong></dt>
<br></br> <br/>
<dd> A (Pearson's) correlation matrix over the columns of a source matrix <code>data</code> <dd> A (Pearson's) correlation matrix over the columns of a source matrix <code>data</code>
can be computed using can be computed using
<source> <source>
@ -746,9 +753,9 @@ new PearsonsCorrelation().computeCorrelationMatrix(data)
The i-jth entry of the returned matrix is the Pearson's product-moment correlation between the The i-jth entry of the returned matrix is the Pearson's product-moment correlation between the
ith and jth columns of <code>data.</code> ith and jth columns of <code>data.</code>
</dd> </dd>
<br></br> <br/>
<dt><strong>Pearson's correlation significance and standard errors</strong></dt> <dt><strong>Pearson's correlation significance and standard errors</strong></dt>
<br></br> <br/>
<dd> To compute standard errors and/or significances of correlation coefficients <dd> To compute standard errors and/or significances of correlation coefficients
associated with Pearson's correlation coefficients, start by creating a associated with Pearson's correlation coefficients, start by creating a
<code>PearsonsCorrelation</code> instance <code>PearsonsCorrelation</code> instance
@ -771,7 +778,7 @@ correlation.getCorrelationPValues()
</source> </source>
<code>getCorrelationPValues().getEntry(i,j)</code> is the <code>getCorrelationPValues().getEntry(i,j)</code> is the
probability that a random variable distributed as <code>t<sub>n-2</sub></code> takes probability that a random variable distributed as <code>t<sub>n-2</sub></code> takes
a value with absolute value greater than or equal to <br></br> a value with absolute value greater than or equal to <br/>
<code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>, <code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>,
where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth
columns of the source array or RealMatrix. This is sometimes referred to as the columns of the source array or RealMatrix. This is sometimes referred to as the
@ -784,9 +791,9 @@ new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1)
of <code>data</code>. If this value is less than .01, we can say that the correlation of <code>data</code>. If this value is less than .01, we can say that the correlation
between the two columns of data is significant at the 99% level. between the two columns of data is significant at the 99% level.
</dd> </dd>
<br></br> <br/>
<dt><strong>Spearman's rank correlation coefficient</strong></dt> <dt><strong>Spearman's rank correlation coefficient</strong></dt>
<br></br> <br/>
<dd>To compute the Spearman's rank-moment correlation between two double arrays <dd>To compute the Spearman's rank-moment correlation between two double arrays
<code>x</code> and <code>y</code>: <code>x</code> and <code>y</code>:
<source> <source>
@ -798,7 +805,15 @@ RankingAlgorithm ranking = new NaturalRanking();
new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y)) new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
</source> </source>
</dd> </dd>
<br></br> <br/>
<dt><strong>Kendalls's tau rank correlation coefficient</strong></dt>
<br/>
<dd>To compute the Kendall's tau rank correlation between two double arrays
<code>x</code> and <code>y</code>:
<source>
new KendallsCorrelation().correlation(x, y)
</source>
</dd>
</dl> </dl>
</p> </p>
</subsection> </subsection>
@ -814,9 +829,11 @@ new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
<a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm"> <a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm">
One-Way ANOVA</a>, One-Way ANOVA</a>,
<a href="http://www.itl.nist.gov/div898/handbook/prc/section3/prc35.htm"> <a href="http://www.itl.nist.gov/div898/handbook/prc/section3/prc35.htm">
Mann-Whitney U</a> and Mann-Whitney U</a>,
<a href="http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test"> <a href="http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">
Wilcoxon signed rank</a> test statistics as well as Wilcoxon signed rank</a> and
<a href="http://en.wikipedia.org/wiki/Binomial_test">
Binomial</a> test statistics as well as
<a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue"> <a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue">
p-values</a> associated with <code>t-</code>, p-values</a> associated with <code>t-</code>,
<code>Chi-Square</code>, <code>G</code>, <code>One-Way ANOVA</code>, <code>Mann-Whitney U</code> <code>Chi-Square</code>, <code>G</code>, <code>One-Way ANOVA</code>, <code>Mann-Whitney U</code>
@ -830,9 +847,11 @@ new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
<a href="../apidocs/org/apache/commons/math3/stat/inference/OneWayAnova.html"> <a href="../apidocs/org/apache/commons/math3/stat/inference/OneWayAnova.html">
OneWayAnova</a>, OneWayAnova</a>,
<a href="../apidocs/org/apache/commons/math3/stat/inference/MannWhitneyUTest.html"> <a href="../apidocs/org/apache/commons/math3/stat/inference/MannWhitneyUTest.html">
MannWhitneyUTest</a>, and MannWhitneyUTest</a>,
<a href="../apidocs/org/apache/commons/math3/stat/inference/WilcoxonSignedRankTest.html"> <a href="../apidocs/org/apache/commons/math3/stat/inference/WilcoxonSignedRankTest.html">
WilcoxonSignedRankTest</a>. WilcoxonSignedRankTest</a> and
<a href="../apidocs/org/apache/commons/math3/stat/inference/BinomialTest.html">
BinomialTest</a>.
The <a href="../apidocs/org/apache/commons/math3/stat/inference/TestUtils.html"> The <a href="../apidocs/org/apache/commons/math3/stat/inference/TestUtils.html">
TestUtils</a> class provides static methods to get test instances or TestUtils</a> class provides static methods to get test instances or
to compute test statistics directly. The examples below all use the to compute test statistics directly. The examples below all use the
@ -886,7 +905,7 @@ new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
<strong>Examples:</strong> <strong>Examples:</strong>
<dl> <dl>
<dt><strong>One-sample <code>t</code> tests</strong></dt> <dt><strong>One-sample <code>t</code> tests</strong></dt>
<br></br> <br/>
<dd>To compare the mean of a double[] array to a fixed value: <dd>To compare the mean of a double[] array to a fixed value:
<source> <source>
double[] observed = {1d, 2d, 3d}; double[] observed = {1d, 2d, 3d};
@ -932,9 +951,9 @@ TestUtils.tTest(mu, observed, alpha);
To test, for example at the 95% level of confidence, use To test, for example at the 95% level of confidence, use
<code>alpha = 0.05</code> <code>alpha = 0.05</code>
</dd> </dd>
<br></br> <br/>
<dt><strong>Two-Sample t-tests</strong></dt> <dt><strong>Two-Sample t-tests</strong></dt>
<br></br> <br/>
<dd><strong>Example 1:</strong> Paired test evaluating <dd><strong>Example 1:</strong> Paired test evaluating
the null hypothesis that the mean difference between corresponding the null hypothesis that the mean difference between corresponding
(paired) elements of the <code>double[]</code> arrays (paired) elements of the <code>double[]</code> arrays
@ -1005,9 +1024,9 @@ TestUtils.tTest(sample1, sample2, .05);
replace "t" at the beginning of the method name with "homoscedasticT" replace "t" at the beginning of the method name with "homoscedasticT"
</p> </p>
</dd> </dd>
<br></br> <br/>
<dt><strong>Chi-square tests</strong></dt> <dt><strong>Chi-square tests</strong></dt>
<br></br> <br/>
<dd>To compute a chi-square statistic measuring the agreement between a <dd>To compute a chi-square statistic measuring the agreement between a
<code>long[]</code> array of observed counts and a <code>double[]</code> <code>long[]</code> array of observed counts and a <code>double[]</code>
array of expected counts, use: array of expected counts, use:
@ -1043,7 +1062,7 @@ TestUtils.chiSquareTest(expected, observed, alpha);
TestUtils.chiSquareTest(counts); TestUtils.chiSquareTest(counts);
</source> </source>
The rows of the 2-way table are The rows of the 2-way table are
<code>count[0], ... , count[count.length - 1]. </code><br></br> <code>count[0], ... , count[count.length - 1]. </code><br/>
The chi-square statistic returned is The chi-square statistic returned is
<code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code> <code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code>
where the sum is taken over all table entries and where the sum is taken over all table entries and
@ -1066,9 +1085,9 @@ TestUtils.chiSquareTest(counts, alpha);
The boolean value returned will be <code>true</code> iff the null The boolean value returned will be <code>true</code> iff the null
hypothesis can be rejected with confidence <code>1 - alpha</code>. hypothesis can be rejected with confidence <code>1 - alpha</code>.
</dd> </dd>
<br></br> <br/>
<dt><strong>G tests</strong></dt> <dt><strong>G tests</strong></dt>
<br></br> <br/>
<dd>G tests are an alternative to chi-square tests that are recommended <dd>G tests are an alternative to chi-square tests that are recommended
when observed counts are small and / or incidence probabilities for when observed counts are small and / or incidence probabilities for
some cells are small. See Ted Dunning's paper, some cells are small. See Ted Dunning's paper,
@ -1077,8 +1096,8 @@ TestUtils.chiSquareTest(counts, alpha);
background and an empirical analysis showing now chi-square background and an empirical analysis showing now chi-square
statistics can be misleading in the presence of low incidence probabilities. statistics can be misleading in the presence of low incidence probabilities.
This paper also derives the formulas used in computing G statistics and the This paper also derives the formulas used in computing G statistics and the
root log likelihood ratio provided by the <code>GTest</code> class.</dd> root log likelihood ratio provided by the <code>GTest</code> class.
<dd> </dd>
<dd>To compute a G-test statistic measuring the agreement between a <dd>To compute a G-test statistic measuring the agreement between a
<code>long[]</code> array of observed counts and a <code>double[]</code> <code>long[]</code> array of observed counts and a <code>double[]</code>
array of expected counts, use: array of expected counts, use:
@ -1128,9 +1147,10 @@ new GTest().rootLogLikelihoodRatio(5, 1995, 0, 100000);
returns the root log likelihood associated with the null hypothesis that A returns the root log likelihood associated with the null hypothesis that A
and B are independent. and B are independent.
</dd> </dd>
<br></br> <br/>
<dt><strong>One-Way ANOVA tests</strong></dt> <dt><strong>One-Way ANOVA tests</strong></dt>
<br></br> <br/>
<dd>
<source> <source>
double[] classA = double[] classA =
{93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 }; {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 };