Add recently added features to the userguide.

git-svn-id: https://svn.apache.org/repos/asf/commons/proper/math/trunk@1538282 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Thomas Neidhart 2013-11-02 21:02:13 +00:00
parent 280af43635
commit 40a97ba13a
1 changed files with 82 additions and 62 deletions

View File

@ -32,13 +32,13 @@
and t-, chi-square and ANOVA test statistics.
</p>
<p>
<a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br></br>
<a href="#a1.3_Frequency_distributions">Frequency distributions</a><br></br>
<a href="#a1.4_Simple_regression">Simple Regression</a><br></br>
<a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br></br>
<a href="#a1.6_Rank_transformations">Rank transformations</a><br></br>
<a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br></br>
<a href="#a1.8_Statistical_tests">Statistical Tests</a><br></br>
<a href="#a1.2_Descriptive_statistics">Descriptive statistics</a><br/>
<a href="#a1.3_Frequency_distributions">Frequency distributions</a><br/>
<a href="#a1.4_Simple_regression">Simple Regression</a><br/>
<a href="#a1.5_Multiple_linear_regression">Multiple Regression</a><br/>
<a href="#a1.6_Rank_transformations">Rank transformations</a><br/>
<a href="#a1.7_Covariance_and_correlation">Covariance and correlation</a><br/>
<a href="#a1.8_Statistical_tests">Statistical Tests</a><br/>
</p>
</subsection>
<subsection name="1.2 Descriptive statistics">
@ -154,7 +154,7 @@
Here are some examples showing how to compute Descriptive statistics.
<dl>
<dt>Compute summary statistics for a list of double values</dt>
<br></br>
<br/>
<dd>Using the <code>DescriptiveStatistics</code> aggregate
(values are stored in memory):
<source>
@ -206,7 +206,7 @@ mean = StatUtils.mean(values, 0, 3);
</dd>
<dt>Maintain a "rolling mean" of the most recent 100 values from
an input stream</dt>
<br></br>
<br/>
<dd>Use a <code>DescriptiveStatistics</code> instance with
window size set to 100
<source>
@ -311,7 +311,7 @@ double totalSampleSum = aggregatedStats.getSum();
Here are some examples.
<dl>
<dt>Compute a frequency distribution based on integer values</dt>
<br></br>
<br/>
<dd>Mixing integers, longs, Integers and Longs:
<source>
Frequency f = new Frequency();
@ -328,7 +328,7 @@ double totalSampleSum = aggregatedStats.getSum();
</source>
</dd>
<dt>Count string frequencies</dt>
<br></br>
<br/>
<dd>Using case-sensitive comparison, alpha sort order (natural comparator):
<source>
Frequency f = new Frequency();
@ -455,7 +455,7 @@ System.out.println(regression.predict(1.5d)
More data points can be added and subsequent getXxx calls will incorporate
additional data in statistics.
</dd>
<br></br>
<br/>
<dt>Estimate a model from a double[][] array of data points</dt>
<dd>Instantiate a regression object and load dataset
<source>
@ -478,7 +478,7 @@ System.out.println(regression.getSlopeStdErr());
More data points -- even another double[][] array -- can be added and subsequent
getXxx calls will incorporate additional data in statistics.
</dd>
<br></br>
<br/>
<dt>Estimate a model from a double[][] array of data points, <em>excluding</em> the intercept</dt>
<dd>Instantiate a regression object and load dataset
<source>
@ -558,7 +558,7 @@ System.out.println(regression.getInterceptStdErr() );
Here are some examples.
<dl>
<dt>OLS regression</dt>
<br></br>
<br/>
<dd>Instantiate an OLS regression object and load a dataset:
<source>
OLSMultipleLinearRegression regression = new OLSMultipleLinearRegression();
@ -589,7 +589,7 @@ double sigma = regression.estimateRegressionStandardError();
</source>
</dd>
<dt>GLS regression</dt>
<br></br>
<br/>
<dd>Instantiate a GLS regression object and load a dataset:
<source>
GLSMultipleLinearRegression regression = new GLSMultipleLinearRegression();
@ -664,17 +664,19 @@ new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData
<a href="../apidocs/org/apache/commons/math3/stat/correlation/Covariance.html">
Covariance</a> computes covariances,
<a href="../apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html">
PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients and
PearsonsCorrelation</a> provides Pearson's Product-Moment correlation coefficients,
<a href="../apidocs/org/apache/commons/math3/stat/correlation/SpearmansCorrelation.html">
SpearmansCorrelation</a> computes Spearman's rank correlation.
SpearmansCorrelation</a> computes Spearman's rank correlation and
<a href="../apidocs/org/apache/commons/math3/stat/correlation/KendallsCorrelation.html">
KendallsCorrelation</a> computes Kendall's tau rank correlation.
</p>
<p>
<strong>Implementation Notes</strong>
<ul>
<li>
Unbiased covariances are given by the formula <br></br>
<code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code>
where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code>
Unbiased covariances are given by the formula <br/>
<code>cov(X, Y) = sum [(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / (n - 1)</code>
where <code>E(X)</code> is the mean of <code>X</code> and <code>E(Y)</code>
is the mean of the <code>Y</code> values. Non-bias-corrected estimates use
<code>n</code> in place of <code>n - 1.</code> Whether or not covariances are
bias-corrected is determined by the optional parameter, "biasCorrected," which
@ -682,7 +684,7 @@ new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData
</li>
<li>
<a href="../apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html">
PearsonsCorrelation</a> computes correlations defined by the formula <br></br>
PearsonsCorrelation</a> computes correlations defined by the formula <br/>
<code>cor(X, Y) = sum[(x<sub>i</sub> - E(X))(y<sub>i</sub> - E(Y))] / [(n - 1)s(X)s(Y)]</code><br/>
where <code>E(X)</code> and <code>E(Y)</code> are means of <code>X</code> and <code>Y</code>
and <code>s(X)</code>, <code>s(Y)</code> are standard deviations.
@ -693,6 +695,11 @@ new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData
correlation on the ranked data. The ranking algorithm is configurable. By default,
<a href="../apidocs/org/apache/commons/math3/stat/ranking/NaturalRanking.html">
NaturalRanking</a> with default strategies for handling ties and NaN values is used.
</li>
<li>
<a href="../apidocs/org/apache/commons/math3/stat/correlation/KendallsCorrelation.html">
KendallsCorrelation</a> computes the association between two measured quantities. A tau test
is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.
</li>
</ul>
</p>
@ -700,7 +707,7 @@ new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData
<strong>Examples:</strong>
<dl>
<dt><strong>Covariance of 2 arrays</strong></dt>
<br></br>
<br/>
<dd>To compute the unbiased covariance between 2 double arrays,
<code>x</code> and <code>y</code>, use:
<source>
@ -711,9 +718,9 @@ new Covariance().covariance(x, y)
covariance(x, y, false)
</source>
</dd>
<br></br>
<br/>
<dt><strong>Covariance matrix</strong></dt>
<br></br>
<br/>
<dd> A covariance matrix over the columns of a source matrix <code>data</code>
can be computed using
<source>
@ -726,18 +733,18 @@ new Covariance().computeCovarianceMatrix(data)
computeCovarianceMatrix(data, false)
</source>
</dd>
<br></br>
<br/>
<dt><strong>Pearson's correlation of 2 arrays</strong></dt>
<br></br>
<br/>
<dd>To compute the Pearson's product-moment correlation between two double arrays
<code>x</code> and <code>y</code>, use:
<source>
new PearsonsCorrelation().correlation(x, y)
</source>
</dd>
<br></br>
<br/>
<dt><strong>Pearson's correlation matrix</strong></dt>
<br></br>
<br/>
<dd> A (Pearson's) correlation matrix over the columns of a source matrix <code>data</code>
can be computed using
<source>
@ -746,9 +753,9 @@ new PearsonsCorrelation().computeCorrelationMatrix(data)
The i-jth entry of the returned matrix is the Pearson's product-moment correlation between the
ith and jth columns of <code>data.</code>
</dd>
<br></br>
<br/>
<dt><strong>Pearson's correlation significance and standard errors</strong></dt>
<br></br>
<br/>
<dd> To compute standard errors and/or significances of correlation coefficients
associated with Pearson's correlation coefficients, start by creating a
<code>PearsonsCorrelation</code> instance
@ -771,22 +778,22 @@ correlation.getCorrelationPValues()
</source>
<code>getCorrelationPValues().getEntry(i,j)</code> is the
probability that a random variable distributed as <code>t<sub>n-2</sub></code> takes
a value with absolute value greater than or equal to <br></br>
<code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>,
where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth
columns of the source array or RealMatrix. This is sometimes referred to as the
<i>significance</i> of the coefficient.<br/><br/>
For example, if <code>data</code> is a RealMatrix with 2 columns and 10 rows, then
<source>
a value with absolute value greater than or equal to <br/>
<code>|r<sub>ij</sub>|((n - 2) / (1 - r<sub>ij</sub><sup>2</sup>))<sup>1/2</sup></code>,
where <code>r<sub>ij</sub></code> is the estimated correlation between the ith and jth
columns of the source array or RealMatrix. This is sometimes referred to as the
<i>significance</i> of the coefficient.<br/><br/>
For example, if <code>data</code> is a RealMatrix with 2 columns and 10 rows, then
<source>
new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1)
</source>
is the significance of the Pearson's correlation coefficient between the two columns
of <code>data</code>. If this value is less than .01, we can say that the correlation
between the two columns of data is significant at the 99% level.
</source>
is the significance of the Pearson's correlation coefficient between the two columns
of <code>data</code>. If this value is less than .01, we can say that the correlation
between the two columns of data is significant at the 99% level.
</dd>
<br></br>
<br/>
<dt><strong>Spearman's rank correlation coefficient</strong></dt>
<br></br>
<br/>
<dd>To compute the Spearman's rank-moment correlation between two double arrays
<code>x</code> and <code>y</code>:
<source>
@ -798,7 +805,15 @@ RankingAlgorithm ranking = new NaturalRanking();
new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
</source>
</dd>
<br></br>
<br/>
<dt><strong>Kendalls's tau rank correlation coefficient</strong></dt>
<br/>
<dd>To compute the Kendall's tau rank correlation between two double arrays
<code>x</code> and <code>y</code>:
<source>
new KendallsCorrelation().correlation(x, y)
</source>
</dd>
</dl>
</p>
</subsection>
@ -814,9 +829,11 @@ new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
<a href="http://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm">
One-Way ANOVA</a>,
<a href="http://www.itl.nist.gov/div898/handbook/prc/section3/prc35.htm">
Mann-Whitney U</a> and
Mann-Whitney U</a>,
<a href="http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">
Wilcoxon signed rank</a> test statistics as well as
Wilcoxon signed rank</a> and
<a href="http://en.wikipedia.org/wiki/Binomial_test">
Binomial</a> test statistics as well as
<a href="http://www.cas.lancs.ac.uk/glossary_v1.1/hyptest.html#pvalue">
p-values</a> associated with <code>t-</code>,
<code>Chi-Square</code>, <code>G</code>, <code>One-Way ANOVA</code>, <code>Mann-Whitney U</code>
@ -830,9 +847,11 @@ new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
<a href="../apidocs/org/apache/commons/math3/stat/inference/OneWayAnova.html">
OneWayAnova</a>,
<a href="../apidocs/org/apache/commons/math3/stat/inference/MannWhitneyUTest.html">
MannWhitneyUTest</a>, and
MannWhitneyUTest</a>,
<a href="../apidocs/org/apache/commons/math3/stat/inference/WilcoxonSignedRankTest.html">
WilcoxonSignedRankTest</a>.
WilcoxonSignedRankTest</a> and
<a href="../apidocs/org/apache/commons/math3/stat/inference/BinomialTest.html">
BinomialTest</a>.
The <a href="../apidocs/org/apache/commons/math3/stat/inference/TestUtils.html">
TestUtils</a> class provides static methods to get test instances or
to compute test statistics directly. The examples below all use the
@ -886,7 +905,7 @@ new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
<strong>Examples:</strong>
<dl>
<dt><strong>One-sample <code>t</code> tests</strong></dt>
<br></br>
<br/>
<dd>To compare the mean of a double[] array to a fixed value:
<source>
double[] observed = {1d, 2d, 3d};
@ -932,9 +951,9 @@ TestUtils.tTest(mu, observed, alpha);
To test, for example at the 95% level of confidence, use
<code>alpha = 0.05</code>
</dd>
<br></br>
<br/>
<dt><strong>Two-Sample t-tests</strong></dt>
<br></br>
<br/>
<dd><strong>Example 1:</strong> Paired test evaluating
the null hypothesis that the mean difference between corresponding
(paired) elements of the <code>double[]</code> arrays
@ -1005,9 +1024,9 @@ TestUtils.tTest(sample1, sample2, .05);
replace "t" at the beginning of the method name with "homoscedasticT"
</p>
</dd>
<br></br>
<br/>
<dt><strong>Chi-square tests</strong></dt>
<br></br>
<br/>
<dd>To compute a chi-square statistic measuring the agreement between a
<code>long[]</code> array of observed counts and a <code>double[]</code>
array of expected counts, use:
@ -1043,7 +1062,7 @@ TestUtils.chiSquareTest(expected, observed, alpha);
TestUtils.chiSquareTest(counts);
</source>
The rows of the 2-way table are
<code>count[0], ... , count[count.length - 1]. </code><br></br>
<code>count[0], ... , count[count.length - 1]. </code><br/>
The chi-square statistic returned is
<code>sum((counts[i][j] - expected[i][j])^2/expected[i][j])</code>
where the sum is taken over all table entries and
@ -1066,9 +1085,9 @@ TestUtils.chiSquareTest(counts, alpha);
The boolean value returned will be <code>true</code> iff the null
hypothesis can be rejected with confidence <code>1 - alpha</code>.
</dd>
<br></br>
<br/>
<dt><strong>G tests</strong></dt>
<br></br>
<br/>
<dd>G tests are an alternative to chi-square tests that are recommended
when observed counts are small and / or incidence probabilities for
some cells are small. See Ted Dunning's paper,
@ -1077,8 +1096,8 @@ TestUtils.chiSquareTest(counts, alpha);
background and an empirical analysis showing now chi-square
statistics can be misleading in the presence of low incidence probabilities.
This paper also derives the formulas used in computing G statistics and the
root log likelihood ratio provided by the <code>GTest</code> class.</dd>
<dd>
root log likelihood ratio provided by the <code>GTest</code> class.
</dd>
<dd>To compute a G-test statistic measuring the agreement between a
<code>long[]</code> array of observed counts and a <code>double[]</code>
array of expected counts, use:
@ -1090,13 +1109,13 @@ System.out.println(TestUtils.g(expected, observed));
the value displayed will be
<code>2 * sum(observed[i]) * log(observed[i]/expected[i])</code>
</dd>
<dd> To get the p-value associated with the null hypothesis that
<dd>To get the p-value associated with the null hypothesis that
<code>observed</code> conforms to <code>expected</code> use:
<source>
TestUtils.gTest(expected, observed);
</source>
</dd>
<dd> To test the null hypothesis that <code>observed</code> conforms to
<dd>To test the null hypothesis that <code>observed</code> conforms to
<code>expected</code> with <code>alpha</code> siginficance level
(equiv. <code>100 * (1-alpha)%</code> confidence) where <code>
0 &lt; alpha &lt; 1 </code> use:
@ -1128,9 +1147,10 @@ new GTest().rootLogLikelihoodRatio(5, 1995, 0, 100000);
returns the root log likelihood associated with the null hypothesis that A
and B are independent.
</dd>
<br></br>
<br/>
<dt><strong>One-Way ANOVA tests</strong></dt>
<br></br>
<br/>
<dd>
<source>
double[] classA =
{93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 };