2003-11-14 16:50:39 -05:00
|
|
|
<?xml version="1.0"?>
|
2004-02-28 12:47:37 -05:00
|
|
|
|
|
|
|
<!--
|
2006-11-29 02:06:35 -05:00
|
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
|
|
this work for additional information regarding copyright ownership.
|
|
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
(the "License"); you may not use this file except in compliance with
|
|
|
|
the License. You may obtain a copy of the License at
|
2004-02-28 12:47:37 -05:00
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
limitations under the License.
|
|
|
|
-->
|
|
|
|
|
2003-11-15 13:38:16 -05:00
|
|
|
<?xml-stylesheet type="text/xsl" href="./xdoc.xsl"?>
|
2005-02-26 08:11:52 -05:00
|
|
|
<!-- $Revision$ $Date$ -->
|
2003-11-14 16:50:39 -05:00
|
|
|
<document url="random.html">
|
|
|
|
|
|
|
|
<properties>
|
|
|
|
<title>The Commons Math User Guide - Data Generation</title>
|
|
|
|
</properties>
|
|
|
|
|
|
|
|
<body>
|
|
|
|
|
|
|
|
<section name="2 Data Generation">
|
|
|
|
|
|
|
|
<subsection name="2.1 Overview" href="overview">
|
|
|
|
<p>
|
|
|
|
The Commons Math random package includes utilities for
|
|
|
|
<ul>
|
|
|
|
<li>generating random numbers</li>
|
|
|
|
<li>generating random strings</li>
|
2005-06-01 23:06:45 -04:00
|
|
|
<li>generating cryptographically secure sequences of random numbers or
|
|
|
|
strings</li>
|
2003-11-14 16:50:39 -05:00
|
|
|
<li>generating random samples and permuations</li>
|
2005-06-01 23:06:45 -04:00
|
|
|
<li>analyzing distributions of values in an input file and generating
|
|
|
|
values "like" the values in the file</li>
|
|
|
|
<li>generating data for grouped frequency distributions or
|
|
|
|
histograms</li>
|
|
|
|
</ul></p>
|
|
|
|
<p>
|
|
|
|
The source of random data used by the data generation utilities is
|
|
|
|
pluggable. By default, the JDK-supplied PseudoRandom Number Generator
|
|
|
|
(PRNG) is used, but alternative generators can be "plugged in" using an
|
|
|
|
adaptor framework, which provides a generic facility for replacing
|
|
|
|
<code>java.util.Random</code> with an alternative PRNG.
|
|
|
|
</p>
|
|
|
|
<p>
|
2005-06-04 01:18:17 -04:00
|
|
|
Sections 2.2-2.5 below show how to use the commons math API to generate
|
2005-06-01 23:06:45 -04:00
|
|
|
different kinds of random data. The examples all use the default
|
|
|
|
JDK-supplied PRNG. PRNG pluggability is covered in 2.6. The only
|
|
|
|
modification required to the examples to use alternative PRNGs is to
|
|
|
|
replace the argumentless constructor calls with invocations including
|
|
|
|
a <code>RandomGenerator</code> instance as a parameter.
|
|
|
|
</p>
|
2003-11-14 16:50:39 -05:00
|
|
|
</subsection>
|
|
|
|
|
|
|
|
<subsection name="2.2 Random numbers" href="deviates">
|
|
|
|
<p>
|
|
|
|
The <a href="../apidocs/org/apache/commons/math/random/RandomData.html">
|
2005-06-01 23:06:45 -04:00
|
|
|
org.apache.commons.math.RandomData</a> interface defines methods for
|
|
|
|
generating random sequences of numbers. The API contracts of these methods
|
|
|
|
use the following concepts:
|
2003-11-14 16:50:39 -05:00
|
|
|
<dl>
|
|
|
|
<dt>Random sequence of numbers from a probability distribution</dt>
|
2005-06-01 23:06:45 -04:00
|
|
|
<dd>There is no such thing as a single "random number." What can be
|
|
|
|
generated are <i>sequences</i> of numbers that appear to be random. When
|
|
|
|
using the built-in JDK function <code>Math.random(),</code> sequences of
|
|
|
|
values generated follow the
|
|
|
|
<a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda3662.htm">
|
|
|
|
Uniform Distribution</a>, which means that the values are evenly spread
|
|
|
|
over the interval between 0 and 1, with no sub-interval having a greater
|
|
|
|
probability of containing generated values than any other interval of the
|
|
|
|
same length. The mathematical concept of a
|
|
|
|
<a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda36.htm">
|
|
|
|
probability distribution</a> basically amounts to asserting that different
|
2005-06-04 01:18:17 -04:00
|
|
|
ranges in the set of possible values of a random variable have
|
2005-06-01 23:06:45 -04:00
|
|
|
different probabilities of containing the value. Commons Math supports
|
|
|
|
generating random sequences from the following probability distributions.
|
|
|
|
The javadoc for the <code>nextXxx</code> methods in
|
|
|
|
<code>RandomDataImpl</code> describes the algorithms used to generate
|
|
|
|
random deviates from each of these distributions.
|
2003-11-14 16:50:39 -05:00
|
|
|
<ul>
|
2005-06-01 23:06:45 -04:00
|
|
|
<li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda3662.htm">
|
|
|
|
uniform distribution</a></li>
|
|
|
|
<li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda3667.htm">
|
|
|
|
exponential distribution</a></li>
|
|
|
|
<li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda366j.htm">
|
|
|
|
poisson distribution</a></li>
|
|
|
|
<li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda3661.htm">
|
|
|
|
Gaussian distribution</a></li>
|
2003-11-14 16:50:39 -05:00
|
|
|
</ul>
|
|
|
|
</dd>
|
|
|
|
<dt>Cryptographically secure random sequences</dt>
|
2005-06-01 23:06:45 -04:00
|
|
|
<dd>It is possible for a sequence of numbers to appear random, but
|
|
|
|
nonetheless to be predictable based on the algorithm used to generate the
|
|
|
|
sequence. If in addition to randomness, strong unpredictability is
|
|
|
|
required, it is best to use a
|
2003-11-14 16:50:39 -05:00
|
|
|
<a href="http://www.wikipedia.org/wiki/Cryptographically_secure_pseudo-random_number_generator">
|
2005-06-01 23:06:45 -04:00
|
|
|
secure random number generator</a> to generate values (or strings). The
|
|
|
|
nextSecureXxx methods in the <code>RandomDataImpl</code> implementation of
|
|
|
|
the <code>RandomData</code> interface use the JDK <code>SecureRandom</code>
|
|
|
|
PRNG to generate cryptographically secure sequences. The
|
|
|
|
<code>setSecureAlgorithm</code> method allows you to change the underlying
|
|
|
|
PRNG. These methods are <strong>much slower</strong> than the corresponding
|
|
|
|
"non-secure" versions, so they should only be used when cryptographic
|
|
|
|
security is required.</dd>
|
2003-11-14 16:50:39 -05:00
|
|
|
<dt>Seeding pseudo-random number generators</dt>
|
2005-06-01 23:06:45 -04:00
|
|
|
<dd>By default, the implementation provided in <code>RandomDataImpl</code>
|
|
|
|
uses the JDK-provided PRNG. Like most other PRNGs, the JDK generator
|
|
|
|
generates sequences of random numbers based on an initial "seed value".
|
|
|
|
For the non-secure methods, starting with the same seed always produces the
|
|
|
|
same sequence of values. Secure sequences started with the same seeds will
|
|
|
|
diverge. When a new <code>RandomDataImpl</code> is created, the underlying
|
|
|
|
random number generators are <strong>not</strong> intialized. The first
|
|
|
|
call to a data generation method, or to a <code>reSeed()</code> method
|
|
|
|
initializes the appropriate generator. If you do not explicitly seed the
|
|
|
|
generator, it is by default seeded with the current time in milliseconds.
|
|
|
|
Therefore, to generate sequences of random data values, you should always
|
|
|
|
instantiate <strong>one</strong> <code>RandomDataImpl</code> and use it
|
|
|
|
repeatedly instead of creating new instances for subsequent values in the
|
|
|
|
sequence. For example, the following will generate a random sequence of 50
|
|
|
|
long integers between 1 and 1,000,000, using the current time in
|
|
|
|
milliseconds as the seed for the JDK PRNG:
|
2004-01-11 14:35:22 -05:00
|
|
|
<source>
|
2005-06-04 01:18:17 -04:00
|
|
|
RandomData randomData = new RandomDataImpl();
|
2005-06-01 23:06:45 -04:00
|
|
|
for (int i = 0; i < 1000; i++) {
|
|
|
|
value = randomData.nextLong(1, 1000000);
|
|
|
|
}
|
2004-01-11 14:35:22 -05:00
|
|
|
</source>
|
2005-06-01 23:06:45 -04:00
|
|
|
The following will not in general produce a good random sequence, since the
|
|
|
|
PRNG is reseeded each time through the loop with the current time in
|
|
|
|
milliseconds:
|
2004-01-11 14:35:22 -05:00
|
|
|
<source>
|
2005-06-01 23:06:45 -04:00
|
|
|
for (int i = 0; i < 1000; i++) {
|
|
|
|
RandomDataImpl randomData = new RandomDataImpl();
|
|
|
|
value = randomData.nextLong(1, 1000000);
|
|
|
|
}
|
2004-01-11 14:35:22 -05:00
|
|
|
</source>
|
2005-06-01 23:06:45 -04:00
|
|
|
The following will produce the same random sequence each time it is
|
|
|
|
executed:
|
2004-01-11 14:35:22 -05:00
|
|
|
<source>
|
2005-06-04 01:18:17 -04:00
|
|
|
RandomData randomData = new RandomDataImpl();
|
2005-06-01 23:06:45 -04:00
|
|
|
randomData.reSeed(1000);
|
|
|
|
for (int i = 0; i = 1000; i++) {
|
|
|
|
value = randomData.nextLong(1, 1000000);
|
|
|
|
}
|
2004-01-11 14:35:22 -05:00
|
|
|
</source>
|
2005-06-01 23:06:45 -04:00
|
|
|
The following will produce a different random sequence each time it is
|
|
|
|
executed.
|
2004-01-11 14:35:22 -05:00
|
|
|
<source>
|
2005-06-04 01:18:17 -04:00
|
|
|
RandomData randomData = new RandomDataImpl();
|
2005-06-01 23:06:45 -04:00
|
|
|
randomData.reSeedSecure(1000);
|
|
|
|
for (int i = 0; i < 1000; i++) {
|
|
|
|
value = randomData.nextSecureLong(1, 1000000);
|
|
|
|
}
|
2004-01-11 14:35:22 -05:00
|
|
|
</source>
|
2003-11-14 16:50:39 -05:00
|
|
|
</dd></dl>
|
|
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
|
|
|
|
<subsection name="2.3 Random Strings" href="strings">
|
|
|
|
<p>
|
|
|
|
The methods <code>nextHexString</code> and <code>nextSecureHexString</code>
|
2005-06-04 01:18:17 -04:00
|
|
|
can be used to generate random strings of hexadecimal characters. Both
|
|
|
|
of these methods produce sequences of strings with good dispersion
|
|
|
|
properties. The difference between the two methods is that the second is
|
|
|
|
cryptographically secure. Specifically, the implementation of
|
|
|
|
<code>nextHexString(n)</code> in <code>RandomDataImpl</code> uses the
|
|
|
|
following simple algorithm to generate a string of <code>n</code> hex digits:
|
2003-11-14 16:50:39 -05:00
|
|
|
<ol>
|
|
|
|
<li>n/2+1 binary bytes are generated using the underlying Random</li>
|
|
|
|
<li>Each binary byte is translated into 2 hex digits</li></ol>
|
|
|
|
The <code>RandomDataImpl</code> implementation of the "secure" version,
|
2005-06-04 01:18:17 -04:00
|
|
|
<code>nextSecureHexString</code> generates hex characters in 40-byte
|
|
|
|
"chunks" using a 3-step process:
|
2003-11-14 16:50:39 -05:00
|
|
|
<ol>
|
2005-06-04 01:18:17 -04:00
|
|
|
<li>20 random bytes are generated using the underlying
|
|
|
|
<code>SecureRandom.</code></li>
|
2003-11-14 16:50:39 -05:00
|
|
|
<li>SHA-1 hash is applied to yield a 20-byte binary digest.</li>
|
|
|
|
<li>Each byte of the binary digest is converted to 2 hex digits</li></ol>
|
2005-06-04 01:18:17 -04:00
|
|
|
Similarly to the secure random number generation methods,
|
|
|
|
<code>nextSecureHexString</code> is <strong>much slower</strong> than
|
|
|
|
the non-secure version. It should be used only for applications such as
|
|
|
|
generating unique session or transaction ids where predictability of
|
|
|
|
subsequent ids based on observation of previous values is a security
|
|
|
|
concern. If all that is needed is an even distribution of hex characters
|
|
|
|
in the generated strings, the non-secure method should be used.
|
2003-11-14 16:50:39 -05:00
|
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
|
2005-06-04 01:18:17 -04:00
|
|
|
<subsection name="2.4 Random permutations, combinations, sampling"
|
|
|
|
href="combinatorics">
|
2003-11-14 16:50:39 -05:00
|
|
|
<p>
|
|
|
|
To select a random sample of objects in a collection, you can use the
|
2005-06-04 01:18:17 -04:00
|
|
|
<code>nextSample</code> method in the <code>RandomData</code> interface.
|
|
|
|
Specifically, if <code>c</code> is a collection containing at least
|
|
|
|
<code>k</code> objects, and <code>ranomData</code> is a
|
|
|
|
<code>RandomData</code> instance <code>randomData.nextSample(c, k)</code>
|
|
|
|
will return an <code>object[]</code> array of length <code>k</code>
|
|
|
|
consisting of elements randomly selected from the collection. If
|
|
|
|
<code>c</code> contains duplicate references, there may be duplicate
|
|
|
|
references in the returned array; otherwise returned elements will be
|
|
|
|
unique -- i.e., the sampling is without replacement among the object
|
|
|
|
references in the collection. </p>
|
2003-11-14 16:50:39 -05:00
|
|
|
<p>
|
2005-06-04 01:18:17 -04:00
|
|
|
If <code>randomData</code> is a <code>RandomData</code> instance, and
|
|
|
|
<code>n</code> and <code>k</code> are integers with
|
|
|
|
<code> k <= n</code>, then
|
|
|
|
<code>randomData.nextPermutation(n, k)</code> returns an <code>int[]</code>
|
2003-11-14 16:50:39 -05:00
|
|
|
array of length <code>k</code> whose whose entries are selected randomly,
|
2005-06-04 01:18:17 -04:00
|
|
|
without repetition, from the integers <code>0</code> through
|
|
|
|
<code>n-1</code> (inclusive), i.e.,
|
|
|
|
<code>randomData.nextPermutation(n, k)</code> returns a random
|
|
|
|
permutation of <code>n</code> taken <code>k</code> at a time.
|
2003-11-14 16:50:39 -05:00
|
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
|
2005-05-22 01:25:44 -04:00
|
|
|
<subsection name="2.5 Generating data 'like' an input file" href="empirical">
|
2003-11-14 16:50:39 -05:00
|
|
|
<p>
|
2005-06-04 01:18:17 -04:00
|
|
|
Using the <code>ValueServer</code> class, you can generate data based on
|
|
|
|
the values in an input file in one of two ways:
|
2004-01-15 02:29:38 -05:00
|
|
|
<dl>
|
|
|
|
<dt>Replay Mode</dt>
|
|
|
|
<dd> The following code will read data from <code>url</code>
|
|
|
|
(a <code>java.net.URL</code> instance), cycling through the values in the
|
|
|
|
file in sequence, reopening and starting at the beginning again when all
|
|
|
|
values have been read.
|
|
|
|
<source>
|
|
|
|
ValueServer vs = new ValueServer();
|
|
|
|
vs.setValuesFileURL(url);
|
|
|
|
vs.setMode(ValueServer.REPLAY_MODE);
|
|
|
|
vs.resetReplayFile();
|
|
|
|
double value = vs.getNext();
|
|
|
|
// ...Generate and use more values...
|
|
|
|
vs.closeReplayFile();
|
|
|
|
</source>
|
|
|
|
The values in the file are not stored in memory, so it does not matter
|
2005-06-04 01:18:17 -04:00
|
|
|
how large the file is, but you do need to explicitly close the file
|
|
|
|
as above. The expected file format is \n -delimited (i.e. one per line)
|
|
|
|
strings representing valid floating point numbers.
|
2004-01-15 02:29:38 -05:00
|
|
|
</dd>
|
|
|
|
<dt>Digest Mode</dt>
|
|
|
|
<dd>When used in Digest Mode, the ValueServer reads the entire input file
|
|
|
|
and estimates a probability density function based on data from the file.
|
2005-06-04 01:18:17 -04:00
|
|
|
The estimation method is essentially the
|
|
|
|
<a href="http://nedwww.ipac.caltech.edu/level5/March02/Silverman/Silver2_6.html">
|
|
|
|
Variable Kernel Method</a> with Gaussian smoothing. Once the density
|
|
|
|
has been estimated, <code>getNext()</code> returns random values whose
|
|
|
|
probability distribution matches the empirical distribution -- i.e., if
|
|
|
|
you generate a large number of such values, their distribution should
|
|
|
|
"look like" the distribution of the values in the input file. The values
|
|
|
|
are not stored in memory in this case either, so there is no limit to the
|
|
|
|
size of the input file. Here is an example:
|
2004-01-15 02:29:38 -05:00
|
|
|
<source>
|
|
|
|
ValueServer vs = new ValueServer();
|
|
|
|
vs.setValuesFileURL(url);
|
|
|
|
vs.setMode(ValueServer.DIGEST_MODE);
|
|
|
|
vs.computeDistribution(500); //Read file and estimate distribution using 500 bins
|
|
|
|
double value = vs.getNext();
|
|
|
|
// ...Generate and use more values...
|
|
|
|
</source>
|
2005-06-04 01:18:17 -04:00
|
|
|
See the javadoc for <code>ValueServer</code> and
|
|
|
|
<code>EmpiricalDistribution</code> for more details. Note that
|
|
|
|
<code>computeDistribution()</code> opens and closes the input file
|
|
|
|
by itself.
|
2004-01-15 02:29:38 -05:00
|
|
|
</dd>
|
|
|
|
</dl>
|
|
|
|
</p>
|
2003-11-14 16:50:39 -05:00
|
|
|
</subsection>
|
|
|
|
|
2005-06-01 23:06:45 -04:00
|
|
|
<subsection name="2.6 PRNG Pluggability" href="pluggability">
|
|
|
|
<p>
|
|
|
|
To enable alternative PRNGs to be "plugged in" to the commons-math data
|
|
|
|
generation utilities and to provide a generic means to replace
|
|
|
|
<code>java.util.Random</code> in applications, a random generator
|
|
|
|
adaptor framework has been added to commons-math. The
|
|
|
|
<a href="../apidocs/org/apache/commons/math/random/RandomGenerator.html">
|
|
|
|
org.apache.commons.math.RandomGenerator</a> interface abstracts the public
|
|
|
|
interface of <code>java.util.Random</code> and any implementation of this
|
|
|
|
interface can be used as the source of random data for the commons-math
|
2005-06-04 01:18:17 -04:00
|
|
|
data generation classes. An abstract base class,
|
2005-06-01 23:06:45 -04:00
|
|
|
<a href="../apidocs/org/apache/commons/math/random/AbstractRandomGenerator.html">
|
|
|
|
org.apache.commons.math.AbstractRandomGenerator</a> is provided to make
|
|
|
|
implementation easier. This class provides default implementations of
|
|
|
|
"derived" data generation methods based on the primitive,
|
|
|
|
<code>nextDouble().</code> To support generic replacement of
|
|
|
|
<code>java.util.Random</code>, the
|
|
|
|
<a href="../apidocs/org/apache/commons/math/random/RandomAdaptor.html">
|
|
|
|
org.apache.commons.math.RandomAdaptor</a> class is provided, which
|
|
|
|
extends <code>java.util.Random</code> and wraps and delegates calls to
|
|
|
|
a <code>RandomGenerator</code> instance.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Examples:
|
|
|
|
<dl>
|
|
|
|
<dt>Create a RandomGenerator based on RngPack's Mersenne Twister</dt>
|
|
|
|
<dd>To create a RandomGenerator using the RngPack Mersenne Twister PRNG
|
|
|
|
as the source of randomness, extend <code>AbstractRandomGenerator</code>
|
|
|
|
overriding the derived methods that the RngPack implementation provides:
|
|
|
|
<source>
|
|
|
|
import edu.cornell.lassp.houle.RngPack.RanMT;
|
|
|
|
/**
|
|
|
|
* AbstractRandomGenerator based on RngPack RanMT generator.
|
|
|
|
*/
|
|
|
|
public class RngPackGenerator extends AbstractRandomGenerator {
|
|
|
|
|
|
|
|
private RanMT random = new RanMT();
|
|
|
|
|
|
|
|
public void setSeed(long seed) {
|
|
|
|
random = new RanMT(seed);
|
|
|
|
}
|
|
|
|
|
|
|
|
public double nextDouble() {
|
|
|
|
return random.raw();
|
|
|
|
}
|
|
|
|
|
|
|
|
public double nextGaussian() {
|
|
|
|
return random.gaussian();
|
|
|
|
}
|
|
|
|
|
|
|
|
public int nextInt(int n) {
|
|
|
|
return random.choose(n);
|
|
|
|
}
|
|
|
|
|
|
|
|
public boolean nextBoolean() {
|
|
|
|
return random.coin();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dt>Use the Mersenne Twister RandomGenerator in place of
|
|
|
|
<code>java.util.Random</code> in <code>RandomData</code></dt>
|
|
|
|
<dd>
|
|
|
|
<source>
|
|
|
|
RandomData randomData = new RandomDataImpl(new RngPackGenerator());
|
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
<dt>Create an adaptor instance based on the Mersenne Twister generator
|
|
|
|
that can be used in place of a <code>Random</code></dt>
|
|
|
|
<dd>
|
|
|
|
<source>
|
|
|
|
RandomGenerator generator = new RngPackGenerator();
|
|
|
|
Random random = RandomAdaptor.createAdaptor(generator);
|
|
|
|
// random can now be used in place of a Random instance, data generation
|
|
|
|
// calls will be delegated to the wrapped Mersenne Twister
|
|
|
|
</source>
|
|
|
|
</dd>
|
|
|
|
</dl>
|
|
|
|
</p>
|
|
|
|
</subsection>
|
|
|
|
|
2003-11-14 16:50:39 -05:00
|
|
|
</section>
|
|
|
|
|
|
|
|
</body>
|
|
|
|
</document>
|