commons-math/xdocs/userguide/random.xml

<?xml version="1.0"?>

<!--
   Copyright 2003-2004 The Apache Software Foundation

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
  -->

<?xml-stylesheet type="text/xsl" href="./xdoc.xsl"?>
<!-- $Revision: 1.8 $ $Date: 2004/02/29 18:50:10 $ -->
<document url="random.html">

<properties>
    <title>The Commons Math User Guide - Data Generation</title>
</properties>

<body>

<section name="2 Data Generation">

<subsection name="2.1 Overview" href="overview">
    <p>
    The Commons Math random package includes utilities for
    <ul>
        <li>generating random numbers</li>
        <li>generating random strings</li>
        <li>generating cryptographically secure sequences of random numbers or strings</li>
        <li>generating random samples and permuations</li>
        <li>analyzing distributions of values in an input file and generating values "like"
            the values in the file</li>
        <li>generating data for grouped frequency distributions or histograms</li>
    </ul></p>
</subsection>

<subsection name="2.2 Random numbers" href="deviates">
    <p>
    The <a href="../apidocs/org/apache/commons/math/random/RandomData.html">
    org.apache.commons.math.RandomData</a> interface defines methods for generating
    random sequences of numbers. The API contracts of these methods use the following concepts:
    <dl>
    <dt>Random sequence of numbers from a probability distribution</dt>
    <dd>There is no such thing as a single "random number."  What can be generated
    are <i>sequences</i> of numbers that appear to be random.  When using the
    built-in JDK function <code>Math.random(),</code> sequences of values generated
    follow the <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda3662.htm">
    Uniform Distribution</a>, which means that the values are evenly spread over the interval
    between 0 and 1, with no sub-interval having a greater probability of containing generated
    values than any other interval of the same length.  The mathematical concept of a <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda36.htm">
    probability distribution</a> basically amounts to asserting that different ranges in the set
    of possible values for of a random variable have different probabilities of containing the value.
    Commons Math supports generating random sequences from the following probability distributions. The
    javadoc for the <code>nextXxx</code> methods in <code>RandomDataImpl</code> describes the algorithms used
    to generate random deviates from each of these distributions.
    <ul>
    <li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda3662.htm">uniform distribution</a></li>
    <li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda3667.htm">exponential distribution</a></li>
    <li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda366j.htm">poisson distribution</a></li>
    <li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda3661.htm">Gaussian distribution</a></li>
    </ul>
    </dd>
    <dt>Cryptographically secure random sequences</dt>
    <dd>It is possible for a sequence of numbers to appear random, but nonetheless to be
    predictable based on the algorithm used to generate the sequence. If in addition to
    randomness, strong unpredictability is required, it is best to use a
    <a href="http://www.wikipedia.org/wiki/Cryptographically_secure_pseudo-random_number_generator">
    secure random number generator</a> to generate values (or strings). The nextSecureXxx methods
    in the <code>RandomDataImpl</code> implementation of the <code>RandomData</code> interface use the
    JDK <code>SecureRandom</code> pseudo-random number generator (PRNG)
    to generate cryptographically secure sequences.  The <code>setSecureAlgorithm</code> method
    allows you to change the underlying PRNG. These methods are <strong>much slower</strong> than
    the corresponding "non-secure" versions, so they should only be used when cryptographic security
    is required.</dd>
    <dt>Seeding pseudo-random number generators</dt>
    <dd>By default, the implementation provided in <code>RandomDataImpl</code> uses the JDK-provided
    PRNG.  Like other PRNGs, the JDK generator generates sequences of random numbers based on an initial
    "seed value".  For the non-secure methods, starting with the same seed always produces the same
    sequence of values.  Secure sequences started with the same seeds will diverge. When a new
    <code>RandomDataImpl</code> is created, the underlying random number generators are
    <strong>not</strong> intialized.  The first call to a data generation method, or to a
    <code>reSeed()</code> method initializes the appropriate generator.  If you do not explicitly
    seed the generator, it is by default seeded with the current time in milliseconds.  Therefore,
    to generate sequences of random data values, you should always instantiate <strong>one</strong>
    <code>RandomDataImpl</code> and use it repeatedly instead of creating new instances for
    subsequent values in the sequence.  For example, the following will generate a random sequence
    of 50 long integers between 1 and 1,000,000, using the current time in milliseconds as the seed
    for the JDK PRNG:
    <source>
        RandomDataImpl randomData = new RandomDataImpl();
        for (int i = 0; i &lt; 1000; i++) {
            value = randomData.nextLong(1, 1000000);
        }
    </source>
    The following will not in general produce a good random sequence, since the PRNG is reseeded
    each time through the loop with the current time in milliseconds:
    <source>
        for (int i = 0; i &lt; 1000; i++) {
            RandomDataImpl randomData = new RandomDataImpl();
            value = randomData.nextLong(1, 1000000);
        }
    </source>
    The following will produce the same random sequence each time it is executed:
    <source>
        RandomDataImpl randomData = new RandomDataImpl();
        randomData.reSeed(1000);
        for (int i = 0; i = 1000; i++) {
            value = randomData.nextLong(1, 1000000);
        }
    </source>
    The following will produce a different random sequence each time it is executed.
    <source>
        RandomDataImpl randomData = new RandomDataImpl();
        randomData.reSeedSecure(1000);
        for (int i = 0; i &lt; 1000; i++) {
            value = randomData.nextSecureLong(1, 1000000);
        }
    </source>
    </dd></dl>
    </p>
</subsection>

<subsection name="2.3 Random Strings" href="strings">
    <p>
    The methods <code>nextHexString</code> and <code>nextSecureHexString</code>
    can be used to generate random strings of hexadecimal characters.  Both of these
    methods produce sequences of strings with good dispersion properties.
    The difference between the two methods is that the second is cryptographically secure.
    Specifically, the implementation of <code>nextHexString(n)</code> in <code>RandomDataImpl</code>
    uses the following simple algorithm to generate a string of <code>n</code> hex digits:
    <ol>
    <li>n/2+1 binary bytes are generated using the underlying Random</li>
    <li>Each binary byte is translated into 2 hex digits</li></ol>
    The <code>RandomDataImpl</code> implementation of the "secure" version,
    <code>nextSecureHexString</code> generates hex characters in 40-byte "chunks"
    using a 3-step process:
    <ol>
    <li>20 random bytes are generated using the underlying <code>SecureRandom.</code></li>
    <li>SHA-1 hash is applied to yield a 20-byte binary digest.</li>
    <li>Each byte of the binary digest is converted to 2 hex digits</li></ol>
    Similarly to the secure random number generation methods, <code>nextSecureHexString</code>
    is <strong>much slower</strong> than the non-secure version.  It should be used only for
    applications such as generating unique session or transaction ids where predictability of
    subsequent ids based on observation of previous values is a security concern.  If all
    that is needed is an even distribution of hex characters in the generated strings, the
    non-secure method should be used.
    </p>
</subsection>

<subsection name="2.4 Random permutations, combinations, sampling" href="combinatorics">
    <p>
    To select a random sample of objects in a collection, you can use the
    <code>nextSample</code> method in the <code>RandomData</code> interface.  Specifically,
    if <code>c</code> is a collection containing at least <code>k</code> objects, and
    <code>ranomData</code> is a <code>RandomDataImpl</code> instance
    <code>randomData.nextSample(c, k)</code>
    will return an <code>object[]</code> array of length <code>k</code> consisting of
    elements randomly selected from the collection.  If <code>c</code> contains
    duplicate references, there may be duplicate references in the returned array;
    otherwise returned elements will be unique -- i.e., the sampling is without
    replacement among the object references in the collection. </p>
    <p>
    If <code>randomData</code> is a <code>RandomDataImpl</code> instance, and
    <code>n</code> and <code>k</code> are integers with <code> k &lt;= n</code>,
    then <code>randomData.nextPermutation(n, k)</code> returns an <code>int[]</code>
    array of length <code>k</code> whose whose entries are selected randomly,
    without repetition, from the integers <code>0</code> through <code>n-1</code> (inclusive), i.e.,
    <code>randomData.nextPermutation(n, k)</code> returns a random permutation of
    <code>n</code> taken <code>k</code> at a time.
    </p>
</subsection>

<subsection name='2.5 Generating data "like" an input file' href="empirical">
    <p>
    Using the <code>ValueServer</code> class, you can generate data based on the
    values in an input file in one of two ways:
    <dl>
      <dt>Replay Mode</dt>
      <dd> The following code will read data from <code>url</code>
      (a <code>java.net.URL</code> instance), cycling through the values in the
      file in sequence, reopening and starting at the beginning again when all
      values have been read.
      <source>
      ValueServer vs = new ValueServer();
      vs.setValuesFileURL(url);
      vs.setMode(ValueServer.REPLAY_MODE);
      vs.resetReplayFile();
      double value = vs.getNext();
      // ...Generate and use more values...
      vs.closeReplayFile();
      </source>
      The values in the file are not stored in memory, so it does not matter
      how large the file is, but you do need to explicitly close the file as above.
      The expected file format is \n -delimited (i.e. one per line) strings
      representing valid floating point numbers.
      </dd>
      <dt>Digest Mode</dt>
      <dd>When used in Digest Mode, the ValueServer reads the entire input file
      and estimates a probability density function based on data from the file.
      The estimation method is essentially the <a href="http://nedwww.ipac.caltech.edu/level5/March02/Silverman/Silver2_6.html">
      Variable Kernel Method</a> with Gaussian smoothing.  Once the density has been
      estimated, <code>getNext()</code> returns random values whose probability
      distribution matches the empirical distribution -- i.e., if you generate a large
      number of such values, their distribution should "look like" the distribution of
      the values in the input file.  The values are not stored in memory in this case either,
      so there is no limit to the size of the input file.  Here is an example:
      <source>
      ValueServer vs = new ValueServer();
      vs.setValuesFileURL(url);
      vs.setMode(ValueServer.DIGEST_MODE);
      vs.computeDistribution(500); //Read file and estimate distribution using 500 bins
      double value = vs.getNext();
      // ...Generate and use more values...
      </source>
      See the javadoc for <code>ValueServer</code> and <code>EmpiricalDistribution</code>
      for more details.  Note that <code>computeDistribution()</code> opens and closes
      the input file by itself.
      </dd>
    </dl>
  </p>
</subsection>

</section>

</body>
</document>