mirror of https://github.com/apache/lucene.git
211 lines
8.4 KiB
Plaintext
211 lines
8.4 KiB
Plaintext
= Monte Carlo Simulations
|
|
// Licensed to the Apache Software Foundation (ASF) under one
|
|
// or more contributor license agreements. See the NOTICE file
|
|
// distributed with this work for additional information
|
|
// regarding copyright ownership. The ASF licenses this file
|
|
// to you under the Apache License, Version 2.0 (the
|
|
// "License"); you may not use this file except in compliance
|
|
// with the License. You may obtain a copy of the License at
|
|
//
|
|
// http://www.apache.org/licenses/LICENSE-2.0
|
|
//
|
|
// Unless required by applicable law or agreed to in writing,
|
|
// software distributed under the License is distributed on an
|
|
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
// KIND, either express or implied. See the License for the
|
|
// specific language governing permissions and limitations
|
|
// under the License.
|
|
|
|
|
|
Monte Carlo simulations are commonly used to model the behavior of
|
|
stochastic systems. This section describes
|
|
how to perform both uncorrelated and correlated Monte Carlo simulations
|
|
using the sampling capabilities of the probability distribution framework.
|
|
|
|
== Uncorrelated Simulations
|
|
|
|
Uncorrelated Monte Carlo simulations model stochastic systems with the assumption
|
|
that the underlying random variables move independently of each other.
|
|
A simple example of a Monte Carlo simulation using two independently changing random variables
|
|
is described below.
|
|
|
|
In this example a Monte Carlo simulation is used to determine the probability that a simple hinge assembly will
|
|
fall within a required length specification.
|
|
|
|
The hinge has two components A and B. The combined length of the two components must be less then 5 centimeters
|
|
to fall within specification.
|
|
|
|
A random sampling of lengths for component A has shown that its length conforms to a
|
|
normal distribution with a mean of 2.2 centimeters and a standard deviation of .0195
|
|
centimeters.
|
|
|
|
A random sampling of lengths for component B has shown that its length conforms
|
|
to a normal distribution with a mean of 2.71 centimeters and a standard deviation of .0198 centimeters.
|
|
|
|
[source,text]
|
|
----
|
|
let(componentA=normalDistribution(2.2, .0195), <1>
|
|
componentB=normalDistribution(2.71, .0198), <2>
|
|
simresults=monteCarlo(sampleA=sample(componentA), <3>
|
|
sampleB=sample(componentB),
|
|
add(sampleA, sampleB), <4>
|
|
100000), <5>
|
|
simmodel=empiricalDistribution(simresults), <6>
|
|
prob=cumulativeProbability(simmodel, 5)) <7>
|
|
----
|
|
|
|
The Monte Carlo simulation below performs the following steps:
|
|
|
|
<1> A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of `componentA`.
|
|
<2> A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of `componentB`.
|
|
<3> The `monteCarlo` function samples from the `componentA` and `componentB` distributions and sets the values to variables `sampleA` and `sampleB`.
|
|
<4> It then calls the `add(sampleA, sampleB)`* function to find the combined lengths of the samples.
|
|
<5> The `monteCarlo` function runs a set number of times, 100000, and collects the results in an array. Each
|
|
time the function is called new samples are drawn from the `componentA`
|
|
and `componentB` distributions. On each run, the `add` function adds the two samples to calculate the combined length.
|
|
The result of each run is collected in an array and assigned to the `simresults` variable.
|
|
<6> An `empiricalDistribution` function is then created from the `simresults` array to model the distribution of the
|
|
simulation results.
|
|
<7> Finally, the `cumulativeProbability` function is called on the `simmodel` to determine the cumulative probability
|
|
that the combined length of the components is 5 or less.
|
|
|
|
Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
|
|
be 5 or less:
|
|
|
|
[source,json]
|
|
----
|
|
{
|
|
"result-set": {
|
|
"docs": [
|
|
{
|
|
"prob": 0.9994371944629039
|
|
},
|
|
{
|
|
"EOF": true,
|
|
"RESPONSE_TIME": 660
|
|
}
|
|
]
|
|
}
|
|
}
|
|
----
|
|
|
|
== Correlated Simulations
|
|
|
|
The simulation above assumes that the lengths of `componentA` and `componentB` vary independently.
|
|
What would happen to the probability model if there was a correlation between the lengths of
|
|
`componentA` and `componentB`?
|
|
|
|
In the example below a database containing assembled pairs of components is used to determine
|
|
if there is a correlation between the lengths of the components, and how the correlation effects the model.
|
|
|
|
Before performing a simulation of the effects of correlation on the probability model its
|
|
useful to understand what the correlation is between the lengths of `componentA` and `componentB`.
|
|
|
|
[source,text]
|
|
----
|
|
let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"), <1>
|
|
b=col(a, componentA_d)), <2>
|
|
c=col(a, componentB_d)),
|
|
d=corr(b, c)) <3>
|
|
----
|
|
|
|
<1> In the example, 5000 random samples are selected from a collection of assembled hinges.
|
|
Each sample contains lengths of the components in the fields `componentA_d` and `componentB_d`.
|
|
<2> Both fields are then vectorized. The *componentA_d* vector is stored in
|
|
variable *`b`* and the *componentB_d* variable is stored in variable *`c`*.
|
|
<3> Then the correlation of the two vectors is calculated using the `corr` function.
|
|
|
|
Note from the result that the outcome from `corr` is 0.9996931313216989.
|
|
This means that `componentA_d` and *`componentB_d` are almost perfectly correlated.
|
|
|
|
[source,json]
|
|
----
|
|
{
|
|
"result-set": {
|
|
"docs": [
|
|
{
|
|
"d": 0.9996931313216989
|
|
},
|
|
{
|
|
"EOF": true,
|
|
"RESPONSE_TIME": 309
|
|
}
|
|
]
|
|
}
|
|
}
|
|
----
|
|
|
|
=== Correlation Effects on the Probability Model
|
|
|
|
The example below explores how to use a multivariate normal distribution function
|
|
to model how correlation effects the probability of hinge defects.
|
|
|
|
In this example 5000 random samples are selected from a collection
|
|
containing length data for assembled hinges. Each sample contains
|
|
the fields `componentA_d` and `componentB_d`.
|
|
|
|
Both fields are then vectorized. The `componentA_d` vector is stored in
|
|
variable *`b`* and the `componentB_d` variable is stored in variable *`c`*.
|
|
|
|
An array is created that contains the means of the two vectorized fields.
|
|
|
|
Then both vectors are added to a matrix which is transposed. This creates
|
|
an observation matrix where each row contains one observation of
|
|
`componentA_d` and `componentB_d`. A covariance matrix is then created from the columns of
|
|
the observation matrix with the
|
|
`cov` function. The covariance matrix describes the covariance between `componentA_d` and `componentB_d`.
|
|
|
|
The `multivariateNormalDistribution` function is then called with the
|
|
array of means for the two fields and the covariance matrix. The model
|
|
for the multivariate normal distribution is stored in variable *`g`*.
|
|
|
|
The `monteCarlo` function then calls the function `add(sample(g))` 50000 times
|
|
and collections the results in a vector. Each time the function is called a single sample
|
|
is drawn from the multivariate normal distribution. Each sample is a vector containing
|
|
one `componentA` and `componentB` pair. The `add` function adds the values in the vector to
|
|
calculate the length of the pair. Over the long term the samples drawn from the
|
|
multivariate normal distribution will conform to the covariance matrix used to construct it.
|
|
|
|
Just as in the non-correlated example an empirical distribution is used to model probabilities
|
|
of the simulation vector and the `cumulativeProbability` function is used to compute the cumulative
|
|
probability that the combined component length will be 5 centimeters or less.
|
|
|
|
Notice that the probability of a hinge meeting specification has dropped to 0.9889517439980468.
|
|
This is because the strong correlation
|
|
between the lengths of components means that their lengths rise together causing more hinges to
|
|
fall out of the 5 centimeter specification.
|
|
|
|
[source,text]
|
|
----
|
|
let(a=random(hinges, q="*:*", rows="5000", fl="componentA_d, componentB_d"),
|
|
b=col(a, componentA_d),
|
|
c=col(a, componentB_d),
|
|
cor=corr(b,c),
|
|
d=array(mean(b), mean(c)),
|
|
e=transpose(matrix(b, c)),
|
|
f=cov(e),
|
|
g=multiVariateNormalDistribution(d, f),
|
|
h=monteCarlo(add(sample(g)), 50000),
|
|
i=empiricalDistribution(h),
|
|
j=cumulativeProbability(i, 5))
|
|
----
|
|
|
|
When this expression is sent to the `/stream` handler it responds with:
|
|
|
|
[source,json]
|
|
----
|
|
{
|
|
"result-set": {
|
|
"docs": [
|
|
{
|
|
"j": 0.9889517439980468
|
|
},
|
|
{
|
|
"EOF": true,
|
|
"RESPONSE_TIME": 599
|
|
}
|
|
]
|
|
}
|
|
}
|
|
----
|