mirror of https://github.com/apache/lucene.git
SOLR-12701: format/style consistency fixes for math expression docs; CSS change to make bold monospace appear properly
This commit is contained in:
parent
a1b6db26db
commit
a619038e90
|
@ -885,6 +885,11 @@ h6 strong
|
|||
line-height: 1.45;
|
||||
}
|
||||
|
||||
p strong code,
|
||||
td strong code {
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
pre,
|
||||
pre > code
|
||||
{
|
||||
|
|
|
@ -21,11 +21,11 @@
|
|||
|
||||
|
||||
The `polyfit` function is a general purpose curve fitter used to model
|
||||
the *non-linear* relationship between two random variables.
|
||||
the non-linear relationship between two random variables.
|
||||
|
||||
The `polyfit` function is passed *x* and *y* axises and fits a smooth curve to the data.
|
||||
If only a single array is provided it is treated as the *y* axis and a sequence is generated
|
||||
for the *x* axis.
|
||||
The `polyfit` function is passed x- and y-axes and fits a smooth curve to the data.
|
||||
If only a single array is provided it is treated as the y-axis and a sequence is generated
|
||||
for the x-axis.
|
||||
|
||||
The `polyfit` function also has a parameter the specifies the degree of the polynomial. The higher
|
||||
the degree the more curves that can be modeled.
|
||||
|
@ -34,7 +34,7 @@ The example below uses the `polyfit` function to fit a curve to an array using
|
|||
a 3 degree polynomial. The fitted curve is then subtracted from the original curve. The output
|
||||
shows the error between the fitted curve and the original curve, known as the residuals.
|
||||
The output also includes the sum-of-squares of the residuals which provides a measure
|
||||
of how large the error is..
|
||||
of how large the error is.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -45,7 +45,7 @@ let(echo="residuals, sumSqError",
|
|||
sumSqError=sumSq(residuals))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -95,7 +95,7 @@ let(echo="residuals, sumSqError",
|
|||
sumSqError=sumSq(residuals))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -138,10 +138,10 @@ responds with:
|
|||
The `polyfit` function returns a function that can be used with the `predict`
|
||||
function.
|
||||
|
||||
In the example below the x axis is included for clarity.
|
||||
In the example below the x-axis is included for clarity.
|
||||
The `polyfit` function returns a function for the fitted curve.
|
||||
The `predict` function is then used to predict a value along the curve, in this
|
||||
case the prediction is made for the *x* value of 5.
|
||||
case the prediction is made for the *`x`* value of 5.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -151,7 +151,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
|
|||
p=predict(curve, 5))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -185,7 +185,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
|
|||
d=derivative(curve))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -235,7 +235,7 @@ let(x=array(0,1,2,3,4,5,6,7,8,9, 10),
|
|||
f=gaussfit(x, y))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -283,7 +283,7 @@ let(x=array(0,1,2,3,4,5,6,7,8,9, 10),
|
|||
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
|
|
@ -30,17 +30,17 @@ the more advanced DSP functions, its useful to get a better understanding of how
|
|||
The `dotProduct` function can be used to combine two arrays into a single product. A simple example can help
|
||||
illustrate this concept.
|
||||
|
||||
In the example below two arrays are set to variables *a* and *b* and then operated on by the `dotProduct` function.
|
||||
The output of the `dotProduct` function is set to variable *c*.
|
||||
In the example below two arrays are set to variables *`a`* and *`b`* and then operated on by the `dotProduct` function.
|
||||
The output of the `dotProduct` function is set to variable *`c`*.
|
||||
|
||||
Then the `mean` function is then used to compute the mean of the first array which is set to the variable `d`.
|
||||
Then the `mean` function is then used to compute the mean of the first array which is set to the variable *`d`*.
|
||||
|
||||
Both the *dot product* and the *mean* are included in the output.
|
||||
Both the dot product and the mean are included in the output.
|
||||
|
||||
When we look at the output of this expression we see that the *dot product* and the *mean* of the first array
|
||||
When we look at the output of this expression we see that the dot product and the mean of the first array
|
||||
are both 30.
|
||||
|
||||
The dot product function *calculated the mean* of the first array.
|
||||
The `dotProduct` function calculated the mean of the first array.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -51,7 +51,7 @@ let(echo="c, d",
|
|||
d=mean(a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -76,9 +76,9 @@ calculation using vector math and look at the output of each step.
|
|||
|
||||
In the example below the `ebeMultiply` function performs an element-by-element multiplication of
|
||||
two arrays. This is the first step of the dot product calculation. The result of the element-by-element
|
||||
multiplication is assigned to variable *c*.
|
||||
multiplication is assigned to variable *`c`*.
|
||||
|
||||
In the next step the `add` function adds all the elements of the array in variable *c*.
|
||||
In the next step the `add` function adds all the elements of the array in variable *`c`*.
|
||||
|
||||
Notice that multiplying each element of the first array by .2 and then adding the results is
|
||||
equivalent to the formula for computing the mean of the first array. The formula for computing the mean
|
||||
|
@ -95,7 +95,7 @@ let(echo="c, d",
|
|||
d=add(c))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -122,11 +122,13 @@ When this expression is sent to the /stream handler it responds with:
|
|||
----
|
||||
|
||||
In the example above two arrays were combined in a way that produced the mean of the first. In the second array
|
||||
each value was set to .2. Another way of looking at this is that each value in the second array has the same weight.
|
||||
By varying the weights in the second array we can produce a different result. For example if the first array represents a time series,
|
||||
each value was set to ".2". Another way of looking at this is that each value in the second array has the same weight.
|
||||
By varying the weights in the second array we can produce a different result.
|
||||
For example if the first array represents a time series,
|
||||
the weights in the second array can be set to add more weight to a particular element in the first array.
|
||||
|
||||
The example below creates a weighted average with the weight decreasing from right to left. Notice that the weighted mean
|
||||
The example below creates a weighted average with the weight decreasing from right to left.
|
||||
Notice that the weighted mean
|
||||
of 36.666 is larger than the previous mean which was 30. This is because more weight was given to last element in the
|
||||
array.
|
||||
|
||||
|
@ -139,7 +141,7 @@ let(echo="c, d",
|
|||
d=add(c))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -167,13 +169,13 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
=== Representing Correlation
|
||||
|
||||
Often when we think of correlation, we are thinking of *Pearsons* correlation in the field of statistics. But the definition of
|
||||
Often when we think of correlation, we are thinking of _Pearson correlation_ in the field of statistics. But the definition of
|
||||
correlation is actually more general: a mutual relationship or connection between two or more things.
|
||||
In the field of digital signal processing the dot product is used to represent correlation. The examples below demonstrates
|
||||
how the dot product can be used to represent correlation.
|
||||
|
||||
In the example below the dot product is computed for two vectors. Notice that the vectors have different values that fluctuate
|
||||
together. The output of the dot product is 190, which is hard to reason about because because its not scaled.
|
||||
together. The output of the dot product is 190, which is hard to reason about because it's not scaled.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -183,7 +185,7 @@ let(echo="c, d",
|
|||
c=dotProduct(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -206,9 +208,9 @@ One approach to scaling the dot product is to first scale the vectors so that bo
|
|||
magnitude of 1, also called unit vectors, are used when comparing only the angle between vectors rather then the magnitude.
|
||||
The `unitize` function can be used to unitize the vectors before calculating the dot product.
|
||||
|
||||
Notice in the example below the dot product result, set to variable *e*, is effectively 1. When applied to unit vectors the dot product
|
||||
will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the *unscaled* vectors and the
|
||||
answer is also effectively 1. This is because *cosine similarity* is a scaled *dot product*.
|
||||
Notice in the example below the dot product result, set to variable *`e`*, is effectively 1. When applied to unit vectors the dot product
|
||||
will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the unscaled vectors and the
|
||||
answer is also effectively 1. This is because cosine similarity is a scaled dot product.
|
||||
|
||||
|
||||
[source,text]
|
||||
|
@ -222,7 +224,7 @@ let(echo="e, f",
|
|||
f=cosineSimilarity(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -254,7 +256,7 @@ let(echo="c, d",
|
|||
c=cosineSimilarity(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -275,10 +277,10 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
== Convolution
|
||||
|
||||
The `conv` function calculates the convolution of two vectors. The convolution is calculated by *reversing*
|
||||
the second vector and sliding it across the first vector. The *dot product* of the two vectors
|
||||
The `conv` function calculates the convolution of two vectors. The convolution is calculated by reversing
|
||||
the second vector and sliding it across the first vector. The dot product of the two vectors
|
||||
is calculated at each point as the second vector is slid across the first vector.
|
||||
The dot products are collected in a *third vector* which is the *convolution* of the two vectors.
|
||||
The dot products are collected in a third vector which is the convolution of the two vectors.
|
||||
|
||||
=== Moving Average Function
|
||||
|
||||
|
@ -290,7 +292,7 @@ is syntactic sugar for convolution.
|
|||
Below is an example of a moving average with a window size of 5. Notice that original vector has 13 elements
|
||||
but the result of the moving average has only 9 elements. This is because the `movingAvg` function
|
||||
only begins generating results when it has a full window. In this case because the window size is 5 so the
|
||||
moving average starts generating results from the 4th index of the original array.
|
||||
moving average starts generating results from the 4^th^ index of the original array.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -298,7 +300,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
|||
b=movingAvg(a, 5))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -344,7 +346,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
|||
c=conv(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -381,7 +383,7 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
----
|
||||
|
||||
We achieve the same result as the `movingAvg` gunction by using the `copyOfRange` function to copy a range of
|
||||
We achieve the same result as the `movingAvg` function by using the `copyOfRange` function to copy a range of
|
||||
the result that drops the first and last 4 values of
|
||||
the convolution result. In the example below the `precision` function is also also used to remove floating point errors from the
|
||||
convolution result. When this is added the output is exactly the same as the `movingAvg` function.
|
||||
|
@ -395,7 +397,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
|||
e=precision(d, 2))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -446,7 +448,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
|||
c=conv(a, rev(b)))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -504,7 +506,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
|||
c=finddelay(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
|
|
@ -26,13 +26,12 @@ Before performing machine learning operations its often necessary to
|
|||
scale the feature vectors so they can be compared at the same scale.
|
||||
|
||||
All the scaling function operate on vectors and matrices.
|
||||
When operating on a matrix the *rows* of the matrix are scaled.
|
||||
When operating on a matrix the rows of the matrix are scaled.
|
||||
|
||||
=== Min/Max Scaling
|
||||
|
||||
The `minMaxScale` function scales a vector or matrix between a min and
|
||||
max value. By default it will scale between 0 and 1 if min/max values
|
||||
are not provided.
|
||||
The `minMaxScale` function scales a vector or matrix between a minimum and maximum value.
|
||||
By default it will scale between 0 and 1 if min/max values are not provided.
|
||||
|
||||
Below is a simple example of min/max scaling between 0 and 1.
|
||||
Notice that once brought into the same scale the vectors are the same.
|
||||
|
@ -79,10 +78,10 @@ This expression returns the following response:
|
|||
|
||||
=== Standardization
|
||||
|
||||
The `standardize` function scales a vector so that it has a
|
||||
mean of 0 and a standard deviation of 1. Standardization can be
|
||||
used with machine learning algorithms, such as SVM, that
|
||||
perform better when the data has a normal distribution.
|
||||
The `standardize` function scales a vector so that it has a mean of 0 and a standard deviation of 1.
|
||||
Standardization can be used with machine learning algorithms, such as
|
||||
https://en.wikipedia.org/wiki/Support_vector_machine[Support Vector Machine (SVM)], that perform better
|
||||
when the data has a normal distribution.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -127,8 +126,7 @@ This expression returns the following response:
|
|||
=== Unit Vectors
|
||||
|
||||
The `unitize` function scales vectors to a magnitude of 1. A vector with a
|
||||
magnitude of 1 is known as a unit vector. Unit vectors are
|
||||
preferred when the vector math deals
|
||||
magnitude of 1 is known as a unit vector. Unit vectors are preferred when the vector math deals
|
||||
with vector direction rather than magnitude.
|
||||
|
||||
[source,text]
|
||||
|
@ -173,24 +171,20 @@ This expression returns the following response:
|
|||
|
||||
== Distance and Distance Measures
|
||||
|
||||
The `distance` function computes the distance for two
|
||||
numeric arrays or a *distance matrix* for the columns of a matrix.
|
||||
The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix.
|
||||
|
||||
There are four distance measure functions that return a function
|
||||
that performs the actual distance calculation:
|
||||
There are five distance measure functions that return a function that performs the actual distance calculation:
|
||||
|
||||
* euclidean (default)
|
||||
* manhattan
|
||||
* canberra
|
||||
* earthMovers
|
||||
* haversineMeters (Geospatial distance measure)
|
||||
* `euclidean` (default)
|
||||
* `manhattan`
|
||||
* `canberra`
|
||||
* `earthMovers`
|
||||
* `haversineMeters` (Geospatial distance measure)
|
||||
|
||||
The distance measure functions can be used with all machine learning functions
|
||||
that support different distance measures.
|
||||
|
||||
Below is an example for computing euclidean distance for
|
||||
two numeric arrays:
|
||||
that support distance measures.
|
||||
|
||||
Below is an example for computing Euclidean distance for two numeric arrays:
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -294,48 +288,46 @@ This expression returns the following response:
|
|||
}
|
||||
----
|
||||
|
||||
== K-means Clustering
|
||||
== K-Means Clustering
|
||||
|
||||
The `kmeans` functions performs k-means clustering of the rows of a matrix.
|
||||
Once the clustering has been completed there are a number of useful functions available
|
||||
for examining the *clusters* and *centroids*.
|
||||
for examining the clusters and centroids.
|
||||
|
||||
The examples below are clustering *term vectors*.
|
||||
The chapter on <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> should be
|
||||
consulted for a full explanation of these features.
|
||||
The examples below cluster _term vectors_.
|
||||
The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> offers
|
||||
a full explanation of these features.
|
||||
|
||||
=== Centroid Features
|
||||
|
||||
In the example below the `kmeans` function is used to cluster a result set from the Enron email data-set
|
||||
and then the top features are extracted from the cluster centroids.
|
||||
|
||||
Let's look at what data is assigned to each variable:
|
||||
|
||||
* *a*: The `random` function returns a sample of 500 documents from the *enron*
|
||||
collection that match the query *body:oil*. The `select` function selects the *id* and
|
||||
and annotates each tuple with the analyzed bigram terms from the body field.
|
||||
|
||||
* *b*: The `termVectors` function creates a TF-IDF term vector matrix from the
|
||||
tuples stored in variable *a*. Each row in the matrix represents a document. The columns of the matrix
|
||||
are the bigram terms that were attached to each tuple.
|
||||
* *c*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the
|
||||
*Euclidean distance* measure.
|
||||
* *d*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid
|
||||
from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix.
|
||||
* *e*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix.
|
||||
This returns the top 5 bigram terms for each centroid.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"),
|
||||
let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"), <1>
|
||||
id,
|
||||
analyze(body, body_bigram) as terms),
|
||||
b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),
|
||||
c=kmeans(b, 5),
|
||||
d=getCentroids(c),
|
||||
e=topFeatures(d, 5))
|
||||
b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),<2>
|
||||
c=kmeans(b, 5), <3>
|
||||
d=getCentroids(c), <4>
|
||||
e=topFeatures(d, 5)) <5>
|
||||
----
|
||||
|
||||
Let's look at what data is assigned to each variable:
|
||||
|
||||
<1> *`a`*: The `random` function returns a sample of 500 documents from the "enron"
|
||||
collection that match the query "body:oil". The `select` function selects the `id` and
|
||||
and annotates each tuple with the analyzed bigram terms from the `body` field.
|
||||
<2> *`b`*: The `termVectors` function creates a TF-IDF term vector matrix from the
|
||||
tuples stored in variable *`a`*. Each row in the matrix represents a document. The columns of the matrix
|
||||
are the bigram terms that were attached to each tuple.
|
||||
<3> *`c`*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the Euclidean distance measure.
|
||||
<4> *`d`*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid
|
||||
from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix.
|
||||
<5> *`e`*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix.
|
||||
This returns the top 5 bigram terms for each centroid.
|
||||
|
||||
This expression returns the following response:
|
||||
|
||||
[source,json]
|
||||
|
@ -396,12 +388,6 @@ This expression returns the following response:
|
|||
The example below examines the top features of a specific cluster. This example uses the same techniques
|
||||
as the centroids example but the top features are extracted from a cluster rather then the centroids.
|
||||
|
||||
The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors
|
||||
that have been clustered together based on their features.
|
||||
|
||||
In the example below the `topFeatures` function is used to extract the top 4 features from each term vector
|
||||
in the cluster.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
|
||||
|
@ -409,10 +395,15 @@ let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
|
|||
analyze(body, body_bigram) as terms),
|
||||
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
||||
c=kmeans(b, 25),
|
||||
d=getCluster(c, 0),
|
||||
e=topFeatures(d, 4))
|
||||
d=getCluster(c, 0), <1>
|
||||
e=topFeatures(d, 4)) <2>
|
||||
----
|
||||
|
||||
<1> The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors
|
||||
that have been clustered together based on their features.
|
||||
<2> The `topFeatures` function is used to extract the top 4 features from each term vector
|
||||
in the cluster.
|
||||
|
||||
This expression returns the following response:
|
||||
|
||||
[source,json]
|
||||
|
@ -489,19 +480,17 @@ This expression returns the following response:
|
|||
}
|
||||
----
|
||||
|
||||
== Multi K-means Clustering
|
||||
== Multi K-Means Clustering
|
||||
|
||||
K-means clustering will be produce different results depending on
|
||||
K-means clustering will produce different results depending on
|
||||
the initial placement of the centroids. K-means is fast enough
|
||||
that multiple trials can be performed and the best outcome selected.
|
||||
The `multiKmeans` function runs the K-means
|
||||
clustering algorithm for a gven number of trials and selects the
|
||||
best result based on which trial produces the lowest intra-cluster
|
||||
variance.
|
||||
|
||||
The example below is identical to centroids example except that
|
||||
it uses `multiKmeans` with 100 trials, rather then a single
|
||||
trial of the `kmeans` function.
|
||||
The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the
|
||||
best result based on which trial produces the lowest intra-cluster variance.
|
||||
|
||||
The example below is identical to centroids example except that it uses `multiKmeans` with 100 trials,
|
||||
rather then a single trial of the `kmeans` function.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -569,10 +558,10 @@ This expression returns the following response:
|
|||
}
|
||||
----
|
||||
|
||||
== Fuzzy K-means Clustering
|
||||
== Fuzzy K-Means Clustering
|
||||
|
||||
The `fuzzyKmeans` function is a soft clustering algorithm which
|
||||
allows vectors to be assigned to more then one cluster. The *fuzziness* parameter
|
||||
allows vectors to be assigned to more then one cluster. The `fuzziness` parameter
|
||||
is a value between 1 and 2 that determines how fuzzy to make the cluster assignment.
|
||||
|
||||
After the clustering has been performed the `getMembershipMatrix` function can be called
|
||||
|
@ -585,27 +574,26 @@ A simple example will make this more clear. In the example below 300 documents a
|
|||
then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the
|
||||
term vectors into 12 clusters with a fuzziness factor of 1.25.
|
||||
|
||||
The `getMembershipMatrix` function is used to return the membership matrix and the first row
|
||||
of membership matrix is retrieved with the `rowAt` function. The `precision` function is then applied to the first row
|
||||
of the matrix to make it easier to read.
|
||||
|
||||
The output shows a single vector representing the cluster membership probabilities for the first
|
||||
term vector. Notice that the term vector has the highest association with the 12th cluster,
|
||||
but also has significant associations with the 3rd, 5th, 6th and 7th clusters.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
et(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
|
||||
let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
|
||||
id,
|
||||
analyze(body, body_bigram) as terms),
|
||||
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
||||
c=fuzzyKmeans(b, 12, fuzziness=1.25),
|
||||
d=getMembershipMatrix(c),
|
||||
e=rowAt(d, 0),
|
||||
f=precision(e, 5))
|
||||
d=getMembershipMatrix(c), <1>
|
||||
e=rowAt(d, 0), <2>
|
||||
f=precision(e, 5)) <3>
|
||||
----
|
||||
|
||||
This expression returns the following response:
|
||||
<1> The `getMembershipMatrix` function is used to return the membership matrix;
|
||||
<2> and the first row of membership matrix is retrieved with the `rowAt` function.
|
||||
<3> The `precision` function is then applied to the first row
|
||||
of the matrix to make it easier to read.
|
||||
|
||||
This expression returns a single vector representing the cluster membership probabilities for the first
|
||||
term vector. Notice that the term vector has the highest association with the 12^th^ cluster,
|
||||
but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clusters:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -637,30 +625,21 @@ This expression returns the following response:
|
|||
}
|
||||
----
|
||||
|
||||
== K-nearest Neighbor (KNN)
|
||||
== K-Nearest Neighbor (KNN)
|
||||
|
||||
The `knn` function searches the rows of a matrix for the
|
||||
K-nearest neighbors of a search vector. The `knn` function
|
||||
returns a *matrix* of the K-nearest neighbors. The `knn` function
|
||||
supports changing of the distance measure by providing one of the
|
||||
four distance measure functions as the fourth parameter:
|
||||
k-nearest neighbors of a search vector. The `knn` function
|
||||
returns a matrix of the k-nearest neighbors.
|
||||
|
||||
* euclidean (Default)
|
||||
* manhattan
|
||||
* canberra
|
||||
* earthMovers
|
||||
The `knn` function supports changing of the distance measure by providing one of these
|
||||
distance measure functions as the fourth parameter:
|
||||
|
||||
The example below builds on the clustering examples to demonstrate
|
||||
the `knn` function.
|
||||
* `euclidean` (Default)
|
||||
* `manhattan`
|
||||
* `canberra`
|
||||
* `earthMovers`
|
||||
|
||||
In the example, the centroids matrix is set to variable *d*. The first
|
||||
centroid vector is selected from the matrix with the `rowAt` function.
|
||||
Then the `knn` function is used to find the 3 nearest neighbors
|
||||
to the centroid vector in the term vector matrix (variable b).
|
||||
|
||||
The `knn` function returns a matrix with the 3 nearest neighbors based on the
|
||||
default distance measure which is euclidean. Finally, the top 4 features
|
||||
of the term vectors in the nearest neighbor matrix are returned.
|
||||
The example below builds on the clustering examples to demonstrate the `knn` function.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -669,13 +648,21 @@ let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
|
|||
analyze(body, body_bigram) as terms),
|
||||
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
||||
c=multiKmeans(b, 5, 100),
|
||||
d=getCentroids(c),
|
||||
e=rowAt(d, 0),
|
||||
g=knn(b, e, 3),
|
||||
h=topFeatures(g, 4))
|
||||
d=getCentroids(c), <1>
|
||||
e=rowAt(d, 0), <2>
|
||||
g=knn(b, e, 3), <3>
|
||||
h=topFeatures(g, 4)) <4>
|
||||
----
|
||||
|
||||
This expression returns the following response:
|
||||
<1> In the example, the centroids matrix is set to variable *`d`*.
|
||||
<2> The first centroid vector is selected from the matrix with the `rowAt` function.
|
||||
<3> Then the `knn` function is used to find the 3 nearest neighbors
|
||||
to the centroid vector in the term vector matrix (variable *`b`*).
|
||||
<4> The `topFeatures` function is used to request the top 4 featurs of the term vectors in the knn matrix.
|
||||
|
||||
The `knn` function returns a matrix with the 3 nearest neighbors based on the
|
||||
default distance measure which is euclidean. Finally, the top 4 features
|
||||
of the term vectors in the nearest neighbor matrix are returned:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -713,20 +700,18 @@ This expression returns the following response:
|
|||
}
|
||||
----
|
||||
|
||||
== KNN Regression
|
||||
== K-Nearest Neighbor Regression
|
||||
|
||||
KNN regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning
|
||||
K-nearest neighbor regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning
|
||||
technique which means it does not fit a model to the training set in advance. Instead the
|
||||
entire training set of observations and outcomes are held in memory and predictions are made
|
||||
by averaging the outcomes of the k-nearest neighbors.
|
||||
|
||||
The `knnRegress` function prepares the training set for use with the `predict` function.
|
||||
|
||||
Below is an example of the `knnRegress` function. In this example 10000 random samples
|
||||
are taken each containing the variables *filesize_d*, *service_d* and *response_d*. The pairs of
|
||||
*filesize_d* and *service_d* will be used to predict the value of *response_d*.
|
||||
|
||||
Notice that `knnRegress` returns a tuple describing the regression inputs.
|
||||
Below is an example of the `knnRegress` function. In this example 10,000 random samples
|
||||
are taken, each containing the variables `filesize_d`, `service_d` and `response_d`. The pairs of
|
||||
`filesize_d` and `service_d` will be used to predict the value of `response_d`.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -738,7 +723,7 @@ let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d
|
|||
lazyModel=knnRegress(observations, outcomes , 5))
|
||||
----
|
||||
|
||||
This expression returns the following response:
|
||||
This expression returns the following response. Notice that `knnRegress` returns a tuple describing the regression inputs:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -767,6 +752,7 @@ This expression returns the following response:
|
|||
=== Prediction and Residuals
|
||||
|
||||
The output of `knnRegress` can be used with the `predict` function like other regression models.
|
||||
|
||||
In the example below the `predict` function is used to predict results for the original training
|
||||
data. The sumSq of the residuals is then calculated.
|
||||
|
||||
|
@ -806,14 +792,15 @@ This expression returns the following response:
|
|||
|
||||
If the features in the observation matrix are not in the same scale then the larger features
|
||||
will carry more weight in the distance calculation then the smaller features. This can greatly
|
||||
impact the accuracy of the prediction. The `knnRegress` function has a *scale* parameter which
|
||||
can be set to *true* to automatically scale the features in the same range.
|
||||
impact the accuracy of the prediction. The `knnRegress` function has a `scale` parameter which
|
||||
can be set to `true` to automatically scale the features in the same range.
|
||||
|
||||
The example below shows `knnRegress` with feature scaling turned on.
|
||||
Notice that when feature scaling is turned on the sumSqErr in the output is much lower.
|
||||
|
||||
Notice that when feature scaling is turned on the `sumSqErr` in the output is much lower.
|
||||
This shows how much more accurate the predictions are when feature scaling is turned on in
|
||||
this particular example. This is because the *filesize_d* feature is significantly larger then
|
||||
the *service_d* feature.
|
||||
this particular example. This is because the `filesize_d` feature is significantly larger then
|
||||
the `service_d` feature.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -850,16 +837,15 @@ This expression returns the following response:
|
|||
|
||||
=== Setting Robust Regression
|
||||
|
||||
The default prediction approach is to take the *mean* of the outcomes of the k-nearest
|
||||
neighbors. If the outcomes contain outliers the *mean* value can be skewed. Setting
|
||||
the *robust* parameter to true will take the *median* outcome of the k-nearest neighbors.
|
||||
The default prediction approach is to take the mean of the outcomes of the k-nearest
|
||||
neighbors. If the outcomes contain outliers the mean value can be skewed. Setting
|
||||
the `robust` parameter to `true` will take the median outcome of the k-nearest neighbors.
|
||||
This provides a regression prediction that is robust to outliers.
|
||||
|
||||
|
||||
=== Setting the Distance Measure
|
||||
|
||||
The distance measure can be changed for the k-nearest neighbor search by adding a distance measure
|
||||
function to the `knnRegress` parameters. Below is an example using manhattan distance.
|
||||
function to the `knnRegress` parameters. Below is an example using `manhattan` distance.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -892,10 +878,3 @@ This expression returns the following response:
|
|||
}
|
||||
}
|
||||
----
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -35,7 +35,7 @@ matrix(array(1, 2),
|
|||
array(4, 5))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -80,7 +80,7 @@ let(a=array(1, 2),
|
|||
d=colAt(c, 1))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -129,7 +129,7 @@ let(echo="d, e",
|
|||
e=getColumnLabels(c))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -182,7 +182,7 @@ let(echo="b,c",
|
|||
c=columnCount(a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -217,7 +217,7 @@ let(a=matrix(array(1, 2),
|
|||
b=transpose(a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -259,7 +259,7 @@ let(a=matrix(array(1, 2, 3),
|
|||
b=sumRows(a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -292,7 +292,7 @@ let(a=matrix(array(1, 2, 3),
|
|||
b=grandSum(a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -326,7 +326,7 @@ let(a=matrix(array(1, 2),
|
|||
b=scalarAdd(10, a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -370,7 +370,7 @@ let(a=matrix(array(1, 2),
|
|||
b=ebeAdd(a, a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -413,7 +413,7 @@ let(a=matrix(array(1, 2),
|
|||
c=matrixMult(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
|
|
@ -16,21 +16,20 @@
|
|||
// specific language governing permissions and limitations
|
||||
// under the License.
|
||||
|
||||
This section of the math expression user guide covers *interpolation*, *derivatives* and *integrals*.
|
||||
These three interrelated topics are part of the field of mathematics called *numerical analysis*.
|
||||
Interpolation, derivatives and integrals are three interrelated topics which are part of the field of mathematics called numerical analysis. This section explores the math expressions available for numerical anlysis.
|
||||
|
||||
== Interpolation
|
||||
|
||||
Interpolation is used to construct new data points between a set of known control of points.
|
||||
The ability to *predict* new data points allows for *sampling* along the curve defined by the
|
||||
The ability to predict new data points allows for sampling along the curve defined by the
|
||||
control points.
|
||||
|
||||
The interpolation functions described below all return an *interpolation model*
|
||||
The interpolation functions described below all return an _interpolation model_
|
||||
that can be passed to other functions which make use of the sampling capability.
|
||||
|
||||
If returned directly the interpolation model returns an array containing predictions for each of the
|
||||
control points. This is useful in the case of `loess` interpolation which first smooths the control points
|
||||
and then interpolates the smoothed points. All other interpolation function simply return the original
|
||||
and then interpolates the smoothed points. All other interpolation functions simply return the original
|
||||
control points because interpolation predicts a curve that passes through the original control points.
|
||||
|
||||
There are different algorithms for interpolation that will result in different predictions
|
||||
|
@ -54,29 +53,25 @@ samples every second. In order to do this the data points between the minutes mu
|
|||
The `predict` function can be used to predict values anywhere within the bounds of the interpolation
|
||||
range. The example below shows a very simple example of upsampling.
|
||||
|
||||
In the example linear interpolation is performed on the arrays in variables *x* and *y*. The *x* variable,
|
||||
which is the x axis, is a sequence from 0 to 20 with a stride of 2. The *y* variable defines the curve
|
||||
along the x axis.
|
||||
|
||||
The `lerp` function performs the interpolation and returns the interpolation model.
|
||||
|
||||
The `u` value is an array from 0 to 20 with a stride of 1. This fills in the gaps of the original x axis.
|
||||
The `predict` function then uses the interpolation function in variable *l* to predict values for
|
||||
every point in the array assigned to variable *u*.
|
||||
|
||||
The variable *p* is the array of predictions, which is the upsampled set of y values.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(x=array(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20),
|
||||
y=array(5, 10, 60, 190, 100, 130, 100, 20, 30, 10, 5),
|
||||
l=lerp(x, y),
|
||||
u=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
|
||||
p=predict(l, u))
|
||||
let(x=array(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20), <1>
|
||||
y=array(5, 10, 60, 190, 100, 130, 100, 20, 30, 10, 5), <2>
|
||||
l=lerp(x, y), <3>
|
||||
u=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), <4>
|
||||
p=predict(l, u)) <5>
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
<1> In the example linear interpolation is performed on the arrays in variables *`x`* and *`y`*. The *`x`* variable,
|
||||
which is the x-axis, is a sequence from 0 to 20 with a stride of 2.
|
||||
<2> The *`y`* variable defines the curve along the x-axis.
|
||||
<3> The `lerp` function performs the interpolation and returns the interpolation model.
|
||||
<4> The `u` value is an array from 0 to 20 with a stride of 1. This fills in the gaps of the original x axis.
|
||||
The `predict` function then uses the interpolation function in variable *`l`* to predict values for
|
||||
every point in the array assigned to variable *`u`*.
|
||||
<5> The variable *`p`* is the array of predictions, which is the upsampled set of *`y`* values.
|
||||
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -127,21 +122,15 @@ A technique known as local regression is used to compute the smoothed curve. Th
|
|||
neighborhood of the local regression can be adjusted
|
||||
to control how close the new curve conforms to the original control points.
|
||||
|
||||
The `loess` function is passed *x* and *y* axises and fits a smooth curve to the data.
|
||||
If only a single array is provided it is treated as the *y* axis and a sequence is generated
|
||||
for the *x* axis.
|
||||
The `loess` function is passed *`x`*- and *`y`*-axes and fits a smooth curve to the data.
|
||||
If only a single array is provided it is treated as the *`y`*-axis and a sequence is generated
|
||||
for the *`x`*-axis.
|
||||
|
||||
The example below uses the `loess` function to fit a curve to a set of *y* values in an array.
|
||||
The bandwidth parameter defines the percent of data to use for the local
|
||||
The example below uses the `loess` function to fit a curve to a set of *`y`* values in an array.
|
||||
The `bandwidth` parameter defines the percent of data to use for the local
|
||||
regression. The lower the percent the smaller the neighborhood used for the local
|
||||
regression and the closer the curve will be to the original data.
|
||||
|
||||
In the example the fitted curve is subtracted from the original curve using the
|
||||
`ebeSubtract` function. The output shows the error between the
|
||||
fitted curve and the original curve, known as the residuals. The output also includes
|
||||
the sum-of-squares of the residuals which provides a measure
|
||||
of how large the error is.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(echo="residuals, sumSqError",
|
||||
|
@ -151,8 +140,11 @@ let(echo="residuals, sumSqError",
|
|||
sumSqError=sumSq(residuals))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
In the example the fitted curve is subtracted from the original curve using the
|
||||
`ebeSubtract` function. The output shows the error between the
|
||||
fitted curve and the original curve, known as the residuals. The output also includes
|
||||
the sum-of-squares of the residuals which provides a measure
|
||||
of how large the error is:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -194,9 +186,7 @@ responds with:
|
|||
}
|
||||
----
|
||||
|
||||
In the next example the curve is fit using a bandwidth of .25. Notice that the curve
|
||||
is a closer fit, shown by the smaller residuals and lower value for the sum-of-squares of the
|
||||
residuals.
|
||||
In the next example the curve is fit using a `bandwidth` of `.25`:
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -207,8 +197,8 @@ let(echo="residuals, sumSqError",
|
|||
sumSqError=sumSq(residuals))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
Notice that the curve is a closer fit, shown by the smaller `residuals` and lower value for the sum-of-squares of the
|
||||
residuals:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -252,11 +242,11 @@ responds with:
|
|||
|
||||
== Derivatives
|
||||
|
||||
The derivative of a function measures the rate of change of the *y* value in respects to the
|
||||
rate of change of the *x* value.
|
||||
The derivative of a function measures the rate of change of the *`y`* value in respects to the
|
||||
rate of change of the *`x`* value.
|
||||
|
||||
The `derivative` function can compute the derivative of any *interpolation* function.
|
||||
The `derivative` function can also compute the derivative of a derivative.
|
||||
The `derivative` function can compute the derivative of any interpolation function.
|
||||
It can also compute the derivative of a derivative.
|
||||
|
||||
The example below computes the derivative for a `loess` interpolation function.
|
||||
|
||||
|
@ -268,7 +258,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
|
|||
derivative=derivative(curve))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -327,7 +317,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
|
|||
integral=integrate(curve, 0, 20))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -357,7 +347,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
|
|||
integral=integrate(curve, 0, 10))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -382,18 +372,7 @@ responds with:
|
|||
The `bicubicSpline` function can be used to interpolate and predict values
|
||||
anywhere within a grid of data.
|
||||
|
||||
A simple example will make this more clear.
|
||||
|
||||
In example below a bicubic spline is used to interpolate a matrix of real estate data.
|
||||
Each row of the matrix represents a specific *year*. Each column of the matrix
|
||||
represents a *floor* of the building. The grid of numbers is the average selling price of
|
||||
an apartment for each year and floor. For example in 2002 the average selling price for
|
||||
the 9th floor was 415000 (row 3, column 3).
|
||||
|
||||
The `bicubicSpline` function is then used to
|
||||
interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8.
|
||||
Notice that the matrix does not included a data point for year 2003, floor 8. The `bicupicSpline`
|
||||
function creates that data point based on the surrounding data in the matrix.
|
||||
A simple example will make this more clear:
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -408,8 +387,16 @@ let(years=array(1998, 2000, 2002, 2004, 2006),
|
|||
prediction=predict(bspline, 2003, 8))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
In this example a bicubic spline is used to interpolate a matrix of real estate data.
|
||||
Each row of the matrix represent specific `years`. Each column of the matrix
|
||||
represents `floors` of the building. The grid of numbers is the average selling price of
|
||||
an apartment for each year and floor. For example in 2002 the average selling price for
|
||||
the 9th floor was `415000` (row 3, column 3).
|
||||
|
||||
The `bicubicSpline` function is then used to
|
||||
interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8.
|
||||
Notice that the matrix does not include a data point for year 2003, floor 8. The `bicupicSpline`
|
||||
function creates that data point based on the surrounding data in the matrix:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -427,4 +414,3 @@ responds with:
|
|||
}
|
||||
}
|
||||
----
|
||||
|
||||
|
|
|
@ -17,18 +17,16 @@
|
|||
// under the License.
|
||||
|
||||
This section of the user guide covers the
|
||||
*probability distribution
|
||||
framework* included in the math expressions library.
|
||||
probability distribution
|
||||
framework included in the math expressions library.
|
||||
|
||||
== Probability Distribution Framework
|
||||
|
||||
The probability distribution framework includes
|
||||
many commonly used *real* and *discrete* probability
|
||||
distributions, including support for *empirical* and
|
||||
*enumerated* distributions that model real world data.
|
||||
The probability distribution framework includes many commonly used <<Real Distributions,real>>
|
||||
and <<Discrete,discrete>> probability distributions, including support for <<Empirical Distribution,empirical>>
|
||||
and <<Enumerated Distributions,enumerated>> distributions that model real world data.
|
||||
|
||||
The probability distribution framework also includes a set
|
||||
of functions that use the probability distributions
|
||||
The probability distribution framework also includes a set of functions that use the probability distributions
|
||||
to support probability calculations and sampling.
|
||||
|
||||
=== Real Distributions
|
||||
|
@ -93,18 +91,18 @@ random variable within a specific distribution.
|
|||
Below is example of calculating the cumulative probability
|
||||
of a random variable within a normal distribution.
|
||||
|
||||
In the example a normal distribution function is created
|
||||
with a mean of 10 and a standard deviation of 5. Then
|
||||
the cumulative probability of the value 12 is calculated for this
|
||||
specific distribution.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=normalDistribution(10, 5),
|
||||
b=cumulativeProbability(a, 12))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
In this example a normal distribution function is created
|
||||
with a mean of 10 and a standard deviation of 5. Then
|
||||
the cumulative probability of the value 12 is calculated for this
|
||||
specific distribution.
|
||||
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -127,10 +125,10 @@ Below is an example of a cumulative probability calculation
|
|||
using an empirical distribution.
|
||||
|
||||
In the example an empirical distribution is created from a random
|
||||
sample taken from the *price_f* field.
|
||||
sample taken from the `price_f` field.
|
||||
|
||||
The cumulative probability of the value .75 is then calculated.
|
||||
The *price_f* field in this example was generated using a
|
||||
The cumulative probability of the value `.75` is then calculated.
|
||||
The `price_f` field in this example was generated using a
|
||||
uniform real distribution between 0 and 1, so the output of the
|
||||
`cumulativeProbability` function is very close to .75.
|
||||
|
||||
|
@ -142,7 +140,7 @@ let(a=random(collection1, q="*:*", rows="30000", fl="price_f"),
|
|||
d=cumulativeProbability(c, .75))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -171,7 +169,7 @@ Below is an example which calculates the probability
|
|||
of a discrete value within a Poisson distribution.
|
||||
|
||||
In the example a Poisson distribution function is created
|
||||
with a mean of 100. Then the
|
||||
with a mean of `100`. Then the
|
||||
probability of encountering a sample of the discrete value 101 is calculated for this
|
||||
specific distribution.
|
||||
|
||||
|
@ -181,7 +179,7 @@ let(a=poissonDistribution(100),
|
|||
b=probability(a, 101))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -200,12 +198,10 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
----
|
||||
|
||||
Below is an example of a probability calculation
|
||||
using an enumerated distribution.
|
||||
Below is an example of a probability calculation using an enumerated distribution.
|
||||
|
||||
In the example an enumerated distribution is created from a random
|
||||
sample taken from the *day_i* field, which was created
|
||||
using a uniform integer distribution between 0 and 30.
|
||||
sample taken from the `day_i` field, which was created using a uniform integer distribution between 0 and 30.
|
||||
|
||||
The probability of the discrete value 10 is then calculated.
|
||||
|
||||
|
@ -217,7 +213,7 @@ let(a=random(collection1, q="*:*", rows="30000", fl="day_i"),
|
|||
d=probability(c, 10))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -239,11 +235,9 @@ When this expression is sent to the /stream handler it responds with:
|
|||
=== Sampling
|
||||
|
||||
All probability distributions support sampling. The `sample`
|
||||
function returns 1 or more random samples from a probability
|
||||
distribution.
|
||||
function returns 1 or more random samples from a probability distribution.
|
||||
|
||||
Below is an example drawing a single sample from
|
||||
a normal distribution.
|
||||
Below is an example drawing a single sample from a normal distribution.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -251,7 +245,7 @@ let(a=normalDistribution(10, 5),
|
|||
b=sample(a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -270,8 +264,7 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
----
|
||||
|
||||
Below is an example drawing 10 samples from a normal
|
||||
distribution.
|
||||
Below is an example drawing 10 samples from a normal distribution.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -279,7 +272,7 @@ let(a=normalDistribution(10, 5),
|
|||
b=sample(a, 10))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -315,14 +308,14 @@ The multivariate normal distribution is a generalization of the
|
|||
univariate normal distribution to higher dimensions.
|
||||
|
||||
The multivariate normal distribution models two or more random
|
||||
variables that are normally distributed. The relationship between
|
||||
the variables is defined by a covariance matrix.
|
||||
variables that are normally distributed. The relationship between the variables is defined by a covariance matrix.
|
||||
|
||||
==== Sampling
|
||||
|
||||
The `sample` function can be used to draw samples
|
||||
from a multivariate normal distribution in much the same
|
||||
way as a univariate normal distribution.
|
||||
|
||||
The difference is that each sample will be an array containing a sample
|
||||
drawn from each of the underlying normal distributions.
|
||||
If multiple samples are drawn, the `sample` function returns a matrix with a
|
||||
|
@ -333,33 +326,25 @@ multivariate normal distribution.
|
|||
The example below demonstrates how to initialize and draw samples
|
||||
from a multivariate normal distribution.
|
||||
|
||||
In this example 5000 random samples are selected from a collection
|
||||
of log records. Each sample contains
|
||||
the fields *filesize_d* and *response_d*. The values of both fields conform
|
||||
to a normal distribution.
|
||||
In this example 5000 random samples are selected from a collection of log records. Each sample contains
|
||||
the fields `filesize_d` and `response_d`. The values of both fields conform to a normal distribution.
|
||||
|
||||
Both fields are then vectorized. The *filesize_d* vector is stored in
|
||||
variable *b* and the *response_d* variable is stored in variable *c*.
|
||||
Both fields are then vectorized. The `filesize_d` vector is stored in
|
||||
variable *`b`* and the `response_d` variable is stored in variable *`c`*.
|
||||
|
||||
An array is created that contains the *means* of the two vectorized fields.
|
||||
An array is created that contains the means of the two vectorized fields.
|
||||
|
||||
Then both vectors are added to a matrix which is transposed. This creates
|
||||
an *observation* matrix where each row contains one observation of
|
||||
*filesize_d* and *response_d*. A covariance matrix is then created from the columns of
|
||||
the observation matrix with the
|
||||
`cov` function. The covariance matrix describes the covariance between
|
||||
*filesize_d* and *response_d*.
|
||||
an observation matrix where each row contains one observation of
|
||||
`filesize_d` and `response_d`. A covariance matrix is then created from the columns of
|
||||
the observation matrix with the `cov` function. The covariance matrix describes the covariance between
|
||||
`filesize_d` and `response_d`.
|
||||
|
||||
The `multivariateNormalDistribution` function is then called with the
|
||||
array of means for the two fields and the covariance matrix. The model for the
|
||||
multivariate normal distribution is assigned to variable *g*.
|
||||
multivariate normal distribution is assigned to variable *`g`*.
|
||||
|
||||
Finally five samples are drawn from the multivariate normal distribution. The samples
|
||||
are returned as a matrix, with each row representing one sample. There are two
|
||||
columns in the matrix. The first column contains samples for *filesize_d* and the second
|
||||
column contains samples for *response_d*. Over the long term the covariance between
|
||||
the columns will conform to the covariance matrix used to instantiate the
|
||||
multivariate normal distribution.
|
||||
Finally five samples are drawn from the multivariate normal distribution.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -373,7 +358,11 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
|||
h=sample(g, 5))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
The samples are returned as a matrix, with each row representing one sample. There are two
|
||||
columns in the matrix. The first column contains samples for `filesize_d` and the second
|
||||
column contains samples for `response_d`. Over the long term the covariance between
|
||||
the columns will conform to the covariance matrix used to instantiate the
|
||||
multivariate normal distribution.
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -412,4 +401,3 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
}
|
||||
----
|
||||
|
||||
|
|
|
@ -16,28 +16,23 @@
|
|||
// specific language governing permissions and limitations
|
||||
// under the License.
|
||||
|
||||
|
||||
This section of the math expressions user guide covers simple and multivariate linear regression.
|
||||
|
||||
The math expressions library supports simple and multivariate linear regression.
|
||||
|
||||
== Simple Linear Regression
|
||||
|
||||
The `regress` function is used to build a linear regression model
|
||||
between two random variables. Sample observations are provided with two
|
||||
numeric arrays. The first numeric array is the *independent variable* and
|
||||
the second array is the *dependent variable*.
|
||||
numeric arrays. The first numeric array is the independent variable and
|
||||
the second array is the dependent variable.
|
||||
|
||||
In the example below the `random` function selects 5000 random samples each containing
|
||||
the fields *filesize_d* and *response_d*. The two fields are vectorized
|
||||
and stored in variables *b* and *c*. Then the `regress` function performs a regression
|
||||
the fields `filesize_d` and `response_d`. The two fields are vectorized
|
||||
and stored in variables *`b`* and *`c`*. Then the `regress` function performs a regression
|
||||
analysis on the two numeric arrays.
|
||||
|
||||
The `regress` function returns a single tuple with the results of the regression
|
||||
analysis.
|
||||
|
||||
Note that in this regression analysis the value of *RSquared* is *.75*. This means that changes in
|
||||
*filesize_d* explain 75% of the variability of the *response_d* variable.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
||||
|
@ -46,7 +41,8 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
|||
d=regress(b, c))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
Note that in this regression analysis the value of `RSquared` is `.75`. This means that changes in
|
||||
`filesize_d` explain 75% of the variability of the `response_d` variable:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -81,11 +77,10 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
The `predict` function uses the regression model to make predictions.
|
||||
Using the example above the regression model can be used to predict the value
|
||||
of *response_d* given a value for *filesize_d*.
|
||||
of `response_d` given a value for `filesize_d`.
|
||||
|
||||
In the example below the `predict` function uses the regression analysis to predict
|
||||
the value of *response_d* for the *filesize_d* value of 40000.
|
||||
|
||||
the value of `response_d` for the `filesize_d` value of `40000`.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -96,7 +91,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
|||
e=predict(d, 40000))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -131,7 +126,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
|||
e=predict(d, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -169,9 +164,9 @@ The difference between the observed value and the predicted value is known as th
|
|||
residual. There isn't a specific function to calculate the residuals but vector
|
||||
math can used to perform the calculation.
|
||||
|
||||
In the example below the predictions are stored in variable *e*. The `ebeSubtract`
|
||||
In the example below the predictions are stored in variable *`e`*. The `ebeSubtract`
|
||||
function is then used to subtract the predictions
|
||||
from the actual *response_d* values stored in variable *c*. Variable *f* contains
|
||||
from the actual `response_d` values stored in variable *`c`*. Variable *`f`* contains
|
||||
the array of residuals.
|
||||
|
||||
[source,text]
|
||||
|
@ -184,7 +179,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
|||
f=ebeSubtract(c, e))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -221,20 +216,17 @@ When this expression is sent to the /stream handler it responds with:
|
|||
== Multivariate Linear Regression
|
||||
|
||||
The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear
|
||||
regression models the linear relationship between two or more *independent* variables and a *dependent* variable.
|
||||
regression models the linear relationship between two or more independent variables and a dependent variable.
|
||||
|
||||
The example below extends the simple linear regression example by introducing a new independent variable
|
||||
called *service_d*. The *service_d* variable is the service level of the request and it can range from 1 to 4
|
||||
called `service_d`. The `service_d` variable is the service level of the request and it can range from 1 to 4
|
||||
in the data-set. The higher the service level, the higher the bandwidth available for the request.
|
||||
|
||||
Notice that the two independent variables *filesize_d* and *service_d* are vectorized and stored
|
||||
in the variables *b* and *c*. The variables *b* and *c* are then added as rows to a `matrix`. The matrix is
|
||||
then transposed so that each row in the matrix represents one observation with *filesize_d* and *service_d*.
|
||||
Notice that the two independent variables `filesize_d` and `service_d` are vectorized and stored
|
||||
in the variables *`b`* and *`c`*. The variables *`b`* and *`c`* are then added as rows to a `matrix`. The matrix is
|
||||
then transposed so that each row in the matrix represents one observation with `filesize_d` and `service_d`.
|
||||
The `olsRegress` function then performs the multivariate regression analysis using the observation matrix as the
|
||||
independent variables and the *response_d* values, stored in variable *d*, as the dependent variable.
|
||||
|
||||
Notice that the RSquared of the regression analysis is 1. This means that linear relationship between
|
||||
*filesize_d* and *service_d* describe 100% of the variability of the *response_d* variable.
|
||||
independent variables and the `response_d` values, stored in variable *`d`*, as the dependent variable.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -246,7 +238,8 @@ let(a=random(collection2, q="*:*", rows="30000", fl="filesize_d, service_d, resp
|
|||
f=olsRegress(e, d))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
Notice in the response that the RSquared of the regression analysis is 1. This means that linear relationship between
|
||||
`filesize_d` and `service_d` describe 100% of the variability of the `response_d` variable:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -299,10 +292,11 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
=== Prediction
|
||||
|
||||
The `predict` function can also be used to make predictions for multivariate linear regression. Below is an example
|
||||
of a single prediction using the multivariate linear regression model and a single observation. The observation
|
||||
is an array that matches the structure of the observation matrix used to build the model. In this case
|
||||
the first value represent a *filesize_d* of 40000 and the second value represents a *service_d* of 4.
|
||||
The `predict` function can also be used to make predictions for multivariate linear regression.
|
||||
|
||||
Below is an example of a single prediction using the multivariate linear regression model and a single observation.
|
||||
The observation is an array that matches the structure of the observation matrix used to build the model. In this case
|
||||
the first value represents a `filesize_d` of `40000` and the second value represents a `service_d` of `4`.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -315,7 +309,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
|
|||
g=predict(f, array(40000, 4)))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -335,9 +329,10 @@ When this expression is sent to the /stream handler it responds with:
|
|||
----
|
||||
|
||||
The `predict` function can also make predictions for more than one multivariate observation. In this scenario
|
||||
an observation matrix used. In the example below the observation matrix used to build the multivariate regression model
|
||||
is passed to the `predict` function and it returns an array of predictions.
|
||||
an observation matrix used.
|
||||
|
||||
In the example below the observation matrix used to build the multivariate regression model
|
||||
is passed to the `predict` function and it returns an array of predictions.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -350,7 +345,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
|
|||
g=predict(f, e))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -388,7 +383,7 @@ Once the predictions are generated the residuals can be calculated using the sam
|
|||
simple linear regression.
|
||||
|
||||
Below is an example of the residuals calculation following a multivariate linear regression. In the example
|
||||
the predictions stored variable *g* are subtracted from observed values stored in variable *d*.
|
||||
the predictions stored variable *`g`* are subtracted from observed values stored in variable *`d`*.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -402,7 +397,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
|
|||
h=ebeSubtract(d, g))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -433,7 +428,3 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
}
|
||||
----
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -26,7 +26,7 @@ For example the expression below adds two numbers together:
|
|||
add(1, 1)
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -98,7 +98,7 @@ select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3")
|
|||
mult(price_f, 10) as newPrice)
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
|
|
@ -18,59 +18,59 @@
|
|||
|
||||
|
||||
Monte Carlo simulations are commonly used to model the behavior of
|
||||
stochastic systems. This section of the user guide describes
|
||||
how to perform both *uncorrelated* and *correlated* Monte Carlo simulations
|
||||
using the *sampling* capabilities of the probability distribution framework.
|
||||
stochastic systems. This section describes
|
||||
how to perform both uncorrelated and correlated Monte Carlo simulations
|
||||
using the sampling capabilities of the probability distribution framework.
|
||||
|
||||
== Uncorrelated Simulations
|
||||
|
||||
Uncorrelated Monte Carlo simulations model stochastic systems with the assumption
|
||||
that the underlying random variables move independently of each other.
|
||||
A simple example of a Monte Carlo simulation using two independently changing random variables
|
||||
is described below.
|
||||
that the underlying random variables move independently of each other.
|
||||
A simple example of a Monte Carlo simulation using two independently changing random variables
|
||||
is described below.
|
||||
|
||||
In this example a Monte Carlo simulation is used to determine the probability that a simple hinge assembly will
|
||||
fall within a required length specification.
|
||||
|
||||
The hinge has two components *A* and *B*. The combined length of the two components must be less then 5 centimeters
|
||||
The hinge has two components A and B. The combined length of the two components must be less then 5 centimeters
|
||||
to fall within specification.
|
||||
|
||||
A random sampling of lengths for component *A* has shown that its length conforms to a
|
||||
A random sampling of lengths for component A has shown that its length conforms to a
|
||||
normal distribution with a mean of 2.2 centimeters and a standard deviation of .0195
|
||||
centimeters.
|
||||
|
||||
A random sampling of lengths for component *B* has shown that its length conforms
|
||||
A random sampling of lengths for component B has shown that its length conforms
|
||||
to a normal distribution with a mean of 2.71 centimeters and a standard deviation of .0198 centimeters.
|
||||
|
||||
The Monte Carlo simulation below performs the following steps:
|
||||
|
||||
* A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of componentA.
|
||||
* A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of componentB.
|
||||
* The `monteCarlo` function samples from the componentA and componentB distributions and sets the values to variables sampleA and sampleB. It then
|
||||
calls the *add(sampleA, sampleB)* function to find the combined lengths of the samples. The `monteCarlo` function runs a set number of times, 100000 in the example below, and collects the results in an array. Each
|
||||
time the function is called new samples are drawn from the componentA
|
||||
and componentB distributions. On each run, the `add` function adds the two samples to calculate the combined length.
|
||||
The result of each run is collected in an array and assigned to the *simresults* variable.
|
||||
* An `empiricalDistribution` function is then created from the *simresults* array to model the distribution of the
|
||||
simulation results.
|
||||
* Finally, the `cumulativeProbability` function is called on the *simmodel* to determine the cumulative probability
|
||||
that the combined length of the components is 5 or less.
|
||||
* Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
|
||||
be 5 or less.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(componentA=normalDistribution(2.2, .0195),
|
||||
componentB=normalDistribution(2.71, .0198),
|
||||
simresults=monteCarlo(sampleA=sample(componentA),
|
||||
let(componentA=normalDistribution(2.2, .0195), <1>
|
||||
componentB=normalDistribution(2.71, .0198), <2>
|
||||
simresults=monteCarlo(sampleA=sample(componentA), <3>
|
||||
sampleB=sample(componentB),
|
||||
add(sampleA, sampleB),
|
||||
100000),
|
||||
simmodel=empiricalDistribution(simresults),
|
||||
prob=cumulativeProbability(simmodel, 5))
|
||||
add(sampleA, sampleB), <4>
|
||||
100000), <5>
|
||||
simmodel=empiricalDistribution(simresults), <6>
|
||||
prob=cumulativeProbability(simmodel, 5)) <7>
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
The Monte Carlo simulation below performs the following steps:
|
||||
|
||||
<1> A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of `componentA`.
|
||||
<2> A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of `componentB`.
|
||||
<3> The `monteCarlo` function samples from the `componentA` and `componentB` distributions and sets the values to variables `sampleA` and `sampleB`.
|
||||
<4> It then calls the `add(sampleA, sampleB)`* function to find the combined lengths of the samples.
|
||||
<5> The `monteCarlo` function runs a set number of times, 100000, and collects the results in an array. Each
|
||||
time the function is called new samples are drawn from the `componentA`
|
||||
and `componentB` distributions. On each run, the `add` function adds the two samples to calculate the combined length.
|
||||
The result of each run is collected in an array and assigned to the `simresults` variable.
|
||||
<6> An `empiricalDistribution` function is then created from the `simresults` array to model the distribution of the
|
||||
simulation results.
|
||||
<7> Finally, the `cumulativeProbability` function is called on the `simmodel` to determine the cumulative probability
|
||||
that the combined length of the components is 5 or less.
|
||||
|
||||
Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
|
||||
be 5 or less:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -91,36 +91,32 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
== Correlated Simulations
|
||||
|
||||
The simulation above assumes that the lengths of *componentA* and *componentB* vary independently.
|
||||
The simulation above assumes that the lengths of `componentA` and `componentB` vary independently.
|
||||
What would happen to the probability model if there was a correlation between the lengths of
|
||||
*componentA* and *componentB*.
|
||||
`componentA` and `componentB`?
|
||||
|
||||
In the example below a database containing assembled pairs of components is used to determine
|
||||
if there is a correlation between the lengths of the components, and how the correlation effects the model.
|
||||
|
||||
Before performing a simulation of the effects of correlation on the probability model its
|
||||
useful to understand what the correlation is between the lengths of *componentA* and *componentB*.
|
||||
|
||||
In the example below 5000 random samples are selected from a collection
|
||||
of assembled hinges. Each sample contains
|
||||
lengths of the components in the fields *componentA_d* and *componentB_d*.
|
||||
|
||||
Both fields are then vectorized. The *componentA_d* vector is stored in
|
||||
variable *b* and the *componentB_d* variable is stored in variable *c*.
|
||||
|
||||
Then the correlation of the two vectors is calculated using the `corr` function. Note that the outcome
|
||||
from `corr` is 0.9996931313216989. This means that *componentA_d* and *componentB_d* are almost
|
||||
perfectly correlated.
|
||||
useful to understand what the correlation is between the lengths of `componentA` and `componentB`.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"),
|
||||
b=col(a, componentA_d)),
|
||||
let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"), <1>
|
||||
b=col(a, componentA_d)), <2>
|
||||
c=col(a, componentB_d)),
|
||||
d=corr(b, c))
|
||||
d=corr(b, c)) <3>
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
<1> In the example, 5000 random samples are selected from a collection of assembled hinges.
|
||||
Each sample contains lengths of the components in the fields `componentA_d` and `componentB_d`.
|
||||
<2> Both fields are then vectorized. The *componentA_d* vector is stored in
|
||||
variable *`b`* and the *componentB_d* variable is stored in variable *`c`*.
|
||||
<3> Then the correlation of the two vectors is calculated using the `corr` function.
|
||||
|
||||
Note from the result that the outcome from `corr` is 0.9996931313216989.
|
||||
This means that `componentA_d` and *`componentB_d` are almost perfectly correlated.
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -139,35 +135,34 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
----
|
||||
|
||||
How does correlation effect the probability model?
|
||||
=== Correlation Effects on the Probability Model
|
||||
|
||||
The example below explores how to use a *multivariate normal distribution* function
|
||||
The example below explores how to use a multivariate normal distribution function
|
||||
to model how correlation effects the probability of hinge defects.
|
||||
|
||||
In this example 5000 random samples are selected from a collection
|
||||
containing length data for assembled hinges. Each sample contains
|
||||
the fields *componentA_d* and *componentB_d*.
|
||||
the fields `componentA_d` and `componentB_d`.
|
||||
|
||||
Both fields are then vectorized. The *componentA_d* vector is stored in
|
||||
variable *b* and the *componentB_d* variable is stored in variable *c*.
|
||||
Both fields are then vectorized. The `componentA_d` vector is stored in
|
||||
variable *`b`* and the `componentB_d` variable is stored in variable *`c`*.
|
||||
|
||||
An array is created that contains the *means* of the two vectorized fields.
|
||||
An array is created that contains the means of the two vectorized fields.
|
||||
|
||||
Then both vectors are added to a matrix which is transposed. This creates
|
||||
an *observation* matrix where each row contains one observation of
|
||||
*componentA_d* and *componentB_d*. A covariance matrix is then created from the columns of
|
||||
an observation matrix where each row contains one observation of
|
||||
`componentA_d` and `componentB_d`. A covariance matrix is then created from the columns of
|
||||
the observation matrix with the
|
||||
`cov` function. The covariance matrix describes the covariance between
|
||||
*componentA_d* and *componentB_d*.
|
||||
`cov` function. The covariance matrix describes the covariance between `componentA_d` and `componentB_d`.
|
||||
|
||||
The `multivariateNormalDistribution` function is then called with the
|
||||
array of means for the two fields and the covariance matrix. The model
|
||||
for the multivariate normal distribution is stored in variable *g*.
|
||||
for the multivariate normal distribution is stored in variable *`g`*.
|
||||
|
||||
The `monteCarlo` function then calls the function *add(sample(g))* 50000 times
|
||||
The `monteCarlo` function then calls the function `add(sample(g))` 50000 times
|
||||
and collections the results in a vector. Each time the function is called a single sample
|
||||
is drawn from the multivariate normal distribution. Each sample is a vector containing
|
||||
one *componentA* and *componentB* pair. the `add` function adds the values in the vector to
|
||||
one `componentA` and `componentB` pair. The `add` function adds the values in the vector to
|
||||
calculate the length of the pair. Over the long term the samples drawn from the
|
||||
multivariate normal distribution will conform to the covariance matrix used to construct it.
|
||||
|
||||
|
@ -195,7 +190,7 @@ let(a=random(hinges, q="*:*", rows="5000", fl="componentA_d, componentB_d"),
|
|||
j=cumulativeProbability(i, 5))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
|
|
@ -37,7 +37,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
|
|||
c=describe(b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -90,7 +90,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
|
|||
c=hist(b, 5))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -179,7 +179,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
|
|||
d=col(c, N))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -228,7 +228,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="day_i"),
|
|||
c=freqTable(b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -302,7 +302,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
|
|||
c=percentile(b, 95))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -344,7 +344,7 @@ let(a=array(1, 2, 3, 4, 5),
|
|||
c=cov(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -380,7 +380,7 @@ let(a=array(1, 2, 3, 4, 5),
|
|||
e=cov(d))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -437,7 +437,7 @@ let(a=array(1, 2, 3, 4, 5),
|
|||
c=corr(a, b, type=spearmans))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -504,7 +504,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
|
|||
e=ttest(c, d))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -552,7 +552,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
|
|||
e=ttest(c, d))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -588,7 +588,7 @@ let(a=array(1,2,3),
|
|||
b=zscores(a))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
|
|
@ -216,8 +216,8 @@ The `nodes` function provides breadth-first graph traversal. For details, see th
|
|||
|
||||
== knnSearch
|
||||
|
||||
The `knnSearch` function returns the K nearest neighbors for a document based on text similarity. Under the covers the `knnSearch` function
|
||||
use the More Like This query parser plugin.
|
||||
The `knnSearch` function returns the k-nearest neighbors for a document based on text similarity. Under the covers the `knnSearch` function
|
||||
uses the More Like This query parser plugin.
|
||||
|
||||
=== knnSearch Parameters
|
||||
|
||||
|
|
|
@ -16,9 +16,9 @@
|
|||
// specific language governing permissions and limitations
|
||||
// under the License.
|
||||
|
||||
TF-IDF term vectors are often used to represent text documents when performing text mining
|
||||
and machine learning operations. This section of the user guide describes how to
|
||||
use math expressions to perform text analysis and create TF-IDF term vectors.
|
||||
Term frequency-inverse document frequency (TF-IDF) term vectors are often used to
|
||||
represent text documents when performing text mining and machine learning operations. The math expressions
|
||||
library can be used to perform text analysis and create TF-IDF term vectors.
|
||||
|
||||
== Text Analysis
|
||||
|
||||
|
@ -26,17 +26,16 @@ The `analyze` function applies a Solr analyzer to a text field and returns the t
|
|||
emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr's
|
||||
schema can be used with the `analyze` function.
|
||||
|
||||
In the example below, the text "hello world" is analyzed using the analyzer chain attached to the *subject* field in
|
||||
the schema. The *subject* field is defined as the field type *text_general* and the text is analyzed using the
|
||||
analysis chain configured for the *text_general* field type.
|
||||
In the example below, the text "hello world" is analyzed using the analyzer chain attached to the `subject` field in
|
||||
the schema. The `subject` field is defined as the field type `text_general` and the text is analyzed using the
|
||||
analysis chain configured for the `text_general` field type.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
analyze("hello world", subject)
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -63,13 +62,12 @@ responds with:
|
|||
The `analyze` function can be used inside of a `select` function to annotate documents with the tokens
|
||||
generated by the analysis.
|
||||
|
||||
The example below is performing a `search` in collection1. Each tuple returned by the `search`
|
||||
contains an *id* and *subject*. For each tuple, the
|
||||
`select` function is selecting the *id* field and calling the `analyze` function on the *subject* field.
|
||||
The analyzer chain specified by the *subject_bigram* field is configured to perform a bigram analysis.
|
||||
The example below performs a `search` in "collection1". Each tuple returned by the `search` function
|
||||
contains an `id` and `subject`. For each tuple, the
|
||||
`select` function selects the `id` field and calls the `analyze` function on the `subject` field.
|
||||
The analyzer chain specified by the `subject_bigram` field is configured to perform a bigram analysis.
|
||||
The tokens generated by the `analyze` function are added to each tuple in a field called `terms`.
|
||||
|
||||
Notice in the output that an array of bigram terms have been added to the tuples.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -78,8 +76,7 @@ select(search(collection1, q="*:*", fl="id, subject", sort="id asc"),
|
|||
analyze(subject, subject_bigram) as terms)
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
Notice in the output that an array of bigram terms have been added to the tuples:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -111,42 +108,37 @@ responds with:
|
|||
|
||||
== TF-IDF Term Vectors
|
||||
|
||||
The `termVectors` function can be used to build *TF-IDF*
|
||||
term vectors from the terms generated by the `analyze` function.
|
||||
The `termVectors` function can be used to build TF-IDF term vectors from the terms generated by the `analyze` function.
|
||||
|
||||
The `termVectors` function operates over a list of tuples that contain a field
|
||||
called *id* and a field called *terms*. Notice
|
||||
that this is the exact output structure of the *document annotation* example above.
|
||||
The `termVectors` function operates over a list of tuples that contain a field called `id` and a field called `terms`.
|
||||
Notice that this is the exact output structure of the document annotation example above.
|
||||
|
||||
The `termVectors` function builds a *matrix* from the list of tuples. There is *row* in the
|
||||
matrix for each tuple in the list. There is a *column* in the matrix for each term in the *terms*
|
||||
field.
|
||||
|
||||
The example below builds on the *document annotation* example.
|
||||
The list of tuples are stored in variable *a*. The `termVectors` function
|
||||
operates over variable *a* and builds a matrix with *2 rows* and *4 columns*.
|
||||
|
||||
The `termVectors` function also sets the *row* and *column* labels of the term vectors matrix.
|
||||
The row labels are the document ids and the
|
||||
column labels are the terms.
|
||||
|
||||
In the example below, the `getRowLabels` and `getColumnLabels` functions return
|
||||
the row and column labels which are then stored in variables *c* and *d*.
|
||||
The *echo* parameter is echoing variables *c* and *d*, so the output includes
|
||||
the row and column labels.
|
||||
The `termVectors` function builds a matrix from the list of tuples. There is row in the
|
||||
matrix for each tuple in the list. There is a column in the matrix for each term in the `terms` field.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(echo="c, d",
|
||||
a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
|
||||
let(echo="c, d", <1>
|
||||
a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"), <2>
|
||||
id,
|
||||
analyze(subject, subject_bigram) as terms),
|
||||
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1),
|
||||
c=getRowLabels(b),
|
||||
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1), <3>
|
||||
c=getRowLabels(b), <4>
|
||||
d=getColumnLabels(b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
The example below builds on the document annotation example.
|
||||
|
||||
<1> The `echo` parameter will echo variables *`c`* and *`d`*, so the output includes
|
||||
the row and column labels, which will be defined later in the expression.
|
||||
<2> The list of tuples are stored in variable *`a`*. The `termVectors` function
|
||||
operates over variable *`a`* and builds a matrix with 2 rows and 4 columns.
|
||||
<3> The `termVectors` function sets the row and column labels of the term vectors matrix as variable *`b`*.
|
||||
The row labels are the document ids and the column labels are the terms.
|
||||
<4> The `getRowLabels` and `getColumnLabels` functions return
|
||||
the row and column labels which are then stored in variables *`c`* and *`d`*.
|
||||
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -188,7 +180,7 @@ let(a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
|
|||
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -230,8 +222,15 @@ the noisy terms helps keep the term vector matrix small enough to fit comfortabl
|
|||
|
||||
There are four parameters designed to filter noisy terms from the term vector matrix:
|
||||
|
||||
* *minTermLength*: The minimum term length required to include the term in the matrix.
|
||||
* *minDocFreq*: The minimum *percentage* (0 to 1) of documents the term must appear in to be included in the index.
|
||||
* *maxDocFreq*: The maximum *percentage* (0 to 1) of documents the term can appear in to be included in the index.
|
||||
* *exclude*: A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
|
||||
`minTermLength`::
|
||||
The minimum term length required to include the term in the matrix.
|
||||
|
||||
minDocFreq::
|
||||
The minimum percentage, expressed as a number between 0 and 1, of documents the term must appear in to be included in the index.
|
||||
|
||||
maxDocFreq::
|
||||
The maximum percentage, expressed as a number between 0 and 1, of documents the term can appear in to be included in the index.
|
||||
|
||||
exclude::
|
||||
A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
|
||||
term will be excluded from the term vector.
|
||||
|
|
|
@ -38,7 +38,7 @@ timeseries(collection1,
|
|||
count(*))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -121,7 +121,7 @@ let(a=timeseries(collection1,
|
|||
b=col(a, count(*)))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -192,7 +192,7 @@ let(a=timeseries(collection1,
|
|||
c=movingAvg(b, 3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -242,7 +242,7 @@ let(a=timeseries(collection1, q=*:*,
|
|||
c=expMovingAvg(b, 3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -292,7 +292,7 @@ let(a=timeseries(collection1,
|
|||
c=movingMedian(b, 3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -353,7 +353,7 @@ let(a=timeseries(collection1,
|
|||
c=diff(b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -403,7 +403,7 @@ let(a=array(1,2,5,2,1,2,5,2,1,2,5),
|
|||
b=diff(a, 4))
|
||||
----
|
||||
|
||||
Expression is sent to the /stream handler it responds with:
|
||||
Expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
|
|
@ -16,19 +16,17 @@
|
|||
// specific language governing permissions and limitations
|
||||
// under the License.
|
||||
|
||||
|
||||
== The Let Expression
|
||||
|
||||
The `let` expression sets variables and returns
|
||||
the value of the last variable by default. The output of any streaming expression
|
||||
or math expression can be set to a variable.
|
||||
the value of the last variable by default. The output of any streaming expression or math expression can be set to a variable.
|
||||
|
||||
Below is a simple example setting three variables *a*, *b*
|
||||
and *c*. Variables *a* and *b* are set to arrays. The variable *c* is set
|
||||
Below is a simple example setting three variables *`a`*, *`b`*
|
||||
and *`c`*. Variables *`a`* and *`b`* are set to arrays. The variable *`c`* is set
|
||||
to the output of the `ebeAdd` function which performs element-by-element
|
||||
addition of the two arrays.
|
||||
|
||||
Notice that the last variable, *c*, is returned.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=array(1, 2, 3),
|
||||
|
@ -36,8 +34,7 @@ let(a=array(1, 2, 3),
|
|||
c=ebeAdd(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
In the response, notice that the last variable, *`c`*, is returned:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -62,7 +59,7 @@ responds with:
|
|||
|
||||
== Echoing Variables
|
||||
|
||||
All variables can be output by setting the *echo* variable to *true*.
|
||||
All variables can be output by setting the `echo` variable to `true`.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -72,7 +69,7 @@ let(echo=true,
|
|||
c=ebeAdd(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -106,8 +103,8 @@ responds with:
|
|||
}
|
||||
----
|
||||
|
||||
A specific set of variables can be echoed by providing a comma delimited
|
||||
list of variables to the echo parameter.
|
||||
A specific set of variables can be echoed by providing a comma delimited list of variables to the echo parameter.
|
||||
Because variables have been provided, the `true` value is assumed.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -117,8 +114,7 @@ let(echo="a,b",
|
|||
c=ebeAdd(a, b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -150,13 +146,13 @@ responds with:
|
|||
|
||||
Variables can be cached in-memory on the Solr node where the math expression
|
||||
was run. A cached variable can then be used in future expressions. Any object
|
||||
that can be set to a variable, including data structures and mathematical models can
|
||||
that can be set to a variable, including data structures and mathematical models, can
|
||||
be cached in-memory for future use.
|
||||
|
||||
The `putCache` function adds a variable to the cache.
|
||||
|
||||
In the example below an array is cached in the *workspace* workspace1
|
||||
and bound to the *key* key1. The workspace allows different users to cache
|
||||
In the example below an array is cached in the `workspace` "workspace1"
|
||||
and bound to the `key` "key1". The workspace allows different users to cache
|
||||
objects in their own workspace. The `putCache` function returns
|
||||
the variable that was added to the cache.
|
||||
|
||||
|
@ -168,8 +164,7 @@ let(a=array(1, 2, 3),
|
|||
d=putCache(workspace1, key1, c))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -192,20 +187,16 @@ responds with:
|
|||
}
|
||||
----
|
||||
|
||||
The `getCache` function retrieves an object from the
|
||||
cache by its workspace and key.
|
||||
|
||||
In the example below the `getCache` function retrieves
|
||||
the array the was cached above and assigns it to variable *a*.
|
||||
The `getCache` function retrieves an object from the cache by its workspace and key.
|
||||
|
||||
In the example below the `getCache` function retrieves the array the was cached above and assigns it to variable *`a`*.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=getCache(workspace1, key1))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -228,18 +219,16 @@ responds with:
|
|||
}
|
||||
----
|
||||
|
||||
The `listCache` function can be used to list the workspaces or the
|
||||
keys in a specific workspace.
|
||||
The `listCache` function can be used to list the workspaces or the keys in a specific workspace.
|
||||
|
||||
In the example below `listCache` returns all the workspaces in the cache
|
||||
as an array of strings.
|
||||
In the example below `listCache` returns all the workspaces in the cache as an array of strings.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=listCache())
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
When this expression is sent to the `/stream` handler it
|
||||
responds with:
|
||||
|
||||
[source,json]
|
||||
|
@ -264,14 +253,12 @@ responds with:
|
|||
|
||||
In the example below all the keys in a specific workspace are listed:
|
||||
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=listCache(workspace1))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -296,17 +283,14 @@ The `removeCache` function can be used to remove a a key from a specific
|
|||
workspace. This `removeCache` function removes the key from the cache
|
||||
and returns the object that was removed.
|
||||
|
||||
In the example below the array that was cached above is removed from the
|
||||
cache.
|
||||
|
||||
In the example below the array that was cached above is removed from the cache.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
let(a=removeCache(workspace1, key1))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it
|
||||
responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
|
|
@ -16,23 +16,20 @@
|
|||
// specific language governing permissions and limitations
|
||||
// under the License.
|
||||
|
||||
This section of the user guide covers vector math and
|
||||
vector manipulation functions.
|
||||
This section covers vector math and vector manipulation functions.
|
||||
|
||||
== Arrays
|
||||
|
||||
Arrays can be created with the `array` function.
|
||||
|
||||
For example the expression below creates a numeric array with
|
||||
three elements:
|
||||
For example, the expression below creates a numeric array with three elements:
|
||||
|
||||
[source,text]
|
||||
----
|
||||
array(1, 2, 3)
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with
|
||||
a json array.
|
||||
When this expression is sent to the `/stream` handler it responds with a JSON array:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -66,7 +63,7 @@ For example, an array can be reversed with the `rev` function:
|
|||
rev(array(1, 2, 3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -89,15 +86,14 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
----
|
||||
|
||||
Another example is the `length` function,
|
||||
which returns the length of an array:
|
||||
Another example is the `length` function, which returns the length of an array:
|
||||
|
||||
[source,text]
|
||||
----
|
||||
length(array(1, 2, 3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -124,7 +120,7 @@ copies elements of an array from a start and end range.
|
|||
copyOfRange(array(1,2,3,4,5,6), 1, 4)
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -149,21 +145,18 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
== Vector Summarizations and Norms
|
||||
|
||||
There are a set of functions that perform
|
||||
summerizations and return norms of arrays. These functions
|
||||
operate over an array and return a single
|
||||
value. The following vector summarizations and norm functions are available:
|
||||
There are a set of functions that perform summarizations and return norms of arrays. These functions
|
||||
operate over an array and return a single value. The following vector summarizations and norm functions are available:
|
||||
`mult`, `add`, `sumSq`, `mean`, `l1norm`, `l2norm`, `linfnorm`.
|
||||
|
||||
The example below is using the `mult` function,
|
||||
which multiples all the values of an array.
|
||||
The example below shows the `mult` function, which multiples all the values of an array.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
mult(array(2,4,8))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -184,14 +177,14 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
The vector norm functions provide different formulas for calculating vector magnitude.
|
||||
|
||||
The example below calculates the *l2norm* of an array.
|
||||
The example below calculates the `l2norm` of an array.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
l2norm(array(2,4,8))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -212,12 +205,11 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
== Scalar Vector Math
|
||||
|
||||
Scalar vector math functions add, subtract, multiple or divide a scalar value with every value in a vector.
|
||||
Scalar vector math functions add, subtract, multiply or divide a scalar value with every value in a vector.
|
||||
The following functions perform these operations: `scalarAdd`, `scalarSubtract`, `scalarMultiply`
|
||||
and `scalarDivide`.
|
||||
|
||||
|
||||
Below is an example of the `scalarMultiply` function, which multiplies the scalar value 3 with
|
||||
Below is an example of the `scalarMultiply` function, which multiplies the scalar value `3` with
|
||||
every value of an array.
|
||||
|
||||
[source,text]
|
||||
|
@ -225,7 +217,7 @@ every value of an array.
|
|||
scalarMultiply(3, array(1,2,3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -251,7 +243,7 @@ When this expression is sent to the /stream handler it responds with:
|
|||
== Element-By-Element Vector Math
|
||||
|
||||
Two vectors can be added, subtracted, multiplied and divided using element-by-element
|
||||
vector math functions. The following element-by-element vector math functions are:
|
||||
vector math functions. The available element-by-element vector math functions are:
|
||||
`ebeAdd`, `ebeSubtract`, `ebeMultiply`, `ebeDivide`.
|
||||
|
||||
The expression below performs the element-by-element subtraction of two arrays.
|
||||
|
@ -261,7 +253,7 @@ The expression below performs the element-by-element subtraction of two arrays.
|
|||
ebeSubtract(array(10, 15, 20), array(1,2,3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -297,7 +289,7 @@ Below is an example of the `dotProduct` function:
|
|||
dotProduct(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -323,7 +315,7 @@ Below is an example of the `cosineSimilarity` function:
|
|||
cosineSimilarity(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -340,4 +332,4 @@ When this expression is sent to the /stream handler it responds with:
|
|||
]
|
||||
}
|
||||
}
|
||||
----
|
||||
----
|
||||
|
|
|
@ -18,11 +18,10 @@
|
|||
|
||||
This section of the user guide explores techniques
|
||||
for retrieving streams of data from Solr and vectorizing the
|
||||
*numeric* fields.
|
||||
numeric fields.
|
||||
|
||||
The next chapter of the user guide covers
|
||||
Text Analysis and Term Vectors which describes how to
|
||||
vectorize *text* fields.
|
||||
See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to
|
||||
vectorize text fields.
|
||||
|
||||
== Streams
|
||||
|
||||
|
@ -32,42 +31,42 @@ to vectorize and analyze the results sets.
|
|||
|
||||
Below are some of the key stream sources:
|
||||
|
||||
* *random*: Random sampling is widely used in statistics, probability and machine learning.
|
||||
* *`random`*: Random sampling is widely used in statistics, probability and machine learning.
|
||||
The `random` function returns a random sample of search results that match a
|
||||
query. The random samples can be vectorized and operated on by math expressions and the results
|
||||
can be used to describe and make inferences about the entire population.
|
||||
|
||||
* *timeseries*: The `timeseries`
|
||||
* *`timeseries`*: The `timeseries`
|
||||
expression provides fast distributed time series aggregations, which can be
|
||||
vectorized and analyzed with math expressions.
|
||||
|
||||
* *knnSearch*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
|
||||
* *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
|
||||
function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
|
||||
a distributed index. Once the nearest neighbors are retrieved they can be vectorized
|
||||
and operated on by machine learning and text mining algorithms.
|
||||
|
||||
* *sql*: SQL is the primary query language used by data scientists. The `sql` function supports
|
||||
* *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports
|
||||
data retrieval using a subset of SQL which includes both full text search and
|
||||
fast distributed aggregations. The result sets can then be vectorized and operated
|
||||
on by math expressions.
|
||||
|
||||
* *jdbc*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
|
||||
* *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
|
||||
streams originating from Solr. Result sets from outside data sources can be vectorized and operated
|
||||
on by math expressions in the same manner as result sets originating from Solr.
|
||||
|
||||
* *topic*: Messaging is an important foundational technology for large scale computing. The `topic`
|
||||
* *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic`
|
||||
function provides publish/subscribe messaging capabilities by treating
|
||||
Solr Cloud as a distributed message queue. Topics are extremely powerful
|
||||
because they allow subscription by query. Topics can be use to support a broad set of
|
||||
use cases including bulk text mining operations and AI alerting.
|
||||
|
||||
* *nodes*: Graph queries are frequently used by recommendation engines and are an important
|
||||
* *`nodes`*: Graph queries are frequently used by recommendation engines and are an important
|
||||
machine learning tool. The `nodes` function provides fast, distributed, breadth
|
||||
first graph traversal over documents in a Solr Cloud collection. The node sets collected
|
||||
by the `nodes` function can be operated on by statistical and machine learning expressions to
|
||||
gain more insight into the graph.
|
||||
|
||||
* *search*: Ranked search results are a powerful tool for finding the most relevant
|
||||
* *`search`*: Ranked search results are a powerful tool for finding the most relevant
|
||||
documents from a large document corpus. The `search` expression
|
||||
returns the top N ranked search results that match any
|
||||
Solr query, including geo-spatial queries. The smaller set of relevant
|
||||
|
@ -79,7 +78,7 @@ text mining expressions to gather insights about the data set.
|
|||
The output of any streaming expression can be set to a variable.
|
||||
Below is a very simple example using the `random` function to fetch
|
||||
three random samples from collection1. The random samples are returned
|
||||
as *tuples*, which contain name/value pairs.
|
||||
as tuples which contain name/value pairs.
|
||||
|
||||
|
||||
[source,text]
|
||||
|
@ -87,7 +86,7 @@ as *tuples*, which contain name/value pairs.
|
|||
let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -116,10 +115,10 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
----
|
||||
|
||||
== Creating a Vector with the *col* Function
|
||||
== Creating a Vector with the col Function
|
||||
|
||||
The `col` function iterates over a list of tuples and copies the values
|
||||
from a specific column into an *array*.
|
||||
from a specific column into an array.
|
||||
|
||||
The output of the `col` function is an numeric array that can be set to a
|
||||
variable and operated on by math expressions.
|
||||
|
@ -157,7 +156,7 @@ let(a=random(collection1, q="*:*", rows="3", fl="price_f"),
|
|||
|
||||
Once a vector has been created any math expression that operates on vectors
|
||||
can be applied. In the example below the `mean` function is applied to
|
||||
the vector assigned to variable *b*.
|
||||
the vector assigned to variable *`b`*.
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -166,7 +165,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
|
|||
c=mean(b))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -191,13 +190,14 @@ Matrices can be created by vectorizing multiple numeric fields
|
|||
and adding them to a matrix. The matrices can then be operated on by
|
||||
any math expression that operates on matrices.
|
||||
|
||||
[TIP]
|
||||
====
|
||||
Note that this section deals with the creation of matrices
|
||||
from numeric data. The next chapter of the user guide covers
|
||||
Text Analysis and Term Vectors which describes how to build TF-IDF
|
||||
term vector matrices from text fields.
|
||||
from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields.
|
||||
====
|
||||
|
||||
Below is a simple example where four random samples are taken
|
||||
from different sub-populations in the data. The *price_f* field of
|
||||
from different sub-populations in the data. The `price_f` field of
|
||||
each random sample is
|
||||
vectorized and the vectors are added as rows to a matrix.
|
||||
Then the `sumRows`
|
||||
|
@ -218,7 +218,7 @@ let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
|
|||
j=sumRows(i))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -244,14 +244,14 @@ When this expression is sent to the /stream handler it responds with:
|
|||
|
||||
== Latitude / Longitude Vectors
|
||||
|
||||
The `latlonVectors` function wraps a list of tuples and parses a lat/long location field into
|
||||
The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into
|
||||
a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long
|
||||
pair for the corresponding tuple in the list. The row labels for the matrix are
|
||||
automatically set to the *id* field in the tuples. The the lat/lon matrix can then be operated
|
||||
on by distance based machine learning functions using the `haversineMeters` distance measure.
|
||||
automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated
|
||||
on by distance-based machine learning functions using the `haversineMeters` distance measure.
|
||||
|
||||
The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called
|
||||
*field*. The field parameter tells the `latlonVectors` function which field to parse the lat/lon
|
||||
`field`, which tells the `latlonVectors` function which field to parse the lat/lon
|
||||
vectors from.
|
||||
|
||||
Below is an example of the `latlonVectors`.
|
||||
|
@ -262,7 +262,7 @@ let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"),
|
|||
b=latlonVectors(a, field="loc_p"))
|
||||
----
|
||||
|
||||
When this expression is sent to the /stream handler it responds with:
|
||||
When this expression is sent to the `/stream` handler it responds with:
|
||||
|
||||
[source,json]
|
||||
----
|
||||
|
@ -301,5 +301,3 @@ When this expression is sent to the /stream handler it responds with:
|
|||
}
|
||||
}
|
||||
----
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue