mirror of https://github.com/apache/lucene.git
SOLR-12701: format/style consistency fixes for math expression docs; CSS change to make bold monospace appear properly
This commit is contained in:
parent
a1b6db26db
commit
a619038e90
|
@ -885,6 +885,11 @@ h6 strong
|
||||||
line-height: 1.45;
|
line-height: 1.45;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
p strong code,
|
||||||
|
td strong code {
|
||||||
|
font-weight: bold;
|
||||||
|
}
|
||||||
|
|
||||||
pre,
|
pre,
|
||||||
pre > code
|
pre > code
|
||||||
{
|
{
|
||||||
|
|
|
@ -21,11 +21,11 @@
|
||||||
|
|
||||||
|
|
||||||
The `polyfit` function is a general purpose curve fitter used to model
|
The `polyfit` function is a general purpose curve fitter used to model
|
||||||
the *non-linear* relationship between two random variables.
|
the non-linear relationship between two random variables.
|
||||||
|
|
||||||
The `polyfit` function is passed *x* and *y* axises and fits a smooth curve to the data.
|
The `polyfit` function is passed x- and y-axes and fits a smooth curve to the data.
|
||||||
If only a single array is provided it is treated as the *y* axis and a sequence is generated
|
If only a single array is provided it is treated as the y-axis and a sequence is generated
|
||||||
for the *x* axis.
|
for the x-axis.
|
||||||
|
|
||||||
The `polyfit` function also has a parameter the specifies the degree of the polynomial. The higher
|
The `polyfit` function also has a parameter the specifies the degree of the polynomial. The higher
|
||||||
the degree the more curves that can be modeled.
|
the degree the more curves that can be modeled.
|
||||||
|
@ -34,7 +34,7 @@ The example below uses the `polyfit` function to fit a curve to an array using
|
||||||
a 3 degree polynomial. The fitted curve is then subtracted from the original curve. The output
|
a 3 degree polynomial. The fitted curve is then subtracted from the original curve. The output
|
||||||
shows the error between the fitted curve and the original curve, known as the residuals.
|
shows the error between the fitted curve and the original curve, known as the residuals.
|
||||||
The output also includes the sum-of-squares of the residuals which provides a measure
|
The output also includes the sum-of-squares of the residuals which provides a measure
|
||||||
of how large the error is..
|
of how large the error is.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -45,7 +45,7 @@ let(echo="residuals, sumSqError",
|
||||||
sumSqError=sumSq(residuals))
|
sumSqError=sumSq(residuals))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -95,7 +95,7 @@ let(echo="residuals, sumSqError",
|
||||||
sumSqError=sumSq(residuals))
|
sumSqError=sumSq(residuals))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -138,10 +138,10 @@ responds with:
|
||||||
The `polyfit` function returns a function that can be used with the `predict`
|
The `polyfit` function returns a function that can be used with the `predict`
|
||||||
function.
|
function.
|
||||||
|
|
||||||
In the example below the x axis is included for clarity.
|
In the example below the x-axis is included for clarity.
|
||||||
The `polyfit` function returns a function for the fitted curve.
|
The `polyfit` function returns a function for the fitted curve.
|
||||||
The `predict` function is then used to predict a value along the curve, in this
|
The `predict` function is then used to predict a value along the curve, in this
|
||||||
case the prediction is made for the *x* value of 5.
|
case the prediction is made for the *`x`* value of 5.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -151,7 +151,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
|
||||||
p=predict(curve, 5))
|
p=predict(curve, 5))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -185,7 +185,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
|
||||||
d=derivative(curve))
|
d=derivative(curve))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -235,7 +235,7 @@ let(x=array(0,1,2,3,4,5,6,7,8,9, 10),
|
||||||
f=gaussfit(x, y))
|
f=gaussfit(x, y))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -283,7 +283,7 @@ let(x=array(0,1,2,3,4,5,6,7,8,9, 10),
|
||||||
|
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
|
|
@ -30,17 +30,17 @@ the more advanced DSP functions, its useful to get a better understanding of how
|
||||||
The `dotProduct` function can be used to combine two arrays into a single product. A simple example can help
|
The `dotProduct` function can be used to combine two arrays into a single product. A simple example can help
|
||||||
illustrate this concept.
|
illustrate this concept.
|
||||||
|
|
||||||
In the example below two arrays are set to variables *a* and *b* and then operated on by the `dotProduct` function.
|
In the example below two arrays are set to variables *`a`* and *`b`* and then operated on by the `dotProduct` function.
|
||||||
The output of the `dotProduct` function is set to variable *c*.
|
The output of the `dotProduct` function is set to variable *`c`*.
|
||||||
|
|
||||||
Then the `mean` function is then used to compute the mean of the first array which is set to the variable `d`.
|
Then the `mean` function is then used to compute the mean of the first array which is set to the variable *`d`*.
|
||||||
|
|
||||||
Both the *dot product* and the *mean* are included in the output.
|
Both the dot product and the mean are included in the output.
|
||||||
|
|
||||||
When we look at the output of this expression we see that the *dot product* and the *mean* of the first array
|
When we look at the output of this expression we see that the dot product and the mean of the first array
|
||||||
are both 30.
|
are both 30.
|
||||||
|
|
||||||
The dot product function *calculated the mean* of the first array.
|
The `dotProduct` function calculated the mean of the first array.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -51,7 +51,7 @@ let(echo="c, d",
|
||||||
d=mean(a))
|
d=mean(a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -76,9 +76,9 @@ calculation using vector math and look at the output of each step.
|
||||||
|
|
||||||
In the example below the `ebeMultiply` function performs an element-by-element multiplication of
|
In the example below the `ebeMultiply` function performs an element-by-element multiplication of
|
||||||
two arrays. This is the first step of the dot product calculation. The result of the element-by-element
|
two arrays. This is the first step of the dot product calculation. The result of the element-by-element
|
||||||
multiplication is assigned to variable *c*.
|
multiplication is assigned to variable *`c`*.
|
||||||
|
|
||||||
In the next step the `add` function adds all the elements of the array in variable *c*.
|
In the next step the `add` function adds all the elements of the array in variable *`c`*.
|
||||||
|
|
||||||
Notice that multiplying each element of the first array by .2 and then adding the results is
|
Notice that multiplying each element of the first array by .2 and then adding the results is
|
||||||
equivalent to the formula for computing the mean of the first array. The formula for computing the mean
|
equivalent to the formula for computing the mean of the first array. The formula for computing the mean
|
||||||
|
@ -95,7 +95,7 @@ let(echo="c, d",
|
||||||
d=add(c))
|
d=add(c))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -122,11 +122,13 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
----
|
----
|
||||||
|
|
||||||
In the example above two arrays were combined in a way that produced the mean of the first. In the second array
|
In the example above two arrays were combined in a way that produced the mean of the first. In the second array
|
||||||
each value was set to .2. Another way of looking at this is that each value in the second array has the same weight.
|
each value was set to ".2". Another way of looking at this is that each value in the second array has the same weight.
|
||||||
By varying the weights in the second array we can produce a different result. For example if the first array represents a time series,
|
By varying the weights in the second array we can produce a different result.
|
||||||
|
For example if the first array represents a time series,
|
||||||
the weights in the second array can be set to add more weight to a particular element in the first array.
|
the weights in the second array can be set to add more weight to a particular element in the first array.
|
||||||
|
|
||||||
The example below creates a weighted average with the weight decreasing from right to left. Notice that the weighted mean
|
The example below creates a weighted average with the weight decreasing from right to left.
|
||||||
|
Notice that the weighted mean
|
||||||
of 36.666 is larger than the previous mean which was 30. This is because more weight was given to last element in the
|
of 36.666 is larger than the previous mean which was 30. This is because more weight was given to last element in the
|
||||||
array.
|
array.
|
||||||
|
|
||||||
|
@ -139,7 +141,7 @@ let(echo="c, d",
|
||||||
d=add(c))
|
d=add(c))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -167,13 +169,13 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
=== Representing Correlation
|
=== Representing Correlation
|
||||||
|
|
||||||
Often when we think of correlation, we are thinking of *Pearsons* correlation in the field of statistics. But the definition of
|
Often when we think of correlation, we are thinking of _Pearson correlation_ in the field of statistics. But the definition of
|
||||||
correlation is actually more general: a mutual relationship or connection between two or more things.
|
correlation is actually more general: a mutual relationship or connection between two or more things.
|
||||||
In the field of digital signal processing the dot product is used to represent correlation. The examples below demonstrates
|
In the field of digital signal processing the dot product is used to represent correlation. The examples below demonstrates
|
||||||
how the dot product can be used to represent correlation.
|
how the dot product can be used to represent correlation.
|
||||||
|
|
||||||
In the example below the dot product is computed for two vectors. Notice that the vectors have different values that fluctuate
|
In the example below the dot product is computed for two vectors. Notice that the vectors have different values that fluctuate
|
||||||
together. The output of the dot product is 190, which is hard to reason about because because its not scaled.
|
together. The output of the dot product is 190, which is hard to reason about because it's not scaled.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -183,7 +185,7 @@ let(echo="c, d",
|
||||||
c=dotProduct(a, b))
|
c=dotProduct(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -206,9 +208,9 @@ One approach to scaling the dot product is to first scale the vectors so that bo
|
||||||
magnitude of 1, also called unit vectors, are used when comparing only the angle between vectors rather then the magnitude.
|
magnitude of 1, also called unit vectors, are used when comparing only the angle between vectors rather then the magnitude.
|
||||||
The `unitize` function can be used to unitize the vectors before calculating the dot product.
|
The `unitize` function can be used to unitize the vectors before calculating the dot product.
|
||||||
|
|
||||||
Notice in the example below the dot product result, set to variable *e*, is effectively 1. When applied to unit vectors the dot product
|
Notice in the example below the dot product result, set to variable *`e`*, is effectively 1. When applied to unit vectors the dot product
|
||||||
will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the *unscaled* vectors and the
|
will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the unscaled vectors and the
|
||||||
answer is also effectively 1. This is because *cosine similarity* is a scaled *dot product*.
|
answer is also effectively 1. This is because cosine similarity is a scaled dot product.
|
||||||
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
|
@ -222,7 +224,7 @@ let(echo="e, f",
|
||||||
f=cosineSimilarity(a, b))
|
f=cosineSimilarity(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -254,7 +256,7 @@ let(echo="c, d",
|
||||||
c=cosineSimilarity(a, b))
|
c=cosineSimilarity(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -275,10 +277,10 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
== Convolution
|
== Convolution
|
||||||
|
|
||||||
The `conv` function calculates the convolution of two vectors. The convolution is calculated by *reversing*
|
The `conv` function calculates the convolution of two vectors. The convolution is calculated by reversing
|
||||||
the second vector and sliding it across the first vector. The *dot product* of the two vectors
|
the second vector and sliding it across the first vector. The dot product of the two vectors
|
||||||
is calculated at each point as the second vector is slid across the first vector.
|
is calculated at each point as the second vector is slid across the first vector.
|
||||||
The dot products are collected in a *third vector* which is the *convolution* of the two vectors.
|
The dot products are collected in a third vector which is the convolution of the two vectors.
|
||||||
|
|
||||||
=== Moving Average Function
|
=== Moving Average Function
|
||||||
|
|
||||||
|
@ -290,7 +292,7 @@ is syntactic sugar for convolution.
|
||||||
Below is an example of a moving average with a window size of 5. Notice that original vector has 13 elements
|
Below is an example of a moving average with a window size of 5. Notice that original vector has 13 elements
|
||||||
but the result of the moving average has only 9 elements. This is because the `movingAvg` function
|
but the result of the moving average has only 9 elements. This is because the `movingAvg` function
|
||||||
only begins generating results when it has a full window. In this case because the window size is 5 so the
|
only begins generating results when it has a full window. In this case because the window size is 5 so the
|
||||||
moving average starts generating results from the 4th index of the original array.
|
moving average starts generating results from the 4^th^ index of the original array.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -298,7 +300,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
||||||
b=movingAvg(a, 5))
|
b=movingAvg(a, 5))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -344,7 +346,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
||||||
c=conv(a, b))
|
c=conv(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -381,7 +383,7 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
We achieve the same result as the `movingAvg` gunction by using the `copyOfRange` function to copy a range of
|
We achieve the same result as the `movingAvg` function by using the `copyOfRange` function to copy a range of
|
||||||
the result that drops the first and last 4 values of
|
the result that drops the first and last 4 values of
|
||||||
the convolution result. In the example below the `precision` function is also also used to remove floating point errors from the
|
the convolution result. In the example below the `precision` function is also also used to remove floating point errors from the
|
||||||
convolution result. When this is added the output is exactly the same as the `movingAvg` function.
|
convolution result. When this is added the output is exactly the same as the `movingAvg` function.
|
||||||
|
@ -395,7 +397,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
||||||
e=precision(d, 2))
|
e=precision(d, 2))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -446,7 +448,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
||||||
c=conv(a, rev(b)))
|
c=conv(a, rev(b)))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -504,7 +506,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
|
||||||
c=finddelay(a, b))
|
c=finddelay(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
|
@ -26,13 +26,12 @@ Before performing machine learning operations its often necessary to
|
||||||
scale the feature vectors so they can be compared at the same scale.
|
scale the feature vectors so they can be compared at the same scale.
|
||||||
|
|
||||||
All the scaling function operate on vectors and matrices.
|
All the scaling function operate on vectors and matrices.
|
||||||
When operating on a matrix the *rows* of the matrix are scaled.
|
When operating on a matrix the rows of the matrix are scaled.
|
||||||
|
|
||||||
=== Min/Max Scaling
|
=== Min/Max Scaling
|
||||||
|
|
||||||
The `minMaxScale` function scales a vector or matrix between a min and
|
The `minMaxScale` function scales a vector or matrix between a minimum and maximum value.
|
||||||
max value. By default it will scale between 0 and 1 if min/max values
|
By default it will scale between 0 and 1 if min/max values are not provided.
|
||||||
are not provided.
|
|
||||||
|
|
||||||
Below is a simple example of min/max scaling between 0 and 1.
|
Below is a simple example of min/max scaling between 0 and 1.
|
||||||
Notice that once brought into the same scale the vectors are the same.
|
Notice that once brought into the same scale the vectors are the same.
|
||||||
|
@ -79,10 +78,10 @@ This expression returns the following response:
|
||||||
|
|
||||||
=== Standardization
|
=== Standardization
|
||||||
|
|
||||||
The `standardize` function scales a vector so that it has a
|
The `standardize` function scales a vector so that it has a mean of 0 and a standard deviation of 1.
|
||||||
mean of 0 and a standard deviation of 1. Standardization can be
|
Standardization can be used with machine learning algorithms, such as
|
||||||
used with machine learning algorithms, such as SVM, that
|
https://en.wikipedia.org/wiki/Support_vector_machine[Support Vector Machine (SVM)], that perform better
|
||||||
perform better when the data has a normal distribution.
|
when the data has a normal distribution.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -127,8 +126,7 @@ This expression returns the following response:
|
||||||
=== Unit Vectors
|
=== Unit Vectors
|
||||||
|
|
||||||
The `unitize` function scales vectors to a magnitude of 1. A vector with a
|
The `unitize` function scales vectors to a magnitude of 1. A vector with a
|
||||||
magnitude of 1 is known as a unit vector. Unit vectors are
|
magnitude of 1 is known as a unit vector. Unit vectors are preferred when the vector math deals
|
||||||
preferred when the vector math deals
|
|
||||||
with vector direction rather than magnitude.
|
with vector direction rather than magnitude.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
|
@ -173,24 +171,20 @@ This expression returns the following response:
|
||||||
|
|
||||||
== Distance and Distance Measures
|
== Distance and Distance Measures
|
||||||
|
|
||||||
The `distance` function computes the distance for two
|
The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix.
|
||||||
numeric arrays or a *distance matrix* for the columns of a matrix.
|
|
||||||
|
|
||||||
There are four distance measure functions that return a function
|
There are five distance measure functions that return a function that performs the actual distance calculation:
|
||||||
that performs the actual distance calculation:
|
|
||||||
|
|
||||||
* euclidean (default)
|
* `euclidean` (default)
|
||||||
* manhattan
|
* `manhattan`
|
||||||
* canberra
|
* `canberra`
|
||||||
* earthMovers
|
* `earthMovers`
|
||||||
* haversineMeters (Geospatial distance measure)
|
* `haversineMeters` (Geospatial distance measure)
|
||||||
|
|
||||||
The distance measure functions can be used with all machine learning functions
|
The distance measure functions can be used with all machine learning functions
|
||||||
that support different distance measures.
|
that support distance measures.
|
||||||
|
|
||||||
Below is an example for computing euclidean distance for
|
|
||||||
two numeric arrays:
|
|
||||||
|
|
||||||
|
Below is an example for computing Euclidean distance for two numeric arrays:
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -294,48 +288,46 @@ This expression returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== K-means Clustering
|
== K-Means Clustering
|
||||||
|
|
||||||
The `kmeans` functions performs k-means clustering of the rows of a matrix.
|
The `kmeans` functions performs k-means clustering of the rows of a matrix.
|
||||||
Once the clustering has been completed there are a number of useful functions available
|
Once the clustering has been completed there are a number of useful functions available
|
||||||
for examining the *clusters* and *centroids*.
|
for examining the clusters and centroids.
|
||||||
|
|
||||||
The examples below are clustering *term vectors*.
|
The examples below cluster _term vectors_.
|
||||||
The chapter on <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> should be
|
The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> offers
|
||||||
consulted for a full explanation of these features.
|
a full explanation of these features.
|
||||||
|
|
||||||
=== Centroid Features
|
=== Centroid Features
|
||||||
|
|
||||||
In the example below the `kmeans` function is used to cluster a result set from the Enron email data-set
|
In the example below the `kmeans` function is used to cluster a result set from the Enron email data-set
|
||||||
and then the top features are extracted from the cluster centroids.
|
and then the top features are extracted from the cluster centroids.
|
||||||
|
|
||||||
Let's look at what data is assigned to each variable:
|
|
||||||
|
|
||||||
* *a*: The `random` function returns a sample of 500 documents from the *enron*
|
|
||||||
collection that match the query *body:oil*. The `select` function selects the *id* and
|
|
||||||
and annotates each tuple with the analyzed bigram terms from the body field.
|
|
||||||
|
|
||||||
* *b*: The `termVectors` function creates a TF-IDF term vector matrix from the
|
|
||||||
tuples stored in variable *a*. Each row in the matrix represents a document. The columns of the matrix
|
|
||||||
are the bigram terms that were attached to each tuple.
|
|
||||||
* *c*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the
|
|
||||||
*Euclidean distance* measure.
|
|
||||||
* *d*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid
|
|
||||||
from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix.
|
|
||||||
* *e*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix.
|
|
||||||
This returns the top 5 bigram terms for each centroid.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"),
|
let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"), <1>
|
||||||
id,
|
id,
|
||||||
analyze(body, body_bigram) as terms),
|
analyze(body, body_bigram) as terms),
|
||||||
b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),
|
b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),<2>
|
||||||
c=kmeans(b, 5),
|
c=kmeans(b, 5), <3>
|
||||||
d=getCentroids(c),
|
d=getCentroids(c), <4>
|
||||||
e=topFeatures(d, 5))
|
e=topFeatures(d, 5)) <5>
|
||||||
----
|
----
|
||||||
|
|
||||||
|
Let's look at what data is assigned to each variable:
|
||||||
|
|
||||||
|
<1> *`a`*: The `random` function returns a sample of 500 documents from the "enron"
|
||||||
|
collection that match the query "body:oil". The `select` function selects the `id` and
|
||||||
|
and annotates each tuple with the analyzed bigram terms from the `body` field.
|
||||||
|
<2> *`b`*: The `termVectors` function creates a TF-IDF term vector matrix from the
|
||||||
|
tuples stored in variable *`a`*. Each row in the matrix represents a document. The columns of the matrix
|
||||||
|
are the bigram terms that were attached to each tuple.
|
||||||
|
<3> *`c`*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the Euclidean distance measure.
|
||||||
|
<4> *`d`*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid
|
||||||
|
from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix.
|
||||||
|
<5> *`e`*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix.
|
||||||
|
This returns the top 5 bigram terms for each centroid.
|
||||||
|
|
||||||
This expression returns the following response:
|
This expression returns the following response:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -396,12 +388,6 @@ This expression returns the following response:
|
||||||
The example below examines the top features of a specific cluster. This example uses the same techniques
|
The example below examines the top features of a specific cluster. This example uses the same techniques
|
||||||
as the centroids example but the top features are extracted from a cluster rather then the centroids.
|
as the centroids example but the top features are extracted from a cluster rather then the centroids.
|
||||||
|
|
||||||
The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors
|
|
||||||
that have been clustered together based on their features.
|
|
||||||
|
|
||||||
In the example below the `topFeatures` function is used to extract the top 4 features from each term vector
|
|
||||||
in the cluster.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
|
let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
|
||||||
|
@ -409,10 +395,15 @@ let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
|
||||||
analyze(body, body_bigram) as terms),
|
analyze(body, body_bigram) as terms),
|
||||||
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
||||||
c=kmeans(b, 25),
|
c=kmeans(b, 25),
|
||||||
d=getCluster(c, 0),
|
d=getCluster(c, 0), <1>
|
||||||
e=topFeatures(d, 4))
|
e=topFeatures(d, 4)) <2>
|
||||||
----
|
----
|
||||||
|
|
||||||
|
<1> The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors
|
||||||
|
that have been clustered together based on their features.
|
||||||
|
<2> The `topFeatures` function is used to extract the top 4 features from each term vector
|
||||||
|
in the cluster.
|
||||||
|
|
||||||
This expression returns the following response:
|
This expression returns the following response:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -489,19 +480,17 @@ This expression returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== Multi K-means Clustering
|
== Multi K-Means Clustering
|
||||||
|
|
||||||
K-means clustering will be produce different results depending on
|
K-means clustering will produce different results depending on
|
||||||
the initial placement of the centroids. K-means is fast enough
|
the initial placement of the centroids. K-means is fast enough
|
||||||
that multiple trials can be performed and the best outcome selected.
|
that multiple trials can be performed and the best outcome selected.
|
||||||
The `multiKmeans` function runs the K-means
|
|
||||||
clustering algorithm for a gven number of trials and selects the
|
|
||||||
best result based on which trial produces the lowest intra-cluster
|
|
||||||
variance.
|
|
||||||
|
|
||||||
The example below is identical to centroids example except that
|
The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the
|
||||||
it uses `multiKmeans` with 100 trials, rather then a single
|
best result based on which trial produces the lowest intra-cluster variance.
|
||||||
trial of the `kmeans` function.
|
|
||||||
|
The example below is identical to centroids example except that it uses `multiKmeans` with 100 trials,
|
||||||
|
rather then a single trial of the `kmeans` function.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -569,10 +558,10 @@ This expression returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== Fuzzy K-means Clustering
|
== Fuzzy K-Means Clustering
|
||||||
|
|
||||||
The `fuzzyKmeans` function is a soft clustering algorithm which
|
The `fuzzyKmeans` function is a soft clustering algorithm which
|
||||||
allows vectors to be assigned to more then one cluster. The *fuzziness* parameter
|
allows vectors to be assigned to more then one cluster. The `fuzziness` parameter
|
||||||
is a value between 1 and 2 that determines how fuzzy to make the cluster assignment.
|
is a value between 1 and 2 that determines how fuzzy to make the cluster assignment.
|
||||||
|
|
||||||
After the clustering has been performed the `getMembershipMatrix` function can be called
|
After the clustering has been performed the `getMembershipMatrix` function can be called
|
||||||
|
@ -585,27 +574,26 @@ A simple example will make this more clear. In the example below 300 documents a
|
||||||
then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the
|
then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the
|
||||||
term vectors into 12 clusters with a fuzziness factor of 1.25.
|
term vectors into 12 clusters with a fuzziness factor of 1.25.
|
||||||
|
|
||||||
The `getMembershipMatrix` function is used to return the membership matrix and the first row
|
|
||||||
of membership matrix is retrieved with the `rowAt` function. The `precision` function is then applied to the first row
|
|
||||||
of the matrix to make it easier to read.
|
|
||||||
|
|
||||||
The output shows a single vector representing the cluster membership probabilities for the first
|
|
||||||
term vector. Notice that the term vector has the highest association with the 12th cluster,
|
|
||||||
but also has significant associations with the 3rd, 5th, 6th and 7th clusters.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
et(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
|
let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
|
||||||
id,
|
id,
|
||||||
analyze(body, body_bigram) as terms),
|
analyze(body, body_bigram) as terms),
|
||||||
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
||||||
c=fuzzyKmeans(b, 12, fuzziness=1.25),
|
c=fuzzyKmeans(b, 12, fuzziness=1.25),
|
||||||
d=getMembershipMatrix(c),
|
d=getMembershipMatrix(c), <1>
|
||||||
e=rowAt(d, 0),
|
e=rowAt(d, 0), <2>
|
||||||
f=precision(e, 5))
|
f=precision(e, 5)) <3>
|
||||||
----
|
----
|
||||||
|
|
||||||
This expression returns the following response:
|
<1> The `getMembershipMatrix` function is used to return the membership matrix;
|
||||||
|
<2> and the first row of membership matrix is retrieved with the `rowAt` function.
|
||||||
|
<3> The `precision` function is then applied to the first row
|
||||||
|
of the matrix to make it easier to read.
|
||||||
|
|
||||||
|
This expression returns a single vector representing the cluster membership probabilities for the first
|
||||||
|
term vector. Notice that the term vector has the highest association with the 12^th^ cluster,
|
||||||
|
but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clusters:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -637,30 +625,21 @@ This expression returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== K-nearest Neighbor (KNN)
|
== K-Nearest Neighbor (KNN)
|
||||||
|
|
||||||
The `knn` function searches the rows of a matrix for the
|
The `knn` function searches the rows of a matrix for the
|
||||||
K-nearest neighbors of a search vector. The `knn` function
|
k-nearest neighbors of a search vector. The `knn` function
|
||||||
returns a *matrix* of the K-nearest neighbors. The `knn` function
|
returns a matrix of the k-nearest neighbors.
|
||||||
supports changing of the distance measure by providing one of the
|
|
||||||
four distance measure functions as the fourth parameter:
|
|
||||||
|
|
||||||
* euclidean (Default)
|
The `knn` function supports changing of the distance measure by providing one of these
|
||||||
* manhattan
|
distance measure functions as the fourth parameter:
|
||||||
* canberra
|
|
||||||
* earthMovers
|
|
||||||
|
|
||||||
The example below builds on the clustering examples to demonstrate
|
* `euclidean` (Default)
|
||||||
the `knn` function.
|
* `manhattan`
|
||||||
|
* `canberra`
|
||||||
|
* `earthMovers`
|
||||||
|
|
||||||
In the example, the centroids matrix is set to variable *d*. The first
|
The example below builds on the clustering examples to demonstrate the `knn` function.
|
||||||
centroid vector is selected from the matrix with the `rowAt` function.
|
|
||||||
Then the `knn` function is used to find the 3 nearest neighbors
|
|
||||||
to the centroid vector in the term vector matrix (variable b).
|
|
||||||
|
|
||||||
The `knn` function returns a matrix with the 3 nearest neighbors based on the
|
|
||||||
default distance measure which is euclidean. Finally, the top 4 features
|
|
||||||
of the term vectors in the nearest neighbor matrix are returned.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -669,13 +648,21 @@ let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
|
||||||
analyze(body, body_bigram) as terms),
|
analyze(body, body_bigram) as terms),
|
||||||
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
|
||||||
c=multiKmeans(b, 5, 100),
|
c=multiKmeans(b, 5, 100),
|
||||||
d=getCentroids(c),
|
d=getCentroids(c), <1>
|
||||||
e=rowAt(d, 0),
|
e=rowAt(d, 0), <2>
|
||||||
g=knn(b, e, 3),
|
g=knn(b, e, 3), <3>
|
||||||
h=topFeatures(g, 4))
|
h=topFeatures(g, 4)) <4>
|
||||||
----
|
----
|
||||||
|
|
||||||
This expression returns the following response:
|
<1> In the example, the centroids matrix is set to variable *`d`*.
|
||||||
|
<2> The first centroid vector is selected from the matrix with the `rowAt` function.
|
||||||
|
<3> Then the `knn` function is used to find the 3 nearest neighbors
|
||||||
|
to the centroid vector in the term vector matrix (variable *`b`*).
|
||||||
|
<4> The `topFeatures` function is used to request the top 4 featurs of the term vectors in the knn matrix.
|
||||||
|
|
||||||
|
The `knn` function returns a matrix with the 3 nearest neighbors based on the
|
||||||
|
default distance measure which is euclidean. Finally, the top 4 features
|
||||||
|
of the term vectors in the nearest neighbor matrix are returned:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -713,20 +700,18 @@ This expression returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== KNN Regression
|
== K-Nearest Neighbor Regression
|
||||||
|
|
||||||
KNN regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning
|
K-nearest neighbor regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning
|
||||||
technique which means it does not fit a model to the training set in advance. Instead the
|
technique which means it does not fit a model to the training set in advance. Instead the
|
||||||
entire training set of observations and outcomes are held in memory and predictions are made
|
entire training set of observations and outcomes are held in memory and predictions are made
|
||||||
by averaging the outcomes of the k-nearest neighbors.
|
by averaging the outcomes of the k-nearest neighbors.
|
||||||
|
|
||||||
The `knnRegress` function prepares the training set for use with the `predict` function.
|
The `knnRegress` function prepares the training set for use with the `predict` function.
|
||||||
|
|
||||||
Below is an example of the `knnRegress` function. In this example 10000 random samples
|
Below is an example of the `knnRegress` function. In this example 10,000 random samples
|
||||||
are taken each containing the variables *filesize_d*, *service_d* and *response_d*. The pairs of
|
are taken, each containing the variables `filesize_d`, `service_d` and `response_d`. The pairs of
|
||||||
*filesize_d* and *service_d* will be used to predict the value of *response_d*.
|
`filesize_d` and `service_d` will be used to predict the value of `response_d`.
|
||||||
|
|
||||||
Notice that `knnRegress` returns a tuple describing the regression inputs.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -738,7 +723,7 @@ let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d
|
||||||
lazyModel=knnRegress(observations, outcomes , 5))
|
lazyModel=knnRegress(observations, outcomes , 5))
|
||||||
----
|
----
|
||||||
|
|
||||||
This expression returns the following response:
|
This expression returns the following response. Notice that `knnRegress` returns a tuple describing the regression inputs:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -767,6 +752,7 @@ This expression returns the following response:
|
||||||
=== Prediction and Residuals
|
=== Prediction and Residuals
|
||||||
|
|
||||||
The output of `knnRegress` can be used with the `predict` function like other regression models.
|
The output of `knnRegress` can be used with the `predict` function like other regression models.
|
||||||
|
|
||||||
In the example below the `predict` function is used to predict results for the original training
|
In the example below the `predict` function is used to predict results for the original training
|
||||||
data. The sumSq of the residuals is then calculated.
|
data. The sumSq of the residuals is then calculated.
|
||||||
|
|
||||||
|
@ -806,14 +792,15 @@ This expression returns the following response:
|
||||||
|
|
||||||
If the features in the observation matrix are not in the same scale then the larger features
|
If the features in the observation matrix are not in the same scale then the larger features
|
||||||
will carry more weight in the distance calculation then the smaller features. This can greatly
|
will carry more weight in the distance calculation then the smaller features. This can greatly
|
||||||
impact the accuracy of the prediction. The `knnRegress` function has a *scale* parameter which
|
impact the accuracy of the prediction. The `knnRegress` function has a `scale` parameter which
|
||||||
can be set to *true* to automatically scale the features in the same range.
|
can be set to `true` to automatically scale the features in the same range.
|
||||||
|
|
||||||
The example below shows `knnRegress` with feature scaling turned on.
|
The example below shows `knnRegress` with feature scaling turned on.
|
||||||
Notice that when feature scaling is turned on the sumSqErr in the output is much lower.
|
|
||||||
|
Notice that when feature scaling is turned on the `sumSqErr` in the output is much lower.
|
||||||
This shows how much more accurate the predictions are when feature scaling is turned on in
|
This shows how much more accurate the predictions are when feature scaling is turned on in
|
||||||
this particular example. This is because the *filesize_d* feature is significantly larger then
|
this particular example. This is because the `filesize_d` feature is significantly larger then
|
||||||
the *service_d* feature.
|
the `service_d` feature.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -850,16 +837,15 @@ This expression returns the following response:
|
||||||
|
|
||||||
=== Setting Robust Regression
|
=== Setting Robust Regression
|
||||||
|
|
||||||
The default prediction approach is to take the *mean* of the outcomes of the k-nearest
|
The default prediction approach is to take the mean of the outcomes of the k-nearest
|
||||||
neighbors. If the outcomes contain outliers the *mean* value can be skewed. Setting
|
neighbors. If the outcomes contain outliers the mean value can be skewed. Setting
|
||||||
the *robust* parameter to true will take the *median* outcome of the k-nearest neighbors.
|
the `robust` parameter to `true` will take the median outcome of the k-nearest neighbors.
|
||||||
This provides a regression prediction that is robust to outliers.
|
This provides a regression prediction that is robust to outliers.
|
||||||
|
|
||||||
|
|
||||||
=== Setting the Distance Measure
|
=== Setting the Distance Measure
|
||||||
|
|
||||||
The distance measure can be changed for the k-nearest neighbor search by adding a distance measure
|
The distance measure can be changed for the k-nearest neighbor search by adding a distance measure
|
||||||
function to the `knnRegress` parameters. Below is an example using manhattan distance.
|
function to the `knnRegress` parameters. Below is an example using `manhattan` distance.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -892,10 +878,3 @@ This expression returns the following response:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -35,7 +35,7 @@ matrix(array(1, 2),
|
||||||
array(4, 5))
|
array(4, 5))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -80,7 +80,7 @@ let(a=array(1, 2),
|
||||||
d=colAt(c, 1))
|
d=colAt(c, 1))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -129,7 +129,7 @@ let(echo="d, e",
|
||||||
e=getColumnLabels(c))
|
e=getColumnLabels(c))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -182,7 +182,7 @@ let(echo="b,c",
|
||||||
c=columnCount(a))
|
c=columnCount(a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -217,7 +217,7 @@ let(a=matrix(array(1, 2),
|
||||||
b=transpose(a))
|
b=transpose(a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -259,7 +259,7 @@ let(a=matrix(array(1, 2, 3),
|
||||||
b=sumRows(a))
|
b=sumRows(a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -292,7 +292,7 @@ let(a=matrix(array(1, 2, 3),
|
||||||
b=grandSum(a))
|
b=grandSum(a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -326,7 +326,7 @@ let(a=matrix(array(1, 2),
|
||||||
b=scalarAdd(10, a))
|
b=scalarAdd(10, a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -370,7 +370,7 @@ let(a=matrix(array(1, 2),
|
||||||
b=ebeAdd(a, a))
|
b=ebeAdd(a, a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -413,7 +413,7 @@ let(a=matrix(array(1, 2),
|
||||||
c=matrixMult(a, b))
|
c=matrixMult(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
|
|
@ -16,21 +16,20 @@
|
||||||
// specific language governing permissions and limitations
|
// specific language governing permissions and limitations
|
||||||
// under the License.
|
// under the License.
|
||||||
|
|
||||||
This section of the math expression user guide covers *interpolation*, *derivatives* and *integrals*.
|
Interpolation, derivatives and integrals are three interrelated topics which are part of the field of mathematics called numerical analysis. This section explores the math expressions available for numerical anlysis.
|
||||||
These three interrelated topics are part of the field of mathematics called *numerical analysis*.
|
|
||||||
|
|
||||||
== Interpolation
|
== Interpolation
|
||||||
|
|
||||||
Interpolation is used to construct new data points between a set of known control of points.
|
Interpolation is used to construct new data points between a set of known control of points.
|
||||||
The ability to *predict* new data points allows for *sampling* along the curve defined by the
|
The ability to predict new data points allows for sampling along the curve defined by the
|
||||||
control points.
|
control points.
|
||||||
|
|
||||||
The interpolation functions described below all return an *interpolation model*
|
The interpolation functions described below all return an _interpolation model_
|
||||||
that can be passed to other functions which make use of the sampling capability.
|
that can be passed to other functions which make use of the sampling capability.
|
||||||
|
|
||||||
If returned directly the interpolation model returns an array containing predictions for each of the
|
If returned directly the interpolation model returns an array containing predictions for each of the
|
||||||
control points. This is useful in the case of `loess` interpolation which first smooths the control points
|
control points. This is useful in the case of `loess` interpolation which first smooths the control points
|
||||||
and then interpolates the smoothed points. All other interpolation function simply return the original
|
and then interpolates the smoothed points. All other interpolation functions simply return the original
|
||||||
control points because interpolation predicts a curve that passes through the original control points.
|
control points because interpolation predicts a curve that passes through the original control points.
|
||||||
|
|
||||||
There are different algorithms for interpolation that will result in different predictions
|
There are different algorithms for interpolation that will result in different predictions
|
||||||
|
@ -54,29 +53,25 @@ samples every second. In order to do this the data points between the minutes mu
|
||||||
The `predict` function can be used to predict values anywhere within the bounds of the interpolation
|
The `predict` function can be used to predict values anywhere within the bounds of the interpolation
|
||||||
range. The example below shows a very simple example of upsampling.
|
range. The example below shows a very simple example of upsampling.
|
||||||
|
|
||||||
In the example linear interpolation is performed on the arrays in variables *x* and *y*. The *x* variable,
|
|
||||||
which is the x axis, is a sequence from 0 to 20 with a stride of 2. The *y* variable defines the curve
|
|
||||||
along the x axis.
|
|
||||||
|
|
||||||
The `lerp` function performs the interpolation and returns the interpolation model.
|
|
||||||
|
|
||||||
The `u` value is an array from 0 to 20 with a stride of 1. This fills in the gaps of the original x axis.
|
|
||||||
The `predict` function then uses the interpolation function in variable *l* to predict values for
|
|
||||||
every point in the array assigned to variable *u*.
|
|
||||||
|
|
||||||
The variable *p* is the array of predictions, which is the upsampled set of y values.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(x=array(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20),
|
let(x=array(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20), <1>
|
||||||
y=array(5, 10, 60, 190, 100, 130, 100, 20, 30, 10, 5),
|
y=array(5, 10, 60, 190, 100, 130, 100, 20, 30, 10, 5), <2>
|
||||||
l=lerp(x, y),
|
l=lerp(x, y), <3>
|
||||||
u=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
|
u=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), <4>
|
||||||
p=predict(l, u))
|
p=predict(l, u)) <5>
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
<1> In the example linear interpolation is performed on the arrays in variables *`x`* and *`y`*. The *`x`* variable,
|
||||||
responds with:
|
which is the x-axis, is a sequence from 0 to 20 with a stride of 2.
|
||||||
|
<2> The *`y`* variable defines the curve along the x-axis.
|
||||||
|
<3> The `lerp` function performs the interpolation and returns the interpolation model.
|
||||||
|
<4> The `u` value is an array from 0 to 20 with a stride of 1. This fills in the gaps of the original x axis.
|
||||||
|
The `predict` function then uses the interpolation function in variable *`l`* to predict values for
|
||||||
|
every point in the array assigned to variable *`u`*.
|
||||||
|
<5> The variable *`p`* is the array of predictions, which is the upsampled set of *`y`* values.
|
||||||
|
|
||||||
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -127,21 +122,15 @@ A technique known as local regression is used to compute the smoothed curve. Th
|
||||||
neighborhood of the local regression can be adjusted
|
neighborhood of the local regression can be adjusted
|
||||||
to control how close the new curve conforms to the original control points.
|
to control how close the new curve conforms to the original control points.
|
||||||
|
|
||||||
The `loess` function is passed *x* and *y* axises and fits a smooth curve to the data.
|
The `loess` function is passed *`x`*- and *`y`*-axes and fits a smooth curve to the data.
|
||||||
If only a single array is provided it is treated as the *y* axis and a sequence is generated
|
If only a single array is provided it is treated as the *`y`*-axis and a sequence is generated
|
||||||
for the *x* axis.
|
for the *`x`*-axis.
|
||||||
|
|
||||||
The example below uses the `loess` function to fit a curve to a set of *y* values in an array.
|
The example below uses the `loess` function to fit a curve to a set of *`y`* values in an array.
|
||||||
The bandwidth parameter defines the percent of data to use for the local
|
The `bandwidth` parameter defines the percent of data to use for the local
|
||||||
regression. The lower the percent the smaller the neighborhood used for the local
|
regression. The lower the percent the smaller the neighborhood used for the local
|
||||||
regression and the closer the curve will be to the original data.
|
regression and the closer the curve will be to the original data.
|
||||||
|
|
||||||
In the example the fitted curve is subtracted from the original curve using the
|
|
||||||
`ebeSubtract` function. The output shows the error between the
|
|
||||||
fitted curve and the original curve, known as the residuals. The output also includes
|
|
||||||
the sum-of-squares of the residuals which provides a measure
|
|
||||||
of how large the error is.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(echo="residuals, sumSqError",
|
let(echo="residuals, sumSqError",
|
||||||
|
@ -151,8 +140,11 @@ let(echo="residuals, sumSqError",
|
||||||
sumSqError=sumSq(residuals))
|
sumSqError=sumSq(residuals))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
In the example the fitted curve is subtracted from the original curve using the
|
||||||
responds with:
|
`ebeSubtract` function. The output shows the error between the
|
||||||
|
fitted curve and the original curve, known as the residuals. The output also includes
|
||||||
|
the sum-of-squares of the residuals which provides a measure
|
||||||
|
of how large the error is:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -194,9 +186,7 @@ responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
In the next example the curve is fit using a bandwidth of .25. Notice that the curve
|
In the next example the curve is fit using a `bandwidth` of `.25`:
|
||||||
is a closer fit, shown by the smaller residuals and lower value for the sum-of-squares of the
|
|
||||||
residuals.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -207,8 +197,8 @@ let(echo="residuals, sumSqError",
|
||||||
sumSqError=sumSq(residuals))
|
sumSqError=sumSq(residuals))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
Notice that the curve is a closer fit, shown by the smaller `residuals` and lower value for the sum-of-squares of the
|
||||||
responds with:
|
residuals:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -252,11 +242,11 @@ responds with:
|
||||||
|
|
||||||
== Derivatives
|
== Derivatives
|
||||||
|
|
||||||
The derivative of a function measures the rate of change of the *y* value in respects to the
|
The derivative of a function measures the rate of change of the *`y`* value in respects to the
|
||||||
rate of change of the *x* value.
|
rate of change of the *`x`* value.
|
||||||
|
|
||||||
The `derivative` function can compute the derivative of any *interpolation* function.
|
The `derivative` function can compute the derivative of any interpolation function.
|
||||||
The `derivative` function can also compute the derivative of a derivative.
|
It can also compute the derivative of a derivative.
|
||||||
|
|
||||||
The example below computes the derivative for a `loess` interpolation function.
|
The example below computes the derivative for a `loess` interpolation function.
|
||||||
|
|
||||||
|
@ -268,7 +258,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
|
||||||
derivative=derivative(curve))
|
derivative=derivative(curve))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -327,7 +317,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
|
||||||
integral=integrate(curve, 0, 20))
|
integral=integrate(curve, 0, 20))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -357,7 +347,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
|
||||||
integral=integrate(curve, 0, 10))
|
integral=integrate(curve, 0, 10))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -382,18 +372,7 @@ responds with:
|
||||||
The `bicubicSpline` function can be used to interpolate and predict values
|
The `bicubicSpline` function can be used to interpolate and predict values
|
||||||
anywhere within a grid of data.
|
anywhere within a grid of data.
|
||||||
|
|
||||||
A simple example will make this more clear.
|
A simple example will make this more clear:
|
||||||
|
|
||||||
In example below a bicubic spline is used to interpolate a matrix of real estate data.
|
|
||||||
Each row of the matrix represents a specific *year*. Each column of the matrix
|
|
||||||
represents a *floor* of the building. The grid of numbers is the average selling price of
|
|
||||||
an apartment for each year and floor. For example in 2002 the average selling price for
|
|
||||||
the 9th floor was 415000 (row 3, column 3).
|
|
||||||
|
|
||||||
The `bicubicSpline` function is then used to
|
|
||||||
interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8.
|
|
||||||
Notice that the matrix does not included a data point for year 2003, floor 8. The `bicupicSpline`
|
|
||||||
function creates that data point based on the surrounding data in the matrix.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -408,8 +387,16 @@ let(years=array(1998, 2000, 2002, 2004, 2006),
|
||||||
prediction=predict(bspline, 2003, 8))
|
prediction=predict(bspline, 2003, 8))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
In this example a bicubic spline is used to interpolate a matrix of real estate data.
|
||||||
responds with:
|
Each row of the matrix represent specific `years`. Each column of the matrix
|
||||||
|
represents `floors` of the building. The grid of numbers is the average selling price of
|
||||||
|
an apartment for each year and floor. For example in 2002 the average selling price for
|
||||||
|
the 9th floor was `415000` (row 3, column 3).
|
||||||
|
|
||||||
|
The `bicubicSpline` function is then used to
|
||||||
|
interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8.
|
||||||
|
Notice that the matrix does not include a data point for year 2003, floor 8. The `bicupicSpline`
|
||||||
|
function creates that data point based on the surrounding data in the matrix:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -427,4 +414,3 @@ responds with:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
|
|
|
@ -17,18 +17,16 @@
|
||||||
// under the License.
|
// under the License.
|
||||||
|
|
||||||
This section of the user guide covers the
|
This section of the user guide covers the
|
||||||
*probability distribution
|
probability distribution
|
||||||
framework* included in the math expressions library.
|
framework included in the math expressions library.
|
||||||
|
|
||||||
== Probability Distribution Framework
|
== Probability Distribution Framework
|
||||||
|
|
||||||
The probability distribution framework includes
|
The probability distribution framework includes many commonly used <<Real Distributions,real>>
|
||||||
many commonly used *real* and *discrete* probability
|
and <<Discrete,discrete>> probability distributions, including support for <<Empirical Distribution,empirical>>
|
||||||
distributions, including support for *empirical* and
|
and <<Enumerated Distributions,enumerated>> distributions that model real world data.
|
||||||
*enumerated* distributions that model real world data.
|
|
||||||
|
|
||||||
The probability distribution framework also includes a set
|
The probability distribution framework also includes a set of functions that use the probability distributions
|
||||||
of functions that use the probability distributions
|
|
||||||
to support probability calculations and sampling.
|
to support probability calculations and sampling.
|
||||||
|
|
||||||
=== Real Distributions
|
=== Real Distributions
|
||||||
|
@ -93,18 +91,18 @@ random variable within a specific distribution.
|
||||||
Below is example of calculating the cumulative probability
|
Below is example of calculating the cumulative probability
|
||||||
of a random variable within a normal distribution.
|
of a random variable within a normal distribution.
|
||||||
|
|
||||||
In the example a normal distribution function is created
|
|
||||||
with a mean of 10 and a standard deviation of 5. Then
|
|
||||||
the cumulative probability of the value 12 is calculated for this
|
|
||||||
specific distribution.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=normalDistribution(10, 5),
|
let(a=normalDistribution(10, 5),
|
||||||
b=cumulativeProbability(a, 12))
|
b=cumulativeProbability(a, 12))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
In this example a normal distribution function is created
|
||||||
|
with a mean of 10 and a standard deviation of 5. Then
|
||||||
|
the cumulative probability of the value 12 is calculated for this
|
||||||
|
specific distribution.
|
||||||
|
|
||||||
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -127,10 +125,10 @@ Below is an example of a cumulative probability calculation
|
||||||
using an empirical distribution.
|
using an empirical distribution.
|
||||||
|
|
||||||
In the example an empirical distribution is created from a random
|
In the example an empirical distribution is created from a random
|
||||||
sample taken from the *price_f* field.
|
sample taken from the `price_f` field.
|
||||||
|
|
||||||
The cumulative probability of the value .75 is then calculated.
|
The cumulative probability of the value `.75` is then calculated.
|
||||||
The *price_f* field in this example was generated using a
|
The `price_f` field in this example was generated using a
|
||||||
uniform real distribution between 0 and 1, so the output of the
|
uniform real distribution between 0 and 1, so the output of the
|
||||||
`cumulativeProbability` function is very close to .75.
|
`cumulativeProbability` function is very close to .75.
|
||||||
|
|
||||||
|
@ -142,7 +140,7 @@ let(a=random(collection1, q="*:*", rows="30000", fl="price_f"),
|
||||||
d=cumulativeProbability(c, .75))
|
d=cumulativeProbability(c, .75))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -171,7 +169,7 @@ Below is an example which calculates the probability
|
||||||
of a discrete value within a Poisson distribution.
|
of a discrete value within a Poisson distribution.
|
||||||
|
|
||||||
In the example a Poisson distribution function is created
|
In the example a Poisson distribution function is created
|
||||||
with a mean of 100. Then the
|
with a mean of `100`. Then the
|
||||||
probability of encountering a sample of the discrete value 101 is calculated for this
|
probability of encountering a sample of the discrete value 101 is calculated for this
|
||||||
specific distribution.
|
specific distribution.
|
||||||
|
|
||||||
|
@ -181,7 +179,7 @@ let(a=poissonDistribution(100),
|
||||||
b=probability(a, 101))
|
b=probability(a, 101))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -200,12 +198,10 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
Below is an example of a probability calculation
|
Below is an example of a probability calculation using an enumerated distribution.
|
||||||
using an enumerated distribution.
|
|
||||||
|
|
||||||
In the example an enumerated distribution is created from a random
|
In the example an enumerated distribution is created from a random
|
||||||
sample taken from the *day_i* field, which was created
|
sample taken from the `day_i` field, which was created using a uniform integer distribution between 0 and 30.
|
||||||
using a uniform integer distribution between 0 and 30.
|
|
||||||
|
|
||||||
The probability of the discrete value 10 is then calculated.
|
The probability of the discrete value 10 is then calculated.
|
||||||
|
|
||||||
|
@ -217,7 +213,7 @@ let(a=random(collection1, q="*:*", rows="30000", fl="day_i"),
|
||||||
d=probability(c, 10))
|
d=probability(c, 10))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -239,11 +235,9 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
=== Sampling
|
=== Sampling
|
||||||
|
|
||||||
All probability distributions support sampling. The `sample`
|
All probability distributions support sampling. The `sample`
|
||||||
function returns 1 or more random samples from a probability
|
function returns 1 or more random samples from a probability distribution.
|
||||||
distribution.
|
|
||||||
|
|
||||||
Below is an example drawing a single sample from
|
Below is an example drawing a single sample from a normal distribution.
|
||||||
a normal distribution.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -251,7 +245,7 @@ let(a=normalDistribution(10, 5),
|
||||||
b=sample(a))
|
b=sample(a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -270,8 +264,7 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
Below is an example drawing 10 samples from a normal
|
Below is an example drawing 10 samples from a normal distribution.
|
||||||
distribution.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -279,7 +272,7 @@ let(a=normalDistribution(10, 5),
|
||||||
b=sample(a, 10))
|
b=sample(a, 10))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -315,14 +308,14 @@ The multivariate normal distribution is a generalization of the
|
||||||
univariate normal distribution to higher dimensions.
|
univariate normal distribution to higher dimensions.
|
||||||
|
|
||||||
The multivariate normal distribution models two or more random
|
The multivariate normal distribution models two or more random
|
||||||
variables that are normally distributed. The relationship between
|
variables that are normally distributed. The relationship between the variables is defined by a covariance matrix.
|
||||||
the variables is defined by a covariance matrix.
|
|
||||||
|
|
||||||
==== Sampling
|
==== Sampling
|
||||||
|
|
||||||
The `sample` function can be used to draw samples
|
The `sample` function can be used to draw samples
|
||||||
from a multivariate normal distribution in much the same
|
from a multivariate normal distribution in much the same
|
||||||
way as a univariate normal distribution.
|
way as a univariate normal distribution.
|
||||||
|
|
||||||
The difference is that each sample will be an array containing a sample
|
The difference is that each sample will be an array containing a sample
|
||||||
drawn from each of the underlying normal distributions.
|
drawn from each of the underlying normal distributions.
|
||||||
If multiple samples are drawn, the `sample` function returns a matrix with a
|
If multiple samples are drawn, the `sample` function returns a matrix with a
|
||||||
|
@ -333,33 +326,25 @@ multivariate normal distribution.
|
||||||
The example below demonstrates how to initialize and draw samples
|
The example below demonstrates how to initialize and draw samples
|
||||||
from a multivariate normal distribution.
|
from a multivariate normal distribution.
|
||||||
|
|
||||||
In this example 5000 random samples are selected from a collection
|
In this example 5000 random samples are selected from a collection of log records. Each sample contains
|
||||||
of log records. Each sample contains
|
the fields `filesize_d` and `response_d`. The values of both fields conform to a normal distribution.
|
||||||
the fields *filesize_d* and *response_d*. The values of both fields conform
|
|
||||||
to a normal distribution.
|
|
||||||
|
|
||||||
Both fields are then vectorized. The *filesize_d* vector is stored in
|
Both fields are then vectorized. The `filesize_d` vector is stored in
|
||||||
variable *b* and the *response_d* variable is stored in variable *c*.
|
variable *`b`* and the `response_d` variable is stored in variable *`c`*.
|
||||||
|
|
||||||
An array is created that contains the *means* of the two vectorized fields.
|
An array is created that contains the means of the two vectorized fields.
|
||||||
|
|
||||||
Then both vectors are added to a matrix which is transposed. This creates
|
Then both vectors are added to a matrix which is transposed. This creates
|
||||||
an *observation* matrix where each row contains one observation of
|
an observation matrix where each row contains one observation of
|
||||||
*filesize_d* and *response_d*. A covariance matrix is then created from the columns of
|
`filesize_d` and `response_d`. A covariance matrix is then created from the columns of
|
||||||
the observation matrix with the
|
the observation matrix with the `cov` function. The covariance matrix describes the covariance between
|
||||||
`cov` function. The covariance matrix describes the covariance between
|
`filesize_d` and `response_d`.
|
||||||
*filesize_d* and *response_d*.
|
|
||||||
|
|
||||||
The `multivariateNormalDistribution` function is then called with the
|
The `multivariateNormalDistribution` function is then called with the
|
||||||
array of means for the two fields and the covariance matrix. The model for the
|
array of means for the two fields and the covariance matrix. The model for the
|
||||||
multivariate normal distribution is assigned to variable *g*.
|
multivariate normal distribution is assigned to variable *`g`*.
|
||||||
|
|
||||||
Finally five samples are drawn from the multivariate normal distribution. The samples
|
Finally five samples are drawn from the multivariate normal distribution.
|
||||||
are returned as a matrix, with each row representing one sample. There are two
|
|
||||||
columns in the matrix. The first column contains samples for *filesize_d* and the second
|
|
||||||
column contains samples for *response_d*. Over the long term the covariance between
|
|
||||||
the columns will conform to the covariance matrix used to instantiate the
|
|
||||||
multivariate normal distribution.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -373,7 +358,11 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
||||||
h=sample(g, 5))
|
h=sample(g, 5))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
The samples are returned as a matrix, with each row representing one sample. There are two
|
||||||
|
columns in the matrix. The first column contains samples for `filesize_d` and the second
|
||||||
|
column contains samples for `response_d`. Over the long term the covariance between
|
||||||
|
the columns will conform to the covariance matrix used to instantiate the
|
||||||
|
multivariate normal distribution.
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -412,4 +401,3 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
|
|
|
@ -16,28 +16,23 @@
|
||||||
// specific language governing permissions and limitations
|
// specific language governing permissions and limitations
|
||||||
// under the License.
|
// under the License.
|
||||||
|
|
||||||
|
The math expressions library supports simple and multivariate linear regression.
|
||||||
This section of the math expressions user guide covers simple and multivariate linear regression.
|
|
||||||
|
|
||||||
|
|
||||||
== Simple Linear Regression
|
== Simple Linear Regression
|
||||||
|
|
||||||
The `regress` function is used to build a linear regression model
|
The `regress` function is used to build a linear regression model
|
||||||
between two random variables. Sample observations are provided with two
|
between two random variables. Sample observations are provided with two
|
||||||
numeric arrays. The first numeric array is the *independent variable* and
|
numeric arrays. The first numeric array is the independent variable and
|
||||||
the second array is the *dependent variable*.
|
the second array is the dependent variable.
|
||||||
|
|
||||||
In the example below the `random` function selects 5000 random samples each containing
|
In the example below the `random` function selects 5000 random samples each containing
|
||||||
the fields *filesize_d* and *response_d*. The two fields are vectorized
|
the fields `filesize_d` and `response_d`. The two fields are vectorized
|
||||||
and stored in variables *b* and *c*. Then the `regress` function performs a regression
|
and stored in variables *`b`* and *`c`*. Then the `regress` function performs a regression
|
||||||
analysis on the two numeric arrays.
|
analysis on the two numeric arrays.
|
||||||
|
|
||||||
The `regress` function returns a single tuple with the results of the regression
|
The `regress` function returns a single tuple with the results of the regression
|
||||||
analysis.
|
analysis.
|
||||||
|
|
||||||
Note that in this regression analysis the value of *RSquared* is *.75*. This means that changes in
|
|
||||||
*filesize_d* explain 75% of the variability of the *response_d* variable.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
||||||
|
@ -46,7 +41,8 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
||||||
d=regress(b, c))
|
d=regress(b, c))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
Note that in this regression analysis the value of `RSquared` is `.75`. This means that changes in
|
||||||
|
`filesize_d` explain 75% of the variability of the `response_d` variable:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -81,11 +77,10 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
The `predict` function uses the regression model to make predictions.
|
The `predict` function uses the regression model to make predictions.
|
||||||
Using the example above the regression model can be used to predict the value
|
Using the example above the regression model can be used to predict the value
|
||||||
of *response_d* given a value for *filesize_d*.
|
of `response_d` given a value for `filesize_d`.
|
||||||
|
|
||||||
In the example below the `predict` function uses the regression analysis to predict
|
In the example below the `predict` function uses the regression analysis to predict
|
||||||
the value of *response_d* for the *filesize_d* value of 40000.
|
the value of `response_d` for the `filesize_d` value of `40000`.
|
||||||
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -96,7 +91,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
||||||
e=predict(d, 40000))
|
e=predict(d, 40000))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -131,7 +126,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
||||||
e=predict(d, b))
|
e=predict(d, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -169,9 +164,9 @@ The difference between the observed value and the predicted value is known as th
|
||||||
residual. There isn't a specific function to calculate the residuals but vector
|
residual. There isn't a specific function to calculate the residuals but vector
|
||||||
math can used to perform the calculation.
|
math can used to perform the calculation.
|
||||||
|
|
||||||
In the example below the predictions are stored in variable *e*. The `ebeSubtract`
|
In the example below the predictions are stored in variable *`e`*. The `ebeSubtract`
|
||||||
function is then used to subtract the predictions
|
function is then used to subtract the predictions
|
||||||
from the actual *response_d* values stored in variable *c*. Variable *f* contains
|
from the actual `response_d` values stored in variable *`c`*. Variable *`f`* contains
|
||||||
the array of residuals.
|
the array of residuals.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
|
@ -184,7 +179,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
|
||||||
f=ebeSubtract(c, e))
|
f=ebeSubtract(c, e))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -221,20 +216,17 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
== Multivariate Linear Regression
|
== Multivariate Linear Regression
|
||||||
|
|
||||||
The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear
|
The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear
|
||||||
regression models the linear relationship between two or more *independent* variables and a *dependent* variable.
|
regression models the linear relationship between two or more independent variables and a dependent variable.
|
||||||
|
|
||||||
The example below extends the simple linear regression example by introducing a new independent variable
|
The example below extends the simple linear regression example by introducing a new independent variable
|
||||||
called *service_d*. The *service_d* variable is the service level of the request and it can range from 1 to 4
|
called `service_d`. The `service_d` variable is the service level of the request and it can range from 1 to 4
|
||||||
in the data-set. The higher the service level, the higher the bandwidth available for the request.
|
in the data-set. The higher the service level, the higher the bandwidth available for the request.
|
||||||
|
|
||||||
Notice that the two independent variables *filesize_d* and *service_d* are vectorized and stored
|
Notice that the two independent variables `filesize_d` and `service_d` are vectorized and stored
|
||||||
in the variables *b* and *c*. The variables *b* and *c* are then added as rows to a `matrix`. The matrix is
|
in the variables *`b`* and *`c`*. The variables *`b`* and *`c`* are then added as rows to a `matrix`. The matrix is
|
||||||
then transposed so that each row in the matrix represents one observation with *filesize_d* and *service_d*.
|
then transposed so that each row in the matrix represents one observation with `filesize_d` and `service_d`.
|
||||||
The `olsRegress` function then performs the multivariate regression analysis using the observation matrix as the
|
The `olsRegress` function then performs the multivariate regression analysis using the observation matrix as the
|
||||||
independent variables and the *response_d* values, stored in variable *d*, as the dependent variable.
|
independent variables and the `response_d` values, stored in variable *`d`*, as the dependent variable.
|
||||||
|
|
||||||
Notice that the RSquared of the regression analysis is 1. This means that linear relationship between
|
|
||||||
*filesize_d* and *service_d* describe 100% of the variability of the *response_d* variable.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -246,7 +238,8 @@ let(a=random(collection2, q="*:*", rows="30000", fl="filesize_d, service_d, resp
|
||||||
f=olsRegress(e, d))
|
f=olsRegress(e, d))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
Notice in the response that the RSquared of the regression analysis is 1. This means that linear relationship between
|
||||||
|
`filesize_d` and `service_d` describe 100% of the variability of the `response_d` variable:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -299,10 +292,11 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
=== Prediction
|
=== Prediction
|
||||||
|
|
||||||
The `predict` function can also be used to make predictions for multivariate linear regression. Below is an example
|
The `predict` function can also be used to make predictions for multivariate linear regression.
|
||||||
of a single prediction using the multivariate linear regression model and a single observation. The observation
|
|
||||||
is an array that matches the structure of the observation matrix used to build the model. In this case
|
Below is an example of a single prediction using the multivariate linear regression model and a single observation.
|
||||||
the first value represent a *filesize_d* of 40000 and the second value represents a *service_d* of 4.
|
The observation is an array that matches the structure of the observation matrix used to build the model. In this case
|
||||||
|
the first value represents a `filesize_d` of `40000` and the second value represents a `service_d` of `4`.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -315,7 +309,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
|
||||||
g=predict(f, array(40000, 4)))
|
g=predict(f, array(40000, 4)))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -335,9 +329,10 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
----
|
----
|
||||||
|
|
||||||
The `predict` function can also make predictions for more than one multivariate observation. In this scenario
|
The `predict` function can also make predictions for more than one multivariate observation. In this scenario
|
||||||
an observation matrix used. In the example below the observation matrix used to build the multivariate regression model
|
an observation matrix used.
|
||||||
is passed to the `predict` function and it returns an array of predictions.
|
|
||||||
|
|
||||||
|
In the example below the observation matrix used to build the multivariate regression model
|
||||||
|
is passed to the `predict` function and it returns an array of predictions.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -350,7 +345,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
|
||||||
g=predict(f, e))
|
g=predict(f, e))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -388,7 +383,7 @@ Once the predictions are generated the residuals can be calculated using the sam
|
||||||
simple linear regression.
|
simple linear regression.
|
||||||
|
|
||||||
Below is an example of the residuals calculation following a multivariate linear regression. In the example
|
Below is an example of the residuals calculation following a multivariate linear regression. In the example
|
||||||
the predictions stored variable *g* are subtracted from observed values stored in variable *d*.
|
the predictions stored variable *`g`* are subtracted from observed values stored in variable *`d`*.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -402,7 +397,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
|
||||||
h=ebeSubtract(d, g))
|
h=ebeSubtract(d, g))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -433,7 +428,3 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -26,7 +26,7 @@ For example the expression below adds two numbers together:
|
||||||
add(1, 1)
|
add(1, 1)
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -98,7 +98,7 @@ select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3")
|
||||||
mult(price_f, 10) as newPrice)
|
mult(price_f, 10) as newPrice)
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
|
@ -18,59 +18,59 @@
|
||||||
|
|
||||||
|
|
||||||
Monte Carlo simulations are commonly used to model the behavior of
|
Monte Carlo simulations are commonly used to model the behavior of
|
||||||
stochastic systems. This section of the user guide describes
|
stochastic systems. This section describes
|
||||||
how to perform both *uncorrelated* and *correlated* Monte Carlo simulations
|
how to perform both uncorrelated and correlated Monte Carlo simulations
|
||||||
using the *sampling* capabilities of the probability distribution framework.
|
using the sampling capabilities of the probability distribution framework.
|
||||||
|
|
||||||
== Uncorrelated Simulations
|
== Uncorrelated Simulations
|
||||||
|
|
||||||
Uncorrelated Monte Carlo simulations model stochastic systems with the assumption
|
Uncorrelated Monte Carlo simulations model stochastic systems with the assumption
|
||||||
that the underlying random variables move independently of each other.
|
that the underlying random variables move independently of each other.
|
||||||
A simple example of a Monte Carlo simulation using two independently changing random variables
|
A simple example of a Monte Carlo simulation using two independently changing random variables
|
||||||
is described below.
|
is described below.
|
||||||
|
|
||||||
In this example a Monte Carlo simulation is used to determine the probability that a simple hinge assembly will
|
In this example a Monte Carlo simulation is used to determine the probability that a simple hinge assembly will
|
||||||
fall within a required length specification.
|
fall within a required length specification.
|
||||||
|
|
||||||
The hinge has two components *A* and *B*. The combined length of the two components must be less then 5 centimeters
|
The hinge has two components A and B. The combined length of the two components must be less then 5 centimeters
|
||||||
to fall within specification.
|
to fall within specification.
|
||||||
|
|
||||||
A random sampling of lengths for component *A* has shown that its length conforms to a
|
A random sampling of lengths for component A has shown that its length conforms to a
|
||||||
normal distribution with a mean of 2.2 centimeters and a standard deviation of .0195
|
normal distribution with a mean of 2.2 centimeters and a standard deviation of .0195
|
||||||
centimeters.
|
centimeters.
|
||||||
|
|
||||||
A random sampling of lengths for component *B* has shown that its length conforms
|
A random sampling of lengths for component B has shown that its length conforms
|
||||||
to a normal distribution with a mean of 2.71 centimeters and a standard deviation of .0198 centimeters.
|
to a normal distribution with a mean of 2.71 centimeters and a standard deviation of .0198 centimeters.
|
||||||
|
|
||||||
The Monte Carlo simulation below performs the following steps:
|
|
||||||
|
|
||||||
* A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of componentA.
|
|
||||||
* A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of componentB.
|
|
||||||
* The `monteCarlo` function samples from the componentA and componentB distributions and sets the values to variables sampleA and sampleB. It then
|
|
||||||
calls the *add(sampleA, sampleB)* function to find the combined lengths of the samples. The `monteCarlo` function runs a set number of times, 100000 in the example below, and collects the results in an array. Each
|
|
||||||
time the function is called new samples are drawn from the componentA
|
|
||||||
and componentB distributions. On each run, the `add` function adds the two samples to calculate the combined length.
|
|
||||||
The result of each run is collected in an array and assigned to the *simresults* variable.
|
|
||||||
* An `empiricalDistribution` function is then created from the *simresults* array to model the distribution of the
|
|
||||||
simulation results.
|
|
||||||
* Finally, the `cumulativeProbability` function is called on the *simmodel* to determine the cumulative probability
|
|
||||||
that the combined length of the components is 5 or less.
|
|
||||||
* Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
|
|
||||||
be 5 or less.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(componentA=normalDistribution(2.2, .0195),
|
let(componentA=normalDistribution(2.2, .0195), <1>
|
||||||
componentB=normalDistribution(2.71, .0198),
|
componentB=normalDistribution(2.71, .0198), <2>
|
||||||
simresults=monteCarlo(sampleA=sample(componentA),
|
simresults=monteCarlo(sampleA=sample(componentA), <3>
|
||||||
sampleB=sample(componentB),
|
sampleB=sample(componentB),
|
||||||
add(sampleA, sampleB),
|
add(sampleA, sampleB), <4>
|
||||||
100000),
|
100000), <5>
|
||||||
simmodel=empiricalDistribution(simresults),
|
simmodel=empiricalDistribution(simresults), <6>
|
||||||
prob=cumulativeProbability(simmodel, 5))
|
prob=cumulativeProbability(simmodel, 5)) <7>
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
The Monte Carlo simulation below performs the following steps:
|
||||||
|
|
||||||
|
<1> A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of `componentA`.
|
||||||
|
<2> A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of `componentB`.
|
||||||
|
<3> The `monteCarlo` function samples from the `componentA` and `componentB` distributions and sets the values to variables `sampleA` and `sampleB`.
|
||||||
|
<4> It then calls the `add(sampleA, sampleB)`* function to find the combined lengths of the samples.
|
||||||
|
<5> The `monteCarlo` function runs a set number of times, 100000, and collects the results in an array. Each
|
||||||
|
time the function is called new samples are drawn from the `componentA`
|
||||||
|
and `componentB` distributions. On each run, the `add` function adds the two samples to calculate the combined length.
|
||||||
|
The result of each run is collected in an array and assigned to the `simresults` variable.
|
||||||
|
<6> An `empiricalDistribution` function is then created from the `simresults` array to model the distribution of the
|
||||||
|
simulation results.
|
||||||
|
<7> Finally, the `cumulativeProbability` function is called on the `simmodel` to determine the cumulative probability
|
||||||
|
that the combined length of the components is 5 or less.
|
||||||
|
|
||||||
|
Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
|
||||||
|
be 5 or less:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -91,36 +91,32 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
== Correlated Simulations
|
== Correlated Simulations
|
||||||
|
|
||||||
The simulation above assumes that the lengths of *componentA* and *componentB* vary independently.
|
The simulation above assumes that the lengths of `componentA` and `componentB` vary independently.
|
||||||
What would happen to the probability model if there was a correlation between the lengths of
|
What would happen to the probability model if there was a correlation between the lengths of
|
||||||
*componentA* and *componentB*.
|
`componentA` and `componentB`?
|
||||||
|
|
||||||
In the example below a database containing assembled pairs of components is used to determine
|
In the example below a database containing assembled pairs of components is used to determine
|
||||||
if there is a correlation between the lengths of the components, and how the correlation effects the model.
|
if there is a correlation between the lengths of the components, and how the correlation effects the model.
|
||||||
|
|
||||||
Before performing a simulation of the effects of correlation on the probability model its
|
Before performing a simulation of the effects of correlation on the probability model its
|
||||||
useful to understand what the correlation is between the lengths of *componentA* and *componentB*.
|
useful to understand what the correlation is between the lengths of `componentA` and `componentB`.
|
||||||
|
|
||||||
In the example below 5000 random samples are selected from a collection
|
|
||||||
of assembled hinges. Each sample contains
|
|
||||||
lengths of the components in the fields *componentA_d* and *componentB_d*.
|
|
||||||
|
|
||||||
Both fields are then vectorized. The *componentA_d* vector is stored in
|
|
||||||
variable *b* and the *componentB_d* variable is stored in variable *c*.
|
|
||||||
|
|
||||||
Then the correlation of the two vectors is calculated using the `corr` function. Note that the outcome
|
|
||||||
from `corr` is 0.9996931313216989. This means that *componentA_d* and *componentB_d* are almost
|
|
||||||
perfectly correlated.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"),
|
let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"), <1>
|
||||||
b=col(a, componentA_d)),
|
b=col(a, componentA_d)), <2>
|
||||||
c=col(a, componentB_d)),
|
c=col(a, componentB_d)),
|
||||||
d=corr(b, c))
|
d=corr(b, c)) <3>
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
<1> In the example, 5000 random samples are selected from a collection of assembled hinges.
|
||||||
|
Each sample contains lengths of the components in the fields `componentA_d` and `componentB_d`.
|
||||||
|
<2> Both fields are then vectorized. The *componentA_d* vector is stored in
|
||||||
|
variable *`b`* and the *componentB_d* variable is stored in variable *`c`*.
|
||||||
|
<3> Then the correlation of the two vectors is calculated using the `corr` function.
|
||||||
|
|
||||||
|
Note from the result that the outcome from `corr` is 0.9996931313216989.
|
||||||
|
This means that `componentA_d` and *`componentB_d` are almost perfectly correlated.
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -139,35 +135,34 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
How does correlation effect the probability model?
|
=== Correlation Effects on the Probability Model
|
||||||
|
|
||||||
The example below explores how to use a *multivariate normal distribution* function
|
The example below explores how to use a multivariate normal distribution function
|
||||||
to model how correlation effects the probability of hinge defects.
|
to model how correlation effects the probability of hinge defects.
|
||||||
|
|
||||||
In this example 5000 random samples are selected from a collection
|
In this example 5000 random samples are selected from a collection
|
||||||
containing length data for assembled hinges. Each sample contains
|
containing length data for assembled hinges. Each sample contains
|
||||||
the fields *componentA_d* and *componentB_d*.
|
the fields `componentA_d` and `componentB_d`.
|
||||||
|
|
||||||
Both fields are then vectorized. The *componentA_d* vector is stored in
|
Both fields are then vectorized. The `componentA_d` vector is stored in
|
||||||
variable *b* and the *componentB_d* variable is stored in variable *c*.
|
variable *`b`* and the `componentB_d` variable is stored in variable *`c`*.
|
||||||
|
|
||||||
An array is created that contains the *means* of the two vectorized fields.
|
An array is created that contains the means of the two vectorized fields.
|
||||||
|
|
||||||
Then both vectors are added to a matrix which is transposed. This creates
|
Then both vectors are added to a matrix which is transposed. This creates
|
||||||
an *observation* matrix where each row contains one observation of
|
an observation matrix where each row contains one observation of
|
||||||
*componentA_d* and *componentB_d*. A covariance matrix is then created from the columns of
|
`componentA_d` and `componentB_d`. A covariance matrix is then created from the columns of
|
||||||
the observation matrix with the
|
the observation matrix with the
|
||||||
`cov` function. The covariance matrix describes the covariance between
|
`cov` function. The covariance matrix describes the covariance between `componentA_d` and `componentB_d`.
|
||||||
*componentA_d* and *componentB_d*.
|
|
||||||
|
|
||||||
The `multivariateNormalDistribution` function is then called with the
|
The `multivariateNormalDistribution` function is then called with the
|
||||||
array of means for the two fields and the covariance matrix. The model
|
array of means for the two fields and the covariance matrix. The model
|
||||||
for the multivariate normal distribution is stored in variable *g*.
|
for the multivariate normal distribution is stored in variable *`g`*.
|
||||||
|
|
||||||
The `monteCarlo` function then calls the function *add(sample(g))* 50000 times
|
The `monteCarlo` function then calls the function `add(sample(g))` 50000 times
|
||||||
and collections the results in a vector. Each time the function is called a single sample
|
and collections the results in a vector. Each time the function is called a single sample
|
||||||
is drawn from the multivariate normal distribution. Each sample is a vector containing
|
is drawn from the multivariate normal distribution. Each sample is a vector containing
|
||||||
one *componentA* and *componentB* pair. the `add` function adds the values in the vector to
|
one `componentA` and `componentB` pair. The `add` function adds the values in the vector to
|
||||||
calculate the length of the pair. Over the long term the samples drawn from the
|
calculate the length of the pair. Over the long term the samples drawn from the
|
||||||
multivariate normal distribution will conform to the covariance matrix used to construct it.
|
multivariate normal distribution will conform to the covariance matrix used to construct it.
|
||||||
|
|
||||||
|
@ -195,7 +190,7 @@ let(a=random(hinges, q="*:*", rows="5000", fl="componentA_d, componentB_d"),
|
||||||
j=cumulativeProbability(i, 5))
|
j=cumulativeProbability(i, 5))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
|
@ -37,7 +37,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
|
||||||
c=describe(b))
|
c=describe(b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -90,7 +90,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
|
||||||
c=hist(b, 5))
|
c=hist(b, 5))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -179,7 +179,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
|
||||||
d=col(c, N))
|
d=col(c, N))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -228,7 +228,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="day_i"),
|
||||||
c=freqTable(b))
|
c=freqTable(b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -302,7 +302,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
|
||||||
c=percentile(b, 95))
|
c=percentile(b, 95))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -344,7 +344,7 @@ let(a=array(1, 2, 3, 4, 5),
|
||||||
c=cov(a, b))
|
c=cov(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -380,7 +380,7 @@ let(a=array(1, 2, 3, 4, 5),
|
||||||
e=cov(d))
|
e=cov(d))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -437,7 +437,7 @@ let(a=array(1, 2, 3, 4, 5),
|
||||||
c=corr(a, b, type=spearmans))
|
c=corr(a, b, type=spearmans))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -504,7 +504,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
|
||||||
e=ttest(c, d))
|
e=ttest(c, d))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -552,7 +552,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
|
||||||
e=ttest(c, d))
|
e=ttest(c, d))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -588,7 +588,7 @@ let(a=array(1,2,3),
|
||||||
b=zscores(a))
|
b=zscores(a))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
|
@ -216,8 +216,8 @@ The `nodes` function provides breadth-first graph traversal. For details, see th
|
||||||
|
|
||||||
== knnSearch
|
== knnSearch
|
||||||
|
|
||||||
The `knnSearch` function returns the K nearest neighbors for a document based on text similarity. Under the covers the `knnSearch` function
|
The `knnSearch` function returns the k-nearest neighbors for a document based on text similarity. Under the covers the `knnSearch` function
|
||||||
use the More Like This query parser plugin.
|
uses the More Like This query parser plugin.
|
||||||
|
|
||||||
=== knnSearch Parameters
|
=== knnSearch Parameters
|
||||||
|
|
||||||
|
|
|
@ -16,9 +16,9 @@
|
||||||
// specific language governing permissions and limitations
|
// specific language governing permissions and limitations
|
||||||
// under the License.
|
// under the License.
|
||||||
|
|
||||||
TF-IDF term vectors are often used to represent text documents when performing text mining
|
Term frequency-inverse document frequency (TF-IDF) term vectors are often used to
|
||||||
and machine learning operations. This section of the user guide describes how to
|
represent text documents when performing text mining and machine learning operations. The math expressions
|
||||||
use math expressions to perform text analysis and create TF-IDF term vectors.
|
library can be used to perform text analysis and create TF-IDF term vectors.
|
||||||
|
|
||||||
== Text Analysis
|
== Text Analysis
|
||||||
|
|
||||||
|
@ -26,17 +26,16 @@ The `analyze` function applies a Solr analyzer to a text field and returns the t
|
||||||
emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr's
|
emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr's
|
||||||
schema can be used with the `analyze` function.
|
schema can be used with the `analyze` function.
|
||||||
|
|
||||||
In the example below, the text "hello world" is analyzed using the analyzer chain attached to the *subject* field in
|
In the example below, the text "hello world" is analyzed using the analyzer chain attached to the `subject` field in
|
||||||
the schema. The *subject* field is defined as the field type *text_general* and the text is analyzed using the
|
the schema. The `subject` field is defined as the field type `text_general` and the text is analyzed using the
|
||||||
analysis chain configured for the *text_general* field type.
|
analysis chain configured for the `text_general` field type.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
analyze("hello world", subject)
|
analyze("hello world", subject)
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
responds with:
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -63,13 +62,12 @@ responds with:
|
||||||
The `analyze` function can be used inside of a `select` function to annotate documents with the tokens
|
The `analyze` function can be used inside of a `select` function to annotate documents with the tokens
|
||||||
generated by the analysis.
|
generated by the analysis.
|
||||||
|
|
||||||
The example below is performing a `search` in collection1. Each tuple returned by the `search`
|
The example below performs a `search` in "collection1". Each tuple returned by the `search` function
|
||||||
contains an *id* and *subject*. For each tuple, the
|
contains an `id` and `subject`. For each tuple, the
|
||||||
`select` function is selecting the *id* field and calling the `analyze` function on the *subject* field.
|
`select` function selects the `id` field and calls the `analyze` function on the `subject` field.
|
||||||
The analyzer chain specified by the *subject_bigram* field is configured to perform a bigram analysis.
|
The analyzer chain specified by the `subject_bigram` field is configured to perform a bigram analysis.
|
||||||
The tokens generated by the `analyze` function are added to each tuple in a field called `terms`.
|
The tokens generated by the `analyze` function are added to each tuple in a field called `terms`.
|
||||||
|
|
||||||
Notice in the output that an array of bigram terms have been added to the tuples.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -78,8 +76,7 @@ select(search(collection1, q="*:*", fl="id, subject", sort="id asc"),
|
||||||
analyze(subject, subject_bigram) as terms)
|
analyze(subject, subject_bigram) as terms)
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
Notice in the output that an array of bigram terms have been added to the tuples:
|
||||||
responds with:
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -111,42 +108,37 @@ responds with:
|
||||||
|
|
||||||
== TF-IDF Term Vectors
|
== TF-IDF Term Vectors
|
||||||
|
|
||||||
The `termVectors` function can be used to build *TF-IDF*
|
The `termVectors` function can be used to build TF-IDF term vectors from the terms generated by the `analyze` function.
|
||||||
term vectors from the terms generated by the `analyze` function.
|
|
||||||
|
|
||||||
The `termVectors` function operates over a list of tuples that contain a field
|
The `termVectors` function operates over a list of tuples that contain a field called `id` and a field called `terms`.
|
||||||
called *id* and a field called *terms*. Notice
|
Notice that this is the exact output structure of the document annotation example above.
|
||||||
that this is the exact output structure of the *document annotation* example above.
|
|
||||||
|
|
||||||
The `termVectors` function builds a *matrix* from the list of tuples. There is *row* in the
|
The `termVectors` function builds a matrix from the list of tuples. There is row in the
|
||||||
matrix for each tuple in the list. There is a *column* in the matrix for each term in the *terms*
|
matrix for each tuple in the list. There is a column in the matrix for each term in the `terms` field.
|
||||||
field.
|
|
||||||
|
|
||||||
The example below builds on the *document annotation* example.
|
|
||||||
The list of tuples are stored in variable *a*. The `termVectors` function
|
|
||||||
operates over variable *a* and builds a matrix with *2 rows* and *4 columns*.
|
|
||||||
|
|
||||||
The `termVectors` function also sets the *row* and *column* labels of the term vectors matrix.
|
|
||||||
The row labels are the document ids and the
|
|
||||||
column labels are the terms.
|
|
||||||
|
|
||||||
In the example below, the `getRowLabels` and `getColumnLabels` functions return
|
|
||||||
the row and column labels which are then stored in variables *c* and *d*.
|
|
||||||
The *echo* parameter is echoing variables *c* and *d*, so the output includes
|
|
||||||
the row and column labels.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(echo="c, d",
|
let(echo="c, d", <1>
|
||||||
a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
|
a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"), <2>
|
||||||
id,
|
id,
|
||||||
analyze(subject, subject_bigram) as terms),
|
analyze(subject, subject_bigram) as terms),
|
||||||
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1),
|
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1), <3>
|
||||||
c=getRowLabels(b),
|
c=getRowLabels(b), <4>
|
||||||
d=getColumnLabels(b))
|
d=getColumnLabels(b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
The example below builds on the document annotation example.
|
||||||
|
|
||||||
|
<1> The `echo` parameter will echo variables *`c`* and *`d`*, so the output includes
|
||||||
|
the row and column labels, which will be defined later in the expression.
|
||||||
|
<2> The list of tuples are stored in variable *`a`*. The `termVectors` function
|
||||||
|
operates over variable *`a`* and builds a matrix with 2 rows and 4 columns.
|
||||||
|
<3> The `termVectors` function sets the row and column labels of the term vectors matrix as variable *`b`*.
|
||||||
|
The row labels are the document ids and the column labels are the terms.
|
||||||
|
<4> The `getRowLabels` and `getColumnLabels` functions return
|
||||||
|
the row and column labels which are then stored in variables *`c`* and *`d`*.
|
||||||
|
|
||||||
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -188,7 +180,7 @@ let(a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
|
||||||
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1))
|
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -230,8 +222,15 @@ the noisy terms helps keep the term vector matrix small enough to fit comfortabl
|
||||||
|
|
||||||
There are four parameters designed to filter noisy terms from the term vector matrix:
|
There are four parameters designed to filter noisy terms from the term vector matrix:
|
||||||
|
|
||||||
* *minTermLength*: The minimum term length required to include the term in the matrix.
|
`minTermLength`::
|
||||||
* *minDocFreq*: The minimum *percentage* (0 to 1) of documents the term must appear in to be included in the index.
|
The minimum term length required to include the term in the matrix.
|
||||||
* *maxDocFreq*: The maximum *percentage* (0 to 1) of documents the term can appear in to be included in the index.
|
|
||||||
* *exclude*: A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
|
minDocFreq::
|
||||||
|
The minimum percentage, expressed as a number between 0 and 1, of documents the term must appear in to be included in the index.
|
||||||
|
|
||||||
|
maxDocFreq::
|
||||||
|
The maximum percentage, expressed as a number between 0 and 1, of documents the term can appear in to be included in the index.
|
||||||
|
|
||||||
|
exclude::
|
||||||
|
A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
|
||||||
term will be excluded from the term vector.
|
term will be excluded from the term vector.
|
||||||
|
|
|
@ -38,7 +38,7 @@ timeseries(collection1,
|
||||||
count(*))
|
count(*))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -121,7 +121,7 @@ let(a=timeseries(collection1,
|
||||||
b=col(a, count(*)))
|
b=col(a, count(*)))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -192,7 +192,7 @@ let(a=timeseries(collection1,
|
||||||
c=movingAvg(b, 3))
|
c=movingAvg(b, 3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -242,7 +242,7 @@ let(a=timeseries(collection1, q=*:*,
|
||||||
c=expMovingAvg(b, 3))
|
c=expMovingAvg(b, 3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -292,7 +292,7 @@ let(a=timeseries(collection1,
|
||||||
c=movingMedian(b, 3))
|
c=movingMedian(b, 3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -353,7 +353,7 @@ let(a=timeseries(collection1,
|
||||||
c=diff(b))
|
c=diff(b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -403,7 +403,7 @@ let(a=array(1,2,5,2,1,2,5,2,1,2,5),
|
||||||
b=diff(a, 4))
|
b=diff(a, 4))
|
||||||
----
|
----
|
||||||
|
|
||||||
Expression is sent to the /stream handler it responds with:
|
Expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
|
@ -16,19 +16,17 @@
|
||||||
// specific language governing permissions and limitations
|
// specific language governing permissions and limitations
|
||||||
// under the License.
|
// under the License.
|
||||||
|
|
||||||
|
|
||||||
== The Let Expression
|
== The Let Expression
|
||||||
|
|
||||||
The `let` expression sets variables and returns
|
The `let` expression sets variables and returns
|
||||||
the value of the last variable by default. The output of any streaming expression
|
the value of the last variable by default. The output of any streaming expression or math expression can be set to a variable.
|
||||||
or math expression can be set to a variable.
|
|
||||||
|
|
||||||
Below is a simple example setting three variables *a*, *b*
|
Below is a simple example setting three variables *`a`*, *`b`*
|
||||||
and *c*. Variables *a* and *b* are set to arrays. The variable *c* is set
|
and *`c`*. Variables *`a`* and *`b`* are set to arrays. The variable *`c`* is set
|
||||||
to the output of the `ebeAdd` function which performs element-by-element
|
to the output of the `ebeAdd` function which performs element-by-element
|
||||||
addition of the two arrays.
|
addition of the two arrays.
|
||||||
|
|
||||||
Notice that the last variable, *c*, is returned.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=array(1, 2, 3),
|
let(a=array(1, 2, 3),
|
||||||
|
@ -36,8 +34,7 @@ let(a=array(1, 2, 3),
|
||||||
c=ebeAdd(a, b))
|
c=ebeAdd(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
In the response, notice that the last variable, *`c`*, is returned:
|
||||||
responds with:
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -62,7 +59,7 @@ responds with:
|
||||||
|
|
||||||
== Echoing Variables
|
== Echoing Variables
|
||||||
|
|
||||||
All variables can be output by setting the *echo* variable to *true*.
|
All variables can be output by setting the `echo` variable to `true`.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -72,7 +69,7 @@ let(echo=true,
|
||||||
c=ebeAdd(a, b))
|
c=ebeAdd(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -106,8 +103,8 @@ responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
A specific set of variables can be echoed by providing a comma delimited
|
A specific set of variables can be echoed by providing a comma delimited list of variables to the echo parameter.
|
||||||
list of variables to the echo parameter.
|
Because variables have been provided, the `true` value is assumed.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -117,8 +114,7 @@ let(echo="a,b",
|
||||||
c=ebeAdd(a, b))
|
c=ebeAdd(a, b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
responds with:
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -150,13 +146,13 @@ responds with:
|
||||||
|
|
||||||
Variables can be cached in-memory on the Solr node where the math expression
|
Variables can be cached in-memory on the Solr node where the math expression
|
||||||
was run. A cached variable can then be used in future expressions. Any object
|
was run. A cached variable can then be used in future expressions. Any object
|
||||||
that can be set to a variable, including data structures and mathematical models can
|
that can be set to a variable, including data structures and mathematical models, can
|
||||||
be cached in-memory for future use.
|
be cached in-memory for future use.
|
||||||
|
|
||||||
The `putCache` function adds a variable to the cache.
|
The `putCache` function adds a variable to the cache.
|
||||||
|
|
||||||
In the example below an array is cached in the *workspace* workspace1
|
In the example below an array is cached in the `workspace` "workspace1"
|
||||||
and bound to the *key* key1. The workspace allows different users to cache
|
and bound to the `key` "key1". The workspace allows different users to cache
|
||||||
objects in their own workspace. The `putCache` function returns
|
objects in their own workspace. The `putCache` function returns
|
||||||
the variable that was added to the cache.
|
the variable that was added to the cache.
|
||||||
|
|
||||||
|
@ -168,8 +164,7 @@ let(a=array(1, 2, 3),
|
||||||
d=putCache(workspace1, key1, c))
|
d=putCache(workspace1, key1, c))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
responds with:
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -192,20 +187,16 @@ responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
The `getCache` function retrieves an object from the
|
The `getCache` function retrieves an object from the cache by its workspace and key.
|
||||||
cache by its workspace and key.
|
|
||||||
|
|
||||||
In the example below the `getCache` function retrieves
|
|
||||||
the array the was cached above and assigns it to variable *a*.
|
|
||||||
|
|
||||||
|
In the example below the `getCache` function retrieves the array the was cached above and assigns it to variable *`a`*.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=getCache(workspace1, key1))
|
let(a=getCache(workspace1, key1))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
responds with:
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -228,18 +219,16 @@ responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
The `listCache` function can be used to list the workspaces or the
|
The `listCache` function can be used to list the workspaces or the keys in a specific workspace.
|
||||||
keys in a specific workspace.
|
|
||||||
|
|
||||||
In the example below `listCache` returns all the workspaces in the cache
|
In the example below `listCache` returns all the workspaces in the cache as an array of strings.
|
||||||
as an array of strings.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=listCache())
|
let(a=listCache())
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it
|
||||||
responds with:
|
responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
|
@ -264,14 +253,12 @@ responds with:
|
||||||
|
|
||||||
In the example below all the keys in a specific workspace are listed:
|
In the example below all the keys in a specific workspace are listed:
|
||||||
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=listCache(workspace1))
|
let(a=listCache(workspace1))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
responds with:
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -296,17 +283,14 @@ The `removeCache` function can be used to remove a a key from a specific
|
||||||
workspace. This `removeCache` function removes the key from the cache
|
workspace. This `removeCache` function removes the key from the cache
|
||||||
and returns the object that was removed.
|
and returns the object that was removed.
|
||||||
|
|
||||||
In the example below the array that was cached above is removed from the
|
In the example below the array that was cached above is removed from the cache.
|
||||||
cache.
|
|
||||||
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
let(a=removeCache(workspace1, key1))
|
let(a=removeCache(workspace1, key1))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
responds with:
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
|
@ -16,23 +16,20 @@
|
||||||
// specific language governing permissions and limitations
|
// specific language governing permissions and limitations
|
||||||
// under the License.
|
// under the License.
|
||||||
|
|
||||||
This section of the user guide covers vector math and
|
This section covers vector math and vector manipulation functions.
|
||||||
vector manipulation functions.
|
|
||||||
|
|
||||||
== Arrays
|
== Arrays
|
||||||
|
|
||||||
Arrays can be created with the `array` function.
|
Arrays can be created with the `array` function.
|
||||||
|
|
||||||
For example the expression below creates a numeric array with
|
For example, the expression below creates a numeric array with three elements:
|
||||||
three elements:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
array(1, 2, 3)
|
array(1, 2, 3)
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with
|
When this expression is sent to the `/stream` handler it responds with a JSON array:
|
||||||
a json array.
|
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -66,7 +63,7 @@ For example, an array can be reversed with the `rev` function:
|
||||||
rev(array(1, 2, 3))
|
rev(array(1, 2, 3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -89,15 +86,14 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
Another example is the `length` function,
|
Another example is the `length` function, which returns the length of an array:
|
||||||
which returns the length of an array:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
length(array(1, 2, 3))
|
length(array(1, 2, 3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -124,7 +120,7 @@ copies elements of an array from a start and end range.
|
||||||
copyOfRange(array(1,2,3,4,5,6), 1, 4)
|
copyOfRange(array(1,2,3,4,5,6), 1, 4)
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -149,21 +145,18 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
== Vector Summarizations and Norms
|
== Vector Summarizations and Norms
|
||||||
|
|
||||||
There are a set of functions that perform
|
There are a set of functions that perform summarizations and return norms of arrays. These functions
|
||||||
summerizations and return norms of arrays. These functions
|
operate over an array and return a single value. The following vector summarizations and norm functions are available:
|
||||||
operate over an array and return a single
|
|
||||||
value. The following vector summarizations and norm functions are available:
|
|
||||||
`mult`, `add`, `sumSq`, `mean`, `l1norm`, `l2norm`, `linfnorm`.
|
`mult`, `add`, `sumSq`, `mean`, `l1norm`, `l2norm`, `linfnorm`.
|
||||||
|
|
||||||
The example below is using the `mult` function,
|
The example below shows the `mult` function, which multiples all the values of an array.
|
||||||
which multiples all the values of an array.
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
mult(array(2,4,8))
|
mult(array(2,4,8))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -184,14 +177,14 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
The vector norm functions provide different formulas for calculating vector magnitude.
|
The vector norm functions provide different formulas for calculating vector magnitude.
|
||||||
|
|
||||||
The example below calculates the *l2norm* of an array.
|
The example below calculates the `l2norm` of an array.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
l2norm(array(2,4,8))
|
l2norm(array(2,4,8))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -212,12 +205,11 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
== Scalar Vector Math
|
== Scalar Vector Math
|
||||||
|
|
||||||
Scalar vector math functions add, subtract, multiple or divide a scalar value with every value in a vector.
|
Scalar vector math functions add, subtract, multiply or divide a scalar value with every value in a vector.
|
||||||
The following functions perform these operations: `scalarAdd`, `scalarSubtract`, `scalarMultiply`
|
The following functions perform these operations: `scalarAdd`, `scalarSubtract`, `scalarMultiply`
|
||||||
and `scalarDivide`.
|
and `scalarDivide`.
|
||||||
|
|
||||||
|
Below is an example of the `scalarMultiply` function, which multiplies the scalar value `3` with
|
||||||
Below is an example of the `scalarMultiply` function, which multiplies the scalar value 3 with
|
|
||||||
every value of an array.
|
every value of an array.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
|
@ -225,7 +217,7 @@ every value of an array.
|
||||||
scalarMultiply(3, array(1,2,3))
|
scalarMultiply(3, array(1,2,3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -251,7 +243,7 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
== Element-By-Element Vector Math
|
== Element-By-Element Vector Math
|
||||||
|
|
||||||
Two vectors can be added, subtracted, multiplied and divided using element-by-element
|
Two vectors can be added, subtracted, multiplied and divided using element-by-element
|
||||||
vector math functions. The following element-by-element vector math functions are:
|
vector math functions. The available element-by-element vector math functions are:
|
||||||
`ebeAdd`, `ebeSubtract`, `ebeMultiply`, `ebeDivide`.
|
`ebeAdd`, `ebeSubtract`, `ebeMultiply`, `ebeDivide`.
|
||||||
|
|
||||||
The expression below performs the element-by-element subtraction of two arrays.
|
The expression below performs the element-by-element subtraction of two arrays.
|
||||||
|
@ -261,7 +253,7 @@ The expression below performs the element-by-element subtraction of two arrays.
|
||||||
ebeSubtract(array(10, 15, 20), array(1,2,3))
|
ebeSubtract(array(10, 15, 20), array(1,2,3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -297,7 +289,7 @@ Below is an example of the `dotProduct` function:
|
||||||
dotProduct(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
|
dotProduct(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -323,7 +315,7 @@ Below is an example of the `cosineSimilarity` function:
|
||||||
cosineSimilarity(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
|
cosineSimilarity(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
|
@ -18,11 +18,10 @@
|
||||||
|
|
||||||
This section of the user guide explores techniques
|
This section of the user guide explores techniques
|
||||||
for retrieving streams of data from Solr and vectorizing the
|
for retrieving streams of data from Solr and vectorizing the
|
||||||
*numeric* fields.
|
numeric fields.
|
||||||
|
|
||||||
The next chapter of the user guide covers
|
See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to
|
||||||
Text Analysis and Term Vectors which describes how to
|
vectorize text fields.
|
||||||
vectorize *text* fields.
|
|
||||||
|
|
||||||
== Streams
|
== Streams
|
||||||
|
|
||||||
|
@ -32,42 +31,42 @@ to vectorize and analyze the results sets.
|
||||||
|
|
||||||
Below are some of the key stream sources:
|
Below are some of the key stream sources:
|
||||||
|
|
||||||
* *random*: Random sampling is widely used in statistics, probability and machine learning.
|
* *`random`*: Random sampling is widely used in statistics, probability and machine learning.
|
||||||
The `random` function returns a random sample of search results that match a
|
The `random` function returns a random sample of search results that match a
|
||||||
query. The random samples can be vectorized and operated on by math expressions and the results
|
query. The random samples can be vectorized and operated on by math expressions and the results
|
||||||
can be used to describe and make inferences about the entire population.
|
can be used to describe and make inferences about the entire population.
|
||||||
|
|
||||||
* *timeseries*: The `timeseries`
|
* *`timeseries`*: The `timeseries`
|
||||||
expression provides fast distributed time series aggregations, which can be
|
expression provides fast distributed time series aggregations, which can be
|
||||||
vectorized and analyzed with math expressions.
|
vectorized and analyzed with math expressions.
|
||||||
|
|
||||||
* *knnSearch*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
|
* *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
|
||||||
function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
|
function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
|
||||||
a distributed index. Once the nearest neighbors are retrieved they can be vectorized
|
a distributed index. Once the nearest neighbors are retrieved they can be vectorized
|
||||||
and operated on by machine learning and text mining algorithms.
|
and operated on by machine learning and text mining algorithms.
|
||||||
|
|
||||||
* *sql*: SQL is the primary query language used by data scientists. The `sql` function supports
|
* *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports
|
||||||
data retrieval using a subset of SQL which includes both full text search and
|
data retrieval using a subset of SQL which includes both full text search and
|
||||||
fast distributed aggregations. The result sets can then be vectorized and operated
|
fast distributed aggregations. The result sets can then be vectorized and operated
|
||||||
on by math expressions.
|
on by math expressions.
|
||||||
|
|
||||||
* *jdbc*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
|
* *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
|
||||||
streams originating from Solr. Result sets from outside data sources can be vectorized and operated
|
streams originating from Solr. Result sets from outside data sources can be vectorized and operated
|
||||||
on by math expressions in the same manner as result sets originating from Solr.
|
on by math expressions in the same manner as result sets originating from Solr.
|
||||||
|
|
||||||
* *topic*: Messaging is an important foundational technology for large scale computing. The `topic`
|
* *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic`
|
||||||
function provides publish/subscribe messaging capabilities by treating
|
function provides publish/subscribe messaging capabilities by treating
|
||||||
Solr Cloud as a distributed message queue. Topics are extremely powerful
|
Solr Cloud as a distributed message queue. Topics are extremely powerful
|
||||||
because they allow subscription by query. Topics can be use to support a broad set of
|
because they allow subscription by query. Topics can be use to support a broad set of
|
||||||
use cases including bulk text mining operations and AI alerting.
|
use cases including bulk text mining operations and AI alerting.
|
||||||
|
|
||||||
* *nodes*: Graph queries are frequently used by recommendation engines and are an important
|
* *`nodes`*: Graph queries are frequently used by recommendation engines and are an important
|
||||||
machine learning tool. The `nodes` function provides fast, distributed, breadth
|
machine learning tool. The `nodes` function provides fast, distributed, breadth
|
||||||
first graph traversal over documents in a Solr Cloud collection. The node sets collected
|
first graph traversal over documents in a Solr Cloud collection. The node sets collected
|
||||||
by the `nodes` function can be operated on by statistical and machine learning expressions to
|
by the `nodes` function can be operated on by statistical and machine learning expressions to
|
||||||
gain more insight into the graph.
|
gain more insight into the graph.
|
||||||
|
|
||||||
* *search*: Ranked search results are a powerful tool for finding the most relevant
|
* *`search`*: Ranked search results are a powerful tool for finding the most relevant
|
||||||
documents from a large document corpus. The `search` expression
|
documents from a large document corpus. The `search` expression
|
||||||
returns the top N ranked search results that match any
|
returns the top N ranked search results that match any
|
||||||
Solr query, including geo-spatial queries. The smaller set of relevant
|
Solr query, including geo-spatial queries. The smaller set of relevant
|
||||||
|
@ -79,7 +78,7 @@ text mining expressions to gather insights about the data set.
|
||||||
The output of any streaming expression can be set to a variable.
|
The output of any streaming expression can be set to a variable.
|
||||||
Below is a very simple example using the `random` function to fetch
|
Below is a very simple example using the `random` function to fetch
|
||||||
three random samples from collection1. The random samples are returned
|
three random samples from collection1. The random samples are returned
|
||||||
as *tuples*, which contain name/value pairs.
|
as tuples which contain name/value pairs.
|
||||||
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
|
@ -87,7 +86,7 @@ as *tuples*, which contain name/value pairs.
|
||||||
let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
|
let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -116,10 +115,10 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== Creating a Vector with the *col* Function
|
== Creating a Vector with the col Function
|
||||||
|
|
||||||
The `col` function iterates over a list of tuples and copies the values
|
The `col` function iterates over a list of tuples and copies the values
|
||||||
from a specific column into an *array*.
|
from a specific column into an array.
|
||||||
|
|
||||||
The output of the `col` function is an numeric array that can be set to a
|
The output of the `col` function is an numeric array that can be set to a
|
||||||
variable and operated on by math expressions.
|
variable and operated on by math expressions.
|
||||||
|
@ -157,7 +156,7 @@ let(a=random(collection1, q="*:*", rows="3", fl="price_f"),
|
||||||
|
|
||||||
Once a vector has been created any math expression that operates on vectors
|
Once a vector has been created any math expression that operates on vectors
|
||||||
can be applied. In the example below the `mean` function is applied to
|
can be applied. In the example below the `mean` function is applied to
|
||||||
the vector assigned to variable *b*.
|
the vector assigned to variable *`b`*.
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -166,7 +165,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
|
||||||
c=mean(b))
|
c=mean(b))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -191,13 +190,14 @@ Matrices can be created by vectorizing multiple numeric fields
|
||||||
and adding them to a matrix. The matrices can then be operated on by
|
and adding them to a matrix. The matrices can then be operated on by
|
||||||
any math expression that operates on matrices.
|
any math expression that operates on matrices.
|
||||||
|
|
||||||
|
[TIP]
|
||||||
|
====
|
||||||
Note that this section deals with the creation of matrices
|
Note that this section deals with the creation of matrices
|
||||||
from numeric data. The next chapter of the user guide covers
|
from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields.
|
||||||
Text Analysis and Term Vectors which describes how to build TF-IDF
|
====
|
||||||
term vector matrices from text fields.
|
|
||||||
|
|
||||||
Below is a simple example where four random samples are taken
|
Below is a simple example where four random samples are taken
|
||||||
from different sub-populations in the data. The *price_f* field of
|
from different sub-populations in the data. The `price_f` field of
|
||||||
each random sample is
|
each random sample is
|
||||||
vectorized and the vectors are added as rows to a matrix.
|
vectorized and the vectors are added as rows to a matrix.
|
||||||
Then the `sumRows`
|
Then the `sumRows`
|
||||||
|
@ -218,7 +218,7 @@ let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
|
||||||
j=sumRows(i))
|
j=sumRows(i))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -244,14 +244,14 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
|
|
||||||
== Latitude / Longitude Vectors
|
== Latitude / Longitude Vectors
|
||||||
|
|
||||||
The `latlonVectors` function wraps a list of tuples and parses a lat/long location field into
|
The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into
|
||||||
a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long
|
a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long
|
||||||
pair for the corresponding tuple in the list. The row labels for the matrix are
|
pair for the corresponding tuple in the list. The row labels for the matrix are
|
||||||
automatically set to the *id* field in the tuples. The the lat/lon matrix can then be operated
|
automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated
|
||||||
on by distance based machine learning functions using the `haversineMeters` distance measure.
|
on by distance-based machine learning functions using the `haversineMeters` distance measure.
|
||||||
|
|
||||||
The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called
|
The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called
|
||||||
*field*. The field parameter tells the `latlonVectors` function which field to parse the lat/lon
|
`field`, which tells the `latlonVectors` function which field to parse the lat/lon
|
||||||
vectors from.
|
vectors from.
|
||||||
|
|
||||||
Below is an example of the `latlonVectors`.
|
Below is an example of the `latlonVectors`.
|
||||||
|
@ -262,7 +262,7 @@ let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"),
|
||||||
b=latlonVectors(a, field="loc_p"))
|
b=latlonVectors(a, field="loc_p"))
|
||||||
----
|
----
|
||||||
|
|
||||||
When this expression is sent to the /stream handler it responds with:
|
When this expression is sent to the `/stream` handler it responds with:
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
@ -301,5 +301,3 @@ When this expression is sent to the /stream handler it responds with:
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue