SOLR-12701: format/style consistency fixes for math expression docs; CSS change to make bold monospace appear properly

This commit is contained in:
Cassandra Targett 2018-09-11 08:45:46 -05:00
parent a1b6db26db
commit a619038e90
17 changed files with 498 additions and 579 deletions

View File

@ -885,6 +885,11 @@ h6 strong
line-height: 1.45;
}
p strong code,
td strong code {
font-weight: bold;
}
pre,
pre > code
{

View File

@ -21,11 +21,11 @@
The `polyfit` function is a general purpose curve fitter used to model
the *non-linear* relationship between two random variables.
the non-linear relationship between two random variables.
The `polyfit` function is passed *x* and *y* axises and fits a smooth curve to the data.
If only a single array is provided it is treated as the *y* axis and a sequence is generated
for the *x* axis.
The `polyfit` function is passed x- and y-axes and fits a smooth curve to the data.
If only a single array is provided it is treated as the y-axis and a sequence is generated
for the x-axis.
The `polyfit` function also has a parameter the specifies the degree of the polynomial. The higher
the degree the more curves that can be modeled.
@ -34,7 +34,7 @@ The example below uses the `polyfit` function to fit a curve to an array using
a 3 degree polynomial. The fitted curve is then subtracted from the original curve. The output
shows the error between the fitted curve and the original curve, known as the residuals.
The output also includes the sum-of-squares of the residuals which provides a measure
of how large the error is..
of how large the error is.
[source,text]
----
@ -45,7 +45,7 @@ let(echo="residuals, sumSqError",
sumSqError=sumSq(residuals))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -95,7 +95,7 @@ let(echo="residuals, sumSqError",
sumSqError=sumSq(residuals))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -138,10 +138,10 @@ responds with:
The `polyfit` function returns a function that can be used with the `predict`
function.
In the example below the x axis is included for clarity.
In the example below the x-axis is included for clarity.
The `polyfit` function returns a function for the fitted curve.
The `predict` function is then used to predict a value along the curve, in this
case the prediction is made for the *x* value of 5.
case the prediction is made for the *`x`* value of 5.
[source,text]
----
@ -151,7 +151,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
p=predict(curve, 5))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -185,7 +185,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
d=derivative(curve))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -235,7 +235,7 @@ let(x=array(0,1,2,3,4,5,6,7,8,9, 10),
f=gaussfit(x, y))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -283,7 +283,7 @@ let(x=array(0,1,2,3,4,5,6,7,8,9, 10),
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]

View File

@ -30,17 +30,17 @@ the more advanced DSP functions, its useful to get a better understanding of how
The `dotProduct` function can be used to combine two arrays into a single product. A simple example can help
illustrate this concept.
In the example below two arrays are set to variables *a* and *b* and then operated on by the `dotProduct` function.
The output of the `dotProduct` function is set to variable *c*.
In the example below two arrays are set to variables *`a`* and *`b`* and then operated on by the `dotProduct` function.
The output of the `dotProduct` function is set to variable *`c`*.
Then the `mean` function is then used to compute the mean of the first array which is set to the variable `d`.
Then the `mean` function is then used to compute the mean of the first array which is set to the variable *`d`*.
Both the *dot product* and the *mean* are included in the output.
Both the dot product and the mean are included in the output.
When we look at the output of this expression we see that the *dot product* and the *mean* of the first array
When we look at the output of this expression we see that the dot product and the mean of the first array
are both 30.
The dot product function *calculated the mean* of the first array.
The `dotProduct` function calculated the mean of the first array.
[source,text]
----
@ -51,7 +51,7 @@ let(echo="c, d",
d=mean(a))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -76,9 +76,9 @@ calculation using vector math and look at the output of each step.
In the example below the `ebeMultiply` function performs an element-by-element multiplication of
two arrays. This is the first step of the dot product calculation. The result of the element-by-element
multiplication is assigned to variable *c*.
multiplication is assigned to variable *`c`*.
In the next step the `add` function adds all the elements of the array in variable *c*.
In the next step the `add` function adds all the elements of the array in variable *`c`*.
Notice that multiplying each element of the first array by .2 and then adding the results is
equivalent to the formula for computing the mean of the first array. The formula for computing the mean
@ -95,7 +95,7 @@ let(echo="c, d",
d=add(c))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -122,11 +122,13 @@ When this expression is sent to the /stream handler it responds with:
----
In the example above two arrays were combined in a way that produced the mean of the first. In the second array
each value was set to .2. Another way of looking at this is that each value in the second array has the same weight.
By varying the weights in the second array we can produce a different result. For example if the first array represents a time series,
each value was set to ".2". Another way of looking at this is that each value in the second array has the same weight.
By varying the weights in the second array we can produce a different result.
For example if the first array represents a time series,
the weights in the second array can be set to add more weight to a particular element in the first array.
The example below creates a weighted average with the weight decreasing from right to left. Notice that the weighted mean
The example below creates a weighted average with the weight decreasing from right to left.
Notice that the weighted mean
of 36.666 is larger than the previous mean which was 30. This is because more weight was given to last element in the
array.
@ -139,7 +141,7 @@ let(echo="c, d",
d=add(c))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -167,13 +169,13 @@ When this expression is sent to the /stream handler it responds with:
=== Representing Correlation
Often when we think of correlation, we are thinking of *Pearsons* correlation in the field of statistics. But the definition of
Often when we think of correlation, we are thinking of _Pearson correlation_ in the field of statistics. But the definition of
correlation is actually more general: a mutual relationship or connection between two or more things.
In the field of digital signal processing the dot product is used to represent correlation. The examples below demonstrates
how the dot product can be used to represent correlation.
In the example below the dot product is computed for two vectors. Notice that the vectors have different values that fluctuate
together. The output of the dot product is 190, which is hard to reason about because because its not scaled.
together. The output of the dot product is 190, which is hard to reason about because it's not scaled.
[source,text]
----
@ -183,7 +185,7 @@ let(echo="c, d",
c=dotProduct(a, b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -206,9 +208,9 @@ One approach to scaling the dot product is to first scale the vectors so that bo
magnitude of 1, also called unit vectors, are used when comparing only the angle between vectors rather then the magnitude.
The `unitize` function can be used to unitize the vectors before calculating the dot product.
Notice in the example below the dot product result, set to variable *e*, is effectively 1. When applied to unit vectors the dot product
will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the *unscaled* vectors and the
answer is also effectively 1. This is because *cosine similarity* is a scaled *dot product*.
Notice in the example below the dot product result, set to variable *`e`*, is effectively 1. When applied to unit vectors the dot product
will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the unscaled vectors and the
answer is also effectively 1. This is because cosine similarity is a scaled dot product.
[source,text]
@ -222,7 +224,7 @@ let(echo="e, f",
f=cosineSimilarity(a, b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -254,7 +256,7 @@ let(echo="c, d",
c=cosineSimilarity(a, b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -275,10 +277,10 @@ When this expression is sent to the /stream handler it responds with:
== Convolution
The `conv` function calculates the convolution of two vectors. The convolution is calculated by *reversing*
the second vector and sliding it across the first vector. The *dot product* of the two vectors
The `conv` function calculates the convolution of two vectors. The convolution is calculated by reversing
the second vector and sliding it across the first vector. The dot product of the two vectors
is calculated at each point as the second vector is slid across the first vector.
The dot products are collected in a *third vector* which is the *convolution* of the two vectors.
The dot products are collected in a third vector which is the convolution of the two vectors.
=== Moving Average Function
@ -290,7 +292,7 @@ is syntactic sugar for convolution.
Below is an example of a moving average with a window size of 5. Notice that original vector has 13 elements
but the result of the moving average has only 9 elements. This is because the `movingAvg` function
only begins generating results when it has a full window. In this case because the window size is 5 so the
moving average starts generating results from the 4th index of the original array.
moving average starts generating results from the 4^th^ index of the original array.
[source,text]
----
@ -298,7 +300,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=movingAvg(a, 5))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -344,7 +346,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
c=conv(a, b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -381,7 +383,7 @@ When this expression is sent to the /stream handler it responds with:
}
----
We achieve the same result as the `movingAvg` gunction by using the `copyOfRange` function to copy a range of
We achieve the same result as the `movingAvg` function by using the `copyOfRange` function to copy a range of
the result that drops the first and last 4 values of
the convolution result. In the example below the `precision` function is also also used to remove floating point errors from the
convolution result. When this is added the output is exactly the same as the `movingAvg` function.
@ -395,7 +397,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
e=precision(d, 2))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -446,7 +448,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
c=conv(a, rev(b)))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -504,7 +506,7 @@ let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
c=finddelay(a, b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----

View File

@ -26,13 +26,12 @@ Before performing machine learning operations its often necessary to
scale the feature vectors so they can be compared at the same scale.
All the scaling function operate on vectors and matrices.
When operating on a matrix the *rows* of the matrix are scaled.
When operating on a matrix the rows of the matrix are scaled.
=== Min/Max Scaling
The `minMaxScale` function scales a vector or matrix between a min and
max value. By default it will scale between 0 and 1 if min/max values
are not provided.
The `minMaxScale` function scales a vector or matrix between a minimum and maximum value.
By default it will scale between 0 and 1 if min/max values are not provided.
Below is a simple example of min/max scaling between 0 and 1.
Notice that once brought into the same scale the vectors are the same.
@ -79,10 +78,10 @@ This expression returns the following response:
=== Standardization
The `standardize` function scales a vector so that it has a
mean of 0 and a standard deviation of 1. Standardization can be
used with machine learning algorithms, such as SVM, that
perform better when the data has a normal distribution.
The `standardize` function scales a vector so that it has a mean of 0 and a standard deviation of 1.
Standardization can be used with machine learning algorithms, such as
https://en.wikipedia.org/wiki/Support_vector_machine[Support Vector Machine (SVM)], that perform better
when the data has a normal distribution.
[source,text]
----
@ -127,8 +126,7 @@ This expression returns the following response:
=== Unit Vectors
The `unitize` function scales vectors to a magnitude of 1. A vector with a
magnitude of 1 is known as a unit vector. Unit vectors are
preferred when the vector math deals
magnitude of 1 is known as a unit vector. Unit vectors are preferred when the vector math deals
with vector direction rather than magnitude.
[source,text]
@ -173,24 +171,20 @@ This expression returns the following response:
== Distance and Distance Measures
The `distance` function computes the distance for two
numeric arrays or a *distance matrix* for the columns of a matrix.
The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix.
There are four distance measure functions that return a function
that performs the actual distance calculation:
There are five distance measure functions that return a function that performs the actual distance calculation:
* euclidean (default)
* manhattan
* canberra
* earthMovers
* haversineMeters (Geospatial distance measure)
* `euclidean` (default)
* `manhattan`
* `canberra`
* `earthMovers`
* `haversineMeters` (Geospatial distance measure)
The distance measure functions can be used with all machine learning functions
that support different distance measures.
Below is an example for computing euclidean distance for
two numeric arrays:
that support distance measures.
Below is an example for computing Euclidean distance for two numeric arrays:
[source,text]
----
@ -294,48 +288,46 @@ This expression returns the following response:
}
----
== K-means Clustering
== K-Means Clustering
The `kmeans` functions performs k-means clustering of the rows of a matrix.
Once the clustering has been completed there are a number of useful functions available
for examining the *clusters* and *centroids*.
for examining the clusters and centroids.
The examples below are clustering *term vectors*.
The chapter on <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> should be
consulted for a full explanation of these features.
The examples below cluster _term vectors_.
The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> offers
a full explanation of these features.
=== Centroid Features
In the example below the `kmeans` function is used to cluster a result set from the Enron email data-set
and then the top features are extracted from the cluster centroids.
Let's look at what data is assigned to each variable:
* *a*: The `random` function returns a sample of 500 documents from the *enron*
collection that match the query *body:oil*. The `select` function selects the *id* and
and annotates each tuple with the analyzed bigram terms from the body field.
* *b*: The `termVectors` function creates a TF-IDF term vector matrix from the
tuples stored in variable *a*. Each row in the matrix represents a document. The columns of the matrix
are the bigram terms that were attached to each tuple.
* *c*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the
*Euclidean distance* measure.
* *d*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid
from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix.
* *e*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix.
This returns the top 5 bigram terms for each centroid.
[source,text]
----
let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"),
let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"), <1>
id,
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),
c=kmeans(b, 5),
d=getCentroids(c),
e=topFeatures(d, 5))
b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),<2>
c=kmeans(b, 5), <3>
d=getCentroids(c), <4>
e=topFeatures(d, 5)) <5>
----
Let's look at what data is assigned to each variable:
<1> *`a`*: The `random` function returns a sample of 500 documents from the "enron"
collection that match the query "body:oil". The `select` function selects the `id` and
and annotates each tuple with the analyzed bigram terms from the `body` field.
<2> *`b`*: The `termVectors` function creates a TF-IDF term vector matrix from the
tuples stored in variable *`a`*. Each row in the matrix represents a document. The columns of the matrix
are the bigram terms that were attached to each tuple.
<3> *`c`*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the Euclidean distance measure.
<4> *`d`*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid
from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix.
<5> *`e`*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix.
This returns the top 5 bigram terms for each centroid.
This expression returns the following response:
[source,json]
@ -396,12 +388,6 @@ This expression returns the following response:
The example below examines the top features of a specific cluster. This example uses the same techniques
as the centroids example but the top features are extracted from a cluster rather then the centroids.
The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors
that have been clustered together based on their features.
In the example below the `topFeatures` function is used to extract the top 4 features from each term vector
in the cluster.
[source,text]
----
let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
@ -409,10 +395,15 @@ let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
c=kmeans(b, 25),
d=getCluster(c, 0),
e=topFeatures(d, 4))
d=getCluster(c, 0), <1>
e=topFeatures(d, 4)) <2>
----
<1> The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors
that have been clustered together based on their features.
<2> The `topFeatures` function is used to extract the top 4 features from each term vector
in the cluster.
This expression returns the following response:
[source,json]
@ -489,19 +480,17 @@ This expression returns the following response:
}
----
== Multi K-means Clustering
== Multi K-Means Clustering
K-means clustering will be produce different results depending on
K-means clustering will produce different results depending on
the initial placement of the centroids. K-means is fast enough
that multiple trials can be performed and the best outcome selected.
The `multiKmeans` function runs the K-means
clustering algorithm for a gven number of trials and selects the
best result based on which trial produces the lowest intra-cluster
variance.
The example below is identical to centroids example except that
it uses `multiKmeans` with 100 trials, rather then a single
trial of the `kmeans` function.
The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the
best result based on which trial produces the lowest intra-cluster variance.
The example below is identical to centroids example except that it uses `multiKmeans` with 100 trials,
rather then a single trial of the `kmeans` function.
[source,text]
----
@ -569,10 +558,10 @@ This expression returns the following response:
}
----
== Fuzzy K-means Clustering
== Fuzzy K-Means Clustering
The `fuzzyKmeans` function is a soft clustering algorithm which
allows vectors to be assigned to more then one cluster. The *fuzziness* parameter
allows vectors to be assigned to more then one cluster. The `fuzziness` parameter
is a value between 1 and 2 that determines how fuzzy to make the cluster assignment.
After the clustering has been performed the `getMembershipMatrix` function can be called
@ -585,27 +574,26 @@ A simple example will make this more clear. In the example below 300 documents a
then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the
term vectors into 12 clusters with a fuzziness factor of 1.25.
The `getMembershipMatrix` function is used to return the membership matrix and the first row
of membership matrix is retrieved with the `rowAt` function. The `precision` function is then applied to the first row
of the matrix to make it easier to read.
The output shows a single vector representing the cluster membership probabilities for the first
term vector. Notice that the term vector has the highest association with the 12th cluster,
but also has significant associations with the 3rd, 5th, 6th and 7th clusters.
[source,text]
----
et(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"),
id,
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
c=fuzzyKmeans(b, 12, fuzziness=1.25),
d=getMembershipMatrix(c),
e=rowAt(d, 0),
f=precision(e, 5))
d=getMembershipMatrix(c), <1>
e=rowAt(d, 0), <2>
f=precision(e, 5)) <3>
----
This expression returns the following response:
<1> The `getMembershipMatrix` function is used to return the membership matrix;
<2> and the first row of membership matrix is retrieved with the `rowAt` function.
<3> The `precision` function is then applied to the first row
of the matrix to make it easier to read.
This expression returns a single vector representing the cluster membership probabilities for the first
term vector. Notice that the term vector has the highest association with the 12^th^ cluster,
but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clusters:
[source,json]
----
@ -637,30 +625,21 @@ This expression returns the following response:
}
----
== K-nearest Neighbor (KNN)
== K-Nearest Neighbor (KNN)
The `knn` function searches the rows of a matrix for the
K-nearest neighbors of a search vector. The `knn` function
returns a *matrix* of the K-nearest neighbors. The `knn` function
supports changing of the distance measure by providing one of the
four distance measure functions as the fourth parameter:
k-nearest neighbors of a search vector. The `knn` function
returns a matrix of the k-nearest neighbors.
* euclidean (Default)
* manhattan
* canberra
* earthMovers
The `knn` function supports changing of the distance measure by providing one of these
distance measure functions as the fourth parameter:
The example below builds on the clustering examples to demonstrate
the `knn` function.
* `euclidean` (Default)
* `manhattan`
* `canberra`
* `earthMovers`
In the example, the centroids matrix is set to variable *d*. The first
centroid vector is selected from the matrix with the `rowAt` function.
Then the `knn` function is used to find the 3 nearest neighbors
to the centroid vector in the term vector matrix (variable b).
The `knn` function returns a matrix with the 3 nearest neighbors based on the
default distance measure which is euclidean. Finally, the top 4 features
of the term vectors in the nearest neighbor matrix are returned.
The example below builds on the clustering examples to demonstrate the `knn` function.
[source,text]
----
@ -669,13 +648,21 @@ let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"),
analyze(body, body_bigram) as terms),
b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"),
c=multiKmeans(b, 5, 100),
d=getCentroids(c),
e=rowAt(d, 0),
g=knn(b, e, 3),
h=topFeatures(g, 4))
d=getCentroids(c), <1>
e=rowAt(d, 0), <2>
g=knn(b, e, 3), <3>
h=topFeatures(g, 4)) <4>
----
This expression returns the following response:
<1> In the example, the centroids matrix is set to variable *`d`*.
<2> The first centroid vector is selected from the matrix with the `rowAt` function.
<3> Then the `knn` function is used to find the 3 nearest neighbors
to the centroid vector in the term vector matrix (variable *`b`*).
<4> The `topFeatures` function is used to request the top 4 featurs of the term vectors in the knn matrix.
The `knn` function returns a matrix with the 3 nearest neighbors based on the
default distance measure which is euclidean. Finally, the top 4 features
of the term vectors in the nearest neighbor matrix are returned:
[source,json]
----
@ -713,20 +700,18 @@ This expression returns the following response:
}
----
== KNN Regression
== K-Nearest Neighbor Regression
KNN regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning
K-nearest neighbor regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning
technique which means it does not fit a model to the training set in advance. Instead the
entire training set of observations and outcomes are held in memory and predictions are made
by averaging the outcomes of the k-nearest neighbors.
The `knnRegress` function prepares the training set for use with the `predict` function.
Below is an example of the `knnRegress` function. In this example 10000 random samples
are taken each containing the variables *filesize_d*, *service_d* and *response_d*. The pairs of
*filesize_d* and *service_d* will be used to predict the value of *response_d*.
Notice that `knnRegress` returns a tuple describing the regression inputs.
Below is an example of the `knnRegress` function. In this example 10,000 random samples
are taken, each containing the variables `filesize_d`, `service_d` and `response_d`. The pairs of
`filesize_d` and `service_d` will be used to predict the value of `response_d`.
[source,text]
----
@ -738,7 +723,7 @@ let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d
lazyModel=knnRegress(observations, outcomes , 5))
----
This expression returns the following response:
This expression returns the following response. Notice that `knnRegress` returns a tuple describing the regression inputs:
[source,json]
----
@ -767,6 +752,7 @@ This expression returns the following response:
=== Prediction and Residuals
The output of `knnRegress` can be used with the `predict` function like other regression models.
In the example below the `predict` function is used to predict results for the original training
data. The sumSq of the residuals is then calculated.
@ -806,14 +792,15 @@ This expression returns the following response:
If the features in the observation matrix are not in the same scale then the larger features
will carry more weight in the distance calculation then the smaller features. This can greatly
impact the accuracy of the prediction. The `knnRegress` function has a *scale* parameter which
can be set to *true* to automatically scale the features in the same range.
impact the accuracy of the prediction. The `knnRegress` function has a `scale` parameter which
can be set to `true` to automatically scale the features in the same range.
The example below shows `knnRegress` with feature scaling turned on.
Notice that when feature scaling is turned on the sumSqErr in the output is much lower.
Notice that when feature scaling is turned on the `sumSqErr` in the output is much lower.
This shows how much more accurate the predictions are when feature scaling is turned on in
this particular example. This is because the *filesize_d* feature is significantly larger then
the *service_d* feature.
this particular example. This is because the `filesize_d` feature is significantly larger then
the `service_d` feature.
[source,text]
----
@ -850,16 +837,15 @@ This expression returns the following response:
=== Setting Robust Regression
The default prediction approach is to take the *mean* of the outcomes of the k-nearest
neighbors. If the outcomes contain outliers the *mean* value can be skewed. Setting
the *robust* parameter to true will take the *median* outcome of the k-nearest neighbors.
The default prediction approach is to take the mean of the outcomes of the k-nearest
neighbors. If the outcomes contain outliers the mean value can be skewed. Setting
the `robust` parameter to `true` will take the median outcome of the k-nearest neighbors.
This provides a regression prediction that is robust to outliers.
=== Setting the Distance Measure
The distance measure can be changed for the k-nearest neighbor search by adding a distance measure
function to the `knnRegress` parameters. Below is an example using manhattan distance.
function to the `knnRegress` parameters. Below is an example using `manhattan` distance.
[source,text]
----
@ -892,10 +878,3 @@ This expression returns the following response:
}
}
----

View File

@ -35,7 +35,7 @@ matrix(array(1, 2),
array(4, 5))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -80,7 +80,7 @@ let(a=array(1, 2),
d=colAt(c, 1))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -129,7 +129,7 @@ let(echo="d, e",
e=getColumnLabels(c))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -182,7 +182,7 @@ let(echo="b,c",
c=columnCount(a))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -217,7 +217,7 @@ let(a=matrix(array(1, 2),
b=transpose(a))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -259,7 +259,7 @@ let(a=matrix(array(1, 2, 3),
b=sumRows(a))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -292,7 +292,7 @@ let(a=matrix(array(1, 2, 3),
b=grandSum(a))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -326,7 +326,7 @@ let(a=matrix(array(1, 2),
b=scalarAdd(10, a))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -370,7 +370,7 @@ let(a=matrix(array(1, 2),
b=ebeAdd(a, a))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -413,7 +413,7 @@ let(a=matrix(array(1, 2),
c=matrixMult(a, b))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]

View File

@ -16,21 +16,20 @@
// specific language governing permissions and limitations
// under the License.
This section of the math expression user guide covers *interpolation*, *derivatives* and *integrals*.
These three interrelated topics are part of the field of mathematics called *numerical analysis*.
Interpolation, derivatives and integrals are three interrelated topics which are part of the field of mathematics called numerical analysis. This section explores the math expressions available for numerical anlysis.
== Interpolation
Interpolation is used to construct new data points between a set of known control of points.
The ability to *predict* new data points allows for *sampling* along the curve defined by the
The ability to predict new data points allows for sampling along the curve defined by the
control points.
The interpolation functions described below all return an *interpolation model*
The interpolation functions described below all return an _interpolation model_
that can be passed to other functions which make use of the sampling capability.
If returned directly the interpolation model returns an array containing predictions for each of the
control points. This is useful in the case of `loess` interpolation which first smooths the control points
and then interpolates the smoothed points. All other interpolation function simply return the original
and then interpolates the smoothed points. All other interpolation functions simply return the original
control points because interpolation predicts a curve that passes through the original control points.
There are different algorithms for interpolation that will result in different predictions
@ -54,29 +53,25 @@ samples every second. In order to do this the data points between the minutes mu
The `predict` function can be used to predict values anywhere within the bounds of the interpolation
range. The example below shows a very simple example of upsampling.
In the example linear interpolation is performed on the arrays in variables *x* and *y*. The *x* variable,
which is the x axis, is a sequence from 0 to 20 with a stride of 2. The *y* variable defines the curve
along the x axis.
The `lerp` function performs the interpolation and returns the interpolation model.
The `u` value is an array from 0 to 20 with a stride of 1. This fills in the gaps of the original x axis.
The `predict` function then uses the interpolation function in variable *l* to predict values for
every point in the array assigned to variable *u*.
The variable *p* is the array of predictions, which is the upsampled set of y values.
[source,text]
----
let(x=array(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20),
y=array(5, 10, 60, 190, 100, 130, 100, 20, 30, 10, 5),
l=lerp(x, y),
u=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
p=predict(l, u))
let(x=array(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20), <1>
y=array(5, 10, 60, 190, 100, 130, 100, 20, 30, 10, 5), <2>
l=lerp(x, y), <3>
u=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), <4>
p=predict(l, u)) <5>
----
When this expression is sent to the /stream handler it
responds with:
<1> In the example linear interpolation is performed on the arrays in variables *`x`* and *`y`*. The *`x`* variable,
which is the x-axis, is a sequence from 0 to 20 with a stride of 2.
<2> The *`y`* variable defines the curve along the x-axis.
<3> The `lerp` function performs the interpolation and returns the interpolation model.
<4> The `u` value is an array from 0 to 20 with a stride of 1. This fills in the gaps of the original x axis.
The `predict` function then uses the interpolation function in variable *`l`* to predict values for
every point in the array assigned to variable *`u`*.
<5> The variable *`p`* is the array of predictions, which is the upsampled set of *`y`* values.
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -127,21 +122,15 @@ A technique known as local regression is used to compute the smoothed curve. Th
neighborhood of the local regression can be adjusted
to control how close the new curve conforms to the original control points.
The `loess` function is passed *x* and *y* axises and fits a smooth curve to the data.
If only a single array is provided it is treated as the *y* axis and a sequence is generated
for the *x* axis.
The `loess` function is passed *`x`*- and *`y`*-axes and fits a smooth curve to the data.
If only a single array is provided it is treated as the *`y`*-axis and a sequence is generated
for the *`x`*-axis.
The example below uses the `loess` function to fit a curve to a set of *y* values in an array.
The bandwidth parameter defines the percent of data to use for the local
The example below uses the `loess` function to fit a curve to a set of *`y`* values in an array.
The `bandwidth` parameter defines the percent of data to use for the local
regression. The lower the percent the smaller the neighborhood used for the local
regression and the closer the curve will be to the original data.
In the example the fitted curve is subtracted from the original curve using the
`ebeSubtract` function. The output shows the error between the
fitted curve and the original curve, known as the residuals. The output also includes
the sum-of-squares of the residuals which provides a measure
of how large the error is.
[source,text]
----
let(echo="residuals, sumSqError",
@ -151,8 +140,11 @@ let(echo="residuals, sumSqError",
sumSqError=sumSq(residuals))
----
When this expression is sent to the /stream handler it
responds with:
In the example the fitted curve is subtracted from the original curve using the
`ebeSubtract` function. The output shows the error between the
fitted curve and the original curve, known as the residuals. The output also includes
the sum-of-squares of the residuals which provides a measure
of how large the error is:
[source,json]
----
@ -194,9 +186,7 @@ responds with:
}
----
In the next example the curve is fit using a bandwidth of .25. Notice that the curve
is a closer fit, shown by the smaller residuals and lower value for the sum-of-squares of the
residuals.
In the next example the curve is fit using a `bandwidth` of `.25`:
[source,text]
----
@ -207,8 +197,8 @@ let(echo="residuals, sumSqError",
sumSqError=sumSq(residuals))
----
When this expression is sent to the /stream handler it
responds with:
Notice that the curve is a closer fit, shown by the smaller `residuals` and lower value for the sum-of-squares of the
residuals:
[source,json]
----
@ -252,11 +242,11 @@ responds with:
== Derivatives
The derivative of a function measures the rate of change of the *y* value in respects to the
rate of change of the *x* value.
The derivative of a function measures the rate of change of the *`y`* value in respects to the
rate of change of the *`x`* value.
The `derivative` function can compute the derivative of any *interpolation* function.
The `derivative` function can also compute the derivative of a derivative.
The `derivative` function can compute the derivative of any interpolation function.
It can also compute the derivative of a derivative.
The example below computes the derivative for a `loess` interpolation function.
@ -268,7 +258,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
derivative=derivative(curve))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -327,7 +317,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
integral=integrate(curve, 0, 20))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -357,7 +347,7 @@ let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
integral=integrate(curve, 0, 10))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -382,18 +372,7 @@ responds with:
The `bicubicSpline` function can be used to interpolate and predict values
anywhere within a grid of data.
A simple example will make this more clear.
In example below a bicubic spline is used to interpolate a matrix of real estate data.
Each row of the matrix represents a specific *year*. Each column of the matrix
represents a *floor* of the building. The grid of numbers is the average selling price of
an apartment for each year and floor. For example in 2002 the average selling price for
the 9th floor was 415000 (row 3, column 3).
The `bicubicSpline` function is then used to
interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8.
Notice that the matrix does not included a data point for year 2003, floor 8. The `bicupicSpline`
function creates that data point based on the surrounding data in the matrix.
A simple example will make this more clear:
[source,text]
----
@ -408,8 +387,16 @@ let(years=array(1998, 2000, 2002, 2004, 2006),
prediction=predict(bspline, 2003, 8))
----
When this expression is sent to the /stream handler it
responds with:
In this example a bicubic spline is used to interpolate a matrix of real estate data.
Each row of the matrix represent specific `years`. Each column of the matrix
represents `floors` of the building. The grid of numbers is the average selling price of
an apartment for each year and floor. For example in 2002 the average selling price for
the 9th floor was `415000` (row 3, column 3).
The `bicubicSpline` function is then used to
interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8.
Notice that the matrix does not include a data point for year 2003, floor 8. The `bicupicSpline`
function creates that data point based on the surrounding data in the matrix:
[source,json]
----
@ -427,4 +414,3 @@ responds with:
}
}
----

View File

@ -17,18 +17,16 @@
// under the License.
This section of the user guide covers the
*probability distribution
framework* included in the math expressions library.
probability distribution
framework included in the math expressions library.
== Probability Distribution Framework
The probability distribution framework includes
many commonly used *real* and *discrete* probability
distributions, including support for *empirical* and
*enumerated* distributions that model real world data.
The probability distribution framework includes many commonly used <<Real Distributions,real>>
and <<Discrete,discrete>> probability distributions, including support for <<Empirical Distribution,empirical>>
and <<Enumerated Distributions,enumerated>> distributions that model real world data.
The probability distribution framework also includes a set
of functions that use the probability distributions
The probability distribution framework also includes a set of functions that use the probability distributions
to support probability calculations and sampling.
=== Real Distributions
@ -93,18 +91,18 @@ random variable within a specific distribution.
Below is example of calculating the cumulative probability
of a random variable within a normal distribution.
In the example a normal distribution function is created
with a mean of 10 and a standard deviation of 5. Then
the cumulative probability of the value 12 is calculated for this
specific distribution.
[source,text]
----
let(a=normalDistribution(10, 5),
b=cumulativeProbability(a, 12))
----
When this expression is sent to the /stream handler it responds with:
In this example a normal distribution function is created
with a mean of 10 and a standard deviation of 5. Then
the cumulative probability of the value 12 is calculated for this
specific distribution.
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -127,10 +125,10 @@ Below is an example of a cumulative probability calculation
using an empirical distribution.
In the example an empirical distribution is created from a random
sample taken from the *price_f* field.
sample taken from the `price_f` field.
The cumulative probability of the value .75 is then calculated.
The *price_f* field in this example was generated using a
The cumulative probability of the value `.75` is then calculated.
The `price_f` field in this example was generated using a
uniform real distribution between 0 and 1, so the output of the
`cumulativeProbability` function is very close to .75.
@ -142,7 +140,7 @@ let(a=random(collection1, q="*:*", rows="30000", fl="price_f"),
d=cumulativeProbability(c, .75))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -171,7 +169,7 @@ Below is an example which calculates the probability
of a discrete value within a Poisson distribution.
In the example a Poisson distribution function is created
with a mean of 100. Then the
with a mean of `100`. Then the
probability of encountering a sample of the discrete value 101 is calculated for this
specific distribution.
@ -181,7 +179,7 @@ let(a=poissonDistribution(100),
b=probability(a, 101))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -200,12 +198,10 @@ When this expression is sent to the /stream handler it responds with:
}
----
Below is an example of a probability calculation
using an enumerated distribution.
Below is an example of a probability calculation using an enumerated distribution.
In the example an enumerated distribution is created from a random
sample taken from the *day_i* field, which was created
using a uniform integer distribution between 0 and 30.
sample taken from the `day_i` field, which was created using a uniform integer distribution between 0 and 30.
The probability of the discrete value 10 is then calculated.
@ -217,7 +213,7 @@ let(a=random(collection1, q="*:*", rows="30000", fl="day_i"),
d=probability(c, 10))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -239,11 +235,9 @@ When this expression is sent to the /stream handler it responds with:
=== Sampling
All probability distributions support sampling. The `sample`
function returns 1 or more random samples from a probability
distribution.
function returns 1 or more random samples from a probability distribution.
Below is an example drawing a single sample from
a normal distribution.
Below is an example drawing a single sample from a normal distribution.
[source,text]
----
@ -251,7 +245,7 @@ let(a=normalDistribution(10, 5),
b=sample(a))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -270,8 +264,7 @@ When this expression is sent to the /stream handler it responds with:
}
----
Below is an example drawing 10 samples from a normal
distribution.
Below is an example drawing 10 samples from a normal distribution.
[source,text]
----
@ -279,7 +272,7 @@ let(a=normalDistribution(10, 5),
b=sample(a, 10))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -315,14 +308,14 @@ The multivariate normal distribution is a generalization of the
univariate normal distribution to higher dimensions.
The multivariate normal distribution models two or more random
variables that are normally distributed. The relationship between
the variables is defined by a covariance matrix.
variables that are normally distributed. The relationship between the variables is defined by a covariance matrix.
==== Sampling
The `sample` function can be used to draw samples
from a multivariate normal distribution in much the same
way as a univariate normal distribution.
The difference is that each sample will be an array containing a sample
drawn from each of the underlying normal distributions.
If multiple samples are drawn, the `sample` function returns a matrix with a
@ -333,33 +326,25 @@ multivariate normal distribution.
The example below demonstrates how to initialize and draw samples
from a multivariate normal distribution.
In this example 5000 random samples are selected from a collection
of log records. Each sample contains
the fields *filesize_d* and *response_d*. The values of both fields conform
to a normal distribution.
In this example 5000 random samples are selected from a collection of log records. Each sample contains
the fields `filesize_d` and `response_d`. The values of both fields conform to a normal distribution.
Both fields are then vectorized. The *filesize_d* vector is stored in
variable *b* and the *response_d* variable is stored in variable *c*.
Both fields are then vectorized. The `filesize_d` vector is stored in
variable *`b`* and the `response_d` variable is stored in variable *`c`*.
An array is created that contains the *means* of the two vectorized fields.
An array is created that contains the means of the two vectorized fields.
Then both vectors are added to a matrix which is transposed. This creates
an *observation* matrix where each row contains one observation of
*filesize_d* and *response_d*. A covariance matrix is then created from the columns of
the observation matrix with the
`cov` function. The covariance matrix describes the covariance between
*filesize_d* and *response_d*.
an observation matrix where each row contains one observation of
`filesize_d` and `response_d`. A covariance matrix is then created from the columns of
the observation matrix with the `cov` function. The covariance matrix describes the covariance between
`filesize_d` and `response_d`.
The `multivariateNormalDistribution` function is then called with the
array of means for the two fields and the covariance matrix. The model for the
multivariate normal distribution is assigned to variable *g*.
multivariate normal distribution is assigned to variable *`g`*.
Finally five samples are drawn from the multivariate normal distribution. The samples
are returned as a matrix, with each row representing one sample. There are two
columns in the matrix. The first column contains samples for *filesize_d* and the second
column contains samples for *response_d*. Over the long term the covariance between
the columns will conform to the covariance matrix used to instantiate the
multivariate normal distribution.
Finally five samples are drawn from the multivariate normal distribution.
[source,text]
----
@ -373,7 +358,11 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
h=sample(g, 5))
----
When this expression is sent to the /stream handler it responds with:
The samples are returned as a matrix, with each row representing one sample. There are two
columns in the matrix. The first column contains samples for `filesize_d` and the second
column contains samples for `response_d`. Over the long term the covariance between
the columns will conform to the covariance matrix used to instantiate the
multivariate normal distribution.
[source,json]
----
@ -412,4 +401,3 @@ When this expression is sent to the /stream handler it responds with:
}
}
----

View File

@ -16,28 +16,23 @@
// specific language governing permissions and limitations
// under the License.
This section of the math expressions user guide covers simple and multivariate linear regression.
The math expressions library supports simple and multivariate linear regression.
== Simple Linear Regression
The `regress` function is used to build a linear regression model
between two random variables. Sample observations are provided with two
numeric arrays. The first numeric array is the *independent variable* and
the second array is the *dependent variable*.
numeric arrays. The first numeric array is the independent variable and
the second array is the dependent variable.
In the example below the `random` function selects 5000 random samples each containing
the fields *filesize_d* and *response_d*. The two fields are vectorized
and stored in variables *b* and *c*. Then the `regress` function performs a regression
the fields `filesize_d` and `response_d`. The two fields are vectorized
and stored in variables *`b`* and *`c`*. Then the `regress` function performs a regression
analysis on the two numeric arrays.
The `regress` function returns a single tuple with the results of the regression
analysis.
Note that in this regression analysis the value of *RSquared* is *.75*. This means that changes in
*filesize_d* explain 75% of the variability of the *response_d* variable.
[source,text]
----
let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
@ -46,7 +41,8 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
d=regress(b, c))
----
When this expression is sent to the /stream handler it responds with:
Note that in this regression analysis the value of `RSquared` is `.75`. This means that changes in
`filesize_d` explain 75% of the variability of the `response_d` variable:
[source,json]
----
@ -81,11 +77,10 @@ When this expression is sent to the /stream handler it responds with:
The `predict` function uses the regression model to make predictions.
Using the example above the regression model can be used to predict the value
of *response_d* given a value for *filesize_d*.
of `response_d` given a value for `filesize_d`.
In the example below the `predict` function uses the regression analysis to predict
the value of *response_d* for the *filesize_d* value of 40000.
the value of `response_d` for the `filesize_d` value of `40000`.
[source,text]
----
@ -96,7 +91,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
e=predict(d, 40000))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -131,7 +126,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
e=predict(d, b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -169,9 +164,9 @@ The difference between the observed value and the predicted value is known as th
residual. There isn't a specific function to calculate the residuals but vector
math can used to perform the calculation.
In the example below the predictions are stored in variable *e*. The `ebeSubtract`
In the example below the predictions are stored in variable *`e`*. The `ebeSubtract`
function is then used to subtract the predictions
from the actual *response_d* values stored in variable *c*. Variable *f* contains
from the actual `response_d` values stored in variable *`c`*. Variable *`f`* contains
the array of residuals.
[source,text]
@ -184,7 +179,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"),
f=ebeSubtract(c, e))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -221,20 +216,17 @@ When this expression is sent to the /stream handler it responds with:
== Multivariate Linear Regression
The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear
regression models the linear relationship between two or more *independent* variables and a *dependent* variable.
regression models the linear relationship between two or more independent variables and a dependent variable.
The example below extends the simple linear regression example by introducing a new independent variable
called *service_d*. The *service_d* variable is the service level of the request and it can range from 1 to 4
called `service_d`. The `service_d` variable is the service level of the request and it can range from 1 to 4
in the data-set. The higher the service level, the higher the bandwidth available for the request.
Notice that the two independent variables *filesize_d* and *service_d* are vectorized and stored
in the variables *b* and *c*. The variables *b* and *c* are then added as rows to a `matrix`. The matrix is
then transposed so that each row in the matrix represents one observation with *filesize_d* and *service_d*.
Notice that the two independent variables `filesize_d` and `service_d` are vectorized and stored
in the variables *`b`* and *`c`*. The variables *`b`* and *`c`* are then added as rows to a `matrix`. The matrix is
then transposed so that each row in the matrix represents one observation with `filesize_d` and `service_d`.
The `olsRegress` function then performs the multivariate regression analysis using the observation matrix as the
independent variables and the *response_d* values, stored in variable *d*, as the dependent variable.
Notice that the RSquared of the regression analysis is 1. This means that linear relationship between
*filesize_d* and *service_d* describe 100% of the variability of the *response_d* variable.
independent variables and the `response_d` values, stored in variable *`d`*, as the dependent variable.
[source,text]
----
@ -246,7 +238,8 @@ let(a=random(collection2, q="*:*", rows="30000", fl="filesize_d, service_d, resp
f=olsRegress(e, d))
----
When this expression is sent to the /stream handler it responds with:
Notice in the response that the RSquared of the regression analysis is 1. This means that linear relationship between
`filesize_d` and `service_d` describe 100% of the variability of the `response_d` variable:
[source,json]
----
@ -299,10 +292,11 @@ When this expression is sent to the /stream handler it responds with:
=== Prediction
The `predict` function can also be used to make predictions for multivariate linear regression. Below is an example
of a single prediction using the multivariate linear regression model and a single observation. The observation
is an array that matches the structure of the observation matrix used to build the model. In this case
the first value represent a *filesize_d* of 40000 and the second value represents a *service_d* of 4.
The `predict` function can also be used to make predictions for multivariate linear regression.
Below is an example of a single prediction using the multivariate linear regression model and a single observation.
The observation is an array that matches the structure of the observation matrix used to build the model. In this case
the first value represents a `filesize_d` of `40000` and the second value represents a `service_d` of `4`.
[source,text]
----
@ -315,7 +309,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
g=predict(f, array(40000, 4)))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -335,9 +329,10 @@ When this expression is sent to the /stream handler it responds with:
----
The `predict` function can also make predictions for more than one multivariate observation. In this scenario
an observation matrix used. In the example below the observation matrix used to build the multivariate regression model
is passed to the `predict` function and it returns an array of predictions.
an observation matrix used.
In the example below the observation matrix used to build the multivariate regression model
is passed to the `predict` function and it returns an array of predictions.
[source,text]
----
@ -350,7 +345,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
g=predict(f, e))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -388,7 +383,7 @@ Once the predictions are generated the residuals can be calculated using the sam
simple linear regression.
Below is an example of the residuals calculation following a multivariate linear regression. In the example
the predictions stored variable *g* are subtracted from observed values stored in variable *d*.
the predictions stored variable *`g`* are subtracted from observed values stored in variable *`d`*.
[source,text]
----
@ -402,7 +397,7 @@ let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, respo
h=ebeSubtract(d, g))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -433,7 +428,3 @@ When this expression is sent to the /stream handler it responds with:
}
}
----

View File

@ -26,7 +26,7 @@ For example the expression below adds two numbers together:
add(1, 1)
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -98,7 +98,7 @@ select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3")
mult(price_f, 10) as newPrice)
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----

View File

@ -18,59 +18,59 @@
Monte Carlo simulations are commonly used to model the behavior of
stochastic systems. This section of the user guide describes
how to perform both *uncorrelated* and *correlated* Monte Carlo simulations
using the *sampling* capabilities of the probability distribution framework.
stochastic systems. This section describes
how to perform both uncorrelated and correlated Monte Carlo simulations
using the sampling capabilities of the probability distribution framework.
== Uncorrelated Simulations
Uncorrelated Monte Carlo simulations model stochastic systems with the assumption
that the underlying random variables move independently of each other.
A simple example of a Monte Carlo simulation using two independently changing random variables
is described below.
that the underlying random variables move independently of each other.
A simple example of a Monte Carlo simulation using two independently changing random variables
is described below.
In this example a Monte Carlo simulation is used to determine the probability that a simple hinge assembly will
fall within a required length specification.
The hinge has two components *A* and *B*. The combined length of the two components must be less then 5 centimeters
The hinge has two components A and B. The combined length of the two components must be less then 5 centimeters
to fall within specification.
A random sampling of lengths for component *A* has shown that its length conforms to a
A random sampling of lengths for component A has shown that its length conforms to a
normal distribution with a mean of 2.2 centimeters and a standard deviation of .0195
centimeters.
A random sampling of lengths for component *B* has shown that its length conforms
A random sampling of lengths for component B has shown that its length conforms
to a normal distribution with a mean of 2.71 centimeters and a standard deviation of .0198 centimeters.
The Monte Carlo simulation below performs the following steps:
* A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of componentA.
* A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of componentB.
* The `monteCarlo` function samples from the componentA and componentB distributions and sets the values to variables sampleA and sampleB. It then
calls the *add(sampleA, sampleB)* function to find the combined lengths of the samples. The `monteCarlo` function runs a set number of times, 100000 in the example below, and collects the results in an array. Each
time the function is called new samples are drawn from the componentA
and componentB distributions. On each run, the `add` function adds the two samples to calculate the combined length.
The result of each run is collected in an array and assigned to the *simresults* variable.
* An `empiricalDistribution` function is then created from the *simresults* array to model the distribution of the
simulation results.
* Finally, the `cumulativeProbability` function is called on the *simmodel* to determine the cumulative probability
that the combined length of the components is 5 or less.
* Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
be 5 or less.
[source,text]
----
let(componentA=normalDistribution(2.2, .0195),
componentB=normalDistribution(2.71, .0198),
simresults=monteCarlo(sampleA=sample(componentA),
let(componentA=normalDistribution(2.2, .0195), <1>
componentB=normalDistribution(2.71, .0198), <2>
simresults=monteCarlo(sampleA=sample(componentA), <3>
sampleB=sample(componentB),
add(sampleA, sampleB),
100000),
simmodel=empiricalDistribution(simresults),
prob=cumulativeProbability(simmodel, 5))
add(sampleA, sampleB), <4>
100000), <5>
simmodel=empiricalDistribution(simresults), <6>
prob=cumulativeProbability(simmodel, 5)) <7>
----
When this expression is sent to the /stream handler it responds with:
The Monte Carlo simulation below performs the following steps:
<1> A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of `componentA`.
<2> A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of `componentB`.
<3> The `monteCarlo` function samples from the `componentA` and `componentB` distributions and sets the values to variables `sampleA` and `sampleB`.
<4> It then calls the `add(sampleA, sampleB)`* function to find the combined lengths of the samples.
<5> The `monteCarlo` function runs a set number of times, 100000, and collects the results in an array. Each
time the function is called new samples are drawn from the `componentA`
and `componentB` distributions. On each run, the `add` function adds the two samples to calculate the combined length.
The result of each run is collected in an array and assigned to the `simresults` variable.
<6> An `empiricalDistribution` function is then created from the `simresults` array to model the distribution of the
simulation results.
<7> Finally, the `cumulativeProbability` function is called on the `simmodel` to determine the cumulative probability
that the combined length of the components is 5 or less.
Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will
be 5 or less:
[source,json]
----
@ -91,36 +91,32 @@ When this expression is sent to the /stream handler it responds with:
== Correlated Simulations
The simulation above assumes that the lengths of *componentA* and *componentB* vary independently.
The simulation above assumes that the lengths of `componentA` and `componentB` vary independently.
What would happen to the probability model if there was a correlation between the lengths of
*componentA* and *componentB*.
`componentA` and `componentB`?
In the example below a database containing assembled pairs of components is used to determine
if there is a correlation between the lengths of the components, and how the correlation effects the model.
Before performing a simulation of the effects of correlation on the probability model its
useful to understand what the correlation is between the lengths of *componentA* and *componentB*.
In the example below 5000 random samples are selected from a collection
of assembled hinges. Each sample contains
lengths of the components in the fields *componentA_d* and *componentB_d*.
Both fields are then vectorized. The *componentA_d* vector is stored in
variable *b* and the *componentB_d* variable is stored in variable *c*.
Then the correlation of the two vectors is calculated using the `corr` function. Note that the outcome
from `corr` is 0.9996931313216989. This means that *componentA_d* and *componentB_d* are almost
perfectly correlated.
useful to understand what the correlation is between the lengths of `componentA` and `componentB`.
[source,text]
----
let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"),
b=col(a, componentA_d)),
let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"), <1>
b=col(a, componentA_d)), <2>
c=col(a, componentB_d)),
d=corr(b, c))
d=corr(b, c)) <3>
----
When this expression is sent to the /stream handler it responds with:
<1> In the example, 5000 random samples are selected from a collection of assembled hinges.
Each sample contains lengths of the components in the fields `componentA_d` and `componentB_d`.
<2> Both fields are then vectorized. The *componentA_d* vector is stored in
variable *`b`* and the *componentB_d* variable is stored in variable *`c`*.
<3> Then the correlation of the two vectors is calculated using the `corr` function.
Note from the result that the outcome from `corr` is 0.9996931313216989.
This means that `componentA_d` and *`componentB_d` are almost perfectly correlated.
[source,json]
----
@ -139,35 +135,34 @@ When this expression is sent to the /stream handler it responds with:
}
----
How does correlation effect the probability model?
=== Correlation Effects on the Probability Model
The example below explores how to use a *multivariate normal distribution* function
The example below explores how to use a multivariate normal distribution function
to model how correlation effects the probability of hinge defects.
In this example 5000 random samples are selected from a collection
containing length data for assembled hinges. Each sample contains
the fields *componentA_d* and *componentB_d*.
the fields `componentA_d` and `componentB_d`.
Both fields are then vectorized. The *componentA_d* vector is stored in
variable *b* and the *componentB_d* variable is stored in variable *c*.
Both fields are then vectorized. The `componentA_d` vector is stored in
variable *`b`* and the `componentB_d` variable is stored in variable *`c`*.
An array is created that contains the *means* of the two vectorized fields.
An array is created that contains the means of the two vectorized fields.
Then both vectors are added to a matrix which is transposed. This creates
an *observation* matrix where each row contains one observation of
*componentA_d* and *componentB_d*. A covariance matrix is then created from the columns of
an observation matrix where each row contains one observation of
`componentA_d` and `componentB_d`. A covariance matrix is then created from the columns of
the observation matrix with the
`cov` function. The covariance matrix describes the covariance between
*componentA_d* and *componentB_d*.
`cov` function. The covariance matrix describes the covariance between `componentA_d` and `componentB_d`.
The `multivariateNormalDistribution` function is then called with the
array of means for the two fields and the covariance matrix. The model
for the multivariate normal distribution is stored in variable *g*.
for the multivariate normal distribution is stored in variable *`g`*.
The `monteCarlo` function then calls the function *add(sample(g))* 50000 times
The `monteCarlo` function then calls the function `add(sample(g))` 50000 times
and collections the results in a vector. Each time the function is called a single sample
is drawn from the multivariate normal distribution. Each sample is a vector containing
one *componentA* and *componentB* pair. the `add` function adds the values in the vector to
one `componentA` and `componentB` pair. The `add` function adds the values in the vector to
calculate the length of the pair. Over the long term the samples drawn from the
multivariate normal distribution will conform to the covariance matrix used to construct it.
@ -195,7 +190,7 @@ let(a=random(hinges, q="*:*", rows="5000", fl="componentA_d, componentB_d"),
j=cumulativeProbability(i, 5))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----

View File

@ -37,7 +37,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
c=describe(b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -90,7 +90,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
c=hist(b, 5))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -179,7 +179,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
d=col(c, N))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -228,7 +228,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="day_i"),
c=freqTable(b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -302,7 +302,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
c=percentile(b, 95))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -344,7 +344,7 @@ let(a=array(1, 2, 3, 4, 5),
c=cov(a, b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -380,7 +380,7 @@ let(a=array(1, 2, 3, 4, 5),
e=cov(d))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -437,7 +437,7 @@ let(a=array(1, 2, 3, 4, 5),
c=corr(a, b, type=spearmans))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -504,7 +504,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
e=ttest(c, d))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -552,7 +552,7 @@ let(a=random(collection1, q="*:*", rows="1500", fl="price_f"),
e=ttest(c, d))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -588,7 +588,7 @@ let(a=array(1,2,3),
b=zscores(a))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----

View File

@ -216,8 +216,8 @@ The `nodes` function provides breadth-first graph traversal. For details, see th
== knnSearch
The `knnSearch` function returns the K nearest neighbors for a document based on text similarity. Under the covers the `knnSearch` function
use the More Like This query parser plugin.
The `knnSearch` function returns the k-nearest neighbors for a document based on text similarity. Under the covers the `knnSearch` function
uses the More Like This query parser plugin.
=== knnSearch Parameters

View File

@ -16,9 +16,9 @@
// specific language governing permissions and limitations
// under the License.
TF-IDF term vectors are often used to represent text documents when performing text mining
and machine learning operations. This section of the user guide describes how to
use math expressions to perform text analysis and create TF-IDF term vectors.
Term frequency-inverse document frequency (TF-IDF) term vectors are often used to
represent text documents when performing text mining and machine learning operations. The math expressions
library can be used to perform text analysis and create TF-IDF term vectors.
== Text Analysis
@ -26,17 +26,16 @@ The `analyze` function applies a Solr analyzer to a text field and returns the t
emitted by the analyzer in an array. Any analyzer chain that is attached to a field in Solr's
schema can be used with the `analyze` function.
In the example below, the text "hello world" is analyzed using the analyzer chain attached to the *subject* field in
the schema. The *subject* field is defined as the field type *text_general* and the text is analyzed using the
analysis chain configured for the *text_general* field type.
In the example below, the text "hello world" is analyzed using the analyzer chain attached to the `subject` field in
the schema. The `subject` field is defined as the field type `text_general` and the text is analyzed using the
analysis chain configured for the `text_general` field type.
[source,text]
----
analyze("hello world", subject)
----
When this expression is sent to the /stream handler it
responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -63,13 +62,12 @@ responds with:
The `analyze` function can be used inside of a `select` function to annotate documents with the tokens
generated by the analysis.
The example below is performing a `search` in collection1. Each tuple returned by the `search`
contains an *id* and *subject*. For each tuple, the
`select` function is selecting the *id* field and calling the `analyze` function on the *subject* field.
The analyzer chain specified by the *subject_bigram* field is configured to perform a bigram analysis.
The example below performs a `search` in "collection1". Each tuple returned by the `search` function
contains an `id` and `subject`. For each tuple, the
`select` function selects the `id` field and calls the `analyze` function on the `subject` field.
The analyzer chain specified by the `subject_bigram` field is configured to perform a bigram analysis.
The tokens generated by the `analyze` function are added to each tuple in a field called `terms`.
Notice in the output that an array of bigram terms have been added to the tuples.
[source,text]
----
@ -78,8 +76,7 @@ select(search(collection1, q="*:*", fl="id, subject", sort="id asc"),
analyze(subject, subject_bigram) as terms)
----
When this expression is sent to the /stream handler it
responds with:
Notice in the output that an array of bigram terms have been added to the tuples:
[source,json]
----
@ -111,42 +108,37 @@ responds with:
== TF-IDF Term Vectors
The `termVectors` function can be used to build *TF-IDF*
term vectors from the terms generated by the `analyze` function.
The `termVectors` function can be used to build TF-IDF term vectors from the terms generated by the `analyze` function.
The `termVectors` function operates over a list of tuples that contain a field
called *id* and a field called *terms*. Notice
that this is the exact output structure of the *document annotation* example above.
The `termVectors` function operates over a list of tuples that contain a field called `id` and a field called `terms`.
Notice that this is the exact output structure of the document annotation example above.
The `termVectors` function builds a *matrix* from the list of tuples. There is *row* in the
matrix for each tuple in the list. There is a *column* in the matrix for each term in the *terms*
field.
The example below builds on the *document annotation* example.
The list of tuples are stored in variable *a*. The `termVectors` function
operates over variable *a* and builds a matrix with *2 rows* and *4 columns*.
The `termVectors` function also sets the *row* and *column* labels of the term vectors matrix.
The row labels are the document ids and the
column labels are the terms.
In the example below, the `getRowLabels` and `getColumnLabels` functions return
the row and column labels which are then stored in variables *c* and *d*.
The *echo* parameter is echoing variables *c* and *d*, so the output includes
the row and column labels.
The `termVectors` function builds a matrix from the list of tuples. There is row in the
matrix for each tuple in the list. There is a column in the matrix for each term in the `terms` field.
[source,text]
----
let(echo="c, d",
a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
let(echo="c, d", <1>
a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"), <2>
id,
analyze(subject, subject_bigram) as terms),
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1),
c=getRowLabels(b),
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1), <3>
c=getRowLabels(b), <4>
d=getColumnLabels(b))
----
When this expression is sent to the /stream handler it
The example below builds on the document annotation example.
<1> The `echo` parameter will echo variables *`c`* and *`d`*, so the output includes
the row and column labels, which will be defined later in the expression.
<2> The list of tuples are stored in variable *`a`*. The `termVectors` function
operates over variable *`a`* and builds a matrix with 2 rows and 4 columns.
<3> The `termVectors` function sets the row and column labels of the term vectors matrix as variable *`b`*.
The row labels are the document ids and the column labels are the terms.
<4> The `getRowLabels` and `getColumnLabels` functions return
the row and column labels which are then stored in variables *`c`* and *`d`*.
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -188,7 +180,7 @@ let(a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"),
b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -230,8 +222,15 @@ the noisy terms helps keep the term vector matrix small enough to fit comfortabl
There are four parameters designed to filter noisy terms from the term vector matrix:
* *minTermLength*: The minimum term length required to include the term in the matrix.
* *minDocFreq*: The minimum *percentage* (0 to 1) of documents the term must appear in to be included in the index.
* *maxDocFreq*: The maximum *percentage* (0 to 1) of documents the term can appear in to be included in the index.
* *exclude*: A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
`minTermLength`::
The minimum term length required to include the term in the matrix.
minDocFreq::
The minimum percentage, expressed as a number between 0 and 1, of documents the term must appear in to be included in the index.
maxDocFreq::
The maximum percentage, expressed as a number between 0 and 1, of documents the term can appear in to be included in the index.
exclude::
A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that
term will be excluded from the term vector.

View File

@ -38,7 +38,7 @@ timeseries(collection1,
count(*))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -121,7 +121,7 @@ let(a=timeseries(collection1,
b=col(a, count(*)))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -192,7 +192,7 @@ let(a=timeseries(collection1,
c=movingAvg(b, 3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -242,7 +242,7 @@ let(a=timeseries(collection1, q=*:*,
c=expMovingAvg(b, 3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -292,7 +292,7 @@ let(a=timeseries(collection1,
c=movingMedian(b, 3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -353,7 +353,7 @@ let(a=timeseries(collection1,
c=diff(b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -403,7 +403,7 @@ let(a=array(1,2,5,2,1,2,5,2,1,2,5),
b=diff(a, 4))
----
Expression is sent to the /stream handler it responds with:
Expression is sent to the `/stream` handler it responds with:
[source,json]
----

View File

@ -16,19 +16,17 @@
// specific language governing permissions and limitations
// under the License.
== The Let Expression
The `let` expression sets variables and returns
the value of the last variable by default. The output of any streaming expression
or math expression can be set to a variable.
the value of the last variable by default. The output of any streaming expression or math expression can be set to a variable.
Below is a simple example setting three variables *a*, *b*
and *c*. Variables *a* and *b* are set to arrays. The variable *c* is set
Below is a simple example setting three variables *`a`*, *`b`*
and *`c`*. Variables *`a`* and *`b`* are set to arrays. The variable *`c`* is set
to the output of the `ebeAdd` function which performs element-by-element
addition of the two arrays.
Notice that the last variable, *c*, is returned.
[source,text]
----
let(a=array(1, 2, 3),
@ -36,8 +34,7 @@ let(a=array(1, 2, 3),
c=ebeAdd(a, b))
----
When this expression is sent to the /stream handler it
responds with:
In the response, notice that the last variable, *`c`*, is returned:
[source,json]
----
@ -62,7 +59,7 @@ responds with:
== Echoing Variables
All variables can be output by setting the *echo* variable to *true*.
All variables can be output by setting the `echo` variable to `true`.
[source,text]
----
@ -72,7 +69,7 @@ let(echo=true,
c=ebeAdd(a, b))
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -106,8 +103,8 @@ responds with:
}
----
A specific set of variables can be echoed by providing a comma delimited
list of variables to the echo parameter.
A specific set of variables can be echoed by providing a comma delimited list of variables to the echo parameter.
Because variables have been provided, the `true` value is assumed.
[source,text]
----
@ -117,8 +114,7 @@ let(echo="a,b",
c=ebeAdd(a, b))
----
When this expression is sent to the /stream handler it
responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -150,13 +146,13 @@ responds with:
Variables can be cached in-memory on the Solr node where the math expression
was run. A cached variable can then be used in future expressions. Any object
that can be set to a variable, including data structures and mathematical models can
that can be set to a variable, including data structures and mathematical models, can
be cached in-memory for future use.
The `putCache` function adds a variable to the cache.
In the example below an array is cached in the *workspace* workspace1
and bound to the *key* key1. The workspace allows different users to cache
In the example below an array is cached in the `workspace` "workspace1"
and bound to the `key` "key1". The workspace allows different users to cache
objects in their own workspace. The `putCache` function returns
the variable that was added to the cache.
@ -168,8 +164,7 @@ let(a=array(1, 2, 3),
d=putCache(workspace1, key1, c))
----
When this expression is sent to the /stream handler it
responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -192,20 +187,16 @@ responds with:
}
----
The `getCache` function retrieves an object from the
cache by its workspace and key.
In the example below the `getCache` function retrieves
the array the was cached above and assigns it to variable *a*.
The `getCache` function retrieves an object from the cache by its workspace and key.
In the example below the `getCache` function retrieves the array the was cached above and assigns it to variable *`a`*.
[source,text]
----
let(a=getCache(workspace1, key1))
----
When this expression is sent to the /stream handler it
responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -228,18 +219,16 @@ responds with:
}
----
The `listCache` function can be used to list the workspaces or the
keys in a specific workspace.
The `listCache` function can be used to list the workspaces or the keys in a specific workspace.
In the example below `listCache` returns all the workspaces in the cache
as an array of strings.
In the example below `listCache` returns all the workspaces in the cache as an array of strings.
[source,text]
----
let(a=listCache())
----
When this expression is sent to the /stream handler it
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
@ -264,14 +253,12 @@ responds with:
In the example below all the keys in a specific workspace are listed:
[source,text]
----
let(a=listCache(workspace1))
----
When this expression is sent to the /stream handler it
responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -296,17 +283,14 @@ The `removeCache` function can be used to remove a a key from a specific
workspace. This `removeCache` function removes the key from the cache
and returns the object that was removed.
In the example below the array that was cached above is removed from the
cache.
In the example below the array that was cached above is removed from the cache.
[source,text]
----
let(a=removeCache(workspace1, key1))
----
When this expression is sent to the /stream handler it
responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----

View File

@ -16,23 +16,20 @@
// specific language governing permissions and limitations
// under the License.
This section of the user guide covers vector math and
vector manipulation functions.
This section covers vector math and vector manipulation functions.
== Arrays
Arrays can be created with the `array` function.
For example the expression below creates a numeric array with
three elements:
For example, the expression below creates a numeric array with three elements:
[source,text]
----
array(1, 2, 3)
----
When this expression is sent to the /stream handler it responds with
a json array.
When this expression is sent to the `/stream` handler it responds with a JSON array:
[source,json]
----
@ -66,7 +63,7 @@ For example, an array can be reversed with the `rev` function:
rev(array(1, 2, 3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -89,15 +86,14 @@ When this expression is sent to the /stream handler it responds with:
}
----
Another example is the `length` function,
which returns the length of an array:
Another example is the `length` function, which returns the length of an array:
[source,text]
----
length(array(1, 2, 3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -124,7 +120,7 @@ copies elements of an array from a start and end range.
copyOfRange(array(1,2,3,4,5,6), 1, 4)
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -149,21 +145,18 @@ When this expression is sent to the /stream handler it responds with:
== Vector Summarizations and Norms
There are a set of functions that perform
summerizations and return norms of arrays. These functions
operate over an array and return a single
value. The following vector summarizations and norm functions are available:
There are a set of functions that perform summarizations and return norms of arrays. These functions
operate over an array and return a single value. The following vector summarizations and norm functions are available:
`mult`, `add`, `sumSq`, `mean`, `l1norm`, `l2norm`, `linfnorm`.
The example below is using the `mult` function,
which multiples all the values of an array.
The example below shows the `mult` function, which multiples all the values of an array.
[source,text]
----
mult(array(2,4,8))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -184,14 +177,14 @@ When this expression is sent to the /stream handler it responds with:
The vector norm functions provide different formulas for calculating vector magnitude.
The example below calculates the *l2norm* of an array.
The example below calculates the `l2norm` of an array.
[source,text]
----
l2norm(array(2,4,8))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -212,12 +205,11 @@ When this expression is sent to the /stream handler it responds with:
== Scalar Vector Math
Scalar vector math functions add, subtract, multiple or divide a scalar value with every value in a vector.
Scalar vector math functions add, subtract, multiply or divide a scalar value with every value in a vector.
The following functions perform these operations: `scalarAdd`, `scalarSubtract`, `scalarMultiply`
and `scalarDivide`.
Below is an example of the `scalarMultiply` function, which multiplies the scalar value 3 with
Below is an example of the `scalarMultiply` function, which multiplies the scalar value `3` with
every value of an array.
[source,text]
@ -225,7 +217,7 @@ every value of an array.
scalarMultiply(3, array(1,2,3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -251,7 +243,7 @@ When this expression is sent to the /stream handler it responds with:
== Element-By-Element Vector Math
Two vectors can be added, subtracted, multiplied and divided using element-by-element
vector math functions. The following element-by-element vector math functions are:
vector math functions. The available element-by-element vector math functions are:
`ebeAdd`, `ebeSubtract`, `ebeMultiply`, `ebeDivide`.
The expression below performs the element-by-element subtraction of two arrays.
@ -261,7 +253,7 @@ The expression below performs the element-by-element subtraction of two arrays.
ebeSubtract(array(10, 15, 20), array(1,2,3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -297,7 +289,7 @@ Below is an example of the `dotProduct` function:
dotProduct(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -323,7 +315,7 @@ Below is an example of the `cosineSimilarity` function:
cosineSimilarity(array(2,3,0,0,0,1), array(2,0,1,0,0,3))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----

View File

@ -18,11 +18,10 @@
This section of the user guide explores techniques
for retrieving streams of data from Solr and vectorizing the
*numeric* fields.
numeric fields.
The next chapter of the user guide covers
Text Analysis and Term Vectors which describes how to
vectorize *text* fields.
See the section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> which describes how to
vectorize text fields.
== Streams
@ -32,42 +31,42 @@ to vectorize and analyze the results sets.
Below are some of the key stream sources:
* *random*: Random sampling is widely used in statistics, probability and machine learning.
* *`random`*: Random sampling is widely used in statistics, probability and machine learning.
The `random` function returns a random sample of search results that match a
query. The random samples can be vectorized and operated on by math expressions and the results
can be used to describe and make inferences about the entire population.
* *timeseries*: The `timeseries`
* *`timeseries`*: The `timeseries`
expression provides fast distributed time series aggregations, which can be
vectorized and analyzed with math expressions.
* *knnSearch*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
* *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch`
function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in
a distributed index. Once the nearest neighbors are retrieved they can be vectorized
and operated on by machine learning and text mining algorithms.
* *sql*: SQL is the primary query language used by data scientists. The `sql` function supports
* *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports
data retrieval using a subset of SQL which includes both full text search and
fast distributed aggregations. The result sets can then be vectorized and operated
on by math expressions.
* *jdbc*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
* *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with
streams originating from Solr. Result sets from outside data sources can be vectorized and operated
on by math expressions in the same manner as result sets originating from Solr.
* *topic*: Messaging is an important foundational technology for large scale computing. The `topic`
* *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic`
function provides publish/subscribe messaging capabilities by treating
Solr Cloud as a distributed message queue. Topics are extremely powerful
because they allow subscription by query. Topics can be use to support a broad set of
use cases including bulk text mining operations and AI alerting.
* *nodes*: Graph queries are frequently used by recommendation engines and are an important
* *`nodes`*: Graph queries are frequently used by recommendation engines and are an important
machine learning tool. The `nodes` function provides fast, distributed, breadth
first graph traversal over documents in a Solr Cloud collection. The node sets collected
by the `nodes` function can be operated on by statistical and machine learning expressions to
gain more insight into the graph.
* *search*: Ranked search results are a powerful tool for finding the most relevant
* *`search`*: Ranked search results are a powerful tool for finding the most relevant
documents from a large document corpus. The `search` expression
returns the top N ranked search results that match any
Solr query, including geo-spatial queries. The smaller set of relevant
@ -79,7 +78,7 @@ text mining expressions to gather insights about the data set.
The output of any streaming expression can be set to a variable.
Below is a very simple example using the `random` function to fetch
three random samples from collection1. The random samples are returned
as *tuples*, which contain name/value pairs.
as tuples which contain name/value pairs.
[source,text]
@ -87,7 +86,7 @@ as *tuples*, which contain name/value pairs.
let(a=random(collection1, q="*:*", rows="3", fl="price_f"))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -116,10 +115,10 @@ When this expression is sent to the /stream handler it responds with:
}
----
== Creating a Vector with the *col* Function
== Creating a Vector with the col Function
The `col` function iterates over a list of tuples and copies the values
from a specific column into an *array*.
from a specific column into an array.
The output of the `col` function is an numeric array that can be set to a
variable and operated on by math expressions.
@ -157,7 +156,7 @@ let(a=random(collection1, q="*:*", rows="3", fl="price_f"),
Once a vector has been created any math expression that operates on vectors
can be applied. In the example below the `mean` function is applied to
the vector assigned to variable *b*.
the vector assigned to variable *`b`*.
[source,text]
----
@ -166,7 +165,7 @@ let(a=random(collection1, q="*:*", rows="15000", fl="price_f"),
c=mean(b))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -191,13 +190,14 @@ Matrices can be created by vectorizing multiple numeric fields
and adding them to a matrix. The matrices can then be operated on by
any math expression that operates on matrices.
[TIP]
====
Note that this section deals with the creation of matrices
from numeric data. The next chapter of the user guide covers
Text Analysis and Term Vectors which describes how to build TF-IDF
term vector matrices from text fields.
from numeric data. The section <<term-vectors.adoc#term-vectors,Text Analysis and Term Vectors>> describes how to build TF-IDF term vector matrices from text fields.
====
Below is a simple example where four random samples are taken
from different sub-populations in the data. The *price_f* field of
from different sub-populations in the data. The `price_f` field of
each random sample is
vectorized and the vectors are added as rows to a matrix.
Then the `sumRows`
@ -218,7 +218,7 @@ let(a=random(collection1, q="market:A", rows="5000", fl="price_f"),
j=sumRows(i))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -244,14 +244,14 @@ When this expression is sent to the /stream handler it responds with:
== Latitude / Longitude Vectors
The `latlonVectors` function wraps a list of tuples and parses a lat/long location field into
The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into
a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long
pair for the corresponding tuple in the list. The row labels for the matrix are
automatically set to the *id* field in the tuples. The the lat/lon matrix can then be operated
on by distance based machine learning functions using the `haversineMeters` distance measure.
automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated
on by distance-based machine learning functions using the `haversineMeters` distance measure.
The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called
*field*. The field parameter tells the `latlonVectors` function which field to parse the lat/lon
`field`, which tells the `latlonVectors` function which field to parse the lat/lon
vectors from.
Below is an example of the `latlonVectors`.
@ -262,7 +262,7 @@ let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"),
b=latlonVectors(a, field="loc_p"))
----
When this expression is sent to the /stream handler it responds with:
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
@ -301,5 +301,3 @@ When this expression is sent to the /stream handler it responds with:
}
}
----