SOLR-13105 - Visual Guide to Math Expressions (#2227)

* SOLR-13105: The Visual Guide to Streaming Expressions and Math Expressions
This commit is contained in:
Cassandra Targett 2021-01-20 16:14:01 -06:00 committed by GitHub
parent 4da8f08c63
commit e8276e09a1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
268 changed files with 4253 additions and 3727 deletions

View File

@ -23,103 +23,82 @@ This section of the math expressions user guide covers computational geometry fu
A convex hull is the smallest convex set of points that encloses a data set. Math expressions has support for computing
the convex hull of a 2D data set. Once a convex hull has been calculated, a set of math expression functions
can be applied to geometrically describe the convex hull.
can be applied to geometrically describe and visualize the convex hull.
The `convexHull` function finds the convex hull of an observation matrix of 2D vectors.
Each row of the matrix is a 2D observation.
=== Visualization
In the example below a convex hull is calculated for a randomly generated set of 100 2D observations.
The `convexHull` function can be used to visualize a border around a
set of 2D points. Border visualizations can be useful for understanding where data points are
in relation to the border.
Then the following functions are called on the convex hull:
In the examples below the `convexHull` function is used
to visualize a border for a set of latitude and longitude points of rat sightings in the NYC311
complaints database. An investigation of the border around the rat sightings can be done
to better understand how rats may be entering or exiting the specific region.
-`getBaryCenter`: Returns the 2D point that is the bary center of the convex hull.
==== Scatter Plot
-`getArea`: Returns the area of the convex hull.
Before visualizing the convex hull its often useful to visualize the 2D points as a scatter plot.
-`getBoundarySize`: Returns the boundary size of the convex hull.
In this example the `random` function draws a sample of records from the NYC311 (complaints database) collection where
the complaint description matches "rat sighting" and the zip code is 11238. The latitude and longitude fields
are then vectorized and plotted as a scatter plot with longitude on x-axis and latitude on the
y-axis.
-`getVertices`: Returns a set of 2D points that are the vertices of the convex hull.
image::images/math-expressions/convex0.png[]
Notice from the scatter plot that many of the points appear to lie near the border of the plot.
[source,text]
----
let(echo="baryCenter, area, boundarySize, vertices",
x=sample(normalDistribution(0, 20), 100),
y=sample(normalDistribution(0, 10), 100),
observations=transpose(matrix(x,y)),
chull=convexHull(observations),
baryCenter=getBaryCenter(chull),
area=getArea(chull),
boundarySize=getBoundarySize(chull),
vertices=getVertices(chull))
----
==== Convex Hull Plot
When this expression is sent to the `/stream` handler it responds with:
The `convexHull` function can be used to visualize the border. The example uses the same points
drawn from the NYC311 database. But instead of plotting the points directly the latitude and
longitude points are added as rows to a matrix. The matrix is then transposed with `transpose`
function so that each row of the matrix contains a single latitude and longitude point.
The `convexHull` function is then used calculate the convex hull for the matrix of points.
The convex hull is set a variable called `hull`.
[source,json]
----
{
"result-set": {
"docs": [
{
"baryCenter": [
-3.0969292101230343,
1.2160948182691975
],
"area": 3477.480599967595,
"boundarySize": 267.52419019533664,
"vertices": [
[
-66.17632818958485,
-8.394931552315256
],
[
-47.556667594765216,
-16.940434013651263
],
[
-33.13582183446102,
-17.30914425443977
],
[
-9.97459859015698,
-17.795012801599654
],
[
27.7705917246824,
-14.487224686587767
],
[
54.689432954170236,
-1.3333371984299605
],
[
35.97568654458672,
23.054169251772556
],
[
-15.539456215337585,
19.811330468093704
],
[
-17.05125031092752,
19.53581741341663
],
[
-35.92010024412891,
15.126430698395572
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 3
}
]
}
}
----
Once the convex hull has been created the `getVertices` function can be used to
retrieve the matrix of points in the scatter plot that comprise the convex border around the scatter plot.
The `colAt` function can then be used to retrieve the latitude and longitude vectors from the matrix
so they can visualized by the `zplot` function. In the example below the convex hull points are
visualized as a scatter plot.
image::images/math-expressions/hullplot.png[]
Notice that the 15 points in the scatter plot describe that latitude and longitude points of the
convex hull.
==== Projecting and Clustering
The once a convex hull as been calculated the `projectToBorder` can then be used to project
points to the nearest point on the border. In the example below the `projectToBorder` function
is used to project the original scatter scatter plot points to the nearest border.
The `projectToBorder` function returns a matrix of lat/lon points for the border projections. In
the example the matrix of border points is then clustered into 7 clusters using kmeans clustering.
The `zplot` function is then used to plot the clustered border points.
image::images/math-expressions/convex1.png[]
Notice in the visualization its easy to see which spots along the border have the highest
density of points. In the case or the rat sightings this information is useful in understanding
which border points are closest for the rats to enter or exit from.
==== Plotting the Centroids
Once the border points have been clustered its very easy to extract the centroids of the clusters
and plot them on a map. The example below extracts the centroids from the clusters using the
`getCentroids` function. `getCentroids` returns the matrix of lat/lon points which represent
the centroids of border clusters. The `colAt` function can then be used to extract the lat/lon
vectors so they can be plotted on a map using `zplot`.
image::images/math-expressions/convex2.png[]
The map above shows the centroids of the border clusters. The centroids from the highest
density clusters can now be zoomed and investigated geo-spatially to determine what might be
the best places to begin an investigation of the border.
== Enclosing Disk
@ -131,11 +110,11 @@ In the example below an enclosing disk is calculated for a randomly generated se
Then the following functions are called on the enclosing disk:
-`getCenter`: Returns the 2D point that is the center of the disk.
* `getCenter`: Returns the 2D point that is the center of the disk.
-`getRadius`: Returns the radius of the disk.
* `getRadius`: Returns the radius of the disk.
-`getSupportPoints`: Returns the support points of the disk.
* `getSupportPoints`: Returns the support points of the disk.
[source,text]
----

View File

@ -16,7 +16,7 @@
// specific language governing permissions and limitations
// under the License.
These functions support constructing a curve.
These functions support constructing a curve through bivariate non-linear data.
== Polynomial Curve Fitting
@ -25,201 +25,86 @@ the non-linear relationship between two random variables.
The `polyfit` function is passed x- and y-axes and fits a smooth curve to the data.
If only a single array is provided it is treated as the y-axis and a sequence is generated
for the x-axis.
The `polyfit` function also has a parameter the specifies the degree of the polynomial. The higher
for the x-axis. A third parameter can be added that specifies the degree of the polynomial. If the degree is
not provided a 3 degree polynomial is used by default. The higher
the degree the more curves that can be modeled.
The example below uses the `polyfit` function to fit a curve to an array using
a 3 degree polynomial. The fitted curve is then subtracted from the original curve. The output
shows the error between the fitted curve and the original curve, known as the residuals.
The output also includes the sum-of-squares of the residuals which provides a measure
of how large the error is.
The `polyfit` function can be visualized in a similar manner to linear regression with
Zeppelin-Solr.
[source,text]
----
let(echo="residuals, sumSqError",
y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0),
curve=polyfit(y, 3),
residuals=ebeSubtract(y, curve),
sumSqError=sumSq(residuals))
----
The example below uses the `polyfit` function to fit a non-linear curve to a scatter
plot of a random sample. The blue points are the scatter plot of the original observations and the red points
are the predicted curve.
When this expression is sent to the `/stream` handler it
responds with:
image::images/math-expressions/polyfit.png[]
[source,json]
----
{
"result-set": {
"docs": [
{
"residuals": [
0.5886274509803899,
-0.0746078431372561,
-0.49492135315664765,
-0.6689571213100631,
-0.5933591898297781,
0.4352283990519288,
0.32016160310277897,
1.1647963800904968,
0.272488687782805,
-0.3534055160525744,
0.2904697263520779,
-0.7925296272355089,
-0.5990476190476182,
-0.12572829131652274,
0.6307843137254909
],
"sumSqError": 4.7294282482223595
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
In the example above a random sample containing two fields, `filesize_d`
and `response_d`, is drawn from the `logs` collection.
The two fields are vectorized and set to the variables `x` and `y`.
In the next example the curve is fit using a 5 degree polynomial. Notice that the curve
is fit closer, shown by the smaller residuals and lower value for the sum-of-squares of the
residuals. This is because the higher polynomial produced a closer fit.
Then the `polyfit` function is used to fit a non-linear model to the data using a 5 degree
polynomial. The `polyfit` function returns a model that is then directly plotted
by `zplot` along with the original observations.
[source,text]
----
let(echo="residuals, sumSqError",
y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0),
curve=polyfit(y, 5),
residuals=ebeSubtract(y, curve),
sumSqError=sumSq(residuals))
----
The fitted model can also be used
by the `predict` function in the same manner as linear regression. The example below
uses the fitted model to predict a response time for a file size of 42000.
When this expression is sent to the `/stream` handler it
responds with:
image::images/math-expressions/polyfit-predict.png[]
[source,json]
----
{
"result-set": {
"docs": [
{
"residuals": [
-0.12337461300309674,
0.22708978328173413,
0.12266015718028167,
-0.16502738747320755,
-0.41142804563857105,
0.2603044014808713,
-0.12128970101106162,
0.6234168308471704,
-0.1754692675745293,
-0.5379689969473249,
0.4651616185671843,
-0.288175756132409,
0.027970945463215102,
0.18699690402476687,
-0.09086687306501587
],
"sumSqError": 1.413089480179252
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
If an array of predictor values is provided an array of predictions will be returned.
The `polyfit` model performs both *interpolation* and *extrapolation*,
which means that it can predict results both within the bounds of the data set
and beyond the bounds.
=== Residuals
The residuals can be calculated and visualized in the same manner as linear
regression as well. In the example below the `ebeSubtract` function is used
to subtract the fitted model from the observed values, to
calculate a vector of residuals. The residuals are then plotted in a *residual plot*
with the predictions along the x-axis and the model error on the y-axis.
image::images/math-expressions/polyfit-resid.png[]
=== Prediction, Derivatives and Integrals
== Gaussian Curve Fitting
The `polyfit` function returns a function that can be used with the `predict`
function.
The `gaussfit` function fits a smooth curve through a Gaussian peak. The `gaussfit`
function takes an x- and y-axis and fits a smooth gaussian curve to the data. If
only one vector of numbers is passed, `gaussfit` will treat it as the y-axis
and will generate a sequence for the x-axis.
In the example below the x-axis is included for clarity.
The `polyfit` function returns a function for the fitted curve.
The `predict` function is then used to predict a value along the curve, in this
case the prediction is made for the *`x`* value of 5.
One of the interesting use cases for `gaussfit` is to visualize how well a regression
model's residuals fit a normal distribution.
[source,text]
----
let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0),
curve=polyfit(x, y, 5),
p=predict(curve, 5))
----
One of the characteristics of a well-fit regression model is that its residuals will ideally fit a normal distribution.
We can
test this by building a histogram of the residuals and then fitting a gaussian curve to the curve of the histogram.
When this expression is sent to the `/stream` handler it
responds with:
In the example below the residuals from a `polyfit` regression are modeled with the
`hist` function to return a histogram with 32 bins. The `hist` function returns
a list of tuples with statistics about each bin. In the example the `col` function is
used to return a vector with the `N` column for each bin, which is the count of
observations in the
bin. If the residuals are normally distributed we would expect the bin counts
to roughly follow a gaussian curve.
[source,json]
----
{
"result-set": {
"docs": [
{
"p": 5.439695598519129
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
The bin count vector is then passed to `gaussfit` as the y-axis. `gaussfit` generates
a sequence for the x-axis and then fits the gaussian curve to data.
The `derivative` and `integrate` functions can be used to compute the derivative
and integrals for the fitted
curve. The example below demonstrates how to compute a derivative
for the fitted curve.
`zplot` is then used to plot the original bin counts and the fitted curve. In the
example below, the blue line is the bin counts, and the smooth yellow line is the
fitted curve. We can see that the binned residuals fit fairly well to a normal
distribution.
image::images/math-expressions/gaussfit.png[]
[source,text]
----
let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14),
y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0),
curve=polyfit(x, y, 5),
d=derivative(curve))
----
The second plot shows the two curves overlaid with an area chart:
When this expression is sent to the `/stream` handler it
responds with:
image::images/math-expressions/gaussfit2.png[]
[source,json]
----
{
"result-set": {
"docs": [
{
"d": [
0.3198918573686361,
0.9261492094077225,
1.2374272373653175,
1.30051359631081,
1.1628032287629813,
0.8722983646900058,
0.47760852150945,
0.02795050408827482,
-0.42685159525716865,
-0.8363663967611356,
-1.1495552332084857,
-1.3147721499346892,
-1.2797639048258267,
-0.9916699683185771,
-0.3970225234002308
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Harmonic Curve Fitting
@ -232,169 +117,19 @@ The example below shows `harmfit` fitting a single oscillation of a sine wave. T
returns the smoothed values at each control point. The return value is also a model which can be used by
the `predict`, `derivative` and `integrate` functions.
There are also three helper functions that can be used to retrieve the estimated parameters of the fitted model:
* `getAmplitude`: Returns the amplitude of the sine wave.
* `getAngularFrequency`: Returns the angular frequency of the sine wave.
* `getPhase`: Returns the phase of the sine wave.
NOTE: The `harmfit` function works best when run on a single oscillation rather than a long sequence of
oscillations. This is particularly true if the sine wave has noise. After the curve has been fit it can be
extrapolated to any point in time in the past or future.
In the example below the `harmfit` function fits control points, provided as x and y axes, and then the
angular frequency, phase and amplitude are retrieved from the fitted model.
[source,text]
----
let(echo="freq, phase, amp",
x=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19),
y=array(-0.7441113653915925,-0.8997532112139415, -0.9853140681578838, -0.9941296760805463,
-0.9255133950087844, -0.7848096869247675, -0.5829778403072583, -0.33573836075915076,
-0.06234851460699166, 0.215897602691855, 0.47732764497752245, 0.701579055431586,
0.8711850882773975, 0.9729352782968976, 0.9989043923858761, 0.9470697190130273,
0.8214686154479715, 0.631884041542757, 0.39308257356494, 0.12366424851680227),
model=harmfit(x, y),
freq=getAngularFrequency(model),
phase=getPhase(model),
amp=getAmplitude(model))
----
In the example below the original control points are shown in blue and the fitted curve is shown in yellow.
[source,json]
----
{
"result-set": {
"docs": [
{
"freq": 0.28,
"phase": 2.4100000000000006,
"amp": 0.9999999999999999
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
=== Interpolation and Extrapolation
The `harmfit` function returns a fitted model of the sine wave that can used by the `predict` function to
interpolate or extrapolate the sine wave.
The example below uses the fitted model to extrapolate the sine wave beyond the control points
to the x-axis points 20, 21, 22, 23.
[source,text]
----
let(x=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19),
y=array(-0.7441113653915925,-0.8997532112139415, -0.9853140681578838, -0.9941296760805463,
-0.9255133950087844, -0.7848096869247675, -0.5829778403072583, -0.33573836075915076,
-0.06234851460699166, 0.215897602691855, 0.47732764497752245, 0.701579055431586,
0.8711850882773975, 0.9729352782968976, 0.9989043923858761, 0.9470697190130273,
0.8214686154479715, 0.631884041542757, 0.39308257356494, 0.12366424851680227),
model=harmfit(x, y),
extrapolation=predict(model, array(20, 21, 22, 23)))
----
[source,json]
----
{
"result-set": {
"docs": [
{
"extrapolation": [
-0.1553861764415666,
-0.42233370833176975,
-0.656386037906838,
-0.8393130343914845
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Gaussian Curve Fitting
The `gaussfit` function fits a smooth curve through a Gaussian peak.
This is shown in the example below.
image::images/math-expressions/harmfit.png[]
[source,text]
----
let(x=array(0,1,2,3,4,5,6,7,8,9, 10),
y=array(4,55,1200,3028,12000,18422,13328,6426,1696,239,20),
f=gaussfit(x, y))
----
The output of `harmfit` is a model that can be used by the `predict` function to interpolate and extrapolate
the sine wave. In the example below the `natural` function creates an x-axis from 0 to 127
used to predict results for the model. This extrapolates the sine wave out to 128 points, when
the original model curve had only 19 control points.
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"f": [
2.81764431935644,
61.157417979413424,
684.2328985468831,
3945.9411154167447,
11729.758936952656,
17972.951897338007,
14195.201949425435,
5779.03836032222,
1212.7224502169634,
131.17742331530349,
7.3138931735866946
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
Like the `polyfit` function, the `gaussfit` function returns a function that can
be used directly by the `predict`, `derivative` and `integrate` functions.
The example below demonstrates how to compute an integral for a fitted Gaussian curve.
[source,text]
----
let(x=array(0,1,2,3,4,5,6,7,8,9, 10),
y=array(4,55,1200,3028,12000,18422,13328,6426,1696,239,20),
f=gaussfit(x, y),
i=integrate(f, 0, 5))
----
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"i": 25261.666789766092
},
{
"EOF": true,
"RESPONSE_TIME": 3
}
]
}
}
----
image::images/math-expressions/harmfit2.png[]

View File

@ -19,438 +19,77 @@
This section of the user guide explores functions that are commonly used in the field of
Digital Signal Processing (DSP).
== Dot Product
The `dotProduct` function is used to calculate the dot product of two numeric arrays.
The dot product is a fundamental calculation for the DSP functions discussed in this section. Before diving into
the more advanced DSP functions its useful to develop a deeper intuition of the dot product.
The dot product operation is performed in two steps:
. Element-by-element multiplication of two vectors which produces a vector of products.
. Sum the vector of products to produce a scalar result.
This simple bit of math has a number of important applications.
=== Representing Linear Combinations
The `dotProduct` performs the math of a _linear combination_. A linear combination has the following form:
[source,text]
----
(a1*v1)+(a2*v2)...
----
In the above example `a1` and `a2` are random variables that change. `v1` and `v2` are constant values.
When computing the dot product the elements of two vectors are multiplied together and the results are added.
If the first vector contains random variables and the second vector contains constant values
then the dot product is performing a linear combination.
This scenario comes up again and again in machine learning. For example both linear and logistic regression
solve for a vector of constant weights. In order to perform a prediction, a dot product is calculated
between a random observation vector and the constant weight vector. That dot product is a linear combination because
one of the vectors holds constant weights.
Lets look at simple example of how a linear combination can be used to find the mean of a vector of numbers.
In the example below two arrays are set to variables *`a`* and *`b`* and then operated on by the `dotProduct` function.
The output of the `dotProduct` function is set to variable *`c`*.
The `mean` function is then used to compute the mean of the first array which is set to the variable *`d`*.
Both the dot product and the mean are included in the output.
When we look at the output of this expression we see that the dot product and the mean of the first array
are both 30.
The `dotProduct` function calculated the mean of the first array.
[source,text]
----
let(echo="c, d",
a=array(10, 20, 30, 40, 50),
b=array(.2, .2, .2, .2, .2),
c=dotProduct(a, b),
d=mean(a))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": 30,
"d": 30
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
To get a better understanding of how the dot product calculated the mean we can perform the steps of the
calculation using vector math and look at the output of each step.
In the example below the `ebeMultiply` function performs an element-by-element multiplication of
two arrays. This is the first step of the dot product calculation. The result of the element-by-element
multiplication is assigned to variable *`c`*.
In the next step the `add` function adds all the elements of the array in variable *`c`*.
Notice that multiplying each element of the first array by .2 and then adding the results is
equivalent to the formula for computing the mean of the first array. The formula for computing the mean
of an array is to add all the elements and divide by the number of elements.
The output includes the output of both the `ebeMultiply` function and the `add` function.
[source,text]
----
let(echo="c, d",
a=array(10, 20, 30, 40, 50),
b=array(.2, .2, .2, .2, .2),
c=ebeMultiply(a, b),
d=add(c))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
2,
4,
6,
8,
10
],
"d": 30
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
In the example above two arrays were combined in a way that produced the mean of the first. In the second array
each value was set to .2. Another way of looking at this is that each value in the second array is
applying the same weight to the values in the first array.
By varying the weights in the second array we can produce a different result.
For example if the first array represents a time series,
the weights in the second array can be set to add more weight to a particular element in the first array.
The example below creates a weighted average with the weight decreasing from right to left.
Notice that the weighted mean
of 36.666 is larger than the previous mean which was 30. This is because more weight was given to last element in the
array.
[source,text]
----
let(echo="c, d",
a=array(10, 20, 30, 40, 50),
b=array(.066666666666666,.133333333333333,.2, .266666666666666, .33333333333333),
c=ebeMultiply(a, b),
d=add(c))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
0.66666666666666,
2.66666666666666,
6,
10.66666666666664,
16.6666666666665
],
"d": 36.66666666666646
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
=== Representing Correlation
Often when we think of correlation, we are thinking of _Pearson correlation_ in the field of statistics. But the definition of
correlation is actually more general: a mutual relationship or connection between two or more things.
In the field of digital signal processing the dot product is used to represent correlation. The examples below demonstrates
how the dot product can be used to represent correlation.
In the example below the dot product is computed for two vectors. Notice that the vectors have different values that fluctuate
together. The output of the dot product is 190, which is hard to reason about because it's not scaled.
[source,text]
----
let(echo="c, d",
a=array(10, 20, 30, 20, 10),
b=array(1, 2, 3, 2, 1),
c=dotProduct(a, b))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": 190
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
One approach to scaling the dot product is to first scale the vectors so that both vectors have a magnitude of 1. Vectors with a
magnitude of 1, also called unit vectors, are used when comparing only the angle between vectors rather than the magnitude.
The `unitize` function can be used to unitize the vectors before calculating the dot product.
Notice in the example below the dot product result, set to variable *`e`*, is effectively 1. When applied to unit vectors the dot product
will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the unscaled vectors and the
answer is also effectively 1. This is because cosine similarity is a scaled dot product.
[source,text]
----
let(echo="e, f",
a=array(10, 20, 30, 20, 10),
b=array(1, 2, 3, 2, 1),
c=unitize(a),
d=unitize(b),
e=dotProduct(c, d),
f=cosineSimilarity(a, b))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"e": 0.9999999999999998,
"f": 0.9999999999999999
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
If we transpose the first two numbers in the first array, so that the vectors
are not perfectly correlated, we see that the cosine similarity drops. This illustrates
how the dot product represents correlation.
[source,text]
----
let(echo="c, d",
a=array(20, 10, 30, 20, 10),
b=array(1, 2, 3, 2, 1),
c=cosineSimilarity(a, b))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": 0.9473684210526314
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Convolution
The `conv` function calculates the convolution of two vectors. The convolution is calculated by reversing
The `conv` function calculates the convolution of two vectors. The convolution is calculated by *reversing*
the second vector and sliding it across the first vector. The dot product of the two vectors
is calculated at each point as the second vector is slid across the first vector.
The dot products are collected in a third vector which is the convolution of the two vectors.
=== Moving Average Function
Before looking at an example of convolution its useful to review the `movingAvg` function. The moving average
Before looking at an example of convolution it's useful to review the `movingAvg` function. The moving average
function computes a moving average by sliding a window across a vector and computing
the average of the window at each shift. If that sounds similar to convolution, that's because the `movingAvg` function
is syntactic sugar for convolution.
the average of the window at each shift. If that sounds similar to convolution, that's because the `movingAvg`
function involves a sliding window approach similar to convolution.
Below is an example of a moving average with a window size of 5. Notice that original vector has 13 elements
Below is an example of a moving average with a window size of 5. Notice that the original vector has 13 elements
but the result of the moving average has only 9 elements. This is because the `movingAvg` function
only begins generating results when it has a full window. In this case because the window size is 5 so the
moving average starts generating results from the 4^th^ index of the original array.
only begins generating results when it has a full window. The `ltrim` function is used to trim the
first four elements from the original `y` array to line up with the moving average.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=movingAvg(a, 5))
----
image::images/math-expressions/conv1.png[]
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"b": [
3,
4,
5,
5.6,
5.8,
5.6,
5,
4,
3
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
=== Convolutional Smoothing
The moving average can also be computed using convolution. In the example
below the `conv` function is used to compute the moving average of the first array
by applying the second array as the filter.
by applying the second array as a filter.
Looking at the result, we see that it is not exactly the same as the result
of the `movingAvg` function. That is because the `conv` pads zeros
Looking at the result, we see that the convolution produced an array with 17 values instead of the 9 values created by the
moving average. That is because the `conv` function pads zeros
to the front and back of the first vector so that the window size is always full.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=array(.2, .2, .2, .2, .2),
c=conv(a, b))
----
image::images/math-expressions/conv2.png[]
When this expression is sent to the `/stream` handler it responds with:
We achieve the same result as the `movingAvg` function by trimming the first and last 4 values of
the convolution result using the `ltrim` and `rtrim` functions.
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
0.2,
0.6000000000000001,
1.2,
2.0000000000000004,
3.0000000000000004,
4,
5,
5.6000000000000005,
5.800000000000001,
5.6000000000000005,
5.000000000000001,
4,
3,
2,
1.2000000000000002,
0.6000000000000001,
0.2
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
The example below plots both the trimmed convolution and the moving average on the same plot. Notice that
they perfectly overlap.
We achieve the same result as the `movingAvg` function by using the `copyOfRange` function to copy a range of
the result that drops the first and last 4 values of
the convolution result. In the example below the `precision` function is also also used to remove floating point errors from the
convolution result. When this is added the output is exactly the same as the `movingAvg` function.
image::images/math-expressions/conv3.png[]
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=array(.2, .2, .2, .2, .2),
c=conv(a, b),
d=copyOfRange(c, 4, 13),
e=precision(d, 2))
----
This demonstrates how convolution can be used to smooth a signal by sliding a filter across the signal and
computing the dot product at each point. The smoothing effect is caused by the design of the filter.
In the example, the filter length is 5 and each value in the filter is .2. This filter calculates a
simple moving average with a window size of 5.
The formula for computing a simple moving average using convolution is to make the filter length the window
size and make the values of the filter all the same and sum to 1. A moving average with a window size of 4
can be computed by changing the filter to a length of 4 with each value being .25.
==== Changing the Weights
The filter, which is sometimes called the *kernel*, can be viewed as a vector of weights. In the initial
example all values in the filter have the same weight (.2). The weights in the filter can be changed to
produce different smoothing effects. This is demonstrated in the example below.
In this example the filter increases in weight from .1 to .3. This places more weight towards the front
of the filter. Notice that the filter is reversed with the `rev` function before the `conv` function applies it.
This is done because convolution will reverse
the filter. In this case we reverse it ahead of time and when convolution reverses it back, it is the same
as the original filter.
The plot shows the effect of the different weights in the filter. The dark blue line is the initial array.
The light blue line is the convolution and the orange line is the moving average. Notice that the convolution
responds quicker to the movements in the underlying array. This is because more weight has been placed
at the front of the filter.
image::images/math-expressions/conv4.png[]
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"e": [
3,
4,
5,
5.6,
5.8,
5.6,
5,
4,
3
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Cross-Correlation
@ -467,54 +106,8 @@ rather than the convolution calculation.
Notice in the result the highest value is 217. This is the point where the two vectors have the highest correlation.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=array(4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
c=conv(a, rev(b)))
----
image::images/math-expressions/crosscorr.png[]
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
1,
4,
10,
20,
35,
56,
84,
116,
149,
180,
203,
216,
217,
204,
180,
148,
111,
78,
50,
28,
13,
4
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Find Delay
@ -525,67 +118,29 @@ and then computes the delay between the two signals.
Below is an example of the `finddelay` function. Notice that the `finddelay` function reports a 3 period delay between the first
and second signal.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=array(4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
c=finddelay(a, b))
----
image::images/math-expressions/delay.png[]
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": 3
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Oscillate (Sine Wave)
The `oscillate` function generates a periodic oscillating signal which can be used to model and study sine waves.
The `oscillate` function takes three parameters: *amplitude*, *angular frequency*
and *phase* and returns a vector containing the y-axis points of a sine wave.
The `oscillate` function takes three parameters: `amplitude`, `angular frequency`, and `phase` and returns a vector containing the y-axis points of a sine wave.
The y-axis points were generated from an x-axis sequence of 0-127.
Below is an example of the `oscillate` function called with an amplitude of
1, and angular frequency of .28 and phase of 1.57.
[source,text]
----
oscillate(1, 0.28, 1.57)
----
The result of the `oscillate` function is plotted below:
image::images/math-expressions/sinewave.png[]
=== Sine Wave Interpolation, Extrapolation
=== Sine Wave Interpolation & Extrapolation
The `oscillate` function returns a function which can be used by the `predict` function to interpolate or extrapolate a sine wave.
The example below extrapolates the sine wave to an x-axis sequence of 0-256.
[source,text]
----
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)))
----
The extrapolated sine wave is plotted below:
image::images/math-expressions/sinewave256.png[]
@ -599,11 +154,6 @@ A few examples, with plots, will help to understand the concepts.
The first example simply revisits the example above of an extrapolated sine wave. The result of this
is plotted in the image below. Notice that there is a structure to the plot that is clearly not random.
[source,text]
----
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)))
----
image::images/math-expressions/sinewave256.png[]
@ -612,11 +162,6 @@ In the next example the `sample` function is used to draw 256 samples from a `un
vector of random data. The result of this is plotted in the image below. Notice that there is no clear structure to the
data and the data appears to be random.
[source,text]
----
sample(uniformDistribution(-1.5, 1.5), 256)
----
image::images/math-expressions/noise.png[]
@ -625,13 +170,6 @@ The result of this is plotted in the image below. Notice that the sine wave has
somewhat within the noise. Its difficult to say for sure if there is structure. As plots
becomes more dense it can become harder to see a pattern hidden within noise.
[source,text]
----
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=sample(uniformDistribution(-1.5, 1.5), 256),
d=ebeAdd(b,c))
----
image::images/math-expressions/hidden-signal.png[]
@ -649,12 +187,6 @@ intensity as the sine wave slides farther away from being directly lined up.
This is the autocorrelation plot of a pure signal.
[source,text]
----
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=conv(b, rev(b)))
----
image::images/math-expressions/signal-autocorrelation.png[]
@ -666,11 +198,6 @@ This is followed by another long period of low intensity correlation.
This is the autocorrelation plot of pure noise.
[source,text]
----
let(a=sample(uniformDistribution(-1.5, 1.5), 256),
b=conv(a, rev(a)),
----
image::images/math-expressions/noise-autocorrelation.png[]
@ -680,25 +207,17 @@ Notice that this plot shows very clear signs of structure which is similar to au
pure signal. The correlation is less intense due to noise but the shape of the correlation plot suggests
strongly that there is an underlying signal hidden within the noise.
[source,text]
----
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=sample(uniformDistribution(-1.5, 1.5), 256),
d=ebeAdd(b, c),
e=conv(d, rev(d)))
----
image::images/math-expressions/hidden-signal-autocorrelation.png[]
== Discrete Fourier Transform
The convolution based functions described above are operating on signals in the time domain. In the time
domain the X axis is time and the Y axis is the quantity of some value at a specific point in time.
The convolution-based functions described above are operating on signals in the time domain. In the time
domain the x-axis is time and the y-axis is the quantity of some value at a specific point in time.
The discrete Fourier Transform translates a time domain signal into the frequency domain.
In the frequency domain the X axis is frequency, and Y axis is the accumulated power at a specific frequency.
In the frequency domain the x-axis is frequency, and y-axis is the accumulated power at a specific frequency.
The basic principle is that every time domain signal is composed of one or more signals (sine waves)
at different frequencies. The discrete Fourier transform decomposes a time domain signal into its component
@ -711,26 +230,21 @@ to determine if a signal has structure or if it is purely random.
The `fft` function performs the discrete Fourier Transform on a vector of *real* data. The result
of the `fft` function is returned as *complex* numbers. A complex number has two parts, *real* and *imaginary*.
The imaginary part of the complex number is ignored in the examples below, but there
are many tutorials on the FFT and that include complex numbers available online.
But before diving into the examples it is important to understand how the `fft` function formats the
complex numbers in the result.
The *real* part of the result describes the magnitude of the signal at different frequencies.
The *imaginary* part of the result describes the *phase*. The examples below deal only with the *real*
part of the result.
The `fft` function returns a `matrix` with two rows. The first row in the matrix is the *real*
part of the complex result. The second row in the matrix is the *imaginary* part of the complex result.
The `rowAt` function can be used to access the rows so they can be processed as vectors.
This approach was taken because all of the vector math functions operate on vectors of real numbers.
Rather then introducing a complex number abstraction into the expression language, the `fft` result is
represented as two vectors of real numbers.
=== Fast Fourier Transform Examples
In the first example the `fft` function is called on the sine wave used in the autocorrelation example.
The results of the `fft` function is a matrix. The `rowAt` function is used to return the first row of
the matrix which is a vector containing the real values of the fft response.
the matrix which is a vector containing the real values of the `fft` response.
The plot of the real values of the `fft` response is shown below. Notice there are two
peaks on opposite sides of the plot. The plot is actually showing a mirrored response. The right side
@ -741,14 +255,6 @@ Also notice that the `fft` has accumulated significant power in a single peak. T
the specific frequency of the sine wave. The vast majority of frequencies in the plot have close to 0 power
associated with them. This `fft` shows a clear signal with very low levels of noise.
[source,text]
----
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=fft(b),
d=rowAt(c, 0))
----
image::images/math-expressions/signal-fft.png[]
@ -758,17 +264,8 @@ autocorrelation example. The plot of the real values of the `fft` response is sh
Notice that in is this response there is no clear peak. Instead all frequencies have accumulated a random level of
power. This `fft` shows no clear sign of signal and appears to be noise.
[source,text]
----
let(a=sample(uniformDistribution(-1.5, 1.5), 256),
b=fft(a),
c=rowAt(b, 0))
----
image::images/math-expressions/noise-fft.png[]
In the third example the `fft` function is called on the same signal hidden within noise that was used for
the autocorrelation example. The plot of the real values of the `fft` response is shown below.
@ -776,14 +273,5 @@ Notice that there are two clear mirrored peaks, at the same locations as the `ff
there is also now considerable noise on the frequencies. The `fft` has found the signal and but also
shows that there is considerable noise along with the signal.
[source,text]
----
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=sample(uniformDistribution(-1.5, 1.5), 256),
d=ebeAdd(b, c),
e=fft(d),
f=rowAt(e, 0))
----
image::images/math-expressions/hidden-signal-fft.png[]

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 453 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 521 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 490 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 278 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 228 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 213 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 95 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 238 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 135 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 138 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 222 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 219 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 246 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 286 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 317 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 193 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 239 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 151 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 168 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 180 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 265 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 303 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 292 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 356 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 165 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 164 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 164 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 178 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 110 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 150 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 162 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 217 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 260 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 163 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 162 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 296 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 185 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 120 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 136 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 381 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 202 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 333 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 126 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 364 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 146 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 119 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 180 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 268 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 286 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 150 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 172 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 211 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 258 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 208 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 190 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 215 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 141 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 245 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 245 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 127 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 258 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 324 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 157 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 132 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 93 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 192 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 284 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 253 KiB

After

Width:  |  Height:  |  Size: 199 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 211 KiB

After

Width:  |  Height:  |  Size: 202 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 312 KiB

After

Width:  |  Height:  |  Size: 245 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 203 KiB

Some files were not shown because too many files have changed in this diff Show More