SOLR-12913: Streaming Expressions / Math Expressions docs for 7.6 release

This commit is contained in:
Joel Bernstein 2018-10-25 09:16:54 -04:00
parent 26e14986af
commit 7952cec99a
10 changed files with 587 additions and 26 deletions

View File

@ -0,0 +1,188 @@
= Computational Geometry
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
This section of the math expressions user guide covers computational geometry
functions.
== Convex Hull
A convex hull is the smallest convex set of points that encloses a data set. Math expressions has support for computing
the convex hull of a 2D data set. Once a convex hull has been calculated, a set of math expression functions
can be used to geometrically describe the convex hull.
The `convexHull` function finds the convex hull of an observation matrix of 2D vectors.
Each row of the matrix contains a 2D observation.
In the example below a convex hull is calculated for a randomly generated set of 100 2D observations.
Then the following functions are called on the convex result:
-`getBaryCenter`: Returns the 2D point that is the bary center of the convex hull.
-`getArea`: Returns the area of the convex hull.
-`getBoundarySize`: Returns the boundary size of the convex hull.
-`getVertices`: Returns 2D points that are the vertices of the convex hull.
[source,text]
----
let(echo="baryCenter, area, boundarySize, vertices",
x=sample(normalDistribution(0, 20), 100),
y=sample(normalDistribution(0, 10), 100),
observations=transpose(matrix(x,y)),
chull=convexHull(observations),
baryCenter=getBaryCenter(chull),
area=getArea(chull),
boundarySize=getBoundarySize(chull),
vertices=getVertices(chull))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"baryCenter": [
-3.0969292101230343,
1.2160948182691975
],
"area": 3477.480599967595,
"boundarySize": 267.52419019533664,
"vertices": [
[
-66.17632818958485,
-8.394931552315256
],
[
-47.556667594765216,
-16.940434013651263
],
[
-33.13582183446102,
-17.30914425443977
],
[
-9.97459859015698,
-17.795012801599654
],
[
27.7705917246824,
-14.487224686587767
],
[
54.689432954170236,
-1.3333371984299605
],
[
35.97568654458672,
23.054169251772556
],
[
-15.539456215337585,
19.811330468093704
],
[
-17.05125031092752,
19.53581741341663
],
[
-35.92010024412891,
15.126430698395572
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 3
}
]
}
}
----
== Enclosing Disk
The `enclosingDisk` function finds the smallest enclosing circle the encloses a 2D data set.
Once an enclosing disk has been calculated, a set of math expression functions
can be used to geometrically describe the enclosing disk.
In the example below an enclosing disk is calculated for a randomly generated set of 1000 2D observations.
Then the following functions are called on the enclosing disk result:
-`getCenter`: Returns the 2D point that is the center of the disk.
-`getRadius`: Returns the radius of the disk.
-`getSupportPoints`: Returns the support points of the disk.
[source,text]
----
let(echo="center, radius, support",
x=sample(normalDistribution(0, 20), 1000),
y=sample(normalDistribution(0, 20), 1000),
observations=transpose(matrix(x,y)),
disk=enclosingDisk(observations),
center=getCenter(disk),
radius=getRadius(disk),
support=getSupportPoints(disk))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"center": [
-6.668825009733749,
-2.9825450908240025
],
"radius": 72.66109546907208,
"support": [
[
20.350992271739464,
64.46791279377014
],
[
33.02079953093981,
57.880978456420365
],
[
-44.7273247899923,
-64.87911518353323
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 8
}
]
}
}
----

View File

@ -221,6 +221,106 @@ responds with:
}
----
== Harmonic Curve Fitting
The `harmonicFit` function or `harmfit` (for short) fits a smooth line through control points of a sine wave.
The `harmfit` function is passed x- and y-axes and fits a smooth curve to the data.
If only a single array is provided it is treated as the y-axis and a sequence is generated
for the x-axis.
The example below shows `harmfit` fitting a single oscillation of a sine wave. `harmfit`
returns the smoothed values at each control point. The return value is also a model which can be used by
the `predict`, `derivative` and `integrate` function. There are also three helper functions that can be used to
retrieve the estimated parameters of the fitted model:
* `getAmplitude`: Returns the amplitude of sine wave.
* `getAngularFrequency`: Returns the angular frequency of the sine wave.
* `getPhase`: Returns the phase of the sine wave.
*Note*: The `harmfit` function works best when run on a single oscillation rather then a long sequence of
oscillations. This is particularly true if the sine wave has noise. After the curve has been fit it can be
extrapolated to any point in time in the past or future.
In example below the `harmfit` function fits control points, provided as x and y axes and the
angular frequency, phase and amplitude are retrieved from the fitted model.
[source,text]
----
let(echo="freq, phase, amp",
x=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19),
y=array(-0.7441113653915925,-0.8997532112139415, -0.9853140681578838, -0.9941296760805463,
-0.9255133950087844, -0.7848096869247675, -0.5829778403072583, -0.33573836075915076,
-0.06234851460699166, 0.215897602691855, 0.47732764497752245, 0.701579055431586,
0.8711850882773975, 0.9729352782968976, 0.9989043923858761, 0.9470697190130273,
0.8214686154479715, 0.631884041542757, 0.39308257356494, 0.12366424851680227),
model=harmfit(x, y),
freq=getAngularFrequency(model),
phase=getPhase(model),
amp=getAmplitude(model))
----
[source,json]
----
{
"result-set": {
"docs": [
{
"freq": 0.28,
"phase": 2.4100000000000006,
"amp": 0.9999999999999999
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
=== Interpolation and Extrapolation
The `harmfit` function returns a fitted model of the sine wave that can used by the `predict` function to
interpolate or extrapolate the sine wave.
The example below uses the fitted model to extrapolate the sine wave beyond the control points
to the x-axis points 20, 21, 22, 23.
[source,text]
----
let(x=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19),
y=array(-0.7441113653915925,-0.8997532112139415, -0.9853140681578838, -0.9941296760805463,
-0.9255133950087844, -0.7848096869247675, -0.5829778403072583, -0.33573836075915076,
-0.06234851460699166, 0.215897602691855, 0.47732764497752245, 0.701579055431586,
0.8711850882773975, 0.9729352782968976, 0.9989043923858761, 0.9470697190130273,
0.8214686154479715, 0.631884041542757, 0.39308257356494, 0.12366424851680227),
model=harmfit(x, y),
extrapolation=predict(model, array(20, 21, 22, 23)))
----
[source,json]
----
{
"result-set": {
"docs": [
{
"extrapolation": [
-0.1553861764415666,
-0.42233370833176975,
-0.656386037906838,
-0.8393130343914845
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Gaussian Curve Fitting

View File

@ -525,6 +525,46 @@ When this expression is sent to the `/stream` handler it responds with:
}
----
== Oscillate (Sine Wave)
The `oscillate` function generates a periodic oscillating signal based
on a parameters. The `oscillate` function can be used to study,
combine and model sine waves.
The `oscillate` function takes three parameters: amplitude, angular frequency
and phase and returns a vector contain the y axis points of sine wave.
The y axis points were generated from a sequence 0-127.
Below is an example of the `oscillate` function called with an amplitude of
1, and angular frequency of .28 and phase of 1.57.
[source,text]
----
oscillate(1, 0.28, 1.57)
----
The result of the `oscillate` function is plotted below:
image::images/math-expressions/sinewave.png[]
=== Sine Wave Interpolation, Extrapolation
The `oscillate` function returns a function which can be used by the `predict` function to interpolate or extrapolate a sine wave.
The example below extrapolates the sine wave to a sequence from 0-256.
[source,text]
----
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)))
----
The extrapolated sine wave is plotted below:
image::images/math-expressions/sinewave256.png[]
== Autocorrelation
Autocorrelation measures the degree to which a signal is correlated with itself. Autocorrelation is used to determine
@ -532,15 +572,16 @@ if a vector contains a signal or is purely random.
A few examples, with plots, will help to understand the concepts.
In the first example the `sin` function is wrapped around a `sequence` function to generate a sine wave. The result of this
The first example simply revisits the example above of an extrapolated sine wave. The result of this
is plotted in the image below. Notice that there is a structure to the plot that is clearly not random.
[source,text]
----
sin(sequence(256, 0, 6))
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)))
----
image::images/math-expressions/signal.png[]
image::images/math-expressions/sinewave256.png[]
In the next example the `sample` function is used to draw 256 samples from a `uniformDistribution` to create a
@ -562,9 +603,10 @@ becomes more dense it can become harder to see a pattern hidden within noise.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=sample(uniformDistribution(-1.5, 1.5), 256),
c=ebeAdd(a, b))
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=sample(uniformDistribution(-1.5, 1.5), 256),
d=ebeAdd(b,c))
----
image::images/math-expressions/hidden-signal.png[]
@ -585,8 +627,9 @@ This is the autocorrelation plot of a pure signal.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=conv(a, rev(a)),
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=conv(b, rev(b)))
----
image::images/math-expressions/signal-autocorrelation.png[]
@ -615,10 +658,11 @@ strongly that there is an underlying signal hidden within the noise.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=sample(uniformDistribution(-1.5, 1.5), 256),
c=ebeAdd(a, b),
d=conv(c, rev(c))
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=sample(uniformDistribution(-1.5, 1.5), 256),
d=ebeAdd(b, c),
e=conv(d, rev(d)))
----
image::images/math-expressions/hidden-signal-autocorrelation.png[]
@ -675,9 +719,10 @@ associated with them. This `fft` shows a clear signal with very low levels of no
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=fft(a),
c=rowAt(b, 0))
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=fft(b),
d=rowAt(c, 0))
----
@ -709,11 +754,12 @@ shows that there is considerable noise along with the signal.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=sample(uniformDistribution(-1.5, 1.5), 256),
c=ebeAdd(a, b),
d=fft(c),
e=rowAt(d, 0))
let(a=oscillate(1, 0.28, 1.57),
b=predict(a, sequence(256, 0, 1)),
c=sample(uniformDistribution(-1.5, 1.5), 256),
d=ebeAdd(b, c),
e=fft(d),
f=rowAt(e, 0))
----
image::images/math-expressions/hidden-signal-fft.png[]

Binary file not shown.

After

Width:  |  Height:  |  Size: 267 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 361 KiB

View File

@ -1,5 +1,5 @@
= Math Expressions
:page-children: scalar-math, vector-math, variables, matrix-math, vectorization, term-vectors, statistics, probability-distributions, simulations, time-series, regression, numerical-analysis, curve-fitting, dsp, machine-learning
:page-children: scalar-math, vector-math, variables, matrix-math, vectorization, term-vectors, statistics, probability-distributions, simulations, time-series, regression, numerical-analysis, curve-fitting, dsp, machine-learning, computational-geometry
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
@ -44,18 +44,21 @@ record in your Solr Cloud cluster computable.
*<<statistics.adoc#statistics,Statistics>>*: Statistical functions in math expressions.
*<<probability-distributions.adoc#probability-distributions,Probability>>*: Mathematical models for probability.
*<<probability-distributions.adoc#probability-distributions,Probability>>*: Mathematical models of probability.
*<<simulations.adoc#simulations,Monte Carlo Simulations>>*: Performing correlated and uncorrelated Monte Carlo simulations.
*<<simulations.adoc#simulations,Monte Carlo Simulations>>*: Performing uncorrelated and correlated Monte Carlo simulations.
*<<regression.adoc#regression,Linear Regression>>*: Simple and multivariate linear regression.
*<<numerical-analysis.adoc#numerical-analysis,Interpolation, Derivatives and Integrals>>*: Numerical analysis math expressions.
*<<curve-fitting.adoc#curve-fitting,Curve Fitting>>*: Polynomial and Gaussian curve fitting.
*<<dsp.adoc#dsp,Digital Signal Processing>>*: Functions commonly used with digital signal processing.
*<<time-series.adoc#time-series,Time Series>>*: Aggregation, smoothing, differencing, modeling and anomaly detection for time series.
*<<curve-fitting.adoc#curve-fitting,Curve Fitting>>*: Polynomial, Harmonic and Gaussian curve fitting.
*<<time-series.adoc#time-series,Time Series>>*: Aggregation, smoothing and differencing of time series.
*<<machine-learning.adoc#machine-learning,Machine Learning>>*: Functions used in machine learning.
*<<computational-geometry.adoc#computational-geometry,Computational Geometry>>*: Convex Hulls and Enclosing Disks.

View File

@ -103,6 +103,59 @@ responds with:
}
----
== Pair sorting vectors
The `pairSort` function can be used to sort two vectors based on the values in
the first vector. The sorting operation maintains the pairing between
the two vectors during the sort.
The `pairSort` function returns a matrix containing the
pair sorted vectors. The first row in the matrix is the first vector,
the second row in the matrix is the second vector.
The individual vectors can then be accessed using the `rowAt` function.
The example below performs a pair sort on two vectors and returns the
matrix containing the sorted vectors.
----
let(a=array(10, 2, 1),
b=array(100, 200, 300),
c=pairSort(a, b))
----
When this expression is sent to the `/stream` handler it
responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
[
1,
2,
10
],
[
300,
200,
100
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 1
}
]
}
}
----
== Row and Column Labels
A matrix can have column and rows and labels. The functions

View File

@ -537,6 +537,8 @@ array.
* `log`: Returns a numeric array with the natural log of each element of the original array.
* `log10`: Returns a numeric array with the base 10 log of each element of the original array.
* `sqrt`: Returns a numeric array with the square root of each element of the original array.
* `cbrt`: Returns a numeric array with the cube root of each element of the original array.
@ -574,6 +576,91 @@ When this expression is sent to the `/stream` handler it responds with:
}
----
== Back Transformations
Vectors that have been transformed with `log`, `log10`, `sqrt` and `cbrt` functions
can be back transformed using the `pow` function.
The example below shows how to back transform data that has been transformed by the
`sqrt` function.
[source,text]
----
let(echo="b,c",
a=array(100, 200, 300),
b=sqrt(a),
c=pow(b, 2))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"b": [
10,
14.142135623730951,
17.320508075688775
],
"c": [
100,
200.00000000000003,
300.00000000000006
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
The example below shows how to back transform data that has been transformed by the
`log10` function.
[source,text]
----
let(echo="b,c",
a=array(100, 200, 300),
b=log10(a),
c=pow(10, b))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"b": [
2,
2.3010299956639813,
2.4771212547196626
],
"c": [
100,
200.00000000000003,
300.0000000000001
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Z-scores
The `zscores` function converts a numeric array to an array of z-scores. The z-score

View File

@ -735,6 +735,29 @@ leftOuterJoin(
)
----
[#list_expression]
== list
The `list` function wraps N Stream Expressions and opens and iterates each stream sequentially.
This has the effect of concatenating the results of multiple Streaming Expressions.
=== list Parameters
* StreamExpressions ...: N Streaming Expressions
=== list Syntax
[source,text]
----
list(tuple(a="hello world"), tuple(a="HELLO WORLD"))
list(search(collection1, q="*:*", fl="id, prod_ss", sort="id asc"),
search(collection2, q="*:*", fl="id, prod_ss", sort="id asc"))
list(tuple(a=search(collection1, q="*:*", fl="id, prod_ss", sort="id asc")),
tuple(a=search(collection2, q="*:*", fl="id, prod_ss", sort="id asc")))
----
== hashJoin
The `hashJoin` function wraps two streams, Left and Right, and for every tuple in Left which exists in Right will emit a tuple containing the fields of both tuples. This supports one-to-one, one-to-many, many-to-one, and many-to-many inner join scenarios. The tuples are emitted in the order in which they appear in the Left stream. The order of the streams does not matter. If both tuples contain a field of the same name then the value from the Right stream will be used in the emitted tuple.
@ -1028,6 +1051,31 @@ The following is a `solrconfig.xml` snippet for 2 workers and "year_i" as the `p
----
====
== plist
The `plist` function wraps N Stream Expressions and opens the streams in parallel
and iterates each stream sequentially. The difference between the `list` and `plist` is that
the streams are opened in parallel. Since many streams such as
`facet`, `stats` and `significantTerms` push down heavy operations to Solr when they are opened,
the plist function can dramatically improve performance by doing these operations in parallel.
=== plist Parameters
* StreamExpressions ...: N Streaming Expressions
=== plist Syntax
[source,text]
----
plist(tuple(a="hello world"), tuple(a="HELLO WORLD"))
plist(search(collection1, q="*:*", fl="id, prod_ss", sort="id asc"),
search(collection2, q="*:*", fl="id, prod_ss", sort="id asc"))
plist(tuple(a=search(collection1, q="*:*", fl="id, prod_ss", sort="id asc")),
tuple(a=search(collection2, q="*:*", fl="id, prod_ss", sort="id asc")))
----
== priority
The `priority` function is a simple priority scheduler for the <<executor>> function. The `executor` function doesn't directly have a concept of task prioritization; instead it simply executes tasks in the order that they are read from it's underlying stream. The `priority` function provides the ability to schedule a higher priority task ahead of lower priority tasks that were submitted earlier.

View File

@ -143,6 +143,42 @@ When this expression is sent to the `/stream` handler it responds with:
}
----
== Vector Sorting
An array can be sorted in natural ascending order with `asc` function.
[source,text]
----
asc(array(10,1,2,3,4,5,6))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"return-value": [
1,
2,
3,
4,
5,
6,
10
]
},
{
"EOF": true,
"RESPONSE_TIME": 1
}
]
}
}
----
== Vector Summarizations and Norms
There are a set of functions that perform summarizations and return norms of arrays. These functions