Squashed commit of the following:

commit e5074c3223e394af17f686294a67a1dd3ecdd147
Author: Joel Bernstein <jbernste@apache.org>
Date:   Wed May 9 13:16:34 2018 -0400

    SOLR-12280: WIP

commit 69cdeccf161177d10f4d2407542392aaee3fcfe8
Author: Joel Bernstein <jbernste@apache.org>
Date:   Wed May 9 13:08:02 2018 -0400

    SOLR-12280: WIP

commit c94f0c87c3e57c023d622ad2411e522c4aac491c
Author: Joel Bernstein <jbernste@apache.org>
Date:   Wed May 9 11:54:58 2018 -0400

    SOLR-12280: WIP

commit 68dd1e73355cb84410f2d0ff3a51797ed6194a10
Author: Joel Bernstein <jbernste@apache.org>
Date:   Wed May 9 10:54:32 2018 -0400

    SOLR-12280: WIP

commit 04a010543418a469100fa299c606a7b1eed452e1
Author: Joel Bernstein <jbernste@apache.org>
Date:   Wed May 9 10:47:27 2018 -0400

    SOLR-12280: WIP

commit a6bbfbadaafe33fcdf93d5c72755e30dadadf017
Author: Joel Bernstein <jbernste@apache.org>
Date:   Wed May 9 09:40:08 2018 -0400

    SOLR-12280: WIP

commit 5d27961aa291bcd71527337632981bcdf62369b4
Author: Joel Bernstein <jbernste@apache.org>
Date:   Tue May 8 20:43:33 2018 -0400

    SOLR-12280: WIP

commit e982cf939f429c05b736f6292c68dd96d7ebc027
Author: Joel Bernstein <jbernste@apache.org>
Date:   Tue May 8 13:27:29 2018 -0400

    SOLR-12280: WIP

commit aae78ab6f387c28a080021bc919ef51864540be2
Author: Joel Bernstein <jbernste@apache.org>
Date:   Tue May 8 12:23:52 2018 -0400

    SOLR-12280: WIP

commit 0787ad76f0f4c62c860784b15490d8a988939997
Author: Joel Bernstein <jbernste@apache.org>
Date:   Tue May 8 12:20:38 2018 -0400

    SOLR-12280: WIP

commit 4df098376ba05188702cca8582959c3fe18066f5
Author: Joel Bernstein <jbernste@apache.org>
Date:   Tue May 8 12:12:11 2018 -0400

    SOLR-12280: WIP

commit 5c0be5136bbab7e0c33b3b8a7b0395b1b330e96d
Author: Joel Bernstein <jbernste@apache.org>
Date:   Tue May 8 12:04:57 2018 -0400

    SOLR-12280: WIP

commit 6c6feac4c2e5a49a5eab87a228713d1b93c8fc70
Author: Joel Bernstein <jbernste@apache.org>
Date:   Tue May 8 11:57:49 2018 -0400

    SOLR-12280: WIP

commit 7d46d11c9dd3a51b68600c2c889f586147545294
Author: Joel Bernstein <jbernste@apache.org>
Date:   Tue May 8 11:50:51 2018 -0400

    SOLR-12280: WIP

commit 8b6bf19d0091203ed63b39d070dd02a9bece6a61
Author: Joel Bernstein <jbernste@apache.org>
Date:   Mon May 7 10:53:14 2018 -0400

    SOLR-12280: WIP

commit 5466591999816eaacde6ce18d824d7688e5f4fe8
Author: Joel Bernstein <jbernste@apache.org>
Date:   Fri May 4 15:12:43 2018 -0400

    SOLR-12280: WIP

commit d7fff7d557a7fd26011c21445b7969b2cd81036f
Author: Joel Bernstein <jbernste@apache.org>
Date:   Fri Apr 27 12:50:27 2018 -0400

    SOLR-12280: Initial commit
This commit is contained in:
Joel Bernstein 2018-05-09 13:24:08 -04:00
parent 4177252a10
commit 144f00a1e3
11 changed files with 721 additions and 0 deletions

View File

@ -0,0 +1,719 @@
= Digital Signal Processing
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
This section of the user guide explores functions that are commonly used in the field of
Digital Signal Processing (DSP).
== Dot product
The `dotProduct` function is used to calculate the dot product of two arrays.
The dot product is a fundamental calculation for the DSP functions discussed in this section. Before diving into
the more advanced DSP functions, its useful to get a better understanding of how the dot product calculation works.
=== Combining two arrays
The `dotProduct` function can be used to combine two arrays into a single product. A simple example can help
illustrate this concept.
In the example below two arrays are set to variables *a* and *b* and then operated on by the `dotProduct` function.
The output of the `dotProduct` function is set to variable *c*.
Then the `mean` function is then used to compute the mean of the first array which is set to the variable `d`.
Both the *dot product* and the *mean* are included in the output.
When we look at the output of this expression we see that the *dot product* and the *mean* of the first array
are both 30.
The dot product function *calculated the mean* of the first array.
[source,text]
----
let(echo="c, d",
a=array(10, 20, 30, 40, 50),
b=array(.2, .2, .2, .2, .2),
c=dotProduct(a, b),
d=mean(a))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": 30,
"d": 30
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
To get a better understanding of how the dot product calculated the mean we can perform the steps of the
calculation using vector math and look at the output of each step.
In the example below the `ebeMultiply` function performs an element-by-element multiplication of
two arrays. This is the first step of the dot product calculation. The result of the element-by-element
multiplication is assigned to variable *c*.
In the next step the `add` function adds all the elements of the array in variable *c*.
Notice that multiplying each element of the first array by .2 and then adding the results is
equivalent to the formula for computing the mean of the first array. The formula for computing the mean
of an array is to add all the elements and divide by the number of elements.
The output includes the output of both the `ebeMultiply` function and the `add` function.
[source,text]
----
let(echo="c, d",
a=array(10, 20, 30, 40, 50),
b=array(.2, .2, .2, .2, .2),
c=ebeMultiply(a, b),
d=add(c))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
2,
4,
6,
8,
10
],
"d": 30
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
In the example above two arrays were combined in a way that produced the mean of the first. In the second array
each value was set to .2. Another way of looking at this is that each value in the second array has the same weight.
By varying the weights in the second array we can produce a different result. For example if the first array represents a time series,
the weights in the second array can be set to add more weight to a particular element in the first array.
The example below creates a weighted average with the weight decreasing from right to left. Notice that the weighted mean
of 36.666 is larger than the previous mean which was 30. This is because more weight was given to last element in the
array.
[source,text]
----
let(echo="c, d",
a=array(10, 20, 30, 40, 50),
b=array(.066666666666666,.133333333333333,.2, .266666666666666, .33333333333333),
c=ebeMultiply(a, b),
d=add(c))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
0.66666666666666,
2.66666666666666,
6,
10.66666666666664,
16.6666666666665
],
"d": 36.66666666666646
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
=== Representing Correlation
Often when we think of correlation, we are thinking of *Pearsons* correlation in the field of statistics. But the definition of
correlation is actually more general: a mutual relationship or connection between two or more things.
In the field of digital signal processing the dot product is used to represent correlation. The examples below demonstrates
how the dot product can be used to represent correlation.
In the example below the dot product is computed for two vectors. Notice that the vectors have different values that fluctuate
together. The output of the dot product is 190, which is hard to reason about because because its not scaled.
[source,text]
----
let(echo="c, d",
a=array(10, 20, 30, 20, 10),
b=array(1, 2, 3, 2, 1),
c=dotProduct(a, b))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": 190
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
One approach to scaling the dot product is to first scale the vectors so that both vectors have a magnitude of 1. Vectors with a
magnitude of 1, also called unit vectors, are used when comparing only the angle between vectors rather then the magnitude.
The `unitize` function can be used to unitize the vectors before calculating the dot product.
Notice in the example below the dot product result, set to variable *e*, is effectively 1. When applied to unit vectors the dot product
will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the *unscaled* vectors and the
answer is also effectively 1. This is because *cosine similarity* is a scaled *dot product*.
[source,text]
----
let(echo="e, f",
a=array(10, 20, 30, 20, 10),
b=array(1, 2, 3, 2, 1),
c=unitize(a),
d=unitize(b),
e=dotProduct(c, d),
f=cosineSimilarity(a, b))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"e": 0.9999999999999998,
"f": 0.9999999999999999
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
If we transpose the first two numbers in the first array, so that the vectors
are not perfectly correlated, we see that the cosine similarity drops. This illustrates
how the dot product represents correlation.
[source,text]
----
let(echo="c, d",
a=array(20, 10, 30, 20, 10),
b=array(1, 2, 3, 2, 1),
c=cosineSimilarity(a, b))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": 0.9473684210526314
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Convolution
The `conv` function calculates the convolution of two vectors. The convolution is calculated by *reversing*
the second vector and sliding it across the first vector. The *dot product* of the two vectors
is calculated at each point as the second vector is slid across the first vector.
The dot products are collected in a *third vector* which is the *convolution* of the two vectors.
=== Moving Average
Before looking at an example of convolution its useful to review the `movingAvg` function. The moving average
function computes a moving average by sliding a window across a vector and computing
the average of the window at each shift. If that sounds similar to convolution, that's because the `movingAvg` function
is syntactic sugar for convolution.
Below is an example of a moving average with a window size of 5. Notice that original vector has 13 elements
but the result of the moving average has only 9 elements. This is because the `movingAvg` function
only begins generating results when it has a full window. In this case because the window size is 5 so the
moving average starts generating results from the 4th index of the original array.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=movingAvg(a, 5))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"b": [
3,
4,
5,
5.6,
5.8,
5.6,
5,
4,
3
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
=== Convolutional Smoothing
The moving average can also be computed using convolution. In the example
below the `conv` function is used to compute the moving average of the first array
by applying the second array as the filter.
Looking at the result, we see that it is not exactly the same as the result
of the `movingAvg` function. That is because the `conv` pads zeros
to the front and back of the first vector so that the window size is always full.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=array(.2, .2, .2, .2, .2),
c=conv(a, b))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
0.2,
0.6000000000000001,
1.2,
2.0000000000000004,
3.0000000000000004,
4,
5,
5.6000000000000005,
5.800000000000001,
5.6000000000000005,
5.000000000000001,
4,
3,
2,
1.2000000000000002,
0.6000000000000001,
0.2
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
We achieve the same result as the `movingAvg` gunction by using the `copyOfRange` function to copy a range of
the result that drops the first and last 4 values of
the convolution result. In the example below the `precision` function is also also used to remove floating point errors from the
convolution result. When this is added the output is exactly the same as the `movingAvg` function.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=array(.2, .2, .2, .2, .2),
c=conv(a, b),
d=copyOfRange(c, 4, 13),
e=precision(d, 2))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"e": [
3,
4,
5,
5.6,
5.8,
5.6,
5,
4,
3
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Cross-Correlation
Cross-correlation is used to determine the delay between two signals. This is accomplished by sliding one signal across another
and calculating the dot product at each shift. The dot products are collected into a vector which represents the correlation
at each shift. The highest dot product in the cross-correlation vector is the point where the two signals are most closely correlated.
The sliding dot product used in convolution can also be used to represent cross-correlation between two vectors. The only
difference in the formula when representing correlation is that the second vector is *not reversed*.
Notice in the example below that the second vector is reversed by the `rev` function before it is operated on by the `conv` function.
The `conv` function reverses the second vector so it will be flipped back to its original order to perform the correlation calculation
rather then the convolution calculation.
Notice in the result the highest value is 217. This is the point where the two vectors have the highest correlation.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=array(4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
c=conv(a, rev(b)))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": [
1,
4,
10,
20,
35,
56,
84,
116,
149,
180,
203,
216,
217,
204,
180,
148,
111,
78,
50,
28,
13,
4
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Find Delay
It is fairly simple to compute the delay from the cross-correlation result, but a convenience function called `finddelay` can
be used to find the delay directly. Under the covers `finddelay` uses convolutional math to compute the cross-correlation vector
and then computes the delay between the two signals.
Below is an example of the `finddelay` function. Notice that the `finddelay` function reports a 3 period delay between the first
and second signal.
[source,text]
----
let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
b=array(4, 5, 6, 7, 6, 5, 4, 3, 2, 1),
c=finddelay(a, b))
----
When this expression is sent to the /stream handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"c": 3
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Autocorrelation
Autocorrelation measures the degree to which a signal is correlated with itself. Autocorrelation is used to determine
if a vector contains a signal or is purely random.
A few examples, with plots, will help to understand the concepts.
In the first example the `sin` function is wrapped around a `sequence` function to generate a sine wave. The result of this
is plotted in the image below. Notice that there is a structure to the plot that is clearly not random.
[source,text]
----
sin(sequence(256, 0, 6))
----
image::images/math-expressions/signal.png[]
In the next example the `sample` function is used to draw 256 samples from a `uniformDistribution` to create a
vector of random data. The result of this is plotted in the image below. Notice that there is no clear structure to the
data and the data appears to be random.
[source,text]
----
sample(uniformDistribution(-1.5, 1.5), 256)
----
image::images/math-expressions/noise.png[]
In the next example the random noise is added to the sine wave using the `ebeAdd` function.
The result of this is plotted in the image below. Notice that the sine wave has been hidden
somewhat within the noise. Its difficult to say for sure if there is structure. As plots
becomes more dense it can become harder to see a pattern hidden within noise.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=sample(uniformDistribution(-1.5, 1.5), 256),
c=ebeAdd(a, b))
----
image::images/math-expressions/hidden-signal.png[]
In the next examples autocorrelation is performed with each of the vectors shown above to see what the
autocorrelation plots look like.
In the example below the `conv` function is used to autocorrelate the first vector which is the sine wave.
Notice that the `conv` function is simply correlating the sine wave with itself.
The plot has a very distinct structure to it. As the sine wave is slid across a copy of itself the correlation
moves up and down in increasing intensity until it reaches a peak. This peak is directly in the center and is the
the point where the sine waves are directly lined up. Following the peak the correlation moves up and down in decreasing
intensity as the sine wave slides farther away from being directly lined up.
This is the autocorrelation plot of a pure signal.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=conv(a, rev(a)),
----
image::images/math-expressions/signal-autocorrelation.png[]
In the example below autocorrelation is performed with the vector of pure noise. Notice that the autocorrelation
plot has a very different plot then the sine wave. In this plot there is long period of low intensity correlation that appears
to be random. Then in the center a peak of high intensity correlation where the vectors are directly lined up.
This is followed by another long period of low intensity correlation.
This is the autocorrelation plot of pure noise.
[source,text]
----
let(a=sample(uniformDistribution(-1.5, 1.5), 256),
b=conv(a, rev(a)),
----
image::images/math-expressions/noise-autocorrelation.png[]
In the example below autocorrelation is performed on the vector with the sine wave hidden within the noise.
Notice that this plot shows very clear signs of structure which is similar to autocorrelation plot of the
pure signal. The correlation is less intense due to noise but the shape of the correlation plot suggests
strongly that there is an underlying signal hidden within the noise.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=sample(uniformDistribution(-1.5, 1.5), 256),
c=ebeAdd(a, b),
d=conv(c, rev(c))
----
image::images/math-expressions/hidden-signal-autocorrelation.png[]
== Discrete Fourier Transform
The convolution based functions described above are operating on signals in the time domain. In the time
domain the X axis is time and the Y axis is the quantity of some value at a specific point in time.
The discrete Fourier Transform translates a time domain signal into the frequency domain.
In the frequency domain the X axis is frequency, and Y axis is the accumulated power at a specific frequency.
The basic principle is that every time domain signal is composed of one or more signals (sine waves)
at different frequencies. The discrete Fourier transform decomposes a time domain signal into its component
frequencies and measures the power at each frequency.
The discrete Fourier transform has many important uses. In the example below, the discrete Fourier transform is used
to determine if a signal has structure or if it is purely random.
=== Complex Result
The `fft` function performs the discrete Fourier Transform on a vector of *real* data. The result
of the `fft` function is returned as *complex* numbers. A complex number has two parts, *real* and *imaginary*.
The imaginary part of the complex number is ignored in the examples below, but there
are many tutorials on the FFT and that include complex numbers available online.
But before diving into the examples it is important to understand how the `fft` function formats the
complex numbers in the result.
The `fft` function returns a `matrix` with two rows. The first row in the matrix is the *real*
part of the complex result. The second row in the matrix is the *imaginary* part of the complex result.
The `rowAt` function can be used to access the rows so they can be processed as vectors.
This approach was taken because all of the vector math functions operate on vectors of real numbers.
Rather then introducing a complex number abstraction into the expression language, the `fft` result is
represented as two vectors of real numbers.
=== Fast Fourier Transform Examples
In the first example the `fft` function is called on the sine wave used in the autocorrelation example.
The results of the `fft` function is a matrix. The `rowAt` function is used to return the first row of
the matrix which is a vector containing the real values of the fft response.
The plot of the real values of the `fft` response is shown below. Notice there are two
peaks on opposite sides of the plot. The plot is actually showing a mirrored response. The right side
of the plot is an exact mirror of the left side. This is expected when the `fft` is run on real rather then
complex data.
Also notice that the `fft` has accumulated significant power in a single peak. This is the power associated with
the specific frequency of the sine wave. The vast majority of frequencies in the plot have close to 0 power
associated with them. This `fft` shows a clear signal with very low levels of noise.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=fft(a),
c=rowAt(b, 0))
----
image::images/math-expressions/signal-fft.png[]
In the second example the `fft` function is called on a vector of random data similar to one used in the
autocorrelation example. The plot of the real values of the `fft` response is shown below.
Notice that in is this response there is no clear peak. Instead all frequencies have accumulated a random level of
power. This `fft` shows no clear sign of signal and appears to be noise.
[source,text]
----
let(a=sample(uniformDistribution(-1.5, 1.5), 256),
b=fft(a),
c=rowAt(b, 0))
----
image::images/math-expressions/noise-fft.png[]
In the third example the `fft` function is called on the same signal hidden within noise that was used for
the autocorrelation example. The plot of the real values of the `fft` response is shown below.
Notice that there are two clear mirrored peaks, at the same locations as the `fft` of the pure signal. But
there is also now considerable noise on the frequencies. The `fft` has found the signal and but also
shows that there is considerable noise along with the signal.
[source,text]
----
let(a=sin(sequence(256, 0, 6)),
b=sample(uniformDistribution(-1.5, 1.5), 256),
c=ebeAdd(a, b),
d=fft(c),
e=rowAt(d, 0))
----
image::images/math-expressions/hidden-signal-fft.png[]

Binary file not shown.

After

Width:  |  Height:  |  Size: 253 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 211 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 312 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 312 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 367 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 315 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 356 KiB

View File

@ -56,4 +56,6 @@ record in your Solr Cloud cluster computable.
== <<curve-fitting.adoc#curve-fitting,Curve Fitting>>
== <<dsp.adoc#digital-signal-processing, Digital Signal Processing>>
== <<machine-learning.adoc#machine-learning,Machine Learning>>