diff --git a/solr/solr-ref-guide/src/computational-geometry.adoc b/solr/solr-ref-guide/src/computational-geometry.adoc index e44c08e2fa6..50d7ac63a71 100644 --- a/solr/solr-ref-guide/src/computational-geometry.adoc +++ b/solr/solr-ref-guide/src/computational-geometry.adoc @@ -23,103 +23,82 @@ This section of the math expressions user guide covers computational geometry fu A convex hull is the smallest convex set of points that encloses a data set. Math expressions has support for computing the convex hull of a 2D data set. Once a convex hull has been calculated, a set of math expression functions -can be applied to geometrically describe the convex hull. +can be applied to geometrically describe and visualize the convex hull. -The `convexHull` function finds the convex hull of an observation matrix of 2D vectors. -Each row of the matrix is a 2D observation. +=== Visualization -In the example below a convex hull is calculated for a randomly generated set of 100 2D observations. +The `convexHull` function can be used to visualize a border around a +set of 2D points. Border visualizations can be useful for understanding where data points are +in relation to the border. -Then the following functions are called on the convex hull: +In the examples below the `convexHull` function is used +to visualize a border for a set of latitude and longitude points of rat sightings in the NYC311 +complaints database. An investigation of the border around the rat sightings can be done +to better understand how rats may be entering or exiting the specific region. --`getBaryCenter`: Returns the 2D point that is the bary center of the convex hull. +==== Scatter Plot --`getArea`: Returns the area of the convex hull. +Before visualizing the convex hull its often useful to visualize the 2D points as a scatter plot. --`getBoundarySize`: Returns the boundary size of the convex hull. +In this example the `random` function draws a sample of records from the NYC311 (complaints database) collection where +the complaint description matches "rat sighting" and the zip code is 11238. The latitude and longitude fields +are then vectorized and plotted as a scatter plot with longitude on x-axis and latitude on the +y-axis. --`getVertices`: Returns a set of 2D points that are the vertices of the convex hull. +image::images/math-expressions/convex0.png[] +Notice from the scatter plot that many of the points appear to lie near the border of the plot. -[source,text] ----- -let(echo="baryCenter, area, boundarySize, vertices", - x=sample(normalDistribution(0, 20), 100), - y=sample(normalDistribution(0, 10), 100), - observations=transpose(matrix(x,y)), - chull=convexHull(observations), - baryCenter=getBaryCenter(chull), - area=getArea(chull), - boundarySize=getBoundarySize(chull), - vertices=getVertices(chull)) ----- +==== Convex Hull Plot -When this expression is sent to the `/stream` handler it responds with: +The `convexHull` function can be used to visualize the border. The example uses the same points +drawn from the NYC311 database. But instead of plotting the points directly the latitude and +longitude points are added as rows to a matrix. The matrix is then transposed with `transpose` +function so that each row of the matrix contains a single latitude and longitude point. +The `convexHull` function is then used calculate the convex hull for the matrix of points. +The convex hull is set a variable called `hull`. -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "baryCenter": [ - -3.0969292101230343, - 1.2160948182691975 - ], - "area": 3477.480599967595, - "boundarySize": 267.52419019533664, - "vertices": [ - [ - -66.17632818958485, - -8.394931552315256 - ], - [ - -47.556667594765216, - -16.940434013651263 - ], - [ - -33.13582183446102, - -17.30914425443977 - ], - [ - -9.97459859015698, - -17.795012801599654 - ], - [ - 27.7705917246824, - -14.487224686587767 - ], - [ - 54.689432954170236, - -1.3333371984299605 - ], - [ - 35.97568654458672, - 23.054169251772556 - ], - [ - -15.539456215337585, - 19.811330468093704 - ], - [ - -17.05125031092752, - 19.53581741341663 - ], - [ - -35.92010024412891, - 15.126430698395572 - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 3 - } - ] - } -} ----- +Once the convex hull has been created the `getVertices` function can be used to +retrieve the matrix of points in the scatter plot that comprise the convex border around the scatter plot. +The `colAt` function can then be used to retrieve the latitude and longitude vectors from the matrix +so they can visualized by the `zplot` function. In the example below the convex hull points are +visualized as a scatter plot. + +image::images/math-expressions/hullplot.png[] + +Notice that the 15 points in the scatter plot describe that latitude and longitude points of the +convex hull. + +==== Projecting and Clustering + +The once a convex hull as been calculated the `projectToBorder` can then be used to project +points to the nearest point on the border. In the example below the `projectToBorder` function +is used to project the original scatter scatter plot points to the nearest border. + +The `projectToBorder` function returns a matrix of lat/lon points for the border projections. In +the example the matrix of border points is then clustered into 7 clusters using kmeans clustering. +The `zplot` function is then used to plot the clustered border points. + +image::images/math-expressions/convex1.png[] + +Notice in the visualization its easy to see which spots along the border have the highest +density of points. In the case or the rat sightings this information is useful in understanding +which border points are closest for the rats to enter or exit from. + +==== Plotting the Centroids + +Once the border points have been clustered its very easy to extract the centroids of the clusters +and plot them on a map. The example below extracts the centroids from the clusters using the +`getCentroids` function. `getCentroids` returns the matrix of lat/lon points which represent +the centroids of border clusters. The `colAt` function can then be used to extract the lat/lon +vectors so they can be plotted on a map using `zplot`. + +image::images/math-expressions/convex2.png[] + +The map above shows the centroids of the border clusters. The centroids from the highest +density clusters can now be zoomed and investigated geo-spatially to determine what might be +the best places to begin an investigation of the border. == Enclosing Disk @@ -131,11 +110,11 @@ In the example below an enclosing disk is calculated for a randomly generated se Then the following functions are called on the enclosing disk: --`getCenter`: Returns the 2D point that is the center of the disk. +* `getCenter`: Returns the 2D point that is the center of the disk. --`getRadius`: Returns the radius of the disk. +* `getRadius`: Returns the radius of the disk. --`getSupportPoints`: Returns the support points of the disk. +* `getSupportPoints`: Returns the support points of the disk. [source,text] ---- diff --git a/solr/solr-ref-guide/src/curve-fitting.adoc b/solr/solr-ref-guide/src/curve-fitting.adoc index d0b84522db5..966e8882b93 100644 --- a/solr/solr-ref-guide/src/curve-fitting.adoc +++ b/solr/solr-ref-guide/src/curve-fitting.adoc @@ -16,7 +16,7 @@ // specific language governing permissions and limitations // under the License. -These functions support constructing a curve. +These functions support constructing a curve through bivariate non-linear data. == Polynomial Curve Fitting @@ -25,201 +25,86 @@ the non-linear relationship between two random variables. The `polyfit` function is passed x- and y-axes and fits a smooth curve to the data. If only a single array is provided it is treated as the y-axis and a sequence is generated -for the x-axis. - -The `polyfit` function also has a parameter the specifies the degree of the polynomial. The higher +for the x-axis. A third parameter can be added that specifies the degree of the polynomial. If the degree is +not provided a 3 degree polynomial is used by default. The higher the degree the more curves that can be modeled. -The example below uses the `polyfit` function to fit a curve to an array using -a 3 degree polynomial. The fitted curve is then subtracted from the original curve. The output -shows the error between the fitted curve and the original curve, known as the residuals. -The output also includes the sum-of-squares of the residuals which provides a measure -of how large the error is. +The `polyfit` function can be visualized in a similar manner to linear regression with +Zeppelin-Solr. -[source,text] ----- -let(echo="residuals, sumSqError", - y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0), - curve=polyfit(y, 3), - residuals=ebeSubtract(y, curve), - sumSqError=sumSq(residuals)) ----- +The example below uses the `polyfit` function to fit a non-linear curve to a scatter +plot of a random sample. The blue points are the scatter plot of the original observations and the red points +are the predicted curve. -When this expression is sent to the `/stream` handler it -responds with: +image::images/math-expressions/polyfit.png[] -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "residuals": [ - 0.5886274509803899, - -0.0746078431372561, - -0.49492135315664765, - -0.6689571213100631, - -0.5933591898297781, - 0.4352283990519288, - 0.32016160310277897, - 1.1647963800904968, - 0.272488687782805, - -0.3534055160525744, - 0.2904697263520779, - -0.7925296272355089, - -0.5990476190476182, - -0.12572829131652274, - 0.6307843137254909 - ], - "sumSqError": 4.7294282482223595 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- +In the example above a random sample containing two fields, `filesize_d` +and `response_d`, is drawn from the `logs` collection. +The two fields are vectorized and set to the variables `x` and `y`. -In the next example the curve is fit using a 5 degree polynomial. Notice that the curve -is fit closer, shown by the smaller residuals and lower value for the sum-of-squares of the -residuals. This is because the higher polynomial produced a closer fit. +Then the `polyfit` function is used to fit a non-linear model to the data using a 5 degree +polynomial. The `polyfit` function returns a model that is then directly plotted +by `zplot` along with the original observations. -[source,text] ----- -let(echo="residuals, sumSqError", - y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0), - curve=polyfit(y, 5), - residuals=ebeSubtract(y, curve), - sumSqError=sumSq(residuals)) ----- +The fitted model can also be used +by the `predict` function in the same manner as linear regression. The example below +uses the fitted model to predict a response time for a file size of 42000. -When this expression is sent to the `/stream` handler it -responds with: +image::images/math-expressions/polyfit-predict.png[] -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "residuals": [ - -0.12337461300309674, - 0.22708978328173413, - 0.12266015718028167, - -0.16502738747320755, - -0.41142804563857105, - 0.2603044014808713, - -0.12128970101106162, - 0.6234168308471704, - -0.1754692675745293, - -0.5379689969473249, - 0.4651616185671843, - -0.288175756132409, - 0.027970945463215102, - 0.18699690402476687, - -0.09086687306501587 - ], - "sumSqError": 1.413089480179252 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- +If an array of predictor values is provided an array of predictions will be returned. + +The `polyfit` model performs both *interpolation* and *extrapolation*, +which means that it can predict results both within the bounds of the data set +and beyond the bounds. + +=== Residuals + +The residuals can be calculated and visualized in the same manner as linear +regression as well. In the example below the `ebeSubtract` function is used +to subtract the fitted model from the observed values, to +calculate a vector of residuals. The residuals are then plotted in a *residual plot* +with the predictions along the x-axis and the model error on the y-axis. + +image::images/math-expressions/polyfit-resid.png[] -=== Prediction, Derivatives and Integrals +== Gaussian Curve Fitting -The `polyfit` function returns a function that can be used with the `predict` -function. +The `gaussfit` function fits a smooth curve through a Gaussian peak. The `gaussfit` +function takes an x- and y-axis and fits a smooth gaussian curve to the data. If +only one vector of numbers is passed, `gaussfit` will treat it as the y-axis +and will generate a sequence for the x-axis. -In the example below the x-axis is included for clarity. -The `polyfit` function returns a function for the fitted curve. -The `predict` function is then used to predict a value along the curve, in this -case the prediction is made for the *`x`* value of 5. +One of the interesting use cases for `gaussfit` is to visualize how well a regression +model's residuals fit a normal distribution. -[source,text] ----- -let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), - y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0), - curve=polyfit(x, y, 5), - p=predict(curve, 5)) ----- +One of the characteristics of a well-fit regression model is that its residuals will ideally fit a normal distribution. +We can +test this by building a histogram of the residuals and then fitting a gaussian curve to the curve of the histogram. -When this expression is sent to the `/stream` handler it -responds with: +In the example below the residuals from a `polyfit` regression are modeled with the +`hist` function to return a histogram with 32 bins. The `hist` function returns +a list of tuples with statistics about each bin. In the example the `col` function is +used to return a vector with the `N` column for each bin, which is the count of +observations in the +bin. If the residuals are normally distributed we would expect the bin counts +to roughly follow a gaussian curve. -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "p": 5.439695598519129 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- +The bin count vector is then passed to `gaussfit` as the y-axis. `gaussfit` generates +a sequence for the x-axis and then fits the gaussian curve to data. -The `derivative` and `integrate` functions can be used to compute the derivative -and integrals for the fitted -curve. The example below demonstrates how to compute a derivative -for the fitted curve. +`zplot` is then used to plot the original bin counts and the fitted curve. In the +example below, the blue line is the bin counts, and the smooth yellow line is the +fitted curve. We can see that the binned residuals fit fairly well to a normal +distribution. +image::images/math-expressions/gaussfit.png[] -[source,text] ----- -let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14), - y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0), - curve=polyfit(x, y, 5), - d=derivative(curve)) ----- +The second plot shows the two curves overlaid with an area chart: -When this expression is sent to the `/stream` handler it -responds with: +image::images/math-expressions/gaussfit2.png[] -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "d": [ - 0.3198918573686361, - 0.9261492094077225, - 1.2374272373653175, - 1.30051359631081, - 1.1628032287629813, - 0.8722983646900058, - 0.47760852150945, - 0.02795050408827482, - -0.42685159525716865, - -0.8363663967611356, - -1.1495552332084857, - -1.3147721499346892, - -1.2797639048258267, - -0.9916699683185771, - -0.3970225234002308 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- == Harmonic Curve Fitting @@ -232,169 +117,19 @@ The example below shows `harmfit` fitting a single oscillation of a sine wave. T returns the smoothed values at each control point. The return value is also a model which can be used by the `predict`, `derivative` and `integrate` functions. -There are also three helper functions that can be used to retrieve the estimated parameters of the fitted model: - -* `getAmplitude`: Returns the amplitude of the sine wave. -* `getAngularFrequency`: Returns the angular frequency of the sine wave. -* `getPhase`: Returns the phase of the sine wave. - NOTE: The `harmfit` function works best when run on a single oscillation rather than a long sequence of oscillations. This is particularly true if the sine wave has noise. After the curve has been fit it can be extrapolated to any point in time in the past or future. -In the example below the `harmfit` function fits control points, provided as x and y axes, and then the -angular frequency, phase and amplitude are retrieved from the fitted model. -[source,text] ----- -let(echo="freq, phase, amp", - x=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), - y=array(-0.7441113653915925,-0.8997532112139415, -0.9853140681578838, -0.9941296760805463, - -0.9255133950087844, -0.7848096869247675, -0.5829778403072583, -0.33573836075915076, - -0.06234851460699166, 0.215897602691855, 0.47732764497752245, 0.701579055431586, - 0.8711850882773975, 0.9729352782968976, 0.9989043923858761, 0.9470697190130273, - 0.8214686154479715, 0.631884041542757, 0.39308257356494, 0.12366424851680227), - model=harmfit(x, y), - freq=getAngularFrequency(model), - phase=getPhase(model), - amp=getAmplitude(model)) ----- +In the example below the original control points are shown in blue and the fitted curve is shown in yellow. -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "freq": 0.28, - "phase": 2.4100000000000006, - "amp": 0.9999999999999999 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -=== Interpolation and Extrapolation - -The `harmfit` function returns a fitted model of the sine wave that can used by the `predict` function to -interpolate or extrapolate the sine wave. - -The example below uses the fitted model to extrapolate the sine wave beyond the control points -to the x-axis points 20, 21, 22, 23. - -[source,text] ----- -let(x=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), - y=array(-0.7441113653915925,-0.8997532112139415, -0.9853140681578838, -0.9941296760805463, - -0.9255133950087844, -0.7848096869247675, -0.5829778403072583, -0.33573836075915076, - -0.06234851460699166, 0.215897602691855, 0.47732764497752245, 0.701579055431586, - 0.8711850882773975, 0.9729352782968976, 0.9989043923858761, 0.9470697190130273, - 0.8214686154479715, 0.631884041542757, 0.39308257356494, 0.12366424851680227), - model=harmfit(x, y), - extrapolation=predict(model, array(20, 21, 22, 23))) ----- - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "extrapolation": [ - -0.1553861764415666, - -0.42233370833176975, - -0.656386037906838, - -0.8393130343914845 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -== Gaussian Curve Fitting - -The `gaussfit` function fits a smooth curve through a Gaussian peak. -This is shown in the example below. +image::images/math-expressions/harmfit.png[] -[source,text] ----- -let(x=array(0,1,2,3,4,5,6,7,8,9, 10), - y=array(4,55,1200,3028,12000,18422,13328,6426,1696,239,20), - f=gaussfit(x, y)) ----- +The output of `harmfit` is a model that can be used by the `predict` function to interpolate and extrapolate +the sine wave. In the example below the `natural` function creates an x-axis from 0 to 127 +used to predict results for the model. This extrapolates the sine wave out to 128 points, when +the original model curve had only 19 control points. -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "f": [ - 2.81764431935644, - 61.157417979413424, - 684.2328985468831, - 3945.9411154167447, - 11729.758936952656, - 17972.951897338007, - 14195.201949425435, - 5779.03836032222, - 1212.7224502169634, - 131.17742331530349, - 7.3138931735866946 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -Like the `polyfit` function, the `gaussfit` function returns a function that can -be used directly by the `predict`, `derivative` and `integrate` functions. - -The example below demonstrates how to compute an integral for a fitted Gaussian curve. - -[source,text] ----- -let(x=array(0,1,2,3,4,5,6,7,8,9, 10), - y=array(4,55,1200,3028,12000,18422,13328,6426,1696,239,20), - f=gaussfit(x, y), - i=integrate(f, 0, 5)) - ----- - -When this expression is sent to the `/stream` handler it -responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "i": 25261.666789766092 - }, - { - "EOF": true, - "RESPONSE_TIME": 3 - } - ] - } -} ----- +image::images/math-expressions/harmfit2.png[] diff --git a/solr/solr-ref-guide/src/dsp.adoc b/solr/solr-ref-guide/src/dsp.adoc index 38f82d38537..50923e84aad 100644 --- a/solr/solr-ref-guide/src/dsp.adoc +++ b/solr/solr-ref-guide/src/dsp.adoc @@ -19,438 +19,77 @@ This section of the user guide explores functions that are commonly used in the field of Digital Signal Processing (DSP). -== Dot Product - -The `dotProduct` function is used to calculate the dot product of two numeric arrays. -The dot product is a fundamental calculation for the DSP functions discussed in this section. Before diving into -the more advanced DSP functions its useful to develop a deeper intuition of the dot product. - -The dot product operation is performed in two steps: - -. Element-by-element multiplication of two vectors which produces a vector of products. - -. Sum the vector of products to produce a scalar result. - -This simple bit of math has a number of important applications. - -=== Representing Linear Combinations - -The `dotProduct` performs the math of a _linear combination_. A linear combination has the following form: - -[source,text] ----- -(a1*v1)+(a2*v2)... ----- - -In the above example `a1` and `a2` are random variables that change. `v1` and `v2` are constant values. - -When computing the dot product the elements of two vectors are multiplied together and the results are added. -If the first vector contains random variables and the second vector contains constant values -then the dot product is performing a linear combination. - -This scenario comes up again and again in machine learning. For example both linear and logistic regression -solve for a vector of constant weights. In order to perform a prediction, a dot product is calculated -between a random observation vector and the constant weight vector. That dot product is a linear combination because -one of the vectors holds constant weights. - -Lets look at simple example of how a linear combination can be used to find the mean of a vector of numbers. - -In the example below two arrays are set to variables *`a`* and *`b`* and then operated on by the `dotProduct` function. -The output of the `dotProduct` function is set to variable *`c`*. - -The `mean` function is then used to compute the mean of the first array which is set to the variable *`d`*. - -Both the dot product and the mean are included in the output. - -When we look at the output of this expression we see that the dot product and the mean of the first array -are both 30. - -The `dotProduct` function calculated the mean of the first array. - -[source,text] ----- -let(echo="c, d", - a=array(10, 20, 30, 40, 50), - b=array(.2, .2, .2, .2, .2), - c=dotProduct(a, b), - d=mean(a)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": 30, - "d": 30 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -To get a better understanding of how the dot product calculated the mean we can perform the steps of the -calculation using vector math and look at the output of each step. - -In the example below the `ebeMultiply` function performs an element-by-element multiplication of -two arrays. This is the first step of the dot product calculation. The result of the element-by-element -multiplication is assigned to variable *`c`*. - -In the next step the `add` function adds all the elements of the array in variable *`c`*. - -Notice that multiplying each element of the first array by .2 and then adding the results is -equivalent to the formula for computing the mean of the first array. The formula for computing the mean -of an array is to add all the elements and divide by the number of elements. - -The output includes the output of both the `ebeMultiply` function and the `add` function. - -[source,text] ----- -let(echo="c, d", - a=array(10, 20, 30, 40, 50), - b=array(.2, .2, .2, .2, .2), - c=ebeMultiply(a, b), - d=add(c)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - 2, - 4, - 6, - 8, - 10 - ], - "d": 30 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -In the example above two arrays were combined in a way that produced the mean of the first. In the second array -each value was set to .2. Another way of looking at this is that each value in the second array is -applying the same weight to the values in the first array. -By varying the weights in the second array we can produce a different result. -For example if the first array represents a time series, -the weights in the second array can be set to add more weight to a particular element in the first array. - -The example below creates a weighted average with the weight decreasing from right to left. -Notice that the weighted mean -of 36.666 is larger than the previous mean which was 30. This is because more weight was given to last element in the -array. - -[source,text] ----- -let(echo="c, d", - a=array(10, 20, 30, 40, 50), - b=array(.066666666666666,.133333333333333,.2, .266666666666666, .33333333333333), - c=ebeMultiply(a, b), - d=add(c)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - 0.66666666666666, - 2.66666666666666, - 6, - 10.66666666666664, - 16.6666666666665 - ], - "d": 36.66666666666646 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -=== Representing Correlation - -Often when we think of correlation, we are thinking of _Pearson correlation_ in the field of statistics. But the definition of -correlation is actually more general: a mutual relationship or connection between two or more things. -In the field of digital signal processing the dot product is used to represent correlation. The examples below demonstrates -how the dot product can be used to represent correlation. - -In the example below the dot product is computed for two vectors. Notice that the vectors have different values that fluctuate -together. The output of the dot product is 190, which is hard to reason about because it's not scaled. - -[source,text] ----- -let(echo="c, d", - a=array(10, 20, 30, 20, 10), - b=array(1, 2, 3, 2, 1), - c=dotProduct(a, b)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": 190 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -One approach to scaling the dot product is to first scale the vectors so that both vectors have a magnitude of 1. Vectors with a -magnitude of 1, also called unit vectors, are used when comparing only the angle between vectors rather than the magnitude. -The `unitize` function can be used to unitize the vectors before calculating the dot product. - -Notice in the example below the dot product result, set to variable *`e`*, is effectively 1. When applied to unit vectors the dot product -will be scaled between 1 and -1. Also notice in the example `cosineSimilarity` is calculated on the unscaled vectors and the -answer is also effectively 1. This is because cosine similarity is a scaled dot product. - - -[source,text] ----- -let(echo="e, f", - a=array(10, 20, 30, 20, 10), - b=array(1, 2, 3, 2, 1), - c=unitize(a), - d=unitize(b), - e=dotProduct(c, d), - f=cosineSimilarity(a, b)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "e": 0.9999999999999998, - "f": 0.9999999999999999 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -If we transpose the first two numbers in the first array, so that the vectors -are not perfectly correlated, we see that the cosine similarity drops. This illustrates -how the dot product represents correlation. - -[source,text] ----- -let(echo="c, d", - a=array(20, 10, 30, 20, 10), - b=array(1, 2, 3, 2, 1), - c=cosineSimilarity(a, b)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": 0.9473684210526314 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - == Convolution -The `conv` function calculates the convolution of two vectors. The convolution is calculated by reversing +The `conv` function calculates the convolution of two vectors. The convolution is calculated by *reversing* the second vector and sliding it across the first vector. The dot product of the two vectors is calculated at each point as the second vector is slid across the first vector. The dot products are collected in a third vector which is the convolution of the two vectors. === Moving Average Function -Before looking at an example of convolution its useful to review the `movingAvg` function. The moving average +Before looking at an example of convolution it's useful to review the `movingAvg` function. The moving average function computes a moving average by sliding a window across a vector and computing -the average of the window at each shift. If that sounds similar to convolution, that's because the `movingAvg` function -is syntactic sugar for convolution. +the average of the window at each shift. If that sounds similar to convolution, that's because the `movingAvg` +function involves a sliding window approach similar to convolution. -Below is an example of a moving average with a window size of 5. Notice that original vector has 13 elements +Below is an example of a moving average with a window size of 5. Notice that the original vector has 13 elements but the result of the moving average has only 9 elements. This is because the `movingAvg` function -only begins generating results when it has a full window. In this case because the window size is 5 so the -moving average starts generating results from the 4^th^ index of the original array. +only begins generating results when it has a full window. The `ltrim` function is used to trim the +first four elements from the original `y` array to line up with the moving average. -[source,text] ----- -let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1), - b=movingAvg(a, 5)) ----- +image::images/math-expressions/conv1.png[] -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "b": [ - 3, - 4, - 5, - 5.6, - 5.8, - 5.6, - 5, - 4, - 3 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- === Convolutional Smoothing The moving average can also be computed using convolution. In the example below the `conv` function is used to compute the moving average of the first array -by applying the second array as the filter. +by applying the second array as a filter. -Looking at the result, we see that it is not exactly the same as the result -of the `movingAvg` function. That is because the `conv` pads zeros +Looking at the result, we see that the convolution produced an array with 17 values instead of the 9 values created by the +moving average. That is because the `conv` function pads zeros to the front and back of the first vector so that the window size is always full. -[source,text] ----- -let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1), - b=array(.2, .2, .2, .2, .2), - c=conv(a, b)) ----- +image::images/math-expressions/conv2.png[] -When this expression is sent to the `/stream` handler it responds with: +We achieve the same result as the `movingAvg` function by trimming the first and last 4 values of +the convolution result using the `ltrim` and `rtrim` functions. -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - 0.2, - 0.6000000000000001, - 1.2, - 2.0000000000000004, - 3.0000000000000004, - 4, - 5, - 5.6000000000000005, - 5.800000000000001, - 5.6000000000000005, - 5.000000000000001, - 4, - 3, - 2, - 1.2000000000000002, - 0.6000000000000001, - 0.2 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- +The example below plots both the trimmed convolution and the moving average on the same plot. Notice that +they perfectly overlap. -We achieve the same result as the `movingAvg` function by using the `copyOfRange` function to copy a range of -the result that drops the first and last 4 values of -the convolution result. In the example below the `precision` function is also also used to remove floating point errors from the -convolution result. When this is added the output is exactly the same as the `movingAvg` function. +image::images/math-expressions/conv3.png[] -[source,text] ----- -let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1), - b=array(.2, .2, .2, .2, .2), - c=conv(a, b), - d=copyOfRange(c, 4, 13), - e=precision(d, 2)) ----- +This demonstrates how convolution can be used to smooth a signal by sliding a filter across the signal and +computing the dot product at each point. The smoothing effect is caused by the design of the filter. +In the example, the filter length is 5 and each value in the filter is .2. This filter calculates a +simple moving average with a window size of 5. + +The formula for computing a simple moving average using convolution is to make the filter length the window +size and make the values of the filter all the same and sum to 1. A moving average with a window size of 4 +can be computed by changing the filter to a length of 4 with each value being .25. + +==== Changing the Weights + +The filter, which is sometimes called the *kernel*, can be viewed as a vector of weights. In the initial +example all values in the filter have the same weight (.2). The weights in the filter can be changed to +produce different smoothing effects. This is demonstrated in the example below. + +In this example the filter increases in weight from .1 to .3. This places more weight towards the front +of the filter. Notice that the filter is reversed with the `rev` function before the `conv` function applies it. +This is done because convolution will reverse +the filter. In this case we reverse it ahead of time and when convolution reverses it back, it is the same +as the original filter. + +The plot shows the effect of the different weights in the filter. The dark blue line is the initial array. +The light blue line is the convolution and the orange line is the moving average. Notice that the convolution +responds quicker to the movements in the underlying array. This is because more weight has been placed +at the front of the filter. + +image::images/math-expressions/conv4.png[] -When this expression is sent to the `/stream` handler it responds with: -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "e": [ - 3, - 4, - 5, - 5.6, - 5.8, - 5.6, - 5, - 4, - 3 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- == Cross-Correlation @@ -467,54 +106,8 @@ rather than the convolution calculation. Notice in the result the highest value is 217. This is the point where the two vectors have the highest correlation. -[source,text] ----- -let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1), - b=array(4, 5, 6, 7, 6, 5, 4, 3, 2, 1), - c=conv(a, rev(b))) ----- +image::images/math-expressions/crosscorr.png[] -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - 1, - 4, - 10, - 20, - 35, - 56, - 84, - 116, - 149, - 180, - 203, - 216, - 217, - 204, - 180, - 148, - 111, - 78, - 50, - 28, - 13, - 4 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- == Find Delay @@ -525,67 +118,29 @@ and then computes the delay between the two signals. Below is an example of the `finddelay` function. Notice that the `finddelay` function reports a 3 period delay between the first and second signal. -[source,text] ----- -let(a=array(1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1), - b=array(4, 5, 6, 7, 6, 5, 4, 3, 2, 1), - c=finddelay(a, b)) ----- +image::images/math-expressions/delay.png[] -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": 3 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- == Oscillate (Sine Wave) The `oscillate` function generates a periodic oscillating signal which can be used to model and study sine waves. -The `oscillate` function takes three parameters: *amplitude*, *angular frequency* -and *phase* and returns a vector containing the y-axis points of a sine wave. +The `oscillate` function takes three parameters: `amplitude`, `angular frequency`, and `phase` and returns a vector containing the y-axis points of a sine wave. The y-axis points were generated from an x-axis sequence of 0-127. Below is an example of the `oscillate` function called with an amplitude of 1, and angular frequency of .28 and phase of 1.57. -[source,text] ----- -oscillate(1, 0.28, 1.57) ----- - -The result of the `oscillate` function is plotted below: image::images/math-expressions/sinewave.png[] -=== Sine Wave Interpolation, Extrapolation +=== Sine Wave Interpolation & Extrapolation The `oscillate` function returns a function which can be used by the `predict` function to interpolate or extrapolate a sine wave. + The example below extrapolates the sine wave to an x-axis sequence of 0-256. - -[source,text] ----- -let(a=oscillate(1, 0.28, 1.57), - b=predict(a, sequence(256, 0, 1))) ----- - -The extrapolated sine wave is plotted below: - image::images/math-expressions/sinewave256.png[] @@ -599,11 +154,6 @@ A few examples, with plots, will help to understand the concepts. The first example simply revisits the example above of an extrapolated sine wave. The result of this is plotted in the image below. Notice that there is a structure to the plot that is clearly not random. -[source,text] ----- -let(a=oscillate(1, 0.28, 1.57), - b=predict(a, sequence(256, 0, 1))) ----- image::images/math-expressions/sinewave256.png[] @@ -612,11 +162,6 @@ In the next example the `sample` function is used to draw 256 samples from a `un vector of random data. The result of this is plotted in the image below. Notice that there is no clear structure to the data and the data appears to be random. -[source,text] ----- -sample(uniformDistribution(-1.5, 1.5), 256) ----- - image::images/math-expressions/noise.png[] @@ -625,13 +170,6 @@ The result of this is plotted in the image below. Notice that the sine wave has somewhat within the noise. Its difficult to say for sure if there is structure. As plots becomes more dense it can become harder to see a pattern hidden within noise. -[source,text] ----- -let(a=oscillate(1, 0.28, 1.57), - b=predict(a, sequence(256, 0, 1)), - c=sample(uniformDistribution(-1.5, 1.5), 256), - d=ebeAdd(b,c)) ----- image::images/math-expressions/hidden-signal.png[] @@ -649,12 +187,6 @@ intensity as the sine wave slides farther away from being directly lined up. This is the autocorrelation plot of a pure signal. -[source,text] ----- -let(a=oscillate(1, 0.28, 1.57), - b=predict(a, sequence(256, 0, 1)), - c=conv(b, rev(b))) ----- image::images/math-expressions/signal-autocorrelation.png[] @@ -666,11 +198,6 @@ This is followed by another long period of low intensity correlation. This is the autocorrelation plot of pure noise. -[source,text] ----- -let(a=sample(uniformDistribution(-1.5, 1.5), 256), - b=conv(a, rev(a)), ----- image::images/math-expressions/noise-autocorrelation.png[] @@ -680,25 +207,17 @@ Notice that this plot shows very clear signs of structure which is similar to au pure signal. The correlation is less intense due to noise but the shape of the correlation plot suggests strongly that there is an underlying signal hidden within the noise. -[source,text] ----- -let(a=oscillate(1, 0.28, 1.57), - b=predict(a, sequence(256, 0, 1)), - c=sample(uniformDistribution(-1.5, 1.5), 256), - d=ebeAdd(b, c), - e=conv(d, rev(d))) ----- image::images/math-expressions/hidden-signal-autocorrelation.png[] == Discrete Fourier Transform -The convolution based functions described above are operating on signals in the time domain. In the time -domain the X axis is time and the Y axis is the quantity of some value at a specific point in time. +The convolution-based functions described above are operating on signals in the time domain. In the time +domain the x-axis is time and the y-axis is the quantity of some value at a specific point in time. The discrete Fourier Transform translates a time domain signal into the frequency domain. -In the frequency domain the X axis is frequency, and Y axis is the accumulated power at a specific frequency. +In the frequency domain the x-axis is frequency, and y-axis is the accumulated power at a specific frequency. The basic principle is that every time domain signal is composed of one or more signals (sine waves) at different frequencies. The discrete Fourier transform decomposes a time domain signal into its component @@ -711,26 +230,21 @@ to determine if a signal has structure or if it is purely random. The `fft` function performs the discrete Fourier Transform on a vector of *real* data. The result of the `fft` function is returned as *complex* numbers. A complex number has two parts, *real* and *imaginary*. -The imaginary part of the complex number is ignored in the examples below, but there -are many tutorials on the FFT and that include complex numbers available online. - -But before diving into the examples it is important to understand how the `fft` function formats the -complex numbers in the result. +The *real* part of the result describes the magnitude of the signal at different frequencies. +The *imaginary* part of the result describes the *phase*. The examples below deal only with the *real* +part of the result. The `fft` function returns a `matrix` with two rows. The first row in the matrix is the *real* part of the complex result. The second row in the matrix is the *imaginary* part of the complex result. - The `rowAt` function can be used to access the rows so they can be processed as vectors. -This approach was taken because all of the vector math functions operate on vectors of real numbers. -Rather then introducing a complex number abstraction into the expression language, the `fft` result is -represented as two vectors of real numbers. + === Fast Fourier Transform Examples In the first example the `fft` function is called on the sine wave used in the autocorrelation example. The results of the `fft` function is a matrix. The `rowAt` function is used to return the first row of -the matrix which is a vector containing the real values of the fft response. +the matrix which is a vector containing the real values of the `fft` response. The plot of the real values of the `fft` response is shown below. Notice there are two peaks on opposite sides of the plot. The plot is actually showing a mirrored response. The right side @@ -741,14 +255,6 @@ Also notice that the `fft` has accumulated significant power in a single peak. T the specific frequency of the sine wave. The vast majority of frequencies in the plot have close to 0 power associated with them. This `fft` shows a clear signal with very low levels of noise. -[source,text] ----- -let(a=oscillate(1, 0.28, 1.57), - b=predict(a, sequence(256, 0, 1)), - c=fft(b), - d=rowAt(c, 0)) ----- - image::images/math-expressions/signal-fft.png[] @@ -758,17 +264,8 @@ autocorrelation example. The plot of the real values of the `fft` response is sh Notice that in is this response there is no clear peak. Instead all frequencies have accumulated a random level of power. This `fft` shows no clear sign of signal and appears to be noise. - -[source,text] ----- -let(a=sample(uniformDistribution(-1.5, 1.5), 256), - b=fft(a), - c=rowAt(b, 0)) ----- - image::images/math-expressions/noise-fft.png[] - In the third example the `fft` function is called on the same signal hidden within noise that was used for the autocorrelation example. The plot of the real values of the `fft` response is shown below. @@ -776,14 +273,5 @@ Notice that there are two clear mirrored peaks, at the same locations as the `ff there is also now considerable noise on the frequencies. The `fft` has found the signal and but also shows that there is considerable noise along with the signal. -[source,text] ----- -let(a=oscillate(1, 0.28, 1.57), - b=predict(a, sequence(256, 0, 1)), - c=sample(uniformDistribution(-1.5, 1.5), 256), - d=ebeAdd(b, c), - e=fft(d), - f=rowAt(e, 0)) ----- image::images/math-expressions/hidden-signal-fft.png[] diff --git a/solr/solr-ref-guide/src/images/math-expressions/2Centroids.png b/solr/solr-ref-guide/src/images/math-expressions/2Centroids.png new file mode 100644 index 00000000000..d536d5c4d92 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/2Centroids.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/2DCluster1.png b/solr/solr-ref-guide/src/images/math-expressions/2DCluster1.png new file mode 100644 index 00000000000..52530b415b8 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/2DCluster1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/2DCluster2.png b/solr/solr-ref-guide/src/images/math-expressions/2DCluster2.png new file mode 100644 index 00000000000..08daacfe990 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/2DCluster2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/2Dcentroids.png b/solr/solr-ref-guide/src/images/math-expressions/2Dcentroids.png new file mode 100644 index 00000000000..27f16b68c54 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/2Dcentroids.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/2Dcluster.png b/solr/solr-ref-guide/src/images/math-expressions/2Dcluster.png new file mode 100644 index 00000000000..13d725b2a5b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/2Dcluster.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/add.png b/solr/solr-ref-guide/src/images/math-expressions/add.png new file mode 100644 index 00000000000..02fd5126877 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/add.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/aggs.png b/solr/solr-ref-guide/src/images/math-expressions/aggs.png new file mode 100644 index 00000000000..006923f2412 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/aggs.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/anomaly.png b/solr/solr-ref-guide/src/images/math-expressions/anomaly.png new file mode 100644 index 00000000000..a6391410973 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/anomaly.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/array.png b/solr/solr-ref-guide/src/images/math-expressions/array.png new file mode 100644 index 00000000000..23490c6f7ca Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/array.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/autocorr.png b/solr/solr-ref-guide/src/images/math-expressions/autocorr.png new file mode 100644 index 00000000000..9f8c7ff8695 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/autocorr.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/bar.png b/solr/solr-ref-guide/src/images/math-expressions/bar.png new file mode 100644 index 00000000000..ffd102b7b89 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/bar.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/beta.png b/solr/solr-ref-guide/src/images/math-expressions/beta.png new file mode 100644 index 00000000000..5b505dc2487 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/beta.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/binomial.png b/solr/solr-ref-guide/src/images/math-expressions/binomial.png new file mode 100644 index 00000000000..c8fed518593 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/binomial.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/bivariate.png b/solr/solr-ref-guide/src/images/math-expressions/bivariate.png new file mode 100644 index 00000000000..364ad04956b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/bivariate.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/cartesian.png b/solr/solr-ref-guide/src/images/math-expressions/cartesian.png new file mode 100644 index 00000000000..06069ab0551 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/cartesian.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/cartogram.png b/solr/solr-ref-guide/src/images/math-expressions/cartogram.png new file mode 100644 index 00000000000..7945d25eaf7 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/cartogram.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/centroidplot.png b/solr/solr-ref-guide/src/images/math-expressions/centroidplot.png new file mode 100644 index 00000000000..c3472d42e18 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/centroidplot.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/centroidzoom.png b/solr/solr-ref-guide/src/images/math-expressions/centroidzoom.png new file mode 100644 index 00000000000..6dd57af97f0 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/centroidzoom.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/clusters.png b/solr/solr-ref-guide/src/images/math-expressions/clusters.png new file mode 100644 index 00000000000..f34d358e2ea Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/clusters.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/clusterzip.png b/solr/solr-ref-guide/src/images/math-expressions/clusterzip.png new file mode 100644 index 00000000000..6ada385c2d3 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/clusterzip.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/clusterzipplot.png b/solr/solr-ref-guide/src/images/math-expressions/clusterzipplot.png new file mode 100644 index 00000000000..88a2f3bb446 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/clusterzipplot.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/commit-series.png b/solr/solr-ref-guide/src/images/math-expressions/commit-series.png new file mode 100644 index 00000000000..7052f064588 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/commit-series.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/constant.png b/solr/solr-ref-guide/src/images/math-expressions/constant.png new file mode 100644 index 00000000000..f64647a7fbe Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/constant.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/conv-smooth.png b/solr/solr-ref-guide/src/images/math-expressions/conv-smooth.png new file mode 100644 index 00000000000..1e928b3c087 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/conv-smooth.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/conv1.png b/solr/solr-ref-guide/src/images/math-expressions/conv1.png new file mode 100644 index 00000000000..d6bd0184a51 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/conv1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/conv2.png b/solr/solr-ref-guide/src/images/math-expressions/conv2.png new file mode 100644 index 00000000000..d0116b36e9c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/conv2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/conv3.png b/solr/solr-ref-guide/src/images/math-expressions/conv3.png new file mode 100644 index 00000000000..f2fdb18c57d Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/conv3.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/conv4.png b/solr/solr-ref-guide/src/images/math-expressions/conv4.png new file mode 100644 index 00000000000..925bac0f969 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/conv4.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/convex.png b/solr/solr-ref-guide/src/images/math-expressions/convex.png new file mode 100644 index 00000000000..e832ad839b6 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/convex.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/convex0.png b/solr/solr-ref-guide/src/images/math-expressions/convex0.png new file mode 100644 index 00000000000..41f1a444337 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/convex0.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/convex1.png b/solr/solr-ref-guide/src/images/math-expressions/convex1.png new file mode 100644 index 00000000000..b4a8ecce802 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/convex1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/convex2.png b/solr/solr-ref-guide/src/images/math-expressions/convex2.png new file mode 100644 index 00000000000..797c176fbf9 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/convex2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/correlation.png b/solr/solr-ref-guide/src/images/math-expressions/correlation.png new file mode 100644 index 00000000000..c6bc9a2a13e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/correlation.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/corrmatrix.png b/solr/solr-ref-guide/src/images/math-expressions/corrmatrix.png new file mode 100644 index 00000000000..786cb0caf4d Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/corrmatrix.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/corrmatrix2.png b/solr/solr-ref-guide/src/images/math-expressions/corrmatrix2.png new file mode 100644 index 00000000000..029d3019373 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/corrmatrix2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/corrsim1.png b/solr/solr-ref-guide/src/images/math-expressions/corrsim1.png new file mode 100644 index 00000000000..4a92e323ad5 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/corrsim1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/corrsim2.png b/solr/solr-ref-guide/src/images/math-expressions/corrsim2.png new file mode 100644 index 00000000000..cea381e15c4 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/corrsim2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/corrsim3.png b/solr/solr-ref-guide/src/images/math-expressions/corrsim3.png new file mode 100644 index 00000000000..b3504385925 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/corrsim3.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/covariance.png b/solr/solr-ref-guide/src/images/math-expressions/covariance.png new file mode 100644 index 00000000000..30ad6c09440 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/covariance.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/crosscorr.png b/solr/solr-ref-guide/src/images/math-expressions/crosscorr.png new file mode 100644 index 00000000000..b0fc7eb6e2b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/crosscorr.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/csv.png b/solr/solr-ref-guide/src/images/math-expressions/csv.png new file mode 100644 index 00000000000..817e03312e8 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/csv.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/csvselect.png b/solr/solr-ref-guide/src/images/math-expressions/csvselect.png new file mode 100644 index 00000000000..94c8a56f29c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/csvselect.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/csvtable.png b/solr/solr-ref-guide/src/images/math-expressions/csvtable.png new file mode 100644 index 00000000000..e8c042c90f3 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/csvtable.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/cumPct.png b/solr/solr-ref-guide/src/images/math-expressions/cumPct.png new file mode 100644 index 00000000000..173c7a45935 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/cumPct.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/cumProb.png b/solr/solr-ref-guide/src/images/math-expressions/cumProb.png new file mode 100644 index 00000000000..ff93b6125ac Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/cumProb.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/curve-fitting.png b/solr/solr-ref-guide/src/images/math-expressions/curve-fitting.png new file mode 100644 index 00000000000..2f63cfdd9ef Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/curve-fitting.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/custom-hist.png b/solr/solr-ref-guide/src/images/math-expressions/custom-hist.png new file mode 100644 index 00000000000..4b0ce8146bd Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/custom-hist.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/dbscan1.png b/solr/solr-ref-guide/src/images/math-expressions/dbscan1.png new file mode 100644 index 00000000000..da8c9204142 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/dbscan1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/dbscan2.png b/solr/solr-ref-guide/src/images/math-expressions/dbscan2.png new file mode 100644 index 00000000000..f476034dc2c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/dbscan2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/delay.png b/solr/solr-ref-guide/src/images/math-expressions/delay.png new file mode 100644 index 00000000000..8211f8f3e3b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/delay.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/derivative.png b/solr/solr-ref-guide/src/images/math-expressions/derivative.png new file mode 100644 index 00000000000..f91cd95cba7 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/derivative.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/derivative1.png b/solr/solr-ref-guide/src/images/math-expressions/derivative1.png new file mode 100644 index 00000000000..4e60b35c659 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/derivative1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/derivative2.png b/solr/solr-ref-guide/src/images/math-expressions/derivative2.png new file mode 100644 index 00000000000..c565fc5c6a3 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/derivative2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/derivatives.png b/solr/solr-ref-guide/src/images/math-expressions/derivatives.png new file mode 100644 index 00000000000..b6ff637e198 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/derivatives.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/describe.png b/solr/solr-ref-guide/src/images/math-expressions/describe.png new file mode 100644 index 00000000000..7f88facfab3 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/describe.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png b/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png new file mode 100644 index 00000000000..58a726ce8d1 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/diagnostics.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/diff.png b/solr/solr-ref-guide/src/images/math-expressions/diff.png new file mode 100644 index 00000000000..350b2c9d077 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/diff.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/diff1.png b/solr/solr-ref-guide/src/images/math-expressions/diff1.png new file mode 100644 index 00000000000..4c5cd50bfaf Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/diff1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/diffcorr.png b/solr/solr-ref-guide/src/images/math-expressions/diffcorr.png new file mode 100644 index 00000000000..c187c38738c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/diffcorr.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/diffzoom.png b/solr/solr-ref-guide/src/images/math-expressions/diffzoom.png new file mode 100644 index 00000000000..27730f79e2b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/diffzoom.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/dist.png b/solr/solr-ref-guide/src/images/math-expressions/dist.png new file mode 100644 index 00000000000..f1a004a607b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/dist.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/distance.png b/solr/solr-ref-guide/src/images/math-expressions/distance.png new file mode 100644 index 00000000000..373cb35d078 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/distance.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/distanceview.png b/solr/solr-ref-guide/src/images/math-expressions/distanceview.png new file mode 100644 index 00000000000..346f640d6a6 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/distanceview.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/empirical.png b/solr/solr-ref-guide/src/images/math-expressions/empirical.png new file mode 100644 index 00000000000..3d8be38fc8f Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/empirical.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/enum1.png b/solr/solr-ref-guide/src/images/math-expressions/enum1.png new file mode 100644 index 00000000000..58d10a5eb3e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/enum1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/enum2.png b/solr/solr-ref-guide/src/images/math-expressions/enum2.png new file mode 100644 index 00000000000..ab0094a5736 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/enum2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/expmoving.png b/solr/solr-ref-guide/src/images/math-expressions/expmoving.png new file mode 100644 index 00000000000..99b5e302f32 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/expmoving.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/facet2D.png b/solr/solr-ref-guide/src/images/math-expressions/facet2D.png new file mode 100644 index 00000000000..7387e5a191f Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/facet2D.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/facettab1.png b/solr/solr-ref-guide/src/images/math-expressions/facettab1.png new file mode 100644 index 00000000000..1bcf2623315 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/facettab1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/facettab2.png b/solr/solr-ref-guide/src/images/math-expressions/facettab2.png new file mode 100644 index 00000000000..cb990009a4c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/facettab2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/facetviz1.png b/solr/solr-ref-guide/src/images/math-expressions/facetviz1.png new file mode 100644 index 00000000000..634a71b8caf Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/facetviz1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/facetviz2.png b/solr/solr-ref-guide/src/images/math-expressions/facetviz2.png new file mode 100644 index 00000000000..ae480d6b1a9 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/facetviz2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/fft.png b/solr/solr-ref-guide/src/images/math-expressions/fft.png new file mode 100644 index 00000000000..0ff81224d1e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/fft.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/forecast.png b/solr/solr-ref-guide/src/images/math-expressions/forecast.png new file mode 100644 index 00000000000..69af3d30532 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/forecast.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/freqTable.png b/solr/solr-ref-guide/src/images/math-expressions/freqTable.png new file mode 100644 index 00000000000..c15991228eb Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/freqTable.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/freqTable1.png b/solr/solr-ref-guide/src/images/math-expressions/freqTable1.png new file mode 100644 index 00000000000..84eb65e8c05 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/freqTable1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/fuzzyk.png b/solr/solr-ref-guide/src/images/math-expressions/fuzzyk.png new file mode 100644 index 00000000000..34bd9441f03 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/fuzzyk.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/gamma.png b/solr/solr-ref-guide/src/images/math-expressions/gamma.png new file mode 100644 index 00000000000..8833b377f0d Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/gamma.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/gaussfit.png b/solr/solr-ref-guide/src/images/math-expressions/gaussfit.png new file mode 100644 index 00000000000..98348a7e2bf Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/gaussfit.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/gaussfit2.png b/solr/solr-ref-guide/src/images/math-expressions/gaussfit2.png new file mode 100644 index 00000000000..796b6a6b151 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/gaussfit2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/geometric.png b/solr/solr-ref-guide/src/images/math-expressions/geometric.png new file mode 100644 index 00000000000..bce76bbbe87 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/geometric.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/harmfit.png b/solr/solr-ref-guide/src/images/math-expressions/harmfit.png new file mode 100644 index 00000000000..53759a7b20d Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/harmfit.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/harmfit2.png b/solr/solr-ref-guide/src/images/math-expressions/harmfit2.png new file mode 100644 index 00000000000..0b8674dc131 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/harmfit2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/hashRollup.png b/solr/solr-ref-guide/src/images/math-expressions/hashRollup.png new file mode 100644 index 00000000000..2f5d23c4288 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/hashRollup.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/having.png b/solr/solr-ref-guide/src/images/math-expressions/having.png new file mode 100644 index 00000000000..ae6e13ced87 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/having.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/having2.png b/solr/solr-ref-guide/src/images/math-expressions/having2.png new file mode 100644 index 00000000000..f55aef6928e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/having2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/having3.png b/solr/solr-ref-guide/src/images/math-expressions/having3.png new file mode 100644 index 00000000000..5450bbbf4f7 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/having3.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/havingId.png b/solr/solr-ref-guide/src/images/math-expressions/havingId.png new file mode 100644 index 00000000000..1e56a602440 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/havingId.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/havingIsNull.png b/solr/solr-ref-guide/src/images/math-expressions/havingIsNull.png new file mode 100644 index 00000000000..52fccf27e2f Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/havingIsNull.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/havingNotNull.png b/solr/solr-ref-guide/src/images/math-expressions/havingNotNull.png new file mode 100644 index 00000000000..82c6799bd11 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/havingNotNull.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/heat.png b/solr/solr-ref-guide/src/images/math-expressions/heat.png new file mode 100644 index 00000000000..97802e8a474 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/heat.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/hfit.png b/solr/solr-ref-guide/src/images/math-expressions/hfit.png new file mode 100644 index 00000000000..25acf700a7c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/hfit.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/hidden-signal-autocorrelation.png b/solr/solr-ref-guide/src/images/math-expressions/hidden-signal-autocorrelation.png index f741c1821c7..4b9305016a0 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/hidden-signal-autocorrelation.png and b/solr/solr-ref-guide/src/images/math-expressions/hidden-signal-autocorrelation.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/hidden-signal-fft.png b/solr/solr-ref-guide/src/images/math-expressions/hidden-signal-fft.png index 58b0c60136f..fb23e683bc9 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/hidden-signal-fft.png and b/solr/solr-ref-guide/src/images/math-expressions/hidden-signal-fft.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/hidden-signal.png b/solr/solr-ref-guide/src/images/math-expressions/hidden-signal.png index 9baff48dea5..1d24e43986e 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/hidden-signal.png and b/solr/solr-ref-guide/src/images/math-expressions/hidden-signal.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/hist.png b/solr/solr-ref-guide/src/images/math-expressions/hist.png new file mode 100644 index 00000000000..914d0e8b47d Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/hist.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/histtable.png b/solr/solr-ref-guide/src/images/math-expressions/histtable.png new file mode 100644 index 00000000000..9f3983598a4 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/histtable.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/hullplot.png b/solr/solr-ref-guide/src/images/math-expressions/hullplot.png new file mode 100644 index 00000000000..8e51e808cf4 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/hullplot.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/ifIsNull.png b/solr/solr-ref-guide/src/images/math-expressions/ifIsNull.png new file mode 100644 index 00000000000..984f1ff8442 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/ifIsNull.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/integral.png b/solr/solr-ref-guide/src/images/math-expressions/integral.png new file mode 100644 index 00000000000..7eed7b32c35 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/integral.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/interpolate1.png b/solr/solr-ref-guide/src/images/math-expressions/interpolate1.png new file mode 100644 index 00000000000..40910f63be1 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/interpolate1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/knn.png b/solr/solr-ref-guide/src/images/math-expressions/knn.png new file mode 100644 index 00000000000..d4dd66eb7d6 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/knn.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/knnRegress.png b/solr/solr-ref-guide/src/images/math-expressions/knnRegress.png new file mode 100644 index 00000000000..ddc85ded147 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/knnRegress.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/knnSearch.png b/solr/solr-ref-guide/src/images/math-expressions/knnSearch.png new file mode 100644 index 00000000000..761e1809f17 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/knnSearch.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/line.png b/solr/solr-ref-guide/src/images/math-expressions/line.png new file mode 100644 index 00000000000..d842af5eee6 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/line.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/line1.png b/solr/solr-ref-guide/src/images/math-expressions/line1.png new file mode 100644 index 00000000000..0d8312091a4 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/line1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/linear.png b/solr/solr-ref-guide/src/images/math-expressions/linear.png new file mode 100644 index 00000000000..007d0d666f1 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/linear.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/loess.png b/solr/solr-ref-guide/src/images/math-expressions/loess.png new file mode 100644 index 00000000000..21eeb60b997 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/loess.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/lognormal.png b/solr/solr-ref-guide/src/images/math-expressions/lognormal.png new file mode 100644 index 00000000000..dff60bb564e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/lognormal.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/logs-collection.png b/solr/solr-ref-guide/src/images/math-expressions/logs-collection.png new file mode 100644 index 00000000000..b56709b9e7a Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/logs-collection.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/logs-dates.png b/solr/solr-ref-guide/src/images/math-expressions/logs-dates.png new file mode 100644 index 00000000000..1241580b56a Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/logs-dates.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/logs-sample.png b/solr/solr-ref-guide/src/images/math-expressions/logs-sample.png new file mode 100644 index 00000000000..881b92c5179 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/logs-sample.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/logs-time-series.png b/solr/solr-ref-guide/src/images/math-expressions/logs-time-series.png new file mode 100644 index 00000000000..5c4fa6a3d91 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/logs-time-series.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/logs-time-series2.png b/solr/solr-ref-guide/src/images/math-expressions/logs-time-series2.png new file mode 100644 index 00000000000..93d9ada496a Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/logs-time-series2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/logs-time-series3.png b/solr/solr-ref-guide/src/images/math-expressions/logs-time-series3.png new file mode 100644 index 00000000000..ad771482d5f Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/logs-time-series3.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/logs-type-collection.png b/solr/solr-ref-guide/src/images/math-expressions/logs-type-collection.png new file mode 100644 index 00000000000..bdb8a825002 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/logs-type-collection.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/logs-type.png b/solr/solr-ref-guide/src/images/math-expressions/logs-type.png new file mode 100644 index 00000000000..1b9b8e9383d Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/logs-type.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/mad.png b/solr/solr-ref-guide/src/images/math-expressions/mad.png new file mode 100644 index 00000000000..356aa07f728 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/mad.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/maddist.png b/solr/solr-ref-guide/src/images/math-expressions/maddist.png new file mode 100644 index 00000000000..0303f43c024 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/maddist.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/madsort.png b/solr/solr-ref-guide/src/images/math-expressions/madsort.png new file mode 100644 index 00000000000..f54bb90721c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/madsort.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/map.png b/solr/solr-ref-guide/src/images/math-expressions/map.png new file mode 100644 index 00000000000..abd91f37387 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/map.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/matches.png b/solr/solr-ref-guide/src/images/math-expressions/matches.png new file mode 100644 index 00000000000..0dc5cb5508c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/matches.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/matrix.png b/solr/solr-ref-guide/src/images/math-expressions/matrix.png new file mode 100644 index 00000000000..84684a01efc Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/matrix.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/minmaxscale.png b/solr/solr-ref-guide/src/images/math-expressions/minmaxscale.png new file mode 100644 index 00000000000..1631bec5e47 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/minmaxscale.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/mnorm.png b/solr/solr-ref-guide/src/images/math-expressions/mnorm.png new file mode 100644 index 00000000000..aa142d782d7 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/mnorm.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/mnorm1.png b/solr/solr-ref-guide/src/images/math-expressions/mnorm1.png new file mode 100644 index 00000000000..ad8ef134797 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/mnorm1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/mnorm2.png b/solr/solr-ref-guide/src/images/math-expressions/mnorm2.png new file mode 100644 index 00000000000..8bd9841b6b1 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/mnorm2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/monte-carlo.png b/solr/solr-ref-guide/src/images/math-expressions/monte-carlo.png new file mode 100644 index 00000000000..a507da1b5e5 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/monte-carlo.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/movingMedian.png b/solr/solr-ref-guide/src/images/math-expressions/movingMedian.png new file mode 100644 index 00000000000..cd6cac2b435 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/movingMedian.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/movingavg.png b/solr/solr-ref-guide/src/images/math-expressions/movingavg.png new file mode 100644 index 00000000000..57e47b528a4 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/movingavg.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/multitime1.png b/solr/solr-ref-guide/src/images/math-expressions/multitime1.png new file mode 100644 index 00000000000..335c037ec1c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/multitime1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/nodestab.png b/solr/solr-ref-guide/src/images/math-expressions/nodestab.png new file mode 100644 index 00000000000..350686bc3e8 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/nodestab.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/nodesviz.png b/solr/solr-ref-guide/src/images/math-expressions/nodesviz.png new file mode 100644 index 00000000000..804b7da9c1d Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/nodesviz.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/noise-autocorrelation.png b/solr/solr-ref-guide/src/images/math-expressions/noise-autocorrelation.png index d69a9a2bdae..577be2d4ad0 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/noise-autocorrelation.png and b/solr/solr-ref-guide/src/images/math-expressions/noise-autocorrelation.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/noise-fft.png b/solr/solr-ref-guide/src/images/math-expressions/noise-fft.png index cdcfba36509..e4ba044cc72 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/noise-fft.png and b/solr/solr-ref-guide/src/images/math-expressions/noise-fft.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/noise.png b/solr/solr-ref-guide/src/images/math-expressions/noise.png index 6b4f9762f67..2624a9423f4 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/noise.png and b/solr/solr-ref-guide/src/images/math-expressions/noise.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/num.png b/solr/solr-ref-guide/src/images/math-expressions/num.png new file mode 100644 index 00000000000..d7324da644b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/num.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/number.png b/solr/solr-ref-guide/src/images/math-expressions/number.png new file mode 100644 index 00000000000..afb76a4c5cd Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/number.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/outliers.png b/solr/solr-ref-guide/src/images/math-expressions/outliers.png new file mode 100644 index 00000000000..3f4fd084166 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/outliers.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/overlay-series.png b/solr/solr-ref-guide/src/images/math-expressions/overlay-series.png new file mode 100644 index 00000000000..176eb380edc Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/overlay-series.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/paging.png b/solr/solr-ref-guide/src/images/math-expressions/paging.png new file mode 100644 index 00000000000..373c256f8cd Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/paging.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/poisson.png b/solr/solr-ref-guide/src/images/math-expressions/poisson.png new file mode 100644 index 00000000000..4f6a023f470 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/poisson.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/polyfit-predict.png b/solr/solr-ref-guide/src/images/math-expressions/polyfit-predict.png new file mode 100644 index 00000000000..58c9f90973c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/polyfit-predict.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/polyfit-resid.png b/solr/solr-ref-guide/src/images/math-expressions/polyfit-resid.png new file mode 100644 index 00000000000..a7130cc5519 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/polyfit-resid.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/polyfit.png b/solr/solr-ref-guide/src/images/math-expressions/polyfit.png new file mode 100644 index 00000000000..4ce63c098c9 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/polyfit.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/qtime-dist.png b/solr/solr-ref-guide/src/images/math-expressions/qtime-dist.png new file mode 100644 index 00000000000..c077c58ef53 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/qtime-dist.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/qtime-highest-scatter.png b/solr/solr-ref-guide/src/images/math-expressions/qtime-highest-scatter.png new file mode 100644 index 00000000000..de9b2947c52 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/qtime-highest-scatter.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/qtime-scatter.png b/solr/solr-ref-guide/src/images/math-expressions/qtime-scatter.png new file mode 100644 index 00000000000..227d8e4f9e5 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/qtime-scatter.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/qtime-series.png b/solr/solr-ref-guide/src/images/math-expressions/qtime-series.png new file mode 100644 index 00000000000..a4772fbb59c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/qtime-series.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/quantile-plot.png b/solr/solr-ref-guide/src/images/math-expressions/quantile-plot.png new file mode 100644 index 00000000000..5ef4c3937ca Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/quantile-plot.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/quantiles.png b/solr/solr-ref-guide/src/images/math-expressions/quantiles.png new file mode 100644 index 00000000000..85b197d1206 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/quantiles.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/quantiles1.png b/solr/solr-ref-guide/src/images/math-expressions/quantiles1.png new file mode 100644 index 00000000000..106d7d21816 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/quantiles1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/query-ids.png b/solr/solr-ref-guide/src/images/math-expressions/query-ids.png new file mode 100644 index 00000000000..58fdedef7ce Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/query-ids.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/query-qq.png b/solr/solr-ref-guide/src/images/math-expressions/query-qq.png new file mode 100644 index 00000000000..d55f2fea8b1 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/query-qq.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/query-shard-level.png b/solr/solr-ref-guide/src/images/math-expressions/query-shard-level.png new file mode 100644 index 00000000000..02c09f4dffe Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/query-shard-level.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/query-spike.png b/solr/solr-ref-guide/src/images/math-expressions/query-spike.png new file mode 100644 index 00000000000..88c823c46ae Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/query-spike.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/query-top-level.png b/solr/solr-ref-guide/src/images/math-expressions/query-top-level.png new file mode 100644 index 00000000000..63f3143bef4 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/query-top-level.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/randomwalk1.png b/solr/solr-ref-guide/src/images/math-expressions/randomwalk1.png new file mode 100644 index 00000000000..04bfda64d94 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/randomwalk1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/randomwalk2.png b/solr/solr-ref-guide/src/images/math-expressions/randomwalk2.png new file mode 100644 index 00000000000..d8c7abc7c13 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/randomwalk2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/randomwalk3.png b/solr/solr-ref-guide/src/images/math-expressions/randomwalk3.png new file mode 100644 index 00000000000..da2a8686b25 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/randomwalk3.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/randomwalk4.png b/solr/solr-ref-guide/src/images/math-expressions/randomwalk4.png new file mode 100644 index 00000000000..3ff89086728 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/randomwalk4.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/randomwalk5.1.png b/solr/solr-ref-guide/src/images/math-expressions/randomwalk5.1.png new file mode 100644 index 00000000000..cc38ee9522f Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/randomwalk5.1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/randomwalk5.png b/solr/solr-ref-guide/src/images/math-expressions/randomwalk5.png new file mode 100644 index 00000000000..0b4656ed92e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/randomwalk5.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/randomwalk6.png b/solr/solr-ref-guide/src/images/math-expressions/randomwalk6.png new file mode 100644 index 00000000000..f524a513c3c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/randomwalk6.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/ratscatter.png b/solr/solr-ref-guide/src/images/math-expressions/ratscatter.png new file mode 100644 index 00000000000..9e643420e71 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/ratscatter.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/recNum.png b/solr/solr-ref-guide/src/images/math-expressions/recNum.png new file mode 100644 index 00000000000..51a4812b963 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/recNum.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/redwine1.png b/solr/solr-ref-guide/src/images/math-expressions/redwine1.png new file mode 100644 index 00000000000..2b7074a1bfe Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/redwine1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/redwine2.png b/solr/solr-ref-guide/src/images/math-expressions/redwine2.png new file mode 100644 index 00000000000..c876955f30f Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/redwine2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png b/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png new file mode 100644 index 00000000000..e68a7902f00 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/regression-plot.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png b/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png new file mode 100644 index 00000000000..d39d1cc5d65 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/residual-plot.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/residual-plot2.png b/solr/solr-ref-guide/src/images/math-expressions/residual-plot2.png new file mode 100644 index 00000000000..97b6cb1c818 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/residual-plot2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/residuals.png b/solr/solr-ref-guide/src/images/math-expressions/residuals.png new file mode 100644 index 00000000000..4b9f5e0ee8e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/residuals.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/sample-overlay.png b/solr/solr-ref-guide/src/images/math-expressions/sample-overlay.png new file mode 100644 index 00000000000..c549594a2df Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/sample-overlay.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/sample-scatter.png b/solr/solr-ref-guide/src/images/math-expressions/sample-scatter.png new file mode 100644 index 00000000000..28d672cdc25 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/sample-scatter.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/sample-scatter1.png b/solr/solr-ref-guide/src/images/math-expressions/sample-scatter1.png new file mode 100644 index 00000000000..b30ec1f1c0a Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/sample-scatter1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/scalar.png b/solr/solr-ref-guide/src/images/math-expressions/scalar.png new file mode 100644 index 00000000000..bad2c7420d7 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/scalar.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/scoring.png b/solr/solr-ref-guide/src/images/math-expressions/scoring.png new file mode 100644 index 00000000000..9d9ae038782 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/scoring.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/search-error.png b/solr/solr-ref-guide/src/images/math-expressions/search-error.png new file mode 100644 index 00000000000..77ff8f31810 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/search-error.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/search-matches.png b/solr/solr-ref-guide/src/images/math-expressions/search-matches.png new file mode 100644 index 00000000000..020b0dd36dd Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/search-matches.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/search-page.png b/solr/solr-ref-guide/src/images/math-expressions/search-page.png new file mode 100644 index 00000000000..24a2f8a905b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/search-page.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/search-resort.png b/solr/solr-ref-guide/src/images/math-expressions/search-resort.png new file mode 100644 index 00000000000..094b28cb483 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/search-resort.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/search-sort-plot.png b/solr/solr-ref-guide/src/images/math-expressions/search-sort-plot.png new file mode 100644 index 00000000000..96b0e1dece0 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/search-sort-plot.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/search-sort.png b/solr/solr-ref-guide/src/images/math-expressions/search-sort.png new file mode 100644 index 00000000000..2cdfeacdb1e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/search-sort.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/search.png b/solr/solr-ref-guide/src/images/math-expressions/search.png new file mode 100644 index 00000000000..ac7db0316fe Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/search.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/search1.png b/solr/solr-ref-guide/src/images/math-expressions/search1.png new file mode 100644 index 00000000000..7126289dfa7 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/search1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/searchiris.png b/solr/solr-ref-guide/src/images/math-expressions/searchiris.png new file mode 100644 index 00000000000..3d79503a2cc Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/searchiris.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/season.png b/solr/solr-ref-guide/src/images/math-expressions/season.png new file mode 100644 index 00000000000..f1cf01bfc0a Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/season.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/seasondiff.png b/solr/solr-ref-guide/src/images/math-expressions/seasondiff.png new file mode 100644 index 00000000000..1f37b5a3cab Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/seasondiff.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/select-math.png b/solr/solr-ref-guide/src/images/math-expressions/select-math.png new file mode 100644 index 00000000000..7d58b750f22 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/select-math.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/select1.png b/solr/solr-ref-guide/src/images/math-expressions/select1.png new file mode 100644 index 00000000000..b9ade0c2893 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/select1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/select2.png b/solr/solr-ref-guide/src/images/math-expressions/select2.png new file mode 100644 index 00000000000..be1ffe0c002 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/select2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/selectconcat.png b/solr/solr-ref-guide/src/images/math-expressions/selectconcat.png new file mode 100644 index 00000000000..f27098b6f55 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/selectconcat.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/selectupper.png b/solr/solr-ref-guide/src/images/math-expressions/selectupper.png new file mode 100644 index 00000000000..5c0f8a6a7fd Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/selectupper.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/selectuuid.png b/solr/solr-ref-guide/src/images/math-expressions/selectuuid.png new file mode 100644 index 00000000000..c70581f8858 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/selectuuid.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/signal-autocorrelation.png b/solr/solr-ref-guide/src/images/math-expressions/signal-autocorrelation.png index cd24667288d..bb0b2f5dac2 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/signal-autocorrelation.png and b/solr/solr-ref-guide/src/images/math-expressions/signal-autocorrelation.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/signal-fft.png b/solr/solr-ref-guide/src/images/math-expressions/signal-fft.png index f70fa467458..5b166c91c23 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/signal-fft.png and b/solr/solr-ref-guide/src/images/math-expressions/signal-fft.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/significantTerms2.png b/solr/solr-ref-guide/src/images/math-expressions/significantTerms2.png new file mode 100644 index 00000000000..8d0b990742c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/significantTerms2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/significantTermsCompare.png b/solr/solr-ref-guide/src/images/math-expressions/significantTermsCompare.png new file mode 100644 index 00000000000..4addb4ff011 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/significantTermsCompare.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/sined.png b/solr/solr-ref-guide/src/images/math-expressions/sined.png new file mode 100644 index 00000000000..9e99e0911ca Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/sined.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/sinewave.png b/solr/solr-ref-guide/src/images/math-expressions/sinewave.png index 19d9b93770c..53f77d7d54f 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/sinewave.png and b/solr/solr-ref-guide/src/images/math-expressions/sinewave.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/sinewave256.png b/solr/solr-ref-guide/src/images/math-expressions/sinewave256.png index e821057d111..ae221a18417 100644 Binary files a/solr/solr-ref-guide/src/images/math-expressions/sinewave256.png and b/solr/solr-ref-guide/src/images/math-expressions/sinewave256.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/skipping.png b/solr/solr-ref-guide/src/images/math-expressions/skipping.png new file mode 100644 index 00000000000..8de49c682e6 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/skipping.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/slow-nodes.png b/solr/solr-ref-guide/src/images/math-expressions/slow-nodes.png new file mode 100644 index 00000000000..a262c01451f Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/slow-nodes.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/slow-queries.png b/solr/solr-ref-guide/src/images/math-expressions/slow-queries.png new file mode 100644 index 00000000000..23d5aa166e1 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/slow-queries.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/spline.png b/solr/solr-ref-guide/src/images/math-expressions/spline.png new file mode 100644 index 00000000000..c9cb5878af9 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/spline.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/sqlagg.png b/solr/solr-ref-guide/src/images/math-expressions/sqlagg.png new file mode 100644 index 00000000000..05486799df1 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/sqlagg.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/stack.png b/solr/solr-ref-guide/src/images/math-expressions/stack.png new file mode 100644 index 00000000000..6aa9533fd5c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/stack.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/standardize.png b/solr/solr-ref-guide/src/images/math-expressions/standardize.png new file mode 100644 index 00000000000..ed4deb6e3a9 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/standardize.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/stats-table.png b/solr/solr-ref-guide/src/images/math-expressions/stats-table.png new file mode 100644 index 00000000000..3e830117abf Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/stats-table.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/stats.png b/solr/solr-ref-guide/src/images/math-expressions/stats.png new file mode 100644 index 00000000000..b0873b685ad Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/stats.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/sterms.png b/solr/solr-ref-guide/src/images/math-expressions/sterms.png new file mode 100644 index 00000000000..da54e6f8cdc Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/sterms.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/stream.png b/solr/solr-ref-guide/src/images/math-expressions/stream.png new file mode 100644 index 00000000000..7a2dbf33820 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/stream.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/striding.png b/solr/solr-ref-guide/src/images/math-expressions/striding.png new file mode 100644 index 00000000000..2be4c950cf0 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/striding.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/table.png b/solr/solr-ref-guide/src/images/math-expressions/table.png new file mode 100644 index 00000000000..e69a92413a4 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/table.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/text-analytics.png b/solr/solr-ref-guide/src/images/math-expressions/text-analytics.png new file mode 100644 index 00000000000..946c4358657 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/text-analytics.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/timecompare.png b/solr/solr-ref-guide/src/images/math-expressions/timecompare.png new file mode 100644 index 00000000000..6262489c815 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/timecompare.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/timemodel.png b/solr/solr-ref-guide/src/images/math-expressions/timemodel.png new file mode 100644 index 00000000000..a9fe5580f62 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/timemodel.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/timeseries.png b/solr/solr-ref-guide/src/images/math-expressions/timeseries.png new file mode 100644 index 00000000000..cf259e13247 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/timeseries.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/timeseries1.png b/solr/solr-ref-guide/src/images/math-expressions/timeseries1.png new file mode 100644 index 00000000000..feb596d6b3c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/timeseries1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/timeseries2.png b/solr/solr-ref-guide/src/images/math-expressions/timeseries2.png new file mode 100644 index 00000000000..7f559cde112 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/timeseries2.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/trend.png b/solr/solr-ref-guide/src/images/math-expressions/trend.png new file mode 100644 index 00000000000..1a3ee030877 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/trend.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/triangular.png b/solr/solr-ref-guide/src/images/math-expressions/triangular.png new file mode 100644 index 00000000000..0c30f6fe66e Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/triangular.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/uniform.png b/solr/solr-ref-guide/src/images/math-expressions/uniform.png new file mode 100644 index 00000000000..c6dfc7ac12c Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/uniform.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/uniformr.png b/solr/solr-ref-guide/src/images/math-expressions/uniformr.png new file mode 100644 index 00000000000..a0bb747ff9b Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/uniformr.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/unitize.png b/solr/solr-ref-guide/src/images/math-expressions/unitize.png new file mode 100644 index 00000000000..c806721d740 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/unitize.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/univariate.png b/solr/solr-ref-guide/src/images/math-expressions/univariate.png new file mode 100644 index 00000000000..c9356394587 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/univariate.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/update.png b/solr/solr-ref-guide/src/images/math-expressions/update.png new file mode 100644 index 00000000000..396e5289af0 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/update.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/valueat.png b/solr/solr-ref-guide/src/images/math-expressions/valueat.png new file mode 100644 index 00000000000..59b9cb0eb7d Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/valueat.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/variables.png b/solr/solr-ref-guide/src/images/math-expressions/variables.png new file mode 100644 index 00000000000..0e3b65cf9ce Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/variables.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/variables1.png b/solr/solr-ref-guide/src/images/math-expressions/variables1.png new file mode 100644 index 00000000000..154388b6589 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/variables1.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/vector.png b/solr/solr-ref-guide/src/images/math-expressions/vector.png new file mode 100644 index 00000000000..2845ce88fa6 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/vector.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/weibull.png b/solr/solr-ref-guide/src/images/math-expressions/weibull.png new file mode 100644 index 00000000000..1366b42e8e2 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/weibull.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/xy.png b/solr/solr-ref-guide/src/images/math-expressions/xy.png new file mode 100644 index 00000000000..a295b851f5f Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/xy.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/xyscatter.png b/solr/solr-ref-guide/src/images/math-expressions/xyscatter.png new file mode 100644 index 00000000000..8f26f61aba2 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/xyscatter.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/zepconf.png b/solr/solr-ref-guide/src/images/math-expressions/zepconf.png new file mode 100644 index 00000000000..81400ae26f6 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/zepconf.png differ diff --git a/solr/solr-ref-guide/src/images/math-expressions/zipf.png b/solr/solr-ref-guide/src/images/math-expressions/zipf.png new file mode 100644 index 00000000000..8a65093bc71 Binary files /dev/null and b/solr/solr-ref-guide/src/images/math-expressions/zipf.png differ diff --git a/solr/solr-ref-guide/src/loading.adoc b/solr/solr-ref-guide/src/loading.adoc new file mode 100644 index 00000000000..86bfa267531 --- /dev/null +++ b/solr/solr-ref-guide/src/loading.adoc @@ -0,0 +1,542 @@ += Loading Data +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +Streaming expressions has support for reading, parsing, transforming, visualizing and loading CSV and TSV formatted data. +These functions are designed to cut down the time spent on data preparation and allow users to begin data exploration before the data is loaded into Solr. + +== Reading Files + +The `cat` function can be used to read files under the *userfiles* directory in +`$SOLR_HOME`. The `cat` function takes two parameters. + +The first parameter is a comma-delimited list of paths. +If the path list contains directories, `cat` will crawl all the files in the directory and sub-directories. +If the path list contains only files `cat` will read just the specific files. + +The second parameter, `maxLines`, tells `cat` how many lines to read in total. +If `maxLines` is not provided, `cat` will read all lines from each file it crawls. + +The `cat` function reads each line (up to `maxLines`) in the crawled files and for each line emits a tuple with two fields: + +* `line`: The text in the line. +* `file`: The relative path of the file under $SOLR_HOME. + +Below is an example of `cat` on the iris.csv file with a `maxLines` of `5`: + +[source,text] +---- +cat("iris.csv", maxLines="5") +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "line": "sepal_length,sepal_width,petal_length,petal_width,species", + "file": "iris.csv" + }, + { + "line": "5.1,3.5,1.4,0.2,setosa", + "file": "iris.csv" + }, + { + "line": "4.9,3,1.4,0.2,setosa", + "file": "iris.csv" + }, + { + "line": "4.7,3.2,1.3,0.2,setosa", + "file": "iris.csv" + }, + { + "line": "4.6,3.1,1.5,0.2,setosa", + "file": "iris.csv" + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +== Parsing CSV and TSV Files + +The `parseCSV` and `parseTSV` functions wrap the `cat` function and parse CSV +(comma separated values) and TSV (tab separated values). Both of these functions +expect a CSV or TSV header record at the beginning of each file. + +Both `parseCSV` and `parseTSV` emit tuples with the header values mapped to their +corresponding values in each line. + + +[source,text] +---- +parseCSV(cat("iris.csv", maxLines="5")) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "sepal_width": "3.5", + "species": "setosa", + "petal_width": "0.2", + "sepal_length": "5.1", + "id": "iris.csv_2", + "petal_length": "1.4" + }, + { + "sepal_width": "3", + "species": "setosa", + "petal_width": "0.2", + "sepal_length": "4.9", + "id": "iris.csv_3", + "petal_length": "1.4" + }, + { + "sepal_width": "3.2", + "species": "setosa", + "petal_width": "0.2", + "sepal_length": "4.7", + "id": "iris.csv_4", + "petal_length": "1.3" + }, + { + "sepal_width": "3.1", + "species": "setosa", + "petal_width": "0.2", + "sepal_length": "4.6", + "id": "iris.csv_5", + "petal_length": "1.5" + }, + { + "EOF": true, + "RESPONSE_TIME": 1 + } + ] + } +} +---- + +== Visualizing + +Once that data has been parsed into tuples with `parseCSV` or `parseTSV` it can be +visualized using Zeppelin-Solr. + +The example below shows the output of the `parseCSV` function visualized as a table. + +image::images/math-expressions/csvtable.png[] + +Columns from the table can then be visualized using one of Apache Zeppelin's +visualizations. The example below shows a scatter plot of the `petal_length` and `petal_width` +grouped by `species`. + +image::images/math-expressions/csv.png[] + +== Selecting Fields and Field Types + +The `select` function can be used to select specific fields from +the CSV file and map them to new field names for indexing. + +Fields in the CSV file can be mapped to field names with +dynamic field suffixes. This approach allows for fine grain +control over schema field types without having to make any +changes to schema files. + +Below is an example of selecting fields and mapping them +to specific field types. + +image::images/math-expressions/csvselect.png[] + +== Loading Data + +When the data is ready to load, the `update` function can be used to send the +data to a SolrCloud collection for indexing. +The `update` function adds documents to Solr in batches and returns a tuple for each batch with summary information about the batch and load. + +In the example below the `update` expression is run using Zeppelin-Solr because the data set is small. +For larger loads it's best to run the load from a curl command where the output of the `update` function can be spooled to disk. + +image::images/math-expressions/update.png[] + +== Transforming Data + +Streaming expressions and math expressions provide a powerful set of functions +for transforming data. +The section below shows some useful transformations that can be applied while analyzing, visualizing, and loading CSV and TSV files. + +=== Unique IDs + +Both `parseCSV` and `parseTSV` emit an *id* field if one is not present in the data already. +The *id* field is a concatenation of the file path and the line number. This is a +convenient way to ensure that records have consistent ids if an id +is not present in the file. + +You can also map any fields in the file to the id field using the `select` function. +The `concat` function can be used to concatenate two or more fields in the file +to create an id. Or the `uuid` function can be used to create a random unique id. If +the `uuid` function is used the data cannot be reloaded without first deleting +the data, as the `uuid` function does not produce the same id for each document +on subsequent loads. + +Below is an example using the `concat` function to create a new id. + +image::images/math-expressions/selectconcat.png[] + +Below is an example using the `uuid` function to create a new id. + +image::images/math-expressions/selectuuid.png[] + +=== Record Numbers + +The `recNum` function can be used inside of a `select` function to add a record number +to each tuple. The record number is useful for tracking location in the result set +and can be used for filtering strategies such as skipping, paging and striding described in +the <> section below. + +The example below shows the syntax of the `recNum` function: + +image::images/math-expressions/recNum.png[] + + +=== Parsing Dates + +The `dateTime` function can be used to parse dates into the ISO-8601 format +needed for loading into a Solr date field. + +We can first inspect the format of the data time field in the CSV file: + +[source,text] +---- +select(parseCSV(cat("yr2017.csv", maxLines="2")), + id, + Created.Date) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "id": "yr2017.csv_2", + "Created.Date": "01/01/2017 12:00:00 AM" + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +Then we can use the `dateTime` function to format the datetime and +map it to a Solr date field. + +The `dateTime` function takes three parameters. The field in the data +with the date string, a template to parse the date using a Java https://docs.oracle.com/javase/9/docs/api/java/text/SimpleDateFormat.html[`SimpleDateFormat` template], +and an optional time zone. + +If the time zone is not present the time zone defaults to GMT time unless +it's included in the date string itself. + +Below is an example of the `dateTime` function applied to the date format +in the example above. + +[source,text] +---- +select(parseCSV(cat("yr2017.csv", maxLines="2")), + id, + dateTime(Created.Date, "MM/dd/yyyy hh:mm:ss a", "EST") as cdate_dt) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "cdate_dt": "2017-01-01T05:00:00Z", + "id": "yr2017.csv_2" + }, + { + "EOF": true, + "RESPONSE_TIME": 1 + } + ] + } +} +---- + +=== String Manipulation + +The `upper`, `lower`, `split`, `valueAt`, `trim`, and `concat` functions can be used to manipulate +strings inside of the `select` function. + +The example below shows the `upper` function used to upper case the *species* +field. + +image::images/math-expressions/selectupper.png[] + +The example below shows the `split` function which splits a field on +a delimiter. This can be used to create multi-value fields from fields +with an internal delimiter. + +The example below demonstrates this with a direct call to +the `/stream` handler: + +[source,text] +---- +select(parseCSV(cat("iris.csv")), + id, + split(id, "_") as parts_ss, + species as species_s, + sepal_length as sepal_length_d, + sepal_width as sepal_width_d, + petal_length as petal_length_d, + petal_width as petal_width_d) +---- + +When this expression is sent to the `/stream` handler it responds with: + + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "petal_width_d": "0.2", + "sepal_width_d": "3.5", + "id": "iris.csv_2", + "petal_length_d": "1.4", + "species_s": "setosa", + "sepal_length_d": "5.1", + "parts_ss": [ + "iris.csv", + "2" + ] + }, + { + "petal_width_d": "0.2", + "sepal_width_d": "3", + "id": "iris.csv_3", + "petal_length_d": "1.4", + "species_s": "setosa", + "sepal_length_d": "4.9", + "parts_ss": [ + "iris.csv", + "3" + ] + }]}} +---- + +The `valueAt` function can be used to select a specific index from +a split array. + +image::images/math-expressions/valueat.png[] + +=== Filtering Results + +The `having` function can be used to filter records. +Filtering can be used to systematically explore specific record sets before indexing or to filter records that are sent for indexing. +The `having` function wraps another stream and applies a boolean function to each tuple. +If the boolean logic function returns true the tuple is returned. + +The following boolean functions are supported: `eq`, `gt`, `gteq`, `lt`, `lteq`, `matches`, `and`, `or`, +`not`, `notNull`, `isNull`. + +Below are some strategies for using the `having` function to filter records. + +==== Finding a Specific Id or Record Number + +The `eq` (equals) function can be used with the `having` expression to filter the result set +to a single record number: + +image::images/math-expressions/havingId.png[] + +==== Skipping + +The `gt` (greater than) function can be used on the `recNum` field to filter the result set to +records with a recNum greater then a specific value: + +image::images/math-expressions/skipping.png[] + +==== Paging + +The `and` function with nested `lt` and `gt` functions can be used to select records within a specific +record number range: + +image::images/math-expressions/paging.png[] + +==== Striding + +The `eq` and nested `mod` function can be used to stride through the data at specific +record number intervals. This allows for a sample to be taken at different intervals in the data +in a systematic way. + +image::images/math-expressions/striding.png[] + +==== Regex Matching + +The `matches` function can be used to test if a field in the record matches a specific +regular expression. This provides a powerful *grep* like capability over the record set. + +image::images/math-expressions/matches.png[] + +=== Handling Nulls + +In most cases nulls do not need to be handled directly unless there is specific logic needed +to handle nulls during the load. + +The `select` function does not output fields that contain a null value. +This means as nulls are encountered in the data the fields are not included in the tuples. + +The string manipulation functions all return null if they encounter a null. +This means the null will be passed through to the `select` function and the fields with nulls will simply be left off the record. + +In certain scenarios it can be important to directly filter or replace nulls. +The sections below cover these scenarios. + +==== Filtering Nulls + +The `having` and `isNull`, `notNull` functions can be combined to filter records that can contain null +values. + +In the example below the `having` function returns zero documents because the `notNull` function is applied to + *field_a* in each tuple. + +image::images/math-expressions/havingNotNull.png[] + +In the example below the `having` function returns all documents because the `isNull` function is applied to +*field_a* in each tuple. + +image::images/math-expressions/havingIsNull.png[] + +==== Replacing Nulls + +The `if` function and `isNull`, `notNull` functions can be combined to replace null values inside a `select` function. + +In the example below the `if` function applies the `isNull` boolean expression to two different fields. + +In the first example it replaces null *petal_width* values with 0, and returns the *petal_width* if present. +In the second example it replaces null *field1* values with the string literal "NA" and returns *field1* if present. + +image::images/math-expressions/ifIsNull.png[] + +=== Text Analysis + +The `analyze` function can be used from inside a `select` function to analyze +a text field with a Lucene/Solr analyzer. +The output of `analyze` is a list of analyzed tokens which can be added to each tuple as a multi-valued field. + +The multi-valued field can then be sent to Solr for indexing or the `cartesianProduct` +function can be used to expand the list of tokens to a stream of tuples. + +There are a number of interesting use cases for the `analyze` function: + +* Previewing the output of different analyzers before indexing. +* Annotating documents with NLP generated tokens (entity extraction, noun phrases etc...) +before the documents reach the indexing pipeline. +This removes heavy NLP processing from the servers that may also be handling queries. It also allows +more compute resources to be applied to the NLP indexing then is available on the search cluster. +* Using the `cartesianProduct` function the analyzed tokens can be indexed as individual documents which allows +analyzed tokens to be searched and analyzed with Solr's aggregation and graph expressions. +* Also using `cartesianProduct` the analyzed tokens can be aggregated, analyzed and visualized using +streaming expressions directly before indexing occurs. + + +Below is an example of the `analyze` function being applied to the *Resolution.Description* +field in the tuples. The *\_text_* fields analyzer is used to analyze the text and the +analyzed tokens are added to the documents in the *token_ss* field. + +[source,text] +---- +select(parseCSV(cat("yr2017.csv", maxLines="2")), + Resolution.Description, + analyze(Resolution.Description, _text_) as tokens_ss) +---- + +When this expression is sent to the `/stream` handler it responds with: + + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "Resolution.Description": "The Department of Health and Mental Hygiene will review your complaint to determine appropriate action. Complaints of this type usually result in an inspection. Please call 311 in 30 days from the date of your complaint for status", + "tokens_ss": [ + "department", + "health", + "mental", + "hygiene", + "review", + "your", + "complaint", + "determine", + "appropriate", + "action", + "complaints", + "type", + "usually", + "result", + "inspection", + "please", + "call", + "311", + "30", + "days", + "from", + "date", + "your", + "complaint", + "status" + ] + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +The example below shows the `cartesianProduct` function expanding the analyzed terms in the `term_s` field into +their own documents. Notice that the other fields from the document are maintained with each term. This allows each term +to be indexed in a separate document so the relationships between terms and the other fields can be explored through +graph expressions or aggregations. + + +image::images/math-expressions/cartesian.png[] diff --git a/solr/solr-ref-guide/src/logs.adoc b/solr/solr-ref-guide/src/logs.adoc new file mode 100644 index 00000000000..489b542711b --- /dev/null +++ b/solr/solr-ref-guide/src/logs.adoc @@ -0,0 +1,390 @@ += Log Analytics +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +This section of the user guide provides an introduction to Solr log analytics. + +NOTE: This is an appendix of the <>. All the functions described below are covered in detail in the guide. +See the <> chapter to learn how to get started with visualizations and Apache Zeppelin. + +== Loading + +The out-of-the-box Solr log format can be loaded into a Solr index using the `bin/postlogs` command line tool +located in the `bin/` directory of the Solr distribution. + +NOTE: If working from the source distribution the +distribution must first be built before `postlogs` can be run. + +The `postlogs` script is designed to be run from the root directory of the Solr distribution. + +The `postlogs` script takes two parameters: + +* Solr base URL (with collection): `http://localhost:8983/solr/logs` +* File path to root of the logs directory: All files found under this directory (including sub-directories) will be indexed. +If the path points to a single log file only that log file will be loaded. + +Below is a sample execution of the `postlogs` tool: + +[source,text] +---- +./bin/postlogs http://localhost:8983/solr/logs /var/logs/solrlogs +---- + +The example above will index all the log files under `/var/logs/solrlogs` to the `logs` collection found at the base url `http://localhost:8983/solr`. + +== Exploring + +Log exploration is often the first step in log analytics and visualization. + +When working with unfamiliar installations exploration can be used to understand which collections are +covered in the logs, what shards and cores are in those collections and the types of operations being +performed on those collections. + +Even with familiar Solr installations exploration is still extremely +important while troubleshooting because it will often turn up surprises such as unknown errors or +unexpected admin or indexing operations. + +=== Sampling + +The first step in exploration is to take a random sample from the `logs` collection +with the `random` function. + +In the example below the `random` function is run with one +parameter which is the name of the collection to sample. + +image::images/math-expressions/logs-sample.png[] + +The sample contains 500 random records with the their full field list. By looking +at this sample we can quickly learn about the *fields* available in the `logs` collection. + +=== Time Period + +Each log record contains a time stamp in the `date_dt` field. +Its often useful to understand what time period the logs cover and how many log records have been +indexed. + +The `stats` function can be run to display this information. + +image::images/math-expressions/logs-dates.png[] + + +=== Record Types + +One of the key fields in the index is the `type_s` field which is the type of log +record. + +The `facet` expression can be used to visualize the different types of log records and how many +records of each type are in the index. + +image::images/math-expressions/logs-type.png[] + + +=== Collections + +Another important field is the `collection_s` field which is the collection that the +log record was generated from. + +The `facet` expression can be used to visualize the different collections and how many log records +they generate. + +image::images/math-expressions/logs-collection.png[] + + +=== Record Type by Collection + +A two dimensional `facet` can be run to visualize the record types by collection. + +image::images/math-expressions/logs-type-collection.png[] + + +=== Time Series + +The `timeseries` function can be used to visualize a time series for a specific time range +of the logs. + +In the example below a time series is used to visualize the log record counts +at 15 second intervals. + +image::images/math-expressions/logs-time-series.png[] + +Notice that there is a very low level of log activity up until hour 21 minute 27. +Then a burst of log activity occurs from minute 27 to minute 52. + +This is then followed by a large spike of log activity. + +The example below breaks this down further by adding a query on the `type_s` field to only +visualize *query* activity in the log. + + +image::images/math-expressions/logs-time-series2.png[] + +Notice the query activity accounts for more then half of the burst of log records between +21:27 and 21:52. But the query activity does not account for the large spike in +log activity that follows. + +We can account for that spike by changing the search to include only *update*, *commit*, +and *deleteByQuery* records in the logs. We can also narrow by collection +so we know where these activities are taking place. + + +image::images/math-expressions/logs-time-series3.png[] + +Through the various exploratory queries and visualizations we now have a much +better understanding of what's contained in the logs. + + +== Query Counting + +Distributed searches produce more than one log record for each query. There will be one *top level* log +record for +the top level distributed query and a *shard level* log record on one replica from each shard. There may also +be a set of *ids* queries to retrieve fields by id from the shards to complete the page of results. + +There are fields in the log index that can be used to differentiate between the three types of query records. + +The examples below use the `stats` function to count the different types of query records in the logs. +The same queries can be used with `search`, `random` and `timeseries` functions to return results +for specific types of query records. + +=== Top Level Queries + +To find all the top level queries in the logs, add a query to limit results to log records with `distrib_s:true` as follows: + +image::images/math-expressions/query-top-level.png[] + + +=== Shard Level Queries + +To find all the shard level queries that are not IDs queries, adjust the query to limit results to logs with `distrib_s:false AND ids_s:false` +as follows: + +image::images/math-expressions/query-shard-level.png[] + + +=== ID Queries + +To find all the *ids* queries, adjust the query to limit results to logs with `distrib_s:false AND ids_s:true` +as follows: + +image::images/math-expressions/query-ids.png[] + + +== Query Performance + +One of the important tasks of Solr log analytics is understanding how well a Solr cluster is performing. + +The `qtime_i` field contains the query time (QTime) in milliseconds +from the log records. +There are number of powerful visualizations and statistical approaches for analyzing query performance. + + +=== QTime Scatter Plot + +Scatter plots can be used to visualize random samples of the `qtime_i` +field. +The example below demonstrates a scatter plot of 500 random samples +from the `ptest1` collection of log records. + +In this example, `qtime_i` is plotted on the y-axis and the x-axis is simply a sequence to spread the query times out across the plot. + +NOTE: The `x` field is included in the field list. +The `random` function automatically generates a sequence for the x-axis when `x` is included in the field list. + +image::images/math-expressions/qtime-scatter.png[] + +From this scatter plot we can tell a number of important things about the query times: + +* The sample query times range from a low of 122 to a high of 643. +* The mean appears to be just above 400 ms. +* The query times tend to cluster closer to the mean and become less frequent as they move away +from the mean. + + +=== Highest QTime Scatter Plot + +It's often useful to be able to visualize the highest query times recorded in the log data. +This can be done by using the `search` function and sorting on `qtime_i desc`. + +In the example below the `search` function returns the highest 500 query times from the `ptest1` collection and sets the results to the variable `a`. +Then the `col` function is used to extract the `qtime_i` column from the result set into a vector, which is set to variable `y`. + +Then the `zplot` function is used plot the query times on the y-axis of the scatter plot. + +NOTE: The `rev` function is used to reverse the query times vector so the visualization displays from lowest to highest query times. + +image::images/math-expressions/qtime-highest-scatter.png[] + +From this plot we can see that the 500 highest query times start at 510ms and slowly move higher, until the last 10 spike upwards, culminating at the highest query time of 2529ms. + + +=== QTime Distribution + +In this example a visualization is created which shows the +distribution of query times rounded to the nearest second. + +The example below starts by taking a random sample of 10000 log records with a `type_s`* of `query`. +The results of the `random` function are assigned to the variable `a`. + +The `col` function is then used extract the `qtime_i` field from the results. +The vector of query times is set to variable `b`. + +The `scalarDivide` function is then used to divide all elements of the query time vector by 1000. +This converts the query times from milli-seconds to seconds. +The result is set to variable `c`. + +The `round` function then rounds all elements of the query times vector to the nearest second. +This means all query times less than 500ms will round to 0. + +The `freqTable` function is then applied to the vector of query times rounded to +the nearest second. + +The resulting frequency table is shown in the visualization below. +The x-axis is the number of seconds. +The y-axis is the number of query times that rounded to each second. + +image::images/math-expressions/qtime-dist.png[] + +Notice that roughly 93 percent of the query times rounded to 0, meaning they were under 500ms. +About 6 percent round to 1 and the rest rounded to either 2 or 3 seconds. + + +=== QTime Percentiles Plot + +A percentile plot is another powerful tool for understanding the distribution of query times in the logs. +The example below demonstrates how to create and interpret percentile plots. + +In this example an `array` of percentiles is created and set to variable `p`. + +Then a random sample of 10000 log records is drawn and set to variable `a`. +The `col` function is then used to extract the `qtime_i` field from the sample results and this vector is set to variable `b`. + +The `percentile` function is then used to calculate the value at each percentile for the vector of query times. +The array of percentiles set to variable `p` tells the `percentile` function +which percentiles to calculate. + +Then the `zplot` function is used to plot the *percentiles* on the x-axis and +the *query time* at each percentile on the y-axis. + +image::images/math-expressions/query-qq.png[] + +From the plot we can see that the 80th percentile has a query time of 464ms. +This means that 80% percent of queries are below 464ms. + +=== QTime Time Series + +A time series aggregation can also be run to visualization how QTime changes over time. + +The example below shows a time series, area chart that visualizes *average query time* at 15 second intervals for a 3 minute section of a log. + +image::images/math-expressions/qtime-series.png[] + + +== Performance Troubleshooting + +If query analysis determines that queries are not performing as expected then log analysis can also be used to troubleshoot the cause of the slowness. +The section below demonstrates several approaches for locating the source of query slowness. + +=== Slow Nodes + +In a distributed search the final search performance is only as fast as the slowest responding shard in the cluster. +Therefore one slow node can be responsible for slow overall search time. + +The fields `core_s`, `replica_s` and `shard_s` are available in the log records. +These fields allow average query time to be calculated by *core*, *replica* or *shard*. + +The `core_s` field is particularly useful as its the most granular element and +the naming convention often includes the collection, shard and replica information. + +The example below uses the `facet` function to calculate `avg(qtime_i)` by core. + +image::images/math-expressions/slow-nodes.png[] + +Notice in the results that the `core_s` field contains information about the +*collection*, *shard*, and *replica*. +The example also shows that qtime seems to be significantly higher for certain cores in the same collection. +This should trigger a deeper investigation as to why those cores might be performing slower. + +=== Slow Queries + +If query analysis shows that most queries are performing well but there are outlier queries that are slow, one reason for this may be that specific queries are slow. + +The `q_s` and `q_t` fields both hold the value of the *q* parameter from Solr requests. +The `q_s` field is a string field and the `q_t` field has been tokenized. + +The `search` function can be used to return the top N slowest queries in the logs by sorting the results by `qtime_i desc`. the example +below demonstrates this: + +image::images/math-expressions/slow-queries.png[] + +Once the queries have been retrieved they can be inspected and tried individually to determine if the query is consistently slow. +If the query is shown to be slow a plan to improve the query performance +can be devised. + +=== Commits + +Commits and activities that cause commits, such as full index replications, can result in slower query performance. +Time series visualization can help to determine if commits are +related to degraded performance. + +The first step is to visualize the query performance issue. +The time series below limits the log results to records that are type `query` and computes the `max(qtime_i)` at ten minute intervals. +The plot shows the day, hour and minute on the x-axis and `max(qtime_i)` in milliseconds on the y-axis. +Notice there are some extreme spikes in max `qtime_i` that need to be understood. + +image::images/math-expressions/query-spike.png[] + + +The next step is to generate a time series that counts commits across the same time intervals. +The time series below uses the same `start`, `end` and `gap` as the initial time series. +But this time series is computed for records that have a type of `commit`. +The count for the commits is calculated and plotted on y-axis. + +Notice that there are spikes in commit activity that appear near the spikes in max `qtime_i`. + +image::images/math-expressions/commit-series.png[] + +The final step is to overlay the two time series in the same plot. + +This is done by performing both time series and setting the results to variables, in this case +`a` and `b`. + +Then the `date_dt` and `max(qtime_i)` fields are extracted as vectors from the first time series and set to variables using the `col` function. +And the `count(*)` field is extracted from the second time series. + +The `zplot` function is then used to plot the time stamp vector on the x-axis and the max qtimes and commit count vectors on y-axis. + +NOTE: The `minMaxScale` function is used to scale both vectors +between 0 and 1 so they can be visually compared on the same plot. + +image::images/math-expressions/overlay-series.png[] + +Notice in this plot that the commit count seems to be closely related to spikes +in max `qtime_i`. + +== Errors + +The log index will contain any error records found in the logs. Error records will have a `type_s` field value of `error`. + +The example below searches for error records: + +image::images/math-expressions/search-error.png[] + + +If the error is followed by a stack trace the stack trace will be present in the searchable field `stack_t`. +The example below shows a search on the `stack_t` field and the stack trace presented in the result. + +image::images/math-expressions/stack.png[] diff --git a/solr/solr-ref-guide/src/machine-learning.adoc b/solr/solr-ref-guide/src/machine-learning.adoc index 59a67744dfc..1c9bc8882aa 100644 --- a/solr/solr-ref-guide/src/machine-learning.adoc +++ b/solr/solr-ref-guide/src/machine-learning.adoc @@ -20,20 +20,653 @@ This section of the math expressions user guide covers machine learning functions. +== Distance and Distance Matrices + +The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix. + +There are six distance measure functions that return a function that performs the actual distance calculation: + +* `euclidean` (default) +* `manhattan` +* `canberra` +* `earthMovers` +* `cosine` +* `haversineMeters` (Geospatial distance measure) + +The distance measure functions can be used with all machine learning functions +that support distance measures. + +Below is an example for computing Euclidean distance for two numeric arrays: + +[source,text] +---- +let(a=array(20, 30, 40, 50), + b=array(21, 29, 41, 49), + c=distance(a, b)) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "c": 2 + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +Below the distance is calculated using Manhattan distance. + +[source,text] +---- +let(a=array(20, 30, 40, 50), + b=array(21, 29, 41, 49), + c=distance(a, b, manhattan())) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "c": 4 + }, + { + "EOF": true, + "RESPONSE_TIME": 1 + } + ] + } +} +---- + +=== Distance Matrices + +Distance matrices are powerful tools for visualizing the distance +between two or more +vectors. + +The `distance` function builds a distance matrix +if a matrix is passed as the parameter. The distance matrix is computed for the *columns* +of the matrix. + +The example below demonstrates the power of distance matrices combined with 2 dimensional faceting. + +In this example the `facet2D` function is used to generate a two dimensional facet aggregation +over the fields `complaint_type_s` and `zip_s` from the `nyc311` complaints database. +The *top 20* complaint types and the *top 25* zip codes for each complaint type are aggregated. +The result is a stream of tuples each containing the fields `complaint_type_s`, `zip_s` and the count for the pair. + +The `pivot` function is then used to pivot the fields into a *matrix* with the `zip_s` +field as the *rows* and the `complaint_type_s` field as the *columns*. The `count(*)` field populates +the values in the cells of the matrix. + +The `distance` function is then used to compute the distance matrix for the columns +of the matrix using `cosine` distance. This produces a distance matrix +that shows distance between complaint types based on the zip codes they appear in. + +Finally the `zplot` function is used to plot the distance matrix as a heat map. Notice that the +heat map has been configured so that the intensity of color increases as the distance between vectors +decreases. + + +image::images/math-expressions/distance.png[] + +The heat map is interactive, so mousing over one of the cells pops up the values +for the cell. + +image::images/math-expressions/distanceview.png[] + +Notice that HEAT/HOT WATER and UNSANITARY CONDITION complaints have a cosine distance of .1 (rounded to the nearest +tenth). + + +== K-Nearest Neighbor (KNN) + +The `knn` function searches the rows of a matrix with a search vector and +returns a matrix of the k-nearest neighbors. This allows for secondary vector +searches over result sets. + +The `knn` function supports changing of the distance measure by providing one of the following +distance measure functions: + +* `euclidean` (Default) +* `manhattan` +* `canberra` +* `earthMovers` +* `cosine` +* `haversineMeters` (Geospatial distance measure) + +The example below shows how to perform a secondary search over an aggregation +result set. The goal of the example is to find zip codes in the nyc311 complaint +database that have similar complaint types to the zip code 10280. + +The first step in the example is to use the `facet2D` function to perform a two +dimensional aggregation over the `zip_s` and `complaint_type_s` fields. In the example +the top 119 zip codes and top 5 complaint types for each zip code are calculated +for the borough of Manhattan. The result is a list of tuples each containing +the `zip_s`, `complaint_type_s` and the `count(*)` for the combination. + +The list of tuples is then *pivoted* into a matrix with the `pivot` function. +The `pivot` function in this example returns a matrix with rows of zip codes +and columns of complaint types. +The `count(*)` field from the tuples populates the cells of the matrix. +This matrix will be used as the secondary search matrix. + +The next step is to locate the vector for the 10280 zip code. +This is done in three steps in the example. +The first step is to retrieve the row labels from the matrix with the `getRowLabels` function. +The row labels in this case are zip codes which were populated by the `pivot` function. +Then the `indexOf` function is used to find the *index* of the "10280" zip code in the list of row labels. +The `rowAt` function is then used to return the vector at that *index* from the matrix. +This vector is the *search vector*. + +Now that we have a matrix and search vector we can use the `knn` function to perform the search. +In the example the `knn` function searches the matrix with the search vector with a K of 5, using +*cosine* distance. Cosine distance is useful for comparing sparse vectors which is the case in this +example. The `knn` function returns a matrix with the top 5 nearest neighbors to the search vector. + +The `knn` function populates the row and column labels of the return matrix and +also adds a vector of *distances* for each row as an attribute to the matrix. + +In the example the `zplot` function extracts the row labels and +the distance vector with the `getRowLabels` and `getAttribute` functions. +The `topFeatures` function is used to extract +the top 5 column labels for each zip code vector, based on the counts for each +column. Then `zplot` outputs the data in a format that can be visualized in +a table with Zeppelin-Solr. + +image::images/math-expressions/knn.png[] + +The table above shows each zip code returned by the `knn` function along +with the list of complaints and distances. These are the zip codes that are most similar +to the 10280 zip code based on their top 5 complaint types. + +== K-Nearest Neighbor Regression + +K-nearest neighbor regression is a non-linear, bivariate and multivariate regression method. +KNN regression is a lazy learning +technique which means it does not fit a model to the training set in advance. Instead the +entire training set of observations and outcomes are held in memory and predictions are made +by averaging the outcomes of the k-nearest neighbors. + +The `knnRegress` function is used to perform nearest neighbor regression. + + +=== 2D Non-Linear Regression + +The example below shows the *regression plot* for KNN regression applied to a 2D scatter plot. + +In this example the `random` function is used to draw 500 random samples from the `logs` collection +containing two fields `filesize_d` and `eresponse_d`. The sample is then vectorized with the +`filesize_d` field stored in a vector assigned to variable *x* and the `eresponse_d` vector stored in +variable `y`. The `knnRegress` function is then applied with `20` as the nearest neighbor parameter, +which returns a KNN function which can be used to predict values. +The `predict` function is then called on the KNN function to predict values for the original `x` vector. +Finally `zplot` is used to plot the original `x` and `y` vectors along with the predictions. + +image::images/math-expressions/knnRegress.png[] + +Notice that the regression plot shows a non-linear relations ship between the `filesize_d` +field and the `eresponse_d` field. Also note that KNN regression +plots a non-linear curve through the scatter plot. The larger the size +of K (nearest neighbors), the smoother the line. + +=== Multivariate Non-Linear Regression + +The `knnRegress` function is also a powerful and flexible tool for +multi-variate non-linear regression. + +In the example below a multi-variate regression is performed using +a database designed for analyzing and predicting wine quality. The +database contains nearly 1600 records with 9 predictors of wine quality: +pH, alcohol, fixed_acidity, sulphates, density, free_sulfur_dioxide, +volatile_acidity, citric_acid, residual_sugar. There is also a field +called quality assigned to each wine ranging +from 3 to 8. + +KNN regression can be used to predict wine quality for vectors containing +the predictor values. + +In the example a search is performed on the `redwine` collection to +return all the rows in the database of observations. Then the quality field and +predictor fields are read into vectors and set to variables. + +The predictor variables are added as rows to a matrix which is +transposed so each row in the matrix contains one observation with the 9 +predictor values. +This is our observation matrix which is assigned to the variable `obs`. + +Then the `knnRegress` function regresses the observations with quality outcomes. +The value for K is set to 5 in the example, so the average quality of the 5 +nearest neighbors will be used to calculate the quality. + +The `predict` function is then used to generate a vector of predictions +for the entire observation set. These predictions will be used to determine +how well the KNN regression performed over the observation data. + +The error, or *residuals*, for the regression are then calculated by +subtracting the *predicted* quality from the *observed* quality. +The `ebeSubtract` function is used to perform the element-by-element +subtraction between the two vectors. + +Finally the `zplot` function formats the predictions and errors for +for the visualization of the *residual plot*. + +image::images/math-expressions/redwine1.png[] + +The residual plot plots the *predicted* values on the x-axis and the *error* for the +prediction on the y-axis. The scatter plot shows how the errors +are distributed across the full range of predictions. + +The residual plot can be interpreted to understand how the KNN regression performed on the +training data. + +* The plot shows the prediction error appears to be fairly evenly distributed +above and below zero. The density of the errors increases as it approaches zero. The +bubble size reflects the density of errors at the specific point in the plot. +This provides an intuitive feel for the distribution of the model's error. + +* The plot also visualizes the variance of the error across the range of +predictions. This provides an intuitive understanding of whether the KNN predictions +will have similar error variance across the full range predictions. + +The residuals can also be visualized using a histogram to better understand +the shape of the residuals distribution. The example below shows the same KNN +regression as above with a plot of the distribution of the errors. + +In the example the `zplot` function is used to plot the `empiricalDistribution` +function of the residuals, with an 11 bin histogram. + +image::images/math-expressions/redwine2.png[] + +Notice that the errors follow a bell curve centered close to 0. From this plot +we can see the probability of getting prediction errors between -1 and 1 is quite high. + +*Additional KNN Regression Parameters* + +The `knnRegression` function has three additional parameters that make it suitable for many different regression scenarios. + +. Any of the distance measures can be used for the regression simply by adding the function to the call. +This allows for regression analysis over sparse vectors (`cosine`), dense vectors and geo-spatial lat/lon vectors (`haversineMeters`). ++ +Sample syntax: ++ +[source,text] +---- +r=knnRegress(obs, quality, 5, cosine()), +---- + +. The `robust` named parameter can be used to perform a regression analysis that is robust to outliers in the outcomes. +When the `robust` parameter is used the median outcome of the k-nearest neighbors is used rather than the average. ++ +Sample syntax: ++ +[source,text] +---- +r=knnRegress(obs, quality, 5, robust="true"), +---- + +. The `scale` named parameter can be used to scale the columns of the observations and search vectors +at prediction time. This can improve the performance of the KNN regression when the feature columns +are at different scales causing the distance calculations to be place too much weight on the larger columns. ++ +Sample syntax: ++ +[source,text] +---- +r=knnRegress(obs, quality, 5, scale="true"), +---- + +== knnSearch + +The `knnSearch` function returns the k-nearest neighbors +for a document based on text similarity. +Under the covers the `knnSearch` function uses Solr's <> query parser plugin. +This capability uses the search engine's query, term statistics, scoring, and ranking capability to perform a fast, nearest neighbor search for similar documents over large distributed indexes. + +The results of this search can be used directly or provide *candidates* for machine learning operations such as a secondary KNN vector search. + +The example below shows the `knnSearch` function on a movie reviews data set. The search returns the 50 documents most similar to a specific document ID (`83e9b5b0...`) based on the similarity of the `review_t` field. +The `mindf` and `maxdf` specify the minimum and maximum document frequency of the terms used to perform the search. +These parameters can make the query faster by eliminating high frequency terms and also improves accuracy by removing noise terms from the search. + +image::images/math-expressions/knnSearch.png[] + +NOTE: In this example the `select` +function is used to truncate the review in the output to 220 characters to make it easier +to read in a table. + +== DBSCAN + +DBSCAN clustering is a powerful density-based clustering algorithm which is particularly well suited for geospatial clustering. +DBSCAN uses two parameters to filter result sets to clusters of specific density: + +* `eps` (Epsilon): Defines the distance between points to be considered as neighbors + +* `min` points: The minimum number of points needed in a cluster for it to be returned. + + +=== 2D Cluster Visualization + +The `zplot` function has direct support for plotting 2D clusters by using the `clusters` named parameter. + +The example below uses DBSCAN clustering and cluster visualization to find +the *hot spots* on a map for rat sightings in the NYC 311 complaints database. + +In this example the `random` function draws a sample of records from the `nyc311` collection where +the complaint description matches "rat sighting" and latitude is populated in the record. +The latitude and longitude fields are then vectorized and added as rows to a matrix. +The matrix is transposed so each row contains a single latitude, longitude +point. +The `dbscan` function is then used to cluster the latitude and longitude points. +Notice that the `dbscan` function in the example has four parameters. + +* `obs` : The observation matrix of lat/lon points + +* `eps` : The distance between points to be considered a cluster. 100 meters in the example. + +* `min points`: The minimum points in a cluster for the cluster to be returned by the function. `5` in the example. + +* `distance measure`: An optional distance measure used to determine the +distance between points. The default is Euclidean distance. +The example uses `haversineMeters` which returns the distance in meters which is much more meaningful for geospatial use cases. + +Finally, the `zplot` function is used to visualize the clusters on a map with Zeppelin-Solr. +The map below has been zoomed to a specific area of Brooklyn with a high density of rat sightings. + +image::images/math-expressions/dbscan1.png[] + +Notice in the visualization that only 1019 points were returned from the 5000 samples. +This is the power of the DBSCAN algorithm to filter records that don't match the criteria +of a cluster. The points that are plotted all belong to clearly defined clusters. + +The map visualization can be zoomed further to explore the locations of specific clusters. +The example below shows a zoom into an area of dense clusters. + +image::images/math-expressions/dbscan2.png[] + + +== K-Means Clustering + +The `kmeans` functions performs k-means clustering of the rows of a matrix. +Once the clustering has been completed there are a number of useful functions available +for examining and visualizing the clusters and centroids. + + +=== Clustered Scatter Plot + +In this example we'll again be clustering 2D lat/lon points of rat sightings. But unlike the DBSCAN example, k-means clustering +does not on its own +perform any noise reduction. So in order to reduce the noise a smaller random sample is selected from the data than was used +for the DBSCAN example. + +We'll see that sampling itself is a powerful noise reduction tool which helps visualize the cluster density. +This is because there is a higher probability that samples will be drawn from higher density clusters and a lower +probability that samples will be drawn from lower density clusters. + +In this example the `random` function draws a sample of 1500 records from the `nyc311` (complaints database) collection where +the complaint description matches "rat sighting" and latitude is populated in the record. The latitude and longitude fields +are then vectorized and added as rows to a matrix. The matrix is transposed so each row contains a single latitude, longitude +point. The `kmeans` function is then used to cluster the latitude and longitude points into 21 clusters. +Finally, the `zplot` function is used to visualize the clusters as a scatter plot. + +image::images/math-expressions/2DCluster1.png[] + +The scatter plot above shows each lat/lon point plotted on a Euclidean plain with longitude on the +x-axis and +latitude on the y-axis. The plot is dense enough so the outlines of the different boroughs are visible +if you know the boroughs of New York City. + + +Each cluster is shown in a different color. This plot provides interesting +insight into the densities of rat sightings throughout the five boroughs of New York City. For +example it highlights a cluster of dense sightings in Brooklyn at cluster1 +surrounded by less dense but still high activity clusters. + +=== Plotting the Centroids + +The centroids of each cluster can then be plotted on a map to visualize the center of the +clusters. In the example below the centroids are extracted from the clusters using the `getCentroids` +function, which returns a matrix of the centroids. + +The centroids matrix contains 2D lat/lon points. The `colAt` function can then be used +to extract the latitude and longitude columns by index from the matrix so they can be +plotted with `zplot`. A map visualization is used below to display the centroids. + + +image::images/math-expressions/centroidplot.png[] + + +The map can then be zoomed to get a closer look at the centroids in the high density areas shown +in the cluster scatter plot. + +image::images/math-expressions/centroidzoom.png[] + + +=== Phrase Extraction + +K-means clustering produces centroids or *prototype* vectors which can be used to represent +each cluster. In this example the key features of the centroids are extracted +to represent the key phrases for clusters of TF-IDF term vectors. + +NOTE: The example below works with TF-IDF _term vectors_. +The section <> offers +a full explanation of this features. + +In the example the `search` function returns documents where the `review_t` field matches the phrase "star wars". +The `select` function is run over the result set and applies the `analyze` function +which uses the Lucene/Solr analyzer attached to the schema field `text_bigrams` to re-analyze the `review_t` +field. This analyzer returns bigrams which are then annotated to documents in a field called `terms`. + +The `termVectors` function then creates TD-IDF term vectors from the bigrams stored in the `terms` field. +The `kmeans` function is then used to cluster the bigram term vectors into 5 clusters. +Finally the top 5 features are extracted from the centroids and returned. +Notice that the features are all bigram phrases with semantic significance. + +[source,text] +---- +let(a=select(search(reviews, q="review_t:\"star wars\"", rows="500"), + id, + analyze(review_t, text_bigrams) as terms), + vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"), + clusters=kmeans(vectors, 5), + centroids=getCentroids(clusters), + phrases=topFeatures(centroids, 5)) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,text] +---- +{ + "result-set": { + "docs": [ + { + "phrases": [ + [ + "empire strikes", + "rebel alliance", + "princess leia", + "luke skywalker", + "phantom menace" + ], + [ + "original star", + "main characters", + "production values", + "anakin skywalker", + "luke skywalker" + ], + [ + "carrie fisher", + "original films", + "harrison ford", + "luke skywalker", + "ian mcdiarmid" + ], + [ + "phantom menace", + "original trilogy", + "harrison ford", + "john williams", + "empire strikes" + ], + [ + "science fiction", + "fiction films", + "forbidden planet", + "character development", + "worth watching" + ] + ] + }, + { + "EOF": true, + "RESPONSE_TIME": 46 + } + ] + } +} +---- + +== Multi K-Means Clustering + +K-means clustering will produce different outcomes depending on +the initial placement of the centroids. K-means is fast enough +that multiple trials can be performed so that the best outcome can be selected. + +The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the +best result based on which trial produces the lowest intra-cluster variance. + +The example below is identical to the phrase extraction example except that it uses `multiKmeans` with 15 trials, +rather than a single trial of the `kmeans` function. + +[source,text] +---- +let(a=select(search(reviews, q="review_t:\"star wars\"", rows="500"), + id, + analyze(review_t, text_bigrams) as terms), + vectors=termVectors(a, maxDocFreq=.10, minDocFreq=.03, minTermLength=13, exclude="_,br,have"), + clusters=multiKmeans(vectors, 5, 15), + centroids=getCentroids(clusters), + phrases=topFeatures(centroids, 5)) +---- + +This expression returns the following response: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "phrases": [ + [ + "science fiction", + "original star", + "production values", + "fiction films", + "forbidden planet" + ], + [ + "empire strikes", + "princess leia", + "luke skywalker", + "phantom menace" + ], + [ + "carrie fisher", + "harrison ford", + "luke skywalker", + "empire strikes", + "original films" + ], + [ + "phantom menace", + "original trilogy", + "harrison ford", + "character development", + "john williams" + ], + [ + "rebel alliance", + "empire strikes", + "princess leia", + "original trilogy", + "luke skywalker" + ] + ] + }, + { + "EOF": true, + "RESPONSE_TIME": 84 + } + ] + } +} +---- + +== Fuzzy K-Means Clustering + +The `fuzzyKmeans` function is a soft clustering algorithm which +allows vectors to be assigned to more then one cluster. The `fuzziness` parameter +is a value between `1` and `2` that determines how fuzzy to make the cluster assignment. + +After the clustering has been performed the `getMembershipMatrix` function can be called +on the clustering result to return a matrix describing the probabilities +of cluster membership for each vector. +This matrix can be used to understand relationships between clusters. + +In the example below `fuzzyKmeans` is used to cluster the movie reviews matching the phrase "star wars". +But instead of looking at the clusters or centroids, the `getMembershipMatrix` is used to return the +membership probabilities for each document. The membership matrix is comprised of a row for each +vector that was clustered. There is a column in the matrix for each cluster. +The values in the matrix contain the probability that a specific vector belongs to a specific cluster. + +In the example the `distance` function is then used to create a *distance matrix* from the columns of the +membership matrix. The distance matrix is then visualized with the `zplot` function as a heat map. + +In the example `cluster1` and `cluster5` have the shortest distance between the clusters. +Further analysis of the features in both clusters can be performed to understand +the relationship between `cluster1` and `cluster5`. + +image::images/math-expressions/fuzzyk.png[] + +NOTE: The heat map has been configured to increase in color intensity as the distance shortens. + == Feature Scaling Before performing machine learning operations its often necessary to scale the feature vectors so they can be compared at the same scale. -All the scaling function operate on vectors and matrices. +All the scaling functions below operate on vectors and matrices. When operating on a matrix the rows of the matrix are scaled. === Min/Max Scaling The `minMaxScale` function scales a vector or matrix between a minimum and maximum value. -By default it will scale between 0 and 1 if min/max values are not provided. +By default it will scale between `0` and `1` if min/max values are not provided. -Below is a simple example of min/max scaling between 0 and 1. +Below is a plot of a sine wave, with an amplitude of 1, before and +after it has been scaled between -5 and 5. + +image::images/math-expressions/minmaxscale.png[] + + +Below is a simple example of min/max scaling of a matrix between 0 and 1. Notice that once brought into the same scale the vectors are the same. [source,text] @@ -44,7 +677,7 @@ let(a=array(20, 30, 40, 50), d=minMaxScale(c)) ---- -This expression returns the following response: +When this expression is sent to the `/stream` handler it responds with: [source,json] ---- @@ -78,10 +711,16 @@ This expression returns the following response: === Standardization -The `standardize` function scales a vector so that it has a mean of 0 and a standard deviation of 1. -Standardization can be used with machine learning algorithms, such as -https://en.wikipedia.org/wiki/Support_vector_machine[Support Vector Machine (SVM)], that perform better -when the data has a normal distribution. +The `standardize` function scales a vector so that it has a +mean of 0 and a standard deviation of 1. + +Below is a plot of a sine wave, with an amplitude of 1, before and +after it has been standardized. + +image::images/math-expressions/standardize.png[] + +Below is a simple example of of a standardized matrix. +Notice that once brought into the same scale the vectors are the same. [source,text] ---- @@ -91,7 +730,7 @@ let(a=array(20, 30, 40, 50), d=standardize(c)) ---- -This expression returns the following response: +When this expression is sent to the `/stream` handler it responds with: [source,json] ---- @@ -126,8 +765,16 @@ This expression returns the following response: === Unit Vectors The `unitize` function scales vectors to a magnitude of 1. A vector with a -magnitude of 1 is known as a unit vector. Unit vectors are preferred when the vector math deals -with vector direction rather than magnitude. +magnitude of 1 is known as a unit vector. Unit vectors are preferred +when the vector math deals with vector direction rather than magnitude. + +Below is a plot of a sine wave, with an amplitude of 1, before and +after it has been unitized. + +image::images/math-expressions/unitize.png[] + +Below is a simple example of a unitized matrix. +Notice that once brought into the same scale the vectors are the same. [source,text] ---- @@ -137,7 +784,7 @@ let(a=array(20, 30, 40, 50), d=unitize(c)) ---- -This expression returns the following response: +When this expression is sent to the `/stream` handler it responds with: [source,json] ---- @@ -168,713 +815,3 @@ This expression returns the following response: } } ---- - -== Distance and Distance Measures - -The `distance` function computes the distance for two numeric arrays or a distance matrix for the columns of a matrix. - -There are five distance measure functions that return a function that performs the actual distance calculation: - -* `euclidean` (default) -* `manhattan` -* `canberra` -* `earthMovers` -* `haversineMeters` (Geospatial distance measure) - -The distance measure functions can be used with all machine learning functions -that support distance measures. - -Below is an example for computing Euclidean distance for two numeric arrays: - -[source,text] ----- -let(a=array(20, 30, 40, 50), - b=array(21, 29, 41, 49), - c=distance(a, b)) ----- - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": 2 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -Below the distance is calculated using *Manahattan* distance. - -[source,text] ----- -let(a=array(20, 30, 40, 50), - b=array(21, 29, 41, 49), - c=distance(a, b, manhattan())) ----- - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": 4 - }, - { - "EOF": true, - "RESPONSE_TIME": 1 - } - ] - } -} ----- - - -Below is an example for computing a distance matrix for columns -of a matrix: - -[source,text] ----- -let(a=array(20, 30, 40), - b=array(21, 29, 41), - c=array(31, 40, 50), - d=matrix(a, b, c), - c=distance(d)) ----- - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "e": [ - [ - 0, - 15.652475842498529, - 34.07345007480164 - ], - [ - 15.652475842498529, - 0, - 18.547236990991408 - ], - [ - 34.07345007480164, - 18.547236990991408, - 0 - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 24 - } - ] - } -} ----- - -== K-Means Clustering - -The `kmeans` functions performs k-means clustering of the rows of a matrix. -Once the clustering has been completed there are a number of useful functions available -for examining the clusters and centroids. - -The examples below cluster _term vectors_. -The section <> offers -a full explanation of these features. - -=== Centroid Features - -In the example below the `kmeans` function is used to cluster a result set from the Enron email data-set -and then the top features are extracted from the cluster centroids. - -[source,text] ----- -let(a=select(random(enron, q="body:oil", rows="500", fl="id, body"), <1> - id, - analyze(body, body_bigram) as terms), - b=termVectors(a, maxDocFreq=.10, minDocFreq=.05, minTermLength=14, exclude="_,copyright"),<2> - c=kmeans(b, 5), <3> - d=getCentroids(c), <4> - e=topFeatures(d, 5)) <5> ----- - -Let's look at what data is assigned to each variable: - -<1> *`a`*: The `random` function returns a sample of 500 documents from the "enron" -collection that match the query "body:oil". The `select` function selects the `id` and -and annotates each tuple with the analyzed bigram terms from the `body` field. -<2> *`b`*: The `termVectors` function creates a TF-IDF term vector matrix from the -tuples stored in variable *`a`*. Each row in the matrix represents a document. The columns of the matrix -are the bigram terms that were attached to each tuple. -<3> *`c`*: The `kmeans` function clusters the rows of the matrix into 5 clusters. The k-means clustering is performed using the Euclidean distance measure. -<4> *`d`*: The `getCentroids` function returns a matrix of cluster centroids. Each row in the matrix is a centroid -from one of the 5 clusters. The columns of the matrix are the same bigrams terms of the term vector matrix. -<5> *`e`*: The `topFeatures` function returns the column labels for the top 5 features of each centroid in the matrix. -This returns the top 5 bigram terms for each centroid. - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "e": [ - [ - "enron enronxgate", - "north american", - "energy services", - "conference call", - "power generation" - ], - [ - "financial times", - "chief financial", - "financial officer", - "exchange commission", - "houston chronicle" - ], - [ - "southern california", - "california edison", - "public utilities", - "utilities commission", - "rate increases" - ], - [ - "rolling blackouts", - "public utilities", - "electricity prices", - "federal energy", - "price controls" - ], - [ - "california edison", - "regulatory commission", - "southern california", - "federal energy", - "power generators" - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 982 - } - ] - } -} ----- - -=== Cluster Features - -The example below examines the top features of a specific cluster. This example uses the same techniques -as the centroids example but the top features are extracted from a cluster rather than the centroids. - -[source,text] ----- -let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"), - id, - analyze(body, body_bigram) as terms), - b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"), - c=kmeans(b, 25), - d=getCluster(c, 0), <1> - e=topFeatures(d, 4)) <2> ----- - -<1> The `getCluster` function returns a cluster by its index. Each cluster is a matrix containing term vectors -that have been clustered together based on their features. -<2> The `topFeatures` function is used to extract the top 4 features from each term vector -in the cluster. - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "e": [ - [ - "electricity board", - "maharashtra state", - "power purchase", - "state electricity", - "reserved enron" - ], - [ - "electricity board", - "maharashtra state", - "state electricity", - "purchase agreement", - "independent power" - ], - [ - "maharashtra state", - "reserved enron", - "federal government", - "state government", - "dabhol project" - ], - [ - "purchase agreement", - "power purchase", - "electricity board", - "maharashtra state", - "state government" - ], - [ - "investment grade", - "portland general", - "general electric", - "holding company", - "transmission lines" - ], - [ - "state government", - "state electricity", - "purchase agreement", - "electricity board", - "maharashtra state" - ], - [ - "electricity board", - "state electricity", - "energy management", - "maharashtra state", - "energy markets" - ], - [ - "electricity board", - "maharashtra state", - "state electricity", - "state government", - "second quarter" - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 978 - } - ] - } -} ----- - -== Multi K-Means Clustering - -K-means clustering will produce different results depending on -the initial placement of the centroids. K-means is fast enough -that multiple trials can be performed and the best outcome selected. - -The `multiKmeans` function runs the k-means clustering algorithm for a given number of trials and selects the -best result based on which trial produces the lowest intra-cluster variance. - -The example below is identical to centroids example except that it uses `multiKmeans` with 100 trials, -rather than a single trial of the `kmeans` function. - -[source,text] ----- -let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"), - id, - analyze(body, body_bigram) as terms), - b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"), - c=multiKmeans(b, 5, 100), - d=getCentroids(c), - e=topFeatures(d, 5)) ----- - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "e": [ - [ - "enron enronxgate", - "energy trading", - "energy markets", - "energy services", - "unleaded gasoline" - ], - [ - "maharashtra state", - "electricity board", - "state electricity", - "energy trading", - "chief financial" - ], - [ - "price controls", - "electricity prices", - "francisco chronicle", - "wholesale electricity", - "power generators" - ], - [ - "southern california", - "california edison", - "public utilities", - "francisco chronicle", - "utilities commission" - ], - [ - "california edison", - "power purchases", - "system operator", - "term contracts", - "independent system" - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 1182 - } - ] - } -} ----- - -== Fuzzy K-Means Clustering - -The `fuzzyKmeans` function is a soft clustering algorithm which -allows vectors to be assigned to more then one cluster. The `fuzziness` parameter -is a value between 1 and 2 that determines how fuzzy to make the cluster assignment. - -After the clustering has been performed the `getMembershipMatrix` function can be called -on the clustering result to return a matrix describing which clusters each vector belongs to. -There is a row in the matrix for each vector that was clustered. There is a column in the matrix -for each cluster. The values in the columns are the probability that the vector belonged to the specific -cluster. - -A simple example will make this more clear. In the example below 300 documents are analyzed and -then turned into a term vector matrix. Then the `fuzzyKmeans` function clusters the -term vectors into 12 clusters with a fuzziness factor of 1.25. - -[source,text] ----- -let(a=select(random(collection3, q="body:oil", rows="300", fl="id, body"), - id, - analyze(body, body_bigram) as terms), - b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"), - c=fuzzyKmeans(b, 12, fuzziness=1.25), - d=getMembershipMatrix(c), <1> - e=rowAt(d, 0), <2> - f=precision(e, 5)) <3> ----- - -<1> The `getMembershipMatrix` function is used to return the membership matrix; -<2> and the first row of membership matrix is retrieved with the `rowAt` function. -<3> The `precision` function is then applied to the first row -of the matrix to make it easier to read. - -This expression returns a single vector representing the cluster membership probabilities for the first -term vector. Notice that the term vector has the highest association with the 12^th^ cluster, -but also has significant associations with the 3^rd^, 5^th^, 6^th^ and 7^th^ clusters: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "f": [ - 0, - 0, - 0.178, - 0, - 0.17707, - 0.17775, - 0.16214, - 0, - 0, - 0, - 0, - 0.30504 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 2157 - } - ] - } -} ----- - -== K-Nearest Neighbor (KNN) - -The `knn` function searches the rows of a matrix for the -k-nearest neighbors of a search vector. The `knn` function -returns a matrix of the k-nearest neighbors. - -The `knn` function supports changing of the distance measure by providing one of these -distance measure functions as the fourth parameter: - -* `euclidean` (Default) -* `manhattan` -* `canberra` -* `earthMovers` - -The example below builds on the clustering examples to demonstrate the `knn` function. - -[source,text] ----- -let(a=select(random(collection3, q="body:oil", rows="500", fl="id, body"), - id, - analyze(body, body_bigram) as terms), - b=termVectors(a, maxDocFreq=.09, minDocFreq=.03, minTermLength=14, exclude="_,copyright"), - c=multiKmeans(b, 5, 100), - d=getCentroids(c), <1> - e=rowAt(d, 0), <2> - g=knn(b, e, 3), <3> - h=topFeatures(g, 4)) <4> ----- - -<1> In the example, the centroids matrix is set to variable *`d`*. -<2> The first centroid vector is selected from the matrix with the `rowAt` function. -<3> Then the `knn` function is used to find the 3 nearest neighbors -to the centroid vector in the term vector matrix (variable *`b`*). -<4> The `topFeatures` function is used to request the top 4 featurs of the term vectors in the knn matrix. - -The `knn` function returns a matrix with the 3 nearest neighbors based on the -default distance measure which is euclidean. Finally, the top 4 features -of the term vectors in the nearest neighbor matrix are returned: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "h": [ - [ - "california power", - "electricity supply", - "concerned about", - "companies like" - ], - [ - "maharashtra state", - "california power", - "electricity board", - "alternative energy" - ], - [ - "electricity board", - "maharashtra state", - "state electricity", - "houston chronicle" - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 1243 - } - ] - } -} ----- - -== K-Nearest Neighbor Regression - -K-nearest neighbor regression is a non-linear, multi-variate regression method. Knn regression is a lazy learning -technique which means it does not fit a model to the training set in advance. Instead the -entire training set of observations and outcomes are held in memory and predictions are made -by averaging the outcomes of the k-nearest neighbors. - -The `knnRegress` function prepares the training set for use with the `predict` function. - -Below is an example of the `knnRegress` function. In this example 10,000 random samples -are taken, each containing the variables `filesize_d`, `service_d` and `response_d`. The pairs of -`filesize_d` and `service_d` will be used to predict the value of `response_d`. - -[source,text] ----- -let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"), - filesizes=col(samples, filesize_d), - serviceLevels=col(samples, service_d), - outcomes=col(samples, response_d), - observations=transpose(matrix(filesizes, serviceLevels)), - lazyModel=knnRegress(observations, outcomes , 5)) ----- - -This expression returns the following response. Notice that `knnRegress` returns a tuple describing the regression inputs: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "lazyModel": { - "features": 2, - "robust": false, - "distance": "EuclideanDistance", - "observations": 10000, - "scale": false, - "k": 5 - } - }, - { - "EOF": true, - "RESPONSE_TIME": 170 - } - ] - } -} ----- - -=== Prediction and Residuals - -The output of `knnRegress` can be used with the `predict` function like other regression models. - -In the example below the `predict` function is used to predict results for the original training -data. The sumSq of the residuals is then calculated. - -[source,text] ----- -let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"), - filesizes=col(samples, filesize_d), - serviceLevels=col(samples, service_d), - outcomes=col(samples, response_d), - observations=transpose(matrix(filesizes, serviceLevels)), - lazyModel=knnRegress(observations, outcomes , 5), - predictions=predict(lazyModel, observations), - residuals=ebeSubtract(outcomes, predictions), - sumSqErr=sumSq(residuals)) ----- - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "sumSqErr": 1920290.1204126712 - }, - { - "EOF": true, - "RESPONSE_TIME": 3796 - } - ] - } -} ----- - -=== Setting Feature Scaling - -If the features in the observation matrix are not in the same scale then the larger features -will carry more weight in the distance calculation then the smaller features. This can greatly -impact the accuracy of the prediction. The `knnRegress` function has a `scale` parameter which -can be set to `true` to automatically scale the features in the same range. - -The example below shows `knnRegress` with feature scaling turned on. - -Notice that when feature scaling is turned on the `sumSqErr` in the output is much lower. -This shows how much more accurate the predictions are when feature scaling is turned on in -this particular example. This is because the `filesize_d` feature is significantly larger then -the `service_d` feature. - -[source,text] ----- -let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"), - filesizes=col(samples, filesize_d), - serviceLevels=col(samples, service_d), - outcomes=col(samples, response_d), - observations=transpose(matrix(filesizes, serviceLevels)), - lazyModel=knnRegress(observations, outcomes , 5, scale=true), - predictions=predict(lazyModel, observations), - residuals=ebeSubtract(outcomes, predictions), - sumSqErr=sumSq(residuals)) ----- - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "sumSqErr": 4076.794951120683 - }, - { - "EOF": true, - "RESPONSE_TIME": 3790 - } - ] - } -} ----- - - -=== Setting Robust Regression - -The default prediction approach is to take the mean of the outcomes of the k-nearest -neighbors. If the outcomes contain outliers the mean value can be skewed. Setting -the `robust` parameter to `true` will take the median outcome of the k-nearest neighbors. -This provides a regression prediction that is robust to outliers. - -=== Setting the Distance Measure - -The distance measure can be changed for the k-nearest neighbor search by adding a distance measure -function to the `knnRegress` parameters. Below is an example using `manhattan` distance. - -[source,text] ----- -let(samples=random(collection1, q="*:*", rows="10000", fl="filesize_d, service_d, response_d"), - filesizes=col(samples, filesize_d), - serviceLevels=col(samples, service_d), - outcomes=col(samples, response_d), - observations=transpose(matrix(filesizes, serviceLevels)), - lazyModel=knnRegress(observations, outcomes, 5, manhattan(), scale=true), - predictions=predict(lazyModel, observations), - residuals=ebeSubtract(outcomes, predictions), - sumSqErr=sumSq(residuals)) ----- - -This expression returns the following response: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "sumSqErr": 4761.221942288098 - }, - { - "EOF": true, - "RESPONSE_TIME": 3571 - } - ] - } -} ----- diff --git a/solr/solr-ref-guide/src/major-changes-in-solr-8.adoc b/solr/solr-ref-guide/src/major-changes-in-solr-8.adoc index 9bb805dff9c..2ed72aecfa7 100644 --- a/solr/solr-ref-guide/src/major-changes-in-solr-8.adoc +++ b/solr/solr-ref-guide/src/major-changes-in-solr-8.adoc @@ -489,7 +489,7 @@ See the section <>*: The functions that apply to scalar numbers. +== Table of Contents -*<>*: Vector math expressions and vector manipulation. +*<>*: Gallery of streaming expression and math expression visualizations. -*<>*: Assigning and caching variables. +*<>*: Getting started with streaming expressions, math expressions, and visualization. -*<>*: Matrix creation, manipulation, and matrix math. +*<>*: Visualizing, transforming and loading CSV files. -*<>*: Retrieving streams and vectorizing numeric and lat/lon location fields. +*<>*: Searching, sampling, aggregation and visualization of result sets. -*<>*: Using math expressions for text analysis and TF-IDF term vectors. +*<>*: Transforming and filtering result sets. -*<>*: Statistical functions in math expressions. +*<>*: Math functions and visualization applied to numbers. -*<>*: Mathematical models of probability. +*<>*: Vector math, manipulation and visualization. -*<>*: Performing uncorrelated and correlated Monte Carlo simulations. +*<>*: Vectorizing result sets and assigning and visualizing variables. + +*<>*: Matrix math, manipulation and visualization. + +*<>*: Text analysis and TF-IDF term vectors. + +*<>*: Continuous and discrete probability distribution functions. + +*<>*: Descriptive statistics, histograms, percentiles, correlation, inference tests and other stats functions. *<>*: Simple and multivariate linear regression. -*<>*: Numerical analysis math expressions. +*<>*: Polynomial, harmonic and Gaussian curve fitting. -*<>*: Functions commonly used with digital signal processing. +*<>*: Time series aggregation, visualization, smoothing, differencing, anomaly detection and forecasting. -*<>*: Polynomial, Harmonic and Gaussian curve fitting. +*<>*: Interpolation, derivatives and integrals. -*<>*: Aggregation, smoothing and differencing of time series. +*<>*: Convolution, cross-correlation, autocorrelation and fast Fourier transforms. -*<>*: Functions used in machine learning. +*<>*: Monte Carlo simulations and random walks + +*<>*: Distance, KNN, DBSCAN, K-means, fuzzy K-means and other ML functions. *<>*: Convex Hulls and Enclosing Disks. + +*<>*: Solr log analytics and visualization. diff --git a/solr/solr-ref-guide/src/math-start.adoc b/solr/solr-ref-guide/src/math-start.adoc new file mode 100644 index 00000000000..63879babf3c --- /dev/null +++ b/solr/solr-ref-guide/src/math-start.adoc @@ -0,0 +1,128 @@ += Getting Started +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +== Language + +*Streaming expressions* and *math expressions* are function languages that run +inside SolrCloud. The languages consist of functions +that are designed to be *composed* to form programming logic. + +*Streaming expressions* are functions that return streams of tuples. Streaming expression functions can be composed to form a transformation pipeline. +The pipeline starts with a *stream source*, such as `search`, which initiates a stream of tuples. +One or more *stream decorators*, such as `select`, wraps the stream source and transforms the stream of tuples. + +*Math expressions* are functions that operate over and return primitives and in-memory +arrays and matrices. The core use case for math expressions is performing mathematical operations and +visualization. + +Streaming expressions and math expressions can be combined to *search, +sample, aggregate, transform, analyze* and *visualize* data in SolrCloud collections. + + +== Execution + +Solr's `/stream` request handler executes streaming expressions and math expressions. +This handler compiles the expression, runs the expression logic +and returns a JSON result. + +=== Admin UI Stream Panel + +The easiest way to run streaming expressions and math expressions is through +the *stream* panel on the Solr Admin UI. + +A sample `search` streaming expression is shown in the screenshot below: + +image::images/math-expressions/search.png[] + +A sample `add` math expression is shown in the screenshot below: + +image::images/math-expressions/add.png[] + +=== Curl Example + +The HTTP interface to the `/stream` handler can be used to +send a streaming expression request and retrieve the response. + +Curl is a useful tool for running streaming expressions when the result +needs to be spooled to disk or is too large for the Solr admin stream panel. Below +is an example of a curl command to the `/stream` handler. + +[source,bash] +---- +curl --data-urlencode 'expr=search(enron_emails, + q="from:1800flowers*", + fl="from, to", + sort="from asc")' http://localhost:8983/solr/enron_emails/stream + +---- + +The JSON response from the stream handler for this request is shown below: + +[source,json] +---- +{"result-set":{"docs":[ + {"from":"1800flowers.133139412@s2u2.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers.93690065@s2u2.com","to":"jtholt@ect.enron.com"}, + {"from":"1800flowers.96749439@s2u2.com","to":"alewis@enron.com"}, + {"from":"1800flowers@1800flowers.flonetwork.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers@1800flowers.flonetwork.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers@1800flowers.flonetwork.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers@1800flowers.flonetwork.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers@1800flowers.flonetwork.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers@shop2u.com","to":"ebass@enron.com"}, + {"from":"1800flowers@shop2u.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers@shop2u.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers@shop2u.com","to":"lcampbel@enron.com"}, + {"from":"1800flowers@shop2u.com","to":"ebass@enron.com"}, + {"from":"1800flowers@shop2u.com","to":"ebass@enron.com"}, + {"EOF":true,"RESPONSE_TIME":33}]} +} +---- + +== Visualization + +The visualizations in this guide were performed with Apache Zeppelin using the +Zeppelin-Solr interpreter. + +=== Zeppelin-Solr Interpreter + +An Apache Zeppelin interpreter for Solr allows streaming expressions and math expressions to be executed and results visualized in Zeppelin. +The instructions for installing and configuring Zeppelin-Solr can be found on the Github repository for the project: +https://github.com/lucidworks/zeppelin-solr + +Once installed the Solr Interpreter can be configured to connect to your Solr instance. +The screenshot below shows the panel for configuring Zeppelin-Solr. + +image::images/math-expressions/zepconf.png[] + +Configure the `solr.baseUrl` and `solr.collection` to point to the location where the streaming +expressions and math expressions will be sent for execution. The `solr.collection` is +just the execution collection and does not need to hold data, although it can hold data. +streaming expressions can choose to query any of the collections that are attached +to the same SolrCloud as the execution collection. + +=== zplot + +Streaming expression result sets can be visualized automatically by Zeppelin-Solr. + +Math expression results need to be formatted for visualization using the `zplot` function. +This function has support for plotting *vectors*, *matrices*, *probability distributions* and +*2D clustering results*. + +There are many examples in the guide which show how to visualize both streaming expressions +and math expressions. diff --git a/solr/solr-ref-guide/src/matrix-math.adoc b/solr/solr-ref-guide/src/matrix-math.adoc index b5cce75a8e6..fd9ec3f5130 100644 --- a/solr/solr-ref-guide/src/matrix-math.adoc +++ b/solr/solr-ref-guide/src/matrix-math.adoc @@ -16,12 +16,11 @@ // specific language governing permissions and limitations // under the License. -This section of the user guide covers the -basics of matrix creation, manipulation and matrix math. Other sections -of the user guide demonstrate how matrices are used by the statistics, -probability and machine learning functions. +Matrices are used as both inputs and outputs of many mathematical functions. +This section of the user guide covers the basics of matrix creation, +manipulation and matrix math. -== Matrix Creation +== Matrices A matrix can be created with the `matrix` function. The matrix function is passed a list of `arrays` with @@ -60,9 +59,73 @@ responds with: "RESPONSE_TIME": 0 } ] - } + }} ---- +== Row and Column Labels + +A matrix can have column and rows and labels. The functions +`setRowLabels`, `setColumnLabels`, `getRowLabels`, and `getColumnLabels` +can be used to set and get the labels. +The label values are set using string arrays. + +The example below sets the row and column labels. In other sections of the +user guide examples are shown where functions return matrices +with the labels already set. + +Below is a simple example of setting and getting row and column labels +on a matrix. + +[source,text] +---- +let(echo="d, e", + a=matrix(array(1, 2), + array(4, 5)), + b=setRowLabels(a, array("rowA", "rowB")), + c=setColumnLabels(b, array("colA", "colB")), + d=getRowLabels(c), + e=getColumnLabels(c)) +---- + +When this expression is sent to the `/stream` handler it +responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "d": [ + "rowA", + "rowB" + ], + "e": [ + "colA", + "colB" + ] + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +== Visualization + +The `zplot` function can plot matrices as a heat map using the `heat` named parameter. +Heat maps are powerful visualization tools for displaying <> and <> matrices described later in the guide. +The example below shows a 2x2 matrix visualized using the heat map +visualization in Apache Zeppelin. + +NOTE: In the visualization below the rows are read from the *bottom* up, which is a common convention for heat maps. + +image::images/math-expressions/matrix.png[] + + == Accessing Rows and Columns The rows and columns of a matrix can be accessed using the `rowAt` @@ -103,111 +166,8 @@ responds with: } ---- -== Pair Sorting Vectors - -The `pairSort` function can be used to sort two vectors based on the values in -the first vector. The sorting operation maintains the pairing between -the two vectors during the sort. - -The `pairSort` function returns a matrix containing the -pair sorted vectors. The first row in the matrix is the first vector, -the second row in the matrix is the second vector. - -The individual vectors can then be accessed using the `rowAt` function. - -The example below performs a pair sort of two vectors and returns the -matrix containing the sorted vectors. - ----- -let(a=array(10, 2, 1), - b=array(100, 200, 300), - c=pairSort(a, b)) ----- - -When this expression is sent to the `/stream` handler it -responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - [ - 1, - 2, - 10 - ], - [ - 300, - 200, - 100 - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 1 - } - ] - } -} ----- -== Row and Column Labels - -A matrix can have column and rows and labels. The functions -`setRowLabels`, `setColumnLabels`, `getRowLabels` and `getColumnLabels` -can be used to set and get the labels. The label values -are set using string arrays. - -The example below sets the row and column labels. In other sections of the -user guide examples are shown where functions return matrices -with the labels already set. - -Below is a simple example of setting and -getting row and column labels -on a matrix. - -[source,text] ----- -let(echo="d, e", - a=matrix(array(1, 2), - array(4, 5)), - b=setRowLabels(a, array("row0", "row1")), - c=setColumnLabels(b, array("col0", "col1")), - d=getRowLabels(c), - e=getColumnLabels(c)) ----- - -When this expression is sent to the `/stream` handler it -responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "d": [ - "row0", - "row1" - ], - "e": [ - "col0", - "col1" - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- == Matrix Attributes @@ -368,9 +328,9 @@ responds with: == Scalar Matrix Math The same scalar math functions that apply to vectors can also be applied to matrices: `scalarAdd`, `scalarSubtract`, -`scalarMultiply`, `scalarDivide`. Below is an example of the `scalarAdd` function -which adds a scalar value to each element in a matrix. +`scalarMultiply`, `scalarDivide`. +Below is an example of the `scalarAdd` function which adds a scalar value to each element in a matrix. [source,text] ---- @@ -379,8 +339,7 @@ let(a=matrix(array(1, 2), b=scalarAdd(10, a)) ---- -When this expression is sent to the `/stream` handler it -responds with: +When this expression is sent to the `/stream` handler it responds with: [source,json] ---- @@ -414,7 +373,7 @@ Two matrices can be added and subtracted using the `ebeAdd` and `ebeSubtract` fu which perform element-by-element addition and subtraction of matrices. -Below is a simple example of an element-by-element addition of a matrix by itself: +Below is a simple example of an element-by-element addition using `ebeAdd` of a matrix by itself: [source,text] ---- @@ -423,8 +382,7 @@ let(a=matrix(array(1, 2), b=ebeAdd(a, a)) ---- -When this expression is sent to the `/stream` handler it -responds with: +When this expression is sent to the `/stream` handler it responds with: [source,json] ---- @@ -454,8 +412,8 @@ responds with: == Matrix Multiplication -Matrix multiplication can be accomplished using the `matrixMult` function. Below is a simple -example of matrix multiplication: +Matrix multiplication can be accomplished using the `matrixMult` function. +Below is a simple example of matrix multiplication: [source,text] ---- @@ -493,4 +451,4 @@ responds with: ] } } ----- \ No newline at end of file +---- diff --git a/solr/solr-ref-guide/src/numerical-analysis.adoc b/solr/solr-ref-guide/src/numerical-analysis.adoc index b4e3584616c..dc3af58bccf 100644 --- a/solr/solr-ref-guide/src/numerical-analysis.adoc +++ b/solr/solr-ref-guide/src/numerical-analysis.adoc @@ -16,7 +16,7 @@ // specific language governing permissions and limitations // under the License. -Interpolation, derivatives and integrals are three interrelated topics which are part of the field of mathematics called numerical analysis. This section explores the math expressions available for numerical anlysis. +This section explores the interrelated math expressions for interpolation and numerical calculus. == Interpolation @@ -24,10 +24,10 @@ Interpolation is used to construct new data points between a set of known contro The ability to predict new data points allows for sampling along the curve defined by the control points. -The interpolation functions described below all return an _interpolation model_ +The interpolation functions described below all return an _interpolation function_ that can be passed to other functions which make use of the sampling capability. -If returned directly the interpolation model returns an array containing predictions for each of the +If returned directly the interpolation function returns an array containing predictions for each of the control points. This is useful in the case of `loess` interpolation which first smooths the control points and then interpolates the smoothed points. All other interpolation functions simply return the original control points because interpolation predicts a curve that passes through the original control points. @@ -44,73 +44,29 @@ and form a smooth curve between control points. * `loess`: Loess interpolation first performs a non-linear local regression to smooth the original control points. Then a spline is used to interpolate the smoothed control points. -=== Upsampling +=== Sampling Along the Curve -Interpolation can be used to increase the sampling rate along a curve. One example -of this would be to take a time series with samples every minute and create a data set with -samples every second. In order to do this the data points between the minutes must be created. +One way to better understand interpolation is to visualize what it means to sample along a curve. The example +below zooms in on a specific region of a curve by sampling the curve between a specific x-axis range. -The `predict` function can be used to predict values anywhere within the bounds of the interpolation -range. The example below shows a very simple example of upsampling. +image::images/math-expressions/interpolate1.png[] -[source,text] ----- -let(x=array(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20), <1> - y=array(5, 10, 60, 190, 100, 130, 100, 20, 30, 10, 5), <2> - l=lerp(x, y), <3> - u=array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), <4> - p=predict(l, u)) <5> ----- +The visualization above first creates two arrays with x and y-axis points. Notice that the x-axis ranges from + 0 to 9. Then the `akima`, `spline` and `lerp` +functions are applied to the vectors to create three interpolation functions. -<1> In the example linear interpolation is performed on the arrays in variables *`x`* and *`y`*. The *`x`* variable, -which is the x-axis, is a sequence from 0 to 20 with a stride of 2. -<2> The *`y`* variable defines the curve along the x-axis. -<3> The `lerp` function performs the interpolation and returns the interpolation model. -<4> The `u` value is an array from 0 to 20 with a stride of 1. This fills in the gaps of the original x axis. -The `predict` function then uses the interpolation function in variable *`l`* to predict values for -every point in the array assigned to variable *`u`*. -<5> The variable *`p`* is the array of predictions, which is the upsampled set of *`y`* values. +Then 500 hundred random samples are drawn from a uniform distribution between 0 and 3. These are +the new zoomed in x-axis points, between 0 and 3. Notice that we are sampling a specific +area of the curve. -When this expression is sent to the `/stream` handler it responds with: +Then the `predict` function is used to predict y-axis points for +the sampled x-axis, for all three interpolation functions. Finally all three prediction vectors +are plotted with the sampled x-axis points. + +The red line is the `lerp` interpolation, the blue line is the `akima` and the purple line is +the `spline` interpolation. You can see they each produce different curves in between the control +points. -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "g": [ - 5, - 7.5, - 10, - 35, - 60, - 125, - 190, - 145, - 100, - 115, - 130, - 115, - 100, - 60, - 20, - 25, - 30, - 20, - 10, - 7.5, - 5 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- === Smoothing Interpolation @@ -118,233 +74,123 @@ The `loess` function is a smoothing interpolator which means it doesn't derive a function that passes through the original control points. Instead the `loess` function returns a function that smooths the original control points. -A technique known as local regression is used to compute the smoothed curve. The size of the +A technique known as local regression is used to compute the smoothed curve. The size of the neighborhood of the local regression can be adjusted to control how close the new curve conforms to the original control points. -The `loess` function is passed *`x`*- and *`y`*-axes and fits a smooth curve to the data. -If only a single array is provided it is treated as the *`y`*-axis and a sequence is generated -for the *`x`*-axis. +The `loess` function is passed x- and y-axes and fits a smooth curve to the data. +If only a single array is provided it is treated as the y-axis and a sequence is generated +for the x-axis. -The example below uses the `loess` function to fit a curve to a set of *`y`* values in an array. -The `bandwidth` parameter defines the percent of data to use for the local -regression. The lower the percent the smaller the neighborhood used for the local -regression and the closer the curve will be to the original data. +The example below shows the `loess` function being used to model a monthly +time series. In the example the `timeseries` function is used to generate +a monthly time series of average closing prices for the stock ticker +*AMZN*. The `date_dt` and `avg(close_d)` fields from the time series +are then vectorized and stored in variables `x` and `y`. The `loess` +function is then applied to the *y* vector containing the average closing +prices. The `bandwidth` named parameter specifies the percentage +of the data set used to compute the local regression. The `loess` function +returns the fitted model of smoothed data points. -[source,text] ----- -let(echo="residuals, sumSqError", - y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0), - curve=loess(y, bandwidth=.3), - residuals=ebeSubtract(y, curve), - sumSqError=sumSq(residuals)) ----- +The `zplot` function is then used to plot the `x`, `y` and `y1` +variables. -In the example the fitted curve is subtracted from the original curve using the -`ebeSubtract` function. The output shows the error between the -fitted curve and the original curve, known as the residuals. The output also includes -the sum-of-squares of the residuals which provides a measure -of how large the error is: +image::images/math-expressions/loess.png[] -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "residuals": [ - 0, - 0, - 0, - -0.040524802275866634, - -0.10531988096456502, - 0.5906115002526198, - 0.004215074334896762, - 0.4201374330912433, - 0.09618315578013803, - 0.012107948556718817, - -0.9892939034492398, - 0.012014364143757561, - 0.1093830927709325, - 0.523166271893805, - 0.09658362075164639, - -0.011433819306139625, - 0.9899403519886416, - -0.011707983372932773, - -0.004223284004140737, - -0.00021462867928434548, - 0.0018723112875456138 - ], - "sumSqError": 2.8016013870800616 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -In the next example the curve is fit using a `bandwidth` of `.25`: - -[source,text] ----- -let(echo="residuals, sumSqError", - y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 6, 5, 5, 3, 2, 1, 0), - curve=loess(y, .25), - residuals=ebeSubtract(y, curve), - sumSqError=sumSq(residuals)) ----- - -Notice that the curve is a closer fit, shown by the smaller `residuals` and lower value for the sum-of-squares of the -residuals: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "residuals": [ - 0, - 0, - 0, - 0, - -0.19117650587715396, - 0.442863451538809, - -0.18553845993358564, - 0.29990769020356645, - 0, - 0.23761890236245709, - -0.7344358765888117, - 0.2376189023624491, - 0, - 0.30373119215254984, - -3.552713678800501e-15, - -0.23761890236245264, - 0.7344358765888046, - -0.2376189023625095, - 0, - 2.842170943040401e-14, - -2.4868995751603507e-14 - ], - "sumSqError": 1.7539413576337557 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- == Derivatives -The derivative of a function measures the rate of change of the *`y`* value in respects to the -rate of change of the *`x`* value. +The derivative of a function measures the rate of change of the `y` value in respects to the +rate of change of the `x` value. -The `derivative` function can compute the derivative of any interpolation function. -It can also compute the derivative of a derivative. +The `derivative` function can compute the derivative for any of the +interpolation functions described above. Each interpolation function +will produce different derivatives that match the characteristics +of the function. -The example below computes the derivative for a `loess` interpolation function. +=== The First Derivative (Velocity) -[source,text] ----- -let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20), - y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0), - curve=loess(x, y, bandwidth=.3), - derivative=derivative(curve)) ----- +A simple example shows how the `derivative` function is used to calculate +the rate of change or *velocity*. -When this expression is sent to the `/stream` handler it -responds with: +In the example two vectors are created, one representing hours and +one representing miles traveled. The `lerp` function is then used to +create a linear interpolation of the `hours` and `miles` vectors. +The `derivative` function is then applied to the +linear interpolation. `zplot` is then used to plot the *`hours`* +on the x-axis and `miles` on the y-axis, and the `derivative` as `mph`, +at each x-axis point. + + +image::images/math-expressions/derivative.png[] + +Notice that the *miles_traveled* line has a slope of 10 until the +5th hour where +it changes to a slope of 50. The *mph* line, which is + the derivative, visualizes the *velocity* of the + *miles_traveled* line. + +Also notice that the derivative is calculated along +straight lines showing immediate change from one point to the next. This +is because linear interpolation (`lerp`) is used as the interpolation +function. If the `spline` or `akima` functions had been used it would have produced +a derivative with rounded curves. + + +=== The Second Derivative (Acceleration) + +While the first derivative represents velocity, the second derivative +represents `acceleration`. The second the derivative is the derivative +of the first derivative. + +The example below builds on the first example and adds the second derivative. +Notice that the second derivative `d2` is taken by applying the +derivative function to a linear interpolation of the first derivative. + +The second derivative is plotted as *acceleration* on the chart. + +image::images/math-expressions/derivatives.png[] + +Notice that the acceleration line is 0 until the *mph* line increases from 10 to 50. At this +point the *acceleration* line moves to 40. As the *mph* line stays at 50, the acceleration +line drops to 0. + +=== Price Velocity + +The example below shows how to plot the `derivative` for a time series generated +by the `timeseries` function. In the example a monthly time series is +generated for the average closing price for the stock ticker `amzn`. +The `avg(close)` column is vectorized and interpolated using linear +interpolation (`lerp`). The `zplot` function is then used to plot the derivative +of the time series. + +image::images/math-expressions/derivative2.png[] + +Notice that the derivative plot clearly shows the rates of change in the stock price over time. -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "derivative": [ - 1.0022002675659012, - 0.9955994648681976, - 1.0154018729613081, - 1.0643674501141696, - 1.0430879694757085, - 0.9698717643975381, - 0.7488201070357539, - 0.44627000894357516, - 0.19019561285422165, - 0.01703599324311178, - -0.001908408138535126, - -0.009121607450087499, - -0.2576361507216319, - -0.49378951291352746, - -0.7288073815664, - -0.9871806872210384, - -1.0025400632604322, - -1.001836567536853, - -1.0076227586138085, - -1.0021524620888589, - -1.0020541789058157 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- == Integrals An integral is a measure of the volume underneath a curve. -The `integrate` function computes an integral for a specific -range of an interpolated curve. +The `integral` function computes the cumulative integrals for a curve or the integral for a specific +range of an interpolated curve. Like the `derivative` function the `integral` function operates +over interpolation functions. -In the example below the `integrate` function computes an -integral for the entire range of the curve, 0 through 20. +=== Single Integral + +If the `integral` function is passed a *start* and *end* range it will compute the volume under the +curve within that specific range. + +In the example below the `integral` function computes an +integral for the entire range of the curve, 0 through 10. Notice that the `integral` function is passed +the interpolated curve and the start and end range, and returns the integral for the range. [source,text] ---- let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20), y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0), curve=loess(x, y, bandwidth=.3), - integral=integrate(curve, 0, 20)) ----- - -When this expression is sent to the `/stream` handler it -responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "integral": 90.17446104846645 - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- - -In the next example an integral is computed for the range of 0 through 10. - -[source,text] ----- -let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20), - y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0), - curve=loess(x, y, bandwidth=.3), - integral=integrate(curve, 0, 10)) + integral=integral(curve, 0, 10)) ---- When this expression is sent to the `/stream` handler it @@ -367,6 +213,44 @@ responds with: } ---- +=== Cumulative Integral Plot + +If the `integral` function is passed a single interpolated curve it returns a vector of the cumulative +integrals for the curve. The cumulative integrals vector contains a cumulative integral calculation +for each x-axis point. The cumulative integral is calculated by taking the +integral of the range between each x-axis point and the *first* x-axis point. In the example above this would +mean calculating a vector of integrals as such: + +[source,text] +---- +let(x=array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20), + y=array(0, 1, 2, 3, 4, 5.7, 6, 7, 7, 7,6, 7, 7, 7, 6, 5, 5, 3, 2, 1, 0), + curve=loess(x, y, bandwidth=.3), + integrals=array(0, integral(curve, 0, 1), integral(curve, 0, 2), integral(curve, 0, 3), ...) +---- + +The plot of cumulative integrals visualizes how much cumulative volume of the curve is under each point +x-axis point. + +The example below shows the cumulative integral plot for a time series generated by +the `timeseries` function. In the example a monthly time series is +generated for the average closing price for the stock ticker `amzn`. +The `avg(close)` column is vectorized and interpolated using a `spline`. + +The `zplot` function is then used to plot the cumulative integral +of the time series. + +image::images/math-expressions/integral.png[] + +The plot above visualizes the volume under the curve as the *AMZN* stock +price changes over time. Because this plot is cumulative, the volume under +a stock price time series which stays the same over time, will +have a positive *linear* slope. A stock that has rising prices will have a *concave* shape and +a stock with falling prices will have a *convex* shape. + +In this particular example the integral plot becomes more *concave* over time +showing accelerating increases in stock price. + == Bicubic Spline The `bicubicSpline` function can be used to interpolate and predict values @@ -395,7 +279,7 @@ the 9th floor was `415000` (row 3, column 3). The `bicubicSpline` function is then used to interpolate the grid, and the `predict` function is used to predict a value for year 2003, floor 8. -Notice that the matrix does not include a data point for year 2003, floor 8. The `bicupicSpline` +Notice that the matrix does not include a data point for year 2003, floor 8. The `bicubicSpline` function creates that data point based on the surrounding data in the matrix: [source,json] diff --git a/solr/solr-ref-guide/src/probability-distributions.adoc b/solr/solr-ref-guide/src/probability-distributions.adoc index bcee553dcc6..85e42475bc4 100644 --- a/solr/solr-ref-guide/src/probability-distributions.adoc +++ b/solr/solr-ref-guide/src/probability-distributions.adoc @@ -19,66 +19,145 @@ This section of the user guide covers the probability distribution framework included in the math expressions library. -== Probability Distribution Framework +== Visualization -The probability distribution framework includes many commonly used <> -and <> probability distributions, including support for <> -and <> distributions that model real world data. +Probability distributions can be visualized with Zeppelin-Solr using the +`zplot` function with the `dist` parameter, which visualizes the +probability density function (PDF) of the distribution. -The probability distribution framework also includes a set of functions that use the probability distributions -to support probability calculations and sampling. +Example visualizations are shown with each distribution below. -=== Real Distributions +=== Continuous Distributions -The probability distribution framework has the following functions -which support well known real probability distributions: +Continuous probability distributions work with continuous numbers (floating points). Below +are the supported continuous probability distributions. -* `normalDistribution`: Creates a normal distribution function. +==== empiricalDistribution -* `logNormalDistribution`: Creates a log normal distribution function. +The `empiricalDistribution` function creates a continuous probability +distribution from actual data. -* `gammaDistribution`: Creates a gamma distribution function. +Empirical distributions can be used to conveniently visualize the probability density function of a random sample from a SolrCloud collection. +The example below shows the `zplot` function visualizing the probability +density of a random sample with a 32 bin histogram. -* `betaDistribution`: Creates a beta distribution function. +image::images/math-expressions/empirical.png[] -* `uniformDistribution`: Creates a uniform real distribution function. +==== normalDistribution -* `weibullDistribution`: Creates a Weibull distribution function. +The visualization below shows a normal distribution with a `mean` of 0 and `standard deviation` of 1. -* `triangularDistribution`: Creates a triangular distribution function. +image::images/math-expressions/dist.png[] -* `constantDistribution`: Creates constant real distribution function. -=== Empirical Distribution +==== logNormalDistribution -The `empiricalDistribution` function creates a real probability -distribution from actual data. An empirical distribution -can be used interchangeably with any of the theoretical -real distributions. +The visualization below shows a log normal distribution with a `shape` of .25 and `scale` +of 0. -=== Discrete +image::images/math-expressions/lognormal.png[] -The probability distribution framework has the following functions -which support well known discrete probability distributions: +==== gammaDistribution -* `poissonDistribution`: Creates a Poisson distribution function. +The visualization below shows a gamma distribution with a `shape` of 7.5 and `scale` +of 1. -* `binomialDistribution`: Creates a binomial distribution function. +image::images/math-expressions/gamma.png[] -* `uniformIntegerDistribution`: Creates a uniform integer distribution function. +==== betaDistribution -* `geometricDistribution`: Creates a geometric distribution function. +The visualization below shows a beta distribution with a `shape1` of 2 and `shape2` +of 2. -* `zipFDistribution`: Creates a Zipf distribution function. +image::images/math-expressions/beta.png[] -=== Enumerated Distributions +==== uniformDistribution + +The visualization below shows a uniform distribution between 0 and 10. + +image::images/math-expressions/uniformr.png[] + +==== weibullDistribution + +The visualization below shows a Weibull distribution with a `shape` of 5 and `scale` +of 1. + +image::images/math-expressions/weibull.png[] + +==== triangularDistribution + +The visualization below shows a triangular distribution with a low of 5 a mode of 10 +and a high value of 20. + +image::images/math-expressions/triangular.png[] + +==== constantDistribution + +The visualization below shows a constant distribution of 10.5. + +image::images/math-expressions/constant.png[] + + + +=== Discrete Distributions + +Discrete probability distributions work with discrete numbers (integers). Below are the +supported discrete probability distributions. + +==== enumeratedDistribution The `enumeratedDistribution` function creates a discrete -distribution function from a data set of discrete values, -or from and enumerated list of values and probabilities. +distribution function +from an enumerated list of values and probabilities or +from a data set of discrete values + +The visualization below shows an enumerated distribution created from a list of +discrete values and probabilities. + +image::images/math-expressions/enum1.png[] + +The visualization below shows an enumerated distribution generated from a search +result that has been transformed into a vector of discrete values. + +image::images/math-expressions/enum2.png[] + +==== poissonDistribution + +The visualization below shows a Poisson distribution with a `mean` of 15. + +image::images/math-expressions/poisson.png[] + + +==== binomialDistribution + +The visualization below shows a binomial distribution with a 100 trials and .15 +probability of success. + +image::images/math-expressions/binomial.png[] + + +==== uniformIntegerDistribution + +The visualization below shows a uniform integer distribution between 0 and 10. + +image::images/math-expressions/uniform.png[] + + +==== geometricDistribution + +The visualization below shows a geometric distribution probability of success of +.25. + +image::images/math-expressions/geometric.png[] + + +==== zipFDistribution + +The visualization below shows a ZipF distribution with a size of 50 and exponent of 1. + +image::images/math-expressions/zipf.png[] + -Enumerated distribution functions can be used interchangeably -with any of the theoretical discrete distributions. === Cumulative Probability @@ -120,23 +199,22 @@ When this expression is sent to the `/stream` handler it responds with: } ---- -Below is an example of a cumulative probability calculation -using an empirical distribution. +=== Probability -In the example an empirical distribution is created from a random -sample taken from the `price_f` field. +All probability distributions can calculate the probability +between a range of values. -The cumulative probability of the value `.75` is then calculated. -The `price_f` field in this example was generated using a -uniform real distribution between 0 and 1, so the output of the - `cumulativeProbability` function is very close to .75. +In the following example an empirical distribution is created +from a sample of file sizes drawn from the logs collection. +Then the probability of a file size between the range of 40000 +and 41000 is calculated to be 19%. [source,text] ---- -let(a=random(collection1, q="*:*", rows="30000", fl="price_f"), - b=col(a, price_f), - c=empiricalDistribution(b), - d=cumulativeProbability(c, .75)) +let(a=random(logs, q="*:*", fl="filesize_d", rows="50000"), + b=col(a, filesize_d), + c=empiricalDistribution(b, 100), + d=probability(c, 40000, 41000)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -147,11 +225,11 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "b": 0.7554217416103242 + "d": 0.19006540560734791 }, { "EOF": true, - "RESPONSE_TIME": 0 + "RESPONSE_TIME": 550 } ] } @@ -197,44 +275,11 @@ When this expression is sent to the `/stream` handler it responds with: } ---- -Below is an example of a probability calculation using an enumerated distribution. - -In the example an enumerated distribution is created from a random -sample taken from the `day_i` field, which was created using a uniform integer distribution between 0 and 30. - -The probability of the discrete value 10 is then calculated. - -[source,text] ----- -let(a=random(collection1, q="*:*", rows="30000", fl="day_i"), - b=col(a, day_i), - c=enumeratedDistribution(b), - d=probability(c, 10)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "d": 0.03356666666666666 - }, - { - "EOF": true, - "RESPONSE_TIME": 488 - } - ] - } -} ----- === Sampling All probability distributions support sampling. The `sample` -function returns 1 or more random samples from a probability distribution. +function returns one or more random samples from a probability distribution. Below is an example drawing a single sample from a normal distribution. @@ -263,43 +308,28 @@ When this expression is sent to the `/stream` handler it responds with: } ---- -Below is an example drawing 10 samples from a normal distribution. +The sample function can also return a vector of samples. Vectors of samples +can be visualized as scatter plots to gain an intuitive understanding +of the underlying distribution. + +The first example shows the scatter plot of a normal distribution with +a mean of 0 and a standard deviation of 5. + +image::images/math-expressions/sample-scatter.png[] + +The next example shows a scatter plot of the same distribution with +an ascending sort applied to the sample vector. + +image::images/math-expressions/sample-scatter1.png[] + +The next example shows two different distributions overlaid +in the same scatter plot. + +image::images/math-expressions/sample-overlay.png[] + -[source,text] ----- -let(a=normalDistribution(10, 5), - b=sample(a, 10)) ----- -When this expression is sent to the `/stream` handler it responds with: -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "b": [ - 10.18444709339441, - 9.466947971749377, - 1.2420697166234458, - 11.074501226984806, - 7.659629052136225, - 0.4440887839190708, - 13.710925254778786, - 2.089566359480239, - 0.7907293097654424, - 2.8184587681006734 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 3 - } - ] - } -} ----- === Multivariate Normal Distribution @@ -329,7 +359,7 @@ In this example 5000 random samples are selected from a collection of log record the fields `filesize_d` and `response_d`. The values of both fields conform to a normal distribution. Both fields are then vectorized. The `filesize_d` vector is stored in -variable *`b`* and the `response_d` variable is stored in variable *`c`*. +variable `b` and the `response_d` variable is stored in variable `c`. An array is created that contains the means of the two vectorized fields. @@ -341,13 +371,13 @@ the observation matrix with the `cov` function. The covariance matrix describes The `multivariateNormalDistribution` function is then called with the array of means for the two fields and the covariance matrix. The model for the -multivariate normal distribution is assigned to variable *`g`*. +multivariate normal distribution is assigned to variable `g`. Finally five samples are drawn from the multivariate normal distribution. [source,text] ---- -let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), +let(a=random(logs, q="*:*", rows="5000", fl="filesize_d, response_d"), b=col(a, filesize_d), c=col(a, response_d), d=array(mean(b), mean(c)), diff --git a/solr/solr-ref-guide/src/regression.adoc b/solr/solr-ref-guide/src/regression.adoc index 4ec23c3fea8..4e3dcad9cef 100644 --- a/solr/solr-ref-guide/src/regression.adoc +++ b/solr/solr-ref-guide/src/regression.adoc @@ -25,9 +25,9 @@ between two random variables. Sample observations are provided with two numeric arrays. The first numeric array is the independent variable and the second array is the dependent variable. -In the example below the `random` function selects 5000 random samples each containing +In the example below the `random` function selects 50000 random samples each containing the fields `filesize_d` and `response_d`. The two fields are vectorized -and stored in variables *`b`* and *`c`*. Then the `regress` function performs a regression +and stored in variables `x` and `y`. Then the `regress` function performs a regression analysis on the two numeric arrays. The `regress` function returns a single tuple with the results of the regression @@ -35,10 +35,10 @@ analysis. [source,text] ---- -let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), - b=col(a, filesize_d), - c=col(a, response_d), - d=regress(b, c)) +let(a=random(logs, q="*:*", rows="50000", fl="filesize_d, response_d"), + x=col(a, filesize_d), + y=col(a, response_d), + r=regress(x, y)) ---- Note that in this regression analysis the value of `RSquared` is `.75`. This means that changes in @@ -50,29 +50,32 @@ Note that in this regression analysis the value of `RSquared` is `.75`. This mea "result-set": { "docs": [ { - "d": { - "significance": 0, - "totalSumSquares": 10564812.895147054, - "R": 0.8674822407146515, - "RSquared": 0.7525254379553127, - "meanSquareError": 523.1137343558588, - "intercept": -49.528134913099095, - "slopeConfidenceInterval": 0.0003171801710329995, - "regressionSumSquares": 7950290.450836472, - "slope": 0.019945557923159506, - "interceptStdErr": 6.489732340389941, - "N": 5000 - } + "significance": 0, + "totalSumSquares": 96595678.64838874, + "R": 0.9052835767815126, + "RSquared": 0.8195383543903288, + "meanSquareError": 348.6502485633668, + "intercept": 55.64040842391729, + "slopeConfidenceInterval": 0.0000822026526346821, + "regressionSumSquares": 79163863.52071753, + "slope": 0.019984612363694493, + "interceptStdErr": 1.6792610845256566, + "N": 50000 }, { "EOF": true, - "RESPONSE_TIME": 98 + "RESPONSE_TIME": 344 } ] } } ---- +The diagnostics can be visualized in a table using Zeppelin-Solr. + +image::images/math-expressions/diagnostics.png[] + + === Prediction The `predict` function uses the regression model to make predictions. @@ -84,11 +87,11 @@ the value of `response_d` for the `filesize_d` value of `40000`. [source,text] ---- -let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), - b=col(a, filesize_d), - c=col(a, response_d), - d=regress(b, c), - e=predict(d, 40000)) +let(a=random(logs, q="*:*", rows="5000", fl="filesize_d, response_d"), + x=col(a, filesize_d), + y=col(a, response_d), + r=regress(x, y), + p=predict(r, 40000)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -99,7 +102,7 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "e": 748.079241022975 + "p": 748.079241022975 }, { "EOF": true, @@ -119,11 +122,11 @@ In this case 5000 predictions are returned. [source,text] ---- -let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), - b=col(a, filesize_d), - c=col(a, response_d), - d=regress(b, c), - e=predict(d, b)) +let(a=random(logs, q="*:*", rows="5000", fl="filesize_d, response_d"), + x=col(a, filesize_d), + y=col(a, response_d), + r=regress(x, y), + p=predict(r, x)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -134,7 +137,7 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "e": [ + "p": [ 742.2525322514165, 709.6972488729955, 687.8382568904871, @@ -145,8 +148,7 @@ When this expression is sent to the `/stream` handler it responds with: 699.5597256337142, 742.4738911248204, 769.0342605881644, - 746.6740473150268, - ... + 746.6740473150268 ] }, { @@ -158,25 +160,34 @@ When this expression is sent to the `/stream` handler it responds with: } ---- +=== Regression Plot + +Using `zplot` and the Zeppelin-Solr interpreter we can visualize both the observations and the predictions in +the same scatter plot. +In the example below `zplot` is plotting the `filesize_d` observations on the +x-axis, the `response_d` observations on the y-axis and the predictions on the y1-axis. + +image::images/math-expressions/linear.png[] + === Residuals The difference between the observed value and the predicted value is known as the residual. There isn't a specific function to calculate the residuals but vector math can used to perform the calculation. -In the example below the predictions are stored in variable *`e`*. The `ebeSubtract` +In the example below the predictions are stored in variable `p`. The `ebeSubtract` function is then used to subtract the predictions -from the actual `response_d` values stored in variable *`c`*. Variable *`f`* contains +from the actual `response_d` values stored in variable `y`. Variable `e` contains the array of residuals. [source,text] ---- -let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, response_d"), - b=col(a, filesize_d), - c=col(a, response_d), - d=regress(b, c), - e=predict(d, b), - f=ebeSubtract(c, e)) +let(a=random(logs, q="*:*", rows="500", fl="filesize_d, response_d"), + x=col(a, filesize_d), + y=col(a, response_d), + r=regress(x, y), + p=predict(r, x), + e=ebeSubtract(y, p)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -200,8 +211,7 @@ When this expression is sent to the `/stream` handler it responds with: -30.213178859683012, -30.609943619066826, 10.527700442607625, - 10.68046928406568, - ... + 10.68046928406568 ] }, { @@ -213,32 +223,53 @@ When this expression is sent to the `/stream` handler it responds with: } ---- +=== Residual Plot + +Using `zplot` and Zeppelin-Solr we can visualize the residuals with +a residuals plot. The example residual plot below plots the predicted value on the +x-axis and the error of the prediction on the y-axis. + +image::images/math-expressions/residual-plot.png[] + +The residual plot can be used to interpret reliability of the model. Three things to look for are: + +. Do the residuals appear to be normally distributed with a mean of 0? +This makes it easier to interpret the results of the model to determine if the distribution of the errors is acceptable for predictions. +It also makes it easier to use a model of the residuals for anomaly detection on new predictions. + +. Do the residuals appear to be *heteroscedastic*? +Which means is the variance of the residuals the same across the range of predictions? +By plotting the prediction on the x-axis and error on y-axis we can see if the variability stays the same as the predictions get higher. +If the residuals are heteroscedastic it means that we can trust the models error to be consistent across the range of predictions. + +. Is there any pattern to the residuals? If so there is likely still a signal within the data that needs to be modeled. + + == Multivariate Linear Regression The `olsRegress` function performs a multivariate linear regression analysis. Multivariate linear regression models the linear relationship between two or more independent variables and a dependent variable. The example below extends the simple linear regression example by introducing a new independent variable -called `service_d`. The `service_d` variable is the service level of the request and it can range from 1 to 4 -in the data-set. The higher the service level, the higher the bandwidth available for the request. +called `load_d`. The `load_d` variable is the load on the network while the file is being downloaded. -Notice that the two independent variables `filesize_d` and `service_d` are vectorized and stored -in the variables *`b`* and *`c`*. The variables *`b`* and *`c`* are then added as rows to a `matrix`. The matrix is +Notice that the two independent variables `filesize_d` and `load_d` are vectorized and stored +in the variables `b` and `c`. The variables `b` and `c` are then added as rows to a `matrix`. The matrix is then transposed so that each row in the matrix represents one observation with `filesize_d` and `service_d`. The `olsRegress` function then performs the multivariate regression analysis using the observation matrix as the independent variables and the `response_d` values, stored in variable *`d`*, as the dependent variable. [source,text] ---- -let(a=random(collection2, q="*:*", rows="30000", fl="filesize_d, service_d, response_d"), - b=col(a, filesize_d), - c=col(a, service_d), - d=col(a, response_d), - e=transpose(matrix(b, c)), - f=olsRegress(e, d)) +let(a=random(testapp, q="*:*", rows="30000", fl="filesize_d, load_d, response_d"), + x=col(a, filesize_d), + y=col(a, load_d), + z=col(a, response_d), + m=transpose(matrix(x, y)), + r=olsRegress(m, z)) ---- -Notice in the response that the RSquared of the regression analysis is 1. This means that linear relationship between +Notice in the response that the `RSquared` of the regression analysis is `1`. This means that linear relationship between `filesize_d` and `service_d` describe 100% of the variability of the `response_d` variable: [source,json] @@ -247,43 +278,41 @@ Notice in the response that the RSquared of the regression analysis is 1. This m "result-set": { "docs": [ { - "f": { - "regressionParametersStandardErrors": [ - 2.0660690430026933e-13, - 5.1212982077663434e-18, - 9.10920932555875e-15 + "regressionParametersStandardErrors": [ + 1.7792032752524236, + 0.0000429945089590394, + 0.0008592489428291642 + ], + "RSquared": 0.8850359458670845, + "regressionParameters": [ + 0.7318766882597804, + 0.01998298784650873, + 0.10982104952105468 + ], + "regressandVariance": 1938.8190758686717, + "regressionParametersVariance": [ + [ + 0.014201127587649602, + -3.326633951803927e-7, + -0.000001732754417954437 ], - "RSquared": 1, - "regressionParameters": [ - 6.553210695971329e-12, - 0.019999999999999858, - -20.49999999999968 + [ + -3.326633951803927e-7, + 8.292732891338694e-12, + 2.0407522508189773e-12 ], - "regressandVariance": 2124.130825172683, - "regressionParametersVariance": [ - [ - 0.013660174897582315, - -3.361258014840509e-7, - -0.00006893737578369605 - ], - [ - -3.361258014840509e-7, - 8.393183709503206e-12, - 6.430253229589981e-11 - ], - [ - -0.00006893737578369605, - 6.430253229589981e-11, - 0.000026553878455570856 - ] - ], - "adjustedRSquared": 1, - "residualSumSquares": 9.373703759269822e-20 - } + [ + -0.000001732754417954437, + 2.0407522508189773e-12, + 3.3121477630934995e-9 + ] + ], + "adjustedRSquared": 0.8850282808303053, + "residualSumSquares": 6686612.141261716 }, { "EOF": true, - "RESPONSE_TIME": 690 + "RESPONSE_TIME": 374 } ] } @@ -296,17 +325,17 @@ The `predict` function can also be used to make predictions for multivariate lin Below is an example of a single prediction using the multivariate linear regression model and a single observation. The observation is an array that matches the structure of the observation matrix used to build the model. In this case -the first value represents a `filesize_d` of `40000` and the second value represents a `service_d` of `4`. +the first value represents a `filesize_d` of `40000` and the second value represents a `load_d` of `4`. [source,text] ---- -let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"), - b=col(a, filesize_d), - c=col(a, service_d), - d=col(a, response_d), - e=transpose(matrix(b, c)), - f=olsRegress(e, d), - g=predict(f, array(40000, 4))) +let(a=random(logs, q="*:*", rows="5000", fl="filesize_d, load_d, response_d"), + x=col(a, filesize_d), + y=col(a, load_d), + z=col(a, response_d), + m=transpose(matrix(x, y)), + r=olsRegress(m, z), + p=predict(r, array(40000, 4))) ---- When this expression is sent to the `/stream` handler it responds with: @@ -317,11 +346,11 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "g": 718.0000000000005 + "p": 801.7725344814675 }, { "EOF": true, - "RESPONSE_TIME": 117 + "RESPONSE_TIME": 70 } ] } @@ -336,13 +365,13 @@ is passed to the `predict` function and it returns an array of predictions. [source,text] ---- -let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"), - b=col(a, filesize_d), - c=col(a, service_d), - d=col(a, response_d), - e=transpose(matrix(b, c)), - f=olsRegress(e, d), - g=predict(f, e)) +let(a=random(logs, q="*:*", rows="5000", fl="filesize_d, load_d, response_d"), + x=col(a, filesize_d), + y=col(a, load_d), + z=col(a, response_d), + m=transpose(matrix(x, y)), + r=olsRegress(m, z), + p=predict(r, m)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -353,19 +382,19 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "e": [ - 685.498283591961, - 801.2175699959365, - 776.7638245911025, - 610.3559852681935, - 751.0925865965207, - 787.2914663381897, - 744.3632053810668, - 688.3729301599697, - 765.367783417171, - 724.9309687628346, - 834.4350712384264, - ... + "p": [ + 917.7122088913725, + 900.5418518783401, + 871.7805676516689, + 822.1887964840801, + 828.0842807117554, + 785.1262470470162, + 833.2583851225845, + 802.016811579941, + 841.5253327135974, + 896.9648275225625, + 858.6511235977382, + 869.8381475112501 ] }, { @@ -383,18 +412,18 @@ Once the predictions are generated the residuals can be calculated using the sam simple linear regression. Below is an example of the residuals calculation following a multivariate linear regression. In the example -the predictions stored variable *`g`* are subtracted from observed values stored in variable *`d`*. +the predictions stored variable `g` are subtracted from observed values stored in variable `d`. [source,text] ---- -let(a=random(collection2, q="*:*", rows="5000", fl="filesize_d, service_d, response_d"), - b=col(a, filesize_d), - c=col(a, service_d), - d=col(a, response_d), - e=transpose(matrix(b, c)), - f=olsRegress(e, d), - g=predict(f, e), - h=ebeSubtract(d, g)) +let(a=random(logs, q="*:*", rows="5000", fl="filesize_d, load_d, response_d"), + x=col(a, filesize_d), + y=col(a, load_d), + z=col(a, response_d), + m=transpose(matrix(x, y)), + r=olsRegress(m, z), + p=predict(r, m), + e=ebeSubtract(z, p)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -406,18 +435,17 @@ When this expression is sent to the `/stream` handler it responds with: "docs": [ { "e": [ - 1.1368683772161603e-13, - 1.1368683772161603e-13, - 0, - 1.1368683772161603e-13, - 0, - 1.1368683772161603e-13, - 0, - 2.2737367544323206e-13, - 1.1368683772161603e-13, - 2.2737367544323206e-13, - 1.1368683772161603e-13, - ... + 21.452271655340496, + 9.647947283595727, + -23.02328008866334, + -13.533046479596806, + -16.1531952414299, + 4.966514036315402, + 23.70151322413119, + -4.276176642246014, + 10.781062392156628, + 0.00039750380267378205, + -1.8307638852961645 ] }, { @@ -428,3 +456,12 @@ When this expression is sent to the `/stream` handler it responds with: } } ---- + +=== Residual Plot + +The residual plot for multi-variate linear regression is the same as for simple linear regression. +The predictions are plotted on the x-axis and the error is plotted on the y-axis. + +image::images/math-expressions/residual-plot2.png[] + +The residual plot for multi-variate linear regression can be interpreted in the exact same way as simple linear regression. diff --git a/solr/solr-ref-guide/src/scalar-math.adoc b/solr/solr-ref-guide/src/scalar-math.adoc index f5fa74584ac..696aa00de67 100644 --- a/solr/solr-ref-guide/src/scalar-math.adoc +++ b/solr/solr-ref-guide/src/scalar-math.adoc @@ -74,6 +74,17 @@ This expression returns the following response: } ---- +== Visualization + +In the Zeppelin-Solr interpreter you can simply type in scalar math functions and the +result will be shown in a table format. + +image::images/math-expressions/scalar.png[] + +The *Number* visualization can be used to visualize the number with text and icons. + +image::images/math-expressions/num.png[] + == Streaming Scalar Math Scalar math expressions can also be applied to each tuple in a stream @@ -83,19 +94,19 @@ The `select` function can also use math expressions to compute new values and add them to the outgoing tuples. In the example below the `select` expression is wrapping a search -expression. The `select` function is selecting the *price_f* field -and computing a new field called *newPrice* using the `mult` math +expression. The `select` function is selecting the `response_d` field +and computing a new field called `new_response` using the `mult` math expression. -The first parameter of the `mult` expression is the *price_f* field. +The first parameter of the `mult` expression is the `response_d` field. The second parameter is the scalar value 10. This multiplies the value -of the *price_f* field in each tuple by 10. +of the `response_d` field in each tuple by 10. [source,text] ---- -select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3"), - price_f, - mult(price_f, 10) as newPrice) +select(search(testapp, q="*:*", fl="response_d", sort="response_d desc", rows="3"), + response_d, + mult(response_d, 10) as new_response) ---- When this expression is sent to the `/stream` handler it responds with: @@ -106,26 +117,37 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "price_f": 0.99999994, - "newPrice": 9.9999994 + "response_d": 1080.3692514541938, + "new_response": 10803.692514541937 }, { - "price_f": 0.99999994, - "newPrice": 9.9999994 + "response_d": 1067.441598608506, + "new_response": 10674.41598608506 }, { - "price_f": 0.9999992, - "newPrice": 9.999992 + "response_d": 1059.8400090891566, + "new_response": 10598.400090891566 }, { "EOF": true, - "RESPONSE_TIME": 3 + "RESPONSE_TIME": 12 } ] } } ---- +== Visualization + +The expression above can be visualized as a table using Zeppelin-Solr. + +image::images/math-expressions/stream.png[] + +By switching to one of the line chart visualizations the two variables can be plotted on the x and y-axis. + +image::images/math-expressions/line.png[] + + == More Scalar Math Functions The following scalar math functions are available in the math expressions library: @@ -134,4 +156,3 @@ The following scalar math functions are available in the math expressions librar `pow`, `mod`, `ceil`, `floor`, `sin`, `asin`, `sinh`, `cos`, `acos`, `cosh`, `tan`, `atan`, `tanh`, `round`, `precision`, `recip`, `sqrt`, `cbrt` - diff --git a/solr/solr-ref-guide/src/search-sample.adoc b/solr/solr-ref-guide/src/search-sample.adoc new file mode 100644 index 00000000000..4ef07b066f1 --- /dev/null +++ b/solr/solr-ref-guide/src/search-sample.adoc @@ -0,0 +1,288 @@ += Searching, Sampling and Aggregation +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +Data is the indispensable factor in statistical analysis. This section +provides an overview of the key functions for retrieving data for +visualization and statistical analysis: searching, sampling +and aggregation. + +== Searching + +=== Exploring + +The `search` function can be used to search a SolrCloud collection and return a +result set. + +Below is an example of the most basic `search` function called from the Zeppelin-Solr interpreter. +Zeppelin-Solr sends the `seach(logs)` call to the `/stream` handler and displays the results +in *table* format. + +In the example the `search` function is passed only the name of the collection being searched. +This returns a result set of 10 records with all fields. +This simple function is useful for exploring the fields in the data and understanding how to start refining the search criteria. + +image::images/math-expressions/search1.png[] + +=== Searching and Sorting + +Once the format of the records is known, parameters can be added to the `*search*` function to begin analyzing the data. + +In the example below a search query, field list, rows and sort have been added to the `search` function. +Now the search is limited to records within a specific time range and returns +a maximum result set of 750 records sorted by `tdate_dt` ascending. +We have also limited the result set to three specific fields. + +image::images/math-expressions/search-sort.png[] + +Once the data is loaded into the table we can switch to a scatter plot and plot the `filesize_d` column +on the *x-axis* and the `response_d` column on the *y-axis*. + +image::images/math-expressions/search-sort-plot.png[] + +This allows us to quickly visualize the relationship between two variables +selected from a very specific slice of the index. + +=== Scoring + +The `search` function will score and rank documents when a query is performed on +a text field. The example below shows an example of the scoring and ranking of results. + +image::images/math-expressions/scoring.png[] + +== Sampling + +The `random` function returns a random sample from a distributed search result set. +This allows for fast visualization, statistical analysis, and modeling of +samples that can be used to infer information about the larger result set. + +The visualization examples below use small random samples, but Solr's random sampling provides sub-second response times on sample sizes of over 200,000. +These larger samples can be used to build reliable statistical models that describe large data sets (billions of documents) with sub-second performance. + +The examples below demonstrate univariate and bivariate scatter +plots of random samples. +Statistical modeling with random samples +is covered in the <>, <>, <>, <>, +and <> sections. + +=== Univariate Scatter Plots + +In the example below the `random` function is called in its simplest form with just a collection name as the parameter. + +When called with no other parameters the `random` function returns a random sample of 500 records with all fields from the collection. +When called without the field list parameter (`fl`) the `random` function also generates a sequence, 0-499 in this case, which can be used for plotting the x-axis. +This sequence is returned in a field called `x`. + +The visualization below shows a scatter plot with the `filesize_d` field +plotted on the y-axis and the `x` sequence plotted on the x-axis. +The effect of this is to spread the `filesize_d` samples across the length +of the plot so they can be more easily studied. + +By studying the scatter plot we can learn a number of things about the +distribution of the `filesize_d` variable: + +* The sample set ranges from 34,875 to 45,902. +* The highest density appears to be at about 40,000. +* The sample seems to have a balanced number of observations above and below +40,000. Based on this the *mean* and *mode* would appear to be around 40,000. +* The number of observations tapers off to a small number of outliers on +the and low and high end of the sample. + +This sample can be re-run multiple times to see if the samples +produce similar plots. + +image::images/math-expressions/univariate.png[] + +=== Bivariate Scatter Plots + +In the next example parameters have been added to the `random` function. +The field list (`fl`) now specifies two fields to be +returned with each sample: `filesize_d` and `response_d`. +The `q` and `rows` parameters are the same as the defaults but are included as an example of how to set these parameters. + +By plotting `filesize_d` on the x-axis and `response_d` on the y-axis we can begin to study the relationship between the two variables. + +By studying the scatter plot we can learn the following: + +* As `filesize_d` rises, `response_d` tends to rise. +* This relationship appears to be linear, as a straight line put through the data could be used to model the relationship. +* The points appear to cluster more densely along a straight line through the middle and become less dense as they move away from the line. +* The variance of the data at each `filesize_d` point seems fairly consistent. This means a predictive model would have consistent error across the range of predictions. + +image::images/math-expressions/bivariate.png[] + +== Aggregation + +Aggregations are a powerful statistical tool for summarizing large data sets and +surfacing patterns, trends, and correlations within the data. +Aggregations are also a powerful tool for visualization and provide data sets for further statistical analysis. + +=== stats + +The simplest aggregation is the `stats` function. +The `stats` function calculates aggregations for an entire result set that matches a query. +The `stats` function supports the following aggregation functions: `count(*)`, `sum`, `min`, `max`, and `avg`. +Any number and combination of statistics can be calculated in a single function call. + +The `stats` function can be visualized in Zeppelin-Solr as a table. +In the example below two statistics are calculated over a result set and are displayed in a table: + +image::images/math-expressions/stats-table.png[] + +The `stats` function can also be visualized using the *number* visualization which is used to highlight important numbers. +The example below shows the `count(*)` aggregation displayed in the number visualization: + +image::images/math-expressions/stats.png[] + +=== facet + +The `facet` function performs single and multi-dimension +aggregations that behave in a similar manner to SQL group by aggregations. +Under the covers the `facet` function pushes down the aggregations to Solr's +<> for fast distributed execution. + +The example below performs a single dimension aggregation from the +nyc311 (NYC complaints) dataset. +The aggregation returns the top five *complaint types* by *count* for records with a status of *Pending*. +The results are displayed with Zeppelin-Solr in a table. + +image::images/math-expressions/facettab1.png[] + +The example below shows the table visualized using a pie chart. + +image::images/math-expressions/facetviz1.png[] + +The next example demonstrates a multi-dimension aggregation. +Notice that the `buckets` parameter now contains two dimensions: `borough_s` and `complaint_type_s`. +This returns the top 20 combinations of borough and complaint type by count. + +image::images/math-expressions/facettab2.png[] + +The example below shows the multi-dimension aggregation visualized as a grouped bar chart. + +image::images/math-expressions/facetviz2.png[] + +The `facet` function supports any combination of the following aggregate functions: count(*), sum, avg, min, +max. + + +=== facet2D + +The `facet2D` function performs two dimensional aggregations that can be +visualized as heat maps or pivoted into matrices and operated on by machine learning functions. + +`facet2D` has different syntax and behavior then a two dimensional `facet` function which +does not control the number of unique facets of each dimension. The `facet2D` function +has the `dimensions` parameter which controls the number of unique facets +for the *x* and *y* dimensions. + +The example below visualizes the output of the `facet2D` function. In the example `facet2D` +returns the top 5 boroughs and the top 5 complaint types for each borough. The output is +then visualized as a heatmap. + +image::images/math-expressions/facet2D.png[] + +The `facet2D` function supports one of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`. + +=== timeseries + +The `timeseries` function performs fast, distributed time +series aggregation leveraging Solr's builtin faceting and date math capabilities. + +The example below performs a monthly time series aggregation over a collection of +daily stock price data. In this example the average monthly closing price is +calculated for the stock ticker *amzn* between a specific date range. + +The output of the `timeseries` function is then visualized with a line chart. + +image::images/math-expressions/timeseries1.png[] + +The `timeseries` function supports any combination of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`. + + +=== significantTerms + +The `significantTerms` function queries a collection, but instead of returning documents, it returns significant terms found in documents in the result set. +This function scores terms based on how frequently they appear in the result set and how rarely they appear in the entire corpus. +The `significantTerms` function emits a tuple for each term which contains the term, the score, the foreground count and the background count. +The foreground count is how many documents the term appears in in the result set. +The background count is how many documents the term appears in in the entire corpus. +The foreground and background counts are global for the collection. + +The `significantTerms` function can often provide insights that cannot be gleaned from other types of aggregations. +The example below illustrates the difference between the `facet` function and the `significantTerms` function. + +In the first example the `facet` function aggregates the top 5 complaint types +in Brooklyn. +This returns the five most common complaint types in Brooklyn, but +its not clear that these terms appear more frequently in Brooklyn then +then the other boroughs. + +image::images/math-expressions/significantTermsCompare.png[] + +In the next example the `significantTerms` function returns the top 5 significant terms in the `complaint_type_s` field for the borough of Brooklyn. +The highest scoring term, Elder Abuse, has a foreground count of 285 and background count of 298. +This means that there were 298 Elder Abuse complaints in the entire data set, and 285 of them were in Brooklyn. +This shows that Elder Abuse complaints have a much higher occurrence rate in Brooklyn than the other boroughs. + +image::images/math-expressions/significantTerms2.png[] + +The final example shows a visualization of the `significantTerms` from a +text field containing movie reviews. The result shows the +significant terms that appear in movie reviews that have the phrase "sci-fi". + +The results are visualized using a bubble chart with the *foreground* count on +plotted on the x-axis and the *background* count on the y-axis. Each term is +shown in a bubble sized by the *score*. + +image::images/math-expressions/sterms.png[] + +=== nodes + +The `nodes` function performs aggregations of nodes during a breadth first search of a graph. +This function is covered in detail in the section <>. +In this example the focus will be on finding correlated nodes in a time series +graph using the `nodes` expressions. + +The example below finds stock tickers whose daily movements tend to be correlated with the ticker *jpm* (JP Morgan). + +The inner `search` expression finds records between a specific date range +where the ticker symbol is *jpm* and the `change_d` field (daily change in stock price) is greater then .25. +This search returns all fields in the index including the `yearMonthDay_s` which is the string representation of the year, month, and day of the matching records. + +The `nodes` function wraps the `search` function and operates over its results. The `walk` parameter maps a field from the search results to a field in the index. +In this case the `yearMonthDay_s` is mapped back to the `yearMonthDay_s` field in the same index. +This will find records that have same `yearMonthDay_s` field value returned +by the initial search, and will return records for all tickers on those days. +A filter query is applied to the search to filter the search to rows that have a `change_d` +greater the .25. +This will find all records on the matching days that have a daily change greater then .25. + +The `gather` parameter tells the nodes expression to gather the `ticker_s` symbols during the breadth first search. +The `count(*)` parameter counts the occurrences of the tickers. +This will count the number of times each ticker appears in the breadth first search. + +Finally the `top` function selects the top 5 tickers by count and returns them. + +The result below shows the ticker symbols in the `nodes` field and the counts for each node. +Notice *jpm* is first, which shows how many days *jpm* had a change greater then .25 in this time +period. +The next set of ticker symbols (*mtb*, *slvb*, *gs* and *pnc*) are the symbols with highest number of days with a change greater then .25 on the same days that *jpm* had a change greater then .25. + +image::images/math-expressions/nodestab.png[] + +The `nodes` function supports any combination of the following aggregate functions: `count(*)`, `sum`, `avg`, `min`, `max`. diff --git a/solr/solr-ref-guide/src/simulations.adoc b/solr/solr-ref-guide/src/simulations.adoc index 14d7c6a7925..349aac6b1f3 100644 --- a/solr/solr-ref-guide/src/simulations.adoc +++ b/solr/solr-ref-guide/src/simulations.adoc @@ -16,195 +16,236 @@ // specific language governing permissions and limitations // under the License. - Monte Carlo simulations are commonly used to model the behavior of -stochastic systems. This section describes -how to perform both uncorrelated and correlated Monte Carlo simulations -using the sampling capabilities of the probability distribution framework. +stochastic (random) systems. This section of the user guide covers +the basics of performing Monte Carlo simulations with Math Expressions. -== Uncorrelated Simulations +== Random Time Series -Uncorrelated Monte Carlo simulations model stochastic systems with the assumption -that the underlying random variables move independently of each other. -A simple example of a Monte Carlo simulation using two independently changing random variables -is described below. +The daily movement of stock prices is often described as a "random walk". +But what does that really mean, and how is this different than a random time series? +The examples below will use Monte Carlo simulations to explore both "random walks" +and random time series. -In this example a Monte Carlo simulation is used to determine the probability that a simple hinge assembly will -fall within a required length specification. +A useful first step in understanding the difference is to visualize +daily stock returns, calculated as closing price minus opening price, as a time series. -The hinge has two components A and B. The combined length of the two components must be less then 5 centimeters -to fall within specification. +The example below uses the `search` function to return 1000 days of daily stock +returns for the ticker *CVX* (Chevron). The `change_d` field, which is the +change in price for the day, is then plotted as a time series. -A random sampling of lengths for component A has shown that its length conforms to a -normal distribution with a mean of 2.2 centimeters and a standard deviation of .0195 -centimeters. +image::images/math-expressions/randomwalk1.png[] -A random sampling of lengths for component B has shown that its length conforms -to a normal distribution with a mean of 2.71 centimeters and a standard deviation of .0198 centimeters. +Notice that the time series of daily price changes moves randomly above and +below zero. Some days the stock is up, some days its down, but there +does not seem to be a noticeable pattern or any dependency between steps. This is a hint +that this is a *random time series*. + +=== Autocorrelation + +Autocorrelation measures the degree to which a signal is correlated with itself. + Autocorrelation can be used to determine +if a vector contains a signal or if there is dependency between values in a time series. If there is no +signal and no dependency between values in the time series then the time series is random. + +It's useful to plot the autocorrelation of the `change_d` vector to confirm that it is indeed random. + +In the example below the search results are set to a variable and then the `change_d` field is vectorized and stored in variable `b`. +Then the `conv` (convolution) function is used to autocorrelate +the `change_d` vector. +Notice that the `conv` function is simply "convolving" the `change_d` vector +with a reversed copy of itself. +This is the technique for performing autocorrelation using convolution. +The <> section +of the user guide covers both convolution and autocorrelation in detail. +In this section we'll just discuss the plot. + +The plot shows the intensity of correlation that is calculated as the `change_d` vector is slid across itself by the `conv` function. +Notice in the plot there is long period of low intensity correlation that appears to be random. +Then in the center a peak of high intensity correlation where the vectors +are directly lined up. +This is followed by another long period of low intensity correlation. + +This is the autocorrelation plot of pure noise. +The daily stock changes appear to be a random time series. + +image::images/math-expressions/randomwalk2.png[] + +=== Visualizing the Distribution + +The random daily changes in stock prices cannot be predicted, but they can be modeled with a probability distribution. +To model the time series we'll start by visualizing the distribution of the `change_d` vector. +In the example below the `change_d` vector is plotted using the `empiricalDistribution` function to create an 11 bin +histogram of the data. +Notice that the distribution appears to be normally distributed. +Daily stock price changes do tend to be normally distributed although *CVX* was chosen specifically for this example because of this characteristic. + +image::images/math-expressions/randomwalk3.png[] + + +=== Fitting the Distribution + +The `ks` test can be used to determine if the distribution of a vector of data fits a reference distribution. +In the example below the `ks` test is performed with a *normal distribution* with the *mean* (`mean`) and *standard deviation* (`stddev`) of the `change_d` vector as the reference distribution. +The `ks` test is checking the reference distribution against the `change_d` vector itself to see if it fits a normal distribution. + +Notice in the example below the `ks` test reports a p-value of .16278. +A p-value of .05 or less is typically used to invalidate the null hypothesis of the test which is that the vector could have been drawn from the reference distribution. + +image::images/math-expressions/randomwalk4.png[] + + +The `ks` test, which tends to be fairly sensitive, has confirmed the visualization which appeared to be normal. +Because of this the normal distribution with the *mean* and *standard deviation* of the `change_d` vector will be used to represent the daily stock returns for Chevron in the Monte Carlo simulations below. + +=== Monte Carlo + +Now that we have fit a distribution to the daily stock return data we can use the `monteCarlo` function to run a simulation using the distribution. + +The `monteCarlo` function runs a specified number of times. +On each run it sets a series of variables and runs one final function which returns a single numeric value. +The `monteCarlo` function collects the results of each run in a vector and returns it. +The final function typically has one or more variables that are drawn from probability distributions on each run. +The `sample` function is used to draw the samples. + +The simulation's result array can then be treated as an empirical distribution to understand the probabilities of the simulation results. + +The example below uses the `monteCarlo` function to simulate a distribution for the total return of 100 days of stock returns. + +In the example a `normalDistribution` is created from the *mean* and *standard deviation* of the `change_d` vector. +The `monteCarlo` function then draws 100 samples from the normal distribution to represent 100 days of stock returns and sets the vector of samples to the variable `d`. + +The `add` function then calculates the total return +from the 100 day sample. +The output of the `add` function is collected by the `monteCarlo` function. +This is repeated 50000 times, with each run drawing a different set of samples from the normal distribution. + +The result of the simulation is set to variable `s`, which contains +the total returns from the 50000 runs. + +The `empiricalDistribution` function is then used to visualize the output of the simulation as a 50 bin histogram. +The distribution visualizes the probability of the different total +returns from 100 days of stock returns for ticker *CVX*. + +image::images/math-expressions/randomwalk5.png[] + +The `probability` and `cumulativeProbability` functions can then used to +learn more about the `empiricalDistribution`. +For example the `probability` function can be used to calculate the probability of a non-negative return from 100 days of stock returns. + +The example below uses the `probability` function to return the probability of a +return between the range of 0 and 40 from the `empiricalDistribution` +of the simulation. + +image::images/math-expressions/randomwalk5.1.png[] + +=== Random Walk + +The `monteCarlo` function can also be used to model a random walk of +daily stock prices from the `normalDistribution` of daily stock returns. +A random walk is a time series where each step is calculated by adding a random sample to the previous step. +This creates a time series where each value is dependent on the previous value, which simulates the autocorrelation of stock prices. + +In the example below the random walk is achieved by adding a random sample to the variable `v` on each Monte Carlo iteration. +The variable `v` is maintained between iterations so each iteration uses the previous value of `v`. +The `double` function is the final function run each iteration, which simply returns the value of `v` as a double. +The example iterates 1000 times to create a random walk with 1000 steps. + +image::images/math-expressions/randomwalk6.png[] + +Notice the autocorrelation in the daily stock prices caused by the dependency +between steps produces a very different plot then the +random daily change in stock price. + +== Multivariate Normal Distribution + +The `multiVariateNormalDistribution` function can be used to model and simulate +two or more normally distributed variables. +It also incorporates the *correlation* between variables into the model which allows for the study of how correlation effects the possible outcomes. + +In the examples below a simulation of the total daily returns of two +stocks is explored. +The *ALL* ticker (*Allstate*) is used along with the *CVX* ticker (*Chevron*) from the previous examples. + +=== Correlation and Covariance + +The multivariate simulations show the effect of correlation on possible +outcomes. +Before getting started with actual simulations it's useful to first understand the correlation and covariance between the Allstate and Chevron stock returns. + +The example below runs two searches to retrieve the daily stock returns +for all Allstate and Chevron. +The `change_d` vectors from both returns are read into variables (`all` and `cvx`) and Pearson's correlation is calculated for the two vectors with the `corr` function. + +image::images/math-expressions/corrsim1.png[] + +Covariance is an unscaled measure of correlation. +Covariance is the measure used by the multivariate simulations so it's useful to also compute the covariance for the two stock returns. +The example below computes the covariance. + +image::images/math-expressions/corrsim2.png[] + +=== Covariance Matrix + +A covariance matrix is actually whats needed by the +`multiVariateNormalDistribution` as it contains both the variance of the +two stock return vectors and the covariance between the two +vectors. +The `cov` function will compute the covariance matrix for the +the columns of a matrix. + +The example below demonstrates how to compute the covariance matrix by adding the `all` and `cvx` vectors as rows to a matrix. +The matrix is then transposed with the `transpose` function so that the `all` vector is the first column and the `cvx` vector is the second column. + +The `cov` function then computes the covariance matrix for the columns of the matrix and returns the result. + +image::images/math-expressions/corrsim3.png[] + +The covariance matrix is a square matrix which contains the +variance of each vector and the covariance between the +vectors as follows: [source,text] ---- -let(componentA=normalDistribution(2.2, .0195), <1> - componentB=normalDistribution(2.71, .0198), <2> - simresults=monteCarlo(sampleA=sample(componentA), <3> - sampleB=sample(componentB), - add(sampleA, sampleB), <4> - 100000), <5> - simmodel=empiricalDistribution(simresults), <6> - prob=cumulativeProbability(simmodel, 5)) <7> + all cvx +all [0.12294442137237226, 0.13106056985285258], +cvx [0.13106056985285258, 0.7409729840230235] ---- -The Monte Carlo simulation below performs the following steps: +=== Multivariate Simulation -<1> A normal distribution with a mean of 2.2 and a standard deviation of .0195 is created to model the length of `componentA`. -<2> A normal distribution with a mean of 2.71 and a standard deviation of .0198 is created to model the length of `componentB`. -<3> The `monteCarlo` function samples from the `componentA` and `componentB` distributions and sets the values to variables `sampleA` and `sampleB`. -<4> It then calls the `add(sampleA, sampleB)`* function to find the combined lengths of the samples. -<5> The `monteCarlo` function runs a set number of times, 100000, and collects the results in an array. Each - time the function is called new samples are drawn from the `componentA` - and `componentB` distributions. On each run, the `add` function adds the two samples to calculate the combined length. - The result of each run is collected in an array and assigned to the `simresults` variable. -<6> An `empiricalDistribution` function is then created from the `simresults` array to model the distribution of the - simulation results. -<7> Finally, the `cumulativeProbability` function is called on the `simmodel` to determine the cumulative probability - that the combined length of the components is 5 or less. +The example below demonstrates a Monte Carlo simulation with two stock tickers using the +`multiVariateNormalDistribution`. -Based on the simulation there is .9994371944629039 probability that the combined length of a component pair will -be 5 or less: +In the example, result sets with the `change_d` field for both stock tickers, `all` (Allstate) and `cvx` (Chevron), +are retrieved and read into vectors. -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "prob": 0.9994371944629039 - }, - { - "EOF": true, - "RESPONSE_TIME": 660 - } - ] - } -} ----- +A matrix is then created from the two vectors and is transposed so +the matrix contains two columns, one with the `all` vector and one with the `cvx` vector. -== Correlated Simulations +Then the `multiVariateNormalDistribution` is created with two parameters. The first parameter is an array of `mean` values. +In this case the means for the `all` vector and the `cvx` vector. +The second parameter is the covariance matrix which was created from the 2-column matrix of the two vectors. -The simulation above assumes that the lengths of `componentA` and `componentB` vary independently. -What would happen to the probability model if there was a correlation between the lengths of -`componentA` and `componentB`? +The `monteCarlo` function then performs the simulation by drawing 100 samples from the `multiVariateNormalDistribution` on each iteration. +Each sample set is a matrix with 100 rows and 2 columns containing stock return samples from the `all` and `cvx` distributions. +The distributions of the columns will match the normal distributions used to create the `multiVariateNormalDistribution`. +The covariance of the sample columns will match the covariance matrix. -In the example below a database containing assembled pairs of components is used to determine -if there is a correlation between the lengths of the components, and how the correlation effects the model. +On each iteration the `grandSum` function is used to sum all the values of the sample matrix to get the total stock returns for both stocks. -Before performing a simulation of the effects of correlation on the probability model its -useful to understand what the correlation is between the lengths of `componentA` and `componentB`. +The output of the simulation is a vector which can be treated as an empirical distribution in exactly the same manner as the single stock ticker simulation. +In this example it is plotted as a 50 bin histogram which visualizes the probability of the different total returns from 100 days of stock returns for the tickers `all` and `cvx`. -[source,text] ----- -let(a=random(collection5, q="*:*", rows="5000", fl="componentA_d, componentB_d"), <1> - b=col(a, componentA_d)), <2> - c=col(a, componentB_d)), - d=corr(b, c)) <3> ----- -<1> In the example, 5000 random samples are selected from a collection of assembled hinges. -Each sample contains lengths of the components in the fields `componentA_d` and `componentB_d`. -<2> Both fields are then vectorized. The *componentA_d* vector is stored in -variable *`b`* and the *componentB_d* variable is stored in variable *`c`*. -<3> Then the correlation of the two vectors is calculated using the `corr` function. +image::images/math-expressions/mnorm.png[] -Note from the result that the outcome from `corr` is 0.9996931313216989. -This means that `componentA_d` and *`componentB_d` are almost perfectly correlated. +=== The Effect of Correlation -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "d": 0.9996931313216989 - }, - { - "EOF": true, - "RESPONSE_TIME": 309 - } - ] - } -} ----- +The covariance matrix can be changed to study the effect on the simulation. +The example below demonstrates this by providing a hard coded covariance matrix with a higher covariance value for the two vectors. +This results is a simulated outcome distribution with a higher standard deviation or larger spread from the mean. +This measures the degree that higher correlation produces higher volatility +in the random walk. -=== Correlation Effects on the Probability Model - -The example below explores how to use a multivariate normal distribution function -to model how correlation effects the probability of hinge defects. - -In this example 5000 random samples are selected from a collection -containing length data for assembled hinges. Each sample contains -the fields `componentA_d` and `componentB_d`. - -Both fields are then vectorized. The `componentA_d` vector is stored in -variable *`b`* and the `componentB_d` variable is stored in variable *`c`*. - -An array is created that contains the means of the two vectorized fields. - -Then both vectors are added to a matrix which is transposed. This creates -an observation matrix where each row contains one observation of -`componentA_d` and `componentB_d`. A covariance matrix is then created from the columns of -the observation matrix with the -`cov` function. The covariance matrix describes the covariance between `componentA_d` and `componentB_d`. - -The `multivariateNormalDistribution` function is then called with the -array of means for the two fields and the covariance matrix. The model -for the multivariate normal distribution is stored in variable *`g`*. - -The `monteCarlo` function then calls the function `add(sample(g))` 50000 times -and collections the results in a vector. Each time the function is called a single sample -is drawn from the multivariate normal distribution. Each sample is a vector containing -one `componentA` and `componentB` pair. The `add` function adds the values in the vector to -calculate the length of the pair. Over the long term the samples drawn from the -multivariate normal distribution will conform to the covariance matrix used to construct it. - -Just as in the non-correlated example an empirical distribution is used to model probabilities -of the simulation vector and the `cumulativeProbability` function is used to compute the cumulative -probability that the combined component length will be 5 centimeters or less. - -Notice that the probability of a hinge meeting specification has dropped to 0.9889517439980468. -This is because the strong correlation -between the lengths of components means that their lengths rise together causing more hinges to -fall out of the 5 centimeter specification. - -[source,text] ----- -let(a=random(hinges, q="*:*", rows="5000", fl="componentA_d, componentB_d"), - b=col(a, componentA_d), - c=col(a, componentB_d), - cor=corr(b,c), - d=array(mean(b), mean(c)), - e=transpose(matrix(b, c)), - f=cov(e), - g=multiVariateNormalDistribution(d, f), - h=monteCarlo(add(sample(g)), 50000), - i=empiricalDistribution(h), - j=cumulativeProbability(i, 5)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "j": 0.9889517439980468 - }, - { - "EOF": true, - "RESPONSE_TIME": 599 - } - ] - } -} ----- +image::images/math-expressions/mnorm2.png[] diff --git a/solr/solr-ref-guide/src/statistics.adoc b/solr/solr-ref-guide/src/statistics.adoc index 426576b09a5..d391e4b5e77 100644 --- a/solr/solr-ref-guide/src/statistics.adoc +++ b/solr/solr-ref-guide/src/statistics.adoc @@ -16,24 +16,23 @@ // specific language governing permissions and limitations // under the License. - This section of the user guide covers the core statistical functions available in math expressions. == Descriptive Statistics -The `describe` function can be used to return descriptive statistics about a +The `describe` function returns descriptive statistics for a numeric array. The `describe` function returns a single *tuple* with name/value -pairs containing descriptive statistics. +pairs containing the descriptive statistics. -Below is a simple example that selects a random sample of documents, -vectorizes the *price_f* field in the result set and uses the `describe` function to -return descriptive statistics about the vector: +Below is a simple example that selects a random sample of documents from the *logs* collection, +vectorizes the `response_d` field in the result set and uses the `describe` function to +return descriptive statistics about the vector. [source,text] ---- -let(a=random(collection1, q="*:*", rows="1500", fl="price_f"), - b=col(a, price_f), +let(a=random(logs, q="*:*", fl="response_d", rows="50000"), + b=col(a, response_d), c=describe(b)) ---- @@ -45,33 +44,42 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "c": { - "sumsq": 4999.041975263254, - "max": 0.99995726, - "var": 0.08344429493940454, - "geometricMean": 0.36696588922559575, - "sum": 7497.460565552007, - "kurtosis": -1.2000739963006035, - "N": 15000, - "min": 0.00012338161, - "mean": 0.49983070437013266, - "popVar": 0.08343873198640858, - "skewness": -0.001735537500095477, - "stdev": 0.28886726179926403 - } + "sumsq": 36674200601.78738, + "max": 1068.854686837548, + "var": 1957.9752647562789, + "geometricMean": 854.1445499569674, + "sum": 42764648.83319176, + "kurtosis": 0.013189848821424377, + "N": 50000, + "min": 656.023249311864, + "mean": 855.2929766638425, + "popVar": 1957.936105250984, + "skewness": 0.0014560741802307174, + "stdev": 44.24901428005237 }, { "EOF": true, - "RESPONSE_TIME": 305 + "RESPONSE_TIME": 430 } ] } } ---- +Notice that the random sample contains 50,000 records and the response +time is only 430 milliseconds. Samples of this size can be used to +reliably estimate the statistics for very large underlying +data sets with sub-second performance. + + +The `describe` function can also be visualized in a table with Zeppelin-Solr: + +image::images/math-expressions/describe.png[] + + == Histograms and Frequency Tables -Histograms and frequency tables are are tools for understanding the distribution +Histograms and frequency tables are tools for visualizing the distribution of a random variable. The `hist` function creates a histogram designed for usage with continuous data. The @@ -79,227 +87,207 @@ The `hist` function creates a histogram designed for usage with continuous data. === histograms -Below is an example that selects a random sample, creates a vector from the -result set and uses the `hist` function to return a histogram with 5 bins. -The `hist` function returns a list of tuples with summary statistics for each bin. +In the example below a histogram is used to visualize a random sample of +response times from the logs collection. The example retrieves the +random sample with the `random` function and creates a vector from the `response_d` field +in the result set. Then the `hist` function is applied to the vector +to return a histogram with 22 bins. The `hist` function returns a +list of tuples with summary statistics for each bin. [source,text] ---- -let(a=random(collection1, q="*:*", rows="15000", fl="price_f"), - b=col(a, price_f), - c=hist(b, 5)) +let(a=random(logs, q="*:*", fl="response_d", rows="50000"), + b=col(a, response_d), + c=hist(b, 22)) ---- When this expression is sent to the `/stream` handler it responds with: -[source,json] +[source,text] ---- { "result-set": { "docs": [ { - "c": [ - { - "prob": 0.2057939717603699, - "min": 0.000010371208, - "max": 0.19996578, - "mean": 0.10010319358402578, - "var": 0.003366805016271609, - "cumProb": 0.10293732468049072, - "sum": 309.0185585938884, - "stdev": 0.058024176136086666, - "N": 3087 - }, - { - "prob": 0.19381868629885585, - "min": 0.20007741, - "max": 0.3999073, - "mean": 0.2993590803885827, - "var": 0.003401644034068929, - "cumProb": 0.3025295802728267, - "sum": 870.5362057700005, - "stdev": 0.0583236147205309, - "N": 2908 - }, - { - "prob": 0.20565789836690007, - "min": 0.39995712, - "max": 0.5999038, - "mean": 0.4993620963792545, - "var": 0.0033158364923609046, - "cumProb": 0.5023006239697967, - "sum": 1540.5320673300018, - "stdev": 0.05758330046429177, - "N": 3085 - }, - { - "prob": 0.19437108496008693, - "min": 0.6000449, - "max": 0.79973197, - "mean": 0.7001752711861512, - "var": 0.0033895105082360185, - "cumProb": 0.7026537198687285, - "sum": 2042.4112660500066, - "stdev": 0.058219502816805456, - "N": 2917 - }, - { - "prob": 0.20019582213899467, - "min": 0.7999126, - "max": 0.99987316, - "mean": 0.8985428275824184, - "var": 0.003312360017780078, - "cumProb": 0.899450457219298, - "sum": 2698.3241112299997, - "stdev": 0.05755310606544253, - "N": 3003 - } - ] + "prob": 0.00004896007228311655, + "min": 675.573084576817, + "max": 688.3309631697003, + "mean": 683.805542728906, + "var": 50.9974629924082, + "cumProb": 0.000030022417162809913, + "sum": 2051.416628186718, + "stdev": 7.141250800273591, + "N": 3 }, { - "EOF": true, - "RESPONSE_TIME": 322 - } - ] - } -} ----- - -The `col` function can be used to *vectorize* a column of data from the list of tuples -returned by the `hist` function. - -In the example below, the *N* field, -which is the number of observations in the each bin, is returned as a vector. - -[source,text] ----- -let(a=random(collection1, q="*:*", rows="15000", fl="price_f"), - b=col(a, price_f), - c=hist(b, 11), - d=col(c, N)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "d": [ - 1387, - 1396, - 1391, - 1357, - 1384, - 1360, - 1367, - 1375, - 1307, - 1310, - 1366 - ] + "prob": 0.00029607514624062624, + "min": 696.2875238591652, + "max": 707.9706315779541, + "mean": 702.1110569558929, + "var": 14.136444379466969, + "cumProb": 0.00022705264963879807, + "sum": 11233.776911294284, + "stdev": 3.759846323916307, + "N": 16 }, { - "EOF": true, - "RESPONSE_TIME": 307 - } - ] - } -} + "prob": 0.0011491235433157194, + "min": 709.1574910598678, + "max": 724.9027194369135, + "mean": 717.8554290699951, + "var": 20.6935845290122, + "cumProb": 0.0009858515418689757, + "sum": 41635.61488605971, + "stdev": 4.549020172412098, + "N": 58 + }, + ... + ]}} ---- +With Zeppelin-Solr the histogram can be first visualized as a table: + +image::images/math-expressions/histtable.png[] + +Then the histogram can be visualized with an area chart by plotting the *mean* of +the bins on the *x-axis* and the *prob* (probability) on the *y-axis*: + +image::images/math-expressions/hist.png[] + +The cumulative probability can be plotted by switching the *y-axis* to the *cumProb* column: + +image::images/math-expressions/cumProb.png[] + +=== Custom Histograms + +Custom histograms can be defined and visualized by combining the output from multiple +`stats` functions into a single histogram. Instead of automatically binning a numeric +field the custom histogram allows for comparison of bins based on queries. + +NOTE: The `stats` function is first discussed in the *Searching, Sampling and Aggregation* section of the +user guide. + +A simple example will illustrate how to define and visualize a custom histogram. + +In this example, three `stats` functions are wrapped in a `plist` function. The +`plist` (parallel list) function executes each of its internal functions in parallel +and concatenates the results into a single stream. `plist` also maintains the order +of the outputs from each of the sub-functions. In this example each `stats` function +computes the count of documents that match a specific query. In this case they count the +number of documents that contain the terms copper, gold and silver. The list of tuples +with the counts is then stored in variable *a*. + +Then an `array` of labels is created and set to variable *l*. + +Finally the `zplot` function is used to plot the labels vector and the `count(*)` column. +Notice the `col` function is used inside of the `zplot` function to extract the +counts from the `stats` results. + +image::images/math-expressions/custom-hist.png[] + + === Frequency Tables The `freqTable` function returns a frequency distribution for a discrete data set. The `freqTable` function doesn't create bins like the histogram. Instead it counts the occurrence of each discrete data value and returns a list of tuples with the -frequency statistics for each value. Fields from a frequency table can be vectorized using -using the `col` function in the same manner as a histogram. +frequency statistics for each value. -Below is a simple example of a frequency table built from a random sample of -a discrete variable. +Below is an example of a frequency table built from a result set +of rounded *differences* in daily opening stock prices for the stock ticker *amzn*. + +This example is interesting because it shows a multi-step process to arrive +at the result. The first step is to *search* for records in the the *stocks* +collection with a ticker of *amzn*. Notice that the result set is sorted by +date ascending and it returns the `open_d` field which is the opening price for +the day. + +The `open_d` field is then vectorized and set to variable *b*, which now contains +a vector of opening prices ordered by date ascending. + +The `diff` function is then used to calculate the *first difference* for the +vector of opening prices. The first difference simply subtracts the previous value +from each value in the array. This will provide an array of price differences +for each day which will show daily change in opening price. + +Then the `round` function is used to round the price differences to the nearest +integer to create a vector of discrete values. The `round` function in this +example is effectively *binning* continuous data at integer boundaries. + +Finally the `freqTable` function is run on the discrete values to calculate +the frequency table. [source,text] ---- -let(a=random(collection1, q="*:*", rows="15000", fl="day_i"), - b=col(a, day_i), - c=freqTable(b)) +let(a=search(stocks, + q="ticker_s:amzn", + fl="open_d, date_dt", + sort="date_dt asc", + rows=25000), + b=col(a, open_d), + c=diff(b), + d=round(c), + e=freqTable(d)) ---- When this expression is sent to the `/stream` handler it responds with: -[source,json] +[source,text] ---- - "result-set": { - "docs": [ - { - "c": [ - { - "pct": 0.0318, - "count": 477, - "cumFreq": 477, - "cumPct": 0.0318, - "value": 0 - }, - { - "pct": 0.033133333333333334, - "count": 497, - "cumFreq": 974, - "cumPct": 0.06493333333333333, - "value": 1 - }, - { - "pct": 0.03426666666666667, - "count": 514, - "cumFreq": 1488, - "cumPct": 0.0992, - "value": 2 - }, - { - "pct": 0.0346, - "count": 519, - "cumFreq": 2007, - "cumPct": 0.1338, - "value": 3 - }, - { - "pct": 0.03133333333333333, - "count": 470, - "cumFreq": 2477, - "cumPct": 0.16513333333333333, - "value": 4 - }, - { - "pct": 0.03333333333333333, - "count": 500, - "cumFreq": 2977, - "cumPct": 0.19846666666666668, - "value": 5 - } - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 281 - } - ] - } -} + { + "result-set": { + "docs": [ + { + "pct": 0.00019409937888198756, + "count": 1, + "cumFreq": 1, + "cumPct": 0.00019409937888198756, + "value": -57 + }, + { + "pct": 0.00019409937888198756, + "count": 1, + "cumFreq": 2, + "cumPct": 0.00038819875776397513, + "value": -51 + }, + { + "pct": 0.00019409937888198756, + "count": 1, + "cumFreq": 3, + "cumPct": 0.0005822981366459627, + "value": -49 + }, + ... + ]}} ---- +With Zeppelin-Solr the frequency table can be first visualized as a table: + +image::images/math-expressions/freqTable.png[] + +The frequency table can then be plotted by switching to a scatter chart and selecting +the *value* column for the *x-axis* and the *count* column for the *y-axis* + +image::images/math-expressions/freqTable1.png[] + +Notice that the visualization nicely displays the frequency of daily change in stock prices +rounded to integers. The most frequently occurring value is 0 with 1494 occurrences followed by + -1 and 1 with around 700 occurrences. + + == Percentiles The `percentile` function returns the estimated value for a specific percentile in -a sample set. The example below returns the estimation for the 95th percentile -of the *price_f* field. +a sample set. The example below returns a random sample containing the `response_d` field +from the logs collection. The `response_d` field is vectorized and the 20th percentile +is calculated for the vector: [source,text] ---- -let(a=random(collection1, q="*:*", rows="15000", fl="price_f"), - b=col(a, price_f), - c=percentile(b, 95)) +let(a=random(logs, q="*:*", rows="15000", fl="response_d"), + b=col(a, response_d), + c=percentile(b, 20)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -310,7 +298,7 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "c": 312.94 + "c": 818.073554 }, { "EOF": true, @@ -321,13 +309,13 @@ When this expression is sent to the `/stream` handler it responds with: } ---- -The `percentile` function also operates on an array of percentile values. +The `percentile` function can also compute an array of percentile values. The example below is computing the 20th, 40th, 60th and 80th percentiles for a random sample -of the *response_d* field: +of the `response_d` field: [source,text] ---- -let(a=random(collection2, q="*:*", rows="15000", fl="response_d"), +let(a=random(logs, q="*:*", rows="15000", fl="response_d"), b=col(a, response_d), c=percentile(b, array(20,40,60,80))) ---- @@ -356,103 +344,46 @@ When this expression is sent to the `/stream` handler it responds with: } ---- -== Covariance and Correlation +=== Quantile Plots -Covariance and Correlation measure how random variables move +Quantile plots or QQ Plots are powerful tools for visually comparing two or more distributions. + +A quantile plot, plots the percentiles from two or more distributions in the same visualization. This allows +for visual comparison of the distributions at each percentile. A simple example will help illustrate the power +of quantile plots. + +In this example the distribution of daily stock price changes for two stock tickers, *goog* and +*amzn*, are visualized with a quantile plot. + +The example first creates an array of values representing the percentiles that will be calculated and sets this array +to variable *p*. Then random samples of the `change_d` field are drawn for the tickers *amzn* and *goog*. The `change_d` field +represents the change in stock price for one day. Then the `change_d` field is vectorized for both samples and placed +in the variables *amzn* and *goog*. The `percentile` function is then used to calculate the percentiles for both vectors. Notice that +the variable *p* is used to specify the list of percentiles that are calculated. + +Finally `zplot` is used to plot the percentiles sequence on the *x-axis* and the calculated +percentile values for both distributions on the *y-axis*. And a line plot is used +to visualize the QQ plot. + +image::images/math-expressions/quantile-plot.png[] + +This quantile plot provides a clear picture of the distributions of daily price changes for *amzn* +and *googl*. In the plot the *x-axis* is the percentiles and the *y-axis* is the percentile value calculated. + +Notice that the *goog* percentile value starts lower and ends higher than the *amzn* plot and that there is a +steeper slope. This shows the greater variability in the *goog* price change distribution. The plot gives a clear picture +of the difference +in the distributions across the full range of percentiles. + + +== Correlation and Covariance + +Correlation and Covariance measure how random variables fluctuate together. -=== Covariance and Covariance Matrices - -The `cov` function calculates the covariance of two sample sets of data. - -In the example below covariance is calculated for two numeric -arrays. - -The example below uses arrays created by the `array` function. Its important to note that -vectorized data from SolrCloud collections can be used with any function that -operates on arrays. - -[source,text] ----- -let(a=array(1, 2, 3, 4, 5), - b=array(100, 200, 300, 400, 500), - c=cov(a, b)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- - { - "result-set": { - "docs": [ - { - "c": 0.9484775349999998 - }, - { - "EOF": true, - "RESPONSE_TIME": 286 - } - ] - } - } ----- - -If a matrix is passed to the `cov` function it will automatically compute a covariance -matrix for the columns of the matrix. - -Notice in the example three numeric arrays are added as rows -in a matrix. The matrix is then transposed to turn the rows into -columns, and the covariance matrix is computed for the columns of the -matrix. - -[source,text] ----- -let(a=array(1, 2, 3, 4, 5), - b=array(100, 200, 300, 400, 500), - c=array(30, 40, 80, 90, 110), - d=transpose(matrix(a, b, c)), - e=cov(d)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- - { - "result-set": { - "docs": [ - { - "e": [ - [ - 2.5, - 250, - 52.5 - ], - [ - 250, - 25000, - 5250 - ], - [ - 52.5, - 5250, - 1150 - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 2 - } - ] - } - } ----- - === Correlation and Correlation Matrices -Correlation is measure of covariance that has been scaled between +Correlation is a measure of the linear correlation between two vectors. Correlation is scaled between -1 and 1. Three correlation types are supported: @@ -462,14 +393,83 @@ Three correlation types are supported: * *spearmans* The type of correlation is specified by adding the *type* named parameter in the -function call. The example below demonstrates the use of the *type* -named parameter. +function call. + +In the example below a random sample containing two fields, `filesize_d` and `response_d`, is drawn from +the logs collection using the `random` function. The fields are vectorized into the +variables *x* and *y* and then *Spearman's* correlation for +the two vectors is calculated using the `corr` function. + +image::images/math-expressions/correlation.png[] + +==== Correlation Matrices + +Correlation matrices are powerful tools for visualizing the correlation between two or more +vectors. + +The `corr` function builds a correlation matrix +if a matrix is passed as the parameter. The correlation matrix is computed by correlating the *columns* +of the matrix. + +The example below demonstrates the power of correlation matrices combined with 2 dimensional faceting. + +In this example the `facet2D` function is used to generate a two dimensional facet aggregation +over the fields `complaint_type_s` and `zip_s` from the *nyc311* complaints database. +The *top 20* complaint types and the *top 25* zip codes for each complaint type are aggregated. +The result is a stream of tuples each containing the fields `complaint_type_s`, `zip_s` and +the count for the pair. + +The `pivot` function is then used to pivot the fields into a *matrix* with the `zip_s` +field as the *rows* and the `complaint_type_s` field as the *columns*. The `count(*)` field populates +the values in the cells of the matrix. + +The `corr` function is then used correlate the *columns* of the matrix. This produces a correlation matrix +that shows how complaint types are correlated based on the zip codes they appear in. Another way to look at this +is it shows how the different complaint types tend to co-occur across zip codes. + +Finally the `zplot` function is used to plot the correlation matrix as a heat map. + +image::images/math-expressions/corrmatrix.png[] + +Notice in the example the correlation matrix is square with complaint types shown on both +the *x* and y-axises. The color of the cells in the heat map shows the +intensity of the correlation between the complaint types. + +The heat map is interactive, so mousing over one of the cells pops up the values +for the cell. + +image::images/math-expressions/corrmatrix2.png[] + +Notice that HEAT/HOT WATER and UNSANITARY CONDITION complaints have a correlation of 8 (rounded to the nearest +tenth). + +=== Covariance and Covariance Matrices + +Covariance is an unscaled measure of correlation. + +The `cov` function calculates the covariance of two vectors of data. + +In the example below a random sample containing two fields, `filesize_d` and `response_d`, is drawn from +the logs collection using the `random` function. The fields are vectorized into the +variables *x* and *y* and then the covariance for +the two vectors is calculated using the `cov` function. + +image::images/math-expressions/covariance.png[] + +If a matrix is passed to the `cov` function it will automatically compute a covariance +matrix for the *columns* of the matrix. + +Notice in the example below that the *x* and *y* vectors are added to a matrix. +The matrix is then transposed to turn the rows into columns, +and the covariance matrix is computed for the columns of the matrix. [source,text] ---- -let(a=array(1, 2, 3, 4, 5), - b=array(100, 200, 300, 400, 5000), - c=corr(a, b, type=spearmans)) +let(a=random(logs, q="*:*", fl="filesize_d, response_d", rows=50000), + x=col(a, filesize_d), + y=col(a, response_d), + m=transpose(matrix(x, y)), + covariance=cov(m)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -480,20 +480,40 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "c": 0.7432941462471664 + "covariance": [ + [ + 4018404.072532102, + 80243.3948172242 + ], + [ + 80243.3948172242, + 1948.3216661122592 + ] + ] }, { "EOF": true, - "RESPONSE_TIME": 0 + "RESPONSE_TIME": 534 } ] } } ---- -Like the `cov` function, the `corr` function automatically builds a correlation matrix -if a matrix is passed as a parameter. The correlation matrix is built by correlating the columns -of the matrix passed in. +The covariance matrix contains both the variance for the two vectors and the covariance between the vectors +in the following format: + + +[source,text] +---- + x y + x [4018404.072532102, 80243.3948172242], + y [80243.3948172242, 1948.3216661122592] +---- + +The covariance matrix is always square. So a covariance matrix created from 3 vectors will produce a 3 x 3 matrix. + + == Statistical Inference Tests @@ -518,7 +538,7 @@ from the same population. drawn from the same population. * `mannWhitney`: The Mann-Whitney test is a non-parametric test that tests if two -samples of continuous were pulled +samples of continuous data were pulled from the same population. The Mann-Whitney test is often used instead of the T-test when the underlying assumptions of the T-test are not met. diff --git a/solr/solr-ref-guide/src/stream-decorator-reference.adoc b/solr/solr-ref-guide/src/stream-decorator-reference.adoc index ff7d95b4e77..9c5063beae9 100644 --- a/solr/solr-ref-guide/src/stream-decorator-reference.adoc +++ b/solr/solr-ref-guide/src/stream-decorator-reference.adoc @@ -637,8 +637,8 @@ Users who anticipate concurrent updates, and wish to "skip" any failed deletes, The `eval` function allows for use cases where new streaming expressions are generated on the fly and then evaluated. The `eval` function wraps a streaming expression and reads a single tuple from the underlying stream. -The `eval` function then retrieves a string Streaming Expressions from the `expr_s` field of the tuple. -The `eval` function then compiles the string Streaming Expression and emits the tuples. +The `eval` function then retrieves a string streaming expression from the `expr_s` field of the tuple. +The `eval` function then compiles the string streaming expression and emits the tuples. === eval Parameters @@ -652,7 +652,7 @@ eval(expr) ---- In the example above the `eval` expression reads the first tuple from the underlying expression. It then compiles and -executes the string Streaming Expression in the expr_s field. +executes the string streaming expression in the `expr_s` field. == executor @@ -665,7 +665,7 @@ This model allows for asynchronous execution of jobs where the output is stored === executor Parameters * `threads`: (Optional) The number of threads in the executors thread pool for executing expressions. -* `StreamExpression`: (Mandatory) The stream source which contains the Streaming Expressions to execute. +* `StreamExpression`: (Mandatory) The stream source which contains the streaming expressions to execute. === executor Syntax diff --git a/solr/solr-ref-guide/src/stream-evaluator-reference.adoc b/solr/solr-ref-guide/src/stream-evaluator-reference.adoc index b1ef9caaae9..44efd5d6690 100644 --- a/solr/solr-ref-guide/src/stream-evaluator-reference.adoc +++ b/solr/solr-ref-guide/src/stream-evaluator-reference.adoc @@ -293,7 +293,7 @@ if(gt(fieldA,fieldB),ceil(fieldA),ceil(fieldB)) // if fieldA > fieldB then retur == col -The `col` function returns a numeric array from a list of Tuples. The `col` +The `col` function returns a numeric array from a list of tuples. The `col` function is used to create numeric arrays from stream sources. === col Parameters @@ -1020,11 +1020,11 @@ string array: The labels for each row in the matrix == getValue -The `getValue` function returns the value of a single Tuple entry by key. +The `getValue` function returns the value of a single tuple entry by key. === getValue Parameters -* `tuple`: The Tuple to return the entry from. +* `tuple`: The tuple to return the entry from. * `key`: The key of the entry to return the value for. === getValue Syntax @@ -1033,7 +1033,7 @@ getValue(tuple, key) === getValue Returns -object: Returns an object of the same type as the Tuple entry. +object: Returns an object of the same type as the tuple entry. == grandSum @@ -1589,7 +1589,7 @@ not(eq(fieldA,fieldB)) // true if fieldA != fieldB The `olsRegress` function performs https://en.wikipedia.org/wiki/Ordinary_least_squares[ordinary least squares], multivariate, linear regression. -The `olsRegress` function returns a single Tuple containing the regression model with estimated regression parameters, RSquared and regression diagnostics. +The `olsRegress` function returns a single tuple containing the regression model with estimated regression parameters, RSquared and regression diagnostics. The output of `olsRegress` can be used with the <> function to predict values based on the regression model. @@ -2082,11 +2082,11 @@ matrix: The matrix with the labels set. == setValue -The `setValue` function sets a new value for a Tuple entry. +The `setValue` function sets a new value for a tuple entry. === setValue Parameters -* `tuple`: The Tuple to return the entry from. +* `tuple`: The tuple to return the entry from. * `key`: The key of the entry to set. * `value`: The value to set. diff --git a/solr/solr-ref-guide/src/stream-source-reference.adoc b/solr/solr-ref-guide/src/stream-source-reference.adoc index d591534b8dc..e5352b4eed9 100644 --- a/solr/solr-ref-guide/src/stream-source-reference.adoc +++ b/solr/solr-ref-guide/src/stream-source-reference.adoc @@ -110,11 +110,11 @@ jdbc( == drill -The `drill` function is designed to support efficient high cardinality aggregation. The `drill` -function sends a request to the `export` handler in a specific collection which includes a Streaming -Expression that the `export` handler applies to the sorted result set. The `export` handler then emits the aggregated tuples. -The `drill` function reads and emits the aggregated tuples fromn each shard maintaining the sort order, -but does not merge the aggregations. Streaming Expression functions can be wrapped around the `drill` function to +The `drill` function is designed to support efficient high cardinality aggregation. +The `drill` function sends a request to the `export` handler in a specific collection which includes a streaming expression that the `export` handler applies to the sorted result set. +The `export` handler then emits the aggregated tuples. +The `drill` function reads and emits the aggregated tuples from each shard maintaining the sort order, but does not merge the aggregations. +Streaming expression functions can be wrapped around the `drill` function to merge the aggregates. === drill Parameters @@ -154,8 +154,8 @@ rollup(drill(articles, == echo -The `echo` function returns a single Tuple echoing its text parameter. `Echo` is the simplest stream source designed to provide text -to a text analyzing stream decorator. +The `echo` function returns a single tuple echoing its text parameter. +`Echo` is the simplest stream source designed to provide text to a text analyzing stream decorator. === echo Syntax @@ -606,10 +606,12 @@ topic(checkpointCollection, == tuple -The `tuple` function emits a single Tuple with name/value pairs. The values can be set to variables assigned in a `let` expression, literals, Stream Evaluators or -Stream Expressions. In the case of Stream Evaluators the tuple will output the return value from the evaluator. -This could be a numeric, list or map. If a value is set to a Stream Expression, the `tuple` function will flatten -the tuple stream from the Stream Expression into a list of Tuples. +The `tuple` function emits a single tuple with name/value pairs. +The values can be set to variables assigned in a `let` expression, literals, stream evaluators or stream expressions. +In the case of stream evaluators the tuple will output the return value from the evaluator. +This could be a numeric, list, or map. +If a value is set to a stream expression, the `tuple` function will flatten +the tuple stream from the stream expression into a list of tuples. === tuple Parameters diff --git a/solr/solr-ref-guide/src/streaming-expressions.adoc b/solr/solr-ref-guide/src/streaming-expressions.adoc index eb7c8bf5978..6d8c3f91d1b 100644 --- a/solr/solr-ref-guide/src/streaming-expressions.adoc +++ b/solr/solr-ref-guide/src/streaming-expressions.adoc @@ -17,39 +17,27 @@ // specific language governing permissions and limitations // under the License. -Streaming Expressions provide a simple yet powerful stream processing language for SolrCloud. +Streaming expressions exposes the capabilities of SolrCloud as composable functions. +These functions provide a system for searching, transforming, analyzing, and visualizing data stored in SolrCloud collections. -Streaming expressions are a suite of functions that can be combined to perform many different parallel computing tasks. These functions are the basis for the <>. +At a high level there a four main capabilities that will be explored in the documentation: -There is a growing library of functions that can be combined to implement: +* *Searching*, sampling and aggregating results from Solr. -* Request/response stream processing -* Batch stream processing -* Fast interactive MapReduce -* Aggregations (Both pushed down faceted and shuffling MapReduce) -* Parallel relational algebra (distributed joins, intersections, unions, complements) -* Publish/subscribe messaging -* Distributed graph traversal -* Machine learning and parallel iterative model training -* Anomaly detection -* Recommendation systems -* Retrieve and rank services -* Text classification and feature extraction -* Streaming NLP -* Statistical Programming +* *Transforming* result sets after they are retrieved from Solr. -Streams from outside systems can be joined with streams originating from Solr and users can add their own stream functions by following Solr's {solr-javadocs}/solrj/org/apache/solr/client/solrj/io/stream/package-summary.html[Java streaming API]. +* *Analyzing* and modeling result sets using probability and statistics and machine learning libraries. + +* *Visualizing* result sets, aggregations and statistical models of the data. -[IMPORTANT] -==== -Both streaming expressions and the streaming API are considered experimental, and the APIs are subject to change. -==== == Stream Language Basics -Streaming Expressions are comprised of streaming functions which work with a Solr collection. They emit a stream of tuples (key/value Maps). +Streaming expressions are comprised of streaming functions which work with a Solr collection. +They emit a stream of tuples (key/value Maps). -Many of the provided streaming functions are designed to work with entire result sets rather than the top N results like normal search. This is supported by the <>. +Some of the provided streaming functions are designed to work with entire result sets rather than the top N results like normal search. +This is supported by the <>. Some streaming functions act as stream sources to originate the stream flow. Other streaming functions act as stream decorators to wrap other stream functions and perform operations on the stream of tuples. Many streams functions can be parallelized across a worker collection. This can be particularly powerful for relational algebra functions. @@ -64,8 +52,7 @@ The `/stream` request handler takes one parameter, `expr`, which is used to spec curl --data-urlencode 'expr=search(enron_emails, q="from:1800flowers*", fl="from, to", - sort="from asc", - qt="/export")' http://localhost:8983/solr/enron_emails/stream + sort="from asc")' http://localhost:8983/solr/enron_emails/stream ---- Details of the parameters for each function are included below. @@ -95,76 +82,42 @@ For the above example the `/stream` handler responded with the following JSON re Note the last tuple in the above example stream is `{"EOF":true,"RESPONSE_TIME":33}`. The `EOF` indicates the end of the stream. To process the JSON response, you'll need to use a streaming JSON implementation because streaming expressions are designed to return the entire result set which may have millions of records. In your JSON client you'll need to iterate each doc (tuple) and check for the EOF tuple to determine the end of stream. -The {solr-javadocs}/solrj/org/apache/solr/client/solrj/io/package-summary.html[`org.apache.solr.client.solrj.io`] package provides Java classes that compile streaming expressions into streaming API objects. These classes can be used to execute streaming expressions from inside a Java application. For example: -[source,java] ----- - StreamFactory streamFactory = new DefaultStreamFactory().withCollectionZkHost("collection1", zkServer.getZkAddress()); - InjectionDefense defense = new InjectionDefense("parallel(collection1, group(search(collection1, q=\"*:*\", fl=\"id,a_s,a_i,a_f\", sort=\"a_s asc,a_f asc\", partitionKeys=\"a_s\"), by=\"a_s asc\"), workers=\"2\", zkHost=\"?$?\", sort=\"a_s asc\")"); - defense.addParameter(zkhost); - ParallelStream pstream = (ParallelStream)streamFactory.constructStream(defense.safeExpressionString()); ----- +== Elements of the Lanaguage -Note that InjectionDefense need only be used if the string being inserted could contain user supplied data. See the -javadoc for `InjectionDefense` for usage details and SOLR-12891 for an example of the potential risks. -Also note that for security reasons normal parameter substitution no longer applies to the expr parameter -unless the jvm has been started with `-DStreamingExpressionMacros=true` (usually via `solr.in.sh`) +=== Stream Sources -=== Data Requirements - -Because streaming expressions relies on the `/export` handler, many of the field and field type requirements to use `/export` are also requirements for `/stream`, particularly for `sort` and `fl` parameters. Please see the section <> for details. - -=== Local Execution - -In certain special cases such as joining documents on a value that is 1:1 with the portion of the id used in -composite routing, the relevant data is always co-located on the same node. When this happens, fanning out requests -among many nodes and waiting for a response from all nodes is inefficient. In cases where data co-location holds true -for the entire expression, it may be faster for the client to send the expression to each slice with -`&streamLocalOnly=true` and handle merging of the results (if required) locally. This is an advanced option, relying -on a convenient organization of the index, and should only be considered if normal usage poses a performance issue. - -=== Request Routing - -Streaming Expressions respect the <> for any call to Solr. - -The value of `shards.preference` that is used to route requests is determined in the following order. The first option available is used. -- Provided as a parameter in the streaming expression (e.g., `search(...., shards.preference="replica.type:PULL")`) -- Provided in the URL Params of the streaming expression (e.g., `http://solr_url:8983/solr/stream?expr=....&shards.preference=replica.type:PULL`) -- Set as a default in the Cluster properties. - -=== Adding Custom Expressions - -Creating your own custom expressions can be easily done by implementing the {solr-javadocs}/solrj/org/apache/solr/client/solrj/io/stream/expr/Expressible.html[Expressible] interface. To add a custom expression to the -list of known mappings for the `/stream` and `/graph` handlers, you just need to declare it as a plugin in `solrconfig.xml` via: - -[source,xml] - - - -== Types of Streaming Expressions - -=== About Stream Sources - -Stream sources originate streams. The most commonly used one of these is `search`, which does a query. +Stream sources originate streams. There are rich set of searching, sampling and aggregation stream sources to choose from. A full reference to all available source expressions is available in <>. -=== About Stream Decorators -Stream decorators wrap other stream functions or perform operations on a stream. + +=== Stream Decorators + +Stream decorators wrap stream sources and other stream decorators to transform a stream. A full reference to all available decorator expressions is available in <>. -=== About Stream Evaluators +=== Math Expressions -Stream Evaluators can be used to evaluate (calculate) new values based on other values in a tuple. That newly evaluated value can be put into the tuple (as part of a `select(...)` clause), used to filter streams (as part of a `having(...)` clause), and for other things. Evaluators can contain field names, raw values, or other evaluators, giving you the ability to create complex evaluation logic, including conditional if/then choices. +Math expressions are a vector and matrix math library that can be combined with streaming expressions to perform analysis and build mathematical models +of the result sets. +From a language standpoint math expressions are a sub-language of streaming expressions that don't return streams of tuples. +Instead they operate on and return numbers, vectors, matrices and mathematical models. +The documentation will show how to combine streaming expressions and math +expressions. -In cases where you want to use raw values as part of an evaluation you will need to consider the order of how evaluators are parsed. +The math expressions user guide is available in <<>> -1. If the parameter can be parsed into a valid number, then it is considered a number. For example, `add(3,4.5)` -2. If the parameter can be parsed into a valid boolean, then it is considered a boolean. For example, `eq(true,false)` -3. If the parameter can be parsed into a valid evaluator, then it is considered an evaluator. For example, `eq(add(10,4),add(7,7))` -4. The parameter is considered a field name, even if it quoted. For example, `eq(fieldA,"fieldB")` - -If you wish to use a raw string as part of an evaluation, you will want to consider using the `raw(string)` evaluator. This will always return the raw value, no matter what is entered. +From a language standpoint math expressions are referred to as *stream evaluators*. A full reference to all available evaluator expressions is available in <>. + +=== Visualization + + +Visualization of both streaming expressions and math expressions is done using Apache Zeppelin and the Zeppelin-Solr Interpreter. + +Visualizing Streaming expressions and setting up of Apache Zeppelin is documented in <>. + +The <> has in depth coverage of visualization techniques. diff --git a/solr/solr-ref-guide/src/taking-solr-to-production.adoc b/solr/solr-ref-guide/src/taking-solr-to-production.adoc index 7e79b12d42d..4bf92e77a65 100644 --- a/solr/solr-ref-guide/src/taking-solr-to-production.adoc +++ b/solr/solr-ref-guide/src/taking-solr-to-production.adoc @@ -240,7 +240,7 @@ Setting the hostname of the Solr server is recommended, especially when running === Environment Banner in Admin UI -To guard against accidentally doing changes to the wrong cluster, you may configure a visual indication in the Admin UI of whether you currently work with a production environment or not. To do this, edit your `solr.in.sh` or `solr.in.cmd` file with a `-Dsolr.environment=prod` setting, or set the cluster property named `environment`. To specify label and/or color, use a comma delimited format as below. The `+` character can be used instead of space to avoid quoting. Colors may be valid CSS colors or numeric, e.g., `#ff0000` for bright red. Examples of valid environment configs: +To guard against accidentally doing changes to the wrong cluster, you may configure a visual indication in the Admin UI of whether you currently work with a production environment or not. To do this, edit your `solr.in.sh` or `solr.in.cmd` file with a `-Dsolr.environment=prod` setting, or set the cluster property named `environment`. To specify label and/or color, use a comma-delimited format as below. The `+` character can be used instead of space to avoid quoting. Colors may be valid CSS colors or numeric, e.g., `#ff0000` for bright red. Examples of valid environment configs: * `prod` * `test,label=Functional+test` diff --git a/solr/solr-ref-guide/src/term-vectors.adoc b/solr/solr-ref-guide/src/term-vectors.adoc index 25100991c37..84047267ece 100644 --- a/solr/solr-ref-guide/src/term-vectors.adoc +++ b/solr/solr-ref-guide/src/term-vectors.adoc @@ -16,9 +16,8 @@ // specific language governing permissions and limitations // under the License. -Term frequency-inverse document frequency (TF-IDF) term vectors are often used to -represent text documents when performing text mining and machine learning operations. The math expressions -library can be used to perform text analysis and create TF-IDF term vectors. +This section of the user guide presents an overview of the text analysis, text analytics +and TF-IDF term vector functions in math expressions. == Text Analysis @@ -57,14 +56,14 @@ When this expression is sent to the `/stream` handler it responds with: } ---- + === Annotating Documents -The `analyze` function can be used inside of a `select` function to annotate documents with the tokens -generated by the analysis. +The `analyze` function can be used inside of a `select` function to annotate documents with the tokens generated by the analysis. -The example below performs a `search` in "collection1". Each tuple returned by the `search` function -contains an `id` and `subject`. For each tuple, the -`select` function selects the `id` field and calls the `analyze` function on the `subject` field. +The example below performs a `search` in "collection1". +Each tuple returned by the `search` function contains an `id` and `subject`. +For each tuple, the `select` function selects the `id` field and calls the `analyze` function on the `subject` field. The analyzer chain specified by the `subject_bigram` field is configured to perform a bigram analysis. The tokens generated by the `analyze` function are added to each tuple in a field called `terms`. @@ -106,6 +105,44 @@ Notice in the output that an array of bigram terms have been added to the tuples } ---- +=== Text Analytics + +The `cartesianProduct` function can be used in conjunction +with the `analyze` function to perform a wide range +of text analytics. + +The `cartesianProduct` function explodes a multivalued field into a stream of tuples. +When the `analyze` function is used to create the multivalued field, the `cartesianProduct` function will explode the analyzed tokens into a stream of tuples. +This allows analytics to be performed over the stream of analyzed tokens and the result to be visualized with Zeppelin-Solr. + +*Example: Phrase Aggregation* + +An example performing phrase aggregation is used to illustrate the power of combining `cartesianProduct` and `analyze`. + +In this example the `search` expression is performed over a collection of movie reviews. +The phrase query "Man on Fire" is searched for and the top 5000 results, by score are returned. +A single field from the results is return which is the `review_t` field that +contains text of the movie review. + +Then `cartesianProduct` function is run over the search results. +The `cartesianProduct` function applies the `analyze` function, which takes the `review_t` field and analyzes it with the Lucene/Solr analyzer attached to the `text_bigrams` schema field. +This analyzer emits the bigrams found in the text field. +The `cartesianProduct` function explodes each bigram into its own tuple with the bigram stored in the field `term`. + +The stream of tuples, each containing a bigram, is then filtered by the `having` function +using regular expressions to select bigrams with a length of 12 or greater and to filter +out bigrams that contain specific characters. + +The `hashRollup` function then aggregates the bigrams and the `top` function emits the top 10 bigrams by count. + +Then Zeppelin-Solr is used to visualize the top 10 ten bigrams. + +image::images/math-expressions/text-analytics.png[] + +Lucene/Solr analyzers can be configured in many different ways to support +aggregations over NLP entities (people, places, companies, etc.) as well as +tokens extracted with regular expressions or dictionaries. + == TF-IDF Term Vectors The `termVectors` function can be used to build TF-IDF term vectors from the terms generated by the `analyze` function. @@ -113,8 +150,9 @@ The `termVectors` function can be used to build TF-IDF term vectors from the ter The `termVectors` function operates over a list of tuples that contain a field called `id` and a field called `terms`. Notice that this is the exact output structure of the document annotation example above. -The `termVectors` function builds a matrix from the list of tuples. There is row in the -matrix for each tuple in the list. There is a column in the matrix for each term in the `terms` field. +The `termVectors` function builds a matrix from the list of tuples. +There is row in the matrix for each tuple in the list. +There is a column in the matrix for each term in the `terms` field. [source,text] ---- @@ -129,17 +167,16 @@ let(echo="c, d", <1> The example below builds on the document annotation example. -<1> The `echo` parameter will echo variables *`c`* and *`d`*, so the output includes +<1> The `echo` parameter will echo variables `c` and `d`, so the output includes the row and column labels, which will be defined later in the expression. -<2> The list of tuples are stored in variable *`a`*. The `termVectors` function -operates over variable *`a`* and builds a matrix with 2 rows and 4 columns. -<3> The `termVectors` function sets the row and column labels of the term vectors matrix as variable *`b`*. +<2> The list of tuples are stored in variable `a`. The `termVectors` function +operates over variable `a` and builds a matrix with 2 rows and 4 columns. +<3> The `termVectors` function sets the row and column labels of the term vectors matrix as variable `b`. The row labels are the document ids and the column labels are the terms. <4> The `getRowLabels` and `getColumnLabels` functions return the row and column labels which are then stored in variables *`c`* and *`d`*. -When this expression is sent to the `/stream` handler it -responds with: +When this expression is sent to the `/stream` handler it responds with: [source,json] ---- @@ -169,8 +206,8 @@ responds with: === TF-IDF Values -The values within the term vectors matrix are the TF-IDF values for each term in each document. The -example below shows the values of the matrix. +The values within the term vectors matrix are the TF-IDF values for each term in each document. +The example below shows the values of the matrix. [source,text] ---- @@ -180,8 +217,7 @@ let(a=select(search(collection3, q="*:*", fl="id, subject", sort="id asc"), b=termVectors(a, minTermLength=4, minDocFreq=0, maxDocFreq=1)) ---- -When this expression is sent to the `/stream` handler it -responds with: +When this expression is sent to the `/stream` handler it responds with: [source,json] ---- @@ -215,22 +251,21 @@ responds with: === Limiting the Noise -One of the key challenges when with working term vectors is that text often has a significant amount of noise -which can obscure the important terms in the data. The `termVectors` function has several parameters -designed to filter out the less meaningful terms. This is also important because eliminating -the noisy terms helps keep the term vector matrix small enough to fit comfortably in memory. +One of the key challenges when working with term vectors is that text often has a significant amount of noise which can obscure the important terms in the data. +The `termVectors` function has several parameters designed to filter out the less meaningful terms. +This is also important because eliminating the noisy terms helps keep the term vector matrix small enough to fit comfortably in memory. There are four parameters designed to filter noisy terms from the term vector matrix: `minTermLength`:: The minimum term length required to include the term in the matrix. -minDocFreq:: +`minDocFreq`:: The minimum percentage, expressed as a number between 0 and 1, of documents the term must appear in to be included in the index. -maxDocFreq:: +`maxDocFreq`:: The maximum percentage, expressed as a number between 0 and 1, of documents the term can appear in to be included in the index. -exclude:: -A comma delimited list of strings used to exclude terms. If a term contains any of the exclude strings that +`exclude`:: +A comma-delimited list of strings used to exclude terms. If a term contains any of the excluded strings that term will be excluded from the term vector. diff --git a/solr/solr-ref-guide/src/the-terms-component.adoc b/solr/solr-ref-guide/src/the-terms-component.adoc index 87d7993aa99..cf27d2224e0 100644 --- a/solr/solr-ref-guide/src/the-terms-component.adoc +++ b/solr/solr-ref-guide/src/the-terms-component.adoc @@ -53,7 +53,7 @@ Specifies the field from which to retrieve terms. This parameter is required if Example: `terms.fl=title` `terms.list`:: -Fetches the document frequency for a comma delimited list of terms. Terms are always returned in index order. If `terms.ttf` is set to true, also returns their total term frequency. If multiple `terms.fl` are defined, these statistics will be returned for each term in each requested field. +Fetches the document frequency for a comma-delimited list of terms. Terms are always returned in index order. If `terms.ttf` is set to true, also returns their total term frequency. If multiple `terms.fl` are defined, these statistics will be returned for each term in each requested field. + Example: `terms.list=termA,termB,termC` + @@ -353,7 +353,7 @@ The `shards` parameter is subject to a host whitelist that has to be configured + By default the whitelist will be populated with all live nodes when running in SolrCloud mode. If you need to disable this feature for backwards compatibility, you can set the system property `solr.disable.shardsWhitelist=true`. + -See the section <> for more information about how the whitelist works. +See the section <> for more information about how the whitelist works. `shards.qt`:: Specifies the request handler Solr uses for requests to shards. diff --git a/solr/solr-ref-guide/src/time-series.adoc b/solr/solr-ref-guide/src/time-series.adoc index ff90a749a86..abd920c9dbc 100644 --- a/solr/solr-ref-guide/src/time-series.adoc +++ b/solr/solr-ref-guide/src/time-series.adoc @@ -16,26 +16,28 @@ // specific language governing permissions and limitations // under the License. -This section of the user guide provides an overview of time series *aggregation*, -*smoothing* and *differencing*. +This section of the user guide provides an overview of some of the time series capabilities available +in Streaming Expressions and Math Expressions. == Time Series Aggregation The `timeseries` function performs fast, distributed time -series aggregation leveraging Solr's builtin faceting and date math capabilities. +series aggregation leveraging Solr's built-in faceting and date math capabilities. -The example below performs a monthly time series aggregation: +The example below performs a monthly time series aggregation over a collection of daily stock price data. +In this example the average monthly closing price is calculated for the stock +ticker *AMZN* between a specific date range. [source,text] ---- -timeseries(collection1, - q=*:*, - field="recdate_dt", - start="2012-01-20T17:33:18Z", - end="2012-12-20T17:33:18Z", +timeseries(stocks, + q=ticker_s:amzn, + field="date_dt", + start="2010-01-01T00:00:00Z", + end="2017-11-01T00:00:00Z", gap="+1MONTH", format="YYYY-MM", - count(*)) + avg(close_d)) ---- When this expression is sent to the `/stream` handler it responds with: @@ -46,342 +48,146 @@ When this expression is sent to the `/stream` handler it responds with: "result-set": { "docs": [ { - "recdate_dt": "2012-01", - "count(*)": 8703 + "date_dt": "2010-01", + "avg(close_d)": 127.42315789473685 }, { - "recdate_dt": "2012-02", - "count(*)": 8648 + "date_dt": "2010-02", + "avg(close_d)": 118.02105263157895 }, { - "recdate_dt": "2012-03", - "count(*)": 8621 + "date_dt": "2010-03", + "avg(close_d)": 130.89739130434782 }, { - "recdate_dt": "2012-04", - "count(*)": 8533 + "date_dt": "2010-04", + "avg(close_d)": 141.07 }, { - "recdate_dt": "2012-05", - "count(*)": 8792 + "date_dt": "2010-05", + "avg(close_d)": 127.606 }, { - "recdate_dt": "2012-06", - "count(*)": 8598 + "date_dt": "2010-06", + "avg(close_d)": 121.66681818181816 }, { - "recdate_dt": "2012-07", - "count(*)": 8679 - }, - { - "recdate_dt": "2012-08", - "count(*)": 8469 - }, - { - "recdate_dt": "2012-09", - "count(*)": 8637 - }, - { - "recdate_dt": "2012-10", - "count(*)": 8536 - }, - { - "recdate_dt": "2012-11", - "count(*)": 8785 - }, - { - "EOF": true, - "RESPONSE_TIME": 16 + "date_dt": "2010-07", + "avg(close_d)": 117.5190476190476 } - ] - } -} +]}} ---- +Using Zeppelin-Solr this time series can be visualized using a line chart. + +image::images/math-expressions/timeseries1.png[] + + == Vectorizing the Time Series -Before a time series result can be operated on by math expressions - the data will need to be vectorized. Specifically -in the example above, the aggregation field count(*) will need to by moved into an array. -As described in the Streams and Vectorization section of the user guide, the `col` function can be used -to copy a numeric column from a list of tuples into an array. +Before a time series can be smoothed or modeled the data will need to be vectorized. +The `col` function can be used +to copy a column of data from a list of tuples into an array. -The expression below demonstrates the vectorization of the count(*) field. +The expression below demonstrates the vectorization of the `date_dt` and `avg(close_d)` fields. +The `zplot` function is then used to plot the months on the x-axis and the average closing prices on the y-axis. -[source,text] ----- -let(a=timeseries(collection1, - q=*:*, - field="test_dt", - start="2012-01-20T17:33:18Z", - end="2012-12-20T17:33:18Z", - gap="+1MONTH", - format="YYYY-MM", - count(*)), - b=col(a, count(*))) ----- +image::images/math-expressions/timeseries2.png[] -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "b": [ - 8703, - 8648, - 8621, - 8533, - 8792, - 8598, - 8679, - 8469, - 8637, - 8536, - 8785 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 5 - } - ] - } -} ----- == Smoothing -Time series smoothing is often used to remove the noise from a time series and help -spot the underlying trends. +Time series smoothing is often used to remove the noise from a time series and help spot the underlying trend. The math expressions library has three *sliding window* approaches -for time series smoothing. The *sliding window* approaches use a summary value -from a sliding window of the data to calculate a new set of smoothed data points. +for time series smoothing. +These approaches use a summary value from a sliding window of the data to calculate a new set of smoothed data points. The three *sliding window* functions are lagging indicators, which means they don't start to move in the direction of the trend until the trend effects -the summary value of the sliding window. Because of this lagging quality these smoothing -functions are often used to confirm the direction of the trend. +the summary value of the sliding window. +Because of this lagging quality these smoothing functions are often used to confirm the direction of the trend. === Moving Average The `movingAvg` function computes a simple moving average over a sliding window of data. -The example below generates a time series, vectorizes the count(*) field and computes the -moving average with a window size of 3. +The example below generates a time series, vectorizes the `avg(close_d)` field and computes the +moving average with a window size of 5. The moving average function returns an array that is of shorter length -then the original data set. This is because results are generated only when a full window of data -is available for computing the average. With a window size of three the moving average will -begin generating results at the 3rd value. The prior values are not included in the result. +then the original vector. This is because results are generated only when a full window of data +is available for computing the average. With a window size of five the moving average will +begin generating results at the 5th value. The prior values are not included in the result. -This is true for all the sliding window functions. +The `zplot` function is then used to plot the months on the x-axis, and the average close and moving +average on the y-axis. Notice that the `ltrim` function is used to trim the first 4 values from +the x-axis and the average closing prices. This is done to line up the three arrays so they start +from the 5th value. -[source,text] ----- -let(a=timeseries(collection1, - q=*:*, - field="test_dt", - start="2012-01-20T17:33:18Z", - end="2012-12-20T17:33:18Z", - gap="+1MONTH", - format="YYYY-MM", - count(*)), - b=col(a, count(*)), - c=movingAvg(b, 3)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - 8657.333333333334, - 8600.666666666666, - 8648.666666666666, - 8641, - 8689.666666666666, - 8582, - 8595, - 8547.333333333334, - 8652.666666666666 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 7 - } - ] - } -} ----- +image::images/math-expressions/movingavg.png[] === Exponential Moving Average The `expMovingAvg` function uses a different formula for computing the moving average that responds faster to changes in the underlying data. This means that it is -less of a lagging indicator then the simple moving average. +less of a lagging indicator than the simple moving average. -Below is an example that computes an exponential moving average: +Below is an example that computes a moving average and exponential moving average and plots them +along with the original y values. Notice how the exponential moving average is more sensitive +to changes in the y values. -[source,text] ----- -let(a=timeseries(collection1, q=*:*, - field="test_dt", - start="2012-01-20T17:33:18Z", - end="2012-12-20T17:33:18Z", - gap="+1MONTH", - format="YYYY-MM", - count(*)), - b=col(a, count(*)), - c=expMovingAvg(b, 3)) ----- +image::images/math-expressions/expmoving.png[] -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - 8657.333333333334, - 8595.166666666668, - 8693.583333333334, - 8645.791666666668, - 8662.395833333334, - 8565.697916666668, - 8601.348958333334, - 8568.674479166668, - 8676.837239583334 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 5 - } - ] - } -} ----- === Moving Median The `movingMedian` function uses the median of the sliding window rather than the average. -In many cases the moving median will be more *robust* to outliers then moving averages. +In many cases the moving median will be more *robust* to outliers than moving averages. Below is an example computing the moving median: -[source,text] ----- -let(a=timeseries(collection1, - q=*:*, - field="test_dt", - start="2012-01-20T17:33:18Z", - end="2012-12-20T17:33:18Z", - gap="+1MONTH", - format="YYYY-MM", - count(*)), - b=col(a, count(*)), - c=movingMedian(b, 3)) ----- +image::images/math-expressions/movingMedian.png[] -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - 8648, - 8621, - 8621, - 8598, - 8679, - 8598, - 8637, - 8536, - 8637 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 7 - } - ] - } -} ----- == Differencing -Differencing is often used to remove the -trend or seasonality from a time series. This is known as making a time series -*stationary*. +Differencing can be used to make +a time series stationary by removing the trend or seasonality from the series. === First Difference -The actual technique of differencing is to use the difference between values rather than the +The technique used in differencing is to use the difference between values rather than the original values. The *first difference* takes the difference between a value and the value that came directly before it. The first difference is often used to remove the trend from a time series. -In the example below, the `diff` function computes the first difference of a time series. -The result array length is one value smaller then the original array. -This is because the `diff` function only returns a result for values -where the prior value has been subtracted. +The examples below uses the first difference to make two time series stationary so they can be compared +without the trend. -[source,text] ----- -let(a=timeseries(collection1, - q=*:*, - field="test_dt", - start="2012-01-20T17:33:18Z", - end="2012-12-20T17:33:18Z", - gap="+1MONTH", - format="YYYY-MM", - count(*)), - b=col(a, count(*)), - c=diff(b)) ----- +In this example we'll be comparing the average monthly closing price for two stocks: Amazon and Google. +The image below plots both time series before differencing is applied. + +image::images/math-expressions/timecompare.png[] + +In the next example the `diff` function is applied to both time series inside the `zplot` function. +The `diff` can be applied inside the `zplot` function or like any other function inside of the `let` +function. + +Notice that both time series now have the trend removed and the monthly movements of the stock price +can be studied without being influenced by the trend. + +image::images/math-expressions/diff1.png[] + +In the next example the `zoom` function of the time series visualization is used to zoom into a specific +range of months. This allows for closer inspection of the data. With closer inspection of the data there appears +to be some correlation between the monthly movements of the two stocks. + +image::images/math-expressions/diffzoom.png[] + +In the final example the differenced time series are correlated with the `corr` function. + +image::images/math-expressions/diffcorr.png[] -When this expression is sent to the `/stream` handler it responds with: -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": [ - -55, - -27, - -88, - 259, - -194, - 81, - -210, - 168, - -101, - 249 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 11 - } - ] - } -} ----- === Lagged Differences @@ -391,41 +197,132 @@ lag in the past. Lagged differences are often used to remove seasonality from a The simple example below demonstrates how lagged differencing works. Notice that the array in the example follows a simple repeated pattern. This type of pattern -is often displayed with seasonality. In this example we can remove this pattern using +is often displayed with seasonality. + +image::images/math-expressions/season.png[] + +In this example we remove this pattern using the `diff` function with a lag of 4. This will subtract the value lagging four indexes -behind the current index. Notice that result set size is the original array size minus the lag. +behind the current index. Notice that the result set size is the original array size minus the lag. This is because the `diff` function only returns results for values where the lag of 4 is possible to compute. -[source,text] ----- -let(a=array(1,2,5,2,1,2,5,2,1,2,5), - b=diff(a, 4)) ----- +image::images/math-expressions/seasondiff.png[] -Expression is sent to the `/stream` handler it responds with: -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "b": [ - 0, - 0, - 0, - 0, - 0, - 0, - 0 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 0 - } - ] - } -} ----- +== Anomaly Detection + +The `movingMAD` (moving mean absolute deviation) function can be used to surface anomalies +in a time series by measuring dispersion (deviation from the mean) within a sliding window. + +The `movingMAD` function operates in a similar manner as a moving average, except it +measures the mean absolute deviation within the window rather than the average. By +looking for unusually high or low dispersion we can find anomalies in the time +series. + +For this example we'll be working with daily stock prices for Amazon over a two year +period. The daily stock data will provide a larger data set to study. + +In the example below the `search` expression is used to return the daily closing price +for the ticker *AMZN* over a two year period. + +image::images/math-expressions/anomaly.png[] + +The next step is to apply the `movingMAD` function to the data to calculate +the moving mean absolute deviation over a 10 day window. The example below shows the function being +applied and visualized. + +image::images/math-expressions/mad.png[] + +Once the moving MAD has been calculated we can visualize the distribution of dispersion +with the `empiricalDistribution` function. The example below plots the empirical +distribution with 10 bins, creating a 10 bin histogram of the dispersion of the +time series. + +This visualization shows that most of the mean absolute deviations fall between 0 and +9.2 with the mean of the final bin at 11.94. + +image::images/math-expressions/maddist.png[] + +The final step is to detect outliers in the series using the `outliers` function. +The `outliers` function uses a probability distribution to find outliers in a numeric vector. +The `outliers` function takes four parameters: + +* Probability distribution +* Numeric vector +* Low probability threshold +* High probability threshold +* List of results that the numeric vector was selected from + +The `outliers` function iterates the numeric vector and uses the probability +distribution to calculate the cumulative probability of each value. If the cumulative +probability is below the low probability threshold or above the high threshold it considers +the value an outlier. When the `outliers` function encounters an outlier it returns +the corresponding result from the list of results provided by the fifth parameter. +It also includes the cumulative probability and the value of the outlier. + +The example below shows the `outliers` function applied to the Amazon stock +price data set. The empirical distribution of the moving mean absolute deviation is +the first parameter. The vector containing the moving mean absolute +deviations is the second parameter. `-1` is the low and `.99` is the high probability +thresholds. `-1` means that low outliers will not be considered. The final parameter +is the original result set containing the `close_d` and `date_dt` fields. + +The output of the `outliers` function contains the results where an outlier was detected. +In this case 5 results above the .99 probability threshold were detected. + + +image::images/math-expressions/outliers.png[] + + +== Modeling + +Math expressions support in Solr includes a number of functions that can be used to model a time series. +These functions include linear regression, polynomial and harmonic curve fitting, loess regression, and KNN regression. + +Each of these functions can model a time series and be used for +interpolation (predicting values within the dataset) and several +can be used for extrapolation (predicting values beyond the data set). + +The various regression functions are covered in detail in the Linear Regression, Curve +Fitting and Machine Learning sections of the user guide. + +The example below uses the `polyfit` function (polynomial regression) to +fit a non-linear model to a time series. The data set being used is the +monthly average closing price for Amazon over an eight year period. + +In this example the `polyfit` function returns a fitted model for the *y* +axis, which is the average monthly closing prices, using a 4 degree polynomial. +The degree of the polynomial determines the number of curves in the +model. The fitted model is set to the variable `y1`. The fitted model +is then directly plotted with `zplot` along with the original `y` +values. + +The visualization shows the smooth line fit through the average closing +price data. + +image::images/math-expressions/timemodel.png[] + + +== Forecasting + +The `polyfit` function can also be used to extrapolate a time series to forecast +future stock prices. The example below demonstrates a 10 month forecast. + +In the example the `polyfit` function fits a model to the y-axis and the model +is set to the variable *`m`*. +Then to create a forecast 10 zeros are appended +to the y-axis to create new vector called `y10`. +Then a new x-axis is created using +the `natural` function which returns a sequence of whole numbers 0 to the length of `y10`. +The new x-axis is stored in the variable `x10`. + +The `predict` function uses the fitted model to predict values for the new x-axis stored in +variable `x10`. + +The `zplot` function is then used to plot the `x10` vector on the x-axis and the `y10` vector and extrapolated +model on the y-axis. Notice that the `y10` vector drops to zero where the observed data +ends, but the forecast continues along the fitted curve +of the model. + +image::images/math-expressions/forecast.png[] diff --git a/solr/solr-ref-guide/src/transform.adoc b/solr/solr-ref-guide/src/transform.adoc new file mode 100644 index 00000000000..61e5e063e92 --- /dev/null +++ b/solr/solr-ref-guide/src/transform.adoc @@ -0,0 +1,129 @@ += Transforming Data +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + +Streaming expressions provides a powerful set of functions for transforming result sets. +This section of the user guide provides an overview of useful transformations applied to result sets. + +== Selecting and Adding Fields + +The `select` function wraps another streaming expression can perform the following operations on each tuple +in the stream: + +* *Select* a subset of fields +* *Map* fields to new names +* *Compute* new fields + +Below is an example showing the `select` function wrapping a `search` function +and mapping fields to new field names. The `recNum` function is a math expression +which simply returns the current record number of the tuple. The `select` expression can call +any math expression to compute new values. + +image::images/math-expressions/select1.png[] + +Below is an example using the `div` function to compute a new field +from two existing fields: + +image::images/math-expressions/select-math.png[] + + +== Filtering Tuples + +The `having` function can be used to filter tuples in the stream based on +boolean logic. + +In the example below the `having` function is filtering the output of the +`facet` function to only emit tuples that have `count(*)` greater than 20404. + +image::images/math-expressions/having.png[] + + +== Paging + +The *record number*, added with the `recNum` function, +can be filtered on to support paging. + +In the example below the `and` function with nested `lt` and `gt` functions are +used to select records within a specific record number range: + +image::images/math-expressions/search-page.png[] + + +== Handling Nulls + +The `notNull` and `isNull` functions can be used to either replace null values with different values, +or to filter out tuples with null values. + +The example below is using the `isNull` function inside of `select` function +to replace null values with -1. The `if` function takes 3 parameters. The first +is a boolean expression, in this case `isNull`. The `if` function returns +the second parameter if the boolean function returns true, and the third +parameter if it returns false. In this case `isNull` is always true because its +checking for a field in the tuples that is not included in the result set. + +image::images/math-expressions/select2.png[] + +`notNull` and `isNull` can also be used with the `having` function to filter out +tuples with null values. + +The example below emits all the documents because it is evaluating `isNull` for +a field that is not in the result set, which always returns true. + +image::images/math-expressions/having2.png[] + +The example below emits zero documents because it is evaluating `notNull` for +a field that is not in the result set, which always returns false. + +image::images/math-expressions/having3.png[] + +== Regex Matching and Filtering + +The `matches` function can be used inside of a `having` function +to test if a field in the record matches a specific +regular expression. This allows for sophisticated regex matching over search results. + +The example below uses the `matches` function to return all records where +the `complaint_type_s` field ends with *Commercial*. + +image::images/math-expressions/search-matches.png[] + +== Sorting + +The `sort` and `top` function can be used to resort a result set in memory. The `sort` function +sorts and returns the entire result set based on the sort criteria. The `top` function +can be used to return the top N values in a result set based on the sort criteria. + +image::images/math-expressions/search-resort.png[] + +== Rollups + +The `rollup` and `hashRollup` functions can be used to perform aggregations over result sets. This +is different then the `facet`, `facet2D` and `timeseries` aggregation functions which push the aggregations +into the search engine using the JSON facet API. + +The `rollup` function performs map-reduce style rollups, which requires the result stream be sorted by the +the grouping fields. This allows for aggregations over very high cardinality fields. The `hashRollup` function +performs rollups keeping all buckets in an in-memory hashmap. This requires enough memory to store all the +distinct group by fields in memory, but does not require that the underlying stream be sorted. + +The example below shows a visualization of the top 5 complaint types +from a random sample of the `nyc311` complaint database. The `top` +function is used to select the top 5 complaint types based on +the `count(*)` field output by the `hashRollup`. + +image::images/math-expressions/hashRollup.png[] diff --git a/solr/solr-ref-guide/src/updating-parts-of-documents.adoc b/solr/solr-ref-guide/src/updating-parts-of-documents.adoc index d8175a29134..b74a8eeb29d 100644 --- a/solr/solr-ref-guide/src/updating-parts-of-documents.adoc +++ b/solr/solr-ref-guide/src/updating-parts-of-documents.adoc @@ -437,7 +437,7 @@ The basic usage of `DocBasedVersionConstraintsProcessorFactory` is to configure ---- -Note that `versionField` is a comma delimited list of fields to check for version numbers. +Note that `versionField` is a comma-delimited list of fields to check for version numbers. Once configured, this update processor will reject (HTTP error code 409) any attempt to update an existing document where the value of the `my_version_l` field in the "new" document is not greater then the value of that field in the existing document. .versionField vs `\_version_` @@ -458,7 +458,7 @@ The value of this option should be the name of a request parameter that the proc + When using this request parameter, any Delete By Id command with a high enough document version number to succeed will be internally converted into an Add Document command that replaces the existing document with a new one which is empty except for the Unique Key and `versionField` to keeping a record of the deleted version so future Add Document commands will fail if their "new" version is not high enough. + -If `versionField` is specified as a list, then this parameter too must be specified as a comma delimited list of the same size so that the parameters correspond with the fields. +If `versionField` is specified as a list, then this parameter too must be specified as a comma-delimited list of the same size so that the parameters correspond with the fields. `supportMissingVersionOnOldDocs`:: This boolean parameter defaults to `false`, but if set to `true` allows any documents written *before* this feature is enabled, and which are missing the `versionField`, to be overwritten. diff --git a/solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc b/solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc index b868519708d..920240044f2 100644 --- a/solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc +++ b/solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc @@ -521,7 +521,7 @@ Example: `rowidOffset=10` The same feature used to index CSV documents can also be easily used to index tab-delimited files (TSV files) and even handle backslash escaping rather than CSV encapsulation. -For example, one can dump a MySQL table to a tab delimited file with: +For example, one can dump a MySQL table to a tab-delimited file with: [source,sql] ---- diff --git a/solr/solr-ref-guide/src/variables.adoc b/solr/solr-ref-guide/src/variables.adoc index 2c1f905d23a..6cc2bd996d8 100644 --- a/solr/solr-ref-guide/src/variables.adoc +++ b/solr/solr-ref-guide/src/variables.adoc @@ -17,13 +17,16 @@ // under the License. +This section of the user guide describes how to assign and visualize +variables with math expressions. + == The Let Expression The `let` expression sets variables and returns the value of the last variable by default. The output of any streaming expression or math expression can be set to a variable. -Below is a simple example setting three variables *`a`*, *`b`* -and *`c`*. Variables *`a`* and *`b`* are set to arrays. The variable *`c`* is set +Below is a simple example setting three variables `a`, `b`, +and `c`. Variables `a` and `b` are set to arrays. The variable `c` is set to the output of the `ebeAdd` function which performs element-by-element addition of the two arrays. @@ -34,7 +37,7 @@ let(a=array(1, 2, 3), c=ebeAdd(a, b)) ---- -In the response, notice that the last variable, *`c`*, is returned: +In the response, notice that the last variable, `c`, is returned: [source,json] ---- @@ -103,7 +106,7 @@ responds with: } ---- -A specific set of variables can be echoed by providing a comma delimited list of variables to the echo parameter. +A specific set of variables can be echoed by providing a comma-delimited list of variables to the `echo` parameter. Because variables have been provided, the `true` value is assumed. [source,text] @@ -142,6 +145,60 @@ When this expression is sent to the `/stream` handler it responds with: } ---- +== Visualizing Variables + +The `let` expression can also include a `zplot` expression that can be used to visualize the +variables. + +In the example below the variables `a` and `b` are set to arrays. The `zplot` function +outputs the variables as `x` and `y` fields in the output. + +[source,text] +---- +let(a=array(1, 2, 3), + b=array(10, 20, 30), + zplot(x=a, y=b)) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "x": 1, + "y": 10 + }, + { + "x": 2, + "y": 20 + }, + { + "x": 3, + "y": 30 + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +Using this approach variables can by visualized using Zeppelin-Solr. In the example below +the arrays are shown in table format. + +image::images/math-expressions/variables.png[] + +Once in table format we can plot the variables using one of the plotting or charting +visualizations. The example below shows variables plotted on a line chart: + +image::images/math-expressions/variables1.png[] + + == Caching Variables Variables can be cached in-memory on the Solr node where the math expression @@ -151,10 +208,10 @@ be cached in-memory for future use. The `putCache` function adds a variable to the cache. -In the example below an array is cached in the `workspace` "workspace1" -and bound to the `key` "key1". The workspace allows different users to cache -objects in their own workspace. The `putCache` function returns -the variable that was added to the cache. +In the example below an array is cached in the workspace `workspace1` +and bound to the key `key1`. +The workspace allows different users to cache objects in their own workspace. +The `putCache` function returns the variable that was added to the cache. [source,text] ---- @@ -189,7 +246,7 @@ When this expression is sent to the `/stream` handler it responds with: The `getCache` function retrieves an object from the cache by its workspace and key. -In the example below the `getCache` function retrieves the array that was cached above and assigns it to variable *`a`*. +In the example below the `getCache` function retrieves the array that was cached above and assigns it to variable `a`. [source,text] ---- @@ -228,8 +285,7 @@ In the example below `listCache` returns all the workspaces in the cache as an a let(a=listCache()) ---- -When this expression is sent to the `/stream` handler it -responds with: +When this expression is sent to the `/stream` handler it responds with: [source,json] ---- diff --git a/solr/solr-ref-guide/src/vector-math.adoc b/solr/solr-ref-guide/src/vector-math.adoc index f820ac1d7c6..2c3ee663bc4 100644 --- a/solr/solr-ref-guide/src/vector-math.adoc +++ b/solr/solr-ref-guide/src/vector-math.adoc @@ -52,6 +52,105 @@ When this expression is sent to the `/stream` handler it responds with a JSON ar } ---- +== Visualization + +The `zplot` function can be used to visualize vectors using Zeppelin-Solr. + +Let's first see what happens when we visualize the array function as a table. + +image::images/math-expressions/array.png[] + +It appears as one row with a comma-delimited list of values. You'll find that you can't visualize this output +using any of the plotting tools. + +To plot the array you need the `zplot` function. Let's first look at how `zplot` output looks like in JSON format. + +[source,text] +---- +zplot(x=array(1, 2, 3)) +---- + +When this expression is sent to the `/stream` handler it responds with a JSON array: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "x": 1 + }, + { + "x": 2 + }, + { + "x": 3 + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +`zplot` has turned the array into three tuples with the field `x`. + +Let's add another array: + +[source,text] +---- +zplot(x=array(1, 2, 3), y=array(10, 20, 30)) +---- + +When this expression is sent to the `/stream` handler it responds with a JSON array: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "x": 1, + "y": 10 + }, + { + "x": 2, + "y": 20 + }, + { + "x": 3, + "y": 30 + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +Now we have three tuples with `x` and `y` fields. + +Let's see how Zeppelin-Solr handles this output in table format: + +image::images/math-expressions/xy.png[] + +Now that we have `x` and `y` columns defined we can simply switch to one of the line charts +and plugin the fields to plot using the chart settings: + +image::images/math-expressions/line1.png[] + +Each chart has settings which can be explored by clicking on *settings*. + +You can switch between chart types for different types of visualizations. Below is an example of +a bar chart: + +image::images/math-expressions/bar.png[] + + == Array Operations Arrays can be passed as parameters to functions that operate on arrays. @@ -178,6 +277,115 @@ When this expression is sent to the `/stream` handler it responds with: } ---- +== Getting Values By Index + +Values from a vector can be retrieved by index with the `valueAt` function. + +[source,text] +---- +valueAt(array(0,1,2,3,4,5,6), 2) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "return-value": 2 + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + +== Sequences + +The `sequence` function can be used to generate a sequence of numbers as an array. +The example below returns a sequence of 10 numbers, starting from 0, with a stride of 2. + +[source,text] +---- +sequence(10, 0, 2) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "return-value": [ + 0, + 2, + 4, + 6, + 8, + 10, + 12, + 14, + 16, + 18 + ] + }, + { + "EOF": true, + "RESPONSE_TIME": 7 + } + ] + } +} +---- + +The `natural` function can be used to create a sequence of *natural* numbers starting from zero. +Natural numbers are positive integers. + +The example below creates a sequence starting at zero with all natural numbers up to, but not including +10. + +[source,text] +---- +natural(10) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "return-value": [ + 0, + 1, + 2, + 3, + 4, + 5, + 6, + 7, + 8, + 9 + ] + }, + { + "EOF": true, + "RESPONSE_TIME": 0 + } + ] + } +} +---- + == Vector Sorting An array can be sorted in natural ascending order with the `asc` function. @@ -278,8 +486,8 @@ When this expression is sent to the `/stream` handler it responds with: == Scalar Vector Math -Scalar vector math functions add, subtract, multiply or divide a scalar value with every value in a vector. -The following functions perform these operations: `scalarAdd`, `scalarSubtract`, `scalarMultiply` +Scalar vector math functions add, subtract, multiply, or divide a scalar value with every value in a vector. +The following functions perform these operations: `scalarAdd`, `scalarSubtract`, `scalarMultiply`, and `scalarDivide`. Below is an example of the `scalarMultiply` function, which multiplies the scalar value `3` with diff --git a/solr/solr-ref-guide/src/vectorization.adoc b/solr/solr-ref-guide/src/vectorization.adoc deleted file mode 100644 index 26a6f602738..00000000000 --- a/solr/solr-ref-guide/src/vectorization.adoc +++ /dev/null @@ -1,383 +0,0 @@ -= Streams and Vectorization -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -This section of the user guide explores techniques -for retrieving streams of data from Solr and vectorizing the -numeric fields. - -See the section <> which describes how to -vectorize text fields. - -== Streams - -Streaming Expressions has a wide range of stream sources that can be used to -retrieve data from SolrCloud collections. Math expressions can be used -to vectorize and analyze the results sets. - -Below are some of the key stream sources: - -* *`facet`*: Multi-dimensional aggregations are a powerful tool for generating -co-occurrence counts for categorical data. The `facet` function uses the JSON facet API -under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions -the aggregated results can be pivoted into a co-occurance matrix which can be mined for -correlations and hidden similarities within the data. - -* *`random`*: Random sampling is widely used in statistics, probability and machine learning. -The `random` function returns a random sample of search results that match a -query. The random samples can be vectorized and operated on by math expressions and the results -can be used to describe and make inferences about the entire population. - -* *`timeseries`*: The `timeseries` -expression provides fast distributed time series aggregations, which can be -vectorized and analyzed with math expressions. - -* *`knnSearch`*: K-nearest neighbor is a core machine learning algorithm. The `knnSearch` -function is a specialized knn algorithm optimized to find the k-nearest neighbors of a document in -a distributed index. Once the nearest neighbors are retrieved they can be vectorized -and operated on by machine learning and text mining algorithms. - -* *`sql`*: SQL is the primary query language used by data scientists. The `sql` function supports -data retrieval using a subset of SQL which includes both full text search and -fast distributed aggregations. The result sets can then be vectorized and operated -on by math expressions. - -* *`jdbc`*: The `jdbc` function allows data from any JDBC compliant data source to be combined with -streams originating from Solr. Result sets from outside data sources can be vectorized and operated -on by math expressions in the same manner as result sets originating from Solr. - -* *`topic`*: Messaging is an important foundational technology for large scale computing. The `topic` -function provides publish/subscribe messaging capabilities by treating -SolrCloud as a distributed message queue. Topics are extremely powerful -because they allow subscription by query. Topics can be use to support a broad set of -use cases including bulk text mining operations and AI alerting. - -* *`nodes`*: Graph queries are frequently used by recommendation engines and are an important -machine learning tool. The `nodes` function provides fast, distributed, breadth -first graph traversal over documents in a SolrCloud collection. The node sets collected -by the `nodes` function can be operated on by statistical and machine learning expressions to -gain more insight into the graph. - -* *`search`*: Ranked search results are a powerful tool for finding the most relevant -documents from a large document corpus. The `search` expression -returns the top N ranked search results that match any -Solr query, including geo-spatial queries. The smaller set of relevant -documents can then be explored with statistical, machine learning and -text mining expressions to gather insights about the data set. - -== Assigning Streams to Variables - -The output of any streaming expression can be set to a variable. -Below is a very simple example using the `random` function to fetch -three random samples from collection1. The random samples are returned -as tuples which contain name/value pairs. - - -[source,text] ----- -let(a=random(collection1, q="*:*", rows="3", fl="price_f")) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "a": [ - { - "price_f": 0.7927976 - }, - { - "price_f": 0.060795486 - }, - { - "price_f": 0.55128294 - } - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 11 - } - ] - } -} ----- - -== Creating a Vector with the col Function - -The `col` function iterates over a list of tuples and copies the values -from a specific column into an array. - -The output of the `col` function is an numeric array that can be set to a -variable and operated on by math expressions. - -Below is an example of the `col` function: - -[source,text] ----- -let(a=random(collection1, q="*:*", rows="3", fl="price_f"), - b=col(a, price_f)) ----- - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "b": [ - 0.42105234, - 0.85237443, - 0.7566981 - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 9 - } - ] - } -} ----- - -== Applying Math Expressions to the Vector - -Once a vector has been created any math expression that operates on vectors -can be applied. In the example below the `mean` function is applied to -the vector assigned to variable *`b`*. - -[source,text] ----- -let(a=random(collection1, q="*:*", rows="15000", fl="price_f"), - b=col(a, price_f), - c=mean(b)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "c": 0.5016035594638814 - }, - { - "EOF": true, - "RESPONSE_TIME": 306 - } - ] - } -} ----- - -== Creating Matrices - -Matrices can be created by vectorizing multiple numeric fields -and adding them to a matrix. The matrices can then be operated on by -any math expression that operates on matrices. - -[TIP] -==== -Note that this section deals with the creation of matrices -from numeric data. The section <> describes how to build TF-IDF term vector matrices from text fields. -==== - -Below is a simple example where four random samples are taken -from different sub-populations in the data. The `price_f` field of -each random sample is -vectorized and the vectors are added as rows to a matrix. -Then the `sumRows` -function is applied to the matrix to return a vector containing -the sum of each row. - -[source,text] ----- -let(a=random(collection1, q="market:A", rows="5000", fl="price_f"), - b=random(collection1, q="market:B", rows="5000", fl="price_f"), - c=random(collection1, q="market:C", rows="5000", fl="price_f"), - d=random(collection1, q="market:D", rows="5000", fl="price_f"), - e=col(a, price_f), - f=col(b, price_f), - g=col(c, price_f), - h=col(d, price_f), - i=matrix(e, f, g, h), - j=sumRows(i)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "j": [ - 154390.1293375, - 167434.89453, - 159293.258493, - 149773.42769, - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 9 - } - ] - } -} ----- - -== Facet Co-occurrence Matrices - -The `facet` function can be used to quickly perform multi-dimension aggregations of categorical data from -records stored in a SolrCloud collection. These multi-dimension aggregations can represent co-occurrence -counts for the values in the dimensions. The `pivot` function can be used to move two dimensional -aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for -correlations to learn about the hidden connections within the data. - -In the example below the `facet` expression is used to generate a two dimensional faceted aggregation. -The first dimension is the US State that a car was purchased in and the second dimension is the car model. -This two dimensional facet generates the co-occurrence counts for the number of times a particular car model -was purchased in a particular state. - - -[source,text] ----- -facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*)) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "state": "NY", - "model": "camry", - "count(*)": 13342 - }, - { - "state": "NJ", - "model": "accord", - "count(*)": 13002 - }, - { - "state": "NY", - "model": "civic", - "count(*)": 12901 - }, - { - "state": "CA", - "model": "focus", - "count(*)": 12892 - }, - { - "state": "TX", - "model": "f150", - "count(*)": 12871 - }, - { - "EOF": true, - "RESPONSE_TIME": 171 - } - ] - } -} ----- - -The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below -The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the -columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*)) - from the facet results. Once the co-occurrence matrix has been created the US States can be clustered -by car model, or the matrix can be transposed and car models can be clustered by the US States -where they were bought. - -[source,text] ----- -let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)), - b=pivot(a, state, model, count(*)), - c=kmeans(b, 7)) ----- - -== Latitude / Longitude Vectors - -The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into -a matrix of lat/long vectors. Each row in the matrix is a vector that contains the lat/long -pair for the corresponding tuple in the list. The row labels for the matrix are -automatically set to the `id` field in the tuples. The lat/lon matrix can then be operated -on by distance-based machine learning functions using the `haversineMeters` distance measure. - -The `latlonVectors` function takes two parameters: a list of tuples and a named parameter called -`field`, which tells the `latlonVectors` function which field to parse the lat/lon -vectors from. - -Below is an example of the `latlonVectors`. - -[source,text] ----- -let(a=random(collection1, q="*:*", fl="id, loc_p", rows="5"), - b=latlonVectors(a, field="loc_p")) ----- - -When this expression is sent to the `/stream` handler it responds with: - -[source,json] ----- -{ - "result-set": { - "docs": [ - { - "b": [ - [ - 42.87183530723629, - 76.74102353397778 - ], - [ - 42.91372904094898, - 76.72874889228416 - ], - [ - 42.911528804897564, - 76.70537292977619 - ], - [ - 42.91143870500213, - 76.74749913047408 - ], - [ - 42.904666267479705, - 76.73933236046092 - ] - ] - }, - { - "EOF": true, - "RESPONSE_TIME": 21 - } - ] - } -} ----- diff --git a/solr/solr-ref-guide/src/visualization.adoc b/solr/solr-ref-guide/src/visualization.adoc new file mode 100644 index 00000000000..1cc02ea6539 --- /dev/null +++ b/solr/solr-ref-guide/src/visualization.adoc @@ -0,0 +1,146 @@ += Visualization +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + +== Gallery + +=== Result Tables + +image::images/math-expressions/search-sort.png[] + +=== Random Sampling & Scatter Plots + +image::images/math-expressions/bivariate.png[] + +=== Correlation Matrices and Heat Maps + +image::images/math-expressions/corrmatrix.png[] + +=== Visualizing CSV Files + +image::images/math-expressions/csv.png[] + +=== Probability Distributions + +image::images/math-expressions/dist.png[] + +=== Histograms + +image::images/math-expressions/cumProb.png[] + +=== Frequency Tables + +image::images/math-expressions/freqTable1.png[] + +=== Quantile Plots + +image::images/math-expressions/quantiles.png[] + +=== Time Series Aggregation + +image::images/math-expressions/timeseries1.png[] + +=== Time Series With Moving Average + +image::images/math-expressions/movingavg.png[] + +=== Time Series Forecast + +image::images/math-expressions/forecast.png[] + +=== Multiple Time Lines + +image::images/math-expressions/timecompare.png[] + +=== Linear Regression + +image::images/math-expressions/linear.png[] + +=== Knn Regression + +image::images/math-expressions/knnRegress.png[] + +=== Multivariate, Non-linear Regression with Residual Plot + +image::images/math-expressions/redwine1.png[] + +=== Monte Carlo Simulations + +image::images/math-expressions/randomwalk5.png[] + +=== Random Walks + +image::images/math-expressions/randomwalk6.png[] + +=== Distance Matrices + +image::images/math-expressions/distance.png[] + +=== DBSCAN Clustering + +image::images/math-expressions/dbscan1.png[] + +=== KMeans Clustering + +image::images/math-expressions/2DCluster1.png[] + +=== Mapping Cluster Centroids + +image::images/math-expressions/centroidplot.png[] + +=== Convex Hulls + +image::images/math-expressions/convex2.png[] + +=== Significant Terms + +image::images/math-expressions/sterms.png[] + +=== Phrase Aggregation + +image::images/math-expressions/text-analytics.png[] + +=== Curve Fitting + +image::images/math-expressions/hfit.png[] + +=== Interpolation + +image::images/math-expressions/interpolate1.png[] + +=== Derivatives + +image::images/math-expressions/sined.png[] + +=== Integrals + +image::images/math-expressions/integral.png[] + +=== Convolutional Smoothing + +image::images/math-expressions/conv-smooth.png[] + +=== Autocorrelation + +image::images/math-expressions/noise-autocorrelation.png[] + +=== Fourier Transform + +image::images/math-expressions/signal-fft.png[] + +