mirror of https://github.com/apache/lucene.git
SOLR-12701: Remove statistical-programming page which is superceded by the new math-expressions paged
This commit is contained in:
parent
7a3f837a32
commit
08a6d13c92
|
@ -44,18 +44,18 @@ record in your Solr Cloud cluster computable.
|
||||||
|
|
||||||
*<<statistics.adoc#statistics,Statistics>>*: Statistical functions in math expressions.
|
*<<statistics.adoc#statistics,Statistics>>*: Statistical functions in math expressions.
|
||||||
|
|
||||||
*<<probability-distributions.adoc#probability-distributions,Probability>>*: A probability distribution framework.
|
*<<probability-distributions.adoc#probability-distributions,Probability>>*: Mathematical models for probability.
|
||||||
|
|
||||||
*<<simulations.adoc#simulations,Monte Carlo Simulations>>*: Performing correlated and uncorrelated Monte Carlo simulations.
|
*<<simulations.adoc#simulations,Monte Carlo Simulations>>*: Performing correlated and uncorrelated Monte Carlo simulations.
|
||||||
|
|
||||||
*<<time-series.adoc#time-series,Time Series>>*: Aggregation, smoothing, and differencing of time series.
|
|
||||||
|
|
||||||
*<<regression.adoc#regression,Linear Regression>>*: Simple and multivariate linear regression.
|
*<<regression.adoc#regression,Linear Regression>>*: Simple and multivariate linear regression.
|
||||||
|
|
||||||
*<<numerical-analysis.adoc#numerical-analysis,Interpolation, Derivatives and Integrals>>*: Numerical analysis math expressions.
|
*<<numerical-analysis.adoc#numerical-analysis,Interpolation, Derivatives and Integrals>>*: Numerical analysis math expressions.
|
||||||
|
|
||||||
*<<curve-fitting.adoc#curve-fitting,Curve Fitting>>*: Polynomial curve fitting.
|
*<<curve-fitting.adoc#curve-fitting,Curve Fitting>>*: Polynomial and Gaussian curve fitting.
|
||||||
|
|
||||||
*<<dsp.adoc#dsp,Digital Signal Processing>>*: Functions commonly used with digital signal processing.
|
*<<dsp.adoc#dsp,Digital Signal Processing>>*: Functions commonly used with digital signal processing.
|
||||||
|
|
||||||
|
*<<time-series.adoc#time-series,Time Series>>*: Aggregation, smoothing, differencing, modeling and anomaly detection for time series.
|
||||||
|
|
||||||
*<<machine-learning.adoc#machine-learning,Machine Learning>>*: Functions used in machine learning.
|
*<<machine-learning.adoc#machine-learning,Machine Learning>>*: Functions used in machine learning.
|
||||||
|
|
|
@ -1,741 +0,0 @@
|
||||||
= Statistical Programming
|
|
||||||
// Licensed to the Apache Software Foundation (ASF) under one
|
|
||||||
// or more contributor license agreements. See the NOTICE file
|
|
||||||
// distributed with this work for additional information
|
|
||||||
// regarding copyright ownership. The ASF licenses this file
|
|
||||||
// to you under the Apache License, Version 2.0 (the
|
|
||||||
// "License"); you may not use this file except in compliance
|
|
||||||
// with the License. You may obtain a copy of the License at
|
|
||||||
//
|
|
||||||
// http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
//
|
|
||||||
// Unless required by applicable law or agreed to in writing,
|
|
||||||
// software distributed under the License is distributed on an
|
|
||||||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
||||||
// KIND, either express or implied. See the License for the
|
|
||||||
// specific language governing permissions and limitations
|
|
||||||
// under the License.
|
|
||||||
|
|
||||||
The Streaming Expression language includes a powerful statistical programing syntax with many of the features of a functional programming language.
|
|
||||||
|
|
||||||
The syntax includes variables, data structures and a growing set of mathematical functions.
|
|
||||||
|
|
||||||
Using the statistical programing syntax, Solr's powerful data retrieval
|
|
||||||
capabilities can be combined with in-depth statistical analysis.
|
|
||||||
|
|
||||||
The data retrieval methods include:
|
|
||||||
|
|
||||||
* SQL
|
|
||||||
* time series aggregation
|
|
||||||
* random sampling
|
|
||||||
* faceted aggregation
|
|
||||||
* K-Nearest Neighbor (KNN) searches
|
|
||||||
* `topic` message queues
|
|
||||||
* MapReduce (parallel relational algebra)
|
|
||||||
* JDBC calls to outside databases
|
|
||||||
* Graph Expressions
|
|
||||||
|
|
||||||
Once the data is retrieved, the statistical programming syntax can be used to create arrays from the data so it
|
|
||||||
can be manipulated, transformed and analyzed.
|
|
||||||
|
|
||||||
The statistical function library includes functions that perform:
|
|
||||||
|
|
||||||
* Correlation
|
|
||||||
* Cross-correlation
|
|
||||||
* Covariance
|
|
||||||
* Moving averages
|
|
||||||
* Percentiles
|
|
||||||
* Simple regression and prediction
|
|
||||||
* Analysis of covariance (ANOVA)
|
|
||||||
* Histograms
|
|
||||||
* Convolution
|
|
||||||
* Euclidean distance
|
|
||||||
* Descriptive statistics
|
|
||||||
* Rank transformation
|
|
||||||
* Normalization transformation
|
|
||||||
* Sequences
|
|
||||||
* Array manipulation functions (creation, copying, length, scaling, reverse, etc.)
|
|
||||||
|
|
||||||
The statistical function library is backed by https://commons.apache.org/proper/commons-math/[Apache Commons Math library]. A full discussion of many of the math functions available to streaming expressions is available in the section <<stream-evaluator-reference.adoc#stream-evaluator-reference,Stream Evaluator Reference>>.
|
|
||||||
|
|
||||||
This document provides an overview of the how to apply the variables, data structures and mathematical functions.
|
|
||||||
|
|
||||||
NOTE: Like all streaming expressions, the statistical functions are run by Solr's `/stream` handler. For an overview of this handler, see the section <<streaming-expressions.adoc#streaming-expressions,Streaming Expressions>>.
|
|
||||||
|
|
||||||
== Math Functions
|
|
||||||
|
|
||||||
Streaming expressions contain a suite of mathematical functions which can be called on their own or as part of a larger expression.
|
|
||||||
|
|
||||||
Solr's `/stream` handler evaluates the mathematical expression and returns a result.
|
|
||||||
|
|
||||||
For example, if you send the following expression to the `/stream` handler:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
add(1, 1)
|
|
||||||
----
|
|
||||||
|
|
||||||
You get the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"return-value": 2
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 2
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
You can nest math functions within each other. For example:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
pow(10, add(1,1))
|
|
||||||
----
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"return-value": 100
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
You can also perform math on a stream of Tuples. For example:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3"),
|
|
||||||
price_f,
|
|
||||||
mult(price_f, 10) as newPrice)
|
|
||||||
----
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"price_f": 0.99999994,
|
|
||||||
"newPrice": 9.9999994
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"price_f": 0.99999994,
|
|
||||||
"newPrice": 9.9999994
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"price_f": 0.9999992,
|
|
||||||
"newPrice": 9.999992
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 3
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
== Data Structures
|
|
||||||
|
|
||||||
Several types of data can be manipulated with the statistical programming syntax. The following sections explore arrays, tuples and lists.
|
|
||||||
|
|
||||||
=== Creating Arrays
|
|
||||||
|
|
||||||
The first data structure we'll explore is the array.
|
|
||||||
|
|
||||||
We can create an array with the `array` function:
|
|
||||||
|
|
||||||
For example:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
array(1, 2, 3)
|
|
||||||
----
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"return-value": [
|
|
||||||
1,
|
|
||||||
2,
|
|
||||||
3
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
We can nest arrays within a matrix function to return matrix:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
matrix(array(1, 2, 3),
|
|
||||||
array(4, 5, 6))
|
|
||||||
----
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"return-value": [
|
|
||||||
[
|
|
||||||
1,
|
|
||||||
2,
|
|
||||||
3
|
|
||||||
],
|
|
||||||
[
|
|
||||||
4,
|
|
||||||
5,
|
|
||||||
6
|
|
||||||
]
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
We can manipulate arrays with functions. For example, we can reverse an array with the `rev` function:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
rev(array(1, 2, 3))
|
|
||||||
----
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"return-value": [
|
|
||||||
3,
|
|
||||||
2,
|
|
||||||
1
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
Arrays can also be built and returned by functions. For example, the `sequence` function:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
sequence(5,0,1)
|
|
||||||
----
|
|
||||||
|
|
||||||
This returns an array of size `5` starting from `0` with a stride of `1`.
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"return-value": [
|
|
||||||
0,
|
|
||||||
1,
|
|
||||||
2,
|
|
||||||
3,
|
|
||||||
4
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 4
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
We can perform math on an array. For example, we can scale an array with the `scale` function:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
scale(10, sequence(5,0,1))
|
|
||||||
----
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"return-value": [
|
|
||||||
0,
|
|
||||||
10,
|
|
||||||
20,
|
|
||||||
30,
|
|
||||||
40
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
We can perform statistical analysis on arrays For example, we can correlate two sequences with the `corr` function:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
corr(sequence(5,1,1), sequence(5,10,10))
|
|
||||||
----
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"return-value": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 1
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
|
|
||||||
=== Tuples
|
|
||||||
|
|
||||||
The tuple is the next data structure we'll explore.
|
|
||||||
|
|
||||||
The `tuple` function returns a map of name/value pairs. A tuple is a very flexible data structure that can hold values that are strings, numerics, arrays and lists of tuples.
|
|
||||||
|
|
||||||
A tuple can be used to return a complex result from a statistical expression.
|
|
||||||
|
|
||||||
Here is an example:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
tuple(title="hello world",
|
|
||||||
array1=array(1,2,3,4),
|
|
||||||
array2=array(4,5,6,7))
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
----
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"title": "hello world",
|
|
||||||
"array1": [
|
|
||||||
1,
|
|
||||||
2,
|
|
||||||
3,
|
|
||||||
4
|
|
||||||
],
|
|
||||||
"array2": [
|
|
||||||
4,
|
|
||||||
5,
|
|
||||||
6,
|
|
||||||
7
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
=== Lists
|
|
||||||
|
|
||||||
Next we have the list data structure.
|
|
||||||
|
|
||||||
The `list` function is a data structure that wraps streaming expressions and emits all the tuples from the wrapped expressions as a single concatenated stream.
|
|
||||||
|
|
||||||
Below is an example of a list of tuples:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
list(tuple(id=1, data=array(1, 2, 3)),
|
|
||||||
tuple(id=2, data=array(10, 12, 14)))
|
|
||||||
----
|
|
||||||
|
|
||||||
Returns the following response:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"id": "1",
|
|
||||||
"data": [
|
|
||||||
1,
|
|
||||||
2,
|
|
||||||
3
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"id": "2",
|
|
||||||
"data": [
|
|
||||||
10,
|
|
||||||
12,
|
|
||||||
14
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
== Setting Variables with let
|
|
||||||
|
|
||||||
The `let` function sets variables and returns the last variable. The output of any statistical function can be set to a variable.
|
|
||||||
|
|
||||||
Below is a simple example setting three variables `a`, `b` and `correlation`.
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
let(a=array(1,2,3),
|
|
||||||
b=array(10, 20, 30),
|
|
||||||
correlation=corr(a, b))
|
|
||||||
----
|
|
||||||
|
|
||||||
Here is the output:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"correlation": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
All variables can be output by setting the `echo` variable to `true`.
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
let(echo=true,
|
|
||||||
a=array(1,2,3),
|
|
||||||
b=array(10, 20, 30),
|
|
||||||
correlation=corr(a, b))
|
|
||||||
----
|
|
||||||
|
|
||||||
Here is the output:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"a": [
|
|
||||||
1,
|
|
||||||
2,
|
|
||||||
3
|
|
||||||
],
|
|
||||||
"b": [
|
|
||||||
10,
|
|
||||||
20,
|
|
||||||
30
|
|
||||||
],
|
|
||||||
"correlation": 1
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 0
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
Streaming expressions can also be used inside of a `let` expression in the following ways:
|
|
||||||
|
|
||||||
* A variable can be set to the output of any streaming expression.
|
|
||||||
* A streaming expression can be executed after all variables have been set. The variables can then be referenced by the streaming expression that is executed. The `let` expression will stream the tuples that are emitted by the final streaming expression.
|
|
||||||
|
|
||||||
Here is a very simple example:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
let(a=random(collection2, q="*:*", rows="3", fl="price_f"),
|
|
||||||
b=random(collection2, q="*:*", rows="3", fl="price_f"),
|
|
||||||
tuple(sample1=a, sample2=b))
|
|
||||||
----
|
|
||||||
|
|
||||||
The `let` expression above is setting variables `a` and `b` to random
|
|
||||||
samples taken from collection2.
|
|
||||||
|
|
||||||
The `let` function then executes the `tuple` streaming expression
|
|
||||||
which references the two variables.
|
|
||||||
|
|
||||||
Here is the output:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"sample1": [
|
|
||||||
{
|
|
||||||
"price_f": 0.39729273
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"price_f": 0.063344836
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"price_f": 0.42020327
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"sample2": [
|
|
||||||
{
|
|
||||||
"price_f": 0.659244
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"price_f": 0.58797807
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"price_f": 0.57520163
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 20
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
== Creating Arrays with col Function
|
|
||||||
|
|
||||||
The `col` function is used to move a column of numbers from a list of tuples into an `array`.
|
|
||||||
|
|
||||||
This is an important function because streaming expressions such as `sql`, `random` and `timeseries` return tuples, but the statistical functions operate on arrays.
|
|
||||||
|
|
||||||
Below is an example of the `col` function:
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
let(a=random(collection2, q="*:*", rows="3", fl="price_f"),
|
|
||||||
b=random(collection2, q="*:*", rows="3", fl="price_f"),
|
|
||||||
c=col(a, price_f),
|
|
||||||
d=col(b, price_f),
|
|
||||||
tuple(sample1=c, sample2=d))
|
|
||||||
----
|
|
||||||
|
|
||||||
The example above is using the `col` function to create arrays from the tuples stored in variables `a` and `b`.
|
|
||||||
|
|
||||||
Variable `c` contains an array of values from the `price_f` field,
|
|
||||||
taken from the tuples stored in variable `a`.
|
|
||||||
|
|
||||||
Variable `d` contains an array of values from the `price_f` field,
|
|
||||||
taken from the tuples stored in variable `b`.
|
|
||||||
|
|
||||||
Also notice inn that the response `tuple` executed by `let` is pointing to the arrays in variables `c` and `d`.
|
|
||||||
|
|
||||||
The response shows the arrays:
|
|
||||||
|
|
||||||
[source,json]
|
|
||||||
----
|
|
||||||
|
|
||||||
{
|
|
||||||
"result-set": {
|
|
||||||
"docs": [
|
|
||||||
{
|
|
||||||
"sample1": [
|
|
||||||
0.06490427,
|
|
||||||
0.6751543,
|
|
||||||
0.07063508
|
|
||||||
],
|
|
||||||
"sample2": [
|
|
||||||
0.8884564,
|
|
||||||
0.8878821,
|
|
||||||
0.3504665
|
|
||||||
]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"EOF": true,
|
|
||||||
"RESPONSE_TIME": 17
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
----
|
|
||||||
|
|
||||||
== Statistical Programming Example
|
|
||||||
|
|
||||||
We've covered how the data structures, variables and a few statistical functions work. Let's dive into an example that puts these tools to use.
|
|
||||||
|
|
||||||
=== Use Case
|
|
||||||
|
|
||||||
We have an existing hotel in *cityA* that is very profitable.
|
|
||||||
We are contemplating opening up a new hotel in a different city.
|
|
||||||
We're considering 4 different cities: *cityB*, *cityC*, *cityD*, *cityE*.
|
|
||||||
We'd like to open a hotel in a city that has similar room rates to cityA.
|
|
||||||
|
|
||||||
How do we determine which of the 4 cities we're considering has room rates which are most similar to cityA?
|
|
||||||
|
|
||||||
=== The Data
|
|
||||||
|
|
||||||
We have a data set of un-aggregated hotel bookings. Each booking record has a rate and city.
|
|
||||||
|
|
||||||
=== Can We Simply Aggregate?
|
|
||||||
|
|
||||||
One approach would be to aggregate the data from each city and compare the mean room rates. This approach will
|
|
||||||
give us some useful information, but the mean is a summary statistic which loses a significant amount of information
|
|
||||||
about the data. For example, we don't have an understanding of how the distribution of room rates is impacting the
|
|
||||||
mean.
|
|
||||||
|
|
||||||
The median room rate provides another interesting data point but it's still not the entire picture. It's sill just
|
|
||||||
one point of reference.
|
|
||||||
|
|
||||||
Is there a way that we can compare the markets without losing valuable information in the data?
|
|
||||||
|
|
||||||
==== K-Nearest Neighbor
|
|
||||||
|
|
||||||
The use case we're reasoning about can often be approached using a K-Nearest Neighbor (knn) algorithm.
|
|
||||||
|
|
||||||
With knn we use a distance measure to compare vectors of data to find the k nearest neighbors to
|
|
||||||
a specific vector.
|
|
||||||
|
|
||||||
==== Euclidean Distance
|
|
||||||
|
|
||||||
The streaming expression statistical function library has a function called `distance`. The `distance` function
|
|
||||||
computes the Euclidean distance between two vectors. This looks promising for comparing vectors of room rates.
|
|
||||||
|
|
||||||
==== Vectors
|
|
||||||
|
|
||||||
But how to create the vectors from a our data set? Remember we have un-aggregated room rates from each of the cities.
|
|
||||||
How can we vectorize the data so it can be compared using the `distance` function.
|
|
||||||
|
|
||||||
We have a streaming expression that can retrieve a random sample from each of the cities. The name of this
|
|
||||||
expression is `random`. So we could take a random sample of 1000 room rates from each of the five cities.
|
|
||||||
|
|
||||||
But random vectors of room rates are not comparable because the distance algorithm compares values at each index
|
|
||||||
in the vector. How can make these vectors comparable?
|
|
||||||
|
|
||||||
We can make them comparable by sorting them. Then as the distance algorithm moves along the vectors it will be
|
|
||||||
comparing room rates from lowest to highest in both cities.
|
|
||||||
|
|
||||||
=== The Code
|
|
||||||
|
|
||||||
[source,text]
|
|
||||||
----
|
|
||||||
let(cityA=sort(random(bookings, q="city:cityA", rows="1000", fl="rate_d"), by="rate_d asc"),
|
|
||||||
cityB=sort(random(bookings, q="city:cityB", rows="1000", fl="rate_d"), by="rate_d asc"),
|
|
||||||
cityC=sort(random(bookings, q="city:cityC", rows="1000", fl="rate_d"), by="rate_d asc"),
|
|
||||||
cityD=sort(random(bookings, q="city:cityD", rows="1000", fl="rate_d"), by="rate_d asc"),
|
|
||||||
cityE=sort(random(bookings, q="city:cityE", rows="1000", fl="rate_d"), by="rate_d asc"),
|
|
||||||
ratesA=col(cityA, rate_d),
|
|
||||||
ratesB=col(cityB, rate_d),
|
|
||||||
ratesC=col(cityC, rate_d),
|
|
||||||
ratesD=col(cityD, rate_d),
|
|
||||||
ratesE=col(cityE, rate_d),
|
|
||||||
top(n=1,
|
|
||||||
sort="distance asc",
|
|
||||||
list(tuple(city=B, distance=distance(ratesA, ratesB)),
|
|
||||||
tuple(city=C, distance=distance(ratesA, ratesC)),
|
|
||||||
tuple(city=D, distance=distance(ratesA, ratesD)),
|
|
||||||
tuple(city=E, distance=distance(ratesA, ratesE)))))
|
|
||||||
----
|
|
||||||
|
|
||||||
=== The Code Explained
|
|
||||||
|
|
||||||
The `let` expression sets variables first.
|
|
||||||
|
|
||||||
The first 5 variables (cityA, cityB, cityC, cityD, cityE), contain the random samples from the `bookings` collection.
|
|
||||||
The `random` function is pulling 1000 random samples from each city and including the `rate_d` field in the
|
|
||||||
tuples that are returned.
|
|
||||||
|
|
||||||
The `random` function is wrapped by a `sort` function which is sorting the tuples in
|
|
||||||
ascending order based on the `rate_d` field.
|
|
||||||
|
|
||||||
The next five variables (ratesA, ratesB, ratesC, ratesD, ratesE) contain the arrays of room rates for each
|
|
||||||
city. The `col` function is used to move the `rate_d` field from the random sample tuples
|
|
||||||
into an array for each city.
|
|
||||||
|
|
||||||
Now we have five sorted vectors of room rates that we can compare with our `distance` function.
|
|
||||||
|
|
||||||
After the variables are set the `let` expression runs the `top` expression.
|
|
||||||
|
|
||||||
The `top` expression is wrapping a `list` of `tuples`. Inside each tuple the `distance` function is used to compare
|
|
||||||
the rateA vector with one of the other cities. The output of the distance function is stored in the distance field
|
|
||||||
in the tuple.
|
|
||||||
|
|
||||||
The `list` function emits each `tuple` and the `top` function returns only the tuple with the lowest distance.
|
|
|
@ -1,5 +1,5 @@
|
||||||
= Streaming Expressions
|
= Streaming Expressions
|
||||||
:page-children: stream-source-reference, stream-decorator-reference, stream-evaluator-reference, statistical-programming, math-expressions, graph-traversal
|
:page-children: stream-source-reference, stream-decorator-reference, stream-evaluator-reference, math-expressions, graph-traversal
|
||||||
// Licensed to the Apache Software Foundation (ASF) under one
|
// Licensed to the Apache Software Foundation (ASF) under one
|
||||||
// or more contributor license agreements. See the NOTICE file
|
// or more contributor license agreements. See the NOTICE file
|
||||||
// distributed with this work for additional information
|
// distributed with this work for additional information
|
||||||
|
|
Loading…
Reference in New Issue