mirror of https://github.com/apache/lucene.git
SOLR-10651: fix docs ...
This commit is contained in:
parent
3e2d72bf7b
commit
8ff75edd13
|
@ -22,10 +22,10 @@ The Streaming Expression language includes a powerful statistical programing syn
|
|||
features of a functional programming language. The syntax includes *variables*, *data structures*
|
||||
and a growing set of *mathematical functions*.
|
||||
|
||||
Using the statistical programing syntax Solr's powerful `data retrieval`
|
||||
capabilities can be combined with in-depth `statistical analysis`.
|
||||
Using the statistical programing syntax Solr's powerful *data retrieval*
|
||||
capabilities can be combined with in-depth *statistical analysis*.
|
||||
|
||||
The `data retrieval` methods include:
|
||||
The *data retrieval* methods include:
|
||||
|
||||
* SQL
|
||||
* time series aggregation
|
||||
|
@ -58,18 +58,18 @@ The statistical function library includes functions that perform:
|
|||
* Sequences
|
||||
* Array manipulation functions (creation, copying, length, scaling, reverse etc...)
|
||||
|
||||
The statistical function library is backed by `Apache Commons Math`.
|
||||
The statistical function library is backed by *Apache Commons Math*.
|
||||
|
||||
This document provides an overview of the how to apply the variables, data structures
|
||||
and mathematical functions.
|
||||
|
||||
== /stream handler
|
||||
|
||||
Like all Streaming Expressions the statistical functions can be run by Solr's /stream handler.
|
||||
Like all Streaming Expressions, the statistical functions can be run by Solr's /stream handler.
|
||||
|
||||
== Math
|
||||
|
||||
Streaming Expressions contain a suite of *mathematical* functions which can be called on
|
||||
Streaming Expressions contain a suite of *mathematical functions* which can be called on
|
||||
their own or as part of a larger expression.
|
||||
|
||||
Solr's /stream handler evaluates the mathematical expression and returns a result.
|
||||
|
@ -241,7 +241,7 @@ Returns the following response:
|
|||
|
||||
We can manipulate arrays with functions.
|
||||
|
||||
For example we can reverse and array with `rev` function.
|
||||
For example we can reverse and array with the `rev` function:
|
||||
|
||||
[source,text]
|
||||
----
|
||||
|
@ -271,7 +271,7 @@ Returns the following response:
|
|||
}
|
||||
----
|
||||
|
||||
Functions can return arrays.
|
||||
Arrays can also be built and returned by functions.
|
||||
|
||||
For example the sequence function:
|
||||
|
||||
|
@ -313,7 +313,7 @@ Expression:
|
|||
|
||||
[source,text]
|
||||
----
|
||||
scale(sequence(5,0,1), 10)
|
||||
scale(10, sequence(5,0,1))
|
||||
----
|
||||
|
||||
Returns the following response:
|
||||
|
@ -341,7 +341,7 @@ Returns the following response:
|
|||
}
|
||||
----
|
||||
|
||||
We can perform `statistical analysis` on arrays.
|
||||
We can perform *statistical analysis* on arrays.
|
||||
|
||||
For example we can correlate two sequences with the `corr` function:
|
||||
|
||||
|
@ -421,10 +421,10 @@ Returns the following response:
|
|||
|
||||
== List
|
||||
|
||||
Next up, we have the *list* data structure.
|
||||
Next we have the *list* data structure.
|
||||
|
||||
The `list` function is a data structure that wraps Streaming Expressions and emits their tuples as a single
|
||||
concatenated stream.
|
||||
The `list` function is a data structure that wraps Streaming Expressions and emits all the tuples from the wrapped
|
||||
expressions as a single concatenated stream.
|
||||
|
||||
Below is an example of a list of tuples:
|
||||
|
||||
|
@ -469,11 +469,10 @@ Returns the following response:
|
|||
|
||||
== Let
|
||||
|
||||
The `let` function sets variables and runs a streaming expression that references the variables.
|
||||
The `let` function sets *variables* and runs a Streaming Expression that references the variables. The `let` funtion can be used to
|
||||
write small statistical programs.
|
||||
|
||||
Th output of any Streaming Expression can be stored in a variable.
|
||||
|
||||
Lets see how `let` works.
|
||||
A *variable* can be set to the output of any Streaming Expression.
|
||||
|
||||
Here is a very simple example:
|
||||
|
||||
|
@ -532,8 +531,8 @@ Here is the output:
|
|||
|
||||
== Col
|
||||
|
||||
The `col` function is used to move a column of numbers from a list of tuples into an array.
|
||||
This is an important step because Streaming Expressions such as SQL, random and timeseries return tuples,
|
||||
The `col` function is used to move a column of numbers from a list of tuples into an `array`.
|
||||
This is an important function because Streaming Expressions such as `sql`, `random` and `timeseries` return tuples,
|
||||
but the statistical functions operate on arrays.
|
||||
|
||||
Below is an example of the `col` function:
|
||||
|
@ -556,7 +555,7 @@ taken from the tuples stored in variable *a*.
|
|||
Variable *d* contains an array of values from the *price_f* field,
|
||||
taken from the tuples stored in variable *b*.
|
||||
|
||||
Also notice that the response `tuple` is now pointing to the arrays in variables *c* and *d*.
|
||||
Also notice inn that the response `tuple` executed by `let` is pointing to the arrays in variables *c* and *d*.
|
||||
|
||||
The response shows the arrays:
|
||||
|
||||
|
@ -597,9 +596,9 @@ Let's dive into an example that puts these tools to use.
|
|||
We have an existing hotel in *cityA* that is very profitable.
|
||||
We are contemplating opening up a new hotel in a different city.
|
||||
We're considering 4 different cities: *cityB*, *cityC*, *cityD*, *cityE*.
|
||||
We'd like to open a hotel in a city that has a very similar room rate to *cityA*.
|
||||
We'd like to open a hotel in a city that has similar room rates to *cityA*.
|
||||
|
||||
How do we determine which of the 4 cities we're considering has the most similar room rates to *cityA*
|
||||
How do we determine which of the 4 cities we're considering has room rates which are most similar to *cityA*?
|
||||
|
||||
=== The Data
|
||||
|
||||
|
@ -609,17 +608,17 @@ We have a data set of un-aggregated hotel *bookings*. Each booking record has a
|
|||
|
||||
One approach would be to aggregate the data from each city and compare the *mean* room rates. This approach will
|
||||
give us some useful information, but the mean is a summary statistic which loses a significant amount of information
|
||||
about the data. For example we don't have an understanding of how the distribution of room rates is effecting the
|
||||
about the data. For example we don't have an understanding of how the distribution of room rates is impacting the
|
||||
mean.
|
||||
|
||||
The *median* room rate provides another interesting data point but it's still not the entire picture. It sill just
|
||||
The *median* room rate provides another interesting data point but it's still not the entire picture. It's sill just
|
||||
one point of reference.
|
||||
|
||||
Is there a way that we can compare the markets without losing valuable information in the data?
|
||||
|
||||
=== K Nearest Neighbor
|
||||
|
||||
The use case we're thinking about can often be approached using a k Nearest Neighbor (knn) algorithm.
|
||||
The use case we're reasoning about can often be approached using a K Nearest Neighbor (knn) algorithm.
|
||||
|
||||
With knn we use a *distance* measure to compare vectors of data to find the k nearest neighbors to
|
||||
a specific vector.
|
||||
|
@ -631,17 +630,17 @@ computes the Euclidean distance between two vectors. This looks promising for co
|
|||
|
||||
=== Vectors
|
||||
|
||||
But how to create the vectors from a our data set. Remember we have un-aggregated room rates from each of the cities.
|
||||
But how to create the vectors from a our data set? Remember we have un-aggregated room rates from each of the cities.
|
||||
How can we vectorize the data so it can be compared using the `distance` function.
|
||||
|
||||
We have a Streaming Expression that can take a *random sample* from each of the cities. The name of this
|
||||
expression is `random`. So we could take a random sample of 1000 room rates from each of the five markets.
|
||||
We have a Streaming Expression that can retrieve a *random sample* from each of the cities. The name of this
|
||||
expression is `random`. So we could take a random sample of 1000 room rates from each of the five cities.
|
||||
|
||||
But random vectors of room rates are not comparable because the distance algorithm compares values at each index
|
||||
in the vector. How can make these vectors comparable?
|
||||
|
||||
We can make them comparable by *sorting* them. Then as the distance algorithm moves along the vectors it will be
|
||||
comparing room rates from lowest to highest in both markets.
|
||||
comparing room rates from lowest to highest in both cities.
|
||||
|
||||
=== The code
|
||||
|
||||
|
@ -676,7 +675,7 @@ tuples that are returned.
|
|||
The `random` function is wrapped by a `sort` function which is sorting the tuples in
|
||||
ascending order based on the rate_d field.
|
||||
|
||||
The next five variables (rates1, rates2, rates3, rates4, rates5) contain the arrays of room rates for each
|
||||
The next five variables (ratesA, ratesB, ratesC, ratesD, ratesE) contain the arrays of room rates for each
|
||||
city. The `col` function is used to move the `rate_d` field from the random sample tuples
|
||||
into an array for each city.
|
||||
|
||||
|
@ -684,14 +683,8 @@ Now we have five sorted vectors of room rates that we can compare with our `dist
|
|||
|
||||
After the variables are set the `let` expression runs the `top` expression.
|
||||
|
||||
The `top` expression is wrapping a `list` of `tuples`. Inside each tuple the distance function is used to compare
|
||||
The `top` expression is wrapping a `list` of `tuples`. Inside each tuple the `distance` function is used to compare
|
||||
the rateA vector with one of the other cities. The output of the distance function is stored in the distance field
|
||||
in the tuple.
|
||||
|
||||
The `list` function emits each `tuple` and the `top` function returns only the tuple with lowest distance.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
The `list` function emits each `tuple` and the `top` function returns only the tuple with the lowest distance.
|
||||
|
|
Loading…
Reference in New Issue