SOLR-10651: fix docs ...

2017-08-17 19:05:02 -04:00 · 2017-08-17 19:05:02 -04:00 · 8ff75edd13
parent 3e2d72bf7b
commit 8ff75edd13
1 changed files with 31 additions and 38 deletions
--- a/solr/solr-ref-guide/src/statistical-programming.adoc
+++ b/solr/solr-ref-guide/src/statistical-programming.adoc
@ -22,10 +22,10 @@ The Streaming Expression language includes a powerful statistical programing syn
 features of a functional programming language. The syntax includes *variables*, *data structures*
 and a growing set of *mathematical functions*.

-Using the statistical programing syntax Solr's powerful `data retrieval`
-capabilities can be combined with in-depth `statistical analysis`.
+Using the statistical programing syntax Solr's powerful *data retrieval*
+capabilities can be combined with in-depth *statistical analysis*.

-The `data retrieval` methods include:
+The *data retrieval* methods include:

 * SQL
 * time series aggregation
@ -58,18 +58,18 @@ The statistical function library includes functions that perform:
 * Sequences
 * Array manipulation functions (creation, copying, length, scaling, reverse etc...)

-The statistical function library is backed by `Apache Commons Math`.
+The statistical function library is backed by *Apache Commons Math*.

 This document provides an overview of the how to apply the variables, data structures
 and mathematical functions.

 == /stream handler

-Like all Streaming Expressions the statistical functions can be run by Solr's /stream handler.
+Like all Streaming Expressions, the statistical functions can be run by Solr's /stream handler.

 == Math

-Streaming Expressions contain a suite of *mathematical* functions which can be called on
+Streaming Expressions contain a suite of *mathematical functions* which can be called on
 their own or as part of a larger expression.

 Solr's /stream handler evaluates the mathematical expression and returns a result.
@ -241,7 +241,7 @@ Returns the following response:

 We can manipulate arrays with functions.

-For example we can reverse and array with `rev` function.
+For example we can reverse and array with the `rev` function:

 [source,text]
 ----
@ -271,7 +271,7 @@ Returns the following response:
 }
 ----

-Functions can return arrays.
+Arrays can also be built and returned by functions.

 For example the sequence function:

@ -313,7 +313,7 @@ Expression:

 [source,text]
 ----
-scale(sequence(5,0,1), 10)
+scale(10, sequence(5,0,1))
 ----

 Returns the following response:
@ -341,7 +341,7 @@ Returns the following response:
 }
 ----

-We can perform `statistical analysis` on arrays.
+We can perform *statistical analysis* on arrays.

 For example we can correlate two sequences with the `corr` function:

@ -421,10 +421,10 @@ Returns the following response:

 == List

-Next up, we have the *list* data structure.
+Next we have the *list* data structure.

-The `list` function is a data structure that wraps Streaming Expressions and emits their tuples as a single
-concatenated stream.
+The `list` function is a data structure that wraps Streaming Expressions and emits all the tuples from the wrapped
+expressions as a single concatenated stream.

 Below is an example of a list of tuples:

@ -469,11 +469,10 @@ Returns the following response:

 == Let

-The `let` function sets variables and runs a streaming expression that references the variables.
+The `let` function sets *variables* and runs a Streaming Expression that references the variables. The `let` funtion can be used to
+write small statistical programs.

-Th output of any Streaming Expression can be stored in a variable.
-
-Lets see how `let` works.
+A *variable* can be set to the output of any Streaming Expression.

 Here is a very simple example:

@ -532,8 +531,8 @@ Here is the output:

 == Col

-The `col` function is used to move a column of numbers from a list of tuples into an array.
-This is an important step because Streaming Expressions such as SQL, random and timeseries return tuples,
+The `col` function is used to move a column of numbers from a list of tuples into an `array`.
+This is an important function because Streaming Expressions such as `sql`, `random` and `timeseries` return tuples,
 but the statistical functions operate on arrays.

 Below is an example of the `col` function:
@ -556,7 +555,7 @@ taken from the tuples stored in variable *a*.
 Variable *d* contains an array of values from the *price_f* field,
 taken from the tuples stored in variable *b*.

-Also notice that the response `tuple` is now pointing to the arrays in variables *c* and *d*.
+Also notice inn that the response `tuple` executed by `let` is pointing to the arrays in variables *c* and *d*.

 The response shows the arrays:

@ -597,9 +596,9 @@ Let's dive into an example that puts these tools to use.
 We have an existing hotel in *cityA* that is very profitable.
 We are contemplating opening up a new hotel in a different city.
 We're considering 4 different cities: *cityB*, *cityC*, *cityD*, *cityE*.
-We'd like to open a hotel in a city that has a very similar room rate to *cityA*.
+We'd like to open a hotel in a city that has similar room rates to *cityA*.

-How do we determine which of the 4 cities we're considering has the most similar room rates to *cityA*
+How do we determine which of the 4 cities we're considering has room rates which are most similar to *cityA*?

 === The Data

@ -609,17 +608,17 @@ We have a data set of un-aggregated hotel *bookings*. Each booking record has a

 One approach would be to aggregate the data from each city and compare the *mean* room rates. This approach will
 give us some useful information, but the mean is a summary statistic which loses a significant amount of information
-about the data. For example we don't have an understanding of how the distribution of room rates is effecting the
+about the data. For example we don't have an understanding of how the distribution of room rates is impacting the
 mean.

-The *median* room rate provides another interesting data point but it's still not the entire picture. It sill just
+The *median* room rate provides another interesting data point but it's still not the entire picture. It's sill just
 one point of reference.

 Is there a way that we can compare the markets without losing valuable information in the data?

 === K Nearest Neighbor

-The use case we're thinking about can often be approached using a k Nearest Neighbor (knn) algorithm.
+The use case we're reasoning about can often be approached using a K Nearest Neighbor (knn) algorithm.

 With knn we use a *distance* measure to compare vectors of data to find the k nearest neighbors to
 a specific vector.
@ -631,17 +630,17 @@ computes the Euclidean distance between two vectors. This looks promising for co

 === Vectors

-But how to create the vectors from a our data set. Remember we have un-aggregated room rates from each of the cities.
+But how to create the vectors from a our data set? Remember we have un-aggregated room rates from each of the cities.
 How can we vectorize the data so it can be compared using the `distance` function.

-We have a Streaming Expression that can take a *random sample* from each of the cities. The name of this
-expression is `random`. So we could take a random sample of 1000 room rates from each of the five markets.
+We have a Streaming Expression that can retrieve a *random sample* from each of the cities. The name of this
+expression is `random`. So we could take a random sample of 1000 room rates from each of the five cities.

 But random vectors of room rates are not comparable because the distance algorithm compares values at each index
 in the vector. How can make these vectors comparable?

 We can make them comparable by *sorting* them. Then as the distance algorithm moves along the vectors it will be
-comparing room rates from lowest to highest in both markets.
+comparing room rates from lowest to highest in both cities.

 === The code

@ -676,7 +675,7 @@ tuples that are returned.
 The `random` function is wrapped by a `sort` function which is sorting the tuples in
 ascending order based on the rate_d field.

-The next five variables (rates1, rates2, rates3, rates4, rates5) contain the arrays of room rates for each
+The next five variables (ratesA, ratesB, ratesC, ratesD, ratesE) contain the arrays of room rates for each
 city. The `col` function is used to move the `rate_d` field from the random sample tuples
 into an array for each city.

@ -684,14 +683,8 @@ Now we have five sorted vectors of room rates that we can compare with our `dist

 After the variables are set the `let` expression runs the `top` expression.

-The `top` expression is wrapping a `list` of `tuples`. Inside each tuple the distance function is used to compare
+The `top` expression is wrapping a `list` of `tuples`. Inside each tuple the `distance` function is used to compare
 the rateA vector with one of the other cities. The output of the distance function is stored in the distance field
 in the tuple.

-The `list` function emits each `tuple` and the `top` function returns only the tuple with lowest distance.
-
-
-
-
-
-
+The `list` function emits each `tuple` and the `top` function returns only the tuple with the lowest distance.