mirror of https://github.com/apache/lucene.git
Ref Guide: clean up typos & other minor issues; last phase of 7.0 pre-release review
This commit is contained in:
parent
9beafb612f
commit
7752c05252
|
@ -51,8 +51,8 @@ An example `security.json` showing both sections is shown below to show how thes
|
||||||
There are several things defined in this file:
|
There are several things defined in this file:
|
||||||
|
|
||||||
<1> Basic authentication and rule-based authorization plugins are enabled.
|
<1> Basic authentication and rule-based authorization plugins are enabled.
|
||||||
<2> A user called 'solr', with a password `'SolrRocks'` has been defined.
|
<2> The parameter `"blockUnknown":true` means that unauthenticated requests are not allowed to pass through.
|
||||||
<3> The parameter `"blockUnknown":true` means that unauthenticated requests are not allowed to pass through.
|
<3> A user called 'solr', with a password `'SolrRocks'` has been defined.
|
||||||
<4> The 'admin' role has been defined, and it has permission to edit security settings.
|
<4> The 'admin' role has been defined, and it has permission to edit security settings.
|
||||||
<5> The 'solr' user has been defined to the 'admin' role.
|
<5> The 'solr' user has been defined to the 'admin' role.
|
||||||
|
|
||||||
|
|
|
@ -43,9 +43,9 @@ keytool -genkeypair -alias solr-ssl -keyalg RSA -keysize 2048 -keypass secret -s
|
||||||
|
|
||||||
The above command will create a keystore file named `solr-ssl.keystore.jks` in the current directory.
|
The above command will create a keystore file named `solr-ssl.keystore.jks` in the current directory.
|
||||||
|
|
||||||
=== Convert the Certificate and Key to PEM Format for Use with cURL
|
=== Convert the Certificate and Key to PEM Format for Use with curl
|
||||||
|
|
||||||
cURL isn't capable of using JKS formatted keystores, so the JKS keystore needs to be converted to PEM format, which cURL understands.
|
curl isn't capable of using JKS formatted keystores, so the JKS keystore needs to be converted to PEM format, which curl understands.
|
||||||
|
|
||||||
First convert the JKS keystore into PKCS12 format using `keytool`:
|
First convert the JKS keystore into PKCS12 format using `keytool`:
|
||||||
|
|
||||||
|
@ -63,7 +63,7 @@ Next convert the PKCS12 format keystore, including both the certificate and the
|
||||||
openssl pkcs12 -in solr-ssl.keystore.p12 -out solr-ssl.pem
|
openssl pkcs12 -in solr-ssl.keystore.p12 -out solr-ssl.pem
|
||||||
----
|
----
|
||||||
|
|
||||||
If you want to use cURL on OS X Yosemite (10.10), you'll need to create a certificate-only version of the PEM format, as follows:
|
If you want to use curl on OS X Yosemite (10.10), you'll need to create a certificate-only version of the PEM format, as follows:
|
||||||
|
|
||||||
[source,bash]
|
[source,bash]
|
||||||
----
|
----
|
||||||
|
@ -230,16 +230,16 @@ bin\solr.cmd -cloud -s cloud\node2 -z localhost:2181 -p 7574
|
||||||
|
|
||||||
[IMPORTANT]
|
[IMPORTANT]
|
||||||
====
|
====
|
||||||
cURL on OS X Mavericks (10.9) has degraded SSL support. For more information and workarounds to allow one-way SSL, see http://curl.haxx.se/mail/archive-2013-10/0036.html. cURL on OS X Yosemite (10.10) is improved - 2-way SSL is possible - see http://curl.haxx.se/mail/archive-2014-10/0053.html .
|
curl on OS X Mavericks (10.9) has degraded SSL support. For more information and workarounds to allow one-way SSL, see http://curl.haxx.se/mail/archive-2013-10/0036.html. curl on OS X Yosemite (10.10) is improved - 2-way SSL is possible - see http://curl.haxx.se/mail/archive-2014-10/0053.html .
|
||||||
|
|
||||||
The cURL commands in the following sections will not work with the system `curl` on OS X Yosemite (10.10). Instead, the certificate supplied with the `-E` param must be in PKCS12 format, and the file supplied with the `--cacert` param must contain only the CA certificate, and no key (see <<Convert the Certificate and Key to PEM Format for Use with cURL,above>> for instructions on creating this file):
|
The curl commands in the following sections will not work with the system `curl` on OS X Yosemite (10.10). Instead, the certificate supplied with the `-E` param must be in PKCS12 format, and the file supplied with the `--cacert` param must contain only the CA certificate, and no key (see <<Convert the Certificate and Key to PEM Format for Use with curl,above>> for instructions on creating this file):
|
||||||
|
|
||||||
[source,bash]
|
[source,bash]
|
||||||
curl -E solr-ssl.keystore.p12:secret --cacert solr-ssl.cacert.pem ...
|
curl -E solr-ssl.keystore.p12:secret --cacert solr-ssl.cacert.pem ...
|
||||||
|
|
||||||
====
|
====
|
||||||
|
|
||||||
NOTE: If your operating system does not include cURL, you can download binaries here: http://curl.haxx.se/download.html
|
NOTE: If your operating system does not include curl, you can download binaries here: http://curl.haxx.se/download.html
|
||||||
|
|
||||||
=== Create a SolrCloud Collection using bin/solr
|
=== Create a SolrCloud Collection using bin/solr
|
||||||
|
|
||||||
|
@ -259,7 +259,7 @@ bin\solr.cmd create -c mycollection -shards 2
|
||||||
|
|
||||||
The `create` action will pass the `SOLR_SSL_*` properties set in your include file to the SolrJ code used to create the collection.
|
The `create` action will pass the `SOLR_SSL_*` properties set in your include file to the SolrJ code used to create the collection.
|
||||||
|
|
||||||
=== Retrieve SolrCloud Cluster Status using cURL
|
=== Retrieve SolrCloud Cluster Status using curl
|
||||||
|
|
||||||
To get the resulting cluster status (again, if you have not enabled client authentication, remove the `-E solr-ssl.pem:secret` option):
|
To get the resulting cluster status (again, if you have not enabled client authentication, remove the `-E solr-ssl.pem:secret` option):
|
||||||
|
|
||||||
|
@ -315,9 +315,9 @@ cd example/exampledocs
|
||||||
java -Djavax.net.ssl.keyStorePassword=secret -Djavax.net.ssl.keyStore=../../server/etc/solr-ssl.keystore.jks -Djavax.net.ssl.trustStore=../../server/etc/solr-ssl.keystore.jks -Djavax.net.ssl.trustStorePassword=secret -Durl=https://localhost:8984/solr/mycollection/update -jar post.jar *.xml
|
java -Djavax.net.ssl.keyStorePassword=secret -Djavax.net.ssl.keyStore=../../server/etc/solr-ssl.keystore.jks -Djavax.net.ssl.trustStore=../../server/etc/solr-ssl.keystore.jks -Djavax.net.ssl.trustStorePassword=secret -Durl=https://localhost:8984/solr/mycollection/update -jar post.jar *.xml
|
||||||
----
|
----
|
||||||
|
|
||||||
=== Query Using cURL
|
=== Query Using curl
|
||||||
|
|
||||||
Use cURL to query the SolrCloud collection created above, from a directory containing the PEM formatted certificate and key created above (e.g. `example/etc/`) - if you have not enabled client authentication (system property `-Djetty.ssl.clientAuth=true)`, then you can remove the `-E solr-ssl.pem:secret` option:
|
Use curl to query the SolrCloud collection created above, from a directory containing the PEM formatted certificate and key created above (e.g. `example/etc/`) - if you have not enabled client authentication (system property `-Djetty.ssl.clientAuth=true)`, then you can remove the `-E solr-ssl.pem:secret` option:
|
||||||
|
|
||||||
[source,bash]
|
[source,bash]
|
||||||
----
|
----
|
||||||
|
|
|
@ -20,8 +20,12 @@
|
||||||
|
|
||||||
Solr ships with many out-of-the-box RequestHandlers, which are called implicit because they are not configured in `solrconfig.xml`.
|
Solr ships with many out-of-the-box RequestHandlers, which are called implicit because they are not configured in `solrconfig.xml`.
|
||||||
|
|
||||||
|
These handlers have pre-defined default parameters, known as _paramsets_, which can be modified if necessary.
|
||||||
|
|
||||||
== List of Implicitly Available Endpoints
|
== List of Implicitly Available Endpoints
|
||||||
|
|
||||||
|
// TODO 7.1 - this doesn't look great in the PDF, redesign the presentation
|
||||||
|
|
||||||
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
|
// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
|
||||||
|
|
||||||
[cols="15,20,15,50",options="header"]
|
[cols="15,20,15,50",options="header"]
|
||||||
|
@ -57,17 +61,20 @@ Solr ships with many out-of-the-box RequestHandlers, which are called implicit b
|
||||||
|
|
||||||
== How to View the Configuration
|
== How to View the Configuration
|
||||||
|
|
||||||
You can see configuration for all request handlers, including the implicit request handlers, via the <<config-api.adoc#config-api,Config API>>. E.g. for the `gettingstarted` collection:
|
You can see configuration for all request handlers, including the implicit request handlers, via the <<config-api.adoc#config-api,Config API>>. For the `gettingstarted` collection:
|
||||||
|
|
||||||
`curl http://localhost:8983/solr/gettingstarted/config/requestHandler`
|
[source,text]
|
||||||
|
curl http://localhost:8983/solr/gettingstarted/config/requestHandler
|
||||||
|
|
||||||
To restrict the results to the configuration for a particular request handler, use the `componentName` request param. E.g. to see just the configuration for the `/export` request handler:
|
To restrict the results to the configuration for a particular request handler, use the `componentName` request parameter. To see just the configuration for the `/export` request handler:
|
||||||
|
|
||||||
`curl "http://localhost:8983/solr/gettingstarted/config/requestHandler?componentName=/export"`
|
[source,text]
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/config/requestHandler?componentName=/export"
|
||||||
|
|
||||||
To include the expanded paramset in the response, as well as the effective parameters from merging the paramset params with the built-in params, use the `expandParams` request param. E.g. for the `/export` request handler:
|
To include the expanded paramset in the response, as well as the effective parameters from merging the paramset parameters with the built-in parameters, use the `expandParams` request param. For the `/export` request handler, you can make a request like this:
|
||||||
|
|
||||||
`curl "http://localhost:8983/solr/gettingstarted/config/requestHandler?componentName=/export&expandParams=true"`
|
[source,text]
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/config/requestHandler?componentName=/export&expandParams=true"
|
||||||
|
|
||||||
== How to Edit the Configuration
|
== How to Edit the Configuration
|
||||||
|
|
||||||
|
|
|
@ -181,10 +181,10 @@ http://localhost:8983/solr/admin/cores?action=LISTSNAPSHOTS&core=techproducts&co
|
||||||
|
|
||||||
The list snapshot request parameters are:
|
The list snapshot request parameters are:
|
||||||
|
|
||||||
core::
|
`core`::
|
||||||
The name of the core to whose snapshots we want to list.
|
The name of the core to whose snapshots we want to list.
|
||||||
|
|
||||||
async::
|
`async`::
|
||||||
Request ID to track this action which will be processed asynchronously.
|
Request ID to track this action which will be processed asynchronously.
|
||||||
|
|
||||||
=== Delete Snapshot API
|
=== Delete Snapshot API
|
||||||
|
@ -210,7 +210,6 @@ The name of the core whose snapshot we want to delete
|
||||||
`async`::
|
`async`::
|
||||||
Request ID to track this action which will be processed asynchronously
|
Request ID to track this action which will be processed asynchronously
|
||||||
|
|
||||||
|
|
||||||
== Backup/Restore Storage Repositories
|
== Backup/Restore Storage Repositories
|
||||||
|
|
||||||
Solr provides interfaces to plug different storage systems for backing up and restoring. For example, you can have a Solr cluster running on a local filesystem like EXT3 but you can backup the indexes to a HDFS filesystem or vice versa.
|
Solr provides interfaces to plug different storage systems for backing up and restoring. For example, you can have a Solr cluster running on a local filesystem like EXT3 but you can backup the indexes to a HDFS filesystem or vice versa.
|
||||||
|
|
|
@ -112,7 +112,7 @@ Now, let’s add a new word to the English stop word list using an HTTP PUT:
|
||||||
curl -X PUT -H 'Content-type:application/json' --data-binary '["foo"]' "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"
|
curl -X PUT -H 'Content-type:application/json' --data-binary '["foo"]' "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"
|
||||||
----
|
----
|
||||||
|
|
||||||
Here we’re using cURL to PUT a JSON list containing a single word “foo” to the managed English stop words set. Solr will return 200 if the request was successful. You can also put multiple words in a single PUT request.
|
Here we’re using curl to PUT a JSON list containing a single word “foo” to the managed English stop words set. Solr will return 200 if the request was successful. You can also put multiple words in a single PUT request.
|
||||||
|
|
||||||
You can test to see if a specific word exists by sending a GET request for that word as a child resource of the set, such as:
|
You can test to see if a specific word exists by sending a GET request for that word as a child resource of the set, such as:
|
||||||
|
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
= Managing Solr
|
= Managing Solr
|
||||||
:page-shortname: managing-solr
|
:page-shortname: managing-solr
|
||||||
:page-permalink: managing-solr.html
|
:page-permalink: managing-solr.html
|
||||||
:page-children: securing-solr, running-solr-on-hdfs, making-and-restoring-backups, configuring-logging, using-jmx-with-solr, mbean-request-handler, performance-statistics-reference, metrics-reporting, v2-api
|
:page-children: securing-solr, running-solr-on-hdfs, making-and-restoring-backups, configuring-logging, metrics-reporting, using-jmx-with-solr, mbean-request-handler, performance-statistics-reference, v2-api
|
||||||
// Licensed to the Apache Software Foundation (ASF) under one
|
// Licensed to the Apache Software Foundation (ASF) under one
|
||||||
// or more contributor license agreements. See the NOTICE file
|
// or more contributor license agreements. See the NOTICE file
|
||||||
// distributed with this work for additional information
|
// distributed with this work for additional information
|
||||||
|
|
|
@ -36,6 +36,7 @@ The output format. This operates the same as the <<response-writers.adoc#respons
|
||||||
|
|
||||||
== MBeanRequestHandler Examples
|
== MBeanRequestHandler Examples
|
||||||
|
|
||||||
|
// TODO 7.1 - replace with link to tutorial
|
||||||
The following examples assume you are running Solr's `techproducts` example configuration:
|
The following examples assume you are running Solr's `techproducts` example configuration:
|
||||||
|
|
||||||
[source,bash]
|
[source,bash]
|
||||||
|
@ -45,16 +46,20 @@ bin/solr start -e techproducts
|
||||||
|
|
||||||
To return information about the CACHE category only:
|
To return information about the CACHE category only:
|
||||||
|
|
||||||
`\http://localhost:8983/solr/techproducts/admin/mbeans?cat=CACHE`
|
[source,text]
|
||||||
|
http://localhost:8983/solr/techproducts/admin/mbeans?cat=CACHE
|
||||||
|
|
||||||
To return information and statistics about the CACHE category only, formatted in XML:
|
To return information and statistics about the CACHE category only, formatted in XML:
|
||||||
|
|
||||||
`\http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&cat=CACHE&wt=xml`
|
[source,text]
|
||||||
|
http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&cat=CACHE&wt=xml
|
||||||
|
|
||||||
To return information for everything, and statistics for everything except the `fieldCache`:
|
To return information for everything, and statistics for everything except the `fieldCache`:
|
||||||
|
|
||||||
`\http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&f.fieldCache.stats=false`
|
[source,text]
|
||||||
|
http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&f.fieldCache.stats=false
|
||||||
|
|
||||||
To return information and statistics for the `fieldCache` only:
|
To return information and statistics for the `fieldCache` only:
|
||||||
|
|
||||||
`\http://localhost:8983/solr/techproducts/admin/mbeans?key=fieldCache&stats=true`
|
[source,text]
|
||||||
|
http://localhost:8983/solr/techproducts/admin/mbeans?key=fieldCache&stats=true
|
||||||
|
|
|
@ -205,7 +205,7 @@ For more information, see the section <<schema-api.adoc#modify-the-schema,Modify
|
||||||
|
|
||||||
More info: http://asciidoctor.org/docs/user-manual/#inter-document-cross-references
|
More info: http://asciidoctor.org/docs/user-manual/#inter-document-cross-references
|
||||||
|
|
||||||
== Lists
|
== Ordered and Unordered Lists
|
||||||
|
|
||||||
AsciiDoc supports three types of lists:
|
AsciiDoc supports three types of lists:
|
||||||
|
|
||||||
|
|
|
@ -79,7 +79,7 @@ Other types of data such as errors and timeouts are also provided. These are ava
|
||||||
|
|
||||||
The table below shows the metric names and attributes to request:
|
The table below shows the metric names and attributes to request:
|
||||||
|
|
||||||
[cols="25,75",options="header"]
|
[cols="30,70",options="header"]
|
||||||
|===
|
|===
|
||||||
|Metric name | Description
|
|Metric name | Description
|
||||||
|`QUERY./select.errors`
|
|`QUERY./select.errors`
|
||||||
|
|
|
@ -40,25 +40,26 @@ This example `security.json` shows how the <<basic-authentication-plugin.adoc#ba
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"authentication":{
|
"authentication":{
|
||||||
"class":"solr.BasicAuthPlugin",
|
"class":"solr.BasicAuthPlugin", <1>
|
||||||
"blockUnknown": true,
|
"blockUnknown": true, <2>
|
||||||
"credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0= Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
|
"credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0= Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="} <3>
|
||||||
},
|
},
|
||||||
"authorization":{
|
"authorization":{
|
||||||
"class":"solr.RuleBasedAuthorizationPlugin",
|
"class":"solr.RuleBasedAuthorizationPlugin", <4>
|
||||||
"permissions":[{"name":"security-edit",
|
"permissions":[{"name":"security-edit",
|
||||||
"role":"admin"}],
|
"role":"admin"}], <5>
|
||||||
"user-role":{"solr":"admin"}
|
"user-role":{"solr":"admin"} <6>
|
||||||
}}
|
}}
|
||||||
----
|
----
|
||||||
|
|
||||||
There are several things defined in this example:
|
There are several things defined in this example:
|
||||||
|
|
||||||
* Basic authentication and rule-based authorization plugins are enabled.
|
<1> Basic authentication and rule-based authorization plugins are enabled.
|
||||||
|
<2> All requests w/o credentials will be rejected with a 401 error. Set `'blockUnknown'` to false (or remove it altogether) if you wish to let unauthenticated requests to go through. However, if a particular resource is protected by a rule, they are rejected anyway with a 401 error.
|
||||||
* A user called 'solr', with a password has been defined.
|
* A user called 'solr', with a password has been defined.
|
||||||
* All requests w/o credentials will be rejected with a 401 error. Set `'blockUnknown'` to false (or remove it altogether) if you wish to let unauthenticated requests to go through. However, if a particular resource is protected by a rule, they are rejected anyway with a 401 error.
|
<4> Basic authentication and rule-based authorization plugins are enabled.
|
||||||
* The 'admin' role has been defined, and it has permission to edit security settings.
|
<5> The 'admin' role has been defined, and it has permission to edit security settings.
|
||||||
* The 'solr' user has been defined to the 'admin' role.
|
<6> The 'solr' user has been defined to the 'admin' role.
|
||||||
|
|
||||||
== Permission Attributes
|
== Permission Attributes
|
||||||
|
|
||||||
|
|
|
@ -136,7 +136,8 @@ Pass the location of HDFS client configuration files - needed for HDFS HA for ex
|
||||||
|
|
||||||
Hadoop can be configured to use the Kerberos protocol to verify user identity when trying to access core services like HDFS. If your HDFS directories are protected using Kerberos, then you need to configure Solr's HdfsDirectoryFactory to authenticate using Kerberos in order to read and write to HDFS. To enable Kerberos authentication from Solr, you need to set the following parameters:
|
Hadoop can be configured to use the Kerberos protocol to verify user identity when trying to access core services like HDFS. If your HDFS directories are protected using Kerberos, then you need to configure Solr's HdfsDirectoryFactory to authenticate using Kerberos in order to read and write to HDFS. To enable Kerberos authentication from Solr, you need to set the following parameters:
|
||||||
|
|
||||||
`solr.hdfs.security.kerberos.enabled`:: false |Set to `true` to enable Kerberos authentication. The default is `false`.
|
`solr.hdfs.security.kerberos.enabled`::
|
||||||
|
Set to `true` to enable Kerberos authentication. The default is `false`.
|
||||||
|
|
||||||
`solr.hdfs.security.kerberos.keytabfile`::
|
`solr.hdfs.security.kerberos.keytabfile`::
|
||||||
A keytab file contains pairs of Kerberos principals and encrypted keys which allows for password-less authentication when Solr attempts to authenticate with secure Hadoop.
|
A keytab file contains pairs of Kerberos principals and encrypted keys which allows for password-less authentication when Solr attempts to authenticate with secure Hadoop.
|
||||||
|
|
|
@ -63,8 +63,7 @@ The client can specify '<<distributed-search-with-index-sharding.adoc#distribute
|
||||||
|
|
||||||
Example response with `partialResults` flag set to 'true':
|
Example response with `partialResults` flag set to 'true':
|
||||||
|
|
||||||
*Solr Response with partialResults*
|
.Solr Response with partialResults
|
||||||
|
|
||||||
[source,json]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
|
|
|
@ -18,27 +18,27 @@
|
||||||
// specific language governing permissions and limitations
|
// specific language governing permissions and limitations
|
||||||
// under the License.
|
// under the License.
|
||||||
|
|
||||||
The Streaming Expression language includes a powerful statistical programing syntax with many of the
|
The Streaming Expression language includes a powerful statistical programing syntax with many of the features of a functional programming language.
|
||||||
features of a functional programming language. The syntax includes *variables*, *data structures*
|
|
||||||
and a growing set of *mathematical functions*.
|
|
||||||
|
|
||||||
Using the statistical programing syntax Solr's powerful *data retrieval*
|
The syntax includes variables, data structures and a growing set of mathematical functions.
|
||||||
capabilities can be combined with in-depth *statistical analysis*.
|
|
||||||
|
|
||||||
The *data retrieval* methods include:
|
Using the statistical programing syntax, Solr's powerful data retrieval
|
||||||
|
capabilities can be combined with in-depth statistical analysis.
|
||||||
|
|
||||||
|
The data retrieval methods include:
|
||||||
|
|
||||||
* SQL
|
* SQL
|
||||||
* time series aggregation
|
* time series aggregation
|
||||||
* random sampling
|
* random sampling
|
||||||
* faceted aggregation
|
* faceted aggregation
|
||||||
* KNN searches
|
* K-Nearest Neighbor (KNN) searches
|
||||||
* topic message queues
|
* `topic` message queues
|
||||||
* MapReduce (parallel relational algebra)
|
* MapReduce (parallel relational algebra)
|
||||||
* JDBC calls to outside databases
|
* JDBC calls to outside databases
|
||||||
* Graph Expressions
|
* Graph Expressions
|
||||||
|
|
||||||
Once the data is retrieved, the statistical programming syntax can be used to create *arrays* from the data so it
|
Once the data is retrieved, the statistical programming syntax can be used to create arrays from the data so it
|
||||||
can be *manipulated*, *transformed* and *analyzed*.
|
can be manipulated, transformed and analyzed.
|
||||||
|
|
||||||
The statistical function library includes functions that perform:
|
The statistical function library includes functions that perform:
|
||||||
|
|
||||||
|
@ -48,7 +48,7 @@ The statistical function library includes functions that perform:
|
||||||
* Moving averages
|
* Moving averages
|
||||||
* Percentiles
|
* Percentiles
|
||||||
* Simple regression and prediction
|
* Simple regression and prediction
|
||||||
* ANOVA
|
* Analysis of covariance (ANOVA)
|
||||||
* Histograms
|
* Histograms
|
||||||
* Convolution
|
* Convolution
|
||||||
* Euclidean distance
|
* Euclidean distance
|
||||||
|
@ -56,34 +56,30 @@ The statistical function library includes functions that perform:
|
||||||
* Rank transformation
|
* Rank transformation
|
||||||
* Normalization transformation
|
* Normalization transformation
|
||||||
* Sequences
|
* Sequences
|
||||||
* Array manipulation functions (creation, copying, length, scaling, reverse etc...)
|
* Array manipulation functions (creation, copying, length, scaling, reverse, etc.)
|
||||||
|
|
||||||
The statistical function library is backed by *Apache Commons Math*.
|
The statistical function library is backed by https://commons.apache.org/proper/commons-math/[Apache Commons Math library]. A full discussion of many of the math functions available to streaming expressions is available in the section <<stream-evaluator-reference.adoc#stream-evaluator-reference,Stream Evaluator Reference>>.
|
||||||
|
|
||||||
This document provides an overview of the how to apply the variables, data structures
|
This document provides an overview of the how to apply the variables, data structures and mathematical functions.
|
||||||
and mathematical functions.
|
|
||||||
|
|
||||||
== /stream handler
|
NOTE: Like all streaming expressions, the statistical functions are run by Solr's `/stream` handler. For an overview of this handler, see the section <<streaming-expressions.adoc#streaming-expressions,Streaming Expressions>>.
|
||||||
|
|
||||||
Like all Streaming Expressions, the statistical functions can be run by Solr's /stream handler.
|
== Math Functions
|
||||||
|
|
||||||
== Math
|
Streaming expressions contain a suite of mathematical functions which can be called on their own or as part of a larger expression.
|
||||||
|
|
||||||
Streaming Expressions contain a suite of *mathematical functions* which can be called on
|
Solr's `/stream` handler evaluates the mathematical expression and returns a result.
|
||||||
their own or as part of a larger expression.
|
|
||||||
|
|
||||||
Solr's /stream handler evaluates the mathematical expression and returns a result.
|
For example, if you send the following expression to the `/stream` handler:
|
||||||
|
|
||||||
For example sending the following expression to the /stream handler:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
add(1, 1)
|
add(1, 1)
|
||||||
----
|
----
|
||||||
|
|
||||||
Returns the following response:
|
You get the following response:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -109,7 +105,7 @@ pow(10, add(1,1))
|
||||||
|
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -126,9 +122,7 @@ Returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
You can also perform math on a stream of Tuples.
|
You can also perform math on a stream of Tuples. For example:
|
||||||
|
|
||||||
For example:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -139,7 +133,7 @@ select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3")
|
||||||
|
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
[source, text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -165,9 +159,13 @@ Returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== Array (data structure)
|
== Data Structures
|
||||||
|
|
||||||
The first data structure we'll explore is the *array*.
|
Several types of data can be manipulated with the statistical programming syntax. The following sections explore <<Arrays,arrays>>, <<Tuples,tuples>>, and <<Lists,lists>>.
|
||||||
|
|
||||||
|
=== Arrays
|
||||||
|
|
||||||
|
The first data structure we'll explore is the array.
|
||||||
|
|
||||||
We can create an array with the `array` function:
|
We can create an array with the `array` function:
|
||||||
|
|
||||||
|
@ -180,7 +178,7 @@ array(1, 2, 3)
|
||||||
|
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -201,7 +199,7 @@ Returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
We can nest arrays within arrays to form a *matrix*:
|
We can nest arrays within arrays to form a matrix:
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -211,7 +209,7 @@ array(array(1, 2, 3),
|
||||||
|
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -239,9 +237,7 @@ Returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
We can manipulate arrays with functions.
|
We can manipulate arrays with functions. For example, we can reverse an array with the `rev` function:
|
||||||
|
|
||||||
For example we can reverse and array with the `rev` function:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -250,7 +246,7 @@ rev(array(1, 2, 3))
|
||||||
|
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -271,18 +267,16 @@ Returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
Arrays can also be built and returned by functions.
|
Arrays can also be built and returned by functions. For example, the `sequence` function:
|
||||||
|
|
||||||
For example the sequence function:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
sequence(5,0,1)
|
sequence(5,0,1)
|
||||||
----
|
----
|
||||||
|
|
||||||
This returns an array of size *5* starting from *0* with a stride of *1*.
|
This returns an array of size `5` starting from `0` with a stride of `1`.
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -305,11 +299,7 @@ This returns an array of size *5* starting from *0* with a stride of *1*.
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
We can perform math on an array.
|
We can perform math on an array. For example, we can scale an array with the `scale` function:
|
||||||
|
|
||||||
For example we can scale an array with the `scale` function:
|
|
||||||
|
|
||||||
Expression:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -318,7 +308,7 @@ scale(10, sequence(5,0,1))
|
||||||
|
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -341,9 +331,7 @@ Returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
We can perform *statistical analysis* on arrays.
|
We can perform statistical analysis on arrays For example, we can correlate two sequences with the `corr` function:
|
||||||
|
|
||||||
For example we can correlate two sequences with the `corr` function:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -352,7 +340,7 @@ corr(sequence(5,1,1), sequence(5,10,10))
|
||||||
|
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -370,12 +358,11 @@ Returns the following response:
|
||||||
----
|
----
|
||||||
|
|
||||||
|
|
||||||
== Tuple (data structure)
|
=== Tuples
|
||||||
|
|
||||||
The *tuple* is the next data structure we'll explore.
|
The tuple is the next data structure we'll explore.
|
||||||
|
|
||||||
The `tuple` function returns a map of name/value pairs. A tuple is a very flexible data structure
|
The `tuple` function returns a map of name/value pairs. A tuple is a very flexible data structure that can hold values that are strings, numerics, arrays and lists of tuples.
|
||||||
that can hold values that are strings, numerics, arrays and lists of tuples.
|
|
||||||
|
|
||||||
A tuple can be used to return a complex result from a statistical expression.
|
A tuple can be used to return a complex result from a statistical expression.
|
||||||
|
|
||||||
|
@ -390,7 +377,7 @@ tuple(title="hello world",
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
----
|
----
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -419,12 +406,11 @@ Returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== List (data structure)
|
=== Lists
|
||||||
|
|
||||||
Next we have the *list* data structure.
|
Next we have the list data structure.
|
||||||
|
|
||||||
The `list` function is a data structure that wraps Streaming Expressions and emits all the tuples from the wrapped
|
The `list` function is a data structure that wraps streaming expressions and emits all the tuples from the wrapped expressions as a single concatenated stream.
|
||||||
expressions as a single concatenated stream.
|
|
||||||
|
|
||||||
Below is an example of a list of tuples:
|
Below is an example of a list of tuples:
|
||||||
|
|
||||||
|
@ -436,7 +422,7 @@ list(tuple(id=1, data=array(1, 2, 3)),
|
||||||
|
|
||||||
Returns the following response:
|
Returns the following response:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
||||||
{
|
{
|
||||||
|
@ -467,14 +453,12 @@ Returns the following response:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== Let (setting variables)
|
== Setting Variables with let
|
||||||
|
|
||||||
The `let` function sets *variables* and runs a Streaming Expression that references the variables. The `let` funtion can be used to
|
The `let` function sets variables and runs a streaming expression that references the variables. The `let` function can be used to
|
||||||
write small statistical programs.
|
write small statistical programs.
|
||||||
|
|
||||||
A *variable* can be set to the output of any Streaming Expression.
|
A variable can be set to the output of any streaming expression. Here is a very simple example:
|
||||||
|
|
||||||
Here is a very simple example:
|
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -483,7 +467,7 @@ let(a=random(collection2, q="*:*", rows="3", fl="price_f"),
|
||||||
tuple(sample1=a, sample2=b))
|
tuple(sample1=a, sample2=b))
|
||||||
----
|
----
|
||||||
|
|
||||||
The `let` expression above is setting variables *a* and *b* to random
|
The `let` expression above is setting variables `a` and `b` to random
|
||||||
samples taken from collection2.
|
samples taken from collection2.
|
||||||
|
|
||||||
The `let` function then executes the `tuple` streaming expression
|
The `let` function then executes the `tuple` streaming expression
|
||||||
|
@ -491,7 +475,7 @@ which references the two variables.
|
||||||
|
|
||||||
Here is the output:
|
Here is the output:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
{
|
{
|
||||||
"result-set": {
|
"result-set": {
|
||||||
|
@ -529,11 +513,11 @@ Here is the output:
|
||||||
}
|
}
|
||||||
----
|
----
|
||||||
|
|
||||||
== Creating arrays with `col` function
|
== Creating Arrays with col Function
|
||||||
|
|
||||||
The `col` function is used to move a column of numbers from a list of tuples into an `array`.
|
The `col` function is used to move a column of numbers from a list of tuples into an `array`.
|
||||||
This is an important function because Streaming Expressions such as `sql`, `random` and `timeseries` return tuples,
|
|
||||||
but the statistical functions operate on arrays.
|
This is an important function because streaming expressions such as `sql`, `random` and `timeseries` return tuples, but the statistical functions operate on arrays.
|
||||||
|
|
||||||
Below is an example of the `col` function:
|
Below is an example of the `col` function:
|
||||||
|
|
||||||
|
@ -546,20 +530,19 @@ let(a=random(collection2, q="*:*", rows="3", fl="price_f"),
|
||||||
tuple(sample1=c, sample2=d))
|
tuple(sample1=c, sample2=d))
|
||||||
----
|
----
|
||||||
|
|
||||||
The example above is using the `col` function to create arrays from the tuples stored in
|
The example above is using the `col` function to create arrays from the tuples stored in variables `a` and `b`.
|
||||||
variables *a* and *b*.
|
|
||||||
|
|
||||||
Variable *c* contains an array of values from the *price_f* field,
|
Variable `c` contains an array of values from the `price_f` field,
|
||||||
taken from the tuples stored in variable *a*.
|
taken from the tuples stored in variable `a`.
|
||||||
|
|
||||||
Variable *d* contains an array of values from the *price_f* field,
|
Variable `d` contains an array of values from the `price_f` field,
|
||||||
taken from the tuples stored in variable *b*.
|
taken from the tuples stored in variable `b`.
|
||||||
|
|
||||||
Also notice inn that the response `tuple` executed by `let` is pointing to the arrays in variables *c* and *d*.
|
Also notice inn that the response `tuple` executed by `let` is pointing to the arrays in variables `c` and `d`.
|
||||||
|
|
||||||
The response shows the arrays:
|
The response shows the arrays:
|
||||||
|
|
||||||
[source,text]
|
[source,json]
|
||||||
----
|
----
|
||||||
|
|
||||||
{
|
{
|
||||||
|
@ -588,61 +571,60 @@ The response shows the arrays:
|
||||||
|
|
||||||
== Statistical Programming Example
|
== Statistical Programming Example
|
||||||
|
|
||||||
We've covered how the *data structures*, *variables* and a few *statistical functions* work.
|
We've covered how the data structures, variables and a few statistical functions work. Let's dive into an example that puts these tools to use.
|
||||||
Let's dive into an example that puts these tools to use.
|
|
||||||
|
|
||||||
=== Use case
|
=== Use Case
|
||||||
|
|
||||||
We have an existing hotel in *cityA* that is very profitable.
|
We have an existing hotel in *cityA* that is very profitable.
|
||||||
We are contemplating opening up a new hotel in a different city.
|
We are contemplating opening up a new hotel in a different city.
|
||||||
We're considering 4 different cities: *cityB*, *cityC*, *cityD*, *cityE*.
|
We're considering 4 different cities: *cityB*, *cityC*, *cityD*, *cityE*.
|
||||||
We'd like to open a hotel in a city that has similar room rates to *cityA*.
|
We'd like to open a hotel in a city that has similar room rates to cityA.
|
||||||
|
|
||||||
How do we determine which of the 4 cities we're considering has room rates which are most similar to *cityA*?
|
How do we determine which of the 4 cities we're considering has room rates which are most similar to cityA?
|
||||||
|
|
||||||
=== The Data
|
=== The Data
|
||||||
|
|
||||||
We have a data set of un-aggregated hotel *bookings*. Each booking record has a rate and city.
|
We have a data set of un-aggregated hotel bookings. Each booking record has a rate and city.
|
||||||
|
|
||||||
=== Can we simply aggregate?
|
=== Can We Simply Aggregate?
|
||||||
|
|
||||||
One approach would be to aggregate the data from each city and compare the *mean* room rates. This approach will
|
One approach would be to aggregate the data from each city and compare the mean room rates. This approach will
|
||||||
give us some useful information, but the mean is a summary statistic which loses a significant amount of information
|
give us some useful information, but the mean is a summary statistic which loses a significant amount of information
|
||||||
about the data. For example we don't have an understanding of how the distribution of room rates is impacting the
|
about the data. For example, we don't have an understanding of how the distribution of room rates is impacting the
|
||||||
mean.
|
mean.
|
||||||
|
|
||||||
The *median* room rate provides another interesting data point but it's still not the entire picture. It's sill just
|
The median room rate provides another interesting data point but it's still not the entire picture. It's sill just
|
||||||
one point of reference.
|
one point of reference.
|
||||||
|
|
||||||
Is there a way that we can compare the markets without losing valuable information in the data?
|
Is there a way that we can compare the markets without losing valuable information in the data?
|
||||||
|
|
||||||
=== K Nearest Neighbor
|
==== K-Nearest Neighbor
|
||||||
|
|
||||||
The use case we're reasoning about can often be approached using a K Nearest Neighbor (knn) algorithm.
|
The use case we're reasoning about can often be approached using a K-Nearest Neighbor (knn) algorithm.
|
||||||
|
|
||||||
With knn we use a *distance* measure to compare vectors of data to find the k nearest neighbors to
|
With knn we use a distance measure to compare vectors of data to find the k nearest neighbors to
|
||||||
a specific vector.
|
a specific vector.
|
||||||
|
|
||||||
=== Euclidean Distance
|
==== Euclidean Distance
|
||||||
|
|
||||||
The Streaming Expression statistical function library has a function called `distance`. The `distance` function
|
The streaming expression statistical function library has a function called `distance`. The `distance` function
|
||||||
computes the Euclidean distance between two vectors. This looks promising for comparing vectors of room rates.
|
computes the Euclidean distance between two vectors. This looks promising for comparing vectors of room rates.
|
||||||
|
|
||||||
=== Vectors
|
==== Vectors
|
||||||
|
|
||||||
But how to create the vectors from a our data set? Remember we have un-aggregated room rates from each of the cities.
|
But how to create the vectors from a our data set? Remember we have un-aggregated room rates from each of the cities.
|
||||||
How can we vectorize the data so it can be compared using the `distance` function.
|
How can we vectorize the data so it can be compared using the `distance` function.
|
||||||
|
|
||||||
We have a Streaming Expression that can retrieve a *random sample* from each of the cities. The name of this
|
We have a streaming expression that can retrieve a random sample from each of the cities. The name of this
|
||||||
expression is `random`. So we could take a random sample of 1000 room rates from each of the five cities.
|
expression is `random`. So we could take a random sample of 1000 room rates from each of the five cities.
|
||||||
|
|
||||||
But random vectors of room rates are not comparable because the distance algorithm compares values at each index
|
But random vectors of room rates are not comparable because the distance algorithm compares values at each index
|
||||||
in the vector. How can make these vectors comparable?
|
in the vector. How can make these vectors comparable?
|
||||||
|
|
||||||
We can make them comparable by *sorting* them. Then as the distance algorithm moves along the vectors it will be
|
We can make them comparable by sorting them. Then as the distance algorithm moves along the vectors it will be
|
||||||
comparing room rates from lowest to highest in both cities.
|
comparing room rates from lowest to highest in both cities.
|
||||||
|
|
||||||
=== The code
|
=== The Code
|
||||||
|
|
||||||
[source,text]
|
[source,text]
|
||||||
----
|
----
|
||||||
|
@ -664,16 +646,16 @@ let(cityA=sort(random(bookings, q="city:cityA", rows="1000", fl="rate_d"), by="r
|
||||||
tuple(city=E, distance=distance(ratesA, ratesE)))))
|
tuple(city=E, distance=distance(ratesA, ratesE)))))
|
||||||
----
|
----
|
||||||
|
|
||||||
==== The code explained
|
=== The Code Explained
|
||||||
|
|
||||||
The `let` expression sets variables first.
|
The `let` expression sets variables first.
|
||||||
|
|
||||||
The first 5 variables (cityA, cityB, cityC, cityD, cityE), contain the random samples from the `bookings` collection.
|
The first 5 variables (cityA, cityB, cityC, cityD, cityE), contain the random samples from the `bookings` collection.
|
||||||
the `random` function is pulling 1000 random samples from each city and including the `rate_d` field in the
|
The `random` function is pulling 1000 random samples from each city and including the `rate_d` field in the
|
||||||
tuples that are returned.
|
tuples that are returned.
|
||||||
|
|
||||||
The `random` function is wrapped by a `sort` function which is sorting the tuples in
|
The `random` function is wrapped by a `sort` function which is sorting the tuples in
|
||||||
ascending order based on the rate_d field.
|
ascending order based on the `rate_d` field.
|
||||||
|
|
||||||
The next five variables (ratesA, ratesB, ratesC, ratesD, ratesE) contain the arrays of room rates for each
|
The next five variables (ratesA, ratesB, ratesC, ratesD, ratesE) contain the arrays of room rates for each
|
||||||
city. The `col` function is used to move the `rate_d` field from the random sample tuples
|
city. The `col` function is used to move the `rate_d` field from the random sample tuples
|
||||||
|
|
Loading…
Reference in New Issue