SOLR-12913: Add new facet expression and pivot docs

This commit is contained in:
Joel Bernstein 2018-11-07 15:07:21 -05:00
parent ff1df8a15c
commit 531b16633a
2 changed files with 90 additions and 4 deletions

View File

@ -130,8 +130,12 @@ The `facet` function provides aggregations that are rolled up over buckets. Unde
* `collection`: (Mandatory) Collection the facets will be aggregated from.
* `q`: (Mandatory) The query to build the aggregations from.
* `buckets`: (Mandatory) Comma separated list of fields to rollup over. The comma separated list represents the dimensions in a multi-dimensional rollup.
* `bucketSorts`: Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values.
* `bucketSizeLimit`: The number of buckets to include. This value is applied to each dimension. '-1' will fetch all the buckets.
* `bucketSorts`: (Mandatory) Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values.
* `rows`: (Default 10) The number of rows to return. '-1' will return all rows.
* `offset`:(Default 0) The offset in the result set to start from.
* `overfetch`: (Default 150) Over-fetching is used to provide accurate aggregations over high cardinality fields.
* `method`: The JSON facet API aggregation method.
* `bucketSizeLimit`: Sets the absolute number of rows to fetch. This is incompatible with rows, offset and overfetch. This value is applied to each dimension. '-1' will fetch all the buckets.
* `metrics`: List of metrics to compute for the buckets. Currently supported metrics are `sum(col)`, `avg(col)`, `min(col)`, `max(col)`, `count(*)`.
=== facet Syntax
@ -144,7 +148,7 @@ facet(collection1,
q="*:*",
buckets="a_s",
bucketSorts="sum(a_i) desc",
bucketSizeLimit=100,
rows=100,
sum(a_i),
sum(a_f),
min(a_i),
@ -166,7 +170,8 @@ facet(collection1,
q="*:*",
buckets="year_i, month_i, day_i",
bucketSorts="year_i desc, month_i desc, day_i desc",
bucketSizeLimit=100,
rows=10,
offset=20,
sum(a_i),
sum(a_f),
min(a_i),
@ -179,6 +184,7 @@ facet(collection1,
----
The example above shows a facet function with rollups over three buckets, where the buckets are returned in descending order by bucket value.
The rows param returns 10 rows and the offset param starts returning rows from the 20th row.
== features

View File

@ -31,6 +31,12 @@ to vectorize and analyze the results sets.
Below are some of the key stream sources:
* *`facet`*: Multi-dimensional aggregations are a powerful tool for generating
co-occurrence counts for categorical data. The `facet` function uses the JSON facet API
under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions
the aggregated results can be pivoted into a co-occurance matrix which can be mined for
correlations and hidden similarities within the data.
* *`random`*: Random sampling is widely used in statistics, probability and machine learning.
The `random` function returns a random sample of search results that match a
query. The random samples can be vectorized and operated on by math expressions and the results
@ -242,6 +248,80 @@ When this expression is sent to the `/stream` handler it responds with:
}
----
== Facet Co-Occurrence Matrices
The `facet` function can be used to quickly perform mulit-dimension aggregations of categorical data from
records stored in a Solr Cloud collection. These multi-dimension aggregations can represent co-occurrence
counts for the values in the dimensions. The `pivot` function can be used to move two dimensional
aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for
correlations to learn about the hidden connections within the data.
In the example below th `facet` expression is used to generate a two dimensional faceted aggregation.
The first dimension is the US State that a car was purchased in and the second dimension is the car model.
The two dimensional facet generates the co-occurrence counts for the number of times a particular car model
was purchased in a particular state.
[source,text]
----
facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*))
----
When this expression is sent to the `/stream` handler it responds with:
[source,json]
----
{
"result-set": {
"docs": [
{
"state": "NY",
"model": "camry",
"count(*)": 13342
},
{
"state": "NJ",
"model": "accord",
"count(*)": 13002
},
{
"state": "NY",
"model": "civic",
"count(*)": 12901
},
{
"state": "CA",
"model": "focus",
"count(*)": 12892
},
{
"state": "TX",
"model": "f150",
"count(*)": 12871
},
{
"EOF": true,
"RESPONSE_TIME": 171
}
]
}
}
----
The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below
The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the
columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*))
from facet results. Once the co-occurrence matrix has been created the US States can be clustered
by car model, or the matrix can be transposed and car models can be clustered by the US States
where they were bought.
[source,text]
----
let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)),
b=pivot(a, state, model, count(*)),
c=kmeans(b, 7))
----
== Latitude / Longitude Vectors
The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into