diff --git a/solr/solr-ref-guide/src/stream-source-reference.adoc b/solr/solr-ref-guide/src/stream-source-reference.adoc index c31639a3bf7..c83991e521b 100644 --- a/solr/solr-ref-guide/src/stream-source-reference.adoc +++ b/solr/solr-ref-guide/src/stream-source-reference.adoc @@ -130,8 +130,12 @@ The `facet` function provides aggregations that are rolled up over buckets. Unde * `collection`: (Mandatory) Collection the facets will be aggregated from. * `q`: (Mandatory) The query to build the aggregations from. * `buckets`: (Mandatory) Comma separated list of fields to rollup over. The comma separated list represents the dimensions in a multi-dimensional rollup. -* `bucketSorts`: Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values. -* `bucketSizeLimit`: The number of buckets to include. This value is applied to each dimension. '-1' will fetch all the buckets. +* `bucketSorts`: (Mandatory) Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values. +* `rows`: (Default 10) The number of rows to return. '-1' will return all rows. +* `offset`:(Default 0) The offset in the result set to start from. +* `overfetch`: (Default 150) Over-fetching is used to provide accurate aggregations over high cardinality fields. +* `method`: The JSON facet API aggregation method. +* `bucketSizeLimit`: Sets the absolute number of rows to fetch. This is incompatible with rows, offset and overfetch. This value is applied to each dimension. '-1' will fetch all the buckets. * `metrics`: List of metrics to compute for the buckets. Currently supported metrics are `sum(col)`, `avg(col)`, `min(col)`, `max(col)`, `count(*)`. === facet Syntax @@ -144,7 +148,7 @@ facet(collection1, q="*:*", buckets="a_s", bucketSorts="sum(a_i) desc", - bucketSizeLimit=100, + rows=100, sum(a_i), sum(a_f), min(a_i), @@ -166,7 +170,8 @@ facet(collection1, q="*:*", buckets="year_i, month_i, day_i", bucketSorts="year_i desc, month_i desc, day_i desc", - bucketSizeLimit=100, + rows=10, + offset=20, sum(a_i), sum(a_f), min(a_i), @@ -179,6 +184,7 @@ facet(collection1, ---- The example above shows a facet function with rollups over three buckets, where the buckets are returned in descending order by bucket value. +The rows param returns 10 rows and the offset param starts returning rows from the 20th row. == features diff --git a/solr/solr-ref-guide/src/vectorization.adoc b/solr/solr-ref-guide/src/vectorization.adoc index 5fdfadc3deb..acd56ec0f8e 100644 --- a/solr/solr-ref-guide/src/vectorization.adoc +++ b/solr/solr-ref-guide/src/vectorization.adoc @@ -31,6 +31,12 @@ to vectorize and analyze the results sets. Below are some of the key stream sources: +* *`facet`*: Multi-dimensional aggregations are a powerful tool for generating +co-occurrence counts for categorical data. The `facet` function uses the JSON facet API +under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions +the aggregated results can be pivoted into a co-occurance matrix which can be mined for +correlations and hidden similarities within the data. + * *`random`*: Random sampling is widely used in statistics, probability and machine learning. The `random` function returns a random sample of search results that match a query. The random samples can be vectorized and operated on by math expressions and the results @@ -242,6 +248,80 @@ When this expression is sent to the `/stream` handler it responds with: } ---- +== Facet Co-Occurrence Matrices + +The `facet` function can be used to quickly perform mulit-dimension aggregations of categorical data from +records stored in a Solr Cloud collection. These multi-dimension aggregations can represent co-occurrence +counts for the values in the dimensions. The `pivot` function can be used to move two dimensional +aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for +correlations to learn about the hidden connections within the data. + +In the example below th `facet` expression is used to generate a two dimensional faceted aggregation. +The first dimension is the US State that a car was purchased in and the second dimension is the car model. +The two dimensional facet generates the co-occurrence counts for the number of times a particular car model +was purchased in a particular state. + + +[source,text] +---- +facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*)) +---- + +When this expression is sent to the `/stream` handler it responds with: + +[source,json] +---- +{ + "result-set": { + "docs": [ + { + "state": "NY", + "model": "camry", + "count(*)": 13342 + }, + { + "state": "NJ", + "model": "accord", + "count(*)": 13002 + }, + { + "state": "NY", + "model": "civic", + "count(*)": 12901 + }, + { + "state": "CA", + "model": "focus", + "count(*)": 12892 + }, + { + "state": "TX", + "model": "f150", + "count(*)": 12871 + }, + { + "EOF": true, + "RESPONSE_TIME": 171 + } + ] + } +} +---- + +The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below +The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the +columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*)) + from facet results. Once the co-occurrence matrix has been created the US States can be clustered +by car model, or the matrix can be transposed and car models can be clustered by the US States +where they were bought. + +[source,text] +---- +let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)), + b=pivot(a, state, model, count(*)), + c=kmeans(b, 7)) +---- + == Latitude / Longitude Vectors The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into