SOLR-12913: Add new facet expression and pivot docs

2025-02-22 10:15:27 +00:00 · 2018-11-07 15:07:21 -05:00 · 2018-11-07 15:07:21 -05:00 · 531b16633a
commit 531b16633a
parent ff1df8a15c
2 changed files with 90 additions and 4 deletions
--- a/solr/solr-ref-guide/src/stream-source-reference.adoc
+++ b/solr/solr-ref-guide/src/stream-source-reference.adoc
@ -130,8 +130,12 @@ The `facet` function provides aggregations that are rolled up over buckets. Unde
 * `collection`: (Mandatory) Collection the facets will be aggregated from.
 * `q`: (Mandatory) The query to build the aggregations from.
 * `buckets`: (Mandatory) Comma separated list of fields to rollup over. The comma separated list represents the dimensions in a multi-dimensional rollup.
-* `bucketSorts`: Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values.
-* `bucketSizeLimit`: The number of buckets to include. This value is applied to each dimension. '-1' will fetch all the buckets.
+* `bucketSorts`: (Mandatory) Comma separated list of sorts to apply to each dimension in the buckets parameters. Sorts can be on the computed metrics or on the bucket values.
+* `rows`: (Default 10) The number of rows to return. '-1' will return all rows.
+* `offset`:(Default 0) The offset in the result set to start from.
+* `overfetch`: (Default 150) Over-fetching is used to provide accurate aggregations over high cardinality fields.
+* `method`: The JSON facet API aggregation method.
+* `bucketSizeLimit`: Sets the absolute number of rows to fetch. This is incompatible with rows, offset and overfetch. This value is applied to each dimension. '-1' will fetch all the buckets.
 * `metrics`: List of metrics to compute for the buckets. Currently supported metrics are `sum(col)`, `avg(col)`, `min(col)`, `max(col)`, `count(*)`.

 === facet Syntax
@ -144,7 +148,7 @@ facet(collection1,
      q="*:*",
      buckets="a_s",
      bucketSorts="sum(a_i) desc",
-      bucketSizeLimit=100,
+      rows=100,
      sum(a_i),
      sum(a_f),
      min(a_i),
@ -166,7 +170,8 @@ facet(collection1,
      q="*:*",
      buckets="year_i, month_i, day_i",
      bucketSorts="year_i desc, month_i desc, day_i desc",
-      bucketSizeLimit=100,
+      rows=10,
+      offset=20,
      sum(a_i),
      sum(a_f),
      min(a_i),
@ -179,6 +184,7 @@ facet(collection1,
 ----

 The example above shows a facet function with rollups over three buckets, where the buckets are returned in descending order by bucket value.
+The rows param returns 10 rows and the offset param starts returning rows from the 20th row.

 == features

--- a/solr/solr-ref-guide/src/vectorization.adoc
+++ b/solr/solr-ref-guide/src/vectorization.adoc
@ -31,6 +31,12 @@ to vectorize and analyze the results sets.

 Below are some of the key stream sources:

+* *`facet`*: Multi-dimensional aggregations are a powerful tool for generating
+co-occurrence counts for categorical data. The `facet` function uses the JSON facet API
+under the covers to provide fast, distributed, multi-dimension aggregations. With math expressions
+the aggregated results can be pivoted into a co-occurance matrix which can be mined for
+correlations and hidden similarities within the data.
+
 * *`random`*: Random sampling is widely used in statistics, probability and machine learning.
 The `random` function returns a random sample of search results that match a
 query. The random samples can be vectorized and operated on by math expressions and the results
@ -242,6 +248,80 @@ When this expression is sent to the `/stream` handler it responds with:
 }
 ----

+== Facet Co-Occurrence Matrices
+
+The `facet` function can be used to quickly perform mulit-dimension aggregations of categorical data from
+records stored in a Solr Cloud collection. These multi-dimension aggregations can represent co-occurrence
+counts for the values in the dimensions. The `pivot` function can be used to move two dimensional
+aggregations into a co-occurrence matrix. The co-occurrence matrix can then be clustered or analyzed for
+correlations to learn about the hidden connections within the data.
+
+In the example below th `facet` expression is used to generate a two dimensional faceted aggregation.
+The first dimension is the US State that a car was purchased in and the second dimension is the car model.
+The two dimensional facet generates the co-occurrence counts for the number of times a particular car model
+was purchased in a particular state.
+
+
+[source,text]
+----
+facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows=5, count(*))
+----
+
+When this expression is sent to the `/stream` handler it responds with:
+
+[source,json]
+----
+{
+  "result-set": {
+    "docs": [
+      {
+        "state": "NY",
+        "model": "camry",
+        "count(*)": 13342
+      },
+      {
+        "state": "NJ",
+        "model": "accord",
+        "count(*)": 13002
+      },
+      {
+        "state": "NY",
+        "model": "civic",
+        "count(*)": 12901
+      },
+      {
+        "state": "CA",
+        "model": "focus",
+        "count(*)": 12892
+      },
+      {
+        "state": "TX",
+        "model": "f150",
+        "count(*)": 12871
+      },
+      {
+        "EOF": true,
+        "RESPONSE_TIME": 171
+      }
+    ]
+  }
+}
+----
+
+The `pivot` function can be used to move the facet results into a co-occurrence matrix. In the example below
+The `pivot` function is used to create a matrix where the rows of the matrix are the US States (state) and the
+columns of the matrix are the car models (model). The values in the matrix are the co-occurrence counts (count(*))
+ from facet results.  Once the co-occurrence matrix has been created the US States can be clustered
+by car model, or the matrix can be transposed and car models can be clustered by the US States
+where they were bought.
+
+[source,text]
+----
+let(a=facet(collection1, q="*:*", buckets="state, model", bucketSorts="count(*) desc", rows="-1", count(*)),
+    b=pivot(a, state, model, count(*)),
+    c=kmeans(b, 7))
+----
+
 == Latitude / Longitude Vectors

 The `latlonVectors` function wraps a list of tuples and parses a lat/lon location field into