lucene/solr/solr-ref-guide/src/result-grouping.adoc

277 lines
12 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

= Result Grouping
:page-shortname: result-grouping
:page-permalink: result-grouping.html
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
Result Grouping groups documents with a common field value into groups and returns the top documents for each group.
For example, if you searched for "DVD" on an electronic retailer's e-commerce site, you might be returned three categories such as "TV and Video", "Movies", and "Computers" with three results per category. In this case, the query term "DVD" appeared in all three categories, so Solr groups them together in order to increase relevancy for the user.
.Prefer Collapse & Expand instead
[NOTE]
====
Solr's <<collapse-and-expand-results.adoc#collapse-and-expand-results,Collapse and Expand>> feature is newer and mostly overlaps with Result Grouping. There are features unique to both, and they have different performance characteristics. That said, in most cases Collapse and Expand is preferable to Result Grouping.
====
Result Grouping is separate from <<faceting.adoc#faceting,Faceting>>. Though it is conceptually similar, faceting returns all relevant results and allows the user to refine the results based on the facet category. For example, if you search for "shoes" on a footwear retailer's e-commerce site, Solr would return all results for that query term, along with selectable facets such as "size," "color," "brand," and so on.
You can however combine grouping with faceting. Grouped faceting supports `facet.field` and `facet.range` but currently doesn't support date and pivot faceting. The facet counts are computed based on the first `group.field` parameter, and other `group.field` parameters are ignored.
Grouped faceting differs from non grouped facets `(sum of all facets) == (total of products with that property)` as shown in the following example:
Object 1
* name: Phaser 4620a
* ppm: 62
* product_range: 6
Object 2
* name: Phaser 4620i
* ppm: 65
* product_range: 6
Object 3
* name: ML6512
* ppm: 62
* product_range: 7
If you ask Solr to group these documents by "product_range", then the total amount of groups is 2, but the facets for ppm are 2 for 62 and 1 for 65.
[[ResultGrouping-RequestParameters]]
== Request Parameters
Result Grouping takes the following request parameters. Any number of these request parameters can be included in a single request:
`group`::
If `true`, query results will be grouped.
`group.field`::
The name of the field by which to group results. The field must be single-valued, and either be indexed or a field type that has a value source and works in a function query, such as `ExternalFileField`. It must also be a string-based field, such as `StrField` or `TextField`
`group.func`::
Group based on the unique values of a function query.
+
NOTE: This option does not work with <<ResultGrouping-DistributedResultGroupingCaveats,distributed searches>>.
`group.query`::
Return a single group of documents that match the given query.
`rows`::
The number of groups to return. The default value is `10`.
`start`::
Specifies an initial offset for the list of groups.
`group.limit`::
Specifies the number of results to return for each group. The default value is `1`.
`group.offset`::
Specifies an initial offset for the document list of each group.
`sort`::
Specifies how Solr sorts the groups relative to each other. For example, `sort=popularity desc` will cause the groups to be sorted according to the highest popularity document in each group. The default value is `score desc`.
`group.sort`::
Specifies how Solr sorts documents within each group. The default behavior if `group.sort` is not specified is to use the same effective value as the `sort` parameter.
`group.format`::
If this parameter is set to `simple`, the grouped documents are presented in a single flat list, and the `start` and `rows` parameters affect the numbers of documents instead of groups. An alternate value for this parameter is `grouped`.
`group.main`::
If `true`, the result of the first field grouping command is used as the main result list in the response, using `group.format=simple`.
`group.ngroups`::
If `true`, Solr includes the number of groups that have matched the query in the results. The default value is false.
+
See below for <<ResultGrouping-DistributedResultGroupingCaveats,Distributed Result Grouping Caveats>> when using sharded indexes
`group.truncate`::
If `true`, facet counts are based on the most relevant document of each group matching the query. The default value is `false`.
`group.facet`::
Determines whether to compute grouped facets for the field facets specified in facet.field parameters. Grouped facets are computed based on the first specified group. As with normal field faceting, fields shouldn't be tokenized (otherwise counts are computed for each token). Grouped faceting supports single and multivalued fields. Default is `false`.
+
WARNING: There can be a heavy performance cost to this option.
+
See below for <<ResultGrouping-DistributedResultGroupingCaveats,Distributed Result Grouping Caveats>> when using sharded indexes.
`group.cache.percent`::
Setting this parameter to a number greater than 0 enables caching for result grouping. Result Grouping executes two searches; this option caches the second search. The default value is `0`. The maximum value is `100`.
+
Testing has shown that group caching only improves search time with Boolean, wildcard, and fuzzy queries. For simple queries like term or "match all" queries, group caching degrades performance.
Any number of group commands (e.g., `group.field`, `group.func`, `group.query`, etc.) may be specified in a single request.
[[ResultGrouping-Examples]]
== Examples
All of the following sample queries work with Solr's "`bin/solr -e techproducts`" example.
[[ResultGrouping-GroupingResultsbyField]]
=== Grouping Results by Field
In this example, we will group results based on the `manu_exact` field, which specifies the manufacturer of the items in the sample dataset.
`\http://localhost:8983/solr/techproducts/select?wt=json&indent=true&fl=id,name&q=solr+memory&group=true&group.field=manu_exact`
[source,json]
----
{
"..."
"grouped":{
"manu_exact":{
"matches":6,
"groups":[{
"groupValue":"Apache Software Foundation",
"doclist":{"numFound":1,"start":0,"docs":[
{
"id":"SOLR1000",
"name":"Solr, the Enterprise Search Server"}]
}},
{
"groupValue":"Corsair Microsystems Inc.",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"VS1GB400C3",
"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"}]
}},
{
"groupValue":"A-DATA Technology Inc.",
"doclist":{"numFound":1,"start":0,"docs":[
{
"id":"VDBDB1A16",
"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"}]
}},
{
"groupValue":"Canon Inc.",
"doclist":{"numFound":1,"start":0,"docs":[
{
"id":"0579B002",
"name":"Canon PIXMA MP500 All-In-One Photo Printer"}]
}},
{
"groupValue":"ASUS Computer Inc.",
"doclist":{"numFound":1,"start":0,"docs":[
{
"id":"EN7800GTX/2DHTV/256M",
"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}]
}
}]}}}
----
The response indicates that there are six total matches for our query. For each of the five unique values of `group.field`, Solr returns a `docList` for that `groupValue` such that the `numFound` indicates the total number of documents in that group, and the top documents are returned according to the implicit default `group.limit=1` and `group.sort=score desc` parameters. The resulting groups are then sorted by the score of the top document within each group based on the implicit `sort=score desc`, and the number of groups returned is limited to the implicit `rows=10`.
We can run the same query with the request parameter `group.main=true`. This will format the results as a single flat document list. This flat format does not include as much information as the normal result grouping query results notably the `numFound` in each group but it may be easier for existing Solr clients to parse.
`\http://localhost:8983/solr/techproducts/select?wt=json&indent=true&fl=id,name,manufacturer&q=solr+memory&group=true&group.field=manu_exact&group.main=true`
[source,json]
----
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"fl":"id,name,manufacturer",
"indent":"true",
"q":"solr memory",
"group.field":"manu_exact",
"group.main":"true",
"group":"true",
"wt":"json"}},
"grouped":{},
"response":{"numFound":6,"start":0,"docs":[
{
"id":"SOLR1000",
"name":"Solr, the Enterprise Search Server"},
{
"id":"VS1GB400C3",
"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"},
{
"id":"VDBDB1A16",
"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"},
{
"id":"0579B002",
"name":"Canon PIXMA MP500 All-In-One Photo Printer"},
{
"id":"EN7800GTX/2DHTV/256M",
"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}]
}
}
----
[[ResultGrouping-GroupingbyQuery]]
=== Grouping by Query
In this example, we will use the `group.query` parameter to find the top three results for "memory" in two different price ranges: 0.00 to 99.99, and over 100.
`\http://localhost:8983/solr/techproducts/select?wt=json&indent=true&fl=name,price&q=memory&group=true&group.query=price:[0+TO+99.99]&group.query=price:[100+TO+*]&group.limit=3`
[source,json]
----
{
"responseHeader":{
"status":0,
"QTime":42,
"params":{
"fl":"name,price",
"indent":"true",
"q":"memory",
"group.limit":"3",
"group.query":["price:[0 TO 99.99]",
"price:[100 TO *]"],
"group":"true",
"wt":"json"}},
"grouped":{
"price:[0 TO 99.99]":{
"matches":5,
"doclist":{"numFound":1,"start":0,"docs":[
{
"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail",
"price":74.99}]
}},
"price:[100 TO *]":{
"matches":5,
"doclist":{"numFound":3,"start":0,"docs":[
{
"name":"CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail",
"price":185.0},
{
"name":"Canon PIXMA MP500 All-In-One Photo Printer",
"price":179.99},
{
"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)",
"price":479.95}]
}
}
}
}
----
In this case, Solr found five matches for "memory," but only returns four results grouped by price. This is because one result for "memory" did not have a price assigned to it.
[[ResultGrouping-DistributedResultGroupingCaveats]]
== Distributed Result Grouping Caveats
Grouping is supported for <<solrcloud.adoc#solrcloud,distributed searches>>, with some caveats:
* Currently `group.func` is is not supported in any distributed searches
* `group.ngroups` and `group.facet` require that all documents in each group must be co-located on the same shard in order for accurate counts to be returned. <<shards-and-indexing-data-in-solrcloud.adoc#shards-and-indexing-data-in-solrcloud,Document routing via composite keys>> can be a useful solution in many situations.