Add compound query types documentation (#4390)

* Add compound query documentation

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Add function score queries

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Add a geospatial example

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Add minimum should match parameter

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Updated boosting response

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Update _query-dsl/query-dsl/compound/constant-score.md

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Reword example

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
This commit is contained in:
kolchfa-aws 2023-07-07 14:03:30 -04:00 committed by GitHub
parent 07c4019e33
commit 178a7d5301
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 1763 additions and 21 deletions

View File

@ -12,23 +12,18 @@ redirect_from:
# Boolean queries
You can perform a Boolean query with the `bool` query type. A Boolean query compounds query clauses so you can combine multiple search queries with Boolean logic. To narrow or broaden your search results, use the `bool` query clause rules.
A Boolean (`bool`) query can combine several query clauses into one advanced query. The clauses are combined with Boolean logic to find matching documents returned in the results.
As a compound query type, `bool` allows you to construct an advanced query by combining several simple queries.
Use the following query clauses within a `bool` query:
Use the following rules to define how to combine multiple sub-query clauses within a `bool` query:
Clause rule | Behavior
Clause | Behavior
:--- | :---
`must` | Logical `and` operator. The results must match the queries in this clause. If you have multiple queries, all of them must match.
`must` | Logical `and` operator. The results must match all queries in this clause.
`must_not` | Logical `not` operator. All matches are excluded from the results.
`should` | Logical `or` operator. The results must match at least one of the queries, but, optionally, they can match more than one query. Each matching `should` clause increases the relevancy score. You can set the minimum number of queries that must match using the `minimum_number_should_match` parameter.
`minimum_number_should_match` | Optional parameter for use with a `should` query clause. Specifies the minimum number of queries that the document must match for it to be returned in the results. The default value is 1.
`filter` | Logical `and` operator that is applied first to reduce your dataset before applying the queries. A query within a filter clause is a yes or no option. If a document matches the query, it is returned in the results; otherwise, it is not. The results of a filter query are generally cached to allow for a faster return. Use the filter query to filter the results based on exact matches, ranges, dates, numbers, and so on.
`should` | Logical `or` operator. The results must match at least one of the queries. Matching more `should` clauses increases the document's relevance score. You can set the minimum number of queries that must match using the [`minimum_should_match`]({{site.url}}{{site.baseurl}}/query-dsl/query-dsl/minimum-should-match/) parameter. If a query contains a `must` or `filter` clause, the default `minimum_should_match` value is 0. Otherwise, the default `minimum_should_match` value is 1.
`filter` | Logical `and` operator that is applied first to reduce your dataset before applying the queries. A query within a filter clause is a yes or no option. If a document matches the query, it is returned in the results; otherwise, it is not. The results of a filter query are generally cached to allow for a faster return. Use the filter query to filter the results based on exact matches, ranges, dates, or numbers.
### Boolean query structure
The structure of a Boolean query contains the `bool` query type followed by clause rules, as follows:
A Boolean query has the following structure:
```json
GET _search
@ -54,9 +49,9 @@ For example, assume you have the complete works of Shakespeare indexed in an Ope
1. The `text_entry` field must contain the word `love` and should contain either `life` or `grace`.
2. The `speaker` field must not contain `ROMEO`.
3. Filter these results to the play `Romeo and Juliet` without affecting the relevancy score.
3. Filter these results to the play `Romeo and Juliet` without affecting the relevance score.
Use the following query:
These requirements can be combined in the following query:
```json
GET shakespeare/_search
@ -100,7 +95,7 @@ GET shakespeare/_search
}
```
#### Sample output
The response contains matching documents:
```json
{
@ -141,7 +136,6 @@ GET shakespeare/_search
If you want to identify which of these clauses actually caused the matching results, name each query with the `_name` parameter.
To add the `_name` parameter, change the field name in the `match` query to an object:
```json
GET shakespeare/_search
{
@ -206,10 +200,10 @@ OpenSearch returns a `matched_queries` array that lists the queries that matched
```
If you remove the queries not in this list, you will still see the exact same result.
By examining which `should` clause matched, you can better understand the relevancy score of the results.
By examining which `should` clause matched, you can better understand the relevance score of the results.
You can also construct complex Boolean expressions by nesting `bool` queries.
For example, to find a `text_entry` field that matches (`love` OR `hate`) AND (`life` OR `grace`) in the play `Romeo and Juliet`:
For example, use the following query to find a `text_entry` field that matches (`love` OR `hate`) AND (`life` OR `grace`) in the play `Romeo and Juliet`:
```json
GET shakespeare/_search
@ -260,7 +254,7 @@ GET shakespeare/_search
}
```
#### Sample output
The response contains matching documents:
```json
{

View File

@ -0,0 +1,158 @@
---
layout: default
title: Boosting queries
parent: Compound queries
grand_parent: Query DSL
nav_order: 30
---
# Boosting queries
If you're searching for the word "pitcher", your results may relate to either baseball players or containers for liquids. For a search in the context of baseball, you might want to completely exclude results that contain the words "glass" or "water" by using the `must_not` clause. However, if you want to keep those results but downgrade them in relevance, you can do so with `boosting` queries.
A `boosting` query returns documents that match a `positive` query. Among those documents, the ones that also match the `negative` query are scored lower in relevance (their relevance score is multiplied by the negative boosting factor).
## Example
Consider an index with two documents that you index as follows:
```json
PUT testindex/_doc/1
{
"article_name": "The greatest pitcher in baseball history"
}
```
```json
PUT testindex/_doc/2
{
"article_name": "The making of a glass pitcher"
}
```
Use the following match query to search for documents containing the word "pitcher":
```json
GET testindex/_search
{
"query": {
"match": {
"article_name": "pitcher"
}
}
}
```
Both returned documents have the same relevance score:
```json
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.18232156,
"hits": [
{
"_index": "testindex",
"_id": "1",
"_score": 0.18232156,
"_source": {
"article_name": "The greatest pitcher in baseball history"
}
},
{
"_index": "testindex",
"_id": "2",
"_score": 0.18232156,
"_source": {
"article_name": "The making of a glass pitcher"
}
}
]
}
}
```
Now use the following `boosting` query to search for documents containing the word "pitcher" but downgrade the documents that contain the words "glass", "crystal", or "water":
```json
GET testindex/_search
{
"query": {
"boosting": {
"positive": {
"match": {
"article_name": "pitcher"
}
},
"negative": {
"match": {
"article_name": "glass crystal water"
}
},
"negative_boost": 0.1
}
}
}
```
{% include copy-curl.html %}
Both documents are still returned, but the document with the word "glass" has a relevance score that is 10 times lower than in the previous case:
```json
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.18232156,
"hits": [
{
"_index": "testindex",
"_id": "1",
"_score": 0.18232156,
"_source": {
"article_name": "The greatest pitcher in baseball history"
}
},
{
"_index": "testindex",
"_id": "2",
"_score": 0.018232157,
"_source": {
"article_name": "The making of a glass pitcher"
}
}
]
}
}
```
## Parameters
The following table lists all top-level parameters supported by `boosting` queries.
Parameter | Description
:--- | :---
`positive` | The query that a document must match to be returned in the results. Required.
`negative` | If a document in the results matches this query, its relevance score is reduced by multiplying its original relevance score (produced by the `positive` query) by the `negative_boost` parameter. Required.
`negative_boost` | A floating-point factor between 0 and 1.0 that the original relevance score is multiplied by in order to reduce the relevance of documents that match the `negative` query. Required.

View File

@ -0,0 +1,212 @@
---
layout: default
title: Constant score queries
parent: Compound queries
grand_parent: Query DSL
nav_order: 40
---
# Constant score queries
If you need to return documents that contain a certain word regardless of how many times the word appears, you can use a `constant_ score` query. A `constant_score` query wraps a filter query and assigns all documents in the results a relevance score equal to the value of the `boost` parameter. Thus, all returned documents have an equal relevance score, and term frequency/inverse document frequency (TF/IDF) is not considered. Filter queries do not calculate relevance scores. Further, OpenSearch caches frequently used filter queries to improve performance.
## Example
Use the following query to return documents that contain the word "Hamlet" in the `shakespeare` index:
```json
GET shakespeare/_search
{
"query": {
"constant_score": {
"filter": {
"match": {
"text_entry": "Hamlet"
}
},
"boost": 1.2
}
}
}
```
{% include copy-curl.html %}
All documents in the results are assigned a relevance score of 1.2:
<details open markdown="block">
<summary>
Response
</summary>
{: .text-delta }
```json
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 96,
"relation": "eq"
},
"max_score": 1.2,
"hits": [
{
"_index": "shakespeare",
"_id": "32535",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32536,
"play_name": "Hamlet",
"speech_number": 48,
"line_number": "1.1.97",
"speaker": "HORATIO",
"text_entry": "Dared to the combat; in which our valiant Hamlet--"
}
},
{
"_index": "shakespeare",
"_id": "32546",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32547,
"play_name": "Hamlet",
"speech_number": 48,
"line_number": "1.1.108",
"speaker": "HORATIO",
"text_entry": "His fell to Hamlet. Now, sir, young Fortinbras,"
}
},
{
"_index": "shakespeare",
"_id": "32625",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32626,
"play_name": "Hamlet",
"speech_number": 59,
"line_number": "1.1.184",
"speaker": "HORATIO",
"text_entry": "Unto young Hamlet; for, upon my life,"
}
},
{
"_index": "shakespeare",
"_id": "32633",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32634,
"play_name": "Hamlet",
"speech_number": 60,
"line_number": "",
"speaker": "MARCELLUS",
"text_entry": "Enter KING CLAUDIUS, QUEEN GERTRUDE, HAMLET, POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, and Attendants"
}
},
{
"_index": "shakespeare",
"_id": "32634",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32635,
"play_name": "Hamlet",
"speech_number": 1,
"line_number": "1.2.1",
"speaker": "KING CLAUDIUS",
"text_entry": "Though yet of Hamlet our dear brothers death"
}
},
{
"_index": "shakespeare",
"_id": "32699",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32700,
"play_name": "Hamlet",
"speech_number": 8,
"line_number": "1.2.65",
"speaker": "KING CLAUDIUS",
"text_entry": "But now, my cousin Hamlet, and my son,--"
}
},
{
"_index": "shakespeare",
"_id": "32703",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32704,
"play_name": "Hamlet",
"speech_number": 12,
"line_number": "1.2.69",
"speaker": "QUEEN GERTRUDE",
"text_entry": "Good Hamlet, cast thy nighted colour off,"
}
},
{
"_index": "shakespeare",
"_id": "32723",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32724,
"play_name": "Hamlet",
"speech_number": 16,
"line_number": "1.2.89",
"speaker": "KING CLAUDIUS",
"text_entry": "Tis sweet and commendable in your nature, Hamlet,"
}
},
{
"_index": "shakespeare",
"_id": "32754",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32755,
"play_name": "Hamlet",
"speech_number": 17,
"line_number": "1.2.120",
"speaker": "QUEEN GERTRUDE",
"text_entry": "Let not thy mother lose her prayers, Hamlet:"
}
},
{
"_index": "shakespeare",
"_id": "32759",
"_score": 1.2,
"_source": {
"type": "line",
"line_id": 32760,
"play_name": "Hamlet",
"speech_number": 19,
"line_number": "1.2.125",
"speaker": "KING CLAUDIUS",
"text_entry": "This gentle and unforced accord of Hamlet"
}
}
]
}
}
```
</details>
## Parameters
The following table lists all top-level parameters supported by `constant_score `queries.
Parameter | Description
:--- | :---
`filter` | The filter query that a document must match to be returned in the results. Required.
`boost` | A floating-point value that is assigned as the relevance score to all returned documents. Optional. Default is 1.0.

View File

@ -0,0 +1,101 @@
---
layout: default
title: Disjunction max queries
parent: Compound queries
grand_parent: Query DSL
nav_order: 50
---
# Disjunction max queries
A disjunction max (`dis_max`) query returns any document that matches one or more query clauses. For documents that match multiple query clauses, the relevance score is set to the highest relevance score from all matching query clauses.
When the relevance scores of the returned documents are identical, you can use the `tie_breaker` parameter to give more weight to documents that match multiple query clauses.
## Example
Consider an index with two documents that you index as follows:
```json
PUT testindex1/_doc/1
{
"title": " The Top 10 Shakespeare Poems",
"description": "Top 10 sonnets of England's national poet and the Bard of Avon"
}
```
```json
PUT testindex1/_doc/2
{
"title": "Sonnets of the 16th Century",
"body": "The poems written by various 16-th century poets"
}
```
Use a `dis_max` query to search for documents that contain the words "Shakespeare works":
```json
GET testindex1/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Shakespeare poems" }},
{ "match": { "body": "Shakespeare poems" }}
]
}
}
}
```
{% include copy-curl.html %}
The response contains both documents:
```json
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.3862942,
"hits": [
{
"_index": "testindex1",
"_id": "1",
"_score": 1.3862942,
"_source": {
"title": " The Top 10 Shakespeare Poems",
"description": "Top 10 sonnets of England's national poet and the Bard of Avon"
}
},
{
"_index": "testindex1",
"_id": "2",
"_score": 0.2876821,
"_source": {
"title": "Sonnets of the 16th Century",
"body": "The poems written by various 16-th century poets"
}
}
]
}
}
```
## Parameters
The following table lists all top-level parameters supported by `dis_max` queries.
Parameter | Description
:--- | :---
`queries` | An array of one or more query clauses that are used to match documents. A document must match at least one query clause to be returned in the results. If a document matches multiple query clauses, the relevance score is set to the highest relevance score from all matching query clauses. Required.
`tie_breaker` | A floating-point factor between 0 and 1.0 that is used to give more weight to documents that match multiple query clauses. In this case, the relevance score of a document is calculated using the following algorithm: Take the highest relevance score from all matching query clauses, multiply the scores from all other matching clauses by the `tie_breaker` value, and add the relevance scores together, normalizing them. Optional. Default is 0 (which means only the highest score counts).

View File

@ -0,0 +1,827 @@
---
layout: default
title: Function score queries
parent: Compound queries
grand_parent: Query DSL
nav_order: 60
has_math: true
---
# Function score queries
Use a `function_score` query if you need to alter the relevance scores of documents returned in the results. A `function_score` query defines a query and one or more functions that can be applied to all results or subsets of the results to recalculate their relevance scores.
## Using one scoring function
The most basic example of a `function_score` query uses one function to recalculate the score. The following query uses a `weight` function to double all relevance scores. This function applies to all documents in the results because there is no `query` parameter specified within `function_score`:
```json
GET shakespeare/_search
{
"query": {
"function_score": {
"weight": "2"
}
}
}
```
{% include copy-curl.html %}
## Applying the scoring function to a subset of documents
To apply the scoring function to a subset of documents, provide a query within the function:
```json
GET shakespeare/_search
{
"query": {
"function_score": {
"query": {
"match": {
"play_name": "Hamlet"
}
},
"weight": "2"
}
}
}
```
{% include copy-curl.html %}
## Supported functions
The `function_score` query type supports the following functions:
- Built-in:
- `weight`: Multiplies a document score by a predefined boost factor.
- `random_score`: Provides a random score that is consistent for a single user but different between users.
- `field_value_factor`: Uses the value of the specified document field to recalculate the score.
- Decay functions (`gauss`, `exp`, and `linear`): Recalculates the score using a specified decay function.
- Custom:
- `script_score`: Uses a script to score documents.
## The weight function
When you use the `weight` function, the original relevance score is multiplied by the floating-point value of `weight`:
```json
GET shakespeare/_search
{
"query": {
"function_score": {
"weight": "2"
}
}
}
```
{% include copy-curl.html %}
Unlike the `boost` value, the `weight` function is not normalized.
## The random score function
The `random_score` function provides a random score that is consistent for a single user but different between users. The score is a floating-point number in the [0, 1) range. By default, the `random_score` function uses internal Lucene document IDs as seed values, making random values irreproducible because documents can be renumbered after merges. To achieve consistency in generating random values, you can provide `seed` and `field` parameters. The `field` must be a field for which `fielddata` is enabled (commonly, a numeric field). The score is calculated using the `seed`, the `fielddata` values for the `field`, and a salt calculated using the index name and shard ID. Because the index name and shard ID are the same for documents that reside in the same shard, documents with the same `field` values will be assigned the same score. To ensure different scores for all documents in the same shard, use a `field` that has unique values for all documents. One option is to use the `_seq_no` field. However, if you choose this field, the scores can change if the document is updated because of the corresponding `_seq_no` update.
The following query uses the `random_score` function with a `seed` and `field`:
```json
GET blogs/_search
{
"query": {
"function_score": {
"random_score": {
"seed": 20,
"field": "_seq_no"
}
}
}
}
```
{% include copy-curl.html %}
## The field value factor function
The `field_value_factor` function recalculates the score using the value of the specified document field. If the field is a multi-valued field, only its first value is used for calculations, and the others are not considered.
The `field_value_factor` function supports the following options:
- `field`: The field to use in score calculations.
- `factor`: An optional factor by which the field value is multiplied. Default is 1.
- `modifier`: One of the modifiers to apply to the field value $$v$$. The following table lists all supported modifiers.
Modifier | Formula | Description
:--- | :--- | :---
`log`| $$\log v$$ | Take the base-10 logarithm of the value. Taking a logarithm of a non-positive number is an illegal operation and will result in an error. For values between 0 (exclusive) and 1 (inclusive), this function returns non-negative values that will result in an error. We recommend using `log1p` or `log2p` instead of `log`.
`log1p`| $$\log (1 + v)$$ | Take the base-10 logarithm of the sum of 1 and the value.
`log2p`| $$\log (2 + v)$$ | Take the base-10 logarithm of the sum of 2 and the value.
`ln`| $$\ln v$$ | Take the natural logarithm of the value. Taking a logarithm of a non-positive number is an illegal operation and will result in an error. For values between 0 (exclusive) and 1 (inclusive), this function returns non-negative values that will result in an error. We recommend using `ln1p` or `ln2p` instead of `ln`.
`ln1p`| $$\ln (1 + v)$$ | Take the natural logarithm of the sum of 1 and the value.
`ln2p`| $$\ln (2 + v)$$ | Take the natural logarithm of the sum of 2 and the value.
`reciprocal`| $$\frac {1}{v}$$ | Take the reciprocal of the value.
`square`| $$v^2$$ | Square the value.
`sqrt`| $$\sqrt v$$ | Take the square root of the value. Taking a square root of a negative number is an illegal operation and will result in an error. Ensure that $$v$$ is non-negative.
`none`| N/A | Do not apply any modifier.
- `missing`: The value to use if the field is missing from the document. The `factor` and `modifier` are applied to this value instead of the missing field value.
For example, the following query uses the `field_value_factor` function to give more weight to the `views` field:
```json
GET blogs/_search
{
"query": {
"function_score": {
"field_value_factor": {
"field": "views",
"factor": 1.5,
"modifier": "log1p",
"missing": 1
}
}
}
}
```
{% include copy-curl.html %}
The preceding query calculates the relevance score using the following formula:
$$ \text{score} = \text{original score} \cdot \log(1 + 1.5 \cdot \text{views}) $$
## The script score function
Using the `script_score` function, you can write a custom script for scoring documents, optionally incorporating values of fields in the document. The original relevance score is accessible in the `_score` variable.
The calculated score cannot be negative. A negative score will result in an error. Document scores have positive 32-bit floating-point values. A score with greater precision is converted to the nearest 32-bit floating-point number.
{: .important}
For example, the following query uses the `script_score` function to calculate the score based on the original score and the number of views and likes for the blog post. To give the number of views and likes a lesser weight, this formula takes the logarithm of the sum of views and likes. To make the logarithm valid even if the number of views and likes is `0`, `1` is added to their sum:
```json
GET blogs/_search
{
"query": {
"function_score": {
"query": {"match": {"name": "opensearch"}},
"script_score": {
"script": "_score * Math.log(1 + doc['likes'].value + doc['views'].value)"
}
}
}
}
```
{% include copy-curl.html %}
Scripts are compiled and cached for faster performance. Thus, it's preferable to reuse the same script and pass any parameters that the script needs:
```json
GET blogs/_search
{
"query": {
"function_score": {
"query": {
"match": { "name": "opensearch" }
},
"script_score": {
"script": {
"params": {
"add": 1
},
"source": "_score * Math.log(params.add + doc['likes'].value + doc['views'].value)"
}
}
}
}
}
```
{% include copy-curl.html %}
## Decay functions
For many applications, you need to sort the results based on proximity or recency. You can do this with decay functions. Decay functions calculate a document score using one of three decay curves: Gaussian, exponential, or linear.
Decay functions operate only on [numeric]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/numeric/), [date]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/dates/), and [geopoint]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/geo-point/) fields.
{: .important}
Decay functions calculate scores based on the `origin`, `scale`, `offset`, and `decay`, as shown in the following figure.
<img src="{{site.url}}{{site.baseurl}}/images/decay-functions.png" alt="Decay function curves" width="600">
### Example: Geopoint fields
Suppose you're looking for a hotel near your office. You create a `hotels` index that maps the `location` field as a geopoint:
```json
PUT hotels
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
```
{% include copy-curl.html %}
You index two documents that correspond to nearby hotels:
```json
PUT hotels/_doc/1
{
"name": "Hotel Within 200",
"location": {
"lat": 40.7105,
"lon": 74.00
}
}
```
{% include copy-curl.html %}
```json
PUT hotels/_doc/2
{
"name": "Hotel Outside 500",
"location": {
"lat": 40.7115,
"lon": 74.00
}
}
```
{% include copy-curl.html %}
The `origin` defines the point from which the distance is calculated (the office location). The `offset` specifies the distance from the origin within which documents are given a full score of 1. You can give hotels within 200 ft of the office the same highest score. The `scale` defines the decay rate of the graph, and the `decay` defines the score to assign to a document at the `scale` + `offset` distance from the origin. Once you are outside the 200 ft radius, you may decide that if you have to walk another 300 ft to get to a hotel (`scale` = 300 ft), you'll assign it one quarter of the original score (`decay` = 0.25).
You create the following query with the `origin` at (74.00, 40.71):
```json
GET hotels/_search
{
"query": {
"function_score": {
"functions": [
{
"exp": {
"location": {
"origin": "40.71,74.00",
"offset": "200ft",
"scale": "300ft",
"decay": 0.25
}
}
}
]
}
}
}
```
{% include copy-curl.html %}
The response contains both hotels. The hotel within 200 ft of the office has a score of 1, and the hotel outside of the 500 ft radius has a score 0.20, which is less than the `decay` parameter 0.25:
<details open markdown="block">
<summary>
Response
</summary>
{: .text-delta}
```json
{
"took": 854,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "hotels",
"_id": "1",
"_score": 1,
"_source": {
"name": "Hotel Within 200",
"location": {
"lat": 40.7105,
"lon": 74
}
}
},
{
"_index": "hotels",
"_id": "2",
"_score": 0.20099315,
"_source": {
"name": "Hotel Outside 500",
"location": {
"lat": 40.7115,
"lon": 74
}
}
}
]
}
}
```
</details>
### Parameters
The following table lists all parameters supported by the `gauss`, `exp`, and `linear` functions.
Parameter | Description
:--- | :---
`origin` | The point from which to calculate the distance. Must be provided as a number for numeric fields, a date for date fields, or a geopoint for geopoint fields. Required for geopoint and numeric fields. Optional for date fields (defaults to `now`). For date fields, date math is supported (for example, `now-2d`).
`offset` | Defines the distance from the origin within which documents are given a score of 1. Optional. Default is 0.
`scale` | Documents at the distance of `scale` + `offset` from the `origin` are assigned a score of `decay`. Required. <br>For numeric fields, `scale` can be any number. <br>For date fields, `scale` can be defined as a number with [units]({{site.url}}{{site.baseurl}}/api-reference/units/) (`5h`, `1d`). If units are not provided, `scale` defaults to milliseconds. <br>For geopoint fields, `scale` can be defined as a number with [units]({{site.url}}{{site.baseurl}}/api-reference/units/) (`1mi`, `5km`). If units are not provided, `scale` defaults to meters.
`decay` | Defines the score of a document at the distance of `scale` + `offset` from the `origin`. Optional. Default is 0.5.
For fields that are missing from the document, decay functions return a score of 1.
{: .note}
### Example: Numeric fields
The following query uses the exponential decay function to prioritize blog posts by the number of comments:
```json
GET blogs/_search
{
"query": {
"function_score": {
"functions": [
{
"exp": {
"comments": {
"origin": "20",
"offset": "5",
"scale": "10"
}
}
}
]
}
}
}
```
{% include copy-curl.html %}
The first two blog posts in the results have a score of 1 because one is at the origin (20) and the other is at a distance of 16, which is within the offset (the range within which documents receive a full score is calculated as 20 $$\pm$$ 5 and is [15, 25]). The third blog post is at a distance of `scale` + `offset` from the `origin` (20 &minus; (5 + 10) = 15), so it's given the default `decay` score (0.5):
<details open markdown="block">
<summary>
Response
</summary>
{: .text-delta}
```json
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "blogs",
"_id": "1",
"_score": 1,
"_source": {
"name": "Semantic search in OpenSearch",
"views": 1200,
"likes": 150,
"comments": 16,
"date_posted": "2022-04-17"
}
},
{
"_index": "blogs",
"_id": "2",
"_score": 1,
"_source": {
"name": "Get started with OpenSearch 2.7",
"views": 1400,
"likes": 100,
"comments": 20,
"date_posted": "2022-05-02"
}
},
{
"_index": "blogs",
"_id": "3",
"_score": 0.5,
"_source": {
"name": "Distributed tracing with Data Prepper",
"views": 800,
"likes": 50,
"comments": 5,
"date_posted": "2022-04-25"
}
},
{
"_index": "blogs",
"_id": "4",
"_score": 0.4352753,
"_source": {
"name": "A very old blog",
"views": 100,
"likes": 20,
"comments": 3,
"date_posted": "2000-04-25"
}
}
]
}
}
```
</details>
### Example: Date fields
The following query uses the Gaussian decay function to prioritize blog posts published around 04/24/2002:
```json
GET blogs/_search
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"date_posted": {
"origin": "2022-04-24",
"offset": "1d",
"scale": "6d",
"decay": 0.25
}
}
}
]
}
}
}
```
{% include copy-curl.html %}
In the results, the first blog post was published within one day of 04/24/2022, so it has the highest score of 1. The second blog post was published on 04/17/2022, which is within `offset` + `scale` (`1d` + `6d`) and therefore has a score equal to `decay` (0.25). The third blog post was published more than 7 days after 04/24/2022, so it has a lower score. The last blog post has a score of 0 because it was published years ago:
<details open markdown="block">
<summary>
Response
</summary>
{: .text-delta}
```json
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "blogs",
"_id": "3",
"_score": 1,
"_source": {
"name": "Distributed tracing with Data Prepper",
"views": 800,
"likes": 50,
"comments": 5,
"date_posted": "2022-04-25"
}
},
{
"_index": "blogs",
"_id": "1",
"_score": 0.25,
"_source": {
"name": "Semantic search in OpenSearch",
"views": 1200,
"likes": 150,
"comments": 16,
"date_posted": "2022-04-17"
}
},
{
"_index": "blogs",
"_id": "2",
"_score": 0.15154076,
"_source": {
"name": "Get started with OpenSearch 2.7",
"views": 1400,
"likes": 100,
"comments": 20,
"date_posted": "2022-05-02"
}
},
{
"_index": "blogs",
"_id": "4",
"_score": 0,
"_source": {
"name": "A very old blog",
"views": 100,
"likes": 20,
"comments": 3,
"date_posted": "2000-04-25"
}
}
]
}
}
```
</details>
### Multi-valued fields
If the field that you specify for decay calculation contains multiple values, you can use the `multi_value_mode` parameter. This parameter specifies one of the following functions to determine the field value that is used for calculations:
- `min`: (Default) The minimum distance from the `origin`.
- `max`: The maximum distance from the `origin`.
- `avg`: The average distance from the `origin`.
- `sum`: The sum of all distances from the `origin`.
For example, you index a document with an array of distances:
```json
PUT testindex/_doc/1
{
"distances": [1, 2, 3, 4, 5]
}
```
The following query uses the `max` distance of a multi-valued field `distances` to calculate decay:
```json
GET testindex/_search
{
"query": {
"function_score": {
"functions": [
{
"exp": {
"distances": {
"origin": "6",
"offset": "5",
"scale": "1"
},
"multi_value_mode": "max"
}
}
]
}
}
}
```
{% include copy-curl.html %}
The document is given a score of 1 because the maximum distance from the origin (1) is within the `offset` from the `origin`:
```json
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "testindex",
"_id": "1",
"_score": 1,
"_source": {
"distances": [
1,
2,
3,
4,
5
]
}
}
]
}
}
```
### Decay curve calculation
The following formulas define score computation for various decay functions ($$v$$ denotes the document field value).
**Gaussian**
$$ \text{score} = \exp \left(-\frac {(\max(0, \lvert v - \text{origin} \rvert - \text{offset}))^2} {2\sigma^2} \right), $$
where $$\sigma$$ is calculated to ensure that the score is equal to `decay` at the distance `offset` + `scale` from the `origin`:
$$ \sigma^2 = - \frac {\text{scale}^2} {2 \ln(\text{decay})} $$
**Exponential**
$$ \text{score} = \exp (\lambda \cdot \max(0, \lvert v - \text{origin} \rvert - \text{offset})),$$
where $$\lambda$$ is calculated to ensure that the score is equal to `decay` at the distance `offset` + `scale` from the `origin`:
$$\lambda = \frac {\ln(\text{decay})} {\text{scale}} $$
**Linear**
$$ \text{score} = \max \left(\frac {s - \max(0, \lvert v - \text{origin} \rvert - \text{offset})} {s} \right), $$
where $$s$$ is calculated to ensure that the score is equal to `decay` at the distance `offset` + `scale` from the `origin`:
$$s = \frac {\text{scale}} {1 - \text{decay}}$$
## Using multiple scoring functions
You can specify multiple scoring functions in a function score query by listing them in the `functions` array.
### Combining scores from multiple functions
Different functions can use different scales for scoring. For example, the `random_score` function provides a score between 0 and 1, but the `field_value_factor` does not have a specific scale for the score. Additionally, you may want to weigh scores given by different functions differently. To adjust scores for different functions, you can specify the `weight` parameter for each function. The score given by each function is then multiplied by the `weight` to produce the final score for that function. The `weight` parameter must be provided in the `functions` array in order to differentiate it from the [weight function](#the-weight-function),
The scores given by each function are combined using the `score_mode` parameter, which takes one of the following values:
- `multiply`: (Default) Scores are multiplied.
- `sum`: Scores are added.
- `avg`: Scores are averaged. If `weight` is specified, this is a [weighted average](https://en.wikipedia.org/wiki/Weighted_arithmetic_mean). For example, if the first function with the weight $$1$$ returns the score $$10$$, and the second function with the weight $$4$$ returns the score $$20$$, the average is calculated as $$\frac {10 \cdot 1 + 20 \cdot 4}{1 + 4} = 18$$.
- `first`: The score from the first function that has a matching filter is taken.
- `max`: The maximum score is taken.
- `min`: The minimum score is taken.
### Specifying an upper limit for a score
You can specify an upper limit for a function score in the `max_boost` parameter. The default upper limit is the maximum magnitude for a `float` value: (2 &minus; 2<sup>&minus;23</sup>) &middot; 2<sup>127</sup>.
### Combining the score for all functions with the query score
You can specify how the score computed using all functions is combined with the query score in the `boost_mode` parameter, which takes one of the following values:
- `multiply`: (Default) Multiply the query score by the function score.
- `replace`: Ignore the query score and use the function score.
- `sum`: Add the query score and the function score.
- `avg`: Average the query score and the function score.
- `max`: Take the greater of the query score and the function score.
- `min`: Take the lesser of the query score and the function score.
### Filtering documents that don't meet a threshold
Changing the relevance score does not change the list of matching documents. To exclude some documents that don't meet a threshold, specify the threshold value in the `min_score` parameter. All documents returned by the query are then scored and filtered using the threshold value.
### Example
The following request searches for blog posts that include the words "OpenSearch Data Prepper", preferring the posts published around 04/24/2022. Additionally, the number of views and likes are taken into consideration. Finally, the cutoff threshold is set at the score of 10:
```json
GET blogs/_search
{
"query": {
"function_score": {
"boost": "5",
"functions": [
{
"gauss": {
"date_posted": {
"origin": "2022-04-24",
"offset": "1d",
"scale": "6d"
}
},
"weight": 1
},
{
"gauss": {
"likes": {
"origin": 200,
"scale": 200
}
},
"weight": 4
},
{
"gauss": {
"views": {
"origin": 1000,
"scale": 800
}
},
"weight": 2
}
],
"query": {
"match": {
"name": "opensearch data prepper"
}
},
"max_boost": 10,
"score_mode": "max",
"boost_mode": "multiply",
"min_score": 10
}
}
}
```
{% include copy-curl.html %}
The results contain the three matching blog posts:
<details open markdown="block">
<summary>
Response
</summary>
{: .text-delta}
```json
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 31.191923,
"hits": [
{
"_index": "blogs",
"_id": "3",
"_score": 31.191923,
"_source": {
"name": "Distributed tracing with Data Prepper",
"views": 800,
"likes": 50,
"comments": 5,
"date_posted": "2022-04-25"
}
},
{
"_index": "blogs",
"_id": "1",
"_score": 13.907352,
"_source": {
"name": "Semantic search in OpenSearch",
"views": 1200,
"likes": 150,
"comments": 16,
"date_posted": "2022-04-17"
}
},
{
"_index": "blogs",
"_id": "2",
"_score": 11.150461,
"_source": {
"name": "Get started with OpenSearch 2.7",
"views": 1400,
"likes": 100,
"comments": 20,
"date_posted": "2022-05-02"
}
}
]
}
}
```
</details>

View File

@ -50,9 +50,9 @@ Broadly, you can classify queries into two categories---*leaf queries* and *comp
## A note on Unicode special characters in text fields
Due to word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it. This can lead to unintentional filtering of documents and potentially compromise control over their access.
Because of word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it. This can lead to unintentional filtering of documents and potentially compromise control over their access.
The examples below illustrate values containing special characters that will be parsed improperly by the standard analyzer. In this example, the existence of the hyphen/minus sign in the value prevents the analyzer from distinguishing between the two different users for `user.id` and interprets them as one and the same:
The following examples illustrate values containing special characters that will be parsed improperly by the standard analyzer. In this example, the existence of the hyphen/minus sign in the value prevents the analyzer from distinguishing between the two different users for `user.id` and interprets them as one and the same:
```json
{

View File

@ -0,0 +1,450 @@
---
layout: default
title: Minimum should match
parent: Query DSL
nav_order: 70
---
# Minimum should match
The `minimum_should_match` parameter can be used for full-text search and specifies the minimum number of terms a document must match to be returned in search results.
The following example requires a document to match at least two out of three search terms in order to be returned as a search result:
```json
GET /shakespeare/_search
{
"query": {
"match": {
"text_entry": {
"query": "prince king star",
"minimum_should_match": "2"
}
}
}
}
```
In this example, the query has three optional clauses that are combined with an `OR`, so the document must match either `prince`, `king`, or `star`.
## Valid values
You can specify the `minimum_should_match` parameter as one of the following values.
Value type | Example | Description
:--- | :--- | :---
Non-negative integer | `2` | A document must match this number of optional clauses.
Negative integer | `-1` | A document must match the total number of optional clauses minus this number.
Non-negative percentage | `70%` | A document must match this percentage of the total number of optional clauses. The number of clauses to match is rounded down to the nearest integer.
Negative percentage | `-30%` | A document can have this percentage of the total number of optional clauses that do not match. The number of clauses a document is allowed to not match is rounded down to the nearest integer.
Combination | `2<75%` | Expression in the `n<p%` format. If the number of optional clauses is less than or equal to `n`, the document must match all optional clauses. If the number of optional clauses is greater than `n`, then the document must match the `p` percentage of optional clauses.
Multiple combinations | `3<-1 5<50%` | More than one combination separated by a space. Each condition applies to the number of optional clauses that is greater than the number on the left of the `<` sign. In this example, if there are three or fewer optional clauses, the document must match all of them. If there are four or five optional clauses, the document must match all but one of them. If there are 6 or more optional clauses, the document must match 50% of them.
Let `n` be the number of optional clauses a document must match. When `n` is calculated as a percentage, if `n` is less than 1, then 1 is used. If `n` is greater than the number of optional clauses, the number of optional clauses is used.
{: .note}
## Using the parameter in Boolean queries
A [Boolean query]({{site.url}}{{site.baseurl}}/) lists optional clauses in the `should` clause and required clauses in the `must` clause. Optionally, it can contain a `filter` clause to filter results.
Consider an example index containing the following five documents:
```json
PUT testindex/_doc/1
{
"text": "one OpenSearch"
}
```
{% include copy-curl.html %}
```json
PUT testindex/_doc/2
{
"text": "one two OpenSearch"
}
```
{% include copy-curl.html %}
```json
PUT testindex/_doc/3
{
"text": "one two three OpenSearch"
}
```
{% include copy-curl.html %}
```json
PUT testindex/_doc/4
{
"text": "one two three four OpenSearch"
}
```
{% include copy-curl.html %}
```json
PUT testindex/_doc/5
{
"text": "OpenSearch"
}
```
{% include copy-curl.html %}
The following query contains four optional clauses:
```json
GET testindex/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"text": "OpenSearch"
}
}
],
"should": [
{
"match": {
"text": "one"
}
},
{
"match": {
"text": "two"
}
},
{
"match": {
"text": "three"
}
},
{
"match": {
"text": "four"
}
}
],
"minimum_should_match": "80%"
}
}
}
```
{% include copy-curl.html %}
Because `minimum_should_match` is specified as `80%`, the number of optional clauses to match is calculated as 4 &middot; 0.8 = 3.2 and then rounded down to 3. Therefore, the results contain documents that match at least three clauses:
```json
{
"took": 40,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 2.494999,
"hits": [
{
"_index": "testindex",
"_id": "4",
"_score": 2.494999,
"_source": {
"text": "one two three four OpenSearch"
}
},
{
"_index": "testindex",
"_id": "3",
"_score": 1.5744598,
"_source": {
"text": "one two three OpenSearch"
}
}
]
}
}
```
Now specify `minimum_should_match` as `-20%`:
```json
GET testindex/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"text": "OpenSearch"
}
}
],
"should": [
{
"match": {
"text": "one"
}
},
{
"match": {
"text": "two"
}
},
{
"match": {
"text": "three"
}
},
{
"match": {
"text": "four"
}
}
],
"minimum_should_match": "-20%"
}
}
}
```
{% include copy-curl.html %}
The number of non-matching optional clauses that a document can have is calculated as 4 &middot; 0.2 = 0.8 and rounded down to 0. Thus, the results contain only one document that matches all optional clauses:
```json
{
"took": 41,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 2.494999,
"hits": [
{
"_index": "testindex",
"_id": "4",
"_score": 2.494999,
"_source": {
"text": "one two three four OpenSearch"
}
}
]
}
}
```
Note that specifying a positive percentage (`80%`) and negative percentage (`-20%`) did not result in the same number of optional clauses a document must match because, in both cases, the result was rounded down. If the number of optional clauses were, for example, 5, then both `80%` and `-20%` would have produced the same number of optional clauses a document must match (4).
### Default `minimum_should_match` value
If a query contains a `must` or `filter` clause, the default `minimum_should_match` value is 0. For example, the following query searches for documents that match `OpenSearch` and 0 optional `should` clauses:
```json
GET testindex/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"text": "OpenSearch"
}
}
],
"should": [
{
"match": {
"text": "one"
}
},
{
"match": {
"text": "two"
}
},
{
"match": {
"text": "three"
}
},
{
"match": {
"text": "four"
}
}
]
}
}
}
```
{% include copy-curl.html %}
This query returns all five documents in the index:
```json
{
"took": 34,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 2.494999,
"hits": [
{
"_index": "testindex",
"_id": "4",
"_score": 2.494999,
"_source": {
"text": "one two three four OpenSearch"
}
},
{
"_index": "testindex",
"_id": "3",
"_score": 1.5744598,
"_source": {
"text": "one two three OpenSearch"
}
},
{
"_index": "testindex",
"_id": "2",
"_score": 0.91368985,
"_source": {
"text": "one two OpenSearch"
}
},
{
"_index": "testindex",
"_id": "1",
"_score": 0.4338556,
"_source": {
"text": "one OpenSearch"
}
},
{
"_index": "testindex",
"_id": "5",
"_score": 0.11964063,
"_source": {
"text": "OpenSearch"
}
}
]
}
}
```
However, if you omit the `must` clause, then the query searches for documents that match one optional `should` clause:
```json
GET testindex/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"text": "one"
}
},
{
"match": {
"text": "two"
}
},
{
"match": {
"text": "three"
}
},
{
"match": {
"text": "four"
}
}
]
}
}
}
```
{% include copy-curl.html %}
The results contain only four documents that match at least one of the optional clauses:
```json
{
"took": 19,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 2.426633,
"hits": [
{
"_index": "testindex",
"_id": "4",
"_score": 2.426633,
"_source": {
"text": "one two three four OpenSearch"
}
},
{
"_index": "testindex",
"_id": "3",
"_score": 1.4978898,
"_source": {
"text": "one two three OpenSearch"
}
},
{
"_index": "testindex",
"_id": "2",
"_score": 0.8266785,
"_source": {
"text": "one two OpenSearch"
}
},
{
"_index": "testindex",
"_id": "1",
"_score": 0.3331056,
"_source": {
"text": "one OpenSearch"
}
}
]
}
}
```

BIN
images/decay-functions.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB