diff --git a/_query-dsl/query-dsl/compound/bool.md b/_query-dsl/query-dsl/compound/bool.md index 383f7ad6..40a715c0 100644 --- a/_query-dsl/query-dsl/compound/bool.md +++ b/_query-dsl/query-dsl/compound/bool.md @@ -12,23 +12,18 @@ redirect_from: # Boolean queries -You can perform a Boolean query with the `bool` query type. A Boolean query compounds query clauses so you can combine multiple search queries with Boolean logic. To narrow or broaden your search results, use the `bool` query clause rules. +A Boolean (`bool`) query can combine several query clauses into one advanced query. The clauses are combined with Boolean logic to find matching documents returned in the results. -As a compound query type, `bool` allows you to construct an advanced query by combining several simple queries. +Use the following query clauses within a `bool` query: -Use the following rules to define how to combine multiple sub-query clauses within a `bool` query: - -Clause rule | Behavior +Clause | Behavior :--- | :--- -`must` | Logical `and` operator. The results must match the queries in this clause. If you have multiple queries, all of them must match. +`must` | Logical `and` operator. The results must match all queries in this clause. `must_not` | Logical `not` operator. All matches are excluded from the results. -`should` | Logical `or` operator. The results must match at least one of the queries, but, optionally, they can match more than one query. Each matching `should` clause increases the relevancy score. You can set the minimum number of queries that must match using the `minimum_number_should_match` parameter. -`minimum_number_should_match` | Optional parameter for use with a `should` query clause. Specifies the minimum number of queries that the document must match for it to be returned in the results. The default value is 1. -`filter` | Logical `and` operator that is applied first to reduce your dataset before applying the queries. A query within a filter clause is a yes or no option. If a document matches the query, it is returned in the results; otherwise, it is not. The results of a filter query are generally cached to allow for a faster return. Use the filter query to filter the results based on exact matches, ranges, dates, numbers, and so on. +`should` | Logical `or` operator. The results must match at least one of the queries. Matching more `should` clauses increases the document's relevance score. You can set the minimum number of queries that must match using the [`minimum_should_match`]({{site.url}}{{site.baseurl}}/query-dsl/query-dsl/minimum-should-match/) parameter. If a query contains a `must` or `filter` clause, the default `minimum_should_match` value is 0. Otherwise, the default `minimum_should_match` value is 1. +`filter` | Logical `and` operator that is applied first to reduce your dataset before applying the queries. A query within a filter clause is a yes or no option. If a document matches the query, it is returned in the results; otherwise, it is not. The results of a filter query are generally cached to allow for a faster return. Use the filter query to filter the results based on exact matches, ranges, dates, or numbers. -### Boolean query structure - -The structure of a Boolean query contains the `bool` query type followed by clause rules, as follows: +A Boolean query has the following structure: ```json GET _search @@ -54,9 +49,9 @@ For example, assume you have the complete works of Shakespeare indexed in an Ope 1. The `text_entry` field must contain the word `love` and should contain either `life` or `grace`. 2. The `speaker` field must not contain `ROMEO`. -3. Filter these results to the play `Romeo and Juliet` without affecting the relevancy score. +3. Filter these results to the play `Romeo and Juliet` without affecting the relevance score. -Use the following query: +These requirements can be combined in the following query: ```json GET shakespeare/_search @@ -100,7 +95,7 @@ GET shakespeare/_search } ``` -#### Sample output +The response contains matching documents: ```json { @@ -141,7 +136,6 @@ GET shakespeare/_search If you want to identify which of these clauses actually caused the matching results, name each query with the `_name` parameter. To add the `_name` parameter, change the field name in the `match` query to an object: - ```json GET shakespeare/_search { @@ -206,10 +200,10 @@ OpenSearch returns a `matched_queries` array that lists the queries that matched ``` If you remove the queries not in this list, you will still see the exact same result. -By examining which `should` clause matched, you can better understand the relevancy score of the results. +By examining which `should` clause matched, you can better understand the relevance score of the results. You can also construct complex Boolean expressions by nesting `bool` queries. -For example, to find a `text_entry` field that matches (`love` OR `hate`) AND (`life` OR `grace`) in the play `Romeo and Juliet`: +For example, use the following query to find a `text_entry` field that matches (`love` OR `hate`) AND (`life` OR `grace`) in the play `Romeo and Juliet`: ```json GET shakespeare/_search @@ -260,7 +254,7 @@ GET shakespeare/_search } ``` -#### Sample output +The response contains matching documents: ```json { diff --git a/_query-dsl/query-dsl/compound/boosting.md b/_query-dsl/query-dsl/compound/boosting.md new file mode 100644 index 00000000..6176b463 --- /dev/null +++ b/_query-dsl/query-dsl/compound/boosting.md @@ -0,0 +1,158 @@ +--- +layout: default +title: Boosting queries +parent: Compound queries +grand_parent: Query DSL +nav_order: 30 +--- + +# Boosting queries + +If you're searching for the word "pitcher", your results may relate to either baseball players or containers for liquids. For a search in the context of baseball, you might want to completely exclude results that contain the words "glass" or "water" by using the `must_not` clause. However, if you want to keep those results but downgrade them in relevance, you can do so with `boosting` queries. + +A `boosting` query returns documents that match a `positive` query. Among those documents, the ones that also match the `negative` query are scored lower in relevance (their relevance score is multiplied by the negative boosting factor). + +## Example + +Consider an index with two documents that you index as follows: + +```json +PUT testindex/_doc/1 +{ + "article_name": "The greatest pitcher in baseball history" +} +``` + +```json +PUT testindex/_doc/2 +{ + "article_name": "The making of a glass pitcher" +} +``` + +Use the following match query to search for documents containing the word "pitcher": + +```json +GET testindex/_search +{ + "query": { + "match": { + "article_name": "pitcher" + } + } +} +``` + +Both returned documents have the same relevance score: + +```json +{ + "took": 5, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 0.18232156, + "hits": [ + { + "_index": "testindex", + "_id": "1", + "_score": 0.18232156, + "_source": { + "article_name": "The greatest pitcher in baseball history" + } + }, + { + "_index": "testindex", + "_id": "2", + "_score": 0.18232156, + "_source": { + "article_name": "The making of a glass pitcher" + } + } + ] + } +} +``` + +Now use the following `boosting` query to search for documents containing the word "pitcher" but downgrade the documents that contain the words "glass", "crystal", or "water": + +```json +GET testindex/_search +{ + "query": { + "boosting": { + "positive": { + "match": { + "article_name": "pitcher" + } + }, + "negative": { + "match": { + "article_name": "glass crystal water" + } + }, + "negative_boost": 0.1 + } + } +} +``` +{% include copy-curl.html %} + +Both documents are still returned, but the document with the word "glass" has a relevance score that is 10 times lower than in the previous case: + +```json +{ + "took": 13, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 0.18232156, + "hits": [ + { + "_index": "testindex", + "_id": "1", + "_score": 0.18232156, + "_source": { + "article_name": "The greatest pitcher in baseball history" + } + }, + { + "_index": "testindex", + "_id": "2", + "_score": 0.018232157, + "_source": { + "article_name": "The making of a glass pitcher" + } + } + ] + } +} +``` + +## Parameters + +The following table lists all top-level parameters supported by `boosting` queries. + +Parameter | Description +:--- | :--- +`positive` | The query that a document must match to be returned in the results. Required. +`negative` | If a document in the results matches this query, its relevance score is reduced by multiplying its original relevance score (produced by the `positive` query) by the `negative_boost` parameter. Required. +`negative_boost` | A floating-point factor between 0 and 1.0 that the original relevance score is multiplied by in order to reduce the relevance of documents that match the `negative` query. Required. diff --git a/_query-dsl/query-dsl/compound/constant-score.md b/_query-dsl/query-dsl/compound/constant-score.md new file mode 100644 index 00000000..4ad45c71 --- /dev/null +++ b/_query-dsl/query-dsl/compound/constant-score.md @@ -0,0 +1,212 @@ +--- +layout: default +title: Constant score queries +parent: Compound queries +grand_parent: Query DSL +nav_order: 40 +--- + +# Constant score queries + +If you need to return documents that contain a certain word regardless of how many times the word appears, you can use a `constant_ score` query. A `constant_score` query wraps a filter query and assigns all documents in the results a relevance score equal to the value of the `boost` parameter. Thus, all returned documents have an equal relevance score, and term frequency/inverse document frequency (TF/IDF) is not considered. Filter queries do not calculate relevance scores. Further, OpenSearch caches frequently used filter queries to improve performance. + +## Example + +Use the following query to return documents that contain the word "Hamlet" in the `shakespeare` index: + +```json +GET shakespeare/_search +{ + "query": { + "constant_score": { + "filter": { + "match": { + "text_entry": "Hamlet" + } + }, + "boost": 1.2 + } + } +} +``` +{% include copy-curl.html %} + +All documents in the results are assigned a relevance score of 1.2: + +
+ + Response + + {: .text-delta } + +```json +{ + "took": 8, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 96, + "relation": "eq" + }, + "max_score": 1.2, + "hits": [ + { + "_index": "shakespeare", + "_id": "32535", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32536, + "play_name": "Hamlet", + "speech_number": 48, + "line_number": "1.1.97", + "speaker": "HORATIO", + "text_entry": "Dared to the combat; in which our valiant Hamlet--" + } + }, + { + "_index": "shakespeare", + "_id": "32546", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32547, + "play_name": "Hamlet", + "speech_number": 48, + "line_number": "1.1.108", + "speaker": "HORATIO", + "text_entry": "His fell to Hamlet. Now, sir, young Fortinbras," + } + }, + { + "_index": "shakespeare", + "_id": "32625", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32626, + "play_name": "Hamlet", + "speech_number": 59, + "line_number": "1.1.184", + "speaker": "HORATIO", + "text_entry": "Unto young Hamlet; for, upon my life," + } + }, + { + "_index": "shakespeare", + "_id": "32633", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32634, + "play_name": "Hamlet", + "speech_number": 60, + "line_number": "", + "speaker": "MARCELLUS", + "text_entry": "Enter KING CLAUDIUS, QUEEN GERTRUDE, HAMLET, POLONIUS, LAERTES, VOLTIMAND, CORNELIUS, Lords, and Attendants" + } + }, + { + "_index": "shakespeare", + "_id": "32634", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32635, + "play_name": "Hamlet", + "speech_number": 1, + "line_number": "1.2.1", + "speaker": "KING CLAUDIUS", + "text_entry": "Though yet of Hamlet our dear brothers death" + } + }, + { + "_index": "shakespeare", + "_id": "32699", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32700, + "play_name": "Hamlet", + "speech_number": 8, + "line_number": "1.2.65", + "speaker": "KING CLAUDIUS", + "text_entry": "But now, my cousin Hamlet, and my son,--" + } + }, + { + "_index": "shakespeare", + "_id": "32703", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32704, + "play_name": "Hamlet", + "speech_number": 12, + "line_number": "1.2.69", + "speaker": "QUEEN GERTRUDE", + "text_entry": "Good Hamlet, cast thy nighted colour off," + } + }, + { + "_index": "shakespeare", + "_id": "32723", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32724, + "play_name": "Hamlet", + "speech_number": 16, + "line_number": "1.2.89", + "speaker": "KING CLAUDIUS", + "text_entry": "Tis sweet and commendable in your nature, Hamlet," + } + }, + { + "_index": "shakespeare", + "_id": "32754", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32755, + "play_name": "Hamlet", + "speech_number": 17, + "line_number": "1.2.120", + "speaker": "QUEEN GERTRUDE", + "text_entry": "Let not thy mother lose her prayers, Hamlet:" + } + }, + { + "_index": "shakespeare", + "_id": "32759", + "_score": 1.2, + "_source": { + "type": "line", + "line_id": 32760, + "play_name": "Hamlet", + "speech_number": 19, + "line_number": "1.2.125", + "speaker": "KING CLAUDIUS", + "text_entry": "This gentle and unforced accord of Hamlet" + } + } + ] + } +} +``` +
+ +## Parameters + +The following table lists all top-level parameters supported by `constant_score `queries. + +Parameter | Description +:--- | :--- +`filter` | The filter query that a document must match to be returned in the results. Required. +`boost` | A floating-point value that is assigned as the relevance score to all returned documents. Optional. Default is 1.0. \ No newline at end of file diff --git a/_query-dsl/query-dsl/compound/disjunction-max.md b/_query-dsl/query-dsl/compound/disjunction-max.md new file mode 100644 index 00000000..a5e42629 --- /dev/null +++ b/_query-dsl/query-dsl/compound/disjunction-max.md @@ -0,0 +1,101 @@ +--- +layout: default +title: Disjunction max queries +parent: Compound queries +grand_parent: Query DSL +nav_order: 50 +--- + +# Disjunction max queries + +A disjunction max (`dis_max`) query returns any document that matches one or more query clauses. For documents that match multiple query clauses, the relevance score is set to the highest relevance score from all matching query clauses. + +When the relevance scores of the returned documents are identical, you can use the `tie_breaker` parameter to give more weight to documents that match multiple query clauses. + +## Example + +Consider an index with two documents that you index as follows: + +```json +PUT testindex1/_doc/1 +{ + "title": " The Top 10 Shakespeare Poems", + "description": "Top 10 sonnets of England's national poet and the Bard of Avon" +} +``` + +```json +PUT testindex1/_doc/2 +{ + "title": "Sonnets of the 16th Century", + "body": "The poems written by various 16-th century poets" +} +``` + +Use a `dis_max` query to search for documents that contain the words "Shakespeare works": + +```json +GET testindex1/_search +{ + "query": { + "dis_max": { + "queries": [ + { "match": { "title": "Shakespeare poems" }}, + { "match": { "body": "Shakespeare poems" }} + ] + } + } +} +``` +{% include copy-curl.html %} + +The response contains both documents: + +```json +{ + "took": 8, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 1.3862942, + "hits": [ + { + "_index": "testindex1", + "_id": "1", + "_score": 1.3862942, + "_source": { + "title": " The Top 10 Shakespeare Poems", + "description": "Top 10 sonnets of England's national poet and the Bard of Avon" + } + }, + { + "_index": "testindex1", + "_id": "2", + "_score": 0.2876821, + "_source": { + "title": "Sonnets of the 16th Century", + "body": "The poems written by various 16-th century poets" + } + } + ] + } +} +``` + +## Parameters + +The following table lists all top-level parameters supported by `dis_max` queries. + +Parameter | Description +:--- | :--- +`queries` | An array of one or more query clauses that are used to match documents. A document must match at least one query clause to be returned in the results. If a document matches multiple query clauses, the relevance score is set to the highest relevance score from all matching query clauses. Required. +`tie_breaker` | A floating-point factor between 0 and 1.0 that is used to give more weight to documents that match multiple query clauses. In this case, the relevance score of a document is calculated using the following algorithm: Take the highest relevance score from all matching query clauses, multiply the scores from all other matching clauses by the `tie_breaker` value, and add the relevance scores together, normalizing them. Optional. Default is 0 (which means only the highest score counts). diff --git a/_query-dsl/query-dsl/compound/function-score.md b/_query-dsl/query-dsl/compound/function-score.md new file mode 100644 index 00000000..0d5f901d --- /dev/null +++ b/_query-dsl/query-dsl/compound/function-score.md @@ -0,0 +1,827 @@ +--- +layout: default +title: Function score queries +parent: Compound queries +grand_parent: Query DSL +nav_order: 60 +has_math: true +--- + +# Function score queries + +Use a `function_score` query if you need to alter the relevance scores of documents returned in the results. A `function_score` query defines a query and one or more functions that can be applied to all results or subsets of the results to recalculate their relevance scores. + +## Using one scoring function + +The most basic example of a `function_score` query uses one function to recalculate the score. The following query uses a `weight` function to double all relevance scores. This function applies to all documents in the results because there is no `query` parameter specified within `function_score`: + +```json +GET shakespeare/_search +{ + "query": { + "function_score": { + "weight": "2" + } + } +} +``` +{% include copy-curl.html %} + +## Applying the scoring function to a subset of documents + +To apply the scoring function to a subset of documents, provide a query within the function: + +```json +GET shakespeare/_search +{ + "query": { + "function_score": { + "query": { + "match": { + "play_name": "Hamlet" + } + }, + "weight": "2" + } + } +} +``` +{% include copy-curl.html %} + +## Supported functions + +The `function_score` query type supports the following functions: + +- Built-in: + - `weight`: Multiplies a document score by a predefined boost factor. + - `random_score`: Provides a random score that is consistent for a single user but different between users. + - `field_value_factor`: Uses the value of the specified document field to recalculate the score. + - Decay functions (`gauss`, `exp`, and `linear`): Recalculates the score using a specified decay function. +- Custom: + - `script_score`: Uses a script to score documents. + +## The weight function + +When you use the `weight` function, the original relevance score is multiplied by the floating-point value of `weight`: + +```json +GET shakespeare/_search +{ + "query": { + "function_score": { + "weight": "2" + } + } +} +``` +{% include copy-curl.html %} + +Unlike the `boost` value, the `weight` function is not normalized. + +## The random score function + +The `random_score` function provides a random score that is consistent for a single user but different between users. The score is a floating-point number in the [0, 1) range. By default, the `random_score` function uses internal Lucene document IDs as seed values, making random values irreproducible because documents can be renumbered after merges. To achieve consistency in generating random values, you can provide `seed` and `field` parameters. The `field` must be a field for which `fielddata` is enabled (commonly, a numeric field). The score is calculated using the `seed`, the `fielddata` values for the `field`, and a salt calculated using the index name and shard ID. Because the index name and shard ID are the same for documents that reside in the same shard, documents with the same `field` values will be assigned the same score. To ensure different scores for all documents in the same shard, use a `field` that has unique values for all documents. One option is to use the `_seq_no` field. However, if you choose this field, the scores can change if the document is updated because of the corresponding `_seq_no` update. + +The following query uses the `random_score` function with a `seed` and `field`: + +```json +GET blogs/_search +{ + "query": { + "function_score": { + "random_score": { + "seed": 20, + "field": "_seq_no" + } + } + } +} +``` +{% include copy-curl.html %} + +## The field value factor function + +The `field_value_factor` function recalculates the score using the value of the specified document field. If the field is a multi-valued field, only its first value is used for calculations, and the others are not considered. + +The `field_value_factor` function supports the following options: + +- `field`: The field to use in score calculations. + +- `factor`: An optional factor by which the field value is multiplied. Default is 1. + +- `modifier`: One of the modifiers to apply to the field value $$v$$. The following table lists all supported modifiers. + + Modifier | Formula | Description + :--- | :--- | :--- + `log`| $$\log v$$ | Take the base-10 logarithm of the value. Taking a logarithm of a non-positive number is an illegal operation and will result in an error. For values between 0 (exclusive) and 1 (inclusive), this function returns non-negative values that will result in an error. We recommend using `log1p` or `log2p` instead of `log`. + `log1p`| $$\log (1 + v)$$ | Take the base-10 logarithm of the sum of 1 and the value. + `log2p`| $$\log (2 + v)$$ | Take the base-10 logarithm of the sum of 2 and the value. + `ln`| $$\ln v$$ | Take the natural logarithm of the value. Taking a logarithm of a non-positive number is an illegal operation and will result in an error. For values between 0 (exclusive) and 1 (inclusive), this function returns non-negative values that will result in an error. We recommend using `ln1p` or `ln2p` instead of `ln`. + `ln1p`| $$\ln (1 + v)$$ | Take the natural logarithm of the sum of 1 and the value. + `ln2p`| $$\ln (2 + v)$$ | Take the natural logarithm of the sum of 2 and the value. + `reciprocal`| $$\frac {1}{v}$$ | Take the reciprocal of the value. + `square`| $$v^2$$ | Square the value. + `sqrt`| $$\sqrt v$$ | Take the square root of the value. Taking a square root of a negative number is an illegal operation and will result in an error. Ensure that $$v$$ is non-negative. + `none`| N/A | Do not apply any modifier. + +- `missing`: The value to use if the field is missing from the document. The `factor` and `modifier` are applied to this value instead of the missing field value. + +For example, the following query uses the `field_value_factor` function to give more weight to the `views` field: + +```json +GET blogs/_search +{ + "query": { + "function_score": { + "field_value_factor": { + "field": "views", + "factor": 1.5, + "modifier": "log1p", + "missing": 1 + } + } + } +} +``` +{% include copy-curl.html %} + +The preceding query calculates the relevance score using the following formula: + +$$ \text{score} = \text{original score} \cdot \log(1 + 1.5 \cdot \text{views}) $$ + +## The script score function + +Using the `script_score` function, you can write a custom script for scoring documents, optionally incorporating values of fields in the document. The original relevance score is accessible in the `_score` variable. + +The calculated score cannot be negative. A negative score will result in an error. Document scores have positive 32-bit floating-point values. A score with greater precision is converted to the nearest 32-bit floating-point number. +{: .important} + +For example, the following query uses the `script_score` function to calculate the score based on the original score and the number of views and likes for the blog post. To give the number of views and likes a lesser weight, this formula takes the logarithm of the sum of views and likes. To make the logarithm valid even if the number of views and likes is `0`, `1` is added to their sum: + +```json +GET blogs/_search +{ + "query": { + "function_score": { + "query": {"match": {"name": "opensearch"}}, + "script_score": { + "script": "_score * Math.log(1 + doc['likes'].value + doc['views'].value)" + } + } + } +} +``` +{% include copy-curl.html %} + +Scripts are compiled and cached for faster performance. Thus, it's preferable to reuse the same script and pass any parameters that the script needs: + +```json +GET blogs/_search +{ + "query": { + "function_score": { + "query": { + "match": { "name": "opensearch" } + }, + "script_score": { + "script": { + "params": { + "add": 1 + }, + "source": "_score * Math.log(params.add + doc['likes'].value + doc['views'].value)" + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Decay functions + +For many applications, you need to sort the results based on proximity or recency. You can do this with decay functions. Decay functions calculate a document score using one of three decay curves: Gaussian, exponential, or linear. + +Decay functions operate only on [numeric]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/numeric/), [date]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/dates/), and [geopoint]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/geo-point/) fields. +{: .important} + +Decay functions calculate scores based on the `origin`, `scale`, `offset`, and `decay`, as shown in the following figure. + +Decay function curves + +### Example: Geopoint fields + +Suppose you're looking for a hotel near your office. You create a `hotels` index that maps the `location` field as a geopoint: + +```json +PUT hotels +{ + "mappings": { + "properties": { + "location": { + "type": "geo_point" + } + } + } +} +``` +{% include copy-curl.html %} + +You index two documents that correspond to nearby hotels: + +```json +PUT hotels/_doc/1 +{ + "name": "Hotel Within 200", + "location": { + "lat": 40.7105, + "lon": 74.00 + } +} +``` +{% include copy-curl.html %} + +```json +PUT hotels/_doc/2 +{ + "name": "Hotel Outside 500", + "location": { + "lat": 40.7115, + "lon": 74.00 + } +} +``` +{% include copy-curl.html %} + +The `origin` defines the point from which the distance is calculated (the office location). The `offset` specifies the distance from the origin within which documents are given a full score of 1. You can give hotels within 200 ft of the office the same highest score. The `scale` defines the decay rate of the graph, and the `decay` defines the score to assign to a document at the `scale` + `offset` distance from the origin. Once you are outside the 200 ft radius, you may decide that if you have to walk another 300 ft to get to a hotel (`scale` = 300 ft), you'll assign it one quarter of the original score (`decay` = 0.25). + +You create the following query with the `origin` at (74.00, 40.71): + +```json +GET hotels/_search +{ + "query": { + "function_score": { + "functions": [ + { + "exp": { + "location": { + "origin": "40.71,74.00", + "offset": "200ft", + "scale": "300ft", + "decay": 0.25 + } + } + } + ] + } + } +} +``` +{% include copy-curl.html %} + +The response contains both hotels. The hotel within 200 ft of the office has a score of 1, and the hotel outside of the 500 ft radius has a score 0.20, which is less than the `decay` parameter 0.25: + +
+ + Response + + {: .text-delta} + +```json +{ + "took": 854, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "hotels", + "_id": "1", + "_score": 1, + "_source": { + "name": "Hotel Within 200", + "location": { + "lat": 40.7105, + "lon": 74 + } + } + }, + { + "_index": "hotels", + "_id": "2", + "_score": 0.20099315, + "_source": { + "name": "Hotel Outside 500", + "location": { + "lat": 40.7115, + "lon": 74 + } + } + } + ] + } +} +``` +
+ +### Parameters + +The following table lists all parameters supported by the `gauss`, `exp`, and `linear` functions. + +Parameter | Description +:--- | :--- +`origin` | The point from which to calculate the distance. Must be provided as a number for numeric fields, a date for date fields, or a geopoint for geopoint fields. Required for geopoint and numeric fields. Optional for date fields (defaults to `now`). For date fields, date math is supported (for example, `now-2d`). +`offset` | Defines the distance from the origin within which documents are given a score of 1. Optional. Default is 0. +`scale` | Documents at the distance of `scale` + `offset` from the `origin` are assigned a score of `decay`. Required.
For numeric fields, `scale` can be any number.
For date fields, `scale` can be defined as a number with [units]({{site.url}}{{site.baseurl}}/api-reference/units/) (`5h`, `1d`). If units are not provided, `scale` defaults to milliseconds.
For geopoint fields, `scale` can be defined as a number with [units]({{site.url}}{{site.baseurl}}/api-reference/units/) (`1mi`, `5km`). If units are not provided, `scale` defaults to meters. +`decay` | Defines the score of a document at the distance of `scale` + `offset` from the `origin`. Optional. Default is 0.5. + +For fields that are missing from the document, decay functions return a score of 1. +{: .note} + +### Example: Numeric fields + +The following query uses the exponential decay function to prioritize blog posts by the number of comments: + +```json +GET blogs/_search +{ + "query": { + "function_score": { + "functions": [ + { + "exp": { + "comments": { + "origin": "20", + "offset": "5", + "scale": "10" + } + } + } + ] + } + } +} +``` +{% include copy-curl.html %} + +The first two blog posts in the results have a score of 1 because one is at the origin (20) and the other is at a distance of 16, which is within the offset (the range within which documents receive a full score is calculated as 20 $$\pm$$ 5 and is [15, 25]). The third blog post is at a distance of `scale` + `offset` from the `origin` (20 − (5 + 10) = 15), so it's given the default `decay` score (0.5): + +
+ + Response + + {: .text-delta} + +```json +{ + "took": 3, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 4, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "blogs", + "_id": "1", + "_score": 1, + "_source": { + "name": "Semantic search in OpenSearch", + "views": 1200, + "likes": 150, + "comments": 16, + "date_posted": "2022-04-17" + } + }, + { + "_index": "blogs", + "_id": "2", + "_score": 1, + "_source": { + "name": "Get started with OpenSearch 2.7", + "views": 1400, + "likes": 100, + "comments": 20, + "date_posted": "2022-05-02" + } + }, + { + "_index": "blogs", + "_id": "3", + "_score": 0.5, + "_source": { + "name": "Distributed tracing with Data Prepper", + "views": 800, + "likes": 50, + "comments": 5, + "date_posted": "2022-04-25" + } + }, + { + "_index": "blogs", + "_id": "4", + "_score": 0.4352753, + "_source": { + "name": "A very old blog", + "views": 100, + "likes": 20, + "comments": 3, + "date_posted": "2000-04-25" + } + } + ] + } +} +``` +
+ +### Example: Date fields + +The following query uses the Gaussian decay function to prioritize blog posts published around 04/24/2002: + +```json +GET blogs/_search +{ + "query": { + "function_score": { + "functions": [ + { + "gauss": { + "date_posted": { + "origin": "2022-04-24", + "offset": "1d", + "scale": "6d", + "decay": 0.25 + } + } + } + ] + } + } +} +``` +{% include copy-curl.html %} + +In the results, the first blog post was published within one day of 04/24/2022, so it has the highest score of 1. The second blog post was published on 04/17/2022, which is within `offset` + `scale` (`1d` + `6d`) and therefore has a score equal to `decay` (0.25). The third blog post was published more than 7 days after 04/24/2022, so it has a lower score. The last blog post has a score of 0 because it was published years ago: + +
+ + Response + + {: .text-delta} + +```json +{ + "took": 2, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 4, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "blogs", + "_id": "3", + "_score": 1, + "_source": { + "name": "Distributed tracing with Data Prepper", + "views": 800, + "likes": 50, + "comments": 5, + "date_posted": "2022-04-25" + } + }, + { + "_index": "blogs", + "_id": "1", + "_score": 0.25, + "_source": { + "name": "Semantic search in OpenSearch", + "views": 1200, + "likes": 150, + "comments": 16, + "date_posted": "2022-04-17" + } + }, + { + "_index": "blogs", + "_id": "2", + "_score": 0.15154076, + "_source": { + "name": "Get started with OpenSearch 2.7", + "views": 1400, + "likes": 100, + "comments": 20, + "date_posted": "2022-05-02" + } + }, + { + "_index": "blogs", + "_id": "4", + "_score": 0, + "_source": { + "name": "A very old blog", + "views": 100, + "likes": 20, + "comments": 3, + "date_posted": "2000-04-25" + } + } + ] + } +} +``` +
+ +### Multi-valued fields + +If the field that you specify for decay calculation contains multiple values, you can use the `multi_value_mode` parameter. This parameter specifies one of the following functions to determine the field value that is used for calculations: + +- `min`: (Default) The minimum distance from the `origin`. +- `max`: The maximum distance from the `origin`. +- `avg`: The average distance from the `origin`. +- `sum`: The sum of all distances from the `origin`. + +For example, you index a document with an array of distances: + +```json +PUT testindex/_doc/1 +{ + "distances": [1, 2, 3, 4, 5] +} +``` + +The following query uses the `max` distance of a multi-valued field `distances` to calculate decay: + +```json +GET testindex/_search +{ + "query": { + "function_score": { + "functions": [ + { + "exp": { + "distances": { + "origin": "6", + "offset": "5", + "scale": "1" + }, + "multi_value_mode": "max" + } + } + ] + } + } +} +``` +{% include copy-curl.html %} + +The document is given a score of 1 because the maximum distance from the origin (1) is within the `offset` from the `origin`: + +```json +{ + "took": 3, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 1, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "testindex", + "_id": "1", + "_score": 1, + "_source": { + "distances": [ + 1, + 2, + 3, + 4, + 5 + ] + } + } + ] + } +} +``` + +### Decay curve calculation + +The following formulas define score computation for various decay functions ($$v$$ denotes the document field value). + +**Gaussian** + +$$ \text{score} = \exp \left(-\frac {(\max(0, \lvert v - \text{origin} \rvert - \text{offset}))^2} {2\sigma^2} \right), $$ + +where $$\sigma$$ is calculated to ensure that the score is equal to `decay` at the distance `offset` + `scale` from the `origin`: + +$$ \sigma^2 = - \frac {\text{scale}^2} {2 \ln(\text{decay})} $$ + +**Exponential** + +$$ \text{score} = \exp (\lambda \cdot \max(0, \lvert v - \text{origin} \rvert - \text{offset})),$$ + +where $$\lambda$$ is calculated to ensure that the score is equal to `decay` at the distance `offset` + `scale` from the `origin`: + +$$\lambda = \frac {\ln(\text{decay})} {\text{scale}} $$ + +**Linear** + +$$ \text{score} = \max \left(\frac {s - \max(0, \lvert v - \text{origin} \rvert - \text{offset})} {s} \right), $$ + +where $$s$$ is calculated to ensure that the score is equal to `decay` at the distance `offset` + `scale` from the `origin`: + +$$s = \frac {\text{scale}} {1 - \text{decay}}$$ + +## Using multiple scoring functions + +You can specify multiple scoring functions in a function score query by listing them in the `functions` array. + +### Combining scores from multiple functions + +Different functions can use different scales for scoring. For example, the `random_score` function provides a score between 0 and 1, but the `field_value_factor` does not have a specific scale for the score. Additionally, you may want to weigh scores given by different functions differently. To adjust scores for different functions, you can specify the `weight` parameter for each function. The score given by each function is then multiplied by the `weight` to produce the final score for that function. The `weight` parameter must be provided in the `functions` array in order to differentiate it from the [weight function](#the-weight-function), + +The scores given by each function are combined using the `score_mode` parameter, which takes one of the following values: + +- `multiply`: (Default) Scores are multiplied. +- `sum`: Scores are added. +- `avg`: Scores are averaged. If `weight` is specified, this is a [weighted average](https://en.wikipedia.org/wiki/Weighted_arithmetic_mean). For example, if the first function with the weight $$1$$ returns the score $$10$$, and the second function with the weight $$4$$ returns the score $$20$$, the average is calculated as $$\frac {10 \cdot 1 + 20 \cdot 4}{1 + 4} = 18$$. +- `first`: The score from the first function that has a matching filter is taken. +- `max`: The maximum score is taken. +- `min`: The minimum score is taken. + +### Specifying an upper limit for a score + +You can specify an upper limit for a function score in the `max_boost` parameter. The default upper limit is the maximum magnitude for a `float` value: (2 − 2−23) · 2127. + +### Combining the score for all functions with the query score + +You can specify how the score computed using all functions is combined with the query score in the `boost_mode` parameter, which takes one of the following values: + +- `multiply`: (Default) Multiply the query score by the function score. +- `replace`: Ignore the query score and use the function score. +- `sum`: Add the query score and the function score. +- `avg`: Average the query score and the function score. +- `max`: Take the greater of the query score and the function score. +- `min`: Take the lesser of the query score and the function score. + +### Filtering documents that don't meet a threshold + +Changing the relevance score does not change the list of matching documents. To exclude some documents that don't meet a threshold, specify the threshold value in the `min_score` parameter. All documents returned by the query are then scored and filtered using the threshold value. + +### Example + +The following request searches for blog posts that include the words "OpenSearch Data Prepper", preferring the posts published around 04/24/2022. Additionally, the number of views and likes are taken into consideration. Finally, the cutoff threshold is set at the score of 10: + +```json +GET blogs/_search +{ + "query": { + "function_score": { + "boost": "5", + "functions": [ + { + "gauss": { + "date_posted": { + "origin": "2022-04-24", + "offset": "1d", + "scale": "6d" + } + }, + "weight": 1 + }, + { + "gauss": { + "likes": { + "origin": 200, + "scale": 200 + } + }, + "weight": 4 + }, + { + "gauss": { + "views": { + "origin": 1000, + "scale": 800 + } + }, + "weight": 2 + } + ], + "query": { + "match": { + "name": "opensearch data prepper" + } + }, + "max_boost": 10, + "score_mode": "max", + "boost_mode": "multiply", + "min_score": 10 + } + } +} +``` +{% include copy-curl.html %} + +The results contain the three matching blog posts: + +
+ + Response + + {: .text-delta} + +```json +{ + "took": 14, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 3, + "relation": "eq" + }, + "max_score": 31.191923, + "hits": [ + { + "_index": "blogs", + "_id": "3", + "_score": 31.191923, + "_source": { + "name": "Distributed tracing with Data Prepper", + "views": 800, + "likes": 50, + "comments": 5, + "date_posted": "2022-04-25" + } + }, + { + "_index": "blogs", + "_id": "1", + "_score": 13.907352, + "_source": { + "name": "Semantic search in OpenSearch", + "views": 1200, + "likes": 150, + "comments": 16, + "date_posted": "2022-04-17" + } + }, + { + "_index": "blogs", + "_id": "2", + "_score": 11.150461, + "_source": { + "name": "Get started with OpenSearch 2.7", + "views": 1400, + "likes": 100, + "comments": 20, + "date_posted": "2022-05-02" + } + } + ] + } +} +``` +
\ No newline at end of file diff --git a/_query-dsl/query-dsl/index.md b/_query-dsl/query-dsl/index.md index 520e2bd7..d5d62182 100644 --- a/_query-dsl/query-dsl/index.md +++ b/_query-dsl/query-dsl/index.md @@ -50,9 +50,9 @@ Broadly, you can classify queries into two categories---*leaf queries* and *comp ## A note on Unicode special characters in text fields -Due to word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it. This can lead to unintentional filtering of documents and potentially compromise control over their access. +Because of word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it. This can lead to unintentional filtering of documents and potentially compromise control over their access. -The examples below illustrate values containing special characters that will be parsed improperly by the standard analyzer. In this example, the existence of the hyphen/minus sign in the value prevents the analyzer from distinguishing between the two different users for `user.id` and interprets them as one and the same: +The following examples illustrate values containing special characters that will be parsed improperly by the standard analyzer. In this example, the existence of the hyphen/minus sign in the value prevents the analyzer from distinguishing between the two different users for `user.id` and interprets them as one and the same: ```json { diff --git a/_query-dsl/query-dsl/minimum-should-match.md b/_query-dsl/query-dsl/minimum-should-match.md new file mode 100644 index 00000000..90729509 --- /dev/null +++ b/_query-dsl/query-dsl/minimum-should-match.md @@ -0,0 +1,450 @@ +--- +layout: default +title: Minimum should match +parent: Query DSL +nav_order: 70 +--- + +# Minimum should match + +The `minimum_should_match` parameter can be used for full-text search and specifies the minimum number of terms a document must match to be returned in search results. + +The following example requires a document to match at least two out of three search terms in order to be returned as a search result: + +```json +GET /shakespeare/_search +{ + "query": { + "match": { + "text_entry": { + "query": "prince king star", + "minimum_should_match": "2" + } + } + } +} +``` + +In this example, the query has three optional clauses that are combined with an `OR`, so the document must match either `prince`, `king`, or `star`. + +## Valid values + +You can specify the `minimum_should_match` parameter as one of the following values. + +Value type | Example | Description +:--- | :--- | :--- +Non-negative integer | `2` | A document must match this number of optional clauses. +Negative integer | `-1` | A document must match the total number of optional clauses minus this number. +Non-negative percentage | `70%` | A document must match this percentage of the total number of optional clauses. The number of clauses to match is rounded down to the nearest integer. +Negative percentage | `-30%` | A document can have this percentage of the total number of optional clauses that do not match. The number of clauses a document is allowed to not match is rounded down to the nearest integer. +Combination | `2<75%` | Expression in the `n