29 KiB
layout | title | parent | grand_parent | nav_order |
---|---|---|---|---|
default | Multi-match | Full-text queries | Query DSL | 50 |
Multi-match queries
A multi-match operation functions similarly to the match operation. You can use a multi_match
query to search multiple fields.
The ^
"boosts" certain fields. Boosts are multipliers that weigh matches in one field more heavily than matches in other fields. In the following example, a match for "wind" in the title field influences _score
four times as much as a match in the plot field:
GET _search
{
"query": {
"multi_match": {
"query": "wind",
"fields": ["title^4", "plot"]
}
}
}
{% include copy-curl.html %}
The result is that films like The Wind Rises and Gone with the Wind are near the top of the search results, and films like Twister, which presumably have "wind" in their plot summaries, are near the bottom.
You can use wildcards in the field name. For example, the following query will search the speaker
field and all fields that start with play_
, for example, play_name
or play_title
:
GET _search
{
"query": {
"multi_match": {
"query": "hamlet",
"fields": ["speaker", "play_*"]
}
}
}
{% include copy-curl.html %}
If you don't provide the fields
parameter, multi_match
query searches the fields specified in the index.query. Default_field
setting, which defaults to *
. The default behavior is to extract all fields in the mapping that are eligible for term-level queries, filter the metadata fields, and combine all extracted fields to build a query.
The maximum number of clauses in a query is defined in the indices.query.bool.max_clause_count
setting, which defaults to 1,024.
{: .note}
Multi-match query types
OpenSearch supports the following multi-match query types, which differ in the way the query is executed internally:
best_fields
(default): Returns documents that match any field. Uses the_score
of the best-matching field.most_fields
: Returns documents that match any field. Uses a combined score of each matching field.cross_fields
: Treats all fields as if they were one field. Processes fields with the sameanalyzer
and matches words in any field.phrase
: Runs amatch_phrase
query on each field. Uses the_score
of the best-matching field.phrase_prefix
: Runs amatch_phrase_prefix
query on each field. Uses the_score
of the best-matching field.bool_prefix
: Runs amatch_bool_prefix
query on each field. Uses a combined score of each matched field.
Best fields
If you're searching for two words that specify a concept, you want the results where the two words are next to each other to score higher.
For example, consider an index that contains the following scientific articles:
PUT /articles/_doc/1
{
"title": "Aurora borealis",
"description": "Northern lights, or aurora borealis, explained"
}
{% include copy-curl.html %}
PUT /articles/_doc/2
{
"title": "Sun deprivation in the Northern countries",
"description": "Using fluorescent lights for therapy"
}
{% include copy-curl.html %}
You can search for articles containing northern lights
in the title or description:
GET articles/_search
{
"query": {
"multi_match" : {
"query": "northern lights",
"type": "best_fields",
"fields": [ "title", "description" ],
"tie_breaker": 0.3
}
}
}
{% include copy-curl.html %}
The preceding query is executed as the following dis_max
query with a match
query for each field:
GET /articles/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "northern lights" }},
{ "match": { "description": "northern lights" }}
],
"tie_breaker": 0.3
}
}
}
The results contain both documents, but document 1 is scored higher because both words are in the description
field:
{
"took": 30,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.84407747,
"hits": [
{
"_index": "articles",
"_id": "1",
"_score": 0.84407747,
"_source": {
"title": "Aurora borealis",
"description": "Northern lights, or aurora borealis, explained"
}
},
{
"_index": "articles",
"_id": "2",
"_score": 0.6322521,
"_source": {
"title": "Sun deprivation in the Northern countries",
"description": "Using fluorescent lights for therapy"
}
}
]
}
}
The best_fields
query uses the score of the best-matching field. If you specify a tie_breaker
, the score is calculated using the following algorithm:
Take the score of the best-matching field and add (tie_breaker
* _score
) for all other matching fields.
Most fields
Use the most_fields
query for multiple fields that contain the same text that is analyzed in different ways. For example, the original field may contain text analyzed with the standard
analyzer and another field may contain the same text analyzed with the english
analyzer, which performs stemming:
PUT /articles
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
{% include copy-curl.html %}
Consider the following two documents that are indexed in the articles
index:
PUT /articles/_doc/1
{
"title": "Buttered toasts"
}
{% include copy-curl.html %}
PUT /articles/_doc/2
{
"title": "Buttering a toast"
}
{% include copy-curl.html %}
The standard
analyzer analyzes the title Buttered toast
into [buttered
, toasts
] and the title Buttering a toast
into [buttering
, a
, toast
]. On the other hand, the english
analyzer produces the same token list [butter
, toast
] for both titles because of stemming.
You can use the most_fields
query in order to return as many documents as possible:
GET /articles/_search
{
"query": {
"multi_match": {
"query": "buttered toast",
"fields": [
"title",
"title.english"
],
"type": "most_fields"
}
}
}
{% include copy-curl.html %}
The preceding query is executed as the following Boolean query:
GET articles/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "buttered toasts" }},
{ "match": { "title.english": "buttered toasts" }}
]
}
}
}
To calculate the relevance score, a document's scores for all match
clauses are added together and then the result is divided by the number of match
clauses.
Including the title.english
field retrieves the second document that matches the stemmed tokens:
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.4418206,
"hits": [
{
"_index": "articles",
"_id": "1",
"_score": 1.4418206,
"_source": {
"title": "Buttered toasts"
}
},
{
"_index": "articles",
"_id": "2",
"_score": 0.09304003,
"_source": {
"title": "Buttering a toast"
}
}
]
}
}
Because both title
and title.english
fields match for the first document, it has a higher relevance score.
Operator and minimum should match
The best_fields
and most_fields
queries generate a match query on a field basis (one per field). Thus, the minimum_should_match
and operator
parameters are applied to each field, which is normally not the desired behavior.
For example, consider a customers
index with the following documents:
PUT customers/_doc/1
{
"first_name": "John",
"last_name": "Doe"
}
{% include copy-curl.html %}
PUT customers/_doc/2
{
"first_name": "Jane",
"last_name": "Doe"
}
{% include copy-curl.html %}
If you're searching for John Doe
in the customers
index, you might construct the following query:
GET customers/_validate/query?explain
{
"query": {
"multi_match" : {
"query": "John Doe",
"type": "best_fields",
"fields": [ "first_name", "last_name" ],
"operator": "and"
}
}
}
{% include copy-curl.html %}
The intent of the and
operator in this query is to find a document that matches John
and Doe
. However, the query does not return any results. You can learn how the query is executed by running the Validate API:
GET customers/_validate/query?explain
{
"query": {
"multi_match" : {
"query": "John Doe",
"type": "best_fields",
"fields": [ "first_name", "last_name" ],
"operator": "and"
}
}
}
{% include copy-curl.html %}
From the response, you can see that the query is trying to match both John
and Doe
to either the first_name
or last_name
field:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"valid": true,
"explanations": [
{
"index": "customers",
"valid": true,
"explanation": "((+first_name:john +first_name:doe) | (+last_name:john +last_name:doe))"
}
]
}
Because neither field contains both words, no results are returned.
A better alternative for searching across fields is to use the cross_fields
query. Unlike the field-centric best_fields
and most_fields
queries, cross_fields
query is term-centric.
Cross fields
Use the cross_fields
query to search for data across multiple fields. For example, if an index contains customer data, the first name and last name of the customer reside in different fields. Yet, when you search for John Doe
, you want to receive documents in which John
is in the first_name
field and Doe
is in the last_name
field.
The most_fields
query does not work in this case because of the following problems:
- The
operator
andminimum_should_match
parameters are applied on a field basis instead of on a term basis. - Term frequencies in the
first_name
andlast_name
fields can lead to unexpected results. For example, if someone's first name happens to beDoe
, a document with this name will be presumed a better match because this first name will not appear in any other documents.
The cross_fields
query analyzes the query string into individual terms and then searches for each of the terms in any of the fields, as if they were one field.
The following is the cross_fields
query for John Doe
:
GET /customers/_search
{
"query": {
"multi_match" : {
"query": "John Doe",
"type": "cross_fields",
"fields": [ "first_name", "last_name" ],
"operator": "and"
}
}
}
{% include copy-curl.html %}
The response contains the only document in which both John
and Doe
are present:
{
"took": 19,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.8754687,
"hits": [
{
"_index": "customers",
"_id": "1",
"_score": 0.8754687,
"_source": {
"first_name": "John",
"last_name": "Doe"
}
}
]
}
}
You can use the Validate API operation to gain insight into how the preceding query is executed:
GET /customers/_validate/query?explain
{
"query": {
"multi_match" : {
"query": "John Doe",
"type": "cross_fields",
"fields": [ "first_name", "last_name" ],
"operator": "and"
}
}
}
{% include copy-curl.html %}
From the response, you can see that the query is searching for all terms in at least one field:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"valid": true,
"explanations": [
{
"index": "customers",
"valid": true,
"explanation": "+blended(terms:[last_name:john, first_name:john]) +blended(terms:[last_name:doe, first_name:doe])"
}
]
}
Thus, blending the term frequencies for all fields solves the problem of differing term frequencies by correcting for the differences.
The cross_fields
query is usually only useful on short string fields with a boost
of 1. In other cases, the score does not produce a meaningful blend of term statistics because of the way boosts, term frequencies, and length normalization contribute to the score.
{: .note}
The fuzziness
parameter is not supported for cross_fields
queries.
{: .note}
Analysis
The cross_fields
query only works as a term-centric query on fields with the same analyzer. Fields with the same analyzer are grouped together and these groups are combined with a Boolean query.
For example, consider an index where the first_name
and last_name
fields are analyzed with the default standard
analyzer and their .edge
subfields are analyzed with an edge n-gram analyzer:
Response
{: .text-delta}PUT customers
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}
}
}
},
"mappings": {
"properties": {
"first_name": {
"type": "text",
"fields": {
"edge": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"last_name": {
"type": "text",
"fields": {
"edge": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
{% include copy-curl.html %}
You index one document in the customers
index:
PUT /customers/_doc/1
{
"first": "John",
"last": "Doe"
}
{% include copy-curl.html %}
You can use a cross_fields
query to search across the fields for John Doe
:
GET /customers/_search
{
"query": {
"multi_match" : {
"query": "John",
"type": "cross_fields",
"fields": [
"first_name", "first_name.edge",
"last_name", "last_name.edge"
]
}
}
}
{% include copy-curl.html %}
To see how the query is executed, you can run the Validate API:
GET /customers/_validate/query?explain
{
"query": {
"multi_match" : {
"query": "John",
"type": "cross_fields",
"fields": [
"first_name", "first_name.edge",
"last_name", "last_name.edge"
]
}
}
}
{% include copy-curl.html %}
The response shows that the last_name
and first_name
fields are grouped together and treated as a single field. Similarly, the last_name.edge
and first_name.edge
fields are grouped together and treated as a single field:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"valid": true,
"explanations": [
{
"index": "customers",
"valid": true,
"explanation": "(blended(terms:[last_name:john, first_name:john]) | (blended(terms:[last_name.edge:Jo, first_name.edge:Jo]) blended(terms:[last_name.edge:Joh, first_name.edge:Joh]) blended(terms:[last_name.edge:John, first_name.edge:John])))"
}
]
}
Using the operator
or minimum_should_match
parameters with multiple field groups like the preceding ones can lead to the problem described in the previous section. To avoid it, you can rewrite the previous query as two cross_fields
subqueries combined with a Boolean query and apply the minimum_should_match
to one of the subqueries:
GET /customers/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "John Doe",
"type": "cross_fields",
"fields": [
"first_name",
"last_name"
],
"minimum_should_match": "1"
}
},
{
"multi_match": {
"query": "John Doe",
"type": "cross_fields",
"fields": [
"first_name.edge",
"last_name.edge"
]
}
}
]
}
}
}
{% include copy-curl.html %}
To create one group for all fields, specify an analyzer in your query:
GET customers/_search
{
"query": {
"multi_match" : {
"query": "John Doe",
"type": "cross_fields",
"analyzer": "standard",
"fields": [ "first_name", "last_name", "*.edge" ]
}
}
}
{% include copy-curl.html %}
Running the Validate API on the previous query shows how the query is executed:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"valid": true,
"explanations": [
{
"index": "customers",
"valid": true,
"explanation": "blended(terms:[last_name.edge:john, last_name:john, first_name:john, first_name.edge:john]) blended(terms:[last_name.edge:doe, last_name:doe, first_name:doe, first_name.edge:doe])"
}
]
}
Phrase
The phrase
query behaves similarly to the best_fields
query but uses a match_phrase
query instead of a match
query.
The following is an example phrase
query for the index described in the best_fields
section:
GET articles/_search
{
"query": {
"multi_match" : {
"query": "northern lights",
"type": "phrase",
"fields": [ "title", "description" ]
}
}
}
{% include copy-curl.html %}
The preceding query is executed as the following dis_max
query with a match_phrase
query for each field:
GET articles/_search
{
"query": {
"dis_max": {
"queries": [
{ "match_phrase": { "title": "northern lights" }},
{ "match_phrase": { "description": "northern lights" }}
]
}
}
}
Because by default a phrase
query matches text only when the terms appear in the same order, only document 1 is returned in the results:
Response
{: .text-delta}{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.84407747,
"hits": [
{
"_index": "articles",
"_id": "1",
"_score": 0.84407747,
"_source": {
"title": "Aurora borealis",
"description": "Northern lights, or aurora borealis, explained"
}
}
]
}
}
You can use the slop
parameter to allow other words between words in query phrase. For example, the following query accepts text as a match if up to two words are between flourescent
and therapy
:
GET articles/_search
{
"query": {
"multi_match" : {
"query": "fluorescent therapy",
"type": "phrase",
"fields": [ "title", "description" ],
"slop": 2
}
}
}
{% include copy-curl.html %}
The response contains document 2:
Response
{: .text-delta}{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.7003825,
"hits": [
{
"_index": "articles",
"_id": "2",
"_score": 0.7003825,
"_source": {
"title": "Sun deprivation in the Northern countries",
"description": "Using fluorescent lights for therapy"
}
}
]
}
}
For slop
values less than 2, no documents are returned.
The fuzziness
parameter is not supported for phrase
queries.
{: .note}
Phrase prefix
The phrase_prefix
query behaves similarly to the phrase
query but uses a match_phrase_prefix
query instead of a match_phrase
query.
The following is an example phrase_prefix
query for the index described in the best_fields
section:
GET articles/_search
{
"query": {
"multi_match" : {
"query": "northern light",
"type": "phrase_prefix",
"fields": [ "title", "description" ]
}
}
}
{% include copy-curl.html %}
The preceding query is executed as the following dis_max
query with a match_phrase_prefix
query for each field:
GET articles/_search
{
"query": {
"dis_max": {
"queries": [
{ "match_phrase_prefix": { "title": "northern light" }},
{ "match_phrase_prefix": { "description": "northern light" }}
]
}
}
}
You can use the slop
parameter to allow other words between words in query phrase.
The fuzziness
parameter is not supported for phrase_prefix
queries.
{: .note}
Boolean prefix
The bool_prefix
query scores documents similarly to the most_fields
query but uses a match_bool_prefix
query instead of a match
query.
The following is an example bool_prefix
query for the index described in the best_fields
section:
GET articles/_search
{
"query": {
"multi_match" : {
"query": "li northern",
"type": "bool_prefix",
"fields": [ "title", "description" ]
}
}
}
{% include copy-curl.html %}
The preceding query is executed as the following dis_max
query with a match_bool_prefix
query for each field:
GET articles/_search
{
"query": {
"dis_max": {
"queries": [
{ "match_bool_prefix": { "title": "li northern" }},
{ "match_bool_prefix": { "description": "li northern" }}
]
}
}
}
The fuzziness
, prefix_length
, max_expansions
, fuzzy_rewrite
, and fuzzy_transpositions
parameters are supported for the terms that are used to construct term queries, but they do not have an effect on the prefix query constructed from the final term.
{: .note}
Parameters
The query accepts the following parameters. All parameters except query
are optional.
Parameter | Data type | Description |
---|---|---|
query |
String | The query string to use for search. Required. |
auto_generate_synonyms_phrase_query |
Boolean | Specifies whether to create a match phrase query automatically for multi-term synonyms. For example, if you specify ba,batting average as synonyms and search for ba , OpenSearch searches for ba OR "batting average" (if this option is true ) or ba OR (batting AND average) (if this option is false ). Default is true . |
analyzer |
String | The analyzer used to tokenize the query string text. Default is the index-time analyzer specified for the default_field . If no analyzer is specified for the default_field , the analyzer is the default analyzer for the index. |
boost |
Floating-point | Boosts the clause by the given multiplier. Useful for weighing clauses in compound queries. Values in the [0, 1) range decrease relevance, and values greater than 1 increase relevance. Default is 1 . |
fields |
Array of strings | The list of fields in which to search. If you don't provide the fields parameter, multi_match query searches the fields specified in the index.query. Default_field setting, which defaults to * . |
fuzziness |
String | The number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matched a value. For example, the distance between wined and wind is 1. Valid values are non-negative integers or AUTO . The default, AUTO , chooses a value based on the length of each term and is a good choice for most use cases. Not supported for phrase , phrase_prefix , and cross_fields queries. |
fuzzy_rewrite |
String | Determines how OpenSearch rewrites the query. Valid values are constant_score , scoring_boolean , constant_score_boolean , top_terms_N , top_terms_boost_N , and top_terms_blended_freqs_N . If the fuzziness parameter is not 0 , the query uses a fuzzy_rewrite method of top_terms_blended_freqs_${max_expansions} by default. Default is constant_score . |
fuzzy_transpositions |
Boolean | Setting fuzzy_transpositions to true (default) adds swaps of adjacent characters to the insert, delete, and substitute operations of the fuzziness option. For example, the distance between wind and wnid is 1 if fuzzy_transpositions is true (swap "n" and "i") and 2 if it is false (delete "n", insert "n"). If fuzzy_transpositions is false, rewind and wnid have the same distance (2) from wind , despite the more human-centric opinion that wnid is an obvious typo. The default is a good choice for most use cases. |
lenient |
Boolean | Setting lenient to true ignores data type mismatches between the query and the document field. For example, a query string of "8.2" could match a field of type float . Default is false . |
max_expansions |
Positive integer | The maximum number of terms to which the query can expand. Fuzzy queries “expand to” a number of matching terms that are within the distance specified in fuzziness . Then OpenSearch tries to match those terms. Default is 50 . |
minimum_should_match |
Positive or negative integer, positive or negative percentage, combination | If the query string contains multiple search terms and you use the or operator, the number of terms that need to match for the document to be considered a match. For example, if minimum_should_match is 2, wind often rising does not match The Wind Rises. If minimum_should_match is 1 , it matches. For details, see Minimum should match. |
operator |
String | If the query string contains multiple search terms, whether all terms need to match (AND ) or only one term needs to match (OR ) for a document to be considered a match. Valid values are:- OR : The string to be is interpreted as to OR be - AND : The string to be is interpreted as to AND be Default is OR . |
prefix_length |
Non-negative integer | The number of leading characters that are not considered in fuzziness. Default is 0 . |
slop |
0 (default) or a positive integer |
Controls the degree to which words in a query can be misordered and still be considered a match. From the Lucene documentation: "The number of other words permitted between words in query phrase. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit reorderings of phrases, the slop must be at least two. A value of zero requires an exact match." Supported for phrase and phrase_prefix query types. |
tie_breaker |
Floating-point | A factor between 0 and 1.0 that is used to give more weight to documents that match multiple query clauses. For more information, see The tie_breaker parameter`. |
type |
String | The multi-match query type. Valid values are best_fields , most_fields , cross_fields , phrase , phrase_prefix , bool_prefix . Default is best_fields . |
zero_terms_query |
String | In some cases, the analyzer removes all terms from a query string. For example, the stop analyzer removes all terms from the string an but this . In those cases, zero_terms_query specifies whether to match no documents (none ) or all documents (all ). Valid values are none and all . Default is none . |
The fuzziness
parameter is not supported for phrase
, phrase_prefix
, and cross_fields
queries.
{: .note}
The slop
parameter is only supported for phrase
and phrase_prefix
queries.
{: .note}
The tie_breaker
parameter
Each term-level blended query calculates the document score as the best score returned by any field in a group. The scores from all blended queries are added together to produce the final score. You can change the way the score is calculated by using the tie_breaker
parameter. The tie_breaker
parameter accepts the following values:
- 0.0 (default for
best_fields
,cross_fields
,phrase
, andphrase_prefix
queries): Take the single best score returned by any field in a group. - 1.0 (default for
most_fields
andbool_prefix
queries): Add the scores for all fields in a group. - A floating-point value in the (0, 1) range: Take the single best score of the best-matching field and add (
tie_breaker
*_score
) for all other matching fields.