opensearch-docs-cn/_query-dsl/full-text/match.md

466 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
layout: default
title: Match
parent: Full-text queries
grand_parent: Query DSL
nav_order: 10
---
# Match query
Use the `match` query for full-text search on a specific document field. If you run a `match` query on a [`text`]({{site.url}}/{{site.baseurl}}/field-types/supported-field-types/text/) field, the `match` query [analyzes]({{site.url}}/{{site.baseurl}}/analyzers/index/) the provided search string and returns documents that match any of the string's terms. If you run a `match` query on an exact-value field, it returns documents that match the exact value. The preferred way to search exact-value fields is to use a filter because, unlike a query, a filter is cached.
The following example shows a basic `match` query for the word `wind` in the `title`:
```json
GET _search
{
"query": {
"match": {
"title": "wind"
}
}
}
```
{% include copy-curl.html %}
To pass additional parameters, you can use the expanded syntax:
```json
GET _search
{
"query": {
"match": {
"title": {
"query": "wind",
"analyzer": "stop"
}
}
}
}
```
{% include copy-curl.html %}
## Examples
In the following examples, you'll use the index that contains the following documents:
```json
PUT testindex/_doc/1
{
"title": "Let the wind rise"
}
```
{% include copy-curl.html %}
```json
PUT testindex/_doc/2
{
"title": "Gone with the wind"
}
```
{% include copy-curl.html %}
```json
PUT testindex/_doc/3
{
"title": "Rise is gone"
}
```
{% include copy-curl.html %}
## Operator
If a `match` query is run on a `text` field, the text is analyzed with the analyzer specified in the `analyzer` parameter. Then the resulting tokens are combined into a Boolean query using the operator specified in the `operator` parameter. The default operator is `OR`, so the query `wind rise` is changed into `wind OR rise`. In this example, this query returns documents 1--3 because each document has a term that matches the query. To specify the `and` operator, use the following query:
```json
GET testindex/_search
{
"query": {
"match": {
"title": {
"query": "wind rise",
"operator": "and"
}
}
}
}
```
{% include copy-curl.html %}
The query is constructed as `wind AND rise` and returns document 1 as the matching document:
<details closed markdown="block">
<summary>
Response
</summary>
{: .text-delta}
```json
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.2667098,
"hits": [
{
"_index": "testindex",
"_id": "1",
"_score": 1.2667098,
"_source": {
"title": "Let the wind rise"
}
}
]
}
}
```
</details>
### Minimum should match
You can control the minimum number of terms that a document must match to be returned in the results by specifying the [`minimum_should_match`]({{site.url}}{{site.baseurl}}/query-dsl/minimum-should-match/) parameter:
```json
GET testindex/_search
{
"query": {
"match": {
"title": {
"query": "wind rise",
"operator": "or",
"minimum_should_match": 2
}
}
}
}
```
{% include copy-curl.html %}
Now documents are required to match both terms, so only document 1 is returned (this is equivalent to the `and` operator):
<details closed markdown="block">
<summary>
Response
</summary>
{: .text-delta}
```json
{
"took": 23,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.2667098,
"hits": [
{
"_index": "testindex",
"_id": "1",
"_score": 1.2667098,
"_source": {
"title": "Let the wind rise"
}
}
]
}
}
```
</details>
## Analyzer
Because in this example you didn't explicitly specify the analyzer, the default `standard` analyzer is used. The default analyzer does not perform stemming, so if you run a query `the wind rises`, you receive no results because the token `rises` does not match the token `rise`. To change the search analyzer, specify it in the `analyzer` field. For example, the following query uses the `english` analyzer:
```json
GET testindex/_search
{
"query": {
"match": {
"title": {
"query": "the wind rises",
"operator": "and",
"analyzer": "english"
}
}
}
}
```
{% include copy-curl.html %}
The `english` analyzer removes the stopword `the` and performs stemming, producing the tokens `wind` and `rise`. The latter token matches document 1, which is returned in the results:
<details closed markdown="block">
<summary>
Response
</summary>
{: .text-delta}
```json
{
"took": 19,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.2667098,
"hits": [
{
"_index": "testindex",
"_id": "1",
"_score": 1.2667098,
"_source": {
"title": "Let the wind rise"
}
}
]
}
}
```
</details>
## Empty query
In some cases, an analyzer might remove all tokens from a query. For example, the `english` analyzer removes stop words, so in a query `and OR or`, all tokens are removed. To check the analyzer behavior, you can use the [Analyze API]({{site.url}}{{site.baseurl}}/api-reference/analyze-apis/#apply-a-built-in-analyzer):
```json
GET testindex/_analyze
{
"analyzer" : "english",
"text" : "and OR or"
}
```
{% include copy-curl.html %}
As expected, the query produces no tokens:
```json
{
"tokens": []
}
```
You can specify the behavior for an empty query in the `zero_terms_query` parameter. Setting `zero_terms_query` to `all` returns all documents in the index and setting it to `none` returns no documents:
```json
GET testindex/_search
{
"query": {
"match": {
"title": {
"query": "and OR or",
"analyzer" : "english",
"zero_terms_query": "all"
}
}
}
}
```
{% include copy-curl.html %}
## Fuzziness
To account for typos, you can specify `fuzziness` for your query as either of the following:
- An integer that specifies the maximum allowed [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) for this edit.
- `AUTO`:
- Strings of 02 characters must match exactly.
- Strings of 35 characters allow 1 edit.
- Strings longer than 5 characters allow 2 edits.
Setting `fuzziness` to the default `AUTO` value works best in most cases:
```json
GET testindex/_search
{
"query": {
"match": {
"title": {
"query": "wnid",
"fuzziness": "AUTO"
}
}
}
}
```
{% include copy-curl.html %}
The token `wnid` matches `wind` and the query returns documents 1 and 2:
<details closed markdown="block">
<summary>
Response
</summary>
{: .text-delta}
```json
{
"took": 31,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.47501624,
"hits": [
{
"_index": "testindex",
"_id": "1",
"_score": 0.47501624,
"_source": {
"title": "Let the wind rise"
}
},
{
"_index": "testindex",
"_id": "2",
"_score": 0.47501624,
"_source": {
"title": "Gone with the wind"
}
}
]
}
}
```
</details>
### Prefix length
Misspellings rarely occur in the beginning of words. Thus, you can specify the minimum length the matched prefix must be to return a document in the results. For example, you can change the preceding query to include a `prefix_length`:
```json
GET testindex/_search
{
"query": {
"match": {
"title": {
"query": "wnid",
"fuzziness": "AUTO",
"prefix_length": 2
}
}
}
}
```
{% include copy-curl.html %}
The preceding query returns no results. If you change the `prefix_length` to 1, documents 1 and 2 are returned because the first letter of the token `wnid` is not misspelled.
### Transpositions
In the preceding example, the word `wnid` contained a transposition (`in` was changed to `ni`). By default, transpositions are allowed in fuzzy matching, but you can disallow them by setting `fuzzy_transpositions` to `false`:
```json
GET testindex/_search
{
"query": {
"match": {
"title": {
"query": "wnid",
"fuzziness": "AUTO",
"fuzzy_transpositions": false
}
}
}
}
```
{% include copy-curl.html %}
Now the query returns no results.
## Synonyms
If you use a `synonym_graph` filter and `auto_generate_synonyms_phrase_query` is set to `true` (default), OpenSearch parses the query into terms and then combines the terms to generate a [phrase query](https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/PhraseQuery.html) for multi-term synonyms. For example, if you specify `ba,batting average` as synonyms and search for `ba`, OpenSearch searches for `ba OR "batting average"`.
To match multi-term synonyms with conjunctions, set `auto_generate_synonyms_phrase_query` to `false`:
```json
GET /testindex/_search
{
"query": {
"match": {
"text": {
"query": "good ba",
"auto_generate_synonyms_phrase_query": false
}
}
}
}
```
{% include copy-curl.html %}
The query produced is `ba OR (batting AND average)`.
## Parameters
The query accepts the name of the field (`<field>`) as a top-level parameter:
```json
GET _search
{
"query": {
"match": {
"<field>": {
"query": "text to search for",
...
}
}
}
}
```
{% include copy-curl.html %}
The `<field>` accepts the following parameters. All parameters except `query` are optional.
Parameter | Data type | Description
:--- | :--- | :---
`query` | String | The query string to use for search. Required.
`auto_generate_synonyms_phrase_query` | Boolean | Specifies whether to create a [match phrase query]({{site.url}}{{site.baseurl}}/query-dsl/full-text/match-phrase/) automatically for multi-term synonyms. For example, if you specify `ba,batting average` as synonyms and search for `ba`, OpenSearch searches for `ba OR "batting average"` (if this option is `true`) or `ba OR (batting AND average)` (if this option is `false`). Default is `true`.
`analyzer` | String | The [analyzer]({{site.url}}{{site.baseurl}}/analyzers/index/) used to tokenize the query string text. Default is the index-time analyzer specified for the `default_field`. If no analyzer is specified for the `default_field`, the `analyzer` is the default analyzer for the index.
`boost` | Floating-point | Boosts the clause by the given multiplier. Useful for weighing clauses in compound queries. Values in the [0, 1) range decrease relevance, and values greater than 1 increase relevance. Default is `1`.
`enable_position_increments` | Boolean | When `true`, resulting queries are aware of position increments. This setting is useful when the removal of stop words leaves an unwanted "gap" between terms. Default is `true`.
`fuzziness` | String | The number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matched a value. For example, the distance between `wined` and `wind` is 1. Valid values are non-negative integers or `AUTO`. The default, `AUTO`, chooses a value based on the length of each term and is a good choice for most use cases.
`fuzzy_rewrite` | String | Determines how OpenSearch rewrites the query. Valid values are `constant_score`, `scoring_boolean`, `constant_score_boolean`, `top_terms_N`, `top_terms_boost_N`, and `top_terms_blended_freqs_N`. If the `fuzziness` parameter is not `0`, the query uses a `fuzzy_rewrite` method of `top_terms_blended_freqs_${max_expansions}` by default. Default is `constant_score`.
`fuzzy_transpositions` | Boolean | Setting `fuzzy_transpositions` to `true` (default) adds swaps of adjacent characters to the insert, delete, and substitute operations of the `fuzziness` option. For example, the distance between `wind` and `wnid` is 1 if `fuzzy_transpositions` is true (swap "n" and "i") and 2 if it is false (delete "n", insert "n"). If `fuzzy_transpositions` is false, `rewind` and `wnid` have the same distance (2) from `wind`, despite the more human-centric opinion that `wnid` is an obvious typo. The default is a good choice for most use cases.
`lenient` | Boolean | Setting `lenient` to `true` ignores data type mismatches between the query and the document field. For example, a query string of `"8.2"` could match a field of type `float`. Default is `false`.
`max_expansions` | Positive integer | The maximum number of terms to which the query can expand. Fuzzy queries “expand to” a number of matching terms that are within the distance specified in `fuzziness`. Then OpenSearch tries to match those terms. Default is `50`.
`minimum_should_match` | Positive or negative integer, positive or negative percentage, combination | If the query string contains multiple search terms and you use the `or` operator, the number of terms that need to match for the document to be considered a match. For example, if `minimum_should_match` is 2, `wind often rising` does not match `The Wind Rises.` If `minimum_should_match` is `1`, it matches. For details, see [Minimum should match]({{site.url}}{{site.baseurl}}/query-dsl/minimum-should-match/).
`operator` | String | If the query string contains multiple search terms, whether all terms need to match (`AND`) or only one term needs to match (`OR`) for a document to be considered a match. Valid values are:<br>- `OR`: The string `to be` is interpreted as `to OR be`<br>- `AND`: The string `to be` is interpreted as `to AND be`<br> Default is `OR`.
`prefix_length` | Non-negative integer | The number of leading characters that are not considered in fuzziness. Default is `0`.
`zero_terms_query` | String | In some cases, the analyzer removes all terms from a query string. For example, the `stop` analyzer removes all terms from the string `an but this`. In those cases, `zero_terms_query` specifies whether to match no documents (`none`) or all documents (`all`). Valid values are `none` and `all`. Default is `none`.