495 lines
14 KiB
Plaintext
495 lines
14 KiB
Plaintext
[[query-dsl-multi-match-query]]
|
|
=== Multi Match Query
|
|
|
|
The `multi_match` query builds on the <<query-dsl-match-query,`match` query>>
|
|
to allow multi-field queries:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "this is a test", <1>
|
|
"fields": [ "subject", "message" ] <2>
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
<1> The query string.
|
|
<2> The fields to be queried.
|
|
|
|
[float]
|
|
==== `fields` and per-field boosting
|
|
|
|
Fields can be specified with wildcards, eg:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "Will Smith",
|
|
"fields": [ "title", "*_name" ] <1>
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
<1> Query the `title`, `first_name` and `last_name` fields.
|
|
|
|
Individual fields can be boosted with the caret (`^`) notation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query" : "this is a test",
|
|
"fields" : [ "subject^3", "message" ] <1>
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> The `subject` field is three times as important as the `message` field.
|
|
|
|
[[multi-match-types]]
|
|
[float]
|
|
==== Types of `multi_match` query:
|
|
|
|
The way the `multi_match` query is executed internally depends on the `type`
|
|
parameter, which can be set to:
|
|
|
|
[horizontal]
|
|
`best_fields`:: (*default*) Finds documents which match any field, but
|
|
uses the `_score` from the best field. See <<type-best-fields>>.
|
|
|
|
`most_fields`:: Finds documents which match any field and combines
|
|
the `_score` from each field. See <<type-most-fields>>.
|
|
|
|
`cross_fields`:: Treats fields with the same `analyzer` as though they
|
|
were one big field. Looks for each word in *any*
|
|
field. See <<type-cross-fields>>.
|
|
|
|
`phrase`:: Runs a `match_phrase` query on each field and combines
|
|
the `_score` from each field. See <<type-phrase>>.
|
|
|
|
`phrase_prefix`:: Runs a `match_phrase_prefix` query on each field and
|
|
combines the `_score` from each field. See <<type-phrase>>.
|
|
|
|
[[type-best-fields]]
|
|
==== `best_fields`
|
|
|
|
The `best_fields` type is most useful when you are searching for multiple
|
|
words best found in the same field. For instance ``brown fox'' in a single
|
|
field is more meaningful than ``brown'' in one field and ``fox'' in the other.
|
|
|
|
The `best_fields` type generates a <<query-dsl-match-query,`match` query>> for
|
|
each field and wraps them in a <<query-dsl-dis-max-query,`dis_max`>> query, to
|
|
find the single best matching field. For instance, this query:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "brown fox",
|
|
"type": "best_fields",
|
|
"fields": [ "subject", "message" ],
|
|
"tie_breaker": 0.3
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
would be executed as:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"dis_max": {
|
|
"queries": [
|
|
{ "match": { "subject": "brown fox" }},
|
|
{ "match": { "message": "brown fox" }}
|
|
],
|
|
"tie_breaker": 0.3
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
Normally the `best_fields` type uses the score of the *single* best matching
|
|
field, but if `tie_breaker` is specified, then it calculates the score as
|
|
follows:
|
|
|
|
* the score from the best matching field
|
|
* plus `tie_breaker * _score` for all other matching fields
|
|
|
|
Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
|
|
`fuzziness`, `prefix_length`, `max_expansions`, `rewrite`, `zero_terms_query`
|
|
and `cutoff_frequency`, as explained in <<query-dsl-match-query, match query>>.
|
|
|
|
[IMPORTANT]
|
|
[[operator-min]]
|
|
.`operator` and `minimum_should_match`
|
|
===================================================
|
|
|
|
The `best_fields` and `most_fields` types are _field-centric_ -- they generate
|
|
a `match` query *per field*. This means that the `operator` and
|
|
`minimum_should_match` parameters are applied to each field individually,
|
|
which is probably not what you want.
|
|
|
|
Take this query for example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "Will Smith",
|
|
"type": "best_fields",
|
|
"fields": [ "first_name", "last_name" ],
|
|
"operator": "and" <1>
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> All terms must be present.
|
|
|
|
This query is executed as:
|
|
|
|
(+first_name:will +first_name:smith)
|
|
| (+last_name:will +last_name:smith)
|
|
|
|
In other words, *all terms* must be present *in a single field* for a document
|
|
to match.
|
|
|
|
See <<type-cross-fields>> for a better solution.
|
|
|
|
===================================================
|
|
|
|
[[type-most-fields]]
|
|
==== `most_fields`
|
|
|
|
The `most_fields` type is most useful when querying multiple fields that
|
|
contain the same text analyzed in different ways. For instance, the main
|
|
field may contain synonyms, stemming and terms without diacritics. A second
|
|
field may contain the original terms, and a third field might contain
|
|
shingles. By combining scores from all three fields we can match as many
|
|
documents as possible with the main field, but use the second and third fields
|
|
to push the most similar results to the top of the list.
|
|
|
|
This query:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "quick brown fox",
|
|
"type": "most_fields",
|
|
"fields": [ "title", "title.original", "title.shingles" ]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
would be executed as:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"bool": {
|
|
"should": [
|
|
{ "match": { "title": "quick brown fox" }},
|
|
{ "match": { "title.original": "quick brown fox" }},
|
|
{ "match": { "title.shingles": "quick brown fox" }}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
The score from each `match` clause is added together, then divided by the
|
|
number of `match` clauses.
|
|
|
|
Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
|
|
`fuzziness`, `prefix_length`, `max_expansions`, `rewrite`, `zero_terms_query`
|
|
and `cutoff_frequency`, as explained in <<query-dsl-match-query,match query>>, but
|
|
*see <<operator-min>>*.
|
|
|
|
[[type-phrase]]
|
|
==== `phrase` and `phrase_prefix`
|
|
|
|
The `phrase` and `phrase_prefix` types behave just like <<type-best-fields>>,
|
|
but they use a `match_phrase` or `match_phrase_prefix` query instead of a
|
|
`match` query.
|
|
|
|
This query:
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "quick brown f",
|
|
"type": "phrase_prefix",
|
|
"fields": [ "subject", "message" ]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
would be executed as:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"dis_max": {
|
|
"queries": [
|
|
{ "match_phrase_prefix": { "subject": "quick brown f" }},
|
|
{ "match_phrase_prefix": { "message": "quick brown f" }}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
Also, accepts `analyzer`, `boost`, `slop` and `zero_terms_query` as explained
|
|
in <<query-dsl-match-query>>. Type `phrase_prefix` additionally accepts
|
|
`max_expansions`.
|
|
|
|
[[type-cross-fields]]
|
|
==== `cross_fields`
|
|
|
|
The `cross_fields` type is particularly useful with structured documents where
|
|
multiple fields *should* match. For instance, when querying the `first_name`
|
|
and `last_name` fields for ``Will Smith'', the best match is likely to have
|
|
``Will'' in one field and ``Smith'' in the other.
|
|
|
|
****
|
|
|
|
This sounds like a job for <<type-most-fields>> but there are two problems
|
|
with that approach. The first problem is that `operator` and
|
|
`minimum_should_match` are applied per-field, instead of per-term (see
|
|
<<operator-min,explanation above>>).
|
|
|
|
The second problem is to do with relevance: the different term frequencies in
|
|
the `first_name` and `last_name` fields can produce unexpected results.
|
|
|
|
For instance, imagine we have two people: ``Will Smith'' and ``Smith Jones''.
|
|
``Smith'' as a last name is very common (and so is of low importance) but
|
|
``Smith'' as a first name is very uncommon (and so is of great importance).
|
|
|
|
If we do a search for ``Will Smith'', the ``Smith Jones'' document will
|
|
probably appear above the better matching ``Will Smith'' because the score of
|
|
`first_name:smith` has trumped the combined scores of `first_name:will` plus
|
|
`last_name:smith`.
|
|
|
|
****
|
|
|
|
One way of dealing with these types of queries is simply to index the
|
|
`first_name` and `last_name` fields into a single `full_name` field. Of
|
|
course, this can only be done at index time.
|
|
|
|
The `cross_field` type tries to solve these problems at query time by taking a
|
|
_term-centric_ approach. It first analyzes the query string into individual
|
|
terms, then looks for each term in any of the fields, as though they were one
|
|
big field.
|
|
|
|
A query like:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "Will Smith",
|
|
"type": "cross_fields",
|
|
"fields": [ "first_name", "last_name" ],
|
|
"operator": "and"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
is executed as:
|
|
|
|
+(first_name:will last_name:will)
|
|
+(first_name:smith last_name:smith)
|
|
|
|
In other words, *all terms* must be present *in at least one field* for a
|
|
document to match. (Compare this to
|
|
<<operator-min,the logic used for `best_fields` and `most_fields`>>.)
|
|
|
|
That solves one of the two problems. The problem of differing term frequencies
|
|
is solved by _blending_ the term frequencies for all fields in order to even
|
|
out the differences.
|
|
|
|
In practice, `first_name:smith` will be treated as though it has the same
|
|
frequencies as `last_name:smith`, plus one. This will make matches on
|
|
`first_name` and `last_name` have comparable scores, with a tiny advantage
|
|
for `last_name` since it is the most likely field that contains `smith`.
|
|
|
|
Note that `cross_fields` is usually only useful on short string fields
|
|
that all have a `boost` of `1`. Otherwise boosts, term freqs and length
|
|
normalization contribute to the score in such a way that the blending of term
|
|
statistics is not meaningful anymore.
|
|
|
|
If you run the above query through the <<search-validate>>, it returns this
|
|
explanation:
|
|
|
|
+blended("will", fields: [first_name, last_name])
|
|
+blended("smith", fields: [first_name, last_name])
|
|
|
|
Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
|
|
`zero_terms_query` and `cutoff_frequency`, as explained in
|
|
<<query-dsl-match-query, match query>>.
|
|
|
|
===== `cross_field` and analysis
|
|
|
|
The `cross_field` type can only work in term-centric mode on fields that have
|
|
the same analyzer. Fields with the same analyzer are grouped together as in
|
|
the example above. If there are multiple groups, they are combined with a
|
|
`bool` query.
|
|
|
|
For instance, if we have a `first` and `last` field which have
|
|
the same analyzer, plus a `first.edge` and `last.edge` which
|
|
both use an `edge_ngram` analyzer, this query:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "Jon",
|
|
"type": "cross_fields",
|
|
"fields": [
|
|
"first", "first.edge",
|
|
"last", "last.edge"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
would be executed as:
|
|
|
|
blended("jon", fields: [first, last])
|
|
| (
|
|
blended("j", fields: [first.edge, last.edge])
|
|
blended("jo", fields: [first.edge, last.edge])
|
|
blended("jon", fields: [first.edge, last.edge])
|
|
)
|
|
|
|
In other words, `first` and `last` would be grouped together and
|
|
treated as a single field, and `first.edge` and `last.edge` would be
|
|
grouped together and treated as a single field.
|
|
|
|
Having multiple groups is fine, but when combined with `operator` or
|
|
`minimum_should_match`, it can suffer from the <<operator-min,same problem>>
|
|
as `most_fields` or `best_fields`.
|
|
|
|
You can easily rewrite this query yourself as two separate `cross_fields`
|
|
queries combined with a `bool` query, and apply the `minimum_should_match`
|
|
parameter to just one of them:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"bool": {
|
|
"should": [
|
|
{
|
|
"multi_match" : {
|
|
"query": "Will Smith",
|
|
"type": "cross_fields",
|
|
"fields": [ "first", "last" ],
|
|
"minimum_should_match": "50%" <1>
|
|
}
|
|
},
|
|
{
|
|
"multi_match" : {
|
|
"query": "Will Smith",
|
|
"type": "cross_fields",
|
|
"fields": [ "*.edge" ]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> Either `will` or `smith` must be present in either of the `first`
|
|
or `last` fields
|
|
|
|
You can force all fields into the same group by specifying the `analyzer`
|
|
parameter in the query.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"multi_match" : {
|
|
"query": "Jon",
|
|
"type": "cross_fields",
|
|
"analyzer": "standard", <1>
|
|
"fields": [ "first", "last", "*.edge" ]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> Use the `standard` analyzer for all fields.
|
|
|
|
which will be executed as:
|
|
|
|
blended("will", fields: [first, first.edge, last.edge, last])
|
|
blended("smith", fields: [first, first.edge, last.edge, last])
|
|
|
|
===== `tie_breaker`
|
|
|
|
By default, each per-term `blended` query will use the best score returned by
|
|
any field in a group, then these scores are added together to give the final
|
|
score. The `tie_breaker` parameter can change the default behaviour of the
|
|
per-term `blended` queries. It accepts:
|
|
|
|
[horizontal]
|
|
`0.0`:: Take the single best score out of (eg) `first_name:will`
|
|
and `last_name:will` (*default*)
|
|
`1.0`:: Add together the scores for (eg) `first_name:will` and
|
|
`last_name:will`
|
|
`0.0 < n < 1.0`:: Take the single best score plus +tie_breaker+ multiplied
|
|
by each of the scores from other matching fields.
|