266 lines
7.8 KiB
Plaintext
266 lines
7.8 KiB
Plaintext
[[query-dsl-common-terms-query]]
|
|
=== Common Terms Query
|
|
|
|
The `common` terms query is a modern alternative to stopwords which
|
|
improves the precision and recall of search results (by taking stopwords
|
|
into account), without sacrificing performance.
|
|
|
|
[float]
|
|
==== The problem
|
|
|
|
Every term in a query has a cost. A search for `"The brown fox"`
|
|
requires three term queries, one for each of `"the"`, `"brown"` and
|
|
`"fox"`, all of which are executed against all documents in the index.
|
|
The query for `"the"` is likely to match many documents and thus has a
|
|
much smaller impact on relevance than the other two terms.
|
|
|
|
Previously, the solution to this problem was to ignore terms with high
|
|
frequency. By treating `"the"` as a _stopword_, we reduce the index size
|
|
and reduce the number of term queries that need to be executed.
|
|
|
|
The problem with this approach is that, while stopwords have a small
|
|
impact on relevance, they are still important. If we remove stopwords,
|
|
we lose precision, (eg we are unable to distinguish between `"happy"`
|
|
and `"not happy"`) and we lose recall (eg text like `"The The"` or
|
|
`"To be or not to be"` would simply not exist in the index).
|
|
|
|
[float]
|
|
==== The solution
|
|
|
|
The `common` terms query divides the query terms into two groups: more
|
|
important (ie _low frequency_ terms) and less important (ie _high
|
|
frequency_ terms which would previously have been stopwords).
|
|
|
|
First it searches for documents which match the more important terms.
|
|
These are the terms which appear in fewer documents and have a greater
|
|
impact on relevance.
|
|
|
|
Then, it executes a second query for the less important terms -- terms
|
|
which appear frequently and have a low impact on relevance. But instead
|
|
of calculating the relevance score for *all* matching documents, it only
|
|
calculates the `_score` for documents already matched by the first
|
|
query. In this way the high frequency terms can improve the relevance
|
|
calculation without paying the cost of poor performance.
|
|
|
|
If a query consists only of high frequency terms, then a single query is
|
|
executed as an `AND` (conjunction) query, in other words all terms are
|
|
required. Even though each individual term will match many documents,
|
|
the combination of terms narrows down the resultset to only the most
|
|
relevant. The single query can also be executed as an `OR` with a
|
|
specific
|
|
<<query-dsl-minimum-should-match,`minimum_should_match`>>,
|
|
in this case a high enough value should probably be used.
|
|
|
|
Terms are allocated to the high or low frequency groups based on the
|
|
`cutoff_frequency`, which can be specified as an absolute frequency
|
|
(`>=1`) or as a relative frequency (`0.0 .. 1.0`). (Remember that document
|
|
frequencies are computed on a per shard level as explained in the blog post
|
|
{defguide}/relevance-is-broken.html[Relevance is broken].)
|
|
|
|
Perhaps the most interesting property of this query is that it adapts to
|
|
domain specific stopwords automatically. For example, on a video hosting
|
|
site, common terms like `"clip"` or `"video"` will automatically behave
|
|
as stopwords without the need to maintain a manual list.
|
|
|
|
[float]
|
|
==== Examples
|
|
|
|
In this example, words that have a document frequency greater than 0.1%
|
|
(eg `"this"` and `"is"`) will be treated as _common terms_.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"common": {
|
|
"body": {
|
|
"query": "this is bonsai cool",
|
|
"cutoff_frequency": 0.001
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The number of terms which should match can be controlled with the
|
|
<<query-dsl-minimum-should-match,`minimum_should_match`>>
|
|
(`high_freq`, `low_freq`), `low_freq_operator` (default `"or"`) and
|
|
`high_freq_operator` (default `"or"`) parameters.
|
|
|
|
For low frequency terms, set the `low_freq_operator` to `"and"` to make
|
|
all terms required:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"common": {
|
|
"body": {
|
|
"query": "nelly the elephant as a cartoon",
|
|
"cutoff_frequency": 0.001,
|
|
"low_freq_operator": "and"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
which is roughly equivalent to:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"bool": {
|
|
"must": [
|
|
{ "term": { "body": "nelly"}},
|
|
{ "term": { "body": "elephant"}},
|
|
{ "term": { "body": "cartoon"}}
|
|
],
|
|
"should": [
|
|
{ "term": { "body": "the"}},
|
|
{ "term": { "body": "as"}},
|
|
{ "term": { "body": "a"}}
|
|
]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Alternatively use
|
|
<<query-dsl-minimum-should-match,`minimum_should_match`>>
|
|
to specify a minimum number or percentage of low frequency terms which
|
|
must be present, for instance:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"common": {
|
|
"body": {
|
|
"query": "nelly the elephant as a cartoon",
|
|
"cutoff_frequency": 0.001,
|
|
"minimum_should_match": 2
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
which is roughly equivalent to:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"bool": {
|
|
"must": {
|
|
"bool": {
|
|
"should": [
|
|
{ "term": { "body": "nelly"}},
|
|
{ "term": { "body": "elephant"}},
|
|
{ "term": { "body": "cartoon"}}
|
|
],
|
|
"minimum_should_match": 2
|
|
}
|
|
},
|
|
"should": [
|
|
{ "term": { "body": "the"}},
|
|
{ "term": { "body": "as"}},
|
|
{ "term": { "body": "a"}}
|
|
]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
minimum_should_match
|
|
|
|
A different
|
|
<<query-dsl-minimum-should-match,`minimum_should_match`>>
|
|
can be applied for low and high frequency terms with the additional
|
|
`low_freq` and `high_freq` parameters Here is an example when providing
|
|
additional parameters (note the change in structure):
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"common": {
|
|
"body": {
|
|
"query": "nelly the elephant not as a cartoon",
|
|
"cutoff_frequency": 0.001,
|
|
"minimum_should_match": {
|
|
"low_freq" : 2,
|
|
"high_freq" : 3
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
which is roughly equivalent to:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"bool": {
|
|
"must": {
|
|
"bool": {
|
|
"should": [
|
|
{ "term": { "body": "nelly"}},
|
|
{ "term": { "body": "elephant"}},
|
|
{ "term": { "body": "cartoon"}}
|
|
],
|
|
"minimum_should_match": 2
|
|
}
|
|
},
|
|
"should": {
|
|
"bool": {
|
|
"should": [
|
|
{ "term": { "body": "the"}},
|
|
{ "term": { "body": "not"}},
|
|
{ "term": { "body": "as"}},
|
|
{ "term": { "body": "a"}}
|
|
],
|
|
"minimum_should_match": 3
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
In this case it means the high frequency terms have only an impact on
|
|
relevance when there are at least three of them. But the most
|
|
interesting use of the
|
|
<<query-dsl-minimum-should-match,`minimum_should_match`>>
|
|
for high frequency terms is when there are only high frequency terms:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"common": {
|
|
"body": {
|
|
"query": "how not to be",
|
|
"cutoff_frequency": 0.001,
|
|
"minimum_should_match": {
|
|
"low_freq" : 2,
|
|
"high_freq" : 3
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
which is roughly equivalent to:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"bool": {
|
|
"should": [
|
|
{ "term": { "body": "how"}},
|
|
{ "term": { "body": "not"}},
|
|
{ "term": { "body": "to"}},
|
|
{ "term": { "body": "be"}}
|
|
],
|
|
"minimum_should_match": "3<50%"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The high frequency generated query is then slightly less restrictive
|
|
than with an `AND`.
|
|
|
|
The `common` terms query also supports `boost`, `analyzer` and
|
|
`disable_coord` as parameters.
|