2017-07-17 12:21:20 -04:00
|
|
|
[[consistent-scoring]]
|
|
|
|
=== Getting consistent scoring
|
|
|
|
|
|
|
|
The fact that Elasticsearch operates with shards and replicas adds challenges
|
|
|
|
when it comes to having good scoring.
|
|
|
|
|
2020-07-23 12:42:33 -04:00
|
|
|
[discrete]
|
2017-07-17 12:21:20 -04:00
|
|
|
==== Scores are not reproducible
|
|
|
|
|
|
|
|
Say the same user runs the same request twice in a row and documents do not come
|
|
|
|
back in the same order both times, this is a pretty bad experience isn't it?
|
|
|
|
Unfortunately this is something that can happen if you have replicas
|
|
|
|
(`index.number_of_replicas` is greater than 0). The reason is that Elasticsearch
|
|
|
|
selects the shards that the query should go to in a round-robin fashion, so it
|
|
|
|
is quite likely if you run the same query twice in a row that it will go to
|
|
|
|
different copies of the same shard.
|
|
|
|
|
|
|
|
Now why is it a problem? Index statistics are an important part of the score.
|
|
|
|
And these index statistics may be different across copies of the same shard
|
|
|
|
due to deleted documents. As you may know when documents are deleted or updated,
|
|
|
|
the old document is not immediately removed from the index, it is just marked
|
|
|
|
as deleted and it will only be removed from disk on the next time that the
|
|
|
|
segment this old document belongs to is merged. However for practical reasons,
|
|
|
|
those deleted documents are taken into account for index statistics. So imagine
|
|
|
|
that the primary shard just finished a large merge that removed lots of deleted
|
|
|
|
documents, then it might have index statistics that are sufficiently different
|
|
|
|
from the replica (which still have plenty of deleted documents) so that scores
|
|
|
|
are different too.
|
|
|
|
|
|
|
|
The recommended way to work around this issue is to use a string that identifies
|
|
|
|
the user that is logged is (a user id or session id for instance) as a
|
2020-08-11 13:04:07 -04:00
|
|
|
<<search-preference,preference>>. This ensures that all queries of a
|
2017-07-17 12:21:20 -04:00
|
|
|
given user are always going to hit the same shards, so scores remain more
|
|
|
|
consistent across queries.
|
|
|
|
|
|
|
|
This work around has another benefit: when two documents have the same score,
|
|
|
|
they will be sorted by their internal Lucene doc id (which is unrelated to the
|
2018-04-11 03:41:37 -04:00
|
|
|
`_id`) by default. However these doc ids could be different across copies of
|
|
|
|
the same shard. So by always hitting the same shard, we would get more
|
|
|
|
consistent ordering of documents that have the same scores.
|
2017-07-17 12:21:20 -04:00
|
|
|
|
2020-07-23 12:42:33 -04:00
|
|
|
[discrete]
|
2017-07-17 12:21:20 -04:00
|
|
|
==== Relevancy looks wrong
|
|
|
|
|
|
|
|
If you notice that two documents with the same content get different scores or
|
|
|
|
that an exact match is not ranked first, then the issue might be related to
|
|
|
|
sharding. By default, Elasticsearch makes each shard responsible for producing
|
|
|
|
its own scores. However since index statistics are an important contributor to
|
|
|
|
the scores, this only works well if shards have similar index statistics. The
|
|
|
|
assumption is that since documents are routed evenly to shards by default, then
|
|
|
|
index statistics should be very similar and scoring would work as expected.
|
|
|
|
However in the event that you either:
|
|
|
|
|
|
|
|
- use routing at index time,
|
|
|
|
- query multiple _indices_,
|
|
|
|
- or have too little data in your index
|
|
|
|
|
|
|
|
then there are good chances that all shards that are involved in the search
|
|
|
|
request do not have similar index statistics and relevancy could be bad.
|
|
|
|
|
|
|
|
If you have a small dataset, the easiest way to work around this issue is to
|
|
|
|
index everything into an index that has a single shard
|
2019-03-12 03:23:53 -04:00
|
|
|
(`index.number_of_shards: 1`), which is the default. Then index statistics
|
|
|
|
will be the same for all documents and scores will be consistent.
|
2017-07-17 12:21:20 -04:00
|
|
|
|
|
|
|
Otherwise the recommended way to work around this issue is to use the
|
|
|
|
<<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
|
2018-03-19 13:22:40 -04:00
|
|
|
Elasticsearch perform an initial round trip to all involved shards, asking
|
2017-07-17 12:21:20 -04:00
|
|
|
them for their index statistics relatively to the query, then the coordinating
|
|
|
|
node will merge those statistics and send the merged statistics alongside the
|
|
|
|
request when asking shards to perform the `query` phase, so that shards can
|
|
|
|
use these global statistics rather than their own statistics in order to do the
|
|
|
|
scoring.
|
|
|
|
|
|
|
|
In most cases, this additional round trip should be very cheap. However in the
|
|
|
|
event that your query contains a very large number of fields/terms or fuzzy
|
|
|
|
queries, beware that gathering statistics alone might not be cheap since all
|
|
|
|
terms have to be looked up in the terms dictionaries in order to look up
|
|
|
|
statistics.
|
|
|
|
|
2019-03-12 03:23:53 -04:00
|
|
|
[[static-scoring-signals]]
|
|
|
|
=== Incorporating static relevance signals into the score
|
|
|
|
|
|
|
|
Many domains have static signals that are known to be correlated with relevance.
|
|
|
|
For instance https://en.wikipedia.org/wiki/PageRank[PageRank] and url length are
|
|
|
|
two commonly used features for web search in order to tune the score of web
|
|
|
|
pages independently of the query.
|
|
|
|
|
|
|
|
There are two main queries that allow combining static score contributions with
|
|
|
|
textual relevance, eg. as computed with BM25:
|
|
|
|
- <<query-dsl-script-score-query,`script_score` query>>
|
|
|
|
- <<query-dsl-rank-feature-query,`rank_feature` query>>
|
|
|
|
|
|
|
|
For instance imagine that you have a `pagerank` field that you wish to
|
|
|
|
combine with the BM25 score so that the final score is equal to
|
|
|
|
`score = bm25_score + pagerank / (10 + pagerank)`.
|
|
|
|
|
|
|
|
With the <<query-dsl-script-score-query,`script_score` query>> the query would
|
|
|
|
look like this:
|
|
|
|
|
|
|
|
//////////////////////////
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2019-03-12 03:23:53 -04:00
|
|
|
--------------------------------------------------
|
|
|
|
PUT index
|
|
|
|
{
|
2020-07-21 15:49:58 -04:00
|
|
|
"mappings": {
|
|
|
|
"properties": {
|
|
|
|
"body": {
|
|
|
|
"type": "text"
|
|
|
|
},
|
|
|
|
"pagerank": {
|
|
|
|
"type": "long"
|
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
2020-07-21 15:49:58 -04:00
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
//////////////////////////
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2019-03-12 03:23:53 -04:00
|
|
|
--------------------------------------------------
|
|
|
|
GET index/_search
|
|
|
|
{
|
2020-07-21 15:49:58 -04:00
|
|
|
"query": {
|
|
|
|
"script_score": {
|
|
|
|
"query": {
|
|
|
|
"match": { "body": "elasticsearch" }
|
|
|
|
},
|
|
|
|
"script": {
|
|
|
|
"source": "_score * saturation(doc['pagerank'].value, 10)" <1>
|
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
2020-07-21 15:49:58 -04:00
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
//TEST[continued]
|
2019-09-09 13:38:14 -04:00
|
|
|
|
2019-03-12 03:23:53 -04:00
|
|
|
<1> `pagerank` must be mapped as a <<number>>
|
|
|
|
|
|
|
|
while with the <<query-dsl-rank-feature-query,`rank_feature` query>> it would
|
|
|
|
look like below:
|
|
|
|
|
|
|
|
//////////////////////////
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2019-03-12 03:23:53 -04:00
|
|
|
--------------------------------------------------
|
|
|
|
PUT index
|
|
|
|
{
|
2020-07-21 15:49:58 -04:00
|
|
|
"mappings": {
|
|
|
|
"properties": {
|
|
|
|
"body": {
|
|
|
|
"type": "text"
|
|
|
|
},
|
|
|
|
"pagerank": {
|
|
|
|
"type": "rank_feature"
|
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
2020-07-21 15:49:58 -04:00
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// TEST
|
|
|
|
|
|
|
|
//////////////////////////
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2019-03-12 03:23:53 -04:00
|
|
|
--------------------------------------------------
|
|
|
|
GET _search
|
|
|
|
{
|
2020-07-21 15:49:58 -04:00
|
|
|
"query": {
|
|
|
|
"bool": {
|
|
|
|
"must": {
|
|
|
|
"match": { "body": "elasticsearch" }
|
|
|
|
},
|
|
|
|
"should": {
|
|
|
|
"rank_feature": {
|
|
|
|
"field": "pagerank", <1>
|
|
|
|
"saturation": {
|
|
|
|
"pivot": 10
|
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
2020-07-21 15:49:58 -04:00
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
2020-07-21 15:49:58 -04:00
|
|
|
}
|
2019-03-12 03:23:53 -04:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2019-09-09 13:38:14 -04:00
|
|
|
|
2019-03-12 03:23:53 -04:00
|
|
|
<1> `pagerank` must be mapped as a <<rank-feature,`rank_feature`>> field
|
|
|
|
|
|
|
|
While both options would return similar scores, there are trade-offs:
|
|
|
|
<<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
|
|
|
|
enabling you to combine the text relevance score with static signals as you
|
|
|
|
prefer. On the other hand, the <<rank-feature,`rank_feature` query>> only
|
|
|
|
exposes a couple ways to incorporate static signails into the score. However,
|
|
|
|
it relies on the <<rank-feature,`rank_feature`>> and
|
|
|
|
<<rank-features,`rank_features`>> fields, which index values in a special way
|
|
|
|
that allows the <<query-dsl-rank-feature-query,`rank_feature` query>> to skip
|
|
|
|
over non-competitive documents and get the top matches of a query faster.
|