Update the how-to section of the docs for 7.0: (#37717)
- new `rank_feature`/`script_score` queries - new `index_phrases`/`index_prefixes` options - disabling `_field_names` doesn't help anymore - adaptive replica selection is on by default
This commit is contained in:
parent
5528554aac
commit
466864710a
|
@ -114,13 +114,6 @@ The default is `10%` which is often plenty: for example, if you give the JVM
|
|||
10GB of memory, it will give 1GB to the index buffer, which is enough to host
|
||||
two shards that are heavily indexing.
|
||||
|
||||
[float]
|
||||
=== Disable `_field_names`
|
||||
|
||||
The <<mapping-field-names-field,`_field_names` field>> introduces some
|
||||
index-time overhead, so you might want to disable it if you never need to
|
||||
run `exists` queries.
|
||||
|
||||
[float]
|
||||
=== Additional optimizations
|
||||
|
||||
|
|
|
@ -3,9 +3,9 @@
|
|||
|
||||
This section includes a few recipes to help with common problems:
|
||||
|
||||
* <<mixing-exact-search-with-stemming>>
|
||||
* <<consistent-scoring>>
|
||||
* <<mixing-exact-search-with-stemming,Mixing exact search with stemming>>
|
||||
* <<consistent-scoring,Getting consistent scores>>
|
||||
* <<static-scoring-signals,Incorporating static relevance signals into the score>>
|
||||
|
||||
include::recipes/stemming.asciidoc[]
|
||||
include::recipes/scoring.asciidoc[]
|
||||
|
||||
|
|
|
@ -60,8 +60,8 @@ request do not have similar index statistics and relevancy could be bad.
|
|||
|
||||
If you have a small dataset, the easiest way to work around this issue is to
|
||||
index everything into an index that has a single shard
|
||||
(`index.number_of_shards: 1`). Then index statistics will be the same for all
|
||||
documents and scores will be consistent.
|
||||
(`index.number_of_shards: 1`), which is the default. Then index statistics
|
||||
will be the same for all documents and scores will be consistent.
|
||||
|
||||
Otherwise the recommended way to work around this issue is to use the
|
||||
<<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
|
||||
|
@ -78,3 +78,125 @@ queries, beware that gathering statistics alone might not be cheap since all
|
|||
terms have to be looked up in the terms dictionaries in order to look up
|
||||
statistics.
|
||||
|
||||
[[static-scoring-signals]]
|
||||
=== Incorporating static relevance signals into the score
|
||||
|
||||
Many domains have static signals that are known to be correlated with relevance.
|
||||
For instance https://en.wikipedia.org/wiki/PageRank[PageRank] and url length are
|
||||
two commonly used features for web search in order to tune the score of web
|
||||
pages independently of the query.
|
||||
|
||||
There are two main queries that allow combining static score contributions with
|
||||
textual relevance, eg. as computed with BM25:
|
||||
- <<query-dsl-script-score-query,`script_score` query>>
|
||||
- <<query-dsl-rank-feature-query,`rank_feature` query>>
|
||||
|
||||
For instance imagine that you have a `pagerank` field that you wish to
|
||||
combine with the BM25 score so that the final score is equal to
|
||||
`score = bm25_score + pagerank / (10 + pagerank)`.
|
||||
|
||||
With the <<query-dsl-script-score-query,`script_score` query>> the query would
|
||||
look like this:
|
||||
|
||||
//////////////////////////
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"body": {
|
||||
"type": "text"
|
||||
},
|
||||
"pagerank": {
|
||||
"type": "long"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST
|
||||
|
||||
//////////////////////////
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
GET index/_search
|
||||
{
|
||||
"query" : {
|
||||
"script_score" : {
|
||||
"query" : {
|
||||
"match": { "body": "elasticsearch" }
|
||||
},
|
||||
"script" : {
|
||||
"source" : "_score * saturation(doc['pagerank'].value, 10)" <1>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
//TEST[continued]
|
||||
<1> `pagerank` must be mapped as a <<number>>
|
||||
|
||||
while with the <<query-dsl-rank-feature-query,`rank_feature` query>> it would
|
||||
look like below:
|
||||
|
||||
//////////////////////////
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"properties": {
|
||||
"body": {
|
||||
"type": "text"
|
||||
},
|
||||
"pagerank": {
|
||||
"type": "rank_feature"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST
|
||||
|
||||
//////////////////////////
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
GET _search
|
||||
{
|
||||
"query" : {
|
||||
"bool" : {
|
||||
"must": {
|
||||
"match": { "body": "elasticsearch" }
|
||||
},
|
||||
"should": {
|
||||
"rank_feature": {
|
||||
"field": "pagerank", <1>
|
||||
"saturation": {
|
||||
"pivot": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
<1> `pagerank` must be mapped as a <<rank-feature,`rank_feature`>> field
|
||||
|
||||
While both options would return similar scores, there are trade-offs:
|
||||
<<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
|
||||
enabling you to combine the text relevance score with static signals as you
|
||||
prefer. On the other hand, the <<rank-feature,`rank_feature` query>> only
|
||||
exposes a couple ways to incorporate static signails into the score. However,
|
||||
it relies on the <<rank-feature,`rank_feature`>> and
|
||||
<<rank-features,`rank_features`>> fields, which index values in a special way
|
||||
that allows the <<query-dsl-rank-feature-query,`rank_feature` query>> to skip
|
||||
over non-competitive documents and get the top matches of a query faster.
|
||||
|
|
|
@ -395,15 +395,6 @@ be able to cope with `max_failures` node failures at once at most, then the
|
|||
right number of replicas for you is
|
||||
`max(max_failures, ceil(num_nodes / num_primaries) - 1)`.
|
||||
|
||||
[float]
|
||||
=== Turn on adaptive replica selection
|
||||
|
||||
When multiple copies of data are present, elasticsearch can use a set of
|
||||
criteria called <<search-adaptive-replica,adaptive replica selection>> to select
|
||||
the best copy of the data based on response time, service time, and queue size
|
||||
of the node containing each copy of the shard. This can improve query throughput
|
||||
and reduce latency for search-heavy applications.
|
||||
|
||||
=== Tune your queries with the Profile API
|
||||
|
||||
You can also analyse how expensive each component of your queries and
|
||||
|
@ -419,3 +410,17 @@ Some caveats to the Profile API are that:
|
|||
- the Profile API as a debugging tool adds significant overhead to search execution and can also have a very verbose output
|
||||
- given the added overhead, the resulting took times are not reliable indicators of actual took time, but can be used comparatively between clauses for relative timing differences
|
||||
- the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause
|
||||
|
||||
=== Faster phrase queries with `index_phrases`
|
||||
|
||||
The <<text,`text`>> field has an <<index-phrases,`index_phrases`>> option that
|
||||
indexes 2-shingles and is automatically leveraged by query parsers to run phrase
|
||||
queries that don't have a slop. If your use-case involves running lots of phrase
|
||||
queries, this can speed up queries significantly.
|
||||
|
||||
=== Faster prefix queries with `index_prefixes`
|
||||
|
||||
The <<text,`text`>> field has an <<index-phrases,`index_prefixes`>> option that
|
||||
indexes prefixes of all terms and is automatically leveraged by query parsers to
|
||||
run prefix queries. If your use-case involves running lots of prefix queries,
|
||||
this can speed up queries significantly.
|
||||
|
|
Loading…
Reference in New Issue