Update the how-to section of the docs for 7.0: (#37717)

- new `rank_feature`/`script_score` queries
 - new `index_phrases`/`index_prefixes` options
 - disabling `_field_names` doesn't help anymore
 - adaptive replica selection is on by default
This commit is contained in:
Adrien Grand 2019-03-12 08:23:53 +01:00
parent 5528554aac
commit 466864710a
4 changed files with 141 additions and 21 deletions

View File

@ -114,13 +114,6 @@ The default is `10%` which is often plenty: for example, if you give the JVM
10GB of memory, it will give 1GB to the index buffer, which is enough to host
two shards that are heavily indexing.
[float]
=== Disable `_field_names`
The <<mapping-field-names-field,`_field_names` field>> introduces some
index-time overhead, so you might want to disable it if you never need to
run `exists` queries.
[float]
=== Additional optimizations

View File

@ -3,9 +3,9 @@
This section includes a few recipes to help with common problems:
* <<mixing-exact-search-with-stemming>>
* <<consistent-scoring>>
* <<mixing-exact-search-with-stemming,Mixing exact search with stemming>>
* <<consistent-scoring,Getting consistent scores>>
* <<static-scoring-signals,Incorporating static relevance signals into the score>>
include::recipes/stemming.asciidoc[]
include::recipes/scoring.asciidoc[]

View File

@ -60,8 +60,8 @@ request do not have similar index statistics and relevancy could be bad.
If you have a small dataset, the easiest way to work around this issue is to
index everything into an index that has a single shard
(`index.number_of_shards: 1`). Then index statistics will be the same for all
documents and scores will be consistent.
(`index.number_of_shards: 1`), which is the default. Then index statistics
will be the same for all documents and scores will be consistent.
Otherwise the recommended way to work around this issue is to use the
<<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
@ -78,3 +78,125 @@ queries, beware that gathering statistics alone might not be cheap since all
terms have to be looked up in the terms dictionaries in order to look up
statistics.
[[static-scoring-signals]]
=== Incorporating static relevance signals into the score
Many domains have static signals that are known to be correlated with relevance.
For instance https://en.wikipedia.org/wiki/PageRank[PageRank] and url length are
two commonly used features for web search in order to tune the score of web
pages independently of the query.
There are two main queries that allow combining static score contributions with
textual relevance, eg. as computed with BM25:
- <<query-dsl-script-score-query,`script_score` query>>
- <<query-dsl-rank-feature-query,`rank_feature` query>>
For instance imagine that you have a `pagerank` field that you wish to
combine with the BM25 score so that the final score is equal to
`score = bm25_score + pagerank / (10 + pagerank)`.
With the <<query-dsl-script-score-query,`script_score` query>> the query would
look like this:
//////////////////////////
[source,js]
--------------------------------------------------
PUT index
{
"mappings": {
"properties": {
"body": {
"type": "text"
},
"pagerank": {
"type": "long"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST
//////////////////////////
[source,js]
--------------------------------------------------
GET index/_search
{
"query" : {
"script_score" : {
"query" : {
"match": { "body": "elasticsearch" }
},
"script" : {
"source" : "_score * saturation(doc['pagerank'].value, 10)" <1>
}
}
}
}
--------------------------------------------------
// CONSOLE
//TEST[continued]
<1> `pagerank` must be mapped as a <<number>>
while with the <<query-dsl-rank-feature-query,`rank_feature` query>> it would
look like below:
//////////////////////////
[source,js]
--------------------------------------------------
PUT index
{
"mappings": {
"properties": {
"body": {
"type": "text"
},
"pagerank": {
"type": "rank_feature"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST
//////////////////////////
[source,js]
--------------------------------------------------
GET _search
{
"query" : {
"bool" : {
"must": {
"match": { "body": "elasticsearch" }
},
"should": {
"rank_feature": {
"field": "pagerank", <1>
"saturation": {
"pivot": 10
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE
<1> `pagerank` must be mapped as a <<rank-feature,`rank_feature`>> field
While both options would return similar scores, there are trade-offs:
<<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
enabling you to combine the text relevance score with static signals as you
prefer. On the other hand, the <<rank-feature,`rank_feature` query>> only
exposes a couple ways to incorporate static signails into the score. However,
it relies on the <<rank-feature,`rank_feature`>> and
<<rank-features,`rank_features`>> fields, which index values in a special way
that allows the <<query-dsl-rank-feature-query,`rank_feature` query>> to skip
over non-competitive documents and get the top matches of a query faster.

View File

@ -395,15 +395,6 @@ be able to cope with `max_failures` node failures at once at most, then the
right number of replicas for you is
`max(max_failures, ceil(num_nodes / num_primaries) - 1)`.
[float]
=== Turn on adaptive replica selection
When multiple copies of data are present, elasticsearch can use a set of
criteria called <<search-adaptive-replica,adaptive replica selection>> to select
the best copy of the data based on response time, service time, and queue size
of the node containing each copy of the shard. This can improve query throughput
and reduce latency for search-heavy applications.
=== Tune your queries with the Profile API
You can also analyse how expensive each component of your queries and
@ -419,3 +410,17 @@ Some caveats to the Profile API are that:
- the Profile API as a debugging tool adds significant overhead to search execution and can also have a very verbose output
- given the added overhead, the resulting took times are not reliable indicators of actual took time, but can be used comparatively between clauses for relative timing differences
- the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause
=== Faster phrase queries with `index_phrases`
The <<text,`text`>> field has an <<index-phrases,`index_phrases`>> option that
indexes 2-shingles and is automatically leveraged by query parsers to run phrase
queries that don't have a slop. If your use-case involves running lots of phrase
queries, this can speed up queries significantly.
=== Faster prefix queries with `index_prefixes`
The <<text,`text`>> field has an <<index-phrases,`index_prefixes`>> option that
indexes prefixes of all terms and is automatically leveraged by query parsers to
run prefix queries. If your use-case involves running lots of prefix queries,
this can speed up queries significantly.