More advices around search speed and disk usage. (#25252)

It adds notes about: - how preference can help optimize cache usage - the fact that too many replicas can hurt search performance due to lower utilization of the filesystem cache - how index sorting can improve _source compression - how always putting fields in the same order in documents can improve _source compression
2017-06-16 11:23:40 +02:00 · 2017-06-16 11:23:40 +02:00 · 8c869e2a0b
parent ccb3c9aae7
commit 8c869e2a0b
2 changed files with 63 additions and 0 deletions
--- a/docs/reference/how-to/disk-usage.asciidoc
+++ b/docs/reference/how-to/disk-usage.asciidoc
@ -158,3 +158,24 @@ on disk usage. In particular, integers should be stored using an integer type
 stored in a `scaled_float` if appropriate or in the smallest type that fits the
 use-case: using `float` over `double`, or `half_float` over `float` will help
 save storage.
 [float]
 === Use index sorting to colocate similar documents
 When Elasticsearch stores `_source`, it compresses multiple documents at once
 in order to improve the overall compression ratio. For instance it is very
 common that documents share the same field names, and quite common that they
 share some field values, especially on fields that have a low cardinality or
 a https://en.wikipedia.org/wiki/Zipf%27s_law[zipfian] distribution.
 By default documents are compressed together in the order that they are added
 to the index. If you enabled <<index-modules-index-sorting,index sorting>>
 then instead they are compressed in sorted order. Sorting documents with similar
 structure, fields, and values together should improve the compression ratio.
 [float]
 === Put fields in the same order in documents
 Due to the fact that multiple documents are compressed together into blocks,
 it is more likely to find longer duplicate strings in those `_source` documents
 if fields always occur in the same order.
--- a/docs/reference/how-to/search-speed.asciidoc
+++ b/docs/reference/how-to/search-speed.asciidoc
@ -326,3 +326,45 @@ queries, they should be mapped as a `keyword`.
 <<index-modules-index-sorting,Index sorting>> can be useful in order to make
 conjunctions faster at the cost of slightly slower indexing. Read more about it
 in the <<index-modules-index-sorting-conjunctions,index sorting documentation>>.
 [float]
 === Use `preference` to optimize cache utilization
 There are multiple caches that can help with search performance, such as the
 https://en.wikipedia.org/wiki/Page_cache[filesystem cache], the
 <<shard-request-cache,request cache>> or the <<query-cache,query cache>>. Yet
 all these caches are maintained at the node level, meaning that if you run the
 same request twice in a row, have 1 <<glossary-replica-shard,replica>> or more
 and use https://en.wikipedia.org/wiki/Round-robin_DNS[round-robin], the default
 routing algorithm, then those two requests will go to different shard copies,
 preventing node-level caches from helping.
 Since it is common for users of a search application to run similar requests
 one after another, for instance in order to analyze a narrower subset of the
 index, using a preference value that identifies the current user or session
 could help optimize usage of the caches.
 [float]
 === Replicas might help with throughput, but not always
 In addition to improving resiliency, replicas can help improve throughput. For
 instance if you have a single-shard index and three nodes, you will need to
 set the number of replicas to 2 in order to have 3 copies of your shard in
 total so that all nodes are utilized.
 Now imagine that you have a 2-shards index and two nodes. In one case, the
 number of replicas is 0, meaning that each node holds a single shard. In the
 second case the number of replicas is 1, meaning that each node has two shards.
 Which setup is going to perform best in terms of search performance? Usually,
 the setup that has fewer shards per node in total will perform better. The
 reason for that is that it gives a greater share of the available filesystem
 cache to each shard, and the filesystem cache is probably Elasticsearch's
 number 1 performance factor. At the same time, beware that a setup that does
 not have replicas is subject to failure in case of a single node failure, so
 there is a trade-off between throughput and availability.
 So what is the right number of replicas? If you have a cluster that has
 `num_nodes` nodes, `num_primaries` primary shards _in total_ and if you want to
 be able to cope with `max_failures` node failures at once at most, then the
 right number of replicas for you is
 `max(max_failures, ceil(num_nodes / num_primaries) - 1)`.