More advices around search speed and disk usage. (#25252)

It adds notes about:
 - how preference can help optimize cache usage
 - the fact that too many replicas can hurt search performance due to lower
   utilization of the filesystem cache
 - how index sorting can improve _source compression
 - how always putting fields in the same order in documents can improve _source
   compression
This commit is contained in:
Adrien Grand 2017-06-16 11:23:40 +02:00 committed by GitHub
parent ccb3c9aae7
commit 8c869e2a0b
2 changed files with 63 additions and 0 deletions

View File

@ -158,3 +158,24 @@ on disk usage. In particular, integers should be stored using an integer type
stored in a `scaled_float` if appropriate or in the smallest type that fits the
use-case: using `float` over `double`, or `half_float` over `float` will help
save storage.
[float]
=== Use index sorting to colocate similar documents
When Elasticsearch stores `_source`, it compresses multiple documents at once
in order to improve the overall compression ratio. For instance it is very
common that documents share the same field names, and quite common that they
share some field values, especially on fields that have a low cardinality or
a https://en.wikipedia.org/wiki/Zipf%27s_law[zipfian] distribution.
By default documents are compressed together in the order that they are added
to the index. If you enabled <<index-modules-index-sorting,index sorting>>
then instead they are compressed in sorted order. Sorting documents with similar
structure, fields, and values together should improve the compression ratio.
[float]
=== Put fields in the same order in documents
Due to the fact that multiple documents are compressed together into blocks,
it is more likely to find longer duplicate strings in those `_source` documents
if fields always occur in the same order.

View File

@ -326,3 +326,45 @@ queries, they should be mapped as a `keyword`.
<<index-modules-index-sorting,Index sorting>> can be useful in order to make
conjunctions faster at the cost of slightly slower indexing. Read more about it
in the <<index-modules-index-sorting-conjunctions,index sorting documentation>>.
[float]
=== Use `preference` to optimize cache utilization
There are multiple caches that can help with search performance, such as the
https://en.wikipedia.org/wiki/Page_cache[filesystem cache], the
<<shard-request-cache,request cache>> or the <<query-cache,query cache>>. Yet
all these caches are maintained at the node level, meaning that if you run the
same request twice in a row, have 1 <<glossary-replica-shard,replica>> or more
and use https://en.wikipedia.org/wiki/Round-robin_DNS[round-robin], the default
routing algorithm, then those two requests will go to different shard copies,
preventing node-level caches from helping.
Since it is common for users of a search application to run similar requests
one after another, for instance in order to analyze a narrower subset of the
index, using a preference value that identifies the current user or session
could help optimize usage of the caches.
[float]
=== Replicas might help with throughput, but not always
In addition to improving resiliency, replicas can help improve throughput. For
instance if you have a single-shard index and three nodes, you will need to
set the number of replicas to 2 in order to have 3 copies of your shard in
total so that all nodes are utilized.
Now imagine that you have a 2-shards index and two nodes. In one case, the
number of replicas is 0, meaning that each node holds a single shard. In the
second case the number of replicas is 1, meaning that each node has two shards.
Which setup is going to perform best in terms of search performance? Usually,
the setup that has fewer shards per node in total will perform better. The
reason for that is that it gives a greater share of the available filesystem
cache to each shard, and the filesystem cache is probably Elasticsearch's
number 1 performance factor. At the same time, beware that a setup that does
not have replicas is subject to failure in case of a single node failure, so
there is a trade-off between throughput and availability.
So what is the right number of replicas? If you have a cluster that has
`num_nodes` nodes, `num_primaries` primary shards _in total_ and if you want to
be able to cope with `max_failures` node failures at once at most, then the
right number of replicas for you is
`max(max_failures, ceil(num_nodes / num_primaries) - 1)`.