More advices around search speed and disk usage. (#25252)
It adds notes about: - how preference can help optimize cache usage - the fact that too many replicas can hurt search performance due to lower utilization of the filesystem cache - how index sorting can improve _source compression - how always putting fields in the same order in documents can improve _source compression
This commit is contained in:
parent
ccb3c9aae7
commit
8c869e2a0b
|
@ -158,3 +158,24 @@ on disk usage. In particular, integers should be stored using an integer type
|
|||
stored in a `scaled_float` if appropriate or in the smallest type that fits the
|
||||
use-case: using `float` over `double`, or `half_float` over `float` will help
|
||||
save storage.
|
||||
|
||||
[float]
|
||||
=== Use index sorting to colocate similar documents
|
||||
|
||||
When Elasticsearch stores `_source`, it compresses multiple documents at once
|
||||
in order to improve the overall compression ratio. For instance it is very
|
||||
common that documents share the same field names, and quite common that they
|
||||
share some field values, especially on fields that have a low cardinality or
|
||||
a https://en.wikipedia.org/wiki/Zipf%27s_law[zipfian] distribution.
|
||||
|
||||
By default documents are compressed together in the order that they are added
|
||||
to the index. If you enabled <<index-modules-index-sorting,index sorting>>
|
||||
then instead they are compressed in sorted order. Sorting documents with similar
|
||||
structure, fields, and values together should improve the compression ratio.
|
||||
|
||||
[float]
|
||||
=== Put fields in the same order in documents
|
||||
|
||||
Due to the fact that multiple documents are compressed together into blocks,
|
||||
it is more likely to find longer duplicate strings in those `_source` documents
|
||||
if fields always occur in the same order.
|
||||
|
|
|
@ -326,3 +326,45 @@ queries, they should be mapped as a `keyword`.
|
|||
<<index-modules-index-sorting,Index sorting>> can be useful in order to make
|
||||
conjunctions faster at the cost of slightly slower indexing. Read more about it
|
||||
in the <<index-modules-index-sorting-conjunctions,index sorting documentation>>.
|
||||
|
||||
[float]
|
||||
=== Use `preference` to optimize cache utilization
|
||||
|
||||
There are multiple caches that can help with search performance, such as the
|
||||
https://en.wikipedia.org/wiki/Page_cache[filesystem cache], the
|
||||
<<shard-request-cache,request cache>> or the <<query-cache,query cache>>. Yet
|
||||
all these caches are maintained at the node level, meaning that if you run the
|
||||
same request twice in a row, have 1 <<glossary-replica-shard,replica>> or more
|
||||
and use https://en.wikipedia.org/wiki/Round-robin_DNS[round-robin], the default
|
||||
routing algorithm, then those two requests will go to different shard copies,
|
||||
preventing node-level caches from helping.
|
||||
|
||||
Since it is common for users of a search application to run similar requests
|
||||
one after another, for instance in order to analyze a narrower subset of the
|
||||
index, using a preference value that identifies the current user or session
|
||||
could help optimize usage of the caches.
|
||||
|
||||
[float]
|
||||
=== Replicas might help with throughput, but not always
|
||||
|
||||
In addition to improving resiliency, replicas can help improve throughput. For
|
||||
instance if you have a single-shard index and three nodes, you will need to
|
||||
set the number of replicas to 2 in order to have 3 copies of your shard in
|
||||
total so that all nodes are utilized.
|
||||
|
||||
Now imagine that you have a 2-shards index and two nodes. In one case, the
|
||||
number of replicas is 0, meaning that each node holds a single shard. In the
|
||||
second case the number of replicas is 1, meaning that each node has two shards.
|
||||
Which setup is going to perform best in terms of search performance? Usually,
|
||||
the setup that has fewer shards per node in total will perform better. The
|
||||
reason for that is that it gives a greater share of the available filesystem
|
||||
cache to each shard, and the filesystem cache is probably Elasticsearch's
|
||||
number 1 performance factor. At the same time, beware that a setup that does
|
||||
not have replicas is subject to failure in case of a single node failure, so
|
||||
there is a trade-off between throughput and availability.
|
||||
|
||||
So what is the right number of replicas? If you have a cluster that has
|
||||
`num_nodes` nodes, `num_primaries` primary shards _in total_ and if you want to
|
||||
be able to cope with `max_failures` node failures at once at most, then the
|
||||
right number of replicas for you is
|
||||
`max(max_failures, ceil(num_nodes / num_primaries) - 1)`.
|
||||
|
|
Loading…
Reference in New Issue