More advices around search speed and disk usage. (#25252)
It adds notes about: - how preference can help optimize cache usage - the fact that too many replicas can hurt search performance due to lower utilization of the filesystem cache - how index sorting can improve _source compression - how always putting fields in the same order in documents can improve _source compression
This commit is contained in:
parent
ccb3c9aae7
commit
8c869e2a0b
|
@ -158,3 +158,24 @@ on disk usage. In particular, integers should be stored using an integer type
|
||||||
stored in a `scaled_float` if appropriate or in the smallest type that fits the
|
stored in a `scaled_float` if appropriate or in the smallest type that fits the
|
||||||
use-case: using `float` over `double`, or `half_float` over `float` will help
|
use-case: using `float` over `double`, or `half_float` over `float` will help
|
||||||
save storage.
|
save storage.
|
||||||
|
|
||||||
|
[float]
|
||||||
|
=== Use index sorting to colocate similar documents
|
||||||
|
|
||||||
|
When Elasticsearch stores `_source`, it compresses multiple documents at once
|
||||||
|
in order to improve the overall compression ratio. For instance it is very
|
||||||
|
common that documents share the same field names, and quite common that they
|
||||||
|
share some field values, especially on fields that have a low cardinality or
|
||||||
|
a https://en.wikipedia.org/wiki/Zipf%27s_law[zipfian] distribution.
|
||||||
|
|
||||||
|
By default documents are compressed together in the order that they are added
|
||||||
|
to the index. If you enabled <<index-modules-index-sorting,index sorting>>
|
||||||
|
then instead they are compressed in sorted order. Sorting documents with similar
|
||||||
|
structure, fields, and values together should improve the compression ratio.
|
||||||
|
|
||||||
|
[float]
|
||||||
|
=== Put fields in the same order in documents
|
||||||
|
|
||||||
|
Due to the fact that multiple documents are compressed together into blocks,
|
||||||
|
it is more likely to find longer duplicate strings in those `_source` documents
|
||||||
|
if fields always occur in the same order.
|
||||||
|
|
|
@ -326,3 +326,45 @@ queries, they should be mapped as a `keyword`.
|
||||||
<<index-modules-index-sorting,Index sorting>> can be useful in order to make
|
<<index-modules-index-sorting,Index sorting>> can be useful in order to make
|
||||||
conjunctions faster at the cost of slightly slower indexing. Read more about it
|
conjunctions faster at the cost of slightly slower indexing. Read more about it
|
||||||
in the <<index-modules-index-sorting-conjunctions,index sorting documentation>>.
|
in the <<index-modules-index-sorting-conjunctions,index sorting documentation>>.
|
||||||
|
|
||||||
|
[float]
|
||||||
|
=== Use `preference` to optimize cache utilization
|
||||||
|
|
||||||
|
There are multiple caches that can help with search performance, such as the
|
||||||
|
https://en.wikipedia.org/wiki/Page_cache[filesystem cache], the
|
||||||
|
<<shard-request-cache,request cache>> or the <<query-cache,query cache>>. Yet
|
||||||
|
all these caches are maintained at the node level, meaning that if you run the
|
||||||
|
same request twice in a row, have 1 <<glossary-replica-shard,replica>> or more
|
||||||
|
and use https://en.wikipedia.org/wiki/Round-robin_DNS[round-robin], the default
|
||||||
|
routing algorithm, then those two requests will go to different shard copies,
|
||||||
|
preventing node-level caches from helping.
|
||||||
|
|
||||||
|
Since it is common for users of a search application to run similar requests
|
||||||
|
one after another, for instance in order to analyze a narrower subset of the
|
||||||
|
index, using a preference value that identifies the current user or session
|
||||||
|
could help optimize usage of the caches.
|
||||||
|
|
||||||
|
[float]
|
||||||
|
=== Replicas might help with throughput, but not always
|
||||||
|
|
||||||
|
In addition to improving resiliency, replicas can help improve throughput. For
|
||||||
|
instance if you have a single-shard index and three nodes, you will need to
|
||||||
|
set the number of replicas to 2 in order to have 3 copies of your shard in
|
||||||
|
total so that all nodes are utilized.
|
||||||
|
|
||||||
|
Now imagine that you have a 2-shards index and two nodes. In one case, the
|
||||||
|
number of replicas is 0, meaning that each node holds a single shard. In the
|
||||||
|
second case the number of replicas is 1, meaning that each node has two shards.
|
||||||
|
Which setup is going to perform best in terms of search performance? Usually,
|
||||||
|
the setup that has fewer shards per node in total will perform better. The
|
||||||
|
reason for that is that it gives a greater share of the available filesystem
|
||||||
|
cache to each shard, and the filesystem cache is probably Elasticsearch's
|
||||||
|
number 1 performance factor. At the same time, beware that a setup that does
|
||||||
|
not have replicas is subject to failure in case of a single node failure, so
|
||||||
|
there is a trade-off between throughput and availability.
|
||||||
|
|
||||||
|
So what is the right number of replicas? If you have a cluster that has
|
||||||
|
`num_nodes` nodes, `num_primaries` primary shards _in total_ and if you want to
|
||||||
|
be able to cope with `max_failures` node failures at once at most, then the
|
||||||
|
right number of replicas for you is
|
||||||
|
`max(max_failures, ceil(num_nodes / num_primaries) - 1)`.
|
||||||
|
|
Loading…
Reference in New Issue