148 lines
6.9 KiB
Plaintext
148 lines
6.9 KiB
Plaintext
[[tune-for-indexing-speed]]
|
|
== Tune for indexing speed
|
|
|
|
[discrete]
|
|
=== Use bulk requests
|
|
|
|
Bulk requests will yield much better performance than single-document index
|
|
requests. In order to know the optimal size of a bulk request, you should run
|
|
a benchmark on a single node with a single shard. First try to index 100
|
|
documents at once, then 200, then 400, etc. doubling the number of documents
|
|
in a bulk request in every benchmark run. When the indexing speed starts to
|
|
plateau then you know you reached the optimal size of a bulk request for your
|
|
data. In case of tie, it is better to err in the direction of too few rather
|
|
than too many documents. Beware that too large bulk requests might put the
|
|
cluster under memory pressure when many of them are sent concurrently, so
|
|
it is advisable to avoid going beyond a couple tens of megabytes per request
|
|
even if larger requests seem to perform better.
|
|
|
|
[discrete]
|
|
[[multiple-workers-threads]]
|
|
=== Use multiple workers/threads to send data to Elasticsearch
|
|
|
|
A single thread sending bulk requests is unlikely to be able to max out the
|
|
indexing capacity of an Elasticsearch cluster. In order to use all resources
|
|
of the cluster, you should send data from multiple threads or processes. In
|
|
addition to making better use of the resources of the cluster, this should
|
|
help reduce the cost of each fsync.
|
|
|
|
Make sure to watch for `TOO_MANY_REQUESTS (429)` response codes
|
|
(`EsRejectedExecutionException` with the Java client), which is the way that
|
|
Elasticsearch tells you that it cannot keep up with the current indexing rate.
|
|
When it happens, you should pause indexing a bit before trying again, ideally
|
|
with randomized exponential backoff.
|
|
|
|
Similarly to sizing bulk requests, only testing can tell what the optimal
|
|
number of workers is. This can be tested by progressively increasing the
|
|
number of workers until either I/O or CPU is saturated on the cluster.
|
|
|
|
[discrete]
|
|
=== Unset or increase the refresh interval
|
|
|
|
The operation that consists of making changes visible to search - called a
|
|
<<indices-refresh,refresh>> - is costly, and calling it often while there is
|
|
ongoing indexing activity can hurt indexing speed.
|
|
|
|
include::{es-repo-dir}/indices/refresh.asciidoc[tag=refresh-interval-default]
|
|
This is the optimal configuration if you have no or very little search traffic
|
|
(e.g. less than one search request every 5 minutes) and want to optimize for
|
|
indexing speed. This behavior aims to automatically optimize bulk indexing in
|
|
the default case when no searches are performed. In order to opt out of this
|
|
behavior set the refresh interval explicitly.
|
|
|
|
On the other hand, if your index experiences regular search requests, this
|
|
default behavior means that Elasticsearch will refresh your index every 1
|
|
second. If you can afford to increase the amount of time between when a document
|
|
gets indexed and when it becomes visible, increasing the
|
|
<<index-refresh-interval-setting,`index.refresh_interval`>> to a larger value, e.g.
|
|
`30s`, might help improve indexing speed.
|
|
|
|
[discrete]
|
|
=== Disable replicas for initial loads
|
|
|
|
If you have a large amount of data that you want to load all at once into
|
|
Elasticsearch, it may be beneficial to set `index.number_of_replicas` to `0` in
|
|
order to speed up indexing. Having no replicas means that losing a single node
|
|
may incur data loss, so it is important that the data lives elsewhere so that
|
|
this initial load can be retried in case of an issue. Once the initial load is
|
|
finished, you can set `index.number_of_replicas` back to its original value.
|
|
|
|
If `index.refresh_interval` is configured in the index settings, it may further
|
|
help to unset it during this initial load and setting it back to its original
|
|
value once the initial load is finished.
|
|
|
|
[discrete]
|
|
=== Disable swapping
|
|
|
|
You should make sure that the operating system is not swapping out the java
|
|
process by <<setup-configuration-memory,disabling swapping>>.
|
|
|
|
[discrete]
|
|
=== Give memory to the filesystem cache
|
|
|
|
The filesystem cache will be used in order to buffer I/O operations. You should
|
|
make sure to give at least half the memory of the machine running Elasticsearch
|
|
to the filesystem cache.
|
|
|
|
[discrete]
|
|
=== Use auto-generated ids
|
|
|
|
When indexing a document that has an explicit id, Elasticsearch needs to check
|
|
whether a document with the same id already exists within the same shard, which
|
|
is a costly operation and gets even more costly as the index grows. By using
|
|
auto-generated ids, Elasticsearch can skip this check, which makes indexing
|
|
faster.
|
|
|
|
[discrete]
|
|
=== Use faster hardware
|
|
|
|
If indexing is I/O bound, you should investigate giving more memory to the
|
|
filesystem cache (see above) or buying faster drives. In particular SSD drives
|
|
are known to perform better than spinning disks. Always use local storage,
|
|
remote filesystems such as `NFS` or `SMB` should be avoided. Also beware of
|
|
virtualized storage such as Amazon's `Elastic Block Storage`. Virtualized
|
|
storage works very well with Elasticsearch, and it is appealing since it is so
|
|
fast and simple to set up, but it is also unfortunately inherently slower on an
|
|
ongoing basis when compared to dedicated local storage. If you put an index on
|
|
`EBS`, be sure to use provisioned IOPS otherwise operations could be quickly
|
|
throttled.
|
|
|
|
Stripe your index across multiple SSDs by configuring a RAID 0 array. Remember
|
|
that it will increase the risk of failure since the failure of any one SSD
|
|
destroys the index. However this is typically the right tradeoff to make:
|
|
optimize single shards for maximum performance, and then add replicas across
|
|
different nodes so there's redundancy for any node failures. You can also use
|
|
<<modules-snapshots,snapshot and restore>> to backup the index for further
|
|
insurance.
|
|
|
|
[discrete]
|
|
=== Indexing buffer size
|
|
|
|
If your node is doing only heavy indexing, be sure
|
|
<<indexing-buffer,`indices.memory.index_buffer_size`>> is large enough to give
|
|
at most 512 MB indexing buffer per shard doing heavy indexing (beyond that
|
|
indexing performance does not typically improve). Elasticsearch takes that
|
|
setting (a percentage of the java heap or an absolute byte-size), and
|
|
uses it as a shared buffer across all active shards. Very active shards will
|
|
naturally use this buffer more than shards that are performing lightweight
|
|
indexing.
|
|
|
|
The default is `10%` which is often plenty: for example, if you give the JVM
|
|
10GB of memory, it will give 1GB to the index buffer, which is enough to host
|
|
two shards that are heavily indexing.
|
|
|
|
[discrete]
|
|
=== Use {ccr} to prevent searching from stealing resources from indexing
|
|
|
|
Within a single cluster, indexing and searching can compete for resources. By
|
|
setting up two clusters, configuring <<xpack-ccr,{ccr}>> to replicate data from
|
|
one cluster to the other one, and routing all searches to the cluster that has
|
|
the follower indices, search activity will no longer steal resources from
|
|
indexing on the cluster that hosts the leader indices.
|
|
|
|
[discrete]
|
|
=== Additional optimizations
|
|
|
|
Many of the strategies outlined in <<tune-for-disk-usage>> also
|
|
provide an improvement in the speed of indexing.
|