2016-06-21 10:31:44 -04:00
|
|
|
[[tune-for-indexing-speed]]
|
|
|
|
== Tune for indexing speed
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Use bulk requests
|
|
|
|
|
|
|
|
Bulk requests will yield much better performance than single-document index
|
2016-10-10 16:51:47 -04:00
|
|
|
requests. In order to know the optimal size of a bulk request, you should run
|
2016-06-21 10:31:44 -04:00
|
|
|
a benchmark on a single node with a single shard. First try to index 100
|
|
|
|
documents at once, then 200, then 400, etc. doubling the number of documents
|
|
|
|
in a bulk request in every benchmark run. When the indexing speed starts to
|
|
|
|
plateau then you know you reached the optimal size of a bulk request for your
|
|
|
|
data. In case of tie, it is better to err in the direction of too few rather
|
|
|
|
than too many documents. Beware that too large bulk requests might put the
|
|
|
|
cluster under memory pressure when many of them are sent concurrently, so
|
|
|
|
it is advisable to avoid going beyond a couple tens of megabytes per request
|
|
|
|
even if larger requests seem to perform better.
|
|
|
|
|
|
|
|
[float]
|
2017-01-17 12:30:59 -05:00
|
|
|
=== Use multiple workers/threads to send data to Elasticsearch
|
2016-06-21 10:31:44 -04:00
|
|
|
|
|
|
|
A single thread sending bulk requests is unlikely to be able to max out the
|
2017-01-17 12:30:59 -05:00
|
|
|
indexing capacity of an Elasticsearch cluster. In order to use all resources
|
2016-06-21 10:31:44 -04:00
|
|
|
of the cluster, you should send data from multiple threads or processes. In
|
|
|
|
addition to making better use of the resources of the cluster, this should
|
|
|
|
help reduce the cost of each fsync.
|
|
|
|
|
|
|
|
Make sure to watch for `TOO_MANY_REQUESTS (429)` response codes
|
|
|
|
(`EsRejectedExecutionException` with the Java client), which is the way that
|
2017-01-17 12:30:59 -05:00
|
|
|
Elasticsearch tells you that it cannot keep up with the current indexing rate.
|
2016-06-27 08:49:37 -04:00
|
|
|
When it happens, you should pause indexing a bit before trying again, ideally
|
2016-06-21 10:31:44 -04:00
|
|
|
with randomized exponential backoff.
|
|
|
|
|
|
|
|
Similarly to sizing bulk requests, only testing can tell what the optimal
|
2016-10-10 16:51:47 -04:00
|
|
|
number of workers is. This can be tested by progressively increasing the
|
2016-06-21 10:31:44 -04:00
|
|
|
number of workers until either I/O or CPU is saturated on the cluster.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Increase the refresh interval
|
|
|
|
|
|
|
|
The default <<dynamic-index-settings,`index.refresh_interval`>> is `1s`, which
|
2017-01-17 12:30:59 -05:00
|
|
|
forces Elasticsearch to create a new segment every second.
|
2016-06-21 10:31:44 -04:00
|
|
|
Increasing this value (to say, `30s`) will allow larger segments to flush and
|
|
|
|
decreases future merge pressure.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Disable refresh and replicas for initial loads
|
|
|
|
|
|
|
|
If you need to load a large amount of data at once, you should disable refresh
|
|
|
|
by setting `index.refresh_interval` to `-1` and set `index.number_of_replicas`
|
|
|
|
to `0`. This will temporarily put your index at risk since the loss of any shard
|
|
|
|
will cause data loss, but at the same time indexing will be faster since
|
|
|
|
documents will be indexed only once. Once the initial loading is finished, you
|
|
|
|
can set `index.refresh_interval` and `index.number_of_replicas` back to their
|
|
|
|
original values.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Disable swapping
|
|
|
|
|
|
|
|
You should make sure that the operating system is not swapping out the java
|
2016-10-10 16:51:47 -04:00
|
|
|
process by <<setup-configuration-memory,disabling swapping>>.
|
2016-06-21 10:31:44 -04:00
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Give memory to the filesystem cache
|
|
|
|
|
|
|
|
The filesystem cache will be used in order to buffer I/O operations. You should
|
|
|
|
make sure to give at least half the memory of the machine running elasticsearch
|
|
|
|
to the filesystem cache.
|
|
|
|
|
2016-09-02 07:30:36 -04:00
|
|
|
[float]
|
|
|
|
=== Use auto-generated ids
|
|
|
|
|
|
|
|
When indexing a document that has an explicit id, elasticsearch needs to check
|
|
|
|
whether a document with the same id already exists within the same shard, which
|
|
|
|
is a costly operation and gets even more costly as the index grows. By using
|
|
|
|
auto-generated ids, Elasticsearch can skip this check, which makes indexing
|
|
|
|
faster.
|
|
|
|
|
2016-06-21 10:31:44 -04:00
|
|
|
[float]
|
|
|
|
=== Use faster hardware
|
|
|
|
|
|
|
|
If indexing is I/O bound, you should investigate giving more memory to the
|
|
|
|
filesystem cache (see above) or buying faster drives. In particular SSD drives
|
|
|
|
are known to perform better than spinning disks. Always use local storage,
|
|
|
|
remote filesystems such as `NFS` or `SMB` should be avoided. Also beware of
|
|
|
|
virtualized storage such as Amazon's `Elastic Block Storage`. Virtualized
|
|
|
|
storage works very well with Elasticsearch, and it is appealing since it is so
|
|
|
|
fast and simple to set up, but it is also unfortunately inherently slower on an
|
|
|
|
ongoing basis when compared to dedicated local storage. If you put an index on
|
|
|
|
`EBS`, be sure to use provisioned IOPS otherwise operations could be quickly
|
|
|
|
throttled.
|
|
|
|
|
|
|
|
Stripe your index across multiple SSDs by configuring a RAID 0 array. Remember
|
|
|
|
that it will increase the risk of failure since the failure of any one SSD
|
|
|
|
destroys the index. However this is typically the right tradeoff to make:
|
|
|
|
optimize single shards for maximum performance, and then add replicas across
|
|
|
|
different nodes so there's redundancy for any node failures. You can also use
|
|
|
|
<<modules-snapshots,snapshot and restore>> to backup the index for further
|
|
|
|
insurance.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Indexing buffer size
|
|
|
|
|
|
|
|
If your node is doing only heavy indexing, be sure
|
|
|
|
<<indexing-buffer,`indices.memory.index_buffer_size`>> is large enough to give
|
|
|
|
at most 512 MB indexing buffer per shard doing heavy indexing (beyond that
|
|
|
|
indexing performance does not typically improve). Elasticsearch takes that
|
|
|
|
setting (a percentage of the java heap or an absolute byte-size), and
|
|
|
|
uses it as a shared buffer across all active shards. Very active shards will
|
|
|
|
naturally use this buffer more than shards that are performing lightweight
|
|
|
|
indexing.
|
|
|
|
|
|
|
|
The default is `10%` which is often plenty: for example, if you give the JVM
|
|
|
|
10GB of memory, it will give 1GB to the index buffer, which is enough to host
|
|
|
|
two shards that are heavily indexing.
|
2017-02-20 14:30:51 -05:00
|
|
|
|
2017-10-06 10:49:15 -04:00
|
|
|
[float]
|
|
|
|
=== Disable `_field_names`
|
|
|
|
|
|
|
|
The <<mapping-field-names-field,`_field_names` field>> introduces some
|
|
|
|
index-time overhead, so you might want to disable it if you never need to
|
|
|
|
run `exists` queries.
|
|
|
|
|
2017-02-20 14:30:51 -05:00
|
|
|
[float]
|
|
|
|
=== Additional optimizations
|
|
|
|
|
|
|
|
Many of the strategies outlined in <<tune-for-disk-usage>> also
|
|
|
|
provide an improvement in the speed of indexing.
|