Clarify the resiliency trade-off of disabling replicas to speed up indexing. (#52714)

We should be more explicit about the downsides of disabling replicas and
explain that users should be ready to re-do the entire load in case of
issues mid-way.
This commit is contained in:
Adrien Grand 2020-02-25 08:52:33 +01:00
parent 5ce66b8b3c
commit 9b0ddc1c03
1 changed files with 11 additions and 8 deletions

View File

@ -58,15 +58,18 @@ gets indexed and when it becomes visible, increasing the
`30s`, might help improve indexing speed.
[float]
=== Disable refresh and replicas for initial loads
=== Disable replicas for initial loads
If you need to load a large amount of data at once, you should disable refresh
by setting `index.refresh_interval` to `-1` and set `index.number_of_replicas`
to `0`. This will temporarily put your index at risk since the loss of any shard
will cause data loss, but at the same time indexing will be faster since
documents will be indexed only once. Once the initial loading is finished, you
can set `index.refresh_interval` and `index.number_of_replicas` back to their
original values.
If you have a large amount of data that you want to load all at once into
Elasticsearch, it may be beneficial to set `index.number_of_replicas` to `0` in
order to speep up indexing. Having no replicas means that losing a single node
may incur data loss, so it is important that the data lives elsewhere so that
this initial load can be retried in case of an issue. Once the initial load is
finished, you can set `index.number_of_replicas` back to its original value.
If `index.refresh_interval` is configured in the index settings, it may further
help to unset it during this initial load and setting it back to its original
value once the initial load is finished.
[float]
=== Disable swapping