Add a how-to section to the docs. #18998
This moves the "Performance Considerations for Elasticsearch Indexing" blog post to the reference guide and adds similar recommendations for tuning disk usage and search speed.
This commit is contained in:
parent
148e64d654
commit
fbad3af352
|
@ -0,0 +1,22 @@
|
|||
[[how-to]]
|
||||
= How To
|
||||
|
||||
[partintro]
|
||||
--
|
||||
Elasticsearch ships with defaults which are intended to give a good out of
|
||||
the box experience. Full text search, highlighting, aggregations, indexing
|
||||
should all just work without the user having to change anything.
|
||||
|
||||
Once you better understand how you want to use Elasticsearch, however,
|
||||
there are a number of optimizations you can make to improve performance
|
||||
for your use case.
|
||||
|
||||
This section provides guidance about which changes should and shouldn't be
|
||||
made.
|
||||
--
|
||||
|
||||
include::how-to/indexing-speed.asciidoc[]
|
||||
|
||||
include::how-to/search-speed.asciidoc[]
|
||||
|
||||
include::how-to/disk-usage.asciidoc[]
|
|
@ -0,0 +1,159 @@
|
|||
[[tune-for-disk-usage]]
|
||||
== Tune for disk usage
|
||||
|
||||
[float]
|
||||
=== Disable the features you do not need
|
||||
|
||||
By default elasticsearch indexes and adds doc values to most fields so that they
|
||||
can be searched and aggregated out of the box. For instance if you have a numeric
|
||||
field called `foo` that you need to run histograms on but that you never need to
|
||||
filter on, you can safely disable indexing on this field in your
|
||||
<<mappings,mappings>>:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"type": {
|
||||
"properties": {
|
||||
"foo": {
|
||||
"type": "integer",
|
||||
"index": false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
<<text,`text`>> fields store normalization factors in the index in order to be
|
||||
able to score documents. If you only need matching capabilities on a `text`
|
||||
field but do not care about the produced scores, you can configure elasticsearch
|
||||
to not write norms to the index:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"type": {
|
||||
"properties": {
|
||||
"foo": {
|
||||
"type": "text",
|
||||
"norms": false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
<<text,`text`>> fields also store frequencies and positions in the index by
|
||||
default. Frequencies are used to compute scores and positions are used to run
|
||||
phrase queries. If you do not need to run phrase queries, you can tell
|
||||
elasticsearch to not index positions:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"type": {
|
||||
"properties": {
|
||||
"foo": {
|
||||
"type": "text",
|
||||
"index_options": "freqs"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
Furthermore if you do not care about scoring either, you can configure
|
||||
elasticsearch to just index matching documents for every term. You will
|
||||
still be able to search on this field, but phrase queries will raise errors
|
||||
and scoring will assume that terms appear only once in every document.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"type": {
|
||||
"properties": {
|
||||
"foo": {
|
||||
"type": "text",
|
||||
"norms": false,
|
||||
"index_options": "freqs"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
[float]
|
||||
=== Don't use default dynamic string mappings
|
||||
|
||||
The default <<dynamic-mapping,dynamic string mappings>> will index string fields
|
||||
both as <<text,`text`>> and <<keyword,`keyword`>>. This is wasteful if you only
|
||||
need one of them. Typically an `id` field will only need to be indexed as a
|
||||
`keyword` while a `body` field will only need to be indexed as a `text` field.
|
||||
|
||||
This can be disabled by either configuring explicit mappings on string fields
|
||||
or setting up dynamic templates that will map string fields as either `text`
|
||||
or `keyword`.
|
||||
|
||||
For instance, here is a template that can be used in order to only map string
|
||||
fields as `keyword`:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"type": {
|
||||
"dynamic_templates": [
|
||||
{
|
||||
"strings": {
|
||||
"match_mapping_type": "string",
|
||||
"mapping": {
|
||||
"type": "keyword"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
[float]
|
||||
=== Disable `_all`
|
||||
|
||||
The <<mapping-all-field,`_all`>> field indexes the value of all fields of a
|
||||
document and can use significant space. If you never need to search against all
|
||||
fields at the same time, it can be disabled.
|
||||
|
||||
[float]
|
||||
=== Use `best_compression`
|
||||
|
||||
The `_source` and stored fields can easily take a non negligible amount of disk
|
||||
space. They can be compressed more aggressively by using the `best_compression`
|
||||
<<index-codec,codec>>.
|
||||
|
||||
[float]
|
||||
=== Use the smallest numeric type that is sufficient
|
||||
|
||||
When storing <<number,numeric data>>, using `float` over `double`, or `half_float`
|
||||
over `float` can help save storage. This is also true for integer types, but less
|
||||
since Elasticsearch will more easily compress them based on the number of bits
|
||||
that they actually need.
|
||||
|
|
@ -0,0 +1,106 @@
|
|||
[[tune-for-indexing-speed]]
|
||||
== Tune for indexing speed
|
||||
|
||||
[float]
|
||||
=== Use bulk requests
|
||||
|
||||
Bulk requests will yield much better performance than single-document index
|
||||
requests. In order to know the optimal size of a bulk request, you shoud run
|
||||
a benchmark on a single node with a single shard. First try to index 100
|
||||
documents at once, then 200, then 400, etc. doubling the number of documents
|
||||
in a bulk request in every benchmark run. When the indexing speed starts to
|
||||
plateau then you know you reached the optimal size of a bulk request for your
|
||||
data. In case of tie, it is better to err in the direction of too few rather
|
||||
than too many documents. Beware that too large bulk requests might put the
|
||||
cluster under memory pressure when many of them are sent concurrently, so
|
||||
it is advisable to avoid going beyond a couple tens of megabytes per request
|
||||
even if larger requests seem to perform better.
|
||||
|
||||
[float]
|
||||
=== Use multiple workers/threads to send data to elasticsearch
|
||||
|
||||
A single thread sending bulk requests is unlikely to be able to max out the
|
||||
indexing capacity of an elasticsearch cluster. In order to use all resources
|
||||
of the cluster, you should send data from multiple threads or processes. In
|
||||
addition to making better use of the resources of the cluster, this should
|
||||
help reduce the cost of each fsync.
|
||||
|
||||
Make sure to watch for `TOO_MANY_REQUESTS (429)` response codes
|
||||
(`EsRejectedExecutionException` with the Java client), which is the way that
|
||||
elasticsearch tells you that it cannot keep up with the current indexing rate.
|
||||
When it happens, you should pause ndexing a bit before trying again, ideally
|
||||
with randomized exponential backoff.
|
||||
|
||||
Similarly to sizing bulk requests, only testing can tell what the optimal
|
||||
number of workers is. This can be tested by progressivily increasing the
|
||||
number of workers until either I/O or CPU is saturated on the cluster.
|
||||
|
||||
[float]
|
||||
=== Increase the refresh interval
|
||||
|
||||
The default <<dynamic-index-settings,`index.refresh_interval`>> is `1s`, which
|
||||
forces elasticsearch to create a new segment every second.
|
||||
Increasing this value (to say, `30s`) will allow larger segments to flush and
|
||||
decreases future merge pressure.
|
||||
|
||||
[float]
|
||||
=== Disable refresh and replicas for initial loads
|
||||
|
||||
If you need to load a large amount of data at once, you should disable refresh
|
||||
by setting `index.refresh_interval` to `-1` and set `index.number_of_replicas`
|
||||
to `0`. This will temporarily put your index at risk since the loss of any shard
|
||||
will cause data loss, but at the same time indexing will be faster since
|
||||
documents will be indexed only once. Once the initial loading is finished, you
|
||||
can set `index.refresh_interval` and `index.number_of_replicas` back to their
|
||||
original values.
|
||||
|
||||
[float]
|
||||
=== Disable swapping
|
||||
|
||||
You should make sure that the operating system is not swapping out the java
|
||||
process by <<setup-configuration-memory,disabling swappping>>.
|
||||
|
||||
[float]
|
||||
=== Give memory to the filesystem cache
|
||||
|
||||
The filesystem cache will be used in order to buffer I/O operations. You should
|
||||
make sure to give at least half the memory of the machine running elasticsearch
|
||||
to the filesystem cache.
|
||||
|
||||
[float]
|
||||
=== Use faster hardware
|
||||
|
||||
If indexing is I/O bound, you should investigate giving more memory to the
|
||||
filesystem cache (see above) or buying faster drives. In particular SSD drives
|
||||
are known to perform better than spinning disks. Always use local storage,
|
||||
remote filesystems such as `NFS` or `SMB` should be avoided. Also beware of
|
||||
virtualized storage such as Amazon's `Elastic Block Storage`. Virtualized
|
||||
storage works very well with Elasticsearch, and it is appealing since it is so
|
||||
fast and simple to set up, but it is also unfortunately inherently slower on an
|
||||
ongoing basis when compared to dedicated local storage. If you put an index on
|
||||
`EBS`, be sure to use provisioned IOPS otherwise operations could be quickly
|
||||
throttled.
|
||||
|
||||
Stripe your index across multiple SSDs by configuring a RAID 0 array. Remember
|
||||
that it will increase the risk of failure since the failure of any one SSD
|
||||
destroys the index. However this is typically the right tradeoff to make:
|
||||
optimize single shards for maximum performance, and then add replicas across
|
||||
different nodes so there's redundancy for any node failures. You can also use
|
||||
<<modules-snapshots,snapshot and restore>> to backup the index for further
|
||||
insurance.
|
||||
|
||||
[float]
|
||||
=== Indexing buffer size
|
||||
|
||||
If your node is doing only heavy indexing, be sure
|
||||
<<indexing-buffer,`indices.memory.index_buffer_size`>> is large enough to give
|
||||
at most 512 MB indexing buffer per shard doing heavy indexing (beyond that
|
||||
indexing performance does not typically improve). Elasticsearch takes that
|
||||
setting (a percentage of the java heap or an absolute byte-size), and
|
||||
uses it as a shared buffer across all active shards. Very active shards will
|
||||
naturally use this buffer more than shards that are performing lightweight
|
||||
indexing.
|
||||
|
||||
The default is `10%` which is often plenty: for example, if you give the JVM
|
||||
10GB of memory, it will give 1GB to the index buffer, which is enough to host
|
||||
two shards that are heavily indexing.
|
|
@ -0,0 +1,194 @@
|
|||
[[tune-for-search-speed]]
|
||||
== Tune for search speed
|
||||
|
||||
[float]
|
||||
=== Give memory to the filesystem cache
|
||||
|
||||
Elasticsearch heavily relies on the filesystem cache in order to make search
|
||||
fast. In general, you should make sure that at least half the available memory
|
||||
goes to the filesystem cache so that elasticsearch can keep hot regions of the
|
||||
index in physical memory.
|
||||
|
||||
[float]
|
||||
=== Use faster hardware
|
||||
|
||||
If your search is I/O bound, you should investigate giving more memory to the
|
||||
filesystem cache (see above) or buying faster drives. In particular SSD drives
|
||||
are known to perform better than spinning disks. Always use local storage,
|
||||
remote filesystems such as `NFS` or `SMB` should be avoided. Also beware of
|
||||
virtualized storage such as Amazon's `Elastic Block Storage`. Virtualized
|
||||
storage works very well with Elasticsearch, and it is appealing since it is so
|
||||
fast and simple to set up, but it is also unfortunately inherently slower on an
|
||||
ongoing basis when compared to dedicated local storage. If you put an index on
|
||||
`EBS`, be sure to use provisioned IOPS otherwise operations could be quickly
|
||||
throttled.
|
||||
|
||||
If your search is CPU-bound, you should investigate buying faster CPUs.
|
||||
|
||||
[float]
|
||||
=== Document modeling
|
||||
|
||||
Documents should be modeled so that search-time operations are as cheap as possible.
|
||||
|
||||
In particular, joins should be avoided. <<nested,`nested`>> can make queries
|
||||
several times slower and <<mapping-parent-field,parent-child>> relations can make
|
||||
queries hundreds of times slower. So if the same questions can be answered without
|
||||
joins by denormalizing documents, significant speedups can be expected.
|
||||
|
||||
[float]
|
||||
=== Pre-index data
|
||||
|
||||
You should leverage patterns in your queries to optimize the way data is indexed.
|
||||
For instance, if all your documents have a `price` field and most queries run
|
||||
<<search-aggregations-bucket-range-aggregation,`range`>> aggregations on a fixed
|
||||
list of ranges, you could make this aggregation faster by pre-indexing the ranges
|
||||
into the index and using a <<search-aggregations-bucket-terms-aggregation,`terms`>>
|
||||
aggregations.
|
||||
|
||||
For instance, if documents look like:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index/type/1
|
||||
{
|
||||
"designation": "spoon",
|
||||
"price": 13
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
and search requests look like:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
GET index/_search
|
||||
{
|
||||
"aggs": {
|
||||
"price_ranges": {
|
||||
"range": {
|
||||
"field": "price",
|
||||
"ranges": [
|
||||
{ "to": 10 },
|
||||
{ "from": 10, "to": 100 },
|
||||
{ "from": 100 }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
Then documents could be enriched by a `price_range` field at index time, which
|
||||
should be mapped as a <<keyword,`keyword`>>:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"type": {
|
||||
"properties": {
|
||||
"price_range": {
|
||||
"type": "keyword"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
PUT index/type/1
|
||||
{
|
||||
"designation": "spoon",
|
||||
"price": 13,
|
||||
"price_range": "10-100"
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
And then search requests could aggregate this new field rather than running a
|
||||
`range` aggregation on the `price` field.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
GET index/_search
|
||||
{
|
||||
"aggs": {
|
||||
"price_ranges": {
|
||||
"terms": {
|
||||
"field": "price_range"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[continued]
|
||||
|
||||
[float]
|
||||
=== Mappings
|
||||
|
||||
The fact that some data is numeric does not mean it should always be mapped as a
|
||||
<<number,numeric field>>. Typically, fields storing identifiers such as an `ISBN`
|
||||
or any number identifying a record from another database, might benefit from
|
||||
being mapped as <<keyword,`keyword`>> rather than `integer` or `long`.
|
||||
|
||||
[float]
|
||||
=== Avoid scripts
|
||||
|
||||
In general, scripts should be avoided. If they are absolutely needed, you
|
||||
should prefer the `painless` and `expressions` engines.
|
||||
|
||||
[float]
|
||||
=== Force-merge read-only indices
|
||||
|
||||
Indices that are read-only would benefit from being
|
||||
<<indices-forcemerge,merged down to a single segment>>. This is typically the
|
||||
case with time-based indices: only the index for the current time frame is
|
||||
getting new documents while older indices are read-only.
|
||||
|
||||
IMPORTANT: Don't force-merge indices that are still being written to -- leave
|
||||
merging to the background merge process.
|
||||
|
||||
[float]
|
||||
=== Warm up global ordinals
|
||||
|
||||
Global ordinals are a data-structure that is used in order to run
|
||||
<<search-aggregations-bucket-terms-aggregation,`terms`>> aggregations on
|
||||
<<keyword,`keyword`>> fields. They are loaded lazily in memory because
|
||||
elasticsearch does not know which fields will be used in `terms` aggregations
|
||||
and which fields won't. You can tell elasticsearch to load global ordinals
|
||||
eagerly at refresh-time by configuring mappings as described below:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT index
|
||||
{
|
||||
"mappings": {
|
||||
"type": {
|
||||
"properties": {
|
||||
"foo": {
|
||||
"type": "keyword",
|
||||
"eager_global_ordinals": true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
// CONSOLE
|
||||
|
||||
[float]
|
||||
=== Warm up the filesystem cache
|
||||
|
||||
If the machine running elasticsearch is restarted, the filesystem cache will be
|
||||
empty, so it will take some time before the operating system loads hot regions
|
||||
of the index into memory so that search operations are fast. You can explicitly
|
||||
tell the operating system which files should be loaded into memory eagerly
|
||||
depending on the file extension using the <<file-system,`index.store.preload`>>
|
||||
setting.
|
||||
|
||||
WARNING: Loading data into the filesystem cache eagerly on too many indices or
|
||||
too many files will make searh _slower_ if the filesystem cache is not large
|
||||
enough to hold all the data. Use with caution.
|
|
@ -43,6 +43,8 @@ include::index-modules.asciidoc[]
|
|||
|
||||
include::ingest.asciidoc[]
|
||||
|
||||
include::how-to.asciidoc[]
|
||||
|
||||
include::testing.asciidoc[]
|
||||
|
||||
include::glossary.asciidoc[]
|
||||
|
|
Loading…
Reference in New Issue