200 lines
8.9 KiB
Plaintext
200 lines
8.9 KiB
Plaintext
[role="xpack"]
|
||
[[transform-scale]]
|
||
= Working with {transforms} at scale
|
||
++++
|
||
<titleabbrev>{transforms-cap} at scale</titleabbrev>
|
||
++++
|
||
|
||
{transforms-cap} convert existing {es} indices into summarized indices, which
|
||
provide opportunities for new insights and analytics. The search and index
|
||
operations performed by {transforms} use standard {es} features so similar
|
||
considerations for working with {es} at scale are often applicable to
|
||
{transforms}. If you experience performance issues, start by identifying the
|
||
bottleneck areas (search, indexing, processing, or storage) then review the
|
||
relevant considerations in this guide to improve performance. It also helps to
|
||
understand how {transforms} work as different considerations apply depending on
|
||
whether or not your transform is running in continuous mode or in batch.
|
||
|
||
In this guide, you’ll learn how to:
|
||
|
||
* Understand the impact of configuration options on the performance of
|
||
{transforms}.
|
||
|
||
**Prerequisites:**
|
||
|
||
These guildelines assume you have a {transform} you want to tune, and you’re
|
||
already familiar with:
|
||
|
||
* <<transform-overview,How {transforms} work>>.
|
||
* <<transform-setup,How to set up {transforms}>>.
|
||
* <<transform-checkpoints,How {transform} checkpoints work in continuous mode>>.
|
||
|
||
The following considerations are not sequential – the numbers help to navigate
|
||
between the list items; you can take action on one or more of them in any order.
|
||
Most of the recommendations apply to both continuous and batch {transforms}. If
|
||
a list item only applies to one {transform} type, this exception is highlighted
|
||
in the description.
|
||
|
||
The keywords in parenthesis at the end of each recommendation title indicates
|
||
the bottleneck area that may be improved by following the given recommendation.
|
||
|
||
[discrete]
|
||
[[measure-performance]]
|
||
== Measure {transforms} performance
|
||
|
||
In order to optimize {transform} performance, start by identifying the areas
|
||
where most work is being done. The **Stats** interface of the
|
||
**{transforms-cap}** page in {kib} contains information that covers three main
|
||
areas: indexing, searching, and processing time (alternatively, you can use the
|
||
<<get-transform-stats, {transforms} stats API>>). If, for example, the results
|
||
show that the highest proportion of time is spent on search, then prioritize
|
||
efforts on optimizing the search query of the {transform}. {transforms-cap} also
|
||
has https://esrally.readthedocs.io[Rally support] that makes it possible to run
|
||
performance checks on {transforms} configurations if it is required. If you
|
||
optimized the crucial factors and you still experience performance issues, you
|
||
may also want to consider improving your hardware.
|
||
|
||
|
||
[discrete]
|
||
[[frequency]]
|
||
== 1. Optimize `frequency` (index)
|
||
|
||
In a {ctransform}, the `frequency` configuration option sets the interval
|
||
between checks for changes in the source indices. If changes are detected, then
|
||
the source data is searched and the changes are applied to the destination
|
||
index. Depending on your use case, you may wish to reduce the frequency at which
|
||
changes are applied. By setting `frequency` to a higher value (maximum is one
|
||
hour), the workload can be spread over time at the cost of less up-to-date data.
|
||
|
||
|
||
[discrete]
|
||
[[increase-shards-dest-index]]
|
||
== 2. Increase the number of shards of the destination index (index)
|
||
|
||
Depending on the size of the destination index, you may consider increasing its
|
||
shard count. {transforms-cap} use one shard by default when creating the
|
||
destination index. To override the index settings, create the destination index
|
||
before starting the {transform}. For more information about how the number of
|
||
shards affects scalability and resilience, refer to <<scalability>>
|
||
|
||
TIP: Use the <<preview-transform>> to check the settings that the {transform}
|
||
would use to create the destination index. You can copy and adjust these in
|
||
order to create the destination index prior to starting the {transform}.
|
||
|
||
|
||
[discrete]
|
||
[[search-queries]]
|
||
== 3. Profile and optimize your search queries (search)
|
||
|
||
If you have defined a {transform} source index `query`, ensure it is as
|
||
efficient as possible. Use the **Search Profiler** under **Dev Tools** in {kib}
|
||
to get detailed timing information about the execution of individual components
|
||
in the search request. Alternatively, you can use the <<search-profile>>. The
|
||
results give you insight into how search requests are executed at a low level so
|
||
that you can understand why certain requests are slow, and take steps to improve
|
||
them.
|
||
|
||
{transforms-cap} execute standard {es} search requests. There are different ways
|
||
to write {es} queries, and some of them are more efficient than others. Consult
|
||
<<tune-for-search-speed>> to learn more about {es} performance tuning.
|
||
|
||
|
||
[discrete]
|
||
[[limit-source-query]]
|
||
== 4. Limit the scope of the source query (search)
|
||
|
||
Imagine your {ctransform} is configured to group by `IP` and calculate the sum
|
||
of `bytes_sent`. For each checkpoint, a {ctransform} detects changes in the
|
||
source data since the previous checkpoint, identifying the IPs for which new
|
||
data has been ingested. Then it performs a second search, filtered for this
|
||
group of IPs, in order to calculate the total `bytes_sent`. If this second
|
||
search matches many shards, then this could be resource intensive. Consider
|
||
limiting the scope that the source index pattern and query will match.
|
||
|
||
Use an absolute time value as a date range filter in your source query (for
|
||
example, greater than `2020-01-01T00:00:00`) to limit which historical indices
|
||
are accessed. If you use a relative time value (for example, `now-30d`) then
|
||
this date range is re-evaluated at the point of each checkpoint execution.
|
||
|
||
|
||
[discrete]
|
||
[[optimize-shading-strategy]]
|
||
== 5. Optimize the sharding strategy for the source index (search)
|
||
|
||
There is no one-size-fits-all sharding strategy. A strategy that works in one
|
||
environment may not scale in another. A good sharding strategy must account for
|
||
your infrastructure, use case, and performance expectations.
|
||
|
||
Too few shards may mean that the benefits of distributing the workload cannot be
|
||
realised; however too many shards may impact your cluster health. To learn more
|
||
about sizing your shards, read this <<size-your-shards,guide>>.
|
||
|
||
|
||
[discrete]
|
||
[[tune-max-page-search-size]]
|
||
== 6. Tune `max_page_search_size` (search)
|
||
|
||
The `max_page_search_size` {transform} configuration option defines the number
|
||
of buckets that are returned for each search request. The default value is 500.
|
||
If you increase this value, you get better throughput at the cost of higher
|
||
latency and memory usage.
|
||
|
||
The ideal value of this parameter is highly dependent on your use case. If your
|
||
{transform} executes memory-intensive aggregations – for example, cardinality or
|
||
percentiles – then increasing `max_page_search_size` requires more available
|
||
memory. If memory limits are exceeded, a circuit breaker exception occurs.
|
||
|
||
|
||
[discrete]
|
||
[[indexed-fields-in-source]]
|
||
== 7. Use indexed fields in your source indices (search)
|
||
|
||
Scripted fields are not indexed fields; their values are only
|
||
computed at search time. While these fields provide flexibility in
|
||
how you access your data, they increase performance costs at search time. If
|
||
{transform} performance using scripted fields is a concern,
|
||
you may wish to consider using indexed fields instead.
|
||
|
||
|
||
[discrete]
|
||
[[index-sorting-group-by-ordering]]
|
||
== 8. Use index sorting and `group_by` ordering (search, process)
|
||
|
||
If you use more than one `group_by` field in your {transform}, then the order of
|
||
the fields in conjunction with the use of <<index-modules-index-sorting>> may
|
||
improve runtime.
|
||
|
||
Index sorting enables you to store documents on disk in a specific order which
|
||
can improve query efficiency. The ideal sorting logic depends on your use case,
|
||
but the rule of thumb may be to sort the fields in descending order (high to low
|
||
cardinality) starting with the time-based fields. Then put the time-based
|
||
components first in the `group_by` if you have any, and then apply the same
|
||
order to your `group_by` fields as configured for index sorting. Index sorting
|
||
can be defined only once at index creation. If you don't already have index
|
||
sorting on the index that you want to use as a source, consider reindexing it to
|
||
a new, sorted index.
|
||
|
||
|
||
[discrete]
|
||
[[disable-source-dest]]
|
||
== 9. Disable the `_source` field on the destination index (storage)
|
||
|
||
The <<mapping-source-field>> contains the original JSON document body that was
|
||
passed at index time. The `_source` field itself is not indexed (and thus is not
|
||
searchable), but it is still stored in the index and incurs a storage overhead.
|
||
Consider disabling `_source` to save storage space if you have a large
|
||
destination index. Disabling `_source` is only possible during index creation.
|
||
|
||
NOTE: When the `_source` field is disabled, a number of features are not
|
||
supported. Consult <<disable-source-field>> to understand the consequences
|
||
before disabling it.
|
||
|
||
|
||
[discrete]
|
||
== Further reading
|
||
|
||
* <<tune-for-search-speed>>
|
||
* <<tune-for-indexing-speed>>
|
||
* <<size-your-shards>>
|
||
* <<ilm-index-lifecycle>>
|