200 lines
8.9 KiB
Plaintext
200 lines
8.9 KiB
Plaintext
|
[role="xpack"]
|
|||
|
[[transform-scale]]
|
|||
|
= Working with {transforms} at scale
|
|||
|
++++
|
|||
|
<titleabbrev>{transforms-cap} at scale</titleabbrev>
|
|||
|
++++
|
|||
|
|
|||
|
{transforms-cap} convert existing {es} indices into summarized indices, which
|
|||
|
provide opportunities for new insights and analytics. The search and index
|
|||
|
operations performed by {transforms} use standard {es} features so similar
|
|||
|
considerations for working with {es} at scale are often applicable to
|
|||
|
{transforms}. If you experience performance issues, start by identifying the
|
|||
|
bottleneck areas (search, indexing, processing, or storage) then review the
|
|||
|
relevant considerations in this guide to improve performance. It also helps to
|
|||
|
understand how {transforms} work as different considerations apply depending on
|
|||
|
whether or not your transform is running in continuous mode or in batch.
|
|||
|
|
|||
|
In this guide, you’ll learn how to:
|
|||
|
|
|||
|
* Understand the impact of configuration options on the performance of
|
|||
|
{transforms}.
|
|||
|
|
|||
|
**Prerequisites:**
|
|||
|
|
|||
|
These guildelines assume you have a {transform} you want to tune, and you’re
|
|||
|
already familiar with:
|
|||
|
|
|||
|
* <<transform-overview,How {transforms} work>>.
|
|||
|
* <<transform-setup,How to set up {transforms}>>.
|
|||
|
* <<transform-checkpoints,How {transform} checkpoints work in continuous mode>>.
|
|||
|
|
|||
|
The following considerations are not sequential – the numbers help to navigate
|
|||
|
between the list items; you can take action on one or more of them in any order.
|
|||
|
Most of the recommendations apply to both continuous and batch {transforms}. If
|
|||
|
a list item only applies to one {transform} type, this exception is highlighted
|
|||
|
in the description.
|
|||
|
|
|||
|
The keywords in parenthesis at the end of each recommendation title indicates
|
|||
|
the bottleneck area that may be improved by following the given recommendation.
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[measure-performance]]
|
|||
|
== Measure {transforms} performance
|
|||
|
|
|||
|
In order to optimize {transform} performance, start by identifying the areas
|
|||
|
where most work is being done. The **Stats** interface of the
|
|||
|
**{transforms-cap}** page in {kib} contains information that covers three main
|
|||
|
areas: indexing, searching, and processing time (alternatively, you can use the
|
|||
|
<<get-transform-stats, {transforms} stats API>>). If, for example, the results
|
|||
|
show that the highest proportion of time is spent on search, then prioritize
|
|||
|
efforts on optimizing the search query of the {transform}. {transforms-cap} also
|
|||
|
has https://esrally.readthedocs.io[Rally support] that makes it possible to run
|
|||
|
performance checks on {transforms} configurations if it is required. If you
|
|||
|
optimized the crucial factors and you still experience performance issues, you
|
|||
|
may also want to consider improving your hardware.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[frequency]]
|
|||
|
== 1. Optimize `frequency` (index)
|
|||
|
|
|||
|
In a {ctransform}, the `frequency` configuration option sets the interval
|
|||
|
between checks for changes in the source indices. If changes are detected, then
|
|||
|
the source data is searched and the changes are applied to the destination
|
|||
|
index. Depending on your use case, you may wish to reduce the frequency at which
|
|||
|
changes are applied. By setting `frequency` to a higher value (maximum is one
|
|||
|
hour), the workload can be spread over time at the cost of less up-to-date data.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[increase-shards-dest-index]]
|
|||
|
== 2. Increase the number of shards of the destination index (index)
|
|||
|
|
|||
|
Depending on the size of the destination index, you may consider increasing its
|
|||
|
shard count. {transforms-cap} use one shard by default when creating the
|
|||
|
destination index. To override the index settings, create the destination index
|
|||
|
before starting the {transform}. For more information about how the number of
|
|||
|
shards affects scalability and resilience, refer to <<scalability>>
|
|||
|
|
|||
|
TIP: Use the <<preview-transform>> to check the settings that the {transform}
|
|||
|
would use to create the destination index. You can copy and adjust these in
|
|||
|
order to create the destination index prior to starting the {transform}.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[search-queries]]
|
|||
|
== 3. Profile and optimize your search queries (search)
|
|||
|
|
|||
|
If you have defined a {transform} source index `query`, ensure it is as
|
|||
|
efficient as possible. Use the **Search Profiler** under **Dev Tools** in {kib}
|
|||
|
to get detailed timing information about the execution of individual components
|
|||
|
in the search request. Alternatively, you can use the <<search-profile>>. The
|
|||
|
results give you insight into how search requests are executed at a low level so
|
|||
|
that you can understand why certain requests are slow, and take steps to improve
|
|||
|
them.
|
|||
|
|
|||
|
{transforms-cap} execute standard {es} search requests. There are different ways
|
|||
|
to write {es} queries, and some of them are more efficient than others. Consult
|
|||
|
<<tune-for-search-speed>> to learn more about {es} performance tuning.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[limit-source-query]]
|
|||
|
== 4. Limit the scope of the source query (search)
|
|||
|
|
|||
|
Imagine your {ctransform} is configured to group by `IP` and calculate the sum
|
|||
|
of `bytes_sent`. For each checkpoint, a {ctransform} detects changes in the
|
|||
|
source data since the previous checkpoint, identifying the IPs for which new
|
|||
|
data has been ingested. Then it performs a second search, filtered for this
|
|||
|
group of IPs, in order to calculate the total `bytes_sent`. If this second
|
|||
|
search matches many shards, then this could be resource intensive. Consider
|
|||
|
limiting the scope that the source index pattern and query will match.
|
|||
|
|
|||
|
Use an absolute time value as a date range filter in your source query (for
|
|||
|
example, greater than `2020-01-01T00:00:00`) to limit which historical indices
|
|||
|
are accessed. If you use a relative time value (for example, `now-30d`) then
|
|||
|
this date range is re-evaluated at the point of each checkpoint execution.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[optimize-shading-strategy]]
|
|||
|
== 5. Optimize the sharding strategy for the source index (search)
|
|||
|
|
|||
|
There is no one-size-fits-all sharding strategy. A strategy that works in one
|
|||
|
environment may not scale in another. A good sharding strategy must account for
|
|||
|
your infrastructure, use case, and performance expectations.
|
|||
|
|
|||
|
Too few shards may mean that the benefits of distributing the workload cannot be
|
|||
|
realised; however too many shards may impact your cluster health. To learn more
|
|||
|
about sizing your shards, read this <<size-your-shards,guide>>.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[tune-max-page-search-size]]
|
|||
|
== 6. Tune `max_page_search_size` (search)
|
|||
|
|
|||
|
The `max_page_search_size` {transform} configuration option defines the number
|
|||
|
of buckets that are returned for each search request. The default value is 500.
|
|||
|
If you increase this value, you get better throughput at the cost of higher
|
|||
|
latency and memory usage.
|
|||
|
|
|||
|
The ideal value of this parameter is highly dependent on your use case. If your
|
|||
|
{transform} executes memory-intensive aggregations – for example, cardinality or
|
|||
|
percentiles – then increasing `max_page_search_size` requires more available
|
|||
|
memory. If memory limits are exceeded, a circuit breaker exception occurs.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[indexed-fields-in-source]]
|
|||
|
== 7. Use indexed fields in your source indices (search)
|
|||
|
|
|||
|
Scripted fields are not indexed fields; their values are only
|
|||
|
computed at search time. While these fields provide flexibility in
|
|||
|
how you access your data, they increase performance costs at search time. If
|
|||
|
{transform} performance using scripted fields is a concern,
|
|||
|
you may wish to consider using indexed fields instead.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[index-sorting-group-by-ordering]]
|
|||
|
== 8. Use index sorting and `group_by` ordering (search, process)
|
|||
|
|
|||
|
If you use more than one `group_by` field in your {transform}, then the order of
|
|||
|
the fields in conjunction with the use of <<index-modules-index-sorting>> may
|
|||
|
improve runtime.
|
|||
|
|
|||
|
Index sorting enables you to store documents on disk in a specific order which
|
|||
|
can improve query efficiency. The ideal sorting logic depends on your use case,
|
|||
|
but the rule of thumb may be to sort the fields in descending order (high to low
|
|||
|
cardinality) starting with the time-based fields. Then put the time-based
|
|||
|
components first in the `group_by` if you have any, and then apply the same
|
|||
|
order to your `group_by` fields as configured for index sorting. Index sorting
|
|||
|
can be defined only once at index creation. If you don't already have index
|
|||
|
sorting on the index that you want to use as a source, consider reindexing it to
|
|||
|
a new, sorted index.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
[[disable-source-dest]]
|
|||
|
== 9. Disable the `_source` field on the destination index (storage)
|
|||
|
|
|||
|
The <<mapping-source-field>> contains the original JSON document body that was
|
|||
|
passed at index time. The `_source` field itself is not indexed (and thus is not
|
|||
|
searchable), but it is still stored in the index and incurs a storage overhead.
|
|||
|
Consider disabling `_source` to save storage space if you have a large
|
|||
|
destination index. Disabling `_source` is only possible during index creation.
|
|||
|
|
|||
|
NOTE: When the `_source` field is disabled, a number of features are not
|
|||
|
supported. Consult <<disable-source-field>> to understand the consequences
|
|||
|
before disabling it.
|
|||
|
|
|||
|
|
|||
|
[discrete]
|
|||
|
== Further reading
|
|||
|
|
|||
|
* <<tune-for-search-speed>>
|
|||
|
* <<tune-for-indexing-speed>>
|
|||
|
* <<size-your-shards>>
|
|||
|
* <<ilm-index-lifecycle>>
|