This commit is contained in:
parent
eaab5c65e0
commit
063db03d17
|
@ -2,6 +2,7 @@ include::transforms.asciidoc[leveloffset=+1]
|
|||
include::overview.asciidoc[leveloffset=+2]
|
||||
include::setup.asciidoc[leveloffset=+2]
|
||||
include::usage.asciidoc[leveloffset=+2]
|
||||
include::transforms-at-scale.asciidoc[leveloffset=+2]
|
||||
include::checkpoints.asciidoc[leveloffset=+2]
|
||||
include::api-quickref.asciidoc[leveloffset=+2]
|
||||
include::ecommerce-tutorial.asciidoc[leveloffset=+2]
|
||||
|
|
|
@ -0,0 +1,199 @@
|
|||
[role="xpack"]
|
||||
[[transform-scale]]
|
||||
= Working with {transforms} at scale
|
||||
++++
|
||||
<titleabbrev>{transforms-cap} at scale</titleabbrev>
|
||||
++++
|
||||
|
||||
{transforms-cap} convert existing {es} indices into summarized indices, which
|
||||
provide opportunities for new insights and analytics. The search and index
|
||||
operations performed by {transforms} use standard {es} features so similar
|
||||
considerations for working with {es} at scale are often applicable to
|
||||
{transforms}. If you experience performance issues, start by identifying the
|
||||
bottleneck areas (search, indexing, processing, or storage) then review the
|
||||
relevant considerations in this guide to improve performance. It also helps to
|
||||
understand how {transforms} work as different considerations apply depending on
|
||||
whether or not your transform is running in continuous mode or in batch.
|
||||
|
||||
In this guide, you’ll learn how to:
|
||||
|
||||
* Understand the impact of configuration options on the performance of
|
||||
{transforms}.
|
||||
|
||||
**Prerequisites:**
|
||||
|
||||
These guildelines assume you have a {transform} you want to tune, and you’re
|
||||
already familiar with:
|
||||
|
||||
* <<transform-overview,How {transforms} work>>.
|
||||
* <<transform-setup,How to set up {transforms}>>.
|
||||
* <<transform-checkpoints,How {transform} checkpoints work in continuous mode>>.
|
||||
|
||||
The following considerations are not sequential – the numbers help to navigate
|
||||
between the list items; you can take action on one or more of them in any order.
|
||||
Most of the recommendations apply to both continuous and batch {transforms}. If
|
||||
a list item only applies to one {transform} type, this exception is highlighted
|
||||
in the description.
|
||||
|
||||
The keywords in parenthesis at the end of each recommendation title indicates
|
||||
the bottleneck area that may be improved by following the given recommendation.
|
||||
|
||||
[discrete]
|
||||
[[measure-performance]]
|
||||
== Measure {transforms} performance
|
||||
|
||||
In order to optimize {transform} performance, start by identifying the areas
|
||||
where most work is being done. The **Stats** interface of the
|
||||
**{transforms-cap}** page in {kib} contains information that covers three main
|
||||
areas: indexing, searching, and processing time (alternatively, you can use the
|
||||
<<get-transform-stats, {transforms} stats API>>). If, for example, the results
|
||||
show that the highest proportion of time is spent on search, then prioritize
|
||||
efforts on optimizing the search query of the {transform}. {transforms-cap} also
|
||||
has https://esrally.readthedocs.io[Rally support] that makes it possible to run
|
||||
performance checks on {transforms} configurations if it is required. If you
|
||||
optimized the crucial factors and you still experience performance issues, you
|
||||
may also want to consider improving your hardware.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[frequency]]
|
||||
== 1. Optimize `frequency` (index)
|
||||
|
||||
In a {ctransform}, the `frequency` configuration option sets the interval
|
||||
between checks for changes in the source indices. If changes are detected, then
|
||||
the source data is searched and the changes are applied to the destination
|
||||
index. Depending on your use case, you may wish to reduce the frequency at which
|
||||
changes are applied. By setting `frequency` to a higher value (maximum is one
|
||||
hour), the workload can be spread over time at the cost of less up-to-date data.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[increase-shards-dest-index]]
|
||||
== 2. Increase the number of shards of the destination index (index)
|
||||
|
||||
Depending on the size of the destination index, you may consider increasing its
|
||||
shard count. {transforms-cap} use one shard by default when creating the
|
||||
destination index. To override the index settings, create the destination index
|
||||
before starting the {transform}. For more information about how the number of
|
||||
shards affects scalability and resilience, refer to <<scalability>>
|
||||
|
||||
TIP: Use the <<preview-transform>> to check the settings that the {transform}
|
||||
would use to create the destination index. You can copy and adjust these in
|
||||
order to create the destination index prior to starting the {transform}.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[search-queries]]
|
||||
== 3. Profile and optimize your search queries (search)
|
||||
|
||||
If you have defined a {transform} source index `query`, ensure it is as
|
||||
efficient as possible. Use the **Search Profiler** under **Dev Tools** in {kib}
|
||||
to get detailed timing information about the execution of individual components
|
||||
in the search request. Alternatively, you can use the <<search-profile>>. The
|
||||
results give you insight into how search requests are executed at a low level so
|
||||
that you can understand why certain requests are slow, and take steps to improve
|
||||
them.
|
||||
|
||||
{transforms-cap} execute standard {es} search requests. There are different ways
|
||||
to write {es} queries, and some of them are more efficient than others. Consult
|
||||
<<tune-for-search-speed>> to learn more about {es} performance tuning.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[limit-source-query]]
|
||||
== 4. Limit the scope of the source query (search)
|
||||
|
||||
Imagine your {ctransform} is configured to group by `IP` and calculate the sum
|
||||
of `bytes_sent`. For each checkpoint, a {ctransform} detects changes in the
|
||||
source data since the previous checkpoint, identifying the IPs for which new
|
||||
data has been ingested. Then it performs a second search, filtered for this
|
||||
group of IPs, in order to calculate the total `bytes_sent`. If this second
|
||||
search matches many shards, then this could be resource intensive. Consider
|
||||
limiting the scope that the source index pattern and query will match.
|
||||
|
||||
Use an absolute time value as a date range filter in your source query (for
|
||||
example, greater than `2020-01-01T00:00:00`) to limit which historical indices
|
||||
are accessed. If you use a relative time value (for example, `now-30d`) then
|
||||
this date range is re-evaluated at the point of each checkpoint execution.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[optimize-shading-strategy]]
|
||||
== 5. Optimize the sharding strategy for the source index (search)
|
||||
|
||||
There is no one-size-fits-all sharding strategy. A strategy that works in one
|
||||
environment may not scale in another. A good sharding strategy must account for
|
||||
your infrastructure, use case, and performance expectations.
|
||||
|
||||
Too few shards may mean that the benefits of distributing the workload cannot be
|
||||
realised; however too many shards may impact your cluster health. To learn more
|
||||
about sizing your shards, read this <<size-your-shards,guide>>.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[tune-max-page-search-size]]
|
||||
== 6. Tune `max_page_search_size` (search)
|
||||
|
||||
The `max_page_search_size` {transform} configuration option defines the number
|
||||
of buckets that are returned for each search request. The default value is 500.
|
||||
If you increase this value, you get better throughput at the cost of higher
|
||||
latency and memory usage.
|
||||
|
||||
The ideal value of this parameter is highly dependent on your use case. If your
|
||||
{transform} executes memory-intensive aggregations – for example, cardinality or
|
||||
percentiles – then increasing `max_page_search_size` requires more available
|
||||
memory. If memory limits are exceeded, a circuit breaker exception occurs.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[indexed-fields-in-source]]
|
||||
== 7. Use indexed fields in your source indices (search)
|
||||
|
||||
Scripted fields are not indexed fields; their values are only
|
||||
computed at search time. While these fields provide flexibility in
|
||||
how you access your data, they increase performance costs at search time. If
|
||||
{transform} performance using scripted fields is a concern,
|
||||
you may wish to consider using indexed fields instead.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[index-sorting-group-by-ordering]]
|
||||
== 8. Use index sorting and `group_by` ordering (search, process)
|
||||
|
||||
If you use more than one `group_by` field in your {transform}, then the order of
|
||||
the fields in conjunction with the use of <<index-modules-index-sorting>> may
|
||||
improve runtime.
|
||||
|
||||
Index sorting enables you to store documents on disk in a specific order which
|
||||
can improve query efficiency. The ideal sorting logic depends on your use case,
|
||||
but the rule of thumb may be to sort the fields in descending order (high to low
|
||||
cardinality) starting with the time-based fields. Then put the time-based
|
||||
components first in the `group_by` if you have any, and then apply the same
|
||||
order to your `group_by` fields as configured for index sorting. Index sorting
|
||||
can be defined only once at index creation. If you don't already have index
|
||||
sorting on the index that you want to use as a source, consider reindexing it to
|
||||
a new, sorted index.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[disable-source-dest]]
|
||||
== 9. Disable the `_source` field on the destination index (storage)
|
||||
|
||||
The <<mapping-source-field>> contains the original JSON document body that was
|
||||
passed at index time. The `_source` field itself is not indexed (and thus is not
|
||||
searchable), but it is still stored in the index and incurs a storage overhead.
|
||||
Consider disabling `_source` to save storage space if you have a large
|
||||
destination index. Disabling `_source` is only possible during index creation.
|
||||
|
||||
NOTE: When the `_source` field is disabled, a number of features are not
|
||||
supported. Consult <<disable-source-field>> to understand the consequences
|
||||
before disabling it.
|
||||
|
||||
|
||||
[discrete]
|
||||
== Further reading
|
||||
|
||||
* <<tune-for-search-speed>>
|
||||
* <<tune-for-indexing-speed>>
|
||||
* <<size-your-shards>>
|
||||
* <<ilm-index-lifecycle>>
|
|
@ -13,6 +13,8 @@ your data.
|
|||
* <<transform-overview>>
|
||||
* <<transform-setup>>
|
||||
* <<transform-usage>>
|
||||
* <<transform-scale>>
|
||||
* <<transform-checkpoints>>
|
||||
* <<transform-api-quickref>>
|
||||
* <<ecommerce-transforms>>
|
||||
* <<transform-examples>>
|
||||
|
|
Loading…
Reference in New Issue