[7.10] [DOCS] Adds Working with transforms at scale to docs (#65726) (#65966)

2020-12-08 07:52:32 +01:00 · 2020-12-08 07:52:32 +01:00 · 063db03d17
parent eaab5c65e0
commit 063db03d17
3 changed files with 202 additions and 0 deletions
--- a/docs/reference/transform/index.asciidoc
+++ b/docs/reference/transform/index.asciidoc
@ -2,6 +2,7 @@ include::transforms.asciidoc[leveloffset=+1]
 include::overview.asciidoc[leveloffset=+2]
 include::setup.asciidoc[leveloffset=+2]
 include::usage.asciidoc[leveloffset=+2]
+include::transforms-at-scale.asciidoc[leveloffset=+2]
 include::checkpoints.asciidoc[leveloffset=+2]
 include::api-quickref.asciidoc[leveloffset=+2]
 include::ecommerce-tutorial.asciidoc[leveloffset=+2]
--- a/docs/reference/transform/transforms-at-scale.asciidoc
+++ b/docs/reference/transform/transforms-at-scale.asciidoc
@ -0,0 +1,199 @@
+[role="xpack"]
+[[transform-scale]]
+= Working with {transforms} at scale
++++
+<titleabbrev>{transforms-cap} at scale</titleabbrev>
++++
+
+{transforms-cap} convert existing {es} indices into summarized indices, which 
+provide opportunities for new insights and analytics. The search and index 
+operations performed by {transforms} use standard {es} features so similar 
+considerations for working with {es} at scale are often applicable to 
+{transforms}. If you experience performance issues, start by identifying the 
+bottleneck areas (search, indexing, processing, or storage) then review the 
+relevant considerations in this guide to improve performance. It also helps to 
+understand how {transforms} work as different considerations apply depending on 
+whether or not your transform is running in continuous mode or in batch.
+
+In this guide, you’ll learn how to:
+
+* Understand the impact of configuration options on the performance of 
+  {transforms}.
+
+**Prerequisites:**
+
+These guildelines assume you have a {transform} you want to tune, and you’re 
+already familiar with: 
+
+* <<transform-overview,How {transforms} work>>.
+* <<transform-setup,How to set up {transforms}>>.
+* <<transform-checkpoints,How {transform} checkpoints work in continuous mode>>.
+
+The following considerations are not sequential – the numbers help to navigate 
+between the list items; you can take action on one or more of them in any order. 
+Most of the recommendations apply to both continuous and batch {transforms}. If 
+a list item only applies to one {transform} type, this exception is highlighted 
+in the description.
+
+The keywords in parenthesis at the end of each recommendation title indicates 
+the bottleneck area that may be improved by following the given recommendation.
+
+[discrete]
+[[measure-performance]]
+== Measure {transforms} performance
+
+In order to optimize {transform} performance, start by identifying the areas 
+where most work is being done. The **Stats** interface of the 
+**{transforms-cap}** page in {kib} contains information that covers three main 
+areas: indexing, searching, and processing time (alternatively, you can use the 
+<<get-transform-stats, {transforms} stats API>>). If, for example, the results 
+show that the highest proportion of  time is spent on search, then prioritize 
+efforts on optimizing the search query of the {transform}. {transforms-cap} also 
+has https://esrally.readthedocs.io[Rally support] that makes it possible to run 
+performance checks on {transforms} configurations if it is required. If you 
+optimized the crucial factors and you still experience performance issues, you 
+may also want to consider improving your hardware.
+
+
+[discrete]
+[[frequency]]
+== 1. Optimize `frequency` (index)
+
+In a {ctransform}, the `frequency` configuration option sets the interval 
+between checks for changes in the source indices. If changes are detected, then 
+the source data is searched and the changes are applied to the destination 
+index. Depending on your use case, you may wish to reduce the frequency at which 
+changes are applied. By setting `frequency` to a higher value (maximum is one 
+hour), the workload can be spread over time at the cost of less up-to-date data.
+
+
+[discrete]
+[[increase-shards-dest-index]]
+== 2. Increase the number of shards of the destination index (index)
+
+Depending on the size of the destination index, you may consider increasing its 
+shard count. {transforms-cap} use one shard by default when creating the 
+destination index. To override the index settings, create the destination index 
+before starting the {transform}. For more information about how the number of 
+shards affects scalability and resilience, refer to <<scalability>>
+
+TIP: Use the <<preview-transform>> to check the settings that the {transform} 
+would use to create the destination index. You can copy and adjust these in 
+order to create the destination index prior to starting the {transform}.
+
+
+[discrete]
+[[search-queries]]
+== 3. Profile and optimize your search queries (search)
+
+If you have defined a {transform} source index `query`, ensure it is as 
+efficient as possible. Use the **Search Profiler** under **Dev Tools** in {kib} 
+to get detailed timing information about the execution of individual components 
+in the search request. Alternatively, you can use the <<search-profile>>. The 
+results give you insight into how search requests are executed at a low level so 
+that you can understand why certain requests are slow, and take steps to improve 
+them.
+
+{transforms-cap} execute standard {es} search requests. There are different ways 
+to write {es} queries, and some of them are more efficient than others. Consult 
+<<tune-for-search-speed>> to learn more about {es} performance tuning.
+
+
+[discrete]
+[[limit-source-query]]
+== 4. Limit the scope of the source query (search)
+
+Imagine your {ctransform} is configured to group by `IP` and calculate the sum 
+of `bytes_sent`. For each checkpoint, a {ctransform} detects changes in the 
+source data since the previous checkpoint, identifying the IPs for which new 
+data has been ingested. Then it performs a second search, filtered for this 
+group of IPs, in order to calculate the total `bytes_sent`. If this second 
+search matches many shards, then this could be resource intensive. Consider 
+limiting the scope that the source index pattern and query will match.
+
+Use an absolute time value as a date range filter in your source query (for 
+example, greater than `2020-01-01T00:00:00`) to limit which historical indices 
+are accessed. If you use a relative time value (for example, `now-30d`) then 
+this date range is re-evaluated at the point of each checkpoint execution.
+
+
+[discrete]
+[[optimize-shading-strategy]]
+== 5. Optimize the sharding strategy for the source index (search)
+
+There is no one-size-fits-all sharding strategy. A strategy that works in one 
+environment may not scale in another. A good sharding strategy must account for 
+your infrastructure, use case, and performance expectations.
+
+Too few shards may mean that the benefits of distributing the workload cannot be 
+realised; however too many shards may impact your cluster health. To learn more 
+about sizing your shards, read this <<size-your-shards,guide>>.
+
+
+[discrete]
+[[tune-max-page-search-size]]
+== 6. Tune `max_page_search_size` (search)
+
+The `max_page_search_size` {transform} configuration option defines the number 
+of buckets that are returned for each search request. The default value is 500. 
+If you increase this value, you get better throughput at the cost of higher 
+latency and memory usage.
+
+The ideal value of this parameter is highly dependent on your use case. If your 
+{transform} executes memory-intensive aggregations – for example, cardinality or 
+percentiles – then increasing `max_page_search_size` requires more available 
+memory. If memory limits are exceeded, a circuit breaker exception occurs.
+
+
+[discrete]
+[[indexed-fields-in-source]]
+== 7. Use indexed fields in your source indices (search)
+
+Scripted fields are not indexed fields; their values are only 
+computed at search time. While these fields provide flexibility in 
+how you access your data, they increase performance costs at search time. If 
+{transform} performance using scripted fields is a concern, 
+you may wish to consider using indexed fields instead.
+
+
+[discrete]
+[[index-sorting-group-by-ordering]]
+== 8. Use index sorting and `group_by` ordering (search, process)
+
+If you use more than one `group_by` field in your {transform}, then the order of 
+the fields in conjunction with the use of <<index-modules-index-sorting>> may 
+improve runtime.
+
+Index sorting enables you to store documents on disk in a specific order which 
+can improve query efficiency. The ideal sorting logic depends on your use case, 
+but the rule of thumb may be to sort the fields in descending order (high to low 
+cardinality) starting with the time-based fields. Then put the time-based 
+components first in the `group_by` if you have any, and then apply the same 
+order to your `group_by` fields as configured for index sorting. Index sorting 
+can be defined only once at index creation. If you don't already have index 
+sorting on the index that you want to use as a source, consider reindexing it to 
+a new, sorted index.
+
+
+[discrete]
+[[disable-source-dest]]
+== 9. Disable the `_source` field on the destination index (storage)
+
+The <<mapping-source-field>> contains the original JSON document body that was 
+passed at index time. The `_source` field itself is not indexed (and thus is not 
+searchable), but it is still stored in the index and incurs a storage overhead. 
+Consider disabling `_source` to save storage space if you have a large 
+destination index. Disabling `_source` is only possible during index creation.
+
+NOTE: When the `_source` field is disabled, a number of features are not 
+supported. Consult <<disable-source-field>> to understand the consequences 
+before disabling it.
+
+
+[discrete]
+== Further reading
+
+* <<tune-for-search-speed>>
+* <<tune-for-indexing-speed>>
+* <<size-your-shards>>
+* <<ilm-index-lifecycle>>
--- a/docs/reference/transform/transforms.asciidoc
+++ b/docs/reference/transform/transforms.asciidoc
@ -13,6 +13,8 @@ your data.
 * <<transform-overview>>
 * <<transform-setup>>
 * <<transform-usage>>
+* <<transform-scale>>
+* <<transform-checkpoints>>
 * <<transform-api-quickref>>
 * <<ecommerce-transforms>>
 * <<transform-examples>>