89 lines
3.9 KiB
Plaintext
89 lines
3.9 KiB
Plaintext
|
[role="xpack"]
|
||
|
[[ml-transform-checkpoints]]
|
||
|
== How {dataframe-transform} checkpoints work
|
||
|
++++
|
||
|
<titleabbrev>How checkpoints work</titleabbrev>
|
||
|
++++
|
||
|
|
||
|
beta[]
|
||
|
|
||
|
Each time a {dataframe-transform} examines the source indices and creates or
|
||
|
updates the destination index, it generates a _checkpoint_.
|
||
|
|
||
|
If your {dataframe-transform} runs only once, there is logically only one
|
||
|
checkpoint. If your {dataframe-transform} runs continuously, however, it creates
|
||
|
checkpoints as it ingests and transforms new source data.
|
||
|
|
||
|
To create a checkpoint, the {cdataframe-transform}:
|
||
|
|
||
|
. Checks for changes to source indices.
|
||
|
+
|
||
|
Using a simple periodic timer, the {dataframe-transform} checks for changes to
|
||
|
the source indices. This check is done based on the interval defined in the
|
||
|
transform's `frequency` property.
|
||
|
+
|
||
|
If the source indices remain unchanged or if a checkpoint is already in progress
|
||
|
then it waits for the next timer.
|
||
|
|
||
|
. Identifies which entities have changed.
|
||
|
+
|
||
|
The {dataframe-transform} searches to see which entities have changed since the
|
||
|
last time it checked. The transform's `sync` configuration object identifies a
|
||
|
time field in the source indices. The transform uses the values in that field to
|
||
|
synchronize the source and destination indices.
|
||
|
|
||
|
. Updates the destination index (the {dataframe}) with the changed entities.
|
||
|
+
|
||
|
--
|
||
|
The {dataframe-transform} applies changes related to either new or changed
|
||
|
entities to the destination index. The set of changed entities is paginated. For
|
||
|
each page, the {dataframe-transform} performs a composite aggregation using a
|
||
|
`terms` query. After all the pages of changes have been applied, the checkpoint
|
||
|
is complete.
|
||
|
--
|
||
|
|
||
|
This checkpoint process involves both search and indexing activity on the
|
||
|
cluster. We have attempted to favor control over performance while developing
|
||
|
{dataframe-transforms}. We decided it was preferable for the
|
||
|
{dataframe-transform} to take longer to complete, rather than to finish quickly
|
||
|
and take precedence in resource consumption. That being said, the cluster still
|
||
|
requires enough resources to support both the composite aggregation search and
|
||
|
the indexing of its results.
|
||
|
|
||
|
TIP: If the cluster experiences unsuitable performance degradation due to the
|
||
|
{dataframe-transform}, stop the transform. Consider whether you can apply a
|
||
|
source query to the {dataframe-transform} to reduce the scope of data it
|
||
|
processes. Also consider whether the cluster has sufficient resources in place
|
||
|
to support both the composite aggregation search and the indexing of its
|
||
|
results.
|
||
|
|
||
|
[discrete]
|
||
|
[[ml-transform-checkpoint-errors]]
|
||
|
==== Error handling
|
||
|
|
||
|
Failures in {dataframe-transforms} tend to be related to searching or indexing.
|
||
|
To increase the resiliency of {dataframe-transforms}, the cursor positions of
|
||
|
the aggregated search and the changed entities search are tracked in memory and
|
||
|
persisted periodically.
|
||
|
|
||
|
Checkpoint failures can be categorized as follows:
|
||
|
|
||
|
* Temporary failures: The checkpoint is retried. If 10 consecutive failures
|
||
|
occur, the {dataframe-transform} has a failed status. For example, this
|
||
|
situation might occur when there are shard failures and queries return only
|
||
|
partial results.
|
||
|
* Irrecoverable failures: The {dataframe-transform} immediately fails. For
|
||
|
example, this situation occurs when the source index is not found.
|
||
|
* Adjustment failures: The {dataframe-transform} retries with adjusted settings.
|
||
|
For example, if a parent circuit breaker memory errors occur during the
|
||
|
composite aggregation, the transform receives partial results. The aggregated
|
||
|
search is retried with a smaller number of buckets. This retry is performed at
|
||
|
the interval defined in the transform's `frequency` property. If the search
|
||
|
is retried to the point where it reaches a minimal number of buckets, an
|
||
|
irrecoverable failure occurs.
|
||
|
|
||
|
If the node running the {dataframe-transforms} fails, the transform restarts
|
||
|
from the most recent persisted cursor position. This recovery process might
|
||
|
repeat some of the work the transform had already done, but it ensures data
|
||
|
consistency.
|