Adding more docs for delayed data detection (#36738)
* Adding more docs for delayed data detection
This commit is contained in:
parent
1d429cf1c9
commit
75f1c79d9f
|
@ -65,9 +65,10 @@ A {dfeed} resource has the following properties:
|
||||||
releases earlier than 6.0.0. For more information, see <<removal-of-types>>.
|
releases earlier than 6.0.0. For more information, see <<removal-of-types>>.
|
||||||
|
|
||||||
`delayed_data_check_config`::
|
`delayed_data_check_config`::
|
||||||
(object) Specifies if and with how large a window should the data feed check
|
(object) Specifies whether the data feed checks for missing data and
|
||||||
for missing data. See <<ml-datafeed-delayed-data-check-config>>.
|
and the size of the window. For example:
|
||||||
For example: `{"enabled": true, "check_window": "1h"}`
|
`{"enabled": true, "check_window": "1h"}` See
|
||||||
|
<<ml-datafeed-delayed-data-check-config>>.
|
||||||
|
|
||||||
[[ml-datafeed-chunking-config]]
|
[[ml-datafeed-chunking-config]]
|
||||||
==== Chunking Configuration Objects
|
==== Chunking Configuration Objects
|
||||||
|
@ -97,7 +98,8 @@ A chunking configuration object has the following properties:
|
||||||
The {dfeed} can optionally search over indices that have already been read in
|
The {dfeed} can optionally search over indices that have already been read in
|
||||||
an effort to find if any data has since been added to the index. If missing data
|
an effort to find if any data has since been added to the index. If missing data
|
||||||
is found, it is a good indication that the `query_delay` option is set too low and
|
is found, it is a good indication that the `query_delay` option is set too low and
|
||||||
the data is being indexed after the {dfeed} has passed that moment in time.
|
the data is being indexed after the {dfeed} has passed that moment in time. See
|
||||||
|
{stack-ov}/ml-delayed-data-detection.html[Working with delayed data].
|
||||||
|
|
||||||
This check only runs on real-time {dfeeds}
|
This check only runs on real-time {dfeeds}
|
||||||
|
|
||||||
|
|
|
@ -32,9 +32,10 @@ The scenarios in this section describe some best practices for generating useful
|
||||||
* <<ml-configuring-url>>
|
* <<ml-configuring-url>>
|
||||||
* <<ml-configuring-aggregation>>
|
* <<ml-configuring-aggregation>>
|
||||||
* <<ml-configuring-categories>>
|
* <<ml-configuring-categories>>
|
||||||
|
* <<ml-configuring-detector-custom-rules>>
|
||||||
* <<ml-configuring-pop>>
|
* <<ml-configuring-pop>>
|
||||||
* <<ml-configuring-transform>>
|
* <<ml-configuring-transform>>
|
||||||
* <<ml-configuring-detector-custom-rules>>
|
* <<ml-delayed-data-detection>>
|
||||||
|
|
||||||
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/customurl.asciidoc
|
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/customurl.asciidoc
|
||||||
include::customurl.asciidoc[]
|
include::customurl.asciidoc[]
|
||||||
|
@ -42,6 +43,9 @@ include::customurl.asciidoc[]
|
||||||
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/aggregations.asciidoc
|
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/aggregations.asciidoc
|
||||||
include::aggregations.asciidoc[]
|
include::aggregations.asciidoc[]
|
||||||
|
|
||||||
|
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/detector-custom-rules.asciidoc
|
||||||
|
include::detector-custom-rules.asciidoc[]
|
||||||
|
|
||||||
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/categories.asciidoc
|
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/categories.asciidoc
|
||||||
include::categories.asciidoc[]
|
include::categories.asciidoc[]
|
||||||
|
|
||||||
|
@ -51,5 +55,5 @@ include::populations.asciidoc[]
|
||||||
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/transforms.asciidoc
|
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/transforms.asciidoc
|
||||||
include::transforms.asciidoc[]
|
include::transforms.asciidoc[]
|
||||||
|
|
||||||
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/detector-custom-rules.asciidoc
|
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/delayed-data-detection.asciidoc
|
||||||
include::detector-custom-rules.asciidoc[]
|
include::delayed-data-detection.asciidoc[]
|
|
@ -0,0 +1,42 @@
|
||||||
|
[role="xpack"]
|
||||||
|
[[ml-delayed-data-detection]]
|
||||||
|
=== Handling delayed data
|
||||||
|
|
||||||
|
Delayed data are documents that are indexed late. That is to say, it is data
|
||||||
|
related to a time that the {dfeed} has already processed.
|
||||||
|
|
||||||
|
When you create a datafeed, you can specify a {ref}/ml-datafeed-resource.html[`query_delay`] setting.
|
||||||
|
This setting enables the datafeed to wait for some time past real-time, which means any "late" data in this period
|
||||||
|
is fully indexed before the datafeed tries to gather it. However, if the setting is set too low, the datafeed may query
|
||||||
|
for data before it has been indexed and consequently miss that document. Conversely, if it is set too high,
|
||||||
|
analysis drifts farther away from real-time. The balance that is struck depends upon each use case and
|
||||||
|
the environmental factors of the cluster.
|
||||||
|
|
||||||
|
==== Why worry about delayed data?
|
||||||
|
|
||||||
|
This is a particularly prescient question. If data are delayed randomly (and consequently missing from analysis),
|
||||||
|
the results of certain types of functions are not really affected. It all comes out ok in the end
|
||||||
|
as the delayed data is distributed randomly. An example would be a `mean` metric for a field in a large collection of data.
|
||||||
|
In this case, checking for delayed data may not provide much benefit. If data are consistently delayed, however, jobs with a `low_count` function may
|
||||||
|
provide false positives. In this situation, it would be useful to see if data
|
||||||
|
comes in after an anomaly is recorded so that you can determine a next course of action.
|
||||||
|
|
||||||
|
==== How do we detect delayed data?
|
||||||
|
|
||||||
|
In addition to the `query_delay` field, there is a
|
||||||
|
{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config], which enables you to
|
||||||
|
configure the datafeed to look in the past for delayed data. Every 15 minutes or every `check_window`,
|
||||||
|
whichever is smaller, the datafeed triggers a document search over the configured indices. This search looks over a
|
||||||
|
time span with a length of `check_window` ending with the latest finalized bucket. That time span is partitioned into buckets,
|
||||||
|
whose length equals the bucket span of the associated job. The `doc_count` of those buckets are then compared with the
|
||||||
|
job's finalized analysis buckets to see whether any data has arrived since the analysis. If there is indeed missing data
|
||||||
|
due to their ingest delay, the end user is notified.
|
||||||
|
|
||||||
|
==== What to do about delayed data?
|
||||||
|
|
||||||
|
The most common course of action is to simply to do nothing. For many functions and situations ignoring the data is
|
||||||
|
acceptable. However, if the amount of delayed data is too great or the situation calls for it, the next course
|
||||||
|
of action to consider is to increase the `query_delay` of the datafeed. This increased delay allows more time for data to be
|
||||||
|
indexed. If you have real-time constraints, however, an increased delay might not be desirable.
|
||||||
|
In which case, you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed.]
|
||||||
|
|
Loading…
Reference in New Issue