2018-12-18 20:14:18 -05:00
|
|
|
[role="xpack"]
|
|
|
|
[[ml-delayed-data-detection]]
|
|
|
|
=== Handling delayed data
|
|
|
|
|
|
|
|
Delayed data are documents that are indexed late. That is to say, it is data
|
|
|
|
related to a time that the {dfeed} has already processed.
|
|
|
|
|
2019-12-30 12:35:16 -05:00
|
|
|
When you create a {dfeed}, you can specify a
|
|
|
|
{ref}/ml-put-datafeed.html#ml-put-datafeed-request-body[`query_delay`] setting.
|
|
|
|
This setting enables the {dfeed} to wait for some time past real-time, which
|
|
|
|
means any "late" data in this period is fully indexed before the {dfeed} tries
|
|
|
|
to gather it. However, if the setting is set too low, the {dfeed} may query for
|
|
|
|
data before it has been indexed and consequently miss that document. Conversely,
|
|
|
|
if it is set too high, analysis drifts farther away from real-time. The balance
|
|
|
|
that is struck depends upon each use case and the environmental factors of the
|
|
|
|
cluster.
|
2018-12-18 20:14:18 -05:00
|
|
|
|
|
|
|
==== Why worry about delayed data?
|
|
|
|
|
2019-01-28 16:04:38 -05:00
|
|
|
This is a particularly prescient question. If data are delayed randomly (and
|
|
|
|
consequently are missing from analysis), the results of certain types of
|
|
|
|
functions are not really affected. In these situations, it all comes out okay in
|
|
|
|
the end as the delayed data is distributed randomly. An example would be a `mean`
|
|
|
|
metric for a field in a large collection of data. In this case, checking for
|
|
|
|
delayed data may not provide much benefit. If data are consistently delayed,
|
2019-07-26 14:07:01 -04:00
|
|
|
however, {anomaly-jobs} with a `low_count` function may provide false positives.
|
|
|
|
In this situation, it would be useful to see if data comes in after an anomaly is
|
2019-01-28 16:04:38 -05:00
|
|
|
recorded so that you can determine a next course of action.
|
2018-12-18 20:14:18 -05:00
|
|
|
|
|
|
|
==== How do we detect delayed data?
|
|
|
|
|
2019-12-30 12:35:16 -05:00
|
|
|
In addition to the `query_delay` field, there is a delayed data check config,
|
2019-01-28 16:04:38 -05:00
|
|
|
which enables you to configure the datafeed to look in the past for delayed data.
|
|
|
|
Every 15 minutes or every `check_window`, whichever is smaller, the datafeed
|
|
|
|
triggers a document search over the configured indices. This search looks over a
|
|
|
|
time span with a length of `check_window` ending with the latest finalized bucket.
|
|
|
|
That time span is partitioned into buckets, whose length equals the bucket span
|
2019-07-26 14:07:01 -04:00
|
|
|
of the associated {anomaly-job}. The `doc_count` of those buckets are then
|
|
|
|
compared with the job's finalized analysis buckets to see whether any data has
|
|
|
|
arrived since the analysis. If there is indeed missing data due to their ingest
|
|
|
|
delay, the end user is notified. For example, you can see annotations in {kib}
|
|
|
|
for the periods where these delays occur.
|
2018-12-18 20:14:18 -05:00
|
|
|
|
|
|
|
==== What to do about delayed data?
|
|
|
|
|
2019-01-28 16:04:38 -05:00
|
|
|
The most common course of action is to simply to do nothing. For many functions
|
|
|
|
and situations, ignoring the data is acceptable. However, if the amount of
|
|
|
|
delayed data is too great or the situation calls for it, the next course of
|
|
|
|
action to consider is to increase the `query_delay` of the datafeed. This
|
|
|
|
increased delay allows more time for data to be indexed. If you have real-time
|
|
|
|
constraints, however, an increased delay might not be desirable. In which case,
|
|
|
|
you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed].
|
2018-12-18 20:14:18 -05:00
|
|
|
|