[DOCS] Delayed data annotations (#37939)

This commit is contained in:
Lisa Cawley 2019-01-28 13:04:38 -08:00 committed by GitHub
parent 557fcf915e
commit 19529da2db
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 35 additions and 24 deletions

View File

@ -5,38 +5,49 @@
Delayed data are documents that are indexed late. That is to say, it is data Delayed data are documents that are indexed late. That is to say, it is data
related to a time that the {dfeed} has already processed. related to a time that the {dfeed} has already processed.
When you create a datafeed, you can specify a {ref}/ml-datafeed-resource.html[`query_delay`] setting. When you create a datafeed, you can specify a
This setting enables the datafeed to wait for some time past real-time, which means any "late" data in this period {ref}/ml-datafeed-resource.html[`query_delay`] setting. This setting enables the
is fully indexed before the datafeed tries to gather it. However, if the setting is set too low, the datafeed may query datafeed to wait for some time past real-time, which means any "late" data in
for data before it has been indexed and consequently miss that document. Conversely, if it is set too high, this period is fully indexed before the datafeed tries to gather it. However, if
analysis drifts farther away from real-time. The balance that is struck depends upon each use case and the setting is set too low, the datafeed may query for data before it has been
the environmental factors of the cluster. indexed and consequently miss that document. Conversely, if it is set too high,
analysis drifts farther away from real-time. The balance that is struck depends
upon each use case and the environmental factors of the cluster.
==== Why worry about delayed data? ==== Why worry about delayed data?
This is a particularly prescient question. If data are delayed randomly (and consequently missing from analysis), This is a particularly prescient question. If data are delayed randomly (and
the results of certain types of functions are not really affected. It all comes out ok in the end consequently are missing from analysis), the results of certain types of
as the delayed data is distributed randomly. An example would be a `mean` metric for a field in a large collection of data. functions are not really affected. In these situations, it all comes out okay in
In this case, checking for delayed data may not provide much benefit. If data are consistently delayed, however, jobs with a `low_count` function may the end as the delayed data is distributed randomly. An example would be a `mean`
provide false positives. In this situation, it would be useful to see if data metric for a field in a large collection of data. In this case, checking for
comes in after an anomaly is recorded so that you can determine a next course of action. delayed data may not provide much benefit. If data are consistently delayed,
however, jobs with a `low_count` function may provide false positives. In this
situation, it would be useful to see if data comes in after an anomaly is
recorded so that you can determine a next course of action.
==== How do we detect delayed data? ==== How do we detect delayed data?
In addition to the `query_delay` field, there is a In addition to the `query_delay` field, there is a
{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config], which enables you to {ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config],
configure the datafeed to look in the past for delayed data. Every 15 minutes or every `check_window`, which enables you to configure the datafeed to look in the past for delayed data.
whichever is smaller, the datafeed triggers a document search over the configured indices. This search looks over a Every 15 minutes or every `check_window`, whichever is smaller, the datafeed
time span with a length of `check_window` ending with the latest finalized bucket. That time span is partitioned into buckets, triggers a document search over the configured indices. This search looks over a
whose length equals the bucket span of the associated job. The `doc_count` of those buckets are then compared with the time span with a length of `check_window` ending with the latest finalized bucket.
job's finalized analysis buckets to see whether any data has arrived since the analysis. If there is indeed missing data That time span is partitioned into buckets, whose length equals the bucket span
due to their ingest delay, the end user is notified. of the associated job. The `doc_count` of those buckets are then compared with
the job's finalized analysis buckets to see whether any data has arrived since
the analysis. If there is indeed missing data due to their ingest delay, the end
user is notified. For example, you can see annotations in {kib} for the periods
where these delays occur.
==== What to do about delayed data? ==== What to do about delayed data?
The most common course of action is to simply to do nothing. For many functions and situations ignoring the data is The most common course of action is to simply to do nothing. For many functions
acceptable. However, if the amount of delayed data is too great or the situation calls for it, the next course and situations, ignoring the data is acceptable. However, if the amount of
of action to consider is to increase the `query_delay` of the datafeed. This increased delay allows more time for data to be delayed data is too great or the situation calls for it, the next course of
indexed. If you have real-time constraints, however, an increased delay might not be desirable. action to consider is to increase the `query_delay` of the datafeed. This
In which case, you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed.] increased delay allows more time for data to be indexed. If you have real-time
constraints, however, an increased delay might not be desirable. In which case,
you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed].