[DOCS] Clarify interval, frequency, and bucket span in ML APIs and example (#51280)
This commit is contained in:
parent
08e9c673e5
commit
4590d4156a
|
@ -11,13 +11,28 @@ distributes these calculations across your cluster. You can then feed this
|
||||||
aggregated data into the {ml-features} instead of raw results, which
|
aggregated data into the {ml-features} instead of raw results, which
|
||||||
reduces the volume of data that must be considered while detecting anomalies.
|
reduces the volume of data that must be considered while detecting anomalies.
|
||||||
|
|
||||||
There are some limitations to using aggregations in {dfeeds}, however.
|
TIP: If you use a terms aggregation and the cardinality of a term is high, the
|
||||||
Your aggregation must include a `date_histogram` aggregation, which in turn must
|
aggregation might not be effective and you might want to just use the default
|
||||||
contain a `max` aggregation on the time field. This requirement ensures that the
|
search and scroll behavior.
|
||||||
aggregated data is a time series and the timestamp of each bucket is the time
|
|
||||||
of the last record in the bucket. If you use a terms aggregation and the
|
There are some limitations to using aggregations in {dfeeds}. Your aggregation
|
||||||
cardinality of a term is high, then the aggregation might not be effective and
|
must include a `date_histogram` aggregation, which in turn must contain a `max`
|
||||||
you might want to just use the default search and scroll behavior.
|
aggregation on the time field. This requirement ensures that the aggregated data
|
||||||
|
is a time series and the timestamp of each bucket is the time of the last record
|
||||||
|
in the bucket.
|
||||||
|
|
||||||
|
You must also consider the interval of the date histogram aggregation carefully.
|
||||||
|
The bucket span of your {anomaly-job} must be divisible by the value of the
|
||||||
|
`calendar_interval` or `fixed_interval` in your aggregation (with no remainder).
|
||||||
|
If you specify a `frequency` for your {dfeed}, it must also be divisible by this
|
||||||
|
interval.
|
||||||
|
|
||||||
|
TIP: As a rule of thumb, if your detectors use <<ml-metric-functions,metric>> or
|
||||||
|
<<ml-sum-functions,sum>> analytical functions, set the date histogram
|
||||||
|
aggregation interval to a tenth of the bucket span. This suggestion creates
|
||||||
|
finer, more granular time buckets, which are ideal for this type of analysis. If
|
||||||
|
your detectors use <<ml-count-functions,count>> or <<ml-rare-functions,rare>>
|
||||||
|
functions, set the interval to the same value as the bucket span.
|
||||||
|
|
||||||
When you create or update an {anomaly-job}, you can include the names of
|
When you create or update an {anomaly-job}, you can include the names of
|
||||||
aggregations, for example:
|
aggregations, for example:
|
||||||
|
@ -143,9 +158,9 @@ pipeline aggregation to find the first order derivative of the counter
|
||||||
----------------------------------
|
----------------------------------
|
||||||
// NOTCONSOLE
|
// NOTCONSOLE
|
||||||
|
|
||||||
{dfeeds-cap} not only supports multi-bucket aggregations, but also single bucket aggregations.
|
{dfeeds-cap} not only supports multi-bucket aggregations, but also single bucket
|
||||||
The following shows two `filter` aggregations, each gathering the number of unique entries for
|
aggregations. The following shows two `filter` aggregations, each gathering the
|
||||||
the `error` field.
|
number of unique entries for the `error` field.
|
||||||
|
|
||||||
[source,js]
|
[source,js]
|
||||||
----------------------------------
|
----------------------------------
|
||||||
|
@ -225,14 +240,15 @@ When you define an aggregation in a {dfeed}, it must have the following form:
|
||||||
----------------------------------
|
----------------------------------
|
||||||
// NOTCONSOLE
|
// NOTCONSOLE
|
||||||
|
|
||||||
The top level aggregation must be either a {ref}/search-aggregations-bucket.html[Bucket Aggregation]
|
The top level aggregation must be either a
|
||||||
containing as single sub-aggregation that is a `date_histogram` or the top level aggregation
|
{ref}/search-aggregations-bucket.html[bucket aggregation] containing as single
|
||||||
is the required `date_histogram`. There must be exactly 1 `date_histogram` aggregation.
|
sub-aggregation that is a `date_histogram` or the top level aggregation is the
|
||||||
|
required `date_histogram`. There must be exactly 1 `date_histogram` aggregation.
|
||||||
For more information, see
|
For more information, see
|
||||||
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date Histogram Aggregation].
|
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date histogram aggregation].
|
||||||
|
|
||||||
NOTE: The `time_zone` parameter in the date histogram aggregation must be set to `UTC`,
|
NOTE: The `time_zone` parameter in the date histogram aggregation must be set to
|
||||||
which is the default value.
|
`UTC`, which is the default value.
|
||||||
|
|
||||||
Each histogram bucket has a key, which is the bucket start time. This key cannot
|
Each histogram bucket has a key, which is the bucket start time. This key cannot
|
||||||
be used for aggregations in {dfeeds}, however, because they need to know the
|
be used for aggregations in {dfeeds}, however, because they need to know the
|
||||||
|
@ -269,16 +285,9 @@ By default, {es} limits the maximum number of terms returned to 10000. For high
|
||||||
cardinality fields, the query might not run. It might return errors related to
|
cardinality fields, the query might not run. It might return errors related to
|
||||||
circuit breaking exceptions that indicate that the data is too large. In such
|
circuit breaking exceptions that indicate that the data is too large. In such
|
||||||
cases, do not use aggregations in your {dfeed}. For more
|
cases, do not use aggregations in your {dfeed}. For more
|
||||||
information, see {ref}/search-aggregations-bucket-terms-aggregation.html[Terms Aggregation].
|
information, see
|
||||||
|
{ref}/search-aggregations-bucket-terms-aggregation.html[Terms aggregation].
|
||||||
|
|
||||||
You can also optionally specify multiple sub-aggregations.
|
You can also optionally specify multiple sub-aggregations. The sub-aggregations
|
||||||
The sub-aggregations are aggregated for the buckets that were created by their
|
are aggregated for the buckets that were created by their parent aggregation.
|
||||||
parent aggregation. For more information, see
|
For more information, see {ref}/search-aggregations.html[Aggregations].
|
||||||
{ref}/search-aggregations.html[Aggregations].
|
|
||||||
|
|
||||||
TIP: If your detectors use metric or sum analytical functions, set the
|
|
||||||
`interval` of the date histogram aggregation to a tenth of the `bucket_span`
|
|
||||||
that was defined in the job. This suggestion creates finer, more granular time
|
|
||||||
buckets, which are ideal for this type of analysis. If your detectors use count
|
|
||||||
or rare functions, set `interval` to the same value as `bucket_span`. For more
|
|
||||||
information about analytical functions, see <<ml-functions>>.
|
|
||||||
|
|
|
@ -26,7 +26,12 @@ cluster privileges to use this API. See
|
||||||
[[ml-put-datafeed-desc]]
|
[[ml-put-datafeed-desc]]
|
||||||
==== {api-description-title}
|
==== {api-description-title}
|
||||||
|
|
||||||
You can associate only one {dfeed} to each {anomaly-job}.
|
{ml-docs}/ml-dfeeds.html[{dfeeds-cap}] retrieve data from {es} for analysis by
|
||||||
|
an {anomaly-job}. You can associate only one {dfeed} to each {anomaly-job}.
|
||||||
|
|
||||||
|
The {dfeed} contains a query that runs at a defined interval (`frequency`). If
|
||||||
|
you are concerned about delayed data, you can add a delay (`query_delay`) at
|
||||||
|
each interval. See {ml-docs}/ml-delayed-data-detection.html[Handling delayed data].
|
||||||
|
|
||||||
[IMPORTANT]
|
[IMPORTANT]
|
||||||
====
|
====
|
||||||
|
@ -64,11 +69,6 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=delayed-data-check-config]
|
||||||
`frequency`::
|
`frequency`::
|
||||||
(Optional, <<time-units, time units>>)
|
(Optional, <<time-units, time units>>)
|
||||||
include::{docdir}/ml/ml-shared.asciidoc[tag=frequency]
|
include::{docdir}/ml/ml-shared.asciidoc[tag=frequency]
|
||||||
+
|
|
||||||
--
|
|
||||||
To learn more about the relationship between time related settings, see
|
|
||||||
<<ml-put-datafeed-time-related-settings>>.
|
|
||||||
--
|
|
||||||
|
|
||||||
`indices`::
|
`indices`::
|
||||||
(Required, array)
|
(Required, array)
|
||||||
|
@ -89,11 +89,6 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=query]
|
||||||
`query_delay`::
|
`query_delay`::
|
||||||
(Optional, <<time-units, time units>>)
|
(Optional, <<time-units, time units>>)
|
||||||
include::{docdir}/ml/ml-shared.asciidoc[tag=query-delay]
|
include::{docdir}/ml/ml-shared.asciidoc[tag=query-delay]
|
||||||
+
|
|
||||||
--
|
|
||||||
To learn more about the relationship between time related settings, see
|
|
||||||
<<ml-put-datafeed-time-related-settings>>.
|
|
||||||
--
|
|
||||||
|
|
||||||
`script_fields`::
|
`script_fields`::
|
||||||
(Optional, object)
|
(Optional, object)
|
||||||
|
@ -103,20 +98,6 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=script-fields]
|
||||||
(Optional, unsigned integer)
|
(Optional, unsigned integer)
|
||||||
include::{docdir}/ml/ml-shared.asciidoc[tag=scroll-size]
|
include::{docdir}/ml/ml-shared.asciidoc[tag=scroll-size]
|
||||||
|
|
||||||
|
|
||||||
[[ml-put-datafeed-time-related-settings]]
|
|
||||||
===== Interaction between time-related settings
|
|
||||||
|
|
||||||
Time-related settings have the following relationships:
|
|
||||||
|
|
||||||
* Queries run at `query_delay` after the end of
|
|
||||||
each `frequency`.
|
|
||||||
|
|
||||||
* When `frequency` is shorter than `bucket_span` of the associated job, interim
|
|
||||||
results for the last (partial) bucket are written, and then overwritten by the
|
|
||||||
full bucket results eventually.
|
|
||||||
|
|
||||||
|
|
||||||
[[ml-put-datafeed-example]]
|
[[ml-put-datafeed-example]]
|
||||||
==== {api-examples-title}
|
==== {api-examples-title}
|
||||||
|
|
||||||
|
|
|
@ -49,11 +49,6 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=analysis-config]
|
||||||
`analysis_config`.`bucket_span`:::
|
`analysis_config`.`bucket_span`:::
|
||||||
(<<time-units,time units>>)
|
(<<time-units,time units>>)
|
||||||
include::{docdir}/ml/ml-shared.asciidoc[tag=bucket-span]
|
include::{docdir}/ml/ml-shared.asciidoc[tag=bucket-span]
|
||||||
+
|
|
||||||
--
|
|
||||||
To learn more about the relationship between time related settings, see
|
|
||||||
<<ml-put-datafeed-time-related-settings>>.
|
|
||||||
--
|
|
||||||
|
|
||||||
`analysis_config`.`categorization_field_name`:::
|
`analysis_config`.`categorization_field_name`:::
|
||||||
(string)
|
(string)
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
tag::aggregations[]
|
tag::aggregations[]
|
||||||
If set, the {dfeed} performs aggregation searches. Support for aggregations is
|
If set, the {dfeed} performs aggregation searches. Support for aggregations is
|
||||||
limited and should only be used with low cardinality data. For more information,
|
limited and should be used only with low cardinality data. For more information,
|
||||||
see
|
see
|
||||||
{ml-docs}/ml-configuring-aggregation.html[Aggregating data for faster performance].
|
{ml-docs}/ml-configuring-aggregation.html[Aggregating data for faster performance].
|
||||||
end::aggregations[]
|
end::aggregations[]
|
||||||
|
@ -148,8 +148,10 @@ end::background-persist-interval[]
|
||||||
|
|
||||||
tag::bucket-span[]
|
tag::bucket-span[]
|
||||||
The size of the interval that the analysis is aggregated into, typically between
|
The size of the interval that the analysis is aggregated into, typically between
|
||||||
`5m` and `1h`. The default value is `5m`. For more information about time units,
|
`5m` and `1h`. The default value is `5m`. If the {anomaly-job} uses a {dfeed}
|
||||||
see <<time-units>>.
|
with {ml-docs}/ml-configuring-aggregation.html[aggregations], this value must be
|
||||||
|
divisible by the interval of the date histogram aggregation. For more
|
||||||
|
information, see {ml-docs}/ml-buckets.html[Buckets].
|
||||||
end::bucket-span[]
|
end::bucket-span[]
|
||||||
|
|
||||||
tag::bucket-span-results[]
|
tag::bucket-span-results[]
|
||||||
|
@ -603,7 +605,10 @@ tag::frequency[]
|
||||||
The interval at which scheduled queries are made while the {dfeed} runs in real
|
The interval at which scheduled queries are made while the {dfeed} runs in real
|
||||||
time. The default value is either the bucket span for short bucket spans, or,
|
time. The default value is either the bucket span for short bucket spans, or,
|
||||||
for longer bucket spans, a sensible fraction of the bucket span. For example:
|
for longer bucket spans, a sensible fraction of the bucket span. For example:
|
||||||
`150s`.
|
`150s`. When `frequency` is shorter than the bucket span, interim results for
|
||||||
|
the last (partial) bucket are written then eventually overwritten by the full
|
||||||
|
bucket results. If the {dfeed} uses aggregations, this value must be divisible
|
||||||
|
by the interval of the date histogram aggregation.
|
||||||
end::frequency[]
|
end::frequency[]
|
||||||
|
|
||||||
tag::from[]
|
tag::from[]
|
||||||
|
@ -939,7 +944,8 @@ The number of seconds behind real time that data is queried. For example, if
|
||||||
data from 10:04 a.m. might not be searchable in {es} until 10:06 a.m., set this
|
data from 10:04 a.m. might not be searchable in {es} until 10:06 a.m., set this
|
||||||
property to 120 seconds. The default value is randomly selected between `60s`
|
property to 120 seconds. The default value is randomly selected between `60s`
|
||||||
and `120s`. This randomness improves the query performance when there are
|
and `120s`. This randomness improves the query performance when there are
|
||||||
multiple jobs running on the same node.
|
multiple jobs running on the same node. For more information, see
|
||||||
|
{ml-docs}/ml-delayed-data-detection.html[Handling delayed data].
|
||||||
end::query-delay[]
|
end::query-delay[]
|
||||||
|
|
||||||
tag::randomize-seed[]
|
tag::randomize-seed[]
|
||||||
|
|
Loading…
Reference in New Issue