[DOCS] Clarify interval, frequency, and bucket span in ML APIs and example (#51280)
This commit is contained in:
parent
08e9c673e5
commit
4590d4156a
|
@ -11,13 +11,28 @@ distributes these calculations across your cluster. You can then feed this
|
|||
aggregated data into the {ml-features} instead of raw results, which
|
||||
reduces the volume of data that must be considered while detecting anomalies.
|
||||
|
||||
There are some limitations to using aggregations in {dfeeds}, however.
|
||||
Your aggregation must include a `date_histogram` aggregation, which in turn must
|
||||
contain a `max` aggregation on the time field. This requirement ensures that the
|
||||
aggregated data is a time series and the timestamp of each bucket is the time
|
||||
of the last record in the bucket. If you use a terms aggregation and the
|
||||
cardinality of a term is high, then the aggregation might not be effective and
|
||||
you might want to just use the default search and scroll behavior.
|
||||
TIP: If you use a terms aggregation and the cardinality of a term is high, the
|
||||
aggregation might not be effective and you might want to just use the default
|
||||
search and scroll behavior.
|
||||
|
||||
There are some limitations to using aggregations in {dfeeds}. Your aggregation
|
||||
must include a `date_histogram` aggregation, which in turn must contain a `max`
|
||||
aggregation on the time field. This requirement ensures that the aggregated data
|
||||
is a time series and the timestamp of each bucket is the time of the last record
|
||||
in the bucket.
|
||||
|
||||
You must also consider the interval of the date histogram aggregation carefully.
|
||||
The bucket span of your {anomaly-job} must be divisible by the value of the
|
||||
`calendar_interval` or `fixed_interval` in your aggregation (with no remainder).
|
||||
If you specify a `frequency` for your {dfeed}, it must also be divisible by this
|
||||
interval.
|
||||
|
||||
TIP: As a rule of thumb, if your detectors use <<ml-metric-functions,metric>> or
|
||||
<<ml-sum-functions,sum>> analytical functions, set the date histogram
|
||||
aggregation interval to a tenth of the bucket span. This suggestion creates
|
||||
finer, more granular time buckets, which are ideal for this type of analysis. If
|
||||
your detectors use <<ml-count-functions,count>> or <<ml-rare-functions,rare>>
|
||||
functions, set the interval to the same value as the bucket span.
|
||||
|
||||
When you create or update an {anomaly-job}, you can include the names of
|
||||
aggregations, for example:
|
||||
|
@ -143,9 +158,9 @@ pipeline aggregation to find the first order derivative of the counter
|
|||
----------------------------------
|
||||
// NOTCONSOLE
|
||||
|
||||
{dfeeds-cap} not only supports multi-bucket aggregations, but also single bucket aggregations.
|
||||
The following shows two `filter` aggregations, each gathering the number of unique entries for
|
||||
the `error` field.
|
||||
{dfeeds-cap} not only supports multi-bucket aggregations, but also single bucket
|
||||
aggregations. The following shows two `filter` aggregations, each gathering the
|
||||
number of unique entries for the `error` field.
|
||||
|
||||
[source,js]
|
||||
----------------------------------
|
||||
|
@ -225,14 +240,15 @@ When you define an aggregation in a {dfeed}, it must have the following form:
|
|||
----------------------------------
|
||||
// NOTCONSOLE
|
||||
|
||||
The top level aggregation must be either a {ref}/search-aggregations-bucket.html[Bucket Aggregation]
|
||||
containing as single sub-aggregation that is a `date_histogram` or the top level aggregation
|
||||
is the required `date_histogram`. There must be exactly 1 `date_histogram` aggregation.
|
||||
The top level aggregation must be either a
|
||||
{ref}/search-aggregations-bucket.html[bucket aggregation] containing as single
|
||||
sub-aggregation that is a `date_histogram` or the top level aggregation is the
|
||||
required `date_histogram`. There must be exactly 1 `date_histogram` aggregation.
|
||||
For more information, see
|
||||
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date Histogram Aggregation].
|
||||
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date histogram aggregation].
|
||||
|
||||
NOTE: The `time_zone` parameter in the date histogram aggregation must be set to `UTC`,
|
||||
which is the default value.
|
||||
NOTE: The `time_zone` parameter in the date histogram aggregation must be set to
|
||||
`UTC`, which is the default value.
|
||||
|
||||
Each histogram bucket has a key, which is the bucket start time. This key cannot
|
||||
be used for aggregations in {dfeeds}, however, because they need to know the
|
||||
|
@ -269,16 +285,9 @@ By default, {es} limits the maximum number of terms returned to 10000. For high
|
|||
cardinality fields, the query might not run. It might return errors related to
|
||||
circuit breaking exceptions that indicate that the data is too large. In such
|
||||
cases, do not use aggregations in your {dfeed}. For more
|
||||
information, see {ref}/search-aggregations-bucket-terms-aggregation.html[Terms Aggregation].
|
||||
information, see
|
||||
{ref}/search-aggregations-bucket-terms-aggregation.html[Terms aggregation].
|
||||
|
||||
You can also optionally specify multiple sub-aggregations.
|
||||
The sub-aggregations are aggregated for the buckets that were created by their
|
||||
parent aggregation. For more information, see
|
||||
{ref}/search-aggregations.html[Aggregations].
|
||||
|
||||
TIP: If your detectors use metric or sum analytical functions, set the
|
||||
`interval` of the date histogram aggregation to a tenth of the `bucket_span`
|
||||
that was defined in the job. This suggestion creates finer, more granular time
|
||||
buckets, which are ideal for this type of analysis. If your detectors use count
|
||||
or rare functions, set `interval` to the same value as `bucket_span`. For more
|
||||
information about analytical functions, see <<ml-functions>>.
|
||||
You can also optionally specify multiple sub-aggregations. The sub-aggregations
|
||||
are aggregated for the buckets that were created by their parent aggregation.
|
||||
For more information, see {ref}/search-aggregations.html[Aggregations].
|
||||
|
|
|
@ -26,7 +26,12 @@ cluster privileges to use this API. See
|
|||
[[ml-put-datafeed-desc]]
|
||||
==== {api-description-title}
|
||||
|
||||
You can associate only one {dfeed} to each {anomaly-job}.
|
||||
{ml-docs}/ml-dfeeds.html[{dfeeds-cap}] retrieve data from {es} for analysis by
|
||||
an {anomaly-job}. You can associate only one {dfeed} to each {anomaly-job}.
|
||||
|
||||
The {dfeed} contains a query that runs at a defined interval (`frequency`). If
|
||||
you are concerned about delayed data, you can add a delay (`query_delay`) at
|
||||
each interval. See {ml-docs}/ml-delayed-data-detection.html[Handling delayed data].
|
||||
|
||||
[IMPORTANT]
|
||||
====
|
||||
|
@ -64,11 +69,6 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=delayed-data-check-config]
|
|||
`frequency`::
|
||||
(Optional, <<time-units, time units>>)
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=frequency]
|
||||
+
|
||||
--
|
||||
To learn more about the relationship between time related settings, see
|
||||
<<ml-put-datafeed-time-related-settings>>.
|
||||
--
|
||||
|
||||
`indices`::
|
||||
(Required, array)
|
||||
|
@ -89,11 +89,6 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=query]
|
|||
`query_delay`::
|
||||
(Optional, <<time-units, time units>>)
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=query-delay]
|
||||
+
|
||||
--
|
||||
To learn more about the relationship between time related settings, see
|
||||
<<ml-put-datafeed-time-related-settings>>.
|
||||
--
|
||||
|
||||
`script_fields`::
|
||||
(Optional, object)
|
||||
|
@ -103,20 +98,6 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=script-fields]
|
|||
(Optional, unsigned integer)
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=scroll-size]
|
||||
|
||||
|
||||
[[ml-put-datafeed-time-related-settings]]
|
||||
===== Interaction between time-related settings
|
||||
|
||||
Time-related settings have the following relationships:
|
||||
|
||||
* Queries run at `query_delay` after the end of
|
||||
each `frequency`.
|
||||
|
||||
* When `frequency` is shorter than `bucket_span` of the associated job, interim
|
||||
results for the last (partial) bucket are written, and then overwritten by the
|
||||
full bucket results eventually.
|
||||
|
||||
|
||||
[[ml-put-datafeed-example]]
|
||||
==== {api-examples-title}
|
||||
|
||||
|
|
|
@ -49,11 +49,6 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=analysis-config]
|
|||
`analysis_config`.`bucket_span`:::
|
||||
(<<time-units,time units>>)
|
||||
include::{docdir}/ml/ml-shared.asciidoc[tag=bucket-span]
|
||||
+
|
||||
--
|
||||
To learn more about the relationship between time related settings, see
|
||||
<<ml-put-datafeed-time-related-settings>>.
|
||||
--
|
||||
|
||||
`analysis_config`.`categorization_field_name`:::
|
||||
(string)
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
tag::aggregations[]
|
||||
If set, the {dfeed} performs aggregation searches. Support for aggregations is
|
||||
limited and should only be used with low cardinality data. For more information,
|
||||
limited and should be used only with low cardinality data. For more information,
|
||||
see
|
||||
{ml-docs}/ml-configuring-aggregation.html[Aggregating data for faster performance].
|
||||
end::aggregations[]
|
||||
|
@ -148,8 +148,10 @@ end::background-persist-interval[]
|
|||
|
||||
tag::bucket-span[]
|
||||
The size of the interval that the analysis is aggregated into, typically between
|
||||
`5m` and `1h`. The default value is `5m`. For more information about time units,
|
||||
see <<time-units>>.
|
||||
`5m` and `1h`. The default value is `5m`. If the {anomaly-job} uses a {dfeed}
|
||||
with {ml-docs}/ml-configuring-aggregation.html[aggregations], this value must be
|
||||
divisible by the interval of the date histogram aggregation. For more
|
||||
information, see {ml-docs}/ml-buckets.html[Buckets].
|
||||
end::bucket-span[]
|
||||
|
||||
tag::bucket-span-results[]
|
||||
|
@ -603,7 +605,10 @@ tag::frequency[]
|
|||
The interval at which scheduled queries are made while the {dfeed} runs in real
|
||||
time. The default value is either the bucket span for short bucket spans, or,
|
||||
for longer bucket spans, a sensible fraction of the bucket span. For example:
|
||||
`150s`.
|
||||
`150s`. When `frequency` is shorter than the bucket span, interim results for
|
||||
the last (partial) bucket are written then eventually overwritten by the full
|
||||
bucket results. If the {dfeed} uses aggregations, this value must be divisible
|
||||
by the interval of the date histogram aggregation.
|
||||
end::frequency[]
|
||||
|
||||
tag::from[]
|
||||
|
@ -939,7 +944,8 @@ The number of seconds behind real time that data is queried. For example, if
|
|||
data from 10:04 a.m. might not be searchable in {es} until 10:06 a.m., set this
|
||||
property to 120 seconds. The default value is randomly selected between `60s`
|
||||
and `120s`. This randomness improves the query performance when there are
|
||||
multiple jobs running on the same node.
|
||||
multiple jobs running on the same node. For more information, see
|
||||
{ml-docs}/ml-delayed-data-detection.html[Handling delayed data].
|
||||
end::query-delay[]
|
||||
|
||||
tag::randomize-seed[]
|
||||
|
|
Loading…
Reference in New Issue