322 lines
10 KiB
Plaintext
322 lines
10 KiB
Plaintext
[role="xpack"]
|
|
[[ml-configuring-aggregation]]
|
|
= Aggregating data for faster performance
|
|
|
|
By default, {dfeeds} fetch data from {es} using search and scroll requests.
|
|
It can be significantly more efficient, however, to aggregate data in {es}
|
|
and to configure your {anomaly-jobs} to analyze aggregated data.
|
|
|
|
One of the benefits of aggregating data this way is that {es} automatically
|
|
distributes these calculations across your cluster. You can then feed this
|
|
aggregated data into the {ml-features} instead of raw results, which
|
|
reduces the volume of data that must be considered while detecting anomalies.
|
|
|
|
TIP: If you use a terms aggregation and the cardinality of a term is high, the
|
|
aggregation might not be effective and you might want to just use the default
|
|
search and scroll behavior.
|
|
|
|
[discrete]
|
|
[[aggs-limits-dfeeds]]
|
|
== Requirements and limitations
|
|
|
|
There are some limitations to using aggregations in {dfeeds}. Your aggregation
|
|
must include a `date_histogram` aggregation, which in turn must contain a `max`
|
|
aggregation on the time field. This requirement ensures that the aggregated data
|
|
is a time series and the timestamp of each bucket is the time of the last record
|
|
in the bucket.
|
|
|
|
IMPORTANT: The name of the aggregation and the name of the field that the agg
|
|
operates on need to match, otherwise the aggregation doesn't work. For example,
|
|
if you use a `max` aggregation on a time field called `responsetime`, the name
|
|
of the aggregation must be also `responsetime`.
|
|
|
|
You must also consider the interval of the date histogram aggregation carefully.
|
|
The bucket span of your {anomaly-job} must be divisible by the value of the
|
|
`calendar_interval` or `fixed_interval` in your aggregation (with no remainder).
|
|
If you specify a `frequency` for your {dfeed}, it must also be divisible by this
|
|
interval. {anomaly-jobs-cap} cannot use date histograms with an interval
|
|
measured in months because the length of the month is not fixed. {dfeeds-cap}
|
|
tolerate weeks or smaller units.
|
|
|
|
TIP: As a rule of thumb, if your detectors use <<ml-metric-functions,metric>> or
|
|
<<ml-sum-functions,sum>> analytical functions, set the date histogram
|
|
aggregation interval to a tenth of the bucket span. This suggestion creates
|
|
finer, more granular time buckets, which are ideal for this type of analysis. If
|
|
your detectors use <<ml-count-functions,count>> or <<ml-rare-functions,rare>>
|
|
functions, set the interval to the same value as the bucket span.
|
|
|
|
|
|
[discrete]
|
|
[[aggs-include-jobs]]
|
|
== Including aggregations in {anomaly-jobs}
|
|
|
|
When you create or update an {anomaly-job}, you can include the names of
|
|
aggregations, for example:
|
|
|
|
[source,console]
|
|
----------------------------------
|
|
PUT _ml/anomaly_detectors/farequote
|
|
{
|
|
"analysis_config": {
|
|
"bucket_span": "60m",
|
|
"detectors": [{
|
|
"function": "mean",
|
|
"field_name": "responsetime", <1>
|
|
"by_field_name": "airline" <1>
|
|
}],
|
|
"summary_count_field_name": "doc_count"
|
|
},
|
|
"data_description": {
|
|
"time_field":"time" <1>
|
|
}
|
|
}
|
|
----------------------------------
|
|
// TEST[skip:setup:farequote_data]
|
|
|
|
<1> The `airline`, `responsetime`, and `time` fields are aggregations. Only the
|
|
aggregated fields defined in the `analysis_config` object are analyzed by the
|
|
{anomaly-job}.
|
|
|
|
NOTE: When the `summary_count_field_name` property is set to a non-null value,
|
|
the job expects to receive aggregated input. The property must be set to the
|
|
name of the field that contains the count of raw data points that have been
|
|
aggregated. It applies to all detectors in the job.
|
|
|
|
The aggregations are defined in the {dfeed} as follows:
|
|
|
|
[source,console]
|
|
----------------------------------
|
|
PUT _ml/datafeeds/datafeed-farequote
|
|
{
|
|
"job_id":"farequote",
|
|
"indices": ["farequote"],
|
|
"aggregations": {
|
|
"buckets": {
|
|
"date_histogram": {
|
|
"field": "time",
|
|
"fixed_interval": "360s",
|
|
"time_zone": "UTC"
|
|
},
|
|
"aggregations": {
|
|
"time": { <1>
|
|
"max": {"field": "time"}
|
|
},
|
|
"airline": { <2>
|
|
"terms": {
|
|
"field": "airline",
|
|
"size": 100
|
|
},
|
|
"aggregations": {
|
|
"responsetime": { <3>
|
|
"avg": {
|
|
"field": "responsetime"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------
|
|
// TEST[skip:setup:farequote_job]
|
|
|
|
<1> The aggregations have names that match the fields that they operate on. The
|
|
`max` aggregation is named `time` and its field also needs to be `time`.
|
|
<2> The `term` aggregation is named `airline` and its field is also named
|
|
`airline`.
|
|
<3> The `avg` aggregation is named `responsetime` and its field is also named
|
|
`responsetime`.
|
|
|
|
Your {dfeed} can contain multiple aggregations, but only the ones with names
|
|
that match values in the job configuration are fed to the job.
|
|
|
|
|
|
[discrete]
|
|
[[aggs-dfeeds]]
|
|
== Nested aggregations in {dfeeds}
|
|
|
|
{dfeeds-cap} support complex nested aggregations. This example uses the
|
|
`derivative` pipeline aggregation to find the first order derivative of the
|
|
counter `system.network.out.bytes` for each value of the field `beat.name`.
|
|
|
|
[source,js]
|
|
----------------------------------
|
|
"aggregations": {
|
|
"beat.name": {
|
|
"terms": {
|
|
"field": "beat.name"
|
|
},
|
|
"aggregations": {
|
|
"buckets": {
|
|
"date_histogram": {
|
|
"field": "@timestamp",
|
|
"fixed_interval": "5m"
|
|
},
|
|
"aggregations": {
|
|
"@timestamp": {
|
|
"max": {
|
|
"field": "@timestamp"
|
|
}
|
|
},
|
|
"bytes_out_average": {
|
|
"avg": {
|
|
"field": "system.network.out.bytes"
|
|
}
|
|
},
|
|
"bytes_out_derivative": {
|
|
"derivative": {
|
|
"buckets_path": "bytes_out_average"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------
|
|
// NOTCONSOLE
|
|
|
|
|
|
[discrete]
|
|
[[aggs-single-dfeeds]]
|
|
== Single bucket aggregations in {dfeeds}
|
|
|
|
{dfeeds-cap} not only supports multi-bucket aggregations, but also single bucket
|
|
aggregations. The following shows two `filter` aggregations, each gathering the
|
|
number of unique entries for the `error` field.
|
|
|
|
[source,js]
|
|
----------------------------------
|
|
{
|
|
"job_id":"servers-unique-errors",
|
|
"indices": ["logs-*"],
|
|
"aggregations": {
|
|
"buckets": {
|
|
"date_histogram": {
|
|
"field": "time",
|
|
"interval": "360s",
|
|
"time_zone": "UTC"
|
|
},
|
|
"aggregations": {
|
|
"time": {
|
|
"max": {"field": "time"}
|
|
}
|
|
"server1": {
|
|
"filter": {"term": {"source": "server-name-1"}},
|
|
"aggregations": {
|
|
"server1_error_count": {
|
|
"value_count": {
|
|
"field": "error"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"server2": {
|
|
"filter": {"term": {"source": "server-name-2"}},
|
|
"aggregations": {
|
|
"server2_error_count": {
|
|
"value_count": {
|
|
"field": "error"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------
|
|
// NOTCONSOLE
|
|
|
|
|
|
[discrete]
|
|
[[aggs-define-dfeeds]]
|
|
== Defining aggregations in {dfeeds}
|
|
|
|
When you define an aggregation in a {dfeed}, it must have the following form:
|
|
|
|
[source,js]
|
|
----------------------------------
|
|
"aggregations": {
|
|
["bucketing_aggregation": {
|
|
"bucket_agg": {
|
|
...
|
|
},
|
|
"aggregations": {]
|
|
"data_histogram_aggregation": {
|
|
"date_histogram": {
|
|
"field": "time",
|
|
},
|
|
"aggregations": {
|
|
"timestamp": {
|
|
"max": {
|
|
"field": "time"
|
|
}
|
|
},
|
|
[,"<first_term>": {
|
|
"terms":{...
|
|
}
|
|
[,"aggregations" : {
|
|
[<sub_aggregation>]+
|
|
} ]
|
|
}]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----------------------------------
|
|
// NOTCONSOLE
|
|
|
|
The top level aggregation must be either a
|
|
{ref}/search-aggregations-bucket.html[bucket aggregation] containing as single
|
|
sub-aggregation that is a `date_histogram` or the top level aggregation is the
|
|
required `date_histogram`. There must be exactly one `date_histogram`
|
|
aggregation. For more information, see
|
|
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date histogram aggregation].
|
|
|
|
NOTE: The `time_zone` parameter in the date histogram aggregation must be set to
|
|
`UTC`, which is the default value.
|
|
|
|
Each histogram bucket has a key, which is the bucket start time. This key cannot
|
|
be used for aggregations in {dfeeds}, however, because they need to know the
|
|
time of the latest record within a bucket. Otherwise, when you restart a
|
|
{dfeed}, it continues from the start time of the histogram bucket and possibly
|
|
fetches the same data twice. The max aggregation for the time field is therefore
|
|
necessary to provide the time of the latest record within a bucket.
|
|
|
|
You can optionally specify a terms aggregation, which creates buckets for
|
|
different values of a field.
|
|
|
|
IMPORTANT: If you use a terms aggregation, by default it returns buckets for
|
|
the top ten terms. Thus if the cardinality of the term is greater than 10, not
|
|
all terms are analyzed.
|
|
|
|
You can change this behavior by setting the `size` parameter. To
|
|
determine the cardinality of your data, you can run searches such as:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET .../_search {
|
|
"aggs": {
|
|
"service_cardinality": {
|
|
"cardinality": {
|
|
"field": "service"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
By default, {es} limits the maximum number of terms returned to 10000. For high
|
|
cardinality fields, the query might not run. It might return errors related to
|
|
circuit breaking exceptions that indicate that the data is too large. In such
|
|
cases, do not use aggregations in your {dfeed}. For more information, see
|
|
{ref}/search-aggregations-bucket-terms-aggregation.html[Terms aggregation].
|
|
|
|
You can also optionally specify multiple sub-aggregations. The sub-aggregations
|
|
are aggregated for the buckets that were created by their parent aggregation.
|
|
For more information, see {ref}/search-aggregations.html[Aggregations].
|