2017-05-23 17:34:21 -04:00
|
|
|
[[ml-configuring-aggregation]]
|
|
|
|
=== Aggregating Data For Faster Performance
|
|
|
|
|
|
|
|
By default, {dfeeds} fetch data from {es} using search and scroll requests.
|
|
|
|
It can be significantly more efficient, however, to aggregate data in {es}
|
|
|
|
and to configure your jobs to analyze aggregated data.
|
|
|
|
|
|
|
|
One of the benefits of aggregating data this way is that {es} automatically
|
|
|
|
distributes these calculations across your cluster. You can then feed this
|
|
|
|
aggregated data into {xpackml} instead of raw results, which
|
|
|
|
reduces the volume of data that must be considered while detecting anomalies.
|
|
|
|
//TBD: Are "aggregated" and "summarized" equivalent terms? Are customers more
|
|
|
|
//familiar with one or the other? If so, I'll use one term throughout.
|
|
|
|
|
|
|
|
There are some limitations to using aggregations in {dfeeds}, however.
|
|
|
|
Your aggregation must include a buckets aggregation, which in turn must contain
|
|
|
|
a date histogram aggregation. This requirement ensures that the aggregated
|
|
|
|
data is a time series. If you use a terms aggregation and the cardinality of a
|
|
|
|
term is high, then the aggregation might not be effective and you might want
|
|
|
|
to just use the default search and scroll behavior.
|
|
|
|
|
|
|
|
When you create or update a job, you can include the names of aggregations, for
|
|
|
|
example:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
----------------------------------
|
|
|
|
PUT _xpack/ml/anomaly_detectors/farequote
|
|
|
|
{
|
|
|
|
"analysis_config": {
|
|
|
|
"bucket_span": "60m",
|
|
|
|
"detectors": [{
|
|
|
|
"function":"mean",
|
|
|
|
"field_name":"responsetime",
|
|
|
|
"by_field_name":"airline"
|
|
|
|
}],
|
2017-09-21 11:44:19 -04:00
|
|
|
"summary_count_field_name": "doc_count"
|
2017-05-23 17:34:21 -04:00
|
|
|
},
|
|
|
|
"data_description": {
|
|
|
|
"time_field":"time"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
----------------------------------
|
|
|
|
|
|
|
|
In this example, the `airline`, `responsetime`, and `time` fields are
|
|
|
|
aggregations.
|
|
|
|
|
|
|
|
NOTE: When the `summary_count_field_name` property is set to a non-null value,
|
|
|
|
the job expects to receive aggregated input. The property must be set to the
|
|
|
|
name of the field that contains the count of raw data points that have been
|
|
|
|
aggregated. It applies to all detectors in the job.
|
|
|
|
|
|
|
|
The aggregations are defined in the {dfeed} as follows:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
----------------------------------
|
|
|
|
PUT _xpack/ml/datafeeds/datafeed-farequote
|
|
|
|
{
|
|
|
|
"job_id":"farequote",
|
2017-08-10 12:08:15 -04:00
|
|
|
"indices": ["farequote"],
|
2017-05-23 17:34:21 -04:00
|
|
|
"types": ["response"],
|
|
|
|
"aggregations": {
|
|
|
|
"buckets": {
|
|
|
|
"date_histogram": {
|
|
|
|
"field": "time",
|
|
|
|
"interval": "360s",
|
|
|
|
"time_zone": "UTC"
|
|
|
|
},
|
|
|
|
"aggregations": {
|
|
|
|
"time": {
|
|
|
|
"max": {"field": "time"}
|
|
|
|
},
|
|
|
|
"airline": {
|
|
|
|
"terms": {
|
|
|
|
"field": "airline",
|
|
|
|
"size": 100
|
|
|
|
},
|
|
|
|
"aggregations": {
|
|
|
|
"responsetime": {
|
|
|
|
"avg": {
|
|
|
|
"field": "responsetime"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
----------------------------------
|
|
|
|
|
|
|
|
|
|
|
|
In this example, the aggregations have names that match the fields that they
|
|
|
|
operate on. That is to say, the `max` aggregation is named `time` and its
|
|
|
|
field is also `time`. The same is true for the aggregations with the names
|
|
|
|
`airline` and `responsetime`. Since you must create the job before you can
|
|
|
|
create the {dfeed}, synchronizing your aggregation and field names can simplify
|
|
|
|
these configuration steps.
|
|
|
|
//TBD: Describe how this would be accomplished in Kibana?
|
|
|
|
|
|
|
|
When you define an aggregation in a {dfeed}, it must have the following form:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
----------------------------------
|
|
|
|
"aggregations" : {
|
|
|
|
"buckets" : {
|
|
|
|
"date_histogram" : {
|
|
|
|
"time_zone": "UTC", ...
|
|
|
|
},
|
|
|
|
"aggregations": {
|
|
|
|
"<time_field>": {
|
|
|
|
"max": {
|
|
|
|
"field":"<time_field>"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
[,"<first_term>": {
|
|
|
|
"terms":{...
|
|
|
|
}
|
|
|
|
[,"aggregations" : {
|
|
|
|
[<sub_aggregation>]+
|
|
|
|
} ]
|
|
|
|
}]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
----------------------------------
|
|
|
|
|
|
|
|
You must specify `buckets` as the aggregation name and `date_histogram` as the
|
|
|
|
aggregation type. For more information, see
|
|
|
|
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date Histogram Aggregation].
|
|
|
|
|
|
|
|
NOTE: The `time_zone` parameter in the date histogram aggregation must be set to `UTC`,
|
|
|
|
which is the default value.
|
|
|
|
|
|
|
|
Each histogram bucket has a key, which is the bucket start time. This key cannot
|
|
|
|
be used for aggregations in {dfeeds}, however, because they need to know the
|
|
|
|
time of the latest record within a bucket. Otherwise, when you restart a {dfeed},
|
|
|
|
it continues from the start time of the histogram bucket and possibly fetches
|
|
|
|
the same data twice. The max aggregation for the time field is therefore
|
|
|
|
necessary to provide the time of the latest record within a bucket.
|
|
|
|
|
|
|
|
You can optionally specify a terms aggregation, which creates buckets for
|
|
|
|
different values of a field.
|
|
|
|
|
|
|
|
IMPORTANT: If you use a terms aggregation, by default it returns buckets for
|
|
|
|
the top ten terms. Thus if the cardinality of the term is greater than 10, not
|
|
|
|
all terms are analyzed.
|
|
|
|
|
|
|
|
You can change this behavior by setting the `size` parameter. To
|
|
|
|
determine the cardinality of your data, you can run searches such as:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
GET .../_search {
|
|
|
|
"aggs": {
|
|
|
|
"service_cardinality": {
|
|
|
|
"cardinality": {
|
|
|
|
"field": "service"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
By default, {es} limits the maximum number of terms returned to 10000. For high
|
|
|
|
cardinality fields, the query might not run. It might return errors related to
|
|
|
|
circuit breaking exceptions that indicate that the data is too large. In such
|
|
|
|
cases, do not use aggregations in your {dfeed}. For more
|
|
|
|
information, see {ref}/search-aggregations-bucket-terms-aggregation.html[Terms Aggregation].
|
|
|
|
|
|
|
|
You can also optionally specify multiple sub-aggregations.
|
|
|
|
The sub-aggregations are aggregated for the buckets that were created by their
|
|
|
|
parent aggregation. For more information, see
|
|
|
|
{ref}/search-aggregations.html[Aggregations].
|
|
|
|
|
|
|
|
TIP: If your detectors use metric or sum analytical functions, set the
|
|
|
|
`interval` of the date histogram aggregation to a tenth of the `bucket_span`
|
|
|
|
that was defined in the job. This suggestion creates finer, more granular time
|
|
|
|
buckets, which are ideal for this type of analysis. If your detectors use count or rare functions, set
|
|
|
|
`interval` to the same value as `bucket_span`. For more information about
|
|
|
|
analytical functions, see <<ml-functions>>.
|
|
|
|
|
|
|
|
//TBD: Add more examples from https://github.com/elastic/prelert-legacy/wiki/Configuring-aggregations-on-a-datafeed
|