242 lines
9.2 KiB
Plaintext
242 lines
9.2 KiB
Plaintext
[[search-aggregations-pipeline]]
|
|
|
|
== Pipeline Aggregations
|
|
|
|
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding
|
|
information to the output tree. There are many different types of pipeline aggregation, each computing different information from
|
|
other aggregations, but these types can be broken down into two families:
|
|
|
|
_Parent_::
|
|
A family of pipeline aggregations that is provided with the output of its parent aggregation and is able
|
|
to compute new buckets or new aggregations to add to existing buckets.
|
|
|
|
_Sibling_::
|
|
Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a
|
|
new aggregation which will be at the same level as the sibling aggregation.
|
|
|
|
Pipeline aggregations can reference the aggregations they need to perform their computation by using the `buckets_path`
|
|
parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the
|
|
<<buckets-path-syntax, `buckets_path` Syntax>> section below.
|
|
|
|
Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the `buckets_path`
|
|
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
|
|
(i.e. a derivative of a derivative).
|
|
|
|
NOTE: Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation
|
|
will be included in the final output.
|
|
|
|
[[buckets-path-syntax]]
|
|
[float]
|
|
=== `buckets_path` Syntax
|
|
|
|
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the `buckets_path`
|
|
parameter, which follows a specific format:
|
|
|
|
// https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form
|
|
[source,ebnf]
|
|
--------------------------------------------------
|
|
AGG_SEPARATOR = '>' ;
|
|
METRIC_SEPARATOR = '.' ;
|
|
AGG_NAME = <the name of the aggregation> ;
|
|
METRIC = <the name of the metric (in case of multi-value metrics aggregation)> ;
|
|
PATH = <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;
|
|
--------------------------------------------------
|
|
|
|
For example, the path `"my_bucket>my_stats.avg"` will path to the `avg` value in the `"my_stats"` metric, which is
|
|
contained in the `"my_bucket"` bucket aggregation.
|
|
|
|
Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the
|
|
aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling"
|
|
metric `"the_sum"`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"timestamp",
|
|
"calendar_interval":"day"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "lemmings" } <1>
|
|
},
|
|
"the_movavg":{
|
|
"moving_avg":{ "buckets_path": "the_sum" } <2>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[warning:The moving_avg aggregation has been deprecated in favor of the moving_fn aggregation.]
|
|
<1> The metric is called `"the_sum"`
|
|
<2> The `buckets_path` refers to the metric via a relative path `"the_sum"`
|
|
|
|
`buckets_path` is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
|
|
instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify
|
|
a metric embedded inside a sibling aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"aggs" : {
|
|
"sales_per_month" : {
|
|
"date_histogram" : {
|
|
"field" : "date",
|
|
"calendar_interval" : "month"
|
|
},
|
|
"aggs": {
|
|
"sales": {
|
|
"sum": {
|
|
"field": "price"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"max_monthly_sales": {
|
|
"max_bucket": {
|
|
"buckets_path": "sales_per_month>sales" <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
<1> `buckets_path` instructs this max_bucket aggregation that we want the maximum value of the `sales` aggregation in the
|
|
`sales_per_month` date histogram.
|
|
|
|
[float]
|
|
=== Special Paths
|
|
|
|
Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs
|
|
the pipeline aggregation to use the document count as its input. For example, a moving average can be calculated on the document count of each bucket, instead of a specific metric:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"aggs": {
|
|
"my_date_histo": {
|
|
"date_histogram": {
|
|
"field":"timestamp",
|
|
"calendar_interval":"day"
|
|
},
|
|
"aggs": {
|
|
"the_movavg": {
|
|
"moving_avg": { "buckets_path": "_count" } <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[warning:The moving_avg aggregation has been deprecated in favor of the moving_fn aggregation.]
|
|
<1> By using `_count` instead of a metric name, we can calculate the moving average of document counts in the histogram
|
|
|
|
The `buckets_path` can also use `"_bucket_count"` and path to a multi-bucket aggregation to use the number of buckets
|
|
returned by that aggregation in the pipeline aggregation instead of a metric. for example a `bucket_selector` can be
|
|
used here to filter out buckets which contain no buckets for an inner terms aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /sales/_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"histo": {
|
|
"date_histogram": {
|
|
"field": "date",
|
|
"calendar_interval": "day"
|
|
},
|
|
"aggs": {
|
|
"categories": {
|
|
"terms": {
|
|
"field": "category"
|
|
}
|
|
},
|
|
"min_bucket_selector": {
|
|
"bucket_selector": {
|
|
"buckets_path": {
|
|
"count": "categories._bucket_count" <1>
|
|
},
|
|
"script": {
|
|
"source": "params.count != 0"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
<1> By using `_bucket_count` instead of a metric name, we can filter out `histo` buckets where they contain no buckets
|
|
for the `categories` aggregation
|
|
|
|
[[dots-in-agg-names]]
|
|
[float]
|
|
=== Dealing with dots in agg names
|
|
|
|
An alternate syntax is supported to cope with aggregations or metrics which
|
|
have dots in the name, such as the ++99.9++th
|
|
<<search-aggregations-metrics-percentile-aggregation,percentile>>. This metric
|
|
may be referred to as:
|
|
|
|
[source,js]
|
|
---------------
|
|
"buckets_path": "my_percentile[99.9]"
|
|
---------------
|
|
// NOTCONSOLE
|
|
|
|
[[gap-policy]]
|
|
[float]
|
|
=== Dealing with gaps in the data
|
|
|
|
Data in the real world is often noisy and sometimes contains *gaps* -- places where data simply doesn't exist. This can
|
|
occur for a variety of reasons, the most common being:
|
|
|
|
* Documents falling into a bucket do not contain a required field
|
|
* There are no documents matching the query for one or more buckets
|
|
* The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value.
|
|
Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the
|
|
first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
|
|
|
|
Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
|
|
data is encountered. All pipeline aggregations accept the `gap_policy` parameter. There are currently two gap policies
|
|
to choose from:
|
|
|
|
_skip_::
|
|
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue
|
|
calculating using the next available value.
|
|
|
|
_insert_zeros_::
|
|
This option will replace missing values with a zero (`0`) and pipeline aggregation computation will
|
|
proceed as normal.
|
|
|
|
|
|
|
|
|
|
include::pipeline/avg-bucket-aggregation.asciidoc[]
|
|
include::pipeline/derivative-aggregation.asciidoc[]
|
|
include::pipeline/max-bucket-aggregation.asciidoc[]
|
|
include::pipeline/min-bucket-aggregation.asciidoc[]
|
|
include::pipeline/sum-bucket-aggregation.asciidoc[]
|
|
include::pipeline/stats-bucket-aggregation.asciidoc[]
|
|
include::pipeline/extended-stats-bucket-aggregation.asciidoc[]
|
|
include::pipeline/percentiles-bucket-aggregation.asciidoc[]
|
|
include::pipeline/movavg-aggregation.asciidoc[]
|
|
include::pipeline/movfn-aggregation.asciidoc[]
|
|
include::pipeline/cumulative-sum-aggregation.asciidoc[]
|
|
include::pipeline/bucket-script-aggregation.asciidoc[]
|
|
include::pipeline/bucket-selector-aggregation.asciidoc[]
|
|
include::pipeline/bucket-sort-aggregation.asciidoc[]
|
|
include::pipeline/serial-diff-aggregation.asciidoc[]
|