174 lines
7.0 KiB
Plaintext
174 lines
7.0 KiB
Plaintext
[[search-aggregations-pipeline]]
|
|
|
|
== Pipeline Aggregations
|
|
|
|
coming[2.0.0-beta1]
|
|
|
|
experimental[]
|
|
|
|
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding
|
|
information to the output tree. There are many different types of pipeline aggregation, each computing different information from
|
|
other aggregations, but these types can be broken down into two families:
|
|
|
|
_Parent_::
|
|
A family of pipeline aggregations that is provided with the output of its parent aggregation and is able
|
|
to compute new buckets or new aggregations to add to existing buckets.
|
|
|
|
_Sibling_::
|
|
Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a
|
|
new aggregation which will be at the same level as the sibling aggregation.
|
|
|
|
Pipeline aggregations can reference the aggregations they need to perform their computation by using the `buckets_path`
|
|
parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the
|
|
<<buckets-path-syntax, `buckets_path` Syntax>> section below.
|
|
|
|
Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the `buckets_path`
|
|
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
|
|
(e.g. a derivative of a derivative).
|
|
|
|
NOTE: Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation
|
|
will be included in the final output.
|
|
|
|
[[buckets-path-syntax]]
|
|
[float]
|
|
=== `buckets_path` Syntax
|
|
|
|
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the `buckets_path`
|
|
parameter, which follows a specific format:
|
|
|
|
--------------------------------------------------
|
|
AGG_SEPARATOR := '>'
|
|
METRIC_SEPARATOR := '.'
|
|
AGG_NAME := <the name of the aggregation>
|
|
METRIC := <the name of the metric (in case of multi-value metrics aggregation)>
|
|
PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
|
|
--------------------------------------------------
|
|
|
|
For example, the path `"my_bucket>my_stats.avg"` will path to the `avg` value in the `"my_stats"` metric, which is
|
|
contained in the `"my_bucket"` bucket aggregation.
|
|
|
|
Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the
|
|
aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling"
|
|
metric `"the_sum"`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"timestamp",
|
|
"interval":"day"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "lemmings" } <1>
|
|
},
|
|
"the_movavg":{
|
|
"moving_avg":{ "buckets_path": "the_sum" } <2>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
<1> The metric is called `"the_sum"`
|
|
<2> The `buckets_path` refers to the metric via a relative path `"the_sum"`
|
|
|
|
`buckets_path` is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
|
|
instead of embedded "inside" them. For example, the `max_bucket` aggregation uses the `buckets_path` to specify
|
|
a metric embedded inside a sibling aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"sales_per_month" : {
|
|
"date_histogram" : {
|
|
"field" : "date",
|
|
"interval" : "month"
|
|
},
|
|
"aggs": {
|
|
"sales": {
|
|
"sum": {
|
|
"field": "price"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"max_monthly_sales": {
|
|
"max_bucket": {
|
|
"buckets_path": "sales_per_month>sales" <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
<1> `buckets_path` instructs this max_bucket aggregation that we want the maximum value of the `sales` aggregation in the
|
|
`sales_per_month` date histogram.
|
|
|
|
[float]
|
|
==== Special Paths
|
|
|
|
Instead of pathing to a metric, `buckets_path` can use a special `"_count"` path. This instructs
|
|
the pipeline aggregation to use the document count as it's input. For example, a moving average can be calculated on the document
|
|
count of each bucket, instead of a specific metric:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"timestamp",
|
|
"interval":"day"
|
|
},
|
|
"aggs":{
|
|
"the_movavg":{
|
|
"moving_avg":{ "buckets_path": "_count" } <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
<1> By using `_count` instead of a metric name, we can calculate the moving average of document counts in the histogram
|
|
|
|
[[gap-policy]]
|
|
[float]
|
|
=== Dealing with gaps in the data
|
|
|
|
Data in the real world is often noisy and sometimes contains *gaps* -- places where data simply doesn't exist. This can
|
|
occur for a variety of reasons, the most common being:
|
|
|
|
* Documents falling into a bucket do not contain a required field
|
|
* There are no documents matching the query for one or more buckets
|
|
* The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value.
|
|
Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the
|
|
first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
|
|
|
|
Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
|
|
data is encountered. All pipeline aggregations accept the `gap_policy` parameter. There are currently two gap policies
|
|
to choose from:
|
|
|
|
_skip_::
|
|
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue
|
|
calculating using the next available value.
|
|
|
|
_insert_zeros_::
|
|
This option will replace missing values with a zero (`0`) and pipeline aggregation computation will
|
|
proceed as normal.
|
|
|
|
|
|
|
|
|
|
include::pipeline/avg-bucket-aggregation.asciidoc[]
|
|
include::pipeline/derivative-aggregation.asciidoc[]
|
|
include::pipeline/max-bucket-aggregation.asciidoc[]
|
|
include::pipeline/min-bucket-aggregation.asciidoc[]
|
|
include::pipeline/sum-bucket-aggregation.asciidoc[]
|
|
include::pipeline/stats-bucket-aggregation.asciidoc[]
|
|
include::pipeline/extended-stats-bucket-aggregation.asciidoc[]
|
|
include::pipeline/percentiles-bucket-aggregation.asciidoc[]
|
|
include::pipeline/movavg-aggregation.asciidoc[]
|
|
include::pipeline/cumulative-sum-aggregation.asciidoc[]
|
|
include::pipeline/bucket-script-aggregation.asciidoc[]
|
|
include::pipeline/bucket-selector-aggregation.asciidoc[]
|
|
include::pipeline/serial-diff-aggregation.asciidoc[]
|