635 lines
21 KiB
Plaintext
635 lines
21 KiB
Plaintext
[[search-aggregations-pipeline-movfn-aggregation]]
|
|
=== Moving Function Aggregation
|
|
|
|
Given an ordered series of data, the Moving Function aggregation will slide a window across the data and allow the user to specify a custom
|
|
script that is executed on each window of data. For convenience, a number of common functions are predefined such as min/max, moving averages,
|
|
etc.
|
|
|
|
This is conceptually very similar to the <<search-aggregations-pipeline-movavg-aggregation, Moving Average>> pipeline aggregation, except
|
|
it provides more functionality.
|
|
|
|
==== Syntax
|
|
|
|
A `moving_fn` aggregation looks like this in isolation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.min(values)"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
.`moving_avg` Parameters
|
|
|===
|
|
|Parameter Name |Description |Required |Default Value
|
|
|`buckets_path` |Path to the metric of interest (see <<buckets-path-syntax, `buckets_path` Syntax>> for more details |Required |
|
|
|`window` |The size of window to "slide" across the histogram. |Required |
|
|
|`script` |The script that should be executed on each window of data |Required |
|
|
|===
|
|
|
|
`moving_fn` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation. They can be
|
|
embedded like any other metric aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{ <1>
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" } <2>
|
|
},
|
|
"the_movfn": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum", <3>
|
|
"window": 10,
|
|
"script": "MovingFunctions.unweightedAvg(values)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
|
|
<2> A `sum` metric is used to calculate the sum of a field. This could be any numeric metric (sum, min, max, etc)
|
|
<3> Finally, we specify a `moving_fn` aggregation which uses "the_sum" metric as its input.
|
|
|
|
Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
|
|
add numeric metrics, such as a `sum`, inside of that histogram. Finally, the `moving_fn` is embedded inside the histogram.
|
|
The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
|
|
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`.
|
|
|
|
An example response from the above aggregation may look like:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took": 11,
|
|
"timed_out": false,
|
|
"_shards": ...,
|
|
"hits": ...,
|
|
"aggregations": {
|
|
"my_date_histo": {
|
|
"buckets": [
|
|
{
|
|
"key_as_string": "2015/01/01 00:00:00",
|
|
"key": 1420070400000,
|
|
"doc_count": 3,
|
|
"the_sum": {
|
|
"value": 550.0
|
|
},
|
|
"the_movfn": {
|
|
"value": null
|
|
}
|
|
},
|
|
{
|
|
"key_as_string": "2015/02/01 00:00:00",
|
|
"key": 1422748800000,
|
|
"doc_count": 2,
|
|
"the_sum": {
|
|
"value": 60.0
|
|
},
|
|
"the_movfn": {
|
|
"value": 550.0
|
|
}
|
|
},
|
|
{
|
|
"key_as_string": "2015/03/01 00:00:00",
|
|
"key": 1425168000000,
|
|
"doc_count": 2,
|
|
"the_sum": {
|
|
"value": 375.0
|
|
},
|
|
"the_movfn": {
|
|
"value": 305.0
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"took": 11/"took": $body.took/]
|
|
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": $body._shards/]
|
|
// TESTRESPONSE[s/"hits": \.\.\./"hits": $body.hits/]
|
|
|
|
|
|
==== Custom user scripting
|
|
|
|
The Moving Function aggregation allows the user to specify any arbitrary script to define custom logic. The script is invoked each time a
|
|
new window of data is collected. These values are provided to the script in the `values` variable. The script should then perform some
|
|
kind of calculation and emit a single `double` as the result. Emitting `null` is not permitted, although `NaN` and +/- `Inf` are allowed.
|
|
|
|
For example, this script will simply return the first value from the window, or `NaN` if no values are available:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_movavg": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "return values.length > 0 ? values[0] : Double.NaN"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
==== Pre-built Functions
|
|
|
|
For convenience, a number of functions have been prebuilt and are available inside the `moving_fn` script context:
|
|
|
|
- `max()`
|
|
- `min()`
|
|
- `sum()`
|
|
- `stdDev()`
|
|
- `unweightedAvg()`
|
|
- `linearWeightedAvg()`
|
|
- `ewma()`
|
|
- `holt()`
|
|
- `holtWinters()`
|
|
|
|
The functions are available from the `MovingFunctions` namespace. E.g. `MovingFunctions.max()`
|
|
|
|
===== max Function
|
|
|
|
This function accepts a collection of doubles and returns the maximum value in that window. `null` and `NaN` values are ignored; the maximum
|
|
is only calculated over the real values. If the window is empty, or all values are `null`/`NaN`, `NaN` is returned as the result.
|
|
|
|
.`max(double[] values)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the maximum
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_moving_max": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.max(values)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
===== min Function
|
|
|
|
This function accepts a collection of doubles and returns the minimum value in that window. `null` and `NaN` values are ignored; the minimum
|
|
is only calculated over the real values. If the window is empty, or all values are `null`/`NaN`, `NaN` is returned as the result.
|
|
|
|
.`min(double[] values)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the minimum
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_moving_min": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.min(values)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
===== sum Function
|
|
|
|
This function accepts a collection of doubles and returns the sum of the values in that window. `null` and `NaN` values are ignored;
|
|
the sum is only calculated over the real values. If the window is empty, or all values are `null`/`NaN`, `0.0` is returned as the result.
|
|
|
|
.`sum(double[] values)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the sum of
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_moving_sum": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.sum(values)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
===== stdDev Function
|
|
|
|
This function accepts a collection of doubles and average, then returns the standard deviation of the values in that window.
|
|
`null` and `NaN` values are ignored; the sum is only calculated over the real values. If the window is empty, or all values are
|
|
`null`/`NaN`, `0.0` is returned as the result.
|
|
|
|
.`stdDev(double[] values)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the standard deviation of
|
|
|`avg` |The average of the window
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_moving_sum": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.stdDev(values, MovingFunctions.unweightedAvg(values))"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
The `avg` parameter must be provided to the standard deviation function because different styles of averages can be computed on the window
|
|
(simple, linearly weighted, etc). The various moving averages that are detailed below can be used to calculate the average for the
|
|
standard deviation function.
|
|
|
|
===== unweightedAvg Function
|
|
|
|
The `unweightedAvg` function calculates the sum of all values in the window, then divides by the size of the window. It is effectively
|
|
a simple arithmetic mean of the window. The simple moving average does not perform any time-dependent weighting, which means
|
|
the values from a `simple` moving average tend to "lag" behind the real data.
|
|
|
|
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
|
|
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
|
values.
|
|
|
|
.`unweightedAvg(double[] values)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the sum of
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_movavg": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.unweightedAvg(values)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
==== linearWeightedAvg Function
|
|
|
|
The `linearWeightedAvg` function assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at
|
|
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
|
|
the "lag" behind the data's mean, since older points have less influence.
|
|
|
|
If the window is empty, or all values are `null`/`NaN`, `NaN` is returned as the result.
|
|
|
|
.`linearWeightedAvg(double[] values)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the sum of
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_movavg": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.linearWeightedAvg(values)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
==== ewma Function
|
|
|
|
The `ewma` function (aka "single-exponential") is similar to the `linearMovAvg` function,
|
|
except older data-points become exponentially less important,
|
|
rather than linearly less important. The speed at which the importance decays can be controlled with an `alpha`
|
|
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
|
|
portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the
|
|
moving average. This tends to make the moving average track the data more closely but with less smoothing.
|
|
|
|
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
|
|
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
|
values.
|
|
|
|
.`ewma(double[] values, double alpha)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the sum of
|
|
|`alpha` |Exponential decay
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_movavg": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.ewma(values, 0.3)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
|
|
==== holt Function
|
|
|
|
The `holt` function (aka "double exponential") incorporates a second exponential term which
|
|
tracks the data's trend. Single exponential does not perform well when the data has an underlying linear trend. The
|
|
double exponential model calculates two values internally: a "level" and a "trend".
|
|
|
|
The level calculation is similar to `ewma`, and is an exponentially weighted view of the data. The difference is
|
|
that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series.
|
|
The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the
|
|
smoothed data). The trend value is also exponentially weighted.
|
|
|
|
Values are produced by multiplying the level and trend components.
|
|
|
|
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
|
|
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
|
values.
|
|
|
|
.`holt(double[] values, double alpha)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the sum of
|
|
|`alpha` |Level decay value
|
|
|`beta` |Trend decay value
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_movavg": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "MovingFunctions.holt(values, 0.3, 0.1)"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
In practice, the `alpha` value behaves very similarly in `holtMovAvg` as `ewmaMovAvg`: small values produce more smoothing
|
|
and more lag, while larger values produce closer tracking and less lag. The value of `beta` is often difficult
|
|
to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger
|
|
values emphasize short-term trends.
|
|
|
|
==== holtWinters Function
|
|
|
|
The `holtWinters` function (aka "triple exponential") incorporates a third exponential term which
|
|
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
|
|
and "seasonality".
|
|
|
|
The level and trend calculation is identical to `holt` The seasonal calculation looks at the difference between
|
|
the current point, and the point one period earlier.
|
|
|
|
Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
|
|
of your data: e.g. if your data has cyclic trends every 7 days, you would set `period = 7`. Similarly if there was
|
|
a monthly trend, you would set it to `30`. There is currently no periodicity detection, although that is planned
|
|
for future enhancements.
|
|
|
|
`null` and `NaN` values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
|
|
`null`/`NaN`, `NaN` is returned as the result. This means that the count used in the average calculation is count of non-`null`,non-`NaN`
|
|
values.
|
|
|
|
.`holtWinters(double[] values, double alpha)` Parameters
|
|
|===
|
|
|Parameter Name |Description
|
|
|`values` |The window of values to find the sum of
|
|
|`alpha` |Level decay value
|
|
|`beta` |Trend decay value
|
|
|`gamma` |Seasonality decay value
|
|
|`period` |The periodicity of the data
|
|
|`multiplicative` |True if you wish to use multiplicative holt-winters, false to use additive
|
|
|===
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /_search
|
|
{
|
|
"size": 0,
|
|
"aggs": {
|
|
"my_date_histo":{
|
|
"date_histogram":{
|
|
"field":"date",
|
|
"interval":"1M"
|
|
},
|
|
"aggs":{
|
|
"the_sum":{
|
|
"sum":{ "field": "price" }
|
|
},
|
|
"the_movavg": {
|
|
"moving_fn": {
|
|
"buckets_path": "the_sum",
|
|
"window": 10,
|
|
"script": "if (values.length > 5*2) {MovingFunctions.holtWinters(values, 0.3, 0.1, 0.1, 5, false)}"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sales]
|
|
|
|
[WARNING]
|
|
======
|
|
Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of
|
|
your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the
|
|
`mult` Holt-Winters pads all values by a very small amount (1*10^-10^) so that all values are non-zero. This affects
|
|
the result, but only minimally. If your data is non-zero, or you prefer to see `NaN` when zero's are encountered,
|
|
you can disable this behavior with `pad: false`
|
|
======
|
|
|
|
===== "Cold Start"
|
|
|
|
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This
|
|
means that your `window` must always be *at least* twice the size of your period. An exception will be thrown if it
|
|
isn't. It also means that Holt-Winters will not emit a value for the first `2 * period` buckets; the current algorithm
|
|
does not backcast.
|
|
|
|
You'll notice in the above example we have an `if ()` statement checking the size of values. This is checking to make sure
|
|
we have two periods worth of data (`5 * 2`, where 5 is the period specified in the `holtWintersMovAvg` function) before calling
|
|
the holt-winters function.
|