2015-03-20 10:20:16 -04:00
|
|
|
[[search-aggregations-pipeline-serialdiff-aggregation]]
|
|
|
|
=== Serial Differencing Aggregation
|
|
|
|
|
|
|
|
Serial differencing is a technique where values in a time series are subtracted from itself at
|
|
|
|
different time lags or periods. For example, the datapoint f(x) = f(x~t~) - f(x~t-n~), where n is the period being used.
|
|
|
|
|
|
|
|
A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the
|
|
|
|
next. Single periods are useful for removing constant, linear trends.
|
|
|
|
|
|
|
|
Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is
|
|
|
|
plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.
|
|
|
|
|
|
|
|
By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the
|
|
|
|
data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn't seem to
|
|
|
|
exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the
|
|
|
|
previous value +/- a random amount. This insight allows selection of further tools for analysis.
|
|
|
|
|
|
|
|
[[serialdiff_dow]]
|
|
|
|
.Dow Jones plotted and made stationary with first-differencing
|
|
|
|
image::images/pipeline_serialdiff/dow.png[]
|
|
|
|
|
|
|
|
Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was
|
|
|
|
synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.
|
|
|
|
|
|
|
|
The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the
|
|
|
|
first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.
|
|
|
|
|
|
|
|
[[serialdiff_lemmings]]
|
|
|
|
.Lemmings data plotted made stationary with 1st and 30th difference
|
|
|
|
image::images/pipeline_serialdiff/lemmings.png[]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
==== Syntax
|
|
|
|
|
|
|
|
A `serial_diff` aggregation looks like this in isolation:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
2020-07-20 15:59:00 -04:00
|
|
|
"serial_diff": {
|
|
|
|
"buckets_path": "the_sum",
|
|
|
|
"lag": "7"
|
|
|
|
}
|
2015-03-20 10:20:16 -04:00
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2017-05-01 13:30:51 -04:00
|
|
|
// NOTCONSOLE
|
2015-03-20 10:20:16 -04:00
|
|
|
|
2019-04-30 10:19:09 -04:00
|
|
|
[[serial-diff-params]]
|
2016-04-25 07:43:57 -04:00
|
|
|
.`serial_diff` Parameters
|
2019-04-30 10:19:09 -04:00
|
|
|
[options="header"]
|
2015-03-20 10:20:16 -04:00
|
|
|
|===
|
|
|
|
|Parameter Name |Description |Required |Default Value
|
2015-08-31 07:47:40 -04:00
|
|
|
|`buckets_path` |Path to the metric of interest (see <<buckets-path-syntax, `buckets_path` Syntax>> for more details |Required |
|
2015-03-20 10:20:16 -04:00
|
|
|
|`lag` |The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from
|
|
|
|
the value 7 buckets ago. Must be a positive, non-zero integer |Optional |`1`
|
|
|
|
|`gap_policy` |Determines what should happen when a gap in the data is encountered. |Optional |`insert_zero`
|
|
|
|
|`format` |Format to apply to the output value of this aggregation |Optional | `null`
|
|
|
|
|===
|
|
|
|
|
|
|
|
`serial_diff` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation:
|
|
|
|
|
2019-09-05 10:11:25 -04:00
|
|
|
[source,console]
|
2015-03-20 10:20:16 -04:00
|
|
|
--------------------------------------------------
|
2016-08-12 18:42:19 -04:00
|
|
|
POST /_search
|
2015-03-20 10:20:16 -04:00
|
|
|
{
|
2016-08-12 18:42:19 -04:00
|
|
|
"size": 0,
|
2015-03-20 10:20:16 -04:00
|
|
|
"aggs": {
|
|
|
|
"my_date_histo": { <1>
|
|
|
|
"date_histogram": {
|
|
|
|
"field": "timestamp",
|
[7.x Backport] Force selection of calendar or fixed intervals (#41906)
The date_histogram accepts an interval which can be either a calendar
interval (DST-aware, leap seconds, arbitrary length of months, etc) or
fixed interval (strict multiples of SI units). Unfortunately this is inferred
by first trying to parse as a calendar interval, then falling back to fixed
if that fails.
This leads to confusing arrangement where `1d` == calendar, but
`2d` == fixed. And if you want a day of fixed time, you have to
specify `24h` (e.g. the next smallest unit). This arrangement is very
error-prone for users.
This PR adds `calendar_interval` and `fixed_interval` parameters to any
code that uses intervals (date_histogram, rollup, composite, datafeed, etc).
Calendar only accepts calendar intervals, fixed accepts any combination of
units (meaning `1d` can be used to specify `24h` in fixed time), and both
are mutually exclusive.
The old interval behavior is deprecated and will throw a deprecation warning.
It is also mutually exclusive with the two new parameters. In the future the
old dual-purpose interval will be removed.
The change applies to both REST and java clients.
2019-05-20 12:07:29 -04:00
|
|
|
"calendar_interval": "day"
|
2015-03-20 10:20:16 -04:00
|
|
|
},
|
|
|
|
"aggs": {
|
|
|
|
"the_sum": {
|
|
|
|
"sum": {
|
|
|
|
"field": "lemmings" <2>
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"thirtieth_difference": {
|
|
|
|
"serial_diff": { <3>
|
2016-02-20 06:13:02 -05:00
|
|
|
"buckets_path": "the_sum",
|
2015-03-20 10:20:16 -04:00
|
|
|
"lag" : 30
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2016-08-12 18:42:19 -04:00
|
|
|
|
2015-03-20 10:20:16 -04:00
|
|
|
<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
|
|
|
|
<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)
|
|
|
|
<3> Finally, we specify a `serial_diff` aggregation which uses "the_sum" metric as its input.
|
|
|
|
|
|
|
|
Serial differences are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
|
|
|
|
add normal metrics, such as a `sum`, inside of that histogram. Finally, the `serial_diff` is embedded inside the histogram.
|
|
|
|
The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
|
2015-08-31 07:47:40 -04:00
|
|
|
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`.
|