OpenSearch/docs/reference/aggregations/pipeline/serial-diff-aggregation.asciidoc

[[search-aggregations-pipeline-serialdiff-aggregation]]
=== Serial Differencing Aggregation

Serial differencing is a technique where values in a time series are subtracted from itself at
different time lags or periods. For example, the datapoint f(x) = f(x~t~) - f(x~t-n~), where n is the period being used.

A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the
next. Single periods are useful for removing constant, linear trends.

Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is
plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.

By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend).  We can see that the
data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn't seem to
exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the
previous value +/- a random amount.  This insight allows selection of further tools for analysis.

[[serialdiff_dow]]
.Dow Jones plotted and made stationary with first-differencing
image::images/pipeline_serialdiff/dow.png[]

Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was
synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.

The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the
first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.

[[serialdiff_lemmings]]
.Lemmings data plotted made stationary with 1st and 30th difference
image::images/pipeline_serialdiff/lemmings.png[]


==== Syntax

A `serial_diff` aggregation looks like this in isolation:

[source,js]
--------------------------------------------------
{
    "serial_diff": {
        "buckets_path": "the_sum",
        "lag": "7"
    }
}
--------------------------------------------------
// NOTCONSOLE

[[serial-diff-params]]
.`serial_diff` Parameters
[options="header"]
|===
|Parameter Name |Description |Required |Default Value
|`buckets_path` |Path to the metric of interest (see <<buckets-path-syntax, `buckets_path` Syntax>> for more details |Required |
|`lag` |The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from
 the value 7 buckets ago. Must be a positive, non-zero integer |Optional |`1`
|`gap_policy` |Determines what should happen when a gap in the data is encountered. |Optional |`insert_zero`
|`format` |Format to apply to the output value of this aggregation |Optional | `null`
|===

`serial_diff` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation:

[source,js]
--------------------------------------------------
POST /_search
{
   "size": 0,
   "aggs": {
      "my_date_histo": {                  <1>
         "date_histogram": {
            "field": "timestamp",
            "interval": "day"
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "lemmings"     <2>
               }
            },
            "thirtieth_difference": {
               "serial_diff": {                <3>
                  "buckets_path": "the_sum",
                  "lag" : 30
               }
            }
         }
      }
   }
}
--------------------------------------------------
// CONSOLE

<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
<2> A `sum` metric is used to calculate the sum of a field.  This could be any metric (sum, min, max, etc)
<3> Finally, we specify a `serial_diff` aggregation which uses "the_sum" metric as its input.

Serial differences are built by first specifying a `histogram` or `date_histogram` over a field.  You can then optionally
add normal metrics, such as a `sum`, inside of that histogram.  Finally, the `serial_diff` is embedded inside the histogram.
The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
<<buckets-path-syntax>> for a description of the syntax for `buckets_path`.
Aggregations: add serial differencing pipeline aggregation 2015-03-20 10:20:16 -04:00			`[[search-aggregations-pipeline-serialdiff-aggregation]]`
			`=== Serial Differencing Aggregation`

			`Serial differencing is a technique where values in a time series are subtracted from itself at`
			`different time lags or periods. For example, the datapoint f(x) = f(x~t~) - f(x~t-n~), where n is the period being used.`

			`A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the`
			`next. Single periods are useful for removing constant, linear trends.`

			`Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is`
			`plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.`

			`By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the`
			`data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn't seem to`
			`exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the`
			`previous value +/- a random amount. This insight allows selection of further tools for analysis.`

			`[[serialdiff_dow]]`
			`.Dow Jones plotted and made stationary with first-differencing`
			`image::images/pipeline_serialdiff/dow.png[]`

			`Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was`
			`synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.`

			`The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the`
			`first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.`

			`[[serialdiff_lemmings]]`
			`.Lemmings data plotted made stationary with 1st and 30th difference`
			`image::images/pipeline_serialdiff/lemmings.png[]`



			`==== Syntax`

			A `serial_diff` aggregation looks like this in isolation:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"serial_diff": {`
			`"buckets_path": "the_sum",`
			`"lag": "7"`
			`}`
			`}`
			`--------------------------------------------------`
Re-enable doc testing for Pipeline Aggregations (#24374) * Re-enable doc testing for Pipeline Aggregations Also adds a response + test for movavg pipeline 2017-05-01 13:30:51 -04:00			`// NOTCONSOLE`
Aggregations: add serial differencing pipeline aggregation 2015-03-20 10:20:16 -04:00
[DOCS] Add anchors for Asciidoctor migration (#41648) 2019-04-30 10:19:09 -04:00			`[[serial-diff-params]]`
serial-diff-aggregation.asciidoc: fix a mistake (#17950) 2016-04-25 14:43:57 +03:00			.`serial_diff` Parameters
[DOCS] Add anchors for Asciidoctor migration (#41648) 2019-04-30 10:19:09 -04:00			`[options="header"]`
Aggregations: add serial differencing pipeline aggregation 2015-03-20 10:20:16 -04:00			`\|===`
			`\|Parameter Name \|Description \|Required \|Default Value`
Docs: Fixed variations of spelling of buckets_path Closes #13201 2015-08-31 13:47:40 +02:00			\|`buckets_path` \|Path to the metric of interest (see <<buckets-path-syntax, `buckets_path` Syntax>> for more details \|Required \|
Aggregations: add serial differencing pipeline aggregation 2015-03-20 10:20:16 -04:00			\|`lag` \|The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from
			the value 7 buckets ago. Must be a positive, non-zero integer \|Optional \|`1`
			\|`gap_policy` \|Determines what should happen when a gap in the data is encountered. \|Optional \|`insert_zero`
			\|`format` \|Format to apply to the output value of this aggregation \|Optional \| `null`
			`\|===`

			`serial_diff` aggregations must be embedded inside of a `histogram` or `date_histogram` aggregation:

			`[source,js]`
			`--------------------------------------------------`
Add `// CONSOLE` to much of pipeline agg docs Most of the examples in the pipeline aggregation docs use a small "sales" test data set and I converted all of the examples that use it to `// CONSOLE`. There are still a bunch of snippets in the pipeline aggregation docs that aren't `// CONSOLE` so they aren't tested. Most of them are "this is the most basic form of this aggregation" so they are more immune to errors and bit rot then the examples that I converted. I'd like to do something with them as well but I'm not sure what. Also, the moving average docs and serial diff docs didn't get a lot of love from this pass because they don't use the test data set or follow the same general layout. Relates to #18160 2016-08-12 18:42:19 -04:00			`POST /_search`
Aggregations: add serial differencing pipeline aggregation 2015-03-20 10:20:16 -04:00			`{`
Add `// CONSOLE` to much of pipeline agg docs Most of the examples in the pipeline aggregation docs use a small "sales" test data set and I converted all of the examples that use it to `// CONSOLE`. There are still a bunch of snippets in the pipeline aggregation docs that aren't `// CONSOLE` so they aren't tested. Most of them are "this is the most basic form of this aggregation" so they are more immune to errors and bit rot then the examples that I converted. I'd like to do something with them as well but I'm not sure what. Also, the moving average docs and serial diff docs didn't get a lot of love from this pass because they don't use the test data set or follow the same general layout. Relates to #18160 2016-08-12 18:42:19 -04:00			`"size": 0,`
Aggregations: add serial differencing pipeline aggregation 2015-03-20 10:20:16 -04:00			`"aggs": {`
			`"my_date_histo": { <1>`
			`"date_histogram": {`
			`"field": "timestamp",`
			`"interval": "day"`
			`},`
			`"aggs": {`
			`"the_sum": {`
			`"sum": {`
			`"field": "lemmings" <2>`
			`}`
			`},`
			`"thirtieth_difference": {`
			`"serial_diff": { <3>`
Update to serial differencing aggregation doc Hi, `thirtieth_difference` should use `the_sum` metric as the `buckets_path`. 2016-02-20 12:13:02 +01:00			`"buckets_path": "the_sum",`
Aggregations: add serial differencing pipeline aggregation 2015-03-20 10:20:16 -04:00			`"lag" : 30`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
Add `// CONSOLE` to much of pipeline agg docs Most of the examples in the pipeline aggregation docs use a small "sales" test data set and I converted all of the examples that use it to `// CONSOLE`. There are still a bunch of snippets in the pipeline aggregation docs that aren't `// CONSOLE` so they aren't tested. Most of them are "this is the most basic form of this aggregation" so they are more immune to errors and bit rot then the examples that I converted. I'd like to do something with them as well but I'm not sure what. Also, the moving average docs and serial diff docs didn't get a lot of love from this pass because they don't use the test data set or follow the same general layout. Relates to #18160 2016-08-12 18:42:19 -04:00			`// CONSOLE`

Aggregations: add serial differencing pipeline aggregation 2015-03-20 10:20:16 -04:00			<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals
			<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)
			<3> Finally, we specify a `serial_diff` aggregation which uses "the_sum" metric as its input.

			Serial differences are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally
			add normal metrics, such as a `sum`, inside of that histogram. Finally, the `serial_diff` is embedded inside the histogram.
			The `buckets_path` parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
Docs: Fixed variations of spelling of buckets_path Closes #13201 2015-08-31 13:47:40 +02:00			<<buckets-path-syntax>> for a description of the syntax for `buckets_path`.