OpenSearch/docs/en/rollup/rollup-getting-started.asciidoc

[[rollup-getting-started]]
== Getting Started

To use the Rollup feature, you need to create one or more "Rollup Jobs".  These jobs run continuously in the background
and rollup the index or indices that you specify, placing the rolled documents in a secondary index (also of your choosing).

Imagine you have a series of daily indices that hold sensor data (`sensor-2017-01-01`, `sensor-2017-01-02`, etc).  A sample document might
look like this:

[source,js]
--------------------------------------------------
{
  "timestamp": 1516729294000,
  "temperature": 200,
  "voltage": 5.2,
  "node": "a"
}
--------------------------------------------------
// NOTCONSOLE

[float]
=== Creating a Rollup Job

We'd like to rollup these documents into hourly summaries, which will allow us to generate reports and dashboards with any time interval
one hour or greater.  A rollup job might look like this:

[source,js]
--------------------------------------------------
PUT _xpack/rollup/job/sensor
{
    "index_pattern": "sensor-*",
    "rollup_index": "sensor_rollup",
    "cron": "*/30 * * * * ?",
    "size" :1000,
    "groups" : {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1h",
        "delay": "7d"
      },
      "terms": {
        "fields": ["node"]
      }
    },
    "metrics": [
        {
            "field": "temperature",
            "metrics": ["min", "max", "sum"]
        },
        {
            "field": "voltage",
            "metrics": ["avg"]
        }
    ]
}
--------------------------------------------------
// CONSOLE

We give the job the ID of "sensor" (in the url: `PUT _xpack/rollup/job/sensor`), and tell it to rollup the index pattern `"sensor-*"`.
This job will find and rollup any index that matches that pattern. Rollup summaries are then stored in the `"sensor_rollup"` index.

The `cron` parameter controls when and how often the job activates.  When a rollup job's cron schedule triggers, it will begin rolling up
from where it left off after the last activation.  So if you configure the cron to run every 30 seconds, the job will process the last 30
seconds worth of data that was indexed into the `sensor-*` indices.

If instead the cron was configured to run once a day at midnight, the job would process the last 24hours worth of data.  The choice is largely
preference, based on how "realtime" you want the rollups, and if you wish to process continuously or move it to off-peak hours.

Next, we define a set of `groups` and `metrics`.  The metrics are fairly straightforward: we want to save the min/max/sum of the `temperature`
field, and the average of the `voltage` field.

The groups are a little more interesting.  Essentially, we are defining the dimensions that we wish to pivot on at a later date when
querying the data.  The grouping in this job allows us to use date_histograms aggregations on the `timestamp` field, rolled up at hourly intervals.
It also allows us to run terms aggregations on the `node` field.

.Date histogram interval vs cron schedule
**********************************
You'll note that the job's cron is configured to run every 30 seconds, but the date_histogram is configured to
rollup at hourly intervals.  How do these relate?

The date_histogram controls the granularity of the saved data.  Data will be rolled up into hourly intervals, and you will be unable
to query with finer granularity.  The cron simply controls when the process looks for new data to rollup.  Every 30 seconds it will see
if there is a new hour's worth of data and roll it up.  If not, the job goes back to sleep.

Often, it doesn't make sense to define such a small cron (30s) on a large interval (1h), because the majority of the activations will
simply go back to sleep.  But there's nothing wrong with it either, the job will do the right thing.

**********************************

For more details about the job syntax, see <<rollup-job-config>>.


After you execute the above command and create the job, you'll receive the following response:

[source,js]
----
{
  "acknowledged": true
}
----
// TESTRESPONSE

[float]
=== Starting the job

After the job is created, it will be sitting in an inactive state.  Jobs need to be started before they begin processing data (this allows
you to stop them later as a way to temporarily pause, without deleting the configuration).

To start the job, execute this command:

[source,js]
--------------------------------------------------
POST _xpack/rollup/job/sensor/_start
--------------------------------------------------
// CONSOLE
// TEST[setup:sensor_rollup_job]

[float]
=== Searching the Rolled results

After the job has run and processed some data, we can use the <<rollup-search>> endpoint to do some searching.  The Rollup feature is designed
so that you can use the same Query DSL syntax that you are accustomed to... it just happens to run on the rolled up data instead.

For example, take this query:

[source,js]
--------------------------------------------------
GET /sensor_rollup/_rollup_search
{
    "size": 0,
    "aggregations": {
        "max_temperature": {
            "max": {
                "field": "temperature"
            }
        }
    }
}
--------------------------------------------------
// CONSOLE
// TEST[setup:sensor_prefab_data]

It's a simple aggregation that calculates the maximum of the `temperature` field.  But you'll notice that is is being sent to the `sensor_rollup`
index instead of the raw `sensor-*` indices.  And you'll also notice that it is using the `_rollup_search` endpoint.  Otherwise the syntax
is exactly as you'd expect.

If you were to execute that query, you'd receive a result that looks like a normal aggregation response:

[source,js]
----
{
  "took" : 102,
  "timed_out" : false,
  "terminated_early" : false,
  "_shards" : ... ,
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "max_temperature" : {
      "value" : 202.0
    }
  }
}
----
// TESTRESPONSE[s/"took" : 102/"took" : $body.$_path/]
// TESTRESPONSE[s/"_shards" : \.\.\. /"_shards" : $body.$_path/]

The only notable difference is that Rollup search results have zero `hits`, because we aren't really searching the original, live data any
more.  Otherwise it's identical syntax.

There are a few interesting takeaways here.  Firstly, even though the data was rolled up with hourly intervals and partitioned by
node name, the query we ran is just calculating the max temperature across all documents.  The `groups` that were configured in the job
are not mandatory elements of a query, they are just extra dimensions you can partition on.  Second, the request and response syntax
is nearly identical to normal DSL, making it easy to integrate into dashboards and applications.

Finally, we can use those grouping fields we defined to construct a more complicated query:

[source,js]
--------------------------------------------------
GET /sensor_rollup/_rollup_search
{
    "size": 0,
    "aggregations": {
        "timeline": {
            "date_histogram": {
                "field": "timestamp",
                "interval": "7d"
            },
            "aggs": {
                "nodes": {
                    "terms": {
                        "field": "node"
                    },
                    "aggs": {
                        "max_temperature": {
                            "max": {
                                "field": "temperature"
                            }
                        },
                        "avg_voltage": {
                            "avg": {
                                "field": "voltage"
                            }
                        }
                    }
                }
            }
        }
    }
}
--------------------------------------------------
// CONSOLE
// TEST[setup:sensor_prefab_data]

Which returns a corresponding response:

[source,js]
----
{
  "took" : 93,
  "timed_out" : false,
  "terminated_early" : false,
  "_shards" : ... ,
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "timeline" : {
      "meta" : { },
      "buckets" : [
        {
          "key_as_string" : "2018-01-18T00:00:00.000Z",
          "key" : 1516233600000,
          "doc_count" : 6,
          "nodes" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "a",
                "doc_count" : 2,
                "max_temperature" : {
                  "value" : 202.0
                },
                "avg_voltage" : {
                  "value" : 5.1499998569488525
                }
              },
              {
                "key" : "b",
                "doc_count" : 2,
                "max_temperature" : {
                  "value" : 201.0
                },
                "avg_voltage" : {
                  "value" : 5.700000047683716
                }
              },
              {
                "key" : "c",
                "doc_count" : 2,
                "max_temperature" : {
                  "value" : 202.0
                },
                "avg_voltage" : {
                  "value" : 4.099999904632568
                }
              }
            ]
          }
        }
      ]
    }
  }
}
----
// TESTRESPONSE[s/"took" : 93/"took" : $body.$_path/]
// TESTRESPONSE[s/"_shards" : \.\.\. /"_shards" : $body.$_path/]

In addition to being more complicated (date histogram and a terms aggregation, plus an additional average metric), you'll notice
the date_histogram uses a `7d` interval instead of `1h`.

[float]
=== Conclusion

This quickstart should have provided a concise overview of the core functionality that Rollup exposes.  There are more tips and things
to consider when setting up Rollups, which you can find throughout the rest of this section.  You may also explore the <<rollup-api-quickref,REST API>>
for an overview of what is available.
Rollups for Elasticsearch (elastic/x-pack-elasticsearch#4002) This adds a new Rollup module to XPack, which allows users to configure periodic "rollup jobs" to pre-aggregate data. That data is then available later for search through a special RollupSearch API, which mimics the DSL and functionality of regular search. Rollups are used to drastically reduce the on-disk footprint of metric-based data (e.g. timestamped document with numeric and keyword fields). It can also be used to speed up aggregations over large datasets, since the rolled data will be considerably smaller and fewer documents to search. The PR adds seven new endpoints to interact with Rollups; create/get/delete job, start/stop job, a capabilities API similar to field-caps, and a Rollup-enabled search. Original commit: elastic/x-pack-elasticsearch@dcde91aacfa52d2985e3948cb64392061e2b10c1 2018-02-23 17:10:37 -05:00			`[[rollup-getting-started]]`
			`== Getting Started`

[Docs] Add quickstart and limitation documentation for Rollups Original commit: elastic/x-pack-elasticsearch@cb4aaa0992eb866ebc4e56f73a31db357dccca33 2018-03-30 20:43:33 +00:00			`To use the Rollup feature, you need to create one or more "Rollup Jobs". These jobs run continuously in the background`
			`and rollup the index or indices that you specify, placing the rolled documents in a secondary index (also of your choosing).`

			Imagine you have a series of daily indices that hold sensor data (`sensor-2017-01-01`, `sensor-2017-01-02`, etc). A sample document might
			`look like this:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"timestamp": 1516729294000,`
			`"temperature": 200,`
			`"voltage": 5.2,`
			`"node": "a"`
			`}`
			`--------------------------------------------------`
			`// NOTCONSOLE`

			`[float]`
			`=== Creating a Rollup Job`

			`We'd like to rollup these documents into hourly summaries, which will allow us to generate reports and dashboards with any time interval`
			`one hour or greater. A rollup job might look like this:`

			`[source,js]`
			`--------------------------------------------------`
			`PUT _xpack/rollup/job/sensor`
			`{`
			`"index_pattern": "sensor-*",`
			`"rollup_index": "sensor_rollup",`
			`"cron": "/30 * * * ?",`
			`"size" :1000,`
			`"groups" : {`
			`"date_histogram": {`
			`"field": "timestamp",`
			`"interval": "1h",`
			`"delay": "7d"`
			`},`
			`"terms": {`
			`"fields": ["node"]`
			`}`
			`},`
			`"metrics": [`
			`{`
			`"field": "temperature",`
			`"metrics": ["min", "max", "sum"]`
			`},`
			`{`
			`"field": "voltage",`
			`"metrics": ["avg"]`
			`}`
			`]`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`

			We give the job the ID of "sensor" (in the url: `PUT _xpack/rollup/job/sensor`), and tell it to rollup the index pattern `"sensor-*"`.
			This job will find and rollup any index that matches that pattern. Rollup summaries are then stored in the `"sensor_rollup"` index.

			The `cron` parameter controls when and how often the job activates. When a rollup job's cron schedule triggers, it will begin rolling up
			`from where it left off after the last activation. So if you configure the cron to run every 30 seconds, the job will process the last 30`
			seconds worth of data that was indexed into the `sensor-*` indices.

			`If instead the cron was configured to run once a day at midnight, the job would process the last 24hours worth of data. The choice is largely`
			`preference, based on how "realtime" you want the rollups, and if you wish to process continuously or move it to off-peak hours.`

			Next, we define a set of `groups` and `metrics`. The metrics are fairly straightforward: we want to save the min/max/sum of the `temperature`
			field, and the average of the `voltage` field.

			`The groups are a little more interesting. Essentially, we are defining the dimensions that we wish to pivot on at a later date when`
			querying the data. The grouping in this job allows us to use date_histograms aggregations on the `timestamp` field, rolled up at hourly intervals.
			It also allows us to run terms aggregations on the `node` field.

			`.Date histogram interval vs cron schedule`
			`**********************************`
			`You'll note that the job's cron is configured to run every 30 seconds, but the date_histogram is configured to`
			`rollup at hourly intervals. How do these relate?`

			`The date_histogram controls the granularity of the saved data. Data will be rolled up into hourly intervals, and you will be unable`
			`to query with finer granularity. The cron simply controls when the process looks for new data to rollup. Every 30 seconds it will see`
			`if there is a new hour's worth of data and roll it up. If not, the job goes back to sleep.`

			`Often, it doesn't make sense to define such a small cron (30s) on a large interval (1h), because the majority of the activations will`
			`simply go back to sleep. But there's nothing wrong with it either, the job will do the right thing.`

			`**********************************`

			`For more details about the job syntax, see <<rollup-job-config>>.`


			`After you execute the above command and create the job, you'll receive the following response:`

			`[source,js]`
			`----`
			`{`
			`"acknowledged": true`
			`}`
			`----`
			`// TESTRESPONSE`

			`[float]`
			`=== Starting the job`

			`After the job is created, it will be sitting in an inactive state. Jobs need to be started before they begin processing data (this allows`
			`you to stop them later as a way to temporarily pause, without deleting the configuration).`

			`To start the job, execute this command:`

			`[source,js]`
			`--------------------------------------------------`
			`POST _xpack/rollup/job/sensor/_start`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[setup:sensor_rollup_job]`

			`[float]`
			`=== Searching the Rolled results`

			`After the job has run and processed some data, we can use the <<rollup-search>> endpoint to do some searching. The Rollup feature is designed`
			`so that you can use the same Query DSL syntax that you are accustomed to... it just happens to run on the rolled up data instead.`

			`For example, take this query:`

			`[source,js]`
			`--------------------------------------------------`
			`GET /sensor_rollup/_rollup_search`
			`{`
			`"size": 0,`
			`"aggregations": {`
			`"max_temperature": {`
			`"max": {`
			`"field": "temperature"`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[setup:sensor_prefab_data]`

			It's a simple aggregation that calculates the maximum of the `temperature` field. But you'll notice that is is being sent to the `sensor_rollup`
			index instead of the raw `sensor-*` indices. And you'll also notice that it is using the `_rollup_search` endpoint. Otherwise the syntax
			`is exactly as you'd expect.`

			`If you were to execute that query, you'd receive a result that looks like a normal aggregation response:`

			`[source,js]`
			`----`
			`{`
			`"took" : 102,`
			`"timed_out" : false,`
			`"terminated_early" : false,`
			`"_shards" : ... ,`
			`"hits" : {`
			`"total" : 0,`
			`"max_score" : 0.0,`
			`"hits" : [ ]`
			`},`
			`"aggregations" : {`
			`"max_temperature" : {`
			`"value" : 202.0`
			`}`
			`}`
			`}`
			`----`
			`// TESTRESPONSE[s/"took" : 102/"took" : $body.$_path/]`
			`// TESTRESPONSE[s/"_shards" : \.\.\. /"_shards" : $body.$_path/]`

			The only notable difference is that Rollup search results have zero `hits`, because we aren't really searching the original, live data any
			`more. Otherwise it's identical syntax.`

			`There are a few interesting takeaways here. Firstly, even though the data was rolled up with hourly intervals and partitioned by`
			node name, the query we ran is just calculating the max temperature across all documents. The `groups` that were configured in the job
			`are not mandatory elements of a query, they are just extra dimensions you can partition on. Second, the request and response syntax`
			`is nearly identical to normal DSL, making it easy to integrate into dashboards and applications.`

			`Finally, we can use those grouping fields we defined to construct a more complicated query:`

			`[source,js]`
			`--------------------------------------------------`
			`GET /sensor_rollup/_rollup_search`
			`{`
			`"size": 0,`
			`"aggregations": {`
			`"timeline": {`
			`"date_histogram": {`
			`"field": "timestamp",`
			`"interval": "7d"`
			`},`
			`"aggs": {`
			`"nodes": {`
			`"terms": {`
			`"field": "node"`
			`},`
			`"aggs": {`
			`"max_temperature": {`
			`"max": {`
			`"field": "temperature"`
			`}`
			`},`
			`"avg_voltage": {`
			`"avg": {`
			`"field": "voltage"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// CONSOLE`
			`// TEST[setup:sensor_prefab_data]`

			`Which returns a corresponding response:`

			`[source,js]`
			`----`
			`{`
			`"took" : 93,`
			`"timed_out" : false,`
			`"terminated_early" : false,`
			`"_shards" : ... ,`
			`"hits" : {`
			`"total" : 0,`
			`"max_score" : 0.0,`
			`"hits" : [ ]`
			`},`
			`"aggregations" : {`
			`"timeline" : {`
			`"meta" : { },`
			`"buckets" : [`
			`{`
			`"key_as_string" : "2018-01-18T00:00:00.000Z",`
			`"key" : 1516233600000,`
			`"doc_count" : 6,`
			`"nodes" : {`
			`"doc_count_error_upper_bound" : 0,`
			`"sum_other_doc_count" : 0,`
			`"buckets" : [`
			`{`
			`"key" : "a",`
			`"doc_count" : 2,`
			`"max_temperature" : {`
			`"value" : 202.0`
			`},`
			`"avg_voltage" : {`
			`"value" : 5.1499998569488525`
			`}`
			`},`
			`{`
			`"key" : "b",`
			`"doc_count" : 2,`
			`"max_temperature" : {`
			`"value" : 201.0`
			`},`
			`"avg_voltage" : {`
			`"value" : 5.700000047683716`
			`}`
			`},`
			`{`
			`"key" : "c",`
			`"doc_count" : 2,`
			`"max_temperature" : {`
			`"value" : 202.0`
			`},`
			`"avg_voltage" : {`
			`"value" : 4.099999904632568`
			`}`
			`}`
			`]`
			`}`
			`}`
			`]`
			`}`
			`}`
			`}`
			`----`
			`// TESTRESPONSE[s/"took" : 93/"took" : $body.$_path/]`
			`// TESTRESPONSE[s/"_shards" : \.\.\. /"_shards" : $body.$_path/]`

			`In addition to being more complicated (date histogram and a terms aggregation, plus an additional average metric), you'll notice`
			the date_histogram uses a `7d` interval instead of `1h`.

			`[float]`
			`=== Conclusion`

			`This quickstart should have provided a concise overview of the core functionality that Rollup exposes. There are more tips and things`
			`to consider when setting up Rollups, which you can find throughout the rest of this section. You may also explore the <<rollup-api-quickref,REST API>>`
			`for an overview of what is available.`