[Docs] Add quickstart and limitation documentation for Rollups

Original commit: elastic/x-pack-elasticsearch@cb4aaa0992
This commit is contained in:
Zachary Tong 2018-03-30 20:43:33 +00:00
parent e8a6c9f5d1
commit 574ce84885
6 changed files with 408 additions and 10 deletions

View File

@ -143,7 +143,7 @@ The `date_histogram` group has several parameters:
`time_zone`:: `time_zone`::
Defines what time_zone the rollup documents are stored as. Unlike raw data, which can shift timezones on the fly, rolled documents have Defines what time_zone the rollup documents are stored as. Unlike raw data, which can shift timezones on the fly, rolled documents have
to be stored with a specific timezone. By default, rollup documents are stored in `UTC`, but this can be changed with the `time_zone` to be stored with a specific timezone. By default, rollup documents are stored in `UTC`, but this can be changed with the `time_zone`
parameter. For more details about timezones, see <<rollup-timezones,Dealing with Timezones>> parameter.
===== Terms ===== Terms

View File

@ -20,8 +20,13 @@ and rewrites it back to what a client would expect given the original query.
`index`:: `index`::
(string) Index, indices or index-pattern to execute a rollup search against. This can include both rollup and non-rollup (string) Index, indices or index-pattern to execute a rollup search against. This can include both rollup and non-rollup
indices indices.
Rules for the `index` parameter:
- At least one index/index-pattern must be specified. This can be either a rollup or non-rollup index. Omitting the index parameter,
or using `_all`, is not permitted
- Multiple non-rollup indices may be specified
- Only one rollup index may be specified. If more than one are supplied an exception will be thrown
==== Request Body ==== Request Body

View File

@ -18,7 +18,6 @@ for analysis, but at a fraction of the storage cost of raw data.
* <<rollup-getting-started,Getting Started>> * <<rollup-getting-started,Getting Started>>
* <<rollup-api-quickref, API Quick Reference>> * <<rollup-api-quickref, API Quick Reference>>
* <<rollup-understanding-groups,Understanding Rollup Grouping>> * <<rollup-understanding-groups,Understanding Rollup Grouping>>
* <<rollup-timezones,Dealing with Timezones>>
* <<rollup-search-limitations,Limitations of Rollup Search>> * <<rollup-search-limitations,Limitations of Rollup Search>>
@ -28,5 +27,4 @@ include::overview.asciidoc[]
include::api-quickref.asciidoc[] include::api-quickref.asciidoc[]
include::rollup-getting-started.asciidoc[] include::rollup-getting-started.asciidoc[]
include::understanding-groups.asciidoc[] include::understanding-groups.asciidoc[]
include::timezones.asciidoc[]
include::rollup-search-limitations.asciidoc[] include::rollup-search-limitations.asciidoc[]

View File

@ -1,4 +1,293 @@
[[rollup-getting-started]] [[rollup-getting-started]]
== Getting Started == Getting Started
todo To use the Rollup feature, you need to create one or more "Rollup Jobs". These jobs run continuously in the background
and rollup the index or indices that you specify, placing the rolled documents in a secondary index (also of your choosing).
Imagine you have a series of daily indices that hold sensor data (`sensor-2017-01-01`, `sensor-2017-01-02`, etc). A sample document might
look like this:
[source,js]
--------------------------------------------------
{
"timestamp": 1516729294000,
"temperature": 200,
"voltage": 5.2,
"node": "a"
}
--------------------------------------------------
// NOTCONSOLE
[float]
=== Creating a Rollup Job
We'd like to rollup these documents into hourly summaries, which will allow us to generate reports and dashboards with any time interval
one hour or greater. A rollup job might look like this:
[source,js]
--------------------------------------------------
PUT _xpack/rollup/job/sensor
{
"index_pattern": "sensor-*",
"rollup_index": "sensor_rollup",
"cron": "*/30 * * * * ?",
"size" :1000,
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
},
"terms": {
"fields": ["node"]
}
},
"metrics": [
{
"field": "temperature",
"metrics": ["min", "max", "sum"]
},
{
"field": "voltage",
"metrics": ["avg"]
}
]
}
--------------------------------------------------
// CONSOLE
We give the job the ID of "sensor" (in the url: `PUT _xpack/rollup/job/sensor`), and tell it to rollup the index pattern `"sensor-*"`.
This job will find and rollup any index that matches that pattern. Rollup summaries are then stored in the `"sensor_rollup"` index.
The `cron` parameter controls when and how often the job activates. When a rollup job's cron schedule triggers, it will begin rolling up
from where it left off after the last activation. So if you configure the cron to run every 30 seconds, the job will process the last 30
seconds worth of data that was indexed into the `sensor-*` indices.
If instead the cron was configured to run once a day at midnight, the job would process the last 24hours worth of data. The choice is largely
preference, based on how "realtime" you want the rollups, and if you wish to process continuously or move it to off-peak hours.
Next, we define a set of `groups` and `metrics`. The metrics are fairly straightforward: we want to save the min/max/sum of the `temperature`
field, and the average of the `voltage` field.
The groups are a little more interesting. Essentially, we are defining the dimensions that we wish to pivot on at a later date when
querying the data. The grouping in this job allows us to use date_histograms aggregations on the `timestamp` field, rolled up at hourly intervals.
It also allows us to run terms aggregations on the `node` field.
.Date histogram interval vs cron schedule
**********************************
You'll note that the job's cron is configured to run every 30 seconds, but the date_histogram is configured to
rollup at hourly intervals. How do these relate?
The date_histogram controls the granularity of the saved data. Data will be rolled up into hourly intervals, and you will be unable
to query with finer granularity. The cron simply controls when the process looks for new data to rollup. Every 30 seconds it will see
if there is a new hour's worth of data and roll it up. If not, the job goes back to sleep.
Often, it doesn't make sense to define such a small cron (30s) on a large interval (1h), because the majority of the activations will
simply go back to sleep. But there's nothing wrong with it either, the job will do the right thing.
**********************************
For more details about the job syntax, see <<rollup-job-config>>.
After you execute the above command and create the job, you'll receive the following response:
[source,js]
----
{
"acknowledged": true
}
----
// TESTRESPONSE
[float]
=== Starting the job
After the job is created, it will be sitting in an inactive state. Jobs need to be started before they begin processing data (this allows
you to stop them later as a way to temporarily pause, without deleting the configuration).
To start the job, execute this command:
[source,js]
--------------------------------------------------
POST _xpack/rollup/job/sensor/_start
--------------------------------------------------
// CONSOLE
// TEST[setup:sensor_rollup_job]
[float]
=== Searching the Rolled results
After the job has run and processed some data, we can use the <<rollup-search>> endpoint to do some searching. The Rollup feature is designed
so that you can use the same Query DSL syntax that you are accustomed to... it just happens to run on the rolled up data instead.
For example, take this query:
[source,js]
--------------------------------------------------
GET /sensor_rollup/_rollup_search
{
"size": 0,
"aggregations": {
"max_temperature": {
"max": {
"field": "temperature"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:sensor_prefab_data]
It's a simple aggregation that calculates the maximum of the `temperature` field. But you'll notice that is is being sent to the `sensor_rollup`
index instead of the raw `sensor-*` indices. And you'll also notice that it is using the `_rollup_search` endpoint. Otherwise the syntax
is exactly as you'd expect.
If you were to execute that query, you'd receive a result that looks like a normal aggregation response:
[source,js]
----
{
"took" : 102,
"timed_out" : false,
"terminated_early" : false,
"_shards" : ... ,
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"max_temperature" : {
"value" : 202.0
}
}
}
----
// TESTRESPONSE[s/"took" : 102/"took" : $body.$_path/]
// TESTRESPONSE[s/"_shards" : \.\.\. /"_shards" : $body.$_path/]
The only notable difference is that Rollup search results have zero `hits`, because we aren't really searching the original, live data any
more. Otherwise it's identical syntax.
There are a few interesting takeaways here. Firstly, even though the data was rolled up with hourly intervals and partitioned by
node name, the query we ran is just calculating the max temperature across all documents. The `groups` that were configured in the job
are not mandatory elements of a query, they are just extra dimensions you can partition on. Second, the request and response syntax
is nearly identical to normal DSL, making it easy to integrate into dashboards and applications.
Finally, we can use those grouping fields we defined to construct a more complicated query:
[source,js]
--------------------------------------------------
GET /sensor_rollup/_rollup_search
{
"size": 0,
"aggregations": {
"timeline": {
"date_histogram": {
"field": "timestamp",
"interval": "7d"
},
"aggs": {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"max_temperature": {
"max": {
"field": "temperature"
}
},
"avg_voltage": {
"avg": {
"field": "voltage"
}
}
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:sensor_prefab_data]
Which returns a corresponding response:
[source,js]
----
{
"took" : 93,
"timed_out" : false,
"terminated_early" : false,
"_shards" : ... ,
"hits" : {
"total" : 0,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"timeline" : {
"meta" : { },
"buckets" : [
{
"key_as_string" : "2018-01-18T00:00:00.000Z",
"key" : 1516233600000,
"doc_count" : 6,
"nodes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "a",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 5.1499998569488525
}
},
{
"key" : "b",
"doc_count" : 2,
"max_temperature" : {
"value" : 201.0
},
"avg_voltage" : {
"value" : 5.700000047683716
}
},
{
"key" : "c",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 4.099999904632568
}
}
]
}
}
]
}
}
}
----
// TESTRESPONSE[s/"took" : 93/"took" : $body.$_path/]
// TESTRESPONSE[s/"_shards" : \.\.\. /"_shards" : $body.$_path/]
In addition to being more complicated (date histogram and a terms aggregation, plus an additional average metric), you'll notice
the date_histogram uses a `7d` interval instead of `1h`.
[float]
=== Conclusion
This quickstart should have provided a concise overview of the core functionality that Rollup exposes. There are more tips and things
to consider when setting up Rollups, which you can find throughout the rest of this section. You may also explore the <<rollup-api-quickref,REST API>>
for an overview of what is available.

View File

@ -1,4 +1,114 @@
[[rollup-search-limitations]] [[rollup-search-limitations]]
== Rollup Search Limitations == Rollup Search Limitations
todo While we feel the Rollup function is extremely flexible, the nature of summarizing data means there will be some limitations. Once
live data is thrown away, you will always lose some flexibility.
This page highlights the major limitations so that you are aware of them.
[float]
=== Only one Rollup index per search
When using the <<rollup-search>> endpoint, the `index` parameter accepts one or more indices. These can be a mix of regular, non-rollup
indices and rollup indices. However, only one rollup index can be specified. The exact list of rules for the `index` parameter are as
follows:
- At least one index/index-pattern must be specified. This can be either a rollup or non-rollup index. Omitting the index parameter,
or using `_all`, is not permitted
- Multiple non-rollup indices may be specified
- Only one rollup index may be specified. If more than one are supplied an exception will be thrown
This limitation is driven by the logic that decides which jobs are the "best" for any given query. If you have ten jobs stored in a single
index, which cover the source data with varying degrees of completeness and different intervals, the query needs to determine which set
of jobs to actually search. Incorrect decisions can lead to inaccurate aggregation results (e.g. over-counting doc counts, or bad metrics).
Needless to say, this is a technically challenging piece of code.
To help simplify the problem, we have limited search to just one rollup index at a time (which may contain multiple jobs). In the future we
may be able to open this up to multiple rollup jobs.
[float]
=== Can only aggregate what's been stored
A perhaps obvious limitation, but rollups can only aggregate on data that has been stored in the rollups. If you don't configure the
rollup job to store metrics about the `price` field, you won't be able to use the `price` field in any query or aggregation.
For example, the `temperature` field in the following query has been stored in a rollup job... but not with an `avg` metric. Which means
the usage of `avg` here is not allowed:
[source,js]
--------------------------------------------------
GET sensor_rollup/_rollup_search
{
"size": 0,
"aggregations": {
"avg_temperature": {
"avg": {
"field": "temperature"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[continued]
// TEST[catch:/illegal_argument_exception/]
The response will tell you that the field and aggregation were not possible, because no rollup jobs were found which contained them:
[source,js]
----
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "There is not a rollup job that has a [avg] agg with name [avg_temperature] which also satisfies all requirements of query.",
"stack_trace": ...
}
],
"type" : "illegal_argument_exception",
"reason" : "There is not a rollup job that has a [avg] agg with name [avg_temperature] which also satisfies all requirements of query.",
"stack_trace": ...
},
"status": 400
}
----
// TESTRESPONSE[s/"stack_trace": \.\.\./"stack_trace": $body.$_path/]
[float]
=== Interval Granularity
Rollups are stored at a certain granularity, as defined by the `date_histogram` group in the configuration. If data is rolled up at hourly
intervals, the <<rollup-search>> API can aggregate on any time interval hourly or greater. Intervals that are less than an hour will throw
an exception, since the data simply doesn't exist for finer granularities.
Because the RollupSearch endpoint can "upsample" intervals, there is no need to configure jobs with multiple intervals (hourly, daily, etc).
It's recommended to just configure a single job with the smallest granularity that is needed, and allow the search endpoint to upsample
as needed.
That said, if multiple jobs are present in a single rollup index with varying intervals, the search endpoint will identify and use the job(s)
with the largest interval to satisfy the search reques.
[float]
=== Limited querying components
The Rollup functionality allows `query`'s in the search request, but with a limited subset of components. The queries currently allowed are:
- Term Query
- Terms Query
- Range Query
- MatchAll Query
- Any compound query (Boolean, Boosting, ConstantScore, etc)
Furthermore, these queries can only use fields that were also saved in the rollup job. If you wish to filter on a keyword `hostname` field,
that field must have been configured in the rollup job under a `terms` grouping.
If you attempt to use an unsupported query, or the query references a field that wasn't configured in the rollup job, an exception will be
thrown. We expect the list of support queries to grow over time as more are implemented.
[float]
=== Timezones
Rollup documents are stored in the timezone of the `date_histogram` group configuration in the job. If no timezone is specified, the default
is to rollup timestamps in `UTC`.

View File

@ -1,4 +0,0 @@
[[rollup-timezones]]
== Dealing with Timezones
todo