281 lines
13 KiB
Plaintext
281 lines
13 KiB
Plaintext
[role="xpack"]
|
|
[testenv="basic"]
|
|
[[rollup-job-config]]
|
|
=== Rollup job configuration
|
|
|
|
experimental[]
|
|
|
|
The Rollup Job Configuration contains all the details about how the rollup job should run, when it indexes documents,
|
|
and what future queries will be able to execute against the rollup index.
|
|
|
|
There are three main sections to the Job Configuration; the logistical details about the job (cron schedule, etc), what fields
|
|
should be grouped on, and what metrics to collect for each group.
|
|
|
|
A full job configuration might look like this:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT _rollup/job/sensor
|
|
{
|
|
"index_pattern": "sensor-*",
|
|
"rollup_index": "sensor_rollup",
|
|
"cron": "*/30 * * * * ?",
|
|
"page_size" :1000,
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "60m",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["hostname", "datacenter"]
|
|
},
|
|
"histogram": {
|
|
"fields": ["load", "net_in", "net_out"],
|
|
"interval": 5
|
|
}
|
|
},
|
|
"metrics": [
|
|
{
|
|
"field": "temperature",
|
|
"metrics": ["min", "max", "sum"]
|
|
},
|
|
{
|
|
"field": "voltage",
|
|
"metrics": ["avg"]
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[setup:sensor_index]
|
|
|
|
==== Logistical Details
|
|
|
|
In the above example, there are several pieces of logistical configuration for the job itself.
|
|
|
|
`{job_id}` (required)::
|
|
(string) In the endpoint URL, you specify the name of the job (`sensor` in the above example). This can be any alphanumeric string,
|
|
and uniquely identifies the data that is associated with the rollup job. The ID is persistent, in that it is stored with the rolled
|
|
up data. So if you create a job, let it run for a while, then delete the job... the data that the job rolled up will still be
|
|
associated with this job ID. You will be unable to create a new job with the same ID, as that could lead to problems with mismatched
|
|
job configurations
|
|
|
|
`index_pattern` (required)::
|
|
(string) The index, or index pattern, that you wish to rollup. Supports wildcard-style patterns (`logstash-*`). The job will
|
|
attempt to rollup the entire index or index-pattern. Once the "backfill" is finished, it will periodically (as defined by the cron)
|
|
look for new data and roll that up too.
|
|
|
|
`rollup_index` (required)::
|
|
(string) The index that you wish to store rollup results into. All the rollup data that is generated by the job will be
|
|
stored in this index. When searching the rollup data, this index will be used in the <<rollup-search,Rollup Search>> endpoint's URL.
|
|
The rollup index can be shared with other rollup jobs. The data is stored so that it doesn't interfere with unrelated jobs.
|
|
|
|
`cron` (required)::
|
|
(string) A cron string which defines when the rollup job should be executed. The cron string defines an interval of when to run
|
|
the job's indexer. When the interval triggers, the indexer will attempt to rollup the data in the index pattern. The cron pattern
|
|
is unrelated to the time interval of the data being rolled up. For example, you may wish to create hourly rollups of your document (as
|
|
defined in the <<rollup-groups-config,grouping configuration>>) but to only run the indexer on a daily basis at midnight, as defined by the cron.
|
|
The cron pattern is defined just like Watcher's Cron Schedule.
|
|
|
|
`page_size` (required)::
|
|
(int) The number of bucket results that should be processed on each iteration of the rollup indexer. A larger value
|
|
will tend to execute faster, but will require more memory during processing. This has no effect on how the data is rolled up, it is
|
|
merely used for tweaking the speed/memory cost of the indexer.
|
|
|
|
[NOTE]
|
|
The `index_pattern` cannot be a pattern that would also match the destination `rollup_index`. E.g. the pattern
|
|
`"foo-*"` would match the rollup index `"foo-rollup"`. This causes problems because the rollup job would attempt
|
|
to rollup it's own data at runtime. If you attempt to configure a pattern that matches the `rollup_index`, an exception
|
|
will be thrown to prevent this behavior.
|
|
|
|
[[rollup-groups-config]]
|
|
==== Grouping Config
|
|
|
|
The `groups` section of the configuration is where you decide which fields should be grouped on, and with what aggregations. These
|
|
fields will then be available later for aggregating into buckets. For example, this configuration:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "60m",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["hostname", "datacenter"]
|
|
},
|
|
"histogram": {
|
|
"fields": ["load", "net_in", "net_out"],
|
|
"interval": 5
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
Allows `date_histogram`'s to be used on the `"timestamp"` field, `terms` aggregations to be used on the `"hostname"` and `"datacenter"`
|
|
fields, and `histograms` to be used on any of `"load"`, `"net_in"`, `"net_out"` fields.
|
|
|
|
Importantly, these aggs/fields can be used in any combination. Think of the `groups` configuration as defining a set of tools that can
|
|
later be used in aggregations to partition the data. Unlike raw data, we have to think ahead to which fields and aggregations might be used.
|
|
But Rollups provide enough flexibility that you simply need to determine _which_ fields are needed, not _in what order_ they are needed.
|
|
|
|
There are three types of groupings currently available:
|
|
|
|
===== Date Histogram
|
|
|
|
A `date_histogram` group aggregates a `date` field into time-based buckets. The `date_histogram` group is *mandatory* -- you currently
|
|
cannot rollup documents without a timestamp and a `date_histogram` group.
|
|
|
|
The `date_histogram` group has several parameters:
|
|
|
|
`field` (required)::
|
|
The date field that is to be rolled up.
|
|
|
|
`interval` (required)::
|
|
The interval of time buckets to be generated when rolling up. E.g. `"60m"` will produce 60 minute (hourly) rollups. This follows standard time formatting
|
|
syntax as used elsewhere in Elasticsearch. The `interval` defines the _minimum_ interval that can be aggregated only. If hourly (`"60m"`)
|
|
intervals are configured, <<rollup-search,Rollup Search>> can execute aggregations with 60m or greater (weekly, monthly, etc) intervals.
|
|
So define the interval as the smallest unit that you wish to later query.
|
|
|
|
Note: smaller, more granular intervals take up proportionally more space.
|
|
|
|
`delay`::
|
|
How long to wait before rolling up new documents. By default, the indexer attempts to roll up all data that is available. However, it
|
|
is not uncommon for data to arrive out of order, sometimes even a few days late. The indexer is unable to deal with data that arrives
|
|
after a time-span has been rolled up (e.g. there is no provision to update already-existing rollups).
|
|
|
|
Instead, you should specify a `delay` that matches the longest period of time you expect out-of-order data to arrive. E.g. a `delay` of
|
|
`"1d"` will instruct the indexer to roll up documents up to `"now - 1d"`, which provides a day of buffer time for out-of-order documents
|
|
to arrive.
|
|
|
|
`time_zone`::
|
|
Defines what time_zone the rollup documents are stored as. Unlike raw data, which can shift timezones on the fly, rolled documents have
|
|
to be stored with a specific timezone. By default, rollup documents are stored in `UTC`, but this can be changed with the `time_zone`
|
|
parameter.
|
|
|
|
.Calendar vs Fixed time intervals
|
|
**********************************
|
|
Elasticsearch understands both "calendar" and "fixed" time intervals. Fixed time intervals are fairly easy to understand;
|
|
`"60s"` means sixty seconds. But what does `"1M` mean? One month of time depends on which month we are talking about,
|
|
some months are longer or shorter than others. This is an example of "calendar" time, and the duration of that unit
|
|
depends on context. Calendar units are also affected by leap-seconds, leap-years, etc.
|
|
|
|
This is important because the buckets generated by Rollup will be in either calendar or fixed intervals, and will limit
|
|
how you can query them later (see <<rollup-search-limitations-intervals, Requests must be multiples of the config>>.
|
|
|
|
We recommend sticking with "fixed" time intervals, since they are easier to understand and are more flexible at query
|
|
time. It will introduce some drift in your data during leap-events, and you will have to think about months in a fixed
|
|
quantity (30 days) instead of the actual calendar length... but it is often easier than dealing with calendar units
|
|
at query time.
|
|
|
|
Multiples of units are always "fixed" (e.g. `"2h"` is always the fixed quantity `7200` seconds. Single units can be
|
|
fixed or calendar depending on the unit:
|
|
|
|
[options="header"]
|
|
|=======
|
|
|Unit |Calendar |Fixed
|
|
|millisecond |NA |`1ms`, `10ms`, etc
|
|
|second |NA |`1s`, `10s`, etc
|
|
|minute |`1m` |`2m`, `10m`, etc
|
|
|hour |`1h` |`2h`, `10h`, etc
|
|
|day |`1d` |`2d`, `10d`, etc
|
|
|week |`1w` |NA
|
|
|month |`1M` |NA
|
|
|quarter |`1q` |NA
|
|
|year |`1y` |NA
|
|
|=======
|
|
|
|
For some units where there are both fixed and calendar, you may need to express the quantity in terms of the next
|
|
smaller unit. For example, if you want a fixed day (not a calendar day), you should specify `24h` instead of `1d`.
|
|
Similarly, if you want fixed hours, specify `60m` instead of `1h`. This is because the single quantity entails
|
|
calendar time, and limits you to querying by calendar time in the future.
|
|
|
|
|
|
**********************************
|
|
|
|
===== Terms
|
|
|
|
The `terms` group can be used on `keyword` or numeric fields, to allow bucketing via the `terms` aggregation at a later point. The `terms`
|
|
group is optional. If defined, the indexer will enumerate and store _all_ values of a field for each time-period. This can be potentially
|
|
costly for high-cardinality groups such as IP addresses, especially if the time-bucket is particularly sparse.
|
|
|
|
While it is unlikely that a rollup will ever be larger in size than the raw data, defining `terms` groups on multiple high-cardinality fields
|
|
can effectively reduce the compression of a rollup to a large extent. You should be judicious which high-cardinality fields are included
|
|
for that reason.
|
|
|
|
The `terms` group has a single parameter:
|
|
|
|
`fields` (required)::
|
|
The set of fields that you wish to collect terms for. This array can contain fields that are both `keyword` and numerics. Order
|
|
does not matter
|
|
|
|
|
|
===== Histogram
|
|
|
|
The `histogram` group aggregates one or more numeric fields into numeric histogram intervals. This group is optional
|
|
|
|
|
|
The `histogram` group has a two parameters:
|
|
|
|
`fields` (required)::
|
|
The set of fields that you wish to build histograms for. All fields specified must be some kind of numeric. Order does not matter
|
|
|
|
`interval` (required)::
|
|
The interval of histogram buckets to be generated when rolling up. E.g. `5` will create buckets that are five units wide
|
|
(`0-5`, `5-10`, etc). Note that only one interval can be specified in the `histogram` group, meaning that all fields being grouped via
|
|
the histogram must share the same interval.
|
|
|
|
[[rollup-metrics-config]]
|
|
==== Metrics Config
|
|
|
|
After defining which groups should be generated for the data, you next configure which metrics should be collected. By default, only
|
|
the doc_counts are collected for each group. To make rollup useful, you will often add metrics like averages, mins, maxes, etc.
|
|
|
|
Metrics are defined on a per-field basis, and for each field you configure which metric should be collected. For example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"metrics": [
|
|
{
|
|
"field": "temperature",
|
|
"metrics": ["min", "max", "sum"]
|
|
},
|
|
{
|
|
"field": "voltage",
|
|
"metrics": ["avg"]
|
|
}
|
|
]
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
This configuration defines metrics over two fields, `"temperature` and `"voltage"`. For the `"temperature"` field, we are collecting
|
|
the min, max and sum of the temperature. For `"voltage"`, we are collecting the average. These metrics are collected in a way that makes
|
|
them compatible with any combination of defined groups.
|
|
|
|
The `metrics` configuration accepts an array of objects, where each object has two parameters:
|
|
|
|
`field` (required)::
|
|
The field to collect metrics for. This must be a numeric of some kind
|
|
|
|
`metrics` (required)::
|
|
An array of metrics to collect for the field. At least one metric must be configured. Acceptable metrics are min/max/sum/avg/value_count.
|
|
|
|
|
|
|
|
.Averages aren't composable?!
|
|
**********************************
|
|
If you've worked with rollups before, you may be cautious around averages. If an average is saved for a 10 minute
|
|
interval, it usually isn't useful for larger intervals. You cannot average six 10-minute averages to find a
|
|
hourly average (average of averages is not equal to the total average).
|
|
|
|
For this reason, other systems tend to either omit the ability to average, or store the average at multiple intervals
|
|
to support more flexible querying.
|
|
|
|
Instead, the Rollup feature saves the `count` and `sum` for the defined time interval. This allows us to reconstruct
|
|
the average at any interval greater-than or equal to the defined interval. This gives maximum flexibility for
|
|
minimal storage costs... and you don't have to worry about average accuracies (no average of averages here!)
|
|
**********************************
|
|
|