OpenSearch/docs/en/rollup/understanding-groups.asciidoc
Zachary Tong bf1550a0b2 Rollups for Elasticsearch (elastic/x-pack-elasticsearch#4002)
This adds a new Rollup module to XPack, which allows users to configure periodic "rollup jobs" to pre-aggregate data.  That data is then available later for search through a special RollupSearch API, which mimics the DSL and functionality of regular search.

Rollups are used to drastically reduce the on-disk footprint of metric-based data (e.g. timestamped document with numeric and keyword fields).  It can also be used to speed up aggregations over large datasets, since the rolled data will be considerably smaller and fewer documents to search.

The PR adds seven new endpoints to interact with Rollups; create/get/delete job, start/stop job, a capabilities API similar to field-caps, and a Rollup-enabled search.

Original commit: elastic/x-pack-elasticsearch@dcde91aacf
2018-02-23 17:10:37 -05:00

118 lines
3.7 KiB
Plaintext

[[rollup-understanding-groups]]
== Understanding Groups
To preserve flexibility, Rollup Jobs are defined based on how future queries may need to use the data. Traditionally, systems force
the admin to make decisions about what metrics to rollup and on what interval. E.g. The average of `cpu_time` on an hourly basis. This
is limiting; if, at a future date, the admin wishes to see the average of `cpu_time` on an hourly basis _and partitioned by `host_name`_,
they are out of luck.
Of course, the admin can decide to rollup the `[hour, host]` tuple on an hourly basis, but as the number of grouping keys grows, so do the
number of tuples the admin needs to configure. Furthermore, these `[hours, host]` tuples are only useful for hourly rollups... daily, weekly,
or monthly rollups all require new configurations.
Rather than force the admin to decide ahead of time which individual tuples should be rolled up, Elasticsearch's Rollup jobs are configured
based on which groups are potentially useful to future queries. For example, this configuration:
[source,js]
--------------------------------------------------
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
},
"terms": {
"fields": ["hostname", "datacenter"]
},
"histogram": {
"fields": ["load", "net_in", "net_out"],
"interval": 5
}
}
--------------------------------------------------
// NOTCONSOLE
Allows `date_histogram`'s to be used on the `"timestamp"` field, `terms` aggregations to be used on the `"hostname"` and `"datacenter"`
fields, and `histograms` to be used on any of `"load"`, `"net_in"`, `"net_out"` fields.
Importantly, these aggs/fields can be used in any combination. This aggregation:
[source,js]
--------------------------------------------------
"aggs" : {
"hourly": {
"date_histogram": {
"field": "timestamp",
"interval": "1h"
},
"aggs": {
"host_names": {
"terms": {
"field": "hostname"
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
is just as valid as this aggregation:
[source,js]
--------------------------------------------------
"aggs" : {
"hourly": {
"date_histogram": {
"field": "timestamp",
"interval": "1h"
},
"aggs": {
"data_center": {
"terms": {
"field": "datacenter"
}
},
"aggs": {
"host_names": {
"terms": {
"field": "hostname"
}
},
"aggs": {
"load_values": {
"histogram": {
"field": "load",
"interval": 5
}
}
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
You'll notice that the second aggregation is not only substantially larger, it also swapped the position of the terms aggregation on
`"hostname"`, illustrating how the order of aggregations does not matter to rollups. Similarly, while the `date_histogram` is required
for rolling up data, it isn't required while querying (although often used). For example, this is a valid aggregation for
Rollup Search to execute:
[source,js]
--------------------------------------------------
"aggs" : {
"host_names": {
"terms": {
"field": "hostname"
}
}
}
--------------------------------------------------
// NOTCONSOLE
Ultimately, when configuring `groups` for a job, think in terms of how you might wish to partition data in a query at a future date...
then include those in the config. Because Rollup Search allows any order or combination of the grouped fields, you just need to decide
if a field is useful for aggregating later, and how you might wish to use it (terms, histogram, etc)