118 lines
3.7 KiB
Plaintext
118 lines
3.7 KiB
Plaintext
[[rollup-understanding-groups]]
|
|
== Understanding Groups
|
|
|
|
To preserve flexibility, Rollup Jobs are defined based on how future queries may need to use the data. Traditionally, systems force
|
|
the admin to make decisions about what metrics to rollup and on what interval. E.g. The average of `cpu_time` on an hourly basis. This
|
|
is limiting; if, at a future date, the admin wishes to see the average of `cpu_time` on an hourly basis _and partitioned by `host_name`_,
|
|
they are out of luck.
|
|
|
|
Of course, the admin can decide to rollup the `[hour, host]` tuple on an hourly basis, but as the number of grouping keys grows, so do the
|
|
number of tuples the admin needs to configure. Furthermore, these `[hours, host]` tuples are only useful for hourly rollups... daily, weekly,
|
|
or monthly rollups all require new configurations.
|
|
|
|
Rather than force the admin to decide ahead of time which individual tuples should be rolled up, Elasticsearch's Rollup jobs are configured
|
|
based on which groups are potentially useful to future queries. For example, this configuration:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["hostname", "datacenter"]
|
|
},
|
|
"histogram": {
|
|
"fields": ["load", "net_in", "net_out"],
|
|
"interval": 5
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
Allows `date_histogram`'s to be used on the `"timestamp"` field, `terms` aggregations to be used on the `"hostname"` and `"datacenter"`
|
|
fields, and `histograms` to be used on any of `"load"`, `"net_in"`, `"net_out"` fields.
|
|
|
|
Importantly, these aggs/fields can be used in any combination. This aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"hourly": {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h"
|
|
},
|
|
"aggs": {
|
|
"host_names": {
|
|
"terms": {
|
|
"field": "hostname"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
is just as valid as this aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"hourly": {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h"
|
|
},
|
|
"aggs": {
|
|
"data_center": {
|
|
"terms": {
|
|
"field": "datacenter"
|
|
}
|
|
},
|
|
"aggs": {
|
|
"host_names": {
|
|
"terms": {
|
|
"field": "hostname"
|
|
}
|
|
},
|
|
"aggs": {
|
|
"load_values": {
|
|
"histogram": {
|
|
"field": "load",
|
|
"interval": 5
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
|
|
You'll notice that the second aggregation is not only substantially larger, it also swapped the position of the terms aggregation on
|
|
`"hostname"`, illustrating how the order of aggregations does not matter to rollups. Similarly, while the `date_histogram` is required
|
|
for rolling up data, it isn't required while querying (although often used). For example, this is a valid aggregation for
|
|
Rollup Search to execute:
|
|
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"host_names": {
|
|
"terms": {
|
|
"field": "hostname"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
Ultimately, when configuring `groups` for a job, think in terms of how you might wish to partition data in a query at a future date...
|
|
then include those in the config. Because Rollup Search allows any order or combination of the grouped fields, you just need to decide
|
|
if a field is useful for aggregating later, and how you might wish to use it (terms, histogram, etc) |