412 lines
13 KiB
Plaintext
412 lines
13 KiB
Plaintext
[[rollup-understanding-groups]]
|
|
== Understanding Groups
|
|
|
|
experimental[]
|
|
|
|
To preserve flexibility, Rollup Jobs are defined based on how future queries may need to use the data. Traditionally, systems force
|
|
the admin to make decisions about what metrics to rollup and on what interval. E.g. The average of `cpu_time` on an hourly basis. This
|
|
is limiting; if, at a future date, the admin wishes to see the average of `cpu_time` on an hourly basis _and partitioned by `host_name`_,
|
|
they are out of luck.
|
|
|
|
Of course, the admin can decide to rollup the `[hour, host]` tuple on an hourly basis, but as the number of grouping keys grows, so do the
|
|
number of tuples the admin needs to configure. Furthermore, these `[hours, host]` tuples are only useful for hourly rollups... daily, weekly,
|
|
or monthly rollups all require new configurations.
|
|
|
|
Rather than force the admin to decide ahead of time which individual tuples should be rolled up, Elasticsearch's Rollup jobs are configured
|
|
based on which groups are potentially useful to future queries. For example, this configuration:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["hostname", "datacenter"]
|
|
},
|
|
"histogram": {
|
|
"fields": ["load", "net_in", "net_out"],
|
|
"interval": 5
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
Allows `date_histogram`'s to be used on the `"timestamp"` field, `terms` aggregations to be used on the `"hostname"` and `"datacenter"`
|
|
fields, and `histograms` to be used on any of `"load"`, `"net_in"`, `"net_out"` fields.
|
|
|
|
Importantly, these aggs/fields can be used in any combination. This aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"hourly": {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h"
|
|
},
|
|
"aggs": {
|
|
"host_names": {
|
|
"terms": {
|
|
"field": "hostname"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
is just as valid as this aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"hourly": {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h"
|
|
},
|
|
"aggs": {
|
|
"data_center": {
|
|
"terms": {
|
|
"field": "datacenter"
|
|
}
|
|
},
|
|
"aggs": {
|
|
"host_names": {
|
|
"terms": {
|
|
"field": "hostname"
|
|
}
|
|
},
|
|
"aggs": {
|
|
"load_values": {
|
|
"histogram": {
|
|
"field": "load",
|
|
"interval": 5
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
|
|
You'll notice that the second aggregation is not only substantially larger, it also swapped the position of the terms aggregation on
|
|
`"hostname"`, illustrating how the order of aggregations does not matter to rollups. Similarly, while the `date_histogram` is required
|
|
for rolling up data, it isn't required while querying (although often used). For example, this is a valid aggregation for
|
|
Rollup Search to execute:
|
|
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"host_names": {
|
|
"terms": {
|
|
"field": "hostname"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
Ultimately, when configuring `groups` for a job, think in terms of how you might wish to partition data in a query at a future date...
|
|
then include those in the config. Because Rollup Search allows any order or combination of the grouped fields, you just need to decide
|
|
if a field is useful for aggregating later, and how you might wish to use it (terms, histogram, etc)
|
|
|
|
=== Grouping Limitations with heterogeneous indices
|
|
|
|
There is a known limitation to Rollup groups, due to some internal implementation details at this time. The Rollup feature leverages
|
|
the `composite` aggregation from Elasticsearch. At the moment, the composite agg only returns buckets when all keys in the tuple are non-null.
|
|
Put another way, if the you request keys `[A,B,C]` in the composite aggregation, the only documents that are aggregated are those that have
|
|
_all_ of the keys `A, B` and `C`.
|
|
|
|
Because Rollup uses the composite agg during the indexing process, it inherits this behavior. Practically speaking, if all of the documents
|
|
in your index are homogeneous (they have the same mapping), you can ignore this limitation and stop reading now.
|
|
|
|
However, if you have a heterogeneous collection of documents that you wish to roll up, you may need to configure two or more jobs to
|
|
accurately cover the original data.
|
|
|
|
As an example, if your index has two types of documents:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"timestamp": 1516729294000,
|
|
"temperature": 200,
|
|
"voltage": 5.2,
|
|
"node": "a"
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
and
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"timestamp": 1516729294000,
|
|
"price": 123,
|
|
"title": "Foo"
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
it may be tempting to create a single, combined rollup job which covers both of these document types, something like this:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT _xpack/rollup/job/combined
|
|
{
|
|
"index_pattern": "data-*",
|
|
"rollup_index": "data_rollup",
|
|
"cron": "*/30 * * * * ?",
|
|
"page_size" :1000,
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["node", "title"]
|
|
}
|
|
},
|
|
"metrics": [
|
|
{
|
|
"field": "temperature",
|
|
"metrics": ["min", "max", "sum"]
|
|
},
|
|
{
|
|
"field": "price",
|
|
"metrics": ["avg"]
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
You can see that it includes a `terms` grouping on both "node" and "title", fields that are mutually exclusive in the document types.
|
|
*This will not work.* Because the `composite` aggregation (and by extension, Rollup) only returns buckets when all keys are non-null,
|
|
and there are no documents that have both a "node" field and a "title" field, this rollup job will not produce any rollups.
|
|
|
|
Instead, you should configure two independent jobs (sharing the same index, or going to separate indices):
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT _xpack/rollup/job/sensor
|
|
{
|
|
"index_pattern": "data-*",
|
|
"rollup_index": "data_rollup",
|
|
"cron": "*/30 * * * * ?",
|
|
"page_size" :1000,
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["node"]
|
|
}
|
|
},
|
|
"metrics": [
|
|
{
|
|
"field": "temperature",
|
|
"metrics": ["min", "max", "sum"]
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT _xpack/rollup/job/purchases
|
|
{
|
|
"index_pattern": "data-*",
|
|
"rollup_index": "data_rollup",
|
|
"cron": "*/30 * * * * ?",
|
|
"page_size" :1000,
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["title"]
|
|
}
|
|
},
|
|
"metrics": [
|
|
{
|
|
"field": "price",
|
|
"metrics": ["avg"]
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
Notice that each job now deals with a single "document type", and will not run into the limitations described above. We are working on changes
|
|
in core Elasticsearch to remove this limitation from the `composite` aggregation, and the documentation will be updated accordingly
|
|
when this particular scenario is fixed.
|
|
|
|
=== Doc counts and overlapping jobs
|
|
|
|
There is an issue with doc counts, related to the above grouping limitation. Imagine you have two Rollup jobs saving to the same index, where
|
|
one job is a "subset" of another job.
|
|
|
|
For example, you might have jobs with these two groupings:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT _xpack/rollup/job/sensor-all
|
|
{
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["node"]
|
|
}
|
|
},
|
|
"metrics": [
|
|
{
|
|
"field": "price",
|
|
"metrics": ["avg"]
|
|
}
|
|
]
|
|
...
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
and
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT _xpack/rollup/job/sensor-building
|
|
{
|
|
"groups" : {
|
|
"date_histogram": {
|
|
"field": "timestamp",
|
|
"interval": "1h",
|
|
"delay": "7d"
|
|
},
|
|
"terms": {
|
|
"fields": ["node", "building"]
|
|
}
|
|
}
|
|
...
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
|
|
The first job `sensor-all` contains the groupings and metrics that apply to all data in the index. The second job is rolling up a subset
|
|
of data (in different buildings) which also include a building identifier. You did this because combining them would run into the limitation
|
|
described in the previous section.
|
|
|
|
This _mostly_ works, but can sometimes return incorrect `doc_counts` when you search. All metrics will be valid however.
|
|
|
|
The issue arises from the composite agg limitation described before, combined with search-time optimization. Imagine you try to run the
|
|
following aggregation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"nodes": {
|
|
"terms": {
|
|
"field": "node"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
This aggregation could be serviced by either `sensor-all` or `sensor-building` job, since they both group on the node field. So the RollupSearch
|
|
API will search both of them and merge results. This will result in *correct* doc_counts and *correct* metrics. No problem here.
|
|
|
|
The issue arises from an aggregation that can _only_ be serviced by `sensor-building`, like this one:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"nodes": {
|
|
"terms": {
|
|
"field": "node"
|
|
},
|
|
"aggs": {
|
|
"building": {
|
|
"terms": {
|
|
"field": "building"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
Now we run into a problem. The RollupSearch API will correctly identify that only `sensor-building` job has all the required components
|
|
to answer the aggregation, and will search it exclusively. Unfortunately, due to the composite aggregation limitation, that job only
|
|
rolled up documents that have both a "node" and a "building" field. Meaning that the doc_counts for the `"nodes"` aggregation will not
|
|
include counts for any document that doesn't have `[node, building]` fields.
|
|
|
|
- The `doc_count` for `"nodes"` aggregation will be incorrect because it only contains counts for `nodes` that also have buildings
|
|
- The `doc_count` for `"buildings"` aggregation will be correct
|
|
- Any metrics, on any level, will be correct
|
|
|
|
==== Workarounds
|
|
|
|
There are two main workarounds if you find yourself with a schema like the above.
|
|
|
|
Easiest and most robust method: use separate indices to store your rollups. The limitations arise because you have several document
|
|
schemas co-habitating in a single index, which makes it difficult for rollups to correctly summarize. If you make several rollup
|
|
jobs and store them in separate indices, these sorts of difficulties do not arise. It does, however, keep you from searching across several
|
|
different rollup indices at the same time.
|
|
|
|
The other workaround is to include an "off-target" aggregation in the query, which pulls in the "superset" job and corrects the doc counts.
|
|
The RollupSearch API determines the best job to search for each "leaf node" in the aggregation tree. So if we include a metric agg on `price`,
|
|
which was only defined in the `sensor-all` job, that will "pull in" the other job:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"aggs" : {
|
|
"nodes": {
|
|
"terms": {
|
|
"field": "node"
|
|
},
|
|
"aggs": {
|
|
"building": {
|
|
"terms": {
|
|
"field": "building"
|
|
}
|
|
},
|
|
"avg_price": {
|
|
"avg": { "field": "price" } <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
<1> Adding an avg aggregation here will fix the doc counts
|
|
|
|
Because only `sensor-all` job had an `avg` on the price field, the RollupSearch API is forced to pull in that additional job for searching,
|
|
and will merge/correct the doc_counts as appropriate. This sort of workaround applies to any additional aggregation -- metric or bucketing --
|
|
although it can be tedious to look through the jobs and determine the right one to add.
|
|
|
|
==== Status
|
|
|
|
We realize this is an onerous limitation, and somewhat breaks the rollup contract of "pick the fields to rollup, we do the rest". We are
|
|
actively working to get the limitation to `composite` agg fixed, and the related issues in Rollup. The documentation will be updated when
|
|
the fix is implemented. |