OpenSearch/docs/en/rollup/understanding-groups.asciidoc

410 lines
13 KiB
Plaintext
Raw Normal View History

[[rollup-understanding-groups]]
== Understanding Groups
To preserve flexibility, Rollup Jobs are defined based on how future queries may need to use the data. Traditionally, systems force
the admin to make decisions about what metrics to rollup and on what interval. E.g. The average of `cpu_time` on an hourly basis. This
is limiting; if, at a future date, the admin wishes to see the average of `cpu_time` on an hourly basis _and partitioned by `host_name`_,
they are out of luck.
Of course, the admin can decide to rollup the `[hour, host]` tuple on an hourly basis, but as the number of grouping keys grows, so do the
number of tuples the admin needs to configure. Furthermore, these `[hours, host]` tuples are only useful for hourly rollups... daily, weekly,
or monthly rollups all require new configurations.
Rather than force the admin to decide ahead of time which individual tuples should be rolled up, Elasticsearch's Rollup jobs are configured
based on which groups are potentially useful to future queries. For example, this configuration:
[source,js]
--------------------------------------------------
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
},
"terms": {
"fields": ["hostname", "datacenter"]
},
"histogram": {
"fields": ["load", "net_in", "net_out"],
"interval": 5
}
}
--------------------------------------------------
// NOTCONSOLE
Allows `date_histogram`'s to be used on the `"timestamp"` field, `terms` aggregations to be used on the `"hostname"` and `"datacenter"`
fields, and `histograms` to be used on any of `"load"`, `"net_in"`, `"net_out"` fields.
Importantly, these aggs/fields can be used in any combination. This aggregation:
[source,js]
--------------------------------------------------
"aggs" : {
"hourly": {
"date_histogram": {
"field": "timestamp",
"interval": "1h"
},
"aggs": {
"host_names": {
"terms": {
"field": "hostname"
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
is just as valid as this aggregation:
[source,js]
--------------------------------------------------
"aggs" : {
"hourly": {
"date_histogram": {
"field": "timestamp",
"interval": "1h"
},
"aggs": {
"data_center": {
"terms": {
"field": "datacenter"
}
},
"aggs": {
"host_names": {
"terms": {
"field": "hostname"
}
},
"aggs": {
"load_values": {
"histogram": {
"field": "load",
"interval": 5
}
}
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
You'll notice that the second aggregation is not only substantially larger, it also swapped the position of the terms aggregation on
`"hostname"`, illustrating how the order of aggregations does not matter to rollups. Similarly, while the `date_histogram` is required
for rolling up data, it isn't required while querying (although often used). For example, this is a valid aggregation for
Rollup Search to execute:
[source,js]
--------------------------------------------------
"aggs" : {
"host_names": {
"terms": {
"field": "hostname"
}
}
}
--------------------------------------------------
// NOTCONSOLE
Ultimately, when configuring `groups` for a job, think in terms of how you might wish to partition data in a query at a future date...
then include those in the config. Because Rollup Search allows any order or combination of the grouped fields, you just need to decide
if a field is useful for aggregating later, and how you might wish to use it (terms, histogram, etc)
=== Grouping Limitations with heterogeneous indices
There is a known limitation to Rollup groups, due to some internal implementation details at this time. The Rollup feature leverages
the `composite` aggregation from Elasticsearch. At the moment, the composite agg only returns buckets when all keys in the tuple are non-null.
Put another way, if the you request keys `[A,B,C]` in the composite aggregation, the only documents that are aggregated are those that have
_all_ of the keys `A, B` and `C`.
Because Rollup uses the composite agg during the indexing process, it inherits this behavior. Practically speaking, if all of the documents
in your index are homogeneous (they have the same mapping), you can ignore this limitation and stop reading now.
However, if you have a heterogeneous collection of documents that you wish to roll up, you may need to configure two or more jobs to
accurately cover the original data.
As an example, if your index has two types of documents:
[source,js]
--------------------------------------------------
{
"timestamp": 1516729294000,
"temperature": 200,
"voltage": 5.2,
"node": "a"
}
--------------------------------------------------
// NOTCONSOLE
and
[source,js]
--------------------------------------------------
{
"timestamp": 1516729294000,
"price": 123,
"title": "Foo"
}
--------------------------------------------------
// NOTCONSOLE
it may be tempting to create a single, combined rollup job which covers both of these document types, something like this:
[source,js]
--------------------------------------------------
PUT _xpack/rollup/job/combined
{
"index_pattern": "data-*",
"rollup_index": "data_rollup",
"cron": "*/30 * * * * ?",
"page_size" :1000,
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
},
"terms": {
"fields": ["node", "title"]
}
},
"metrics": [
{
"field": "temperature",
"metrics": ["min", "max", "sum"]
},
{
"field": "price",
"metrics": ["avg"]
}
]
}
--------------------------------------------------
// NOTCONSOLE
You can see that it includes a `terms` grouping on both "node" and "title", fields that are mutually exclusive in the document types.
*This will not work.* Because the `composite` aggregation (and by extension, Rollup) only returns buckets when all keys are non-null,
and there are no documents that have both a "node" field and a "title" field, this rollup job will not produce any rollups.
Instead, you should configure two independent jobs (sharing the same index, or going to separate indices):
[source,js]
--------------------------------------------------
PUT _xpack/rollup/job/sensor
{
"index_pattern": "data-*",
"rollup_index": "data_rollup",
"cron": "*/30 * * * * ?",
"page_size" :1000,
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
},
"terms": {
"fields": ["node"]
}
},
"metrics": [
{
"field": "temperature",
"metrics": ["min", "max", "sum"]
}
]
}
--------------------------------------------------
// NOTCONSOLE
[source,js]
--------------------------------------------------
PUT _xpack/rollup/job/purchases
{
"index_pattern": "data-*",
"rollup_index": "data_rollup",
"cron": "*/30 * * * * ?",
"page_size" :1000,
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
},
"terms": {
"fields": ["title"]
}
},
"metrics": [
{
"field": "price",
"metrics": ["avg"]
}
]
}
--------------------------------------------------
// NOTCONSOLE
Notice that each job now deals with a single "document type", and will not run into the limitations described above. We are working on changes
in core Elasticsearch to remove this limitation from the `composite` aggregation, and the documentation will be updated accordingly
when this particular scenario is fixed.
=== Doc counts and overlapping jobs
There is an issue with doc counts, related to the above grouping limitation. Imagine you have two Rollup jobs saving to the same index, where
one job is a "subset" of another job.
For example, you might have jobs with these two groupings:
[source,js]
--------------------------------------------------
PUT _xpack/rollup/job/sensor-all
{
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
},
"terms": {
"fields": ["node"]
}
},
"metrics": [
{
"field": "price",
"metrics": ["avg"]
}
]
...
}
--------------------------------------------------
// NOTCONSOLE
and
[source,js]
--------------------------------------------------
PUT _xpack/rollup/job/sensor-building
{
"groups" : {
"date_histogram": {
"field": "timestamp",
"interval": "1h",
"delay": "7d"
},
"terms": {
"fields": ["node", "building"]
}
}
...
}
--------------------------------------------------
// NOTCONSOLE
The first job `sensor-all` contains the groupings and metrics that apply to all data in the index. The second job is rolling up a subset
of data (in different buildings) which also include a building identifier. You did this because combining them would run into the limitation
described in the previous section.
This _mostly_ works, but can sometimes return incorrect `doc_counts` when you search. All metrics will be valid however.
The issue arises from the composite agg limitation described before, combined with search-time optimization. Imagine you try to run the
following aggregation:
[source,js]
--------------------------------------------------
"aggs" : {
"nodes": {
"terms": {
"field": "node"
}
}
}
--------------------------------------------------
// NOTCONSOLE
This aggregation could be serviced by either `sensor-all` or `sensor-building` job, since they both group on the node field. So the RollupSearch
API will search both of them and merge results. This will result in *correct* doc_counts and *correct* metrics. No problem here.
The issue arises from an aggregation that can _only_ be serviced by `sensor-building`, like this one:
[source,js]
--------------------------------------------------
"aggs" : {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"building": {
"terms": {
"field": "building"
}
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
Now we run into a problem. The RollupSearch API will correctly identify that only `sensor-building` job has all the required components
to answer the aggregation, and will search it exclusively. Unfortunately, due to the composite aggregation limitation, that job only
rolled up documents that have both a "node" and a "building" field. Meaning that the doc_counts for the `"nodes"` aggregation will not
include counts for any document that doesn't have `[node, building]` fields.
- The `doc_count` for `"nodes"` aggregation will be incorrect because it only contains counts for `nodes` that also have buildings
- The `doc_count` for `"buildings"` aggregation will be correct
- Any metrics, on any level, will be correct
==== Workarounds
There are two main workarounds if you find yourself with a schema like the above.
Easiest and most robust method: use separate indices to store your rollups. The limitations arise because you have several document
schemas co-habitating in a single index, which makes it difficult for rollups to correctly summarize. If you make several rollup
jobs and store them in separate indices, these sorts of difficulties do not arise. It does, however, keep you from searching across several
different rollup indices at the same time.
The other workaround is to include an "off-target" aggregation in the query, which pulls in the "superset" job and corrects the doc counts.
The RollupSearch API determines the best job to search for each "leaf node" in the aggregation tree. So if we include a metric agg on `price`,
which was only defined in the `sensor-all` job, that will "pull in" the other job:
[source,js]
--------------------------------------------------
"aggs" : {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"building": {
"terms": {
"field": "building"
}
},
"avg_price": {
"avg": { "field": "price" } <1>
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
<1> Adding an avg aggregation here will fix the doc counts
Because only `sensor-all` job had an `avg` on the price field, the RollupSearch API is forced to pull in that additional job for searching,
and will merge/correct the doc_counts as appropriate. This sort of workaround applies to any additional aggregation -- metric or bucketing --
although it can be tedious to look through the jobs and determine the right one to add.
==== Status
We realize this is an onerous limitation, and somewhat breaks the rollup contract of "pick the fields to rollup, we do the rest". We are
actively working to get the limitation to `composite` agg fixed, and the related issues in Rollup. The documentation will be updated when
the fix is implemented.