[Docs] Document limitations around rolling up heterogeneous schemas

Rolling up indices that contain multiple schemas requires extra considerations for the user, and in some cases, some limitations. This commit tries to enumerate the issues so the user is aware. Original commit: elastic/x-pack-elasticsearch@bf96eeab4e
2018-04-17 19:45:22 +00:00 · 2018-04-17 19:45:22 +00:00 · 97fa56443d
parent ed2c6ead7f
commit 97fa56443d
1 changed files with 293 additions and 1 deletions
--- a/docs/en/rollup/understanding-groups.asciidoc
+++ b/docs/en/rollup/understanding-groups.asciidoc
@ -116,3 +116,295 @@ Rollup Search to execute:
 Ultimately, when configuring `groups` for a job, think in terms of how you might wish to partition data in a query at a future date...
 then include those in the config.  Because Rollup Search allows any order or combination of the grouped fields, you just need to decide
 if a field is useful for aggregating later, and how you might wish to use it (terms, histogram, etc)
 === Grouping Limitations with heterogeneous indices
 There is a known limitation to Rollup groups, due to some internal implementation details at this time.  The Rollup feature leverages
 the `composite` aggregation from Elasticsearch.  At the moment, the composite agg only returns buckets when all keys in the tuple are non-null.
 Put another way, if the you request keys `[A,B,C]` in the composite aggregation, the only documents that are aggregated are those that have
 _all_ of the keys `A, B` and `C`.
 Because Rollup uses the composite agg during the indexing process, it inherits this behavior.  Practically speaking, if all of the documents
 in your index are homogeneous (they have the same mapping), you can ignore this limitation and stop reading now.
 However, if you have a heterogeneous collection of documents that you wish to roll up, you may need to configure two or more jobs to
 accurately cover the original data.
 As an example, if your index has two types of documents:
 [source,js]
 --------------------------------------------------
 {
  "timestamp": 1516729294000,
  "temperature": 200,
  "voltage": 5.2,
  "node": "a"
 }
 --------------------------------------------------
 // NOTCONSOLE
 and
 [source,js]
 --------------------------------------------------
 {
  "timestamp": 1516729294000,
  "price": 123,
  "title": "Foo"
 }
 --------------------------------------------------
 // NOTCONSOLE
 it may be tempting to create a single, combined rollup job which covers both of these document types, something like this:
 [source,js]
 --------------------------------------------------
 PUT _xpack/rollup/job/combined
 {
    "index_pattern": "data-*",
    "rollup_index": "data_rollup",
    "cron": "*/30 * * * * ?",
    "page_size" :1000,
    "groups" : {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1h",
        "delay": "7d"
      },
      "terms": {
        "fields": ["node", "title"]
      }
    },
    "metrics": [
        {
            "field": "temperature",
            "metrics": ["min", "max", "sum"]
        },
        {
            "field": "price",
            "metrics": ["avg"]
        }
    ]
 }
 --------------------------------------------------
 // NOTCONSOLE
 You can see that it includes a `terms` grouping on both "node" and "title", fields that are mutually exclusive in the document types.
 *This will not work.*  Because the `composite` aggregation (and by extension, Rollup) only returns buckets when all keys are non-null,
 and there are no documents that have both a "node" field and a "title" field, this rollup job will not produce any rollups.
 Instead, you should configure two independent jobs (sharing the same index, or going to separate indices):
 [source,js]
 --------------------------------------------------
 PUT _xpack/rollup/job/sensor
 {
    "index_pattern": "data-*",
    "rollup_index": "data_rollup",
    "cron": "*/30 * * * * ?",
    "page_size" :1000,
    "groups" : {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1h",
        "delay": "7d"
      },
      "terms": {
        "fields": ["node"]
      }
    },
    "metrics": [
        {
            "field": "temperature",
            "metrics": ["min", "max", "sum"]
        }
    ]
 }
 --------------------------------------------------
 // NOTCONSOLE
 [source,js]
 --------------------------------------------------
 PUT _xpack/rollup/job/purchases
 {
    "index_pattern": "data-*",
    "rollup_index": "data_rollup",
    "cron": "*/30 * * * * ?",
    "page_size" :1000,
    "groups" : {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1h",
        "delay": "7d"
      },
      "terms": {
        "fields": ["title"]
      }
    },
    "metrics": [
        {
            "field": "price",
            "metrics": ["avg"]
        }
    ]
 }
 --------------------------------------------------
 // NOTCONSOLE
 Notice that each job now deals with a single "document type", and will not run into the limitations described above.  We are working on changes
 in core Elasticsearch to remove this limitation from the `composite` aggregation, and the documentation will be updated accordingly
 when this particular scenario is fixed.
 === Doc counts and overlapping jobs
 There is an issue with doc counts, related to the above grouping limitation.  Imagine you have two Rollup jobs saving to the same index, where
 one job is a "subset" of another job.
 For example, you might have jobs with these two groupings:
 [source,js]
 --------------------------------------------------
 PUT _xpack/rollup/job/sensor-all
 {
    "groups" : {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1h",
        "delay": "7d"
      },
      "terms": {
        "fields": ["node"]
      }
    },
    "metrics": [
        {
            "field": "price",
            "metrics": ["avg"]
        }
    ]
    ...
 }
 --------------------------------------------------
 // NOTCONSOLE
 and
 [source,js]
 --------------------------------------------------
 PUT _xpack/rollup/job/sensor-building
 {
    "groups" : {
      "date_histogram": {
        "field": "timestamp",
        "interval": "1h",
        "delay": "7d"
      },
      "terms": {
        "fields": ["node", "building"]
      }
    }
    ...
 }
 --------------------------------------------------
 // NOTCONSOLE
 The first job `sensor-all` contains the groupings and metrics that apply to all data in the index.  The second job is rolling up a subset
 of data (in different buildings) which also include a building identifier.  You did this because combining them would run into the limitation
 described in the previous section.
 This _mostly_ works, but can sometimes return incorrect `doc_counts` when you search.  All metrics will be valid however.
 The issue arises from the composite agg limitation described before, combined with search-time optimization.  Imagine you try to run the
 following aggregation:
 [source,js]
 --------------------------------------------------
 "aggs" : {
  "nodes": {
    "terms": {
      "field": "node"
    }
  }
 }
 --------------------------------------------------
 // NOTCONSOLE
 This aggregation could be serviced by either `sensor-all` or `sensor-building` job, since they both group on the node field.  So the RollupSearch
 API will search both of them and merge results.  This will result in *correct* doc_counts and *correct* metrics.  No problem here.
 The issue arises from an aggregation that can _only_ be serviced by `sensor-building`, like this one:
 [source,js]
 --------------------------------------------------
 "aggs" : {
  "nodes": {
    "terms": {
      "field": "node"
    },
    "aggs": {
      "building": {
        "terms": {
          "field": "building"
        }
      }
    }
  }
 }
 --------------------------------------------------
 // NOTCONSOLE
 Now we run into a problem.  The RollupSearch API will correctly identify that only `sensor-building` job has all the required components
 to answer the aggregation, and will search it exclusively.  Unfortunately, due to the composite aggregation limitation, that job only
 rolled up documents that have both a "node" and a "building" field.  Meaning that the doc_counts for the `"nodes"` aggregation will not
 include counts for any document that doesn't have `[node, building]` fields.
 - The `doc_count` for `"nodes"` aggregation will be incorrect because it only contains counts for `nodes` that also have buildings
 - The `doc_count` for `"buildings"` aggregation will be correct
 - Any metrics, on any level, will be correct
 ==== Workarounds
 There are two main workarounds if you find yourself with a schema like the above.
 Easiest and most robust method: use separate indices to store your rollups.  The limitations arise because you have several document
 schemas co-habitating in a single index, which makes it difficult for rollups to correctly summarize.  If you make several rollup
 jobs and store them in separate indices, these sorts of difficulties do not arise.  It does, however, keep you from searching across several
 different rollup indices at the same time.
 The other workaround is to include an "off-target" aggregation in the query, which pulls in the "superset" job and corrects the doc counts.
 The RollupSearch API determines the best job to search for each "leaf node" in the aggregation tree.  So if we include a metric agg on `price`,
 which was only defined in the `sensor-all` job, that will "pull in" the other job:
 [source,js]
 --------------------------------------------------
 "aggs" : {
  "nodes": {
    "terms": {
      "field": "node"
    },
    "aggs": {
      "building": {
        "terms": {
          "field": "building"
        }
      },
      "avg_price": {
        "avg": { "field": "price" } <1>
      }
    }
  }
 }
 --------------------------------------------------
 // NOTCONSOLE
 <1> Adding an avg aggregation here will fix the doc counts
 Because only `sensor-all` job had an `avg` on the price field, the RollupSearch API is forced to pull in that additional job for searching,
 and will merge/correct the doc_counts as appropriate.  This sort of workaround applies to any additional aggregation -- metric or bucketing --
 although it can be tedious to look through the jobs and determine the right one to add.
 ==== Status
 We realize this is an onerous limitation, and somewhat breaks the rollup contract of "pick the fields to rollup, we do the rest".  We are
 actively working to get the limitation to `composite` agg fixed, and the related issues in Rollup.  The documentation will be updated when
 the fix is implemented.