From 5bdf25320a7eb2eb0247b3c6c94ba8934ad2d861 Mon Sep 17 00:00:00 2001 From: Mark Tozzi Date: Tue, 1 Oct 2019 10:58:44 -0400 Subject: [PATCH] Documentation notes for Range field histograms (#46890) (#47366) --- docs/reference/aggregations/bucket.asciidoc | 1 + .../bucket/datehistogram-aggregation.asciidoc | 2 +- .../bucket/histogram-aggregation.asciidoc | 25 ++- .../bucket/range-field-note.asciidoc | 181 ++++++++++++++++++ 4 files changed, 202 insertions(+), 7 deletions(-) create mode 100644 docs/reference/aggregations/bucket/range-field-note.asciidoc diff --git a/docs/reference/aggregations/bucket.asciidoc b/docs/reference/aggregations/bucket.asciidoc index 9f186ef1ffb..ef2ebadd1dc 100644 --- a/docs/reference/aggregations/bucket.asciidoc +++ b/docs/reference/aggregations/bucket.asciidoc @@ -67,3 +67,4 @@ include::bucket/significanttext-aggregation.asciidoc[] include::bucket/terms-aggregation.asciidoc[] +include::bucket/range-field-note.asciidoc[] diff --git a/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc b/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc index 2d47ba9a59b..62cd850a276 100644 --- a/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc +++ b/docs/reference/aggregations/bucket/datehistogram-aggregation.asciidoc @@ -3,7 +3,7 @@ This multi-bucket aggregation is similar to the normal <>, but it can -only be used with date values. Because dates are represented internally in +only be used with date or date range values. Because dates are represented internally in Elasticsearch as long values, it is possible, but not as accurate, to use the normal `histogram` on dates as well. The main difference in the two APIs is that here the interval can be specified using date/time expressions. Time-based diff --git a/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc b/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc index f7e340af29b..d657a687ff2 100644 --- a/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc +++ b/docs/reference/aggregations/bucket/histogram-aggregation.asciidoc @@ -1,12 +1,13 @@ [[search-aggregations-bucket-histogram-aggregation]] === Histogram Aggregation -A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. -It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field -that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval `5` -(in case of price it may represent $5). When the aggregation executes, the price field of every document will be -evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size is `5` -then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the key `30`. +A multi-bucket values source based aggregation that can be applied on numeric values or numeric range values extracted +from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the +documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with +interval `5` (in case of price it may represent $5). When the aggregation executes, the price field of every document +will be evaluated and will be rounded down to its closest bucket - for example, if the price is `32` and the bucket size +is `5` then the rounding will yield `30` and thus the document will "fall" into the bucket that is associated with the +key `30`. To make this more formal, here is the rounding function that is used: [source,java] @@ -14,6 +15,10 @@ To make this more formal, here is the rounding function that is used: bucket_key = Math.floor((value - offset) / interval) * interval + offset -------------------------------------------------- +For range values, a document can fall into multiple buckets. The first bucket is computed from the lower +bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same +way from the upper bound of the range, and the range is counted in all buckets in between and including those two. + The `interval` must be a positive decimal, while the `offset` must be a decimal in `[0, interval)` (a decimal greater than or equal to `0` and less than `interval`) @@ -175,6 +180,14 @@ POST /sales/_search?size=0 -------------------------------------------------- // TEST[setup:sales] +When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include +buckets outside of a query's range. For example, if your query looks for values greater than 100, and you have a range +covering 50 to 150, and an interval of 50, that document will land in 3 buckets - 50, 100, and 150. In general, it's +best to think of the query and aggregation steps as independent - the query selects a set of documents, and then the +aggregation buckets those documents without regard to how they were selected. +See <> for more information and an example. + ==== Order By default the returned buckets are sorted by their `key` ascending, though the order behaviour can be controlled using diff --git a/docs/reference/aggregations/bucket/range-field-note.asciidoc b/docs/reference/aggregations/bucket/range-field-note.asciidoc new file mode 100644 index 00000000000..0e49c0035b1 --- /dev/null +++ b/docs/reference/aggregations/bucket/range-field-note.asciidoc @@ -0,0 +1,181 @@ +[[search-aggregations-bucket-range-field-note]] +=== Subtleties of bucketing range fields + +==== Documents are counted for each bucket they land in + +Since a range represents multiple values, running a bucket aggregation over a +range field can result in the same document landing in multiple buckets. This +can lead to surprising behavior, such as the sum of bucket counts being higher +than the number of matched documents. For example, consider the following +index: +[source, console] +-------------------------------------------------- +PUT range_index +{ + "settings": { + "number_of_shards": 2 + }, + "mappings": { + "properties": { + "expected_attendees": { + "type": "integer_range" + }, + "time_frame": { + "type": "date_range", + "format": "yyyy-MM-dd||epoch_millis" + } + } + } +} + +PUT range_index/_doc/1?refresh +{ + "expected_attendees" : { + "gte" : 10, + "lte" : 20 + }, + "time_frame" : { + "gte" : "2019-10-28", + "lte" : "2019-11-04" + } +} +-------------------------------------------------- +// TESTSETUP + +The range is wider than the interval in the following aggregation, and thus the +document will land in multiple buckets. + +[source, console] +-------------------------------------------------- +POST /range_index/_search?size=0 +{ + "aggs" : { + "range_histo" : { + "histogram" : { + "field" : "expected_attendees", + "interval" : 5 + } + } + } +} +-------------------------------------------------- + +Since the interval is `5` (and the offset is `0` by default), we expect buckets `10`, +`15`, and `20`. Our range document will fall in all three of these buckets. + +[source, console-result] +-------------------------------------------------- +{ + ... + "aggregations" : { + "range_histo" : { + "buckets" : [ + { + "key" : 10.0, + "doc_count" : 1 + }, + { + "key" : 15.0, + "doc_count" : 1 + }, + { + "key" : 20.0, + "doc_count" : 1 + } + ] + } + } +} +-------------------------------------------------- +// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] + +A document cannot exist partially in a bucket; For example, the above document +cannot count as one-third in each of the above three buckets. In this example, +since the document's range landed in multiple buckets, the full value of that +document would also be counted in any sub-aggregations for each bucket as well. + +==== Query bounds are not aggregation filters + +Another unexpected behavior can arise when a query is used to filter on the +field being aggregated. In this case, a document could match the query but +still have one or both of the endpoints of the range outside the query. +Consider the following aggregation on the above document: + +[source, console] +-------------------------------------------------- +POST /range_index/_search?size=0 +{ + "query": { + "range": { + "time_frame": { + "gte": "2019-11-01", + "format": "yyyy-MM-dd" + } + } + }, + "aggs" : { + "november_data" : { + "date_histogram" : { + "field" : "time_frame", + "calendar_interval" : "day" + } + } + } +} +-------------------------------------------------- + +Even though the query only considers days in November, the aggregation +generates 8 buckets (4 in October, 4 in November) because the aggregation is +calculated over the ranges of all matching documents. + +[source, console-result] +-------------------------------------------------- +{ + ... + "aggregations" : { + "november_data" : { + "buckets" : [ + { + "key" : 1572220800000, + "doc_count" : 1 + }, + { + "key" : 1572307200000, + "doc_count" : 1 + }, + { + "key" : 1572393600000, + "doc_count" : 1 + }, + { + "key" : 1572480000000, + "doc_count" : 1 + }, + { + "key" : 1572566400000, + "doc_count" : 1 + }, + { + "key" : 1572652800000, + "doc_count" : 1 + }, + { + "key" : 1572739200000, + "doc_count" : 1 + }, + { + "key" : 1572825600000, + "doc_count" : 1 + } + ] + } + } +} +-------------------------------------------------- +// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] + +Depending on the use case, a `CONTAINS` query could limit the documents to only +those that fall entirely in the queried range. In this example, the one +document would not be included and the aggregation would be empty. Filtering +the buckets after the aggregation is also an option, for use cases where the +document should be counted but the out of bounds data can be safely ignored.