OpenSearch/docs/reference/aggregations/bucket/variablewidthhistogram-aggregation.asciidoc
Nik Everett 514b2f3414
Clean up a few of vwh's rough edges (#59341) (#59807)
This cleans up a few rough edged in the `variable_width_histogram`,
mostly found by @wwang500:
1. Setting its tuning parameters in an unexpected order could cause the
   request to fail.
2. We checked that the maximum number of buckets was both less than
   50000 and MAX_BUCKETS. This drops the 50000.
3. Fixes a divide by 0 that can occur of the `shard_size` is 1.
4. Fixes a divide by 0 that can occur if the `shard_size * 3` overflows
   a signed int.
5. Requires `shard_size * 3 / 4` to be at least `buckets`. If it is less
   than `buckets` we will very consistently return fewer buckets than
   requested. For the most part we expect folks to leave it at the
   default. If they change it, we expect it to be much bigger than
   `buckets`.
6. Allocate a smaller `mergeMap` in when initially bucketing requests
   that don't use the entire `shard_size * 3 / 4`. Its just a waste.
7. Default `shard_size` to `10 * buckets` rather than `100`. It *looks*
   like that was our intention the whole time. And it feels like it'd
   keep the algorithm humming along more smoothly.
8. Default the `initial_buffer` to `min(10 * shard_size, 50000)` like
   we've documented it rather than `5000`. Like the point above, this
   feels like the right thing to do to keep the algorithm happy.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-07-17 15:16:09 -04:00

96 lines
4.5 KiB
Plaintext

[[search-aggregations-bucket-variablewidthhistogram-aggregation]]
=== Variable Width Histogram Aggregation
experimental::["We're evaluating the request and response format for this new aggregation.",https://github.com/elastic/elasticsearch/issues/58573]
This is a multi-bucket aggregation similar to <<search-aggregations-bucket-histogram-aggregation>>.
However, the width of each bucket is not specified. Rather, a target number of buckets is provided and bucket intervals
are dynamically determined based on the document distribution. This is done using a simple one-pass document clustering algorithm
that aims to obtain low distances between bucket centroids. Unlike other multi-bucket aggregations, the intervals will not
necessarily have a uniform width.
TIP: The number of buckets returned will always be less than or equal to the target number.
Requesting a target of 2 buckets.
[source,console]
--------------------------------------------------
POST /sales/_search?size=0
{
"aggs" : {
"prices" : {
"variable_width_histogram" : {
"field" : "price",
"buckets" : 2
}
}
}
}
--------------------------------------------------
// TEST[setup:sales]
Response:
[source,console-result]
--------------------------------------------------
{
...
"aggregations": {
"prices" : {
"buckets": [
{
"min": 10.0,
"key": 30.0,
"max": 50.0,
"doc_count": 2
},
{
"min": 150.0,
"key": 185.0,
"max": 200.0,
"doc_count": 5
}
]
}
}
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
IMPORTANT: This aggregation cannot currently be nested under any aggregation that collects from more than a single bucket.
==== Clustering Algorithm
Each shard fetches the first `initial_buffer` documents and stores them in memory. Once the buffer is full, these documents
are sorted and linearly separated into `3/4 * shard_size buckets`.
Next each remaining documents is either collected into the nearest bucket, or placed into a new bucket if it is distant
from all the existing ones. At most `shard_size` total buckets are created.
In the reduce step, the coordinating node sorts the buckets from all shards by their centroids. Then, the two buckets
with the nearest centroids are repeatedly merged until the target number of buckets is achieved.
This merging procedure is a form of https://en.wikipedia.org/wiki/Hierarchical_clustering[agglomerative hierarchical clustering].
TIP: A shard can return fewer than `shard_size` buckets, but it cannot return more.
==== Shard size
The `shard_size` parameter specifies the number of buckets that the coordinating node will request from each shard.
A higher `shard_size` leads each shard to produce smaller buckets. This reduce the likelihood of buckets overlapping
after the reduction step. Increasing the `shard_size` will improve the accuracy of the histogram, but it will
also make it more expensive to compute the final result because bigger priority queues will have to be managed on a
shard level, and the data transfers between the nodes and the client will be larger.
TIP: Parameters `buckets`, `shard_size`, and `initial_buffer` are optional. By default, `buckets = 10`, `shard_size = buckets * 50`, and `initial_buffer = min(10 * shard_size, 50000)`.
==== Initial Buffer
The `initial_buffer` parameter can be used to specify the number of individual documents that will be stored in memory
on a shard before the initial bucketing algorithm is run. Bucket distribution is determined using this sample
of `initial_buffer` documents. So, although a higher `initial_buffer` will use more memory, it will lead to more representative
clusters.
==== Bucket bounds are approximate
During the reduce step, the master node continuously merges the two buckets with the nearest centroids. If two buckets have
overlapping bounds but distant centroids, then it is possible that they will not be merged. Because of this, after
reduction the maximum value in some interval (`max`) might be greater than the minimum value in the subsequent
bucket (`min`). To reduce the impact of this error, when such an overlap occurs the bound between these intervals is adjusted to be `(max + min) / 2`.
TIP: Bucket bounds are very sensitive to outliers