2020-06-25 11:40:47 -04:00
|
|
|
[[search-aggregations-bucket-variablewidthhistogram-aggregation]]
|
|
|
|
=== Variable Width Histogram Aggregation
|
|
|
|
|
2020-06-25 16:54:37 -04:00
|
|
|
experimental::["We're evaluating the request and response format for this new aggregation.",https://github.com/elastic/elasticsearch/issues/58573]
|
|
|
|
|
2020-06-25 11:40:47 -04:00
|
|
|
This is a multi-bucket aggregation similar to <<search-aggregations-bucket-histogram-aggregation>>.
|
|
|
|
However, the width of each bucket is not specified. Rather, a target number of buckets is provided and bucket intervals
|
|
|
|
are dynamically determined based on the document distribution. This is done using a simple one-pass document clustering algorithm
|
|
|
|
that aims to obtain low distances between bucket centroids. Unlike other multi-bucket aggregations, the intervals will not
|
|
|
|
necessarily have a uniform width.
|
|
|
|
|
|
|
|
TIP: The number of buckets returned will always be less than or equal to the target number.
|
|
|
|
|
|
|
|
Requesting a target of 2 buckets.
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
--------------------------------------------------
|
|
|
|
POST /sales/_search?size=0
|
|
|
|
{
|
|
|
|
"aggs" : {
|
|
|
|
"prices" : {
|
|
|
|
"variable_width_histogram" : {
|
|
|
|
"field" : "price",
|
|
|
|
"buckets" : 2
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// TEST[setup:sales]
|
|
|
|
|
|
|
|
Response:
|
|
|
|
|
|
|
|
[source,console-result]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
...
|
|
|
|
"aggregations": {
|
|
|
|
"prices" : {
|
|
|
|
"buckets": [
|
|
|
|
{
|
|
|
|
"min": 10.0,
|
|
|
|
"key": 30.0,
|
|
|
|
"max": 50.0,
|
|
|
|
"doc_count": 2
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"min": 150.0,
|
|
|
|
"key": 185.0,
|
|
|
|
"max": 200.0,
|
|
|
|
"doc_count": 5
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
|
|
|
|
|
2020-06-30 18:26:45 -04:00
|
|
|
IMPORTANT: This aggregation cannot currently be nested under any aggregation that collects from more than a single bucket.
|
|
|
|
|
2020-06-25 11:40:47 -04:00
|
|
|
==== Clustering Algorithm
|
|
|
|
Each shard fetches the first `initial_buffer` documents and stores them in memory. Once the buffer is full, these documents
|
|
|
|
are sorted and linearly separated into `3/4 * shard_size buckets`.
|
|
|
|
Next each remaining documents is either collected into the nearest bucket, or placed into a new bucket if it is distant
|
|
|
|
from all the existing ones. At most `shard_size` total buckets are created.
|
|
|
|
|
|
|
|
In the reduce step, the coordinating node sorts the buckets from all shards by their centroids. Then, the two buckets
|
|
|
|
with the nearest centroids are repeatedly merged until the target number of buckets is achieved.
|
|
|
|
This merging procedure is a form of https://en.wikipedia.org/wiki/Hierarchical_clustering[agglomerative hierarchical clustering].
|
|
|
|
|
|
|
|
TIP: A shard can return fewer than `shard_size` buckets, but it cannot return more.
|
|
|
|
|
|
|
|
==== Shard size
|
|
|
|
The `shard_size` parameter specifies the number of buckets that the coordinating node will request from each shard.
|
|
|
|
A higher `shard_size` leads each shard to produce smaller buckets. This reduce the likelihood of buckets overlapping
|
|
|
|
after the reduction step. Increasing the `shard_size` will improve the accuracy of the histogram, but it will
|
|
|
|
also make it more expensive to compute the final result because bigger priority queues will have to be managed on a
|
|
|
|
shard level, and the data transfers between the nodes and the client will be larger.
|
|
|
|
|
|
|
|
TIP: Parameters `buckets`, `shard_size`, and `initial_buffer` are optional. By default, `buckets = 10`, `shard_size = 500` and `initial_buffer = min(50 * shard_size, 50000)`.
|
|
|
|
|
|
|
|
==== Initial Buffer
|
|
|
|
The `initial_buffer` parameter can be used to specify the number of individual documents that will be stored in memory
|
|
|
|
on a shard before the initial bucketing algorithm is run. Bucket distribution is determined using this sample
|
|
|
|
of `initial_buffer` documents. So, although a higher `initial_buffer` will use more memory, it will lead to more representative
|
|
|
|
clusters.
|
|
|
|
|
|
|
|
==== Bucket bounds are approximate
|
|
|
|
During the reduce step, the master node continuously merges the two buckets with the nearest centroids. If two buckets have
|
|
|
|
overlapping bounds but distant centroids, then it is possible that they will not be merged. Because of this, after
|
|
|
|
reduction the maximum value in some interval (`max`) might be greater than the minimum value in the subsequent
|
|
|
|
bucket (`min`). To reduce the impact of this error, when such an overlap occurs the bound between these intervals is adjusted to be `(max + min) / 2`.
|
|
|
|
|
|
|
|
TIP: Bucket bounds are very sensitive to outliers
|