This commit adds a new field to the response of the terms aggregation called
`sum_other_doc_count` which is equal to the sum of the doc counts of the buckets
that did not make it to the list of top buckets. It is typically useful to have
a sector called eg. `other` when using terms aggregations to build pie charts.
Example query and response:
```json
GET test/_search?search_type=count
{
"aggs": {
"colors": {
"terms": {
"field": "color",
"size": 3
}
}
}
}
```
```json
{
[...],
"aggregations": {
"colors": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 4,
"buckets": [
{
"key": "blue",
"doc_count": 65
},
{
"key": "red",
"doc_count": 14
},
{
"key": "brown",
"doc_count": 3
}
]
}
}
}
```
Close#8213
The terms aggregation can now support sorting on multiple criteria by replacing the sort object with an array or sort object whose order signifies the priority of the sort. The existing syntax for sorting on a single criteria also still works.
Contributes to #6917
Replaces #7588
The terms aggregation can now support sorting on multiple criteria by replacing the sort object with an array or sort object whose order signifies the priority of the sort. The existing syntax for sorting on a single criteria also still works.
Contributes to #6917
The current implementation of 'date_histogram' does not understand
the `factor` parameter. Since the docs shouldn't raise false hopes,
I removed the section.
Closes#7277
A multi-bucket aggregation where multiple filters can be defined (each filter defines a bucket). The buckets will collect all the documents that match their associated filter.
This aggregation can be very useful when one wants to compare analytics between different criterias. It can also be accomplished using multiple definitions of the single filter aggregation, but here, the user will only need to define the sub-aggregations only once.
Closes#6118
This is only applicable when the order is set to _count. The upper bound of the error in the doc count is calculated by summing the doc count of the last term on each shard which did not return the term. The implementation calculates the error by summing the doc count for the last term on each shard for which the term IS returned and then subtracts this value from the sum of the doc counts for the last term from ALL shards.
Closes#6696
This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding
- SignificanceHeuristic
- SignificanceHeuristicBuilder
- SignificanceHeuristicParser
closes#6561
A new "breadth_first" results collection mode allows upper branches of aggregation tree to be calculated and then pruned
to a smaller selection before advancing into executing collection on child branches.
Closes#6128
Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.
This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.
On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.
closes#5998closes#6041
By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when requesting empty buckets (min_doc_count : 0), this causes a confusion, specifically, when the data is also filtered.
To understand why, let's look at an example:
Lets say the you're filtering your request to get all docs from the last month, and in the date_histogram aggs you'd like to slice the data per day. You also specify min_doc_count:0 so that you'd still get empty buckets for those days to which no document belongs. By default, if the first document that fall in this last month also happen to fall on the first day of the **second week** of the month, the date_histogram will **not** return empty buckets for all those days prior to that second week. The reason for that is that by default the histogram aggregations only start building buckets when they encounter documents (hence, missing on all the days of the first week in our example).
With extended_bounds, you now can "force" the histogram aggregations to start building buckets on a specific min values and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if the min_doc_count is greater than 0).
Note that (as the name suggest) extended_bounds is **not** filtering buckets. Meaning, if the min bounds is higher than the values extracted from the documents, the documents will still dictate what the min bucket will be (and the same goes to the extended_bounds.max and the max bucket). For filtering buckets, one should nest the histogram agg under a range filter agg with the appropriate min/max.
Closes#5224
Significance is related to the changes in document frequency observed between everyday use in the corpus and
frequency observed in the result set. The asciidocs include extensive details on the applications of this feature.
Closes#5146
Supports sorting on sub-aggs down the current hierarchy. This is supported as long as the aggregation in the specified order path are of a single-bucket type, where the last aggregation in the path points to either a single-bucket aggregation or a metrics one. If it's a single-bucket aggregation, the sort will be applied on the document count in the bucket (i.e. doc_count), and if it is a metrics type, the sort will be applied on the pointed out metric (in case of a single-metric aggregations, such as avg, the sort will be applied on the single metric value)
NOTE: this commit adds a constraint on what should be considered a valid aggregation name. Aggregations names must be alpha-numeric and may contain '-' and '_'.
Closes#5253