- Added docs for the value_count aggregation

- Fixed typos in the terms facets docs
- Fixed aggregation docs layout
- Added docs for shard_size in term aggregation
This commit is contained in:
uboness 2013-11-29 12:35:25 +01:00
parent 630641f292
commit afb0d119e4
8 changed files with 112 additions and 22 deletions

View File

@ -27,6 +27,7 @@ The interesting part comes next, since each bucket effectively defines a documen
NOTE: Bucketing aggregations can have sub-aggregations (bucketing or metric). The sub aggregations will be computed for
each of the buckets their parent aggregation generates. There is not hard limit on the level/depth of nested aggregations (one can nest an aggregation under a "parent" aggregation which is itself a sub-aggregation of another highter aggregations)
[float]
=== Structuring Aggregations
The following snippet captures the basic structure of aggregations:
@ -46,6 +47,7 @@ The following snippet captures the basic structure of aggregations:
The `aggregations` object (a.k.a `aggs` for short) in the json holds the aggregations to be computed. Each aggregation is associated with a logical name that the user defines (e.g. if the aggregation computes the average price, then it'll make sense to name it `avg_price`). These logical names will also be used to uniquely identify the aggregations in the response. Each aggregation has a specific type (`<aggregation_type>` in the above snippet) and is typically the first key within the named aggregation body. Each type of aggregation define its own body, depending on the nature of the aggregation (eg. an `avg` aggregation on a specific field will define the field on which the avg will be calculated). At the same level of the aggregation type definition, one can optionally define a set of additional aggregations, though this only makes sense if the aggregation you defined is of a bucketing nature. In this scenario, the sub-aggregations you define on the bucketing aggregation level will be computed for all the buckets built by the bucketing aggregation. For example, if the you define a set of aggregations under the `range` aggregation, the sub-aggregations will be computed for each of the range buckets that are defined.
[float]
==== Values Source
Some aggregations work on values extracted from the aggregated documents. Typically, the values will be extracted from a sepcific document field which is set under the `field` settings for the aggrations. It is also possible to define a `<<modules-scripting,script>>` that will generate the values (per document).
@ -59,12 +61,36 @@ When working with scripts, the `script_lang` and `params` settings can also be d
Scripts can generate a single value or multiple values per documents. When generating multiple values, once can use the `script_values_sorted` settings to indicate whether these values are sorted or not. Internally, elasticsearch can perform optimizations when dealing with sorted values (for example, with the `min` aggregations, knowing the values are sorted, elasticsearch will skip the iterations over all the values and rely on the first value in the list to be the minimum value among all other values associated with the same document).
include::aggregations/metrics.asciidoc[]
[float]
=== Metrics Aggregations
include::aggregations/bucket.asciidoc[]
The aggregations in this family compute metrics based on values extracted in one way or another from the documents that
are being aggregated. The values are typically extracted from the fields of the document (using the field data), but
can also be generated using scripts. Some aggregations output a single metric (e.g. `avg`) and are called `single-value
metrics aggregation`, others generate multiple metrics (e.g. `stats`) and are called `multi-value metrics aggregation`.
The distinction between single-value and multi-value metrics aggregations plays a role when these aggregations serve as
direct sub-aggregations of some bucket aggregations (some bucket aggregation enable you to sort the returned buckets based
on the metrics in each bucket).
[float]
=== Bucket Aggregations
Bucket aggregations don't calculate metrics over fields like the metrics aggregations do, but instead, they create
buckets of documents. Each bucket is associated with a criteria (depends on the aggregation type) that determines
whether or not a document in the current context "falls" in it. In other words, the buckets effectively define document
sets. In addition to the buckets themselves, the `bucket` aggregations also compute and return the number of documents
that "fell in" each bucket.
Bucket aggregations, as opposed to `metrics` aggregations, can hold sub-aggregations. These sub aggregations will be
aggregated for each of the buckets created by their "parent" bucket aggregation.
There are different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some
define fixed number of multiple buckets, and others dynamically create the buckets during the aggregation process.
include::aggregations/metrics.asciidoc[]
include::aggregations/bucket.asciidoc[]

View File

@ -1,11 +1,4 @@
[[search-aggregations-bucket]]
=== Bucket Aggregations
Bucket aggregations don't calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents. Each bucket is associated with a criteria (depends on the aggregation type) that determines whether or not a document in the current context "falls" in it. In other words, the buckets effectively define document sets. In addition to the buckets themselves, the `bucket` aggregations also compute and return the number of documents that "fell in" each bucket.
Bucket aggregations, as opposed to `metrics` aggregations, can hold sub-aggregations. These sub aggregations will be aggregated for each of the buckets created by their "parent" bucket aggregation.
There are different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some define fixed number of multiple buckets, and others dynamically create the buckets during the aggregation process.
include::bucket/global-aggregation.asciidoc[]

View File

@ -51,7 +51,9 @@ Response:
--------------------------------------------------
[[date-format-pattern]]
==== Date Format/Pattern (copied from http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html[JodaDate])
==== Date Format/Pattern
NOTE: this information was copied from http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html[JodaDate]
All ASCII letters are reserved as format pattern letters, which are defined as follows:

View File

@ -40,15 +40,36 @@ Response:
}
--------------------------------------------------
By default, the `terms` aggregation will return the buckets for the top ten terms ordered by the `doc_count`. One can change this default behaviour by setting the `size` parameter.
By default, the `terms` aggregation will return the buckets for the top ten terms ordered by the `doc_count`. One can
change this default behaviour by setting the `size` parameter.
==== Size
==== Size & Shard Size
The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
default, the node coordinating the search process will request each shard to provide its own top `size` term buckets
and once all shards respond, it will reduces the results to the final list that will then be returned to the client.
This means that if the number of unique terms is greater than `size`, the returned list is slightly off and not accurate
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
size buckets was not returned).
The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
transfers between the nodes and the client).
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way,
one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
the client.
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
override it and reset it to be equal to `size`.
The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top `size` term buckets and once all shards respond, it will reduces the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than `size`, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned). The higher the `size` is, the more accurate the response at the cost of aggregation performance.
==== Order
The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by their `doc_count` descending. It is also possible to change this behaviour as follows:
The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by
their `doc_count` descending. It is also possible to change this behaviour as follows:
Ordering the buckets by their `doc_count` in an ascending manner:

View File

@ -1,7 +1,4 @@
[[search-aggregations-metrics]]
=== Metrics Aggregations
The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts. Some aggregations output a single metric (e.g. `avg`) and are called `single-value metrics aggregation`, others generate multiple metrics (e.g. `stats`) and are called `multi-value metrics aggregation`. The distinction between single-value and multi-value metrics aggregations plays a role when these aggregations serve as direct sub-aggregations of some bucket aggregations (some bucket aggregation enable you to sort the returned buckets based on the metrics in each bucket).
include::metrics/min-aggregation.asciidoc[]
@ -14,3 +11,5 @@ include::metrics/avg-aggregation.asciidoc[]
include::metrics/stats-aggregation.asciidoc[]
include::metrics/extendedstats-aggregation.asciidoc[]
include::metrics/valuecount-aggregation.asciidoc[]

View File

@ -3,7 +3,7 @@
A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
The `exteded_stats` aggregations is an extended version of the `<<search-aggregations-metrics-stats-aggregation,stats>>` aggregation, where additional metrics are added such as `sum_of_squares`, `variance` and `std_deviation`.
The `exteded_stats` aggregations is an extended version of the <<search-aggregations-metrics-stats-aggregation,`stats`>> aggregation, where additional metrics are added such as `sum_of_squares`, `variance` and `std_deviation`.
Assuming the data consists of documents representing exams grades (between 0 and 100) of students

View File

@ -0,0 +1,49 @@
[[search-aggregations-metrics-valuecount-aggregation]]
=== Value Count
A `single-value` metrics aggregation that counts the number of values that are extracted from the aggregated documents.
These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically,
this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the `avg`
one might be interested in the number of values the average is computed over.
[source,js]
--------------------------------------------------
{
"aggs" : {
"grades_count" : { "value_count" : { "field" : "grade" } }
}
}
--------------------------------------------------
Response:
[source,js]
--------------------------------------------------
{
...
"aggregations": {
"grades_count": {
"value": 10
}
}
}
--------------------------------------------------
The name of the aggregation (`grades_count` above) also serves as the key by which the aggregation result can be
retrieved from the returned response.
==== Script
Counting the values generated by a script:
[source,js]
--------------------------------------------------
{
...,
"aggs" : {
"grades_count" : { "value_count" : { "script" : "doc['grade'].value" } }
}
}
--------------------------------------------------

View File

@ -42,12 +42,12 @@ The higher the requested `size` is, the more accurate the results will be,
but also, the more expensive it will be to compute the final results (both
due to bigger priority queues that are managed on a shard level and due to
bigger data transfers between the nodes and the client). In an attempt to
minimize the extra work that comes with bigger requested `size` we a
`shard_size` parameter was introduced. The once defined, it will determine
how many terms the coordinating node is requesting from each shard. Once
minimize the extra work that comes with bigger requested `size` the
`shard_size` parameter was introduced. When defined, it will determine
how many terms the coordinating node will request from each shard. Once
all the shards responded, the coordinating node will then reduce them
to a final result which will be based on the `size` parameter - this way,
once can increase the accuracy of the returned terms and avoid the overhead
one can increase the accuracy of the returned terms and avoid the overhead
of streaming a big list of terms back to the client.
Note that `shard_size` cannot be smaller than `size`... if that's the case