[DOCS] Rewrite aggs overview (#64318) (#64410)

- Replaces more abstract docs about object structure and values source with task-based examples.
- Relocates several sections from the current `misc.asciidoc` file.
- Alphabetically sorts agg categories in the nav.
- Removes the matrix agg family. Moves the stats matrix agg under the metric agg family
This commit is contained in:
James Rodewig 2020-10-30 10:31:16 -04:00 committed by GitHub
parent ed4f11f203
commit 0c28bccbc8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 442 additions and 280 deletions

View File

@ -3,117 +3,432 @@
[partintro]
--
The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks
called aggregations, that can be composed in order to build complex summaries of the data.
An aggregation summarizes your data as metrics, statistics, or other analytics.
Aggregations help you answer questions like:
An aggregation can be seen as a _unit-of-work_ that builds analytic information over a set of documents. The context of
the execution defines what this document set is (e.g. a top-level aggregation executes within the context of the executed
query/filters of the search request).
* What's the average load time for my website?
* Who are my most valuable customers based on transaction volume?
* What would be considered a large file on my network?
* How many products are in each product category?
There are many different types of aggregations, each with its own purpose and output. To better understand these types,
it is often easier to break them into four main families:
{es} organizes aggregations into three categories:
<<search-aggregations-bucket, _Bucketing_>>::
A family of aggregations that build buckets, where each bucket is associated with a _key_ and a document
criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in
the context and when a criterion matches, the document is considered to "fall in" the relevant bucket.
By the end of the aggregation process, we'll end up with a list of buckets - each one with a set of
documents that "belong" to it.
* <<search-aggregations-metrics,Metric>> aggregations that calculate metrics,
such as a sum or average, from field values.
<<search-aggregations-metrics, _Metric_>>::
Aggregations that keep track and compute metrics over a set of documents.
* <<search-aggregations-bucket,Bucket>> aggregations that
group documents into buckets, also called bins, based on field values, ranges,
or other criteria.
<<search-aggregations-matrix, _Matrix_>>::
A family of aggregations that operate on multiple fields and produce a matrix result based on the
values extracted from the requested document fields. Unlike metric and bucket aggregations, this
aggregation family does not yet support scripting.
<<search-aggregations-pipeline, _Pipeline_>>::
Aggregations that aggregate the output of other aggregations and their associated metrics
The interesting part comes next. Since each bucket effectively defines a document set (all documents belonging to
the bucket), one can potentially associate aggregations on the bucket level, and those will execute within the context
of that bucket. This is where the real power of aggregations kicks in: *aggregations can be nested!*
NOTE: Bucketing aggregations can have sub-aggregations (bucketing or metric). The sub-aggregations will be computed for
the buckets which their parent aggregation generates. There is no hard limit on the level/depth of nested
aggregations (one can nest an aggregation under a "parent" aggregation, which is itself a sub-aggregation of
another higher-level aggregation).
NOTE: Aggregations operate on the `double` representation of
the data. As a consequence, the result may be approximate when running on longs
whose absolute value is greater than `2^53`.
* <<search-aggregations-pipeline,Pipeline>> aggregations that take input from
other aggregations instead of documents or fields.
[discrete]
== Structuring Aggregations
[[run-an-agg]]
=== Run an aggregation
The following snippet captures the basic structure of aggregations:
You can run aggregations as part of a <<search-your-data,search>> by specifying the <<search-search,search API>>'s `aggs` parameter. The
following search runs a
<<search-aggregations-bucket-terms-aggregation,terms aggregation>> on
`my-field`:
[source,js]
--------------------------------------------------
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"meta" : { [<meta_data_body>] } ]?
[,"aggregations" : { [<sub_aggregation>]+ } ]?
[source,console]
----
GET /my-index-000001/_search
{
"aggs": {
"my-agg-name": {
"terms": {
"field": "my-field"
}
}
[,"<aggregation_name_2>" : { ... } ]*
}
}
--------------------------------------------------
// NOTCONSOLE
----
// TEST[setup:my_index]
// TEST[s/my-field/http.request.method/]
The `aggregations` object (the key `aggs` can also be used) in the JSON holds the aggregations to be computed. Each aggregation
is associated with a logical name that the user defines (e.g. if the aggregation computes the average price, then it would
make sense to name it `avg_price`). These logical names will also be used to uniquely identify the aggregations in the
response. Each aggregation has a specific type (`<aggregation_type>` in the above snippet) and is typically the first
key within the named aggregation body. Each type of aggregation defines its own body, depending on the nature of the
aggregation (e.g. an `avg` aggregation on a specific field will define the field on which the average will be calculated).
At the same level of the aggregation type definition, one can optionally define a set of additional aggregations,
though this only makes sense if the aggregation you defined is of a bucketing nature. In this scenario, the
sub-aggregations you define on the bucketing aggregation level will be computed for all the buckets built by the
bucketing aggregation. For example, if you define a set of aggregations under the `range` aggregation, the
sub-aggregations will be computed for the range buckets that are defined.
Aggregation results are in the response's `aggregations` object:
[source,console-result]
----
{
"took": 78,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 1.0,
"hits": [...]
},
"aggregations": {
"my-agg-name": { <1>
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
}
----
// TESTRESPONSE[s/"took": 78/"took": "$body.took"/]
// TESTRESPONSE[s/\.\.\.$/"took": "$body.took", "timed_out": false, "_shards": "$body._shards", /]
// TESTRESPONSE[s/"hits": \[\.\.\.\]/"hits": "$body.hits.hits"/]
// TESTRESPONSE[s/"buckets": \[\]/"buckets":\[\{"key":"get","doc_count":5\}\]/]
<1> Results for the `my-agg-name` aggregation.
[discrete]
=== Values Source
[[change-agg-scope]]
=== Change an aggregation's scope
Some aggregations work on values extracted from the aggregated documents. Typically, the values will be extracted from
a specific document field which is set using the `field` key for the aggregations. It is also possible to define a
<<modules-scripting,`script`>> which will generate the values (per document).
Use the `query` parameter to limit the documents on which an aggregation runs:
When both `field` and `script` settings are configured for the aggregation, the script will be treated as a
`value script`. While normal scripts are evaluated on a document level (i.e. the script has access to all the data
associated with the document), value scripts are evaluated on the *value* level. In this mode, the values are extracted
from the configured `field` and the `script` is used to apply a "transformation" over these value/s.
[source,console]
----
GET /my-index-000001/_search
{
"query": {
"range": {
"@timestamp": {
"gte": "now-1d/d",
"lt": "now/d"
}
}
},
"aggs": {
"my-agg-name": {
"terms": {
"field": "my-field"
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/my-field/http.request.method/]
["NOTE",id="aggs-script-note"]
===============================
When working with scripts, the `lang` and `params` settings can also be defined. The former defines the scripting
language which is used (assuming the proper language is available in Elasticsearch, either by default or as a plugin). The latter
enables defining all the "dynamic" expressions in the script as parameters, which enables the script to keep itself static
between calls (this will ensure the use of the cached compiled scripts in Elasticsearch).
===============================
[discrete]
[[return-only-agg-results]]
=== Return only aggregation results
Elasticsearch uses the type of the field in the mapping in order to figure out
how to run the aggregation and format the response. However there are two cases
in which Elasticsearch cannot figure out this information: unmapped fields (for
instance in the case of a search request across multiple indices, and only some
of them have a mapping for the field) and pure scripts. For those cases, it is
possible to give Elasticsearch a hint using the `value_type` option, which
accepts the following values: `string`, `long` (works for all integer types),
`double` (works for all decimal types like `float` or `scaled_float`), `date`,
`ip` and `boolean`.
By default, searches containing an aggregation return both search hits and
aggregation results. To return only aggregation results, set `size` to `0`:
[source,console]
----
GET /my-index-000001/_search
{
"size": 0,
"aggs": {
"my-agg-name": {
"terms": {
"field": "my-field"
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/my-field/http.request.method/]
[discrete]
[[run-multiple-aggs]]
=== Run multiple aggregations
You can specify multiple aggregations in the same request:
[source,console]
----
GET /my-index-000001/_search
{
"aggs": {
"my-first-agg-name": {
"terms": {
"field": "my-field"
}
},
"my-second-agg-name": {
"avg": {
"field": "my-other-field"
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/my-field/http.request.method/]
// TEST[s/my-other-field/http.response.bytes/]
[discrete]
[[run-sub-aggs]]
=== Run sub-aggregations
Bucket aggregations support bucket or metric sub-aggregations. For example, a
terms aggregation with an <<search-aggregations-metrics-avg-aggregation,avg>>
sub-aggregation calculates an average value for each bucket of documents. There
is no level or depth limit for nesting sub-aggregations.
[source,console]
----
GET /my-index-000001/_search
{
"aggs": {
"my-agg-name": {
"terms": {
"field": "my-field"
},
"aggs": {
"my-sub-agg-name": {
"avg": {
"field": "my-other-field"
}
}
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/_search/_search?size=0/]
// TEST[s/my-field/http.request.method/]
// TEST[s/my-other-field/http.response.bytes/]
The response nests sub-aggregation results under their parent aggregation:
[source,console-result]
----
{
...
"aggregations": {
"my-agg-name": { <1>
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "foo",
"doc_count": 5,
"my-sub-agg-name": { <2>
"value": 75.0
}
}
]
}
}
}
----
// TESTRESPONSE[s/\.\.\./"took": "$body.took", "timed_out": false, "_shards": "$body._shards", "hits": "$body.hits",/]
// TESTRESPONSE[s/"key": "foo"/"key": "get"/]
// TESTRESPONSE[s/"value": 75.0/"value": $body.aggregations.my-agg-name.buckets.0.my-sub-agg-name.value/]
<1> Results for the parent aggregation, `my-agg-name`.
<2> Results for `my-agg-name`'s sub-aggregation, `my-sub-agg-name`.
[discrete]
[[add-metadata-to-an-agg]]
=== Add custom metadata
Use the `meta` object to associate custom metadata with an aggregation:
[source,console]
----
GET /my-index-000001/_search
{
"aggs": {
"my-agg-name": {
"terms": {
"field": "my-field"
},
"meta": {
"my-metadata-field": "foo"
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/_search/_search?size=0/]
The response returns the `meta` object in place:
[source,console-result]
----
{
...
"aggregations": {
"my-agg-name": {
"meta": {
"my-metadata-field": "foo"
},
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
}
----
// TESTRESPONSE[s/\.\.\./"took": "$body.took", "timed_out": false, "_shards": "$body._shards", "hits": "$body.hits",/]
[discrete]
[[return-agg-type]]
=== Return the aggregation type
By default, aggregation results include the aggregation's name but not its type.
To return the aggregation type, use the `typed_keys` query parameter.
[source,console]
----
GET /my-index-000001/_search?typed_keys
{
"aggs": {
"my-agg-name": {
"histogram": {
"field": "my-field",
"interval": 1000
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/typed_keys/typed_keys&size=0/]
// TEST[s/my-field/http.response.bytes/]
The response returns the aggregation type as a prefix to the aggregation's name.
IMPORTANT: Some aggregations return a different aggregation type from the
type in the request. For example, the terms,
<<search-aggregations-bucket-significantterms-aggregation,significant terms>>,
and <<search-aggregations-metrics-percentile-aggregation,percentiles>>
aggregations return different aggregations types depending on the data type of
the aggregated field.
[source,console-result]
----
{
...
"aggregations": {
"histogram#my-agg-name": { <1>
"buckets": []
}
}
}
----
// TESTRESPONSE[s/\.\.\./"took": "$body.took", "timed_out": false, "_shards": "$body._shards", "hits": "$body.hits",/]
// TESTRESPONSE[s/"buckets": \[\]/"buckets":\[\{"key":1070000.0,"doc_count":5\}\]/]
<1> The aggregation type, `histogram`, followed by a `#` separator and the aggregation's name, `my-agg-name`.
[discrete]
[[use-scripts-in-an-agg]]
=== Use scripts in an aggregation
Some aggregations support <<modules-scripting,scripts>>. You can
use a `script` to extract or generate values for the aggregation:
[source,console]
----
GET /my-index-000001/_search
{
"aggs": {
"my-agg-name": {
"histogram": {
"interval": 1000,
"script": {
"source": "doc['my-field'].value.length()"
}
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/my-field/http.request.method/]
If you also specify a `field`, the `script` modifies the field values used in
the aggregation. The following aggregation uses a script to modify `my-field`
values:
[source,console]
----
GET /my-index-000001/_search
{
"aggs": {
"my-agg-name": {
"histogram": {
"field": "my-field",
"interval": 1000,
"script": "_value / 1000"
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/my-field/http.response.bytes/]
Some aggregations only work on specific data types. Use the `value_type`
parameter to specify a data type for a script-generated value or an unmapped
field. `value_type` accepts the following values:
* `boolean`
* `date`
* `double`, used for all floating-point numbers
* `long`, used for all integers
* `ip`
* `string`
[source,console]
----
GET /my-index-000001/_search
{
"aggs": {
"my-agg-name": {
"histogram": {
"field": "my-field",
"interval": 1000,
"script": "_value / 1000",
"value_type": "long"
}
}
}
}
----
// TEST[setup:my_index]
// TEST[s/my-field/http.response.bytes/]
[discrete]
[[agg-caches]]
=== Aggregation caches
For faster responses, {es} caches the results of frequently run aggregations in
the <<shard-request-cache,shard request cache>>. To get cached results, use the
same <<shard-and-node-preference,`preference` string>> for each search. If you
don't need search hits, <<return-only-agg-results,set `size` to `0`>> to avoid
filling the cache.
{es} routes searches with the same preference string to the same shards. If the
shards' data doesnt change between searches, the shards return cached
aggregation results.
[discrete]
[[limits-for-long-values]]
=== Limits for `long` values
When running aggregations, {es} uses <<number,`double`>> values to hold and
represent numeric data. As a result, aggregations on <<number,`long`>> numbers
greater than +2^53^+ are approximate.
--
include::aggregations/metrics.asciidoc[]
include::aggregations/bucket.asciidoc[]
include::aggregations/metrics.asciidoc[]
include::aggregations/pipeline.asciidoc[]
include::aggregations/matrix.asciidoc[]
include::aggregations/misc.asciidoc[]

View File

@ -1,9 +0,0 @@
[[search-aggregations-matrix]]
== Matrix Aggregations
experimental[]
The aggregations in this family operate on multiple fields and produce a matrix result based on the values extracted from
the requested document fields. Unlike metric and bucket aggregations, this aggregation family does not yet support scripting.
include::matrix/stats-aggregation.asciidoc[]

View File

@ -1,5 +1,5 @@
[[search-aggregations-matrix-stats-aggregation]]
=== Matrix Stats
=== Matrix stats aggregation
The `matrix_stats` aggregation is a numeric aggregation that computes the following statistics over a set of document fields:
@ -13,6 +13,9 @@ The `matrix_stats` aggregation is a numeric aggregation that computes the follow
`correlation`:: The covariance matrix scaled to a range of -1 to 1, inclusive. Describes the relationship between field
distributions.
IMPORTANT: Unlike other metric aggregations, the `matrix_stats` aggregation does
not support scripting.
//////////////////////////
[source,js]

View File

@ -27,6 +27,8 @@ include::metrics/geobounds-aggregation.asciidoc[]
include::metrics/geocentroid-aggregation.asciidoc[]
include::matrix/stats-aggregation.asciidoc[]
include::metrics/max-aggregation.asciidoc[]
include::metrics/min-aggregation.asciidoc[]

View File

@ -1,179 +0,0 @@
[[caching-heavy-aggregations]]
== Caching heavy aggregations
Frequently used aggregations (e.g. for display on the home page of a website)
can be cached for faster responses. These cached results are the same results
that would be returned by an uncached aggregation -- you will never get stale
results.
See <<shard-request-cache>> for more details.
[[returning-only-agg-results]]
== Returning only aggregation results
There are many occasions when aggregations are required but search hits are not. For these cases the hits can be ignored by
setting `size=0`. For example:
[source,console]
--------------------------------------------------
GET /my-index-000001/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "user.id"
}
}
}
}
--------------------------------------------------
// TEST[setup:my_index]
Setting `size` to `0` avoids executing the fetch phase of the search making the request more efficient.
[[agg-metadata]]
== Aggregation Metadata
You can associate a piece of metadata with individual aggregations at request time that will be returned in place
at response time.
Consider this example where we want to associate the color blue with our `terms` aggregation.
[source,console]
--------------------------------------------------
GET /my-index-000001/_search
{
"size": 0,
"aggs": {
"titles": {
"terms": {
"field": "title"
},
"meta": {
"color": "blue"
}
}
}
}
--------------------------------------------------
// TEST[setup:my_index]
Then that piece of metadata will be returned in place for our `titles` terms aggregation
[source,console-result]
--------------------------------------------------
{
"aggregations": {
"titles": {
"meta": {
"color": "blue"
},
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
]
}
},
...
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": "$body.took", "timed_out": false, "_shards": "$body._shards", "hits": "$body.hits"/]
[[returning-aggregation-type]]
== Returning the type of the aggregation
Sometimes you need to know the exact type of an aggregation in order to parse its results. The `typed_keys` parameter
can be used to change the aggregation's name in the response so that it will be prefixed by its internal type.
Considering the following <<search-aggregations-bucket-datehistogram-aggregation,`date_histogram` aggregation>> named
`requests_over_time` which has a sub <<search-aggregations-metrics-top-hits-aggregation, `top_hits` aggregation>> named
`top_users`:
[source,console]
--------------------------------------------------
GET /my-index-000001/_search?typed_keys
{
"aggregations": {
"requests_over_time": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "year"
},
"aggregations": {
"top_users": {
"top_hits": {
"size": 1,
"_source": ["user.id", "http.response.bytes", "message"]
}
}
}
}
}
}
--------------------------------------------------
// TEST[setup:my_index]
In the response, the aggregations names will be changed to respectively `date_histogram#requests_over_time` and
`top_hits#top_users`, reflecting the internal types of each aggregation:
[source,console-result]
--------------------------------------------------
{
"aggregations": {
"date_histogram#requests_over_time": { <1>
"buckets": [
{
"key_as_string": "2099-01-01T00:00:00.000Z",
"key": 4070908800000,
"doc_count": 5,
"top_hits#top_users": { <2>
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my-index-000001",
"_type": "_doc",
"_id": "0",
"_score": 1.0,
"_source": {
"user": { "id": "kimchy"},
"message": "GET /search HTTP/1.1 200 1070000",
"http": { "response": { "bytes": 1070000 } }
}
}
]
}
}
}
]
}
},
...
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": "$body.took", "timed_out": false, "_shards": "$body._shards", "hits": "$body.hits"/]
<1> The name `requests_over_time` now contains the `date_histogram` prefix.
<2> The name `top_users` now contains the `top_hits` prefix.
NOTE: For some aggregations, it is possible that the returned type is not the same as the one provided with the
request. This is the case for Terms, Significant Terms and Percentiles aggregations, where the returned type
also contains information about the type of the targeted field: `lterms` (for a terms aggregation on a Long field),
`sigsterms` (for a significant terms aggregation on a String field), `tdigest_percentiles` (for a percentile
aggregation based on the TDigest algorithm).
[[indexing-aggregation-results]]
== Indexing aggregation results with {transforms}
<<transforms,{transforms-cap}>> enable you to convert existing {es} indices
into summarized indices, which provide opportunities for new insights and
analytics. You can use {transforms} to persistently index your aggregation
results into entity-centric indices.

View File

@ -1213,3 +1213,33 @@ See <<size-your-shards>>.
=== elasticsearch-croneval parameters
See <<elasticsearch-croneval-parameters>>.
[role="exclude",id="caching-heavy-aggregations"]
=== Caching heavy aggregations
See <<agg-caches>>.
[role="exclude",id="returning-only-agg-results"]
=== Returning only aggregation results
See <<return-only-agg-results>>.
[role="exclude",id="agg-metadata"]
=== Aggregation metadata
See <<add-metadata-to-an-agg>>.
[role="exclude",id="returning-aggregation-type"]
=== Returning the type of the aggregation
See <<return-agg-type>>.
[role="exclude",id="indexing-aggregation-results"]
=== Indexing aggregation results with transforms
See <<transforms>>.
[role="exclude",id="search-aggregations-matrix"]
=== Matrix aggregations
See <<search-aggregations-matrix-stats-aggregation>>.

View File

@ -233,7 +233,7 @@ shards. This is usually slower but more accurate.
Contains parameters for a search request:
`aggregations`:::
(Optional, <<_structuring_aggregations,aggregation object>>)
(Optional, <<search-aggregations,aggregation object>>)
Aggregations you wish to run during the search. See <<search-aggregations>>.
`query`:::