[DOCS] Added "Aggregation" to all aggs titles

This commit is contained in:
Clinton Gormley 2014-05-13 01:35:58 +02:00
parent dfdc183ba6
commit 5b93255ec8
23 changed files with 126 additions and 126 deletions

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-datehistogram-aggregation]] [[search-aggregations-bucket-datehistogram-aggregation]]
=== Date Histogram === Date Histogram Aggregation
A multi-bucket aggregation similar to the <<search-aggregations-bucket-histogram-aggregation,histogram>> except it can A multi-bucket aggregation similar to the <<search-aggregations-bucket-histogram-aggregation,histogram>> except it can
only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-daterange-aggregation]] [[search-aggregations-bucket-daterange-aggregation]]
=== Date Range === Date Range Aggregation
A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal <<search-aggregations-bucket-range-aggregation,range>> aggregation is that the `from` and `to` values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the `from` and `to` response fields will be returned. A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal <<search-aggregations-bucket-range-aggregation,range>> aggregation is that the `from` and `to` values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the `from` and `to` response fields will be returned.
Note that this aggregration includes the `from` value and excludes the `to` value for each range. Note that this aggregration includes the `from` value and excludes the `to` value for each range.
@ -93,7 +93,7 @@ All ASCII letters are reserved as format pattern letters, which are defined as f
|' |escape for text |delimiter |' |escape for text |delimiter
|'' |single quote |literal |' |'' |single quote |literal |'
|======= |=======
The count of pattern letters determine the format. The count of pattern letters determine the format.
Text:: If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used if available. Text:: If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used if available.

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-filter-aggregation]] [[search-aggregations-bucket-filter-aggregation]]
=== Filter === Filter Aggregation
Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents. Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-geodistance-aggregation]] [[search-aggregations-bucket-geodistance-aggregation]]
=== Geo Distance === Geo Distance Aggregation
A multi-bucket aggregation that works on `geo_point` fields and conceptually works very similar to the <<search-aggregations-bucket-range-aggregation,range>> aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket). A multi-bucket aggregation that works on `geo_point` fields and conceptually works very similar to the <<search-aggregations-bucket-range-aggregation,range>> aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-geohashgrid-aggregation]] [[search-aggregations-bucket-geohashgrid-aggregation]]
=== GeoHash grid === GeoHash grid Aggregation
A multi-bucket aggregation that works on `geo_point` fields and groups points into buckets that represent cells in a grid. A multi-bucket aggregation that works on `geo_point` fields and groups points into buckets that represent cells in a grid.
The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a http://en.wikipedia.org/wiki/Geohash[geohash] which is of user-definable precision. The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a http://en.wikipedia.org/wiki/Geohash[geohash] which is of user-definable precision.

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-global-aggregation]] [[search-aggregations-bucket-global-aggregation]]
=== Global === Global Aggregation
Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you're searching on, but is *not* influenced by the search query itself. Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you're searching on, but is *not* influenced by the search query itself.
@ -40,7 +40,7 @@ The response for the above aggreation:
"aggregations" : { "aggregations" : {
"all_products" : { "all_products" : {
"doc_count" : 100, <1> "doc_count" : 100, <1>
"avg_price" : { "avg_price" : {
"value" : 56.3 "value" : 56.3
} }
} }

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-histogram-aggregation]] [[search-aggregations-bucket-histogram-aggregation]]
=== Histogram === Histogram Aggregation
A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
@ -25,7 +25,7 @@ The following snippet "buckets" the products based on their `price` by interval
{ {
"aggs" : { "aggs" : {
"prices" : { "prices" : {
"histogram" : { "histogram" : {
"field" : "price", "field" : "price",
"interval" : 50 "interval" : 50
} }
@ -70,7 +70,7 @@ and create buckets with zero documents). This can be configured using the `min_d
{ {
"aggs" : { "aggs" : {
"prices" : { "prices" : {
"histogram" : { "histogram" : {
"field" : "price", "field" : "price",
"interval" : 50, "interval" : 50,
"min_doc_count" : 0 "min_doc_count" : 0
@ -172,7 +172,7 @@ Ordering the buckets by their key - descending:
{ {
"aggs" : { "aggs" : {
"prices" : { "prices" : {
"histogram" : { "histogram" : {
"field" : "price", "field" : "price",
"interval" : 50, "interval" : 50,
"order" : { "_key" : "desc" } "order" : { "_key" : "desc" }
@ -189,7 +189,7 @@ Ordering the buckets by their `doc_count` - ascending:
{ {
"aggs" : { "aggs" : {
"prices" : { "prices" : {
"histogram" : { "histogram" : {
"field" : "price", "field" : "price",
"interval" : 50, "interval" : 50,
"order" : { "_count" : "asc" } "order" : { "_count" : "asc" }
@ -206,7 +206,7 @@ If the histogram aggregation has a direct metrics sub-aggregation, the latter ca
{ {
"aggs" : { "aggs" : {
"prices" : { "prices" : {
"histogram" : { "histogram" : {
"field" : "price", "field" : "price",
"interval" : 50, "interval" : 50,
"order" : { "price_stats.min" : "asc" } <1> "order" : { "price_stats.min" : "asc" } <1>
@ -275,7 +275,7 @@ limit through the `min_doc_count` option.
{ {
"aggs" : { "aggs" : {
"prices" : { "prices" : {
"histogram" : { "histogram" : {
"field" : "price", "field" : "price",
"interval" : 50, "interval" : 50,
"min_doc_count": 10 "min_doc_count": 10
@ -336,7 +336,7 @@ instead keyed by the buckets keys:
{ {
"aggs" : { "aggs" : {
"prices" : { "prices" : {
"histogram" : { "histogram" : {
"field" : "price", "field" : "price",
"interval" : 50, "interval" : 50,
"keyed" : true "keyed" : true

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-iprange-aggregation]] [[search-aggregations-bucket-iprange-aggregation]]
=== IPv4 Range === IPv4 Range Aggregation
Just like the dedicated <<search-aggregations-bucket-daterange-aggregation,date>> range aggregation, there is also a dedicated range aggregation for IPv4 typed fields: Just like the dedicated <<search-aggregations-bucket-daterange-aggregation,date>> range aggregation, there is also a dedicated range aggregation for IPv4 typed fields:

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-missing-aggregation]] [[search-aggregations-bucket-missing-aggregation]]
=== Missing === Missing Aggregation
A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values. A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values.

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-nested-aggregation]] [[search-aggregations-bucket-nested-aggregation]]
=== Nested === Nested Aggregation
A special single bucket aggregation that enables aggregating nested documents. A special single bucket aggregation that enables aggregating nested documents.

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-range-aggregation]] [[search-aggregations-bucket-range-aggregation]]
=== Range === Range Aggregation
A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document. A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document.
Note that this aggregration includes the `from` value and excludes the `to` value for each range. Note that this aggregration includes the `from` value and excludes the `to` value for each range.
@ -11,7 +11,7 @@ Example:
{ {
"aggs" : { "aggs" : {
"price_ranges" : { "price_ranges" : {
"range" : { "range" : {
"field" : "price", "field" : "price",
"ranges" : [ "ranges" : [
{ "to" : 50 }, { "to" : 50 },
@ -62,7 +62,7 @@ Setting the `keyed` flag to `true` will associate a unique string key with each
{ {
"aggs" : { "aggs" : {
"price_ranges" : { "price_ranges" : {
"range" : { "range" : {
"field" : "price", "field" : "price",
"keyed" : true, "keyed" : true,
"ranges" : [ "ranges" : [
@ -112,7 +112,7 @@ It is also possible to customize the key for each range:
{ {
"aggs" : { "aggs" : {
"price_ranges" : { "price_ranges" : {
"range" : { "range" : {
"field" : "price", "field" : "price",
"keyed" : true, "keyed" : true,
"ranges" : [ "ranges" : [
@ -155,7 +155,7 @@ Lets say the product prices are in USD but we would like to get the price ranges
{ {
"aggs" : { "aggs" : {
"price_ranges" : { "price_ranges" : {
"range" : { "range" : {
"field" : "price", "field" : "price",
"script" : "_value * conversion_rate", "script" : "_value * conversion_rate",
"params" : { "params" : {
@ -181,7 +181,7 @@ The following example, not only "bucket" the documents to the different buckets
{ {
"aggs" : { "aggs" : {
"price_ranges" : { "price_ranges" : {
"range" : { "range" : {
"field" : "price", "field" : "price",
"ranges" : [ "ranges" : [
{ "to" : 50 }, { "to" : 50 },
@ -190,7 +190,7 @@ The following example, not only "bucket" the documents to the different buckets
] ]
}, },
"aggs" : { "aggs" : {
"price_stats" : { "price_stats" : {
"stats" : { "field" : "price" } "stats" : { "field" : "price" }
} }
} }
@ -254,7 +254,7 @@ If a sub aggregation is also based on the same value source as the range aggrega
{ {
"aggs" : { "aggs" : {
"price_ranges" : { "price_ranges" : {
"range" : { "range" : {
"field" : "price", "field" : "price",
"ranges" : [ "ranges" : [
{ "to" : 50 }, { "to" : 50 },
@ -263,13 +263,13 @@ If a sub aggregation is also based on the same value source as the range aggrega
] ]
}, },
"aggs" : { "aggs" : {
"price_stats" : { "price_stats" : {
"stats" : {} <1> "stats" : {} <1>
} }
} }
} }
} }
} }
-------------------------------------------------- --------------------------------------------------
<1> We don't need to specify the `price` as we "inherit" it by default from the parent `range` aggregation <1> We don't need to specify the `price` as we "inherit" it by default from the parent `range` aggregation

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-reverse-nested-aggregation]] [[search-aggregations-bucket-reverse-nested-aggregation]]
=== Reverse nested === Reverse nested Aggregation
A special single bucket aggregation that enables aggregating on parent docs from nested documents. Effectively this A special single bucket aggregation that enables aggregating on parent docs from nested documents. Effectively this
aggregation can break out of the nested block structure and link to other nested structures or the root document, aggregation can break out of the nested block structure and link to other nested structures or the root document,

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-significantterms-aggregation]] [[search-aggregations-bucket-significantterms-aggregation]]
=== Significant Terms === Significant Terms Aggregation
An aggregation that returns interesting or unusual occurrences of terms in a set. An aggregation that returns interesting or unusual occurrences of terms in a set.
@ -22,7 +22,7 @@ added[1.1.0]
In all these cases the terms being selected are not simply the most popular terms in a set. In all these cases the terms being selected are not simply the most popular terms in a set.
They are the terms that have undergone a significant change in popularity measured between a _foreground_ and _background_ set. They are the terms that have undergone a significant change in popularity measured between a _foreground_ and _background_ set.
If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user's search results If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user's search results
that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency. that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.
==== Single-set analysis ==== Single-set analysis
@ -70,15 +70,15 @@ Response:
} }
-------------------------------------------------- --------------------------------------------------
When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force
stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554) stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554)
but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is
a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type. a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type.
The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons. The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons.
To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces. To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces.
This can be a tedious way to look for unusual patterns in an index This can be a tedious way to look for unusual patterns in an index
@ -135,7 +135,7 @@ Response:
"doc_count": 47347, "doc_count": 47347,
"significantCrimeTypes": { "significantCrimeTypes": {
"doc_count": 47347, "doc_count": 47347,
"buckets": [ "buckets": [
{ {
"key": "Bicycle theft", "key": "Bicycle theft",
"doc_count": 3640, "doc_count": 3640,
@ -163,7 +163,7 @@ area to identify unusual hot-spots of a particular crime type:
{ {
"aggs": { "aggs": {
"hotspots": { "hotspots": {
"geohash_grid" : { "geohash_grid" : {
"field":"location", "field":"location",
"precision":5, "precision":5,
}, },
@ -177,8 +177,8 @@ area to identify unusual hot-spots of a particular crime type:
} }
-------------------------------------------------- --------------------------------------------------
This example uses the `geohash_grid` aggregation to create result buckets that represent geographic areas, and inside each This example uses the `geohash_grid` aggregation to create result buckets that represent geographic areas, and inside each
bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g. bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g.
* Airports exhibit unusual numbers of weapon confiscations * Airports exhibit unusual numbers of weapon confiscations
* Universities show uplifts of bicycle thefts * Universities show uplifts of bicycle thefts
@ -188,16 +188,16 @@ tackling an unusual volume of a particular crime type.
Obviously a time-based top-level segmentation would help identify current trends for each point in time Obviously a time-based top-level segmentation would help identify current trends for each point in time
where a simple `terms` aggregation would typically show the very popular "constants" that persist across all time slots. where a simple `terms` aggregation would typically show the very popular "constants" that persist across all time slots.
.How are the scores calculated? .How are the scores calculated?
********************************** **********************************
The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users.
The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favour The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favour
common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favour rare terms. common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favour rare terms.
Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
********************************** **********************************
@ -207,15 +207,15 @@ Rare vs common is essentially a precision vs recall balance and so the absolute
The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest: The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest:
* keywords for refining end-user searches * keywords for refining end-user searches
* keywords for use in percolator queries * keywords for use in percolator queries
WARNING: Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt WARNING: Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt
to load every unique word into RAM. It is recommended to only use this on smaller indices. to load every unique word into RAM. It is recommended to only use this on smaller indices.
.Use the _"like this but not this"_ pattern .Use the _"like this but not this"_ pattern
********************************** **********************************
You can spot mis-categorized content by first searching a structured field e.g. `category:adultMovie` and use significant_terms on the You can spot mis-categorized content by first searching a structured field e.g. `category:adultMovie` and use significant_terms on the
free-text "movie_description" field. Take the suggested words (I'll leave them to your imagination) and then search for all movies NOT marked as category:adultMovie but containing these keywords. free-text "movie_description" field. Take the suggested words (I'll leave them to your imagination) and then search for all movies NOT marked as category:adultMovie but containing these keywords.
You now have a ranked list of badly-categorized movies that you should reclassify or at least remove from the "familyFriendly" category. You now have a ranked list of badly-categorized movies that you should reclassify or at least remove from the "familyFriendly" category.
The significance score from each term can also provide a useful `boost` setting to sort matches. The significance score from each term can also provide a useful `boost` setting to sort matches.
@ -224,11 +224,11 @@ a high setting would have a small number of relevant results packed full of keyw
********************************** **********************************
[TIP] [TIP]
============ ============
.Show significant_terms in context .Show significant_terms in context
Free-text significant_terms are much more easily understood when viewed in context. Take the results of `significant_terms` suggestions from a Free-text significant_terms are much more easily understood when viewed in context. Take the results of `significant_terms` suggestions from a
free-text field and use them in a `terms` query on the same field with a `highlight` clause to present users with example snippets of documents. When the terms free-text field and use them in a `terms` query on the same field with a `highlight` clause to present users with example snippets of documents. When the terms
are presented unstemmed, highlighted, with the right case, in the right order and with some context, their significance/meaning is more readily apparent. are presented unstemmed, highlighted, with the right case, in the right order and with some context, their significance/meaning is more readily apparent.
============ ============
@ -239,7 +239,7 @@ are presented unstemmed, highlighted, with the right case, in the right order an
The above examples show how to select the _foreground_ set for analysis using a query or parent aggregation to filter but currently there is no means of specifying The above examples show how to select the _foreground_ set for analysis using a query or parent aggregation to filter but currently there is no means of specifying
a _background_ set other than the index from which all results are ultimately drawn. Sometimes it may prove useful to use a different a _background_ set other than the index from which all results are ultimately drawn. Sometimes it may prove useful to use a different
background set as the basis for comparisons e.g. to first select the tweets for the TV show "XFactor" and then look background set as the basis for comparisons e.g. to first select the tweets for the TV show "XFactor" and then look
for significant terms in a subset of that content which is from this week. for significant terms in a subset of that content which is from this week.
===== Significant terms must be indexed values ===== Significant terms must be indexed values
Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes. Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes.
@ -250,20 +250,20 @@ Also DocValues are not supported as sources of term data for similar reasons.
===== No analysis of floating point fields ===== No analysis of floating point fields
Floating point fields are currently not supported as the subject of significant_terms analysis. Floating point fields are currently not supported as the subject of significant_terms analysis.
While integer or long fields can be used to represent concepts like bank account numbers or category numbers which While integer or long fields can be used to represent concepts like bank account numbers or category numbers which
can be interesting to track, floating point fields are usually used to represent quantities of something. can be interesting to track, floating point fields are usually used to represent quantities of something.
As such, individual floating point terms are not useful for this form of frequency analysis. As such, individual floating point terms are not useful for this form of frequency analysis.
===== Use as a parent aggregation ===== Use as a parent aggregation
If there is the equivalent of a `match_all` query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the If there is the equivalent of a `match_all` query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the
top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and
so there is no difference in document frequencies to observe and from which to make sensible suggestions. so there is no difference in document frequencies to observe and from which to make sensible suggestions.
Another consideration is that the significant_terms aggregation produces many candidate results at shard level Another consideration is that the significant_terms aggregation produces many candidate results at shard level
that are only later pruned on the reducing node once all statistics from all shards are merged. As a result, that are only later pruned on the reducing node once all statistics from all shards are merged. As a result,
it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms
aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of
significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations. significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations.
===== Approximate counts ===== Approximate counts
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and
as such may be: as such may be:
@ -272,7 +272,7 @@ as such may be:
* high when considering the background frequency as it may count occurrences found in deleted documents * high when considering the background frequency as it may count occurrences found in deleted documents
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies. Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies.
However, the `size` and `shard size` settings covered in the next section provide tools to help control the accuracy levels. However, the `size` and `shard size` settings covered in the next section provide tools to help control the accuracy levels.
==== Parameters ==== Parameters
@ -287,14 +287,14 @@ If the number of unique terms is greater than `size`, the returned list can be s
size buckets was not returned). size buckets was not returned).
To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard
using a heuristic based on the number of shards. To take manual control of this setting the `shard_size` parameter using a heuristic based on the number of shards. To take manual control of this setting the `shard_size` parameter
can be used to control the volumes of candidate terms produced by each shard. can be used to control the volumes of candidate terms produced by each shard.
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to
values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced.
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
override it and reset it to be equal to `size`. override it and reset it to be equal to `size`.
@ -308,7 +308,7 @@ It is possible to only return terms that match more than a configured number of
{ {
"aggs" : { "aggs" : {
"tags" : { "tags" : {
"significant_terms" : { "significant_terms" : {
"field" : "tag", "field" : "tag",
"min_doc_count": 10 "min_doc_count": 10
} }
@ -320,8 +320,8 @@ It is possible to only return terms that match more than a configured number of
The above aggregation would only return tags which have been found in 10 hits or more. Default value is `3`. The above aggregation would only return tags which have been found in 10 hits or more. Default value is `3`.
Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic. Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it. The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.

View File

@ -1,5 +1,5 @@
[[search-aggregations-bucket-terms-aggregation]] [[search-aggregations-bucket-terms-aggregation]]
=== Terms === Terms Aggregation
A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value. A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
@ -80,7 +80,7 @@ Ordering the buckets by their `doc_count` in an ascending manner:
{ {
"aggs" : { "aggs" : {
"genders" : { "genders" : {
"terms" : { "terms" : {
"field" : "gender", "field" : "gender",
"order" : { "_count" : "asc" } "order" : { "_count" : "asc" }
} }
@ -96,7 +96,7 @@ Ordering the buckets alphabetically by their terms in an ascending manner:
{ {
"aggs" : { "aggs" : {
"genders" : { "genders" : {
"terms" : { "terms" : {
"field" : "gender", "field" : "gender",
"order" : { "_term" : "asc" } "order" : { "_term" : "asc" }
} }
@ -113,7 +113,7 @@ Ordering the buckets by single value metrics sub-aggregation (identified by the
{ {
"aggs" : { "aggs" : {
"genders" : { "genders" : {
"terms" : { "terms" : {
"field" : "gender", "field" : "gender",
"order" : { "avg_height" : "desc" } "order" : { "avg_height" : "desc" }
}, },
@ -132,7 +132,7 @@ Ordering the buckets by multi value metrics sub-aggregation (identified by the a
{ {
"aggs" : { "aggs" : {
"genders" : { "genders" : {
"terms" : { "terms" : {
"field" : "gender", "field" : "gender",
"order" : { "height_stats.avg" : "desc" } "order" : { "height_stats.avg" : "desc" }
}, },
@ -193,7 +193,7 @@ It is possible to only return terms that match more than a configured number of
{ {
"aggs" : { "aggs" : {
"tags" : { "tags" : {
"terms" : { "terms" : {
"field" : "tag", "field" : "tag",
"min_doc_count": 10 "min_doc_count": 10
} }
@ -221,7 +221,7 @@ Generating the terms using a script:
{ {
"aggs" : { "aggs" : {
"genders" : { "genders" : {
"terms" : { "terms" : {
"script" : "doc['gender'].value" "script" : "doc['gender'].value"
} }
} }
@ -236,7 +236,7 @@ Generating the terms using a script:
{ {
"aggs" : { "aggs" : {
"genders" : { "genders" : {
"terms" : { "terms" : {
"field" : "gender", "field" : "gender",
"script" : "'Gender: ' +_value" "script" : "'Gender: ' +_value"
} }

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-avg-aggregation]] [[search-aggregations-metrics-avg-aggregation]]
=== Avg === Avg Aggregation
A `single-value` metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. A `single-value` metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
@ -58,14 +58,14 @@ It turned out that the exam was way above the level of the students and a grade
... ...
"aggs" : { "aggs" : {
"avg_corrected_grade" : { "avg_corrected_grade" : {
"avg" : { "avg" : {
"field" : "grade", "field" : "grade",
"script" : "_value * correction", "script" : "_value * correction",
"params" : { "params" : {
"correction" : 1.2 "correction" : 1.2
} }
} }
} }
} }
} }

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-cardinality-aggregation]] [[search-aggregations-metrics-cardinality-aggregation]]
=== Cardinality === Cardinality Aggregation
added[1.1.0] added[1.1.0]
@ -21,8 +21,8 @@ match a query:
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"author_count" : { "author_count" : {
"cardinality" : { "cardinality" : {
"field" : "author" "field" : "author"
} }
} }
@ -36,8 +36,8 @@ This aggregation also supports the `precision_threshold` and `rehash` options:
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"author_count" : { "author_count" : {
"cardinality" : { "cardinality" : {
"field" : "author_hash", "field" : "author_hash",
"precision_threshold": 100, <1> "precision_threshold": 100, <1>
"rehash": false <2> "rehash": false <2>
@ -76,7 +76,7 @@ properties:
* excellent accuracy on low-cardinality sets, * excellent accuracy on low-cardinality sets,
* fixed memory usage: no matter if there are tens or billions of unique values, * fixed memory usage: no matter if there are tens or billions of unique values,
memory usage only depends on the configured precision. memory usage only depends on the configured precision.
For a precision threshold of `c`, the implementation that we are using requires For a precision threshold of `c`, the implementation that we are using requires
about `c * 8` bytes. about `c * 8` bytes.
@ -121,8 +121,8 @@ without computing hashes on the fly:
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"author_count" : { "author_count" : {
"cardinality" : { "cardinality" : {
"field" : "author.hash" "field" : "author.hash"
} }
} }
@ -149,8 +149,8 @@ however since hashes need to be computed on the fly.
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"author_count" : { "author_count" : {
"cardinality" : { "cardinality" : {
"script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value" "script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
} }
} }

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-extendedstats-aggregation]] [[search-aggregations-metrics-extendedstats-aggregation]]
=== Extended Stats === Extended Stats Aggregation
A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
@ -68,7 +68,7 @@ It turned out that the exam was way above the level of the students and a grade
"aggs" : { "aggs" : {
"grades_stats" : { "grades_stats" : {
"extended_stats" : { "extended_stats" : {
"field" : "grade", "field" : "grade",
"script" : "_value * correction", "script" : "_value * correction",
"params" : { "params" : {

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-max-aggregation]] [[search-aggregations-metrics-max-aggregation]]
=== Max === Max Aggregation
A `single-value` metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. A `single-value` metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
@ -53,14 +53,14 @@ Let's say that the prices of the documents in our index are in USD, but we would
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"max_price_in_euros" : { "max_price_in_euros" : {
"max" : { "max" : {
"field" : "price", "field" : "price",
"script" : "_value * conversion_rate", "script" : "_value * conversion_rate",
"params" : { "params" : {
"conversion_rate" : 1.2 "conversion_rate" : 1.2
} }
} }
} }
} }
} }

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-min-aggregation]] [[search-aggregations-metrics-min-aggregation]]
=== Min === Min Aggregation
A `single-value` metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. A `single-value` metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
@ -53,8 +53,8 @@ Let's say that the prices of the documents in our index are in USD, but we would
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"min_price_in_euros" : { "min_price_in_euros" : {
"min" : { "min" : {
"field" : "price", "field" : "price",
"script" : "_value * conversion_rate", "script" : "_value * conversion_rate",
"params" : { "params" : {

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-percentile-aggregation]] [[search-aggregations-metrics-percentile-aggregation]]
=== Percentiles === Percentiles Aggregation
added[1.1.0] added[1.1.0]
@ -16,15 +16,15 @@ future. If you use this feature, please let us know your experience with it!
===== =====
Percentiles show the point at which a certain percentage of observed values Percentiles show the point at which a certain percentage of observed values
occur. For example, the 95th percentile is the value which is greater than 95% occur. For example, the 95th percentile is the value which is greater than 95%
of the observed values. of the observed values.
Percentiles are often used to find outliers. In normal distributions, the Percentiles are often used to find outliers. In normal distributions, the
0.13th and 99.87th percentiles represents three standard deviations from the 0.13th and 99.87th percentiles represents three standard deviations from the
mean. Any data which falls outside three standard deviations is often considered mean. Any data which falls outside three standard deviations is often considered
an anomaly. an anomaly.
When a range of percentiles are retrieved, they can be used to estimate the When a range of percentiles are retrieved, they can be used to estimate the
data distribution and determine if the data is skewed, bimodal, etc. data distribution and determine if the data is skewed, bimodal, etc.
Assume your data consists of website load times. The average and median Assume your data consists of website load times. The average and median
@ -37,17 +37,17 @@ Let's look at a range of percentiles representing load time:
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"load_time_outlier" : { "load_time_outlier" : {
"percentiles" : { "percentiles" : {
"field" : "load_time" <1> "field" : "load_time" <1>
} }
} }
} }
} }
-------------------------------------------------- --------------------------------------------------
<1> The field `load_time` must be a numeric field <1> The field `load_time` must be a numeric field
By default, the `percentile` metric will generate a range of By default, the `percentile` metric will generate a range of
percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this: percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
[source,js] [source,js]
@ -75,23 +75,23 @@ WARNING: added[1.2.0] The above response structure applies for `1.2.0` and above
missing and all the percentiles where placed directly under the aggregation name object missing and all the percentiles where placed directly under the aggregation name object
As you can see, the aggregation will return a calculated value for each percentile As you can see, the aggregation will return a calculated value for each percentile
in the default range. If we assume response times are in milliseconds, it is in the default range. If we assume response times are in milliseconds, it is
immediately obvious that the webpage normally loads in 15-30ms, but occasionally immediately obvious that the webpage normally loads in 15-30ms, but occasionally
spikes to 60-150ms. spikes to 60-150ms.
Often, administrators are only interested in outliers -- the extreme percentiles. Often, administrators are only interested in outliers -- the extreme percentiles.
We can specify just the percents we are interested in (requested percentiles We can specify just the percents we are interested in (requested percentiles
must be a value between 0-100 inclusive): must be a value between 0-100 inclusive):
[source,js] [source,js]
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"load_time_outlier" : { "load_time_outlier" : {
"percentiles" : { "percentiles" : {
"field" : "load_time", "field" : "load_time",
"percents" : [95, 99, 99.9] <1> "percents" : [95, 99, 99.9] <1>
} }
} }
} }
} }
@ -110,13 +110,13 @@ a script to convert them on-the-fly:
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"load_time_outlier" : { "load_time_outlier" : {
"percentiles" : { "percentiles" : {
"script" : "doc['load_time'].value / timeUnit", <1> "script" : "doc['load_time'].value / timeUnit", <1>
"params" : { "params" : {
"timeUnit" : 1000 <2> "timeUnit" : 1000 <2>
} }
} }
} }
} }
} }
@ -127,7 +127,7 @@ script to generate values which percentiles are calculated on
==== Percentiles are (usually) approximate ==== Percentiles are (usually) approximate
There are many different algorithms to calculate percentiles. The naive There are many different algorithms to calculate percentiles. The naive
implementation simply stores all the values in a sorted array. To find the 50th implementation simply stores all the values in a sorted array. To find the 50th
percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`. percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
@ -137,7 +137,7 @@ across potentially billions of values in an Elasticsearch cluster, _approximate_
percentiles are calculated. percentiles are calculated.
The algorithm used by the `percentile` metric is called TDigest (introduced by The algorithm used by the `percentile` metric is called TDigest (introduced by
Ted Dunning in Ted Dunning in
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]). https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]).
When using this metric, there are a few guidelines to keep in mind: When using this metric, there are a few guidelines to keep in mind:
@ -147,8 +147,8 @@ are more accurate than less extreme percentiles, such as the median
- For small sets of values, percentiles are highly accurate (and potentially - For small sets of values, percentiles are highly accurate (and potentially
100% accurate if the data is small enough). 100% accurate if the data is small enough).
- As the quantity of values in a bucket grows, the algorithm begins to approximate - As the quantity of values in a bucket grows, the algorithm begins to approximate
the percentiles. It is effectively trading accuracy for memory savings. The the percentiles. It is effectively trading accuracy for memory savings. The
exact level of inaccuracy is difficult to generalize, since it depends on your exact level of inaccuracy is difficult to generalize, since it depends on your
data distribution and volume of data being aggregated data distribution and volume of data being aggregated
The following chart shows the relative error on a uniform distribution depending The following chart shows the relative error on a uniform distribution depending
@ -170,20 +170,20 @@ This balance can be controlled using a `compression` parameter:
-------------------------------------------------- --------------------------------------------------
{ {
"aggs" : { "aggs" : {
"load_time_outlier" : { "load_time_outlier" : {
"percentiles" : { "percentiles" : {
"field" : "load_time", "field" : "load_time",
"compression" : 200 <1> "compression" : 200 <1>
} }
} }
} }
} }
-------------------------------------------------- --------------------------------------------------
<1> Compression controls memory usage and approximation error <1> Compression controls memory usage and approximation error
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
more nodes available, the higher the accuracy (and large memory footprint) proportional more nodes available, the higher the accuracy (and large memory footprint) proportional
to the volume of data. The `compression` parameter limits the maximum number of to the volume of data. The `compression` parameter limits the maximum number of
nodes to `100 * compression`. nodes to `100 * compression`.
Therefore, by increasing the compression value, you can increase the accuracy of Therefore, by increasing the compression value, you can increase the accuracy of

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-stats-aggregation]] [[search-aggregations-metrics-stats-aggregation]]
=== Stats === Stats Aggregation
A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
@ -65,7 +65,7 @@ It turned out that the exam was way above the level of the students and a grade
"aggs" : { "aggs" : {
"grades_stats" : { "grades_stats" : {
"stats" : { "stats" : {
"field" : "grade", "field" : "grade",
"script" : "_value * correction", "script" : "_value * correction",
"params" : { "params" : {

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-sum-aggregation]] [[search-aggregations-metrics-sum-aggregation]]
=== Sum === Sum Aggregation
A `single-value` metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script. A `single-value` metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
@ -66,10 +66,10 @@ Computing the sum of squares over all stock tick changes:
... ...
"aggs" : { "aggs" : {
"daytime_return" : { "daytime_return" : {
"sum" : { "sum" : {
"field" : "change", "field" : "change",
"script" : "_value * _value" } "script" : "_value * _value" }
} }
} }
} }

View File

@ -1,5 +1,5 @@
[[search-aggregations-metrics-valuecount-aggregation]] [[search-aggregations-metrics-valuecount-aggregation]]
=== Value Count === Value Count Aggregation
A `single-value` metrics aggregation that counts the number of values that are extracted from the aggregated documents. A `single-value` metrics aggregation that counts the number of values that are extracted from the aggregated documents.
These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically, These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically,