[DOCS] Added "Aggregation" to all aggs titles
This commit is contained in:
parent
dfdc183ba6
commit
5b93255ec8
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-datehistogram-aggregation]]
|
||||
=== Date Histogram
|
||||
=== Date Histogram Aggregation
|
||||
|
||||
A multi-bucket aggregation similar to the <<search-aggregations-bucket-histogram-aggregation,histogram>> except it can
|
||||
only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-daterange-aggregation]]
|
||||
=== Date Range
|
||||
=== Date Range Aggregation
|
||||
|
||||
A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal <<search-aggregations-bucket-range-aggregation,range>> aggregation is that the `from` and `to` values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the `from` and `to` response fields will be returned.
|
||||
Note that this aggregration includes the `from` value and excludes the `to` value for each range.
|
||||
|
@ -93,7 +93,7 @@ All ASCII letters are reserved as format pattern letters, which are defined as f
|
|||
|' |escape for text |delimiter
|
||||
|'' |single quote |literal |'
|
||||
|=======
|
||||
|
||||
|
||||
The count of pattern letters determine the format.
|
||||
|
||||
Text:: If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used if available.
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-filter-aggregation]]
|
||||
=== Filter
|
||||
=== Filter Aggregation
|
||||
|
||||
Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-geodistance-aggregation]]
|
||||
=== Geo Distance
|
||||
=== Geo Distance Aggregation
|
||||
|
||||
A multi-bucket aggregation that works on `geo_point` fields and conceptually works very similar to the <<search-aggregations-bucket-range-aggregation,range>> aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-geohashgrid-aggregation]]
|
||||
=== GeoHash grid
|
||||
=== GeoHash grid Aggregation
|
||||
|
||||
A multi-bucket aggregation that works on `geo_point` fields and groups points into buckets that represent cells in a grid.
|
||||
The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a http://en.wikipedia.org/wiki/Geohash[geohash] which is of user-definable precision.
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-global-aggregation]]
|
||||
=== Global
|
||||
=== Global Aggregation
|
||||
|
||||
Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you're searching on, but is *not* influenced by the search query itself.
|
||||
|
||||
|
@ -40,7 +40,7 @@ The response for the above aggreation:
|
|||
"aggregations" : {
|
||||
"all_products" : {
|
||||
"doc_count" : 100, <1>
|
||||
"avg_price" : {
|
||||
"avg_price" : {
|
||||
"value" : 56.3
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-histogram-aggregation]]
|
||||
=== Histogram
|
||||
=== Histogram Aggregation
|
||||
|
||||
A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
|
||||
It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
|
||||
|
@ -25,7 +25,7 @@ The following snippet "buckets" the products based on their `price` by interval
|
|||
{
|
||||
"aggs" : {
|
||||
"prices" : {
|
||||
"histogram" : {
|
||||
"histogram" : {
|
||||
"field" : "price",
|
||||
"interval" : 50
|
||||
}
|
||||
|
@ -70,7 +70,7 @@ and create buckets with zero documents). This can be configured using the `min_d
|
|||
{
|
||||
"aggs" : {
|
||||
"prices" : {
|
||||
"histogram" : {
|
||||
"histogram" : {
|
||||
"field" : "price",
|
||||
"interval" : 50,
|
||||
"min_doc_count" : 0
|
||||
|
@ -172,7 +172,7 @@ Ordering the buckets by their key - descending:
|
|||
{
|
||||
"aggs" : {
|
||||
"prices" : {
|
||||
"histogram" : {
|
||||
"histogram" : {
|
||||
"field" : "price",
|
||||
"interval" : 50,
|
||||
"order" : { "_key" : "desc" }
|
||||
|
@ -189,7 +189,7 @@ Ordering the buckets by their `doc_count` - ascending:
|
|||
{
|
||||
"aggs" : {
|
||||
"prices" : {
|
||||
"histogram" : {
|
||||
"histogram" : {
|
||||
"field" : "price",
|
||||
"interval" : 50,
|
||||
"order" : { "_count" : "asc" }
|
||||
|
@ -206,7 +206,7 @@ If the histogram aggregation has a direct metrics sub-aggregation, the latter ca
|
|||
{
|
||||
"aggs" : {
|
||||
"prices" : {
|
||||
"histogram" : {
|
||||
"histogram" : {
|
||||
"field" : "price",
|
||||
"interval" : 50,
|
||||
"order" : { "price_stats.min" : "asc" } <1>
|
||||
|
@ -275,7 +275,7 @@ limit through the `min_doc_count` option.
|
|||
{
|
||||
"aggs" : {
|
||||
"prices" : {
|
||||
"histogram" : {
|
||||
"histogram" : {
|
||||
"field" : "price",
|
||||
"interval" : 50,
|
||||
"min_doc_count": 10
|
||||
|
@ -336,7 +336,7 @@ instead keyed by the buckets keys:
|
|||
{
|
||||
"aggs" : {
|
||||
"prices" : {
|
||||
"histogram" : {
|
||||
"histogram" : {
|
||||
"field" : "price",
|
||||
"interval" : 50,
|
||||
"keyed" : true
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-iprange-aggregation]]
|
||||
=== IPv4 Range
|
||||
=== IPv4 Range Aggregation
|
||||
|
||||
Just like the dedicated <<search-aggregations-bucket-daterange-aggregation,date>> range aggregation, there is also a dedicated range aggregation for IPv4 typed fields:
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-missing-aggregation]]
|
||||
=== Missing
|
||||
=== Missing Aggregation
|
||||
|
||||
A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values.
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-nested-aggregation]]
|
||||
=== Nested
|
||||
=== Nested Aggregation
|
||||
|
||||
A special single bucket aggregation that enables aggregating nested documents.
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-range-aggregation]]
|
||||
=== Range
|
||||
=== Range Aggregation
|
||||
|
||||
A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document.
|
||||
Note that this aggregration includes the `from` value and excludes the `to` value for each range.
|
||||
|
@ -11,7 +11,7 @@ Example:
|
|||
{
|
||||
"aggs" : {
|
||||
"price_ranges" : {
|
||||
"range" : {
|
||||
"range" : {
|
||||
"field" : "price",
|
||||
"ranges" : [
|
||||
{ "to" : 50 },
|
||||
|
@ -62,7 +62,7 @@ Setting the `keyed` flag to `true` will associate a unique string key with each
|
|||
{
|
||||
"aggs" : {
|
||||
"price_ranges" : {
|
||||
"range" : {
|
||||
"range" : {
|
||||
"field" : "price",
|
||||
"keyed" : true,
|
||||
"ranges" : [
|
||||
|
@ -112,7 +112,7 @@ It is also possible to customize the key for each range:
|
|||
{
|
||||
"aggs" : {
|
||||
"price_ranges" : {
|
||||
"range" : {
|
||||
"range" : {
|
||||
"field" : "price",
|
||||
"keyed" : true,
|
||||
"ranges" : [
|
||||
|
@ -155,7 +155,7 @@ Lets say the product prices are in USD but we would like to get the price ranges
|
|||
{
|
||||
"aggs" : {
|
||||
"price_ranges" : {
|
||||
"range" : {
|
||||
"range" : {
|
||||
"field" : "price",
|
||||
"script" : "_value * conversion_rate",
|
||||
"params" : {
|
||||
|
@ -181,7 +181,7 @@ The following example, not only "bucket" the documents to the different buckets
|
|||
{
|
||||
"aggs" : {
|
||||
"price_ranges" : {
|
||||
"range" : {
|
||||
"range" : {
|
||||
"field" : "price",
|
||||
"ranges" : [
|
||||
{ "to" : 50 },
|
||||
|
@ -190,7 +190,7 @@ The following example, not only "bucket" the documents to the different buckets
|
|||
]
|
||||
},
|
||||
"aggs" : {
|
||||
"price_stats" : {
|
||||
"price_stats" : {
|
||||
"stats" : { "field" : "price" }
|
||||
}
|
||||
}
|
||||
|
@ -254,7 +254,7 @@ If a sub aggregation is also based on the same value source as the range aggrega
|
|||
{
|
||||
"aggs" : {
|
||||
"price_ranges" : {
|
||||
"range" : {
|
||||
"range" : {
|
||||
"field" : "price",
|
||||
"ranges" : [
|
||||
{ "to" : 50 },
|
||||
|
@ -263,13 +263,13 @@ If a sub aggregation is also based on the same value source as the range aggrega
|
|||
]
|
||||
},
|
||||
"aggs" : {
|
||||
"price_stats" : {
|
||||
"price_stats" : {
|
||||
"stats" : {} <1>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
--------------------------------------------------
|
||||
|
||||
<1> We don't need to specify the `price` as we "inherit" it by default from the parent `range` aggregation
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-reverse-nested-aggregation]]
|
||||
=== Reverse nested
|
||||
=== Reverse nested Aggregation
|
||||
|
||||
A special single bucket aggregation that enables aggregating on parent docs from nested documents. Effectively this
|
||||
aggregation can break out of the nested block structure and link to other nested structures or the root document,
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-significantterms-aggregation]]
|
||||
=== Significant Terms
|
||||
=== Significant Terms Aggregation
|
||||
|
||||
An aggregation that returns interesting or unusual occurrences of terms in a set.
|
||||
|
||||
|
@ -22,7 +22,7 @@ added[1.1.0]
|
|||
|
||||
In all these cases the terms being selected are not simply the most popular terms in a set.
|
||||
They are the terms that have undergone a significant change in popularity measured between a _foreground_ and _background_ set.
|
||||
If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user's search results
|
||||
If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user's search results
|
||||
that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.
|
||||
|
||||
==== Single-set analysis
|
||||
|
@ -70,15 +70,15 @@ Response:
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force
|
||||
stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554)
|
||||
When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force
|
||||
stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554)
|
||||
but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is
|
||||
a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type.
|
||||
|
||||
The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons.
|
||||
To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces.
|
||||
|
||||
This can be a tedious way to look for unusual patterns in an index
|
||||
This can be a tedious way to look for unusual patterns in an index
|
||||
|
||||
|
||||
|
||||
|
@ -135,7 +135,7 @@ Response:
|
|||
"doc_count": 47347,
|
||||
"significantCrimeTypes": {
|
||||
"doc_count": 47347,
|
||||
"buckets": [
|
||||
"buckets": [
|
||||
{
|
||||
"key": "Bicycle theft",
|
||||
"doc_count": 3640,
|
||||
|
@ -163,7 +163,7 @@ area to identify unusual hot-spots of a particular crime type:
|
|||
{
|
||||
"aggs": {
|
||||
"hotspots": {
|
||||
"geohash_grid" : {
|
||||
"geohash_grid" : {
|
||||
"field":"location",
|
||||
"precision":5,
|
||||
},
|
||||
|
@ -177,8 +177,8 @@ area to identify unusual hot-spots of a particular crime type:
|
|||
}
|
||||
--------------------------------------------------
|
||||
|
||||
This example uses the `geohash_grid` aggregation to create result buckets that represent geographic areas, and inside each
|
||||
bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g.
|
||||
This example uses the `geohash_grid` aggregation to create result buckets that represent geographic areas, and inside each
|
||||
bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g.
|
||||
|
||||
* Airports exhibit unusual numbers of weapon confiscations
|
||||
* Universities show uplifts of bicycle thefts
|
||||
|
@ -188,16 +188,16 @@ tackling an unusual volume of a particular crime type.
|
|||
|
||||
|
||||
Obviously a time-based top-level segmentation would help identify current trends for each point in time
|
||||
where a simple `terms` aggregation would typically show the very popular "constants" that persist across all time slots.
|
||||
where a simple `terms` aggregation would typically show the very popular "constants" that persist across all time slots.
|
||||
|
||||
|
||||
|
||||
.How are the scores calculated?
|
||||
**********************************
|
||||
The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users.
|
||||
The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favour
|
||||
The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users.
|
||||
The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favour
|
||||
common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favour rare terms.
|
||||
Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
|
||||
Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
|
||||
|
||||
**********************************
|
||||
|
||||
|
@ -207,15 +207,15 @@ Rare vs common is essentially a precision vs recall balance and so the absolute
|
|||
The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest:
|
||||
|
||||
* keywords for refining end-user searches
|
||||
* keywords for use in percolator queries
|
||||
* keywords for use in percolator queries
|
||||
|
||||
WARNING: Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt
|
||||
to load every unique word into RAM. It is recommended to only use this on smaller indices.
|
||||
to load every unique word into RAM. It is recommended to only use this on smaller indices.
|
||||
|
||||
.Use the _"like this but not this"_ pattern
|
||||
**********************************
|
||||
You can spot mis-categorized content by first searching a structured field e.g. `category:adultMovie` and use significant_terms on the
|
||||
free-text "movie_description" field. Take the suggested words (I'll leave them to your imagination) and then search for all movies NOT marked as category:adultMovie but containing these keywords.
|
||||
free-text "movie_description" field. Take the suggested words (I'll leave them to your imagination) and then search for all movies NOT marked as category:adultMovie but containing these keywords.
|
||||
You now have a ranked list of badly-categorized movies that you should reclassify or at least remove from the "familyFriendly" category.
|
||||
|
||||
The significance score from each term can also provide a useful `boost` setting to sort matches.
|
||||
|
@ -224,11 +224,11 @@ a high setting would have a small number of relevant results packed full of keyw
|
|||
|
||||
**********************************
|
||||
|
||||
[TIP]
|
||||
[TIP]
|
||||
============
|
||||
.Show significant_terms in context
|
||||
|
||||
Free-text significant_terms are much more easily understood when viewed in context. Take the results of `significant_terms` suggestions from a
|
||||
Free-text significant_terms are much more easily understood when viewed in context. Take the results of `significant_terms` suggestions from a
|
||||
free-text field and use them in a `terms` query on the same field with a `highlight` clause to present users with example snippets of documents. When the terms
|
||||
are presented unstemmed, highlighted, with the right case, in the right order and with some context, their significance/meaning is more readily apparent.
|
||||
============
|
||||
|
@ -239,7 +239,7 @@ are presented unstemmed, highlighted, with the right case, in the right order an
|
|||
The above examples show how to select the _foreground_ set for analysis using a query or parent aggregation to filter but currently there is no means of specifying
|
||||
a _background_ set other than the index from which all results are ultimately drawn. Sometimes it may prove useful to use a different
|
||||
background set as the basis for comparisons e.g. to first select the tweets for the TV show "XFactor" and then look
|
||||
for significant terms in a subset of that content which is from this week.
|
||||
for significant terms in a subset of that content which is from this week.
|
||||
|
||||
===== Significant terms must be indexed values
|
||||
Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes.
|
||||
|
@ -250,20 +250,20 @@ Also DocValues are not supported as sources of term data for similar reasons.
|
|||
===== No analysis of floating point fields
|
||||
Floating point fields are currently not supported as the subject of significant_terms analysis.
|
||||
While integer or long fields can be used to represent concepts like bank account numbers or category numbers which
|
||||
can be interesting to track, floating point fields are usually used to represent quantities of something.
|
||||
As such, individual floating point terms are not useful for this form of frequency analysis.
|
||||
can be interesting to track, floating point fields are usually used to represent quantities of something.
|
||||
As such, individual floating point terms are not useful for this form of frequency analysis.
|
||||
|
||||
===== Use as a parent aggregation
|
||||
If there is the equivalent of a `match_all` query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the
|
||||
If there is the equivalent of a `match_all` query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the
|
||||
top-most aggregation - in this scenario the _foreground_ set is exactly the same as the _background_ set and
|
||||
so there is no difference in document frequencies to observe and from which to make sensible suggestions.
|
||||
|
||||
Another consideration is that the significant_terms aggregation produces many candidate results at shard level
|
||||
Another consideration is that the significant_terms aggregation produces many candidate results at shard level
|
||||
that are only later pruned on the reducing node once all statistics from all shards are merged. As a result,
|
||||
it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms
|
||||
it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms
|
||||
aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of
|
||||
significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations.
|
||||
|
||||
significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations.
|
||||
|
||||
===== Approximate counts
|
||||
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and
|
||||
as such may be:
|
||||
|
@ -272,7 +272,7 @@ as such may be:
|
|||
* high when considering the background frequency as it may count occurrences found in deleted documents
|
||||
|
||||
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies.
|
||||
However, the `size` and `shard size` settings covered in the next section provide tools to help control the accuracy levels.
|
||||
However, the `size` and `shard size` settings covered in the next section provide tools to help control the accuracy levels.
|
||||
|
||||
==== Parameters
|
||||
|
||||
|
@ -287,14 +287,14 @@ If the number of unique terms is greater than `size`, the returned list can be s
|
|||
size buckets was not returned).
|
||||
|
||||
To ensure better accuracy a multiple of the final `size` is used as the number of terms to request from each shard
|
||||
using a heuristic based on the number of shards. To take manual control of this setting the `shard_size` parameter
|
||||
using a heuristic based on the number of shards. To take manual control of this setting the `shard_size` parameter
|
||||
can be used to control the volumes of candidate terms produced by each shard.
|
||||
|
||||
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
|
||||
significant_terms aggregation can produce higher-quality results when the `shard_size` parameter is set to
|
||||
values significantly higher than the `size` setting. This ensures that a bigger volume of promising candidate terms are given
|
||||
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
|
||||
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced.
|
||||
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced.
|
||||
|
||||
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
|
||||
override it and reset it to be equal to `size`.
|
||||
|
@ -308,7 +308,7 @@ It is possible to only return terms that match more than a configured number of
|
|||
{
|
||||
"aggs" : {
|
||||
"tags" : {
|
||||
"significant_terms" : {
|
||||
"significant_terms" : {
|
||||
"field" : "tag",
|
||||
"min_doc_count": 10
|
||||
}
|
||||
|
@ -320,8 +320,8 @@ It is possible to only return terms that match more than a configured number of
|
|||
The above aggregation would only return tags which have been found in 10 hits or more. Default value is `3`.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
|
||||
|
||||
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-bucket-terms-aggregation]]
|
||||
=== Terms
|
||||
=== Terms Aggregation
|
||||
|
||||
A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
|
||||
|
||||
|
@ -80,7 +80,7 @@ Ordering the buckets by their `doc_count` in an ascending manner:
|
|||
{
|
||||
"aggs" : {
|
||||
"genders" : {
|
||||
"terms" : {
|
||||
"terms" : {
|
||||
"field" : "gender",
|
||||
"order" : { "_count" : "asc" }
|
||||
}
|
||||
|
@ -96,7 +96,7 @@ Ordering the buckets alphabetically by their terms in an ascending manner:
|
|||
{
|
||||
"aggs" : {
|
||||
"genders" : {
|
||||
"terms" : {
|
||||
"terms" : {
|
||||
"field" : "gender",
|
||||
"order" : { "_term" : "asc" }
|
||||
}
|
||||
|
@ -113,7 +113,7 @@ Ordering the buckets by single value metrics sub-aggregation (identified by the
|
|||
{
|
||||
"aggs" : {
|
||||
"genders" : {
|
||||
"terms" : {
|
||||
"terms" : {
|
||||
"field" : "gender",
|
||||
"order" : { "avg_height" : "desc" }
|
||||
},
|
||||
|
@ -132,7 +132,7 @@ Ordering the buckets by multi value metrics sub-aggregation (identified by the a
|
|||
{
|
||||
"aggs" : {
|
||||
"genders" : {
|
||||
"terms" : {
|
||||
"terms" : {
|
||||
"field" : "gender",
|
||||
"order" : { "height_stats.avg" : "desc" }
|
||||
},
|
||||
|
@ -193,7 +193,7 @@ It is possible to only return terms that match more than a configured number of
|
|||
{
|
||||
"aggs" : {
|
||||
"tags" : {
|
||||
"terms" : {
|
||||
"terms" : {
|
||||
"field" : "tag",
|
||||
"min_doc_count": 10
|
||||
}
|
||||
|
@ -221,7 +221,7 @@ Generating the terms using a script:
|
|||
{
|
||||
"aggs" : {
|
||||
"genders" : {
|
||||
"terms" : {
|
||||
"terms" : {
|
||||
"script" : "doc['gender'].value"
|
||||
}
|
||||
}
|
||||
|
@ -236,7 +236,7 @@ Generating the terms using a script:
|
|||
{
|
||||
"aggs" : {
|
||||
"genders" : {
|
||||
"terms" : {
|
||||
"terms" : {
|
||||
"field" : "gender",
|
||||
"script" : "'Gender: ' +_value"
|
||||
}
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-avg-aggregation]]
|
||||
=== Avg
|
||||
=== Avg Aggregation
|
||||
|
||||
A `single-value` metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
|
||||
|
||||
|
@ -58,14 +58,14 @@ It turned out that the exam was way above the level of the students and a grade
|
|||
...
|
||||
|
||||
"aggs" : {
|
||||
"avg_corrected_grade" : {
|
||||
"avg" : {
|
||||
"avg_corrected_grade" : {
|
||||
"avg" : {
|
||||
"field" : "grade",
|
||||
"script" : "_value * correction",
|
||||
"params" : {
|
||||
"correction" : 1.2
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-cardinality-aggregation]]
|
||||
=== Cardinality
|
||||
=== Cardinality Aggregation
|
||||
|
||||
added[1.1.0]
|
||||
|
||||
|
@ -21,8 +21,8 @@ match a query:
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"author_count" : {
|
||||
"cardinality" : {
|
||||
"author_count" : {
|
||||
"cardinality" : {
|
||||
"field" : "author"
|
||||
}
|
||||
}
|
||||
|
@ -36,8 +36,8 @@ This aggregation also supports the `precision_threshold` and `rehash` options:
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"author_count" : {
|
||||
"cardinality" : {
|
||||
"author_count" : {
|
||||
"cardinality" : {
|
||||
"field" : "author_hash",
|
||||
"precision_threshold": 100, <1>
|
||||
"rehash": false <2>
|
||||
|
@ -76,7 +76,7 @@ properties:
|
|||
* excellent accuracy on low-cardinality sets,
|
||||
* fixed memory usage: no matter if there are tens or billions of unique values,
|
||||
memory usage only depends on the configured precision.
|
||||
|
||||
|
||||
For a precision threshold of `c`, the implementation that we are using requires
|
||||
about `c * 8` bytes.
|
||||
|
||||
|
@ -121,8 +121,8 @@ without computing hashes on the fly:
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"author_count" : {
|
||||
"cardinality" : {
|
||||
"author_count" : {
|
||||
"cardinality" : {
|
||||
"field" : "author.hash"
|
||||
}
|
||||
}
|
||||
|
@ -149,8 +149,8 @@ however since hashes need to be computed on the fly.
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"author_count" : {
|
||||
"cardinality" : {
|
||||
"author_count" : {
|
||||
"cardinality" : {
|
||||
"script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-extendedstats-aggregation]]
|
||||
=== Extended Stats
|
||||
=== Extended Stats Aggregation
|
||||
|
||||
A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
|
||||
|
||||
|
@ -68,7 +68,7 @@ It turned out that the exam was way above the level of the students and a grade
|
|||
|
||||
"aggs" : {
|
||||
"grades_stats" : {
|
||||
"extended_stats" : {
|
||||
"extended_stats" : {
|
||||
"field" : "grade",
|
||||
"script" : "_value * correction",
|
||||
"params" : {
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-max-aggregation]]
|
||||
=== Max
|
||||
=== Max Aggregation
|
||||
|
||||
A `single-value` metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
|
||||
|
||||
|
@ -53,14 +53,14 @@ Let's say that the prices of the documents in our index are in USD, but we would
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"max_price_in_euros" : {
|
||||
"max" : {
|
||||
"max_price_in_euros" : {
|
||||
"max" : {
|
||||
"field" : "price",
|
||||
"script" : "_value * conversion_rate",
|
||||
"params" : {
|
||||
"conversion_rate" : 1.2
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-min-aggregation]]
|
||||
=== Min
|
||||
=== Min Aggregation
|
||||
|
||||
A `single-value` metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
|
||||
|
||||
|
@ -53,8 +53,8 @@ Let's say that the prices of the documents in our index are in USD, but we would
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"min_price_in_euros" : {
|
||||
"min" : {
|
||||
"min_price_in_euros" : {
|
||||
"min" : {
|
||||
"field" : "price",
|
||||
"script" : "_value * conversion_rate",
|
||||
"params" : {
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-percentile-aggregation]]
|
||||
=== Percentiles
|
||||
=== Percentiles Aggregation
|
||||
|
||||
added[1.1.0]
|
||||
|
||||
|
@ -16,15 +16,15 @@ future. If you use this feature, please let us know your experience with it!
|
|||
=====
|
||||
|
||||
Percentiles show the point at which a certain percentage of observed values
|
||||
occur. For example, the 95th percentile is the value which is greater than 95%
|
||||
occur. For example, the 95th percentile is the value which is greater than 95%
|
||||
of the observed values.
|
||||
|
||||
Percentiles are often used to find outliers. In normal distributions, the
|
||||
0.13th and 99.87th percentiles represents three standard deviations from the
|
||||
Percentiles are often used to find outliers. In normal distributions, the
|
||||
0.13th and 99.87th percentiles represents three standard deviations from the
|
||||
mean. Any data which falls outside three standard deviations is often considered
|
||||
an anomaly.
|
||||
|
||||
When a range of percentiles are retrieved, they can be used to estimate the
|
||||
When a range of percentiles are retrieved, they can be used to estimate the
|
||||
data distribution and determine if the data is skewed, bimodal, etc.
|
||||
|
||||
Assume your data consists of website load times. The average and median
|
||||
|
@ -37,17 +37,17 @@ Let's look at a range of percentiles representing load time:
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"load_time_outlier" : {
|
||||
"percentiles" : {
|
||||
"load_time_outlier" : {
|
||||
"percentiles" : {
|
||||
"field" : "load_time" <1>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
<1> The field `load_time` must be a numeric field
|
||||
|
||||
By default, the `percentile` metric will generate a range of
|
||||
By default, the `percentile` metric will generate a range of
|
||||
percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`. The response will look like this:
|
||||
|
||||
[source,js]
|
||||
|
@ -75,23 +75,23 @@ WARNING: added[1.2.0] The above response structure applies for `1.2.0` and above
|
|||
missing and all the percentiles where placed directly under the aggregation name object
|
||||
|
||||
As you can see, the aggregation will return a calculated value for each percentile
|
||||
in the default range. If we assume response times are in milliseconds, it is
|
||||
in the default range. If we assume response times are in milliseconds, it is
|
||||
immediately obvious that the webpage normally loads in 15-30ms, but occasionally
|
||||
spikes to 60-150ms.
|
||||
|
||||
Often, administrators are only interested in outliers -- the extreme percentiles.
|
||||
We can specify just the percents we are interested in (requested percentiles
|
||||
We can specify just the percents we are interested in (requested percentiles
|
||||
must be a value between 0-100 inclusive):
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"load_time_outlier" : {
|
||||
"percentiles" : {
|
||||
"load_time_outlier" : {
|
||||
"percentiles" : {
|
||||
"field" : "load_time",
|
||||
"percents" : [95, 99, 99.9] <1>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -110,13 +110,13 @@ a script to convert them on-the-fly:
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"load_time_outlier" : {
|
||||
"percentiles" : {
|
||||
"load_time_outlier" : {
|
||||
"percentiles" : {
|
||||
"script" : "doc['load_time'].value / timeUnit", <1>
|
||||
"params" : {
|
||||
"timeUnit" : 1000 <2>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -127,7 +127,7 @@ script to generate values which percentiles are calculated on
|
|||
|
||||
==== Percentiles are (usually) approximate
|
||||
|
||||
There are many different algorithms to calculate percentiles. The naive
|
||||
There are many different algorithms to calculate percentiles. The naive
|
||||
implementation simply stores all the values in a sorted array. To find the 50th
|
||||
percentile, you simply find the value that is at `my_array[count(my_array) * 0.5]`.
|
||||
|
||||
|
@ -137,7 +137,7 @@ across potentially billions of values in an Elasticsearch cluster, _approximate_
|
|||
percentiles are calculated.
|
||||
|
||||
The algorithm used by the `percentile` metric is called TDigest (introduced by
|
||||
Ted Dunning in
|
||||
Ted Dunning in
|
||||
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]).
|
||||
|
||||
When using this metric, there are a few guidelines to keep in mind:
|
||||
|
@ -147,8 +147,8 @@ are more accurate than less extreme percentiles, such as the median
|
|||
- For small sets of values, percentiles are highly accurate (and potentially
|
||||
100% accurate if the data is small enough).
|
||||
- As the quantity of values in a bucket grows, the algorithm begins to approximate
|
||||
the percentiles. It is effectively trading accuracy for memory savings. The
|
||||
exact level of inaccuracy is difficult to generalize, since it depends on your
|
||||
the percentiles. It is effectively trading accuracy for memory savings. The
|
||||
exact level of inaccuracy is difficult to generalize, since it depends on your
|
||||
data distribution and volume of data being aggregated
|
||||
|
||||
The following chart shows the relative error on a uniform distribution depending
|
||||
|
@ -170,20 +170,20 @@ This balance can be controlled using a `compression` parameter:
|
|||
--------------------------------------------------
|
||||
{
|
||||
"aggs" : {
|
||||
"load_time_outlier" : {
|
||||
"percentiles" : {
|
||||
"load_time_outlier" : {
|
||||
"percentiles" : {
|
||||
"field" : "load_time",
|
||||
"compression" : 200 <1>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
<1> Compression controls memory usage and approximation error
|
||||
|
||||
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
|
||||
The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the
|
||||
more nodes available, the higher the accuracy (and large memory footprint) proportional
|
||||
to the volume of data. The `compression` parameter limits the maximum number of
|
||||
to the volume of data. The `compression` parameter limits the maximum number of
|
||||
nodes to `100 * compression`.
|
||||
|
||||
Therefore, by increasing the compression value, you can increase the accuracy of
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-stats-aggregation]]
|
||||
=== Stats
|
||||
=== Stats Aggregation
|
||||
|
||||
A `multi-value` metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
|
||||
|
||||
|
@ -65,7 +65,7 @@ It turned out that the exam was way above the level of the students and a grade
|
|||
|
||||
"aggs" : {
|
||||
"grades_stats" : {
|
||||
"stats" : {
|
||||
"stats" : {
|
||||
"field" : "grade",
|
||||
"script" : "_value * correction",
|
||||
"params" : {
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-sum-aggregation]]
|
||||
=== Sum
|
||||
=== Sum Aggregation
|
||||
|
||||
A `single-value` metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
|
||||
|
||||
|
@ -66,10 +66,10 @@ Computing the sum of squares over all stock tick changes:
|
|||
...
|
||||
|
||||
"aggs" : {
|
||||
"daytime_return" : {
|
||||
"sum" : {
|
||||
"daytime_return" : {
|
||||
"sum" : {
|
||||
"field" : "change",
|
||||
"script" : "_value * _value" }
|
||||
"script" : "_value * _value" }
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
[[search-aggregations-metrics-valuecount-aggregation]]
|
||||
=== Value Count
|
||||
=== Value Count Aggregation
|
||||
|
||||
A `single-value` metrics aggregation that counts the number of values that are extracted from the aggregated documents.
|
||||
These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically,
|
||||
|
|
Loading…
Reference in New Issue