239 lines
7.8 KiB
Plaintext
239 lines
7.8 KiB
Plaintext
[[search-aggregations-bucket-terms-aggregation]]
|
|
=== Terms
|
|
|
|
A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
|
|
|
|
Example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : { "field" : "gender" }
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
...
|
|
|
|
"aggregations" : {
|
|
"genders" : {
|
|
"buckets" : [
|
|
{
|
|
"key" : "male",
|
|
"doc_count" : 10
|
|
},
|
|
{
|
|
"key" : "female",
|
|
"doc_count" : 10
|
|
},
|
|
]
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
By default, the `terms` aggregation will return the buckets for the top ten terms ordered by the `doc_count`. One can
|
|
change this default behaviour by setting the `size` parameter.
|
|
|
|
==== Size & Shard Size
|
|
|
|
The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
|
|
default, the node coordinating the search process will request each shard to provide its own top `size` term buckets
|
|
and once all shards respond, it will reduces the results to the final list that will then be returned to the client.
|
|
This means that if the number of unique terms is greater than `size`, the returned list is slightly off and not accurate
|
|
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
|
|
size buckets was not returned).
|
|
|
|
The higher the requested `size` is, the more accurate the results will be, but also, the more expensive it will be to
|
|
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
|
|
transfers between the nodes and the client).
|
|
|
|
The `shard_size` parameter can be used to minimize the extra work that comes with bigger requested `size`. When defined,
|
|
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
|
|
coordinating node will then reduce them to a final result which will be based on the `size` parameter - this way,
|
|
one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
|
|
the client.
|
|
|
|
NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sense). When it is, elasticsearch will
|
|
override it and reset it to be equal to `size`.
|
|
|
|
|
|
==== Order
|
|
|
|
The order of the buckets can be customized by setting the `order` parameter. By default, the buckets are ordered by
|
|
their `doc_count` descending. It is also possible to change this behaviour as follows:
|
|
|
|
Ordering the buckets by their `doc_count` in an ascending manner:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"order" : { "_count" : "asc" }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Ordering the buckets alphabetically by their terms in an ascending manner:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"order" : { "_term" : "asc" }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name):
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"order" : { "avg_height" : "desc" }
|
|
},
|
|
"aggs" : {
|
|
"avg_height" : { "avg" : { "field" : "height" } }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name):
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"order" : { "stats.avg" : "desc" }
|
|
},
|
|
"aggs" : {
|
|
"height_stats" : { "stats" : { "field" : "height" } }
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Script
|
|
|
|
Generating the terms using a script:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"script" : "doc['gender'].value"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Value Script
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"genders" : {
|
|
"terms" : {
|
|
"field" : "gender",
|
|
"script" : "doc['gender'].value"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
==== Filtering Values
|
|
|
|
It is possible to filter the values for which buckets will be created. This can be done using the `include` and
|
|
`exclude` parameters which are based on regular expressions.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"tags" : {
|
|
"terms" : {
|
|
"field" : "tags",
|
|
"include" : ".*sport.*",
|
|
"exclude" : "water_.*"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
In the above example, buckets will be created for all the tags that has the word `sport` in them, except those starting
|
|
with `water_` (so the tag `water_sports` will no be aggregated). The `include` regular expression will determine what
|
|
values are "allowed" to be aggregated, while the `exclude` determines the values that should not be aggregated. When
|
|
both are defined, the `exclude` has precedence, meaning, the `include` is evaluated first and only then the `exclude`.
|
|
|
|
The regular expression are based on the Java(TM) http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html[Pattern],
|
|
and as such, they it is also possible to pass in flags that will determine how the compiled regular expression will work:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"tags" : {
|
|
"terms" : {
|
|
"field" : "tags",
|
|
"include" : {
|
|
"pattern" : ".*sport.*",
|
|
"flags" : "CANON_EQ|CASE_INSENSITIVE" <1>
|
|
},
|
|
"exclude" : {
|
|
"pattern" : "water_.*",
|
|
"flags" : "CANON_EQ|CASE_INSENSITIVE"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> the flags are concatenated using the `|` character as a separator
|
|
|
|
The possible flags that can be used are:
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CANON_EQ[`CANON_EQ`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CASE_INSENSITIVE[`CASE_INSENSITIVE`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#COMMENTS[`COMMENTS`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#DOTALL[`DOTALL`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#LITERAL[`LITERAL`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#MULTILINE[`MULTILINE`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CASE[`UNICODE_CASE`],
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS[`UNICODE_CHARACTER_CLASS`] and
|
|
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNIX_LINES[`UNIX_LINES`] |