158 lines
5.5 KiB
Plaintext
158 lines
5.5 KiB
Plaintext
[[search-aggregations-metrics-cardinality-aggregation]]
|
|
=== Cardinality Aggregation
|
|
|
|
A `single-value` metrics aggregation that calculates an approximate count of
|
|
distinct values. Values can be extracted either from specific fields in the
|
|
document or generated by a script.
|
|
|
|
Assume you are indexing books and would like to count the unique authors that
|
|
match a query:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"author_count" : {
|
|
"cardinality" : {
|
|
"field" : "author"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Precision control
|
|
|
|
This aggregation also supports the `precision_threshold` option:
|
|
|
|
experimental[The `precision_threshold` option is specific to the current internal implementation of the `cardinality` agg, which may change in the future]
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"author_count" : {
|
|
"cardinality" : {
|
|
"field" : "author_hash",
|
|
"precision_threshold": 100 <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> The `precision_threshold` options allows to trade memory for accuracy, and
|
|
defines a unique count below which counts are expected to be close to
|
|
accurate. Above this value, counts might become a bit more fuzzy. The maximum
|
|
supported value is 40000, thresholds above this number will have the same
|
|
effect as a threshold of 40000.
|
|
Default value depends on the number of parent aggregations that multiple
|
|
create buckets (such as terms or histograms).
|
|
|
|
==== Counts are approximate
|
|
|
|
Computing exact counts requires loading values into a hash set and returning its
|
|
size. This doesn't scale when working on high-cardinality sets and/or large
|
|
values as the required memory usage and the need to communicate those
|
|
per-shard sets between nodes would utilize too many resources of the cluster.
|
|
|
|
This `cardinality` aggregation is based on the
|
|
http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
|
|
algorithm, which counts based on the hashes of the values with some interesting
|
|
properties:
|
|
|
|
* configurable precision, which decides on how to trade memory for accuracy,
|
|
* excellent accuracy on low-cardinality sets,
|
|
* fixed memory usage: no matter if there are tens or billions of unique values,
|
|
memory usage only depends on the configured precision.
|
|
|
|
For a precision threshold of `c`, the implementation that we are using requires
|
|
about `c * 8` bytes.
|
|
|
|
The following chart shows how the error varies before and after the threshold:
|
|
|
|
image:images/cardinality_error.png[]
|
|
|
|
For all 3 thresholds, counts have been accurate up to the configured threshold
|
|
(although not guaranteed, this is likely to be the case). Please also note that
|
|
even with a threshold as low as 100, the error remains under 5%, even when
|
|
counting millions of items.
|
|
|
|
==== Pre-computed hashes
|
|
|
|
On string fields that have a high cardinality, it might be faster to store the
|
|
hash of your field values in your index and then run the cardinality aggregation
|
|
on this field. This can either be done by providing hash values from client-side
|
|
or by letting elasticsearch compute hash values for you by using the
|
|
{plugins}/mapper-size.html[`mapper-murmur3`] plugin.
|
|
|
|
NOTE: Pre-computing hashes is usually only useful on very large and/or
|
|
high-cardinality fields as it saves CPU and memory. However, on numeric
|
|
fields, hashing is very fast and storing the original values requires as much
|
|
or less memory than storing the hashes. This is also true on low-cardinality
|
|
string fields, especially given that those have an optimization in order to
|
|
make sure that hashes are computed at most once per unique value per segment.
|
|
|
|
==== Script
|
|
|
|
The `cardinality` metric supports scripting, with a noticeable performance hit
|
|
however since hashes need to be computed on the fly.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"author_count" : {
|
|
"cardinality" : {
|
|
"script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
This will interpret the `script` parameter as an `inline` script with the default script language and no script parameters. To use a file script use the following syntax:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"author_count" : {
|
|
"cardinality" : {
|
|
"script" : {
|
|
"file": "my_script",
|
|
"params": {
|
|
"first_name_field": "author.first_name",
|
|
"last_name_field": "author.last_name"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
TIP: for indexed scripts replace the `file` parameter with an `id` parameter.
|
|
|
|
==== Missing value
|
|
|
|
The `missing` parameter defines how documents that are missing a value should be treated.
|
|
By default they will be ignored but it is also possible to treat them as if they
|
|
had a value.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"tag_cardinality" : {
|
|
"cardinality" : {
|
|
"field" : "tag",
|
|
"missing": "N/A" <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Documents without a value in the `tag` field will fall into the same bucket as documents that have the value `N/A`.
|