218 lines
7.0 KiB
Plaintext
218 lines
7.0 KiB
Plaintext
[[search-aggregations-metrics-cardinality-aggregation]]
|
|
=== Cardinality Aggregation
|
|
|
|
A `single-value` metrics aggregation that calculates an approximate count of
|
|
distinct values. Values can be extracted either from specific fields in the
|
|
document or generated by a script.
|
|
|
|
Assume you are indexing books and would like to count the unique authors that
|
|
match a query:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"author_count" : {
|
|
"cardinality" : {
|
|
"field" : "author"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Precision control
|
|
|
|
This aggregation also supports the `precision_threshold` option:
|
|
|
|
experimental[The `precision_threshold` option is specific to the current internal implementation of the `cardinality` agg, which may change in the future]
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"author_count" : {
|
|
"cardinality" : {
|
|
"field" : "author_hash",
|
|
"precision_threshold": 100 <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> The `precision_threshold` options allows to trade memory for accuracy, and
|
|
defines a unique count below which counts are expected to be close to
|
|
accurate. Above this value, counts might become a bit more fuzzy. The maximum
|
|
supported value is 40000, thresholds above this number will have the same
|
|
effect as a threshold of 40000. The default values is +3000+.
|
|
|
|
==== Counts are approximate
|
|
|
|
Computing exact counts requires loading values into a hash set and returning its
|
|
size. This doesn't scale when working on high-cardinality sets and/or large
|
|
values as the required memory usage and the need to communicate those
|
|
per-shard sets between nodes would utilize too many resources of the cluster.
|
|
|
|
This `cardinality` aggregation is based on the
|
|
http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]
|
|
algorithm, which counts based on the hashes of the values with some interesting
|
|
properties:
|
|
|
|
* configurable precision, which decides on how to trade memory for accuracy,
|
|
* excellent accuracy on low-cardinality sets,
|
|
* fixed memory usage: no matter if there are tens or billions of unique values,
|
|
memory usage only depends on the configured precision.
|
|
|
|
For a precision threshold of `c`, the implementation that we are using requires
|
|
about `c * 8` bytes.
|
|
|
|
The following chart shows how the error varies before and after the threshold:
|
|
|
|
////
|
|
To generate this chart use this gnuplot script:
|
|
-------
|
|
#!/usr/bin/gnuplot
|
|
reset
|
|
set terminal png size 1000,400
|
|
|
|
set xlabel "Actual cardinality"
|
|
set logscale x
|
|
|
|
set ylabel "Relative error (%)"
|
|
set yrange [0:8]
|
|
|
|
set title "Cardinality error"
|
|
set grid
|
|
|
|
set style data lines
|
|
|
|
plot "test.dat" using 1:2 title "threshold=100", \
|
|
"" using 1:3 title "threshold=1000", \
|
|
"" using 1:4 title "threshold=10000"
|
|
#
|
|
-------
|
|
|
|
and generate data in a 'test.dat' file using the below Java code:
|
|
|
|
-------
|
|
private static double error(HyperLogLogPlusPlus h, long expected) {
|
|
double actual = h.cardinality(0);
|
|
return Math.abs(expected - actual) / expected;
|
|
}
|
|
|
|
public static void main(String[] args) {
|
|
HyperLogLogPlusPlus h100 = new HyperLogLogPlusPlus(precisionFromThreshold(100), BigArrays.NON_RECYCLING_INSTANCE, 1);
|
|
HyperLogLogPlusPlus h1000 = new HyperLogLogPlusPlus(precisionFromThreshold(1000), BigArrays.NON_RECYCLING_INSTANCE, 1);
|
|
HyperLogLogPlusPlus h10000 = new HyperLogLogPlusPlus(precisionFromThreshold(10000), BigArrays.NON_RECYCLING_INSTANCE, 1);
|
|
|
|
int next = 100;
|
|
int step = 10;
|
|
|
|
for (int i = 1; i <= 10000000; ++i) {
|
|
long h = BitMixer.mix64(i);
|
|
h100.collect(0, h);
|
|
h1000.collect(0, h);
|
|
h10000.collect(0, h);
|
|
|
|
if (i == next) {
|
|
System.out.println(i + " " + error(h100, i)*100 + " " + error(h1000, i)*100 + " " + error(h10000, i)*100);
|
|
next += step;
|
|
if (next >= 100 * step) {
|
|
step *= 10;
|
|
}
|
|
}
|
|
}
|
|
}
|
|
-------
|
|
|
|
////
|
|
|
|
image:images/cardinality_error.png[]
|
|
|
|
For all 3 thresholds, counts have been accurate up to the configured threshold
|
|
(although not guaranteed, this is likely to be the case). Please also note that
|
|
even with a threshold as low as 100, the error remains very low, even when
|
|
counting millions of items.
|
|
|
|
==== Pre-computed hashes
|
|
|
|
On string fields that have a high cardinality, it might be faster to store the
|
|
hash of your field values in your index and then run the cardinality aggregation
|
|
on this field. This can either be done by providing hash values from client-side
|
|
or by letting elasticsearch compute hash values for you by using the
|
|
{plugins}/mapper-size.html[`mapper-murmur3`] plugin.
|
|
|
|
NOTE: Pre-computing hashes is usually only useful on very large and/or
|
|
high-cardinality fields as it saves CPU and memory. However, on numeric
|
|
fields, hashing is very fast and storing the original values requires as much
|
|
or less memory than storing the hashes. This is also true on low-cardinality
|
|
string fields, especially given that those have an optimization in order to
|
|
make sure that hashes are computed at most once per unique value per segment.
|
|
|
|
==== Script
|
|
|
|
The `cardinality` metric supports scripting, with a noticeable performance hit
|
|
however since hashes need to be computed on the fly.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"author_count" : {
|
|
"cardinality" : {
|
|
"script": {
|
|
"lang": "painless",
|
|
"inline": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters. To use a file script use the following syntax:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"author_count" : {
|
|
"cardinality" : {
|
|
"script" : {
|
|
"file": "my_script",
|
|
"params": {
|
|
"first_name_field": "author.first_name",
|
|
"last_name_field": "author.last_name"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
TIP: for indexed scripts replace the `file` parameter with an `id` parameter.
|
|
|
|
==== Missing value
|
|
|
|
The `missing` parameter defines how documents that are missing a value should be treated.
|
|
By default they will be ignored but it is also possible to treat them as if they
|
|
had a value.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"aggs" : {
|
|
"tag_cardinality" : {
|
|
"cardinality" : {
|
|
"field" : "tag",
|
|
"missing": "N/A" <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Documents without a value in the `tag` field will fall into the same bucket as documents that have the value `N/A`.
|