ea1794832f
This adds a `rare_terms` aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts. This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: LONG_MAX (or worse, ordering a terms agg by count ascending, which has unbounded error). This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a cuckoo filter. If a future term is found in the cuckoo filter we assume it was previously removed from the map and is "common". The map keys are the "rare" terms after collection is done. |
||
---|---|---|
.. | ||
accuracy_0001.png | ||
accuracy_001.png | ||
accuracy_01.png | ||
memory.png |