From 14907645aac85f923df9fc78e04d74c493f80bda Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Xavier=20L=C3=A9aut=C3=A9?= Date: Fri, 25 Apr 2014 13:23:43 -0700 Subject: [PATCH] add cardinality aggregator docs --- docs/content/Aggregations.md | 76 ++++++++++++++++++++++++++++++++++-- 1 file changed, 72 insertions(+), 4 deletions(-) diff --git a/docs/content/Aggregations.md b/docs/content/Aggregations.md index 74ad226ff81..a59066940e9 100644 --- a/docs/content/Aggregations.md +++ b/docs/content/Aggregations.md @@ -59,7 +59,8 @@ Computes an arbitrary JavaScript function over a set of columns (both metrics an All JavaScript functions must return numerical values. ```json -{ "type": "javascript", "name": "", +{ "type": "javascript", + "name": "", "fieldNames" : [ , , ... ], "fnAggregate" : "function(current, column1, column2, ...) { @@ -83,11 +84,78 @@ All JavaScript functions must return numerical values. } ``` -### Complex aggregators +### Cardinality aggregator -#### `hyperUnique` aggregator +Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality. -`hyperUnique` uses [Hyperloglog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to compute the estimated cardinality of a dimension. +```json +{ + "type": "cardinality", + "name": "", + "fieldNames": [ , , ... ], + "byRow": # (optional, defaults to false) +} +``` + +#### Cardinality by value + +When setting `byRow = false` (the default) it computes the cardinality of the set composed of the union of all dimension values for all the given dimensions. + +* For a single dimension, this is equivalent to + +```sql +SELECT COUNT(DISCTINCT(dimension)) FROM +``` + +* For multiple dimensions, this is equivalent to something akin to + +```sql +SELECT COUNT(DISTINCT(value))) FROM ( + SELECT dim_1 as value FROM + UNION + SELECT dim_2 as value FROM + UNION + SELECT dim_3 as value FROM +) +``` + +#### Cardinality by row + +When setting `byRow = true` it computes the cardinality by row, i.e. the cardinality of distinct dimension combinations +This is equivalent to something akin to + +```sql +SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 GROUP BY DIM1, DIM2, DIM3 ) FROM +``` + +**Example** + +Determine the number of distinct categories items are assigned to. + +```json +{ + "type": "cardinality", + "name": "distinct_values", + "fieldNames": [ "main_category", "secondary_category" ] +} +``` + +Determine the number of distinct are assigned to. + +```json +{ + "type": "cardinality", + "name": "distinct_values", + "fieldNames": [ "", "secondary_category" ], + "byRow" : true +} +``` + +## Complex Aggregations + +### HyperUnique aggregator + +Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to compute the estimated cardinality of a dimension that has been aggregated as a hyperUnique metric at indexing time. ```json { "type" : "hyperUnique", "name" : , "fieldName" : }