diff --git a/docs/content/development/datasketches-aggregators.md b/docs/content/development/datasketches-aggregators.md new file mode 100644 index 00000000000..90207489f01 --- /dev/null +++ b/docs/content/development/datasketches-aggregators.md @@ -0,0 +1,42 @@ +--- +layout: doc_page +--- + +## DataSketches aggregator +Druid aggregators based on [datasketches]()http://datasketches.github.io/) library. You would ingest one or more metric columns using either `sketchBuild` or `sketchMerge` aggregators. Then, at query time, you can use `sketchMerge` with appropriate post aggregators described below. Note that sketch algorithms are approxiate, see details in the [datasketches doc](http://datasketches.github.io/docs/theChallenge.html). + +#### aggregators to use at ingestion time + +##### non-sketch input data +You can ingest data to druid and build the sketch objects which are later used at the time of querying to compute uniques. + +```json +{ "type" : "sketchBuild", "name" : , "fieldName" : } +``` + +##### sketch input data +You can ingest data to druid which already contains the sketch objects from batch pipeline by using [sketches-pig](https://github.com/DataSketches/sketches-pig) for example, in which case you just want to merge those and ingest into Druid. + +```json +{ "type" : "sketchMerge", "name" : , "fieldName" : } +``` +#### aggregator to use at query time + +```json +{ "type" : "sketchMerge", "name" : , "fieldName" : } +``` + +Note that you can specify an additional field called "size" in above aggregators which is 16384 by default and must be a power of 2. At query time, size value must be greater than or equal to the value used at indexing time. Internally, size refers to the maximum number of entries sketch object will retain, higher size would mean higher accuracy but higher space needed to store those sketches. See [theta-size](http://datasketches.github.io/docs/ThetaSize.html) for more details. In general, I would recommend just sticking to default size which has worked well for us. + +### Post Aggregators + +#### Sketch Estimator +```json +{ "type" : "sketchEstimate", "name": , "fieldName" : } +``` + +#### Sketch Operations +```json +{ "type" : "sketchSetOp", "name": , "func": , "fields" : } +``` + diff --git a/docs/content/toc.textile b/docs/content/toc.textile index 6e158fa401e..3e695048006 100644 --- a/docs/content/toc.textile +++ b/docs/content/toc.textile @@ -89,6 +89,7 @@ h2. Development ** "Geographic Queries":../development/geo.html ** "Select Query":../development/select-query.html ** "Approximate Histograms and Quantiles":../development/approximate-histograms.html +** "Datasketches based Aggregators":../development/datasketches-aggregators.html ** "Router node":../development/router.html ** "New Kafka Firehose":../development/kafka-simple-consumer-firehose.html