adding datasketches aggregator to documentation

This commit is contained in:
Himanshu Gupta 2015-10-30 15:53:24 -05:00
parent 9c569be11e
commit 6c6a38cedb
2 changed files with 43 additions and 0 deletions

View File

@ -0,0 +1,42 @@
---
layout: doc_page
---
## DataSketches aggregator
Druid aggregators based on [datasketches]()http://datasketches.github.io/) library. You would ingest one or more metric columns using either `sketchBuild` or `sketchMerge` aggregators. Then, at query time, you can use `sketchMerge` with appropriate post aggregators described below. Note that sketch algorithms are approxiate, see details in the [datasketches doc](http://datasketches.github.io/docs/theChallenge.html).
#### aggregators to use at ingestion time
##### non-sketch input data
You can ingest data to druid and build the sketch objects which are later used at the time of querying to compute uniques.
```json
{ "type" : "sketchBuild", "name" : <output_name>, "fieldName" : <metric_name> }
```
##### sketch input data
You can ingest data to druid which already contains the sketch objects from batch pipeline by using [sketches-pig](https://github.com/DataSketches/sketches-pig) for example, in which case you just want to merge those and ingest into Druid.
```json
{ "type" : "sketchMerge", "name" : <output_name>, "fieldName" : <metric_name> }
```
#### aggregator to use at query time
```json
{ "type" : "sketchMerge", "name" : <output_name>, "fieldName" : <metric_name> }
```
Note that you can specify an additional field called "size" in above aggregators which is 16384 by default and must be a power of 2. At query time, size value must be greater than or equal to the value used at indexing time. Internally, size refers to the maximum number of entries sketch object will retain, higher size would mean higher accuracy but higher space needed to store those sketches. See [theta-size](http://datasketches.github.io/docs/ThetaSize.html) for more details. In general, I would recommend just sticking to default size which has worked well for us.
### Post Aggregators
#### Sketch Estimator
```json
{ "type" : "sketchEstimate", "name": <output name>, "fieldName" : <the name field value of the sketchMerge aggregator>}
```
#### Sketch Operations
```json
{ "type" : "sketchSetOp", "name": <output name>, "func": <UNION|INTERSECT|NOT>, "fields" : <the name field value of the sketchMerge aggregators>}
```

View File

@ -89,6 +89,7 @@ h2. Development
** "Geographic Queries":../development/geo.html ** "Geographic Queries":../development/geo.html
** "Select Query":../development/select-query.html ** "Select Query":../development/select-query.html
** "Approximate Histograms and Quantiles":../development/approximate-histograms.html ** "Approximate Histograms and Quantiles":../development/approximate-histograms.html
** "Datasketches based Aggregators":../development/datasketches-aggregators.html
** "Router node":../development/router.html ** "Router node":../development/router.html
** "New Kafka Firehose":../development/kafka-simple-consumer-firehose.html ** "New Kafka Firehose":../development/kafka-simple-consumer-firehose.html