druid/distinctcount.md at ee4ebb496a61af208717f5a0662aceb2dcf94518

mirror of https://github.com/apache/druid.git synced 2025-02-10 03:55:02 +00:00

Docusaurus build framework + ingestion doc refresh. (#8311 )

* Docusaurus build framework + ingestion doc refresh.

* stick to npm instead of yarn

* fix typos

* restore some _bin

* Adjustments.

* detect and fix redirect anchors

* update anchor lint

* Web-console: remove specific column filters (#8343)

* add clear filter

* update tool kit

* remove usless check

* auto run

* add %

* Fix resource leak (#8337)

* Fix resource leak

* Patch comments

* Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234)

* Fixes from PR review.

* Fix more anchors.

* Preamble nix.

* Fix more anchors, headers

* clean up placeholder page

* add to website lint to travis config

* better broken link checking

* travis fix

* Fixed more broken links

* better redirects

* unfancy catch

* fix LGTM error

* link fixes

* fix md issues

* Addl fixes

2019-08-20 21:48:59 -07:00

2.8 KiB

Raw Blame History

id	title
distinctcount	DistinctCount Aggregator

To use this Apache Druid (incubating) extension, make sure to include the druid-distinctcount extension.

Additionally, follow these steps:

First, use a single dimension hash-based partition spec to partition data by a single dimension. For example visitor_id. This to make sure all rows with a particular value for that dimension will go into the same segment, or this might over count.
Second, use distinctCount to calculate the distinct count, make sure queryGranularity is divided exactly by segmentGranularity or else the result will be wrong.

There are some limitations, when used with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment. If exceeded the result will be wrong. When used with topN, numValuesPerPass should not be too big. If too big the distinctCount will use a lot of memory and might cause the JVM to go our of memory.

Example:

Timeseries query

{
  "queryType": "timeseries",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "aggregations": [
    {
      "type": "distinctCount",
      "name": "uv",
      "fieldName": "visitor_id"
    }
  ],
  "intervals": [
    "2016-03-01T00:00:00.000/2013-03-20T00:00:00.000"
  ]
}

TopN query

{
  "queryType": "topN",
  "dataSource": "sample_datasource",
  "dimension": "sample_dim",
  "threshold": 5,
  "metric": "uv",
  "granularity": "all",
  "aggregations": [
    {
      "type": "distinctCount",
      "name": "uv",
      "fieldName": "visitor_id"
    }
  ],
  "intervals": [
    "2016-03-06T00:00:00/2016-03-06T23:59:59"
  ]
}

GroupBy query

{
  "queryType": "groupBy",
  "dataSource": "sample_datasource",
  "dimensions": "[sample_dim]",
  "granularity": "all",
  "aggregations": [
    {
      "type": "distinctCount",
      "name": "uv",
      "fieldName": "visitor_id"
    }
  ],
  "intervals": [
    "2016-03-06T00:00:00/2016-03-06T23:59:59"
  ]
}

2.8 KiB Raw Blame History

Timeseries query

TopN query

GroupBy query

2.8 KiB

Raw Blame History