6.1 KiB

Raw Blame History

layout
doc_page

Tutorial: All About Queries

Hello! This tutorial is meant to provide a more in-depth look into Druid queries. The tutorial is somewhat incomplete right now but we hope to add more content to it in the near future.

Setup

Before we start digging into how to query Druid, make sure you've gone through the other tutorials and are comfortable with spinning up a local cluster and loading data into Druid.

Booting a Druid Cluster

Let's start up a simple Druid cluster so we can query all the things.

Note: If Zookeeper and metadata storage aren't running, you'll have to start them again as described in The Druid Cluster.

To start a Coordinator node:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/coordinator io.druid.cli.Main server coordinator

To start a Historical node:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/historical io.druid.cli.Main server historical

To start a Broker node:

java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker io.druid.cli.Main server broker

Querying Your Data

Make sure you've completed Loading Your Data so we have some data to query. Having done that, it's time to query our data! For a complete specification of queries, see Querying.

Construct a Query

{
    "queryType": "groupBy",
    "dataSource": "wikipedia",
    "granularity": "all",
    "dimensions": [],
    "aggregations": [
        {"type": "count", "name": "rows"},
        {"type": "longSum", "name": "edit_count", "fieldName": "count"},
        {"type": "doubleSum", "name": "chars_added", "fieldName": "added"}
    ],
    "intervals": ["2010-01-01T00:00/2020-01-01T00"]
}

Query That Data

Run the query against your broker:

curl -X POST "http://localhost:8080/druid/v2/?pretty" -H 'Content-type: application/json' -d @query.body

And get:

[ {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : {
    "chars_added" : 1545.0,
    "edit_count" : 5,
    "rows" : 5
  }
} ]

This result tells us that our query has 5 edits, and we have 5 rows of data as well. In those 5 edits, we have 1545 characters added.

What can I query for?

How are we to know what queries we can run? Although Querying is a helpful index, to get a handle on querying our data we need to look at our ingestion schema. There are a few particular fields we care about in the ingestion schema. All of these fields should in present in the real-time ingestion schema and the batch ingestion schema.

Datasource:

"dataSource":"wikipedia"

Our dataSource tells us the name of the relation/table, or 'source of data'. What we decide to name our data source must match the data source we are going to be querying.

Granularity:

"indexGranularity": "none",

Druid will roll up data at ingestion time unless the index/rollup granularity is specified as "none". Your query granularity cannot be lower than your index granularity.

Aggregators:

"aggregators" : [{
   "type" : "count",
   "name" : "count"
  }, {
   "type" : "doubleSum",
   "name" : "added",
   "fieldName" : "added"
  }, {
   "type" : "doubleSum",
   "name" : "deleted",
   "fieldName" : "deleted"
  }, {
   "type" : "doubleSum",
   "name" : "delta",
   "fieldName" : "delta"
}]

The Aggregations specified at ingestion time correlated directly to the metrics that can be queried.

Dimensions:

"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]

These specify the dimensions that we can filter our data on. If we added a dimension to our groupBy query, we get:

{
    "queryType": "groupBy",
    "dataSource": "wikipedia",
    "granularity": "all",
    "dimensions": ["namespace"],
    "aggregations": [
        {"type": "longSum", "name": "edit_count", "fieldName": "count"},
        {"type": "doubleSum", "name": "chars_added", "fieldName": "added"}
    ],
    "intervals": ["2010-01-01T00:00/2020-01-01T00"]
}

Which gets us data grouped over the namespace dimension in return!

[ {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : {
    "chars_added" : 180.0,
    "edit_count" : 2,
    "namespace" : "article"
  }
}, {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : {
    "chars_added" : 1365.0,
    "edit_count" : 3,
    "namespace" : "wikipedia"
  }
} ]

Additionally,, we can also filter our query to narrow down our metric values:

{
    "queryType": "groupBy",
    "dataSource": "wikipedia",
    "granularity": "all",
    "filter": { "type": "selector", "dimension": "namespace", "value": "article" },
    "aggregations": [
        {"type": "longSum", "name": "edit_count", "fieldName": "count"},
        {"type": "doubleSum", "name": "chars_added", "fieldName": "added"}
    ],
    "intervals": ["2010-01-01T00:00/2020-01-01T00"]
}

Which gets us metrics about only those edits where the namespace is 'article':

[ {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : {
    "chars_added" : 180.0,
    "edit_count" : 2
  }
} ]

Check out Filters for more information.

What Types of Queries to Use

The types of query you should use depends on your use case. TimeBoundary queries are useful to understand the range of your data. Timeseries queries are useful for aggregates and filters over a time range, and offer significant speed improvements over GroupBy queries. To find the top values for a given dimension, TopN queries should be used over group by queries as well.

Learn More

You can learn more about querying at Querying! If you are ready to evaluate Druid more in depth, check out Booting a production cluster!

6.1 KiB Raw Blame History