druid/docs/content/Querying-your-data.md

11 KiB

layout
doc_page

Setup

Before we start querying druid, we're going to finish setting up a complete cluster on localhost. In Loading Your Data we setup a Realtime, Compute and Master node. If you've already completed that tutorial, you need only follow the directions for 'Booting a Broker Node'.

Booting a Broker Node

  1. Setup a config file at config/broker/runtime.properties that looks like this:

    druid.host=0.0.0.0:8083
    druid.port=8083
    
    com.metamx.emitter.logging=true
    
    druid.processing.formatString=processing_%s
    druid.processing.numThreads=1
    druid.processing.buffer.sizeBytes=10000000
    
    #emitting, opaque marker
    druid.service=example
    
    druid.request.logging.dir=/tmp/example/log
    druid.realtime.specFile=realtime.spec
    com.metamx.emitter.logging=true
    com.metamx.emitter.logging.level=debug
    
    # below are dummy values when operating a realtime only node
    druid.processing.numThreads=3
    
    com.metamx.aws.accessKey=dummy_access_key
    com.metamx.aws.secretKey=dummy_secret_key
    druid.pusher.s3.bucket=dummy_s3_bucket
    
    druid.zk.service.host=localhost
    druid.server.maxSize=300000000000
    druid.zk.paths.base=/druid
    druid.database.segmentTable=prod_segments
    druid.database.user=druid
    druid.database.password=diurd
    druid.database.connectURI=jdbc:mysql://localhost:3306/druid
    druid.zk.paths.discoveryPath=/druid/discoveryPath
    druid.database.ruleTable=rules
    druid.database.configTable=config
    
    # Path on local FS for storage of segments; dir will be created if needed
    druid.paths.indexCache=/tmp/druid/indexCache
    # Path on local FS for storage of segment metadata; dir will be created if needed
    druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
    druid.pusher.local.storageDirectory=/tmp/druid/localStorage
    druid.pusher.local=true
    
    # thread pool size for servicing queries
    druid.client.http.connections=30
    
  2. Run the broker node:

    java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
    -Ddruid.realtime.specFile=realtime.spec \
    -classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/broker \
    com.metamx.druid.http.BrokerMain
    

Booting a Master Node

  1. Setup a config file at config/master/runtime.properties that looks like this: https://gist.github.com/rjurney/5818870

  2. Run the master node:

    java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
    -classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/master \
    com.metamx.druid.http.MasterMain
    

Booting a Realtime Node

  1. Setup a config file at config/realtime/runtime.properties that looks like this: https://gist.github.com/rjurney/5818774

  2. Setup a realtime.spec file like this: https://gist.github.com/rjurney/5818779

  3. Run the realtime node:

    java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
    -Ddruid.realtime.specFile=realtime.spec \
    -classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/realtime \
    com.metamx.druid.realtime.RealtimeMain
    

Booting a Compute Node

  1. Setup a config file at config/compute/runtime.properties that looks like this: https://gist.github.com/rjurney/5818885

  2. Run the compute node:

    java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
    -classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/compute \
    com.metamx.druid.http.ComputeMain
    

Querying Your Data

Now that we have a complete cluster setup on localhost, we need to load data. To do so, refer to Loading Your Data. Having done that, its time to query our data! For a complete specification of queries, see Querying.

Querying Different Nodes

As a shared-nothing system, there are three ways to query druid, against the Realtime, Compute or Broker node. Querying a Realtime node returns only realtime data, querying a compute node returns only historical segments. Querying the broker will query both realtime and compute segments and compose an overall result for the query. This is the normal mode of operation for queries in druid.

Construct a Query

For constructing this query, see: Querying against the realtime.spec

{
    "queryType": "groupBy",
    "dataSource": "druidtest",
    "granularity": "all",
    "dimensions": [],
    "aggregations": [
        {"type": "count", "name": "rows"},
        {"type": "longSum", "name": "imps", "fieldName": "impressions"},
        {"type": "doubleSum", "name": "wp", "fieldName": "wp"}
    ],
    "intervals": ["2010-01-01T00:00/2020-01-01T00"]
}

Querying the Realtime Node

Run our query against port 8080:

curl -X POST "http://localhost:8080/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body

See our result:

[ {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 5, "wp" : 15000.0, "rows" : 5 }
} ]

Querying the Compute Node

Run the query against port 8082:

curl -X POST "http://localhost:8082/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body

And get (similar to):

[ {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 27, "wp" : 77000.0, "rows" : 9 }
} ]

Querying both Nodes via the Broker

Run the query against port 8083:

curl -X POST "http://localhost:8083/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body

And get:

[ {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 5, "wp" : 15000.0, "rows" : 5 }
} ]

Now that we know what nodes can be queried (although you should usually use the broker node), lets learn how to know what queries are available.

Querying Against the realtime.spec

How are we to know what queries we can run? Although Querying is a helpful index, to get a handle on querying our data we need to look at our Realtime node's realtime.spec file:

[{
  "schema" : { "dataSource":"druidtest",
               "aggregators":[ {"type":"count", "name":"impressions"},
                               {"type":"doubleSum","name":"wp","fieldName":"wp"}],
               "indexGranularity":"minute",
               "shardSpec" : { "type": "none" } },
  "config" : { "maxRowsInMemory" : 500000,
               "intermediatePersistPeriod" : "PT10m" },
  "firehose" : { "type" : "kafka-0.7.2",
                 "consumerProps" : { "zk.connect" : "localhost:2181",
                                     "zk.connectiontimeout.ms" : "15000",
                                     "zk.sessiontimeout.ms" : "15000",
                                     "zk.synctime.ms" : "5000",
                                     "groupid" : "topic-pixel-local",
                                     "fetch.size" : "1048586",
                                     "autooffset.reset" : "largest",
                                     "autocommit.enable" : "false" },
                 "feed" : "druidtest",
                 "parser" : { "timestampSpec" : { "column" : "utcdt", "format" : "iso" },
                              "data" : { "format" : "json" },
                              "dimensionExclusions" : ["wp"] } },
  "plumber" : { "type" : "realtime",
                "windowPeriod" : "PT10m",
                "segmentGranularity":"hour",
                "basePersistDirectory" : "/tmp/realtime/basePersist",
                "rejectionPolicy": {"type": "messageTime"} }

}]

dataSource

"dataSource":"druidtest"

Our dataSource tells us the name of the relation/table, or 'source of data', to query in both our realtime.spec and query.body!

aggregations

Note the Aggregations in our query:

    "aggregations": [
        {"type": "count", "name": "rows"},
        {"type": "longSum", "name": "imps", "fieldName": "impressions"},
        {"type": "doubleSum", "name": "wp", "fieldName": "wp"}
    ],

this matches up to the aggregators in the schema of our realtime.spec!

"aggregators":[ {"type":"count", "name":"impressions"},
                {"type":"doubleSum","name":"wp","fieldName":"wp"}],

dimensions

Lets look back at our actual records (from [Loading Your Data](Loading Your Data.html)):

{"utcdt": "2010-01-01T01:01:01", "wp": 1000, "gender": "male", "age": 100}
{"utcdt": "2010-01-01T01:01:02", "wp": 2000, "gender": "female", "age": 50}
{"utcdt": "2010-01-01T01:01:03", "wp": 3000, "gender": "male", "age": 20}
{"utcdt": "2010-01-01T01:01:04", "wp": 4000, "gender": "female", "age": 30}
{"utcdt": "2010-01-01T01:01:05", "wp": 5000, "gender": "male", "age": 40}

Note that we have two dimensions to our data, other than our primary metric, wp. They are 'gender' and 'age'. We can specify these in our query! Note that we have added a dimension: age, below.

{
    "queryType": "groupBy",
    "dataSource": "druidtest",
    "granularity": "all",
    "dimensions": ["age"],
    "aggregations": [
        {"type": "count", "name": "rows"},
        {"type": "longSum", "name": "imps", "fieldName": "impressions"},
        {"type": "doubleSum", "name": "wp", "fieldName": "wp"}
    ],
    "intervals": ["2010-01-01T00:00/2020-01-01T00"]
}

Which gets us grouped data in return!

[ {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 1, "age" : "100", "wp" : 1000.0, "rows" : 1 }
}, {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 1, "age" : "20", "wp" : 3000.0, "rows" : 1 }
}, {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 1, "age" : "30", "wp" : 4000.0, "rows" : 1 }
}, {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 1, "age" : "40", "wp" : 5000.0, "rows" : 1 }
}, {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 1, "age" : "50", "wp" : 2000.0, "rows" : 1 }
} ]

filtering

Now that we've observed our dimensions, we can also filter:

{
    "queryType": "groupBy",
    "dataSource": "druidtest",
    "granularity": "all",
    "filter": { "type": "selector", "dimension": "gender", "value": "male" },
    "aggregations": [
        {"type": "count", "name": "rows"},
        {"type": "longSum", "name": "imps", "fieldName": "impressions"},
        {"type": "doubleSum", "name": "wp", "fieldName": "wp"}
    ],
    "intervals": ["2010-01-01T00:00/2020-01-01T00"]
}

Which gets us just people aged 40:

[ {
  "version" : "v1",
  "timestamp" : "2010-01-01T00:00:00.000Z",
  "event" : { "imps" : 3, "wp" : 9000.0, "rows" : 3 }
} ]

Check out Filters for more.

Learn More

You can learn more about querying at Querying! Now check out Booting a production cluster!