mirror of https://github.com/apache/druid.git
Make more of the docs look and work correctly. Yay! Almost done with this!
This commit is contained in:
parent
86ddc7da7b
commit
ca5c941560
|
@ -1 +0,0 @@
|
|||
cheddar@ChedHeads-MacBook-Pro-2.local.61986
|
|
@ -2,9 +2,6 @@
|
|||
<html lang="en">
|
||||
<head>
|
||||
{% include site_head.html %}
|
||||
<link rel="stylesheet" href="/css/main.css">
|
||||
<link rel="stylesheet" href="/css/header.css">
|
||||
<link rel="stylesheet" href="/css/footer.css">
|
||||
|
||||
<link rel="stylesheet" href="css/docs.css">
|
||||
</head>
|
||||
|
|
|
@ -1,114 +0,0 @@
|
|||
---
|
||||
layout: doc_page
|
||||
---
|
||||
A Druid cluster consists of various node types that need to be set up depending on your use case. See our [Design](Design.html) docs for a description of the different node types.
|
||||
|
||||
Setup Scripts
|
||||
-------------
|
||||
|
||||
One of our community members, [housejester](https://github.com/housejester/), contributed some scripts to help with setting up a cluster. Checkout the [github](https://github.com/housejester/druid-test-harness) and [wiki](https://github.com/housejester/druid-test-harness/wiki/Druid-Test-Harness).
|
||||
|
||||
Minimum Physical Layout: Absolute Minimum
|
||||
-----------------------------------------
|
||||
|
||||
As a special case, the absolute minimum setup is one of the standalone examples for realtime ingestion and querying; see [Examples](Examples.html) that can easily run on one machine with one core and 1GB RAM. This layout can be set up to try some basic queries with Druid.
|
||||
|
||||
Minimum Physical Layout: Experimental Testing with 4GB of RAM
|
||||
-------------------------------------------------------------
|
||||
|
||||
This layout can be used to load some data from deep storage onto a Druid compute node for the first time. A minimal physical layout for a 1 or 2 core machine with 4GB of RAM is:
|
||||
|
||||
1. node1: [Master](Master.html) + metadata service + zookeeper + [Compute](Compute.html)
|
||||
2. transient nodes: indexer
|
||||
|
||||
This setup is only reasonable to prove that a configuration works. It would not be worthwhile to use this layout for performance measurement.
|
||||
|
||||
Comfortable Physical Layout: Pilot Project with Multiple Machines
|
||||
-----------------------------------------------------------------
|
||||
|
||||
*The machine size “flavors” are using AWS/EC2 terminology for descriptive purposes only and is not meant to imply that AWS/EC2 is required or recommended. Another cloud provider or your own hardware can also work.*
|
||||
|
||||
A minimal physical layout not constrained by cores that demonstrates parallel querying and realtime, using AWS-EC2 “small”/m1.small (one core, with 1.7GB of RAM) or larger, no realtime, is:
|
||||
|
||||
1. node1: [Master](Master.html) (m1.small)
|
||||
2. node2: metadata service (m1.small)
|
||||
3. node3: zookeeper (m1.small)
|
||||
4. node4: [Broker](Broker.html) (m1.small or m1.medium or m1.large)
|
||||
5. node5: [Compute](Compute.html) (m1.small or m1.medium or m1.large)
|
||||
6. node6: [Compute](Compute.html) (m1.small or m1.medium or m1.large)
|
||||
7. node7: [Realtime](Realtime.html) (m1.small or m1.medium or m1.large)
|
||||
8. transient nodes: indexer
|
||||
|
||||
This layout naturally lends itself to adding more RAM and core to Compute nodes, and to adding many more Compute nodes. Depending on the actual load, the Master, metadata server, and Zookeeper might need to use larger machines.
|
||||
|
||||
High Availability Physical Layout
|
||||
---------------------------------
|
||||
|
||||
*The machine size “flavors” are using AWS/EC2 terminology for descriptive purposes only and is not meant to imply that AWS/EC2 is required or recommended. Another cloud provider or your own hardware can also work.*
|
||||
|
||||
An HA layout allows full rolling restarts and heavy volume:
|
||||
|
||||
1. node1: [Master](Master.html) (m1.small or m1.medium or m1.large)
|
||||
2. node2: [Master](Master.html) (m1.small or m1.medium or m1.large) (backup)
|
||||
3. node3: metadata service (c1.medium or m1.large)
|
||||
4. node4: metadata service (c1.medium or m1.large) (backup)
|
||||
5. node5: zookeeper (c1.medium)
|
||||
6. node6: zookeeper (c1.medium)
|
||||
7. node7: zookeeper (c1.medium)
|
||||
8. node8: [Broker](Broker.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
|
||||
9. node9: [Broker](Broker.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge) (backup)
|
||||
10. node10: [Compute](Compute.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
|
||||
11. node11: [Compute](Compute.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
|
||||
12. node12: [Realtime](Realtime.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
|
||||
13. transient nodes: indexer
|
||||
|
||||
Sizing for Cores and RAM
|
||||
------------------------
|
||||
|
||||
The Compute and Broker nodes will use as many cores as are available, depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is not well characterized yet and would depend on types of queries, query load, and the schema. Compute daemons should have a heap a size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing. Since in-memory caching is essential for good performance, even more RAM is better. Broker nodes will use RAM for caching, so they do more than just route queries.
|
||||
|
||||
The effective utilization of cores by Zookeeper, MySQL, and Master nodes is likely to be between 1 and 2 for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap a size between 500MB and 1GB.
|
||||
|
||||
Storage
|
||||
-------
|
||||
|
||||
Indexed segments should be kept in a permanent store accessible by all nodes like AWS S3 or HDFS or equivalent. Currently Druid supports S3, but this will be extended soon.
|
||||
|
||||
Local disk (“ephemeral” on AWS EC2) for caching is recommended over network mounted storage (example of mounted: AWS EBS, Elastic Block Store) in order to avoid network delays during times of heavy usage. If your data center is suitably provisioned for networked storage, perhaps with separate LAN/NICs just for storage, then mounted might work fine.
|
||||
|
||||
Setup
|
||||
-----
|
||||
|
||||
Setting up a cluster is essentially just firing up all of the nodes you want with the proper [configuration](configuration.html). One thing to be aware of is that there are a few properties in the configuration that potentially need to be set individually for each process:
|
||||
|
||||
<code>
|
||||
druid.server.type=historical|realtime
|
||||
druid.host=someHostOrIPaddrWithPort
|
||||
druid.port=8080
|
||||
</code>
|
||||
|
||||
`druid.server.type` should be set to “historical” for your compute nodes and realtime for the realtime nodes. The master will only assign segments to a “historical” node and the broker has some intelligence around its ability to cache results when talking to a realtime node. This does not need to be set for the master or the broker.
|
||||
|
||||
`druid.host` should be set to the hostname and port that can be used to talk to the given server process. Basically, someone should be able to send a request to http://\${druid.host}/ and actually talk to the process.
|
||||
|
||||
`druid.port` should be set to the port that the server should listen on. In the vast majority of cases, this port should be the same as what is on `druid.host`.
|
||||
|
||||
Build/Run
|
||||
---------
|
||||
|
||||
The simplest way to build and run from the repository is to run `mvn package` from the base directory and then take `druid-services/target/druid-services-*-selfcontained.jar` and push that around to your machines; the jar does not need to be expanded, and since it contains the main() methods for each kind of service, it is **not** invoked with java ~~jar. It can be run from a normal java command-line by just including it on the classpath and then giving it the main class that you want to run. For example one instance of the Compute node/service can be started like this:
|
||||
\<pre\>
|
||||
<code>
|
||||
java~~Duser.timezone=UTC ~~Dfile.encoding=UTF-8~~cp compute/:druid-services/target/druid-services~~\*~~selfcontained.jar com.metamx.druid.http.ComputeMain
|
||||
</code>
|
||||
|
||||
</pre>
|
||||
The following table shows the possible services and fully qualified class for main().
|
||||
|
||||
|service|main class|
|
||||
|-------|----------|
|
||||
|[ Realtime ]( Realtime .html)|com.metamx.druid.realtime.RealtimeMain|
|
||||
|[ Master ]( Master .html)|com.metamx.druid.http.MasterMain|
|
||||
|[ Broker ]( Broker .html)|com.metamx.druid.http.BrokerMain|
|
||||
|[ Compute ]( Compute .html)|com.metamx.druid.http.ComputeMain|
|
||||
|
|
@ -21,59 +21,59 @@ The periodic time intervals (like “PT1M”) are [ISO8601 intervals](http://en.
|
|||
|
||||
An example runtime.properties is as follows:
|
||||
|
||||
<code>
|
||||
# S3 access
|
||||
com.metamx.aws.accessKey=<S3 access key>
|
||||
com.metamx.aws.secretKey=<S3 secret_key>
|
||||
```
|
||||
# S3 access
|
||||
com.metamx.aws.accessKey=<S3 access key>
|
||||
com.metamx.aws.secretKey=<S3 secret_key>
|
||||
|
||||
# thread pool size for servicing queries
|
||||
druid.client.http.connections=30
|
||||
# thread pool size for servicing queries
|
||||
druid.client.http.connections=30
|
||||
|
||||
# JDBC connection string for metadata database
|
||||
druid.database.connectURI=
|
||||
druid.database.user=user
|
||||
druid.database.password=password
|
||||
# time between polling for metadata database
|
||||
druid.database.poll.duration=PT1M
|
||||
druid.database.segmentTable=prod_segments
|
||||
# JDBC connection string for metadata database
|
||||
druid.database.connectURI=
|
||||
druid.database.user=user
|
||||
druid.database.password=password
|
||||
# time between polling for metadata database
|
||||
druid.database.poll.duration=PT1M
|
||||
druid.database.segmentTable=prod_segments
|
||||
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
# Path on local FS for storage of segment metadata; dir will be created if needed
|
||||
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
# Path on local FS for storage of segment metadata; dir will be created if needed
|
||||
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
|
||||
|
||||
druid.request.logging.dir=/tmp/druid/log
|
||||
druid.request.logging.dir=/tmp/druid/log
|
||||
|
||||
druid.server.maxSize=300000000000
|
||||
druid.server.maxSize=300000000000
|
||||
|
||||
# ZK quorum IPs
|
||||
druid.zk.service.host=
|
||||
# ZK path prefix for Druid-usage of zookeeper, Druid will create multiple paths underneath this znode
|
||||
druid.zk.paths.base=/druid
|
||||
# ZK path for discovery, the only path not to doc_page to anything
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
# ZK quorum IPs
|
||||
druid.zk.service.host=
|
||||
# ZK path prefix for Druid-usage of zookeeper, Druid will create multiple paths underneath this znode
|
||||
druid.zk.paths.base=/druid
|
||||
# ZK path for discovery, the only path not to doc_page to anything
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
|
||||
# the host:port as advertised to clients
|
||||
druid.host=someHostOrIPaddrWithPort
|
||||
# the port on which to listen, this port should line up with the druid.host value
|
||||
druid.port=8080
|
||||
# the host:port as advertised to clients
|
||||
druid.host=someHostOrIPaddrWithPort
|
||||
# the port on which to listen, this port should line up with the druid.host value
|
||||
druid.port=8080
|
||||
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=debug
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=debug
|
||||
|
||||
druid.processing.formatString=processing_%s
|
||||
druid.processing.numThreads=3
|
||||
druid.processing.formatString=processing_%s
|
||||
druid.processing.numThreads=3
|
||||
|
||||
|
||||
druid.computation.buffer.size=100000000
|
||||
druid.computation.buffer.size=100000000
|
||||
|
||||
# S3 dest for realtime indexer
|
||||
druid.pusher.s3.bucket=
|
||||
druid.pusher.s3.baseKey=
|
||||
# S3 dest for realtime indexer
|
||||
druid.pusher.s3.bucket=
|
||||
druid.pusher.s3.baseKey=
|
||||
|
||||
druid.bard.cache.sizeInBytes=40000000
|
||||
druid.master.merger.service=blah_blah
|
||||
</code>
|
||||
druid.bard.cache.sizeInBytes=40000000
|
||||
druid.master.merger.service=blah_blah
|
||||
```
|
||||
|
||||
Configuration groupings
|
||||
-----------------------
|
||||
|
|
|
@ -8,93 +8,100 @@ Before we start querying druid, we're going to finish setting up a complete clus
|
|||
## Booting a Broker Node ##
|
||||
|
||||
1. Setup a config file at config/broker/runtime.properties that looks like this:
|
||||
```
|
||||
druid.host=0.0.0.0:8083
|
||||
druid.port=8083
|
||||
|
||||
com.metamx.emitter.logging=true
|
||||
```
|
||||
druid.host=0.0.0.0:8083
|
||||
druid.port=8083
|
||||
|
||||
druid.processing.formatString=processing_%s
|
||||
druid.processing.numThreads=1
|
||||
druid.processing.buffer.sizeBytes=10000000
|
||||
com.metamx.emitter.logging=true
|
||||
|
||||
#emitting, opaque marker
|
||||
druid.service=example
|
||||
druid.processing.formatString=processing_%s
|
||||
druid.processing.numThreads=1
|
||||
druid.processing.buffer.sizeBytes=10000000
|
||||
|
||||
druid.request.logging.dir=/tmp/example/log
|
||||
druid.realtime.specFile=realtime.spec
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=debug
|
||||
#emitting, opaque marker
|
||||
druid.service=example
|
||||
|
||||
# below are dummy values when operating a realtime only node
|
||||
druid.processing.numThreads=3
|
||||
druid.request.logging.dir=/tmp/example/log
|
||||
druid.realtime.specFile=realtime.spec
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=debug
|
||||
|
||||
com.metamx.aws.accessKey=dummy_access_key
|
||||
com.metamx.aws.secretKey=dummy_secret_key
|
||||
druid.pusher.s3.bucket=dummy_s3_bucket
|
||||
# below are dummy values when operating a realtime only node
|
||||
druid.processing.numThreads=3
|
||||
|
||||
druid.zk.service.host=localhost
|
||||
druid.server.maxSize=300000000000
|
||||
druid.zk.paths.base=/druid
|
||||
druid.database.segmentTable=prod_segments
|
||||
druid.database.user=druid
|
||||
druid.database.password=diurd
|
||||
druid.database.connectURI=jdbc:mysql://localhost:3306/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
druid.database.ruleTable=rules
|
||||
druid.database.configTable=config
|
||||
com.metamx.aws.accessKey=dummy_access_key
|
||||
com.metamx.aws.secretKey=dummy_secret_key
|
||||
druid.pusher.s3.bucket=dummy_s3_bucket
|
||||
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
# Path on local FS for storage of segment metadata; dir will be created if needed
|
||||
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
|
||||
druid.pusher.local.storageDirectory=/tmp/druid/localStorage
|
||||
druid.pusher.local=true
|
||||
druid.zk.service.host=localhost
|
||||
druid.server.maxSize=300000000000
|
||||
druid.zk.paths.base=/druid
|
||||
druid.database.segmentTable=prod_segments
|
||||
druid.database.user=druid
|
||||
druid.database.password=diurd
|
||||
druid.database.connectURI=jdbc:mysql://localhost:3306/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
druid.database.ruleTable=rules
|
||||
druid.database.configTable=config
|
||||
|
||||
# thread pool size for servicing queries
|
||||
druid.client.http.connections=30
|
||||
```
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
# Path on local FS for storage of segment metadata; dir will be created if needed
|
||||
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
|
||||
druid.pusher.local.storageDirectory=/tmp/druid/localStorage
|
||||
druid.pusher.local=true
|
||||
|
||||
# thread pool size for servicing queries
|
||||
druid.client.http.connections=30
|
||||
```
|
||||
|
||||
2. Run the broker node:
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=realtime.spec \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/broker \
|
||||
com.metamx.druid.http.BrokerMain
|
||||
```
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=realtime.spec \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/broker \
|
||||
com.metamx.druid.http.BrokerMain
|
||||
```
|
||||
|
||||
## Booting a Master Node ##
|
||||
|
||||
1. Setup a config file at config/master/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818870](https://gist.github.com/rjurney/5818870)
|
||||
|
||||
2. Run the master node:
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/master \
|
||||
com.metamx.druid.http.MasterMain
|
||||
```
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/master \
|
||||
com.metamx.druid.http.MasterMain
|
||||
```
|
||||
|
||||
## Booting a Realtime Node ##
|
||||
|
||||
1. Setup a config file at config/realtime/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818774](https://gist.github.com/rjurney/5818774)
|
||||
|
||||
2. Setup a realtime.spec file like this: [https://gist.github.com/rjurney/5818779](https://gist.github.com/rjurney/5818779)
|
||||
|
||||
3. Run the realtime node:
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=realtime.spec \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/realtime \
|
||||
com.metamx.druid.realtime.RealtimeMain
|
||||
```
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=realtime.spec \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/realtime \
|
||||
com.metamx.druid.realtime.RealtimeMain
|
||||
```
|
||||
|
||||
## Booting a Compute Node ##
|
||||
|
||||
1. Setup a config file at config/compute/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818885](https://gist.github.com/rjurney/5818885)
|
||||
2. Run the compute node:
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/compute \
|
||||
com.metamx.druid.http.ComputeMain
|
||||
```
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/compute \
|
||||
com.metamx.druid.http.ComputeMain
|
||||
```
|
||||
|
||||
# Querying Your Data #
|
||||
|
||||
|
@ -107,6 +114,7 @@ As a shared-nothing system, there are three ways to query druid, against the [Re
|
|||
### Construct a Query ###
|
||||
|
||||
For constructing this query, see: Querying against the realtime.spec
|
||||
|
||||
```json
|
||||
{
|
||||
"queryType": "groupBy",
|
||||
|
@ -125,57 +133,52 @@ For constructing this query, see: Querying against the realtime.spec
|
|||
### Querying the Realtime Node ###
|
||||
|
||||
Run our query against port 8080:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8080/druid/v2/?pretty" \
|
||||
-H 'content-type: application/json' -d @query.body
|
||||
curl -X POST "http://localhost:8080/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body
|
||||
```
|
||||
|
||||
See our result:
|
||||
|
||||
```json
|
||||
[ {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 5,
|
||||
"wp" : 15000.0,
|
||||
"rows" : 5
|
||||
}
|
||||
"event" : { "imps" : 5, "wp" : 15000.0, "rows" : 5 }
|
||||
} ]
|
||||
```
|
||||
|
||||
### Querying the Compute Node ###
|
||||
Run the query against port 8082:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8082/druid/v2/?pretty" \
|
||||
-H 'content-type: application/json' -d @query.body
|
||||
curl -X POST "http://localhost:8082/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body
|
||||
```
|
||||
|
||||
And get (similar to):
|
||||
|
||||
```json
|
||||
[ {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 27,
|
||||
"wp" : 77000.0,
|
||||
"rows" : 9
|
||||
}
|
||||
"event" : { "imps" : 27, "wp" : 77000.0, "rows" : 9 }
|
||||
} ]
|
||||
```
|
||||
|
||||
### Querying both Nodes via the Broker ###
|
||||
Run the query against port 8083:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8083/druid/v2/?pretty" \
|
||||
-H 'content-type: application/json' -d @query.body
|
||||
curl -X POST "http://localhost:8083/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body
|
||||
```
|
||||
|
||||
And get:
|
||||
|
||||
```json
|
||||
[ {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 5,
|
||||
"wp" : 15000.0,
|
||||
"rows" : 5
|
||||
}
|
||||
"event" : { "imps" : 5, "wp" : 15000.0, "rows" : 5 }
|
||||
} ]
|
||||
```
|
||||
|
||||
|
@ -189,9 +192,9 @@ How are we to know what queries we can run? Although [Querying](Querying.html) i
|
|||
[{
|
||||
"schema" : { "dataSource":"druidtest",
|
||||
"aggregators":[ {"type":"count", "name":"impressions"},
|
||||
{"type":"doubleSum","name":"wp","fieldName":"wp"}],
|
||||
{"type":"doubleSum","name":"wp","fieldName":"wp"}],
|
||||
"indexGranularity":"minute",
|
||||
"shardSpec" : { "type": "none" } },
|
||||
"shardSpec" : { "type": "none" } },
|
||||
"config" : { "maxRowsInMemory" : 500000,
|
||||
"intermediatePersistPeriod" : "PT10m" },
|
||||
"firehose" : { "type" : "kafka-0.7.2",
|
||||
|
@ -221,6 +224,7 @@ How are we to know what queries we can run? Although [Querying](Querying.html) i
|
|||
```json
|
||||
"dataSource":"druidtest"
|
||||
```
|
||||
|
||||
Our dataSource tells us the name of the relation/table, or 'source of data', to query in both our realtime.spec and query.body!
|
||||
|
||||
### aggregations ###
|
||||
|
@ -239,7 +243,7 @@ this matches up to the aggregators in the schema of our realtime.spec!
|
|||
|
||||
```json
|
||||
"aggregators":[ {"type":"count", "name":"impressions"},
|
||||
{"type":"doubleSum","name":"wp","fieldName":"wp"}],
|
||||
{"type":"doubleSum","name":"wp","fieldName":"wp"}],
|
||||
```
|
||||
|
||||
### dimensions ###
|
||||
|
@ -277,48 +281,23 @@ Which gets us grouped data in return!
|
|||
[ {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 1,
|
||||
"age" : "100",
|
||||
"wp" : 1000.0,
|
||||
"rows" : 1
|
||||
}
|
||||
"event" : { "imps" : 1, "age" : "100", "wp" : 1000.0, "rows" : 1 }
|
||||
}, {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 1,
|
||||
"age" : "20",
|
||||
"wp" : 3000.0,
|
||||
"rows" : 1
|
||||
}
|
||||
"event" : { "imps" : 1, "age" : "20", "wp" : 3000.0, "rows" : 1 }
|
||||
}, {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 1,
|
||||
"age" : "30",
|
||||
"wp" : 4000.0,
|
||||
"rows" : 1
|
||||
}
|
||||
"event" : { "imps" : 1, "age" : "30", "wp" : 4000.0, "rows" : 1 }
|
||||
}, {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 1,
|
||||
"age" : "40",
|
||||
"wp" : 5000.0,
|
||||
"rows" : 1
|
||||
}
|
||||
"event" : { "imps" : 1, "age" : "40", "wp" : 5000.0, "rows" : 1 }
|
||||
}, {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 1,
|
||||
"age" : "50",
|
||||
"wp" : 2000.0,
|
||||
"rows" : 1
|
||||
}
|
||||
"event" : { "imps" : 1, "age" : "50", "wp" : 2000.0, "rows" : 1 }
|
||||
} ]
|
||||
```
|
||||
|
||||
|
@ -331,11 +310,7 @@ Now that we've observed our dimensions, we can also filter:
|
|||
"queryType": "groupBy",
|
||||
"dataSource": "druidtest",
|
||||
"granularity": "all",
|
||||
"filter": {
|
||||
"type": "selector",
|
||||
"dimension": "gender",
|
||||
"value": "male"
|
||||
},
|
||||
"filter": { "type": "selector", "dimension": "gender", "value": "male" },
|
||||
"aggregations": [
|
||||
{"type": "count", "name": "rows"},
|
||||
{"type": "longSum", "name": "imps", "fieldName": "impressions"},
|
||||
|
@ -351,11 +326,7 @@ Which gets us just people aged 40:
|
|||
[ {
|
||||
"version" : "v1",
|
||||
"timestamp" : "2010-01-01T00:00:00.000Z",
|
||||
"event" : {
|
||||
"imps" : 3,
|
||||
"wp" : 9000.0,
|
||||
"rows" : 3
|
||||
}
|
||||
"event" : { "imps" : 3, "wp" : 9000.0, "rows" : 3 }
|
||||
} ]
|
||||
```
|
||||
|
||||
|
|
|
@ -13,28 +13,30 @@ Each event has a timestamp indicating the time of the edit (in UTC time), a list
|
|||
Specifically. the data schema looks like so:
|
||||
|
||||
Dimensions (things to filter on):
|
||||
\`\`\`json
|
||||
“page”
|
||||
“language”
|
||||
“user”
|
||||
“unpatrolled”
|
||||
“newPage”
|
||||
“robot”
|
||||
“anonymous”
|
||||
“namespace”
|
||||
“continent”
|
||||
“country”
|
||||
“region”
|
||||
“city”
|
||||
\`\`\`
|
||||
|
||||
```json
|
||||
"page"
|
||||
"language"
|
||||
"user"
|
||||
"unpatrolled"
|
||||
"newPage"
|
||||
"robot"
|
||||
"anonymous"
|
||||
"namespace"
|
||||
"continent"
|
||||
"country"
|
||||
"region"
|
||||
"city"
|
||||
```
|
||||
|
||||
Metrics (things to aggregate over):
|
||||
\`\`\`json
|
||||
“count”
|
||||
“added”
|
||||
“delta”
|
||||
“deleted”
|
||||
\`\`\`
|
||||
|
||||
```json
|
||||
"count"
|
||||
"added"
|
||||
"delta"
|
||||
"deleted"
|
||||
```
|
||||
|
||||
These metrics track the number of characters added, deleted, and changed.
|
||||
|
||||
|
@ -50,115 +52,115 @@ Download this file to a directory of your choosing.
|
|||
|
||||
You can extract the awesomeness within by issuing:
|
||||
|
||||
tar -zxvf druid-services-*-bin.tar.gz
|
||||
```
|
||||
tar -zxvf druid-services-*-bin.tar.gz
|
||||
```
|
||||
|
||||
Not too lost so far right? That’s great! If you cd into the directory:
|
||||
|
||||
cd druid-services-0.5.54
|
||||
```
|
||||
cd druid-services-0.5.54
|
||||
```
|
||||
|
||||
You should see a bunch of files:
|
||||
\* run\_example\_server.sh
|
||||
\* run\_example\_client.sh
|
||||
\* LICENSE, config, examples, lib directories
|
||||
|
||||
* run_example_server.sh
|
||||
* run_example_client.sh
|
||||
* LICENSE, config, examples, lib directories
|
||||
|
||||
Running Example Scripts
|
||||
-----------------------
|
||||
|
||||
Let’s start doing stuff. You can start a Druid [Realtime](Realtime.html) node by issuing:
|
||||
|
||||
./run_example_server.sh
|
||||
```
|
||||
./run_example_server.sh
|
||||
```
|
||||
|
||||
Select “wikipedia”.
|
||||
Select "wikipedia".
|
||||
|
||||
Once the node starts up you will see a bunch of logs about setting up properties and connecting to the data source. If everything was successful, you should see messages of the form shown below.
|
||||
|
||||
<code>
|
||||
2013-07-19 21:54:05,154 INFO [main] com.metamx.druid.realtime.RealtimeNode - Starting Jetty
|
||||
2013-07-19 21:54:05,154 INFO [main] org.mortbay.log - jetty-6.1.x
|
||||
2013-07-19 21:54:05,171 INFO [chief-wikipedia] com.metamx.druid.realtime.plumber.RealtimePlumberSchool - Expect to run at [2013-07-19T22:03:00.000Z]
|
||||
2013-07-19 21:54:05,246 INFO [main] org.mortbay.log - Started SelectChannelConnector@0.0.0.0:8083
|
||||
</code>
|
||||
```
|
||||
2013-07-19 21:54:05,154 INFO [main] com.metamx.druid.realtime.RealtimeNode - Starting Jetty
|
||||
2013-07-19 21:54:05,154 INFO [main] org.mortbay.log - jetty-6.1.x
|
||||
2013-07-19 21:54:05,171 INFO [chief-wikipedia] com.metamx.druid.realtime.plumber.RealtimePlumberSchool - Expect to run at [2013-07-19T22:03:00.000Z]
|
||||
2013-07-19 21:54:05,246 INFO [main] org.mortbay.log - Started SelectChannelConnector@0.0.0.0:8083
|
||||
```
|
||||
|
||||
The Druid real time-node ingests events in an in-memory buffer. Periodically, these events will be persisted to disk. If you are interested in the details of our real-time architecture and why we persist indexes to disk, I suggest you read our [White Paper](http://static.druid.io/docs/druid.pdf).
|
||||
|
||||
Okay, things are about to get real(~~time). To query the real-time node you’ve spun up, you can issue:
|
||||
\<pre\>./run\_example\_client.sh\</pre\>
|
||||
Select “wikipedia” once again. This script issues ]s to the data we’ve been ingesting. The query looks like this:
|
||||
\`\`\`json
|
||||
Okay, things are about to get real-time. To query the real-time node you’ve spun up, you can issue:
|
||||
|
||||
```
|
||||
./run_example_client.sh
|
||||
```
|
||||
|
||||
Select "wikipedia" once again. This script issues [GroupByQuery](GroupByQuery.html)s to the data we’ve been ingesting. The query looks like this:
|
||||
|
||||
```json
|
||||
{
|
||||
[queryType]("groupBy"),
|
||||
[dataSource]("wikipedia"),
|
||||
[granularity]("minute"),
|
||||
[dimensions]([)
|
||||
“page”
|
||||
],
|
||||
[aggregations]([)
|
||||
{
|
||||
[type]("count"),
|
||||
[name]("rows")
|
||||
},
|
||||
{
|
||||
[type]("longSum"),
|
||||
[fieldName]("edit_count"),
|
||||
[name]("count")
|
||||
}
|
||||
],
|
||||
[filter]({)
|
||||
[type]("selector"),
|
||||
[dimension]("namespace"),
|
||||
[value]("article")
|
||||
},
|
||||
[intervals]([)
|
||||
“2013-06-01T00:00/2020-01-01T00”
|
||||
]
|
||||
"queryType":"groupBy",
|
||||
"dataSource":"wikipedia",
|
||||
"granularity":"minute",
|
||||
"dimensions":[ "page" ],
|
||||
"aggregations":[
|
||||
{"type":"count", "name":"rows"},
|
||||
{"type":"longSum", "fieldName":"edit_count", "name":"count"}
|
||||
],
|
||||
"filter":{ "type":"selector", "dimension":"namespace", "value":"article" },
|
||||
"intervals":[ "2013-06-01T00:00/2020-01-01T00" ]
|
||||
}
|
||||
\`\`\`
|
||||
This is a **groupBy** query, which you may be familiar with from SQL. We are grouping, or aggregating, via the **dimensions** field: . We are **filtering** via the **“namespace”** dimension, to only look at edits on **“articles”**. Our **aggregations** are what we are calculating: a count of the number of data rows, and a count of the number of edits that have occurred.
|
||||
```
|
||||
|
||||
This is a **groupBy** query, which you may be familiar with from SQL. We are grouping, or aggregating, via the `dimensions` field: `["page"]`. We are **filtering** via the `namespace` dimension, to only look at edits on `articles`. Our **aggregations** are what we are calculating: a count of the number of data rows, and a count of the number of edits that have occurred.
|
||||
|
||||
The result looks something like this:
|
||||
\`\`\`json
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
[version]() “v1”,
|
||||
[timestamp]() “2013-09-04T21:44:00.000Z”,
|
||||
[event]() {
|
||||
[count]() 0,
|
||||
[page]() “2013\\u201314\_Brentford\_F.C.*season",
|
||||
[rows]() 1
|
||||
}
|
||||
},
|
||||
{
|
||||
[version]() "v1",
|
||||
[timestamp]() "2013-09-04T21:44:00.000Z",
|
||||
[event]() {
|
||||
[count]() 0,
|
||||
[page]() "8e*00e9tape\_du\_Tour\_de\_France\_2013”,
|
||||
[rows]() 1
|
||||
}
|
||||
},
|
||||
{
|
||||
[version]() “v1”,
|
||||
[timestamp]() “2013-09-04T21:44:00.000Z”,
|
||||
[event]() {
|
||||
[count]() 0,
|
||||
[page]() “Agenda\_of\_the\_Tea\_Party\_movement”,
|
||||
[rows]() 1
|
||||
}
|
||||
},
|
||||
…
|
||||
\`\`\`
|
||||
{
|
||||
"version": "v1",
|
||||
"timestamp": "2013-09-04T21:44:00.000Z",
|
||||
"event": { "count": 0, "page": "2013\u201314_Brentford_F.C._season", "rows": 1 }
|
||||
},
|
||||
{
|
||||
"version": "v1",
|
||||
"timestamp": "2013-09-04T21:44:00.000Z",
|
||||
"event": { "count": 0, "page": "8e_\u00e9tape_du_Tour_de_France_2013", "rows": 1 }
|
||||
},
|
||||
{
|
||||
"version": "v1",
|
||||
"timestamp": "2013-09-04T21:44:00.000Z",
|
||||
"event": { "count": 0, "page": "Agenda_of_the_Tea_Party_movement", "rows": 1 }
|
||||
},
|
||||
...
|
||||
```
|
||||
|
||||
This groupBy query is a bit complicated and we’ll return to it later. For the time being, just make sure you are getting some blocks of data back. If you are having problems, make sure you have [curl](http://curl.haxx.se/) installed. Control+C to break out of the client script.
|
||||
|
||||
h2. Querying Druid
|
||||
|
||||
In your favorite editor, create the file:
|
||||
\<pre\>time\_boundary\_query.body\</pre\>
|
||||
|
||||
```
|
||||
time_boundary_query.body
|
||||
```
|
||||
|
||||
Druid queries are JSON blobs which are relatively painless to create programmatically, but an absolute pain to write by hand. So anyway, we are going to create a Druid query by hand. Add the following to the file you just created:
|
||||
\<pre\><code>
|
||||
|
||||
```
|
||||
{
|
||||
[queryType]() “timeBoundary”,
|
||||
[dataSource]() “wikipedia”
|
||||
"queryType": "timeBoundary",
|
||||
"dataSource": "wikipedia"
|
||||
}
|
||||
</code>\</pre\>
|
||||
The ] is one of the simplest Druid queries. To run the query, you can issue:
|
||||
\<pre\><code> curl~~X POST ‘http://localhost:8083/druid/v2/?pretty’ ~~H ‘content-type: application/json’~~d ```` time_boundary_query.body</code></pre>
|
||||
```
|
||||
|
||||
The [TimeBoundaryQuery](TimeBoundaryQuery.html) is one of the simplest Druid queries. To run the query, you can issue:
|
||||
|
||||
```
|
||||
curl -X POST ‘http://localhost:8083/druid/v2/?pretty’ -H 'content-type: application/json' -d @time_boundary_query.body
|
||||
```
|
||||
|
||||
We get something like this JSON back:
|
||||
|
||||
|
@ -171,186 +173,146 @@ We get something like this JSON back:
|
|||
}
|
||||
} ]
|
||||
```
|
||||
|
||||
As you can probably tell, the result is indicating the maximum and minimum timestamps we've seen thus far (summarized to a minutely granularity). Let's explore a bit further.
|
||||
|
||||
Return to your favorite editor and create the file:
|
||||
<pre>timeseries_query.body</pre>
|
||||
|
||||
```
|
||||
timeseries_query.body
|
||||
```
|
||||
|
||||
We are going to make a slightly more complicated query, the [TimeseriesQuery](TimeseriesQuery.html). Copy and paste the following into the file:
|
||||
<pre><code>
|
||||
|
||||
```
|
||||
{
|
||||
"queryType": "timeseries",
|
||||
"dataSource": "wikipedia",
|
||||
"intervals": [
|
||||
"2010-01-01/2020-01-01"
|
||||
],
|
||||
"intervals": [ "2010-01-01/2020-01-01" ],
|
||||
"granularity": "all",
|
||||
"aggregations": [
|
||||
{
|
||||
"type": "longSum",
|
||||
"fieldName": "count",
|
||||
"name": "edit_count"
|
||||
},
|
||||
{
|
||||
"type": "doubleSum",
|
||||
"fieldName": "added",
|
||||
"name": "chars_added"
|
||||
}
|
||||
{"type": "longSum", "fieldName": "count", "name": "edit_count"},
|
||||
{"type": "doubleSum", "fieldName": "added", "name": "chars_added"}
|
||||
]
|
||||
}
|
||||
</code></pre>
|
||||
```
|
||||
|
||||
You are probably wondering, what are these [Granularities](Granularities.html) and [Aggregations](Aggregations.html) things? What the query is doing is aggregating some metrics over some span of time.
|
||||
To issue the query and get some results, run the following in your command line:
|
||||
<pre><code>curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d ````timeseries\_query.body</code>
|
||||
|
||||
</pre>
|
||||
```
|
||||
curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d ````timeseries_query.body
|
||||
```
|
||||
|
||||
Once again, you should get a JSON blob of text back with your results, that looks something like this:
|
||||
|
||||
\`\`\`json
|
||||
```json
|
||||
[ {
|
||||
“timestamp” : “2013-09-04T21:44:00.000Z”,
|
||||
“result” : {
|
||||
“chars\_added” : 312670.0,
|
||||
“edit\_count” : 733
|
||||
}
|
||||
"timestamp" : "2013-09-04T21:44:00.000Z",
|
||||
"result" : { "chars_added" : 312670.0, "edit_count" : 733 }
|
||||
} ]
|
||||
\`\`\`
|
||||
```
|
||||
|
||||
If you issue the query again, you should notice your results updating.
|
||||
|
||||
Right now all the results you are getting back are being aggregated into a single timestamp bucket. What if we wanted to see our aggregations on a per minute basis? What field can we change in the query to accomplish this?
|
||||
|
||||
If you loudly exclaimed “we can change granularity to minute”, you are absolutely correct! We can specify different granularities to bucket our results, like so:
|
||||
If you loudly exclaimed "we can change granularity to minute", you are absolutely correct! We can specify different granularities to bucket our results, like so:
|
||||
|
||||
<code>
|
||||
{
|
||||
"queryType": "timeseries",
|
||||
"dataSource": "wikipedia",
|
||||
"intervals": [
|
||||
"2010-01-01/2020-01-01"
|
||||
],
|
||||
"granularity": "minute",
|
||||
"aggregations": [
|
||||
{
|
||||
"type": "longSum",
|
||||
"fieldName": "count",
|
||||
"name": "edit_count"
|
||||
},
|
||||
{
|
||||
"type": "doubleSum",
|
||||
"fieldName": "added",
|
||||
"name": "chars_added"
|
||||
}
|
||||
]
|
||||
}
|
||||
</code>
|
||||
```
|
||||
{
|
||||
"queryType": "timeseries",
|
||||
"dataSource": "wikipedia",
|
||||
"intervals": [ "2010-01-01/2020-01-01" ],
|
||||
"granularity": "minute",
|
||||
"aggregations": [
|
||||
{"type": "longSum", "fieldName": "count", "name": "edit_count"},
|
||||
{"type": "doubleSum", "fieldName": "added", "name": "chars_added"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This gives us something like the following:
|
||||
|
||||
\`\`\`json
|
||||
```json
|
||||
[
|
||||
{
|
||||
“timestamp” : “2013-09-04T21:44:00.000Z”,
|
||||
“result” : {
|
||||
“chars\_added” : 30665.0,
|
||||
“edit\_count” : 128
|
||||
}
|
||||
}, {
|
||||
“timestamp” : “2013-09-04T21:45:00.000Z”,
|
||||
“result” : {
|
||||
“chars\_added” : 122637.0,
|
||||
“edit\_count” : 167
|
||||
}
|
||||
}, {
|
||||
“timestamp” : “2013-09-04T21:46:00.000Z”,
|
||||
“result” : {
|
||||
“chars\_added” : 78938.0,
|
||||
“edit\_count” : 159
|
||||
}
|
||||
"timestamp" : "2013-09-04T21:44:00.000Z",
|
||||
"result" : { "chars_added" : 30665.0, "edit_count" : 128 }
|
||||
},
|
||||
…
|
||||
\`\`\`
|
||||
{
|
||||
"timestamp" : "2013-09-04T21:45:00.000Z",
|
||||
"result" : { "chars_added" : 122637.0, "edit_count" : 167 }
|
||||
},
|
||||
{
|
||||
"timestamp" : "2013-09-04T21:46:00.000Z",
|
||||
"result" : { "chars_added" : 78938.0, "edit_count" : 159 }
|
||||
},
|
||||
...
|
||||
]
|
||||
```
|
||||
|
||||
Solving a Problem
|
||||
-----------------
|
||||
|
||||
One of Druid’s main powers is to provide answers to problems, so let’s pose a problem. What if we wanted to know what the top pages in the US are, ordered by the number of edits over the last few minutes you’ve been going through this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [GroupByQuery](GroupByQuery.html). It would be nice if we could group by results by dimension value and somehow sort those results… and it turns out we can!
|
||||
One of Druid’s main powers is to provide answers to problems, so let’s pose a problem. What if we wanted to know what the top pages in the US are, ordered by the number of edits over the last few minutes you’ve been going through this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [GroupByQuery](GroupByQuery.html). It would be nice if we could group by results by dimension value and somehow sort those results... and it turns out we can!
|
||||
|
||||
Let’s create the file:
|
||||
|
||||
group_by_query.body</pre>
|
||||
and put the following in there:
|
||||
<pre><code>
|
||||
{
|
||||
"queryType": "groupBy",
|
||||
"dataSource": "wikipedia",
|
||||
"granularity": "all",
|
||||
"dimensions": [
|
||||
"page"
|
||||
],
|
||||
"orderBy": {
|
||||
"type": "doc_page",
|
||||
"columns": [
|
||||
{
|
||||
"dimension": "edit_count",
|
||||
"direction": "DESCENDING"
|
||||
}
|
||||
],
|
||||
"limit": 10
|
||||
},
|
||||
"aggregations": [
|
||||
{
|
||||
"type": "longSum",
|
||||
"fieldName": "count",
|
||||
"name": "edit_count"
|
||||
}
|
||||
],
|
||||
"filter": {
|
||||
"type": "selector",
|
||||
"dimension": "country",
|
||||
"value": "United States"
|
||||
},
|
||||
"intervals": [
|
||||
"2012-10-01T00:00/2020-01-01T00"
|
||||
]
|
||||
}
|
||||
</code>
|
||||
```
|
||||
group_by_query.body
|
||||
```
|
||||
|
||||
and put the following in there:
|
||||
|
||||
```
|
||||
{
|
||||
"queryType": "groupBy",
|
||||
"dataSource": "wikipedia",
|
||||
"granularity": "all",
|
||||
"dimensions": [ "page" ],
|
||||
"orderBy": {
|
||||
"type": "doc_page",
|
||||
"columns": [ { "dimension": "edit_count", "direction": "DESCENDING" } ],
|
||||
"limit": 10
|
||||
},
|
||||
"aggregations": [
|
||||
{"type": "longSum", "fieldName": "count", "name": "edit_count"}
|
||||
],
|
||||
"filter": { "type": "selector", "dimension": "country", "value": "United States" },
|
||||
"intervals": ["2012-10-01T00:00/2020-01-01T00"]
|
||||
}
|
||||
```
|
||||
|
||||
Woah! Our query just got a way more complicated. Now we have these [Filters](Filters.html) things and this [OrderBy](OrderBy.html) thing. Fear not, it turns out the new objects we’ve introduced to our query can help define the format of our results and provide an answer to our question.
|
||||
|
||||
If you issue the query:
|
||||
|
||||
<code>curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d @group_by_query.body</code>
|
||||
```
|
||||
curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d @group_by_query.body
|
||||
```
|
||||
|
||||
You should see an answer to our question. As an example, some results are shown below:
|
||||
|
||||
\`\`\`json
|
||||
```json
|
||||
[
|
||||
{
|
||||
“version” : “v1”,
|
||||
“timestamp” : “2012-10-01T00:00:00.000Z”,
|
||||
“event” : {
|
||||
“page” : “RTC\_Transit”,
|
||||
“edit\_count” : 6
|
||||
}
|
||||
}, {
|
||||
“version” : “v1”,
|
||||
“timestamp” : “2012-10-01T00:00:00.000Z”,
|
||||
“event” : {
|
||||
“page” : “List\_of\_Deadly\_Women\_episodes”,
|
||||
“edit\_count” : 4
|
||||
}
|
||||
}, {
|
||||
“version” : “v1”,
|
||||
“timestamp” : “2012-10-01T00:00:00.000Z”,
|
||||
“event” : {
|
||||
“page” : “User\_talk:David\_Biddulph”,
|
||||
“edit\_count” : 4
|
||||
}
|
||||
"version" : "v1",
|
||||
"timestamp" : "2012-10-01T00:00:00.000Z",
|
||||
"event" : { "page" : "RTC_Transit", "edit_count" : 6 }
|
||||
},
|
||||
…
|
||||
\`\`\`
|
||||
{
|
||||
"version" : "v1",
|
||||
"timestamp" : "2012-10-01T00:00:00.000Z",
|
||||
"event" : { "page" : "List_of_Deadly_Women_episodes", "edit_count" : 4 }
|
||||
},
|
||||
{
|
||||
"version" : "v1",
|
||||
"timestamp" : "2012-10-01T00:00:00.000Z",
|
||||
"event" : { "page" : "User_talk:David_Biddulph", "edit_count" : 4 }
|
||||
},
|
||||
...
|
||||
```
|
||||
|
||||
Feel free to tweak other query parameters to answer other questions you may have about the data.
|
||||
|
||||
|
|
|
@ -14,6 +14,7 @@ If you followed the first tutorial, you should already have Druid downloaded. If
|
|||
You can download the latest version of druid [here](http://static.druid.io/artifacts/releases/druid-services-0.5.54-bin.tar.gz)
|
||||
|
||||
and untar the contents within by issuing:
|
||||
|
||||
```bash
|
||||
tar -zxvf druid-services-*-bin.tar.gz
|
||||
cd druid-services-*
|
||||
|
@ -32,15 +33,18 @@ For deep storage, we have made a public S3 bucket (static.druid.io) available wh
|
|||
1. If you don't already have it, download MySQL Community Server here: [http://dev.mysql.com/downloads/mysql/](http://dev.mysql.com/downloads/mysql/)
|
||||
2. Install MySQL
|
||||
3. Create a druid user and database
|
||||
|
||||
```bash
|
||||
mysql -u root
|
||||
```
|
||||
|
||||
```sql
|
||||
GRANT ALL ON druid.* TO 'druid'@'localhost' IDENTIFIED BY 'diurd';
|
||||
CREATE database druid;
|
||||
```
|
||||
|
||||
### Setting up Zookeeper ###
|
||||
|
||||
```bash
|
||||
curl http://www.motorlogy.com/apache/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz -o zookeeper-3.4.5.tar.gz
|
||||
tar xzf zookeeper-3.4.5.tar.gz
|
||||
|
@ -55,6 +59,7 @@ cd ..
|
|||
Similar to the first tutorial, the data we will be loading is based on edits that have occurred on Wikipedia. Every time someone edits a page in Wikipedia, metadata is generated about the editor and edited page. Druid collects each individual event and packages them together in a container known as a [segment](https://github.com/metamx/druid/wiki/Segments). Segments contain data over some span of time. We've prebuilt a segment for this tutorial and will cover making your own segments in other [pages](https://github.com/metamx/druid/wiki/Loading-Your-Data).The segment we are going to work with has the following format:
|
||||
|
||||
Dimensions (things to filter on):
|
||||
|
||||
```json
|
||||
"page"
|
||||
"language"
|
||||
|
@ -71,6 +76,7 @@ Dimensions (things to filter on):
|
|||
```
|
||||
|
||||
Metrics (things to aggregate over):
|
||||
|
||||
```json
|
||||
"count"
|
||||
"added"
|
||||
|
@ -98,7 +104,7 @@ To create the master config file:
|
|||
mkdir config/master
|
||||
```
|
||||
|
||||
Under the directory we just created, create the file ```runtime.properties``` with the following contents:
|
||||
Under the directory we just created, create the file `runtime.properties` with the following contents:
|
||||
|
||||
```
|
||||
druid.host=127.0.0.1:8082
|
||||
|
@ -146,7 +152,8 @@ To create the compute config file:
|
|||
mkdir config/compute
|
||||
```
|
||||
|
||||
Under the directory we just created, create the file ```runtime.properties``` with the following contents:
|
||||
Under the directory we just created, create the file `runtime.properties` with the following contents:
|
||||
|
||||
```
|
||||
druid.host=127.0.0.1:8081
|
||||
druid.port=8081
|
||||
|
@ -219,67 +226,17 @@ To start the broker node:
|
|||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker com.metamx.druid.http.BrokerMain
|
||||
```
|
||||
<!--
|
||||
### Optional: Start a Realtime Node ###
|
||||
```
|
||||
druid.host=127.0.0.1:8083
|
||||
druid.port=8083
|
||||
druid.service=realtime
|
||||
|
||||
# logging
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=info
|
||||
|
||||
# zk
|
||||
druid.zk.service.host=localhost
|
||||
druid.zk.paths.base=/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
|
||||
# processing
|
||||
druid.processing.buffer.sizeBytes=10000000
|
||||
|
||||
# schema
|
||||
druid.realtime.specFile=realtime.spec
|
||||
|
||||
# aws
|
||||
com.metamx.aws.accessKey=dummy_access_key
|
||||
com.metamx.aws.secretKey=dummy_secret_key
|
||||
|
||||
# db
|
||||
druid.database.segmentTable=segments
|
||||
druid.database.user=druid
|
||||
druid.database.password=diurd
|
||||
druid.database.connectURI=jdbc:mysql://localhost:3306/druid
|
||||
druid.database.ruleTable=rules
|
||||
druid.database.configTable=config
|
||||
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
|
||||
# handoff
|
||||
druid.pusher.s3.bucket=dummy_s3_bucket
|
||||
druid.pusher.s3.baseKey=dummy_key
|
||||
```
|
||||
|
||||
To start the realtime node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath services/target/druid-services-*-selfcontained.jar:config/realtime com.metamx.druid.realtime.RealtimeMain
|
||||
```
|
||||
-->
|
||||
## Loading the Data ##
|
||||
|
||||
The MySQL dependency we introduced earlier on contains a 'segments' table that contains entries for segments that should be loaded into our cluster. The Druid master compares this table with segments that already exist in the cluster to determine what should be loaded and dropped. To load our wikipedia segment, we need to create an entry in our MySQL segment table.
|
||||
|
||||
Usually, when new segments are created, these MySQL entries are created directly so you never have to do this by hand. For this tutorial, we can do this manually by going back into MySQL and issuing:
|
||||
|
||||
```
|
||||
``` sql
|
||||
use druid;
|
||||
```
|
||||
|
||||
``
|
||||
INSERT INTO segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES ('wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z', 'wikipedia', '2013-08-08T21:26:23.799Z', '2013-08-01T00:00:00.000Z', '2013-08-02T00:00:00.000Z', '0', '2013-08-08T21:22:48.989Z', '1', '{\"dataSource\":\"wikipedia\",\"interval\":\"2013-08-01T00:00:00.000Z/2013-08-02T00:00:00.000Z\",\"version\":\"2013-08-08T21:22:48.989Z\",\"loadSpec\":{\"type\":\"s3_zip\",\"bucket\":\"static.druid.io\",\"key\":\"data/segments/wikipedia/20130801T000000.000Z_20130802T000000.000Z/2013-08-08T21_22_48.989Z/0/index.zip\"},\"dimensions\":\"dma_code,continent_code,geo,area_code,robot,country_name,network,city,namespace,anonymous,unpatrolled,page,postal_code,language,newpage,user,region_lookup\",\"metrics\":\"count,delta,variation,added,deleted\",\"shardSpec\":{\"type\":\"none\"},\"binaryVersion\":9,\"size\":24664730,\"identifier\":\"wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z\"}');
|
||||
``
|
||||
```
|
||||
|
||||
If you look in your master node logs, you should, after a maximum of a minute or so, see logs of the following form:
|
||||
|
||||
|
@ -294,7 +251,7 @@ When the segment completes downloading and ready for queries, you should see the
|
|||
2013-08-08 22:48:41,959 INFO [ZkCoordinator-0] com.metamx.druid.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z] at path[/druid/segments/127.0.0.1:8081/2013-08-08T22:48:41.959Z]
|
||||
```
|
||||
|
||||
At this point, we can query the segment. For more information on querying, see this[link](https://github.com/metamx/druid/wiki/Querying).
|
||||
At this point, we can query the segment. For more information on querying, see this [link](https://github.com/metamx/druid/wiki/Querying).
|
||||
|
||||
## Next Steps ##
|
||||
|
||||
|
|
|
@ -12,3 +12,40 @@
|
|||
.doc-content img {
|
||||
max-width: 847.5px;
|
||||
}
|
||||
|
||||
.doc-content code {
|
||||
background-color: #e0e0e0;
|
||||
}
|
||||
|
||||
.doc-content pre code {
|
||||
background-color: transparent;
|
||||
}
|
||||
|
||||
.doc-content table,
|
||||
.doc-content table > thead > tr > th,
|
||||
.doc-content table > tbody > tr > th,
|
||||
.doc-content table > tfoot > tr > th,
|
||||
.doc-content table > thead > tr > td,
|
||||
.doc-content table > tbody > tr > td,
|
||||
.doc-content table > tfoot > tr > td {
|
||||
border: 1px solid #dddddd;
|
||||
}
|
||||
|
||||
.doc-content table > thead > tr > th,
|
||||
.doc-content table > thead > tr > td {
|
||||
border-bottom-width: 2px;
|
||||
}
|
||||
|
||||
.doc-content table > tbody > tr:nth-child(odd) > td,
|
||||
.doc-content table > tbody > tr:nth-child(odd) > th {
|
||||
background-color: #f9f9f9;
|
||||
}
|
||||
|
||||
.doc-content table > tbody > tr:hover > td,
|
||||
.doc-content table > tbody > tr:hover > th {
|
||||
background-color: #d5d5d5;
|
||||
}
|
||||
|
||||
.doc-content table code {
|
||||
background-color: transparent;
|
||||
}
|
|
@ -12,11 +12,10 @@ h1. Contents
|
|||
h2. Getting Started
|
||||
* "Tutorial: A First Look at Druid":./Tutorial:-A-First-Look-at-Druid.html
|
||||
* "Tutorial: The Druid Cluster":./Tutorial:-The-Druid-Cluster.html
|
||||
* "Loading Your Data":./Tutorial:-Webstream.html
|
||||
* "Loading Your Data":./Loading-Your-Data.html
|
||||
* "Querying Your Data":./Querying-your-data.html
|
||||
* "Booting a Production Cluster":./Booting-a-production-cluster.html
|
||||
* "Examples":./Examples.html
|
||||
* "Cluster Setup":Cluster-setup.html
|
||||
* "Configuration":Configuration.html
|
||||
|
||||
h2. Data Ingestion
|
||||
|
|
|
@ -1,38 +0,0 @@
|
|||
footer {
|
||||
font-size: 14px;
|
||||
color: #000000;
|
||||
font-weight: 300;
|
||||
}
|
||||
|
||||
footer strong {
|
||||
display: block;
|
||||
font-weight: 400;
|
||||
}
|
||||
|
||||
footer a {
|
||||
color: #000000;
|
||||
}
|
||||
|
||||
footer address {
|
||||
margin: 0 0 30px 30px;
|
||||
}
|
||||
|
||||
footer .soc {
|
||||
text-align:left;
|
||||
margin:5px 0 0 0;
|
||||
}
|
||||
footer .soc a {
|
||||
display:inline-block;
|
||||
width:35px;
|
||||
height:34px;
|
||||
background:url(../img/icons-soc.png) no-repeat;
|
||||
}
|
||||
footer .soc a.github {
|
||||
background-position:0 -34px;
|
||||
}
|
||||
footer .soc a.meet {
|
||||
background-position:0 -68px;
|
||||
}
|
||||
footer .soc a.rss {
|
||||
background-position:0 -102px;
|
||||
}
|
|
@ -1,21 +0,0 @@
|
|||
.navbar {
|
||||
max-width: 1170px;
|
||||
margin: 10px auto 25px;
|
||||
background-color: #eeeeee;
|
||||
border-bottom-width: 0;
|
||||
font-size: 18px;
|
||||
line-height: 20px;
|
||||
font-weight: 300;
|
||||
}
|
||||
|
||||
.container.druid-navbar {
|
||||
background-color: #171717;
|
||||
-webkit-border-radius: 4px;
|
||||
-moz-border-radius: 4px;
|
||||
border-radius: 4px;
|
||||
-webkit-box-shadow: 0 1px 4px rgba(0, 0, 0, 0.065);
|
||||
-moz-box-shadow: 0 1px 4px rgba(0, 0, 0, 0.065);
|
||||
box-shadow: 0 1px 4px rgba(0, 0, 0, 0.065);
|
||||
|
||||
}
|
||||
|
|
@ -1,57 +0,0 @@
|
|||
@font-face {
|
||||
font-family: 'Conv_framd';
|
||||
src: url('../fonts/framd.eot');
|
||||
src: url('../fonts/framd.eot?#iefix') format('embedded-opentype'),
|
||||
url('../fonts/framd.woff') format('woff'),
|
||||
url('../fonts/framd.ttf') format('truetype'),
|
||||
url('../fonts/framd.svg#heroregular') format('svg');
|
||||
font-weight: normal;
|
||||
font-style: normal;
|
||||
}
|
||||
|
||||
html, body {
|
||||
position:relative;
|
||||
height:100%;
|
||||
min-height:100%;
|
||||
height:100%;
|
||||
color:#252525;
|
||||
font:400 18px/26px 'Open Sans', Arial, Helvetica, sans-serif;
|
||||
margin:0;
|
||||
word-wrap:break-word;
|
||||
}
|
||||
|
||||
h1, h2, h3, h4, h5, h6, .h1, .h2, .h3, .h4, .h5, .h6 {
|
||||
font-family: 'Open Sans', Arial, Helvetica, sans-serif;
|
||||
font-weight: 300;
|
||||
}
|
||||
|
||||
h3, .h3 {
|
||||
font-size: 30px;
|
||||
font-weight: 300;
|
||||
}
|
||||
|
||||
.text-indent {
|
||||
padding-left: 50px;
|
||||
}
|
||||
|
||||
.text-indent-2 {
|
||||
padding-left: 100px;
|
||||
}
|
||||
|
||||
.text-indent-p p {
|
||||
padding-left: 50px;
|
||||
}
|
||||
|
||||
code {
|
||||
color: inherit;
|
||||
background-color: transparent;
|
||||
}
|
||||
|
||||
.page-header {
|
||||
margin-bottom: 40px;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.easter-egg {
|
||||
color: transparent;
|
||||
}
|
Loading…
Reference in New Issue