mirror of https://github.com/apache/druid.git
fix docs for 0.6 part 1 of many
This commit is contained in:
parent
de71c14114
commit
af1dbe6eab
2
build.sh
2
build.sh
|
@ -30,4 +30,4 @@ echo "For examples, see: "
|
|||
echo " "
|
||||
ls -1 examples/*/*sh
|
||||
echo " "
|
||||
echo "See also https://github.com/metamx/druid/wiki"
|
||||
echo "See also http://druid.io/docs/0.6.0/Home.html"
|
||||
|
|
|
@ -1,24 +0,0 @@
|
|||
---
|
||||
layout: post
|
||||
title: "Welcome to Jekyll!"
|
||||
date: 2013-09-16 13:06:49
|
||||
categories: jekyll update
|
||||
---
|
||||
|
||||
You'll find this post in your `_posts` directory - edit this post and re-build (or run with the `-w` switch) to see your changes!
|
||||
To add new posts, simply add a file in the `_posts` directory that follows the convention: YYYY-MM-DD-name-of-post.ext.
|
||||
|
||||
Jekyll also offers powerful support for code snippets:
|
||||
|
||||
{% highlight ruby %}
|
||||
def print_hi(name)
|
||||
puts "Hi, #{name}"
|
||||
end
|
||||
print_hi('Tom')
|
||||
#=> prints 'Hi, Tom' to STDOUT.
|
||||
{% endhighlight %}
|
||||
|
||||
Check out the [Jekyll docs][jekyll] for more info on how to get the most out of Jekyll. File all bugs/feature requests at [Jekyll's GitHub repo][jekyll-gh].
|
||||
|
||||
[jekyll-gh]: https://github.com/mojombo/jekyll
|
||||
[jekyll]: http://jekyllrb.com
|
|
@ -4,22 +4,22 @@ layout: doc_page
|
|||
Batch Data Ingestion
|
||||
====================
|
||||
|
||||
There are two choices for batch data ingestion to your Druid cluster, you can use the [Indexing service](Indexing-service.html) or you can use the `HadoopDruidIndexerMain`. This page describes how to use the `HadoopDruidIndexerMain`.
|
||||
There are two choices for batch data ingestion to your Druid cluster, you can use the [Indexing service](Indexing-service.html) or you can use the `HadoopDruidIndexer`. This page describes how to use the `HadoopDruidIndexer`.
|
||||
|
||||
Which should I use?
|
||||
-------------------
|
||||
|
||||
The [Indexing service](Indexing-service.html) is a node that can run as part of your Druid cluster and can accomplish a number of different types of indexing tasks. Even if all you care about is batch indexing, it provides for the encapsulation of things like the Database that is used for segment metadata and other things, so that your indexing tasks do not need to include such information. Long-term, the indexing service is going to be the preferred method of ingesting data.
|
||||
|
||||
The `HadoopDruidIndexerMain` runs hadoop jobs in order to separate and index data segments. It takes advantage of Hadoop as a job scheduling and distributed job execution platform. It is a simple method if you already have Hadoop running and don’t want to spend the time configuring and deploying the [Indexing service](Indexing service.html) just yet.
|
||||
The `HadoopDruidIndexer` runs hadoop jobs in order to separate and index data segments. It takes advantage of Hadoop as a job scheduling and distributed job execution platform. It is a simple method if you already have Hadoop running and don’t want to spend the time configuring and deploying the [Indexing service](Indexing service.html) just yet.
|
||||
|
||||
HadoopDruidIndexer
|
||||
------------------
|
||||
|
||||
Located at `com.metamx.druid.indexer.HadoopDruidIndexerMain` can be run like
|
||||
The HadoopDruidIndexer can be run like so:
|
||||
|
||||
```
|
||||
java -cp hadoop_config_path:druid_indexer_selfcontained_jar_path com.metamx.druid.indexer.HadoopDruidIndexerMain <config_file>
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath hadoop_config_path:`echo lib/* | tr ' ' ':'` io.druid.cli.Main index hadoop <config_file>
|
||||
```
|
||||
|
||||
The interval is the [ISO8601 interval](http://en.wikipedia.org/wiki/ISO_8601#Time_intervals) of the data you are processing. The config\_file is a path to a file (the "specFile") that contains JSON and an example looks like:
|
||||
|
|
|
@ -3,7 +3,7 @@ layout: doc_page
|
|||
---
|
||||
# Booting a Single Node Cluster #
|
||||
|
||||
[Loading Your Data](Loading-Your-Data.html) and [Querying Your Data](Querying-Your-Data.html) contain recipes to boot a small druid cluster on localhost. Here we will boot a small cluster on EC2. You can checkout the code, or download a tarball from [here](http://static.druid.io/artifacts/druid-services-0.5.51-SNAPSHOT-bin.tar.gz).
|
||||
[Loading Your Data](Loading-Your-Data.html) and [Querying Your Data](Querying-Your-Data.html) contain recipes to boot a small druid cluster on localhost. Here we will boot a small cluster on EC2. You can checkout the code, or download a tarball from [here](http://static.druid.io/artifacts/druid-services-0.6.0-bin.tar.gz).
|
||||
|
||||
The [ec2 run script](https://github.com/metamx/druid/blob/master/examples/bin/run_ec2.sh), run_ec2.sh, is located at 'examples/bin' if you have checked out the code, or at the root of the project if you've downloaded a tarball. The scripts rely on the [Amazon EC2 API Tools](http://aws.amazon.com/developertools/351), and you will need to set three environment variables:
|
||||
|
||||
|
|
|
@ -104,7 +104,7 @@ The coordinator node exposes several HTTP endpoints for interactions.
|
|||
The Coordinator Console
|
||||
------------------
|
||||
|
||||
The Druid coordinator exposes a web GUI for displaying cluster information and rule configuration. After the coordinator starts, the console can be accessed at http://HOST:PORT/static/. There exists a full cluster view, as well as views for individual historical nodes, datasources and segments themselves. Segment information can be displayed in raw JSON form or as part of a sortable and filterable table.
|
||||
The Druid coordinator exposes a web GUI for displaying cluster information and rule configuration. After the coordinator starts, the console can be accessed at http://<HOST>:<PORT>. There exists a full cluster view, as well as views for individual historical nodes, datasources and segments themselves. Segment information can be displayed in raw JSON form or as part of a sortable and filterable table.
|
||||
|
||||
The coordinator console also exposes an interface to creating and editing rules. All valid datasources configured in the segment database, along with a default datasource, are available for configuration. Rules of different types can be added, deleted or edited.
|
||||
|
||||
|
|
|
@ -6,7 +6,7 @@ A version may be declared as a release candidate if it has been deployed to a si
|
|||
Release Candidate
|
||||
-----------------
|
||||
|
||||
There is no release candidate at this time.
|
||||
The current release candidate is tagged at version [0.6.0](https://github.com/metamx/druid/tree/druid-0.6.0).
|
||||
|
||||
Stable Release
|
||||
--------------
|
||||
|
|
|
@ -4,7 +4,7 @@ layout: doc_page
|
|||
Examples
|
||||
========
|
||||
|
||||
The examples on this page are setup in order to give you a feel for what Druid does in practice. They are quick demos of Druid based on [RealtimeStandaloneMain](https://github.com/metamx/druid/blob/master/examples/src/main/java/druid/examples/RealtimeStandaloneMain.java). While you wouldn’t run it this way in production you should be able to see how ingestion works and the kind of exploratory queries that are possible. Everything that can be done on your box here can be scaled out to 10’s of billions of events and terabytes of data per day in a production cluster while still giving the snappy responsive exploratory queries.
|
||||
The examples on this page are setup in order to give you a feel for what Druid does in practice. They are quick demos of Druid based on [CliRealtimeExample](https://github.com/metamx/druid/blob/master/services/src/main/java/io/druid/cli/CliRealtimeExample.java). While you wouldn’t run it this way in production you should be able to see how ingestion works and the kind of exploratory queries that are possible. Everything that can be done on your box here can be scaled out to 10’s of billions of events and terabytes of data per day in a production cluster while still giving the snappy responsive exploratory queries.
|
||||
|
||||
Installing Standalone Druid
|
||||
---------------------------
|
||||
|
@ -19,7 +19,7 @@ Clone Druid and build it:
|
|||
git clone https://github.com/metamx/druid.git druid
|
||||
cd druid
|
||||
git fetch --tags
|
||||
git checkout druid-0.4.30
|
||||
git checkout druid-0.6.0
|
||||
./build.sh
|
||||
```
|
||||
|
||||
|
@ -49,7 +49,7 @@ This Example uses a feature of Twitter that allows for sampling of it’s stream
|
|||
|
||||
### What you’ll do
|
||||
|
||||
See [Tutorial](Tutorial.html)
|
||||
See [Twitter Tutorial](Twitter-Tutorial.html)
|
||||
|
||||
Rand Example
|
||||
------------
|
||||
|
@ -68,5 +68,4 @@ In another terminal window:
|
|||
./run_example_client.sh # type rand when prompted
|
||||
```
|
||||
|
||||
|
||||
The result of the client query is in JSON format. The client makes a REST request using the program `curl` which is usually installed on Linux, Unix, and OSX by default.
|
||||
|
|
|
@ -31,7 +31,7 @@ For every query that a historical node services, it will log the query and repor
|
|||
|
||||
Running
|
||||
-------
|
||||
|
||||
p
|
||||
Historical nodes can be run using the `io.druid.cli.Main` class with program arguments "server historical".
|
||||
|
||||
Configuration
|
||||
|
|
|
@ -23,3 +23,5 @@ Some great folks have written their own libraries to interact with Druid
|
|||
|
||||
* [madvertise/druid-dumbo](https://github.com/madvertise/druid-dumbo) - Scripts to help generate batch configs for the ingestion of data into Druid
|
||||
* [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness) - A set of scripts to simplify standing up some servers and seeing how things work
|
||||
* [mingfang/docker-druid](https://github.com/mingfang/docker-druid) - A Dockerfile to run the entire Druid cluster
|
||||
|
||||
|
|
|
@ -1,12 +1,12 @@
|
|||
---
|
||||
layout: doc_page
|
||||
---
|
||||
Once you have a realtime node working, it is time to load your own data to see how Druid performs.
|
||||
Once you have a real-time node working, it is time to load your own data to see how Druid performs.
|
||||
|
||||
Druid can ingest data in three ways: via Kafka and a realtime node, via the indexing service, and via the Hadoop batch loader. Data is ingested in realtime using a [Firehose](Firehose.html).
|
||||
Druid can ingest data in three ways: via Kafka and a realtime node, via the indexing service, and via the Hadoop batch loader. Data is ingested in real-time using a [Firehose](Firehose.html).
|
||||
|
||||
## Create Config Directories ##
|
||||
Each type of node needs its own config file and directory, so create them as subdirectories under the druid directory.
|
||||
Each type of node needs its own config file and directory, so create them as subdirectories under the druid directory if they not already exist.
|
||||
|
||||
```bash
|
||||
mkdir config
|
||||
|
@ -18,7 +18,7 @@ mkdir config/broker
|
|||
|
||||
## Loading Data with Kafka ##
|
||||
|
||||
[KafkaFirehoseFactory](https://github.com/metamx/druid/blob/druid-0.5.x/realtime/src/main/java/com/metamx/druid/realtime/firehose/KafkaFirehoseFactory.java) is how druid communicates with Kafka. Using this [Firehose](Firehose.html) with the right configuration, we can import data into Druid in realtime without writing any code. To load data to a realtime node via Kafka, we'll first need to initialize Zookeeper and Kafka, and then configure and initialize a [Realtime](Realtime.html) node.
|
||||
[KafkaFirehoseFactory](https://github.com/metamx/druid/blob/druid-0.6.0/realtime/src/main/java/com/metamx/druid/realtime/firehose/KafkaFirehoseFactory.java) is how druid communicates with Kafka. Using this [Firehose](Firehose.html) with the right configuration, we can import data into Druid in realtime without writing any code. To load data to a realtime node via Kafka, we'll first need to initialize Zookeeper and Kafka, and then configure and initialize a [Realtime](Realtime.html) node.
|
||||
|
||||
### Booting Kafka ###
|
||||
|
||||
|
@ -59,71 +59,90 @@ Instructions for booting a Zookeeper and then Kafka cluster are available [here]
|
|||
1. Create a valid configuration file similar to this called config/realtime/runtime.properties:
|
||||
|
||||
```properties
|
||||
druid.host=0.0.0.0:8080
|
||||
druid.host=localhost
|
||||
druid.service=example
|
||||
druid.port=8080
|
||||
|
||||
com.metamx.emitter.logging=true
|
||||
druid.zk.service.host=localhost
|
||||
|
||||
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
|
||||
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
|
||||
|
||||
druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
|
||||
druid.db.connector.user=druid
|
||||
druid.db.connector.password=diurd
|
||||
|
||||
druid.realtime.specFile=config/realtime/realtime.spec
|
||||
|
||||
druid.processing.formatString=processing_%s
|
||||
druid.processing.numThreads=1
|
||||
druid.processing.buffer.sizeBytes=10000000
|
||||
|
||||
#emitting, opaque marker
|
||||
druid.service=example
|
||||
|
||||
druid.request.logging.dir=/tmp/example/log
|
||||
druid.realtime.specFile=realtime.spec
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=debug
|
||||
|
||||
# below are dummy values when operating a realtime only node
|
||||
druid.processing.numThreads=3
|
||||
|
||||
com.metamx.aws.accessKey=dummy_access_key
|
||||
com.metamx.aws.secretKey=dummy_secret_key
|
||||
druid.storage.s3.bucket=dummy_s3_bucket
|
||||
|
||||
druid.zk.service.host=localhost
|
||||
druid.server.maxSize=300000000000
|
||||
druid.zk.paths.base=/druid
|
||||
druid.database.segmentTable=prod_segments
|
||||
druid.database.user=user
|
||||
druid.database.password=diurd
|
||||
druid.database.connectURI=
|
||||
druid.host=127.0.0.1:8080
|
||||
```
|
||||
|
||||
2. Create a valid realtime configuration file similar to this called realtime.spec:
|
||||
|
||||
```json
|
||||
[{
|
||||
"schema" : { "dataSource":"druidtest",
|
||||
"aggregators":[ {"type":"count", "name":"impressions"},
|
||||
{"type":"doubleSum","name":"wp","fieldName":"wp"}],
|
||||
"indexGranularity":"minute",
|
||||
"shardSpec" : { "type": "none" } },
|
||||
"config" : { "maxRowsInMemory" : 500000,
|
||||
"intermediatePersistPeriod" : "PT10m" },
|
||||
"firehose" : { "type" : "kafka-0.7.2",
|
||||
"consumerProps" : { "zk.connect" : "localhost:2181",
|
||||
"zk.connectiontimeout.ms" : "15000",
|
||||
"zk.sessiontimeout.ms" : "15000",
|
||||
"zk.synctime.ms" : "5000",
|
||||
"groupid" : "topic-pixel-local",
|
||||
"fetch.size" : "1048586",
|
||||
"autooffset.reset" : "largest",
|
||||
"autocommit.enable" : "false" },
|
||||
"feed" : "druidtest",
|
||||
"parser" : { "timestampSpec" : { "column" : "utcdt", "format" : "iso" },
|
||||
"data" : { "format" : "json" },
|
||||
"dimensionExclusions" : ["wp"] } },
|
||||
"plumber" : { "type" : "realtime",
|
||||
"windowPeriod" : "PT10m",
|
||||
"segmentGranularity":"hour",
|
||||
"basePersistDirectory" : "/tmp/realtime/basePersist",
|
||||
"rejectionPolicy": {"type": "messageTime"} }
|
||||
|
||||
}]
|
||||
[
|
||||
{
|
||||
"schema": {
|
||||
"dataSource": "druidtest",
|
||||
"aggregators": [
|
||||
{
|
||||
"type": "count",
|
||||
"name": "impressions"
|
||||
},
|
||||
{
|
||||
"type": "doubleSum",
|
||||
"name": "wp",
|
||||
"fieldName": "wp"
|
||||
}
|
||||
],
|
||||
"indexGranularity": "minute",
|
||||
"shardSpec": {
|
||||
"type": "none"
|
||||
}
|
||||
},
|
||||
"config": {
|
||||
"maxRowsInMemory": 500000,
|
||||
"intermediatePersistPeriod": "PT10m"
|
||||
},
|
||||
"firehose": {
|
||||
"type": "kafka-0.7.2",
|
||||
"consumerProps": {
|
||||
"zk.connect": "localhost:2181",
|
||||
"zk.connectiontimeout.ms": "15000",
|
||||
"zk.sessiontimeout.ms": "15000",
|
||||
"zk.synctime.ms": "5000",
|
||||
"groupid": "topic-pixel-local",
|
||||
"fetch.size": "1048586",
|
||||
"autooffset.reset": "largest",
|
||||
"autocommit.enable": "false"
|
||||
},
|
||||
"feed": "druidtest",
|
||||
"parser": {
|
||||
"timestampSpec": {
|
||||
"column": "utcdt",
|
||||
"format": "iso"
|
||||
},
|
||||
"data": {
|
||||
"format": "json"
|
||||
},
|
||||
"dimensionExclusions": [
|
||||
"wp"
|
||||
]
|
||||
}
|
||||
},
|
||||
"plumber": {
|
||||
"type": "realtime",
|
||||
"windowPeriod": "PT10m",
|
||||
"segmentGranularity": "hour",
|
||||
"basePersistDirectory": "\/tmp\/realtime\/basePersist",
|
||||
"rejectionPolicy": {
|
||||
"type": "messageTime"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
3. Launch the realtime node
|
||||
|
@ -131,7 +150,7 @@ Instructions for booting a Zookeeper and then Kafka cluster are available [here]
|
|||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=config/realtime/realtime.spec \
|
||||
-classpath lib/*:config/realtime com.metamx.druid.realtime.RealtimeMain
|
||||
-classpath lib/*:config/realtime io.druid.cli.Main server realtime
|
||||
```
|
||||
|
||||
4. Paste data into the Kafka console producer
|
||||
|
@ -239,46 +258,20 @@ If you've already setup a realtime node, be aware that although you can run mult
|
|||
1. Setup a configuration file called config/coordinator/runtime.properties similar to:
|
||||
|
||||
```properties
|
||||
druid.host=0.0.0.0:8081
|
||||
druid.host=localhost
|
||||
druid.service=coordinator
|
||||
druid.port=8081
|
||||
|
||||
com.metamx.emitter.logging=true
|
||||
druid.zk.service.host=localhost
|
||||
|
||||
druid.processing.formatString=processing_%s
|
||||
druid.processing.numThreads=1
|
||||
druid.processing.buffer.sizeBytes=10000000
|
||||
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
|
||||
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
|
||||
|
||||
# emitting, opaque marker
|
||||
druid.service=example
|
||||
druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
|
||||
druid.db.connector.user=druid
|
||||
druid.db.connector.password=diurd
|
||||
|
||||
druid.coordinator.startDelay=PT60s
|
||||
druid.request.logging.dir=/tmp/example/log
|
||||
druid.realtime.specFile=realtime.spec
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=debug
|
||||
|
||||
# below are dummy values when operating a realtime only node
|
||||
druid.processing.numThreads=3
|
||||
|
||||
com.metamx.aws.accessKey=dummy_access_key
|
||||
com.metamx.aws.secretKey=dummy_secret_key
|
||||
druid.storage.s3.bucket=dummy_s3_bucket
|
||||
|
||||
druid.zk.service.host=localhost
|
||||
druid.server.maxSize=300000000000
|
||||
druid.zk.paths.base=/druid
|
||||
druid.database.segmentTable=prod_segments
|
||||
druid.database.user=druid
|
||||
druid.database.password=diurd
|
||||
druid.database.connectURI=jdbc:mysql://localhost:3306/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
druid.database.ruleTable=rules
|
||||
druid.database.configTable=config
|
||||
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
# Path on local FS for storage of segment metadata; dir will be created if needed
|
||||
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
|
||||
```
|
||||
|
||||
2. Launch the [Coordinator](Coordinator.html) node
|
||||
|
@ -286,7 +279,7 @@ If you've already setup a realtime node, be aware that although you can run mult
|
|||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-classpath lib/*:config/coordinator \
|
||||
com.metamx.druid.http.CoordinatorMain
|
||||
io.druid.Cli.Main server coordinator
|
||||
```
|
||||
|
||||
### Launch a Historical Node ###
|
||||
|
@ -294,48 +287,21 @@ If you've already setup a realtime node, be aware that although you can run mult
|
|||
1. Create a configuration file in config/historical/runtime.properties similar to:
|
||||
|
||||
```properties
|
||||
druid.host=0.0.0.0:8082
|
||||
druid.host=localhost
|
||||
druid.service=historical
|
||||
druid.port=8082
|
||||
|
||||
com.metamx.emitter.logging=true
|
||||
druid.zk.service.host=localhost
|
||||
|
||||
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
|
||||
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
|
||||
|
||||
druid.server.maxSize=100000000
|
||||
|
||||
druid.processing.formatString=processing_%s
|
||||
druid.processing.numThreads=1
|
||||
druid.processing.buffer.sizeBytes=10000000
|
||||
|
||||
# emitting, opaque marker
|
||||
druid.service=example
|
||||
|
||||
druid.request.logging.dir=/tmp/example/log
|
||||
druid.realtime.specFile=realtime.spec
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=debug
|
||||
|
||||
# below are dummy values when operating a realtime only node
|
||||
druid.processing.numThreads=3
|
||||
|
||||
com.metamx.aws.accessKey=dummy_access_key
|
||||
com.metamx.aws.secretKey=dummy_secret_key
|
||||
druid.storage.s3.bucket=dummy_s3_bucket
|
||||
|
||||
druid.zk.service.host=localhost
|
||||
druid.server.maxSize=300000000000
|
||||
druid.zk.paths.base=/druid
|
||||
druid.database.segmentTable=prod_segments
|
||||
druid.database.user=druid
|
||||
druid.database.password=diurd
|
||||
druid.database.connectURI=jdbc:mysql://localhost:3306/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
druid.database.ruleTable=rules
|
||||
druid.database.configTable=config
|
||||
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
# Path on local FS for storage of segment metadata; dir will be created if needed
|
||||
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
|
||||
# Setup local storage mode
|
||||
druid.storage.local.storageDirectory=/tmp/druid/localStorage
|
||||
druid.storage.local=true
|
||||
druid.segmentCache.infoPath=/tmp/druid/segmentInfoCache
|
||||
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 100000000}]
|
||||
```
|
||||
|
||||
2. Launch the historical node:
|
||||
|
@ -371,31 +337,47 @@ Now its time to run the Hadoop [Batch-ingestion](Batch-ingestion.html) job, Hado
|
|||
"timestampFormat": "iso",
|
||||
"dataSpec": {
|
||||
"format": "json",
|
||||
"dimensions": ["gender", "age"]
|
||||
"dimensions": [
|
||||
"gender",
|
||||
"age"
|
||||
]
|
||||
},
|
||||
"granularitySpec": {
|
||||
"type":"uniform",
|
||||
"intervals":["2010-01-01T01/PT1H"],
|
||||
"gran":"hour"
|
||||
"type": "uniform",
|
||||
"intervals": [
|
||||
"2010-01-01T01\/PT1H"
|
||||
],
|
||||
"gran": "hour"
|
||||
},
|
||||
"pathSpec": { "type": "static",
|
||||
"paths": "/Users/rjurney/Software/druid/records.json" },
|
||||
"rollupSpec": { "aggs":[ {"type":"count", "name":"impressions"},
|
||||
{"type":"doubleSum","name":"wp","fieldName":"wp"}
|
||||
],
|
||||
"rollupGranularity": "minute"},
|
||||
"workingPath": "/tmp/working_path",
|
||||
"segmentOutputPath": "/tmp/segments",
|
||||
"leaveIntermediate": "false",
|
||||
"pathSpec": {
|
||||
"type": "static",
|
||||
"paths": "\/druid\/records.json"
|
||||
},
|
||||
"rollupSpec": {
|
||||
"aggs": [
|
||||
{
|
||||
"type": "count",
|
||||
"name": "impressions"
|
||||
},
|
||||
{
|
||||
"type": "doubleSum",
|
||||
"name": "wp",
|
||||
"fieldName": "wp"
|
||||
}
|
||||
],
|
||||
"rollupGranularity": "minute"
|
||||
},
|
||||
"workingPath": "\/tmp\/working_path",
|
||||
"segmentOutputPath": "\/tmp\/segments",
|
||||
"partitionsSpec": {
|
||||
"targetPartitionSize": 5000000
|
||||
},
|
||||
"updaterJobSpec": {
|
||||
"type":"db",
|
||||
"connectURI":"jdbc:mysql://localhost:3306/druid",
|
||||
"user":"druid",
|
||||
"password":"diurd",
|
||||
"segmentTable":"prod_segments"
|
||||
"type": "db",
|
||||
"connectURI": "jdbc:mysql:\/\/localhost:3306\/druid",
|
||||
"user": "druid",
|
||||
"password": "diurd",
|
||||
"segmentTable": "druid_segments"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
@ -404,8 +386,8 @@ Now its time to run the Hadoop [Batch-ingestion](Batch-ingestion.html) job, Hado
|
|||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=realtime.spec -classpath lib/* \
|
||||
com.metamx.druid.indexer.HadoopDruidIndexerMain batchConfig.json
|
||||
-classpath `echo lib/* | tr ' ' ':'` \
|
||||
io.druid.cli.Main index hadoop batchConfig.json
|
||||
```
|
||||
|
||||
You can now move on to [Querying Your Data](Querying-Your-Data.html)!
|
||||
|
|
|
@ -6,7 +6,7 @@ MySQL is an external dependency of Druid. We use it to store various metadata ab
|
|||
Segments Table
|
||||
--------------
|
||||
|
||||
This is dictated by the `druid.database.segmentTable` property (Note that these properties are going to change in the next stable version after 0.4.12).
|
||||
This is dictated by the `druid.db.tables.segments` property.
|
||||
|
||||
This table stores metadata about the segments that are available in the system. The table is polled by the [Coordinator](Coordinator.html) to determine the set of segments that should be available for querying in the system. The table has two main functional columns, the other columns are for indexing purposes.
|
||||
|
||||
|
|
|
@ -7,101 +7,24 @@ Before we start querying druid, we're going to finish setting up a complete clus
|
|||
|
||||
## Booting a Broker Node ##
|
||||
|
||||
1. Setup a config file at config/broker/runtime.properties that looks like this:
|
||||
1. Setup a config file at config/broker/runtime.properties that looks like this:
|
||||
|
||||
```
|
||||
druid.host=0.0.0.0:8083
|
||||
druid.port=8083
|
||||
|
||||
com.metamx.emitter.logging=true
|
||||
|
||||
druid.processing.formatString=processing_%s
|
||||
druid.processing.numThreads=1
|
||||
druid.processing.buffer.sizeBytes=10000000
|
||||
|
||||
#emitting, opaque marker
|
||||
druid.service=example
|
||||
|
||||
druid.request.logging.dir=/tmp/example/log
|
||||
druid.realtime.specFile=realtime.spec
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=debug
|
||||
|
||||
# below are dummy values when operating a realtime only node
|
||||
druid.processing.numThreads=3
|
||||
|
||||
com.metamx.aws.accessKey=dummy_access_key
|
||||
com.metamx.aws.secretKey=dummy_secret_key
|
||||
druid.storage.s3.bucket=dummy_s3_bucket
|
||||
|
||||
druid.host=localhost
|
||||
druid.service=broker
|
||||
druid.port=8080
|
||||
|
||||
druid.zk.service.host=localhost
|
||||
druid.server.maxSize=300000000000
|
||||
druid.zk.paths.base=/druid
|
||||
druid.database.segmentTable=prod_segments
|
||||
druid.database.user=druid
|
||||
druid.database.password=diurd
|
||||
druid.database.connectURI=jdbc:mysql://localhost:3306/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
druid.database.ruleTable=rules
|
||||
druid.database.configTable=config
|
||||
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
# Path on local FS for storage of segment metadata; dir will be created if needed
|
||||
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
|
||||
druid.storage.local.storageDirectory=/tmp/druid/localStorage
|
||||
druid.storage.local=true
|
||||
|
||||
# thread pool size for servicing queries
|
||||
druid.client.http.connections=30
|
||||
|
||||
```
|
||||
|
||||
2. Run the broker node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=realtime.spec \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/broker \
|
||||
com.metamx.druid.http.BrokerMain
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker io.druid.cli.Main server broker
|
||||
```
|
||||
|
||||
## Booting a Coordinator Node ##
|
||||
|
||||
1. Setup a config file at config/coordinator/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818870](https://gist.github.com/rjurney/5818870)
|
||||
|
||||
2. Run the coordinator node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/coordinator \
|
||||
io.druid.cli.Main server coordinator
|
||||
```
|
||||
|
||||
## Booting a Realtime Node ##
|
||||
|
||||
1. Setup a config file at config/realtime/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818774](https://gist.github.com/rjurney/5818774)
|
||||
|
||||
2. Setup a realtime.spec file like this: [https://gist.github.com/rjurney/5818779](https://gist.github.com/rjurney/5818779)
|
||||
|
||||
3. Run the realtime node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=realtime.spec \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/realtime \
|
||||
com.metamx.druid.realtime.RealtimeMain
|
||||
```
|
||||
|
||||
## Booting a historical node ##
|
||||
|
||||
1. Setup a config file at config/historical/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818885](https://gist.github.com/rjurney/5818885)
|
||||
2. Run the historical node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/historical \
|
||||
io.druid.cli.Main server historical
|
||||
```
|
||||
With the Broker node and the other Druid nodes types up and running, you have a fully functional Druid Cluster and are ready to query your data!
|
||||
|
||||
# Querying Your Data #
|
||||
|
||||
|
@ -109,7 +32,7 @@ Now that we have a complete cluster setup on localhost, we need to load data. To
|
|||
|
||||
## Querying Different Nodes ##
|
||||
|
||||
As a shared-nothing system, there are three ways to query druid, against the [Realtime](Realtime.html), [Historical](Historical.html) or [Broker](Broker.html) node. Querying a Realtime node returns only realtime data, querying a historical node returns only historical segments. Querying the broker will query both realtime and historical segments and compose an overall result for the query. This is the normal mode of operation for queries in druid.
|
||||
As a shared-nothing system, there are three ways to query druid, against the [Realtime](Realtime.html), [Historical](Historical.html) or [Broker](Broker.html) node. Querying a Realtime node returns only realtime data, querying a historical node returns only historical segments. Querying the broker may query both realtime and historical segments and compose an overall result for the query. This is the normal mode of operation for queries in Druid.
|
||||
|
||||
### Construct a Query ###
|
||||
|
||||
|
@ -148,7 +71,7 @@ See our result:
|
|||
} ]
|
||||
```
|
||||
|
||||
### Querying the historical node ###
|
||||
### Querying the Historical node ###
|
||||
Run the query against port 8082:
|
||||
|
||||
```bash
|
||||
|
@ -165,7 +88,7 @@ And get (similar to):
|
|||
} ]
|
||||
```
|
||||
|
||||
### Querying both Nodes via the Broker ###
|
||||
### Querying the Broker ###
|
||||
Run the query against port 8083:
|
||||
|
||||
```bash
|
||||
|
@ -184,39 +107,72 @@ And get:
|
|||
|
||||
Now that we know what nodes can be queried (although you should usually use the broker node), lets learn how to know what queries are available.
|
||||
|
||||
## Querying Against the realtime.spec ##
|
||||
## Examining the realtime.spec ##
|
||||
|
||||
How are we to know what queries we can run? Although [Querying](Querying.html) is a helpful index, to get a handle on querying our data we need to look at our [Realtime](Realtime.html) node's realtime.spec file:
|
||||
|
||||
```json
|
||||
[{
|
||||
"schema" : { "dataSource":"druidtest",
|
||||
"aggregators":[ {"type":"count", "name":"impressions"},
|
||||
{"type":"doubleSum","name":"wp","fieldName":"wp"}],
|
||||
"indexGranularity":"minute",
|
||||
"shardSpec" : { "type": "none" } },
|
||||
"config" : { "maxRowsInMemory" : 500000,
|
||||
"intermediatePersistPeriod" : "PT10m" },
|
||||
"firehose" : { "type" : "kafka-0.7.2",
|
||||
"consumerProps" : { "zk.connect" : "localhost:2181",
|
||||
"zk.connectiontimeout.ms" : "15000",
|
||||
"zk.sessiontimeout.ms" : "15000",
|
||||
"zk.synctime.ms" : "5000",
|
||||
"groupid" : "topic-pixel-local",
|
||||
"fetch.size" : "1048586",
|
||||
"autooffset.reset" : "largest",
|
||||
"autocommit.enable" : "false" },
|
||||
"feed" : "druidtest",
|
||||
"parser" : { "timestampSpec" : { "column" : "utcdt", "format" : "iso" },
|
||||
"data" : { "format" : "json" },
|
||||
"dimensionExclusions" : ["wp"] } },
|
||||
"plumber" : { "type" : "realtime",
|
||||
"windowPeriod" : "PT10m",
|
||||
"segmentGranularity":"hour",
|
||||
"basePersistDirectory" : "/tmp/realtime/basePersist",
|
||||
"rejectionPolicy": {"type": "messageTime"} }
|
||||
|
||||
}]
|
||||
[
|
||||
{
|
||||
"schema": {
|
||||
"dataSource": "druidtest",
|
||||
"aggregators": [
|
||||
{
|
||||
"type": "count",
|
||||
"name": "impressions"
|
||||
},
|
||||
{
|
||||
"type": "doubleSum",
|
||||
"name": "wp",
|
||||
"fieldName": "wp"
|
||||
}
|
||||
],
|
||||
"indexGranularity": "minute",
|
||||
"shardSpec": {
|
||||
"type": "none"
|
||||
}
|
||||
},
|
||||
"config": {
|
||||
"maxRowsInMemory": 500000,
|
||||
"intermediatePersistPeriod": "PT10m"
|
||||
},
|
||||
"firehose": {
|
||||
"type": "kafka-0.7.2",
|
||||
"consumerProps": {
|
||||
"zk.connect": "localhost:2181",
|
||||
"zk.connectiontimeout.ms": "15000",
|
||||
"zk.sessiontimeout.ms": "15000",
|
||||
"zk.synctime.ms": "5000",
|
||||
"groupid": "topic-pixel-local",
|
||||
"fetch.size": "1048586",
|
||||
"autooffset.reset": "largest",
|
||||
"autocommit.enable": "false"
|
||||
},
|
||||
"feed": "druidtest",
|
||||
"parser": {
|
||||
"timestampSpec": {
|
||||
"column": "utcdt",
|
||||
"format": "iso"
|
||||
},
|
||||
"data": {
|
||||
"format": "json"
|
||||
},
|
||||
"dimensionExclusions": [
|
||||
"wp"
|
||||
]
|
||||
}
|
||||
},
|
||||
"plumber": {
|
||||
"type": "realtime",
|
||||
"windowPeriod": "PT10m",
|
||||
"segmentGranularity": "hour",
|
||||
"basePersistDirectory": "\/tmp\/realtime\/basePersist",
|
||||
"rejectionPolicy": {
|
||||
"type": "messageTime"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### dataSource ###
|
||||
|
@ -330,7 +286,7 @@ Which gets us just people aged 40:
|
|||
} ]
|
||||
```
|
||||
|
||||
Check out [Filters](Filters.html) for more.
|
||||
Check out [Filters](Filters.html) for more information.
|
||||
|
||||
## Learn More ##
|
||||
|
||||
|
|
|
@ -28,32 +28,64 @@ Realtime nodes take a mix of base server configuration and spec files that descr
|
|||
The property `druid.realtime.specFile` has the path of a file (absolute or relative path and file name) with realtime specifications in it. This "specFile" should be a JSON Array of JSON objects like the following:
|
||||
|
||||
```json
|
||||
[{
|
||||
"schema" : { "dataSource":"dataSourceName",
|
||||
"aggregators":[ {"type":"count", "name":"events"},
|
||||
{"type":"doubleSum","name":"outColumn","fieldName":"inColumn"} ],
|
||||
"indexGranularity":"minute",
|
||||
"shardSpec" : { "type": "none" } },
|
||||
"config" : { "maxRowsInMemory" : 500000,
|
||||
"intermediatePersistPeriod" : "PT10m" },
|
||||
"firehose" : { "type" : "kafka-0.7.2",
|
||||
"consumerProps" : { "zk.connect" : "zk_connect_string",
|
||||
"zk.connectiontimeout.ms" : "15000",
|
||||
"zk.sessiontimeout.ms" : "15000",
|
||||
"zk.synctime.ms" : "5000",
|
||||
"groupid" : "consumer-group",
|
||||
"fetch.size" : "1048586",
|
||||
"autooffset.reset" : "largest",
|
||||
"autocommit.enable" : "false" },
|
||||
"feed" : "your_kafka_topic",
|
||||
"parser" : { "timestampSpec" : { "column" : "timestamp", "format" : "iso" },
|
||||
"data" : { "format" : "json" },
|
||||
"dimensionExclusions" : ["value"] } },
|
||||
"plumber" : { "type" : "realtime",
|
||||
"windowPeriod" : "PT10m",
|
||||
"segmentGranularity":"hour",
|
||||
"basePersistDirectory" : "/tmp/realtime/basePersist" }
|
||||
}]
|
||||
[
|
||||
{
|
||||
"schema": {
|
||||
"dataSource": "dataSourceName",
|
||||
"aggregators": [
|
||||
{
|
||||
"type": "count",
|
||||
"name": "events"
|
||||
},
|
||||
{
|
||||
"type": "doubleSum",
|
||||
"name": "outColumn",
|
||||
"fieldName": "inColumn"
|
||||
}
|
||||
],
|
||||
"indexGranularity": "minute",
|
||||
"shardSpec": {
|
||||
"type": "none"
|
||||
}
|
||||
},
|
||||
"config": {
|
||||
"maxRowsInMemory": 500000,
|
||||
"intermediatePersistPeriod": "PT10m"
|
||||
},
|
||||
"firehose": {
|
||||
"type": "kafka-0.7.2",
|
||||
"consumerProps": {
|
||||
"zk.connect": "zk_connect_string",
|
||||
"zk.connectiontimeout.ms": "15000",
|
||||
"zk.sessiontimeout.ms": "15000",
|
||||
"zk.synctime.ms": "5000",
|
||||
"groupid": "consumer-group",
|
||||
"fetch.size": "1048586",
|
||||
"autooffset.reset": "largest",
|
||||
"autocommit.enable": "false"
|
||||
},
|
||||
"feed": "your_kafka_topic",
|
||||
"parser": {
|
||||
"timestampSpec": {
|
||||
"column": "timestamp",
|
||||
"format": "iso"
|
||||
},
|
||||
"data": {
|
||||
"format": "json"
|
||||
},
|
||||
"dimensionExclusions": [
|
||||
"value"
|
||||
]
|
||||
}
|
||||
},
|
||||
"plumber": {
|
||||
"type": "realtime",
|
||||
"windowPeriod": "PT10m",
|
||||
"segmentGranularity": "hour",
|
||||
"basePersistDirectory": "\/tmp\/realtime\/basePersist"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
This is a JSON Array so you can give more than one realtime stream to a given node. The number you can put in the same process depends on the exact configuration. In general, it is best to think of each realtime stream handler as requiring 2-threads: 1 thread for data consumption and aggregation, 1 thread for incremental persists and other background tasks.
|
||||
|
@ -116,43 +148,9 @@ Extending the code
|
|||
|
||||
Realtime integration is intended to be extended in two ways:
|
||||
|
||||
1. Connect to data streams from varied systems ([Firehose](https://github.com/metamx/druid/blob/druid-0.5.x/realtime/src/main/java/com/metamx/druid/realtime/firehose/FirehoseFactory.java))
|
||||
2. Adjust the publishing strategy to match your needs ([Plumber](https://github.com/metamx/druid/blob/druid-0.5.x/realtime/src/main/java/com/metamx/druid/realtime/plumber/PlumberSchool.java))
|
||||
1. Connect to data streams from varied systems ([Firehose](https://github.com/metamx/druid/blob/druid-0.6.0/realtime/src/main/java/com/metamx/druid/realtime/firehose/FirehoseFactory.java))
|
||||
2. Adjust the publishing strategy to match your needs ([Plumber](https://github.com/metamx/druid/blob/druid-0.6.0/realtime/src/main/java/com/metamx/druid/realtime/plumber/PlumberSchool.java))
|
||||
|
||||
The expectations are that the former will be very common and something that users of Druid will do on a fairly regular basis. Most users will probably never have to deal with the latter form of customization. Indeed, we hope that all potential use cases can be packaged up as part of Druid proper without requiring proprietary customization.
|
||||
|
||||
Given those expectations, adding a firehose is straightforward and completely encapsulated inside of the interface. Adding a plumber is more involved and requires understanding of how the system works to get right, it’s not impossible, but it’s not intended that individuals new to Druid will be able to do it immediately.
|
||||
|
||||
We will do our best to accept contributions from the community of new Firehoses and Plumbers, but we also understand the requirement for being able to plug in your own proprietary implementations. The model for doing this is by embedding the druid code in another project and writing your own `main()` method that initializes a RealtimeNode object and registers your proprietary objects with it.
|
||||
|
||||
```java
|
||||
public class MyRealtimeMain
|
||||
{
|
||||
private static final Logger log = new Logger(MyRealtimeMain.class);
|
||||
|
||||
public static void main(String[] args) throws Exception
|
||||
{
|
||||
LogLevelAdjuster.register();
|
||||
|
||||
Lifecycle lifecycle = new Lifecycle();
|
||||
|
||||
lifecycle.addManagedInstance(
|
||||
RealtimeNode.builder()
|
||||
.build()
|
||||
.registerJacksonSubtype(foo.bar.MyFirehose.class)
|
||||
);
|
||||
|
||||
try {
|
||||
lifecycle.start();
|
||||
}
|
||||
catch (Throwable t) {
|
||||
log.info(t, "Throwable caught at startup, committing seppuku");
|
||||
System.exit(2);
|
||||
}
|
||||
|
||||
lifecycle.join();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Pluggable pieces of the system are either handled by a setter on the RealtimeNode object, or they are configuration driven and need to be setup to allow for [Jackson polymorphic deserialization](http://wiki.fasterxml.com/JacksonPolymorphicDeserialization) and registered via the relevant methods on the RealtimeNode object.
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
layout: doc_page
|
||||
---
|
||||
Numerous backend engineers at [Metamarkets](http://www.metamarkets.com) work on Druid full-time. If you any questions about usage or code, feel free to contact any of us.
|
||||
Numerous backend engineers at [Metamarkets](http://www.metamarkets.com) and other companies work on Druid full-time. If you any questions about usage or code, feel free to contact any of us.
|
||||
|
||||
Google Groups Mailing List
|
||||
--------------------------
|
||||
|
|
|
@ -47,7 +47,7 @@ There are two ways to setup Druid: download a tarball, or [Build From Source](Bu
|
|||
|
||||
### Download a Tarball
|
||||
|
||||
We've built a tarball that contains everything you'll need. You'll find it [here](http://static.druid.io/artifacts/releases/druid-services-0.5.54-bin.tar.gz)
|
||||
We've built a tarball that contains everything you'll need. You'll find it [here](http://static.druid.io/artifacts/releases/druid-services-0.6.0-bin.tar.gz)
|
||||
Download this file to a directory of your choosing.
|
||||
|
||||
You can extract the awesomeness within by issuing:
|
||||
|
@ -59,7 +59,7 @@ tar -zxvf druid-services-*-bin.tar.gz
|
|||
Not too lost so far right? That's great! If you cd into the directory:
|
||||
|
||||
```
|
||||
cd druid-services-0.5.54
|
||||
cd druid-services-0.6.0
|
||||
```
|
||||
|
||||
You should see a bunch of files:
|
||||
|
@ -82,10 +82,12 @@ Select "wikipedia".
|
|||
Once the node starts up you will see a bunch of logs about setting up properties and connecting to the data source. If everything was successful, you should see messages of the form shown below.
|
||||
|
||||
```
|
||||
2013-07-19 21:54:05,154 INFO [main] com.metamx.druid.realtime.RealtimeNode - Starting Jetty
|
||||
2013-07-19 21:54:05,154 INFO [main] org.mortbay.log - jetty-6.1.x
|
||||
2013-07-19 21:54:05,171 INFO [chief-wikipedia] com.metamx.druid.realtime.plumber.RealtimePlumberSchool - Expect to run at [2013-07-19T22:03:00.000Z]
|
||||
2013-07-19 21:54:05,246 INFO [main] org.mortbay.log - Started SelectChannelConnector@0.0.0.0:8083
|
||||
2013-09-04 19:33:11,922 INFO [main] org.eclipse.jetty.server.AbstractConnector - Started SelectChannelConnector@0.0.0.0:8083
|
||||
2013-09-04 19:33:11,946 INFO [ApiDaemon] io.druid.segment.realtime.firehose.IrcFirehoseFactory - irc connection to server [irc.wikimedia.org] established
|
||||
2013-09-04 19:33:11,946 INFO [ApiDaemon] io.druid.segment.realtime.firehose.IrcFirehoseFactory - Joining channel #en.wikipedia
|
||||
2013-09-04 19:33:11,946 INFO [ApiDaemon] io.druid.segment.realtime.firehose.IrcFirehoseFactory - Joining channel #fr.wikipedia
|
||||
2013-09-04 19:33:11,946 INFO [ApiDaemon] io.druid.segment.realtime.firehose.IrcFirehoseFactory - Joining channel #de.wikipedia
|
||||
2013-09-04 19:33:11,946 INFO [ApiDaemon] io.druid.segment.realtime.firehose.IrcFirehoseFactory - Joining channel #ja.wikipedia
|
||||
```
|
||||
|
||||
The Druid real time-node ingests events in an in-memory buffer. Periodically, these events will be persisted to disk. If you are interested in the details of our real-time architecture and why we persist indexes to disk, I suggest you read our [White Paper](http://static.druid.io/docs/druid.pdf).
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
layout: doc_page
|
||||
---
|
||||
Welcome back! In our first [tutorial](https://github.com/metamx/druid/wiki/Tutorial%3A-A-First-Look-at-Druid), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
|
||||
Welcome back! In our first [tutorial](Tutorial:-A-First-Look-at-Druid), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
|
||||
|
||||
This tutorial will hopefully answer these questions!
|
||||
|
||||
|
@ -11,7 +11,7 @@ In this tutorial, we will set up other types of Druid nodes as well as and exter
|
|||
|
||||
If you followed the first tutorial, you should already have Druid downloaded. If not, let's go back and do that first.
|
||||
|
||||
You can download the latest version of druid [here](http://static.druid.io/artifacts/releases/druid-services-0.5.54-bin.tar.gz)
|
||||
You can download the latest version of druid [here](http://static.druid.io/artifacts/releases/druid-services-0.6.0-bin.tar.gz)
|
||||
|
||||
and untar the contents within by issuing:
|
||||
|
||||
|
@ -26,7 +26,7 @@ You can also [Build From Source](Build-From-Source.html).
|
|||
|
||||
Druid requires 3 external dependencies. A "deep" storage that acts as a backup data repository, a relational database such as MySQL to hold configuration and metadata information, and [Apache Zookeeper](http://zookeeper.apache.org/) for coordination among different pieces of the cluster.
|
||||
|
||||
For deep storage, we have made a public S3 bucket (static.druid.io) available where data for this particular tutorial can be downloaded. More on the data [later](https://github.com/metamx/druid/wiki/Tutorial-Part-2#the-data).
|
||||
For deep storage, we have made a public S3 bucket (static.druid.io) available where data for this particular tutorial can be downloaded. More on the data [later](Tutorial-Part-2.html#the-data).
|
||||
|
||||
### Setting up MySQL ###
|
||||
|
||||
|
@ -56,7 +56,7 @@ cd ..
|
|||
|
||||
## The Data ##
|
||||
|
||||
Similar to the first tutorial, the data we will be loading is based on edits that have occurred on Wikipedia. Every time someone edits a page in Wikipedia, metadata is generated about the editor and edited page. Druid collects each individual event and packages them together in a container known as a [segment](https://github.com/metamx/druid/wiki/Segments). Segments contain data over some span of time. We've prebuilt a segment for this tutorial and will cover making your own segments in other [pages](https://github.com/metamx/druid/wiki/Loading-Your-Data).The segment we are going to work with has the following format:
|
||||
Similar to the first tutorial, the data we will be loading is based on edits that have occurred on Wikipedia. Every time someone edits a page in Wikipedia, metadata is generated about the editor and edited page. Druid collects each individual event and packages them together in a container known as a [segment](https://github.com/metamx/druid/wiki/Segments). Segments contain data over some span of time. We've prebuilt a segment for this tutorial and will cover making your own segments in other [pages](Loading-Your-Data.html).The segment we are going to work with has the following format:
|
||||
|
||||
Dimensions (things to filter on):
|
||||
|
||||
|
@ -92,11 +92,12 @@ Let's start up a few nodes and download our data. First things though, let's cre
|
|||
mkdir config
|
||||
```
|
||||
|
||||
If you are interested in learning more about Druid configuration files, check out this [link](https://github.com/metamx/druid/wiki/Configuration). Many aspects of Druid are customizable. For the purposes of this tutorial, we are going to use default values for most things.
|
||||
If you are interested in learning more about Druid configuration files, check out this [link](Configuration.html). Many aspects of Druid are customizable. For the purposes of this tutorial, we are going to use default values for most things.
|
||||
|
||||
### Start a Coordinator Node ###
|
||||
|
||||
Coordinator nodes are in charge of load assignment and distribution. Coordinator nodes monitor the status of the cluster and command historical nodes to assign and drop segments.
|
||||
For more information about coordinator nodes, see [here](Coordinator.html).
|
||||
|
||||
To create the coordinator config file:
|
||||
|
||||
|
@ -104,36 +105,23 @@ To create the coordinator config file:
|
|||
mkdir config/coordinator
|
||||
```
|
||||
|
||||
Under the directory we just created, create the file `runtime.properties` with the following contents:
|
||||
Under the directory we just created, create the file `runtime.properties` with the following contents if it does not exist:
|
||||
|
||||
```
|
||||
druid.host=127.0.0.1:8082
|
||||
druid.port=8082
|
||||
druid.host=localhost
|
||||
druid.service=coordinator
|
||||
druid.port=8082
|
||||
|
||||
# logging
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=info
|
||||
|
||||
# zk
|
||||
druid.zk.service.host=localhost
|
||||
druid.zk.paths.base=/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
|
||||
# aws (demo user)
|
||||
com.metamx.aws.accessKey=AKIAIMKECRUYKDQGR6YQ
|
||||
com.metamx.aws.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
|
||||
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
|
||||
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
|
||||
|
||||
# db
|
||||
druid.database.segmentTable=segments
|
||||
druid.database.user=druid
|
||||
druid.database.password=diurd
|
||||
druid.database.connectURI=jdbc:mysql://localhost:3306/druid
|
||||
druid.database.ruleTable=rules
|
||||
druid.database.configTable=config
|
||||
druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
|
||||
druid.db.connector.user=druid
|
||||
druid.db.connector.password=diurd
|
||||
|
||||
# coordinator runtime configs
|
||||
druid.coordinator.startDelay=PT60S
|
||||
druid.coordinator.startDelay=PT60s
|
||||
```
|
||||
|
||||
To start the coordinator node:
|
||||
|
@ -144,7 +132,8 @@ java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/
|
|||
|
||||
### Start a historical node ###
|
||||
|
||||
historical nodes are the workhorses of a cluster and are in charge of loading historical segments and making them available for queries. Our Wikipedia segment will be downloaded by a historical node.
|
||||
Historical nodes are the workhorses of a cluster and are in charge of loading historical segments and making them available for queries. Our Wikipedia segment will be downloaded by a historical node.
|
||||
For more information about Historical nodes, see [here](Historical.html).
|
||||
|
||||
To create the historical config file:
|
||||
|
||||
|
@ -155,34 +144,21 @@ mkdir config/historical
|
|||
Under the directory we just created, create the file `runtime.properties` with the following contents:
|
||||
|
||||
```
|
||||
druid.host=127.0.0.1:8081
|
||||
druid.port=8081
|
||||
druid.host=localhost
|
||||
druid.service=historical
|
||||
druid.port=8081
|
||||
|
||||
# logging
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=info
|
||||
|
||||
# zk
|
||||
druid.zk.service.host=localhost
|
||||
druid.zk.paths.base=/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
|
||||
# processing
|
||||
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
|
||||
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
|
||||
|
||||
druid.server.maxSize=100000000
|
||||
|
||||
druid.processing.buffer.sizeBytes=10000000
|
||||
|
||||
# aws (demo user)
|
||||
com.metamx.aws.accessKey=AKIAIMKECRUYKDQGR6YQ
|
||||
com.metamx.aws.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
|
||||
|
||||
# Path on local FS for storage of segments; dir will be created if needed
|
||||
druid.paths.indexCache=/tmp/druid/indexCache
|
||||
|
||||
# Path on local FS for storage of segment metadata; dir will be created if needed
|
||||
druid.paths.segmentInfoCache=/tmp/druid/segmentInfoCache
|
||||
|
||||
# server
|
||||
druid.server.maxSize=100000000
|
||||
druid.segmentCache.infoPath=/tmp/druid/segmentInfoCache
|
||||
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 100000000}]
|
||||
```
|
||||
|
||||
To start the historical node:
|
||||
|
@ -194,6 +170,7 @@ java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/
|
|||
### Start a Broker Node ###
|
||||
|
||||
Broker nodes are responsible for figuring out which historical and/or realtime nodes correspond to which queries. They also merge partial results from these nodes in a scatter/gather fashion.
|
||||
For more information about Broker nodes, see [here](Broker.html).
|
||||
|
||||
To create the broker config file:
|
||||
|
||||
|
@ -204,27 +181,17 @@ mkdir config/broker
|
|||
Under the directory we just created, create the file ```runtime.properties``` with the following contents:
|
||||
|
||||
```
|
||||
druid.host=127.0.0.1:8080
|
||||
druid.port=8080
|
||||
druid.host=localhost
|
||||
druid.service=broker
|
||||
druid.port=8080
|
||||
|
||||
# logging
|
||||
com.metamx.emitter.logging=true
|
||||
com.metamx.emitter.logging.level=info
|
||||
|
||||
# zk
|
||||
druid.zk.service.host=localhost
|
||||
druid.zk.paths.base=/druid
|
||||
druid.zk.paths.discoveryPath=/druid/discoveryPath
|
||||
|
||||
# thread pool size for servicing queries
|
||||
druid.client.http.connections=10
|
||||
```
|
||||
|
||||
To start the broker node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker com.metamx.druid.http.BrokerMain
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker io.druid.cli.Main server broker
|
||||
```
|
||||
|
||||
## Loading the Data ##
|
||||
|
@ -251,9 +218,9 @@ When the segment completes downloading and ready for queries, you should see the
|
|||
2013-08-08 22:48:41,959 INFO [ZkCoordinator-0] com.metamx.druid.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z] at path[/druid/segments/127.0.0.1:8081/2013-08-08T22:48:41.959Z]
|
||||
```
|
||||
|
||||
At this point, we can query the segment. For more information on querying, see this [link](https://github.com/metamx/druid/wiki/Querying).
|
||||
At this point, we can query the segment. For more information on querying, see this [link](Querying.html).
|
||||
|
||||
## Next Steps ##
|
||||
|
||||
Now that you have an understanding of what the Druid clsuter looks like, why not load some of your own data?
|
||||
Check out the [Loading Your Own Data](https://github.com/metamx/druid/wiki/Loading-Your-Data) section for more info!
|
||||
Check out the [Loading Your Own Data](Loading-Your-Data.html) section for more info!
|
||||
|
|
|
@ -37,7 +37,7 @@ There are two ways to setup Druid: download a tarball, or [Build From Source](Bu
|
|||
|
||||
h3. Download a Tarball
|
||||
|
||||
We've built a tarball that contains everything you'll need. You'll find it [here](http://static.druid.io/artifacts/releases/druid-services-0.5.50-bin.tar.gz)
|
||||
We've built a tarball that contains everything you'll need. You'll find it [here](http://static.druid.io/artifacts/releases/druid-services-0.6.0-bin.tar.gz)
|
||||
Download this file to a directory of your choosing.
|
||||
You can extract the awesomeness within by issuing:
|
||||
|
||||
|
@ -48,7 +48,7 @@ tar zxvf druid-services-*-bin.tar.gz
|
|||
Not too lost so far right? That's great! If you cd into the directory:
|
||||
|
||||
```
|
||||
cd druid-services-0.5.50
|
||||
cd druid-services-0.6.0
|
||||
```
|
||||
|
||||
You should see a bunch of files:
|
||||
|
@ -68,9 +68,8 @@ Select "webstream".
|
|||
Once the node starts up you will see a bunch of logs about setting up properties and connecting to the data source. If everything was successful, you should see messages of the form shown below.
|
||||
|
||||
```
|
||||
2013-07-19 21:54:05,154 INFO com.metamx.druid.realtime.RealtimeNode~~ Starting Jetty
|
||||
2013-07-19 21:54:05,154 INFO org.mortbay.log - jetty-6.1.x
|
||||
2013-07-19 21:54:05,171 INFO com.metamx.druid.realtime.plumber.RealtimePlumberSchool - Expect to run at
|
||||
Jul 19, 2013 21:54:05 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
|
||||
INFO: Binding io.druid.server.StatusResource to GuiceManagedComponentProvider with the scope "Undefined"
|
||||
2013-07-19 21:54:05,246 INFO org.mortbay.log - Started SelectChannelConnector@0.0.0.0:8083
|
||||
```
|
||||
|
||||
|
|
|
@ -9,16 +9,16 @@ There are two ways to setup Druid: download a tarball, or build it from source.
|
|||
|
||||
h3. Download a Tarball
|
||||
|
||||
We've built a tarball that contains everything you'll need. You'll find it "here":http://static.druid.io/data/examples/druid-services-0.4.6.tar.gz.
|
||||
We've built a tarball that contains everything you'll need. You'll find it "here":http://static.druid.io/data/examples/druid-services-0.6.0.tar.gz.
|
||||
Download this bad boy to a directory of your choosing.
|
||||
|
||||
You can extract the awesomeness within by issuing:
|
||||
|
||||
pre. tar -zxvf druid-services-0.4.6.tar.gz
|
||||
pre. tar -zxvf druid-services-0.6.0.tar.gz
|
||||
|
||||
Not too lost so far right? That's great! If you cd into the directory:
|
||||
|
||||
pre. cd druid-services-0.4.6-SNAPSHOT
|
||||
pre. cd druid-services-0.6.0-SNAPSHOT
|
||||
|
||||
You should see a bunch of files:
|
||||
* run_example_server.sh
|
||||
|
@ -31,7 +31,7 @@ The other way to setup Druid is from source via git. To do so, run these command
|
|||
|
||||
<pre><code>git clone git@github.com:metamx/druid.git
|
||||
cd druid
|
||||
git checkout druid-0.4.32-branch
|
||||
git checkout druid-0.6.0
|
||||
./build.sh
|
||||
</code></pre>
|
||||
|
||||
|
|
|
@ -9,4 +9,6 @@ druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
|
|||
|
||||
druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
|
||||
druid.db.connector.user=druid
|
||||
druid.db.connector.password=diurd
|
||||
druid.db.connector.password=diurd
|
||||
|
||||
druid.coordinator.startDelay=PT60s
|
|
@ -31,7 +31,7 @@ public abstract class DruidCoordinatorConfig
|
|||
public abstract String getHost();
|
||||
|
||||
@Config("druid.coordinator.startDelay")
|
||||
@Default("PT120s")
|
||||
@Default("PT300s")
|
||||
public abstract Duration getCoordinatorStartDelay();
|
||||
|
||||
@Config("druid.coordinator.period")
|
||||
|
|
Loading…
Reference in New Issue