druid/docs/content/Tutorial:-The-Druid-Cluster.md

263 lines
11 KiB
Markdown
Raw Normal View History

---
2013-09-26 19:22:28 -04:00
layout: doc_page
---
# Tutorial: The Druid Cluster
2013-10-10 18:05:01 -04:00
Welcome back! In our first [tutorial](Tutorial%3A-A-First-Look-at-Druid.html), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
2013-09-13 18:20:39 -04:00
This tutorial will hopefully answer these questions!
2014-01-24 15:26:38 -05:00
In this tutorial, we will set up other types of Druid nodes and external dependencies for a fully functional Druid cluster. The architecture of Druid is very much like the [Megazord](http://www.youtube.com/watch?v=7mQuHh1X4H4) from the popular 90s show Mighty Morphin' Power Rangers. Each Druid node has a specific purpose and the nodes come together to form a fully functional system.
2013-09-13 18:20:39 -04:00
2013-10-10 18:05:01 -04:00
## Downloading Druid
2013-09-13 18:20:39 -04:00
If you followed the first tutorial, you should already have Druid downloaded. If not, let's go back and do that first.
2014-08-09 22:33:31 -04:00
You can download the latest version of druid [here](http://static.druid.io/artifacts/releases/druid-services-0.6.143-bin.tar.gz)
2013-09-13 18:20:39 -04:00
and untar the contents within by issuing:
2013-09-13 18:20:39 -04:00
```bash
tar -zxvf druid-services-*-bin.tar.gz
cd druid-services-*
```
2013-10-10 18:05:01 -04:00
You can also [Build From Source](Build-from-source.html).
2013-09-13 18:20:39 -04:00
2013-10-10 18:05:01 -04:00
## External Dependencies
2013-09-13 18:20:39 -04:00
Druid requires 3 external dependencies. A "deep" storage that acts as a backup data repository, a relational database such as MySQL to hold configuration and metadata information, and [Apache Zookeeper](http://zookeeper.apache.org/) for coordination among different pieces of the cluster.
2013-10-10 18:05:01 -04:00
For deep storage, we have made a public S3 bucket (static.druid.io) available where data for this particular tutorial can be downloaded. More on the data later.
2013-09-13 18:20:39 -04:00
2013-10-10 18:05:01 -04:00
#### Setting up MySQL
2013-09-13 18:20:39 -04:00
2014-01-24 15:26:38 -05:00
1. If you don't already have it, download MySQL Community Server here: [http://dev.mysql.com/downloads/mysql/](http://dev.mysql.com/downloads/mysql/).
2. Install MySQL.
3. Create a druid user and database.
2013-09-13 18:20:39 -04:00
```bash
mysql -u root
```
2013-09-13 18:20:39 -04:00
```sql
GRANT ALL ON druid.* TO 'druid'@'localhost' IDENTIFIED BY 'diurd';
CREATE database druid;
```
2013-10-10 18:05:01 -04:00
#### Setting up Zookeeper
2013-09-13 18:20:39 -04:00
```bash
2014-06-19 15:55:02 -04:00
curl http://apache.osuosl.org/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz -o zookeeper-3.4.5.tar.gz
2013-09-13 18:20:39 -04:00
tar xzf zookeeper-3.4.5.tar.gz
cd zookeeper-3.4.5
cp conf/zoo_sample.cfg conf/zoo.cfg
./bin/zkServer.sh start
cd ..
```
2013-10-10 18:05:01 -04:00
## The Data
2013-09-13 18:20:39 -04:00
2013-10-10 18:05:01 -04:00
Similar to the first tutorial, the data we will be loading is based on edits that have occurred on Wikipedia. Every time someone edits a page in Wikipedia, metadata is generated about the editor and edited page. Druid collects each individual event and packages them together in a container known as a [segment](Segments.html). Segments contain data over some span of time. We've prebuilt a segment for this tutorial and will cover making your own segments in other [pages](Tutorial%3A-Loading-Your-Data-Part-1.html).The segment we are going to work with has the following format:
2013-09-13 18:20:39 -04:00
Dimensions (things to filter on):
2013-09-13 18:20:39 -04:00
```json
"page"
"language"
"user"
"unpatrolled"
"newPage"
"robot"
"anonymous"
"namespace"
"continent"
"country"
"region"
"city"
```
Metrics (things to aggregate over):
2013-09-13 18:20:39 -04:00
```json
"count"
"added"
"delta"
"deleted"
```
2013-10-10 18:05:01 -04:00
## The Cluster
2013-09-13 18:20:39 -04:00
2014-01-24 15:26:38 -05:00
Let's start up a few nodes and download our data. First, let's make sure we have configs in the config directory for our various nodes. Issue the following from the Druid home directory:
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
ls config
2013-09-13 18:20:39 -04:00
```
2013-10-07 17:47:04 -04:00
If you are interested in learning more about Druid configuration files, check out this [link](Configuration.html). Many aspects of Druid are customizable. For the purposes of this tutorial, we are going to use default values for most things.
2013-09-13 18:20:39 -04:00
2013-10-10 18:05:01 -04:00
#### Start a Coordinator Node
2013-09-13 18:20:39 -04:00
Coordinator nodes are in charge of load assignment and distribution. Coordinator nodes monitor the status of the cluster and command historical nodes to assign and drop segments.
2013-10-07 17:47:04 -04:00
For more information about coordinator nodes, see [here](Coordinator.html).
2013-09-13 18:20:39 -04:00
2013-10-10 18:05:01 -04:00
The coordinator config file should already exist at:
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
config/coordinator
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
In the directory, there should be a `runtime.properties` file with the following contents:
2013-09-13 18:20:39 -04:00
```
2013-10-07 17:47:04 -04:00
druid.host=localhost
druid.service=coordinator
2013-10-07 17:47:04 -04:00
druid.port=8082
2013-09-13 18:20:39 -04:00
druid.zk.service.host=localhost
2013-10-07 17:47:04 -04:00
druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
druid.db.connector.user=druid
druid.db.connector.password=diurd
2014-06-19 16:33:16 -04:00
druid.coordinator.startDelay=PT70s
2013-09-13 18:20:39 -04:00
```
To start the coordinator node:
2013-09-13 18:20:39 -04:00
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/coordinator io.druid.cli.Main server coordinator
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
#### Start a Historical Node
2013-09-13 18:20:39 -04:00
2013-10-07 17:47:04 -04:00
Historical nodes are the workhorses of a cluster and are in charge of loading historical segments and making them available for queries. Our Wikipedia segment will be downloaded by a historical node.
For more information about Historical nodes, see [here](Historical.html).
2013-09-13 18:20:39 -04:00
2013-10-10 18:05:01 -04:00
The historical config file should exist at:
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
config/historical
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
In the directory we just created, we should have the file `runtime.properties` with the following contents:
2013-09-13 18:20:39 -04:00
```
2013-10-07 17:47:04 -04:00
druid.host=localhost
druid.service=historical
2013-10-07 17:47:04 -04:00
druid.port=8081
2013-09-13 18:20:39 -04:00
druid.zk.service.host=localhost
2014-08-09 22:33:31 -04:00
druid.extensions.coordinates=["io.druid.extensions:druid-s3-extensions:0.6.143"]
2013-11-07 17:40:45 -05:00
# Dummy read only AWS account (used to download example data)
2013-10-07 17:47:04 -04:00
druid.s3.secretKey=QyyfVZ7llSiRg6Qcrql1eEUG7buFpAK6T6engr1b
druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
2013-09-13 18:20:39 -04:00
2014-01-10 17:47:18 -05:00
druid.server.maxSize=10000000000
2013-09-13 18:20:39 -04:00
2014-01-10 17:47:18 -05:00
# Change these to make Druid faster
2014-01-13 21:01:56 -05:00
druid.processing.buffer.sizeBytes=100000000
2014-01-10 17:47:18 -05:00
druid.processing.numThreads=1
2013-09-13 18:20:39 -04:00
2014-01-10 17:47:18 -05:00
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}]
2013-09-13 18:20:39 -04:00
```
To start the historical node:
2013-09-13 18:20:39 -04:00
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/historical io.druid.cli.Main server historical
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
#### Start a Broker Node
2013-09-13 18:20:39 -04:00
Broker nodes are responsible for figuring out which historical and/or realtime nodes correspond to which queries. They also merge partial results from these nodes in a scatter/gather fashion.
2013-10-07 17:47:04 -04:00
For more information about Broker nodes, see [here](Broker.html).
2013-09-13 18:20:39 -04:00
2013-10-10 18:05:01 -04:00
The broker config file should exist at:
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
config/broker
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
In the directory, there should be a `runtime.properties` file with the following contents:
2013-09-13 18:20:39 -04:00
```
2013-10-07 17:47:04 -04:00
druid.host=localhost
2013-09-13 18:20:39 -04:00
druid.service=broker
2013-10-07 17:47:04 -04:00
druid.port=8080
2013-09-13 18:20:39 -04:00
druid.zk.service.host=localhost
```
To start the broker node:
```bash
2013-10-07 17:47:04 -04:00
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker io.druid.cli.Main server broker
2013-09-13 18:20:39 -04:00
```
2013-10-10 18:05:01 -04:00
## Loading the Data
2013-09-13 18:20:39 -04:00
The MySQL dependency we introduced earlier on contains a 'segments' table that contains entries for segments that should be loaded into our cluster. The Druid coordinator compares this table with segments that already exist in the cluster to determine what should be loaded and dropped. To load our wikipedia segment, we need to create an entry in our MySQL segment table.
2013-09-13 18:20:39 -04:00
Usually, when new segments are created, these MySQL entries are created directly so you never have to do this by hand. For this tutorial, we can do this manually by going back into MySQL and issuing:
``` sql
2013-09-13 18:20:39 -04:00
use druid;
INSERT INTO druid_segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES ('wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z', 'wikipedia', '2013-08-08T21:26:23.799Z', '2013-08-01T00:00:00.000Z', '2013-08-02T00:00:00.000Z', '0', '2013-08-08T21:22:48.989Z', '1', '{\"dataSource\":\"wikipedia\",\"interval\":\"2013-08-01T00:00:00.000Z/2013-08-02T00:00:00.000Z\",\"version\":\"2013-08-08T21:22:48.989Z\",\"loadSpec\":{\"type\":\"s3_zip\",\"bucket\":\"static.druid.io\",\"key\":\"data/segments/wikipedia/20130801T000000.000Z_20130802T000000.000Z/2013-08-08T21_22_48.989Z/0/index.zip\"},\"dimensions\":\"dma_code,continent_code,geo,area_code,robot,country_name,network,city,namespace,anonymous,unpatrolled,page,postal_code,language,newpage,user,region_lookup\",\"metrics\":\"count,delta,variation,added,deleted\",\"shardSpec\":{\"type\":\"none\"},\"binaryVersion\":9,\"size\":24664730,\"identifier\":\"wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z\"}');
```
2013-09-13 18:20:39 -04:00
If you look in your coordinator node logs, you should, after a maximum of a minute or so, see logs of the following form:
2013-09-13 18:20:39 -04:00
```
2013-08-08 22:48:41,967 INFO [main-EventThread] com.metamx.druid.coordinator.LoadQueuePeon - Server[/druid/loadQueue/127.0.0.1:8081] done processing [/druid/loadQueue/127.0.0.1:8081/wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z]
2013-09-13 18:20:39 -04:00
2013-08-08 22:48:41,969 INFO [ServerInventoryView-0] com.metamx.druid.client.SingleServerInventoryView - Server[127.0.0.1:8081] added segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z]
```
When the segment completes downloading and ready for queries, you should see the following message on your historical node logs:
2013-09-13 18:20:39 -04:00
```
2013-08-08 22:48:41,959 INFO [ZkCoordinator-0] com.metamx.druid.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z] at path[/druid/segments/127.0.0.1:8081/2013-08-08T22:48:41.959Z]
```
2013-10-07 17:47:04 -04:00
At this point, we can query the segment. For more information on querying, see this [link](Querying.html).
2013-09-13 18:20:39 -04:00
### Bonus Round: Start a Realtime Node
To start the realtime node that was used in our first tutorial, you simply have to issue:
```
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=examples/wikipedia/wikipedia_realtime.spec -classpath lib/*:config/realtime io.druid.cli.Main server realtime
```
The configurations are located in `config/realtime/runtime.properties` and should contain the following:
```
druid.host=localhost
druid.service=realtime
druid.port=8083
druid.zk.service.host=localhost
2014-08-09 22:33:31 -04:00
druid.extensions.coordinates=["io.druid.extensions:druid-examples:0.6.143","io.druid.extensions:druid-kafka-seven:0.6.143"]
# Change this config to db to hand off to the rest of the Druid cluster
druid.publish.type=noop
# These configs are only required for real hand off
# druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
# druid.db.connector.user=druid
# druid.db.connector.password=diurd
2014-01-13 21:01:56 -05:00
druid.processing.buffer.sizeBytes=100000000
2014-03-13 21:52:08 -04:00
druid.processing.numThreads=1
druid.monitoring.monitors=["io.druid.segment.realtime.RealtimeMetricsMonitor"]
```
2013-10-10 18:05:01 -04:00
Next Steps
----------
If you are interested in how data flows through the different Druid components, check out the [Druid data flow architecture](Design.html). Now that you have an understanding of what the Druid cluster looks like, why not load some of your own data?
2013-11-04 15:45:06 -05:00
Check out the next [tutorial](Tutorial%3A-Loading-Your-Data-Part-1.html) section for more info!