Make more of the docs look and work correctly. Yay! Almost done with this!

This commit is contained in:
cheddar 2013-09-27 12:57:08 -05:00
parent 86ddc7da7b
commit ca5c941560
12 changed files with 377 additions and 685 deletions

View File

@ -1 +0,0 @@
cheddar@ChedHeads-MacBook-Pro-2.local.61986

View File

@ -2,9 +2,6 @@
<html lang="en">
<head>
{% include site_head.html %}
<link rel="stylesheet" href="/css/main.css">
<link rel="stylesheet" href="/css/header.css">
<link rel="stylesheet" href="/css/footer.css">
<link rel="stylesheet" href="css/docs.css">
</head>

View File

@ -1,114 +0,0 @@
---
layout: doc_page
---
A Druid cluster consists of various node types that need to be set up depending on your use case. See our [Design](Design.html) docs for a description of the different node types.
Setup Scripts
-------------
One of our community members, [housejester](https://github.com/housejester/), contributed some scripts to help with setting up a cluster. Checkout the [github](https://github.com/housejester/druid-test-harness) and [wiki](https://github.com/housejester/druid-test-harness/wiki/Druid-Test-Harness).
Minimum Physical Layout: Absolute Minimum
-----------------------------------------
As a special case, the absolute minimum setup is one of the standalone examples for realtime ingestion and querying; see [Examples](Examples.html) that can easily run on one machine with one core and 1GB RAM. This layout can be set up to try some basic queries with Druid.
Minimum Physical Layout: Experimental Testing with 4GB of RAM
-------------------------------------------------------------
This layout can be used to load some data from deep storage onto a Druid compute node for the first time. A minimal physical layout for a 1 or 2 core machine with 4GB of RAM is:
1. node1: [Master](Master.html) + metadata service + zookeeper + [Compute](Compute.html)
2. transient nodes: indexer
This setup is only reasonable to prove that a configuration works. It would not be worthwhile to use this layout for performance measurement.
Comfortable Physical Layout: Pilot Project with Multiple Machines
-----------------------------------------------------------------
*The machine size “flavors” are using AWS/EC2 terminology for descriptive purposes only and is not meant to imply that AWS/EC2 is required or recommended. Another cloud provider or your own hardware can also work.*
A minimal physical layout not constrained by cores that demonstrates parallel querying and realtime, using AWS-EC2 “small”/m1.small (one core, with 1.7GB of RAM) or larger, no realtime, is:
1. node1: [Master](Master.html) (m1.small)
2. node2: metadata service (m1.small)
3. node3: zookeeper (m1.small)
4. node4: [Broker](Broker.html) (m1.small or m1.medium or m1.large)
5. node5: [Compute](Compute.html) (m1.small or m1.medium or m1.large)
6. node6: [Compute](Compute.html) (m1.small or m1.medium or m1.large)
7. node7: [Realtime](Realtime.html) (m1.small or m1.medium or m1.large)
8. transient nodes: indexer
This layout naturally lends itself to adding more RAM and core to Compute nodes, and to adding many more Compute nodes. Depending on the actual load, the Master, metadata server, and Zookeeper might need to use larger machines.
High Availability Physical Layout
---------------------------------
*The machine size “flavors” are using AWS/EC2 terminology for descriptive purposes only and is not meant to imply that AWS/EC2 is required or recommended. Another cloud provider or your own hardware can also work.*
An HA layout allows full rolling restarts and heavy volume:
1. node1: [Master](Master.html) (m1.small or m1.medium or m1.large)
2. node2: [Master](Master.html) (m1.small or m1.medium or m1.large) (backup)
3. node3: metadata service (c1.medium or m1.large)
4. node4: metadata service (c1.medium or m1.large) (backup)
5. node5: zookeeper (c1.medium)
6. node6: zookeeper (c1.medium)
7. node7: zookeeper (c1.medium)
8. node8: [Broker](Broker.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
9. node9: [Broker](Broker.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge) (backup)
10. node10: [Compute](Compute.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
11. node11: [Compute](Compute.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
12. node12: [Realtime](Realtime.html) (m1.small or m1.medium or m1.large or m2.xlarge or m2.2xlarge or m2.4xlarge)
13. transient nodes: indexer
Sizing for Cores and RAM
------------------------
The Compute and Broker nodes will use as many cores as are available, depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is not well characterized yet and would depend on types of queries, query load, and the schema. Compute daemons should have a heap a size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing. Since in-memory caching is essential for good performance, even more RAM is better. Broker nodes will use RAM for caching, so they do more than just route queries.
The effective utilization of cores by Zookeeper, MySQL, and Master nodes is likely to be between 1 and 2 for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap a size between 500MB and 1GB.
Storage
-------
Indexed segments should be kept in a permanent store accessible by all nodes like AWS S3 or HDFS or equivalent. Currently Druid supports S3, but this will be extended soon.
Local disk (“ephemeral” on AWS EC2) for caching is recommended over network mounted storage (example of mounted: AWS EBS, Elastic Block Store) in order to avoid network delays during times of heavy usage. If your data center is suitably provisioned for networked storage, perhaps with separate LAN/NICs just for storage, then mounted might work fine.
Setup
-----
Setting up a cluster is essentially just firing up all of the nodes you want with the proper [configuration](configuration.html). One thing to be aware of is that there are a few properties in the configuration that potentially need to be set individually for each process:
<code>
druid.server.type=historical|realtime
druid.host=someHostOrIPaddrWithPort
druid.port=8080
</code>
`druid.server.type` should be set to “historical” for your compute nodes and realtime for the realtime nodes. The master will only assign segments to a “historical” node and the broker has some intelligence around its ability to cache results when talking to a realtime node. This does not need to be set for the master or the broker.
`druid.host` should be set to the hostname and port that can be used to talk to the given server process. Basically, someone should be able to send a request to http://\${druid.host}/ and actually talk to the process.
`druid.port` should be set to the port that the server should listen on. In the vast majority of cases, this port should be the same as what is on `druid.host`.
Build/Run
---------
The simplest way to build and run from the repository is to run `mvn package` from the base directory and then take `druid-services/target/druid-services-*-selfcontained.jar` and push that around to your machines; the jar does not need to be expanded, and since it contains the main() methods for each kind of service, it is **not** invoked with java ~~jar. It can be run from a normal java command-line by just including it on the classpath and then giving it the main class that you want to run. For example one instance of the Compute node/service can be started like this:
\<pre\>
<code>
java~~Duser.timezone=UTC ~~Dfile.encoding=UTF-8~~cp compute/:druid-services/target/druid-services~~\*~~selfcontained.jar com.metamx.druid.http.ComputeMain
</code>
</pre>
The following table shows the possible services and fully qualified class for main().
|service|main class|
|-------|----------|
|[ Realtime ]( Realtime .html)|com.metamx.druid.realtime.RealtimeMain|
|[ Master ]( Master .html)|com.metamx.druid.http.MasterMain|
|[ Broker ]( Broker .html)|com.metamx.druid.http.BrokerMain|
|[ Compute ]( Compute .html)|com.metamx.druid.http.ComputeMain|

View File

@ -21,7 +21,7 @@ The periodic time intervals (like “PT1M”) are [ISO8601 intervals](http://en.
An example runtime.properties is as follows:
<code>
```
# S3 access
com.metamx.aws.accessKey=<S3 access key>
com.metamx.aws.secretKey=<S3 secret_key>
@ -73,7 +73,7 @@ An example runtime.properties is as follows:
druid.bard.cache.sizeInBytes=40000000
druid.master.merger.service=blah_blah
</code>
```
Configuration groupings
-----------------------

View File

@ -8,6 +8,7 @@ Before we start querying druid, we're going to finish setting up a complete clus
## Booting a Broker Node ##
1. Setup a config file at config/broker/runtime.properties that looks like this:
```
druid.host=0.0.0.0:8083
druid.port=8083
@ -56,6 +57,7 @@ druid.client.http.connections=30
```
2. Run the broker node:
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
-Ddruid.realtime.specFile=realtime.spec \
@ -66,7 +68,9 @@ com.metamx.druid.http.BrokerMain
## Booting a Master Node ##
1. Setup a config file at config/master/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818870](https://gist.github.com/rjurney/5818870)
2. Run the master node:
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/master \
@ -78,7 +82,9 @@ com.metamx.druid.http.MasterMain
1. Setup a config file at config/realtime/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818774](https://gist.github.com/rjurney/5818774)
2. Setup a realtime.spec file like this: [https://gist.github.com/rjurney/5818779](https://gist.github.com/rjurney/5818779)
3. Run the realtime node:
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
-Ddruid.realtime.specFile=realtime.spec \
@ -90,6 +96,7 @@ com.metamx.druid.realtime.RealtimeMain
1. Setup a config file at config/compute/runtime.properties that looks like this: [https://gist.github.com/rjurney/5818885](https://gist.github.com/rjurney/5818885)
2. Run the compute node:
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
-classpath services/target/druid-services-0.5.50-SNAPSHOT-selfcontained.jar:config/compute \
@ -107,6 +114,7 @@ As a shared-nothing system, there are three ways to query druid, against the [Re
### Construct a Query ###
For constructing this query, see: Querying against the realtime.spec
```json
{
"queryType": "groupBy",
@ -125,57 +133,52 @@ For constructing this query, see: Querying against the realtime.spec
### Querying the Realtime Node ###
Run our query against port 8080:
```bash
curl -X POST "http://localhost:8080/druid/v2/?pretty" \
-H 'content-type: application/json' -d @query.body
curl -X POST "http://localhost:8080/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body
```
See our result:
```json
[ {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 5,
"wp" : 15000.0,
"rows" : 5
}
"event" : { "imps" : 5, "wp" : 15000.0, "rows" : 5 }
} ]
```
### Querying the Compute Node ###
Run the query against port 8082:
```bash
curl -X POST "http://localhost:8082/druid/v2/?pretty" \
-H 'content-type: application/json' -d @query.body
curl -X POST "http://localhost:8082/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body
```
And get (similar to):
```json
[ {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 27,
"wp" : 77000.0,
"rows" : 9
}
"event" : { "imps" : 27, "wp" : 77000.0, "rows" : 9 }
} ]
```
### Querying both Nodes via the Broker ###
Run the query against port 8083:
```bash
curl -X POST "http://localhost:8083/druid/v2/?pretty" \
-H 'content-type: application/json' -d @query.body
curl -X POST "http://localhost:8083/druid/v2/?pretty" -H 'content-type: application/json' -d @query.body
```
And get:
```json
[ {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 5,
"wp" : 15000.0,
"rows" : 5
}
"event" : { "imps" : 5, "wp" : 15000.0, "rows" : 5 }
} ]
```
@ -221,6 +224,7 @@ How are we to know what queries we can run? Although [Querying](Querying.html) i
```json
"dataSource":"druidtest"
```
Our dataSource tells us the name of the relation/table, or 'source of data', to query in both our realtime.spec and query.body!
### aggregations ###
@ -277,48 +281,23 @@ Which gets us grouped data in return!
[ {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 1,
"age" : "100",
"wp" : 1000.0,
"rows" : 1
}
"event" : { "imps" : 1, "age" : "100", "wp" : 1000.0, "rows" : 1 }
}, {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 1,
"age" : "20",
"wp" : 3000.0,
"rows" : 1
}
"event" : { "imps" : 1, "age" : "20", "wp" : 3000.0, "rows" : 1 }
}, {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 1,
"age" : "30",
"wp" : 4000.0,
"rows" : 1
}
"event" : { "imps" : 1, "age" : "30", "wp" : 4000.0, "rows" : 1 }
}, {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 1,
"age" : "40",
"wp" : 5000.0,
"rows" : 1
}
"event" : { "imps" : 1, "age" : "40", "wp" : 5000.0, "rows" : 1 }
}, {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 1,
"age" : "50",
"wp" : 2000.0,
"rows" : 1
}
"event" : { "imps" : 1, "age" : "50", "wp" : 2000.0, "rows" : 1 }
} ]
```
@ -331,11 +310,7 @@ Now that we've observed our dimensions, we can also filter:
"queryType": "groupBy",
"dataSource": "druidtest",
"granularity": "all",
"filter": {
"type": "selector",
"dimension": "gender",
"value": "male"
},
"filter": { "type": "selector", "dimension": "gender", "value": "male" },
"aggregations": [
{"type": "count", "name": "rows"},
{"type": "longSum", "name": "imps", "fieldName": "impressions"},
@ -351,11 +326,7 @@ Which gets us just people aged 40:
[ {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"imps" : 3,
"wp" : 9000.0,
"rows" : 3
}
"event" : { "imps" : 3, "wp" : 9000.0, "rows" : 3 }
} ]
```

View File

@ -13,28 +13,30 @@ Each event has a timestamp indicating the time of the edit (in UTC time), a list
Specifically. the data schema looks like so:
Dimensions (things to filter on):
\`\`\`json
“page”
“language”
“user”
“unpatrolled”
“newPage”
“robot”
“anonymous”
“namespace”
“continent”
“country”
“region”
“city”
\`\`\`
```json
"page"
"language"
"user"
"unpatrolled"
"newPage"
"robot"
"anonymous"
"namespace"
"continent"
"country"
"region"
"city"
```
Metrics (things to aggregate over):
\`\`\`json
“count”
“added”
“delta”
“deleted”
\`\`\`
```json
"count"
"added"
"delta"
"deleted"
```
These metrics track the number of characters added, deleted, and changed.
@ -50,115 +52,115 @@ Download this file to a directory of your choosing.
You can extract the awesomeness within by issuing:
```
tar -zxvf druid-services-*-bin.tar.gz
```
Not too lost so far right? Thats great! If you cd into the directory:
```
cd druid-services-0.5.54
```
You should see a bunch of files:
\* run\_example\_server.sh
\* run\_example\_client.sh
\* LICENSE, config, examples, lib directories
* run_example_server.sh
* run_example_client.sh
* LICENSE, config, examples, lib directories
Running Example Scripts
-----------------------
Lets start doing stuff. You can start a Druid [Realtime](Realtime.html) node by issuing:
```
./run_example_server.sh
```
Select “wikipedia”.
Select "wikipedia".
Once the node starts up you will see a bunch of logs about setting up properties and connecting to the data source. If everything was successful, you should see messages of the form shown below.
<code>
```
2013-07-19 21:54:05,154 INFO [main] com.metamx.druid.realtime.RealtimeNode - Starting Jetty
2013-07-19 21:54:05,154 INFO [main] org.mortbay.log - jetty-6.1.x
2013-07-19 21:54:05,171 INFO [chief-wikipedia] com.metamx.druid.realtime.plumber.RealtimePlumberSchool - Expect to run at [2013-07-19T22:03:00.000Z]
2013-07-19 21:54:05,246 INFO [main] org.mortbay.log - Started SelectChannelConnector@0.0.0.0:8083
</code>
```
The Druid real time-node ingests events in an in-memory buffer. Periodically, these events will be persisted to disk. If you are interested in the details of our real-time architecture and why we persist indexes to disk, I suggest you read our [White Paper](http://static.druid.io/docs/druid.pdf).
Okay, things are about to get real(~~time). To query the real-time node youve spun up, you can issue:
\<pre\>./run\_example\_client.sh\</pre\>
Select “wikipedia” once again. This script issues ]s to the data weve been ingesting. The query looks like this:
\`\`\`json
Okay, things are about to get real-time. To query the real-time node youve spun up, you can issue:
```
./run_example_client.sh
```
Select "wikipedia" once again. This script issues [GroupByQuery](GroupByQuery.html)s to the data weve been ingesting. The query looks like this:
```json
{
[queryType]("groupBy"),
[dataSource]("wikipedia"),
[granularity]("minute"),
[dimensions]([)
“page”
"queryType":"groupBy",
"dataSource":"wikipedia",
"granularity":"minute",
"dimensions":[ "page" ],
"aggregations":[
{"type":"count", "name":"rows"},
{"type":"longSum", "fieldName":"edit_count", "name":"count"}
],
[aggregations]([)
{
[type]("count"),
[name]("rows")
},
{
[type]("longSum"),
[fieldName]("edit_count"),
[name]("count")
"filter":{ "type":"selector", "dimension":"namespace", "value":"article" },
"intervals":[ "2013-06-01T00:00/2020-01-01T00" ]
}
],
[filter]({)
[type]("selector"),
[dimension]("namespace"),
[value]("article")
},
[intervals]([)
“2013-06-01T00:00/2020-01-01T00”
]
}
\`\`\`
This is a **groupBy** query, which you may be familiar with from SQL. We are grouping, or aggregating, via the **dimensions** field: . We are **filtering** via the **“namespace”** dimension, to only look at edits on **“articles”**. Our **aggregations** are what we are calculating: a count of the number of data rows, and a count of the number of edits that have occurred.
```
This is a **groupBy** query, which you may be familiar with from SQL. We are grouping, or aggregating, via the `dimensions` field: `["page"]`. We are **filtering** via the `namespace` dimension, to only look at edits on `articles`. Our **aggregations** are what we are calculating: a count of the number of data rows, and a count of the number of edits that have occurred.
The result looks something like this:
\`\`\`json
```json
[
{
[version]() “v1”,
[timestamp]() “2013-09-04T21:44:00.000Z”,
[event]() {
[count]() 0,
[page]() “2013\\u201314\_Brentford\_F.C.*season",
[rows]() 1
}
"version": "v1",
"timestamp": "2013-09-04T21:44:00.000Z",
"event": { "count": 0, "page": "2013\u201314_Brentford_F.C._season", "rows": 1 }
},
{
[version]() "v1",
[timestamp]() "2013-09-04T21:44:00.000Z",
[event]() {
[count]() 0,
[page]() "8e*00e9tape\_du\_Tour\_de\_France\_2013”,
[rows]() 1
}
"version": "v1",
"timestamp": "2013-09-04T21:44:00.000Z",
"event": { "count": 0, "page": "8e_\u00e9tape_du_Tour_de_France_2013", "rows": 1 }
},
{
[version]() “v1”,
[timestamp]() “2013-09-04T21:44:00.000Z”,
[event]() {
[count]() 0,
[page]() “Agenda\_of\_the\_Tea\_Party\_movement”,
[rows]() 1
}
"version": "v1",
"timestamp": "2013-09-04T21:44:00.000Z",
"event": { "count": 0, "page": "Agenda_of_the_Tea_Party_movement", "rows": 1 }
},
\`\`\`
...
```
This groupBy query is a bit complicated and well return to it later. For the time being, just make sure you are getting some blocks of data back. If you are having problems, make sure you have [curl](http://curl.haxx.se/) installed. Control+C to break out of the client script.
h2. Querying Druid
In your favorite editor, create the file:
\<pre\>time\_boundary\_query.body\</pre\>
```
time_boundary_query.body
```
Druid queries are JSON blobs which are relatively painless to create programmatically, but an absolute pain to write by hand. So anyway, we are going to create a Druid query by hand. Add the following to the file you just created:
\<pre\><code>
```
{
[queryType]() “timeBoundary”,
[dataSource]() “wikipedia”
"queryType": "timeBoundary",
"dataSource": "wikipedia"
}
</code>\</pre\>
The ] is one of the simplest Druid queries. To run the query, you can issue:
\<pre\><code> curl~~X POST http://localhost:8083/druid/v2/?pretty ~~H content-type: application/json~~d ```` time_boundary_query.body</code></pre>
```
The [TimeBoundaryQuery](TimeBoundaryQuery.html) is one of the simplest Druid queries. To run the query, you can issue:
```
curl -X POST http://localhost:8083/druid/v2/?pretty -H 'content-type: application/json' -d @time_boundary_query.body
```
We get something like this JSON back:
@ -171,186 +173,146 @@ We get something like this JSON back:
}
} ]
```
As you can probably tell, the result is indicating the maximum and minimum timestamps we've seen thus far (summarized to a minutely granularity). Let's explore a bit further.
Return to your favorite editor and create the file:
<pre>timeseries_query.body</pre>
```
timeseries_query.body
```
We are going to make a slightly more complicated query, the [TimeseriesQuery](TimeseriesQuery.html). Copy and paste the following into the file:
<pre><code>
```
{
"queryType": "timeseries",
"dataSource": "wikipedia",
"intervals": [
"2010-01-01/2020-01-01"
],
"intervals": [ "2010-01-01/2020-01-01" ],
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"fieldName": "count",
"name": "edit_count"
},
{
"type": "doubleSum",
"fieldName": "added",
"name": "chars_added"
}
{"type": "longSum", "fieldName": "count", "name": "edit_count"},
{"type": "doubleSum", "fieldName": "added", "name": "chars_added"}
]
}
</code></pre>
```
You are probably wondering, what are these [Granularities](Granularities.html) and [Aggregations](Aggregations.html) things? What the query is doing is aggregating some metrics over some span of time.
To issue the query and get some results, run the following in your command line:
<pre><code>curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d ````timeseries\_query.body</code>
</pre>
```
curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d ````timeseries_query.body
```
Once again, you should get a JSON blob of text back with your results, that looks something like this:
\`\`\`json
```json
[ {
“timestamp” : “2013-09-04T21:44:00.000Z”,
“result” : {
“chars\_added” : 312670.0,
“edit\_count” : 733
}
"timestamp" : "2013-09-04T21:44:00.000Z",
"result" : { "chars_added" : 312670.0, "edit_count" : 733 }
} ]
\`\`\`
```
If you issue the query again, you should notice your results updating.
Right now all the results you are getting back are being aggregated into a single timestamp bucket. What if we wanted to see our aggregations on a per minute basis? What field can we change in the query to accomplish this?
If you loudly exclaimed “we can change granularity to minute”, you are absolutely correct! We can specify different granularities to bucket our results, like so:
If you loudly exclaimed "we can change granularity to minute", you are absolutely correct! We can specify different granularities to bucket our results, like so:
<code>
```
{
"queryType": "timeseries",
"dataSource": "wikipedia",
"intervals": [
"2010-01-01/2020-01-01"
],
"intervals": [ "2010-01-01/2020-01-01" ],
"granularity": "minute",
"aggregations": [
{
"type": "longSum",
"fieldName": "count",
"name": "edit_count"
},
{
"type": "doubleSum",
"fieldName": "added",
"name": "chars_added"
}
{"type": "longSum", "fieldName": "count", "name": "edit_count"},
{"type": "doubleSum", "fieldName": "added", "name": "chars_added"}
]
}
</code>
```
This gives us something like the following:
\`\`\`json
```json
[
{
“timestamp” : “2013-09-04T21:44:00.000Z”,
“result” : {
“chars\_added” : 30665.0,
“edit\_count” : 128
}
}, {
“timestamp” : “2013-09-04T21:45:00.000Z”,
“result” : {
“chars\_added” : 122637.0,
“edit\_count” : 167
}
}, {
“timestamp” : “2013-09-04T21:46:00.000Z”,
“result” : {
“chars\_added” : 78938.0,
“edit\_count” : 159
}
"timestamp" : "2013-09-04T21:44:00.000Z",
"result" : { "chars_added" : 30665.0, "edit_count" : 128 }
},
\`\`\`
{
"timestamp" : "2013-09-04T21:45:00.000Z",
"result" : { "chars_added" : 122637.0, "edit_count" : 167 }
},
{
"timestamp" : "2013-09-04T21:46:00.000Z",
"result" : { "chars_added" : 78938.0, "edit_count" : 159 }
},
...
]
```
Solving a Problem
-----------------
One of Druids main powers is to provide answers to problems, so lets pose a problem. What if we wanted to know what the top pages in the US are, ordered by the number of edits over the last few minutes youve been going through this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [GroupByQuery](GroupByQuery.html). It would be nice if we could group by results by dimension value and somehow sort those results… and it turns out we can!
One of Druids main powers is to provide answers to problems, so lets pose a problem. What if we wanted to know what the top pages in the US are, ordered by the number of edits over the last few minutes youve been going through this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the [GroupByQuery](GroupByQuery.html). It would be nice if we could group by results by dimension value and somehow sort those results... and it turns out we can!
Lets create the file:
group_by_query.body</pre>
```
group_by_query.body
```
and put the following in there:
<pre><code>
```
{
"queryType": "groupBy",
"dataSource": "wikipedia",
"granularity": "all",
"dimensions": [
"page"
],
"dimensions": [ "page" ],
"orderBy": {
"type": "doc_page",
"columns": [
{
"dimension": "edit_count",
"direction": "DESCENDING"
}
],
"columns": [ { "dimension": "edit_count", "direction": "DESCENDING" } ],
"limit": 10
},
"aggregations": [
{
"type": "longSum",
"fieldName": "count",
"name": "edit_count"
}
{"type": "longSum", "fieldName": "count", "name": "edit_count"}
],
"filter": {
"type": "selector",
"dimension": "country",
"value": "United States"
},
"intervals": [
"2012-10-01T00:00/2020-01-01T00"
]
"filter": { "type": "selector", "dimension": "country", "value": "United States" },
"intervals": ["2012-10-01T00:00/2020-01-01T00"]
}
</code>
```
Woah! Our query just got a way more complicated. Now we have these [Filters](Filters.html) things and this [OrderBy](OrderBy.html) thing. Fear not, it turns out the new objects weve introduced to our query can help define the format of our results and provide an answer to our question.
If you issue the query:
<code>curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d @group_by_query.body</code>
```
curl -X POST 'http://localhost:8083/druid/v2/?pretty' -H 'content-type: application/json' -d @group_by_query.body
```
You should see an answer to our question. As an example, some results are shown below:
\`\`\`json
```json
[
{
“version” : “v1”,
“timestamp” : “2012-10-01T00:00:00.000Z”,
“event” : {
“page” : “RTC\_Transit”,
“edit\_count” : 6
}
}, {
“version” : “v1”,
“timestamp” : “2012-10-01T00:00:00.000Z”,
“event” : {
“page” : “List\_of\_Deadly\_Women\_episodes”,
“edit\_count” : 4
}
}, {
“version” : “v1”,
“timestamp” : “2012-10-01T00:00:00.000Z”,
“event” : {
“page” : “User\_talk:David\_Biddulph”,
“edit\_count” : 4
}
"version" : "v1",
"timestamp" : "2012-10-01T00:00:00.000Z",
"event" : { "page" : "RTC_Transit", "edit_count" : 6 }
},
\`\`\`
{
"version" : "v1",
"timestamp" : "2012-10-01T00:00:00.000Z",
"event" : { "page" : "List_of_Deadly_Women_episodes", "edit_count" : 4 }
},
{
"version" : "v1",
"timestamp" : "2012-10-01T00:00:00.000Z",
"event" : { "page" : "User_talk:David_Biddulph", "edit_count" : 4 }
},
...
```
Feel free to tweak other query parameters to answer other questions you may have about the data.

View File

@ -14,6 +14,7 @@ If you followed the first tutorial, you should already have Druid downloaded. If
You can download the latest version of druid [here](http://static.druid.io/artifacts/releases/druid-services-0.5.54-bin.tar.gz)
and untar the contents within by issuing:
```bash
tar -zxvf druid-services-*-bin.tar.gz
cd druid-services-*
@ -32,15 +33,18 @@ For deep storage, we have made a public S3 bucket (static.druid.io) available wh
1. If you don't already have it, download MySQL Community Server here: [http://dev.mysql.com/downloads/mysql/](http://dev.mysql.com/downloads/mysql/)
2. Install MySQL
3. Create a druid user and database
```bash
mysql -u root
```
```sql
GRANT ALL ON druid.* TO 'druid'@'localhost' IDENTIFIED BY 'diurd';
CREATE database druid;
```
### Setting up Zookeeper ###
```bash
curl http://www.motorlogy.com/apache/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz -o zookeeper-3.4.5.tar.gz
tar xzf zookeeper-3.4.5.tar.gz
@ -55,6 +59,7 @@ cd ..
Similar to the first tutorial, the data we will be loading is based on edits that have occurred on Wikipedia. Every time someone edits a page in Wikipedia, metadata is generated about the editor and edited page. Druid collects each individual event and packages them together in a container known as a [segment](https://github.com/metamx/druid/wiki/Segments). Segments contain data over some span of time. We've prebuilt a segment for this tutorial and will cover making your own segments in other [pages](https://github.com/metamx/druid/wiki/Loading-Your-Data).The segment we are going to work with has the following format:
Dimensions (things to filter on):
```json
"page"
"language"
@ -71,6 +76,7 @@ Dimensions (things to filter on):
```
Metrics (things to aggregate over):
```json
"count"
"added"
@ -98,7 +104,7 @@ To create the master config file:
mkdir config/master
```
Under the directory we just created, create the file ```runtime.properties``` with the following contents:
Under the directory we just created, create the file `runtime.properties` with the following contents:
```
druid.host=127.0.0.1:8082
@ -146,7 +152,8 @@ To create the compute config file:
mkdir config/compute
```
Under the directory we just created, create the file ```runtime.properties``` with the following contents:
Under the directory we just created, create the file `runtime.properties` with the following contents:
```
druid.host=127.0.0.1:8081
druid.port=8081
@ -219,67 +226,17 @@ To start the broker node:
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/broker com.metamx.druid.http.BrokerMain
```
<!--
### Optional: Start a Realtime Node ###
```
druid.host=127.0.0.1:8083
druid.port=8083
druid.service=realtime
# logging
com.metamx.emitter.logging=true
com.metamx.emitter.logging.level=info
# zk
druid.zk.service.host=localhost
druid.zk.paths.base=/druid
druid.zk.paths.discoveryPath=/druid/discoveryPath
# processing
druid.processing.buffer.sizeBytes=10000000
# schema
druid.realtime.specFile=realtime.spec
# aws
com.metamx.aws.accessKey=dummy_access_key
com.metamx.aws.secretKey=dummy_secret_key
# db
druid.database.segmentTable=segments
druid.database.user=druid
druid.database.password=diurd
druid.database.connectURI=jdbc:mysql://localhost:3306/druid
druid.database.ruleTable=rules
druid.database.configTable=config
# Path on local FS for storage of segments; dir will be created if needed
druid.paths.indexCache=/tmp/druid/indexCache
# handoff
druid.pusher.s3.bucket=dummy_s3_bucket
druid.pusher.s3.baseKey=dummy_key
```
To start the realtime node:
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath services/target/druid-services-*-selfcontained.jar:config/realtime com.metamx.druid.realtime.RealtimeMain
```
-->
## Loading the Data ##
The MySQL dependency we introduced earlier on contains a 'segments' table that contains entries for segments that should be loaded into our cluster. The Druid master compares this table with segments that already exist in the cluster to determine what should be loaded and dropped. To load our wikipedia segment, we need to create an entry in our MySQL segment table.
Usually, when new segments are created, these MySQL entries are created directly so you never have to do this by hand. For this tutorial, we can do this manually by going back into MySQL and issuing:
```
``` sql
use druid;
```
``
INSERT INTO segments (id, dataSource, created_date, start, end, partitioned, version, used, payload) VALUES ('wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z', 'wikipedia', '2013-08-08T21:26:23.799Z', '2013-08-01T00:00:00.000Z', '2013-08-02T00:00:00.000Z', '0', '2013-08-08T21:22:48.989Z', '1', '{\"dataSource\":\"wikipedia\",\"interval\":\"2013-08-01T00:00:00.000Z/2013-08-02T00:00:00.000Z\",\"version\":\"2013-08-08T21:22:48.989Z\",\"loadSpec\":{\"type\":\"s3_zip\",\"bucket\":\"static.druid.io\",\"key\":\"data/segments/wikipedia/20130801T000000.000Z_20130802T000000.000Z/2013-08-08T21_22_48.989Z/0/index.zip\"},\"dimensions\":\"dma_code,continent_code,geo,area_code,robot,country_name,network,city,namespace,anonymous,unpatrolled,page,postal_code,language,newpage,user,region_lookup\",\"metrics\":\"count,delta,variation,added,deleted\",\"shardSpec\":{\"type\":\"none\"},\"binaryVersion\":9,\"size\":24664730,\"identifier\":\"wikipedia_2013-08-01T00:00:00.000Z_2013-08-02T00:00:00.000Z_2013-08-08T21:22:48.989Z\"}');
``
```
If you look in your master node logs, you should, after a maximum of a minute or so, see logs of the following form:

View File

@ -12,3 +12,40 @@
.doc-content img {
max-width: 847.5px;
}
.doc-content code {
background-color: #e0e0e0;
}
.doc-content pre code {
background-color: transparent;
}
.doc-content table,
.doc-content table > thead > tr > th,
.doc-content table > tbody > tr > th,
.doc-content table > tfoot > tr > th,
.doc-content table > thead > tr > td,
.doc-content table > tbody > tr > td,
.doc-content table > tfoot > tr > td {
border: 1px solid #dddddd;
}
.doc-content table > thead > tr > th,
.doc-content table > thead > tr > td {
border-bottom-width: 2px;
}
.doc-content table > tbody > tr:nth-child(odd) > td,
.doc-content table > tbody > tr:nth-child(odd) > th {
background-color: #f9f9f9;
}
.doc-content table > tbody > tr:hover > td,
.doc-content table > tbody > tr:hover > th {
background-color: #d5d5d5;
}
.doc-content table code {
background-color: transparent;
}

View File

@ -12,11 +12,10 @@ h1. Contents
h2. Getting Started
* "Tutorial: A First Look at Druid":./Tutorial:-A-First-Look-at-Druid.html
* "Tutorial: The Druid Cluster":./Tutorial:-The-Druid-Cluster.html
* "Loading Your Data":./Tutorial:-Webstream.html
* "Loading Your Data":./Loading-Your-Data.html
* "Querying Your Data":./Querying-your-data.html
* "Booting a Production Cluster":./Booting-a-production-cluster.html
* "Examples":./Examples.html
* "Cluster Setup":Cluster-setup.html
* "Configuration":Configuration.html
h2. Data Ingestion

View File

@ -1,38 +0,0 @@
footer {
font-size: 14px;
color: #000000;
font-weight: 300;
}
footer strong {
display: block;
font-weight: 400;
}
footer a {
color: #000000;
}
footer address {
margin: 0 0 30px 30px;
}
footer .soc {
text-align:left;
margin:5px 0 0 0;
}
footer .soc a {
display:inline-block;
width:35px;
height:34px;
background:url(../img/icons-soc.png) no-repeat;
}
footer .soc a.github {
background-position:0 -34px;
}
footer .soc a.meet {
background-position:0 -68px;
}
footer .soc a.rss {
background-position:0 -102px;
}

View File

@ -1,21 +0,0 @@
.navbar {
max-width: 1170px;
margin: 10px auto 25px;
background-color: #eeeeee;
border-bottom-width: 0;
font-size: 18px;
line-height: 20px;
font-weight: 300;
}
.container.druid-navbar {
background-color: #171717;
-webkit-border-radius: 4px;
-moz-border-radius: 4px;
border-radius: 4px;
-webkit-box-shadow: 0 1px 4px rgba(0, 0, 0, 0.065);
-moz-box-shadow: 0 1px 4px rgba(0, 0, 0, 0.065);
box-shadow: 0 1px 4px rgba(0, 0, 0, 0.065);
}

View File

@ -1,57 +0,0 @@
@font-face {
font-family: 'Conv_framd';
src: url('../fonts/framd.eot');
src: url('../fonts/framd.eot?#iefix') format('embedded-opentype'),
url('../fonts/framd.woff') format('woff'),
url('../fonts/framd.ttf') format('truetype'),
url('../fonts/framd.svg#heroregular') format('svg');
font-weight: normal;
font-style: normal;
}
html, body {
position:relative;
height:100%;
min-height:100%;
height:100%;
color:#252525;
font:400 18px/26px 'Open Sans', Arial, Helvetica, sans-serif;
margin:0;
word-wrap:break-word;
}
h1, h2, h3, h4, h5, h6, .h1, .h2, .h3, .h4, .h5, .h6 {
font-family: 'Open Sans', Arial, Helvetica, sans-serif;
font-weight: 300;
}
h3, .h3 {
font-size: 30px;
font-weight: 300;
}
.text-indent {
padding-left: 50px;
}
.text-indent-2 {
padding-left: 100px;
}
.text-indent-p p {
padding-left: 50px;
}
code {
color: inherit;
background-color: transparent;
}
.page-header {
margin-bottom: 40px;
text-align: center;
}
.easter-egg {
color: transparent;
}