Merge pull request #1247 from druid-io/new-docs

Rework the Druid documentation
This commit is contained in:
Xavier Léauté 2015-04-21 10:20:54 -07:00
commit 702e5ceb2d
31 changed files with 563 additions and 489 deletions

View File

@ -11,7 +11,7 @@ petabyte sized data sets. Druid supports a variety of flexible filters, exact
calculations, approximate algorithms, and other useful calculations.
Druid can load both streaming and batch data and integrates with
Samza, Kafka, Storm, and Hadoop among others.
Samza, Kafka, Storm, and Hadoop.
### License
@ -41,8 +41,10 @@ If you find any bugs, please file a [GitHub issue](https://github.com/druid-io/d
### Community
Community support is available on our [mailing
list](https://groups.google.com/forum/#!forum/druid-development)(druid-development@googlegroups.com).
Community support is available on the [druid-user mailing
list](https://groups.google.com/forum/#!forum/druid-user)(druid-user@googlegroups.com).
Development discussions occur on the [druid-development list](https://groups.google.com/forum/#!forum/druid-development)(druid-development@googlegroups.com).
We also have a couple people hanging out on IRC in `#druid-dev` on
`irc.freenode.net`.

View File

@ -104,7 +104,7 @@ When setting `byRow` to `false` (the default) it computes the cardinality of the
* For a single dimension, this is equivalent to
```sql
SELECT COUNT(DISCTINCT(dimension)) FROM <datasource>
SELECT COUNT(DISTINCT(dimension)) FROM <datasource>
```
* For multiple dimensions, this is equivalent to something akin to

View File

@ -21,9 +21,11 @@ Many of Druid's external dependencies can be plugged in as modules. Extensions c
|Property|Description|Default|
|--------|-----------|-------|
|`druid.extensions.remoteRepositories`|If this is not set to '[]', Druid will try to download extensions at the specified remote repository.|["http://repo1.maven.org/maven2/", "https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local"]|
|`druid.extensions.localRepository`|The local maven directory where extensions are installed. If this is set, remoteRepositories is not required.|[]|
|`druid.extensions.coordinates`|The list of extensions to include.|[]|
|`druid.extensions.remoteRepositories`|This is a JSON Array list of remote repositories to load dependencies from. If this is not set to '[]', Druid will try to download extensions at the specified remote repository.|["http://repo1.maven.org/maven2/", "https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local"]|
|`druid.extensions.localRepository`|. The way maven gets dependencies is that it downloads them to a "local repository" on your local disk and then collects the paths to each of the jars. This specifies the directory to consider the "local repository". If this is set, remoteRepositories is not required.|`~/.m2/repository`|
|`druid.extensions.coordinates`|This is a JSON array of "groupId:artifactId[:version]" maven coordinates. For artifacts without version specified, Druid will append the default version.|[]|
|`druid.extensions.defaultVersion`|Version to use for extension artifacts without version information.|`druid-server` artifact version.|
|`druid.extensions.searchCurrentClassloader`|This is a boolean flag that determines if Druid will search the main classloader for extensions. It defaults to true but can be turned off if you have reason to not automatically add all modules on the classpath.|true|
### Zookeeper
We recommend just setting the base ZK path and the ZK service host, but all ZK paths that Druid uses can be overwritten to absolute paths.

View File

@ -17,6 +17,7 @@ The table data source is the most common type. It's represented by a string, or
```
### Union Data Source
This data source unions two or more table data sources.
```json
@ -28,8 +29,10 @@ This data source unions two or more table data sources.
Note that the data sources being unioned should have the same schema.
### Query Data Source
This is used for nested groupBys and is only currently supported for groupBys.
```json
{
"type": "query",

View File

@ -2,7 +2,11 @@
layout: doc_page
---
# Data Source Metadata Queries
Data Source Metadata queries return metadata information for a dataSource. It returns the timestamp of latest ingested event for the datasource. The grammar is:
Data Source Metadata queries return metadata information for a dataSource. These queries return information about:
* The timestamp of latest ingested event for the dataSource. This is the ingested event without any consideration of rollup.
The grammar for these queries is:
```json
{
@ -16,8 +20,8 @@ There are 2 main parts to a Data Source Metadata query:
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "dataSourceMetadata"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database|yes|
|context|An additional JSON Object which can be used to specify certain flags.|no|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](DataSource.html) for more information.|yes|
|context|See [Context](Context.html)|no|
The format of the result is:

View File

@ -4,7 +4,8 @@ layout: doc_page
Data Formats for Ingestion
==========================
Druid can ingest data in JSON, CSV, or custom delimited data such as TSV. While most examples in the documentation use data in JSON format, it is not difficult to configure Druid to ingest CSV or other delimited data.
Druid can ingest denormalized data in JSON, CSV, or a custom delimited form such as TSV. While most examples in the documentation use data in JSON format, it is not difficult to configure Druid to ingest CSV or other delimited data.
We also welcome any contributions to new formats.
## Formatting the Data
The following are three samples of the data used in the [Wikipedia example](Tutorial%3A-Loading-Streaming-Data.html).
@ -29,7 +30,7 @@ _CSV_
2013-08-31T12:41:27Z,"Coyote Tango","ja","cancer","true","false","true","false","wikipedia","Asia","Japan","Kanto","Tokyo",1,10,-9
```
_TSV_
_TSV (Delimited)_
```
2013-08-31T01:02:33Z "Gypsy Danger" "en" "nuclear" "true" "true" "false" "false" "article" "North America" "United States" "Bay Area" "San Francisco" 57 200 -143
@ -43,94 +44,53 @@ Note that the CSV and TSV data do not contain column heads. This becomes importa
## Configuration
All forms of Druid ingestion require some form of schema object. The format of the data to be ingested is specified using the`parseSpec` entry in your `dataSchema`.
### JSON
All forms of Druid ingestion require some form of schema object. An example blob of json pertaining to the data format may look something like this:
```json
"firehose" : {
"type" : "local",
"baseDir" : "examples/indexing",
"filter" : "wikipedia_data.json"
"parseSpec":{
"format" : "json",
"timestampSpec" : {
"column" : "timestamp"
},
"dimensionSpec" : {
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}
}
```
The `parser` entry for the `dataSchema` should be changed to describe the json format as per
```json
"parser" : {
"type":"string",
"parseSpec":{
"timestampSpec" : {
"column" : "timestamp"
},
"format" : "json",
"dimensionSpec" : {
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}
}
}
```
Specified here are the location of the datafile, the timestamp column, the format of the data, and the columns that will become dimensions in Druid.
### CSV
Since the CSV data cannot contain the column names (no header is allowed), these must be added before that data can be processed:
```json
"firehose" : {
"type" : "local",
"baseDir" : "examples/indexing/",
"filter" : "wikipedia_data.csv"
"parseSpec":{
"format" : "csv",
"timestampSpec" : {
"column" : "timestamp"
},
"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensionsSpec" : {
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}
}
```
The `parser` entry for the `dataSchema` should be changed to describe the csv format as per
```json
"parser" : {
"type":"string",
"parseSpec":{
"timestampSpec" : {
"column" : "timestamp"
},
"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"type" : "csv",
"dimensionsSpec" : {
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}
}
}
```
Note also that the filename extension and the data type were changed to "csv". Note that dimensions is a subset of columns and indicates which dimensions are desired to be indexed.
### TSV
For the TSV data, the same changes are made but with "tsv" for the filename extension and the data type.
```json
"firehose" : {
"type" : "local",
"baseDir" : "examples/indexing/",
"filter" : "wikipedia_data.tsv"
"parseSpec":{
"format" : "tsv",
"timestampSpec" : {
"column" : "timestamp"
},
"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"delimiter":"|",
"dimensionsSpec" : {
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}
```
The `parser` entry for the `dataSchema` should be changed to describe the tsv format as per
```json
"parser" : {
"type":"string",
"parseSpec":{
"timestampSpec" : {
"column" : "timestamp"
},
"columns" : ["timestamp","page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"type" : "tsv",
"delimiter":"|",
"dimensionsSpec" : {
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}
}
}
}
```
Be sure to change the `delimiter` to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed.
### Multi-value dimensions
Dimensions can have multiple values for TSV and CSV data. To specify the delimiter for a multi-value dimension, set the `listDelimiter`
Dimensions can have multiple values for TSV and CSV data. To specify the delimiter for a multi-value dimension, set the `listDelimiter` in the `parseSpec`.

View File

@ -49,6 +49,21 @@ If you are using the Hadoop indexer, set your output directory to be a location
## Community Contributed Deep Stores
### Azure
[Microsoft Azure Storage](http://azure.microsoft.com/en-us/services/storage/) is another option for deep storage. This requires some additional druid configuration.
```
druid.storage.type=azure
druid.azure.account=<azure storage account>
druid.azure.key=<azure storage account key>
druid.azure.container=<azure storage container>
druid.azure.protocol=<optional; valid options: https or http; default: https>
druid.azure.maxTries=<optional; number of tries before give up an Azure operation; default: 3; min: 1>
```
Please note that this is a community contributed module. See [Azure Services](http://azure.microsoft.com/en-us/pricing/free-trial/) for more information.
### Cassandra
[Apache Cassandra](http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra) can also be leveraged for deep storage. This requires some additional druid configuration as well as setting up the necessary schema within a Cassandra keystore.

33
docs/content/Evaluate.md Normal file
View File

@ -0,0 +1,33 @@
---
layout: doc_page
---
Evaluate Druid
==============
This page is meant to help you in evaluating Druid by answering common questions that come up.
## Evaluating on a Single Machine
Most of the tutorials focus on running multiple Druid services on a single machine in an attempt to teach basic Druid concepts, and work out kinks in data ingestion. The configurations in the tutorials are
very poor choices for an actual production cluster.
## Capacity and Cost Planning
The best way to understand what your cluster will cost is to first understand how much data reduction you will get when you create segments.
We recommend indexing and creating segments from 1G of your data and evaluating the resultant segment size. This will allow you to see how much your data rolls up, and how many segments will be able
to be loaded on the hardware you have at your disposal.
Most of the cost of a Druid cluster is in historical nodes, followed by real-time indexing nodes if you have a high data intake. For high availability, you should have backup
coordination nodes (coordinators and overlords). Coordination nodes should require much cheaper hardware than nodes that serve queries.
## Selecting Hardware
Druid is designed to run on commodity hardware and we've tried to provide some general guidelines on [how things should be tuned]() for various deployments. We've also provided
some [example specs](./Production-Cluster-Configuration.html) for hardware for a production cluster.
## Benchmarking Druid
The best resource to benchmark Druid is to follow the steps outlined in our [blog post](http://druid.io/blog/2014/03/17/benchmarking-druid.html) about the topic.
The code to reproduce the results in the blog post are all open source. The blog post covers Druid queries on TPC-H data, but you should be able to customize
configuration parameters to your data set. The blog post is a little outdated and uses an older version of Druid, but is still mostly relevant to demonstrate performance.

View File

@ -2,7 +2,10 @@
layout: doc_page
---
# groupBy Queries
These types of queries take a groupBy query object and return an array of JSON objects where each object represents a grouping asked for by the query. Note: If you only want to do straight aggregates for some time range, we highly recommend using [TimeseriesQueries](TimeseriesQuery.html) instead. The performance will be substantially better.
These types of queries take a groupBy query object and return an array of JSON objects where each object represents a
grouping asked for by the query. Note: If you only want to do straight aggregates for some time range, we highly recommend
using [TimeseriesQueries](TimeseriesQuery.html) instead. The performance will be substantially better. If you want to
do an ordered groupBy over a single dimension, please look at [TopN](./TopNQuery.html) queries. The performance for that use case is also substantially better.
An example groupBy query object is shown below:
``` json
@ -48,7 +51,7 @@ There are 11 main parts to a groupBy query:
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "groupBy"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database, or a [DataSource](DataSource.html) structure.|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](DataSource.html) for more information.|yes|
|dimensions|A JSON list of dimensions to do the groupBy over; or see [DimensionSpec](DimensionSpecs) for ways to extract dimensions. |yes|
|limitSpec|See [LimitSpec](LimitSpec.html).|no|
|having|See [Having](Having.html).|no|

View File

@ -0,0 +1,34 @@
---
layout: doc_page
---
# Including Extensions
Druid uses a module system that allows for the addition of extensions at runtime.
## Specifying extensions
Druid extensions can be specified in the `common.runtime.properties`. There are two ways of adding druid extensions currently.
### Add to the classpath
If you add your extension jar to the classpath at runtime, Druid will load it into the system. This mechanism is relatively easy to reason about, but it also means that you have to ensure that all dependency jars on the classpath are compatible. That is, Druid makes no provisions while using this method to maintain class loader isolation so you must make sure that the jars on your classpath are mutually compatible.
### Specify maven coordinates
Druid has the ability to automatically load extension jars from maven at runtime. With this mechanism, Druid also loads up the dependencies of the extension jar into an isolated class loader. That means that your extension can depend on a different version of a library that Druid also uses and both can co-exist.
### I want classloader isolation, but I don't want my production machines downloading their own dependencies. What should I do?
If you want to take advantage of the maven-based classloader isolation but you are also rightly frightened by the prospect of each of your production machines downloading their own dependencies on deploy, this section is for you.
The trick to doing this is
1) Specify a local directory for `druid.extensions.localRepository`
2) Run the `tools pull-deps` command to pull all the specified dependencies down into your local repository
3) Bundle up the local repository along with your other Druid stuff into whatever you use for a deployable artifact
4) Run Your druid processes with `druid.extensions.remoteRepositories=[]` and a local repository set to wherever your bundled "local" repository is located
The Druid processes will then only load up jars from the local repository and will not try to go out onto the internet to find the maven dependencies.

View File

@ -0,0 +1,82 @@
---
layout: doc_page
---
# Ingestion Overview
There are a couple of different ways to get data into Druid. We hope to unify things in the near future, but for the time being
the method you choose to ingest your data into Druid should be driven by your use case.
## Streaming Data
If you have a continuous stream of data, there are a few options to get your data into Druid. It should be noted that the current state of real-time ingestion in Druid does not guarantee exactly once processing. The real-time pipeline is meant to surface insights on
events as they are occurring. For an accurate copy of ingested data, an accompanying batch pipeline is required. We are working towards a streaming only word, but for
the time being, we recommend running a lambda architecture.
### Ingest from a Stream Processor
If you process your data using a stream processor such as Apache Samza or Apache Storm, you can use the [Tranquility](https://github.com/metamx/tranquility) library to manage
your real-time ingestion. This setup requires using the indexing service for ingestion, which is what is used in production by many organizations that use Druid.
### Ingest from Apache Kafka
If you wish to ingest directly from Kafka using Tranquility, you will have to write a consumer that reads from Kafka and passes the data to Tranquility.
The other option is to use [standalone Realtime nodes](./Realtime.html).
It should be noted that standalone realtime nodes use the Kafka high level consumer, which imposes a few restrictions.
Druid replicates segment such that logically equivalent data segments are concurrently hosted on N nodes. If N1 nodes go down,
the data will still be available for querying. On real-time nodes, this process depends on maintaining logically equivalent
data segments on each of the N nodes, which is not possible with standard Kafka consumer groups if your Kafka topic requires more than one consumer
(because consumers in different consumer groups will split up the data differently).
For example, let's say your topic is split across Kafka partitions 1, 2, & 3 and you have 2 real-time nodes with linear shard specs 1 & 2.
Both of the real-time nodes are in the same consumer group. Real-time node 1 may consume data from partitions 1 & 3, and real-time node 2 may consume data from partition 2.
Querying for your data through the broker will yield correct results.
The problem arises if you want to replicate your data by creating real-time nodes 3 & 4. These new real-time nodes also
have linear shard specs 1 & 2, and they will consume data from Kafka using a different consumer group. In this case,
real-time node 3 may consume data from partitions 1 & 2, and real-time node 4 may consume data from partition 2.
From Druid's perspective, the segments hosted by real-time nodes 1 and 3 are the same, and the data hosted by real-time nodes
2 and 4 are the same, although they are reading from different Kafka partitions. Querying for the data will yield inconsistent
results.
Is this always a problem? No. If your data is small enough to fit on a single Kafka partition, you can replicate without issues.
Otherwise, you can run real-time nodes without replication.
## Large Batch of Static Data
If you have a large batch of historical data that you want to load all at once into Druid, you should use Druid's built in support for
Hadoop-based indexing. Hadoop-based indexing for large (> 1G) of batch data is the fastest way to load data into Druid. If you wish to avoid
the Hadoop dependency, or if you do not have a Hadoop cluster present, you can look at using the [index task](). The index task will be much slower
than Hadoop indexing for ingesting batch data.
One pattern that we've seen is to store raw events (or processed events) in deep storage (S3, HDFS, etc) and periodically run batch processing jobs over these raw events.
You can, for example, create a directory structure for your raw data, such as the following:
```
/prod/<dataSource>/v=1/y=2015/m=03/d=21/H=20/data.gz
/prod/<dataSource>/v=1/y=2015/m=03/d=21/H=21/data.gz
/prod/<dataSource>/v=1/y=2015/m=03/d=21/H=22/data.gz
```
In this example, hourly raw events are stored in individual gzipped files. Periodic batch processing jobs can then run over these files.
## Lambda Architecture
We recommend running a streaming real-time pipeline to run queries over events as they are occurring and a batch pipeline to perform periodic
cleanups of data.
## Sharding
Multiple segments may exist for the same interval of time for the same datasource. These segments form a `block` for an interval.
Depending on the type of `shardSpec` that is used to shard the data, Druid queries may only complete if a `block` is complete. That is to say, if a block consists of 3 segments, such as:
`sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_0`
`sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_1`
`sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_2`
All 3 segments must be loaded before a query for the interval `2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z` completes.
The exception to this rule is with using linear shard specs. Linear shard specs do not force 'completeness' and queries can complete even if shards are not loaded in the system.
For example, if your real-time ingestion creates 3 segments that were sharded with linear shard spec, and only two of the segments were loaded in the system, queries would return results only for those 2 segments.

View File

@ -6,49 +6,6 @@ layout: doc_page
Druid uses a module system that allows for the addition of extensions at runtime.
## Specifying extensions
There are two ways of adding druid extensions currently.
### Add to the classpath
If you add your extension jar to the classpath at runtime, Druid will load it into the system. This mechanism is relatively easy to reason about, but it also means that you have to ensure that all dependency jars on the classpath are compatible. That is, Druid makes no provisions while using this method to maintain class loader isolation so you must make sure that the jars on your classpath are mutually compatible.
### Specify maven coordinates
Druid has the ability to automatically load extension jars from maven at runtime. With this mechanism, Druid also loads up the dependencies of the extension jar into an isolated class loader. That means that your extension can depend on a different version of a library that Druid also uses and both can co-exist.
## Configuring the extensions
Druid provides the following settings to configure the loading of extensions:
* `druid.extensions.coordinates`
This is a JSON array of "groupId:artifactId[:version]" maven coordinates. For artifacts without version specified, Druid will append the default version. Defaults to `[]`
* `druid.extensions.defaultVersion`
Version to use for extension artifacts without version information. Defaults to the `druid-server` artifact version.
* `druid.extensions.localRepository`
This specifies where to look for the "local repository". The way maven gets dependencies is that it downloads them to a "local repository" on your local disk and then collects the paths to each of the jars. This specifies the directory to consider the "local repository". Defaults to `~/.m2/repository`
* `druid.extensions.remoteRepositories`
This is a JSON Array list of remote repositories to load dependencies from. Defaults to `["http://repo1.maven.org/maven2/", "https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local"]`
* `druid.extensions.searchCurrentClassloader`
This is a boolean flag that determines if Druid will search the main classloader for extensions. It defaults to true but can be turned off if you have reason to not automatically add all modules on the classpath.
### I want classloader isolation, but I don't want my production machines downloading their own dependencies. What should I do?
If you want to take advantage of the maven-based classloader isolation but you are also rightly frightened by the prospect of each of your production machines downloading their own dependencies on deploy, this section is for you.
The trick to doing this is
1) Specify a local directory for `druid.extensions.localRepository`
2) Run the `tools pull-deps` command to pull all the specified dependencies down into your local repository
3) Bundle up the local repository along with your other Druid stuff into whatever you use for a deployable artifact
4) Run Your druid processes with `druid.extensions.remoteRepositories=[]` and a local repository set to wherever your bundled "local" repository is located
The Druid processes will then only load up jars from the local repository and will not try to go out onto the internet to find the maven dependencies.
## Writing your own extensions
Druid's extensions leverage Guice in order to add things at runtime. Basically, Guice is a framework for Dependency Injection, but we use it to hold the expected object graph of the Druid process. Extensions can make any changes they want/need to the object graph via adding Guice bindings. While the extensions actually give you the capability to change almost anything however you want, in general, we expect people to want to extend one of a few things.
@ -60,7 +17,6 @@ Druid's extensions leverage Guice in order to add things at runtime. Basically,
1. Add new Query types
1. Add new Jersey resources
Extensions are added to the system via an implementation of `io.druid.initialization.DruidModule`.
### Creating a Druid Module
@ -114,6 +70,57 @@ Binders.dataSegmentPusherBinder(binder)
In addition to DataSegmentPusher and DataSegmentPuller, you can also bind:
* DataSegmentKiller: Removes segments, used as part of the Kill Task to delete unused segments, i.e. perform garbage collection of segments that are either superseded by newer versions or that have been dropped from the cluster.
* DataSegmentMover: Allow migrating segments from one place to another, currently this is only used as part of the MoveTask to move unused segments to a different S3 bucket or prefix, typically to reduce storage costs of unused data (e.g. move to glacier or cheaper storage)
* DataSegmentArchiver: Just a wrapper around Mover, but comes with a pre-configured target bucket/path, so it doesn't have to be specified at runtime as part of the ArchiveTask.
### Validating your deep storage implementation
**WARNING!** This is not a formal procedure, but a collection of hints to validate if your new deep storage implementation is able do push, pull and kill segments.
It's a good idea to use batch ingestion tasks to validate your implementation.
The segment will be automatically rolled up to historical note after ~20 seconds.
In this way, you can validate both push (at realtime node) and pull (at historical node) segments.
* DataSegmentPusher
Wherever your data storage (cloud storage service, distributed file system, etc.) is, you should be able to see two new files: `descriptor.json` and `index.zip` after your ingestion task ends.
* DataSegmentPuller
After ~20 secs your ingestion task ends, you should be able to see your historical node trying to load the new segment.
The following example was retrieved from a historical node configured to use Azure for deep storage:
```
2015-04-14T02:42:33,450 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - New request[LOAD: dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00
.000Z_2015-04-14T02:41:09.484Z] with zNode[/druid/dev/loadQueue/192.168.33.104:8081/dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_2015-04-14T02:41:09.
484Z].
2015-04-14T02:42:33,451 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Loading segment dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.0
00Z_2015-04-14T02:41:09.484Z
2015-04-14T02:42:33,463 INFO [ZkCoordinator-0] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.storage.azure.AzureAccountConfig] from props[drui
d.azure.] as [io.druid.storage.azure.AzureAccountConfig@759c9ad9]
2015-04-14T02:49:08,275 INFO [ZkCoordinator-0] com.metamx.common.CompressionUtils - Unzipping file[/opt/druid/tmp/compressionUtilZipCache1263964429587449785.z
ip] to [/opt/druid/zk_druid/dde/2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z/2015-04-14T02:41:09.484Z/0]
2015-04-14T02:49:08,276 INFO [ZkCoordinator-0] io.druid.storage.azure.AzureDataSegmentPuller - Loaded 1196 bytes from [dde/2015-01-02T00:00:00.000Z_2015-01-03
T00:00:00.000Z/2015-04-14T02:41:09.484Z/0/index.zip] to [/opt/druid/zk_druid/dde/2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z/2015-04-14T02:41:09.484Z/0]
2015-04-14T02:49:08,277 WARN [ZkCoordinator-0] io.druid.segment.loading.SegmentLoaderLocalCacheManager - Segment [dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_2015-04-14T02:41:09.484Z] is different than expected size. Expected [0] found [1196]
2015-04-14T02:49:08,282 INFO [ZkCoordinator-0] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_2015-04-14T02:41:09.484Z] at path[/druid/dev/segments/192.168.33.104:8081/192.168.33.104:8081_historical__default_tier_2015-04-14T02:49:08.282Z_7bb87230ebf940188511dd4a53ffd7351]
2015-04-14T02:49:08,292 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Completed request [LOAD: dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_2015-04-14T02:41:09.484Z]
```
* DataSegmentKiller
The easiest way of testing the segment killing is marking a segment as not used and then starting a killing task through the old Coordinator console.
To mark a segment as not used, you need to connect to your metadata storage and update the `used` column to `false` on the segment table rows.
To start a segment killing task, you need to access the old Coordinator console `http://<COODRINATOR_IP>:<COORDINATOR_PORT/old-console/kill.html` then select the appropriate datasource and then input a time range (e.g. `2000/3000`).
After the killing task ends, both `descriptor.json` and `index.zip` files should be deleted from the data storage.;
In addition to DataSegmentPusher and DataSegmentPuller, you can also bind:
* DataSegmentKiller: Removes segments, used as part of the Kill Task to delete unused segments, i.e. perform garbage collection of segments that are either superseded by newer versions or that have been dropped from the cluster.
* DataSegmentMover: Allow migrating segments from one place to another. Currently this is only used as part of the MoveTask to move unused segments to a different S3 bucket or prefix in order to reduce storage costs of unused data (e.g. move to glacier or cheaper storage).
* DataSegmentArchiver: It's a wrapper around Mover which comes with a pre-configured target bucket/path. Thus it's not necessary to specify those parameters at runtime as part of the ArchiveTask.

View File

@ -0,0 +1,19 @@
---
layout: doc_page
---
# Papers
* [Druid: A Real-time Analytical Data Store](http://static.druid.io/docs/druid.pdf) - Discusses the Druid architecture in detail.
* [The RADStack: Open Source Lambda Architecture for Interactive Analytics](http://static.druid.io/docs/radstack.pdf) - Discusses how Druid supports real-time and batch workflows.
# Presentations
* [Introduction to Druid](https://www.youtube.com/watch?v=hgmxVPx4vVw) - Discusses the motivations behind Druid and the architecture of the system.
* [Druid: Interactive Queries Meet Real-Time Data](https://www.youtube.com/watch?v=Dlqj34l2upk) - Discusses how real-time ingestion in Druid works and use cases at Netflix.
* [Not Exactly! Fast Queries via Approximation Algorithms](https://www.youtube.com/watch?v=Hpd3f_MLdXo) - Discusses how approximate algorithms work in Druid.
* [Real-time Analytics with Open Source Technologies](https://www.youtube.com/watch?v=kJMYVpnW_AQ) - Discusses Lambda architectures with Druid.

View File

@ -0,0 +1,20 @@
---
layout: doc_page
---
Query Context
=============
The query context is used for various query configuration parameters.
|property |default | description |
|--------------|---------------------|----------------------|
|timeout | `0` (no timeout) | Query timeout in milliseconds, beyond which unfinished queries will be cancelled. |
|priority | `0` | Query Priority. Queries with higher priority get precedence for computational resources.|
|queryId | auto-generated | Unique identifier given to this query. If a query ID is set or known, this can be used to cancel the query |
|useCache | `true` | Flag indicating whether to leverage the query cache for this query. This may be overridden in the broker or historical node configuration |
|populateCache | `true` | Flag indicating whether to save the results of the query to the query cache. Primarily used for debugging. This may be overriden in the broker or historical node configuration |
|bySegment | `false` | Return "by segment" results. Primarily used for debugging, setting it to `true` returns results associated with the data segment they came from |
|finalize | `true` | Flag indicating whether to "finalize" aggregation results. Primarily used for debugging. For instance, the `hyperUnique` aggregator will return the full HyperLogLog sketch instead of the estimated cardinality when this flag is set to `false` |
|chunkPeriod | `0` (off) | At the broker node level, long interval queries (of any type) may be broken into shorter interval queries, reducing the impact on resources. Use ISO 8601 periods. For example, if this property is set to `P1M` (one month), then a query covering a year would be broken into 12 smaller queries. All the query chunks will be processed asynchronously inside query processing executor service. Make sure "druid.processing.numThreads" is configured appropriately on the broker. |

View File

@ -5,153 +5,45 @@ layout: doc_page
Querying
========
Queries are made using an HTTP REST style request to a [Broker](Broker.html),
[Historical](Historical.html), or [Realtime](Realtime.html) node. The
Queries are made using an HTTP REST style request to queryable nodes ([Broker](Broker.html),
[Historical](Historical.html), or [Realtime](Realtime.html)). The
query is expressed in JSON and each of these node types expose the same
REST query interface.
REST query interface. For normal Druid operations, queries should be issued to the broker nodes.
We start by describing an example query with additional comments that mention
possible variations. Query operators are also summarized in a table below.
Although Druid's native query language is JSON over HTTP, many members of the community have contributed different [client libraries](./Libraries.html) in other languages to query Druid.
Example Query "rand"
--------------------
Available Queries
-----------------
Here is the query in the examples/rand subproject (file is query.body), followed by a commented version of the same.
Druid has numerous query types for various use cases. Queries are composed of various JSON properties and Druid has different types of queries for different use cases. The documentation for the various query types describe all the JSON properties that can be set.
```javascript
{
"queryType": "groupBy",
"dataSource": "randSeq",
"granularity": "all",
"dimensions": [],
"aggregations": [
{ "type": "count", "name": "rows" },
{ "type": "doubleSum", "fieldName": "events", "name": "e" },
{ "type": "doubleSum", "fieldName": "outColumn", "name": "randomNumberSum" }
],
"postAggregations": [{
"type": "arithmetic",
"name": "avg_random",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "fieldName": "randomNumberSum" },
{ "type": "fieldAccess", "fieldName": "rows" }
]
}],
"intervals": ["2012-10-01T00:00/2020-01-01T00"]
}
```
### Aggregation Queries
This query could be submitted via curl like so (assuming the query object is in a file "query.json").
* [Timeseries](./TimeseriesQuery.html)
* [TopN](./TopNQuery.html)
* [GroupBy](./GroupByQuery.html)
```
curl -X POST "http://host:port/druid/v2/?pretty" -H 'content-type: application/json' -d @query.json
```
### Metadata Queries
The "pretty" query parameter gets the results formatted a bit nicer.
* [Time Boundary](./TimeBoundaryQuery.html)
* [Segment Metadata](./SegmentMetadataQuery.html)
* [Datasource Metadata](./DatasourceMetadataQuery.html)
Details of Example Query "rand"
-------------------------------
### Search Queries
The queryType JSON field identifies which kind of query operator is to be used, in this case it is groupBy, the most frequently used kind (which corresponds to an internal implementation class GroupByQuery registered as "groupBy"), and it has a set of required fields that are also part of this query. The queryType can also be "search" or "timeBoundary" which have similar or different required fields summarized below:
* [Search](./SearchQuery.html)
```javascript
{
"queryType": "groupBy",
```
The dataSource JSON field shown next identifies where to apply the query. In this case, randSeq corresponds to the examples/rand/rand_realtime.spec file schema:
```javascript
"dataSource": "randSeq",
```
The granularity JSON field specifies the bucket size for values. It could be a built-in time interval like "second", "minute", "fifteen_minute", "thirty_minute", "hour" or "day". It can also be an expression like `{"type": "period", "period":"PT6m"}` meaning "6 minute buckets". See [Granularities](Granularities.html) for more information on the different options for this field. In this example, it is set to the special value "all" which means bucket all data points together into the same time bucket.
```javascript
"granularity": "all",
```
The dimensions JSON field value is an array of zero or more fields as defined in the dataSource spec file or defined in the input records and carried forward. These are used to constrain the grouping. If empty, then one value per time granularity bucket is requested in the groupBy:
```javascript
"dimensions": [],
```
A groupBy also requires the JSON field "aggregations" (See [Aggregations](Aggregations.html)), which are applied to the column specified by fieldName and the output of the aggregation will be named according to the value in the "name" field:
```javascript
"aggregations": [
{ "type": "count", "name": "rows" },
{ "type": "doubleSum", "fieldName": "events", "name": "e" },
{ "type": "doubleSum", "fieldName": "outColumn", "name": "randomNumberSum" }
],
```
You can also specify postAggregations, which are applied after data has been aggregated for the current granularity and dimensions bucket. See [Post Aggregations](Post-aggregations.html) for a detailed description. In the rand example, an arithmetic type operation (division, as specified by "fn") is performed with the result "name" of "avg_random". The "fields" field specifies the inputs from the aggregation stage to this expression. Note that identifiers corresponding to "name" JSON field inside the type "fieldAccess" are required but not used outside this expression, so they are prefixed with "dummy" for clarity:
```javascript
"postAggregations": [{
"type": "arithmetic",
"name": "avg_random",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "fieldName": "randomNumberSum" },
{ "type": "fieldAccess", "fieldName": "rows" }
]
}],
```
The time range(s) of the query; data outside the specified intervals will not be used; this example specifies from October 1, 2012 until January 1, 2020:
```javascript
"intervals": ["2012-10-01T00:00/2020-01-01T00"]
}
```
Query Operators
---------------
The following table summarizes query properties.
Properties shared by all query types
|property |description|required?|
|----------|-----------|---------|
|dataSource|query is applied to this data source|yes|
|intervals |range of time series to include in query|yes|
|context |This is a key-value map used to alter some of the behavior of a query. See [Query Context](#query-context) below|no|
|query type|property |description|required?|
|----------|----------|-----------|---------|
|timeseries, topN, groupBy, search|filter|Specifies the filter (the "WHERE" clause in SQL) for the query. See [Filters](Filters.html)|no|
|timeseries, topN, groupBy, search|granularity|the timestamp granularity to bucket results into (i.e. "hour"). See [Granularities](Granularities.html) for more information.|no|
|timeseries, topN, groupBy|aggregations|aggregations that combine values in a bucket. See [Aggregations](Aggregations.html).|yes|
|timeseries, topN, groupBy|postAggregations|aggregations of aggregations. See [Post Aggregations](Post-aggregations.html).|yes|
|groupBy|dimensions|constrains the groupings; if empty, then one value per time granularity bucket|yes|
|search|limit|maximum number of results (default is 1000), a system-level maximum can also be set via `com.metamx.query.search.maxSearchLimit`|no|
|search|searchDimensions|Dimensions to apply the search query to. If not specified, it will search through all dimensions.|no|
|search|query|The query portion of the search query. This is essentially a predicate that specifies if something matches.|yes|
<a name="query-context"></a>Query Context
-------------
|property |default | description |
|--------------|---------------------|----------------------|
|timeout | `0` (no timeout) | Query timeout in milliseconds, beyond which unfinished queries will be cancelled |
|priority | `0` | Query Priority. Queries with higher priority get precedence for computational resources.|
|queryId | auto-generated | Unique identifier given to this query. If a query ID is set or known, this can be used to cancel the query |
|useCache | `true` | Flag indicating whether to leverage the query cache for this query. This may be overriden in the broker or historical node configuration |
|populateCache | `true` | Flag indicating whether to save the results of the query to the query cache. Primarily used for debugging. This may be overriden in the broker or historical node configuration |
|bySegment | `false` | Return "by segment" results. Pimarily used for debugging, setting it to `true` returns results associated with the data segment they came from |
|finalize | `true` | Flag indicating whether to "finalize" aggregation results. Primarily used for debugging. For instance, the `hyperUnique` aggregator will return the full HyperLogLog sketch instead of the estimated cardinality when this flag is set to `false` |
|chunkPeriod | `0` (off) | At broker, Long-interval queries (of any type) may be broken into shorter interval queries, reducing the impact on resources. Use ISO 8601 periods. For example, if this property is set to `P1M` (one month), then a query covering a year would be broken into 12 smaller queries. All the query chunks will be processed asynchronously inside query processing executor service. Make sure "druid.processing.numThreads" is configured appropriately. |
Which Query Should I Use?
-------------------------
Where possible, we recommend using [Timeseries]() and [TopN]() queries instead of [GroupBy](). GroupBy is the most flexible Druid query, but also has the poorest performance.
Timeseries are significantly faster than groupBy queries for aggregations that don't require grouping over dimensions. For grouping and sorting over a single dimension,
topN queries are much more optimized than groupBys.
Query Cancellation
------------------
Queries can be cancelled explicitely using their unique identifier. If the
Queries can be cancelled explicitly using their unique identifier. If the
query identifier is set at the time of query, or is otherwise known, the following
endpoint can be used on the broker or router to cancel the query.

View File

@ -1,12 +1,15 @@
---
layout: doc_page
---
# Configuring Rules for Coordinator Nodes
# Retaining or Automatically Dropping Data
Coordinator nodes use rules to determine what data should be loaded or dropped from the cluster. Rules are used for data retention and are set on the coordinator console (http://coordinator_ip:port).
Rules indicate how segments should be assigned to different historical node tiers and how many replicas of a segment should exist in each tier. Rules may also indicate when segments should be dropped entirely from the cluster. The coordinator loads a set of rules from the metadata storage. Rules may be specific to a certain datasource and/or a default set of rules can be configured. Rules are read in order and hence the ordering of rules is important. The coordinator will cycle through all available segments and match each segment with the first rule that applies. Each segment may only match a single rule.
Note: It is recommended that the coordinator console is used to configure rules. However, the coordinator node does have HTTP endpoints to programmatically configure rules.
When a rule is updated, the change may not be reflected until the next time the coordinator runs. This will be fixed in the near future.
Load Rules
----------

View File

@ -31,14 +31,14 @@ There are several main parts to a search query:
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "search"; this is the first thing Druid looks at to figure out how to interpret the query.|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database.|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](DataSource.html) for more information.|yes|
|granularity|Defines the granularity of the query. See [Granularities](Granularities.html).|yes|
|filter|See [Filters](Filters.html).|no|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|searchDimensions|The dimensions to run the search over. Excluding this means the search is run over all dimensions.|no|
|query|See [SearchQuerySpec](SearchQuerySpec.html).|yes|
|sort|An object specifying how the results of the search should be sorted. Two possible types here are "lexicographic" (the default sort) and "strlen".|no|
|context|An additional JSON Object which can be used to specify certain flags.|no|
|context|See [Context](Context.html)|no|
The format of the result is:

View File

@ -24,11 +24,11 @@ There are several main parts to a segment metadata query:
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "segmentMetadata"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](DataSource.html) for more information.|yes|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|toInclude|A JSON Object representing what columns should be included in the result. Defaults to "all".|no|
|merge|Merge all individual segment metadata results into a single result|no|
|context|An additional JSON Object which can be used to specify certain flags.|no|
|context|See [Context](Context.html)|no|
The format of the result is:

View File

@ -4,7 +4,7 @@ layout: doc_page
Segments
========
The latest Druid segment version is `v9`.
Druid segments contain data for a time interval, stored as separate columns. Dimensions (string columns) have inverted indexes associated with them for each dimension value. Metric columns are LZ4 compressed.
Naming Convention
-----------------
@ -35,6 +35,8 @@ A segment is comprised of several files, listed below.
There is also a special column called `__time` that refers to the time column of the segment. This will hopefully become less and less special as the code evolves, but for now its as special as my Mommy always told me I am.
In the codebase, segments have an internal format version. The current segment format version is `v9`.
Format of a column
------------------

View File

@ -23,7 +23,7 @@ There are several main parts to a select query:
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "select"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](DataSource.html) for more information.|yes|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|filter|See [Filters](Filters.html)|no|
|dimensions|A String array of dimensions to select. If left empty, all dimensions are returned.|no|

View File

@ -17,9 +17,9 @@ There are 3 main parts to a time boundary query:
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "timeBoundary"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](DataSource.html) for more information.|yes|
|bound | Optional, set to `maxTime` or `minTime` to return only the latest or earliest timestamp. Default to returning both if not set| no |
|context|An additional JSON Object which can be used to specify certain flags.|no|
|context|See [Context](Context.html)|no|
The format of the result is:

View File

@ -48,13 +48,13 @@ There are 7 main parts to a timeseries query:
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "timeseries"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String defining the data source to query, very similar to a table in a relational database|yes|
|granularity|Defines the granularity of the query. See [Granularities](Granularities.html)|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](DataSource.html) for more information.|yes|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|granularity|Defines the granularity to bucket query results. See [Granularities](Granularities.html)|yes|
|filter|See [Filters](Filters.html)|no|
|aggregations|See [Aggregations](Aggregations.html)|yes|
|postAggregations|See [Post Aggregations](Post-aggregations.html)|no|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|context|An additional JSON Object which can be used to specify certain flags.|no|
|context|See [Context](Context.html)|no|
To pull it all together, the above query would return 2 data points, one for each day between 2012-01-01 and 2012-01-03, from the "sample\_datasource" table. Each data point would be the (long) sum of sample\_fieldName1, the (double) sum of sample\_fieldName2 and the (double) the result of sample\_fieldName1 divided by sample\_fieldName2 for the filter set. The output looks like this:

View File

@ -6,6 +6,8 @@ TopN queries
TopN queries return a sorted set of results for the values in a given dimension according to some criteria. Conceptually, they can be thought of as an approximate [GroupByQuery](GroupByQuery.html) over a single dimension with an [Ordering](LimitSpec.html) spec. TopNs are much faster and resource efficient than GroupBys for this use case. These types of queries take a topN query object and return an array of JSON objects where each object represents a value asked for by the topN query.
TopNs are approximate in that each node will rank their top K results and only return those top K results to the broker. K, by default in Druid, is `max(1000, threshold)`. In practice, this means that if you ask for the top 1000 items ordered, the correctness of the first ~900 items will be 100%, and the ordering of the results after that is not guaranteed. TopNs can be made more accurate by increasing the threshold.
A topN query object looks like:
```json
@ -68,13 +70,21 @@ A topN query object looks like:
}
```
There are 10 parts to a topN query, but 7 of them are shared with [TimeseriesQuery](TimeseriesQuery.html). Please review [TimeseriesQuery](TimeseriesQuery.html) for meanings of fields not defined below.
There are 10 parts to a topN query.
|property|description|required?|
|--------|-----------|---------|
|queryType|This String should always be "topN"; this is the first thing Druid looks at to figure out how to interpret the query|yes|
|dataSource|A String or Object defining the data source to query, very similar to a table in a relational database. See [DataSource](DataSource.html) for more information.|yes|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|granularity|Defines the granularity to bucket query results. See [Granularities](Granularities.html)|yes|
|filter|See [Filters](Filters.html)|no|
|aggregations|See [Aggregations](Aggregations.html)|yes|
|postAggregations|See [Post Aggregations](Post-aggregations.html)|no|
|dimension|A String or JSON object defining the dimension that you want the top taken for. For more info, see [DimensionSpecs](DimensionSpecs.html)|yes|
|threshold|An integer defining the N in the topN (i.e. how many you want in the top list)|yes|
|threshold|An integer defining the N in the topN (i.e. how many results you want in the top list)|yes|
|metric|A String or JSON object specifying the metric to sort by for the top list. For more info, see [TopNMetricSpec](TopNMetricSpec.html).|yes|
|context|See [Context](Context.html)|no|
Please note the context JSON object is also available for topN queries and should be used with the same caution as the timeseries case.
The format of the results would look like so:
@ -211,4 +221,4 @@ Users who can tolerate *approximate rank* topN over a dimension with greater tha
"queryType": "topN",
"threshold": 2
}
```
```

View File

@ -47,24 +47,24 @@ To start, we need to get our hands on a Druid build. There are two ways to get D
### Download a Tarball
We've built a tarball that contains everything you'll need. You'll find it [here](http://static.druid.io/artifacts/releases/druid-0.7.1-bin.tar.gz). Download this file to a directory of your choosing.
We've built a tarball that contains everything you'll need. You'll find it [here](http://druid.io/downloads.html). Download this file to a directory of your choosing.
### Build From Source
Follow the [Build From Source](Build-from-source.html) guide to build from source. Then grab the tarball from services/target/druid-0.7.1-bin.tar.gz.
Follow the [Build From Source](Build-from-source.html) guide to build from source. Then grab the tarball from services/target/druid-<version>-bin.tar.gz.
### Unpack the Tarball
You can extract the content within by issuing:
```
tar -zxvf druid-0.7.1-bin.tar.gz
tar -zxvf druid-<version>-bin.tar.gz
```
If you cd into the directory:
```
cd druid-0.7.1
cd druid-<version>
```
You should see a bunch of files:
@ -80,11 +80,10 @@ Druid requires 3 external dependencies. A "deep storage" that acts as a backup d
#### Set up Zookeeper
```bash
Download zookeeper from [http://www.apache.org/dyn/closer.cgi/zookeeper/](http://www.apache.org/dyn/closer.cgi/zookeeper/)
Install zookeeper.
* Download zookeeper from [http://www.apache.org/dyn/closer.cgi/zookeeper/](http://www.apache.org/dyn/closer.cgi/zookeeper/)
* Install zookeeper.
e.g.
```bash
curl http://www.gtlib.gatech.edu/pub/apache/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz -o zookeeper-3.4.6.tar.gz
tar xzf zookeeper-3.4.6.tar.gz
cd zookeeper-3.4.6
@ -102,7 +101,7 @@ Let's start doing stuff. You can start an example Druid [Realtime](Realtime.html
./run_example_server.sh
```
Select "2" for the "wikipedia" example.
Select the "wikipedia" example.
Note that the first time you start the example, it may take some extra time due to its fetching various dependencies. Once the node starts up you will see a bunch of logs about setting up properties and connecting to the data source. If everything was successful, you should see messages of the form shown below.

View File

@ -11,7 +11,7 @@ first two tutorials.
## About the Data
We will be working with the same Wikipedia edits data schema [from out previous
We will be working with the same Wikipedia edits data schema [from our previous
tutorials](Tutorial%3A-A-First-Look-at-Druid.html#about-the-data).
## Set Up

View File

@ -45,11 +45,10 @@ CREATE DATABASE druid DEFAULT CHARACTER SET utf8;
#### Set up Zookeeper
```bash
Download zookeeper from [http://www.apache.org/dyn/closer.cgi/zookeeper/](http://www.apache.org/dyn/closer.cgi/zookeeper/)
Install zookeeper.
* Download zookeeper from [http://www.apache.org/dyn/closer.cgi/zookeeper/](http://www.apache.org/dyn/closer.cgi/zookeeper/)
* Install zookeeper.
e.g.
```bash
curl http://www.gtlib.gatech.edu/pub/apache/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz -o zookeeper-3.4.6.tar.gz
tar xzf zookeeper-3.4.6.tar.gz
cd zookeeper-3.4.6

21
docs/content/Tutorials.md Normal file
View File

@ -0,0 +1,21 @@
---
layout: doc_page
---
# Druid Tutorials
We have a series of tutorials to help new users learn to use and operate Druid. We will be adding new tutorials to this list periodically and we encourage the community to contribute tutorials of their own.
## Tutorials
* **[A First Look at Druid](./Tutorial:-A-First-Look-at-Druid.html)**
This tutorial covers a very basic introduction to Druid. You will load some streaming wikipedia data and learn about basic queries.
* **[The Druid Cluster](./Tutorial:-The-Druid-Cluster.html)**
This tutorial goes over the basic operations of the nodes in a Druid cluster and how to start the nodes.
* **[Loading Streaming Data](./Tutorial:-Loading-Streaming-Data.html)**
This tutorial covers loading streaming data into Druid.
* **[Loading Batch Data](./Tutorial:-Loading-Batch-Data.html)**
This tutorial covers loading static (batch) data into Druid.

View File

@ -2,60 +2,132 @@
layout: doc_page
---
# About Druid
# Druid Concepts
Druid is an open-source analytics data store designed for OLAP queries on timeseries data (trillions of events, petabytes of data). Druid provides cost-effective and always-on real-time data ingestion, arbitrary data exploration, and fast data aggregation.
Druid is an open source data store designed for [OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing) queries on time-series data.
This page is meant to provide readers with a high level overview of how Druid stores data, and the architecture of a Druid cluster.
- Try out Druid with our Getting Started [Tutorial](./Tutorial%3A-A-First-Look-at-Druid.html)
- Learn more by reading the [White Paper](http://static.druid.io/docs/druid.pdf)
## The Data
Key Features
------------
To frame our discussion, let's begin with an example data set (from online advertising):
- **Designed for Analytics** - Druid is built for exploratory analytics for OLAP workflows. It supports a variety of filters, aggregators and query types and provides a framework for plugging in new functionality. Users have leveraged Druids infrastructure to develop features such as top K queries and histograms.
- **Interactive Queries** - Druids low-latency data ingestion architecture allows events to be queried milliseconds after they are created. Druids query latency is optimized by reading and scanning only exactly what is needed. Aggregate and filter on data without sitting around waiting for results. Druid is ideal for powering analytic dashboards.
- **Highly Available** - Druid is used to back SaaS implementations that need to be up all the time. Your data is still available and queryable during system updates. Scale up or down without data loss.
- **Scalable** - Existing Druid deployments handle billions of events and terabytes of data per day. Druid is designed to be petabyte scale.
timestamp publisher advertiser gender country click price
2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65
2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62
2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45
2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
This data set is composed of three distinct components. If you are acquainted with OLAP terminology, the following concepts should be familiar.
* **Timestamp column**: We treat timestamp separately because all of our queries
center around the time axis.
* **Dimension columns**: We have four dimensions of publisher, advertiser, gender, and country.
They each represent an axis of the data that weve chosen to slice across.
* **Metric columns**: These are clicks and price. These represent values, usually numeric,
which are derived from an aggregation operation such as count, sum, and mean.
Also known as measures in standard OLAP terminology.
Individually, the events are not very interesting, however, summarizations of this type of data can yield many useful insights.
Druid summarizes this raw data at ingestion time using a process we refer to as "roll-up".
Roll-up is a first-level aggregation operation over a selected set of dimensions, equivalent to (in pseudocode):
GROUP BY timestamp, publisher, advertiser, gender, country
:: impressions = COUNT(1), clicks = SUM(click), revenue = SUM(price)
The compacted version of our original raw data looks something like this:
timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
In practice, we see that rolling up data can dramatically reduce the size of data that needs to be stored (up to a factor of 100).
This storage reduction does come at a cost; as we roll up data, we lose the ability to query individual events. Phrased another way,
the rollup granularity is the minimum granularity you will be able to query data at. Hence, Druid ingestion specs define this granularity as the `queryGranularity` of the data.
The lowest `queryGranularity` is millisecond.
Druid is designed to perform single table operations and does not currently support joins.
Although many production setups instrument joins at the ETL level, data must be denormalized before it is loaded into Druid.
## Sharding the Data
Druid shards are called `segments` and Druid always first shards data by time. In our compacted data set, we can create two segments, one for each hour of data.
For example:
Segment `sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0` contains
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
Why Druid?
----------
Segment `sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_0` contains
Druid was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. It's especially useful if you are summarizing your data sets and then querying the summarizations. Put your summarizations into Druid and get quick queryability out of a system that you can be confident will scale up as your data volumes increase. Deployments have scaled up to many TBs of data per hour at peak ingested and aggregated in real-time.
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
Druid is a system that you can set up in your organization next to Hadoop. It provides the ability to access your data in an interactive slice-and-dice fashion. It trades off some query flexibility and takes over the storage format in order to provide the speed.
Segments are self-contained containers for the time interval of data they hold. Segments
contain data stored in compressed column orientations, along with the indexes for those columns. Druid queries only understand how to
scan segments.
We have more details about the general design of the system and why you might want to use it in our [White Paper](http://static.druid.io/docs/druid.pdf) or in our [Design](Design.html) doc.
Segments are uniquely identified by a datasource, interval, version, and an optional partition number.
## Indexing the Data
When Druid?
----------
Druid gets its speed in part from how it stores data. Borrowing ideas from search infrastructure,
Druid creates immutable snapshots of data, stored in data structures highly optimized for analytic queries.
* You need to do interactive aggregations and fast exploration on large amounts of data
* You need ad-hoc analytic queries (not a key-value store)
* You have a lot of data (10s of billions of events added per day, 10s of TB of data added per day)
* You want to do your analysis on data as its happening (in real-time)
* You need a data store that is always available with no single point of failure.
Druid is a column store, which means each individual column is stored separately. Only the columns that pertain to a query are used
in that query, and Druid is pretty good about only scanning exactly what it needs for a query.
Different columns can also employ different compression methods. Different columns can also have different indexes associated with them.
Architecture Overview
---------------------
Druid indexes data on a per shard (segment) level.
Druid is partially inspired by search infrastructure and creates mostly immutable views of data and stores the data in data structures highly optimized for aggregations and filters. A Druid cluster is composed of various types of nodes, each designed to do a small set of things very well.
## Loading the Data
Druid vs…
----------
Druid has two means of ingestion, real-time and batch. Real-time ingestion in Druid is best effort. Exactly once semantics are not guaranteed with real-time ingestion in Druid, although we have it on our roadmap to support this.
Batch ingestion provides exactly once guarantees and segments created via batch processing will accurately reflect the ingested data.
One common approach to operating Druid is to have a real-time pipeline for recent insights, and a batch pipeline for the accurate copy of the data.
* [Druid-vs-Impala-or-Shark](Druid-vs-Impala-or-Shark.html)
* [Druid-vs-Redshift](Druid-vs-Redshift.html)
* [Druid-vs-Vertica](Druid-vs-Vertica.html)
* [Druid-vs-Cassandra](Druid-vs-Cassandra.html)
* [Druid-vs-Hadoop](Druid-vs-Hadoop.html)
* [Druid-vs-Spark](Druid-vs-Spark.html)
* [Druid-vs-Elasticsearch](Druid-vs-Elasticsearch.html)
## The Druid Cluster
About This Page
----------
The data infrastructure world is vast, confusing and constantly in flux. This page is meant to help potential evaluators decide whether Druid is a good fit for the problem one needs to solve. If anything about it is incorrect please provide that feedback on the mailing list or via some other means so we can fix it.
A Druid Cluster is composed of several different types of nodes. Each node is designed to do a small set of things very well.
* **Historical Nodes** Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve queries over those segments.
The nodes have a shared nothing architecture and know how to load segments, drop segments, and serve queries on segments.
* **Broker Nodes** Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering queries and gathering and merging results.
Broker nodes know what segments live where.
* **Coordinator Nodes** Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new segments, drop old segments, and move segments to load balance.
* **Real-time Processing** Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is common between these two services.
Real-time processing involves ingesting data, indexing the data (creating segments), and handing segments off to historical nodes. Data is queryable as soon as it is
ingested by the realtime processing logic. The hand-off process is also lossless; data remains queryable throughout the entire process.
### External Dependencies
Druid has a couple of external dependencies for cluster operations.
* **Zookeeper** Druid relies on Zookeeper for intra-cluster communication.
* **Metadata Storage** Druid relies on a metadata storage to store metadata about segments and configuration. Services that create segments write new entries to the metadata store
and the coordinator nodes monitor the metadata store to know when new data needs to be loaded or old data needs to be dropped. The metadata store is not
involved in the query path. MySQL and PostgreSQL are popular metadata stores.
* **Deep Storage** Deep storage acts as a permanent backup of segments. Services that create segments upload segments to deep storage and historical nodes download
segments from deep storage. Deep storage is not involved in the query path. S3 and HDFS are popular deep storages.
### High Availability Characteristics
Druid is designed to have no single point of failure. Different node types are able to fail without impacting the services of the other node types. To run a highly available Druid cluster, you should have at least 2 nodes of every node type running.
### Comprehensive Architecture
For a comprehensive look at Druid architecture, please read our [white paper](http://static.druid.io/docs/druid.pdf).

View File

@ -3,95 +3,76 @@
<link rel="stylesheet" href="css/toc.css">
h2. Introduction
* "About Druid":./
* "Design":./Design.html
* "Concepts and Terminology":./Concepts-and-Terminology.html
h2. Getting Started
* "Tutorial: A First Look at Druid":./Tutorial:-A-First-Look-at-Druid.html
* "Tutorial: The Druid Cluster":./Tutorial:-The-Druid-Cluster.html
* "Tutorial: Loading Streaming Data":./Tutorial:-Loading-Streaming-Data.html
* "Tutorial: Loading Batch Data":./Tutorial:-Loading-Batch-Data.html
h2. Booting a Druid Cluster
* "Simple Cluster Configuration":Simple-Cluster-Configuration.html
* "Production Cluster Configuration":Production-Cluster-Configuration.html
* "Production Hadoop Configuration":Hadoop-Configuration.html
* "Rolling Cluster Updates":Rolling-Updates.html
* "Recommendations":Recommendations.html
h2. Configuration
* "Common Configuration":Configuration.html
* "Indexing Service":Indexing-Service-Config.html
* "Coordinator":Coordinator-Config.html
* "Historical":Historical-Config.html
* "Broker":Broker-Config.html
* "Realtime":Realtime-Config.html
* "Configuring Logging":./Logging.html
* "Concepts":./
* "Hello, Druid":./Tutorial:-A-First-Look-at-Druid.html
* "Tutorials":./Tutorials.html
* "Evaluate Druid":./Evaluate.html
h2. Data Ingestion
* "Ingestion FAQ":./Ingestion-FAQ.html
* "Realtime":./Realtime-ingestion.html
* "Batch":./Batch-ingestion.html
** "Different Hadoop Versions":./Other-Hadoop.html
* "Indexing Service":./Indexing-Service.html
** "Tasks":./Tasks.html
* "Overview":./Ingestion-Overview.html
* "Data Formats":./Data_formats.html
h2. Operations
* "Performance FAQ":./Performance-FAQ.html
* "Extending Druid":./Modules.html
* "Druid Metrics":./Metrics.html
* "Realtime Ingestion":./Realtime-ingestion.html
* "Batch Ingestion":./Batch-ingestion.html
* "FAQ":./Ingestion-FAQ.html
h2. Querying
* "Querying":./Querying.html
** "Filters":./Filters.html
** "Aggregations":./Aggregations.html
** "Post Aggregations":./Post-aggregations.html
** "Granularities":./Granularities.html
** "DimensionSpecs":./DimensionSpecs.html
* Query Types
** "DataSource Metadata":./DataSourceMetadataQuery.html
** "GroupBy":./GroupByQuery.html
*** "LimitSpec":./LimitSpec.html
*** "Having":./Having.html
** "Search":./SearchQuery.html
*** "SearchQuerySpec":./SearchQuerySpec.html
** "Segment Metadata":./SegmentMetadataQuery.html
** "Time Boundary":./TimeBoundaryQuery.html
** "Timeseries":./TimeseriesQuery.html
** "TopN":./TopNQuery.html
*** "TopNMetricSpec":./TopNMetricSpec.html
* "Overview":./Querying.html
* "Timeseries":./TimeseriesQuery.html
* "TopN":./TopNQuery.html
* "GroupBy":./GroupByQuery.html
* "Time Boundary":./TimeBoundaryQuery.html
* "Segment Metadata":./SegmentMetadataQuery.html
* "DataSource Metadata":./DataSourceMetadataQuery.html
* "Search":./SearchQuery.html
h2. Architecture
* "Design":./Design.html
* "Segments":./Segments.html
h2. Design
* Storage
** "Segments":./Segments.html
* Node Types
** "Historical":./Historical.html
** "Broker":./Broker.html
** "Coordinator":./Coordinator.html
*** "Rule Configuration":./Rule-Configuration.html
** "Realtime":./Realtime.html
** "Indexing Service":./Indexing-Service.html
*** "Middle Manager":./Middlemanager.html
*** "Peon":./Peons.html
* External Dependencies
** "Realtime":./Realtime.html
* Dependencies
** "Deep Storage":./Deep-Storage.html
** "Metadata Storage":./Metadata-storage.html
** "ZooKeeper":./ZooKeeper.html
h2. Experimental
* "About Experimental Features":./About-Experimental-Features.html
* "Geographic Queries":./GeographicQueries.html
* "Select Query":./SelectQuery.html
* "Approximate Histograms and Quantiles":./ApproxHisto.html
* "Router node":./Router.html
h2. Operations
* "Good Practices":Recommendations.html
* "Including Extensions":./Including-Extensions.html
* "Data Retention":./Rule-Configuration.html
* "Metrics and Monitoring":./Metrics.html
* "Updating the Cluster":./Rolling-Updates.html
* "Different Hadoop Versions":./Other-Hadoop.html
* "Performance FAQ":./Performance-FAQ.html
h2. Configuration
* "Common Configuration":./Configuration.html
* "Indexing Service":./Indexing-Service-Config.html
* "Coordinator":./Coordinator-Config.html
* "Historical":./Historical-Config.html
* "Broker":./Broker-Config.html
* "Realtime":./Realtime-Config.html
* "Configuring Logging":./Logging.html
* "Simple Cluster Configuration":./Simple-Cluster-Configuration.html
* "Production Cluster Configuration":./Production-Cluster-Configuration.html
* "Production Hadoop Configuration":./Hadoop-Configuration.html
h2. Development
* "Versioning":./Versioning.html
* "Build From Source":./Build-from-source.html
* "Libraries":./Libraries.html
* "Extending Druid":./Modules.html
* "Build From Source":./Build-from-source.html
* "Versioning":./Versioning.html
* Experimental Features
** "Overview":./About-Experimental-Features.html
** "Geographic Queries":./GeographicQueries.html
** "Select Query":./SelectQuery.html
** "Approximate Histograms and Quantiles":./ApproxHisto.html
** "Router node":./Router.html
h2. Misc
* "Papers & Talks":./Papers-and-talks.html
* "Thanks":/thanks.html

View File

@ -1,89 +0,0 @@
{
"type" : "index_realtime",
"spec" : {
"dataSchema": {
"dataSource": "wikipedia",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
"page",
"language",
"user",
"unpatrolled",
"newPage",
"robot",
"anonymous",
"namespace",
"continent",
"country",
"region",
"city"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
},
{
"type": "doubleSum",
"name": "added",
"fieldName": "added"
},
{
"type": "doubleSum",
"name": "deleted",
"fieldName": "deleted"
},
{
"type": "doubleSum",
"name": "delta",
"fieldName": "delta"
}
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "NONE"
}
},
"ioConfig": {
"type": "realtime",
"firehose": {
"type": "kafka-0.8",
"consumerProps": {
"zookeeper.connect": "localhost:2181",
"zookeeper.connection.timeout.ms" : "15000",
"zookeeper.session.timeout.ms" : "15000",
"zookeeper.sync.time.ms" : "5000",
"group.id": "druid-example",
"fetch.message.max.bytes" : "1048586",
"auto.offset.reset": "largest",
"auto.commit.enable": "false"
},
"feed": "wikipedia"
}
},
"tuningConfig": {
"type": "realtime",
"maxRowsInMemory": 500000,
"intermediatePersistPeriod": "PT10m",
"windowPeriod": "PT10m",
"basePersistDirectory": "\/tmp\/realtime\/basePersist",
"rejectionPolicy": {
"type": "serverTime"
}
}
}
}

View File

@ -26,7 +26,7 @@
"ts"
],
"dimensionExclusions": [
],
"spatialDimensions": [
{
@ -154,4 +154,4 @@
}
}
}
]
]