Merge pull request #2398 from implydata/doc-render

Numerous fixes to enables docs to render correctly
This commit is contained in:
Fangjin Yang 2016-02-04 15:14:18 -08:00
commit 8ac77a9644
28 changed files with 351 additions and 385 deletions

View File

@ -5,14 +5,13 @@ layout: doc_page
# Caching
Caching can optionally be enabled on the broker, historical, and realtime
nodes, as well as realtime index tasks. See [broker](broker.html#caching),
[historical](historical.html#caching), and [realtime](realtime.html#caching)
configuration options for how to enable it for individual node types.
processing. See [broker](broker.html#caching),
[historical](historical.html#caching), and [realtime](realtime.html#caching)
configuration options for how to enable it for different processes.
Druid uses a local in-memory cache by default, unless a diffrent type of cache is specified.
Use the `druid.cache.type` configuration to set a different kind of cache.
## Cache configuration
Cache settings are set globally, so the same configuration can be re-used

View File

@ -6,16 +6,7 @@ Logging
Druid nodes will emit logs that are useful for debugging to the console. Druid nodes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.
Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (eg a config dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir. For example, if the log4j2.xml file is in config/_common:
```bash
java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
-Ddruid.realtime.specFile=examples/indexing/wikipedia.spec \
-classpath "config/_common:config/realtime:lib/*" \
io.druid.cli.Main server realtime
```
Note the "-classpath" in this example has the config dir before the jars under lib/*.
Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (e.g. the _common/ dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir.
To enable java logging to go through log4j2, set the `-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager` server parameter.

View File

@ -4,40 +4,40 @@ layout: doc_page
Production Cluster Configuration
================================
```note-info
This configuration is an example of what a production cluster could look like. Many other hardware combinations are
<div class="note info">
This configuration is an example of what a production cluster could look like. Many other hardware combinations are
possible! Cheaper hardware is absolutely possible.
```
</div>
This production Druid cluster assumes that metadata storage and Zookeeper are already set up. The deep storage that is
This production Druid cluster assumes that metadata storage and Zookeeper are already set up. The deep storage that is
used for examples is [S3](https://aws.amazon.com/s3/) and [memcached](http://memcached.org/) is used for a distributed cache.
```note-info
The nodes in this example do not need to be on their own individual servers. Overlord and Coordinator nodes should be
co-located on the same hardware.
```
<div class="note info">
The nodes in this example do not need to be on their own individual servers. Overlord and Coordinator nodes should be
co-located on the same hardware.
</div>
The nodes that respond to queries (Historical, Broker, and MiddleManager nodes) will use as many cores as are available,
depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is
not well characterized yet and would depend on types of queries, query load, and the schema. Historical daemons should
have a heap size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing.
Since in-memory caching is essential for good performance, even more RAM is better.
Broker nodes will use RAM for caching, so they do more than just route queries.
The nodes that respond to queries (Historical, Broker, and MiddleManager nodes) will use as many cores as are available,
depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is
not well characterized yet and would depend on types of queries, query load, and the schema. Historical daemons should
have a heap size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing.
Since in-memory caching is essential for good performance, even more RAM is better.
Broker nodes will use RAM for caching, so they do more than just route queries.
SSDs are highly recommended for Historical nodes when all they have more segments loaded than available memory.
The nodes that are responsible for coordination (Coordinator and Overlord nodes) require much less processing.
The effective utilization of cores by Zookeeper, metadata storage, and Coordinator nodes is likely to be between 1 and 2
for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap
The effective utilization of cores by Zookeeper, metadata storage, and Coordinator nodes is likely to be between 1 and 2
for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap
size between 500MB and 1GB.
We'll use [EC2](https://aws.amazon.com/ec2/) r3.8xlarge nodes for query facing nodes and m1.xlarge nodes for coordination nodes.
The following examples work relatively well in production, however, a more optimized tuning for the nodes we selected and
We'll use [EC2](https://aws.amazon.com/ec2/) r3.8xlarge nodes for query facing nodes and m1.xlarge nodes for coordination nodes.
The following examples work relatively well in production, however, a more optimized tuning for the nodes we selected and
more optimal hardware for a Druid cluster are both definitely possible.
```note-caution
<div class="note caution">
For high availability, there should be at least a redundant copy of every process running on separate hardware.
```
</div>
### Common Configuration (common.runtime.properties)

View File

@ -3,11 +3,6 @@
layout: doc_page
---
```note-caution
If you are doing stream-pull based ingestion, we suggest using [stream-pushed](../ingestion/stream-push.html) based ingestion instead and not
using real-time nodes.
```
Realtime Node Configuration
==============================
For general Realtime Node information, see [here](../design/realtime.html).

View File

@ -154,14 +154,16 @@ an issue).
The `payload` column stores a JSON blob that has all of the metadata for the segment (some of the data stored in this payload is redundant with some of the columns in the table, that is intentional). This looks something like
```
```json
{
"dataSource":"wikipedia",
"interval":"2012-05-23T00:00:00.000Z/2012-05-24T00:00:00.000Z",
"version":"2012-05-24T00:10:00.046Z",
"loadSpec":{"type":"s3_zip",
"bucket":"bucket_for_segment",
"key":"path/to/segment/on/s3"},
"loadSpec":{
"type":"s3_zip",
"bucket":"bucket_for_segment",
"key":"path/to/segment/on/s3"
},
"dimensions":"comma-delimited-list-of-dimension-names",
"metrics":"comma-delimited-list-of-metric-names",
"shardSpec":{"type":"none"},
@ -197,13 +199,12 @@ The Audit table is used to store the audit history for configuration changes
e.g rule changes done by [Coordinator](../design/coordinator.html) and other
config changes.
+
+##Accessed By: ##
+
+The Metadata Storage is accessed only by:
+
+1. Realtime Nodes
2. Indexing Service Nodes (if any)
+3. Coordinator Nodes
+
+Thus you need to give permissions (eg in AWS Security Groups) only for these machines to access the Metadata storage.
##Accessed By: ##
The Metadata Storage is accessed only by:
1. Indexing Service Nodes (if any)
2. Realtime Nodes (if any)
3. Coordinator Nodes
Thus you need to give permissions (eg in AWS Security Groups) only for these machines to access the Metadata storage.

View File

@ -1,45 +0,0 @@
---
layout: doc_page
---
Concepts and Terminology
========================
The following definitions are given with respect to the Druid data store. They are intended to help you better understand the Druid documentation, where these terms and concepts occur.
More definitions are available on the [design page](../design/design.html).
* **Aggregation**&nbsp;&nbsp;The summarizing of data meeting certain specifications. Druid aggregates [timeseries data](#timeseries), which in effect compacts the data. Time intervals (set in configuration) are used to create buckets, while [timestamps](#timestamp) determine which buckets data aggregated in.
* **Aggregators**&nbsp;&nbsp;A mechanism for combining records during realtime incremental indexing, Hadoop batch indexing, and in queries.
* **Compute node**&nbsp;&nbsp;Obsolete name for a [Historical node](../design/historical.html).
* **DataSource**&nbsp;&nbsp;A table-like view of data; specified in [specFiles](#specfile) and in queries. A dataSource specifies the source of data being ingested and ultimately stored in [segments](#segment).
* **Dimensions**&nbsp;&nbsp;Aspects or categories of data, such as languages or locations. For example, with *language* and *country* as the type of dimension, values could be "English" or "Mandarin" for language, or "USA" or "China" for country. In Druid, dimensions can serve as filters for narrowing down hits (for example, language = "English" or country = "China").
* **Ephemeral Node**&nbsp;&nbsp;A Zookeeper node (or "znode") that exists for as long as the session that created the znode is active. More info [here](http://zookeeper.apache.org/doc/r3.2.1/zookeeperProgrammers.html#Ephemeral+Nodes). In a Druid cluster, ephemeral nodes are typically used for commands (such as assigning [segments](#segment) to certain nodes).
* **Granularity**&nbsp;&nbsp;The time interval corresponding to aggregation by time. Druid configuration settings specify the granularity of [timestamp](#timestamp) buckets in a [segment](#segment) (for example, by minute or by hour), as well as the granularity of the segment itself. The latter is essentially the overall range of absolute time covered by the segment. In queries, granularity settings control the summarization of findings.
* **Ingestion**&nbsp;&nbsp;The pulling and initial storing and processing of data. Druid supports realtime and batch ingestion of data, and applies indexing in both cases.
* **Master node**&nbsp;&nbsp;Obsolete name for a [Coordinator node](../design/coordinator.html).
* **Metrics**&nbsp;&nbsp;Countable data that can be aggregated. Metrics, for example, can be the number of visitors to a website, number of tweets per day, or average revenue.
* **Rollup**&nbsp;&nbsp;The aggregation of data that occurs at one or more stages, based on settings in a [configuration file](#specfile).
<a name="segment"></a>
* **Segment**&nbsp;&nbsp;A collection of (internal) records that are stored and processed together. Druid chunks data into segments representing a time interval, and these are stored and manipulated in the cluster.
* **Shard**&nbsp;&nbsp;A sub-partition of the data, allowing multiple [segments](#segment) to represent the data in a certain time interval. Sharding occurs along time partitions to better handle amounts of data that exceed certain limits on segment size, although sharding along dimensions may also occur to optimize efficiency.
<a name="specfile"></a>
* **specFile**&nbsp;&nbsp;The specification for services in JSON format; see [Realtime](../design/realtime.html) and [Batch-ingestion](../ingestion/batch-ingestion.html)
<a name="timeseries"></a>
* **Timeseries Data**&nbsp;&nbsp;Data points which are ordered in time. The closing value of a financial index or the number of tweets per hour with a certain hashtag are examples of timeseries data.
<a name="timestamp"></a>
* **Timestamp**&nbsp;&nbsp;An absolute position on a timeline, given in a standard alpha-numerical format such as with UTC time. [Timeseries data](#timeseries) points can be ordered by timestamp, and in Druid, they are.

View File

@ -31,7 +31,6 @@ Each of the systems, or components, described below also has a dedicated page wi
The node types that currently exist are:
* [**Historical**](../design/historical.html) nodes are the workhorses that handle storage and querying on "historical" data (non-realtime). Historical nodes download segments from deep storage, respond to the queries from broker nodes about these segments, and return results to the broker nodes. They announce themselves and the segments they are serving in Zookeeper, and also use Zookeeper to monitor for signals to load or drop new segments.
* [**Realtime**](../design/realtime.html) nodes ingest data in real time. They are in charge of listening to a stream of incoming data and making it available immediately inside the Druid system. Real-time nodes respond to query requests from Broker nodes, returning query results to those nodes. Aged data is pushed from Realtime nodes to deep storage. Realtime nodes monitor ZooKeeper to discover segments that they've pushed to deep storage have been loaded by Historicals&mdash;if so, they drop those segments.
* [**Coordinator**](../design/coordinator.html) nodes monitor the grouping of historical nodes to ensure that data is available, replicated and in a generally "optimal" configuration. They do this by reading segment metadata information from metadata storage to determine what segments should be loaded in the cluster, using Zookeeper to determine what Historical nodes exist, and creating Zookeeper entries to tell Historical nodes to load and drop new segments.
* [**Broker**](../design/broker.html) nodes receive queries from external clients and forward those queries to Realtime and Historical nodes. When Broker nodes receive results, they merge these results and return them to the caller. For knowing topology, Broker nodes use Zookeeper to determine what Realtime and Historical nodes exist.
* [**Indexing Service**](../design/indexing-service.html) nodes form a cluster of workers to load batch and real-time data into the system as well as allow for alterations to the data stored in the system.

View File

@ -3,8 +3,9 @@ layout: doc_page
---
## DataSketches aggregator
Druid aggregators based on [datasketches]()http://datasketches.github.io/) library. Note that sketch algorithms are approxiate, see details in the "Accuracy" section of the datasketches doc.
At ingestion time, this aggregator creates the theta sketch objects which get stored in Druid segments. Logically speaking, a theta sketch object can be thought of as a Set data structure. At query time, sketches are read and aggregated(set unioned) together. In the end, by default, you receive the estimate of number of unique entries in the sketch object. Also, You can use post aggregators to do union, intersection or difference on sketch columns in the same row.
Druid aggregators based on [datasketches](http://datasketches.github.io/) library. Note that sketch algorithms are approxiate, see details in the "Accuracy" section of the datasketches doc.
At ingestion time, this aggregator creates the theta sketch objects which get stored in Druid segments. Logically speaking, a theta sketch object can be thought of as a Set data structure. At query time, sketches are read and aggregated(set unioned) together. In the end, by default, you receive the estimate of number of unique entries in the sketch object. Also, You can use post aggregators to do union, intersection or difference on sketch columns in the same row.
Note that you can use `thetaSketch` aggregator on columns which were not ingested using same, it will return estimated cardinality of the column. It is recommended to use it at ingestion time as well to make querying faster.
### Aggregators

View File

@ -9,7 +9,7 @@ This page discusses how we can integrate Druid with other technologies.
Event streams can be stored in a distributed message bus such as Kafka and further processed via a distributed stream
processor system such as Storm, Samza, or Spark Streaming. Data processed by the stream processor can feed into Druid using
the [Tranquility](https://github.com/druid-io/tranquility) library. Data can be
the [Tranquility](https://github.com/druid-io/tranquility) library.
<img src="../../img/druid-production.png" width="800"/>

View File

@ -39,26 +39,13 @@ Some great folks have written their own libraries to interact with Druid
#### SQL
* [implydata/plyql](https://github.com/implydata/plyql) - A command line interface for issuing SQL queries to Druid via [plywood](https://github.com/implydata/plywood)
* [srikalyc/Sql4D](https://github.com/srikalyc/Sql4D) - A SQL client for Druid. Used in production at Yahoo.
Community Helper Libraries
--------------------------
* [madvertise/druid-dumbo](https://github.com/madvertise/druid-dumbo) - Scripts to help generate batch configs for the ingestion of data into Druid
* [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness) - A set of scripts to simplify standing up some servers and seeing how things work
* [mingfang/docker-druid](https://github.com/mingfang/docker-druid) - A Dockerfile to run the entire Druid cluster
Other Druid Distributions
-------------------------
* [Imply Analytics Platform](http://imply.io/download) - The Imply Analytics platform repackages Druid, all its dependencies, and an UI and SQL layer.
Tools
---
* [Insert Segments](../../operations/insert-segment-to-db.html) - A tool that can insert segments' metadata into Druid metadata storage.
UIs
---
@ -68,13 +55,20 @@ UIs
* [Metabase](https://github.com/metabase/metabase) - Simple dashboards, charts and query tool for your Druid DB
Tools
---
-----
* [Insert Segments](../../operations/insert-segment-to-db.html) - A tool that can insert segments' metadata into Druid metadata storage.
Other Community Extensions
Community Helper Libraries
--------------------------
* [madvertise/druid-dumbo](https://github.com/madvertise/druid-dumbo) - Scripts to help generate batch configs for the ingestion of data into Druid
* [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness) - A set of scripts to simplify standing up some servers and seeing how things work
* [mingfang/docker-druid](https://github.com/mingfang/docker-druid) - A Dockerfile to run the entire Druid cluster
Community Extensions
--------------------
These are extensions from the community. (If you would like yours listed please speak up!)
* [acesinc/druid-cors-filter-extension](https://github.com/acesinc/druid-cors-filter-extension) - An extension to enable CORS headers in http requests.

View File

@ -70,7 +70,8 @@ of a Druid [overlord](../design/indexing-service.html). A sample task is shown b
"tuningConfig" : {
"type": "hadoop"
}
}
},
"hadoopDependencyCoordinates": <my_hadoop_version>
}
```
@ -261,7 +262,7 @@ with Amazon-specific features such as S3 encryption and consistent views. If you
features, you will need to make the Amazon EMR Hadoop JARs available to Druid through one of the
mechanisms described in the [Using other Hadoop distributions](#using-other-hadoop-distributions) section.
### Using other Hadoop distributions
## Using other Hadoop distributions
Druid works out of the box with many Hadoop distributions.

View File

@ -10,6 +10,13 @@ To run:
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_dir> io.druid.cli.Main index hadoop <spec_file>
```
## Options
- "--coordinate" - provide a version of Hadoop to use. This property will override the default Hadoop coordinates. Once specified, Druid will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`.
- "--no-default-hadoop" - don't pull down the default hadoop version
## Spec file
The spec file needs to contain a JSON object where the contents are the same as the "spec" field in the Hadoop index task.
In addition, the following fields need to be added to the ioConfig:

View File

@ -66,6 +66,8 @@ All forms of Druid ingestion require some form of schema object. The format of t
}
```
If you have nested JSON, [Druid can automatically flatten it for you](flatten-json.html).
### CSV
Since the CSV data cannot contain the column names (no header is allowed), these must be added before that data can be processed:

View File

@ -93,7 +93,7 @@ If `type` is not included, the parser defaults to `string`.
### Avro Stream Parser
This is for realtime ingestion.
This is for realtime ingestion. Make sure to include "io.druid.extensions:druid-avro-extensions" as an extension.
| Field | Type | Description | Required |
|-------|------|-------------|----------|
@ -102,6 +102,7 @@ This is for realtime ingestion.
| parseSpec | JSON Object | Specifies the format of the data. | yes |
For example, using Avro stream parser with schema repo Avro bytes decoder:
```json
"parser" : {
"type" : "avro_stream",
@ -116,11 +117,7 @@ For example, using Avro stream parser with schema repo Avro bytes decoder:
"url" : "${YOUR_SCHEMA_REPO_END_POINT}",
}
},
"parsSpec" : {
"format" : "timeAndDims",
"timestampSpec" : {},
"dimensionsSpec" : {}
}
"parseSpec" : <standard_druid_parseSpec>
}
```
@ -155,7 +152,7 @@ This Avro bytes decoder first extract `subject` and `id` from input message byte
### Avro Hadoop Parser
This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.avro.AvroValueInputFormat"`. You may want to set Avro reader's schema in `jobProperties` in `tuningConfig`, eg: `"avro.schema.path.input.value": "/path/to/your/schema.avsc"` or `"avro.schema.input.value": "your_schema_JSON_object"`, if reader's schema is not set, the schema in Avro object container file will be used, see [Avro specification](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution).
This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.avro.AvroValueInputFormat"`. You may want to set Avro reader's schema in `jobProperties` in `tuningConfig`, eg: `"avro.schema.path.input.value": "/path/to/your/schema.avsc"` or `"avro.schema.input.value": "your_schema_JSON_object"`, if reader's schema is not set, the schema in Avro object container file will be used, see [Avro specification](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution). Make sure to include "io.druid.extensions:druid-avro-extensions" as an extension.
| Field | Type | Description | Required |
|-------|------|-------------|----------|
@ -164,20 +161,16 @@ This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `
| fromPigAvroStorage | Boolean | Specifies whether the data file is stored using AvroStorage. | no(default == false) |
For example, using Avro Hadoop parser with custom reader's schema file:
```json
{
"type" : "index_hadoop",
"hadoopDependencyCoordinates" : ["io.druid.extensions:druid-avro-extensions"],
"type" : "index_hadoop",
"spec" : {
"dataSchema" : {
"dataSource" : "",
"parser" : {
"type" : "avro_hadoop",
"parsSpec" : {
"format" : "timeAndDims",
"timestampSpec" : {},
"dimensionsSpec" : {}
}
"parseSpec" : <standard_druid_parseSpec>
}
},
"ioConfig" : {

View File

@ -4,7 +4,7 @@ layout: doc_page
# Loading streams
Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility (a Druid-aware
Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility) (a Druid-aware
client) and the [indexing service](../design/indexing-service.html) or through standalone [Realtime nodes](../design/realtime.html).
The first approach will be more complex to set up, but also offers scalability and high availability characteristics that advanced production
setups may require. The second approach has some known [limitations](../ingestion/stream-pull.html#limitations).

View File

@ -25,7 +25,6 @@ For Real-time Node Configuration, see [Realtime Configuration](../configuration/
For writing your own plugins to the real-time node, see [Firehose](../ingestion/firehose.html).
<a id="realtime-specfile"></a>
## Realtime "specFile"
The property `druid.realtime.specFile` has the path of a file (absolute or relative path and file name) with realtime specifications in it. This "specFile" should be a JSON Array of JSON objects like the following:
@ -166,7 +165,7 @@ The following policies are available:
* `none` &ndash; All events are accepted. Never hands off data unless shutdown() is called on the configured firehose.
####<a id="sharding"></a> Sharding
#### Sharding
Druid uses shards, or segments with partition numbers, to more efficiently handle large amounts of incoming data. In Druid, shards represent the segments that together cover a time interval based on the value of `segmentGranularity`. If, for example, `segmentGranularity` is set to "hour", then a number of shards may be used to store the data for that hour. Sharding along dimensions may also occur to optimize efficiency.

View File

@ -4,24 +4,24 @@ layout: doc_page
## Stream Push
Druid can connect to any streaming data source through
[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing
Druid can connect to any streaming data source through
[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing
streams to Druid in real-time. Druid does not come bundled with Tranquility, and you will have to download the distribution.
```note-info
<div class="note info">
If you've never loaded streaming data into Druid, we recommend trying out the
[stream loading tutorial](../tutorials/tutorial-streams.html) first and then coming back to this page.
```
<a href="../tutorials/tutorial-streams.html">stream loading tutorial</a> first and then coming back to this page.
</div>
Note that with all streaming ingestion options, you must ensure that incoming data is recent
enough (within a [configurable windowPeriod](#segmentgranularity-and-windowperiod) of the current
time). Older messages will not be processed in real-time. Historical data is best processed with
Note that with all streaming ingestion options, you must ensure that incoming data is recent
enough (within a [configurable windowPeriod](#segmentgranularity-and-windowperiod) of the current
time). Older messages will not be processed in real-time. Historical data is best processed with
[batch ingestion](../ingestion/batch-ingestion.html).
### Server
Druid can use [Tranquility Server](https://github.com/druid-io/tranquility/blob/master/docs/server.md), which
lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers
Druid can use [Tranquility Server](https://github.com/druid-io/tranquility/blob/master/docs/server.md), which
lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers
and historical processes.
Tranquility server is started by issuing:
@ -33,7 +33,7 @@ bin/tranquility server -configFile <path_to_config_file>/server.json
To customize Tranquility Server:
- In `server.json`, customize the `properties` and `dataSources`.
- If you have servers already running Tranquility, stop them (CTRL-C) and start
- If you have servers already running Tranquility, stop them (CTRL-C) and start
them up again.
For tips on customizing `server.json`, see the
@ -42,8 +42,8 @@ For tips on customizing `server.json`, see the
### Kafka
[Tranquility Kafka](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md)
lets you load data from Kafka into Druid without writing any code. You only need a configuration
[Tranquility Kafka](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md)
lets you load data from Kafka into Druid without writing any code. You only need a configuration
file.
Tranquility server is started by issuing:
@ -57,80 +57,80 @@ To customize Tranquility Kafka in the single-machine quickstart configuration:
- In `kafka.json`, customize the `properties` and `dataSources`.
- If you have Tranquility already running, stop it (CTRL-C) and start it up again.
For tips on customizing `kafka.json`, see the
For tips on customizing `kafka.json`, see the
[Tranquility Kafka documentation](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md).
### JVM apps and stream processors
Tranquility can also be embedded in JVM-based applications as a library. You can do this directly
in your own program using the
[Core API](https://github.com/druid-io/tranquility/blob/master/docs/core.md), or you can use
the connectors bundled in Tranquility for popular JVM-based stream processors such as
[Storm](https://github.com/druid-io/tranquility/blob/master/docs/storm.md),
[Samza](https://github.com/druid-io/tranquility/blob/master/docs/samza.md),
[Spark Streaming](https://github.com/druid-io/tranquility/blob/master/docs/spark.md), and
Tranquility can also be embedded in JVM-based applications as a library. You can do this directly
in your own program using the
[Core API](https://github.com/druid-io/tranquility/blob/master/docs/core.md), or you can use
the connectors bundled in Tranquility for popular JVM-based stream processors such as
[Storm](https://github.com/druid-io/tranquility/blob/master/docs/storm.md),
[Samza](https://github.com/druid-io/tranquility/blob/master/docs/samza.md),
[Spark Streaming](https://github.com/druid-io/tranquility/blob/master/docs/spark.md), and
[Flink](https://github.com/druid-io/tranquility/blob/master/docs/flink.md).
## Concepts
### Task creation
Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication,
service discovery, and schema rollover for you, seamlessly and without downtime. You never have to
write code to deal with individual tasks directly. But, it can be helpful to understand how
Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication,
service discovery, and schema rollover for you, seamlessly and without downtime. You never have to
write code to deal with individual tasks directly. But, it can be helpful to understand how
Tranquility creates tasks.
Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of
[Druid segments](../design/segments.html). Tranquility coordinates all task
creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same
Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of
[Druid segments](../design/segments.html). Tranquility coordinates all task
creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same
configuration, even on different machines, and they will send to the same set of tasks.
See the [Tranquility overview](https://github.com/druid-io/tranquility/blob/master/docs/overview.md)
See the [Tranquility overview](https://github.com/druid-io/tranquility/blob/master/docs/overview.md)
for more details about how Tranquility manages tasks.
### segmentGranularity and windowPeriod
The segmentGranularity is the time period covered by the segments produced by each task. For
example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour
The segmentGranularity is the time period covered by the segments produced by each task. For
example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour
each.
The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes
(the default) means that any events with a timestamp older than ten minutes in the past, or more
The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes
(the default) means that any events with a timestamp older than ten minutes in the past, or more
than ten minutes in the future, will be dropped.
These are important configurations because they influence how long tasks will be alive for, and how
long data stays in the realtime system before being handed off to the historical nodes. For example,
if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay
around listening for events for an hour and ten minutes. For this reason, to prevent excessive
These are important configurations because they influence how long tasks will be alive for, and how
long data stays in the realtime system before being handed off to the historical nodes. For example,
if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay
around listening for events for an hour and ten minutes. For this reason, to prevent excessive
buildup of tasks, it is recommended that your windowPeriod be less than your segmentGranularity.
### Append only
Druid streaming ingestion is *append-only*, meaning you cannot use streaming ingestion to update or
delete individual records after they are inserted. If you need to update or delete individual
records, you need to use a batch reindexing process. See the *[batch ingest](batch-ingestion.html)*
Druid streaming ingestion is *append-only*, meaning you cannot use streaming ingestion to update or
delete individual records after they are inserted. If you need to update or delete individual
records, you need to use a batch reindexing process. See the *[batch ingest](batch-ingestion.html)*
page for more details.
Druid does support efficient deletion of entire time ranges without resorting to batch reindexing.
Druid does support efficient deletion of entire time ranges without resorting to batch reindexing.
This can be done automatically through setting up retention policies.
### Guarantees
Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set
up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be
Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set
up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be
processed exactly once. In some conditions, it can drop or duplicate events:
- Events with timestamps outside your configured windowPeriod will be dropped.
- If you suffer more Druid Middle Manager failures than your configured replicas count, some
- If you suffer more Druid Middle Manager failures than your configured replicas count, some
partially indexed data may be lost.
- If there is a persistent issue that prevents communication with the Druid indexing service, and
retry policies are exhausted during that period, or the period lasts longer than your windowPeriod,
- If there is a persistent issue that prevents communication with the Druid indexing service, and
retry policies are exhausted during that period, or the period lasts longer than your windowPeriod,
some events will be dropped.
- If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing
- If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing
service, it will retry the batch, which can lead to duplicated events.
- If you are using Tranquility inside Storm or Samza, various parts of both architectures have an
- If you are using Tranquility inside Storm or Samza, various parts of both architectures have an
at-least-once design and can lead to duplicated events.
Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for
historical data, we recommend a [hybrid batch/streaming](../tutorials/ingestion.html#hybrid-batch-streaming)
Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for
historical data, we recommend a [hybrid batch/streaming](../tutorials/ingestion.html#hybrid-batch-streaming)
architecture.

View File

@ -25,21 +25,12 @@ To let Druid load your extensions, follow the steps below
Example:
Suppose you specify `druid.extensions.directory=/usr/local/druid/extensions`, and want Druid to load normal extensions ```druid-examples```, ```druid-kafka-eight``` and ```mysql-metadata-storage```.
Suppose you specify `druid.extensions.directory=/usr/local/druid/extensions`, and want Druid to load normal extensions ```druid-kafka-eight``` and ```mysql-metadata-storage```.
Then under ```extensions```, it should look like this,
```
extensions/
├── druid-examples
│   ├── commons-beanutils-1.8.3.jar
│   ├── commons-digester-1.8.jar
│   ├── commons-logging-1.1.1.jar
│   ├── commons-validator-1.4.0.jar
│   ├── druid-examples-0.8.0-rc1.jar
│   ├── twitter4j-async-3.0.3.jar
│   ├── twitter4j-core-3.0.3.jar
│   └── twitter4j-stream-3.0.3.jar
├── druid-kafka-eight
│   ├── druid-kafka-eight-0.7.3.jar
│   ├── jline-0.9.94.jar
@ -61,7 +52,7 @@ extensions/
└── mysql-metadata-storage-0.8.0-rc1.jar
```
As you can see, under ```extensions``` there are three sub-directories ```druid-examples```, ```druid-kafka-eight``` and ```mysql-metadata-storage```, each sub-directory denotes an extension that Druid might load.
As you can see, under ```extensions``` there are two sub-directories ```druid-kafka-eight``` and ```mysql-metadata-storage```, each sub-directory denotes an extension that Druid might load.
3) To have Druid load a specific list of extensions present under the root extension directory, set `druid.extensions.loadList` to the list of extensions to load. Using the example above, if you want Druid to load ```druid-kafka-eight``` and ```mysql-metadata-storage```, you can specify `druid.extensions.loadList=["druid-kafka-eight", "mysql-metadata-storage"]`.

View File

@ -114,9 +114,14 @@ your functionality as a native Java aggregator.
The javascript aggregator is recommended for rapidly prototyping features. This aggregator will be much slower in production
use than a native Java aggregator.
## Approximate Aggregations
### Cardinality aggregator
Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality.
Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality. Please note that this
aggregator will be much slower than indexing a column with the hyperUnique aggregator. This aggregator also runs over a dimension column, which
means the string dimension cannot be removed from the dataset to improve rollup. In general, we strongly recommend using the hyperUnique aggregator
instead of the cardinality aggregator if you do not care about the individual values of a dimension.
```json
{
@ -181,10 +186,6 @@ Determine the number of distinct people (i.e. combinations of first and last nam
}
```
## Complex Aggregations
Druid supports complex aggregations such as various types of approximate sketches.
### HyperUnique aggregator
Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time.
@ -193,6 +194,8 @@ Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to
{ "type" : "hyperUnique", "name" : <output_name>, "fieldName" : <metric_name> }
```
For more approximate aggregators, please see [theta sketches](../development/datasketches-aggregators.html).
## Miscellaneous Aggregations
### Filtered Aggregator

View File

@ -361,11 +361,13 @@ Then groupBy/topN processing pipeline "explodes" all multi-valued dimensions res
In addition to "query filter" which efficiently selects the rows to be processed, you can use the filtering dimension spec to filter for specific values within the values of a multi-valued dimension. These dimensionSpecs take a delegate DimensionSpec and a filtering criteria. From the "exploded" rows, only rows matching the given filtering criteria are returned in the query result.
The following filtered dimension spec acts as a whitelist or blacklist for values as per the "isWhitelist" attribute value.
```json
{ "type" : "listFiltered", "delegate" : <dimensionSpec>, "values": <array of strings>, "isWhitelist": <optional attribute for true/false, default is true> }
```
Following filtered dimension spec retains only the values matching regex. Note that `listFiltered` is faster than this and one should use that for whitelist or blacklist usecase.
```json
{ "type" : "regexFiltered", "delegate" : <dimensionSpec>, "pattern": <java regex pattern> }
```

View File

@ -5,31 +5,31 @@ layout: doc_page
Druid has limited support for joins through [query-time lookups](../querying/lookups.html). The common use case of
query-time lookups is to replace one dimension value that (e.g. a String ID) with another value (e.g. a human-readable
String value).
String value). This is similar a star-schema join.
Druid does not yet have full support for joins. Although Druids storage format would allow for the implementation
Druid does not yet have full support for joins. Although Druids storage format would allow for the implementation
of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet
for the following reasons:
1. Scaling join queries has been, in our professional experience,
1. Scaling join queries has been, in our professional experience,
a constant bottleneck of working with distributed databases.
2. The incremental gains in functionality are perceived to be
of less value than the anticipated problems with managing
2. The incremental gains in functionality are perceived to be
of less value than the anticipated problems with managing
highly concurrent, join-heavy workloads.
A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
high-level strategies for join queries we are aware of are a hash-based strategy or a
sorted-merge strategy. The hash-based strategy requires that all but
one data set be available as something that looks like a hash table,
a lookup operation is then performed on this hash table for every
row in the “primary” stream. The sorted-merge strategy assumes
that each stream is sorted by the join key and thus allows for the incremental
joining of the streams. Each of these strategies, however,
requires the materialization of some number of the streams either in
A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
high-level strategies for join queries we are aware of are a hash-based strategy or a
sorted-merge strategy. The hash-based strategy requires that all but
one data set be available as something that looks like a hash table,
a lookup operation is then performed on this hash table for every
row in the “primary” stream. The sorted-merge strategy assumes
that each stream is sorted by the join key and thus allows for the incremental
joining of the streams. Each of these strategies, however,
requires the materialization of some number of the streams either in
sorted order or in a hash table form.
When all sides of the join are significantly large tables (> 1 billion
records), materializing the pre-join streams requires complex
distributed memory management. The complexity of the memory
management is only amplified by the fact that we are targeting highly
When all sides of the join are significantly large tables (> 1 billion
records), materializing the pre-join streams requires complex
distributed memory management. The complexity of the memory
management is only amplified by the fact that we are targeting highly
concurrent, multi-tenant workloads.

View File

@ -103,7 +103,7 @@ It can be used in a sample calculation as so:
}]
```
#### Example Usage
## Example Usage
In this example, lets calculate a simple percentage using post aggregators. Lets imagine our data set has a metric called "total".

View File

@ -6,32 +6,32 @@ layout: doc_page
Druid is designed to be deployed as a scalable, fault-tolerant cluster.
In this document, we'll set up a simple cluster and discuss how it can be further configured to meet
your needs. This simple cluster will feature scalable, fault-tolerant servers for Historicals and MiddleManagers, and a single
coordination server to host the Coordinator and Overlord processes. In production, we recommend deploying Coordinators and Overlords in a fault-tolerant
In this document, we'll set up a simple cluster and discuss how it can be further configured to meet
your needs. This simple cluster will feature scalable, fault-tolerant servers for Historicals and MiddleManagers, and a single
coordination server to host the Coordinator and Overlord processes. In production, we recommend deploying Coordinators and Overlords in a fault-tolerant
configuration as well.
## Select hardware
The Coordinator and Overlord processes can be co-located on a single server that is responsible for handling the metadata and coordination needs of your cluster.
The equivalent of an AWS [m3.xlarge](https://aws.amazon.com/ec2/instance-types/#M3) is sufficient for most clusters. This
The Coordinator and Overlord processes can be co-located on a single server that is responsible for handling the metadata and coordination needs of your cluster.
The equivalent of an AWS [m3.xlarge](https://aws.amazon.com/ec2/instance-types/#M3) is sufficient for most clusters. This
hardware offers:
- 4 vCPUs
- 15 GB RAM
- 80 GB SSD storage
Historicals and MiddleManagers can be colocated on a single server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM,
and SSDs. The equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) is a
Historicals and MiddleManagers can be colocated on a single server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM,
and SSDs. The equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) is a
good starting point. This hardware offers:
- 8 vCPUs
- 61 GB RAM
- 160 GB SSD storage
Druid Brokers accept queries and farm them out to the rest of the cluster. They also optionally maintain an
in-memory query cache. These servers benefit greatly from CPU and RAM, and can also be deployed on
the equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3). This hardware
Druid Brokers accept queries and farm them out to the rest of the cluster. They also optionally maintain an
in-memory query cache. These servers benefit greatly from CPU and RAM, and can also be deployed on
the equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3). This hardware
offers:
- 8 vCPUs
@ -48,14 +48,14 @@ We recommend running your favorite Linux distribution. You will also need:
* Java 7 or better
Your OS package manager should be able to help for both Java. If your Ubuntu-based OS
does not have a recent enough version of Java, WebUpd8 offers [packages for those
Your OS package manager should be able to help for both Java. If your Ubuntu-based OS
does not have a recent enough version of Java, WebUpd8 offers [packages for those
OSes](http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html).
## Download the distribution
First, download and unpack the release archive. It's best to do this on a single machine at first,
since you will be editing the configurations and then copying the modified distribution out to all
First, download and unpack the release archive. It's best to do this on a single machine at first,
since you will be editing the configurations and then copying the modified distribution out to all
of your servers.
```bash
@ -80,8 +80,8 @@ We'll be editing the files in `conf/` in order to get things running.
## Configure deep storage
Druid relies on a distributed filesystem or large object (blob) store for data storage. The most
commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if
Druid relies on a distributed filesystem or large object (blob) store for data storage. The most
commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if
you already have a Hadoop deployment).
### S3
@ -148,13 +148,13 @@ druid.indexer.logs.directory=/druid/indexing-logs
Also,
- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
`conf/druid/_common/`.
## Configure Tranquility Server (optional)
Data streams can be sent to Druid through a simple HTTP API powered by Tranquility
Data streams can be sent to Druid through a simple HTTP API powered by Tranquility
Server. If you will be using this functionality, then at this point you should [configure
Tranquility Server](../ingestion/stream-ingestion.html#server).
@ -166,56 +166,56 @@ using this functionality, then at this point you should
## Configure for connecting to Hadoop (optional)
If you will be loading data from a Hadoop cluster, then at this point you should configure Druid to be aware
If you will be loading data from a Hadoop cluster, then at this point you should configure Druid to be aware
of your cluster:
- Update `druid.indexer.task.hadoopWorkingPath` in `conf/middleManager/runtime.properties` to
a path on HDFS that you'd like to use for temporary files required during the indexing process.
- Update `druid.indexer.task.hadoopWorkingPath` in `conf/middleManager/runtime.properties` to
a path on HDFS that you'd like to use for temporary files required during the indexing process.
`druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing` is a common choice.
- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
`conf/druid/_common/core-site.xml`, `conf/druid/_common/hdfs-site.xml`, and so on.
Note that you don't need to use HDFS deep storage in order to load data from Hadoop. For example, if
your cluster is running on Amazon Web Services, we recommend using S3 for deep storage even if you
Note that you don't need to use HDFS deep storage in order to load data from Hadoop. For example, if
your cluster is running on Amazon Web Services, we recommend using S3 for deep storage even if you
are loading data using Hadoop or Elastic MapReduce.
For more info, please see [batch ingestion](../ingestion/batch-ingestion.html).
## Configure addresses for Druid coordination
In this simple cluster, you will deploy a single Druid Coordinator, a
In this simple cluster, you will deploy a single Druid Coordinator, a
single Druid Overlord, a single ZooKeeper instance, and an embedded Derby metadata store on the same server.
In `conf/druid/_common/common.runtime.properties`, replace
In `conf/druid/_common/common.runtime.properties`, replace
"zk.host.ip" with the IP address of the machine that runs your ZK instance:
- `druid.zk.service.host`
In `conf/_common/common.runtime.properties`, replace
In `conf/_common/common.runtime.properties`, replace
"metadata.store.ip" with the IP address of the machine that you will use as your metadata store:
- `druid.metadata.storage.connector.connectURI`
- `druid.metadata.storage.connector.host`
```note-caution
In production, we recommend running 2 servers, each running a Druid Coordinator
and a Druid Overlord. We also recommend running a ZooKeeper cluster on its own dedicated hardware,
as well as replicated [metadata
storage](http://druid.io/docs/latest/dependencies/metadata-storage.html) such as MySQL or
<div class="note caution">
In production, we recommend running 2 servers, each running a Druid Coordinator
and a Druid Overlord. We also recommend running a ZooKeeper cluster on its own dedicated hardware,
as well as replicated [metadata
storage](http://druid.io/docs/latest/dependencies/metadata-storage.html) such as MySQL or
PostgreSQL, on its own dedicated hardware.
```
</div>
## Tune Druid processes that serve queries
Druid Historicals and MiddleManagers can be co-located on the same hardware. Both Druid processes benefit greatly from
being tuned to the hardware they run on. If you are running Tranquility Server or Kafka, you can also colocate Tranquility with these two Druid processes.
If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3)
EC2 instances, or similar hardware, the configuration in the distribution is a
Druid Historicals and MiddleManagers can be co-located on the same hardware. Both Druid processes benefit greatly from
being tuned to the hardware they run on. If you are running Tranquility Server or Kafka, you can also colocate Tranquility with these two Druid processes.
If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3)
EC2 instances, or similar hardware, the configuration in the distribution is a
reasonable starting point.
If you are using different hardware, we recommend adjusting configurations for your specific
If you are using different hardware, we recommend adjusting configurations for your specific
hardware. The most commonly adjusted configurations are:
- `-Xmx` and `-Xms`
@ -227,20 +227,20 @@ hardware. The most commonly adjusted configurations are:
- `druid.server.maxSize` and `druid.segmentCache.locations` on Historical Nodes
- `druid.worker.capacity` on MiddleManagers
```note
<div class="note info">
Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up..
```
</div>
Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
possible configuration options.
## Tune Druid Brokers
Druid Brokers also benefit greatly from being tuned to the hardware they
run on. If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) EC2 instances,
Druid Brokers also benefit greatly from being tuned to the hardware they
run on. If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) EC2 instances,
or similar hardware, the configuration in the distribution is a reasonable starting point.
If you are using different hardware, we recommend adjusting configurations for your specific
If you are using different hardware, we recommend adjusting configurations for your specific
hardware. The most commonly adjusted configurations are:
- `-Xmx` and `-Xms`
@ -251,17 +251,17 @@ hardware. The most commonly adjusted configurations are:
- `druid.query.groupBy.maxIntermediateRows`
- `druid.query.groupBy.maxResults`
```note-caution
Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up..
```
<div class="note caution">
Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up.
</div>
Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
possible configuration options.
## Start Coordinator, Overlord, Zookeeper, and metadata store
Copy the Druid distribution and your edited configurations to your coordination
server. If you have been editing the configurations on your local machine, you can use *rsync* to
Copy the Druid distribution and your edited configurations to your coordination
server. If you have been editing the configurations on your local machine, you can use *rsync* to
copy them:
```bash
@ -269,18 +269,19 @@ rsync -az druid-0.9.0/ COORDINATION_SERVER:druid-0.9.0/
```
Log on to your coordination server and install Zookeeper:
```bash
curl http://www.gtlib.gatech.edu/pub/apache/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz -o zookeeper-3.4.6.tar.gz
tar -xzf zookeeper-3.4.6.tar.gz
cd zookeeper-3.4.6
cp conf/zoo_sample.cfg conf/zoo.cfg
./bin/zkServer.sh start
```note-caution
In production, we also recommend running a ZooKeeper cluster on its own dedicated hardware.
```
<div class="note caution">
In production, we also recommend running a ZooKeeper cluster on its own dedicated hardware.
</div>
On your coordination server, *cd* into the distribution and start up the coordination services (you should do this in different windows or pipe the log to a file):
```bash
@ -288,7 +289,7 @@ java `cat conf/druid/coordinator/jvm.config | xargs` -cp conf/druid/_common:conf
java `cat conf/druid/overlord/jvm.config | xargs` -cp conf/druid/_common:conf/druid/overlord:lib/* io.druid.cli.Main server overlord
```
You should see a log message printed out for each service that starts up. You can view detailed logs
You should see a log message printed out for each service that starts up. You can view detailed logs
for any service by looking in the `var/log/druid` directory using another terminal.
## Start Historicals and MiddleManagers
@ -304,15 +305,15 @@ java `cat conf/druid/middleManager/jvm.config | xargs` -cp conf/druid/_common:co
You can add more servers with Druid Historicals and MiddleManagers as needed.
```note-info
For clusters with complex resource allocation needs, you can break apart Historicals and MiddleManagers and scale the components individually.
This also allows you take advantage of Druid's built-in MiddleManager
<div class="note info">
For clusters with complex resource allocation needs, you can break apart Historicals and MiddleManagers and scale the components individually.
This also allows you take advantage of Druid's built-in MiddleManager
autoscaling facility.
```
</div>
If you are doing push-based stream ingestion with Kafka or over HTTP, you can also start Tranquility server on the same
hardware that holds MiddleManagers and Historicals. For large scale production, MiddleManagers and Tranquility server
can still be co-located. If you are running Tranquility (not server) with a stream processor, you can co-locate
If you are doing push-based stream ingestion with Kafka or over HTTP, you can also start Tranquility server on the same
hardware that holds MiddleManagers and Historicals. For large scale production, MiddleManagers and Tranquility server
can still be co-located. If you are running Tranquility (not server) with a stream processor, you can co-locate
Tranquility with the stream processor and not require Tranquility server.
```bash
@ -324,9 +325,9 @@ bin/tranquility <server or kafka> -configFile <path_to_druid_distro>/conf/tranqu
## Start Druid Broker
Copy the Druid distribution and your edited configurations to your servers set aside for the Druid Brokers.
Copy the Druid distribution and your edited configurations to your servers set aside for the Druid Brokers.
On each one, *cd* into the distribution and run this command to start a Broker (you want to pipe the output to a log file):
On each one, *cd* into the distribution and run this command to start a Broker (you may want to pipe the output to a log file):
```bash
java `cat conf/druid/broker/jvm.config | xargs` -cp conf/druid/_common:conf/druid/broker:lib/* io.druid.cli.Main server broker
@ -336,5 +337,5 @@ You can add more Brokers as needed based on query load.
## Loading data
Congratulations, you now have a Druid cluster! The next step is to learn about recommended ways to load data into
Congratulations, you now have a Druid cluster! The next step is to learn about recommended ways to load data into
Druid based on your use case. Read more about [loading data](ingestion.html).

View File

@ -9,15 +9,15 @@ layout: doc_page
Druid supports streaming (real-time) and file-based (batch) ingestion methods. The most
popular configurations are:
- [Files](batch-ingestion.html) - Load data from HDFS, S3, local files, or any supported Hadoop
- [Files](../ingestion/batch-ingestion.html) - Load data from HDFS, S3, local files, or any supported Hadoop
filesystem in batches. We recommend this method if your dataset is already in flat files.
- [Stream push](stream-ingestion.html#stream-push) - Push a data stream into Druid in real-time
- [Stream push](../ingestion/stream-ingestion.html#stream-push) - Push a data stream into Druid in real-time
using [Tranquility](http://github.com/druid-io/tranquility), a client library for sending streams
to Druid. We recommend this method if your dataset originates in a streaming system like Kafka,
Storm, Spark Streaming, or your own system.
- [Stream pull](stream-ingestion.html#stream-pull) - Pull a data stream directly from an external
- [Stream pull](../ingestion/stream-ingestion.html#stream-pull) - Pull a data stream directly from an external
data source into Druid using Realtime Nodes.
## Getting started

View File

@ -15,17 +15,17 @@ You will need:
* 8G of RAM
* 2 vCPUs
On Mac OS X, you can use [Oracle's JDK
8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) to install
On Mac OS X, you can use [Oracle's JDK
8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) to install
Java.
On Linux, your OS package manager should be able to help for Java. If your Ubuntu-
based OS does not have a recent enough version of Java, WebUpd8 offers [packages for those
based OS does not have a recent enough version of Java, WebUpd8 offers [packages for those
OSes](http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html).
## Getting started
To install Druid, issue the following commands in your terminal:
To install Druid, issue the following commands in your terminal:
```bash
curl -O http://static.druid.io/artifacts/releases/druid-0.9.0-bin.tar.gz
@ -46,7 +46,7 @@ In the package, you should find:
## Start up Zookeeper
Druid currently has a dependency on [Apache ZooKeeper](http://zookeeper.apache.org/) for distributed coordination. You'll
Druid currently has a dependency on [Apache ZooKeeper](http://zookeeper.apache.org/) for distributed coordination. You'll
need to download and run Zookeeper.
```bash
@ -65,8 +65,9 @@ With Zookeeper running, return to the druid-0.9.0 directory. In that directory,
bin/init
```
Next, you can start up the Druid processes in different terminal windows. This tutorial runs every Druid process on the same system. In production,
many of these Druid processes can be colocated even in a distributed cluster.
This will setup up some directories for you. Next, you can start up the Druid processes in different terminal windows.
This tutorial runs every Druid process on the same system. In a large distributed production cluster,
many of these Druid processes can still be co-located together.
```bash
java `cat conf-quickstart/druid/historical/jvm.config | xargs` -cp conf-quickstart/druid/_common:conf-quickstart/druid/historical:lib/* io.druid.cli.Main server historical
@ -78,7 +79,7 @@ java `cat conf-quickstart/druid/middleManager/jvm.config | xargs` -cp conf-quick
You should see a log message printed out for each service that starts up.
Later on, if you'd like to stop the services, CTRL-C to exit from the running java processes. If you
Later on, if you'd like to stop the services, CTRL-C to exit from the running java processes. If you
want a clean start after stopping the services, delete the `var` directory and run the `init` script again.
Once every service has started, you are now ready to load data.
@ -87,13 +88,13 @@ Once every service has started, you are now ready to load data.
We've included a sample of Wikipedia edits from September 12, 2015 to get you started.
```note-info
This section shows you how to load data in batches, but you can skip ahead to learn how to [load
streams in real-time](quickstart.html#load-streaming-data). Druid's streaming ingestion can load data
<div class="note info">
This section shows you how to load data in batches, but you can skip ahead to learn how to <a href="quickstart.html#load-streaming-data">load
streams in real-time</a>. Druid's streaming ingestion can load data
with virtually no delay between events occurring and being available for queries.
```
</div>
The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
filter and split on) in the Wikipedia dataset, other than time, are:
* channel
@ -113,7 +114,7 @@ filter and split on) in the Wikipedia dataset, other than time, are:
* regionName
* user
The [measures](https://en.wikipedia.org/wiki/Measure_%28data_warehouse%29), or *metrics* as they are known in Druid (values you can aggregate)
The [measures](https://en.wikipedia.org/wiki/Measure_%28data_warehouse%29), or *metrics* as they are known in Druid (values you can aggregate)
in the Wikipedia dataset are:
* count
@ -122,8 +123,8 @@ in the Wikipedia dataset are:
* delta
* user_unique
To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
a task that loads the `wikiticker-2015-09-12-sampled.json` file included in the archive. To submit
To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
a task that loads the `wikiticker-2015-09-12-sampled.json` file included in the archive. To submit
this task, POST it to Druid in a new terminal window from the druid-0.9.0 directory:
```bash
@ -132,27 +133,27 @@ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-inde
Which will print the ID of the task if the submission was successful:
```base
```bash
{"task":"index_hadoop_wikipedia_2013-10-09T21:30:32.802Z"}
```
To view the status of your ingestion task, go to your overlord console:
[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after
To view the status of your ingestion task, go to your overlord console:
[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after
the task is successful, you should see a "SUCCESS" status for the task.
After your ingestion task finishes, the data will be loaded by historical nodes and available for
querying within a minute or two. You can monitor the progress of loading your data in the
coordinator console, by checking whether there is a datasource "wikiticker" with a blue circle
After your ingestion task finishes, the data will be loaded by historical nodes and available for
querying within a minute or two. You can monitor the progress of loading your data in the
coordinator console, by checking whether there is a datasource "wikiticker" with a blue circle
indicating "fully available": [http://localhost:8081/#/](http://localhost:8081/#/).
Once the data is fully available, you can immediately query it&mdash; to see how, skip to the [Query
Once the data is fully available, you can immediately query it&mdash; to see how, skip to the [Query
data](#query-data) section below. Or, continue to the [Load your own data](#load-your-own-data)
section if you'd like to load a different dataset.
## Load streaming data
To load streaming data, we are going to push events into Druid
over a simple HTTP API. We will do this use [Tranquility], a high level data producer
To load streaming data, we are going to push events into Druid
over a simple HTTP API. We will do this use [Tranquility], a high level data producer
library for Druid.
To download Tranquility, issue the following commands in your terminal:
@ -163,19 +164,19 @@ tar -xzf tranquility-distribution-0.7.2.tgz
cd tranquility-distribution-0.7.2
```
We've included a configuration file in `conf-quickstart/tranquility/server.json` as part of the Druid distribution
for a *metrics* datasource. We're going to start the Tranquility server process, which can be used to push events
We've included a configuration file in `conf-quickstart/tranquility/server.json` as part of the Druid distribution
for a *metrics* datasource. We're going to start the Tranquility server process, which can be used to push events
directly to Druid.
``` bash
bin/tranquility server -configFile <path_to_druid_distro>/conf-quickstart/tranquility/server.json
```
```note-info
<div class="note info">
This section shows you how to load data using Tranquility Server, but Druid also supports a wide
variety of [other streaming ingestion options](ingestion-streams.html#stream-push), including from
variety of <a href="ingestion-streams.html#stream-push">other streaming ingestion options</a>, including from
popular streaming systems like Kafka, Storm, Samza, and Spark Streaming.
```
</div>
The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
filter and split on) for this datasource are flexible. It's configured for *schemaless dimensions*,
@ -223,17 +224,17 @@ curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wiki
## Visualizing data
Druid is ideal for power user-facing analytic applications. There are a number of different open source applications to
visualize and explore data in Druid. We recommend trying [Pivot](https://github.com/implydata/pivot),
[Panoramix](https://github.com/mistercrunch/panoramix), or [Metabase](https://github.com/metabase/metabase) to start
Druid is ideal for power user-facing analytic applications. There are a number of different open source applications to
visualize and explore data in Druid. We recommend trying [Pivot](https://github.com/implydata/pivot),
[Panoramix](https://github.com/mistercrunch/panoramix), or [Metabase](https://github.com/metabase/metabase) to start
visualizing the data you just ingested.
If you installed Pivot for example, you should be able to view your data in your browser at [localhost:9090](localhost:9090).
### SQL and other query libraries
There are many more query tools for Druid than we've included here, including SQL
engines, and libraries for various languages like Python and Ruby. Please see [the list of
There are many more query tools for Druid than we've included here, including SQL
engines, and libraries for various languages like Python and Ruby. Please see [the list of
libraries](../development/libraries.html) for more information.
## Clustered setup

View File

@ -2,25 +2,36 @@
layout: doc_page
---
## Load your own batch data
# Tutorial: Load your own batch data
Before you get started with loading your own batch data, you should have first completed the [quickstart](quickstart.html).
## Getting started
You can easily load any timestamped dataset into Druid. For Druid batch loads, the most important
questions are:
This tutorial shows you how to load your own data files into Druid.
For this tutorial, we'll assume you've already downloaded Druid as described in
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
don't need to have loaded any data yet.
Once that's complete, you can load your own dataset by writing a custom ingestion spec.
## Writing an ingestion spec
When loading files into Druid, you will use Druid's [batch loading](ingestion-batch.html) process.
There's an example batch ingestion spec in `quickstart/wikiticker-index.json` that you can modify
for your own needs.
The most important questions are:
* What should the dataset be called? This is the "dataSource" field of the "dataSchema".
* Where is the dataset located? The file paths belong in the "paths" of the "inputSpec". If you
* Where is the dataset located? The file paths belong in the "paths" of the "inputSpec". If you
want to load multiple files, you can provide them as a comma-separated string.
* Which field should be treated as a timestamp? This belongs in the "column" of the "timestampSpec".
* Which fields should be treated as dimensions? This belongs in the "dimensions" of the "dimensionsSpec".
* Which fields should be treated as metrics? This belongs in the "metricsSpec".
* What time ranges (intervals) are being loaded? This belongs in the "intervals" of the "granularitySpec".
```note-info
If your data does not have a natural sense of time, you can tag each row with the current time.
You can also tag all rows with a fixed timestamp, like "2000-01-01T00:00:00.000Z".
```
Let's use this pageviews dataset as an example. Druid supports TSV, CSV, and JSON out of the box.
Note that nested JSON objects are not supported, so if you do use JSON, you should provide a file
@ -90,19 +101,39 @@ And modify it by altering these sections:
}
```
Finally, fire off the task and indexing will proceed!
## Running the task
To actually run this task, first make sure that the indexing task can read *pageviews.json*:
- If you're running locally (no configuration for connecting to Hadoop; this is the default) then
place it in the root of the Druid distribution.
- If you configured Druid to connect to a Hadoop cluster, upload
the pageviews.json file to HDFS. You may need to adjust the `paths` in the ingestion spec.
To kick off the indexing process, POST your indexing task to the Druid Overlord. In a standard Druid
install, the URL is `http://OVERLORD_IP:8090/druid/indexer/v1/task`.
```bash
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/pageviews-index.json localhost:8090/druid/indexer/v1/task
curl -X 'POST' -H 'Content-Type:application/json' -d @my-index-task.json OVERLORD_IP:8090/druid/indexer/v1/task
```
If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot
If you're running everything on a single machine, you can use localhost:
```bash
curl -X 'POST' -H 'Content-Type:application/json' -d @my-index-task.json localhost:8090/druid/indexer/v1/task
```
If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot
by visiting the "Task log" on the [overlord console](http://localhost:8090/console.html).
```note-info
Druid supports a wide variety of data formats, ingestion options, and configurations not
discussed here. For a full explanation of all available features, see the ingestion sections of the Druid
documentation.
```
## Querying your data
Your data should become fully available within a minute or two. You can monitor this process on
your Coordinator console at [http://localhost:8081/#/](http://localhost:8081/#/).
Once your data is fully available, you can query it using any of the
[supported query methods](../querying/querying.html).
## Further reading
For more information on loading batch data, please see [the batch ingestion documentation](../ingestion/batch-ingestion.html).

View File

@ -8,21 +8,21 @@ layout: doc_page
This tutorial shows you how to load data from Kafka into Druid.
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
don't need to have loaded any data yet.
```note-info
<div class="note info">
This tutorial will show you how to load data from Kafka into Druid, but Druid additionally supports
a wide variety of batch and streaming loading methods. See the *[Loading files](../ingestion/batch-ingestion.html)*
and *[Loading streams](../ingestion/stream-ingestion.html)* pages for more information about other options,
a wide variety of batch and streaming loading methods. See the <a href="../ingestion/batch-ingestion.html">Loading files</a>
and <a href="../ingestion/stream-ingestion.html">Loading streams</a> pages for more information about other options,
including from Hadoop, HTTP, Storm, Samza, Spark Streaming, and your own JVM apps.
```
</div>
## Start Kafka
[Apache Kafka](http://kafka.apache.org/) is a high throughput message bus that works well with
Druid. For this tutorial, we will use Kafka 0.9.0.0. To download Kafka, issue the following
[Apache Kafka](http://kafka.apache.org/) is a high throughput message bus that works well with
Druid. For this tutorial, we will use Kafka 0.9.0.0. To download Kafka, issue the following
commands in your terminal:
```bash
@ -45,7 +45,7 @@ Run this command to create a Kafka topic called *metrics*, to which we'll send d
## Enable Druid Kafka ingestion
Druid includes configs for [Tranquility Kafka](ingestion-streams.md#kafka) to support loading data from Kafka.
Druid includes configs for [Tranquility Kafka](ingestion-streams.md#kafka) to support loading data from Kafka.
To enable this in the quickstart-based configuration:
- Stop your Tranquility command (CTRL-C) and then start it up again.
@ -66,25 +66,25 @@ In your Kafka directory, run:
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic metrics
```
The *kafka-console-producer* command is now awaiting input. Copy the generated example metrics,
paste them into the *kafka-console-producer* terminal, and press enter. If you like, you can also
The *kafka-console-producer* command is now awaiting input. Copy the generated example metrics,
paste them into the *kafka-console-producer* terminal, and press enter. If you like, you can also
paste more messages into the producer, or you can press CTRL-D to exit the console producer.
You can immediately query this data, or you can skip ahead to the
You can immediately query this data, or you can skip ahead to the
[Loading your own data](#loading-your-own-data) section if you'd like to load your own dataset.
## Querying your data
After sending data, you can immediately query it using any of the
After sending data, you can immediately query it using any of the
[supported query methods](../querying/querying.html).
## Loading your own data
So far, you've loaded data into Druid from Kafka using an ingestion spec that we've included in the
distribution. Each ingestion spec is designed to work with a particular dataset. You load your own
So far, you've loaded data into Druid from Kafka using an ingestion spec that we've included in the
distribution. Each ingestion spec is designed to work with a particular dataset. You load your own
data types into Imply by writing a custom ingestion spec.
You can write a custom ingestion spec by starting from the bundled configuration in
You can write a custom ingestion spec by starting from the bundled configuration in
`conf-quickstart/tranquility/kafka.json` and modifying it for your own needs.
The most important questions are:
@ -111,7 +111,7 @@ Next, edit `conf-quickstart/tranquility/kafka.json`:
* Let's call the dataset "pageviews-kafka".
* The timestamp is the "time" field.
* Good choices for dimensions are the string fields "url" and "user".
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
sum when we load the data will allow us to compute an average at query time as well.
You can edit the existing `conf-quickstart/tranquility/kafka.json` file by altering these
@ -157,7 +157,7 @@ Next, start Druid Kafka ingestion:
bin/tranquility kafka -configFile ../druid-0.9.0-SNAPSHOT/conf-quickstart/tranquility/kafka.json
```
- If your Tranquility server or Kafka is already running, stop it (CTRL-C) and
- If your Tranquility server or Kafka is already running, stop it (CTRL-C) and
start it up again.
Finally, send some data to the Kafka topic. Let's start with these messages:
@ -168,23 +168,23 @@ Finally, send some data to the Kafka topic. Let's start with these messages:
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
```
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
[windowPeriod](../ingestion/stream-ingestion.html#segmentgranularity-and-windowperiod) value), so you should
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
[windowPeriod](../ingestion/stream-ingestion.html#segmentgranularity-and-windowperiod) value), so you should
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
get this by running:
```bash
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
```
Update the timestamps in the JSON above, then copy and paste these messages into this console
Update the timestamps in the JSON above, then copy and paste these messages into this console
producer and press enter:
```bash
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic pageviews
```
That's it, your data should now be in Druid. You can immediately query it using any of the
That's it, your data should now be in Druid. You can immediately query it using any of the
[supported query methods](../querying/querying.html).
## Further reading

View File

@ -2,33 +2,33 @@
layout: doc_page
---
## Load your own streaming data
# Tutorial: Load your own streaming data
## Getting started
This tutorial shows you how to load your own streams into Druid.
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
don't need to have loaded any data yet.
Once that's complete, you can load your own dataset by writing a custom ingestion spec.
## Writing an ingestion spec
When loading streams into Druid, we recommend using the [stream push](../ingestion/stream-push.html)
process. In this tutorial we'll be using [Tranquility Server](../ingestion/stream-ingestion.html#server) to push
When loading streams into Druid, we recommend using the [stream push](../ingestion/stream-push.html)
process. In this tutorial we'll be using [Tranquility Server](../ingestion/stream-ingestion.html#server) to push
data into Druid over HTTP.
```note-info
This tutorial will show you how to push streams to Druid using HTTP, but Druid additionally supports
a wide variety of batch and streaming loading methods. See the *[Loading files](batch-ingestion.html)*
and *[Loading streams](stream-ingestion.html)* pages for more information about other options,
<div class="note info">
This tutorial will show you how to push streams to Druid using HTTP, but Druid additionally supports
a wide variety of batch and streaming loading methods. See the <a href="../ingestion/batch-ingestion.html">Loading files</a>
and <a href="../ingestion/stream-ingestion.html">Loading streams</a> pages for more information about other options,
including from Hadoop, Kafka, Storm, Samza, Spark Streaming, and your own JVM apps.
```
</div>
You can prepare for loading a new dataset over HTTP by writing a custom Tranquility Server
configuration. The bundled configuration is in `conf-quickstart/tranquility/server.json`, which
You can prepare for loading a new dataset over HTTP by writing a custom Tranquility Server
configuration. The bundled configuration is in `conf-quickstart/tranquility/server.json`, which
you can modify for your own needs.
The most important questions are:
@ -49,10 +49,10 @@ So the answers to the questions above are:
* Let's call the dataset "pageviews".
* The timestamp is the "time" field.
* Good choices for dimensions are the string fields "url" and "user".
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
sum when we load the data will allow us to compute an average at query time as well.
Now, edit the existing `conf-quickstart/tranquility/server.json` file by altering these
Now, edit the existing `conf-quickstart/tranquility/server.json` file by altering these
sections:
1. Change the key `"metrics"` under `"dataSources"` to `"pageviews"`
@ -95,16 +95,16 @@ Let's send some data! We'll start with these three records:
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
```
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
[windowPeriod](ingestion-streams.html#segmentgranularity-and-windowperiod) value), so you should
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
[windowPeriod](ingestion-streams.html#segmentgranularity-and-windowperiod) value), so you should
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
get this by running:
```bash
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
```
Update the timestamps in the JSON above, and save it to a file named `pageviews.json`. Then send
Update the timestamps in the JSON above, and save it to a file named `pageviews.json`. Then send
it to Druid by running:
```bash
@ -117,16 +117,16 @@ This will print something like:
{"result":{"received":3,"sent":3}}
```
This indicates that the HTTP server received 3 events from you, and sent 3 to Druid. Note that
this may take a few seconds to finish the first time you run it, as Druid resources must be
This indicates that the HTTP server received 3 events from you, and sent 3 to Druid. Note that
this may take a few seconds to finish the first time you run it, as Druid resources must be
allocated to the ingestion task. Subsequent POSTs should complete quickly.
If you see `"sent":0` this likely means that your timestamps are not recent enough. Try adjusting
If you see `"sent":0` this likely means that your timestamps are not recent enough. Try adjusting
your timestamps and re-sending your data.
## Querying your data
After sending data, you can immediately query it using any of the
After sending data, you can immediately query it using any of the
[supported query methods](../querying/querying.html).
## Further reading