diff --git a/docs/content/configuration/caching.md b/docs/content/configuration/caching.md
index babf8e42390..939842e1497 100644
--- a/docs/content/configuration/caching.md
+++ b/docs/content/configuration/caching.md
@@ -5,14 +5,13 @@ layout: doc_page
# Caching
Caching can optionally be enabled on the broker, historical, and realtime
-nodes, as well as realtime index tasks. See [broker](broker.html#caching),
-[historical](historical.html#caching), and [realtime](realtime.html#caching)
-configuration options for how to enable it for individual node types.
+processing. See [broker](broker.html#caching),
+[historical](historical.html#caching), and [realtime](realtime.html#caching)
+configuration options for how to enable it for different processes.
Druid uses a local in-memory cache by default, unless a diffrent type of cache is specified.
Use the `druid.cache.type` configuration to set a different kind of cache.
-
## Cache configuration
Cache settings are set globally, so the same configuration can be re-used
diff --git a/docs/content/configuration/logging.md b/docs/content/configuration/logging.md
index ca251907d8b..285681399cd 100644
--- a/docs/content/configuration/logging.md
+++ b/docs/content/configuration/logging.md
@@ -6,16 +6,7 @@ Logging
Druid nodes will emit logs that are useful for debugging to the console. Druid nodes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.
-Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (eg a config dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir. For example, if the log4j2.xml file is in config/_common:
-
-```bash
-java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
- -Ddruid.realtime.specFile=examples/indexing/wikipedia.spec \
- -classpath "config/_common:config/realtime:lib/*" \
- io.druid.cli.Main server realtime
-```
-
-Note the "-classpath" in this example has the config dir before the jars under lib/*.
+Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (e.g. the _common/ dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir.
To enable java logging to go through log4j2, set the `-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager` server parameter.
diff --git a/docs/content/configuration/production-cluster.md b/docs/content/configuration/production-cluster.md
index 4b659e24162..bc87940dd5a 100644
--- a/docs/content/configuration/production-cluster.md
+++ b/docs/content/configuration/production-cluster.md
@@ -4,40 +4,40 @@ layout: doc_page
Production Cluster Configuration
================================
-```note-info
-This configuration is an example of what a production cluster could look like. Many other hardware combinations are
+
+This configuration is an example of what a production cluster could look like. Many other hardware combinations are
possible! Cheaper hardware is absolutely possible.
-```
+
-This production Druid cluster assumes that metadata storage and Zookeeper are already set up. The deep storage that is
+This production Druid cluster assumes that metadata storage and Zookeeper are already set up. The deep storage that is
used for examples is [S3](https://aws.amazon.com/s3/) and [memcached](http://memcached.org/) is used for a distributed cache.
-```note-info
-The nodes in this example do not need to be on their own individual servers. Overlord and Coordinator nodes should be
-co-located on the same hardware.
-```
+
+The nodes in this example do not need to be on their own individual servers. Overlord and Coordinator nodes should be
+co-located on the same hardware.
+
-The nodes that respond to queries (Historical, Broker, and MiddleManager nodes) will use as many cores as are available,
-depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is
-not well characterized yet and would depend on types of queries, query load, and the schema. Historical daemons should
-have a heap size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing.
-Since in-memory caching is essential for good performance, even more RAM is better.
-Broker nodes will use RAM for caching, so they do more than just route queries.
+The nodes that respond to queries (Historical, Broker, and MiddleManager nodes) will use as many cores as are available,
+depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is
+not well characterized yet and would depend on types of queries, query load, and the schema. Historical daemons should
+have a heap size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing.
+Since in-memory caching is essential for good performance, even more RAM is better.
+Broker nodes will use RAM for caching, so they do more than just route queries.
SSDs are highly recommended for Historical nodes when all they have more segments loaded than available memory.
The nodes that are responsible for coordination (Coordinator and Overlord nodes) require much less processing.
-The effective utilization of cores by Zookeeper, metadata storage, and Coordinator nodes is likely to be between 1 and 2
-for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap
+The effective utilization of cores by Zookeeper, metadata storage, and Coordinator nodes is likely to be between 1 and 2
+for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap
size between 500MB and 1GB.
-We'll use [EC2](https://aws.amazon.com/ec2/) r3.8xlarge nodes for query facing nodes and m1.xlarge nodes for coordination nodes.
-The following examples work relatively well in production, however, a more optimized tuning for the nodes we selected and
+We'll use [EC2](https://aws.amazon.com/ec2/) r3.8xlarge nodes for query facing nodes and m1.xlarge nodes for coordination nodes.
+The following examples work relatively well in production, however, a more optimized tuning for the nodes we selected and
more optimal hardware for a Druid cluster are both definitely possible.
-```note-caution
+
For high availability, there should be at least a redundant copy of every process running on separate hardware.
-```
+
### Common Configuration (common.runtime.properties)
diff --git a/docs/content/configuration/realtime.md b/docs/content/configuration/realtime.md
index 45b7c19a9d6..b8a205bafa5 100644
--- a/docs/content/configuration/realtime.md
+++ b/docs/content/configuration/realtime.md
@@ -3,11 +3,6 @@
layout: doc_page
---
-```note-caution
-If you are doing stream-pull based ingestion, we suggest using [stream-pushed](../ingestion/stream-push.html) based ingestion instead and not
-using real-time nodes.
-```
-
Realtime Node Configuration
==============================
For general Realtime Node information, see [here](../design/realtime.html).
diff --git a/docs/content/dependencies/metadata-storage.md b/docs/content/dependencies/metadata-storage.md
index b1470b71986..16add3b7cdb 100644
--- a/docs/content/dependencies/metadata-storage.md
+++ b/docs/content/dependencies/metadata-storage.md
@@ -154,14 +154,16 @@ an issue).
The `payload` column stores a JSON blob that has all of the metadata for the segment (some of the data stored in this payload is redundant with some of the columns in the table, that is intentional). This looks something like
-```
+```json
{
"dataSource":"wikipedia",
"interval":"2012-05-23T00:00:00.000Z/2012-05-24T00:00:00.000Z",
"version":"2012-05-24T00:10:00.046Z",
- "loadSpec":{"type":"s3_zip",
- "bucket":"bucket_for_segment",
- "key":"path/to/segment/on/s3"},
+ "loadSpec":{
+ "type":"s3_zip",
+ "bucket":"bucket_for_segment",
+ "key":"path/to/segment/on/s3"
+ },
"dimensions":"comma-delimited-list-of-dimension-names",
"metrics":"comma-delimited-list-of-metric-names",
"shardSpec":{"type":"none"},
@@ -197,13 +199,12 @@ The Audit table is used to store the audit history for configuration changes
e.g rule changes done by [Coordinator](../design/coordinator.html) and other
config changes.
-+
-+##Accessed By: ##
-+
-+The Metadata Storage is accessed only by:
-+
-+1. Realtime Nodes
- 2. Indexing Service Nodes (if any)
-+3. Coordinator Nodes
-+
-+Thus you need to give permissions (eg in AWS Security Groups) only for these machines to access the Metadata storage.
+##Accessed By: ##
+
+The Metadata Storage is accessed only by:
+
+1. Indexing Service Nodes (if any)
+2. Realtime Nodes (if any)
+3. Coordinator Nodes
+
+Thus you need to give permissions (eg in AWS Security Groups) only for these machines to access the Metadata storage.
diff --git a/docs/content/design/concepts-and-terminology.md b/docs/content/design/concepts-and-terminology.md
deleted file mode 100644
index d64ed375d62..00000000000
--- a/docs/content/design/concepts-and-terminology.md
+++ /dev/null
@@ -1,45 +0,0 @@
----
-layout: doc_page
----
-Concepts and Terminology
-========================
-
-The following definitions are given with respect to the Druid data store. They are intended to help you better understand the Druid documentation, where these terms and concepts occur.
-
-More definitions are available on the [design page](../design/design.html).
-
-* **Aggregation** The summarizing of data meeting certain specifications. Druid aggregates [timeseries data](#timeseries), which in effect compacts the data. Time intervals (set in configuration) are used to create buckets, while [timestamps](#timestamp) determine which buckets data aggregated in.
-
-* **Aggregators** A mechanism for combining records during realtime incremental indexing, Hadoop batch indexing, and in queries.
-
-* **Compute node** Obsolete name for a [Historical node](../design/historical.html).
-
-* **DataSource** A table-like view of data; specified in [specFiles](#specfile) and in queries. A dataSource specifies the source of data being ingested and ultimately stored in [segments](#segment).
-
-* **Dimensions** Aspects or categories of data, such as languages or locations. For example, with *language* and *country* as the type of dimension, values could be "English" or "Mandarin" for language, or "USA" or "China" for country. In Druid, dimensions can serve as filters for narrowing down hits (for example, language = "English" or country = "China").
-
-* **Ephemeral Node** A Zookeeper node (or "znode") that exists for as long as the session that created the znode is active. More info [here](http://zookeeper.apache.org/doc/r3.2.1/zookeeperProgrammers.html#Ephemeral+Nodes). In a Druid cluster, ephemeral nodes are typically used for commands (such as assigning [segments](#segment) to certain nodes).
-
-* **Granularity** The time interval corresponding to aggregation by time. Druid configuration settings specify the granularity of [timestamp](#timestamp) buckets in a [segment](#segment) (for example, by minute or by hour), as well as the granularity of the segment itself. The latter is essentially the overall range of absolute time covered by the segment. In queries, granularity settings control the summarization of findings.
-
-* **Ingestion** The pulling and initial storing and processing of data. Druid supports realtime and batch ingestion of data, and applies indexing in both cases.
-
-* **Master node** Obsolete name for a [Coordinator node](../design/coordinator.html).
-
-* **Metrics** Countable data that can be aggregated. Metrics, for example, can be the number of visitors to a website, number of tweets per day, or average revenue.
-
-* **Rollup** The aggregation of data that occurs at one or more stages, based on settings in a [configuration file](#specfile).
-
-
-* **Segment** A collection of (internal) records that are stored and processed together. Druid chunks data into segments representing a time interval, and these are stored and manipulated in the cluster.
-
-* **Shard** A sub-partition of the data, allowing multiple [segments](#segment) to represent the data in a certain time interval. Sharding occurs along time partitions to better handle amounts of data that exceed certain limits on segment size, although sharding along dimensions may also occur to optimize efficiency.
-
-
-* **specFile** The specification for services in JSON format; see [Realtime](../design/realtime.html) and [Batch-ingestion](../ingestion/batch-ingestion.html)
-
-
-* **Timeseries Data** Data points which are ordered in time. The closing value of a financial index or the number of tweets per hour with a certain hashtag are examples of timeseries data.
-
-
-* **Timestamp** An absolute position on a timeline, given in a standard alpha-numerical format such as with UTC time. [Timeseries data](#timeseries) points can be ordered by timestamp, and in Druid, they are.
diff --git a/docs/content/design/design.md b/docs/content/design/design.md
index a54df280eab..c2a8beaf38d 100644
--- a/docs/content/design/design.md
+++ b/docs/content/design/design.md
@@ -31,7 +31,6 @@ Each of the systems, or components, described below also has a dedicated page wi
The node types that currently exist are:
* [**Historical**](../design/historical.html) nodes are the workhorses that handle storage and querying on "historical" data (non-realtime). Historical nodes download segments from deep storage, respond to the queries from broker nodes about these segments, and return results to the broker nodes. They announce themselves and the segments they are serving in Zookeeper, and also use Zookeeper to monitor for signals to load or drop new segments.
-* [**Realtime**](../design/realtime.html) nodes ingest data in real time. They are in charge of listening to a stream of incoming data and making it available immediately inside the Druid system. Real-time nodes respond to query requests from Broker nodes, returning query results to those nodes. Aged data is pushed from Realtime nodes to deep storage. Realtime nodes monitor ZooKeeper to discover segments that they've pushed to deep storage have been loaded by Historicals—if so, they drop those segments.
* [**Coordinator**](../design/coordinator.html) nodes monitor the grouping of historical nodes to ensure that data is available, replicated and in a generally "optimal" configuration. They do this by reading segment metadata information from metadata storage to determine what segments should be loaded in the cluster, using Zookeeper to determine what Historical nodes exist, and creating Zookeeper entries to tell Historical nodes to load and drop new segments.
* [**Broker**](../design/broker.html) nodes receive queries from external clients and forward those queries to Realtime and Historical nodes. When Broker nodes receive results, they merge these results and return them to the caller. For knowing topology, Broker nodes use Zookeeper to determine what Realtime and Historical nodes exist.
* [**Indexing Service**](../design/indexing-service.html) nodes form a cluster of workers to load batch and real-time data into the system as well as allow for alterations to the data stored in the system.
diff --git a/docs/content/development/datasketches-aggregators.md b/docs/content/development/datasketches-aggregators.md
index ffa3eb0d817..5b59aa94eed 100644
--- a/docs/content/development/datasketches-aggregators.md
+++ b/docs/content/development/datasketches-aggregators.md
@@ -3,8 +3,9 @@ layout: doc_page
---
## DataSketches aggregator
-Druid aggregators based on [datasketches]()http://datasketches.github.io/) library. Note that sketch algorithms are approxiate, see details in the "Accuracy" section of the datasketches doc.
-At ingestion time, this aggregator creates the theta sketch objects which get stored in Druid segments. Logically speaking, a theta sketch object can be thought of as a Set data structure. At query time, sketches are read and aggregated(set unioned) together. In the end, by default, you receive the estimate of number of unique entries in the sketch object. Also, You can use post aggregators to do union, intersection or difference on sketch columns in the same row.
+
+Druid aggregators based on [datasketches](http://datasketches.github.io/) library. Note that sketch algorithms are approxiate, see details in the "Accuracy" section of the datasketches doc.
+At ingestion time, this aggregator creates the theta sketch objects which get stored in Druid segments. Logically speaking, a theta sketch object can be thought of as a Set data structure. At query time, sketches are read and aggregated(set unioned) together. In the end, by default, you receive the estimate of number of unique entries in the sketch object. Also, You can use post aggregators to do union, intersection or difference on sketch columns in the same row.
Note that you can use `thetaSketch` aggregator on columns which were not ingested using same, it will return estimated cardinality of the column. It is recommended to use it at ingestion time as well to make querying faster.
### Aggregators
diff --git a/docs/content/development/integrating-druid-with-other-technologies.md b/docs/content/development/integrating-druid-with-other-technologies.md
index 2d634f47beb..b30f9a97c15 100644
--- a/docs/content/development/integrating-druid-with-other-technologies.md
+++ b/docs/content/development/integrating-druid-with-other-technologies.md
@@ -9,7 +9,7 @@ This page discusses how we can integrate Druid with other technologies.
Event streams can be stored in a distributed message bus such as Kafka and further processed via a distributed stream
processor system such as Storm, Samza, or Spark Streaming. Data processed by the stream processor can feed into Druid using
-the [Tranquility](https://github.com/druid-io/tranquility) library. Data can be
+the [Tranquility](https://github.com/druid-io/tranquility) library.
diff --git a/docs/content/development/libraries.md b/docs/content/development/libraries.md
index 7dcecd666f2..e409ff53db2 100644
--- a/docs/content/development/libraries.md
+++ b/docs/content/development/libraries.md
@@ -39,26 +39,13 @@ Some great folks have written their own libraries to interact with Druid
#### SQL
* [implydata/plyql](https://github.com/implydata/plyql) - A command line interface for issuing SQL queries to Druid via [plywood](https://github.com/implydata/plywood)
-* [srikalyc/Sql4D](https://github.com/srikalyc/Sql4D) - A SQL client for Druid. Used in production at Yahoo.
-Community Helper Libraries
---------------------------
-
-* [madvertise/druid-dumbo](https://github.com/madvertise/druid-dumbo) - Scripts to help generate batch configs for the ingestion of data into Druid
-* [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness) - A set of scripts to simplify standing up some servers and seeing how things work
-* [mingfang/docker-druid](https://github.com/mingfang/docker-druid) - A Dockerfile to run the entire Druid cluster
-
Other Druid Distributions
-------------------------
* [Imply Analytics Platform](http://imply.io/download) - The Imply Analytics platform repackages Druid, all its dependencies, and an UI and SQL layer.
-Tools
----
-
-* [Insert Segments](../../operations/insert-segment-to-db.html) - A tool that can insert segments' metadata into Druid metadata storage.
-
UIs
---
@@ -68,13 +55,20 @@ UIs
* [Metabase](https://github.com/metabase/metabase) - Simple dashboards, charts and query tool for your Druid DB
Tools
----
+-----
* [Insert Segments](../../operations/insert-segment-to-db.html) - A tool that can insert segments' metadata into Druid metadata storage.
-Other Community Extensions
+Community Helper Libraries
--------------------------
+* [madvertise/druid-dumbo](https://github.com/madvertise/druid-dumbo) - Scripts to help generate batch configs for the ingestion of data into Druid
+* [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness) - A set of scripts to simplify standing up some servers and seeing how things work
+* [mingfang/docker-druid](https://github.com/mingfang/docker-druid) - A Dockerfile to run the entire Druid cluster
+
+Community Extensions
+--------------------
+
These are extensions from the community. (If you would like yours listed please speak up!)
* [acesinc/druid-cors-filter-extension](https://github.com/acesinc/druid-cors-filter-extension) - An extension to enable CORS headers in http requests.
diff --git a/docs/content/ingestion/batch-ingestion.md b/docs/content/ingestion/batch-ingestion.md
index e3af622cd3c..c9beb5fea2b 100644
--- a/docs/content/ingestion/batch-ingestion.md
+++ b/docs/content/ingestion/batch-ingestion.md
@@ -70,7 +70,8 @@ of a Druid [overlord](../design/indexing-service.html). A sample task is shown b
"tuningConfig" : {
"type": "hadoop"
}
- }
+ },
+ "hadoopDependencyCoordinates":
}
```
@@ -261,7 +262,7 @@ with Amazon-specific features such as S3 encryption and consistent views. If you
features, you will need to make the Amazon EMR Hadoop JARs available to Druid through one of the
mechanisms described in the [Using other Hadoop distributions](#using-other-hadoop-distributions) section.
-### Using other Hadoop distributions
+## Using other Hadoop distributions
Druid works out of the box with many Hadoop distributions.
diff --git a/docs/content/ingestion/command-line-hadoop-indexer.md b/docs/content/ingestion/command-line-hadoop-indexer.md
index eba792d09f8..90422c53380 100644
--- a/docs/content/ingestion/command-line-hadoop-indexer.md
+++ b/docs/content/ingestion/command-line-hadoop-indexer.md
@@ -10,6 +10,13 @@ To run:
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*: io.druid.cli.Main index hadoop
```
+## Options
+
+- "--coordinate" - provide a version of Hadoop to use. This property will override the default Hadoop coordinates. Once specified, Druid will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`.
+- "--no-default-hadoop" - don't pull down the default hadoop version
+
+## Spec file
+
The spec file needs to contain a JSON object where the contents are the same as the "spec" field in the Hadoop index task.
In addition, the following fields need to be added to the ioConfig:
diff --git a/docs/content/ingestion/data-formats.md b/docs/content/ingestion/data-formats.md
index f6e9585b181..598270f4e9e 100644
--- a/docs/content/ingestion/data-formats.md
+++ b/docs/content/ingestion/data-formats.md
@@ -66,6 +66,8 @@ All forms of Druid ingestion require some form of schema object. The format of t
}
```
+If you have nested JSON, [Druid can automatically flatten it for you](flatten-json.html).
+
### CSV
Since the CSV data cannot contain the column names (no header is allowed), these must be added before that data can be processed:
diff --git a/docs/content/ingestion/index.md b/docs/content/ingestion/index.md
index 9252bbe8db2..df1c8786616 100644
--- a/docs/content/ingestion/index.md
+++ b/docs/content/ingestion/index.md
@@ -93,7 +93,7 @@ If `type` is not included, the parser defaults to `string`.
### Avro Stream Parser
-This is for realtime ingestion.
+This is for realtime ingestion. Make sure to include "io.druid.extensions:druid-avro-extensions" as an extension.
| Field | Type | Description | Required |
|-------|------|-------------|----------|
@@ -102,6 +102,7 @@ This is for realtime ingestion.
| parseSpec | JSON Object | Specifies the format of the data. | yes |
For example, using Avro stream parser with schema repo Avro bytes decoder:
+
```json
"parser" : {
"type" : "avro_stream",
@@ -116,11 +117,7 @@ For example, using Avro stream parser with schema repo Avro bytes decoder:
"url" : "${YOUR_SCHEMA_REPO_END_POINT}",
}
},
- "parsSpec" : {
- "format" : "timeAndDims",
- "timestampSpec" : {},
- "dimensionsSpec" : {}
- }
+ "parseSpec" :
}
```
@@ -155,7 +152,7 @@ This Avro bytes decoder first extract `subject` and `id` from input message byte
### Avro Hadoop Parser
-This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.avro.AvroValueInputFormat"`. You may want to set Avro reader's schema in `jobProperties` in `tuningConfig`, eg: `"avro.schema.path.input.value": "/path/to/your/schema.avsc"` or `"avro.schema.input.value": "your_schema_JSON_object"`, if reader's schema is not set, the schema in Avro object container file will be used, see [Avro specification](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution).
+This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.avro.AvroValueInputFormat"`. You may want to set Avro reader's schema in `jobProperties` in `tuningConfig`, eg: `"avro.schema.path.input.value": "/path/to/your/schema.avsc"` or `"avro.schema.input.value": "your_schema_JSON_object"`, if reader's schema is not set, the schema in Avro object container file will be used, see [Avro specification](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution). Make sure to include "io.druid.extensions:druid-avro-extensions" as an extension.
| Field | Type | Description | Required |
|-------|------|-------------|----------|
@@ -164,20 +161,16 @@ This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `
| fromPigAvroStorage | Boolean | Specifies whether the data file is stored using AvroStorage. | no(default == false) |
For example, using Avro Hadoop parser with custom reader's schema file:
+
```json
{
- "type" : "index_hadoop",
- "hadoopDependencyCoordinates" : ["io.druid.extensions:druid-avro-extensions"],
+ "type" : "index_hadoop",
"spec" : {
"dataSchema" : {
"dataSource" : "",
"parser" : {
"type" : "avro_hadoop",
- "parsSpec" : {
- "format" : "timeAndDims",
- "timestampSpec" : {},
- "dimensionsSpec" : {}
- }
+ "parseSpec" :
}
},
"ioConfig" : {
diff --git a/docs/content/ingestion/stream-ingestion.md b/docs/content/ingestion/stream-ingestion.md
index 33d4624fdd2..7d7269dc359 100644
--- a/docs/content/ingestion/stream-ingestion.md
+++ b/docs/content/ingestion/stream-ingestion.md
@@ -4,7 +4,7 @@ layout: doc_page
# Loading streams
-Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility (a Druid-aware
+Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility) (a Druid-aware
client) and the [indexing service](../design/indexing-service.html) or through standalone [Realtime nodes](../design/realtime.html).
The first approach will be more complex to set up, but also offers scalability and high availability characteristics that advanced production
setups may require. The second approach has some known [limitations](../ingestion/stream-pull.html#limitations).
diff --git a/docs/content/ingestion/stream-pull.md b/docs/content/ingestion/stream-pull.md
index 59701cb5bdc..18937d88344 100644
--- a/docs/content/ingestion/stream-pull.md
+++ b/docs/content/ingestion/stream-pull.md
@@ -25,7 +25,6 @@ For Real-time Node Configuration, see [Realtime Configuration](../configuration/
For writing your own plugins to the real-time node, see [Firehose](../ingestion/firehose.html).
-
## Realtime "specFile"
The property `druid.realtime.specFile` has the path of a file (absolute or relative path and file name) with realtime specifications in it. This "specFile" should be a JSON Array of JSON objects like the following:
@@ -166,7 +165,7 @@ The following policies are available:
* `none` – All events are accepted. Never hands off data unless shutdown() is called on the configured firehose.
-#### Sharding
+#### Sharding
Druid uses shards, or segments with partition numbers, to more efficiently handle large amounts of incoming data. In Druid, shards represent the segments that together cover a time interval based on the value of `segmentGranularity`. If, for example, `segmentGranularity` is set to "hour", then a number of shards may be used to store the data for that hour. Sharding along dimensions may also occur to optimize efficiency.
diff --git a/docs/content/ingestion/stream-push.md b/docs/content/ingestion/stream-push.md
index 31c0934b1dd..327b120b13a 100644
--- a/docs/content/ingestion/stream-push.md
+++ b/docs/content/ingestion/stream-push.md
@@ -4,24 +4,24 @@ layout: doc_page
## Stream Push
-Druid can connect to any streaming data source through
-[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing
+Druid can connect to any streaming data source through
+[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing
streams to Druid in real-time. Druid does not come bundled with Tranquility, and you will have to download the distribution.
-```note-info
+
If you've never loaded streaming data into Druid, we recommend trying out the
-[stream loading tutorial](../tutorials/tutorial-streams.html) first and then coming back to this page.
-```
+
stream loading tutorial first and then coming back to this page.
+
-Note that with all streaming ingestion options, you must ensure that incoming data is recent
-enough (within a [configurable windowPeriod](#segmentgranularity-and-windowperiod) of the current
-time). Older messages will not be processed in real-time. Historical data is best processed with
+Note that with all streaming ingestion options, you must ensure that incoming data is recent
+enough (within a [configurable windowPeriod](#segmentgranularity-and-windowperiod) of the current
+time). Older messages will not be processed in real-time. Historical data is best processed with
[batch ingestion](../ingestion/batch-ingestion.html).
### Server
-Druid can use [Tranquility Server](https://github.com/druid-io/tranquility/blob/master/docs/server.md), which
-lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers
+Druid can use [Tranquility Server](https://github.com/druid-io/tranquility/blob/master/docs/server.md), which
+lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers
and historical processes.
Tranquility server is started by issuing:
@@ -33,7 +33,7 @@ bin/tranquility server -configFile /server.json
To customize Tranquility Server:
- In `server.json`, customize the `properties` and `dataSources`.
-- If you have servers already running Tranquility, stop them (CTRL-C) and start
+- If you have servers already running Tranquility, stop them (CTRL-C) and start
them up again.
For tips on customizing `server.json`, see the
@@ -42,8 +42,8 @@ For tips on customizing `server.json`, see the
### Kafka
-[Tranquility Kafka](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md)
-lets you load data from Kafka into Druid without writing any code. You only need a configuration
+[Tranquility Kafka](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md)
+lets you load data from Kafka into Druid without writing any code. You only need a configuration
file.
Tranquility server is started by issuing:
@@ -57,80 +57,80 @@ To customize Tranquility Kafka in the single-machine quickstart configuration:
- In `kafka.json`, customize the `properties` and `dataSources`.
- If you have Tranquility already running, stop it (CTRL-C) and start it up again.
-For tips on customizing `kafka.json`, see the
+For tips on customizing `kafka.json`, see the
[Tranquility Kafka documentation](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md).
### JVM apps and stream processors
-Tranquility can also be embedded in JVM-based applications as a library. You can do this directly
-in your own program using the
-[Core API](https://github.com/druid-io/tranquility/blob/master/docs/core.md), or you can use
-the connectors bundled in Tranquility for popular JVM-based stream processors such as
-[Storm](https://github.com/druid-io/tranquility/blob/master/docs/storm.md),
-[Samza](https://github.com/druid-io/tranquility/blob/master/docs/samza.md),
-[Spark Streaming](https://github.com/druid-io/tranquility/blob/master/docs/spark.md), and
+Tranquility can also be embedded in JVM-based applications as a library. You can do this directly
+in your own program using the
+[Core API](https://github.com/druid-io/tranquility/blob/master/docs/core.md), or you can use
+the connectors bundled in Tranquility for popular JVM-based stream processors such as
+[Storm](https://github.com/druid-io/tranquility/blob/master/docs/storm.md),
+[Samza](https://github.com/druid-io/tranquility/blob/master/docs/samza.md),
+[Spark Streaming](https://github.com/druid-io/tranquility/blob/master/docs/spark.md), and
[Flink](https://github.com/druid-io/tranquility/blob/master/docs/flink.md).
## Concepts
### Task creation
-Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication,
-service discovery, and schema rollover for you, seamlessly and without downtime. You never have to
-write code to deal with individual tasks directly. But, it can be helpful to understand how
+Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication,
+service discovery, and schema rollover for you, seamlessly and without downtime. You never have to
+write code to deal with individual tasks directly. But, it can be helpful to understand how
Tranquility creates tasks.
-Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of
-[Druid segments](../design/segments.html). Tranquility coordinates all task
-creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same
+Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of
+[Druid segments](../design/segments.html). Tranquility coordinates all task
+creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same
configuration, even on different machines, and they will send to the same set of tasks.
-See the [Tranquility overview](https://github.com/druid-io/tranquility/blob/master/docs/overview.md)
+See the [Tranquility overview](https://github.com/druid-io/tranquility/blob/master/docs/overview.md)
for more details about how Tranquility manages tasks.
### segmentGranularity and windowPeriod
-The segmentGranularity is the time period covered by the segments produced by each task. For
-example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour
+The segmentGranularity is the time period covered by the segments produced by each task. For
+example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour
each.
-The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes
-(the default) means that any events with a timestamp older than ten minutes in the past, or more
+The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes
+(the default) means that any events with a timestamp older than ten minutes in the past, or more
than ten minutes in the future, will be dropped.
-These are important configurations because they influence how long tasks will be alive for, and how
-long data stays in the realtime system before being handed off to the historical nodes. For example,
-if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay
-around listening for events for an hour and ten minutes. For this reason, to prevent excessive
+These are important configurations because they influence how long tasks will be alive for, and how
+long data stays in the realtime system before being handed off to the historical nodes. For example,
+if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay
+around listening for events for an hour and ten minutes. For this reason, to prevent excessive
buildup of tasks, it is recommended that your windowPeriod be less than your segmentGranularity.
### Append only
-Druid streaming ingestion is *append-only*, meaning you cannot use streaming ingestion to update or
-delete individual records after they are inserted. If you need to update or delete individual
-records, you need to use a batch reindexing process. See the *[batch ingest](batch-ingestion.html)*
+Druid streaming ingestion is *append-only*, meaning you cannot use streaming ingestion to update or
+delete individual records after they are inserted. If you need to update or delete individual
+records, you need to use a batch reindexing process. See the *[batch ingest](batch-ingestion.html)*
page for more details.
-Druid does support efficient deletion of entire time ranges without resorting to batch reindexing.
+Druid does support efficient deletion of entire time ranges without resorting to batch reindexing.
This can be done automatically through setting up retention policies.
### Guarantees
-Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set
-up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be
+Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set
+up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be
processed exactly once. In some conditions, it can drop or duplicate events:
- Events with timestamps outside your configured windowPeriod will be dropped.
-- If you suffer more Druid Middle Manager failures than your configured replicas count, some
+- If you suffer more Druid Middle Manager failures than your configured replicas count, some
partially indexed data may be lost.
-- If there is a persistent issue that prevents communication with the Druid indexing service, and
-retry policies are exhausted during that period, or the period lasts longer than your windowPeriod,
+- If there is a persistent issue that prevents communication with the Druid indexing service, and
+retry policies are exhausted during that period, or the period lasts longer than your windowPeriod,
some events will be dropped.
-- If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing
+- If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing
service, it will retry the batch, which can lead to duplicated events.
-- If you are using Tranquility inside Storm or Samza, various parts of both architectures have an
+- If you are using Tranquility inside Storm or Samza, various parts of both architectures have an
at-least-once design and can lead to duplicated events.
-Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for
-historical data, we recommend a [hybrid batch/streaming](../tutorials/ingestion.html#hybrid-batch-streaming)
+Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for
+historical data, we recommend a [hybrid batch/streaming](../tutorials/ingestion.html#hybrid-batch-streaming)
architecture.
diff --git a/docs/content/operations/including-extensions.md b/docs/content/operations/including-extensions.md
index 7081bca1fc7..c539d317ffc 100644
--- a/docs/content/operations/including-extensions.md
+++ b/docs/content/operations/including-extensions.md
@@ -25,21 +25,12 @@ To let Druid load your extensions, follow the steps below
Example:
-Suppose you specify `druid.extensions.directory=/usr/local/druid/extensions`, and want Druid to load normal extensions ```druid-examples```, ```druid-kafka-eight``` and ```mysql-metadata-storage```.
+Suppose you specify `druid.extensions.directory=/usr/local/druid/extensions`, and want Druid to load normal extensions ```druid-kafka-eight``` and ```mysql-metadata-storage```.
Then under ```extensions```, it should look like this,
```
extensions/
-├── druid-examples
-│ ├── commons-beanutils-1.8.3.jar
-│ ├── commons-digester-1.8.jar
-│ ├── commons-logging-1.1.1.jar
-│ ├── commons-validator-1.4.0.jar
-│ ├── druid-examples-0.8.0-rc1.jar
-│ ├── twitter4j-async-3.0.3.jar
-│ ├── twitter4j-core-3.0.3.jar
-│ └── twitter4j-stream-3.0.3.jar
├── druid-kafka-eight
│ ├── druid-kafka-eight-0.7.3.jar
│ ├── jline-0.9.94.jar
@@ -61,7 +52,7 @@ extensions/
└── mysql-metadata-storage-0.8.0-rc1.jar
```
-As you can see, under ```extensions``` there are three sub-directories ```druid-examples```, ```druid-kafka-eight``` and ```mysql-metadata-storage```, each sub-directory denotes an extension that Druid might load.
+As you can see, under ```extensions``` there are two sub-directories ```druid-kafka-eight``` and ```mysql-metadata-storage```, each sub-directory denotes an extension that Druid might load.
3) To have Druid load a specific list of extensions present under the root extension directory, set `druid.extensions.loadList` to the list of extensions to load. Using the example above, if you want Druid to load ```druid-kafka-eight``` and ```mysql-metadata-storage```, you can specify `druid.extensions.loadList=["druid-kafka-eight", "mysql-metadata-storage"]`.
diff --git a/docs/content/querying/aggregations.md b/docs/content/querying/aggregations.md
index fcce78d45d1..67396140447 100644
--- a/docs/content/querying/aggregations.md
+++ b/docs/content/querying/aggregations.md
@@ -114,9 +114,14 @@ your functionality as a native Java aggregator.
The javascript aggregator is recommended for rapidly prototyping features. This aggregator will be much slower in production
use than a native Java aggregator.
+## Approximate Aggregations
+
### Cardinality aggregator
-Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality.
+Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality. Please note that this
+aggregator will be much slower than indexing a column with the hyperUnique aggregator. This aggregator also runs over a dimension column, which
+means the string dimension cannot be removed from the dataset to improve rollup. In general, we strongly recommend using the hyperUnique aggregator
+instead of the cardinality aggregator if you do not care about the individual values of a dimension.
```json
{
@@ -181,10 +186,6 @@ Determine the number of distinct people (i.e. combinations of first and last nam
}
```
-## Complex Aggregations
-
-Druid supports complex aggregations such as various types of approximate sketches.
-
### HyperUnique aggregator
Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time.
@@ -193,6 +194,8 @@ Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to
{ "type" : "hyperUnique", "name" : , "fieldName" : }
```
+For more approximate aggregators, please see [theta sketches](../development/datasketches-aggregators.html).
+
## Miscellaneous Aggregations
### Filtered Aggregator
diff --git a/docs/content/querying/dimensionspecs.md b/docs/content/querying/dimensionspecs.md
index 95e23286dc4..5a65a50856a 100644
--- a/docs/content/querying/dimensionspecs.md
+++ b/docs/content/querying/dimensionspecs.md
@@ -361,11 +361,13 @@ Then groupBy/topN processing pipeline "explodes" all multi-valued dimensions res
In addition to "query filter" which efficiently selects the rows to be processed, you can use the filtering dimension spec to filter for specific values within the values of a multi-valued dimension. These dimensionSpecs take a delegate DimensionSpec and a filtering criteria. From the "exploded" rows, only rows matching the given filtering criteria are returned in the query result.
The following filtered dimension spec acts as a whitelist or blacklist for values as per the "isWhitelist" attribute value.
+
```json
{ "type" : "listFiltered", "delegate" : , "values": , "isWhitelist": }
```
Following filtered dimension spec retains only the values matching regex. Note that `listFiltered` is faster than this and one should use that for whitelist or blacklist usecase.
+
```json
{ "type" : "regexFiltered", "delegate" : , "pattern": }
```
diff --git a/docs/content/querying/joins.md b/docs/content/querying/joins.md
index ec4ffe24ca8..375d0bc065c 100644
--- a/docs/content/querying/joins.md
+++ b/docs/content/querying/joins.md
@@ -5,31 +5,31 @@ layout: doc_page
Druid has limited support for joins through [query-time lookups](../querying/lookups.html). The common use case of
query-time lookups is to replace one dimension value that (e.g. a String ID) with another value (e.g. a human-readable
-String value).
+String value). This is similar a star-schema join.
-Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation
+Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation
of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet
for the following reasons:
-1. Scaling join queries has been, in our professional experience,
+1. Scaling join queries has been, in our professional experience,
a constant bottleneck of working with distributed databases.
-2. The incremental gains in functionality are perceived to be
-of less value than the anticipated problems with managing
+2. The incremental gains in functionality are perceived to be
+of less value than the anticipated problems with managing
highly concurrent, join-heavy workloads.
-A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
-high-level strategies for join queries we are aware of are a hash-based strategy or a
-sorted-merge strategy. The hash-based strategy requires that all but
-one data set be available as something that looks like a hash table,
-a lookup operation is then performed on this hash table for every
-row in the “primary” stream. The sorted-merge strategy assumes
-that each stream is sorted by the join key and thus allows for the incremental
-joining of the streams. Each of these strategies, however,
-requires the materialization of some number of the streams either in
+A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
+high-level strategies for join queries we are aware of are a hash-based strategy or a
+sorted-merge strategy. The hash-based strategy requires that all but
+one data set be available as something that looks like a hash table,
+a lookup operation is then performed on this hash table for every
+row in the “primary” stream. The sorted-merge strategy assumes
+that each stream is sorted by the join key and thus allows for the incremental
+joining of the streams. Each of these strategies, however,
+requires the materialization of some number of the streams either in
sorted order or in a hash table form.
-When all sides of the join are significantly large tables (> 1 billion
-records), materializing the pre-join streams requires complex
-distributed memory management. The complexity of the memory
-management is only amplified by the fact that we are targeting highly
+When all sides of the join are significantly large tables (> 1 billion
+records), materializing the pre-join streams requires complex
+distributed memory management. The complexity of the memory
+management is only amplified by the fact that we are targeting highly
concurrent, multi-tenant workloads.
diff --git a/docs/content/querying/post-aggregations.md b/docs/content/querying/post-aggregations.md
index a86f17bbd64..266f8184e09 100644
--- a/docs/content/querying/post-aggregations.md
+++ b/docs/content/querying/post-aggregations.md
@@ -103,7 +103,7 @@ It can be used in a sample calculation as so:
}]
```
-#### Example Usage
+## Example Usage
In this example, let’s calculate a simple percentage using post aggregators. Let’s imagine our data set has a metric called "total".
diff --git a/docs/content/tutorials/cluster.md b/docs/content/tutorials/cluster.md
index 9be27ed84ea..b213eaa34d2 100644
--- a/docs/content/tutorials/cluster.md
+++ b/docs/content/tutorials/cluster.md
@@ -6,32 +6,32 @@ layout: doc_page
Druid is designed to be deployed as a scalable, fault-tolerant cluster.
-In this document, we'll set up a simple cluster and discuss how it can be further configured to meet
-your needs. This simple cluster will feature scalable, fault-tolerant servers for Historicals and MiddleManagers, and a single
-coordination server to host the Coordinator and Overlord processes. In production, we recommend deploying Coordinators and Overlords in a fault-tolerant
+In this document, we'll set up a simple cluster and discuss how it can be further configured to meet
+your needs. This simple cluster will feature scalable, fault-tolerant servers for Historicals and MiddleManagers, and a single
+coordination server to host the Coordinator and Overlord processes. In production, we recommend deploying Coordinators and Overlords in a fault-tolerant
configuration as well.
## Select hardware
-The Coordinator and Overlord processes can be co-located on a single server that is responsible for handling the metadata and coordination needs of your cluster.
-The equivalent of an AWS [m3.xlarge](https://aws.amazon.com/ec2/instance-types/#M3) is sufficient for most clusters. This
+The Coordinator and Overlord processes can be co-located on a single server that is responsible for handling the metadata and coordination needs of your cluster.
+The equivalent of an AWS [m3.xlarge](https://aws.amazon.com/ec2/instance-types/#M3) is sufficient for most clusters. This
hardware offers:
- 4 vCPUs
- 15 GB RAM
- 80 GB SSD storage
-Historicals and MiddleManagers can be colocated on a single server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM,
-and SSDs. The equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) is a
+Historicals and MiddleManagers can be colocated on a single server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM,
+and SSDs. The equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) is a
good starting point. This hardware offers:
- 8 vCPUs
- 61 GB RAM
- 160 GB SSD storage
-Druid Brokers accept queries and farm them out to the rest of the cluster. They also optionally maintain an
-in-memory query cache. These servers benefit greatly from CPU and RAM, and can also be deployed on
-the equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3). This hardware
+Druid Brokers accept queries and farm them out to the rest of the cluster. They also optionally maintain an
+in-memory query cache. These servers benefit greatly from CPU and RAM, and can also be deployed on
+the equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3). This hardware
offers:
- 8 vCPUs
@@ -48,14 +48,14 @@ We recommend running your favorite Linux distribution. You will also need:
* Java 7 or better
-Your OS package manager should be able to help for both Java. If your Ubuntu-based OS
-does not have a recent enough version of Java, WebUpd8 offers [packages for those
+Your OS package manager should be able to help for both Java. If your Ubuntu-based OS
+does not have a recent enough version of Java, WebUpd8 offers [packages for those
OSes](http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html).
## Download the distribution
-First, download and unpack the release archive. It's best to do this on a single machine at first,
-since you will be editing the configurations and then copying the modified distribution out to all
+First, download and unpack the release archive. It's best to do this on a single machine at first,
+since you will be editing the configurations and then copying the modified distribution out to all
of your servers.
```bash
@@ -80,8 +80,8 @@ We'll be editing the files in `conf/` in order to get things running.
## Configure deep storage
-Druid relies on a distributed filesystem or large object (blob) store for data storage. The most
-commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if
+Druid relies on a distributed filesystem or large object (blob) store for data storage. The most
+commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if
you already have a Hadoop deployment).
### S3
@@ -148,13 +148,13 @@ druid.indexer.logs.directory=/druid/indexing-logs
Also,
-- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
-mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
+- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
+mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
`conf/druid/_common/`.
## Configure Tranquility Server (optional)
-Data streams can be sent to Druid through a simple HTTP API powered by Tranquility
+Data streams can be sent to Druid through a simple HTTP API powered by Tranquility
Server. If you will be using this functionality, then at this point you should [configure
Tranquility Server](../ingestion/stream-ingestion.html#server).
@@ -166,56 +166,56 @@ using this functionality, then at this point you should
## Configure for connecting to Hadoop (optional)
-If you will be loading data from a Hadoop cluster, then at this point you should configure Druid to be aware
+If you will be loading data from a Hadoop cluster, then at this point you should configure Druid to be aware
of your cluster:
-- Update `druid.indexer.task.hadoopWorkingPath` in `conf/middleManager/runtime.properties` to
-a path on HDFS that you'd like to use for temporary files required during the indexing process.
+- Update `druid.indexer.task.hadoopWorkingPath` in `conf/middleManager/runtime.properties` to
+a path on HDFS that you'd like to use for temporary files required during the indexing process.
`druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing` is a common choice.
-- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
-mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
+- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
+mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
`conf/druid/_common/core-site.xml`, `conf/druid/_common/hdfs-site.xml`, and so on.
-Note that you don't need to use HDFS deep storage in order to load data from Hadoop. For example, if
-your cluster is running on Amazon Web Services, we recommend using S3 for deep storage even if you
+Note that you don't need to use HDFS deep storage in order to load data from Hadoop. For example, if
+your cluster is running on Amazon Web Services, we recommend using S3 for deep storage even if you
are loading data using Hadoop or Elastic MapReduce.
For more info, please see [batch ingestion](../ingestion/batch-ingestion.html).
## Configure addresses for Druid coordination
-In this simple cluster, you will deploy a single Druid Coordinator, a
+In this simple cluster, you will deploy a single Druid Coordinator, a
single Druid Overlord, a single ZooKeeper instance, and an embedded Derby metadata store on the same server.
-In `conf/druid/_common/common.runtime.properties`, replace
+In `conf/druid/_common/common.runtime.properties`, replace
"zk.host.ip" with the IP address of the machine that runs your ZK instance:
- `druid.zk.service.host`
-In `conf/_common/common.runtime.properties`, replace
+In `conf/_common/common.runtime.properties`, replace
"metadata.store.ip" with the IP address of the machine that you will use as your metadata store:
- `druid.metadata.storage.connector.connectURI`
- `druid.metadata.storage.connector.host`
-```note-caution
-In production, we recommend running 2 servers, each running a Druid Coordinator
-and a Druid Overlord. We also recommend running a ZooKeeper cluster on its own dedicated hardware,
-as well as replicated [metadata
-storage](http://druid.io/docs/latest/dependencies/metadata-storage.html) such as MySQL or
+
+In production, we recommend running 2 servers, each running a Druid Coordinator
+and a Druid Overlord. We also recommend running a ZooKeeper cluster on its own dedicated hardware,
+as well as replicated [metadata
+storage](http://druid.io/docs/latest/dependencies/metadata-storage.html) such as MySQL or
PostgreSQL, on its own dedicated hardware.
-```
+
## Tune Druid processes that serve queries
-Druid Historicals and MiddleManagers can be co-located on the same hardware. Both Druid processes benefit greatly from
-being tuned to the hardware they run on. If you are running Tranquility Server or Kafka, you can also colocate Tranquility with these two Druid processes.
-If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3)
-EC2 instances, or similar hardware, the configuration in the distribution is a
+Druid Historicals and MiddleManagers can be co-located on the same hardware. Both Druid processes benefit greatly from
+being tuned to the hardware they run on. If you are running Tranquility Server or Kafka, you can also colocate Tranquility with these two Druid processes.
+If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3)
+EC2 instances, or similar hardware, the configuration in the distribution is a
reasonable starting point.
-If you are using different hardware, we recommend adjusting configurations for your specific
+If you are using different hardware, we recommend adjusting configurations for your specific
hardware. The most commonly adjusted configurations are:
- `-Xmx` and `-Xms`
@@ -227,20 +227,20 @@ hardware. The most commonly adjusted configurations are:
- `druid.server.maxSize` and `druid.segmentCache.locations` on Historical Nodes
- `druid.worker.capacity` on MiddleManagers
-```note
+
Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up..
-```
+
-Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
+Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
possible configuration options.
## Tune Druid Brokers
-Druid Brokers also benefit greatly from being tuned to the hardware they
-run on. If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) EC2 instances,
+Druid Brokers also benefit greatly from being tuned to the hardware they
+run on. If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) EC2 instances,
or similar hardware, the configuration in the distribution is a reasonable starting point.
-If you are using different hardware, we recommend adjusting configurations for your specific
+If you are using different hardware, we recommend adjusting configurations for your specific
hardware. The most commonly adjusted configurations are:
- `-Xmx` and `-Xms`
@@ -251,17 +251,17 @@ hardware. The most commonly adjusted configurations are:
- `druid.query.groupBy.maxIntermediateRows`
- `druid.query.groupBy.maxResults`
-```note-caution
-Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up..
-```
+
+Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up.
+
-Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
+Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
possible configuration options.
## Start Coordinator, Overlord, Zookeeper, and metadata store
-Copy the Druid distribution and your edited configurations to your coordination
-server. If you have been editing the configurations on your local machine, you can use *rsync* to
+Copy the Druid distribution and your edited configurations to your coordination
+server. If you have been editing the configurations on your local machine, you can use *rsync* to
copy them:
```bash
@@ -269,18 +269,19 @@ rsync -az druid-0.9.0/ COORDINATION_SERVER:druid-0.9.0/
```
Log on to your coordination server and install Zookeeper:
-
+
```bash
curl http://www.gtlib.gatech.edu/pub/apache/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz -o zookeeper-3.4.6.tar.gz
tar -xzf zookeeper-3.4.6.tar.gz
cd zookeeper-3.4.6
cp conf/zoo_sample.cfg conf/zoo.cfg
./bin/zkServer.sh start
-
-```note-caution
-In production, we also recommend running a ZooKeeper cluster on its own dedicated hardware.
```
+
+In production, we also recommend running a ZooKeeper cluster on its own dedicated hardware.
+
+
On your coordination server, *cd* into the distribution and start up the coordination services (you should do this in different windows or pipe the log to a file):
```bash
@@ -288,7 +289,7 @@ java `cat conf/druid/coordinator/jvm.config | xargs` -cp conf/druid/_common:conf
java `cat conf/druid/overlord/jvm.config | xargs` -cp conf/druid/_common:conf/druid/overlord:lib/* io.druid.cli.Main server overlord
```
-You should see a log message printed out for each service that starts up. You can view detailed logs
+You should see a log message printed out for each service that starts up. You can view detailed logs
for any service by looking in the `var/log/druid` directory using another terminal.
## Start Historicals and MiddleManagers
@@ -304,15 +305,15 @@ java `cat conf/druid/middleManager/jvm.config | xargs` -cp conf/druid/_common:co
You can add more servers with Druid Historicals and MiddleManagers as needed.
-```note-info
-For clusters with complex resource allocation needs, you can break apart Historicals and MiddleManagers and scale the components individually.
-This also allows you take advantage of Druid's built-in MiddleManager
+
+For clusters with complex resource allocation needs, you can break apart Historicals and MiddleManagers and scale the components individually.
+This also allows you take advantage of Druid's built-in MiddleManager
autoscaling facility.
-```
+
-If you are doing push-based stream ingestion with Kafka or over HTTP, you can also start Tranquility server on the same
-hardware that holds MiddleManagers and Historicals. For large scale production, MiddleManagers and Tranquility server
-can still be co-located. If you are running Tranquility (not server) with a stream processor, you can co-locate
+If you are doing push-based stream ingestion with Kafka or over HTTP, you can also start Tranquility server on the same
+hardware that holds MiddleManagers and Historicals. For large scale production, MiddleManagers and Tranquility server
+can still be co-located. If you are running Tranquility (not server) with a stream processor, you can co-locate
Tranquility with the stream processor and not require Tranquility server.
```bash
@@ -324,9 +325,9 @@ bin/tranquility -configFile /conf/tranqu
## Start Druid Broker
-Copy the Druid distribution and your edited configurations to your servers set aside for the Druid Brokers.
+Copy the Druid distribution and your edited configurations to your servers set aside for the Druid Brokers.
-On each one, *cd* into the distribution and run this command to start a Broker (you want to pipe the output to a log file):
+On each one, *cd* into the distribution and run this command to start a Broker (you may want to pipe the output to a log file):
```bash
java `cat conf/druid/broker/jvm.config | xargs` -cp conf/druid/_common:conf/druid/broker:lib/* io.druid.cli.Main server broker
@@ -336,5 +337,5 @@ You can add more Brokers as needed based on query load.
## Loading data
-Congratulations, you now have a Druid cluster! The next step is to learn about recommended ways to load data into
+Congratulations, you now have a Druid cluster! The next step is to learn about recommended ways to load data into
Druid based on your use case. Read more about [loading data](ingestion.html).
diff --git a/docs/content/tutorials/ingestion.md b/docs/content/tutorials/ingestion.md
index ac85f24d094..ab2844fe269 100644
--- a/docs/content/tutorials/ingestion.md
+++ b/docs/content/tutorials/ingestion.md
@@ -9,15 +9,15 @@ layout: doc_page
Druid supports streaming (real-time) and file-based (batch) ingestion methods. The most
popular configurations are:
-- [Files](batch-ingestion.html) - Load data from HDFS, S3, local files, or any supported Hadoop
+- [Files](../ingestion/batch-ingestion.html) - Load data from HDFS, S3, local files, or any supported Hadoop
filesystem in batches. We recommend this method if your dataset is already in flat files.
-- [Stream push](stream-ingestion.html#stream-push) - Push a data stream into Druid in real-time
+- [Stream push](../ingestion/stream-ingestion.html#stream-push) - Push a data stream into Druid in real-time
using [Tranquility](http://github.com/druid-io/tranquility), a client library for sending streams
to Druid. We recommend this method if your dataset originates in a streaming system like Kafka,
Storm, Spark Streaming, or your own system.
-- [Stream pull](stream-ingestion.html#stream-pull) - Pull a data stream directly from an external
+- [Stream pull](../ingestion/stream-ingestion.html#stream-pull) - Pull a data stream directly from an external
data source into Druid using Realtime Nodes.
## Getting started
diff --git a/docs/content/tutorials/quickstart.md b/docs/content/tutorials/quickstart.md
index bff14cc33ae..d0bcb3bd08c 100644
--- a/docs/content/tutorials/quickstart.md
+++ b/docs/content/tutorials/quickstart.md
@@ -15,17 +15,17 @@ You will need:
* 8G of RAM
* 2 vCPUs
-On Mac OS X, you can use [Oracle's JDK
-8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) to install
+On Mac OS X, you can use [Oracle's JDK
+8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) to install
Java.
On Linux, your OS package manager should be able to help for Java. If your Ubuntu-
-based OS does not have a recent enough version of Java, WebUpd8 offers [packages for those
+based OS does not have a recent enough version of Java, WebUpd8 offers [packages for those
OSes](http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html).
## Getting started
-To install Druid, issue the following commands in your terminal:
+To install Druid, issue the following commands in your terminal:
```bash
curl -O http://static.druid.io/artifacts/releases/druid-0.9.0-bin.tar.gz
@@ -46,7 +46,7 @@ In the package, you should find:
## Start up Zookeeper
-Druid currently has a dependency on [Apache ZooKeeper](http://zookeeper.apache.org/) for distributed coordination. You'll
+Druid currently has a dependency on [Apache ZooKeeper](http://zookeeper.apache.org/) for distributed coordination. You'll
need to download and run Zookeeper.
```bash
@@ -65,8 +65,9 @@ With Zookeeper running, return to the druid-0.9.0 directory. In that directory,
bin/init
```
-Next, you can start up the Druid processes in different terminal windows. This tutorial runs every Druid process on the same system. In production,
-many of these Druid processes can be colocated even in a distributed cluster.
+This will setup up some directories for you. Next, you can start up the Druid processes in different terminal windows.
+This tutorial runs every Druid process on the same system. In a large distributed production cluster,
+many of these Druid processes can still be co-located together.
```bash
java `cat conf-quickstart/druid/historical/jvm.config | xargs` -cp conf-quickstart/druid/_common:conf-quickstart/druid/historical:lib/* io.druid.cli.Main server historical
@@ -78,7 +79,7 @@ java `cat conf-quickstart/druid/middleManager/jvm.config | xargs` -cp conf-quick
You should see a log message printed out for each service that starts up.
-Later on, if you'd like to stop the services, CTRL-C to exit from the running java processes. If you
+Later on, if you'd like to stop the services, CTRL-C to exit from the running java processes. If you
want a clean start after stopping the services, delete the `var` directory and run the `init` script again.
Once every service has started, you are now ready to load data.
@@ -87,13 +88,13 @@ Once every service has started, you are now ready to load data.
We've included a sample of Wikipedia edits from September 12, 2015 to get you started.
-```note-info
-This section shows you how to load data in batches, but you can skip ahead to learn how to [load
-streams in real-time](quickstart.html#load-streaming-data). Druid's streaming ingestion can load data
+
+This section shows you how to load data in batches, but you can skip ahead to learn how to
load
+streams in real-time. Druid's streaming ingestion can load data
with virtually no delay between events occurring and being available for queries.
-```
+
-The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
+The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
filter and split on) in the Wikipedia dataset, other than time, are:
* channel
@@ -113,7 +114,7 @@ filter and split on) in the Wikipedia dataset, other than time, are:
* regionName
* user
-The [measures](https://en.wikipedia.org/wiki/Measure_%28data_warehouse%29), or *metrics* as they are known in Druid (values you can aggregate)
+The [measures](https://en.wikipedia.org/wiki/Measure_%28data_warehouse%29), or *metrics* as they are known in Druid (values you can aggregate)
in the Wikipedia dataset are:
* count
@@ -122,8 +123,8 @@ in the Wikipedia dataset are:
* delta
* user_unique
-To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
-a task that loads the `wikiticker-2015-09-12-sampled.json` file included in the archive. To submit
+To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
+a task that loads the `wikiticker-2015-09-12-sampled.json` file included in the archive. To submit
this task, POST it to Druid in a new terminal window from the druid-0.9.0 directory:
```bash
@@ -132,27 +133,27 @@ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-inde
Which will print the ID of the task if the submission was successful:
-```base
+```bash
{"task":"index_hadoop_wikipedia_2013-10-09T21:30:32.802Z"}
```
-To view the status of your ingestion task, go to your overlord console:
-[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after
+To view the status of your ingestion task, go to your overlord console:
+[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after
the task is successful, you should see a "SUCCESS" status for the task.
-After your ingestion task finishes, the data will be loaded by historical nodes and available for
-querying within a minute or two. You can monitor the progress of loading your data in the
-coordinator console, by checking whether there is a datasource "wikiticker" with a blue circle
+After your ingestion task finishes, the data will be loaded by historical nodes and available for
+querying within a minute or two. You can monitor the progress of loading your data in the
+coordinator console, by checking whether there is a datasource "wikiticker" with a blue circle
indicating "fully available": [http://localhost:8081/#/](http://localhost:8081/#/).
-Once the data is fully available, you can immediately query it— to see how, skip to the [Query
+Once the data is fully available, you can immediately query it— to see how, skip to the [Query
data](#query-data) section below. Or, continue to the [Load your own data](#load-your-own-data)
section if you'd like to load a different dataset.
## Load streaming data
-To load streaming data, we are going to push events into Druid
-over a simple HTTP API. We will do this use [Tranquility], a high level data producer
+To load streaming data, we are going to push events into Druid
+over a simple HTTP API. We will do this use [Tranquility], a high level data producer
library for Druid.
To download Tranquility, issue the following commands in your terminal:
@@ -163,19 +164,19 @@ tar -xzf tranquility-distribution-0.7.2.tgz
cd tranquility-distribution-0.7.2
```
-We've included a configuration file in `conf-quickstart/tranquility/server.json` as part of the Druid distribution
-for a *metrics* datasource. We're going to start the Tranquility server process, which can be used to push events
+We've included a configuration file in `conf-quickstart/tranquility/server.json` as part of the Druid distribution
+for a *metrics* datasource. We're going to start the Tranquility server process, which can be used to push events
directly to Druid.
``` bash
bin/tranquility server -configFile /conf-quickstart/tranquility/server.json
```
-```note-info
+
This section shows you how to load data using Tranquility Server, but Druid also supports a wide
-variety of [other streaming ingestion options](ingestion-streams.html#stream-push), including from
+variety of
other streaming ingestion options, including from
popular streaming systems like Kafka, Storm, Samza, and Spark Streaming.
-```
+
The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
filter and split on) for this datasource are flexible. It's configured for *schemaless dimensions*,
@@ -223,17 +224,17 @@ curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wiki
## Visualizing data
-Druid is ideal for power user-facing analytic applications. There are a number of different open source applications to
-visualize and explore data in Druid. We recommend trying [Pivot](https://github.com/implydata/pivot),
-[Panoramix](https://github.com/mistercrunch/panoramix), or [Metabase](https://github.com/metabase/metabase) to start
+Druid is ideal for power user-facing analytic applications. There are a number of different open source applications to
+visualize and explore data in Druid. We recommend trying [Pivot](https://github.com/implydata/pivot),
+[Panoramix](https://github.com/mistercrunch/panoramix), or [Metabase](https://github.com/metabase/metabase) to start
visualizing the data you just ingested.
If you installed Pivot for example, you should be able to view your data in your browser at [localhost:9090](localhost:9090).
### SQL and other query libraries
-There are many more query tools for Druid than we've included here, including SQL
-engines, and libraries for various languages like Python and Ruby. Please see [the list of
+There are many more query tools for Druid than we've included here, including SQL
+engines, and libraries for various languages like Python and Ruby. Please see [the list of
libraries](../development/libraries.html) for more information.
## Clustered setup
diff --git a/docs/content/tutorials/tutorial-batch.md b/docs/content/tutorials/tutorial-batch.md
index 2678a082183..d6b432fa566 100644
--- a/docs/content/tutorials/tutorial-batch.md
+++ b/docs/content/tutorials/tutorial-batch.md
@@ -2,25 +2,36 @@
layout: doc_page
---
-## Load your own batch data
+# Tutorial: Load your own batch data
-Before you get started with loading your own batch data, you should have first completed the [quickstart](quickstart.html).
+## Getting started
-You can easily load any timestamped dataset into Druid. For Druid batch loads, the most important
-questions are:
+This tutorial shows you how to load your own data files into Druid.
+
+For this tutorial, we'll assume you've already downloaded Druid as described in
+the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
+don't need to have loaded any data yet.
+
+Once that's complete, you can load your own dataset by writing a custom ingestion spec.
+
+## Writing an ingestion spec
+
+When loading files into Druid, you will use Druid's [batch loading](ingestion-batch.html) process.
+There's an example batch ingestion spec in `quickstart/wikiticker-index.json` that you can modify
+for your own needs.
+
+The most important questions are:
* What should the dataset be called? This is the "dataSource" field of the "dataSchema".
- * Where is the dataset located? The file paths belong in the "paths" of the "inputSpec". If you
+ * Where is the dataset located? The file paths belong in the "paths" of the "inputSpec". If you
want to load multiple files, you can provide them as a comma-separated string.
* Which field should be treated as a timestamp? This belongs in the "column" of the "timestampSpec".
* Which fields should be treated as dimensions? This belongs in the "dimensions" of the "dimensionsSpec".
* Which fields should be treated as metrics? This belongs in the "metricsSpec".
* What time ranges (intervals) are being loaded? This belongs in the "intervals" of the "granularitySpec".
-```note-info
If your data does not have a natural sense of time, you can tag each row with the current time.
You can also tag all rows with a fixed timestamp, like "2000-01-01T00:00:00.000Z".
-```
Let's use this pageviews dataset as an example. Druid supports TSV, CSV, and JSON out of the box.
Note that nested JSON objects are not supported, so if you do use JSON, you should provide a file
@@ -90,19 +101,39 @@ And modify it by altering these sections:
}
```
-Finally, fire off the task and indexing will proceed!
+## Running the task
+
+To actually run this task, first make sure that the indexing task can read *pageviews.json*:
+
+- If you're running locally (no configuration for connecting to Hadoop; this is the default) then
+place it in the root of the Druid distribution.
+- If you configured Druid to connect to a Hadoop cluster, upload
+the pageviews.json file to HDFS. You may need to adjust the `paths` in the ingestion spec.
+
+To kick off the indexing process, POST your indexing task to the Druid Overlord. In a standard Druid
+install, the URL is `http://OVERLORD_IP:8090/druid/indexer/v1/task`.
```bash
-curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/pageviews-index.json localhost:8090/druid/indexer/v1/task
+curl -X 'POST' -H 'Content-Type:application/json' -d @my-index-task.json OVERLORD_IP:8090/druid/indexer/v1/task
```
-If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot
+If you're running everything on a single machine, you can use localhost:
+
+```bash
+curl -X 'POST' -H 'Content-Type:application/json' -d @my-index-task.json localhost:8090/druid/indexer/v1/task
+```
+
+If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot
by visiting the "Task log" on the [overlord console](http://localhost:8090/console.html).
-```note-info
-Druid supports a wide variety of data formats, ingestion options, and configurations not
-discussed here. For a full explanation of all available features, see the ingestion sections of the Druid
-documentation.
-```
+## Querying your data
+
+Your data should become fully available within a minute or two. You can monitor this process on
+your Coordinator console at [http://localhost:8081/#/](http://localhost:8081/#/).
+
+Once your data is fully available, you can query it using any of the
+[supported query methods](../querying/querying.html).
+
+## Further reading
For more information on loading batch data, please see [the batch ingestion documentation](../ingestion/batch-ingestion.html).
diff --git a/docs/content/tutorials/tutorial-kafka.md b/docs/content/tutorials/tutorial-kafka.md
index 741f38257b7..eddd927047b 100644
--- a/docs/content/tutorials/tutorial-kafka.md
+++ b/docs/content/tutorials/tutorial-kafka.md
@@ -8,21 +8,21 @@ layout: doc_page
This tutorial shows you how to load data from Kafka into Druid.
-For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
-the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
+For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
+the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
don't need to have loaded any data yet.
-```note-info
+
This tutorial will show you how to load data from Kafka into Druid, but Druid additionally supports
-a wide variety of batch and streaming loading methods. See the *[Loading files](../ingestion/batch-ingestion.html)*
-and *[Loading streams](../ingestion/stream-ingestion.html)* pages for more information about other options,
+a wide variety of batch and streaming loading methods. See the
Loading files
+and
Loading streams pages for more information about other options,
including from Hadoop, HTTP, Storm, Samza, Spark Streaming, and your own JVM apps.
-```
+
## Start Kafka
-[Apache Kafka](http://kafka.apache.org/) is a high throughput message bus that works well with
-Druid. For this tutorial, we will use Kafka 0.9.0.0. To download Kafka, issue the following
+[Apache Kafka](http://kafka.apache.org/) is a high throughput message bus that works well with
+Druid. For this tutorial, we will use Kafka 0.9.0.0. To download Kafka, issue the following
commands in your terminal:
```bash
@@ -45,7 +45,7 @@ Run this command to create a Kafka topic called *metrics*, to which we'll send d
## Enable Druid Kafka ingestion
-Druid includes configs for [Tranquility Kafka](ingestion-streams.md#kafka) to support loading data from Kafka.
+Druid includes configs for [Tranquility Kafka](ingestion-streams.md#kafka) to support loading data from Kafka.
To enable this in the quickstart-based configuration:
- Stop your Tranquility command (CTRL-C) and then start it up again.
@@ -66,25 +66,25 @@ In your Kafka directory, run:
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic metrics
```
-The *kafka-console-producer* command is now awaiting input. Copy the generated example metrics,
-paste them into the *kafka-console-producer* terminal, and press enter. If you like, you can also
+The *kafka-console-producer* command is now awaiting input. Copy the generated example metrics,
+paste them into the *kafka-console-producer* terminal, and press enter. If you like, you can also
paste more messages into the producer, or you can press CTRL-D to exit the console producer.
-You can immediately query this data, or you can skip ahead to the
+You can immediately query this data, or you can skip ahead to the
[Loading your own data](#loading-your-own-data) section if you'd like to load your own dataset.
## Querying your data
-After sending data, you can immediately query it using any of the
+After sending data, you can immediately query it using any of the
[supported query methods](../querying/querying.html).
## Loading your own data
-So far, you've loaded data into Druid from Kafka using an ingestion spec that we've included in the
-distribution. Each ingestion spec is designed to work with a particular dataset. You load your own
+So far, you've loaded data into Druid from Kafka using an ingestion spec that we've included in the
+distribution. Each ingestion spec is designed to work with a particular dataset. You load your own
data types into Imply by writing a custom ingestion spec.
-You can write a custom ingestion spec by starting from the bundled configuration in
+You can write a custom ingestion spec by starting from the bundled configuration in
`conf-quickstart/tranquility/kafka.json` and modifying it for your own needs.
The most important questions are:
@@ -111,7 +111,7 @@ Next, edit `conf-quickstart/tranquility/kafka.json`:
* Let's call the dataset "pageviews-kafka".
* The timestamp is the "time" field.
* Good choices for dimensions are the string fields "url" and "user".
- * Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
+ * Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
sum when we load the data will allow us to compute an average at query time as well.
You can edit the existing `conf-quickstart/tranquility/kafka.json` file by altering these
@@ -157,7 +157,7 @@ Next, start Druid Kafka ingestion:
bin/tranquility kafka -configFile ../druid-0.9.0-SNAPSHOT/conf-quickstart/tranquility/kafka.json
```
-- If your Tranquility server or Kafka is already running, stop it (CTRL-C) and
+- If your Tranquility server or Kafka is already running, stop it (CTRL-C) and
start it up again.
Finally, send some data to the Kafka topic. Let's start with these messages:
@@ -168,23 +168,23 @@ Finally, send some data to the Kafka topic. Let's start with these messages:
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
```
-Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
-[windowPeriod](../ingestion/stream-ingestion.html#segmentgranularity-and-windowperiod) value), so you should
-replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
+Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
+[windowPeriod](../ingestion/stream-ingestion.html#segmentgranularity-and-windowperiod) value), so you should
+replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
get this by running:
```bash
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
```
-Update the timestamps in the JSON above, then copy and paste these messages into this console
+Update the timestamps in the JSON above, then copy and paste these messages into this console
producer and press enter:
```bash
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic pageviews
```
-That's it, your data should now be in Druid. You can immediately query it using any of the
+That's it, your data should now be in Druid. You can immediately query it using any of the
[supported query methods](../querying/querying.html).
## Further reading
diff --git a/docs/content/tutorials/tutorial-streams.md b/docs/content/tutorials/tutorial-streams.md
index f935c7a82c1..6375dd828b9 100644
--- a/docs/content/tutorials/tutorial-streams.md
+++ b/docs/content/tutorials/tutorial-streams.md
@@ -2,33 +2,33 @@
layout: doc_page
---
-## Load your own streaming data
+# Tutorial: Load your own streaming data
## Getting started
This tutorial shows you how to load your own streams into Druid.
-For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
-the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
+For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
+the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
don't need to have loaded any data yet.
Once that's complete, you can load your own dataset by writing a custom ingestion spec.
## Writing an ingestion spec
-When loading streams into Druid, we recommend using the [stream push](../ingestion/stream-push.html)
-process. In this tutorial we'll be using [Tranquility Server](../ingestion/stream-ingestion.html#server) to push
+When loading streams into Druid, we recommend using the [stream push](../ingestion/stream-push.html)
+process. In this tutorial we'll be using [Tranquility Server](../ingestion/stream-ingestion.html#server) to push
data into Druid over HTTP.
-```note-info
-This tutorial will show you how to push streams to Druid using HTTP, but Druid additionally supports
-a wide variety of batch and streaming loading methods. See the *[Loading files](batch-ingestion.html)*
-and *[Loading streams](stream-ingestion.html)* pages for more information about other options,
+
+This tutorial will show you how to push streams to Druid using HTTP, but Druid additionally supports
+a wide variety of batch and streaming loading methods. See the
Loading files
+and
Loading streams pages for more information about other options,
including from Hadoop, Kafka, Storm, Samza, Spark Streaming, and your own JVM apps.
-```
+
-You can prepare for loading a new dataset over HTTP by writing a custom Tranquility Server
-configuration. The bundled configuration is in `conf-quickstart/tranquility/server.json`, which
+You can prepare for loading a new dataset over HTTP by writing a custom Tranquility Server
+configuration. The bundled configuration is in `conf-quickstart/tranquility/server.json`, which
you can modify for your own needs.
The most important questions are:
@@ -49,10 +49,10 @@ So the answers to the questions above are:
* Let's call the dataset "pageviews".
* The timestamp is the "time" field.
* Good choices for dimensions are the string fields "url" and "user".
- * Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
+ * Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
sum when we load the data will allow us to compute an average at query time as well.
-Now, edit the existing `conf-quickstart/tranquility/server.json` file by altering these
+Now, edit the existing `conf-quickstart/tranquility/server.json` file by altering these
sections:
1. Change the key `"metrics"` under `"dataSources"` to `"pageviews"`
@@ -95,16 +95,16 @@ Let's send some data! We'll start with these three records:
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
```
-Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
-[windowPeriod](ingestion-streams.html#segmentgranularity-and-windowperiod) value), so you should
-replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
+Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
+[windowPeriod](ingestion-streams.html#segmentgranularity-and-windowperiod) value), so you should
+replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
get this by running:
```bash
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
```
-Update the timestamps in the JSON above, and save it to a file named `pageviews.json`. Then send
+Update the timestamps in the JSON above, and save it to a file named `pageviews.json`. Then send
it to Druid by running:
```bash
@@ -117,16 +117,16 @@ This will print something like:
{"result":{"received":3,"sent":3}}
```
-This indicates that the HTTP server received 3 events from you, and sent 3 to Druid. Note that
-this may take a few seconds to finish the first time you run it, as Druid resources must be
+This indicates that the HTTP server received 3 events from you, and sent 3 to Druid. Note that
+this may take a few seconds to finish the first time you run it, as Druid resources must be
allocated to the ingestion task. Subsequent POSTs should complete quickly.
-If you see `"sent":0` this likely means that your timestamps are not recent enough. Try adjusting
+If you see `"sent":0` this likely means that your timestamps are not recent enough. Try adjusting
your timestamps and re-sending your data.
## Querying your data
-After sending data, you can immediately query it using any of the
+After sending data, you can immediately query it using any of the
[supported query methods](../querying/querying.html).
## Further reading