druid/docs/content/tutorials/tutorial-loading-streaming-...

---
layout: doc_page
---

# Loading Streaming Data

In our [last tutorial](../tutorials/tutorial-the-druid-cluster.html), we set up a
complete Druid cluster. We created all the Druid dependencies and ingested
streaming data. In this tutorial, we will expand upon what we've done in the
first two tutorials.

## About the Data

We will be working with the same Wikipedia edits data schema [from our previous
tutorials](tutorial-a-first-look-at-druid.html#about-the-data).

## Set Up

At this point, you should already have Druid downloaded and be comfortable
running a Druid cluster locally. If not, [have a look at our second
tutorial](../tutorials/tutorial-the-druid-cluster.html). If Zookeeper and MySQL are not
running, you will have to start them as described in [The Druid
Cluster](../tutorials/tutorial-the-druid-cluster.html).

With real-world data, we recommend having a message bus such as [Apache
Kafka](http://kafka.apache.org/) sit between the data stream and the real-time
node. The message bus provides higher availability for production environments.
[Firehoses](../ingestion/firehose.html) are the key abstraction for real-time ingestion.

### Kafka

Druid communicates with Kafka using the
[KafkaFirehoseFactory](../ingestion/firehose.html). Using this [Firehose](../ingestion/firehose.html)
with the right configuration, we can import data into Druid in real-time
without writing any code. To load data to a real-time node via Kafka, we'll
first need to initialize Zookeeper and Kafka, and then configure and initialize
a [Realtime](../design/realtime.html) node.

The following quick-start instructions for booting a Zookeeper and then Kafka
cluster were adapted from the [Apache Kafka quickstart guide](http://kafka.apache.org/documentation.html#quickstart).

1. Download Kafka

    For this tutorial we will [download Kafka 0.8.2.1]
    (https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz)

    ```bash
    tar -xzf kafka_2.10-0.8.2.1.tgz
    cd kafka_2.10-0.8.2.1
    ```

1. Start Kafka

    First launch ZooKeeper:

    ```bash
    ./bin/zookeeper-server-start.sh config/zookeeper.properties
    ```

    Then start the Kafka server (in a separate console):

    ```bash
    ./bin/kafka-server-start.sh config/server.properties
    ```

1. Create a topic named `wikipedia`

    ```bash
    ./bin/kafka-topics.sh --create --zookeeper localhost:2181 \
      --replication-factor 1 --partitions 1 --topic wikipedia
    ```

1. Launch a console producer for that topic (so we can paste in kafka
   messages in a bit)

    ```bash
    ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia
    ```

### Druid Realtime Node

The realtime spec for the data source in this tutorial is available under
`examples/indexing/wikipedia.spec` from the [Druid
download](http://static.druid.io/artifacts/releases/druid-services-0.7.1-bin.tar.gz)

1. Launch the realtime node

    ```bash
    java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8         \
         -Ddruid.realtime.specFile=examples/indexing/wikipedia.spec \
         -classpath "config/_common:config/realtime:lib/*"          \
         io.druid.cli.Main server realtime
    ```

1. Copy and paste the following data into the terminal where we launched
   the Kafka console producer above.

    ```json
    {"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
    {"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
    {"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
    {"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
    {"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
    ```

    **Note:** This config uses a [`messageTime` rejection policy](../design/plumber.html)
    which will accept all events and hand off as long as there is a continuous
    stream of events. In this particular example, hand-off will not actually
    occur because we only have a handful of events.

    Disclaimer: We recognize the timestamps of these events aren't actually recent.

1. Watch the events getting ingested and the real-time node announcing a data
   segment

    ```
    ...
    2015-02-17T23:01:50,220 INFO [chief-wikipedia] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_2013-08-31T00:00:00.000Z] at path[/druid/segments/localhost:8084/2015-02-17T23:01:50.219Z0]
    ...
    ```

1. Issue a query

    Issuing a [TimeBoundaryQuery](../querying/timeboundaryquery.html) to the real-time node
    should return some results:

    ```bash
    curl -XPOST -H'Content-type: application/json' \
      "http://localhost:8084/druid/v2/?pretty" \
      -d'{"queryType":"timeBoundary","dataSource":"wikipedia"}'
    ```

    ```json
    [ {
      "timestamp" : "2013-08-31T01:02:33.000Z",
      "result" : {
        "minTime" : "2013-08-31T01:02:33.000Z",
        "maxTime" : "2013-08-31T12:41:27.000Z"
      }
    } ]
    ```

## Advanced Streaming Ingestion

Druid offers an additional method of ingesting streaming data via the indexing service. You may be wondering why a second method is needed. Standalone real-time nodes are sufficient for certain volumes of data and availability tolerances. They pull data from a message queue like Kafka or Rabbit, index data locally, and periodically finalize segments for handoff to historical nodes. They are fairly straightforward to scale, simply taking advantage of the innate scalability of the backing message queue. But they are difficult to make highly available with Kafka, the most popular supported message queue, because its high-level consumer doesn’t provide a way to scale out two replicated consumer groups such that each one gets the same data in the same shard. They also become difficult to manage once you have a lot of them, since every machine needs a unique configuration.

Druid solved the availability problem by switching from a pull-based model to a push-based model; rather than Druid indexers pulling data from Kafka, another process pulls data and pushes the data into Druid. Since with the push based model, we can ensure that the same data makes it into the same shard, we can replicate data. The [indexing service](../design/indexing-service.html) encapsulates this functionality, where a task-and-resources model replaces a standalone machine model. In addition to simplifying machine configuration, the model also allows nodes to run in the cloud with an elastic number of machines. If you are interested in this form of real-time ingestion, please check out the client library [Tranquility](https://github.com/druid-io/tranquility).

Additional Information
----------------------

Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-user).
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								---
 								layout: doc_page
 								---
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								# Loading Streaming Data
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								In our [last tutorial](../tutorials/tutorial-the-druid-cluster.html), we set up a
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								complete Druid cluster. We created all the Druid dependencies and ingested
 								streaming data. In this tutorial, we will expand upon what we've done in the
 								first two tutorials.
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								## About the Data
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
-												Rework the druid docs

											
										
										
											2015-03-26 13:44:11 -04:00
+								We will be working with the same Wikipedia edits data schema [from our previous
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								tutorials](tutorial-a-first-look-at-druid.html#about-the-data).
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								## Set Up
 								At this point, you should already have Druid downloaded and be comfortable
 								running a Druid cluster locally. If not, [have a look at our second
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								tutorial](../tutorials/tutorial-the-druid-cluster.html). If Zookeeper and MySQL are not
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								running, you will have to start them as described in [The Druid
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								Cluster](../tutorials/tutorial-the-druid-cluster.html).
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
 								With real-world data, we recommend having a message bus such as [Apache
 								Kafka](http://kafka.apache.org/) sit between the data stream and the real-time
 								node. The message bus provides higher availability for production environments.
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								[Firehoses](../ingestion/firehose.html) are the key abstraction for real-time ingestion.
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
 								### Kafka
 								Druid communicates with Kafka using the
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								[KafkaFirehoseFactory](../ingestion/firehose.html). Using this [Firehose](../ingestion/firehose.html)
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								with the right configuration, we can import data into Druid in real-time
 								without writing any code. To load data to a real-time node via Kafka, we'll
 								first need to initialize Zookeeper and Kafka, and then configure and initialize
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								a [Realtime](../design/realtime.html) node.
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
 								The following quick-start instructions for booting a Zookeeper and then Kafka
 								cluster were adapted from the [Apache Kafka quickstart guide](http://kafka.apache.org/documentation.html#quickstart).
 . Download Kafka
-												update to kafka 0.8.2.1, because it's better™

											
										
										
											2015-03-12 12:54:52 -04:00
+								    For this tutorial we will [download Kafka 0.8.2.1]
 								    (https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz)
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
 								    ```bash
-												update to kafka 0.8.2.1, because it's better™

											
										
										
											2015-03-12 12:54:52 -04:00
+								    tar -xzf kafka_2.10-0.8.2.1.tgz
 								    cd kafka_2.10-0.8.2.1
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								    ```
 . Start Kafka
 								    First launch ZooKeeper:
 								    ```bash
 								    ./bin/zookeeper-server-start.sh config/zookeeper.properties
 								    ```
 								    Then start the Kafka server (in a separate console):
 								    ```bash
 								    ./bin/kafka-server-start.sh config/server.properties
 								    ```
 . Create a topic named `wikipedia`
 								    ```bash
 								    ./bin/kafka-topics.sh --create --zookeeper localhost:2181 \
 								      --replication-factor 1 --partitions 1 --topic wikipedia
 								    ```
 . Launch a console producer for that topic (so we can paste in kafka
 								   messages in a bit)
 								    ```bash
 								    ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia
 								    ```
 								### Druid Realtime Node
 								The realtime spec for the data source in this tutorial is available under
 								`examples/indexing/wikipedia.spec` from the [Druid
-												update versions to prepare for rc release

											
										
										
											2015-03-19 14:39:38 -04:00
+								download](http://static.druid.io/artifacts/releases/druid-services-0.7.1-bin.tar.gz)
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
 . Launch the realtime node
 								    ```bash
 								    java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8         \
 								         -Ddruid.realtime.specFile=examples/indexing/wikipedia.spec \
 								         -classpath "config/_common:config/realtime:lib/*"          \
 								         io.druid.cli.Main server realtime
 								    ```
 . Copy and paste the following data into the terminal where we launched
 								   the Kafka console producer above.
 								    ```json
 								    {"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
 								    {"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
 								    {"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
 								    {"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
 								    {"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
 								    ```
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								    **Note:** This config uses a [`messageTime` rejection policy](../design/plumber.html)
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								    which will accept all events and hand off as long as there is a continuous
 								    stream of events. In this particular example, hand-off will not actually
 								    occur because we only have a handful of events.
 								    Disclaimer: We recognize the timestamps of these events aren't actually recent.
 . Watch the events getting ingested and the real-time node announcing a data
 								   segment
 								    ```
 								    ...
-												Use default ports in examples

											
										
										
											2015-02-18 14:46:27 -05:00
+-02-17T23:01:50,220 INFO [chief-wikipedia] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_2013-08-31T00:00:00.000Z] at path[/druid/segments/localhost:8084/2015-02-17T23:01:50.219Z0]
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								    ...
 								    ```
 . Issue a query
-												renaming all *.md filenames to only have lowercase and dashes
so that they are editable on case-insensitive os as well

											
										
										
											2015-05-05 17:07:32 -04:00
+								    Issuing a [TimeBoundaryQuery](../querying/timeboundaryquery.html) to the real-time node
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								    should return some results:
 								    ```bash
 								    curl -XPOST -H'Content-type: application/json' \
-												Use default ports in examples

											
										
										
											2015-02-18 14:46:27 -05:00
+								      "http://localhost:8084/druid/v2/?pretty" \
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								      -d'{"queryType":"timeBoundary","dataSource":"wikipedia"}'
 								    ```
 								    ```json
 								    [ {
 								      "timestamp" : "2013-08-31T01:02:33.000Z",
 								      "result" : {
 								        "minTime" : "2013-08-31T01:02:33.000Z",
 								        "maxTime" : "2013-08-31T12:41:27.000Z"
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								      }
-												update example + tutorial for kafka 0.8

											
										
										
											2015-02-17 23:26:58 -05:00
+								    } ]
 								    ```
 								## Advanced Streaming Ingestion
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
-												moar doc fixes

											
										
										
											2015-01-22 00:59:16 -05:00
+								Druid offers an additional method of ingesting streaming data via the indexing service. You may be wondering why a second method is needed. Standalone real-time nodes are sufficient for certain volumes of data and availability tolerances. They pull data from a message queue like Kafka or Rabbit, index data locally, and periodically finalize segments for handoff to historical nodes. They are fairly straightforward to scale, simply taking advantage of the innate scalability of the backing message queue. But they are difficult to make highly available with Kafka, the most popular supported message queue, because its high-level consumer doesn’t provide a way to scale out two replicated consumer groups such that each one gets the same data in the same shard. They also become difficult to manage once you have a lot of them, since every machine needs a unique configuration.
-												Change tranquility links

											
										
										
											2015-05-31 13:59:38 -04:00
+								Druid solved the availability problem by switching from a pull-based model to a push-based model; rather than Druid indexers pulling data from Kafka, another process pulls data and pushes the data into Druid. Since with the push based model, we can ensure that the same data makes it into the same shard, we can replicate data. The [indexing service](../design/indexing-service.html) encapsulates this functionality, where a task-and-resources model replaces a standalone machine model. In addition to simplifying machine configuration, the model also allows nodes to run in the cloud with an elastic number of machines. If you are interested in this form of real-time ingestion, please check out the client library [Tranquility](https://github.com/druid-io/tranquility).
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
 								Additional Information
 								----------------------
-												rework the druid docs and fix many mistakes

											
										
										
											2015-03-09 19:14:52 -04:00
+								Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-user).
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00