druid/docs/content/ingestion/stream-push.md

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

---
layout: doc_page
---

## Stream Push

Druid can connect to any streaming data source through
[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing
streams to Druid in real-time. Druid does not come bundled with Tranquility, and you will have to download the distribution.

<div class="note info">
If you've never loaded streaming data into Druid with Tranquility before, we recommend trying out the
<a href="../tutorials/tutorial-tranquility.html">stream loading tutorial</a> first and then coming back to this page.
</div>

Note that with all streaming ingestion options, you must ensure that incoming data is recent
enough (within a [configurable windowPeriod](#segmentgranularity-and-windowperiod) of the current
time). Older messages will not be processed in real-time. Historical data is best processed with
[batch ingestion](../ingestion/batch-ingestion.html).

### Server

Druid can use [Tranquility Server](https://github.com/druid-io/tranquility/blob/master/docs/server.md), which
lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers
and historical processes.

Tranquility server is started by issuing:

```bash
bin/tranquility server -configFile <path_to_config_file>/server.json
```

To customize Tranquility Server:

- In `server.json`, customize the `properties` and `dataSources`.
- If you have servers already running Tranquility, stop them (CTRL-C) and start
them up again.

For tips on customizing `server.json`, see the
*[Writing an ingestion spec](../tutorials/tutorial-ingestion-spec.html)* tutorial and the
[Tranquility Server documentation](https://github.com/druid-io/tranquility/blob/master/docs/server.md).

### JVM apps and stream processors

Tranquility can also be embedded in JVM-based applications as a library. You can do this directly
in your own program using the
[Core API](https://github.com/druid-io/tranquility/blob/master/docs/core.md), or you can use
the connectors bundled in Tranquility for popular JVM-based stream processors such as
[Storm](https://github.com/druid-io/tranquility/blob/master/docs/storm.md),
[Samza](https://github.com/druid-io/tranquility/blob/master/docs/samza.md),
[Spark Streaming](https://github.com/druid-io/tranquility/blob/master/docs/spark.md), and
[Flink](https://github.com/druid-io/tranquility/blob/master/docs/flink.md).

### Kafka (Deprecated)

<div class="note info">
NOTE: Tranquility Kafka is deprecated. Please use the <a href="../development/extensions-core/kafka-ingestion.html">Kafka Indexing Service</a> to load data from Kafka instead. 
</div>


[Tranquility Kafka](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md)
lets you load data from Kafka into Druid without writing any code. You only need a configuration
file.

Tranquility server is started by issuing:

```bash
bin/tranquility kafka -configFile <path_to_config_file>/kafka.json
```

To customize Tranquility Kafka in the single-machine quickstart configuration:

- In `kafka.json`, customize the `properties` and `dataSources`.
- If you have Tranquility already running, stop it (CTRL-C) and start it up again.

For tips on customizing `kafka.json`, see the
[Tranquility Kafka documentation](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md).


## Concepts

### Task creation

Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication,
service discovery, and schema rollover for you, seamlessly and without downtime. You never have to
write code to deal with individual tasks directly. But, it can be helpful to understand how
Tranquility creates tasks.

Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of
[Druid segments](../design/segments.html). Tranquility coordinates all task
creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same
configuration, even on different machines, and they will send to the same set of tasks.

See the [Tranquility overview](https://github.com/druid-io/tranquility/blob/master/docs/overview.md)
for more details about how Tranquility manages tasks.

### segmentGranularity and windowPeriod

The segmentGranularity is the time period covered by the segments produced by each task. For
example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour
each.

The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes
(the default) means that any events with a timestamp older than ten minutes in the past, or more
than ten minutes in the future, will be dropped.

These are important configurations because they influence how long tasks will be alive for, and how
long data stays in the realtime system before being handed off to the historical nodes. For example,
if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay
around listening for events for an hour and ten minutes. For this reason, to prevent excessive
buildup of tasks, it is recommended that your windowPeriod be less than your segmentGranularity.

### Append only

Druid streaming ingestion is *append-only*, meaning you cannot use streaming ingestion to update or
delete individual records after they are inserted. If you need to update or delete individual
records, you need to use a batch reindexing process. See the *[batch ingest](batch-ingestion.html)*
page for more details.

Druid does support efficient deletion of entire time ranges without resorting to batch reindexing.
This can be done automatically through setting up retention policies.

### Guarantees

Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set
up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be
processed exactly once. In some conditions, it can drop or duplicate events:

- Events with timestamps outside your configured windowPeriod will be dropped.
- If you suffer more Druid Middle Manager failures than your configured replicas count, some
partially indexed data may be lost.
- If there is a persistent issue that prevents communication with the Druid indexing service, and
retry policies are exhausted during that period, or the period lasts longer than your windowPeriod,
some events will be dropped.
- If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing
service, it will retry the batch, which can lead to duplicated events.
- If you are using Tranquility inside Storm or Samza, various parts of both architectures have an
at-least-once design and can lead to duplicated events.

Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for
historical data, we recommend a hybrid/batch streaming architecture, described below.

### Hybrid Batch/Streaming

You can combine batch and streaming methods in a hybrid batch/streaming architecture. In a hybrid architecture, you use a streaming method to do initial ingestion, and then periodically re-ingest older data in batch mode (typically every few hours, or nightly). When Druid re-ingests data for a time range, the new data automatically replaces the data from the earlier ingestion.

All streaming ingestion methods currently supported by Druid do introduce the possibility of dropped or duplicated messages in certain failure scenarios, and batch re-ingestion eliminates this potential source of error for historical data.

Batch re-ingestion also gives you the option to re-ingest your data if you needed to revise it for any reason.

### Deployment Notes

Stream ingestion may generate a large number of small segments because it's difficult to optimize the segment size at
ingestion time. The number of segments will increase over time, and this might cuase the query performance issue. 

Details on how to optimize the segment size can be found on [Segment size optimization](../operations/segment-optimization.html).

## Documentation

Tranquility documentation be found [here](https://github.com/druid-io/tranquility/blob/master/README.md).

## Configuration

Tranquility configuration can be found [here](https://github.com/druid-io/tranquility/blob/master/docs/configuration.md).

Tranquility's tuningConfig can be found [here](http://static.druid.io/tranquility/api/latest/#com.metamx.tranquility.druid.DruidTuning).
add missing license headers, in particular to MD files; clean up RAT … (#6563) * add missing license headers, in particular to MD files; clean up RAT exclusions * revert inadvertent doc changes * docs * cr changes * fix modified druid-production.svg 2018-11-13 12:38:37 -05:00			`<!--`
			`~ Licensed to the Apache Software Foundation (ASF) under one`
			`~ or more contributor license agreements. See the NOTICE file`
			`~ distributed with this work for additional information`
			`~ regarding copyright ownership. The ASF licenses this file`
			`~ to you under the Apache License, Version 2.0 (the`
			`~ "License"); you may not use this file except in compliance`
			`~ with the License. You may obtain a copy of the License at`
			`~`
			`~ http://www.apache.org/licenses/LICENSE-2.0`
			`~`
			`~ Unless required by applicable law or agreed to in writing,`
			`~ software distributed under the License is distributed on an`
			`~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`~ KIND, either express or implied. See the License for the`
			`~ specific language governing permissions and limitations`
			`~ under the License.`
			`-->`

new quickstart 2016-01-06 00:27:52 -05:00			`---`
			`layout: doc_page`
			`---`

			`## Stream Push`

add doc rendering 2016-02-04 14:53:09 -05:00			`Druid can connect to any streaming data source through`
			`[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing`
new quickstart 2016-01-06 00:27:52 -05:00			`streams to Druid in real-time. Druid does not come bundled with Tranquility, and you will have to download the distribution.`

add doc rendering 2016-02-04 14:53:09 -05:00			`<div class="note info">`
Docs consistency cleanup (#6259) 2018-09-04 15:54:41 -04:00			`If you've never loaded streaming data into Druid with Tranquility before, we recommend trying out the`
			`<a href="../tutorials/tutorial-tranquility.html">stream loading tutorial</a> first and then coming back to this page.`
add doc rendering 2016-02-04 14:53:09 -05:00			`</div>`
new quickstart 2016-01-06 00:27:52 -05:00
add doc rendering 2016-02-04 14:53:09 -05:00			`Note that with all streaming ingestion options, you must ensure that incoming data is recent`
			`enough (within a [configurable windowPeriod](#segmentgranularity-and-windowperiod) of the current`
			`time). Older messages will not be processed in real-time. Historical data is best processed with`
new quickstart 2016-01-06 00:27:52 -05:00			`[batch ingestion](../ingestion/batch-ingestion.html).`

			`### Server`

add doc rendering 2016-02-04 14:53:09 -05:00			`Druid can use [Tranquility Server](https://github.com/druid-io/tranquility/blob/master/docs/server.md), which`
			`lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers`
new quickstart 2016-01-06 00:27:52 -05:00			`and historical processes.`

			`Tranquility server is started by issuing:`

			```bash
			`bin/tranquility server -configFile <path_to_config_file>/server.json`
			```

			`To customize Tranquility Server:`

			- In `server.json`, customize the `properties` and `dataSources`.
add doc rendering 2016-02-04 14:53:09 -05:00			`- If you have servers already running Tranquility, stop them (CTRL-C) and start`
new quickstart 2016-01-06 00:27:52 -05:00			`them up again.`

			For tips on customizing `server.json`, see the
Docs consistency cleanup (#6259) 2018-09-04 15:54:41 -04:00			`[Writing an ingestion spec](../tutorials/tutorial-ingestion-spec.html) tutorial and the`
new quickstart 2016-01-06 00:27:52 -05:00			`[Tranquility Server documentation](https://github.com/druid-io/tranquility/blob/master/docs/server.md).`

Docs consistency cleanup (#6259) 2018-09-04 15:54:41 -04:00			`### JVM apps and stream processors`

			`Tranquility can also be embedded in JVM-based applications as a library. You can do this directly`
			`in your own program using the`
			`[Core API](https://github.com/druid-io/tranquility/blob/master/docs/core.md), or you can use`
			`the connectors bundled in Tranquility for popular JVM-based stream processors such as`
			`[Storm](https://github.com/druid-io/tranquility/blob/master/docs/storm.md),`
			`[Samza](https://github.com/druid-io/tranquility/blob/master/docs/samza.md),`
			`[Spark Streaming](https://github.com/druid-io/tranquility/blob/master/docs/spark.md), and`
			`[Flink](https://github.com/druid-io/tranquility/blob/master/docs/flink.md).`

			`### Kafka (Deprecated)`

			`<div class="note info">`
			`NOTE: Tranquility Kafka is deprecated. Please use the <a href="../development/extensions-core/kafka-ingestion.html">Kafka Indexing Service</a> to load data from Kafka instead.`
			`</div>`

new quickstart 2016-01-06 00:27:52 -05:00
add doc rendering 2016-02-04 14:53:09 -05:00			`[Tranquility Kafka](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md)`
			`lets you load data from Kafka into Druid without writing any code. You only need a configuration`
new quickstart 2016-01-06 00:27:52 -05:00			`file.`

			`Tranquility server is started by issuing:`

			```bash
			`bin/tranquility kafka -configFile <path_to_config_file>/kafka.json`
			```

			`To customize Tranquility Kafka in the single-machine quickstart configuration:`

			- In `kafka.json`, customize the `properties` and `dataSources`.
			`- If you have Tranquility already running, stop it (CTRL-C) and start it up again.`

add doc rendering 2016-02-04 14:53:09 -05:00			For tips on customizing `kafka.json`, see the
new quickstart 2016-01-06 00:27:52 -05:00			`[Tranquility Kafka documentation](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md).`


			`## Concepts`

			`### Task creation`

add doc rendering 2016-02-04 14:53:09 -05:00			`Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication,`
			`service discovery, and schema rollover for you, seamlessly and without downtime. You never have to`
			`write code to deal with individual tasks directly. But, it can be helpful to understand how`
new quickstart 2016-01-06 00:27:52 -05:00			`Tranquility creates tasks.`

add doc rendering 2016-02-04 14:53:09 -05:00			`Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of`
			`[Druid segments](../design/segments.html). Tranquility coordinates all task`
			`creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same`
new quickstart 2016-01-06 00:27:52 -05:00			`configuration, even on different machines, and they will send to the same set of tasks.`

add doc rendering 2016-02-04 14:53:09 -05:00			`See the [Tranquility overview](https://github.com/druid-io/tranquility/blob/master/docs/overview.md)`
new quickstart 2016-01-06 00:27:52 -05:00			`for more details about how Tranquility manages tasks.`

			`### segmentGranularity and windowPeriod`

add doc rendering 2016-02-04 14:53:09 -05:00			`The segmentGranularity is the time period covered by the segments produced by each task. For`
			`example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour`
new quickstart 2016-01-06 00:27:52 -05:00			`each.`

add doc rendering 2016-02-04 14:53:09 -05:00			`The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes`
			`(the default) means that any events with a timestamp older than ten minutes in the past, or more`
new quickstart 2016-01-06 00:27:52 -05:00			`than ten minutes in the future, will be dropped.`

add doc rendering 2016-02-04 14:53:09 -05:00			`These are important configurations because they influence how long tasks will be alive for, and how`
			`long data stays in the realtime system before being handed off to the historical nodes. For example,`
			`if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay`
			`around listening for events for an hour and ten minutes. For this reason, to prevent excessive`
new quickstart 2016-01-06 00:27:52 -05:00			`buildup of tasks, it is recommended that your windowPeriod be less than your segmentGranularity.`

			`### Append only`

add doc rendering 2016-02-04 14:53:09 -05:00			`Druid streaming ingestion is append-only, meaning you cannot use streaming ingestion to update or`
			`delete individual records after they are inserted. If you need to update or delete individual`
			`records, you need to use a batch reindexing process. See the [batch ingest](batch-ingestion.html)`
new quickstart 2016-01-06 00:27:52 -05:00			`page for more details.`

add doc rendering 2016-02-04 14:53:09 -05:00			`Druid does support efficient deletion of entire time ranges without resorting to batch reindexing.`
new quickstart 2016-01-06 00:27:52 -05:00			`This can be done automatically through setting up retention policies.`

			`### Guarantees`

add doc rendering 2016-02-04 14:53:09 -05:00			`Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set`
			`up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be`
new quickstart 2016-01-06 00:27:52 -05:00			`processed exactly once. In some conditions, it can drop or duplicate events:`

			`- Events with timestamps outside your configured windowPeriod will be dropped.`
add doc rendering 2016-02-04 14:53:09 -05:00			`- If you suffer more Druid Middle Manager failures than your configured replicas count, some`
new quickstart 2016-01-06 00:27:52 -05:00			`partially indexed data may be lost.`
add doc rendering 2016-02-04 14:53:09 -05:00			`- If there is a persistent issue that prevents communication with the Druid indexing service, and`
			`retry policies are exhausted during that period, or the period lasts longer than your windowPeriod,`
new quickstart 2016-01-06 00:27:52 -05:00			`some events will be dropped.`
add doc rendering 2016-02-04 14:53:09 -05:00			`- If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing`
new quickstart 2016-01-06 00:27:52 -05:00			`service, it will retry the batch, which can lead to duplicated events.`
add doc rendering 2016-02-04 14:53:09 -05:00			`- If you are using Tranquility inside Storm or Samza, various parts of both architectures have an`
new quickstart 2016-01-06 00:27:52 -05:00			`at-least-once design and can lead to duplicated events.`

add doc rendering 2016-02-04 14:53:09 -05:00			`Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for`
Docs consistency cleanup (#6259) 2018-09-04 15:54:41 -04:00			`historical data, we recommend a hybrid/batch streaming architecture, described below.`

			`### Hybrid Batch/Streaming`

			`You can combine batch and streaming methods in a hybrid batch/streaming architecture. In a hybrid architecture, you use a streaming method to do initial ingestion, and then periodically re-ingest older data in batch mode (typically every few hours, or nightly). When Druid re-ingests data for a time range, the new data automatically replaces the data from the earlier ingestion.`

			`All streaming ingestion methods currently supported by Druid do introduce the possibility of dropped or duplicated messages in certain failure scenarios, and batch re-ingestion eliminates this potential source of error for historical data.`

			`Batch re-ingestion also gives you the option to re-ingest your data if you needed to revise it for any reason.`
fix docs 2016-02-08 16:20:04 -05:00
Automatic compaction by coordinators (#5102) * Automatic compaction by coordinator * add links * skip compaction for very recent segments if they are small * fix finding search interval * fix finding search interval * fix TimelineHolder iteration * add test for newestSegmentFirstPolicy * add CompactionSegmentIterator * add numTargetCompactionSegments * add missing config * fix skipping huge shards * fix handling large number of segments per shard * fix test failure * change recursive call to loop * fix logging * fix build * fix test failure * address comments * change dataSources type * check running pendingTasks at each run * fix test * address comments * fix build * fix test * address comments * address comments * add doc for segment size optimization * address comment 2018-01-12 23:52:37 -05:00			`### Deployment Notes`

			`Stream ingestion may generate a large number of small segments because it's difficult to optimize the segment size at`
			`ingestion time. The number of segments will increase over time, and this might cuase the query performance issue.`

Docs consistency cleanup (#6259) 2018-09-04 15:54:41 -04:00			`Details on how to optimize the segment size can be found on [Segment size optimization](../operations/segment-optimization.html).`
Automatic compaction by coordinators (#5102) * Automatic compaction by coordinator * add links * skip compaction for very recent segments if they are small * fix finding search interval * fix finding search interval * fix TimelineHolder iteration * add test for newestSegmentFirstPolicy * add CompactionSegmentIterator * add numTargetCompactionSegments * add missing config * fix skipping huge shards * fix handling large number of segments per shard * fix test failure * change recursive call to loop * fix logging * fix build * fix test failure * address comments * change dataSources type * check running pendingTasks at each run * fix test * address comments * fix build * fix test * address comments * address comments * add doc for segment size optimization * address comment 2018-01-12 23:52:37 -05:00
fix docs 2016-02-08 16:20:04 -05:00			`## Documentation`

			`Tranquility documentation be found [here](https://github.com/druid-io/tranquility/blob/master/README.md).`

			`## Configuration`

			`Tranquility configuration can be found [here](https://github.com/druid-io/tranquility/blob/master/docs/configuration.md).`

			`Tranquility's tuningConfig can be found [here](http://static.druid.io/tranquility/api/latest/#com.metamx.tranquility.druid.DruidTuning).`