druid/docs/content/tutorials/ingestion.md

43 lines
2.0 KiB
Markdown
Raw Normal View History

2016-01-06 00:27:52 -05:00
---
layout: doc_page
---
# Loading Data
## Choosing an ingestion method
Druid supports streaming (real-time) and file-based (batch) ingestion methods. The most
popular configurations are:
2016-02-04 14:53:09 -05:00
- [Files](../ingestion/batch-ingestion.html) - Load data from HDFS, S3, local files, or any supported Hadoop
2016-01-06 00:27:52 -05:00
filesystem in batches. We recommend this method if your dataset is already in flat files.
2016-02-04 14:53:09 -05:00
- [Stream push](../ingestion/stream-ingestion.html#stream-push) - Push a data stream into Druid in real-time
2016-01-06 00:27:52 -05:00
using [Tranquility](http://github.com/druid-io/tranquility), a client library for sending streams
to Druid. We recommend this method if your dataset originates in a streaming system like Kafka,
Storm, Spark Streaming, or your own system.
2016-02-04 14:53:09 -05:00
- [Stream pull](../ingestion/stream-ingestion.html#stream-pull) - Pull a data stream directly from an external
2016-01-06 00:27:52 -05:00
data source into Druid using Realtime Nodes.
## Getting started
The easiest ways to get started with loading your own data are the three included tutorials.
- [Files-based tutorial](tutorial-batch.html) showing you how to load files from your local disk.
- [Streams-based tutorial](tutorial-streams.html) showing you how to push data over HTTP.
- [Kafka-based tutorial](tutorial-kafka.html) showing you how to load data from Kafka.
## Hybrid batch/streaming
You can combine batch and streaming methods in a hybrid batch/streaming architecture. In a hybrid architecture,
you use a streaming method to do initial ingestion, and then periodically re-ingest older data in batch mode
(typically every few hours, or nightly). When Druid re-ingests data for a time range, the new data automatically
replaces the data from the earlier ingestion.
All streaming ingestion methods currently supported by Druid do introduce the possibility of dropped or duplicated
messages in certain failure scenarios, and batch re-ingestion eliminates this potential source of error for
historical data.
Batch re-ingestion also gives you the option to re-ingest your data if you needed to revise it for any reason.