druid/docs/content/ingestion/hadoop-vs-native-batch.md

3.2 KiB

layout title
doc_page Hadoop-based Batch Ingestion VS Native Batch Ingestion

Comparison of Batch Ingestion Methods

Apache Druid (incubating) basically supports three types of batch ingestion: Apache Hadoop-based batch ingestion, native parallel batch ingestion, and native local batch ingestion. The below table shows what features are supported by each ingestion method.

Hadoop-based ingestion Native parallel ingestion Native local ingestion
Parallel indexing Always parallel Parallel if firehose is splittable
& maxNumConcurrentSubTasks > 1 in tuningConfig
Always sequential
Supported indexing modes Overwriting mode Both appending and overwriting modes Both appending and overwriting modes
External dependency Hadoop (it internally submits Hadoop jobs) No dependency No dependency
Supported rollup modes Perfect rollup Both perfect and best-effort rollup Both perfect and best-effort rollup
Supported partitioning methods Both Hash-based and range partitioning Hash-based partitioning (when forceGuaranteedRollup = true) Hash-based partitioning (when forceGuaranteedRollup = true)
Supported input locations All locations accessible via HDFS client or Druid dataSource All implemented firehoses All implemented firehoses
Supported file formats All implemented Hadoop InputFormats Currently text file formats (CSV, TSV, JSON) by default. Additional formats can be added though a custom extension implementing FiniteFirehoseFactory Currently text file formats (CSV, TSV, JSON) by default. Additional formats can be added though a custom extension implementing FiniteFirehoseFactory
Saving parse exceptions in ingestion report Currently not supported Currently not supported Supported
Custom segment version Supported, but this is NOT recommended N/A N/A