druid/docs/ingestion/index.md

42 KiB

id title
index Ingestion

Overview

All data in Druid is organized into segments, which are data files that generally have up to a few million rows each. Loading data in Druid is called ingestion or indexing and consists of reading data from a source system and creating segments based on that data.

In most ingestion methods, the work of loading data is done by Druid MiddleManager processes (or the Indexer processes). One exception is Hadoop-based ingestion, where this work is instead done using a Hadoop MapReduce job on YARN (although MiddleManager or Indexer processes are still involved in starting and monitoring the Hadoop jobs). Once segments have been generated and stored in deep storage, they will be loaded by Historical processes. For more details on how this works under the hood, see the Storage design section of Druid's design documentation.

How to use this documentation

This page you are currently reading provides information about universal Druid ingestion concepts, and about configurations that are common to all ingestion methods.

The individual pages for each ingestion method provide additional information about concepts and configurations that are unique to each ingestion method.

We recommend reading (or at least skimming) this universal page first, and then referring to the page for the ingestion method or methods that you have chosen.

Ingestion methods

The table below lists Druid's most common data ingestion methods, along with comparisons to help you choose the best one for your situation. Each ingestion method supports its own set of source systems to pull from. For details about how each method works, as well as configuration properties specific to that method, check out its documentation page.

Streaming

The most recommended, and most popular, method of streaming ingestion is the Kafka indexing service that reads directly from Kafka. The Kinesis indexing service also works well if you prefer Kinesis.

This table compares the major available options:

Method Kafka Kinesis Tranquility
Supervisor type kafka kinesis N/A
How it works Druid reads directly from Apache Kafka. Druid reads directly from Amazon Kinesis. Tranquility, a library that ships separately from Druid, is used to push data into Druid.
Can ingest late data? Yes Yes No (late data is dropped based on the windowPeriod config)
Exactly-once guarantees? Yes Yes No

Batch

When doing batch loads from files, you should use one-time tasks, and you have three options: index_parallel (native batch; parallel), index_hadoop (Hadoop-based), or index (native batch; single-task).

In general, we recommend native batch whenever it meets your needs, since the setup is simpler (it does not depend on an external Hadoop cluster). However, there are still scenarios where Hadoop-based batch ingestion might be a better choice, for example when you already have a running Hadoop cluster and want to use the cluster resource of the existing cluster for batch ingestion.

This table compares the three available options:

Method Native batch (parallel) Hadoop-based Native batch (simple)
Task type index_parallel index_hadoop index
Parallel? Yes, if inputFormat is splittable and maxNumConcurrentSubTasks > 1 in tuningConfig. See data format documentation for details. Yes, always. No. Each task is single-threaded.
Can append or overwrite? Yes, both. Overwrite only. Yes, both.
External dependencies None. Hadoop cluster (Druid submits Map/Reduce jobs). None.
Input locations Any inputSource. Any Hadoop FileSystem or Druid datasource. Any inputSource.
File formats Any inputFormat. Any Hadoop InputFormat. Any inputFormat.
Rollup modes Perfect if forceGuaranteedRollup = true in the tuningConfig. Always perfect. Perfect if forceGuaranteedRollup = true in the tuningConfig.
Partitioning options Dynamic, hash-based, and range-based partitioning methods are available. See Partitions Spec for details. Hash-based or range-based partitioning via partitionsSpec. Dynamic and hash-based partitioning methods are available. See Partitions Spec for details.

Druid's data model

Datasources

Druid data is stored in datasources, which are similar to tables in a traditional RDBMS. Druid offers a unique data modeling system that bears similarity to both relational and timeseries models.

Primary timestamp

Druid schemas must always include a primary timestamp. The primary timestamp is used for partitioning and sorting your data. Druid queries are able to rapidly identify and retrieve data corresponding to time ranges of the primary timestamp column. Druid is also able to use the primary timestamp column for time-based data management operations such as dropping time chunks, overwriting time chunks, and time-based retention rules.

The primary timestamp is parsed based on the timestampSpec. In addition, the granularitySpec controls other important operations that are based on the primary timestamp. Regardless of which input field the primary timestamp is read from, it will always be stored as a column named __time in your Druid datasource.

If you have more than one timestamp column, you can store the others as secondary timestamps.

Dimensions

Dimensions are columns that are stored as-is and can be used for any purpose. You can group, filter, or apply aggregators to dimensions at query time in an ad-hoc manner. If you run with rollup disabled, then the set of dimensions is simply treated like a set of columns to ingest, and behaves exactly as you would expect from a typical database that does not support a rollup feature.

Dimensions are configured through the dimensionsSpec.

Metrics

Metrics are columns that are stored in an aggregated form. They are most useful when rollup is enabled. Specifying a metric allows you to choose an aggregation function for Druid to apply to each row during ingestion. This has two benefits:

  1. If rollup is enabled, multiple rows can be collapsed into one row even while retaining summary information. In the rollup tutorial, this is used to collapse netflow data to a single row per (minute, srcIP, dstIP) tuple, while retaining aggregate information about total packet and byte counts.
  2. Some aggregators, especially approximate ones, can be computed faster at query time even on non-rolled-up data if they are partially computed at ingestion time.

Metrics are configured through the metricsSpec.

Rollup

What is rollup?

Druid can roll up data as it is ingested to minimize the amount of raw data that needs to be stored. Rollup is a form of summarization or pre-aggregation. In practice, rolling up data can dramatically reduce the size of data that needs to be stored, reducing row counts by potentially orders of magnitude. This storage reduction does come at a cost: as we roll up data, we lose the ability to query individual events.

When rollup is disabled, Druid loads each row as-is without doing any form of pre-aggregation. This mode is similar to what you would expect from a typical database that does not support a rollup feature.

When rollup is enabled, then any rows that have identical dimensions and timestamp to each other (after queryGranularity-based truncation) can be collapsed, or rolled up, into a single row in Druid.

By default, rollup is enabled.

Enabling or disabling rollup

Rollup is controlled by the rollup setting in the granularitySpec. By default, it is true (enabled). Set this to false if you want Druid to store each record as-is, without any rollup summarization.

Example of rollup

For an example of how to configure rollup, and of how the feature will modify your data, check out the rollup tutorial.

Maximizing rollup ratio

You can measure the rollup ratio of a datasource by comparing the number of rows in Druid with the number of ingested events. The higher this number, the more benefit you are gaining from rollup. One way to do this is with a Druid SQL query like:

SELECT SUM("cnt") / COUNT(*) * 1.0 FROM datasource

In this query, cnt should refer to a "count" type metric specified at ingestion time. See Counting the number of ingested events on the "Schema design" page for more details about how counting works when rollup is enabled.

Tips for maximizing rollup:

  • Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios you will achieve.
  • Use sketches to avoid storing high cardinality dimensions, which harm rollup ratios.
  • Adjusting queryGranularity at ingestion time (for example, using PT5M instead of PT1M) increases the likelihood of two rows in Druid having matching timestamps, and can improve your rollup ratios.
  • It can be beneficial to load the same data into more than one Druid datasource. Some users choose to create a "full" datasource that has rollup disabled (or enabled, but with a minimal rollup ratio) and an "abbreviated" datasource that has fewer dimensions and a higher rollup ratio. When queries only involve dimensions in the "abbreviated" set, using that datasource leads to much faster query times. This can often be done with just a small increase in storage footprint, since abbreviated datasources tend to be substantially smaller.
  • If you are using a best-effort rollup ingestion configuration that does not guarantee perfect rollup, you can potentially improve your rollup ratio by switching to a guaranteed perfect rollup option, or by reindexing or compacting your data in the background after initial ingestion.

Perfect rollup vs Best-effort rollup

Some Druid ingestion methods guarantee perfect rollup, meaning that input data are perfectly aggregated at ingestion time. Others offer best-effort rollup, meaning that input data might not be perfectly aggregated and thus there could be multiple segments holding rows with the same timestamp and dimension values.

In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing segments before all data for a time chunk has been received, which we call incremental publishing. In both of these cases, records that could theoretically be rolled up may end up in different segments. All types of streaming ingestion run in this mode.

Ingestion methods that guarantee perfect rollup do it with an additional preprocessing step to determine intervals and partitioning before the actual data ingestion stage. This preprocessing step scans the entire input dataset, which generally increases the time required for ingestion, but provides information necessary for perfect rollup.

The following table shows how each method handles rollup:

Method How it works
Native batch index_parallel and index type may be either perfect or best-effort, based on configuration.
Hadoop Always perfect.
Kafka indexing service Always best-effort.
Kinesis indexing service Always best-effort.

Partitioning

Why partition?

Optimal partitioning and sorting of segments within your datasources can have substantial impact on footprint and performance.

Druid datasources are always partitioned by time into time chunks, and each time chunk contains one or more segments. This partitioning happens for all ingestion methods, and is based on the segmentGranularity parameter of your ingestion spec's dataSchema.

The segments within a particular time chunk may also be partitioned further, using options that vary based on the ingestion type you have chosen. In general, doing this secondary partitioning using a particular dimension will improve locality, meaning that rows with the same value for that dimension are stored together and can be accessed quickly.

You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural" dimension that you often filter by, if one exists. This will often improve compression - users have reported threefold storage size decreases - and it also tends to improve query performance as well.

Partitioning and sorting are best friends! If you do have a "natural" partitioning dimension, you should also consider placing it first in the dimensions list of your dimensionsSpec, which tells Druid to sort rows within each segment by that column. This will often improve compression even more, beyond the improvement gained by partitioning alone.

However, note that currently, Druid always sorts rows within a segment by timestamp first, even before the first dimension listed in your dimensionsSpec. This can prevent dimension sorting from being maximally effective. If necessary, you can work around this limitation by setting queryGranularity equal to segmentGranularity in your granularitySpec, which will set all timestamps within the segment to the same value, and by saving your "real" timestamp as a secondary timestamp. This limitation may be removed in a future version of Druid.

How to set up partitioning

Not all ingestion methods support an explicit partitioning configuration, and not all have equivalent levels of flexibility. As of current Druid versions, If you are doing initial ingestion through a less-flexible method (like Kafka) then you can use reindexing or compaction to repartition your data after it is initially ingested. This is a powerful technique: you can use it to ensure that any data older than a certain threshold is optimally partitioned, even as you continuously add new data from a stream.

The following table shows how each ingestion method handles partitioning:

Method How it works
Native batch Configured using partitionsSpec inside the tuningConfig.
Hadoop Configured using partitionsSpec inside the tuningConfig.
Kafka indexing service Partitioning in Druid is guided by how your Kafka topic is partitioned. You can also reindex or compact to repartition after initial ingestion.
Kinesis indexing service Partitioning in Druid is guided by how your Kinesis stream is sharded. You can also reindex or compact to repartition after initial ingestion.

Note that, of course, one way to partition data is to load it into separate datasources. This is a perfectly viable approach and works very well when the number of datasources does not lead to excessive per-datasource overheads. If you go with this approach, then you can ignore this section, since it is describing how to set up partitioning within a single datasource.

For more details on splitting data up into separate datasources, and potential operational considerations, refer to the Multitenancy considerations page.

Ingestion specs

No matter what ingestion method you use, data is loaded into Druid using either one-time tasks or ongoing "supervisors" (which run and supervise a set of tasks over time). In any case, part of the task or supervisor definition is an ingestion spec.

Ingestion specs consists of three main components:

Example ingestion spec for task type index_parallel (native batch):

{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "wikipedia",
      "timestampSpec": {
        "column": "timestamp",
        "format": "auto"
      },
      "dimensionsSpec": {
        "dimensions": [
          { "type": "string", "page" },
          { "type": "string", "language" },
          { "type": "long", "name": "userId" }
        ]
      },
      "metricsSpec": [
        { "type": "count", "name": "count" },
        { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
        { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
      ],
      "granularitySpec": {
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "intervals": [
          "2013-08-31/2013-09-01"
        ]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "local",
        "baseDir": "examples/indexing/",
        "filter": "wikipedia_data.json"
      },
      "inputFormat": {
        "type": "json",
        "flattenSpec": {
          "useFieldDiscovery": true,
          "fields": [
            { "type": "path", "name": "userId", "expr": "$.user.id" }
          ]
        }
      }
    },
    "tuningConfig": {
      "type": "index_parallel"
    }
  }
}

The specific options supported by these sections will depend on the ingestion method you have chosen. For more examples, refer to the documentation for each ingestion method.

You can also load data visually, without the need to write an ingestion spec, using the "Load data" functionality available in Druid's web console. Druid's visual data loader supports Kafka, Kinesis, and native batch mode.

dataSchema

The dataSchema spec has been changed in 0.17.0. The new spec is supported by all ingestion methods except for Hadoop ingestion. See the Legacy dataSchema spec for the old spec.

The dataSchema is a holder for the following components:

An example dataSchema is:

"dataSchema": {
  "dataSource": "wikipedia",
  "timestampSpec": {
    "column": "timestamp",
    "format": "auto"
  },
  "dimensionsSpec": {
    "dimensions": [
      { "type": "string", "page" },
      { "type": "string", "language" },
      { "type": "long", "name": "userId" }
    ]
  },
  "metricsSpec": [
    { "type": "count", "name": "count" },
    { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
    { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
  ],
  "granularitySpec": {
    "segmentGranularity": "day",
    "queryGranularity": "none",
    "intervals": [
      "2013-08-31/2013-09-01"
    ]
  }
}

dataSource

The dataSource is located in dataSchemadataSource and is simply the name of the datasource that data will be written to. An example dataSource is:

"dataSource": "my-first-datasource"

timestampSpec

The timestampSpec is located in dataSchematimestampSpec and is responsible for configuring the primary timestamp. An example timestampSpec is:

"timestampSpec": {
  "column": "timestamp",
  "format": "auto"
}

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then timestampSpec, then transformSpec, and finally dimensionsSpec and metricsSpec. Keep this in mind when writing your ingestion spec.

A timestampSpec can have the following components:

Field Description Default
column Input row field to read the primary timestamp from.

Regardless of the name of this input field, the primary timestamp will always be stored as a column named __time in your Druid datasource.
timestamp
format Timestamp format. Options are:
  • iso: ISO8601 with 'T' separator, like "2000-01-01T01:02:03.456"
  • posix: seconds since epoch
  • millis: milliseconds since epoch
  • micro: microseconds since epoch
  • nano: nanoseconds since epoch
  • auto: automatically detects ISO (either 'T' or space separator) or millis format
  • any Joda DateTimeFormat string
auto
missingValue Timestamp to use for input records that have a null or missing timestamp column. Should be in ISO8601 format, like "2000-01-01T01:02:03.456", even if you have specified something else for format. Since Druid requires a primary timestamp, this setting can be useful for ingesting datasets that do not have any per-record timestamps at all. none

dimensionsSpec

The dimensionsSpec is located in dataSchemadimensionsSpec and is responsible for configuring dimensions. An example dimensionsSpec is:

"dimensionsSpec" : {
  "dimensions": [
    "page",
    "language",
    { "type": "long", "name": "userId" }
  ],
  "dimensionExclusions" : [],
  "spatialDimensions" : []
}

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then timestampSpec, then transformSpec, and finally dimensionsSpec and metricsSpec. Keep this in mind when writing your ingestion spec.

A dimensionsSpec can have the following components:

Field Description Default
dimensions A list of dimension names or objects. Cannot have the same column in both dimensions and dimensionExclusions.

If this and spatialDimensions are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in dimensionExclusions as String-typed dimension columns. See inclusions and exclusions below for details.
[]
dimensionExclusions The names of dimensions to exclude from ingestion. Only names are supported here, not objects.

This list is only used if the dimensions and spatialDimensions lists are both null or empty arrays; otherwise it is ignored. See inclusions and exclusions below for details.
[]
spatialDimensions An array of spatial dimensions. []

Dimension objects

Each dimension in the dimensions list can either be a name or an object. Providing a name is equivalent to providing a string type dimension object with the given name, e.g. "page" is equivalent to {"name": "page", "type": "string"}.

Dimension objects can have the following components:

Field Description Default
type Either string, long, float, or double. string
name The name of the dimension. This will be used as the field name to read from input records, as well as the column name stored in generated segments.

Note that you can use a transformSpec if you want to rename columns during ingestion time.
none (required)
createBitmapIndex For string typed dimensions, whether or not bitmap indexes should be created for the column in generated segments. Creating a bitmap index requires more storage, but speeds up certain kinds of filtering (especially equality and prefix filtering). Only supported for string typed dimensions. true

Inclusions and exclusions

Druid will interpret a dimensionsSpec in two possible ways: normal or schemaless.

Normal interpretation occurs when either dimensions or spatialDimensions is non-empty. In this case, the combination of the two lists will be taken as the set of dimensions to be ingested, and the list of dimensionExclusions will be ignored.

Schemaless interpretation occurs when both dimensions and spatialDimensions are empty or null. In this case, the set of dimensions is determined in the following way:

  1. First, start from the set of all input fields from the inputFormat (or the flattenSpec, if one is being used).
  2. Any field listed in dimensionExclusions is excluded.
  3. The field listed as column in the timestampSpec is excluded.
  4. Any field used as an input to an aggregator from the metricsSpec is excluded.
  5. Any field with the same name as an aggregator from the metricsSpec is excluded.
  6. All other fields are ingested as string typed dimensions with the default settings.

Note: Fields generated by a transformSpec are not currently considered candidates for schemaless dimension interpretation.

metricsSpec

The metricsSpec is located in dataSchemametricsSpec and is a list of aggregators to apply at ingestion time. This is most useful when rollup is enabled, since it's how you configure ingestion-time aggregation.

An example metricsSpec is:

"metricsSpec": [
  { "type": "count", "name": "count" },
  { "type": "doubleSum", "name": "bytes_added_sum", "fieldName": "bytes_added" },
  { "type": "doubleSum", "name": "bytes_deleted_sum", "fieldName": "bytes_deleted" }
]

Generally, when rollup is disabled, you should have an empty metricsSpec (because without rollup, Druid does not do any ingestion-time aggregation, so there is little reason to include an ingestion-time aggregator). However, in some cases, it can still make sense to define metrics: for example, if you want to create a complex column as a way of pre-computing part of an approximate aggregation, this can only be done by defining a metric in a metricsSpec.

granularitySpec

The granularitySpec is located in dataSchemagranularitySpec and is responsible for configuring the following operations:

  1. Partitioning a datasource into time chunks (via segmentGranularity).
  2. Truncating the timestamp, if desired (via queryGranularity).
  3. Specifying which time chunks of segments should be created, for batch ingestion (via intervals).
  4. Specifying whether ingestion-time rollup should be used or not (via rollup).

Other than rollup, these operations are all based on the primary timestamp.

An example granularitySpec is:

"granularitySpec": {
  "segmentGranularity": "day",
  "queryGranularity": "none",
  "intervals": [
    "2013-08-31/2013-09-01"
  ],
  "rollup": true
}

A granularitySpec can have the following components:

Field Description Default
type Either uniform or arbitrary. In most cases you want to use uniform. uniform
segmentGranularity Time chunking granularity for this datasource. Multiple segments can be created per time chunk. For example, when set to day, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size. Any granularity can be provided here. Note that all segments in the same time chunk should have the same segment granularity.

Ignored if type is set to arbitrary.
day
queryGranularity The resolution of timestamp storage within each segment. This must be equal to, or finer, than segmentGranularity. This will be the finest granularity that you can query at and still receive sensible results, but note that you can still query at anything coarser than this granularity. E.g., a value of minute will mean that records will be stored at minutely granularity, and can be sensibly queried at any multiple of minutes (including minutely, 5-minutely, hourly, etc).

Any granularity can be provided here. Use none to store timestamps as-is, without any truncation. Note that rollup will be applied if it is set even when the queryGranularity is set to none.
none
rollup Whether to use ingestion-time rollup or not. Note that rollup is still effective even when queryGranularity is set to none. Your data will be rolled up if they have the exactly same timestamp. true
intervals A list of intervals describing what time chunks of segments should be created. If type is set to uniform, this list will be broken up and rounded-off based on the segmentGranularity. If type is set to arbitrary, this list will be used as-is.

If null or not provided, batch ingestion tasks will generally determine which time chunks to output based on what timestamps are found in the input data.

If specified, batch ingestion tasks may be able to skip a determining-partitions phase, which can result in faster ingestion. Batch ingestion tasks may also be able to request all their locks up-front instead of one by one. Batch ingestion tasks will throw away any records with timestamps outside of the specified intervals.

Ignored for any form of streaming ingestion.
null

transformSpec

The transformSpec is located in dataSchematransformSpec and is responsible for transforming and filtering records during ingestion time. It is optional. An example transformSpec is:

"transformSpec": {
  "transforms": [
    { "type": "expression", "name": "countryUpper", "expression": "upper(country)" }
  ],
  "filter": {
    "type": "selector",
    "dimension": "country",
    "value": "San Serriffe"
  }
}

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then timestampSpec, then transformSpec, and finally dimensionsSpec and metricsSpec. Keep this in mind when writing your ingestion spec.

Transforms

The transforms list allows you to specify a set of expressions to evaluate on top of input data. Each transform has a "name" which can be referred to by your dimensionsSpec, metricsSpec, etc.

If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".

Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular, they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field with another field containing all nulls, which will act similarly to removing the field.

Transforms can refer to the timestamp of an input row by referring to __time as part of the expression. They can also replace the timestamp if you set their "name" to __time. In both cases, __time should be treated as a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied after the timestampSpec.

Druid currently includes one kind of built-in transform, the expression transform. It has the following syntax:

{
  "type": "expression",
  "name": "<output name>",
  "expression": "<expr>"
}

The expression is a Druid query expression.

Conceptually, after input data records are read, Druid applies ingestion spec components in a particular order: first flattenSpec (if any), then timestampSpec, then transformSpec, and finally dimensionsSpec and metricsSpec. Keep this in mind when writing your ingestion spec.

Filter

The filter conditionally filters input rows during ingestion. Only rows that pass the filter will be ingested. Any of Druid's standard query filters can be used. Note that within a transformSpec, the transforms are applied before the filter, so the filter can refer to a transform.

Legacy dataSchema spec

The dataSchema spec has been changed in 0.17.0. The new spec is supported by all ingestion methods except for Hadoop ingestion. See dataSchema for the new spec.

The legacy dataSchema spec has below two more components in addition to the ones listed in the dataSchema section above.

parser (Deprecated)

In legacy dataSchema, the parser is located in the dataSchemaparser and is responsible for configuring a wide variety of items related to parsing input records. The parser is deprecated and it is highly recommended to use inputFormat instead. For details about inputFormat and supported parser types, see the "Data formats" page.

For details about major components of the parseSpec, refer to their subsections:

An example parser is:

"parser": {
  "type": "string",
  "parseSpec": {
    "format": "json",
    "flattenSpec": {
      "useFieldDiscovery": true,
      "fields": [
        { "type": "path", "name": "userId", "expr": "$.user.id" }
      ]
    },
    "timestampSpec": {
      "column": "timestamp",
      "format": "auto"
    },
    "dimensionsSpec": {
      "dimensions": [
        { "type": "string", "page" },
        { "type": "string", "language" },
        { "type": "long", "name": "userId" }
      ]
    }
  }
}

flattenSpec

In the legacy dataSchema, the flattenSpec is located in dataSchemaparserparseSpecflattenSpec and is responsible for bridging the gap between potentially nested input data (such as JSON, Avro, etc) and Druid's flat data model. See Flatten spec for more details.

ioConfig

The ioConfig influences how data is read from a source system, such as Apache Kafka, Amazon S3, a mounted filesystem, or any other supported source system. The inputFormat property applies to all ingestion method except for Hadoop ingestion. The Hadoop ingestion still uses the parser in the legacy dataSchema. The rest of ioConfig is specific to each individual ingestion method. An example ioConfig to read JSON data is:

"ioConfig": {
    "type": "<ingestion-method-specific type code>",
    "inputFormat": {
      "type": "json"
    },
    ...
}

For more details, see the documentation provided by each ingestion method.

tuningConfig

Tuning properties are specified in a tuningConfig, which goes at the top level of an ingestion spec. Some properties apply to all ingestion methods, but most are specific to each individual ingestion method. An example tuningConfig that sets all of the shared, common properties to their defaults is:

"tuningConfig": {
  "type": "<ingestion-method-specific type code>",
  "maxRowsInMemory": 1000000,
  "maxBytesInMemory": <one-sixth of JVM memory>,
  "indexSpec": {
    "bitmap": { "type": "roaring" },
    "dimensionCompression": "lz4",
    "metricCompression": "lz4",
    "longEncoding": "longs"
  },
  <other ingestion-method-specific properties>
}
Field Description Default
type Each ingestion method has its own tuning type code. You must specify the type code that matches your ingestion method. Common options are index, hadoop, kafka, and kinesis.
maxRowsInMemory The maximum number of records to store in memory before persisting to disk. Note that this is the number of rows post-rollup, and so it may not be equal to the number of input records. Ingested records will be persisted to disk when either maxRowsInMemory or maxBytesInMemory are reached (whichever happens first). 1000000
maxBytesInMemory The maximum aggregate size of records, in bytes, to store in the JVM heap before persisting. This is based on a rough estimate of memory usage. Ingested records will be persisted to disk when either maxRowsInMemory or maxBytesInMemory are reached (whichever happens first). maxBytesInMemory also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of maxBytesInMemory until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds maxBytesInMemory.

Setting maxBytesInMemory to -1 disables this check, meaning Druid will rely entirely on maxRowsInMemory to control memory usage. Setting it to zero means the default value will be used (one-sixth of JVM heap size).

Note that the estimate of memory usage is designed to be an overestimate, and can be especially high when using complex ingest-time aggregators, including sketches. If this causes your indexing workloads to persist to disk too often, you can set maxBytesInMemory to -1 and rely on maxRowsInMemory instead.
One-sixth of max JVM heap size
skipBytesInMemoryOverheadCheck The calculation of maxBytesInMemory takes into account overhead objects created during ingestion and each intermediate persist. Setting this to true can exclude the bytes of these overhead objects from maxBytesInMemory check. false
indexSpec Tune how data is indexed. See below for more information. See table below
Other properties Each ingestion method has its own list of additional tuning properties. See the documentation for each method for a full list: Kafka indexing service, Kinesis indexing service, Native batch, and Hadoop-based.

indexSpec

The indexSpec object can include the following properties:

Field Description Default
bitmap Compression format for bitmap indexes. Should be a JSON object with type set to roaring or concise. For type roaring, the boolean property compressRunOnSerialization (defaults to true) controls whether or not run-length encoding will be used when it is determined to be more space-efficient. {"type": "concise"}
dimensionCompression Compression format for dimension columns. Options are lz4, lzf, or uncompressed. lz4
metricCompression Compression format for primitive type metric columns. Options are lz4, lzf, uncompressed, or none (which is more efficient than uncompressed, but not supported by older versions of Druid). lz4
longEncoding Encoding format for long-typed columns. Applies regardless of whether they are dimensions or metrics. Options are auto or longs. auto encodes the values using offset or lookup table depending on column cardinality, and store them with variable size. longs stores the value as-is with 8 bytes each. longs

Beyond these properties, each ingestion method has its own specific tuning properties. See the documentation for each ingestion method for details.