druid/docs/ingestion/native-batch-simple-task.md

13 KiB

id title sidebar_label
native-batch-simple-task Native batch simple task indexing Simple task indexing

The simple task (type index) is designed to ingest small data sets into Apache Druid. The task executes within the indexing service. For general information on native batch indexing and parallel task indexing, see Native batch ingestion.

Task syntax

A sample task is shown below:

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "timestampSpec" : {
        "column" : "timestamp",
        "format" : "auto"
      },
      "dimensionsSpec" : {
        "dimensions": ["country", "page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent",,"region","city"],
        "dimensionExclusions" : []
      },
      "metricsSpec" : [
        {
          "type" : "count",
          "name" : "count"
        },
        {
          "type" : "doubleSum",
          "name" : "added",
          "fieldName" : "added"
        },
        {
          "type" : "doubleSum",
          "name" : "deleted",
          "fieldName" : "deleted"
        },
        {
          "type" : "doubleSum",
          "name" : "delta",
          "fieldName" : "delta"
        }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "NONE",
        "intervals" : [ "2013-08-31/2013-09-01" ]
      }
    },
    "ioConfig" : {
      "type" : "index",
      "inputSource" : {
        "type" : "local",
        "baseDir" : "examples/indexing/",
        "filter" : "wikipedia_data.json"
       },
       "inputFormat": {
         "type": "json"
       }
    },
    "tuningConfig" : {
      "type" : "index",
      "partitionsSpec": {
        "type": "single_dim",
        "partitionDimension": "country",
        "targetRowsPerSegment": 5000000
      }
    }
  }
}
property description required?
type The task type, this should always be index. yes
id The task ID. If this is not explicitly specified, Druid generates the task ID using task type, data source name, interval, and date-time stamp. no
spec The ingestion spec including the data schema, IOConfig, and TuningConfig. See below for more details. yes
context Context containing various task configuration parameters. See below for more details. no

dataSchema

This field is required.

See the dataSchema section of the ingestion docs for details.

If you do not specify intervals explicitly in your dataSchema's granularitySpec, the Local Index Task will do an extra pass over the data to determine the range to lock when it starts up. If you specify intervals explicitly, any rows outside the specified intervals will be thrown away. We recommend setting intervals explicitly if you know the time range of the data because it allows the task to skip the extra pass, and so that you don't accidentally replace data outside that range if there's some stray data with unexpected timestamps.

ioConfig

property description default required?
type The task type, this should always be "index". none yes
inputFormat inputFormat to specify how to parse input data. none yes
appendToExisting Creates segments as additional shards of the latest version, effectively appending to the segment set instead of replacing it. This means that you can append new segments to any datasource regardless of its original partitioning scheme. You must use the dynamic partitioning type for the appended segments. If you specify a different partitioning type, the task fails with an error. false no
dropExisting If true and appendToExisting is false and the granularitySpec contains aninterval, then the ingestion task drops (mark unused) all existing segments fully contained by the specified interval when the task publishes new segments. If ingestion fails, Druid does not drop or mark unused any segments. In the case of misconfiguration where either appendToExisting is true or interval is not specified in granularitySpec, Druid does not drop any segments even if dropExisting is true. WARNING: this functionality is still in beta and can result in temporary data unavailability for data within the specified interval. false no

tuningConfig

The tuningConfig is optional and default parameters will be used if no tuningConfig is specified. See below for more details.

property description default required?
type The task type, this should always be "index". none yes
maxRowsPerSegment Deprecated. Use partitionsSpec instead. Used in sharding. Determines how many rows are in each segment. 5000000 no
maxRowsInMemory Used in determining when intermediate persists to disk should occur. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set. 1000000 no
maxBytesInMemory Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is maxBytesInMemory * (2 + maxPendingPersists). Note that maxBytesInMemory also includes heap usage of artifacts created from intermediary persists. This means that after every persist, the amount of maxBytesInMemory until next persist will decreases, and task will fail when the sum of bytes of all intermediary persisted artifacts exceeds maxBytesInMemory. 1/6 of max JVM memory no
maxTotalRows Deprecated. Use partitionsSpec instead. Total number of rows in segments waiting for being pushed. Used in determining when intermediate pushing should occur. 20000000 no
numShards Deprecated. Use partitionsSpec instead. Directly specify the number of shards to create. If this is specified and intervals is specified in the granularitySpec, the index task can skip the determine intervals/partitions pass through the data. numShards cannot be specified if maxRowsPerSegment is set. null no
partitionDimensions Deprecated. Use partitionsSpec instead. The dimensions to partition on. Leave blank to select all dimensions. Only used with forceGuaranteedRollup = true, will be ignored otherwise. null no
partitionsSpec Defines how to partition data in each timeChunk, see PartitionsSpec dynamic if forceGuaranteedRollup = false, hashed if forceGuaranteedRollup = true no
indexSpec Defines segment storage format options to be used at indexing time, see IndexSpec null no
indexSpecForIntermediatePersists Defines segment storage format options to be used at indexing time for intermediate persisted temporary segments. This can be used to disable dimension/metric compression on intermediate segments to reduce memory required for final merging. However, disabling compression on intermediate segments might increase page cache use while they are used before getting merged into final segment published, see IndexSpec for possible values. same as indexSpec no
maxPendingPersists Maximum number of persists that can be pending but not started. If this limit would be exceeded by a new intermediate persist, ingestion will block until the currently-running persist finishes. Maximum heap memory usage for indexing scales with maxRowsInMemory * (2 + maxPendingPersists). 0 (meaning one persist can be running concurrently with ingestion, and none can be queued up) no
forceGuaranteedRollup Forces guaranteeing the perfect rollup. The perfect rollup optimizes the total size of generated segments and querying time while indexing time will be increased. If this is set to true, the index task will read the entire input data twice: one for finding the optimal number of partitions per time chunk and one for generating segments. Note that the result segments would be hash-partitioned. This flag cannot be used with appendToExisting of IOConfig. For more details, see the below Segment pushing modes section. false no
reportParseExceptions DEPRECATED. If true, exceptions encountered during parsing will be thrown and will halt ingestion; if false, unparseable rows and fields will be skipped. Setting reportParseExceptions to true will override existing configurations for maxParseExceptions and maxSavedParseExceptions, setting maxParseExceptions to 0 and limiting maxSavedParseExceptions to no more than 1. false no
pushTimeout Milliseconds to wait for pushing segments. It must be >= 0, where 0 means to wait forever. 0 no
segmentWriteOutMediumFactory Segment write-out medium to use when creating segments. See SegmentWriteOutMediumFactory. Not specified, the value from druid.peon.defaultSegmentWriteOutMediumFactory.type is used no
logParseExceptions If true, log an error message when a parsing exception occurs, containing information about the row where the error occurred. false no
maxParseExceptions The maximum number of parse exceptions that can occur before the task halts ingestion and fails. Overridden if reportParseExceptions is set. unlimited no
maxSavedParseExceptions When a parse exception occurs, Druid can keep track of the most recent parse exceptions. "maxSavedParseExceptions" limits how many exception instances will be saved. These saved exceptions will be made available after the task finishes in the task completion report. Overridden if reportParseExceptions is set. 0 no

partitionsSpec

PartitionsSpec is to describe the secondary partitioning method. You should use different partitionsSpec depending on the rollup mode you want. For perfect rollup, you should use hashed.

property description default required?
type This should always be hashed none yes
maxRowsPerSegment Used in sharding. Determines how many rows are in each segment. 5000000 no
numShards Directly specify the number of shards to create. If this is specified and intervals is specified in the granularitySpec, the index task can skip the determine intervals/partitions pass through the data. numShards cannot be specified if maxRowsPerSegment is set. null no
partitionDimensions The dimensions to partition on. Leave blank to select all dimensions. null no
partitionFunction A function to compute hash of partition dimensions. See Hash partition function murmur3_32_abs no

For best-effort rollup, you should use dynamic.

property description default required?
type This should always be dynamic none yes
maxRowsPerSegment Used in sharding. Determines how many rows are in each segment. 5000000 no
maxTotalRows Total number of rows in segments waiting for being pushed. 20000000 no

segmentWriteOutMediumFactory

Field Type Description Required
type String See Additional Peon Configuration: SegmentWriteOutMediumFactory for explanation and available options. yes

Segment pushing modes

While ingesting data using the simple task indexing, Druid creates segments from the input data and pushes them. For segment pushing, the simple task index supports the following segment pushing modes based upon your type of rollup:

  • Bulk pushing mode: Used for perfect rollup. Druid pushes every segment at the very end of the index task. Until then, Druid stores created segments in memory and local storage of the service running the index task. This mode can cause problems if you have limited storage capacity, and is not recommended to use in production. To enable bulk pushing mode, set forceGuaranteedRollup in your TuningConfig. You can not use bulk pushing with appendToExisting in your IOConfig.
  • Incremental pushing mode: Used for best-effort rollup. Druid pushes segments are incrementally during the course of the indexing task. The index task collects data and stores created segments in the memory and disks of the services running the task until the total number of collected rows exceeds maxTotalRows. At that point the index task immediately pushes all segments created up until that moment, cleans up pushed segments, and continues to ingest the remaining data.