druid/docs/content/Batch-ingestion.md

---
layout: doc_page
---

# Batch Data Ingestion
There are two choices for batch data ingestion to your Druid cluster, you can use the [Indexing service](Indexing-Service.html) or you can use the `HadoopDruidIndexer`.

Which should I use?
-------------------

The [Indexing service](Indexing-Service.html) is a set of nodes that can run as part of your Druid cluster and can accomplish a number of different types of indexing tasks. Even if all you care about is batch indexing, it provides for the encapsulation of things like the [metadata store](Metadata-storage.html) that is used for segment metadata and other things, so that your indexing tasks do not need to include such information. The indexing service was created such that external systems could programmatically interact with it and run periodic indexing tasks. Long-term, the indexing service is going to be the preferred method of ingesting data.

The `HadoopDruidIndexer` runs hadoop jobs in order to separate and index data segments. It takes advantage of Hadoop as a job scheduling and distributed job execution platform. It is a simple method if you already have Hadoop running and don’t want to spend the time configuring and deploying the [Indexing service](Indexing-Service.html) just yet.

## Batch Ingestion using the HadoopDruidIndexer

The HadoopDruidIndexer can be run like so:

```
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_path> io.druid.cli.Main index hadoop <spec_file>
```

## Hadoop "specFile"

The spec\_file is a path to a file that contains JSON and an example looks like:

```json
{
  "dataSchema" : {
    "dataSource" : "wikipedia",
    "parser" : {
      "type" : "string",
      "parseSpec" : {
        "format" : "json",
        "timestampSpec" : {
          "column" : "timestamp",
          "format" : "auto"
        },
        "dimensionsSpec" : {
          "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
          "dimensionExclusions" : [],
          "spatialDimensions" : []
        }
      }
    },
    "metricsSpec" : [
      {
        "type" : "count",
        "name" : "count"
      },
      {
        "type" : "doubleSum",
        "name" : "added",
        "fieldName" : "added"
      },
      {
        "type" : "doubleSum",
        "name" : "deleted",
        "fieldName" : "deleted"
      },
      {
        "type" : "doubleSum",
        "name" : "delta",
        "fieldName" : "delta"
      }
    ],
    "granularitySpec" : {
      "type" : "uniform",
      "segmentGranularity" : "DAY",
      "queryGranularity" : "NONE",
      "intervals" : [ "2013-08-31/2013-09-01" ]
    }
  },
  "ioConfig" : {
    "type" : "hadoop",
    "inputSpec" : {
      "type" : "static",
      "paths" : "/MyDirectory/examples/indexing/wikipedia_data.json"
    },
    "metadataUpdateSpec" : {
      "type":"metadata",
      "connectURI" : "jdbc:metadata storage://localhost:3306/druid",
      "password" : "diurd",
      "segmentTable" : "druid_segments",
      "user" : "druid"
    },
    "segmentOutputPath" : "/MyDirectory/data/index/output"
  },
  "tuningConfig" : {
    "type" : "hadoop",
    "workingPath": "/tmp",
    "partitionsSpec" : {
      "type" : "dimension",
      "partitionDimension" : null,
      "targetPartitionSize" : 5000000,
      "maxPartitionSize" : 7500000,
      "assumeGrouped" : false,
      "numShards" : -1
    },
    "shardSpecs" : { },
    "leaveIntermediate" : false,
    "cleanupOnFailure" : true,
    "overwriteFiles" : false,
    "ignoreInvalidRows" : false,
    "jobProperties" : { },
    "combineText" : false,
    "persistInHeap" : false,
    "ingestOffheap" : false,
    "bufferSize" : 134217728,
    "aggregationBufferRatio" : 0.5,
    "rowFlushBoundary" : 300000
  }
}
```

### DataSchema

This field is required.

See [Ingestion](Ingestion.html)

### IOConfig

This field is required.

|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|This should always be 'hadoop'.|yes|
|pathSpec|Object|a specification of where to pull the data in from|yes|
|segmentOutputPath|String|the path to dump segments into.|yes|
|metadataUpdateSpec|Object|a specification of how to update the metadata for the druid cluster these segments belong to.|yes|

#### Path specification

There are multiple types of path specification:

##### `static`

Is a type of data loader where a static path to where the data files are located is passed.

|Field|Type|Description|Required|
|-----|----|-----------|--------|
|paths|Array of String|A String of input paths indicating where the raw data is located.|yes|

For example, using the static input paths:

```
"paths" : "s3n://billy-bucket/the/data/is/here/data.gz, s3n://billy-bucket/the/data/is/here/moredata.gz, s3n://billy-bucket/the/data/is/here/evenmoredata.gz"
```

##### `granularity`

Is a type of data loader that expects data to be laid out in a specific path format. Specifically, it expects it to be segregated by day in this directory format `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (dates are represented by lowercase, time is represented by uppercase).

|Field|Type|Description|Required|
|-----|----|-----------|--------|
|dataGranularity|Object|specifies the granularity to expect the data at, e.g. hour means to expect directories `y=XXXX/m=XX/d=XX/H=XX`.|yes|
|inputPath|String|Base path to append the expected time path to.|yes|
|filePattern|String|Pattern that files should match to be included.|yes|

For example, if the sample config were run with the interval 2012-06-01/2012-06-02, it would expect data at the paths

```
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
...
s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
```

#### Metadata Update Job Spec

This is a specification of the properties that tell the job how to update metadata such that the Druid cluster will see the output segments and load them.


|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|"metadata" is the only value available.|yes|
|connectURI|String|A valid JDBC url to metadata storage.|yes|
|user|String|Username for db.|yes|
|password|String|password for db.|yes|
|segmentTable|String|Table to use in DB.|yes|

These properties should parrot what you have configured for your [Coordinator](Coordinator.html).

### TuningConfig

The tuningConfig is optional and default parameters will be used if no tuningConfig is specified.

|Field|Type|Description|Required|
|-----|----|-----------|--------|
|workingPath|String|the working path to use for intermediate results (results between Hadoop jobs).|no (default == '/tmp/druid-indexing')|
|version|String|The version of created segments.|no (default == datetime that indexing starts at)|
|leaveIntermediate|leave behind files in the workingPath when job completes or fails (debugging tool).|no (default == false)|
|partitionsSpec|Object|a specification of how to partition each time bucket into segments, absence of this property means no partitioning will occur.More details below.|no (default == 'hashed'|
|maxRowsInMemory|Integer|The number of rows to aggregate before persisting. This number is the post-aggregation rows, so it is not equivalent to the number of input events, but the number of aggregated rows that those events result in. This is used to manage the required JVM heap size.|no (default == 5 million)|
|cleanupOnFailure|Boolean|Cleans up intermediate files when the job fails as opposed to leaving them around for debugging.|no (default == true)|
|overwriteFiles|Boolean|Override existing files found during indexing.|no (default == false)|
|ignoreInvalidRows|Boolean|Ignore rows found to have problems.|no (default == false)|
|jobProperties|Object|a map of properties to add to the Hadoop job configuration.|no (default == null)|

### Partitioning specification

Segments are always partitioned based on timestamp (according to the granularitySpec) and may be further partitioned in
some other way depending on partition type. Druid supports two types of partitioning strategies: "hashed" (based on the
hash of all dimensions in each row), and "dimension" (based on ranges of a single dimension).

Hashed partitioning is recommended in most cases, as it will improve indexing performance and create more uniformly
sized data segments relative to single-dimension partitioning.

#### Hash-based partitioning

```json
  "partitionsSpec": {
     "type": "hashed",
     "targetPartitionSize": 5000000
   }
```

Hashed partitioning works by first selecting a number of segments, and then partitioning rows across those segments
according to the hash of all dimensions in each row. The number of segments is determined automatically based on the
cardinality of the input set and a target partition size.

The configuration options are:

|property|description|required?|
|--------|-----------|---------|
|type|type of partitionSpec to be used |"hashed"|
|targetPartitionSize|target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|either this or numShards|
|numShards|specify the number of partitions directly, instead of a target partition size. Ingestion will run faster, since it can skip the step necessary to select a number of partitions automatically.|either this or targetPartitionSize|

#### Single-dimension partitioning

```json
  "partitionsSpec": {
     "type": "dimension",
     "targetPartitionSize": 5000000
   }
```

Single-dimension partitioning works by first selecting a dimension to partition on, and then separating that dimension
into contiguous ranges. Each segment will contain all rows with values of that dimension in that range. For example,
your segments may be partitioned on the dimension "host" using the ranges "a.example.com" to "f.example.com" and
"f.example.com" to "z.example.com". By default, the dimension to use is determined automatically, although you can
override it with a specific dimension.

The configuration options are:

|property|description|required?|
|--------|-----------|---------|
|type|type of partitionSpec to be used |"dimension"|
|targetPartitionSize|target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|yes|
|maxPartitionSize|maximum number of rows to include in a partition. Defaults to 50% larger than the targetPartitionSize.|no|
|partitionDimension|the dimension to partition on. Leave blank to select a dimension automatically.|no|
|assumeGrouped|assume input data has already been grouped on time and dimensions. Ingestion will run faster, but can choose suboptimal partitions if the assumption is violated.|no|

### Remote Hadoop Cluster

If you have a remote Hadoop cluster, make sure to include the folder holding your configuration `*.xml` files in the classpath of the indexer.

Batch Ingestion Using the Indexing Service
------------------------------------------

Batch ingestion for the indexing service is done by submitting an [Index Task](Tasks.html) (for datasets < 1G) or a [Hadoop Index Task](Tasks.html). The indexing service can be started by issuing:

```
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/overlord io.druid.cli.Main server overlord
```

This will start up a very simple local indexing service. For more complex deployments of the indexing service, see [here](Indexing-Service.html).

The schema of the Hadoop Index Task contains a task "type" and a Hadoop Index Config. A sample Hadoop index task is shown below:

```json
{
  "type" : "index_hadoop",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "timestamp",
            "format" : "auto"
          },
          "dimensionsSpec" : {
            "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
            "dimensionExclusions" : [],
            "spatialDimensions" : []
          }
        }
      },
      "metricsSpec" : [
        {
          "type" : "count",
          "name" : "count"
        },
        {
          "type" : "doubleSum",
          "name" : "added",
          "fieldName" : "added"
        },
        {
          "type" : "doubleSum",
          "name" : "deleted",
          "fieldName" : "deleted"
        },
        {
          "type" : "doubleSum",
          "name" : "delta",
          "fieldName" : "delta"
        }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "NONE",
        "intervals" : [ "2013-08-31/2013-09-01" ]
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "static",
        "paths" : "/MyDirectory/examples/indexing/wikipedia_data.json"
      }
    },
    "tuningConfig" : {
      "type": "hadoop"
    }
  }
}
```

### DataSchema

This field is required.

See [Ingestion](Ingestion.html)

### IOConfig

This field is required.

|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|This should always be 'hadoop'.|yes|
|pathSpec|Object|a specification of where to pull the data in from|yes|

### TuningConfig

The tuningConfig is optional and default parameters will be used if no tuningConfig is specified. This is the same as the tuningConfig for the standalone Hadoop indexer. See above for more details.

### Running the Task

The Hadoop Index Config submitted as part of an Hadoop Index Task is identical to the Hadoop Index Config used by the `HadoopBatchIndexer` except that three fields must be omitted: `segmentOutputPath`, `workingPath`, `updaterJobSpec`. The Indexing Service takes care of setting these fields internally.

To run the task:

```
curl -X 'POST' -H 'Content-Type:application/json' -d @example_index_hadoop_task.json localhost:8090/druid/indexer/v1/task
```

If the task succeeds, you should see in the logs of the indexing service:

```
2013-10-16 16:38:31,945 INFO [pool-6-thread-1] io.druid.indexing.overlord.exec.TaskConsumer - Task SUCCESS: HadoopIndexTask...
```

### Remote Hadoop Cluster

If you have a remote Hadoop cluster, make sure to include the folder holding your configuration `*.xml` files in the classpath of the middle manager.

Having Problems?
----------------
Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-development).
-												Added prepend tag to make pages display.

											
										
										
											2013-09-16 17:49:36 -04:00
+								---
-												Docs working

											
										
										
											2013-09-26 19:22:28 -04:00
+								layout: doc_page
-												Added prepend tag to make pages display.

											
										
										
											2013-09-16 17:49:36 -04:00
+								---
-												added title

											
										
										
											2013-12-19 19:04:56 -05:00
 								# Batch Data Ingestion
-												fix the index task and more docs

											
										
										
											2014-01-10 17:47:18 -05:00
+								There are two choices for batch data ingestion to your Druid cluster, you can use the [Indexing service](Indexing-Service.html) or you can use the `HadoopDruidIndexer`.
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
 								Which should I use?
 								-------------------
-												address cr

											
										
										
											2014-12-09 17:19:18 -05:00
+								The [Indexing service](Indexing-Service.html) is a set of nodes that can run as part of your Druid cluster and can accomplish a number of different types of indexing tasks. Even if all you care about is batch indexing, it provides for the encapsulation of things like the [metadata store](Metadata-storage.html) that is used for segment metadata and other things, so that your indexing tasks do not need to include such information. The indexing service was created such that external systems could programmatically interact with it and run periodic indexing tasks. Long-term, the indexing service is going to be the preferred method of ingesting data.
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												fix the index task and more docs

											
										
										
											2014-01-10 17:47:18 -05:00
+								The `HadoopDruidIndexer` runs hadoop jobs in order to separate and index data segments. It takes advantage of Hadoop as a job scheduling and distributed job execution platform. It is a simple method if you already have Hadoop running and don’t want to spend the time configuring and deploying the [Indexing service](Indexing-Service.html) just yet.
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								## Batch Ingestion using the HadoopDruidIndexer
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												fix docs for 0.6 part 1 of many

											
										
										
											2013-10-07 17:47:04 -04:00
+								The HadoopDruidIndexer can be run like so:
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												Finish converting docs over to something that displays properly

											
										
										
											2013-09-27 20:08:34 -04:00
+								```
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_path> io.druid.cli.Main index hadoop <spec_file>
-												Finish converting docs over to something that displays properly

											
										
										
											2013-09-27 20:08:34 -04:00
+								```
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								## Hadoop "specFile"
 								The spec\_file is a path to a file that contains JSON and an example looks like:
-												Finish converting docs over to something that displays properly

											
										
										
											2013-09-27 20:08:34 -04:00
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								```json
-												Finish converting docs over to something that displays properly

											
										
										
											2013-09-27 20:08:34 -04:00
+								{
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								  "dataSchema" : {
 								    "dataSource" : "wikipedia",
 								    "parser" : {
 								      "type" : "string",
 								      "parseSpec" : {
 								        "format" : "json",
 								        "timestampSpec" : {
 								          "column" : "timestamp",
 								          "format" : "auto"
 								        },
 								        "dimensionsSpec" : {
 								          "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
 								          "dimensionExclusions" : [],
 								          "spatialDimensions" : []
 								        }
 								      }
 								    },
 								    "metricsSpec" : [
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								      {
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								        "type" : "count",
 								        "name" : "count"
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								      },
 								      {
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								        "type" : "doubleSum",
 								        "name" : "added",
 								        "fieldName" : "added"
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								      },
 								      {
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								        "type" : "doubleSum",
 								        "name" : "deleted",
 								        "fieldName" : "deleted"
 								      },
 								      {
 								        "type" : "doubleSum",
 								        "name" : "delta",
 								        "fieldName" : "delta"
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								      }
 								    ],
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								    "granularitySpec" : {
 								      "type" : "uniform",
 								      "segmentGranularity" : "DAY",
 								      "queryGranularity" : "NONE",
 								      "intervals" : [ "2013-08-31/2013-09-01" ]
 								    }
-												Finish converting docs over to something that displays properly

											
										
										
											2013-09-27 20:08:34 -04:00
+								  },
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								  "ioConfig" : {
 								    "type" : "hadoop",
 								    "inputSpec" : {
 								      "type" : "static",
 								      "paths" : "/MyDirectory/examples/indexing/wikipedia_data.json"
 								    },
 								    "metadataUpdateSpec" : {
-												more fixes to docs

											
										
										
											2014-12-09 17:28:15 -05:00
+								      "type":"metadata",
 								      "connectURI" : "jdbc:metadata storage://localhost:3306/druid",
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								      "password" : "diurd",
 								      "segmentTable" : "druid_segments",
 								      "user" : "druid"
 								    },
 								    "segmentOutputPath" : "/MyDirectory/data/index/output"
-												More explicit example of jobProperties.

											
										
										
											2014-06-25 23:19:03 -04:00
+								  },
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								  "tuningConfig" : {
 								    "type" : "hadoop",
-												update batch ingest docs

											
										
										
											2015-01-08 17:38:10 -05:00
+								    "workingPath": "/tmp",
 								    "partitionsSpec" : {
 								      "type" : "dimension",
 								      "partitionDimension" : null,
 								      "targetPartitionSize" : 5000000,
 								      "maxPartitionSize" : 7500000,
 								      "assumeGrouped" : false,
 								      "numShards" : -1
 								    },
 								    "shardSpecs" : { },
 								    "leaveIntermediate" : false,
 								    "cleanupOnFailure" : true,
 								    "overwriteFiles" : false,
 								    "ignoreInvalidRows" : false,
 								    "jobProperties" : { },
 								    "combineText" : false,
 								    "persistInHeap" : false,
 								    "ingestOffheap" : false,
 								    "bufferSize" : 134217728,
 								    "aggregationBufferRatio" : 0.5,
 								    "rowFlushBoundary" : 300000
-												Finish converting docs over to something that displays properly

											
										
										
											2013-09-27 20:08:34 -04:00
+								  }
 								}
 								```
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								### DataSchema
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								This field is required.
 								See [Ingestion](Ingestion.html)
 								### IOConfig
 								This field is required.
 								|Field|Type|Description|Required|
 								|-----|----|-----------|--------|
 								|type|String|This should always be 'hadoop'.|yes|
 								|pathSpec|Object|a specification of where to pull the data in from|yes|
 								|segmentOutputPath|String|the path to dump segments into.|yes|
 								|metadataUpdateSpec|Object|a specification of how to update the metadata for the druid cluster these segments belong to.|yes|
 								#### Path specification
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
 								There are multiple types of path specification:
-												fix endpoint bugs and more docs

											
										
										
											2014-04-03 20:01:33 -04:00
+								##### `static`
 								Is a type of data loader where a static path to where the data files are located is passed.
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								|Field|Type|Description|Required|
 								|-----|----|-----------|--------|
 								|paths|Array of String|A String of input paths indicating where the raw data is located.|yes|
-												fix endpoint bugs and more docs

											
										
										
											2014-04-03 20:01:33 -04:00
 								For example, using the static input paths:
 								```
 								"paths" : "s3n://billy-bucket/the/data/is/here/data.gz, s3n://billy-bucket/the/data/is/here/moredata.gz, s3n://billy-bucket/the/data/is/here/evenmoredata.gz"
 								```
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
+								##### `granularity`
 								Is a type of data loader that expects data to be laid out in a specific path format. Specifically, it expects it to be segregated by day in this directory format `y=XXXX/m=XX/d=XX/H=XX/M=XX/S=XX` (dates are represented by lowercase, time is represented by uppercase).
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								|Field|Type|Description|Required|
 								|-----|----|-----------|--------|
 								|dataGranularity|Object|specifies the granularity to expect the data at, e.g. hour means to expect directories `y=XXXX/m=XX/d=XX/H=XX`.|yes|
 								|inputPath|String|Base path to append the expected time path to.|yes|
 								|filePattern|String|Pattern that files should match to be included.|yes|
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
 								For example, if the sample config were run with the interval 2012-06-01/2012-06-02, it would expect data at the paths
-												Finish converting docs over to something that displays properly

											
										
										
											2013-09-27 20:08:34 -04:00
+								```
 								s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=00
 								s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=01
 								...
 								s3n://billy-bucket/the/data/is/here/y=2012/m=06/d=01/H=23
 								```
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								#### Metadata Update Job Spec
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								This is a specification of the properties that tell the job how to update metadata such that the Druid cluster will see the output segments and load them.
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
 								|Field|Type|Description|Required|
 								|-----|----|-----------|--------|
 								|type|String|"metadata" is the only value available.|yes|
-												more fixes to docs

											
										
										
											2014-12-09 17:28:15 -05:00
+								|connectURI|String|A valid JDBC url to metadata storage.|yes|
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								|user|String|Username for db.|yes|
-												address cr

											
										
										
											2014-12-09 17:19:18 -05:00
+								|password|String|password for db.|yes|
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								|segmentTable|String|Table to use in DB.|yes|
 								These properties should parrot what you have configured for your [Coordinator](Coordinator.html).
 								### TuningConfig
 								The tuningConfig is optional and default parameters will be used if no tuningConfig is specified.
 								|Field|Type|Description|Required|
 								|-----|----|-----------|--------|
 								|workingPath|String|the working path to use for intermediate results (results between Hadoop jobs).|no (default == '/tmp/druid-indexing')|
 								|version|String|The version of created segments.|no (default == datetime that indexing starts at)|
 								|leaveIntermediate|leave behind files in the workingPath when job completes or fails (debugging tool).|no (default == false)|
 								|partitionsSpec|Object|a specification of how to partition each time bucket into segments, absence of this property means no partitioning will occur.More details below.|no (default == 'hashed'|
 								|maxRowsInMemory|Integer|The number of rows to aggregate before persisting. This number is the post-aggregation rows, so it is not equivalent to the number of input events, but the number of aggregated rows that those events result in. This is used to manage the required JVM heap size.|no (default == 5 million)|
 								|cleanupOnFailure|Boolean|Cleans up intermediate files when the job fails as opposed to leaving them around for debugging.|no (default == true)|
 								|overwriteFiles|Boolean|Override existing files found during indexing.|no (default == false)|
 								|ignoreInvalidRows|Boolean|Ignore rows found to have problems.|no (default == false)|
 								|jobProperties|Object|a map of properties to add to the Hadoop job configuration.|no (default == null)|
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
 								### Partitioning specification
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								Segments are always partitioned based on timestamp (according to the granularitySpec) and may be further partitioned in
 								some other way depending on partition type. Druid supports two types of partitioning strategies: "hashed" (based on the
 								hash of all dimensions in each row), and "dimension" (based on ranges of a single dimension).
-												Revert "Revert "Merge branch 'determine-partitions-improvements'""

This reverts commit 189b3e2b9bc3c79da9c98b805df624f024109469.

											
										
										
											2014-02-14 15:58:56 -05:00
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								Hashed partitioning is recommended in most cases, as it will improve indexing performance and create more uniformly
 								sized data segments relative to single-dimension partitioning.
-												Revert "Revert "Merge branch 'determine-partitions-improvements'""

This reverts commit 189b3e2b9bc3c79da9c98b805df624f024109469.

											
										
										
											2014-02-14 15:58:56 -05:00
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								#### Hash-based partitioning
-												refactor randomPartitionsSpec to hashedPartitionsSpec

refactor to a more appropriate name

											
										
										
											2014-02-24 16:37:31 -05:00
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								```json
 								  "partitionsSpec": {
 								     "type": "hashed",
 								     "targetPartitionSize": 5000000
 								   }
 								```
 								Hashed partitioning works by first selecting a number of segments, and then partitioning rows across those segments
 								according to the hash of all dimensions in each row. The number of segments is determined automatically based on the
 								cardinality of the input set and a target partition size.
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								The configuration options are:
-												add doc for numShards

											
										
										
											2014-04-23 12:41:28 -04:00
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								|property|description|required?|
 								|--------|-----------|---------|
 								|type|type of partitionSpec to be used |"hashed"|
 								|targetPartitionSize|target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|either this or numShards|
 								|numShards|specify the number of partitions directly, instead of a target partition size. Ingestion will run faster, since it can skip the step necessary to select a number of partitions automatically.|either this or targetPartitionSize|
 								#### Single-dimension partitioning
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												add doc for numShards

											
										
										
											2014-04-23 12:41:28 -04:00
+								```json
 								  "partitionsSpec": {
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								     "type": "dimension",
 								     "targetPartitionSize": 5000000
-												add doc for numShards

											
										
										
											2014-04-23 12:41:28 -04:00
+								   }
 								```
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								Single-dimension partitioning works by first selecting a dimension to partition on, and then separating that dimension
 								into contiguous ranges. Each segment will contain all rows with values of that dimension in that range. For example,
 								your segments may be partitioned on the dimension "host" using the ranges "a.example.com" to "f.example.com" and
 								"f.example.com" to "z.example.com". By default, the dimension to use is determined automatically, although you can
 								override it with a specific dimension.
 								The configuration options are:
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
+								|property|description|required?|
 								|--------|-----------|---------|
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								|type|type of partitionSpec to be used |"dimension"|
 								|targetPartitionSize|target number of rows to include in a partition, should be a number that targets segments of 500MB\~1GB.|yes|
 								|maxPartitionSize|maximum number of rows to include in a partition. Defaults to 50% larger than the targetPartitionSize.|no|
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
+								|partitionDimension|the dimension to partition on. Leave blank to select a dimension automatically.|no|
-												New PartitionsSpec docs.

											
										
										
											2014-10-16 16:17:04 -04:00
+								|assumeGrouped|assume input data has already been grouped on time and dimensions. Ingestion will run faster, but can choose suboptimal partitions if the assumption is violated.|no|
-												Add docs from github wiki

											
										
										
											2013-09-13 18:20:39 -04:00
-												Go through and fix mistakes in tutorials and docs

											
										
										
											2015-02-17 18:21:16 -05:00
+								### Remote Hadoop Cluster
 								If you have a remote Hadoop cluster, make sure to include the folder holding your configuration `*.xml` files in the classpath of the indexer.
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								Batch Ingestion Using the Indexing Service
 								------------------------------------------
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								Batch ingestion for the indexing service is done by submitting an [Index Task](Tasks.html) (for datasets < 1G) or a [Hadoop Index Task](Tasks.html). The indexing service can be started by issuing:
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
 								```
 								java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/overlord io.druid.cli.Main server overlord
 								```
 								This will start up a very simple local indexing service. For more complex deployments of the indexing service, see [here](Indexing-Service.html).
-												make the hadoop index task work again

											
										
										
											2013-10-16 12:45:17 -04:00
+								The schema of the Hadoop Index Task contains a task "type" and a Hadoop Index Config. A sample Hadoop index task is shown below:
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
 								```json
 								{
 								  "type" : "index_hadoop",
-												one more doc fix

											
										
										
											2015-01-22 00:52:23 -05:00
+								  "spec" : {
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								    "dataSchema" : {
 								      "dataSource" : "wikipedia",
 								      "parser" : {
 								        "type" : "string",
 								        "parseSpec" : {
 								          "format" : "json",
 								          "timestampSpec" : {
 								            "column" : "timestamp",
 								            "format" : "auto"
 								          },
 								          "dimensionsSpec" : {
 								            "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
 								            "dimensionExclusions" : [],
 								            "spatialDimensions" : []
 								          }
 								        }
 								      },
 								      "metricsSpec" : [
 								        {
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								          "type" : "count",
 								          "name" : "count"
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								        },
 								        {
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								          "type" : "doubleSum",
 								          "name" : "added",
 								          "fieldName" : "added"
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								        },
 								        {
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								          "type" : "doubleSum",
 								          "name" : "deleted",
 								          "fieldName" : "deleted"
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								        },
 								        {
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								          "type" : "doubleSum",
 								          "name" : "delta",
 								          "fieldName" : "delta"
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								        }
 								      ],
 								      "granularitySpec" : {
 								        "type" : "uniform",
 								        "segmentGranularity" : "DAY",
 								        "queryGranularity" : "NONE",
 								        "intervals" : [ "2013-08-31/2013-09-01" ]
 								      }
 								    },
 								    "ioConfig" : {
 								      "type" : "hadoop",
 								      "inputSpec" : {
 								        "type" : "static",
 								        "paths" : "/MyDirectory/examples/indexing/wikipedia_data.json"
 								      }
 								    },
 								    "tuningConfig" : {
 								      "type": "hadoop"
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								    }
 								  }
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								}
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
+								```
-												redocumenting ingestion

											
										
										
											2014-12-08 19:15:46 -05:00
+								### DataSchema
 								This field is required.
 								See [Ingestion](Ingestion.html)
 								### IOConfig
 								This field is required.
 								|Field|Type|Description|Required|
 								|-----|----|-----------|--------|
 								|type|String|This should always be 'hadoop'.|yes|
 								|pathSpec|Object|a specification of where to pull the data in from|yes|
 								### TuningConfig
 								The tuningConfig is optional and default parameters will be used if no tuningConfig is specified. This is the same as the tuningConfig for the standalone Hadoop indexer. See above for more details.
 								### Running the Task
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
-												another docs fix

											
										
										
											2014-07-29 14:40:35 -04:00
+								The Hadoop Index Config submitted as part of an Hadoop Index Task is identical to the Hadoop Index Config used by the `HadoopBatchIndexer` except that three fields must be omitted: `segmentOutputPath`, `workingPath`, `updaterJobSpec`. The Indexing Service takes care of setting these fields internally.
-												port docs over to 0.6 and a bunch of misc fixes

											
										
										
											2013-10-11 21:38:53 -04:00
-												make the hadoop index task work again

											
										
										
											2013-10-16 12:45:17 -04:00
+								To run the task:
 								```
-												Use default ports in examples

											
										
										
											2015-02-18 14:46:27 -05:00
+								curl -X 'POST' -H 'Content-Type:application/json' -d @example_index_hadoop_task.json localhost:8090/druid/indexer/v1/task
-												make the hadoop index task work again

											
										
										
											2013-10-16 12:45:17 -04:00
+								```
 								If the task succeeds, you should see in the logs of the indexing service:
 								```
 -10-16 16:38:31,945 INFO [pool-6-thread-1] io.druid.indexing.overlord.exec.TaskConsumer - Task SUCCESS: HadoopIndexTask...
 								```
-												Go through and fix mistakes in tutorials and docs

											
										
										
											2015-02-17 18:21:16 -05:00
+								### Remote Hadoop Cluster
 								If you have a remote Hadoop cluster, make sure to include the folder holding your configuration `*.xml` files in the classpath of the middle manager.
-												make the hadoop index task work again

											
										
										
											2013-10-16 12:45:17 -04:00
+								Having Problems?
 								----------------
 								Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-development).