druid/docs/content/ingestion/command-line-hadoop-indexer.md

---
layout: doc_page
title: "Command Line Hadoop Indexer"
---

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

# Command Line Hadoop Indexer

To run:

```
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_dir> org.apache.druid.cli.Main index hadoop <spec_file>
```

## Options

- "--coordinate" - provide a version of Apache Hadoop to use. This property will override the default Hadoop coordinates. Once specified, Apache Druid (incubating) will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`.
- "--no-default-hadoop" - don't pull down the default hadoop version

## Spec file

The spec file needs to contain a JSON object where the contents are the same as the "spec" field in the Hadoop index task. See [Hadoop Batch Ingestion](../ingestion/hadoop.html) for details on the spec format. 

In addition, a `metadataUpdateSpec` and `segmentOutputPath` field needs to be added to the ioConfig:

```
      "ioConfig" : {
        ...
        "metadataUpdateSpec" : {
          "type":"mysql",
          "connectURI" : "jdbc:mysql://localhost:3306/druid",
          "password" : "diurd",
          "segmentTable" : "druid_segments",
          "user" : "druid"
        },
        "segmentOutputPath" : "/MyDirectory/data/index/output"
      },
```    

and a `workingPath` field needs to be added to the tuningConfig:

```
  "tuningConfig" : {
   ...
    "workingPath": "/tmp",
    ...
  }
```    

#### Metadata Update Job Spec

This is a specification of the properties that tell the job how to update metadata such that the Druid cluster will see the output segments and load them.

|Field|Type|Description|Required|
|-----|----|-----------|--------|
|type|String|"metadata" is the only value available.|yes|
|connectURI|String|A valid JDBC url to metadata storage.|yes|
|user|String|Username for db.|yes|
|password|String|password for db.|yes|
|segmentTable|String|Table to use in DB.|yes|

These properties should parrot what you have configured for your [Coordinator](../design/coordinator.html).

#### segmentOutputPath Config

|Field|Type|Description|Required|
|-----|----|-----------|--------|
|segmentOutputPath|String|the path to dump segments into.|yes|

#### workingPath Config

|Field|Type|Description|Required|
|-----|----|-----------|--------|
|workingPath|String|the working path to use for intermediate results (results between Hadoop jobs).|no (default == '/tmp/druid-indexing')|
 
Please note that the command line Hadoop indexer doesn't have the locking capabilities of the indexing service, so if you choose to use it, 
you have to take caution to not override segments created by real-time processing (if you that a real-time pipeline set up).
Front Matter header needs to be on the first line for md to be rendered properly by jekyll (#6733) 2018-12-13 14:47:20 -05:00			`---`
			`layout: doc_page`
			`title: "Command Line Hadoop Indexer"`
			`---`

add missing license headers, in particular to MD files; clean up RAT … (#6563) * add missing license headers, in particular to MD files; clean up RAT exclusions * revert inadvertent doc changes * docs * cr changes * fix modified druid-production.svg 2018-11-13 12:38:37 -05:00			`<!--`
			`~ Licensed to the Apache Software Foundation (ASF) under one`
			`~ or more contributor license agreements. See the NOTICE file`
			`~ distributed with this work for additional information`
			`~ regarding copyright ownership. The ASF licenses this file`
			`~ to you under the Apache License, Version 2.0 (the`
			`~ "License"); you may not use this file except in compliance`
			`~ with the License. You may obtain a copy of the License at`
			`~`
			`~ http://www.apache.org/licenses/LICENSE-2.0`
			`~`
			`~ Unless required by applicable law or agreed to in writing,`
			`~ software distributed under the License is distributed on an`
			`~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`~ KIND, either express or implied. See the License for the`
			`~ specific language governing permissions and limitations`
			`~ under the License.`
			`-->`

new quickstart 2016-01-06 00:27:52 -05:00			`# Command Line Hadoop Indexer`

			`To run:`

			```
Rename io.druid to org.apache.druid. (#6266) * Rename io.druid to org.apache.druid. * Fix META-INF files and remove some benchmark results. * MonitorsConfig update for metrics package migration. * Reorder some dimensions in inner queries for some reason. * Fix protobuf tests. 2018-08-30 12:56:26 -04:00			`java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_dir> org.apache.druid.cli.Main index hadoop <spec_file>`
new quickstart 2016-01-06 00:27:52 -05:00			```

add doc rendering 2016-02-04 14:53:09 -05:00			`## Options`

Add more Apache branding to docs (#7515) 2019-04-19 18:52:26 -04:00			- "--coordinate" - provide a version of Apache Hadoop to use. This property will override the default Hadoop coordinates. Once specified, Apache Druid (incubating) will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`.
add doc rendering 2016-02-04 14:53:09 -05:00			`- "--no-default-hadoop" - don't pull down the default hadoop version`

			`## Spec file`

Docs consistency cleanup (#6259) 2018-09-04 15:54:41 -04:00			`The spec file needs to contain a JSON object where the contents are the same as the "spec" field in the Hadoop index task. See [Hadoop Batch Ingestion](../ingestion/hadoop.html) for details on the spec format.`

			In addition, a `metadataUpdateSpec` and `segmentOutputPath` field needs to be added to the ioConfig:
new quickstart 2016-01-06 00:27:52 -05:00
			```
			`"ioConfig" : {`
			`...`
			`"metadataUpdateSpec" : {`
			`"type":"mysql",`
			`"connectURI" : "jdbc:mysql://localhost:3306/druid",`
			`"password" : "diurd",`
			`"segmentTable" : "druid_segments",`
			`"user" : "druid"`
			`},`
			`"segmentOutputPath" : "/MyDirectory/data/index/output"`
			`},`
			```

Docs consistency cleanup (#6259) 2018-09-04 15:54:41 -04:00			and a `workingPath` field needs to be added to the tuningConfig:
new quickstart 2016-01-06 00:27:52 -05:00
			```
			`"tuningConfig" : {`
			`...`
			`"workingPath": "/tmp",`
			`...`
			`}`
			```

			`#### Metadata Update Job Spec`

			`This is a specification of the properties that tell the job how to update metadata such that the Druid cluster will see the output segments and load them.`

			`\|Field\|Type\|Description\|Required\|`
			`\|-----\|----\|-----------\|--------\|`
			`\|type\|String\|"metadata" is the only value available.\|yes\|`
			`\|connectURI\|String\|A valid JDBC url to metadata storage.\|yes\|`
			`\|user\|String\|Username for db.\|yes\|`
			`\|password\|String\|password for db.\|yes\|`
			`\|segmentTable\|String\|Table to use in DB.\|yes\|`

			`These properties should parrot what you have configured for your [Coordinator](../design/coordinator.html).`

			`#### segmentOutputPath Config`

			`\|Field\|Type\|Description\|Required\|`
			`\|-----\|----\|-----------\|--------\|`
			`\|segmentOutputPath\|String\|the path to dump segments into.\|yes\|`

			`#### workingPath Config`

			`\|Field\|Type\|Description\|Required\|`
			`\|-----\|----\|-----------\|--------\|`
			`\|workingPath\|String\|the working path to use for intermediate results (results between Hadoop jobs).\|no (default == '/tmp/druid-indexing')\|`

			`Please note that the command line Hadoop indexer doesn't have the locking capabilities of the indexing service, so if you choose to use it,`
			`you have to take caution to not override segments created by real-time processing (if you that a real-time pipeline set up).`