druid/docs/development/extensions-core/hdfs.md

---
id: hdfs
title: "HDFS"
---

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->


To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-hdfs-storage` as an extension.

## Deep Storage

### Configuration for HDFS

|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.storage.type`|hdfs||Must be set.|
|`druid.storage.storageDirectory`||Directory for storing segments.|Must be set.|
|`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty|
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty|

If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work.
If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.

### Configuration for Google Cloud Storage

The HDFS extension can also be used for GCS as deep storage.

|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.storage.type`|hdfs||Must be set.|
|`druid.storage.storageDirectory`||gs://bucket/example/directory|Must be set.|

All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in <druid>/lib/ and <druid>/extensions/druid-hdfs-storage/

Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.

<a name="firehose"></a>

## Native batch ingestion

This firehose ingests events from a predefined list of files from a Hadoop filesystem.
This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task).
Since each split represents an HDFS file, each worker task of `index_parallel` will read an object.

Sample spec:

```json
"firehose" : {
    "type" : "hdfs",
    "paths": "/foo/bar,/foo/baz"
}
```

This firehose provides caching and prefetching features. During native batch indexing, a firehose can be read twice if
`intervals` are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scanning
of files is slow.

|Property|Description|Default|
|--------|-----------|-------|
|type|This should be `hdfs`.|none (required)|
|paths|HDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like `*` are supported in these paths.|none (required)|
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|
|prefetchTriggerBytes|Threshold to trigger prefetching files.|maxFetchCapacityBytes / 2|
|fetchTimeout|Timeout for fetching each file.|60000|
|maxFetchRetry|Maximum number of retries for fetching each file.|3|
Front Matter header needs to be on the first line for md to be rendered properly by jekyll (#6733) 2018-12-13 14:47:20 -05:00			`---`
Docusaurus build framework + ingestion doc refresh. (#8311) * Docusaurus build framework + ingestion doc refresh. * stick to npm instead of yarn * fix typos * restore some _bin * Adjustments. * detect and fix redirect anchors * update anchor lint * Web-console: remove specific column filters (#8343) * add clear filter * update tool kit * remove usless check * auto run * add % * Fix resource leak (#8337) * Fix resource leak * Patch comments * Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234) * Fixes from PR review. * Fix more anchors. * Preamble nix. * Fix more anchors, headers * clean up placeholder page * add to website lint to travis config * better broken link checking * travis fix * Fixed more broken links * better redirects * unfancy catch * fix LGTM error * link fixes * fix md issues * Addl fixes 2019-08-21 00:48:59 -04:00			`id: hdfs`
Front Matter header needs to be on the first line for md to be rendered properly by jekyll (#6733) 2018-12-13 14:47:20 -05:00			`title: "HDFS"`
			`---`

add missing license headers, in particular to MD files; clean up RAT … (#6563) * add missing license headers, in particular to MD files; clean up RAT exclusions * revert inadvertent doc changes * docs * cr changes * fix modified druid-production.svg 2018-11-13 12:38:37 -05:00			`<!--`
			`~ Licensed to the Apache Software Foundation (ASF) under one`
			`~ or more contributor license agreements. See the NOTICE file`
			`~ distributed with this work for additional information`
			`~ regarding copyright ownership. The ASF licenses this file`
			`~ to you under the Apache License, Version 2.0 (the`
			`~ "License"); you may not use this file except in compliance`
			`~ with the License. You may obtain a copy of the License at`
			`~`
			`~ http://www.apache.org/licenses/LICENSE-2.0`
			`~`
			`~ Unless required by applicable law or agreed to in writing,`
			`~ software distributed under the License is distributed on an`
			`~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`~ KIND, either express or implied. See the License for the`
			`~ specific language governing permissions and limitations`
			`~ under the License.`
			`-->`

refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
De-incubation cleanup in code, docs, packaging (#9108) * De-incubation cleanup in code, docs, packaging * remove unused docs script 2020-01-03 12:33:19 -05:00			To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-hdfs-storage` as an extension.
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Docusaurus build framework + ingestion doc refresh. (#8311) * Docusaurus build framework + ingestion doc refresh. * stick to npm instead of yarn * fix typos * restore some _bin * Adjustments. * detect and fix redirect anchors * update anchor lint * Web-console: remove specific column filters (#8343) * add clear filter * update tool kit * remove usless check * auto run * add % * Fix resource leak (#8337) * Fix resource leak * Patch comments * Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234) * Fixes from PR review. * Fix more anchors. * Preamble nix. * Fix more anchors, headers * clean up placeholder page * add to website lint to travis config * better broken link checking * travis fix * Fixed more broken links * better redirects * unfancy catch * fix LGTM error * link fixes * fix md issues * Addl fixes 2019-08-21 00:48:59 -04:00			`## Deep Storage`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Add HDFS firehose (#8754) * Add HDFS firehose. * Tests, support for lists of paths. * Fixups. * Update list of firehoses. * Wildcards is a word. 2019-10-28 11:07:38 -04:00			`### Configuration for HDFS`
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
			`\|Property\|Possible Values\|Description\|Default\|`
			`\|--------\|---------------\|-----------\|-------\|`
			\|`druid.storage.type`\|hdfs\|\|Must be set.\|
			\|`druid.storage.storageDirectory`\|\|Directory for storing segments.\|Must be set.\|
Adding hadoop kerberos authentification. (#3419) * adding kerberos authentication * make the 2 functions identical 2016-09-13 13:42:50 -04:00			\|`druid.hadoop.security.kerberos.principal`\|`druid@EXAMPLE.COM`\| Principal user name \|empty\|
			\|`druid.hadoop.security.kerberos.keytab`\|`/etc/security/keytabs/druid.headlessUser.keytab`\|Path to keytab file\|empty\|
refactor extensions into their own docs 2016-03-22 16:54:49 -04:00
Adding hadoop kerberos authentification. (#3419) * adding kerberos authentication * make the 2 functions identical 2016-09-13 13:42:50 -04:00			`If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work.`
Docusaurus build framework + ingestion doc refresh. (#8311) * Docusaurus build framework + ingestion doc refresh. * stick to npm instead of yarn * fix typos * restore some _bin * Adjustments. * detect and fix redirect anchors * update anchor lint * Web-console: remove specific column filters (#8343) * add clear filter * update tool kit * remove usless check * auto run * add % * Fix resource leak (#8337) * Fix resource leak * Patch comments * Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (#8234) * Fixes from PR review. * Fix more anchors. * Preamble nix. * Fix more anchors, headers * clean up placeholder page * add to website lint to travis config * better broken link checking * travis fix * Fixed more broken links * better redirects * unfancy catch * fix LGTM error * link fixes * fix md issues * Addl fixes 2019-08-21 00:48:59 -04:00			If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.
Added info about Google Cloud Storage (#3056) 2016-06-02 13:06:07 -04:00
Add HDFS firehose (#8754) * Add HDFS firehose. * Tests, support for lists of paths. * Fixups. * Update list of firehoses. * Wildcards is a word. 2019-10-28 11:07:38 -04:00			`### Configuration for Google Cloud Storage`
Added info about Google Cloud Storage (#3056) 2016-06-02 13:06:07 -04:00
			`The HDFS extension can also be used for GCS as deep storage.`

			`\|Property\|Possible Values\|Description\|Default\|`
			`\|--------\|---------------\|-----------\|-------\|`
			\|`druid.storage.type`\|hdfs\|\|Must be set.\|
			\|`druid.storage.storageDirectory`\|\|gs://bucket/example/directory\|Must be set.\|

			`All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/hadoop/google-cloud-storage-connector#manualinstallation) in their class path. One option is to place this jar in <druid>/lib/ and <druid>/extensions/druid-hdfs-storage/`

			`Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.`
Add HDFS firehose (#8754) * Add HDFS firehose. * Tests, support for lists of paths. * Fixups. * Update list of firehoses. * Wildcards is a word. 2019-10-28 11:07:38 -04:00
			`<a name="firehose"></a>`

			`## Native batch ingestion`

Fix typos. (#8767) 2019-10-28 15:47:01 -04:00			`This firehose ingests events from a predefined list of files from a Hadoop filesystem.`
Add HDFS firehose (#8754) * Add HDFS firehose. * Tests, support for lists of paths. * Fixups. * Update list of firehoses. * Wildcards is a word. 2019-10-28 11:07:38 -04:00			`This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md#parallel-task).`
			Since each split represents an HDFS file, each worker task of `index_parallel` will read an object.

			`Sample spec:`

			```json
			`"firehose" : {`
			`"type" : "hdfs",`
			`"paths": "/foo/bar,/foo/baz"`
			`}`
			```

			`This firehose provides caching and prefetching features. During native batch indexing, a firehose can be read twice if`
			`intervals` are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scanning
			`of files is slow.`

			`\|Property\|Description\|Default\|`
			`\|--------\|-----------\|-------\|`
			\|type\|This should be `hdfs`.\|none (required)\|
			\|paths\|HDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like `*` are supported in these paths.\|none (required)\|
			`\|maxCacheCapacityBytes\|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.\|1073741824\|`
			`\|maxFetchCapacityBytes\|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.\|1073741824\|`
Fix typos. (#8767) 2019-10-28 15:47:01 -04:00			`\|prefetchTriggerBytes\|Threshold to trigger prefetching files.\|maxFetchCapacityBytes / 2\|`
Add HDFS firehose (#8754) * Add HDFS firehose. * Tests, support for lists of paths. * Fixups. * Update list of firehoses. * Wildcards is a word. 2019-10-28 11:07:38 -04:00			`\|fetchTimeout\|Timeout for fetching each file.\|60000\|`
			`\|maxFetchRetry\|Maximum number of retries for fetching each file.\|3\|`