2018-12-13 14:47:20 -05:00
---
2019-08-21 00:48:59 -04:00
id: hdfs
2018-12-13 14:47:20 -05:00
title: "HDFS"
---
2018-11-13 12:38:37 -05:00
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
2016-03-22 16:54:49 -04:00
2021-08-19 04:52:26 -04:00
To use this Apache Druid extension, [include ](../../development/extensions.md#loading-extensions ) `druid-hdfs-storage` in the extensions load list and run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.
2016-03-22 16:54:49 -04:00
2019-08-21 00:48:59 -04:00
## Deep Storage
2016-03-22 16:54:49 -04:00
2019-10-28 11:07:38 -04:00
### Configuration for HDFS
2016-03-22 16:54:49 -04:00
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.storage.type`|hdfs||Must be set.|
|`druid.storage.storageDirectory`||Directory for storing segments.|Must be set.|
2016-09-13 13:42:50 -04:00
|`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty|
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty|
2016-03-22 16:54:49 -04:00
2020-01-17 18:52:05 -05:00
Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml` , `hdfs-site.xml` )
in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common` .
If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work.
2019-08-21 00:48:59 -04:00
If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab` , this is an alternative to the cron job method that runs `kinit` command periodically.
2016-06-02 13:06:07 -04:00
2020-01-17 18:52:05 -05:00
### Configuration for Cloud Storage
You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.
2016-06-02 13:06:07 -04:00
2020-01-17 18:52:05 -05:00
#### Configuration for AWS S3
To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
2016-06-02 13:06:07 -04:00
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
2020-01-17 18:52:05 -05:00
|`druid.storage.type`|hdfs| |Must be set.|
|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.|
2016-06-02 13:06:07 -04:00
2020-12-17 16:37:43 -05:00
You also need to include the [Hadoop AWS module ](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/ ), especially the `hadoop-aws.jar` in the Druid classpath.
2020-01-17 18:52:05 -05:00
Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
2016-06-02 13:06:07 -04:00
2020-01-17 18:52:05 -05:00
```bash
java -classpath "${DRUID_HOME}lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
```
2019-10-28 11:07:38 -04:00
2020-01-17 18:52:05 -05:00
Finally, you need to add the below properties in the `core-site.xml` .
2020-12-17 16:37:43 -05:00
For more configurations, see the [Hadoop AWS module ](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/ ).
2020-01-17 18:52:05 -05:00
```xml
< property >
< name > fs.s3a.impl< / name >
< value > org.apache.hadoop.fs.s3a.S3AFileSystem< / value >
< description > The implementation class of the S3A Filesystem< / description >
< / property >
< property >
< name > fs.AbstractFileSystem.s3a.impl< / name >
< value > org.apache.hadoop.fs.s3a.S3A< / value >
< description > The implementation class of the S3A AbstractFileSystem.< / description >
< / property >
< property >
< name > fs.s3a.access.key< / name >
< description > AWS access key ID. Omit for IAM role-based or provider-based authentication.< / description >
< value > your access key< / value >
< / property >
< property >
< name > fs.s3a.secret.key< / name >
< description > AWS secret key. Omit for IAM role-based or provider-based authentication.< / description >
< value > your secret key< / value >
< / property >
```
2019-10-28 11:07:38 -04:00
2020-01-17 18:52:05 -05:00
#### Configuration for Google Cloud Storage
2019-10-28 11:07:38 -04:00
2020-01-17 18:52:05 -05:00
To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
2019-10-28 11:07:38 -04:00
2020-01-17 18:52:05 -05:00
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.storage.type`|hdfs||Must be set.|
|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the deep storage|Must be set.|
All services that need to access GCS need to have the [GCS connector jar ](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters ) in their class path.
Please read the [install instructions ](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md )
to properly set up the necessary libraries and configurations.
One option is to place this jar in `${DRUID_HOME}/lib/` and `${DRUID_HOME}/extensions/druid-hdfs-storage/` .
Finally, you need to configure the `core-site.xml` file with the filesystem
and authentication properties needed for GCS. You may want to copy the below
example properties. Please follow the instructions at
[https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md ](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md )
for more details.
2020-12-17 16:37:43 -05:00
For more configurations, [GCS core default ](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.0.0/gcs/conf/gcs-core-default.xml )
2020-01-17 18:52:05 -05:00
and [GCS core template ](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml ).
```xml
< property >
< name > fs.gs.impl< / name >
< value > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem< / value >
< description > The FileSystem for gs: (GCS) uris.< / description >
< / property >
< property >
< name > fs.AbstractFileSystem.gs.impl< / name >
< value > com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS< / value >
< description > The AbstractFileSystem for gs: uris.< / description >
< / property >
< property >
< name > google.cloud.auth.service.account.enable< / name >
< value > true< / value >
< description >
Whether to use a service account for GCS authorization.
Setting this property to `false` will disable use of service accounts for
authentication.
< / description >
< / property >
< property >
< name > google.cloud.auth.service.account.json.keyfile< / name >
< value > /path/to/keyfile< / value >
< description >
The JSON key file of the service account used for GCS
access when google.cloud.auth.service.account.enable is true.
< / description >
< / property >
2019-10-28 11:07:38 -04:00
```
2020-01-17 18:52:05 -05:00
Tested with Druid 0.17.0, Hadoop 2.8.5 and gcs-connector jar 2.0.0-hadoop2.
## Reading data from HDFS or Cloud Storage
### Native batch ingestion
2021-12-03 06:07:14 -05:00
The [HDFS input source ](../../ingestion/native-batch-input-source.md#hdfs-input-source ) is supported by the [Parallel task ](../../ingestion/native-batch.md )
2020-01-17 18:52:05 -05:00
to read files directly from the HDFS Storage. You may be able to read objects from cloud storage
with the HDFS input source, but we highly recommend to use a proper
2021-12-03 06:07:14 -05:00
[Input Source ](../../ingestion/native-batch-input-source.md ) instead if possible because
it is simple to set up. For now, only the [S3 input source ](../../ingestion/native-batch-input-source.md#s3-input-source )
and the [Google Cloud Storage input source ](../../ingestion/native-batch-input-source.md#google-cloud-storage-input-source )
2020-01-17 18:52:05 -05:00
are supported for cloud storage types, and so you may still want to use the HDFS input source
to read from cloud storage other than those two.
### Hadoop-based ingestion
If you use the [Hadoop ingestion ](../../ingestion/hadoop.md ), you can read data from HDFS
by specifying the paths in your [`inputSpec` ](../../ingestion/hadoop.md#inputspec ).
See the [Static ](../../ingestion/hadoop.md#static ) inputSpec for details.