To use this Apache Druid extension, make sure to [include](../../development/extensions.md#loading-extensions) `druid-hdfs-storage` as an extension and run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.
If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.
You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html), especially the `hadoop-aws.jar` in the Druid classpath.
Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the deep storage|Must be set.|
All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters) in their class path.
Please read the [install instructions](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
to properly set up the necessary libraries and configurations.
One option is to place this jar in `${DRUID_HOME}/lib/` and `${DRUID_HOME}/extensions/druid-hdfs-storage/`.
Finally, you need to configure the `core-site.xml` file with the filesystem
and authentication properties needed for GCS. You may want to copy the below
example properties. Please follow the instructions at
Tested with Druid 0.17.0, Hadoop 2.8.5 and gcs-connector jar 2.0.0-hadoop2.
## Reading data from HDFS or Cloud Storage
### Native batch ingestion
The [HDFS input source](../../ingestion/native-batch.md#hdfs-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md#parallel-task)
to read files directly from the HDFS Storage. You may be able to read objects from cloud storage
with the HDFS input source, but we highly recommend to use a proper
[Input Source](../../ingestion/native-batch.md#input-sources) instead if possible because
it is simple to set up. For now, only the [S3 input source](../../ingestion/native-batch.md#s3-input-source)
and the [Google Cloud Storage input source](../../ingestion/native-batch.md#google-cloud-storage-input-source)
are supported for cloud storage types, and so you may still want to use the HDFS input source
to read from cloud storage other than those two.
### Hadoop-based ingestion
If you use the [Hadoop ingestion](../../ingestion/hadoop.md), you can read data from HDFS
by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
See the [Static](../../ingestion/hadoop.md#static) inputSpec for details.