druid/hdfs.md at f5402169319737381d005ab9a69a84c3857d4ba1

4.0 KiB

Raw Blame History

id	title
hdfs	HDFS

To use this Apache Druid extension, make sure to include druid-hdfs-storage as an extension.

Deep Storage

Configuration for HDFS

Property	Possible Values	Description	Default
`druid.storage.type`	hdfs		Must be set.
`druid.storage.storageDirectory`		Directory for storing segments.	Must be set.
`druid.hadoop.security.kerberos.principal`	`druid@EXAMPLE.COM`	Principal user name	empty
`druid.hadoop.security.kerberos.keytab`	`/etc/security/keytabs/druid.headlessUser.keytab`	Path to keytab file	empty

If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work. If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set druid.hadoop.security.kerberos.principal and druid.hadoop.security.kerberos.keytab, this is an alternative to the cron job method that runs kinit command periodically.

Configuration for Google Cloud Storage

The HDFS extension can also be used for GCS as deep storage.

Property	Possible Values	Description	Default
`druid.storage.type`	hdfs		Must be set.
`druid.storage.storageDirectory`		gs://bucket/example/directory	Must be set.

All services that need to access GCS need to have the GCS connector jar in their class path. One option is to place this jar in /lib/ and /extensions/druid-hdfs-storage/

Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.

Native batch ingestion

This firehose ingests events from a predefined list of files from a Hadoop filesystem. This firehose is splittable and can be used by native parallel index tasks. Since each split represents an HDFS file, each worker task of index_parallel will read an object.

Sample spec:

"firehose" : {
    "type" : "hdfs",
    "paths": "/foo/bar,/foo/baz"
}

This firehose provides caching and prefetching features. During native batch indexing, a firehose can be read twice if intervals are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scanning of files is slow.

Property	Description	Default
type	This should be `hdfs`.	none (required)
paths	HDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like `*` are supported in these paths.	none (required)
maxCacheCapacityBytes	Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.	1073741824
maxFetchCapacityBytes	Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.	1073741824
prefetchTriggerBytes	Threshold to trigger prefetching files.	maxFetchCapacityBytes / 2
fetchTimeout	Timeout for fetching each file.	60000
maxFetchRetry	Maximum number of retries for fetching each file.	3

4.0 KiB Raw Blame History

Deep Storage

Configuration for HDFS

Configuration for Google Cloud Storage

Native batch ingestion

4.0 KiB

Raw Blame History