4.1 KiB
id | title |
---|---|
hdfs | HDFS |
To use this Apache Druid extension, make sure to include druid-hdfs-storage
as an extension and run druid processes with GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile
in the environment.
Deep Storage
Configuration for HDFS
Property | Possible Values | Description | Default |
---|---|---|---|
druid.storage.type |
hdfs | Must be set. | |
druid.storage.storageDirectory |
Directory for storing segments. | Must be set. | |
druid.hadoop.security.kerberos.principal |
druid@EXAMPLE.COM |
Principal user name | empty |
druid.hadoop.security.kerberos.keytab |
/etc/security/keytabs/druid.headlessUser.keytab |
Path to keytab file | empty |
If you are using the Hadoop indexer, set your output directory to be a location on Hadoop and it will work.
If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set druid.hadoop.security.kerberos.principal
and druid.hadoop.security.kerberos.keytab
, this is an alternative to the cron job method that runs kinit
command periodically.
Configuration for Google Cloud Storage
The HDFS extension can also be used for GCS as deep storage.
Property | Possible Values | Description | Default |
---|---|---|---|
druid.storage.type |
hdfs | Must be set. | |
druid.storage.storageDirectory |
gs://bucket/example/directory | Must be set. |
All services that need to access GCS need to have the GCS connector jar in their class path. One option is to place this jar in /lib/ and /extensions/druid-hdfs-storage/
Tested with Druid 0.9.0, Hadoop 2.7.2 and gcs-connector jar 1.4.4-hadoop2.
Native batch ingestion
This firehose ingests events from a predefined list of files from a Hadoop filesystem.
This firehose is splittable and can be used by native parallel index tasks.
Since each split represents an HDFS file, each worker task of index_parallel
will read an object.
Sample spec:
"firehose" : {
"type" : "hdfs",
"paths": "/foo/bar,/foo/baz"
}
This firehose provides caching and prefetching features. During native batch indexing, a firehose can be read twice if
intervals
are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scanning
of files is slow.
Property | Description | Default |
---|---|---|
type | This should be hdfs . |
none (required) |
paths | HDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like * are supported in these paths. |
none (required) |
maxCacheCapacityBytes | Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes. | 1073741824 |
maxFetchCapacityBytes | Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read. | 1073741824 |
prefetchTriggerBytes | Threshold to trigger prefetching files. | maxFetchCapacityBytes / 2 |
fetchTimeout | Timeout for fetching each file. | 60000 |
maxFetchRetry | Maximum number of retries for fetching each file. | 3 |