mirror of
https://github.com/apache/druid.git
synced 2025-02-18 16:12:23 +00:00
* Claim full support for Java 17. No production code has changed, except the startup scripts. Changes: 1) Allow Java 17 without DRUID_SKIP_JAVA_CHECK. 2) Include the full list of opens and exports on both Java 11 and 17. 3) Document that Java 17 is both supported and preferred. 4) Switch some tests from Java 11 to 17 to get better coverage on the preferred version. * Doc update. * Update errorprone. * Update docker_build_containers.sh. * Update errorprone in licenses.yaml. * Add some more run-javas. * Additional run-javas. * Update errorprone. * Suppress new errorprone error. * Add exports and opens in ForkingTaskRunner for Java 11+. Test, doc changes. * Additional errorprone updates. * Update for errorprone. * Restore old fomatting in LdapCredentialsValidator. * Copy bin/ too. * Fix Java 15, 17 build line in docker_build_containers.sh. * Update busybox image. * One more java command. * Fix interpolation. * IT commandline refinements. * Switch to busybox 1.34.1-glibc. * POM adjustments, build and test one IT on 17. * Additional debugging. * Fix silly thing. * Adjust command line. * Add exports and opens one more place. * Additional harmonization of strong encapsulation parameters.
170 lines
7.8 KiB
Markdown
170 lines
7.8 KiB
Markdown
---
|
|
id: hdfs
|
|
title: "HDFS"
|
|
---
|
|
|
|
<!--
|
|
~ Licensed to the Apache Software Foundation (ASF) under one
|
|
~ or more contributor license agreements. See the NOTICE file
|
|
~ distributed with this work for additional information
|
|
~ regarding copyright ownership. The ASF licenses this file
|
|
~ to you under the Apache License, Version 2.0 (the
|
|
~ "License"); you may not use this file except in compliance
|
|
~ with the License. You may obtain a copy of the License at
|
|
~
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
~
|
|
~ Unless required by applicable law or agreed to in writing,
|
|
~ software distributed under the License is distributed on an
|
|
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
~ KIND, either express or implied. See the License for the
|
|
~ specific language governing permissions and limitations
|
|
~ under the License.
|
|
-->
|
|
|
|
|
|
To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-hdfs-storage` in the extensions load list and run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.
|
|
|
|
## Deep Storage
|
|
|
|
### Configuration for HDFS
|
|
|
|
|Property|Possible Values|Description|Default|
|
|
|--------|---------------|-----------|-------|
|
|
|`druid.storage.type`|hdfs||Must be set.|
|
|
|`druid.storage.storageDirectory`||Directory for storing segments.|Must be set.|
|
|
|`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty|
|
|
|`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty|
|
|
|
|
Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`)
|
|
in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`.
|
|
|
|
If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work.
|
|
If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.
|
|
|
|
### Configuration for Cloud Storage
|
|
|
|
You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.
|
|
|
|
#### Configuration for AWS S3
|
|
|
|
To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
|
|
|
|
|Property|Possible Values|Description|Default|
|
|
|--------|---------------|-----------|-------|
|
|
|`druid.storage.type`|hdfs| |Must be set.|
|
|
|`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.|
|
|
|
|
You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/), especially the `hadoop-aws.jar` in the Druid classpath.
|
|
Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.
|
|
|
|
```bash
|
|
${DRUID_HOME}/bin/run-java -classpath "${DRUID_HOME}/lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
|
|
cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
|
|
```
|
|
|
|
Finally, you need to add the below properties in the `core-site.xml`.
|
|
For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/).
|
|
|
|
```xml
|
|
<property>
|
|
<name>fs.s3a.impl</name>
|
|
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
|
|
<description>The implementation class of the S3A Filesystem</description>
|
|
</property>
|
|
|
|
<property>
|
|
<name>fs.AbstractFileSystem.s3a.impl</name>
|
|
<value>org.apache.hadoop.fs.s3a.S3A</value>
|
|
<description>The implementation class of the S3A AbstractFileSystem.</description>
|
|
</property>
|
|
|
|
<property>
|
|
<name>fs.s3a.access.key</name>
|
|
<description>AWS access key ID. Omit for IAM role-based or provider-based authentication.</description>
|
|
<value>your access key</value>
|
|
</property>
|
|
|
|
<property>
|
|
<name>fs.s3a.secret.key</name>
|
|
<description>AWS secret key. Omit for IAM role-based or provider-based authentication.</description>
|
|
<value>your secret key</value>
|
|
</property>
|
|
```
|
|
|
|
#### Configuration for Google Cloud Storage
|
|
|
|
To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly.
|
|
|
|
|Property|Possible Values|Description|Default|
|
|
|--------|---------------|-----------|-------|
|
|
|`druid.storage.type`|hdfs||Must be set.|
|
|
|`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the deep storage|Must be set.|
|
|
|
|
All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters) in their class path.
|
|
Please read the [install instructions](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
|
|
to properly set up the necessary libraries and configurations.
|
|
One option is to place this jar in `${DRUID_HOME}/lib/` and `${DRUID_HOME}/extensions/druid-hdfs-storage/`.
|
|
|
|
Finally, you need to configure the `core-site.xml` file with the filesystem
|
|
and authentication properties needed for GCS. You may want to copy the below
|
|
example properties. Please follow the instructions at
|
|
[https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
|
|
for more details.
|
|
For more configurations, [GCS core default](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.0.0/gcs/conf/gcs-core-default.xml)
|
|
and [GCS core template](https://github.com/GoogleCloudPlatform/bdutil/blob/master/conf/hadoop2/gcs-core-template.xml).
|
|
|
|
```xml
|
|
<property>
|
|
<name>fs.gs.impl</name>
|
|
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
|
|
<description>The FileSystem for gs: (GCS) uris.</description>
|
|
</property>
|
|
|
|
<property>
|
|
<name>fs.AbstractFileSystem.gs.impl</name>
|
|
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
|
|
<description>The AbstractFileSystem for gs: uris.</description>
|
|
</property>
|
|
|
|
<property>
|
|
<name>google.cloud.auth.service.account.enable</name>
|
|
<value>true</value>
|
|
<description>
|
|
Whether to use a service account for GCS authorization.
|
|
Setting this property to `false` will disable use of service accounts for
|
|
authentication.
|
|
</description>
|
|
</property>
|
|
|
|
<property>
|
|
<name>google.cloud.auth.service.account.json.keyfile</name>
|
|
<value>/path/to/keyfile</value>
|
|
<description>
|
|
The JSON key file of the service account used for GCS
|
|
access when google.cloud.auth.service.account.enable is true.
|
|
</description>
|
|
</property>
|
|
```
|
|
|
|
Tested with Druid 0.17.0, Hadoop 2.8.5 and gcs-connector jar 2.0.0-hadoop2.
|
|
|
|
## Reading data from HDFS or Cloud Storage
|
|
|
|
### Native batch ingestion
|
|
|
|
The [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md)
|
|
to read files directly from the HDFS Storage. You may be able to read objects from cloud storage
|
|
with the HDFS input source, but we highly recommend to use a proper
|
|
[Input Source](../../ingestion/input-sources.md) instead if possible because
|
|
it is simple to set up. For now, only the [S3 input source](../../ingestion/input-sources.md#s3-input-source)
|
|
and the [Google Cloud Storage input source](../../ingestion/input-sources.md#google-cloud-storage-input-source)
|
|
are supported for cloud storage types, and so you may still want to use the HDFS input source
|
|
to read from cloud storage other than those two.
|
|
|
|
### Hadoop-based ingestion
|
|
|
|
If you use the [Hadoop ingestion](../../ingestion/hadoop.md), you can read data from HDFS
|
|
by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
|
|
See the [Static](../../ingestion/hadoop.md#static) inputSpec for details.
|