druid/docs/development/extensions-core/s3.md

152 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
id: s3
title: "S3-compatible"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
## S3 extension
This extension allows you to do 2 things:
* [Ingest data](#reading-data-from-s3) from files stored in S3.
* Write segments to [deep storage](#deep-storage) in S3.
To use this Apache Druid extension, [include](../../development/extensions.md#loading-extensions) `druid-s3-extensions` in the extensions load list.
### Reading data from S3
Use a native batch [Parallel task](../../ingestion/native-batch.md) with an [S3 input source](../../ingestion/native-batch-input-sources.html#s3-input-source) to read objects directly from S3.
Alternatively, use a [Hadoop task](../../ingestion/hadoop.md),
and specify S3 paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
To read objects from S3, you must supply [connection information](#configuration) in configuration.
### Deep Storage
S3-compatible deep storage means either AWS S3 or a compatible service like Google Storage which exposes the same API as S3.
S3 deep storage needs to be explicitly enabled by setting `druid.storage.type=s3`. **Only after setting the storage type to S3 will any of the settings below take effect.**
To use S3 for Deep Storage, you must supply [connection information](#configuration) in configuration *and* set additional configuration, specific for [Deep Storage](#deep-storage-specific-configuration).
#### Deep storage specific configuration
|Property|Description|Default|
|--------|-----------|-------|
|`druid.storage.bucket`|Bucket to store in.|Must be set.|
|`druid.storage.baseKey`|A prefix string that will be prepended to the object names for the segments published to S3 deep storage|Must be set.|
|`druid.storage.type`|Global deep storage provider. Must be set to `s3` to make use of this extension.|Must be set (likely `s3`).|
|`druid.storage.archiveBucket`|S3 bucket name for archiving when running the *archive task*.|none|
|`druid.storage.archiveBaseKey`|S3 object key prefix for archiving.|none|
|`druid.storage.disableAcl`|Boolean flag for how object permissions are handled. To use ACLs, set this property to `false`. To use Object Ownership, set it to `true`. The permission requirements for ACLs and Object Ownership are different. For more information, see [S3 permissions settings](#s3-permissions-settings).|false|
|`druid.storage.useS3aSchema`|If true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.|false|
## Configuration
### S3 authentication methods
You can provide credentials to connect to S3 in a number of ways, whether for [deep storage](#deep-storage) or as an [ingestion source](#reading-data-from-s3).
The configuration options are listed in order of precedence. For example, if you would like to use profile information given in `~/.aws.credentials`, do not set `druid.s3.accessKey` and `druid.s3.secretKey` in your Druid config file because they would take precedence.
|order|type|details|
|--------|-----------|-------|
|1|Druid config file|Based on your runtime.properties if it contains values `druid.s3.accessKey` and `druid.s3.secretKey` |
|2|Custom properties file| Based on custom properties file where you can supply `sessionToken`, `accessKey` and `secretKey` values. This file is provided to Druid through `druid.s3.fileSessionCredentials` properties|
|3|Environment variables|Based on environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`|
|4|Java system properties|Based on JVM properties `aws.accessKeyId` and `aws.secretKey` |
|5|Profile information|Based on credentials you may have on your druid instance (generally in `~/.aws/credentials`)|
|6|ECS container credentials|Based on environment variables available on AWS ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the [EC2ContainerCredentialsProviderWrapper documentation](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/EC2ContainerCredentialsProviderWrapper.html)|
|7|Instance profile information|Based on the instance profile you may have attached to your druid instance|
For more information, refer to the [Amazon Developer Guide](https://docs.aws.amazon.com/fr_fr/sdk-for-java/v1/developer-guide/credentials).
Alternatively, you can bypass this chain by specifying an access key and secret key using a [Properties Object](../../ingestion/native-batch-input-sources.html#s3-input-source) inside your ingestion specification.
Use the property [`druid.startup.logging.maskProperties`](../../configuration/index.html#startup-logging) to mask credentials information in Druid logs. For example, `["password", "secretKey", "awsSecretAccessKey"]`.
### S3 permissions settings
To manage the permissions for objects in an S3 bucket, you can use either ACLs or Object Ownership. The permissions required for each method are different.
By default, Druid uses ACLs. With ACLs, any object that Druid puts into the bucket inherits the ACL settings from the bucket.
You can switch from using ACLs to Object Ownership by setting `druid.storage.disableAcl` to `true`. The bucket owner owns any object that gets created, so you need to use S3's bucket policies to manage permissions.
Note that this setting only affects Druid's behavior. Changing S3 to use Object Ownership requires additional configuration. For more information, see the AWS documentation on [Controlling ownership of objects and disabling ACLs for your bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/about-object-ownership.html).
#### ACL permissions
If you're using ACLs, Druid needs the following permissions:
- `s3:GetObject`
- `s3:PutObject`
- `s3:DeleteObject`
- `s3:GetBucketAcl`
- `s3:PutObjectAcl`
#### Object Ownership permissions
If you're using Object Ownership, Druid needs the following permissions:
- `s3:GetObject`
- `s3:PutObject`
- `s3:DeleteObject`
### AWS region
The AWS SDK requires that a target region be specified. You can set these by using the JVM system property `aws.region` or by setting an environment variable `AWS_REGION`.
For example, to set the region to 'us-east-1' through system properties:
- Add `-Daws.region=us-east-1` to the `jvm.config` file for all Druid services.
- Add `-Daws.region=us-east-1` to `druid.indexer.runner.javaOpts` in [Middle Manager configuration](../../configuration/index.md#middlemanager-configuration) so that the property will be passed to Peon (worker) processes.
### Connecting to S3 configuration
|Property|Description|Default|
|--------|-----------|-------|
|`druid.s3.accessKey`|S3 access key. See [S3 authentication methods](#s3-authentication-methods) for more details|Can be omitted according to authentication methods chosen.|
|`druid.s3.secretKey`|S3 secret key. See [S3 authentication methods](#s3-authentication-methods) for more details|Can be omitted according to authentication methods chosen.|
|`druid.s3.fileSessionCredentials`|Path to properties file containing `sessionToken`, `accessKey` and `secretKey` value. One key/value pair per line (format `key=value`). See [S3 authentication methods](#s3-authentication-methods) for more details |Can be omitted according to authentication methods chosen.|
|`druid.s3.protocol`|Communication protocol type to use when sending requests to AWS. `http` or `https` can be used. This configuration would be ignored if `druid.s3.endpoint.url` is filled with a URL with a different protocol.|`https`|
|`druid.s3.disableChunkedEncoding`|Disables chunked encoding. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#disableChunkedEncoding--) for details.|false|
|`druid.s3.enablePathStyleAccess`|Enables path style access. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#enablePathStyleAccess--) for details.|false|
|`druid.s3.forceGlobalBucketAccessEnabled`|Enables global bucket access. See [AWS document](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Builder.html#setForceGlobalBucketAccessEnabled-java.lang.Boolean-) for details.|false|
|`druid.s3.endpoint.url`|Service endpoint either with or without the protocol.|None|
|`druid.s3.endpoint.signingRegion`|Region to use for SigV4 signing of requests (e.g. us-west-1).|None|
|`druid.s3.proxy.host`|Proxy host to connect through.|None|
|`druid.s3.proxy.port`|Port on the proxy host to connect through.|None|
|`druid.s3.proxy.username`|User name to use when connecting through a proxy.|None|
|`druid.s3.proxy.password`|Password to use when connecting through a proxy.|None|
|`druid.storage.sse.type`|Server-side encryption type. Should be one of `s3`, `kms`, and `custom`. See the below [Server-side encryption section](#server-side-encryption) for more details.|None|
|`druid.storage.sse.kms.keyId`|AWS KMS key ID. This is used only when `druid.storage.sse.type` is `kms` and can be empty to use the default key ID.|None|
|`druid.storage.sse.custom.base64EncodedKey`|Base64-encoded key. Should be specified if `druid.storage.sse.type` is `custom`.|None|
## Server-side encryption
You can enable [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption) by setting
`druid.storage.sse.type` to a supported type of server-side encryption. The current supported types are:
- s3: [Server-side encryption with S3-managed encryption keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption)
- kms: [Server-side encryption with AWS KMSManaged Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption)
- custom: [Server-side encryption with Customer-Provided Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys)