druid/s3.md at 84ff0d2352bfb542b03e621440fbee335f09f00d

12 KiB

Raw Blame History

id	title
s3	S3-compatible

To use this Apache Druid extension, make sure to include druid-s3-extensions as an extension.

Deep Storage

S3-compatible deep storage means either AWS S3 or a compatible service like Google Storage which exposes the same API as S3.

Configuration

S3 deep storage needs to be explicitly enabled by setting druid.storage.type=s3. Only after setting the storage type to S3 will any of the settings below take effect.

The AWS SDK requires that the target region be specified. Two ways of doing this are by using the JVM system property aws.region or the environment variable AWS_REGION.

As an example, to set the region to 'us-east-1' through system properties:

Add -Daws.region=us-east-1 to the jvm.config file for all Druid services.
Add -Daws.region=us-east-1 to druid.indexer.runner.javaOpts in Middle Manager configuration so that the property will be passed to Peon (worker) processes.

Property	Description	Default
`druid.s3.accessKey`	S3 access key. See S3 authentication methods for more details	Can be omitted according to authentication methods chosen.
`druid.s3.secretKey`	S3 secret key. See S3 authentication methods for more details	Can be omitted according to authentication methods chosen.
`druid.s3.fileSessionCredentials`	Path to properties file containing `sessionToken`, `accessKey` and `secretKey` value. One key/value pair per line (format `key=value`). See S3 authentication methods for more details	Can be omitted according to authentication methods chosen.
`druid.s3.protocol`	Communication protocol type to use when sending requests to AWS. `http` or `https` can be used. This configuration would be ignored if `druid.s3.endpoint.url` is filled with a URL with a different protocol.	`https`
`druid.s3.disableChunkedEncoding`	Disables chunked encoding. See AWS document for details.	false
`druid.s3.enablePathStyleAccess`	Enables path style access. See AWS document for details.	false
`druid.s3.forceGlobalBucketAccessEnabled`	Enables global bucket access. See AWS document for details.	false
`druid.s3.endpoint.url`	Service endpoint either with or without the protocol.	None
`druid.s3.endpoint.signingRegion`	Region to use for SigV4 signing of requests (e.g. us-west-1).	None
`druid.s3.proxy.host`	Proxy host to connect through.	None
`druid.s3.proxy.port`	Port on the proxy host to connect through.	None
`druid.s3.proxy.username`	User name to use when connecting through a proxy.	None
`druid.s3.proxy.password`	Password to use when connecting through a proxy.	None
`druid.storage.bucket`	Bucket to store in.	Must be set.
`druid.storage.baseKey`	Base key prefix to use, i.e. what directory.	Must be set.
`druid.storage.archiveBucket`	S3 bucket name for archiving when running the archive task.	none
`druid.storage.archiveBaseKey`	S3 object key prefix for archiving.	none
`druid.storage.disableAcl`	Boolean flag to disable ACL. If this is set to `false`, the full control would be granted to the bucket owner. This may require to set additional permissions. See S3 permissions settings.	false
`druid.storage.sse.type`	Server-side encryption type. Should be one of `s3`, `kms`, and `custom`. See the below Server-side encryption section for more details.	None
`druid.storage.sse.kms.keyId`	AWS KMS key ID. This is used only when `druid.storage.sse.type` is `kms` and can be empty to use the default key ID.	None
`druid.storage.sse.custom.base64EncodedKey`	Base64-encoded key. Should be specified if `druid.storage.sse.type` is `custom`.	None
`druid.storage.type`	Global deep storage provider. Must be set to `s3` to make use of this extension.	Must be set (likely `s3`).
`druid.storage.useS3aSchema`	If true, use the "s3a" filesystem when using Hadoop-based ingestion. If false, the "s3n" filesystem will be used. Only affects Hadoop-based ingestion.	false

S3 permissions settings

s3:GetObject and s3:PutObject are basically required for pushing/loading segments to/from S3. If druid.storage.disableAcl is set to false, then s3:GetBucketAcl and s3:PutObjectAcl are additionally required to set ACL for objects.

S3 authentication methods

To connect to your S3 bucket (whether deep storage bucket or source bucket), Druid use the following credentials providers chain

order	type	details
1	Druid config file	Based on your runtime.properties if it contains values `druid.s3.accessKey` and `druid.s3.secretKey`
2	Custom properties file	Based on custom properties file where you can supply `sessionToken`, `accessKey` and `secretKey` values. This file is provided to Druid through `druid.s3.fileSessionCredentials` properties
3	Environment variables	Based on environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
4	Java system properties	Based on JVM properties `aws.accessKeyId` and `aws.secretKey`
5	Profile information	Based on credentials you may have on your druid instance (generally in `~/.aws/credentials`)
6	ECS container credentials	Based on environment variables available on AWS ECS (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI or AWS_CONTAINER_CREDENTIALS_FULL_URI) as described in the EC2ContainerCredentialsProviderWrapper documentation
7	Instance profile information	Based on the instance profile you may have attached to your druid instance

You can find more information about authentication method here
Note : Order is important here as it indicates the precedence of authentication methods.
So if you are trying to use Instance profile information, you must not set druid.s3.accessKey and druid.s3.secretKey in your Druid runtime.properties

Server-side encryption

You can enable server-side encryption by setting druid.storage.sse.type to a supported type of server-side encryption. The current supported types are:

S3 batch ingestion input source

This extension also provides an input source for Druid native batch ingestion to support reading objects directly from S3. Objects can be specified either via a list of S3 URI strings or a list of S3 location prefixes, which will attempt to list the contents and ingest all objects contained in the locations. The S3 input source is splittable and can be used by native parallel index tasks, where each worker task of index_parallel will read a single object.

Sample spec:

...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "uris": ["s3://foo/bar/file.json", "s3://bar/foo/file2.json"]
      },
      "inputFormat": {
        "type": "json"
      },
      ...
    },
...

...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "prefixes": ["s3://foo/bar", "s3://bar/foo"]
      },
      "inputFormat": {
        "type": "json"
      },
      ...
    },
...

...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "s3",
        "objects": [
          { "bucket": "foo", "path": "bar/file1.json"},
          { "bucket": "bar", "path": "foo/file2.json"}
        ]
      },
      "inputFormat": {
        "type": "json"
      },
      ...
    },
...

property	description	default	required?
type	This should be `s3`.	N/A	yes
uris	JSON array of URIs where S3 objects to be ingested are located.	N/A	`uris` or `prefixes` or `objects` must be set
prefixes	JSON array of URI prefixes for the locations of S3 objects to be ingested.	N/A	`uris` or `prefixes` or `objects` must be set
objects	JSON array of S3 Objects to be ingested.	N/A	`uris` or `prefixes` or `objects` must be set

S3 Object:

property	description	default	required?
bucket	Name of the S3 bucket	N/A	yes
path	The path where data is located.	N/A	yes

StaticS3Firehose

This firehose ingests events from a predefined list of S3 objects. This firehose is splittable and can be used by native parallel index tasks. Since each split represents an object in this firehose, each worker task of index_parallel will read an object.