Documentation for bucket ownership features in Data Prepper 2.4 (#4679)

* Adds documentation for new features in Data Prepper 2.4 related to bucket ownership. Includes a section explaining how the bucket ownership works and can be configured for cross-account access. Resolves #4678

Signed-off-by: David Venable <dlv@amazon.com>

* Updates from PR feedback.

Signed-off-by: David Venable <dlv@amazon.com>

* Remove redundant header

Add a name to primary H2 header to section to prevent links from breaking.

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: David Venable <dlv@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
This commit is contained in:
David Venable 2023-08-21 08:46:04 -07:00 committed by GitHub
parent 7d077786bc
commit ac3f60d5ca
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -45,24 +45,51 @@ In order to use the `s3` source, configure your AWS Identity and Access Manageme
If your S3 objects or Amazon SQS queues do not use [AWS Key Management Service (AWS KMS)](https://aws.amazon.com/kms/), remove the `kms:Decrypt` permission.
## Cross-account S3 access<a name="s3_bucket_ownership"></a>
When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the
[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html).
By default, Data Prepper expects an S3 bucket to be owned by the same that owns the correlating SQS queue.
When no SQS is provided, Data Prepper uses the Amazon Resource Name (ARN) role in the `aws` configuration.
If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions:
- If all S3 buckets you want data from belong to an account other than that of the SQS queue, set `default_bucket_owner` to the account ID of the bucket account holder.
- If your S3 buckets are in multiple accounts, use a `bucket_owners` map.
In the following example, the SQS queue is owned by account `000000000000`. The SQS queue contains data from two S3 buckets: `my-bucket-01` and `my-bucket-02`.
Because `my-bucket-01` is owned by `123456789012` and `my-bucket-02` is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration:
```
s3:
sqs:
queue_url: "https://sqs.us-east-1.amazonaws.com/000000000000/MyQueue"
bucket_owners:
my-bucket-01: 123456789012
my-bucket-02: 999999999999
```
You can use both `bucket_owners` and `default_bucket_owner` together.
## Configuration
You can use the following options to configure the `s3` source.
Option | Required | Type | Description
:--- | :--- | :--- | :---
notification_type | Yes | String | Must be `sqs`.
compression | No | String | The compression algorithm to apply: `none`, `gzip`, or `automatic`. Default value is `none`.
codec | Yes | Codec | The [codec](#codec) to apply.
sqs | Yes | sqs | The SQS configuration. See [sqs](#sqs) for details.
aws | Yes | aws | The AWS configuration. See [aws](#aws) for details.
on_error | No | String | Determines how to handle errors in Amazon SQS. Can be either `retain_messages` or `delete_messages`. If `retain_messages`, then Data Prepper will leave the message in the Amazon SQS queue and try again. This is recommended for dead-letter queues. If `delete_messages`, then Data Prepper will delete failed messages. Default value is `retain_messages`.
Option | Required | Type | Description
:--- | :--- |:---------| :---
notification_type | Yes | String | Must be `sqs`.
compression | No | String | The compression algorithm to apply: `none`, `gzip`, or `automatic`. Default value is `none`.
codec | Yes | Codec | The [codec](#codec) to apply.
sqs | Yes | sqs | The SQS configuration. See [sqs](#sqs) for details.
aws | Yes | aws | The AWS configuration. See [aws](#aws) for details.
on_error | No | String | Determines how to handle errors in Amazon SQS, either `retain_messages` or `delete_messages`. `retain_messages` leaves the message in the Amazon SQS queue and tries to send the message again later. This is recommended for dead-letter queues. `delete_messages` deletes any failed messages. Default is `retain_messages`.
buffer_timeout | No | Duration | The amount of time allowed for for writing events to the Data Prepper buffer before timeout occurs. Any events that the Amazon S3 source cannot write to the buffer in this time will be discarded. Default value is 10 seconds.
records_to_accumulate | No | Integer | The number of messages that accumulate before writing to the buffer. Default value is 100.
metadata_root_key | No | String | Base key for adding S3 metadata to each Event. The metadata includes the key and bucket for each S3 object. Defaults to `s3/`.
disable_bucket_ownership_validation | No | Boolean | If `true`, the S3Source will not attempt to validate that the bucket is owned by the expected account. The expected account is the same account that owns the Amazon SQS queue. Defaults to `false`.
acknowledgments | No | Boolean | If `true`, enables `s3` sources to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/#end-to-end-acknowledgments) when events are received by OpenSearch sinks.
records_to_accumulate | No | Integer | The number of messages that accumulate before writing to the buffer. Default value is `100`.
metadata_root_key | No | String | The base key for adding S3 metadata to each event. The metadata includes the key and bucket for each S3 object. Defaults to `s3/`.
default_bucket_owner | No | String | An AWS account ID to use as the default account when checking bucket ownership.
bucket_owners | No | Map | A map of S3 bucket names and their AWS account IDs. When provided, the `s3` source validates that the bucket is owned by the account. This allows for the use of buckets from multiple accounts.
disable_bucket_ownership_validation | No | Boolean | When `true`, the S3 source does not attempt to validate that the bucket is owned by the expected account. By default, this is the same account that owns the Amazon SQS queue. For more information, see [bucket ownership](#s3_bucket_ownership). Defaults to `false`.
acknowledgments | No | Boolean | When `true`, enables `s3` sources to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines/#end-to-end-acknowledgments) when events are received by OpenSearch sinks.
## sqs