Data Prepper documentation updates for the s3 sink in 2.4.0 (#4847)

* Data Prepper documentation updates for the s3 sink in 2.4.0. Includes other corrections.

* Updated the Parquet/Avro codecs to include auto_schema as well as some clarifications on the restrictions of the auto-generated schema.

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
This commit is contained in:
David Venable 2023-08-22 13:18:40 -07:00 committed by GitHub
parent 6bece563ea
commit 0b5011409a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 100 additions and 10 deletions

View File

@ -44,27 +44,117 @@ Use the following options when customizing the `s3` sink.
Option | Required | Type | Description Option | Required | Type | Description
:--- | :--- | :--- | :--- :--- | :--- | :--- | :---
`bucket` | Yes | String | The object from which the data is retrieved and then stored. The `name` must match the name of your object store. `bucket` | Yes | String | The object from which the data is retrieved and then stored. The `name` must match the name of your object store.
`region` | No | String | The AWS Region to use when connecting to S3. Defaults to the [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). `codec` | Yes | [Buffer type](#buffer-type) | Determines the buffer type.
`sts_role_arn` | No | String | The [AWS Security Token Service](https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html) (AWS STS) role that the `s3` sink assumes when sending a request to S3. Defaults to the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). `aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS. `threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found inside the root directory of the bucket. `object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found inside the root directory of the bucket.
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
## Threshold configuration options ## aws
Option | Required | Type | Description
:--- | :--- | :--- | :---
`region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
`sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin.
`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS.
## Threshold configuration
Use the following options to set ingestion thresholds for the `s3` sink. Use the following options to set ingestion thresholds for the `s3` sink.
Option | Required | Type | Description Option | Required | Type | Description
:--- | :--- | :--- | :--- :--- | :--- | :--- | :---
`event_count` | Yes | Integer | The maximum number of events the S3 bucket can ingest. `event_count` | Yes | Integer | The maximum number of events the S3 bucket can ingest.
`maximum_size` | Yes | String | The maximum count or number of bytes that the S3 bucket can ingest. Defaults to `50mb`. `maximum_size` | Yes | String | The maximum number of bytes that the S3 bucket can ingest after compression. Defaults to `50mb`.
`event_collect_timeout` | Yes | String | Sets the time period during which events are collected before ingestion. All values are strings that represent duration, either an ISO_8601 notation string, such as `PT20.345S`, or a simple notation, such as `60s` or `1500ms`. `event_collect_timeout` | Yes | String | Sets the time period during which events are collected before ingestion. All values are strings that represent duration, either an ISO_8601 notation string, such as `PT20.345S`, or a simple notation, such as `60s` or `1500ms`.
## buffer_type
`buffer_type` is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. Use of one of the following options: ## Buffer type
`buffer_type` is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. The default value is `in_memory`. Use one of the following options:
- `local_file`: Flushes the record into a file on your machine.
- `in_memory`: Stores the record in memory. - `in_memory`: Stores the record in memory.
- `local_file`: Flushes the record into a file on your machine.
- `multipart`: Writes using the [S3 multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html). Every 10 MB is written as a part.
## Object key configuration
Option | Required | Type | Description
:--- | :--- | :--- | :---
`path_prefix` | Yes | String | The S3 key prefix path to use. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. By default, events write to the root of the bucket.
## codec
The `codec` determines how the `s3` source formats data written to each S3 object.
### avro codec
The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document.
Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
In general, you should define your own schema because it will most accurately reflect your needs.
In cases where your data is uniform, you may be able to automatically generate a schema.
Automatically generated schemas are based on the first event received by the codec.
The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
Automatically generated schemas make all fields nullable.
Option | Required | Type | Description
:--- | :--- | :--- | :---
`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true.
`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event.
### csv codec
The `csv` codec writes events in comma-separated value (CSV) format.
It also supports tab-separated value (TSV) and other delimited formats.
Option | Required | Type | Description
:--- | :--- | :--- | :---
`delimiter` | No | String | The delimiter. By default, this is `,` to support CSV.
`header` | No | String List | The header columns.
### ndjson codec
The `ndjson` codec writes each line as a JSON object.
The `ndjson` codec does not take any configurations.
### json codec
The `json` codec writes events in a single large JSON file.
Each event is written into an object within a JSON array.
Option | Required | Type | Description
:--- | :--- | :--- | :---
`key_name` | No | String | The name of the key for the JSON array. By default this is `events`.
### parquet codec
The `parquet` codec writes events into a Parquet file.
You must set the `buffer_type` to `multipart` when using Parquet.
The Parquet codec writes data using the Avro schema.
In general, you should define your own schema because it will most accurately reflect your needs.
In cases where your data is uniform, you may be able to automatically generate a schema.
Automatically generated schemas are based on the first event received by the codec.
The schema will only have keys from this event, so you must have all keys present in all events for the auto-schema generation to produce a working schema.
Auto-generated schemas make all fields nullable.
Option | Required | Type | Description
:--- | :--- | :--- | :---
`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true.
`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event.