From 0b5011409ae004a893989686fec6eef2fed76ae5 Mon Sep 17 00:00:00 2001
From: David Venable <dlv@amazon.com>
Date: Tue, 22 Aug 2023 13:18:40 -0700
Subject: [PATCH] Data Prepper documentation updates for the s3 sink in 2.4.0
 (#4847)

* Data Prepper documentation updates for the s3 sink in 2.4.0. Includes other corrections.

* Updated the Parquet/Avro codecs to include auto_schema as well as some clarifications on the restrictions of the auto-generated schema.

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
---
 .../pipelines/configuration/sinks/s3.md       | 110 ++++++++++++++++--
 1 file changed, 100 insertions(+), 10 deletions(-)

diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md
index 8114f76d..70eb81ae 100644
--- a/_data-prepper/pipelines/configuration/sinks/s3.md
+++ b/_data-prepper/pipelines/configuration/sinks/s3.md
@@ -44,27 +44,117 @@ Use the following options when customizing the `s3` sink.
 Option | Required | Type | Description
 :--- | :--- | :--- | :---
 `bucket` | Yes | String | The object from which the data is retrieved and then stored. The `name` must match the name of your object store.
-`region` | No | String | The AWS Region to use when connecting to S3. Defaults to the [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
-`sts_role_arn` | No | String | The [AWS Security Token Service](https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html) (AWS STS) role that the `s3` sink assumes when sending a request to S3. Defaults to the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). 
-`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS.
-`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
+`codec` | Yes | [Buffer type](#buffer-type) | Determines the buffer type. 
+`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
+`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. 
 `object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found inside the root directory of the bucket.
+`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. 
+`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. 
+`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
 
-## Threshold configuration options
+## aws
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
+`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
+`sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin.
+`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS.
+
+
+## Threshold configuration
 
 Use the following options to set ingestion thresholds for the `s3` sink.
 
 Option | Required | Type | Description
 :--- | :--- | :--- | :---
 `event_count` | Yes | Integer | The maximum number of events the S3 bucket can ingest. 
-`maximum_size` | Yes | String | The maximum count or number of bytes that the S3 bucket can ingest. Defaults to `50mb`.
+`maximum_size` | Yes | String | The maximum number of bytes that the S3 bucket can ingest after compression. Defaults to `50mb`.
 `event_collect_timeout` | Yes | String | Sets the time period during which events are collected before ingestion. All values are strings that represent duration, either an ISO_8601 notation string, such as `PT20.345S`, or a simple notation, such as `60s` or `1500ms`.
 
-## buffer_type
 
-`buffer_type` is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. Use of one of the following options: 
+## Buffer type
+
+`buffer_type` is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. The default value is `in_memory`. Use one of the following options: 
 
-- `local_file`: Flushes the record into a file on your machine. 
 - `in_memory`: Stores the record in memory.
+- `local_file`: Flushes the record into a file on your machine. 
+- `multipart`: Writes using the [S3 multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html). Every 10 MB is written as a part.
+
+## Object key configuration
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`path_prefix` | Yes | String | The S3 key prefix path to use. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. By default, events write to the root of the bucket. 
+
+
+## codec
+
+The `codec` determines how the `s3` source formats data written to each S3 object.
+
+### avro codec
+
+The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document.
+
+Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
+In general, you should define your own schema because it will most accurately reflect your needs.
+In cases where your data is uniform, you may be able to automatically generate a schema.
+Automatically generated schemas are based on the first event received by the codec.
+The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
+Automatically generated schemas make all fields nullable.
+
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true.
+`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event.
+
+
+### csv codec
+
+The `csv` codec writes events in comma-separated value (CSV) format.
+It also supports tab-separated value (TSV) and other delimited formats.
+
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`delimiter` | No | String | The delimiter. By default, this is `,` to support CSV.
+`header` | No | String List | The header columns.
+
+ 
+### ndjson codec
+
+The `ndjson` codec writes each line as a JSON object.
+
+The `ndjson` codec does not take any configurations.
+
+
+### json codec
+
+The `json` codec writes events in a single large JSON file.
+Each event is written into an object within a JSON array.
+
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`key_name` | No | String | The name of the key for the JSON array. By default this is `events`.
+
+
+### parquet codec
+
+The `parquet` codec writes events into a Parquet file.
+You must set the `buffer_type` to `multipart` when using Parquet.
+
+The Parquet codec writes data using the Avro schema.
+In general, you should define your own schema because it will most accurately reflect your needs.
+In cases where your data is uniform, you may be able to automatically generate a schema.
+Automatically generated schemas are based on the first event received by the codec.
+The schema will only have keys from this event, so you must have all keys present in all events for the auto-schema generation to produce a working schema.
+Auto-generated schemas make all fields nullable.
+
+
+Option | Required | Type | Description
+:--- | :--- | :--- | :---
+`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true.
+`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event.
 
- 
\ No newline at end of file