From 64d59b9bb280eecbed1448d98605746b33929adb Mon Sep 17 00:00:00 2001
From: David Venable <dlv@amazon.com>
Date: Tue, 29 Aug 2023 08:01:24 -0700
Subject: [PATCH] Behavioral changes for Data Prepper S3 sink (#4897)

* Updates the Data Prepper documentation for S3 sinks based on recent behavior changes.

Signed-off-by: David Venable <dlv@amazon.com>

* Updates from the PR feedback.

Signed-off-by: David Venable <dlv@amazon.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: David Venable <dlv@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
---
 .../pipelines/configuration/sinks/s3.md       | 24 +++++++++++++------
 .../pipelines/configuration/sinks/sinks.md    |  4 ++--
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md
index aa9e1797..cb881e81 100644
--- a/_data-prepper/pipelines/configuration/sinks/s3.md
+++ b/_data-prepper/pipelines/configuration/sinks/s3.md
@@ -98,10 +98,21 @@ The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) d
 
 Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
 In general, you should define your own schema because it will most accurately reflect your needs.
+
+We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions).
+Without the null union, each field must be present or the data will fail to write to the sink.
+If you can be certain that each each event has a given field, you can make it non-nullable.
+
+When you provide your own Avro schema, that schema defines the final structure of your data.
+Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination.
+To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema.
+
 In cases where your data is uniform, you may be able to automatically generate a schema.
 Automatically generated schemas are based on the first event received by the codec.
-The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
+The schema will only contain keys from this event.
+Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
 Automatically generated schemas make all fields nullable.
+Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema.
 
 
 Option | Required | Type | Description
@@ -131,14 +142,13 @@ Option | Required | Type | Description
 ### parquet codec
 
 The `parquet` codec writes events into a Parquet file.
-You must set the `buffer_type` to `multipart` when using Parquet.
+When using the Parquet codec, set the `buffer_type` to `in_memory`.
 
-The Parquet codec writes data using the Avro schema. However, we generally recommend that you define your own schema so that it can best meet your needs.
+The Parquet codec writes data using the Avro schema.
+Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
+However, we generally recommend that you define your own schema so that it can best meet your needs.
 
-In cases where your data is uniform, you may be able to automatically generate a schema.
-Automatically generated schemas are based on the first event received by the codec.
-The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
-Automatically generated schemas make all fields nullable.
+For details on the Avro schema and recommendations, see the  [Avro codec](#avro-codec) documentation.
 
 
 Option | Required | Type | Description
diff --git a/_data-prepper/pipelines/configuration/sinks/sinks.md b/_data-prepper/pipelines/configuration/sinks/sinks.md
index ad99e572..0f3af6ab 100644
--- a/_data-prepper/pipelines/configuration/sinks/sinks.md
+++ b/_data-prepper/pipelines/configuration/sinks/sinks.md
@@ -18,5 +18,5 @@ Option | Required | Type        | Description
 :--- | :--- |:------------| :---
 routes | No | String list | A list of routes for which this sink applies. If not provided, this sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information.
 tags_target_key | No | String   | When specified, includes event tags in the output of the provided key.
-include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink.
-exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink.
+include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink. Some codecs and sinks do not allow use of this field. 
+exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink. Some codecs and sinks do not allow use of this field.