Behavioral changes for Data Prepper S3 sink (#4897)

* Updates the Data Prepper documentation for S3 sinks based on recent behavior changes.

Signed-off-by: David Venable <dlv@amazon.com>

* Updates from the PR feedback.

Signed-off-by: David Venable <dlv@amazon.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: David Venable <dlv@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
This commit is contained in:
David Venable 2023-08-29 08:01:24 -07:00 committed by GitHub
parent dc21de0f80
commit 64d59b9bb2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 19 additions and 9 deletions

View File

@ -98,10 +98,21 @@ The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) d
Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
In general, you should define your own schema because it will most accurately reflect your needs.
We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions).
Without the null union, each field must be present or the data will fail to write to the sink.
If you can be certain that each each event has a given field, you can make it non-nullable.
When you provide your own Avro schema, that schema defines the final structure of your data.
Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination.
To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema.
In cases where your data is uniform, you may be able to automatically generate a schema.
Automatically generated schemas are based on the first event received by the codec.
The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
The schema will only contain keys from this event.
Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
Automatically generated schemas make all fields nullable.
Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema.
Option | Required | Type | Description
@ -131,14 +142,13 @@ Option | Required | Type | Description
### parquet codec
The `parquet` codec writes events into a Parquet file.
You must set the `buffer_type` to `multipart` when using Parquet.
When using the Parquet codec, set the `buffer_type` to `in_memory`.
The Parquet codec writes data using the Avro schema. However, we generally recommend that you define your own schema so that it can best meet your needs.
The Parquet codec writes data using the Avro schema.
Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
However, we generally recommend that you define your own schema so that it can best meet your needs.
In cases where your data is uniform, you may be able to automatically generate a schema.
Automatically generated schemas are based on the first event received by the codec.
The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
Automatically generated schemas make all fields nullable.
For details on the Avro schema and recommendations, see the [Avro codec](#avro-codec) documentation.
Option | Required | Type | Description

View File

@ -18,5 +18,5 @@ Option | Required | Type | Description
:--- | :--- |:------------| :---
routes | No | String list | A list of routes for which this sink applies. If not provided, this sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information.
tags_target_key | No | String | When specified, includes event tags in the output of the provided key.
include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink.
exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink.
include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink. Some codecs and sinks do not allow use of this field.
exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink. Some codecs and sinks do not allow use of this field.