From 64d59b9bb280eecbed1448d98605746b33929adb Mon Sep 17 00:00:00 2001 From: David Venable Date: Tue, 29 Aug 2023 08:01:24 -0700 Subject: [PATCH] Behavioral changes for Data Prepper S3 sink (#4897) * Updates the Data Prepper documentation for S3 sinks based on recent behavior changes. Signed-off-by: David Venable * Updates from the PR feedback. Signed-off-by: David Venable * Apply suggestions from code review Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --------- Signed-off-by: David Venable Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> Co-authored-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com> --- .../pipelines/configuration/sinks/s3.md | 24 +++++++++++++------ .../pipelines/configuration/sinks/sinks.md | 4 ++-- 2 files changed, 19 insertions(+), 9 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index aa9e1797..cb881e81 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -98,10 +98,21 @@ The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) d Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. In general, you should define your own schema because it will most accurately reflect your needs. + +We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions). +Without the null union, each field must be present or the data will fail to write to the sink. +If you can be certain that each each event has a given field, you can make it non-nullable. + +When you provide your own Avro schema, that schema defines the final structure of your data. +Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. +To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. + In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event received by the codec. -The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. +The schema will only contain keys from this event. +Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable. +Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema. Option | Required | Type | Description @@ -131,14 +142,13 @@ Option | Required | Type | Description ### parquet codec The `parquet` codec writes events into a Parquet file. -You must set the `buffer_type` to `multipart` when using Parquet. +When using the Parquet codec, set the `buffer_type` to `in_memory`. -The Parquet codec writes data using the Avro schema. However, we generally recommend that you define your own schema so that it can best meet your needs. +The Parquet codec writes data using the Avro schema. +Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. +However, we generally recommend that you define your own schema so that it can best meet your needs. -In cases where your data is uniform, you may be able to automatically generate a schema. -Automatically generated schemas are based on the first event received by the codec. -The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. -Automatically generated schemas make all fields nullable. +For details on the Avro schema and recommendations, see the [Avro codec](#avro-codec) documentation. Option | Required | Type | Description diff --git a/_data-prepper/pipelines/configuration/sinks/sinks.md b/_data-prepper/pipelines/configuration/sinks/sinks.md index ad99e572..0f3af6ab 100644 --- a/_data-prepper/pipelines/configuration/sinks/sinks.md +++ b/_data-prepper/pipelines/configuration/sinks/sinks.md @@ -18,5 +18,5 @@ Option | Required | Type | Description :--- | :--- |:------------| :--- routes | No | String list | A list of routes for which this sink applies. If not provided, this sink receives all events. See [conditional routing]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#conditional-routing) for more information. tags_target_key | No | String | When specified, includes event tags in the output of the provided key. -include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink. -exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink. +include_keys | No | String list | When specified, provides the keys in this list in the data sent to the sink. Some codecs and sinks do not allow use of this field. +exclude_keys | No | String list | When specified, excludes the keys given from the data sent to the sink. Some codecs and sinks do not allow use of this field.