Add codec and processor combinations to S3 source (#5288)

* Add codec and processor combinations to S3 source.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Update s3.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Make codec-processor combinations its own file

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Make Data Prepper codec page

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Fix links

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Fix links

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Reformat article without tables.

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* fix link

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update _data-prepper/common-use-cases/codec-processor-combinations.md

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

* Update codec-processor-combinations.md

Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>

---------

Signed-off-by: Naarcha-AWS <naarcha@amazon.com>
Signed-off-by: Naarcha-AWS <97990722+Naarcha-AWS@users.noreply.github.com>
Co-authored-by: Melissa Vagi <vagimeli@amazon.com>
This commit is contained in:
Naarcha-AWS 2023-11-17 20:06:04 -06:00 committed by GitHub
parent 826e6771ed
commit 72b3363460
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 57 additions and 6 deletions

View File

@ -0,0 +1,47 @@
---
layout: default
title: Codec processor combinations
parent: Common use cases
nav_order: 25
---
# Codec processor combinations
At ingestion time, data received by the [`s3` source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3/) can be parsed by [codecs]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#codec). Codecs compresses and decompresses large data sets in a certain format before ingestion them through a Data Prepper pipeline [processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/processors/).
While most codecs can be used with most processors, the following codec processor combinations can make your pipeline more efficient when used with the following input types.
## JSON array
A [JSON array](https://json-schema.org/understanding-json-schema/reference/array) is used to order elements of different types. Because an array is required in JSON, the data contained within the array must be tabular.
The JSON array does not require a processor.
## NDJSON
Unlike a JSON array, [NDJSON](https://www.npmjs.com/package/ndjson) allows for each row of data to be delimited by a newline, meaning data is processed per line instead of an array.
The NDJSON input type is parsed using the [newline]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#newline-codec) codec, which parses each single line as a single log event. The [parse_json]({{site.url}}{{site.baseurl}}data-prepper/pipelines/configuration/processors/parse-json/) processor then outputs each line as a single event.
## CSV
The CSV data type inputs data as a table. It can used without a codec or processor, but it does require one or the other, for example, either just the `csv` processor or the `csv` codec.
The CSV input type is most effective when used with the following codec processor combinations.
### `csv` codec
When the [`csv` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#csv-codec) is used without a processor, it automatically detects headers from the CSV and uses them for index mapping.
### `newline` codec
The [`newline` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#newline-codec) parses each row as a single log event. The codec will only detect a header when `header_destination` is configured. The [csv]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/csv/) processor then outputs the event into columns. The header detected in `header_destination` from the `newline` codec can be used in the `csv` processor under `column_names_source_key.`
## Parquet
[Apache Parquet](https://parquet.apache.org/docs/overview/) is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it's configured with [S3 Select]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#using-s3_select-with-the-s3-source).
## Avro
[Apache Avro] helps streamline streaming data pipelines. It is most efficient when used with the [`avro` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/s3#avro-codec) inside an `s3` sink.

View File

@ -121,9 +121,9 @@ Option | Required | Type | Description
## codec
The `codec` determines how the `s3` source parses each S3 object.
The `codec` determines how the `s3` source parses each Amazon S3 object. For increased and more efficient performance, you can use [codec combinations]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/codec-processor-combinations/) with certain processors.
### newline codec
### `newline` codec
The `newline` codec parses each single line as a single log event. This is ideal for most application logs because each event parses per single line. It can also be suitable for S3 objects that have individual JSON objects on each line, which matches well when used with the [parse_json]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/parse-json/) processor to parse each line.
@ -147,11 +147,14 @@ Option | Required | Type | Description
`delimiter` | Yes | Integer | The delimiter separating columns. Default is `,`.
`quote_character` | Yes | String | The character used as a text qualifier for CSV data. Default is `"`.
`header` | No | String list | The header containing the column names used to parse CSV data.
`detect_header` | No | Boolean | Whether the first line of the S3 object should be interpreted as a header. Default is `true`.
`detect_header` | No | Boolean | Whether the first line of the Amazon S3 object should be interpreted as a header. Default is `true`.
## Using `s3_select` with the `s3` source<a name="s3_select"></a>
When configuring `s3_select` to parse S3 objects, use the following options.
When configuring `s3_select` to parse Amazon S3 objects, use the following options:
Option | Required | Type | Description
:--- |:-----------------------|:------------| :---
@ -222,12 +225,13 @@ Option | Required | Type | Description
Option | Required | Type | Description
:--- | :--- | :--- | :---
`interval` | Yes | String | Indicates the minimum interval between each scan. The next scan in the interval will start after the interval duration from the last scan ends and when all the objects from the previous scan are processed. Supports ISO_8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`).
`interval` | Yes | String | Indicates the minimum interval between each scan. The next scan in the interval will start after the interval duration from the last scan ends and when all the objects from the previous scan are processed. Supports ISO 8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`).
`count` | No | Integer | Specifies how many times a bucket will be scanned. Defaults to `Integer.MAX_VALUE`.
## Metrics
The `s3` source includes the following metrics.
The `s3` source includes the following metrics:
### Counters