3.2 KiB

Raw Blame History

layout	title	parent	nav_order
default	Codec processor combinations	Common use cases	25

Codec processor combinations

At ingestion time, data received by the s3 source can be parsed by codecs. Codecs compresses and decompresses large data sets in a certain format before ingestion them through a Data Prepper pipeline processor.

While most codecs can be used with most processors, the following codec processor combinations can make your pipeline more efficient when used with the following input types.

JSON array

A JSON array is used to order elements of different types. Because an array is required in JSON, the data contained within the array must be tabular.

The JSON array does not require a processor.

NDJSON

Unlike a JSON array, NDJSON allows for each row of data to be delimited by a newline, meaning data is processed per line instead of an array.

The NDJSON input type is parsed using the newline codec, which parses each single line as a single log event. The parse_json processor then outputs each line as a single event.

CSV

The CSV data type inputs data as a table. It can used without a codec or processor, but it does require one or the other, for example, either just the csv processor or the csv codec.

The CSV input type is most effective when used with the following codec processor combinations.

`csv` codec

When the csv codec is used without a processor, it automatically detects headers from the CSV and uses them for index mapping.

`newline` codec

The newline codec parses each row as a single log event. The codec will only detect a header when header_destination is configured. The csv processor then outputs the event into columns. The header detected in header_destination from the newline codec can be used in the csv processor under column_names_source_key.

Parquet

Apache Parquet is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it's configured with S3 Select.

Avro

[Apache Avro] helps streamline streaming data pipelines. It is most efficient when used with the avro codec inside an s3 sink.

3.2 KiB Raw Blame History