6.9 KiB

Raw Blame History

layout	title	parent	grand_parent	nav_order
default	s3	Sinks	Pipelines	55

s3

The s3 sink sends records to an Amazon Simple Storage Service (Amazon S3) bucket using the S3 client.

Usage

The following example creates a pipeline configured with an s3 sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type ndjson:

pipeline:
  ...
  sink:
    - s3:
        aws:
          region: us-east-1
          sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper
          sts_header_overrides:
        max_retries: 5
        bucket:
          name: bucket_name
          object_key:
            path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/
        threshold:
          event_count: 2000
          maximum_size: 50mb
          event_collect_timeout: 15s
        codec:
          ndjson:
        buffer_type: in_memory

Configuration

Use the following options when customizing the s3 sink.

Option	Required	Type	Description
`bucket`	Yes	String	The object from which the data is retrieved and then stored. The `name` must match the name of your object store.
`codec`	Yes	Buffer type	Determines the buffer type.
`aws`	Yes	AWS	The AWS configuration. See aws for more information.
`threshold`	Yes	Threshold	Configures when to write an object to S3.
`object_key`	No	Sets the `path_prefix` and the `file_pattern` of the object store. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found inside the root directory of the bucket.
`compression`	No	String	The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type`	No	Buffer type	Determines the buffer type.
`max_retries`	No	Integer	The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.

aws

Option	Required	Type	Description
`region`	No	String	The AWS Region to use for credentials. Defaults to standard SDK behavior to determine the Region.
`sts_role_arn`	No	String	The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the standard SDK behavior for credentials.
`sts_header_overrides`	No	Map	A map of header overrides that the IAM role assumes for the sink plugin.
`sts_external_id`	No	String	The external ID to attach to AssumeRole requests from AWS STS.

Threshold configuration

Use the following options to set ingestion thresholds for the s3 sink.

Option	Required	Type	Description
`event_count`	Yes	Integer	The maximum number of events the S3 bucket can ingest.
`maximum_size`	Yes	String	The maximum number of bytes that the S3 bucket can ingest after compression. Defaults to `50mb`.
`event_collect_timeout`	Yes	String	Sets the time period during which events are collected before ingestion. All values are strings that represent duration, either an ISO_8601 notation string, such as `PT20.345S`, or a simple notation, such as `60s` or `1500ms`.

Buffer type

buffer_type is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. The default value is in_memory. Use one of the following options:

in_memory: Stores the record in memory.
local_file: Flushes the record into a file on your machine.
multipart: Writes using the S3 multipart upload. Every 10 MB is written as a part.

Object key configuration

Option	Required	Type	Description
`path_prefix`	Yes	String	The S3 key prefix path to use. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. By default, events write to the root of the bucket.

codec

The codec determines how the s3 source formats data written to each S3 object.

avro codec

The avro codec writes an event as an Apache Avro document.

Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. In general, you should define your own schema because it will most accurately reflect your needs. In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event received by the codec. The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable.

Option	Required	Type	Description
`schema`	Yes	String	The Avro schema declaration. Not required if `auto_schema` is set to true.
`auto_schema`	No	Boolean	When set to `true`, automatically generates the Avro schema declaration from the first event.

ndjson codec

The ndjson codec writes each line as a JSON object.

The ndjson codec does not take any configurations.

json codec

The json codec writes events in a single large JSON file. Each event is written into an object within a JSON array.

Option	Required	Type	Description
`key_name`	No	String	The name of the key for the JSON array. By default this is `events`.

parquet codec

The parquet codec writes events into a Parquet file. You must set the buffer_type to multipart when using Parquet.

The Parquet codec writes data using the Avro schema. In general, you should define your own schema because it will most accurately reflect your needs. In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event received by the codec. The schema will only have keys from this event, so you must have all keys present in all events for the auto-schema generation to produce a working schema. Auto-generated schemas make all fields nullable.

Option	Required	Type	Description
`schema`	Yes	String	The Avro schema declaration. Not required if `auto_schema` is set to true.
`auto_schema`	No	Boolean	When set to `true`, automatically generates the Avro schema declaration from the first event.

6.9 KiB Raw Blame History

s3