159 lines
7.7 KiB
Markdown
159 lines
7.7 KiB
Markdown
---
|
|
layout: default
|
|
title: s3
|
|
parent: Sinks
|
|
grand_parent: Pipelines
|
|
nav_order: 55
|
|
---
|
|
|
|
# s3
|
|
|
|
The `s3` sink saves batches of events to [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/) objects.
|
|
|
|
## Usage
|
|
|
|
The following example creates a pipeline configured with an s3 sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type `ndjson`:
|
|
|
|
```
|
|
pipeline:
|
|
...
|
|
sink:
|
|
- s3:
|
|
aws:
|
|
region: us-east-1
|
|
sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper
|
|
sts_header_overrides:
|
|
max_retries: 5
|
|
bucket:
|
|
name: bucket_name
|
|
object_key:
|
|
path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/
|
|
threshold:
|
|
event_count: 2000
|
|
maximum_size: 50mb
|
|
event_collect_timeout: 15s
|
|
codec:
|
|
ndjson:
|
|
buffer_type: in_memory
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Use the following options when customizing the `s3` sink.
|
|
|
|
Option | Required | Type | Description
|
|
:--- | :--- | :--- | :---
|
|
`bucket` | Yes | String | The object from which the data is retrieved and then stored. The `name` must match the name of your object store.
|
|
`codec` | Yes | [Buffer type](#buffer-type) | Determines the buffer type.
|
|
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
|
|
`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
|
|
`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found inside the root directory of the bucket.
|
|
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
|
|
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
|
|
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
|
|
|
|
## aws
|
|
|
|
Option | Required | Type | Description
|
|
:--- | :--- | :--- | :---
|
|
`region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
|
|
`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
|
|
`sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin.
|
|
`sts_external_id` | No | String | The external ID to attach to AssumeRole requests from AWS STS.
|
|
|
|
|
|
## Threshold configuration
|
|
|
|
Use the following options to set ingestion thresholds for the `s3` sink.
|
|
|
|
Option | Required | Type | Description
|
|
:--- | :--- | :--- | :---
|
|
`event_count` | Yes | Integer | The maximum number of events the S3 bucket can ingest.
|
|
`maximum_size` | Yes | String | The maximum number of bytes that the S3 bucket can ingest after compression. Defaults to `50mb`.
|
|
`event_collect_timeout` | Yes | String | Sets the time period during which events are collected before ingestion. All values are strings that represent duration, either an ISO_8601 notation string, such as `PT20.345S`, or a simple notation, such as `60s` or `1500ms`.
|
|
|
|
|
|
## Buffer type
|
|
|
|
`buffer_type` is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. The default value is `in_memory`. Use one of the following options:
|
|
|
|
- `in_memory`: Stores the record in memory.
|
|
- `local_file`: Flushes the record into a file on your machine.
|
|
- `multipart`: Writes using the [S3 multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html). Every 10 MB is written as a part.
|
|
|
|
## Object key configuration
|
|
|
|
Option | Required | Type | Description
|
|
:--- | :--- | :--- | :---
|
|
`path_prefix` | Yes | String | The S3 key prefix path to use. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. By default, events write to the root of the bucket.
|
|
|
|
|
|
## codec
|
|
|
|
The `codec` determines how the `s3` source formats data written to each S3 object.
|
|
|
|
### avro codec
|
|
|
|
The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document.
|
|
|
|
Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
|
|
In general, you should define your own schema because it will most accurately reflect your needs.
|
|
|
|
We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions).
|
|
Without the null union, each field must be present or the data will fail to write to the sink.
|
|
If you can be certain that each each event has a given field, you can make it non-nullable.
|
|
|
|
When you provide your own Avro schema, that schema defines the final structure of your data.
|
|
Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination.
|
|
To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema.
|
|
|
|
In cases where your data is uniform, you may be able to automatically generate a schema.
|
|
Automatically generated schemas are based on the first event received by the codec.
|
|
The schema will only contain keys from this event.
|
|
Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
|
|
Automatically generated schemas make all fields nullable.
|
|
Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema.
|
|
|
|
|
|
Option | Required | Type | Description
|
|
:--- | :--- | :--- | :---
|
|
`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true.
|
|
`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event.
|
|
|
|
|
|
### ndjson codec
|
|
|
|
The `ndjson` codec writes each line as a JSON object.
|
|
|
|
The `ndjson` codec does not take any configurations.
|
|
|
|
|
|
### json codec
|
|
|
|
The `json` codec writes events in a single large JSON file.
|
|
Each event is written into an object within a JSON array.
|
|
|
|
|
|
Option | Required | Type | Description
|
|
:--- | :--- | :--- | :---
|
|
`key_name` | No | String | The name of the key for the JSON array. By default this is `events`.
|
|
|
|
|
|
### parquet codec
|
|
|
|
The `parquet` codec writes events into a Parquet file.
|
|
When using the Parquet codec, set the `buffer_type` to `in_memory`.
|
|
|
|
The Parquet codec writes data using the Avro schema.
|
|
Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
|
|
However, we generally recommend that you define your own schema so that it can best meet your needs.
|
|
|
|
For details on the Avro schema and recommendations, see the [Avro codec](#avro-codec) documentation.
|
|
|
|
|
|
Option | Required | Type | Description
|
|
:--- | :--- | :--- | :---
|
|
`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true.
|
|
`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event.
|
|
|