Data Prepper is an independent component, not an OpenSearch plugin, that converts data for use with OpenSearch. It's not bundled with the all-in-one OpenSearch installation packages.
## Install Data Prepper
To use the Docker image, pull it like any other image:
Otherwise, [download](https://opensearch.org/downloads.html) the appropriate archive for your operating system and unzip it.
## Configure pipelines
To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more preppers, and one or more sinks:
```yml
sample-pipeline:
workers: 4 # the number of workers
delay: 100 # in milliseconds, how long workers wait between read attempts
buffer_size: 1024 # max number of records the buffer accepts
batch_size: 256 # max number of records the buffer drains after each read
prepper:
- otel_trace_raw_prepper:
sink:
- opensearch:
hosts: ["https:localhost:9200"]
cert: "config/root-ca.pem"
username: "ta-user"
password: "ta-password"
trace_analytics_raw: true
```
- Sources define where your data comes from. In this case, the source is the OpenTelemetry Collector (`otel_trace_source`) with some optional SSL settings.
- Buffers store data as it passes through the pipeline.
By default, Data Prepper uses its one and only buffer, the `bounded_blocking` buffer, so you can omit this section unless you developed a custom buffer or need to tune the buffer settings.
- Preppers perform some action on your data: filter, transform, enrich, etc.
You can have multiple preppers, which run sequentially from top to bottom, not in parallel. The `otel_trace_raw_prepper` prepper converts OpenTelemetry data into OpenSearch-compatible JSON documents.
- Sinks define where your data goes. In this case, the sink is an OpenSearch cluster.
Pipelines can act as the source for other pipelines. In the following example, a pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks:
Data Prepper itself provides administrative HTTP endpoints such as `/list` to list pipelines and `/metrics/prometheus` to provide Prometheus-compatible metrics data. The port which serves these endpoints, as well as TLS configuration, is specified by a separate YAML file. Example:
For production workloads, you likely want to run Data Prepper on a dedicated machine, which makes connectivity a concern. Data Prepper uses port 21890 and must be able to connect to both the OpenTelemetry Collector and the OpenSearch cluster. In the [sample applications](https://github.com/opensearch-project/Data-Prepper/tree/main/examples), you can see that all components use the same Docker network and expose the appropriate ports.