diff --git a/_clients/data-prepper/pipelines.md b/_clients/data-prepper/pipelines.md deleted file mode 100644 index b433c12d..00000000 --- a/_clients/data-prepper/pipelines.md +++ /dev/null @@ -1,293 +0,0 @@ ---- -layout: default -title: Pipelines -parent: Data Prepper -nav_order: 10 ---- - -# Pipelines - -![Data Prepper Pipeline]({{site.url}}{{site.baseurl}}/images/data-prepper-pipeline.png) - -To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more processors, and one or more sinks. For example: - -```yml -simple-sample-pipeline: - workers: 2 # the number of workers - delay: 5000 # in milliseconds, how long workers wait between read attempts - source: - random: - buffer: - bounded_blocking: - buffer_size: 1024 # max number of records the buffer accepts - batch_size: 256 # max number of records the buffer drains after each read - processor: - - string_converter: - upper_case: true - sink: - - stdout: -``` - -- Sources define where your data comes from. In this case, the source is a random UUID generator (`random`). - -- Buffers store data as it passes through the pipeline. - - By default, Data Prepper uses its one and only buffer, the `bounded_blocking` buffer, so you can omit this section unless you developed a custom buffer or need to tune the buffer settings. - -- Processors perform some action on your data: filter, transform, enrich, etc. - - You can have multiple processors, which run sequentially from top to bottom, not in parallel. The `string_converter` processor transform the strings by making them uppercase. - -- Sinks define where your data goes. In this case, the sink is stdout. - -Starting from Data Prepper 2.0, you can define pipelines across multiple configuration YAML files, where each file contains the configuration for one or more pipelines. This gives you more freedom to organize and chain complex pipeline configurations. For Data Prepper to load your pipeline configuration properly, place your configuration YAML files in the `pipelines` folder under your application's home directory (e.g. `/usr/share/data-prepper`). -{: .note } - -## Conditional Routing - -Pipelines also support **conditional routing** which allows you to route Events to different sinks based on specific conditions. To add conditional routing to a pipeline, specify a list of named routes under the `route` component and add specific routes to sinks under the `routes` property. Any sink with the `routes` property will only accept Events that match at least one of the routing conditions. - -In the following example, `application-logs` is a named route with a condition set to `/log_type == "application"`. The route uses [Data Prepper expressions](https://github.com/opensearch-project/data-prepper/tree/main/examples) to define the conditions. Data Prepper only routes events that satisfy the condition to the first OpenSearch sink. By default, Data Prepper routes all Events to a sink which does not define a route. In the example, all Events route into the third OpenSearch sink. - -```yml -conditional-routing-sample-pipeline: - source: - http: - processor: - route: - - application-logs: '/log_type == "application"' - - http-logs: '/log_type == "apache"' - sink: - - opensearch: - hosts: [ "https://opensearch:9200" ] - index: application_logs - routes: [application-logs] - - opensearch: - hosts: [ "https://opensearch:9200" ] - index: http_logs - routes: [http-logs] - - opensearch: - hosts: [ "https://opensearch:9200" ] - index: all_logs -``` - - -## Examples - -This section provides some pipeline examples that you can use to start creating your own pipelines. For more information, see [Data Prepper configuration reference]({{site.url}}{{site.baseurl}}/clients/data-prepper/data-prepper-reference/) guide. - -The Data Prepper repository has several [sample applications](https://github.com/opensearch-project/data-prepper/tree/main/examples) to help you get started. - -### Log ingestion pipeline - -The following example demonstrates how to use HTTP source and Grok prepper plugins to process unstructured log data. - -```yml -log-pipeline: - source: - http: - ssl: false - processor: - - grok: - match: - log: [ "%{COMMONAPACHELOG}" ] - sink: - - opensearch: - hosts: [ "https://opensearch:9200" ] - insecure: true - username: admin - password: admin - index: apache_logs -``` - -This example uses weak security. We strongly recommend securing all plugins which open external ports in production environments. -{: .note} - -### Trace analytics pipeline - -The following example demonstrates how to build a pipeline that supports the [Trace Analytics OpenSearch Dashboards plugin]({{site.url}}{{site.baseurl}}/observability-plugin/trace/ta-dashboards/). This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines index trace and the service map documents for the dashboard plugin. - -Starting from Data Prepper 2.0, Data Prepper no longer supports `otel_trace_raw_prepper` processor due to the Data Prepper internal data model evolution. -Instead, users should use `otel_trace_raw`. - -```yml -entry-pipeline: - delay: "100" - source: - otel_trace_source: - ssl: false - buffer: - bounded_blocking: - buffer_size: 10240 - batch_size: 160 - sink: - - pipeline: - name: "raw-pipeline" - - pipeline: - name: "service-map-pipeline" -raw-pipeline: - source: - pipeline: - name: "entry-pipeline" - buffer: - bounded_blocking: - buffer_size: 10240 - batch_size: 160 - processor: - - otel_trace_raw: - sink: - - opensearch: - hosts: ["https://localhost:9200"] - insecure: true - username: admin - password: admin - index_type: trace-analytics-raw -service-map-pipeline: - delay: "100" - source: - pipeline: - name: "entry-pipeline" - buffer: - bounded_blocking: - buffer_size: 10240 - batch_size: 160 - processor: - - service_map_stateful: - sink: - - opensearch: - hosts: ["https://localhost:9200"] - insecure: true - username: admin - password: admin - index_type: trace-analytics-service-map -``` - -To maintain similar ingestion throughput and latency, scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload. -{: .tip} - -### Metrics pipeline - -Data Prepper supports metrics ingestion using OTel. It currently supports the following metric types: - -* Gauge -* Sum -* Summary -* Histogram - -Other types are not supported. Data Prepper drops all other types, including Exponential Histogram and Summary. Additionally, Data Prepper does not support Scope instrumentation. - -To set up a metrics pipeline: - -```yml -metrics-pipeline: - source: - otel_metrics_source: - processor: - - otel_metrics_raw_processor: - sink: - - opensearch: - hosts: ["https://localhost:9200"] - username: admin - password: admin -``` - -### S3 log ingestion pipeline - -The following example demonstrates how to use the S3 Source and Grok Processor plugins to process unstructured log data -from [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3). This example uses Application Load -Balancer logs. As the Application Load Balancer writes logs to S3, S3 creates notifications in Amazon SQS. Data Prepper -reads those notifications and reads the S3 objects to get the log data and process it. - -```yml -log-pipeline: - source: - s3: - notification_type: "sqs" - compression: "gzip" - codec: - newline: - sqs: - queue_url: "https://sqs.us-east-1.amazonaws.com/12345678910/ApplicationLoadBalancer" - aws: - region: "us-east-1" - sts_role_arn: "arn:aws:iam::12345678910:role/Data-Prepper" - - processor: - - grok: - match: - message: ["%{DATA:type} %{TIMESTAMP_ISO8601:time} %{DATA:elb} %{DATA:client} %{DATA:target} %{BASE10NUM:request_processing_time} %{DATA:target_processing_time} %{BASE10NUM:response_processing_time} %{BASE10NUM:elb_status_code} %{DATA:target_status_code} %{BASE10NUM:received_bytes} %{BASE10NUM:sent_bytes} \"%{DATA:request}\" \"%{DATA:user_agent}\" %{DATA:ssl_cipher} %{DATA:ssl_protocol} %{DATA:target_group_arn} \"%{DATA:trace_id}\" \"%{DATA:domain_name}\" \"%{DATA:chosen_cert_arn}\" %{DATA:matched_rule_priority} %{TIMESTAMP_ISO8601:request_creation_time} \"%{DATA:actions_executed}\" \"%{DATA:redirect_url}\" \"%{DATA:error_reason}\" \"%{DATA:target_list}\" \"%{DATA:target_status_code_list}\" \"%{DATA:classification}\" \"%{DATA:classification_reason}"] - - grok: - match: - request: ["(%{NOTSPACE:http_method})? (%{NOTSPACE:http_uri})? (%{NOTSPACE:http_version})?"] - - grok: - match: - http_uri: ["(%{WORD:protocol})?(://)?(%{IPORHOST:domain})?(:)?(%{INT:http_port})?(%{GREEDYDATA:request_uri})?"] - - date: - from_time_received: true - destination: "@timestamp" - - - sink: - - opensearch: - hosts: [ "https://localhost:9200" ] - username: "admin" - password: "admin" - index: alb_logs -``` - -## Migrating from Logstash - -Data Prepper supports Logstash configuration files for a limited set of plugins. Simply use the logstash config to run Data Prepper. - -```bash -docker run --name data-prepper \ - -v /full/path/to/logstash.conf:/usr/share/data-prepper/pipelines/pipelines.conf \ - opensearchproject/opensearch-data-prepper:latest -``` - -This feature is limited by feature parity of Data Prepper. As of Data Prepper 1.2 release, the following plugins from the Logstash configuration are supported: - -- HTTP Input plugin -- Grok Filter plugin -- Elasticsearch Output plugin -- Amazon Elasticsearch Output plugin - -## Configure the Data Prepper server - -Data Prepper itself provides administrative HTTP endpoints such as `/list` to list pipelines and `/metrics/prometheus` to provide Prometheus-compatible metrics data. The port that has these endpoints has a TLS configuration and is specified by a separate YAML file. By default, these endpoints are secured by Data Prepper docker images. We strongly recommend providing your own configuration file for securing production environments. Here is an example `data-prepper-config.yaml`: - -```yml -ssl: true -keyStoreFilePath: "/usr/share/data-prepper/keystore.jks" -keyStorePassword: "password" -privateKeyPassword: "other_password" -serverPort: 1234 -``` - -To configure the Data Prepper server, run Data Prepper with the additional yaml file. - -```bash -docker run --name data-prepper \ - -v /full/path/to/my-pipelines.yaml:/usr/share/data-prepper/pipelines/my-pipelines.yaml \ - -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml \ - opensearchproject/data-prepper:latest -``` - -## Configure the peer forwarder - -Data Prepper provides an HTTP service to forward Events between Data Prepper nodes for aggregation. This is required for operating Data Prepper in a clustered deployment. Currently, peer forwarding is supported in `aggregate`, `service_map_stateful`, and `otel_trace_raw` processors. Peer forwarder groups events based on the identification keys provided by the processors. For `service_map_stateful` and `otel_trace_raw` it's `traceId` by default and can not be configured. For `aggregate` processor, it is configurable using `identification_keys` option. - -Peer forwarder supports peer discovery through one of three options: a static list, a DNS record lookup , or AWS Cloud Map. This option can be configured using `discovery_mode` option. Peer forwarder also supports SSL for verification and encrytion, and mTLS for mutual authentication in peer forwarding service. - -To configure the peer forwarder, add configuration options to `data-prepper-config.yaml` mentioned in the previous [Configure the Data Prepper server](#configure-the-data-prepper-server) section: - -```yml -peer_forwarder: - discovery_mode: dns - domain_name: "data-prepper-cluster.my-domain.net" - ssl: true - ssl_certificate_file: "" - ssl_key_file: "" - authentication: - mutual_tls: -``` diff --git a/_config.yml b/_config.yml index 49e3df9e..7392a362 100644 --- a/_config.yml +++ b/_config.yml @@ -64,6 +64,12 @@ collections: clients: permalink: /:collection/:path/ output: true + data-prepper: + permalink: /:collection/:path/ + output: true + tools: + permalink: /:collection/:path/ + output: true api-reference: permalink: /:collection/:path/ output: true @@ -113,7 +119,13 @@ just_the_docs: name: Notifications plugin nav_fold: true clients: - name: Clients and tools + name: Clients + nav_fold: true + data-prepper: + name: Data Prepper + nav_fold: true + tools: + name: Tools nav_fold: true api-reference: name: API reference diff --git a/_clients/data-prepper/configure-logstash-data-prepper.md b/_data-prepper/configure-logstash-data-prepper.md similarity index 92% rename from _clients/data-prepper/configure-logstash-data-prepper.md rename to _data-prepper/configure-logstash-data-prepper.md index b58b531b..df3f66f3 100644 --- a/_clients/data-prepper/configure-logstash-data-prepper.md +++ b/_data-prepper/configure-logstash-data-prepper.md @@ -2,7 +2,11 @@ layout: default title: Configure Logstash for Data Prepper parent: Data Prepper +<<<<<<< HEAD:_data-prepper/configure-logstash-data-prepper.md +nav_order: 12 +======= nav_order: 2 +>>>>>>> main:_clients/data-prepper/configure-logstash-data-prepper.md --- # Configure Logstash for Data Prepper You can run Data Prepper with a Logstash configuration. diff --git a/_clients/data-prepper/data-prepper-reference.md b/_data-prepper/data-prepper-reference.md similarity index 99% rename from _clients/data-prepper/data-prepper-reference.md rename to _data-prepper/data-prepper-reference.md index 3f495d9f..48dc9472 100644 --- a/_clients/data-prepper/data-prepper-reference.md +++ b/_data-prepper/data-prepper-reference.md @@ -1,8 +1,9 @@ --- layout: default title: Configuration reference -parent: Data Prepper -nav_order: 3 +nav_order: 30 +redirect_from: + - /clients/data-prepper/data-prepper-reference/ --- # Data Prepper configuration reference diff --git a/_clients/data-prepper/get-started.md b/_data-prepper/get-started.md similarity index 97% rename from _clients/data-prepper/get-started.md rename to _data-prepper/get-started.md index 95d3dc6e..5821fa33 100644 --- a/_clients/data-prepper/get-started.md +++ b/_data-prepper/get-started.md @@ -1,8 +1,9 @@ --- layout: default title: Get Started -parent: Data Prepper -nav_order: 1 +nav_order: 10 +redirect_from: + - /clients/data-prepper/get-started/ --- # Get started with Data Prepper diff --git a/_clients/data-prepper/index.md b/_data-prepper/index.md similarity index 96% rename from _clients/data-prepper/index.md rename to _data-prepper/index.md index 22bf87cb..5987586f 100644 --- a/_clients/data-prepper/index.md +++ b/_data-prepper/index.md @@ -1,9 +1,13 @@ --- layout: default -title: Data Prepper -nav_order: 120 -has_children: true +title: Data Prepper +nav_order: 1 +has_children: false has_toc: false +redirect_from: + - /clients/tools/data-prepper/ + - /clients/data-prepper/ + - /clients/data-prepper/index/ --- # Data Prepper diff --git a/_clients/data-prepper/migrate-open-distro.md b/_data-prepper/migrate-open-distro.md similarity index 89% rename from _clients/data-prepper/migrate-open-distro.md rename to _data-prepper/migrate-open-distro.md index e6f1deca..8e8b98b4 100644 --- a/_clients/data-prepper/migrate-open-distro.md +++ b/_data-prepper/migrate-open-distro.md @@ -2,7 +2,11 @@ layout: default title: Migrating from Open Distro Data Prepper parent: Data Prepper +<<<<<<< HEAD:_data-prepper/migrate-open-distro.md +nav_order: 11 +======= nav_order: 2 +>>>>>>> main:_clients/data-prepper/migrate-open-distro.md --- # Migrating from Open Distro Data Prepper diff --git a/_data-prepper/pipelines.md b/_data-prepper/pipelines.md new file mode 100644 index 00000000..a4edb22a --- /dev/null +++ b/_data-prepper/pipelines.md @@ -0,0 +1,582 @@ +--- +layout: default +title: Pipelines +nav_order: 20 +redirect_from: + - /clients/data-prepper/pipelines/ +--- + +# Pipelines + +![Data Prepper Pipeline]({{site.url}}{{site.baseurl}}/images/data-prepper-pipeline.png) + +To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more processors, and one or more sinks. For example: + +```yml +simple-sample-pipeline: + workers: 2 # the number of workers + delay: 5000 # in milliseconds, how long workers wait between read attempts + source: + random: + buffer: + bounded_blocking: + buffer_size: 1024 # max number of records the buffer accepts + batch_size: 256 # max number of records the buffer drains after each read + processor: + - string_converter: + upper_case: true + sink: + - stdout: +``` + +- Sources define where your data comes from. In this case, the source is a random UUID generator (`random`). + +- Buffers store data as it passes through the pipeline. + + By default, Data Prepper uses its one and only buffer, the `bounded_blocking` buffer, so you can omit this section unless you developed a custom buffer or need to tune the buffer settings. + +- Processors perform some action on your data: filter, transform, enrich, etc. + + You can have multiple processors, which run sequentially from top to bottom, not in parallel. The `string_converter` processor transform the strings by making them uppercase. + +- Sinks define where your data goes. In this case, the sink is stdout. + +Starting from Data Prepper 2.0, you can define pipelines across multiple configuration YAML files, where each file contains the configuration for one or more pipelines. This gives you more freedom to organize and chain complex pipeline configurations. For Data Prepper to load your pipeline configuration properly, place your configuration YAML files in the `pipelines` folder under your application's home directory (e.g. `/usr/share/data-prepper`). +{: .note } + +## Conditional Routing + +Pipelines also support **conditional routing** which allows you to route Events to different sinks based on specific conditions. To add conditional routing to a pipeline, specify a list of named routes under the `route` component and add specific routes to sinks under the `routes` property. Any sink with the `routes` property will only accept Events that match at least one of the routing conditions. + +In the following example, `application-logs` is a named route with a condition set to `/log_type == "application"`. The route uses [Data Prepper expressions](https://github.com/opensearch-project/data-prepper/tree/main/examples) to define the conditions. Data Prepper only routes events that satisfy the condition to the first OpenSearch sink. By default, Data Prepper routes all Events to a sink which does not define a route. In the example, all Events route into the third OpenSearch sink. + +```yml +conditional-routing-sample-pipeline: + source: + http: + processor: + route: + - application-logs: '/log_type == "application"' + - http-logs: '/log_type == "apache"' + sink: + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: application_logs + routes: [application-logs] + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: http_logs + routes: [http-logs] + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: all_logs +``` + + +## Examples + +This section provides some pipeline examples that you can use to start creating your own pipelines. For more information, see [Data Prepper configuration reference]({{site.url}}{{site.baseurl}}/clients/data-prepper/data-prepper-reference/) guide. + +The Data Prepper repository has several [sample applications](https://github.com/opensearch-project/data-prepper/tree/main/examples) to help you get started. + +### Log ingestion pipeline + +The following example demonstrates how to use HTTP source and Grok prepper plugins to process unstructured log data. + +```yml +log-pipeline: + source: + http: + ssl: false + processor: + - grok: + match: + log: [ "%{COMMONAPACHELOG}" ] + sink: + - opensearch: + hosts: [ "https://opensearch:9200" ] + insecure: true + username: admin + password: admin + index: apache_logs +``` + +This example uses weak security. We strongly recommend securing all plugins which open external ports in production environments. +{: .note} + +### Trace analytics pipeline + +The following example demonstrates how to build a pipeline that supports the [Trace Analytics OpenSearch Dashboards plugin]({{site.url}}{{site.baseurl}}/observability-plugin/trace/ta-dashboards/). This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines index trace and the service map documents for the dashboard plugin. + +Starting from Data Prepper 2.0, Data Prepper no longer supports `otel_trace_raw_prepper` processor due to the Data Prepper internal data model evolution. +Instead, users should use `otel_trace_raw`. + +```yml +entry-pipeline: + delay: "100" + source: + otel_trace_source: + ssl: false + buffer: + bounded_blocking: + buffer_size: 10240 + batch_size: 160 + sink: + - pipeline: + name: "raw-pipeline" + - pipeline: + name: "service-map-pipeline" +raw-pipeline: + source: + pipeline: + name: "entry-pipeline" + buffer: + bounded_blocking: + buffer_size: 10240 + batch_size: 160 + processor: + - otel_trace_raw: + sink: + - opensearch: + hosts: ["https://localhost:9200"] + insecure: true + username: admin + password: admin + index_type: trace-analytics-raw +service-map-pipeline: + delay: "100" + source: + pipeline: + name: "entry-pipeline" + buffer: + bounded_blocking: + buffer_size: 10240 + batch_size: 160 + processor: + - service_map_stateful: + sink: + - opensearch: + hosts: ["https://localhost:9200"] + insecure: true + username: admin + password: admin + index_type: trace-analytics-service-map +``` + +To maintain similar ingestion throughput and latency, scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload. +{: .tip} + +### Metrics pipeline + +Data Prepper supports metrics ingestion using OTel. It currently supports the following metric types: + +* Gauge +* Sum +* Summary +* Histogram + +Other types are not supported. Data Prepper drops all other types, including Exponential Histogram and Summary. Additionally, Data Prepper does not support Scope instrumentation. + +To set up a metrics pipeline: + +```yml +metrics-pipeline: + source: + otel_metrics_source: + processor: + - otel_metrics_raw_processor: + sink: + - opensearch: + hosts: ["https://localhost:9200"] + username: admin + password: admin +``` + +### S3 log ingestion pipeline + +The following example demonstrates how to use the S3 Source and Grok Processor plugins to process unstructured log data +from [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3). This example uses Application Load +Balancer logs. As the Application Load Balancer writes logs to S3, S3 creates notifications in Amazon SQS. Data Prepper +reads those notifications and reads the S3 objects to get the log data and process it. + +```yml +log-pipeline: + source: + s3: + notification_type: "sqs" + compression: "gzip" + codec: + newline: + sqs: + queue_url: "https://sqs.us-east-1.amazonaws.com/12345678910/ApplicationLoadBalancer" + aws: + region: "us-east-1" + sts_role_arn: "arn:aws:iam::12345678910:role/Data-Prepper" + + processor: + - grok: + match: + message: ["%{DATA:type} %{TIMESTAMP_ISO8601:time} %{DATA:elb} %{DATA:client} %{DATA:target} %{BASE10NUM:request_processing_time} %{DATA:target_processing_time} %{BASE10NUM:response_processing_time} %{BASE10NUM:elb_status_code} %{DATA:target_status_code} %{BASE10NUM:received_bytes} %{BASE10NUM:sent_bytes} \"%{DATA:request}\" \"%{DATA:user_agent}\" %{DATA:ssl_cipher} %{DATA:ssl_protocol} %{DATA:target_group_arn} \"%{DATA:trace_id}\" \"%{DATA:domain_name}\" \"%{DATA:chosen_cert_arn}\" %{DATA:matched_rule_priority} %{TIMESTAMP_ISO8601:request_creation_time} \"%{DATA:actions_executed}\" \"%{DATA:redirect_url}\" \"%{DATA:error_reason}\" \"%{DATA:target_list}\" \"%{DATA:target_status_code_list}\" \"%{DATA:classification}\" \"%{DATA:classification_reason}"] + - grok: + match: + request: ["(%{NOTSPACE:http_method})? (%{NOTSPACE:http_uri})? (%{NOTSPACE:http_version})?"] + - grok: + match: + http_uri: ["(%{WORD:protocol})?(://)?(%{IPORHOST:domain})?(:)?(%{INT:http_port})?(%{GREEDYDATA:request_uri})?"] + - date: + from_time_received: true + destination: "@timestamp" + + + sink: + - opensearch: + hosts: [ "https://localhost:9200" ] + username: "admin" + password: "admin" + index: alb_logs +``` + +## Migrating from Logstash + +Data Prepper supports Logstash configuration files for a limited set of plugins. Simply use the logstash config to run Data Prepper. + +```bash +docker run --name data-prepper \ + -v /full/path/to/logstash.conf:/usr/share/data-prepper/pipelines/pipelines.conf \ + opensearchproject/opensearch-data-prepper:latest +``` + +This feature is limited by feature parity of Data Prepper. As of Data Prepper 1.2 release, the following plugins from the Logstash configuration are supported: + +- HTTP Input plugin +- Grok Filter plugin +- Elasticsearch Output plugin +- Amazon Elasticsearch Output plugin + +## Configure the Data Prepper server + +Data Prepper itself provides administrative HTTP endpoints such as `/list` to list pipelines and `/metrics/prometheus` to provide Prometheus-compatible metrics data. The port that has these endpoints has a TLS configuration and is specified by a separate YAML file. By default, these endpoints are secured by Data Prepper docker images. We strongly recommend providing your own configuration file for securing production environments. Here is an example `data-prepper-config.yaml`: + +```yml +ssl: true +keyStoreFilePath: "/usr/share/data-prepper/keystore.jks" +keyStorePassword: "password" +privateKeyPassword: "other_password" +serverPort: 1234 +``` + +To configure the Data Prepper server, run Data Prepper with the additional yaml file. + +```bash +docker run --name data-prepper \ + -v /full/path/to/my-pipelines.yaml:/usr/share/data-prepper/pipelines/my-pipelines.yaml \ + -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml \ + opensearchproject/data-prepper:latest +``` + +## Configure the peer forwarder + +Data Prepper provides an HTTP service to forward Events between Data Prepper nodes for aggregation. This is required for operating Data Prepper in a clustered deployment. Currently, peer forwarding is supported in `aggregate`, `service_map_stateful`, and `otel_trace_raw` processors. Peer forwarder groups events based on the identification keys provided by the processors. For `service_map_stateful` and `otel_trace_raw` it's `traceId` by default and can not be configured. For `aggregate` processor, it is configurable using `identification_keys` option. + +Peer forwarder supports peer discovery through one of three options: a static list, a DNS record lookup , or AWS Cloud Map. This option can be configured using `discovery_mode` option. Peer forwarder also supports SSL for verification and encrytion, and mTLS for mutual authentication in peer forwarding service. + +To configure the peer forwarder, add configuration options to `data-prepper-config.yaml` mentioned in the previous [Configure the Data Prepper server](#configure-the-data-prepper-server) section: + +```yml +peer_forwarder: + discovery_mode: dns + domain_name: "data-prepper-cluster.my-domain.net" + ssl: true + ssl_certificate_file: "" + ssl_key_file: "" + authentication: + mutual_tls: +``` + + +# Pipelines + +![Data Prepper Pipeline]({{site.url}}{{site.baseurl}}/images/data-prepper-pipeline.png) + +To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more processors, and one or more sinks. For example: + +```yml +simple-sample-pipeline: + workers: 2 # the number of workers + delay: 5000 # in milliseconds, how long workers wait between read attempts + source: + random: + buffer: + bounded_blocking: + buffer_size: 1024 # max number of records the buffer accepts + batch_size: 256 # max number of records the buffer drains after each read + processor: + - string_converter: + upper_case: true + sink: + - stdout: +``` + +- Sources define where your data comes from. In this case, the source is a random UUID generator (`random`). + +- Buffers store data as it passes through the pipeline. + + By default, Data Prepper uses its one and only buffer, the `bounded_blocking` buffer, so you can omit this section unless you developed a custom buffer or need to tune the buffer settings. + +- Processors perform some action on your data: filter, transform, enrich, etc. + + You can have multiple processors, which run sequentially from top to bottom, not in parallel. The `string_converter` processor transform the strings by making them uppercase. + +- Sinks define where your data goes. In this case, the sink is stdout. + +Starting from Data Prepper 2.0, you can define pipelines across multiple configuration YAML files, where each file contains the configuration for one or more pipelines. This gives you more freedom to organize and chain complex pipeline configurations. For Data Prepper to load your pipeline configuration properly, place your configuration YAML files in the `pipelines` folder under your application's home directory (e.g. `/usr/share/data-prepper`). +{: .note } + +## Conditional Routing + +Pipelines also support **conditional routing** which allows you to route Events to different sinks based on specific conditions. To add conditional routing to a pipeline, specify a list of named routes under the `route` component and add specific routes to sinks under the `routes` property. Any sink with the `routes` property will only accept Events that match at least one of the routing conditions. + +In the following example, `application-logs` is a named route with a condition set to `/log_type == "application"`. The route uses [Data Prepper expressions](https://github.com/opensearch-project/data-prepper/tree/main/examples) to define the conditions. Data Prepper only routes events that satisfy the condition to the first OpenSearch sink. By default, Data Prepper routes all Events to a sink which does not define a route. In the example, all Events route into the third OpenSearch sink. + +```yml +conditional-routing-sample-pipeline: + source: + http: + processor: + route: + - application-logs: '/log_type == "application"' + - http-logs: '/log_type == "apache"' + sink: + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: application_logs + routes: [application-logs] + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: http_logs + routes: [http-logs] + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: all_logs +``` + + +## Examples + +This section provides some pipeline examples that you can use to start creating your own pipelines. For more information, see [Data Prepper configuration reference]({{site.url}}{{site.baseurl}}/clients/data-prepper/data-prepper-reference/) guide. + +The Data Prepper repository has several [sample applications](https://github.com/opensearch-project/data-prepper/tree/main/examples) to help you get started. + +### Log ingestion pipeline + +The following example demonstrates how to use HTTP source and Grok prepper plugins to process unstructured log data. + +```yml +log-pipeline: + source: + http: + ssl: false + processor: + - grok: + match: + log: [ "%{COMMONAPACHELOG}" ] + sink: + - opensearch: + hosts: [ "https://opensearch:9200" ] + insecure: true + username: admin + password: admin + index: apache_logs +``` + +This example uses weak security. We strongly recommend securing all plugins which open external ports in production environments. +{: .note} + +### Trace analytics pipeline + +The following example demonstrates how to build a pipeline that supports the [Trace Analytics OpenSearch Dashboards plugin]({{site.url}}{{site.baseurl}}/observability-plugin/trace/ta-dashboards/). This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines index trace and the service map documents for the dashboard plugin. + +Starting from Data Prepper 2.0, Data Prepper no longer supports `otel_trace_raw_prepper` processor due to the Data Prepper internal data model evolution. +Instead, users should use `otel_trace_raw`. + +```yml +entry-pipeline: + delay: "100" + source: + otel_trace_source: + ssl: false + buffer: + bounded_blocking: + buffer_size: 10240 + batch_size: 160 + sink: + - pipeline: + name: "raw-pipeline" + - pipeline: + name: "service-map-pipeline" +raw-pipeline: + source: + pipeline: + name: "entry-pipeline" + buffer: + bounded_blocking: + buffer_size: 10240 + batch_size: 160 + processor: + - otel_trace_raw: + sink: + - opensearch: + hosts: ["https://localhost:9200"] + insecure: true + username: admin + password: admin + index_type: trace-analytics-raw +service-map-pipeline: + delay: "100" + source: + pipeline: + name: "entry-pipeline" + buffer: + bounded_blocking: + buffer_size: 10240 + batch_size: 160 + processor: + - service_map_stateful: + sink: + - opensearch: + hosts: ["https://localhost:9200"] + insecure: true + username: admin + password: admin + index_type: trace-analytics-service-map +``` + +To maintain similar ingestion throughput and latency, scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload. +{: .tip} + +### Metrics pipeline + +Data Prepper supports metrics ingestion using OTel. It currently supports the following metric types: + +* Gauge +* Sum +* Summary +* Histogram + +Other types are not supported. Data Prepper drops all other types, including Exponential Histogram and Summary. Additionally, Data Prepper does not support Scope instrumentation. + +To set up a metrics pipeline: + +```yml +metrics-pipeline: + source: + otel_metrics_source: + processor: + - otel_metrics_raw_processor: + sink: + - opensearch: + hosts: ["https://localhost:9200"] + username: admin + password: admin +``` + +### S3 log ingestion pipeline + +The following example demonstrates how to use the S3 Source and Grok Processor plugins to process unstructured log data +from [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3). This example uses Application Load +Balancer logs. As the Application Load Balancer writes logs to S3, S3 creates notifications in Amazon SQS. Data Prepper +reads those notifications and reads the S3 objects to get the log data and process it. + +```yml +log-pipeline: + source: + s3: + notification_type: "sqs" + compression: "gzip" + codec: + newline: + sqs: + queue_url: "https://sqs.us-east-1.amazonaws.com/12345678910/ApplicationLoadBalancer" + aws: + region: "us-east-1" + sts_role_arn: "arn:aws:iam::12345678910:role/Data-Prepper" + + processor: + - grok: + match: + message: ["%{DATA:type} %{TIMESTAMP_ISO8601:time} %{DATA:elb} %{DATA:client} %{DATA:target} %{BASE10NUM:request_processing_time} %{DATA:target_processing_time} %{BASE10NUM:response_processing_time} %{BASE10NUM:elb_status_code} %{DATA:target_status_code} %{BASE10NUM:received_bytes} %{BASE10NUM:sent_bytes} \"%{DATA:request}\" \"%{DATA:user_agent}\" %{DATA:ssl_cipher} %{DATA:ssl_protocol} %{DATA:target_group_arn} \"%{DATA:trace_id}\" \"%{DATA:domain_name}\" \"%{DATA:chosen_cert_arn}\" %{DATA:matched_rule_priority} %{TIMESTAMP_ISO8601:request_creation_time} \"%{DATA:actions_executed}\" \"%{DATA:redirect_url}\" \"%{DATA:error_reason}\" \"%{DATA:target_list}\" \"%{DATA:target_status_code_list}\" \"%{DATA:classification}\" \"%{DATA:classification_reason}"] + - grok: + match: + request: ["(%{NOTSPACE:http_method})? (%{NOTSPACE:http_uri})? (%{NOTSPACE:http_version})?"] + - grok: + match: + http_uri: ["(%{WORD:protocol})?(://)?(%{IPORHOST:domain})?(:)?(%{INT:http_port})?(%{GREEDYDATA:request_uri})?"] + - date: + from_time_received: true + destination: "@timestamp" + + + sink: + - opensearch: + hosts: [ "https://localhost:9200" ] + username: "admin" + password: "admin" + index: alb_logs +``` + +## Migrating from Logstash + +Data Prepper supports Logstash configuration files for a limited set of plugins. Simply use the logstash config to run Data Prepper. + +```bash +docker run --name data-prepper \ + -v /full/path/to/logstash.conf:/usr/share/data-prepper/pipelines/pipelines.conf \ + opensearchproject/opensearch-data-prepper:latest +``` + +This feature is limited by feature parity of Data Prepper. As of Data Prepper 1.2 release, the following plugins from the Logstash configuration are supported: + +- HTTP Input plugin +- Grok Filter plugin +- Elasticsearch Output plugin +- Amazon Elasticsearch Output plugin + +## Configure the Data Prepper server + +Data Prepper itself provides administrative HTTP endpoints such as `/list` to list pipelines and `/metrics/prometheus` to provide Prometheus-compatible metrics data. The port that has these endpoints has a TLS configuration and is specified by a separate YAML file. By default, these endpoints are secured by Data Prepper docker images. We strongly recommend providing your own configuration file for securing production environments. Here is an example `data-prepper-config.yaml`: + +```yml +ssl: true +keyStoreFilePath: "/usr/share/data-prepper/keystore.jks" +keyStorePassword: "password" +privateKeyPassword: "other_password" +serverPort: 1234 +``` + +To configure the Data Prepper server, run Data Prepper with the additional yaml file. + +```bash +docker run --name data-prepper \ + -v /full/path/to/my-pipelines.yaml:/usr/share/data-prepper/pipelines/my-pipelines.yaml \ + -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml \ + opensearchproject/data-prepper:latest +``` + +## Configure the peer forwarder + +Data Prepper provides an HTTP service to forward Events between Data Prepper nodes for aggregation. This is required for operating Data Prepper in a clustered deployment. Currently, peer forwarding is supported in `aggregate`, `service_map_stateful`, and `otel_trace_raw` processors. Peer forwarder groups events based on the identification keys provided by the processors. For `service_map_stateful` and `otel_trace_raw` it's `traceId` by default and can not be configured. For `aggregate` processor, it is configurable using `identification_keys` option. + +Peer forwarder supports peer discovery through one of three options: a static list, a DNS record lookup , or AWS Cloud Map. This option can be configured using `discovery_mode` option. Peer forwarder also supports SSL for verification and encrytion, and mTLS for mutual authentication in peer forwarding service. + +To configure the peer forwarder, add configuration options to `data-prepper-config.yaml` mentioned in the previous [Configure the Data Prepper server](#configure-the-data-prepper-server) section: + +```yml +peer_forwarder: + discovery_mode: dns + domain_name: "data-prepper-cluster.my-domain.net" + ssl: true + ssl_certificate_file: "" + ssl_key_file: "" + authentication: + mutual_tls: +``` diff --git a/_clients/cli.md b/_tools/cli.md similarity index 99% rename from _clients/cli.md rename to _tools/cli.md index e08bcd8c..05088fe7 100644 --- a/_clients/cli.md +++ b/_tools/cli.md @@ -1,7 +1,7 @@ --- layout: default title: OpenSearch CLI -nav_order: 80 +nav_order: 70 has_children: false --- diff --git a/_clients/grafana.md b/_tools/grafana.md similarity index 100% rename from _clients/grafana.md rename to _tools/grafana.md diff --git a/_clients/agents-and-ingestion-tools/index.md b/_tools/index.md similarity index 82% rename from _clients/agents-and-ingestion-tools/index.md rename to _tools/index.md index d8f3dd80..be8c79b7 100644 --- a/_clients/agents-and-ingestion-tools/index.md +++ b/_tools/index.md @@ -1,14 +1,21 @@ --- layout: default -title: Agents and ingestion tools -nav_order: 140 +title: Tools +nav_order: 50 has_children: false -has_toc: false redirect_from: - /clients/agents-and-ingestion-tools/ --- -# Agents and ingestion tools +# OpenSearch tools + +This section provides documentation for OpenSearch-supported tools, including: + +- [Agents and ingestion tools](#agents-and-ingestion-tools) +- [OpenSearch CLI](#opensearch-cli) +- [OpenSearch Kubernetes operator](#opensearch-kubernetes-operator) + +## Agents and ingestion tools Historically, many multiple popular agents and ingestion tools have worked with Elasticsearch OSS, such as Beats, Logstash, Fluentd, FluentBit, and OpenTelemetry. OpenSearch aims to continue to support a broad set of agents and ingestion tools, but not all have been tested or have explicitly added OpenSearch compatibility. @@ -39,7 +46,7 @@ Logstash OSS 8.0 introduces a breaking change where all plugins run in ECS compa ecs_compatibility => disabled ``` -## Downloads +### Downloads You can download the OpenSearch output plugin for Logstash from [OpenSearch downloads](https://opensearch.org/downloads.html). The Logstash output plugin is compatible with OpenSearch and Elasticsearch OSS (7.10.2 or lower). @@ -92,4 +99,12 @@ Some users report compatibility issues with ingest pipelines on these versions o \*\* Beats OSS includes all Apache 2.0 Beats agents (i.e. Filebeat, Metricbeat, Auditbeat, Heartbeat, Winlogbeat, Packetbeat). Beats versions newer than 7.12.x are not supported by OpenSearch. If you must update the Beats agent(s) in your environment to a newer version, you can work around the incompatibility by directing traffic from Beats to Logstash and using the Logstash Output plugin to ingest the data to OpenSearch. -{: .warning } \ No newline at end of file +{: .warning } + +## OpenSearch CLI + +The OpenSearch CLI command line interface (opensearch-cli) lets you manage your OpenSearch cluster from the command line and automate tasks. For more information on OpenSearch CLI, see [OpenSearch CLI]({{site.url}}{{site.baseurl}}/tools/cli/). + +## OpenSearch Kubernetes operator + +The OpenSearch Kubernetes (K8s) Operator is an open-source kubernetes operator that helps automate the deployment and provisioning of OpenSearch and OpenSearch Dashboards in a containerized environment. For information on how to use the K8s operator, see [OpenSearch Kubernetes operator]({{site.url}}{{site.baseurl}}/tools/k8s-operator/) \ No newline at end of file diff --git a/_clients/k8s-operator.md b/_tools/k8s-operator.md similarity index 99% rename from _clients/k8s-operator.md rename to _tools/k8s-operator.md index b481a6f2..3f9f8512 100644 --- a/_clients/k8s-operator.md +++ b/_tools/k8s-operator.md @@ -1,7 +1,8 @@ --- layout: default title: OpenSearch Kubernetes Operator -nav_order: 180 +nav_order: 80 +has_children: false --- The OpenSearch Kubernetes Operator is an open-source kubernetes operator that helps automate the deployment and provisioning of OpenSearch and OpenSearch Dashboards in a containerized environment. The operator can manage multiple OpenSearch clusters that can be scaled up and down depending on your needs. diff --git a/_clients/logstash/advanced-config.md b/_tools/logstash/advanced-config.md similarity index 100% rename from _clients/logstash/advanced-config.md rename to _tools/logstash/advanced-config.md diff --git a/_clients/logstash/common-filters.md b/_tools/logstash/common-filters.md similarity index 100% rename from _clients/logstash/common-filters.md rename to _tools/logstash/common-filters.md diff --git a/_clients/logstash/execution-model.md b/_tools/logstash/execution-model.md similarity index 100% rename from _clients/logstash/execution-model.md rename to _tools/logstash/execution-model.md diff --git a/_clients/logstash/index.md b/_tools/logstash/index.md similarity index 100% rename from _clients/logstash/index.md rename to _tools/logstash/index.md diff --git a/_clients/logstash/read-from-opensearch.md b/_tools/logstash/read-from-opensearch.md similarity index 100% rename from _clients/logstash/read-from-opensearch.md rename to _tools/logstash/read-from-opensearch.md diff --git a/_clients/logstash/ship-to-opensearch.md b/_tools/logstash/ship-to-opensearch.md similarity index 100% rename from _clients/logstash/ship-to-opensearch.md rename to _tools/logstash/ship-to-opensearch.md diff --git a/_upgrade-to/index.md b/_upgrade-to/index.md index b59bc428..71f7e940 100644 --- a/_upgrade-to/index.md +++ b/_upgrade-to/index.md @@ -18,7 +18,7 @@ Three approaches exist: Regardless of your approach, to safeguard against data loss, we recommend that you take a [snapshot]({{site.url}}{{site.baseurl}}/opensearch/snapshots/snapshot-restore) of all indexes prior to any migration. -If your existing clients include a version check, such as recent versions of Logstash OSS and Filebeat OSS, [check compatibility]({{site.url}}{{site.baseurl}}/clients/agents-and-ingestion-tools/index/) before upgrading. +If your existing clients include a version check, such as recent versions of Logstash OSS and Filebeat OSS, [check compatibility]({{site.url}}{{site.baseurl}}/tools/index/#compatibility-matrices) before upgrading. ## Upgrading from Open Distro