diff --git a/_clients/data-prepper/data-prepper-reference.md b/_clients/data-prepper/data-prepper-reference.md index 3f38252f..79d44531 100644 --- a/_clients/data-prepper/data-prepper-reference.md +++ b/_clients/data-prepper/data-prepper-reference.md @@ -17,8 +17,60 @@ ssl | No | Boolean | Indicates whether TLS should be used for server APIs. Defau keyStoreFilePath | No | String | Path to a .jks or .p12 keystore file. Required if `ssl` is true. keyStorePassword | No | String | Password for keystore. Optional, defaults to empty string. privateKeyPassword | No | String | Password for private key within keystore. Optional, defaults to empty string. -serverPort | No | Integer | Port number to use for server APIs. Defaults to 4900 +serverPort | No | Integer | Port number to use for server APIs. Defaults to 4900. metricRegistries | No | List | Metrics registries for publishing the generated metrics. Currently supports Prometheus and CloudWatch. Defaults to Prometheus. +metricTags | No | Map | Key-value pairs as common metric tags to metric registries. The maximum number of pairs is three. Note that `serviceName` is a reserved tag key with `DataPrepper` as default tag value. Alternatively, administrators can set this value through the environment variable `DATAPREPPER_SERVICE_NAME`. If `serviceName` is defined in `metricTags`, that value overwrites those set through the above methods. +authentication | No | Object | Authentication configuration. Valid option is `http_basic` with `username` and `password` properties. If not defined, the server does not perform authentication. +processorShutdownTimeout | No | Duration | Time given to processors to clear any in-flight data and gracefully shutdown. Default is 30s. +sinkShutdownTimeout | No | Duration | Time given to sinks to clear any in-flight data and gracefully shutdown. Default is 30s. +peer_forwarder | No | Object | Peer forwarder configurations. See [Peer forwarder options](#peer-forwarder-options) for more details. + +### Peer forwarder options + +The following section details various configuration options for peer forwarder. + +#### General options for peer forwarder + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +port | No | Integer | The port number peer forwarder server is running on. Valid options are between 0 and 65535. Defaults is 4994. +request_timeout | No | Integer | Request timeout in milliseconds for peer forwarder HTTP server. Default is 10000. +server_thread_count | No | Integer | Number of threads used by peer forwarder server. Default is 200. +client_thread_count | No | Integer | Number of threads used by peer forwarder client. Default is 200. +max_connection_count | No | Integer | Maximum number of open connections for peer forwarder server. Default is 500. +max_pending_requests | No | Integer | Maximum number of allowed tasks in ScheduledThreadPool work queue. Default is 1024. +discovery_mode | No | String | Peer discovery mode to use. Valid options are `local_node`, `static`, `dns`, or `aws_cloud_map`. Defaults to `local_node`, which processes events locally. +static_endpoints | Conditionally | List | A list containing endpoints of all Data Prepper instances. Required if `discovery_mode` is set to static. +domain_name | Conditionally | String | A single domain name to query DNS against. Typically, used by creating multiple DNS A Records for the same domain. Required if `discovery_mode` is set to dns. +aws_cloud_map_namespace_name | Conditionally | String | Cloud Map namespace when using AWS Cloud Map service discovery. Required if `discovery_mode` is set to `aws_cloud_map`. +aws_cloud_map_service_name | Conditionally | String | Cloud Map service name when using AWS Cloud Map service discovery. Required if `discovery_mode` is set to `aws_cloud_map`. +aws_cloud_map_query_parameters | No | Map | Key-value pairs to filter the results based on the custom attributes attached to an instance. Only instances that match all the specified key-value pairs are returned. +buffer_size | No | Integer | Max number of unchecked records the buffer accepts. Number of unchecked records is the sum of the number of records written into the buffer and the num of in-flight records not yet checked by the Checkpointing API. Default is 512. +batch_size | No | Integer | Max number of records the buffer returns on read. Default is 48. +aws_region | Conditionally | String | AWS region to use ACM, S3 or AWS Cloud Map. Required if `use_acm_certificate_for_ssl` is set to true or `ssl_certificate_file` and `ssl_key_file` is AWS S3 path or `discovery_mode` is set to `aws_cloud_map`. +drain_timeout | No | Duration | Wait time for the peer forwarder to complete processing data before shutdown. Default is 10s. + +#### TLS/SSL options for peer forwarder + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +ssl | No | Boolean | Enables TLS/SSL. Default is true. +ssl_certificate_file | Conditionally | String | SSL certificate chain file path or AWS S3 path. S3 path example `s3:///`. Required if `ssl` is true and `use_acm_certificate_for_ssl` is false. Defaults to `config/default_certificate.pem` which is the default certificate file. Read more about how the certificate file is generated [here](https://github.com/opensearch-project/data-prepper/tree/main/examples/certificates). +ssl_key_file | Conditionally | String | SSL key file path or AWS S3 path. S3 path example `s3:///`. Required if `ssl` is true and `use_acm_certificate_for_ssl` is false. Defaults to `config/default_private_key.pem` which is the default private key file. Read more about how the default private key file is generated [here](https://github.com/opensearch-project/data-prepper/tree/main/examples/certificates). +ssl_insecure_disable_verification | No | Boolean | Disables the verification of server's TLS certificate chain. Default is false. +ssl_fingerprint_verification_only | No | Boolean | Disables the verification of server's TLS certificate chain and instead verifies only the certificate fingerprint. Default is false. +use_acm_certificate_for_ssl | No | Boolean | Enables TLS/SSL using certificate and private key from AWS Certificate Manager (ACM). Default is false. +acm_certificate_arn | Conditionally | String | ACM certificate ARN. The ACM certificate takes preference over S3 or a local file system certificate. Required if `use_acm_certificate_for_ssl` is set to true. +acm_private_key_password | No | String | ACM private key password that decrypts the private key. If not provided, Data Prepper generates a random password. +acm_certificate_timeout_millis | No | Integer | Timeout in milliseconds for ACM to get certificates. Default is 120000. +aws_region | Conditionally | String | AWS region to use ACM, S3 or AWS Cloud Map. Required if `use_acm_certificate_for_ssl` is set to true or `ssl_certificate_file` and `ssl_key_file` is AWS S3 path or `discovery_mode` is set to `aws_cloud_map`. + +#### Authentication options for peer forwarder + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +authentication | No | Map | Authentication method to use. Valid options are `mutual_tls` (use mTLS) or `unauthenticated` (no authentication). Default is `unauthenticated`. + ## General pipeline options @@ -42,6 +94,7 @@ Option | Required | Type | Description port | No | Integer | The port OTel trace source is running on. Default is `21890`. request_timeout | No | Integer | The request timeout in milliseconds. Default is `10_000`. health_check_service | No | Boolean | Enables a gRPC health check service under `grpc.health.v1/Health/Check`. Default is `false`. +unauthenticated_health_check | No | Boolean | Determines whether or not authentication is required on the health check endpoint. Data Prepper ignores this option if no authentication is defined. Default is `false`. proto_reflection_service | No | Boolean | Enables a reflection service for Protobuf services (see [gRPC reflection](https://github.com/grpc/grpc/blob/master/doc/server-reflection.md) and [gRPC Server Reflection Tutorial](https://github.com/grpc/grpc-java/blob/master/documentation/server-reflection-tutorial.md) docs). Default is `false`. unframed_requests | No | Boolean | Enable requests not framed using the gRPC wire protocol. thread_count | No | Integer | The number of threads to keep in the ScheduledThreadPool. Default is `200`. @@ -53,9 +106,6 @@ useAcmCertForSSL | No | Boolean | Whether to enable TLS/SSL using certificate an acmCertificateArn | Conditionally | String | Represents the ACM certificate ARN. ACM certificate take preference over S3 or local file system certificate. Required if `useAcmCertForSSL` is set to `true`. awsRegion | Conditionally | String | Represents the AWS region to use ACM or S3. Required if `useAcmCertForSSL` is set to `true` or `sslKeyCertChainFile` and `sslKeyFile` are AWS S3 paths. authentication | No | Object | An authentication configuration. By default, an unauthenticated server is created for the pipeline. This parameter uses pluggable authentication for HTTPS. To use basic authentication, define the `http_basic` plugin with a `username` and `password`. To provide customer authentication, use or create a plugin that implements [GrpcAuthenticationProvider](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/armeria-common/src/main/java/com/amazon/dataprepper/armeria/authentication/GrpcAuthenticationProvider.java). -record_type | No | String | A string represents the supported record data type that is written into the buffer plugin. Value options are `otlp` or `event`. Default is `otlp`. -`otlp` | No | String | Otel-trace-source writes each incoming `ExportTraceServiceRequest` request as record data type into the buffer. -`event` | No | String | Otel-trace-source decodes each incoming `ExportTraceServiceRequest` request into a collection of Data Prepper internal spans serving as buffer items. To achieve better performance in this mode, we recommend setting buffer capacity proportional to the estimated number of spans in the incoming request payload. ### http_source @@ -64,11 +114,21 @@ This is a source plugin that supports HTTP protocol. Currently ONLY support Json Option | Required | Type | Description :--- | :--- | :--- | :--- port | No | Integer | The port the source is running on. Default is `2021`. Valid options are between `0` and `65535`. +health_check_service | No | Boolean | Enables health check service on `/health` endpoint on the defined port. Default is `false`. +unauthenticated_health_check | No | Boolean | Determines whether or not authentication is required on the health check endpoint. Data Prepper ignores this option if no authentication is defined. Default is `false`. request_timeout | No | Integer | The request timeout in millis. Default is `10_000`. thread_count | No | Integer | The number of threads to keep in the ScheduledThreadPool. Default is `200`. max_connection_count | No | Integer | The maximum allowed number of open connections. Default is `500`. max_pending_requests | No | Integer | The maximum number of allowed tasks in ScheduledThreadPool work queue. Default is `1024`. authentication | No | Object | An authentication configuration. By default, this creates an unauthenticated server for the pipeline. This uses pluggable authentication for HTTPS. To use basic authentication define the `http_basic` plugin with a `username` and `password`. To provide customer authentication, use or create a plugin that implements [ArmeriaHttpAuthenticationProvider](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/armeria-common/src/main/java/com/amazon/dataprepper/armeria/authentication/ArmeriaHttpAuthenticationProvider.java). +ssl | No | Boolean | Enables TLS/SSL. Default is false. +ssl_certificate_file | Conditionally | String | SSL certificate chain file path or AWS S3 path. S3 path example `s3:///`. Required if `ssl` is true and `use_acm_certificate_for_ssl` is false. +ssl_key_file | Conditionally | String | SSL key file path or AWS S3 path. S3 path example `s3:///`. Required if `ssl` is true and `use_acm_certificate_for_ssl` is false. +use_acm_certificate_for_ssl | No | Boolean | Enables TLS/SSL using certificate and private key from AWS Certificate Manager (ACM). Default is false. +acm_certificate_arn | Conditionally | String | ACM certificate ARN. The ACM certificate takes preference over S3 or a local file system certificate. Required if `use_acm_certificate_for_ssl` is set to true. +acm_private_key_password | No | String | ACM private key password that decrypts the private key. If not provided, Data Prepper generates a random password. +acm_certificate_timeout_millis | No | Integer | Timeout in milliseconds for ACM to get certificates. Default is 120000. +aws_region | Conditionally | String | AWS region to use ACM or S3. Required if `use_acm_certificate_for_ssl` is set to true or `ssl_certificate_file` and `ssl_key_file` is AWS S3 path. ### otel_metrics_source @@ -100,12 +160,13 @@ Option | Required | Type | Description :--- | :--- | :--- | :--- notification_type | Yes | String | Must be `sqs` compression | No | String | The compression algorithm to apply: `none`, `gzip`, or `automatic`. Default is `none`. -codec | Yes | Codec | The codec to apply. Must be either `newline` or `json`. +codec | Yes | Codec | The codec to apply. Must be `newline`, `json`, or `csv`. sqs | Yes | sqs | The [Amazon Simple Queue Service](https://aws.amazon.com/sqs/) (Amazon SQS) configuration. See [sqs](#sqs) for details. aws | Yes | aws | The AWS configuration. See [aws](#aws) for details. on_error | No | String | Determines how to handle errors in Amazon SQS. Can be either `retain_messages` or `delete_messages`. If `retain_messages`, then Data Prepper will leave the message in the SQS queue and try again. This is recommended for dead-letter queues. If `delete_messages`, then Data Prepper will delete failed messages. Default is `retain_messages`. buffer_timeout | No | Duration | The timeout for writing events to the Data Prepper buffer. Any events that the S3 Source cannot write to the buffer in this time will be discarded. Default is 10 seconds. records_to_accumulate | No | Integer | The number of messages that accumulate before writing to the buffer. Default is 100. +metadata_root_key | No | String | Base key for adding S3 metadata to each Event. The metadata includes the key and bucket for each S3 object. Defaults to `s3/`. disable_bucket_ownership_validation | No | Boolean | If `true`, then the S3 Source will not attempt to validate that the bucket is owned by the expected account. The only expected account is the same account that owns the SQS queue. Defaults to `false`. #### sqs @@ -176,17 +237,17 @@ Prior to Data Prepper 1.3, Processors were named Preppers. Starting in Data Prep {: .note } -### otel_trace_raw_prepper - -Converts OpenTelemetry data to OpenSearch-compatible JSON documents and fills in trace group related fields in those JSON documents. It requires `record_type` to be set as `otlp` in `otel_trace_source`. - -Option | Required | Type | Description -:--- | :--- | :--- | :--- -trace_flush_interval | No | Integer | Represents the time interval in seconds to flush all the descendant spans without any root span. Default is 180. - ### otel_trace_raw -This processor is a Data Prepper event record type compatible version of `otel_trace_raw_prepper` that fills in trace group related fields into all incoming Data Prepper span records. It requires `record_type` to be set as `event` in `otel_trace_source`. +This processor is a Data Prepper event record type replacement of `otel_trace_raw_prepper` (no longer supported since Data Prepper 2.0). +The processor fills in trace group related fields including + +* `traceGroup`: root span name +* `endTime`: end time of the entire trace in ISO 8601 +* `durationInNanos`: duration of the entire trace in nanoseconds +* `statusCode`: status code for the entire trace in nanoseconds + +in all incoming Data Prepper span records by state caching the root span info per traceId. Option | Required | Type | Description :--- | :--- | :--- | :--- @@ -200,26 +261,6 @@ Option | Required | Type | Description :--- | :--- | :--- | :--- window_duration | No | Integer | Represents the fixed time window in seconds to evaluate service-map relationships. Default is 180. -### peer_forwarder - -Forwards ExportTraceServiceRequests via gRPC to other Data Prepper instances. Required for operating Data Prepper in a clustered deployment. - -Option | Required | Type | Description -:--- | :--- | :--- | :--- -time_out | No | Integer | Forwarded request timeout in seconds. Defaults to 3 seconds. -span_agg_count | No | Integer | Batch size for number of spans per request. Defaults to 48. -target_port | No | Integer | The destination port to forward requests to. Defaults to `21890`. -discovery_mode | No | String | Peer discovery mode to be used. Allowable values are `static`, `dns`, and `aws_cloud_map`. Defaults to `static`. -static_endpoints | No | List | List containing string endpoints of all Data Prepper instances. -domain_name | No | String | Single domain name to query DNS against. Typically used by creating multiple DNS A Records for the same domain. -ssl | No | Boolean | Indicates whether to use TLS. Default is true. -awsCloudMapNamespaceName | Conditionally | String | Name of your CloudMap Namespace. Required if `discovery_mode` is set to `aws_cloud_map`. -awsCloudMapServiceName | Conditionally | String | Service name within your CloudMap Namespace. Required if `discovery_mode` is set to `aws_cloud_map`. -sslKeyCertChainFile | Conditionally | String | Represents the SSL certificate chain file path or AWS S3 path. S3 path example `s3:///`. Required if `ssl` is set to `true`. -useAcmCertForSSL | No | Boolean | Enables TLS/SSL using certificate and private key from AWS Certificate Manager (ACM). Default is `false`. -awsRegion | Conditionally | String | Represents the AWS Region to use ACM, S3, or CloudMap. Required if `useAcmCertForSSL` is set to `true` or `sslKeyCertChainFile` and `sslKeyFile` are AWS S3 paths. -acmCertificateArn | Conditionally | String | Represents the ACM certificate ARN. ACM certificate take preference over S3 or local file system certificate. Required if `useAcmCertForSSL` is set to `true`. - ### string_converter Converts string to uppercase or lowercase. Mostly useful as an example if you want to develop your own processor. @@ -260,7 +301,7 @@ Option | Required | Type | Description drop_when | Yes | String | Accepts a Data Prepper Expression string following the [Data Prepper Expression Syntax](https://github.com/opensearch-project/data-prepper/blob/main/docs/expression_syntax.md). Configuring `drop_events` with `drop_when: true` drops all the events received. handle_failed_events | No | Enum | Specifies how exceptions are handled when an exception occurs while evaluating an event. Default value is `drop`, which drops the event so it doesn't get sent to OpenSearch. Available options are `drop`, `drop_silently`, `skip`, `skip_silently`. For more information, see [handle_failed_events](https://github.com/opensearch-project/data-prepper/tree/main/data-prepper-plugins/drop-events-processor#handle_failed_events). -### grok_prepper +### grok Takes unstructured data and utilizes pattern matching to structure and extract important keys and make data more structured and queryable. @@ -382,10 +423,45 @@ Option | Required | Type | Description :--- | :--- | :--- | :--- with_keys | Yes | List | A list of keys to trim the whitespace from. +### csv + +Takes in an Event and parses its CSV data into columns. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +source | No | String | The field in the Event that will be parsed. Default is `message`. +quote_character | No | String | The character used as a text qualifier for a single column of data. Default is double quote `"`. +delimiter | No | String | The character separating each column. Default is `,`. +delete_header | No | Boolean | If specified, the header on the Event (`column_names_source_key`) deletes after the Event is parsed. If there’s no header on the Event, no actions is taken. Default is true. +column_names_source_key | No | String | The field in the Event that specifies the CSV column names, which will be autodetected. If there must be extra column names, the column names autogenerate according to their index. If `column_names` is also defined, the header in `column_names_source_key` can also be used to generate the Event fields. If too few columns are specified in this field, the remaining column names autogenerate. If too many column names are specified in this field, CSV processor omits the extra column names. +column_names | No | List | User-specified names for the CSV columns. Default is `[column1, column2, ..., columnN]` if there are N columns of data in the CSV record and `column_names_source_key` is not defined. If `column_names_source_key` is defined, the header in `column_names_source_key` generates the Event fields. If too few columns are specified in this field, the remaining column names will autogenerate. If too many column names are specified in this field, CSV processor omits the extra column names. + +### json + +Takes in an Event and parses its JSON data, including any nested fields. + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +source | No | String | The field in the `Event` that will be parsed. Default is `message`. +destination | No | String | The destination field of the parsed JSON. Defaults to the root of the `Event`. Cannot be `""`, `/`, or any whitespace-only `String` because these are not valid `Event` fields. +pointer | No | String | A JSON Pointer to the field to be parsed. There is no `pointer` by default, meaning the entire `source` is parsed. The `pointer` can access JSON Array indices as well. If the JSON Pointer is invalid then the entire `source` data is parsed into the outgoing `Event`. If the pointed-to key already exists in the `Event` and the `destination` is the root, then the pointer uses the entire path of the key. + + +## Routes + +Routes define conditions that can be used in sinks for conditional routing. Routes are specified at the same level as processors and sinks under the name `route` and consist of a list of key-value pairs, where the key is the name of a route and the value is a Data Prepper expression representing the routing condition. + + ## Sinks Sinks define where Data Prepper writes your data to. +### General options for all sink types + +Option | Required | Type | Description +:--- | :--- | :--- | :--- +routes | No | List | List of routes that the sink accepts. If not specified, the sink accepts all upstream events. + ### opensearch @@ -404,12 +480,10 @@ socket_timeout | No | Integer | the timeout in milliseconds for waiting for data connect_timeout | No | Integer | The timeout in milliseconds used when requesting a connection from the connection manager. A timeout value of zero is interpreted as an infinite timeout. If this timeout value is either negative or not set, the underlying Apache HttpClient would rely on operating system settings for managing connection timeouts. insecure | No | Boolean | Whether to verify SSL certificates. If set to true, CA certificate verification is disabled and insecure HTTP requests are sent instead. Default is `false`. proxy | No | String | The address of a [forward HTTP proxy server](https://en.wikipedia.org/wiki/Proxy_server). The format is "<host name or IP>:<port>". Examples: "example.com:8100", "http://example.com:8100", "112.112.112.112:8100". Port number cannot be omitted. -trace_analytics_raw | No | Boolean | Deprecated in favor of `index_type`. Whether to export as trace data to the `otel-v1-apm-span-*` index pattern (alias `otel-v1-apm-span`) for use with the Trace Analytics OpenSearch Dashboards plugin. Default is `false`. -trace_analytics_service_map | No | Boolean | Deprecated in favor of `index_type`. Whether to export as trace data to the `otel-v1-apm-service-map` index for use with the service map component of the Trace Analytics OpenSearch Dashboards plugin. Default is `false`. -index | No | String | Name of the index to export to. Only required if you don't use the `trace-analytics-raw` or `trace-analytics-service-map` presets. In other words, this parameter is applicable and required only if index_type is explicitly `custom` or defaults to `custom`. +index | Conditionally | String | Name of the export index. Applicable and required only when the `index_type` is `custom`. index_type | No | String | This index type tells the Sink plugin what type of data it is handling. Valid values: `custom`, `trace-analytics-raw`, `trace-analytics-service-map`, `management-disabled`. Default is `custom`. -template_file | No | String | Path to a JSON [index template]({{site.url}}{{site.baseurl}}/opensearch/index-templates/) file (e.g. `/your/local/template-file.json` if you do not use the `trace_analytics_raw` or `trace_analytics_service_map`.) See [otel-v1-apm-span-index-template.json](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/src/main/resources/otel-v1-apm-span-index-template.json) for an example. -document_id_field | No | String | The field from the source data to use for the OpenSearch document ID (e.g. `"my-field"`) if you don't use the `trace_analytics_raw` or `trace_analytics_service_map` presets. +template_file | No | String | Path to a JSON [index template]({{site.url}}{{site.baseurl}}/opensearch/index-templates/) file (e.g. `/your/local/template-file.json`) if `index_type` is `custom`. See [otel-v1-apm-span-index-template.json](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/opensearch/src/main/resources/otel-v1-apm-span-index-template.json) for an example. +document_id_field | No | String | The field from the source data to use for the OpenSearch document ID (e.g. `"my-field"`) if `index_type` is `custom`. dlq_file | No | String | The path to your preferred dead letter queue file (e.g. `/your/local/dlq-file`). Data Prepper writes to this file when it fails to index a document on the OpenSearch cluster. bulk_size | No | Integer (long) | The maximum size (in MiB) of bulk requests to the OpenSearch cluster. Values below 0 indicate an unlimited size. If a single document exceeds the maximum bulk request size, Data Prepper sends it individually. Default is 5. ism_policy_file | No | String | The absolute file path for an ISM (Index State Management) policy JSON file. This policy file is effective only when there is no built-in policy file for the index type. For example, `custom` index type is currently the only one without a built-in policy file, thus it would use the policy file here if it's provided through this parameter. For more information, see [ISM policies]({{site.url}}{{site.baseurl}}/im-plugin/ism/policies/). diff --git a/_clients/data-prepper/get-started.md b/_clients/data-prepper/get-started.md index 11ef4ea9..95d3dc6e 100644 --- a/_clients/data-prepper/get-started.md +++ b/_clients/data-prepper/get-started.md @@ -19,7 +19,7 @@ docker pull opensearchproject/data-prepper:latest ## 2. Define a pipeline -Create a Data Prepper pipeline file, `pipelines.yaml`, with the following configuration: +Create a Data Prepper pipeline file, `my-pipelines.yaml`, with the following configuration: ```yml simple-sample-pipeline: @@ -37,7 +37,7 @@ Run the following command with your pipeline configuration YAML. ```bash docker run --name data-prepper \ - -v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml \ + -v /full/path/to/my-pipelines.yaml:/usr/share/data-prepper/pipelines/my-pipelines.yaml \ opensearchproject/opensearch-data-prepper:latest ``` @@ -45,19 +45,19 @@ This sample pipeline configuration above demonstrates a simple pipeline with a s After starting Data Prepper, you should see log output and some UUIDs after a few seconds: -```yml -2021-09-30T20:19:44,147 [main] INFO com.amazon.dataprepper.pipeline.server.DataPrepperServer - Data Prepper server running at :4900 -2021-09-30T20:19:44,681 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer -2021-09-30T20:19:45,183 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer -2021-09-30T20:19:45,687 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer -2021-09-30T20:19:46,191 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer -2021-09-30T20:19:46,694 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer -2021-09-30T20:19:47,200 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer -2021-09-30T20:19:49,181 [simple-test-pipeline-processor-worker-1-thread-1] INFO com.amazon.dataprepper.pipeline.ProcessWorker - simple-test-pipeline Worker: Processing 6 records from buffer -07dc0d37-da2c-447e-a8df-64792095fb72 -5ac9b10a-1d21-4306-851a-6fb12f797010 -99040c79-e97b-4f1d-a70b-409286f2a671 -5319a842-c028-4c17-a613-3ef101bd2bdd -e51e700e-5cab-4f6d-879a-1c3235a77d18 -b4ed2d7e-cf9c-4e9d-967c-b18e8af35c90 +``` +2021-09-30T20:19:44,147 [main] INFO org.opensearch.dataprepper.pipeline.server.DataPrepperServer - Data Prepper server running at :4900 +2021-09-30T20:19:44,681 [random-source-pool-0] INFO org.opensearch.dataprepper.plugins.source.RandomStringSource - Writing to buffer +2021-09-30T20:19:45,183 [random-source-pool-0] INFO org.opensearch.dataprepper.plugins.source.RandomStringSource - Writing to buffer +2021-09-30T20:19:45,687 [random-source-pool-0] INFO org.opensearch.dataprepper.plugins.source.RandomStringSource - Writing to buffer +2021-09-30T20:19:46,191 [random-source-pool-0] INFO org.opensearch.dataprepper.plugins.source.RandomStringSource - Writing to buffer +2021-09-30T20:19:46,694 [random-source-pool-0] INFO org.opensearch.dataprepper.plugins.source.RandomStringSource - Writing to buffer +2021-09-30T20:19:47,200 [random-source-pool-0] INFO org.opensearch.dataprepper.plugins.source.RandomStringSource - Writing to buffer +2021-09-30T20:19:49,181 [simple-test-pipeline-processor-worker-1-thread-1] INFO org.opensearch.dataprepper.pipeline.ProcessWorker - simple-test-pipeline Worker: Processing 6 records from buffer +{"message":"1043a78e-1312-4341-8c1e-227e34a1fbf3"} +{"message":"b1529b81-1ee1-4cdb-b5d7-11586e570ae6"} +{"message":"56d83593-4c95-4bc4-9c0b-e061d9b23192"} +{"message":"254153df-4534-4f5e-bb31-98b984f2ac29"} +{"message":"ad1430e6-8486-4d84-a2ef-de30315dea07"} +{"message":"81c5e621-79aa-4850-9bf1-68642d70c1ee"} ``` diff --git a/_clients/data-prepper/index.md b/_clients/data-prepper/index.md index 7fb833f4..5fd13dfe 100644 --- a/_clients/data-prepper/index.md +++ b/_clients/data-prepper/index.md @@ -12,4 +12,76 @@ Data Prepper is a server side data collector capable of filtering, enriching, tr Data Prepper lets users build custom pipelines to improve the operational view of applications. Two common uses for Data Prepper are trace and log analytics. [Trace analytics]({{site.url}}{{site.baseurl}}/observability-plugin/trace/index/) can help you visualize the flow of events and identify performance problems, and [log analytics]({{site.url}}{{site.baseurl}}/observability-plugin/log-analytics/) can improve searching, analyzing and provide insights into your application. +## Concepts + +Data Prepper is compromised of **Pipelines** that collect and filter data based on the components set within the pipeline. Each component is pluggable, enabling you to use your own custom implementation of each component. These components include: + +- One [source](#source) +- One or more[sinks](#sink) +- (Optional) One [buffer](#buffer) +- (Optional) One or more[processors](#processor) + +A single instance of Data Prepper can have one or more pipelines. + +Each pipeline definition contains two required components **source** and **sink**. If buffers and processors are missing from the Data Prepper pipeline, Data Prepper uses the default buffer and a no-op processor. + +### Source + +Source is the input component of a pipeline that defines the mechanism through which a Data Prepper pipeline will consume events. A pipeline can have only one source. The source can consume events either by receiving the events over HTTP or HTTPS or reading from external endpoints like OTeL Collector for traces and metrics and S3. Source have their own configuration options based on the format of the events (such as string, json, cloudwatch logs, or open telemetry trace). The source component consumes events and writes them to the buffer component. + +### Buffer + +The buffer component acts as the layer between the source and the sink. Buffer can be either in-memory or disk-based. The default buffer uses an in-memory queue bounded by the number of events, called `bounded_blocking`. If the buffer component is not explicitly mentioned in the pipeline configuration, Data Prepper uses the default `bounded_blocking`. + +### Sink + +Sink is the output component of a pipeline that defines the destination(s) to which a Data Prepper pipeline publishes events. A sink destination could be services such as OpenSearch, S3, or another Data Prepper pipeline. When using another Data Prepper pipeline as the sink, you can chain multiple pipelines together based on the needs to the data. Sink contains it's own configurations options based on the destination type. + +### Processor + +Processors are units within the Data Prepper pipeline that can filter, transform, and enrich events into your desired format before publishing the record to the sink. The a processor is not defined in the pipeline configuration, the events publish in the format defined in the source component. You can have more than on processor within a pipeline. When using multiple processors, the processors are executed in the order they are defined inside the pipeline spec. + +## Sample Pipeline configurations + +To understand how all pipeline components function within a Data Prepper configuration, see the following examples. Each pipeline configuration uses a `yaml` file format. + +### Minimal component + +This pipeline configuration reads from file source and writes to that same source. It uses the default options for buffer and processor. + +```yml +sample-pipeline: + source: + file: + path: + sink: + - file: + path: +``` + +### All components + +The following pipeline uses a source that reads string events from the `input-file`. The source then pushes the data to buffer bounded by max size of `1024`. The pipeline configured to have `4` workers each of them reading maximum of `256` events from the buffer for every `100 milliseconds`. Each worker executes the `string_converter` processor and write the output of the processor to the `output-file`. + +```yml +sample-pipeline: + workers: 4 #Number of workers + delay: 100 # in milliseconds, how often the workers should run + source: + file: + path: + buffer: + bounded_blocking: + buffer_size: 1024 # max number of events the buffer will accept + batch_size: 256 # max number of events the buffer will drain for each read + processor: + - string_converter: + upper_case: true + sink: + - file: + path: +``` + +## Next steps + To get started building your own custom pipelines with Data Prepper, see the [Get Started]({{site.url}}{{site.baseurl}}/clients/data-prepper/get-started/) guide. diff --git a/_clients/data-prepper/pipelines.md b/_clients/data-prepper/pipelines.md index 6763f134..7e40d3b0 100644 --- a/_clients/data-prepper/pipelines.md +++ b/_clients/data-prepper/pipelines.md @@ -40,6 +40,38 @@ simple-sample-pipeline: - Sinks define where your data goes. In this case, the sink is stdout. +Starting from Data Prepper 2.0, you can define pipelines across multiple configuration YAML files, where each file contains the configuration for one or more pipelines. This gives you more freedom to organize and chain complex pipeline configurations. For Data Prepper to load your pipeline configuration properly, place your configuration YAML files in the `pipelines` folder under your application's home directory (e.g. `/usr/share/data-prepper`). +{: .note } + +## Conditional Routing + +Pipelines also support **conditional routing** which allows you to route Events to different sinks based on specific conditions. To add conditional routing to a pipeline, specify a list of named routes under the `route` component and add specific routes to sinks under the `routes` property. Any sink with the `routes` property will only accept Events that match at least one of the routing conditions. + +In the following example, `application-logs` is a named route with a condition set to `/log_type == "application"`. The route uses [Data Prepper expressions](https://github.com/opensearch-project/data-prepper/tree/main/examples) to define the conditions. Data Prepper only routes events that satisfy the condition to the first OpenSearch sink. By default, Data Prepper routes all Events to a sink which does not define a route. In the example, all Events route into the third OpenSearch sink. + +```yml +conditional-routing-sample-pipeline: + source: + http: + processor: + route: + - application-logs: '/log_type == "application"' + - http-logs: '/log_type == "apache"' + sink: + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: application_logs + routes: [application-logs] + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: http_logs + routes: [http-logs] + - opensearch: + hosts: [ "https://opensearch:9200" ] + index: all_logs +``` + + ## Examples This section provides some pipeline examples that you can use to start creating your own pipelines. For more information, see [Data Prepper configuration reference]({{site.url}}{{site.baseurl}}/clients/data-prepper/data-prepper-reference/) guide. @@ -75,9 +107,8 @@ This example uses weak security. We strongly recommend securing all plugins whic The following example demonstrates how to build a pipeline that supports the [Trace Analytics OpenSearch Dashboards plugin]({{site.url}}{{site.baseurl}}/observability-plugin/trace/ta-dashboards/). This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines index trace and the service map documents for the dashboard plugin. -#### Classic - -This pipeline definition will be deprecated in 2.0. Users are recommended to use [Event record type](#event-record-type) pipeline definition. +Starting from Data Prepper 2.0, Data Prepper no longer supports `otel_trace_raw_prepper` processor due to the Data Prepper internal data model evolution. +Instead, users should use `otel_trace_raw`. ```yml entry-pipeline: @@ -85,51 +116,6 @@ entry-pipeline: source: otel_trace_source: ssl: false - sink: - - pipeline: - name: "raw-pipeline" - - pipeline: - name: "service-map-pipeline" -raw-pipeline: - source: - pipeline: - name: "entry-pipeline" - processor: - - otel_trace_raw_prepper: - sink: - - opensearch: - hosts: ["https://localhost:9200"] - insecure: true - username: admin - password: admin - index_type: trace-analytics-raw -service-map-pipeline: - delay: "100" - source: - pipeline: - name: "entry-pipeline" - processor: - - service_map_stateful: - sink: - - opensearch: - hosts: ["https://localhost:9200"] - insecure: true - username: admin - password: admin - index_type: trace-analytics-service-map -``` - -#### Event record type - -Starting from Data Prepper 1.4, Data Prepper supports event record type in trace analytics pipeline source, buffer, and processors. - -```yml -entry-pipeline: - delay: "100" - source: - otel_trace_source: - ssl: false - record_type: event buffer: bounded_blocking: buffer_size: 10240 @@ -176,7 +162,8 @@ service-map-pipeline: index_type: trace-analytics-service-map ``` -Note that it is recommended to scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload to maintain similar ingestion throughput and latency as in [Classic](#classic). +To maintain similar ingestion throughput and latency, scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload. +{: .tip} ### Metrics pipeline @@ -211,7 +198,7 @@ from [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3). Th Balancer logs. As the Application Load Balancer writes logs to S3, S3 creates notifications in Amazon SQS. Data Prepper reads those notifications and reads the S3 objects to get the log data and process it. -``` +```yml log-pipeline: source: s3: @@ -254,7 +241,7 @@ Data Prepper supports Logstash configuration files for a limited set of plugins. ```bash docker run --name data-prepper \ - -v /full/path/to/logstash.conf:/usr/share/data-prepper/pipelines.conf \ + -v /full/path/to/logstash.conf:/usr/share/data-prepper/pipelines/pipelines.conf \ opensearchproject/opensearch-data-prepper:latest ``` @@ -280,7 +267,27 @@ serverPort: 1234 To configure the Data Prepper server, run Data Prepper with the additional yaml file. ```bash -docker run --name data-prepper -v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml \ - /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml \ +docker run --name data-prepper \ + -v /full/path/to/my-pipelines.yaml:/usr/share/data-prepper/pipelines/my-pipelines.yaml \ + -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml \ opensearchproject/data-prepper:latest -```` +``` + +## Configure the peer forwarder + +Data Prepper provides an HTTP service to forward Events between Data Prepper nodes for aggregation. This is required for operating Data Prepper in a clustered deployment. Currently, peer forwarding is supported in `aggregate`, `service_map_stateful`, and `otel_trace_raw` processors. Peer forwarder groups events based on the identification keys provided by the processors. For `service_map_stateful` and `otel_trace_raw` it's `traceId` by default and can not be configured. For `aggregate` processor, it is configurable using `identification_keys` option. + +Peer forwarder supports peer discovery through one of three options: a static list, a DNS record lookup , or AWS Cloud Map. This option can be configured using `discovery_mode` option. Peer forwarder also supports SSL for verification and encrytion, and mTLS for mutual authentication in peer forwarding service. + +To configure the peer forwarder, add configuration options to `data-prepper-config.yaml` mentioned in the previous [Configure the Data Prepper server](#configure-the-data-prepper-server) section: + +```yml +peer_forwarder: + discovery_mode: dns + domain_name: "data-prepper-cluster.my-domain.net" + ssl: true + ssl_certificate_file: "" + ssl_key_file: "" + authentication: + mutual_tls: +``` diff --git a/_opensearch/ux.md b/_opensearch/ux.md index 5937af5b..5685233c 100644 --- a/_opensearch/ux.md +++ b/_opensearch/ux.md @@ -59,7 +59,8 @@ GET shakespeare/_search Prefix matching doesn’t require any special mappings. It works with your data as-is. However, it’s a fairly resource-intensive operation. A prefix of `a` could match hundreds of thousands of terms and not be useful to your user. -To limit the impact of prefix expansion, set `max_expansions` to a reasonable number. To learn about the `max_expansions` option, see [Options]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/full-text#other-optional-query-fields). + +To limit the impact of prefix expansion, set `max_expansions` to a reasonable number. To learn about the `max_expansions` option, see [Optional query fields]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/full-text#optional-query-fields). #### Sample Request