updating data prepper documentation for 1.2 release

Signed-off-by: Christopher Manning <cmanning09@users.noreply.github.com>
This commit is contained in:
Christopher Manning 2021-12-03 08:28:34 -06:00 committed by Manning
parent 05b1682f11
commit 2c2d7ffc67
8 changed files with 242 additions and 59 deletions

View File

@ -1,14 +1,13 @@
---
layout: default
title: Configuration reference
parent: Trace analytics
nav_order: 25
parent: Data Prepper
nav_order: 3
---
# Data Prepper configuration reference
This page lists all supported Data Prepper sources, buffers, preppers, and sinks, along with their associated options. For example configuration files, see [Data Prepper]({{site.url}}{{site.baseurl}}/observability-plugins/trace/data-prepper/).
This page lists all supported Data Prepper server, sources, buffers, preppers, and sinks, along with their associated options. For example configuration files, see [Data Prepper]({{site.url}}{{site.baseurl}}/observability-plugins/data-prepper/pipelines/).
## Data Prepper server options
@ -19,7 +18,7 @@ keyStoreFilePath | No | String, path to a .jks or .p12 keystore file. Required i
keyStorePassword | No | String, password for keystore. Optional, defaults to empty string.
privateKeyPassword | No | String, password for private key within keystore. Optional, defaults to empty string.
serverPort | No | Integer, port number to use for server APIs. Defaults to 4900
metricRegistries | No | List, metrics registries for publishing the generated metrics. Defaults to Prometheus; Prometheus and CloudWatch are currently supported.
## General pipeline options
@ -53,7 +52,20 @@ sslKeyFile | Conditionally | String, file-system path or AWS S3 path to the secu
useAcmCertForSSL | No | Boolean, enables TLS/SSL using certificate and private key from AWS Certificate Manager (ACM). Default is `false`.
acmCertificateArn | Conditionally | String, represents the ACM certificate ARN. ACM certificate take preference over S3 or local file system certificate. Required if `useAcmCertForSSL` is set to `true`.
awsRegion | Conditionally | String, represents the AWS region to use ACM or S3. Required if `useAcmCertForSSL` is set to `true` or `sslKeyCertChainFile` and `sslKeyFile` are AWS S3 paths.
authentication | No | An authentication configuration. By default, this runs an unauthenticated server. This uses pluggable authentication for HTTPS. To provide customer authentication use or create a plugin which implements: [GrpcAuthenticationProvider](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/armeria-common/src/main/java/com/amazon/dataprepper/armeria/authentication/GrpcAuthenticationProvider.java)
### http_source
This is a source plugin that supports HTTP protocol. Currently ONLY support Json UTF-8 codec for incoming request, e.g. `[{"key1": "value1"}, {"key2": "value2"}]`.
Option | Required | Description
:--- | :--- | :---
port | No | Integer, the port the source is running on. Default is `2021`. Valid options are between `0` and `65535`.
request_timeout | No | Integer, the request timeout in millis. Default is `10_000`.
thread_count | No | Integer, the number of threads to keep in the ScheduledThreadPool. Default is `200`.
max_connection_count | No | Integer, the maximum allowed number of open connections. Default is `500`.
max_pending_requests | No | Ingeger, the maximum number of allowed tasks in ScheduledThreadPool work queue. Default is `1024.
authentication | No | An authentication configuration. By default, this runs an unauthenticated server. This uses pluggable authentication for HTTPS. To provide customer authentication use or create a plugin which implements: [ArmeriaHttpAuthenticationProvider](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/armeria-common/src/main/java/com/amazon/dataprepper/armeria/authentication/ArmeriaHttpAuthenticationProvider.java)
### file
@ -62,7 +74,8 @@ Source for flat file input.
Option | Required | Description
:--- | :--- | :---
path | Yes | String, path to the input file (e.g. `logs/my-log.log`).
format | No | String, format of each line in the file. Valid options are `json` or `plain`. Default is `plain`.
record_type | No | String, the record type that will be stored. Valid options are `string` or `event`. Default is `string`. If you would like to use the file source for log analytics use cases like grok, change this to `event`.
### pipeline
@ -144,6 +157,22 @@ Option | Required | Description
:--- | :--- | :---
upper_case | No | Boolean, whether to convert to uppercase (`true`) or lowercase (`false`).
### grok_prepper
Takes unstructured data and utilizes pattern matching to structure and extract important keys and make data more structured and queryable.
Option | Required | Description
:--- | :--- | :---
match | No | Map, specifies which keys to match specific patterns against. Default is `{}`.
keep_empty_captures | No | Boolean, enables preserving `null` captures. Defatul value is `false`.
named_captures_only | No | Boolean, enables whether to keep only named captures. Default vaule is `true`.
break_on_match | No | Boolean, specifies wether to match all patterns or stop once the first successful match is found. Default is `true`.
keys_to_overwrite | No | List, specifies which existing keys are to be overwritten if there is a capture with the same key value. Default is `[]`.
pattern_definitions | No | Map, that allows for custom pattern use inline. Default value is `{}`.
patterns_directories | No | List, specifies the path of directories that contain customer pattern files. Default value is `[]`.
pattern_files_glob | No | String, specifies which pattern files to use from the directories specified for `pattern_directories`. Default is `*`.
target_key | No | String, specifies a parent level key to store all captures. Default value is `null`.
timeout_millis | Integer, the maximum amount of time that matching will be performed. Setting to `0` will disable the timeout. Default value is `30,000`.
## Sinks

View File

@ -0,0 +1,58 @@
---
layout: default
title: Get Started
parent: Data Prepper
nav_order: 1
---
# Get started with Data Prepper
Data Prepper is an independent component, not an OpenSearch plugin, that converts data for use with OpenSearch. It's not bundled with the all-in-one OpenSearch installation packages.
## Install Data Prepper
To use the Docker image, pull it like any other image:
```bash
docker pull opensearchproject/data-prepper:latest
```
## Define a pipeline
Build a pipeline by creating a pipeline configuration YAML using any of the built-in Data Prepper plugins.
For more examples and details on pipeline configurations see [Pipelines]({{site.url}}{{site.baseurl}}/observability-plugins/data-prepper/pipelines) guide.
## Start Data Prepper
Run the following command with your pipeline configuration YAML.
```bash
docker run --name data-prepper -v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml opensearchproject/opensearch-data-prepper:latest
```
### Migrating from Logstash
Data Prepper supports Logstash configuration files for a limited set of plugins. Simply use the logstash config to run Data Prepper.
```bash
docker run --name data-prepper -v /full/path/to/logstash.conf:/usr/share/data-prepper/pipelines.conf opensearchproject/opensearch-data-prepper:latest
```
This feature is limited by feature parity of Data Prepper. As of Data Prepper 1.2 release, the following plugins from the Logstash configuration are supported:
- HTTP Input plugin
- Grok Filter plugin
- Elasticsearch Output plugin
- Amazon Elasticsearch Output plugin
### Configure the Data Prepper server
Data Prepper itself provides administrative HTTP endpoints such as `/list` to list pipelines and `/metrics/prometheus` to provide Prometheus-compatible metrics data. The port which serves these endpoints has a TLS configuration and is specified by a separate YAML file. Data Prepper docker images secures these endpoints by default. We strongly recommend providing your own configuration file for securing production environments. Example:
```yml
ssl: true
keyStoreFilePath: "/usr/share/data-prepper/keystore.jks"
keyStorePassword: "password"
privateKeyPassword: "other_password"
serverPort: 1234
```

View File

@ -0,0 +1,15 @@
---
layout: default
title: Data Prepper
nav_order: 80
has_children: true
has_toc: false
---
# Data Prepper
Data Prepper is a server side data collector with abilities to filter, enrich, transform, normalize and aggregate data for downstream analytics and visualization.
Data Prepper allows users to build custom pipelines to improve their operational view of their own applications. Two common uses for Data Prepper are trace and log analytics. [Trace Anaytics]({{site.url}}{{site.baseurl}}/observability-plugins/trace/index/) can help you visualize this flow of events and identify performance problems. [Log Anayltics]({{site.url}}{{site.baseurl}}/observability-plugins/log-analytics/index) can improve searching, analyzing and provide insights into your application.
To get started building your own custom pipelines with Data Prepper, see the [Get Started]({{site.url}}{{site.baseurl}}/observability-plugins/data-prepper/get-started/) guide.

View File

@ -1,27 +1,11 @@
---
layout: default
title: Data Prepper
parent: Trace analytics
nav_order: 20
title: Pipelines
parent: Data Prepper
nav_order: 2
---
# Data Prepper
Data Prepper is an independent component, not an OpenSearch plugin, that converts data for use with OpenSearch. It's not bundled with the all-in-one OpenSearch installation packages.
## Install Data Prepper
To use the Docker image, pull it like any other image:
```bash
docker pull opensearchproject/data-prepper:latest
```
Otherwise, [download](https://opensearch.org/downloads.html) the appropriate archive for your operating system and unzip it.
## Configure pipelines
# Pipelines
To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more preppers, and one or more sinks:
@ -61,7 +45,39 @@ sample-pipeline:
- Sinks define where your data goes. In this case, the sink is an OpenSearch cluster.
Pipelines can act as the source for other pipelines. In the following example, a pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks:
## Examples
This section provides some pipeline examples that you can use to start creating your own pipelines. For more information, see [Data Prepper configuration reference]({{site.url}}{{site.baseurl}}/observability-plugins/data-prepper/data-prepper-reference/) guide.
The Data Prepper repository has several [sample applications](https://github.com/opensearch-project/data-prepper/tree/main/examples) to help you get started.
### Log ingestion pipeline
The following example demonstrates how to use HTTP source and Grok prepper plugins to process unstructured log data.
```yaml
log-pipeline:
source:
http:
ssl: false
processor:
- grok:
match:
log: [ "%{COMMONAPACHELOG}" ]
sink:
- opensearch:
hosts: [ "https://opensearch:9200" ]
insecure: true
username: admin
password: admin
index: apache_logs
```
Note: This example uses weak security. We strongly recommend securing all plugins which open external ports in production environments.
### Trace Analytics pipeline
The following example demonstrates how to build a pipeline that supports the [Trace Analytics OpenSearch Dashboards plugin]({{site.url}}{{site.baseurl}}/observability-plugins/trace/ta-dashboards/). This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines index trace and the service map documents for the dashboard plugin.
```yml
entry-pipeline:
@ -104,32 +120,3 @@ service-map-pipeline:
password: "ta-password"
trace_analytics_service_map: true
```
To learn more, see the [Data Prepper configuration reference]({{site.url}}{{site.baseurl}}/observability-plugins/trace/data-prepper-reference/).
## Configure the Data Prepper server
Data Prepper itself provides administrative HTTP endpoints such as `/list` to list pipelines and `/metrics/prometheus` to provide Prometheus-compatible metrics data. The port which serves these endpoints, as well as TLS configuration, is specified by a separate YAML file. Example:
```yml
ssl: true
keyStoreFilePath: "/usr/share/data-prepper/keystore.jks"
keyStorePassword: "password"
privateKeyPassword: "other_password"
serverPort: 1234
```
## Start Data Prepper
**Docker**
```bash
docker run --name data-prepper --expose 21890 -v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml opensearchproject/opensearch-data-prepper:latest
```
**macOS and Linux**
```bash
./data-prepper-tar-install.sh config/pipelines.yaml config/data-prepper-config.yaml
```
For production workloads, you likely want to run Data Prepper on a dedicated machine, which makes connectivity a concern. Data Prepper uses port 21890 and must be able to connect to both the OpenTelemetry Collector and the OpenSearch cluster. In the [sample applications](https://github.com/opensearch-project/Data-Prepper/tree/main/examples), you can see that all components use the same Docker network and expose the appropriate ports.

View File

@ -0,0 +1,95 @@
---
layout: default
title: Log analytics
nav_order: 70
---
# Log Ingestion
Log ingestion provides a way to transform unstructured log data into a structure data and ingestion into OpenSearch. This data can help improve your log collection in distributed applications.
Structured log data allows for improved queries and filtering based on the data format when you are searching logs for an event.
## Get started with log ingestion
OpenSearch Log Ingestion consists of three components---[Data Prepper]({{site.url}}{{site.baseurl}}/observability-plugins/data-prepper/index/), [OpenSearch]({{site.url}}{{site.baseurl}}/) and [OpenSearch Dashboards]({{site.url}}{{site.baseurl}}/)---that fit into the OpenSearch ecosystem. The Data Prepper repository has several [sample applications](https://github.com/opensearch-project/data-prepper/tree/main/examples) to help you get started.
### Basic flow of data
![Log data flow diagram from a distributed application to OpenSearch]({{site.url}}{{site.baseurl}}/images/la.png)
1. Log Ingestion relies on you adding log collection to your application's environment to gather and send log data.
(In the [example](#example) below, [FluentBit](https://docs.fluentbit.io/manual/) is used as a log collector that collects log data from a file and sends the log data to Data Prepper).
2. [Data Prepper]({{site.url}}{{site.baseurl}}/observability-plugins/data-prepper/index/) receives the log data, transforms the data into a structure format, and indexes it on an OpenSearch cluster.
3. The data can then be explored through OpenSearch search queries or the **Discover** page in OpenSearch Dashboards.
### Example
This example mimics the writing of log entries to a log file that are then processed by Data Prepper and stored in OpenSearch.
Download or clone the [Data Prepper repository](https://github.com/opensearch-project/data-prepper). Then navigate to `examples/log-ingestion/` and open `docker-compose.yml` in a text editor. This file contains a container for:
- [Fluent Bit](https://docs.fluentbit.io/manual/) (`fluent-bit`)
- Data Prepper (`data-prepper`)
- A single-node OpenSearch cluster (`opensearch`)
- OpenSearch Dashboards (`opensearch-dashboards`).
Close the file and run `docker-compose up --build` to start the containers.
After the containers start, your ingestion pipeline is set up and ready to ingest log data. The `fluent-bit` container is configured to read log data from `test.log`. Run the following command to generate log data to send to the log ingestion pipeline.
```
echo '63.173.168.120 - - [04/Nov/2021:15:07:25 -0500] "GET /search/tag/list HTTP/1.0" 200 5003' >> test.log
```
Fluent-Bit will collect the log data and send it to Data Prepper:
```angular2html
[2021/12/02 15:35:41] [ info] [output:http:http.0] data-prepper:2021, HTTP status=200
200 OK
```
Data Prepper will process the log and index it:
```
2021-12-02T15:35:44,499 [log-pipeline-processor-worker-1-thread-1] INFO com.amazon.dataprepper.pipeline.ProcessWorker - log-pipeline Worker: Processing 1 records from buffer
```
This should result in a single document being written to the OpenSearch cluster in the `apache-logs` index as defined in the `log_pipeline.yaml` file.
Run the following command to see one of the raw documents in the OpenSearch cluster:
```bash
curl -X GET -u 'admin:admin' -k 'https://localhost:9200/apache_logs/_search?pretty&size=1'
```
The response should show the parsed log data:
```
"hits" : [
{
"_index" : "apache_logs",
"_type" : "_doc",
"_id" : "yGrJe30BgI2EWNKtDZ1g",
"_score" : 1.0,
"_source" : {
"date" : 1.638459307042312E9,
"log" : "63.173.168.120 - - [04/Nov/2021:15:07:25 -0500] \"GET /search/tag/list HTTP/1.0\" 200 5003",
"request" : "/search/tag/list",
"auth" : "-",
"ident" : "-",
"response" : "200",
"bytes" : "5003",
"clientip" : "63.173.168.120",
"verb" : "GET",
"httpversion" : "1.0",
"timestamp" : "04/Nov/2021:15:07:25 -0500"
}
}
]
```
The same data can be viewed in OpenSearch Dashboards by visiting the **Discover** page and searching the `apache_logs` index. Remember, you must create the index in OpensSearch Dashboards if this is your first time searching for the index.

View File

@ -9,7 +9,6 @@ nav_order: 1
OpenSearch Trace Analytics consists of two components---Data Prepper and the Trace Analytics OpenSearch Dashboards plugin---that fit into the OpenTelemetry and OpenSearch ecosystems. The Data Prepper repository has several [sample applications](https://github.com/opensearch-project/data-prepper/tree/main/examples) to help you get started.
## Basic flow of data
![Data flow diagram from a distributed application to OpenSearch]({{site.url}}{{site.baseurl}}/images/ta.svg)
@ -20,11 +19,10 @@ OpenSearch Trace Analytics consists of two components---Data Prepper and the Tra
1. The [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/getting-started/) receives data from the application and formats it into OpenTelemetry data.
1. [Data Prepper]({{site.url}}{{site.baseurl}}/observability-plugins/trace/data-prepper/) processes the OpenTelemetry data, transforms it for use in OpenSearch, and indexes it on an OpenSearch cluster.
1. [Data Prepper]({{site.url}}{{site.baseurl}}/observability-plugins/data-prepper/index/) processes the OpenTelemetry data, transforms it for use in OpenSearch, and indexes it on an OpenSearch cluster.
1. The [Trace Analytics OpenSearch Dashboards plugin]({{site.url}}{{site.baseurl}}/observability-plugins/trace/ta-dashboards/) displays the data in near real-time as a series of charts and tables, with an emphasis on service architecture, latency, error rate, and throughput.
## Jaeger HotROD
One Trace Analytics sample application is the Jaeger HotROD demo, which mimics the flow of data through a distributed application.

1
images/la-diagram.drawio Normal file
View File

@ -0,0 +1 @@
<mxfile host="Electron" modified="2021-12-02T23:36:50.889Z" agent="5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/13.6.2 Chrome/83.0.4103.122 Electron/9.2.0 Safari/537.36" etag="TGQ9dVb-OumTgvDfhW7r" version="13.6.2" type="device"><diagram id="raLMG7nUvYtrXQTlhIq5" name="Page-1">3VdNj5swEP01HFciQBxy7CbZrdStulIO7a1y8ADuGgYZk4/++ppgvkSSTdSsmvSC7DdjZvyenxGWO0u2z5Jm8VdkICzHZlvLnVuOM/WIfpbArgKIb1dAJDmroFELLPlvMGCdVnAGeS9RIQrFsz4YYJpCoHoYlRI3/bQQRb9qRiMYAMuAiiH6nTMVG3Rk223gM/AoNqX9sQkktE42QB5ThpsO5C4sdyYRVTVKtjMQJXc1L9W6pyPRpjEJqTpnwY+HsEi+2Nmb8lZ2Aovtz/nzg1u9ZU1FYTb8glEpF1XUtK12NRfANDVmilLFGGFKxaJFHyUWKYOyoK1nbc4LYqbBkQZ/gVI7ozMtFGooVokw0apmWejoLg2UYyEDOLG1+rRQGYE6kec0WugzDJiAkju9zpzXWj0Jgiq+7rdFzeGKmmXNm16R64bbFAzDXLfREUgPOgVbaC/bBRKOBhJ+yjLBA90upgMNW4VKujcxV7DM6J7IjTZwX42QCzFDgXK/1mUU/DDQeK4kvkEnQgIfVmGj3xqkgu1pBYeMmwUN5bUC9VWw6fivzok71iP2cVV6pF/K8Pj/NYlzpkm8OzeJM5DwSY9S9cjVQMF3TNF30D+xiHdzFpkM+P2WQboEKoPYcojQ9R9XUo+icsQwKBK91/zO3eOd6R5y5+7xBurO91ef/Sohy0Be10Jj8Jl3yEK+s3IJuY6FyM1ZaDog+c7dQc50h39YqLPt8Fesk4surkAUubrL8+7f3Hn3TzJvz2ker5BKlh/4ftSxqwoBIy3F5JAQUzJx6QcJ4Y4/Tgg9bf8vq49B+5PuLv4A</diagram></mxfile>

BIN
images/la.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB