Add pattern matching (text processing) use case to Data Prepper documentation (#6181)

* Add pattern matching use case to Data Prepper documentation

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Add pattern matching use case to Data Prepper documentation

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Address SME comments

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Copy edits

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

* Address editorial feedback

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>

---------

Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
This commit is contained in:
Melissa Vagi 2024-01-30 12:00:58 -07:00 committed by GitHub
parent 8225f0b511
commit d9d5406f93
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 215 additions and 0 deletions

View File

@ -0,0 +1,215 @@
---
layout: default
title: Text processing
parent: Common use cases
nav_order: 35
---
# Text processing
Data Prepper provides text processing capabilities with the [`grok processor`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/). The `grok` processor is based on the [`java-grok`](https://mvnrepository.com/artifact/io.krakens/java-grok) library and supports all compatible patterns. The `java-grok` library is built using the [`java.util.regex`](https://docs.oracle.com/javase/8/docs/api/java/util/regex/package-summary.html) regular expression library.
You can add custom patterns to your pipelines by using the `patterns_definitions` option. When debugging custom patterns, the [Grok Debugger](https://grokdebugger.com/) can be helpful.
## Basic usage
To get started with text processing, create the following pipeline:
```json
patten-matching-pipeline:
source
...
processor:
- grok:
match:
message: ['%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
sink:
- opensearch:
# Provide an OpenSearch cluster endpoint
```
{% include copy-curl.html %}
An incoming message might contain the following contents:
```json
{"message": "127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200"}
```
{% include copy-curl.html %}
In each incoming event, the pipeline will locate the value in the `message` key and attempt to match the pattern. The keywords `IPORHOST`, `HTTPDATE`, and `NUMBER` are built into the plugin.
When an incoming record matches the pattern, it generates an internal event such as the following with identification keys extracted from the original message:
```json
{
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
"response_status":200,
"clientip":"198.126.12",
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
```
{% include copy-curl.html %}
The `match` configuration for the `grok` processor specifies which record keys to match against which patterns.
In the following example, the `match` configuration checks incoming logs for a `message` key. If the key exists, it matches the key value against the `SYSLOGBASE` pattern and then against the `COMMONAPACHELOG` pattern. It then checks the logs for a `timestamp` key. If that key exists, it attempts to match the key value against the `TIMESTAMP_ISO8601` pattern.
```json
processor:
- grok:
match:
message: ['%{SYSLOGBASE}', "%{COMMONAPACHELOG}"]
timestamp: ["%{TIMESTAMP_ISO8601}"]
```
{% include copy-curl.html %}
By default, the plugin continues until it finds a successful match. For example, if there is a successful match against the value in the `message` key for a `SYSLOGBASE` pattern, the plugin doesn't attempt to match the other patterns. If you want to match logs against every pattern, include the `break_on_match` option.
## Including named and empty captures
Include the `keep_empty_captures` option in your pipeline configuration to include null captures or the `named_captures_only` option to include only named captures. Named captures follow the pattern `%{SYNTAX:SEMANTIC}` while unnamed captures follow the pattern `%{SYNTAX}`.
For example, you can modify the preceding Grok configuration to remove `clientip` from the `%{IPORHOST}` pattern:
```json
processor:
- grok:
match:
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
```
{% include copy-curl.html %}
The resulting grokked log will look like this:
```json
{
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
"response_status":200,
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
```
{% include copy-curl.html %}
Notice that the `clientip` key no longer exists because the `%{IPORHOST}` pattern is now an unnamed capture.
However, if you set `named_captures_only` to `false`:
```json
processor:
- grok:
match:
named_captures_only: false
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:message:int}']
```
{% include copy-curl.html %}
Then the resulting grokked log will look like this:
```json
{
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
"MONTH":"Oct",
"YEAR":"2000",
"response_status":200,
"HOUR":"13",
"TIME":"13:55:36",
"MINUTE":"55",
"SECOND":"36",
"IPORHOST":"198.126.12",
"MONTHDAY":"10",
"INT":"-0700",
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
```
{% include copy-curl.html %}
Note that the `IPORHOST` capture now shows up as a new key, along with some internal unnamed captures like `MONTH` and `YEAR`. The `HTTPDATE` keyword is currently using these patterns, which you can see in the default patterns file.
## Overwriting keys
Include the `keys_to_overwrite` option to specify which existing record keys to overwrite if there is a capture with the same key value.
For example, you can modify the preceding Grok configuration to replace `%{NUMBER:response_status:int}` with `%{NUMBER:message:int}` and add `message` to the list of keys to overwrite:
```json
processor:
- grok:
match:
keys_to_overwrite: ["message"]
message: ['%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\] %{NUMBER:message:int}']
```
{% include copy-curl.html %}
In the resulting grokked log, the original message is overwritten with the number `200`:
```json
{
"message":200,
"clientip":"198.126.12",
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
```
{% include copy-curl.html %}
## Using custom patterns
Include the `pattern_definitions` option in your Grok configuration to specify custom patterns.
The following configuration creates custom regex patterns named `CUSTOM_PATTERN-1` and `CUSTOM_PATTERN-2`. By default, the plugin continues until it finds a successful match.
```json
processor:
- grok:
pattern_definitions:
CUSTOM_PATTERN_1: 'this-is-regex-1'
CUSTOM_PATTERN_2: '%{CUSTOM_PATTERN_1} REGEX'
match:
message: ["%{CUSTOM_PATTERN_2:my_pattern_key}"]
```
{% include copy-curl.html %}
If you specify `break_on_match` as `false`, the pipeline attempts to match all patterns and extract keys from the incoming events:
```json
processor:
- grok:
pattern_definitions:
CUSTOM_PATTERN_1: 'this-is-regex-1'
CUSTOM_PATTERN_2: 'this-is-regex-2'
CUSTOM_PATTERN_3: 'this-is-regex-3'
CUSTOM_PATTERN_4: 'this-is-regex-4'
match:
message: [ "%{PATTERN1}”, "%{PATTERN2}" ]
log: [ "%{PATTERN3}", "%{PATTERN4}" ]
break_on_match: false
```
{% include copy-curl.html %}
You can define your own custom patterns to use for pipeline pattern matching. In the previous example, `my_pattern` will be extracted after matching the custom patterns.
## Storing captures with a parent key
Include the `target_key` option in your Grok configuration to wrap all record captures in an additional outer key value.
For example, you can modify the preceding Grok configuration to add a target key named `grokked`:
```json
processor:
- grok:
target_key: "grokked"
match:
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
```
The resulting grokked log will look like this:
```json
{
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
"grokked": {
"response_status":200,
"clientip":"198.126.12",
"timestamp":"10/Oct/2000:13:55:36 -0700"
}
}
```