Add pattern matching (text processing) use case to Data Prepper documentation (#6181)
* Add pattern matching use case to Data Prepper documentation Signed-off-by: Melissa Vagi <vagimeli@amazon.com> * Add pattern matching use case to Data Prepper documentation Signed-off-by: Melissa Vagi <vagimeli@amazon.com> * Address SME comments Signed-off-by: Melissa Vagi <vagimeli@amazon.com> * Copy edits Signed-off-by: Melissa Vagi <vagimeli@amazon.com> * Address editorial feedback Signed-off-by: Melissa Vagi <vagimeli@amazon.com> --------- Signed-off-by: Melissa Vagi <vagimeli@amazon.com>
This commit is contained in:
parent
8225f0b511
commit
d9d5406f93
|
@ -0,0 +1,215 @@
|
|||
---
|
||||
layout: default
|
||||
title: Text processing
|
||||
parent: Common use cases
|
||||
nav_order: 35
|
||||
---
|
||||
|
||||
# Text processing
|
||||
|
||||
Data Prepper provides text processing capabilities with the [`grok processor`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/grok/). The `grok` processor is based on the [`java-grok`](https://mvnrepository.com/artifact/io.krakens/java-grok) library and supports all compatible patterns. The `java-grok` library is built using the [`java.util.regex`](https://docs.oracle.com/javase/8/docs/api/java/util/regex/package-summary.html) regular expression library.
|
||||
|
||||
You can add custom patterns to your pipelines by using the `patterns_definitions` option. When debugging custom patterns, the [Grok Debugger](https://grokdebugger.com/) can be helpful.
|
||||
|
||||
## Basic usage
|
||||
|
||||
To get started with text processing, create the following pipeline:
|
||||
|
||||
```json
|
||||
patten-matching-pipeline:
|
||||
source
|
||||
...
|
||||
processor:
|
||||
- grok:
|
||||
match:
|
||||
message: ['%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
|
||||
sink:
|
||||
- opensearch:
|
||||
# Provide an OpenSearch cluster endpoint
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
An incoming message might contain the following contents:
|
||||
|
||||
```json
|
||||
{"message": "127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200"}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
In each incoming event, the pipeline will locate the value in the `message` key and attempt to match the pattern. The keywords `IPORHOST`, `HTTPDATE`, and `NUMBER` are built into the plugin.
|
||||
|
||||
When an incoming record matches the pattern, it generates an internal event such as the following with identification keys extracted from the original message:
|
||||
|
||||
```json
|
||||
{
|
||||
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
|
||||
"response_status":200,
|
||||
"clientip":"198.126.12",
|
||||
"timestamp":"10/Oct/2000:13:55:36 -0700"
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
The `match` configuration for the `grok` processor specifies which record keys to match against which patterns.
|
||||
|
||||
In the following example, the `match` configuration checks incoming logs for a `message` key. If the key exists, it matches the key value against the `SYSLOGBASE` pattern and then against the `COMMONAPACHELOG` pattern. It then checks the logs for a `timestamp` key. If that key exists, it attempts to match the key value against the `TIMESTAMP_ISO8601` pattern.
|
||||
|
||||
```json
|
||||
processor:
|
||||
- grok:
|
||||
match:
|
||||
message: ['%{SYSLOGBASE}', "%{COMMONAPACHELOG}"]
|
||||
timestamp: ["%{TIMESTAMP_ISO8601}"]
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
By default, the plugin continues until it finds a successful match. For example, if there is a successful match against the value in the `message` key for a `SYSLOGBASE` pattern, the plugin doesn't attempt to match the other patterns. If you want to match logs against every pattern, include the `break_on_match` option.
|
||||
|
||||
## Including named and empty captures
|
||||
|
||||
Include the `keep_empty_captures` option in your pipeline configuration to include null captures or the `named_captures_only` option to include only named captures. Named captures follow the pattern `%{SYNTAX:SEMANTIC}` while unnamed captures follow the pattern `%{SYNTAX}`.
|
||||
|
||||
For example, you can modify the preceding Grok configuration to remove `clientip` from the `%{IPORHOST}` pattern:
|
||||
|
||||
```json
|
||||
processor:
|
||||
- grok:
|
||||
match:
|
||||
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
The resulting grokked log will look like this:
|
||||
|
||||
```json
|
||||
{
|
||||
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
|
||||
"response_status":200,
|
||||
"timestamp":"10/Oct/2000:13:55:36 -0700"
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
Notice that the `clientip` key no longer exists because the `%{IPORHOST}` pattern is now an unnamed capture.
|
||||
|
||||
However, if you set `named_captures_only` to `false`:
|
||||
|
||||
```json
|
||||
processor:
|
||||
- grok:
|
||||
match:
|
||||
named_captures_only: false
|
||||
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:message:int}']
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
Then the resulting grokked log will look like this:
|
||||
|
||||
```json
|
||||
{
|
||||
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
|
||||
"MONTH":"Oct",
|
||||
"YEAR":"2000",
|
||||
"response_status":200,
|
||||
"HOUR":"13",
|
||||
"TIME":"13:55:36",
|
||||
"MINUTE":"55",
|
||||
"SECOND":"36",
|
||||
"IPORHOST":"198.126.12",
|
||||
"MONTHDAY":"10",
|
||||
"INT":"-0700",
|
||||
"timestamp":"10/Oct/2000:13:55:36 -0700"
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
Note that the `IPORHOST` capture now shows up as a new key, along with some internal unnamed captures like `MONTH` and `YEAR`. The `HTTPDATE` keyword is currently using these patterns, which you can see in the default patterns file.
|
||||
|
||||
## Overwriting keys
|
||||
|
||||
Include the `keys_to_overwrite` option to specify which existing record keys to overwrite if there is a capture with the same key value.
|
||||
|
||||
For example, you can modify the preceding Grok configuration to replace `%{NUMBER:response_status:int}` with `%{NUMBER:message:int}` and add `message` to the list of keys to overwrite:
|
||||
|
||||
```json
|
||||
processor:
|
||||
- grok:
|
||||
match:
|
||||
keys_to_overwrite: ["message"]
|
||||
message: ['%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\] %{NUMBER:message:int}']
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
In the resulting grokked log, the original message is overwritten with the number `200`:
|
||||
|
||||
```json
|
||||
{
|
||||
"message":200,
|
||||
"clientip":"198.126.12",
|
||||
"timestamp":"10/Oct/2000:13:55:36 -0700"
|
||||
}
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
## Using custom patterns
|
||||
|
||||
Include the `pattern_definitions` option in your Grok configuration to specify custom patterns.
|
||||
|
||||
The following configuration creates custom regex patterns named `CUSTOM_PATTERN-1` and `CUSTOM_PATTERN-2`. By default, the plugin continues until it finds a successful match.
|
||||
|
||||
```json
|
||||
processor:
|
||||
- grok:
|
||||
pattern_definitions:
|
||||
CUSTOM_PATTERN_1: 'this-is-regex-1'
|
||||
CUSTOM_PATTERN_2: '%{CUSTOM_PATTERN_1} REGEX'
|
||||
match:
|
||||
message: ["%{CUSTOM_PATTERN_2:my_pattern_key}"]
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
If you specify `break_on_match` as `false`, the pipeline attempts to match all patterns and extract keys from the incoming events:
|
||||
|
||||
```json
|
||||
processor:
|
||||
- grok:
|
||||
pattern_definitions:
|
||||
CUSTOM_PATTERN_1: 'this-is-regex-1'
|
||||
CUSTOM_PATTERN_2: 'this-is-regex-2'
|
||||
CUSTOM_PATTERN_3: 'this-is-regex-3'
|
||||
CUSTOM_PATTERN_4: 'this-is-regex-4'
|
||||
match:
|
||||
message: [ "%{PATTERN1}”, "%{PATTERN2}" ]
|
||||
log: [ "%{PATTERN3}", "%{PATTERN4}" ]
|
||||
break_on_match: false
|
||||
```
|
||||
{% include copy-curl.html %}
|
||||
|
||||
You can define your own custom patterns to use for pipeline pattern matching. In the previous example, `my_pattern` will be extracted after matching the custom patterns.
|
||||
|
||||
## Storing captures with a parent key
|
||||
|
||||
Include the `target_key` option in your Grok configuration to wrap all record captures in an additional outer key value.
|
||||
|
||||
For example, you can modify the preceding Grok configuration to add a target key named `grokked`:
|
||||
|
||||
```json
|
||||
processor:
|
||||
- grok:
|
||||
target_key: "grokked"
|
||||
match:
|
||||
message: ['%{IPORHOST} \[%{HTTPDATE:timestamp}\] %{NUMBER:response_status:int}']
|
||||
```
|
||||
|
||||
The resulting grokked log will look like this:
|
||||
|
||||
```json
|
||||
{
|
||||
"message":"127.0.0.1 198.126.12 [10/Oct/2000:13:55:36 -0700] 200",
|
||||
"grokked": {
|
||||
"response_status":200,
|
||||
"clientip":"198.126.12",
|
||||
"timestamp":"10/Oct/2000:13:55:36 -0700"
|
||||
}
|
||||
}
|
||||
```
|
Loading…
Reference in New Issue