347 lines
12 KiB
Plaintext
347 lines
12 KiB
Plaintext
[[grok-processor]]
|
|
=== Grok Processor
|
|
|
|
Extracts structured fields out of a single text field within a document. You choose which field to
|
|
extract matched fields from, as well as the grok pattern you expect will match. A grok pattern is like a regular
|
|
expression that supports aliased expressions that can be reused.
|
|
|
|
This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format
|
|
that is generally written for humans and not computer consumption.
|
|
This processor comes packaged with many
|
|
https://github.com/elastic/elasticsearch/blob/{branch}/libs/grok/src/main/resources/patterns[reusable patterns].
|
|
|
|
If you need help building patterns to match your logs, you will find the {kibana-ref}/xpack-grokdebugger.html[Grok Debugger] tool quite useful! The Grok Debugger is an {xpack} feature under the Basic License and is therefore *free to use*. The Grok Constructor at <http://grokconstructor.appspot.com/> is also a useful tool.
|
|
|
|
[[grok-basics]]
|
|
==== Grok Basics
|
|
|
|
Grok sits on top of regular expressions, so any regular expressions are valid in grok as well.
|
|
The regular expression library is Oniguruma, and you can see the full supported regexp syntax
|
|
https://github.com/kkos/oniguruma/blob/master/doc/RE[on the Oniguruma site].
|
|
|
|
Grok works by leveraging this regular expression language to allow naming existing patterns and combining them into more
|
|
complex patterns that match your fields.
|
|
|
|
The syntax for reusing a grok pattern comes in three forms: `%{SYNTAX:SEMANTIC}`, `%{SYNTAX}`, `%{SYNTAX:SEMANTIC:TYPE}`.
|
|
|
|
The `SYNTAX` is the name of the pattern that will match your text. For example, `3.44` will be matched by the `NUMBER`
|
|
pattern and `55.3.244.1` will be matched by the `IP` pattern. The syntax is how you match. `NUMBER` and `IP` are both
|
|
patterns that are provided within the default patterns set.
|
|
|
|
The `SEMANTIC` is the identifier you give to the piece of text being matched. For example, `3.44` could be the
|
|
duration of an event, so you could call it simply `duration`. Further, a string `55.3.244.1` might identify
|
|
the `client` making a request.
|
|
|
|
The `TYPE` is the type you wish to cast your named field. `int`, `long`, `double`, `float` and `boolean` are supported types for coercion.
|
|
|
|
For example, you might want to match the following text:
|
|
|
|
[source,txt]
|
|
--------------------------------------------------
|
|
3.44 55.3.244.1
|
|
--------------------------------------------------
|
|
|
|
You may know that the message in the example is a number followed by an IP address. You can match this text by using the following
|
|
Grok expression.
|
|
|
|
[source,txt]
|
|
--------------------------------------------------
|
|
%{NUMBER:duration} %{IP:client}
|
|
--------------------------------------------------
|
|
|
|
[[using-grok]]
|
|
==== Using the Grok Processor in a Pipeline
|
|
|
|
[[grok-options]]
|
|
.Grok Options
|
|
[options="header"]
|
|
|======
|
|
| Name | Required | Default | Description
|
|
| `field` | yes | - | The field to use for grok expression parsing
|
|
| `patterns` | yes | - | An ordered list of grok expression to match and extract named captures with. Returns on the first expression in the list that matches.
|
|
| `pattern_definitions` | no | - | A map of pattern-name and pattern tuples defining custom patterns to be used by the current processor. Patterns matching existing names will override the pre-existing definition.
|
|
| `trace_match` | no | false | when true, `_ingest._grok_match_index` will be inserted into your matched document's metadata with the index into the pattern found in `patterns` that matched.
|
|
| `ignore_missing` | no | false | If `true` and `field` does not exist or is `null`, the processor quietly exits without modifying the document
|
|
include::common-options.asciidoc[]
|
|
|======
|
|
|
|
Here is an example of using the provided patterns to extract out and name structured fields from a string field in
|
|
a document.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST _ingest/pipeline/_simulate
|
|
{
|
|
"pipeline": {
|
|
"description" : "...",
|
|
"processors": [
|
|
{
|
|
"grok": {
|
|
"field": "message",
|
|
"patterns": ["%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes:int} %{NUMBER:duration:double}"]
|
|
}
|
|
}
|
|
]
|
|
},
|
|
"docs":[
|
|
{
|
|
"_source": {
|
|
"message": "55.3.244.1 GET /index.html 15824 0.043"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
This pipeline will insert these named captures as new fields within the document, like so:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"docs": [
|
|
{
|
|
"doc": {
|
|
"_index": "_index",
|
|
"_type": "_doc",
|
|
"_id": "_id",
|
|
"_source" : {
|
|
"duration" : 0.043,
|
|
"request" : "/index.html",
|
|
"method" : "GET",
|
|
"bytes" : 15824,
|
|
"client" : "55.3.244.1",
|
|
"message" : "55.3.244.1 GET /index.html 15824 0.043"
|
|
},
|
|
"_ingest": {
|
|
"timestamp": "2016-11-08T19:43:03.850+0000"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/2016-11-08T19:43:03.850\+0000/$body.docs.0.doc._ingest.timestamp/]
|
|
|
|
[[custom-patterns]]
|
|
==== Custom Patterns
|
|
|
|
The Grok processor comes pre-packaged with a base set of patterns. These patterns may not always have
|
|
what you are looking for. Patterns have a very basic format. Each entry has a name and the pattern itself.
|
|
|
|
You can add your own patterns to a processor definition under the `pattern_definitions` option.
|
|
Here is an example of a pipeline specifying custom pattern definitions:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"description" : "...",
|
|
"processors": [
|
|
{
|
|
"grok": {
|
|
"field": "message",
|
|
"patterns": ["my %{FAVORITE_DOG:dog} is colored %{RGB:color}"],
|
|
"pattern_definitions" : {
|
|
"FAVORITE_DOG" : "beagle",
|
|
"RGB" : "RED|GREEN|BLUE"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
[[trace-match]]
|
|
==== Providing Multiple Match Patterns
|
|
|
|
Sometimes one pattern is not enough to capture the potential structure of a field. Let's assume we
|
|
want to match all messages that contain your favorite pet breeds of either cats or dogs. One way to accomplish
|
|
this is to provide two distinct patterns that can be matched, instead of one really complicated expression capturing
|
|
the same `or` behavior.
|
|
|
|
Here is an example of such a configuration executed against the simulate API:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST _ingest/pipeline/_simulate
|
|
{
|
|
"pipeline": {
|
|
"description" : "parse multiple patterns",
|
|
"processors": [
|
|
{
|
|
"grok": {
|
|
"field": "message",
|
|
"patterns": ["%{FAVORITE_DOG:pet}", "%{FAVORITE_CAT:pet}"],
|
|
"pattern_definitions" : {
|
|
"FAVORITE_DOG" : "beagle",
|
|
"FAVORITE_CAT" : "burmese"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
},
|
|
"docs":[
|
|
{
|
|
"_source": {
|
|
"message": "I love burmese cats!"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
response:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"docs": [
|
|
{
|
|
"doc": {
|
|
"_type": "_doc",
|
|
"_index": "_index",
|
|
"_id": "_id",
|
|
"_source": {
|
|
"message": "I love burmese cats!",
|
|
"pet": "burmese"
|
|
},
|
|
"_ingest": {
|
|
"timestamp": "2016-11-08T19:43:03.850+0000"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/2016-11-08T19:43:03.850\+0000/$body.docs.0.doc._ingest.timestamp/]
|
|
|
|
Both patterns will set the field `pet` with the appropriate match, but what if we want to trace which of our
|
|
patterns matched and populated our fields? We can do this with the `trace_match` parameter. Here is the output of
|
|
that same pipeline, but with `"trace_match": true` configured:
|
|
|
|
////
|
|
Hidden setup for example:
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST _ingest/pipeline/_simulate
|
|
{
|
|
"pipeline": {
|
|
"description" : "parse multiple patterns",
|
|
"processors": [
|
|
{
|
|
"grok": {
|
|
"field": "message",
|
|
"patterns": ["%{FAVORITE_DOG:pet}", "%{FAVORITE_CAT:pet}"],
|
|
"trace_match": true,
|
|
"pattern_definitions" : {
|
|
"FAVORITE_DOG" : "beagle",
|
|
"FAVORITE_CAT" : "burmese"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
},
|
|
"docs":[
|
|
{
|
|
"_source": {
|
|
"message": "I love burmese cats!"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
////
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"docs": [
|
|
{
|
|
"doc": {
|
|
"_type": "_doc",
|
|
"_index": "_index",
|
|
"_id": "_id",
|
|
"_source": {
|
|
"message": "I love burmese cats!",
|
|
"pet": "burmese"
|
|
},
|
|
"_ingest": {
|
|
"_grok_match_index": "1",
|
|
"timestamp": "2016-11-08T19:43:03.850+0000"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/2016-11-08T19:43:03.850\+0000/$body.docs.0.doc._ingest.timestamp/]
|
|
|
|
In the above response, you can see that the index of the pattern that matched was `"1"`. This is to say that it was the
|
|
second (index starts at zero) pattern in `patterns` to match.
|
|
|
|
This trace metadata enables debugging which of the patterns matched. This information is stored in the ingest
|
|
metadata and will not be indexed.
|
|
|
|
[[grok-processor-rest-get]]
|
|
==== Retrieving patterns from REST endpoint
|
|
|
|
The Grok Processor comes packaged with its own REST endpoint for retrieving which patterns the processor is packaged with.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET _ingest/processor/grok
|
|
--------------------------------------------------
|
|
|
|
The above request will return a response body containing a key-value representation of the built-in patterns dictionary.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"patterns" : {
|
|
"BACULA_CAPACITY" : "%{INT}{1,3}(,%{INT}{3})*",
|
|
"PATH" : "(?:%{UNIXPATH}|%{WINPATH})",
|
|
...
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
This can be useful to reference as the built-in patterns change across versions.
|
|
|
|
[[grok-watchdog]]
|
|
==== Grok watchdog
|
|
|
|
Grok expressions that take too long to execute are interrupted and
|
|
the grok processor then fails with an exception. The grok
|
|
processor has a watchdog thread that determines when evaluation of
|
|
a grok expression takes too long and is controlled by the following
|
|
settings:
|
|
|
|
[[grok-watchdog-options]]
|
|
.Grok watchdog settings
|
|
[options="header"]
|
|
|======
|
|
| Name | Default | Description
|
|
| `ingest.grok.watchdog.interval` | 1s | How often to check whether there are grok evaluations that take longer than the maximum allowed execution time.
|
|
| `ingest.grok.watchdog.max_execution_time` | 1s | The maximum allowed execution of a grok expression evaluation.
|
|
|======
|
|
|
|
[[grok-debugging]]
|
|
==== Grok debugging
|
|
|
|
It is advised to use the {kibana-ref}/xpack-grokdebugger.html[Grok Debugger] to debug grok patterns. From there you can test one or more
|
|
patterns in the UI against sample data. Under the covers it uses the same engine as ingest node processor.
|
|
|
|
Additionally, it is recommended to enable debug logging for Grok so that any additional messages may also be seen in the Elasticsearch
|
|
server log.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT _cluster/settings
|
|
{
|
|
"transient": {
|
|
"logger.org.elasticsearch.ingest.common.GrokProcessor": "debug"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|