OpenSearch/docs/plugins/ingest.asciidoc

[[ingest]]
== Ingest Plugin

TODO

=== Processors

==== Grok Processor

The Grok Processor extracts structured fields out of a single text field within a document. You choose which field to
extract matched fields from, as well as the Grok Pattern you expect will match. A Grok Pattern is like a regular
expression that supports aliased expressions that can be reused.

This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format
that is generally written for humans and not computer consumption.

The processor comes packaged with over 120 reusable patterns that are located at `$ES_HOME/config/ingest/grok/patterns`.
Here, you can add your own custom grok pattern files with custom grok expressions to be used by the processor.

If you need help building patterns to match your logs, you will find the <http://grokdebug.herokuapp.com> and
<http://grokconstructor.appspot.com/> applications quite useful!

===== Grok Basics

Grok sits on top of regular expressions, so any regular expressions are valid in grok as well.
The regular expression library is Oniguruma, and you can see the full supported regexp syntax
https://github.com/kkos/oniguruma/blob/master/doc/RE[on the Onigiruma site].

Grok works by leveraging this regular expression language to allow naming existing patterns and combining them into more
complex patterns that match your fields.

The syntax for re-using a grok pattern comes in three forms: `%{SYNTAX:SEMANTIC}`, `%{SYNTAX}`, `%{SYNTAX:SEMANTIC:TYPE}`.

The `SYNTAX` is the name of the pattern that will match your text. For example, `3.44` will be matched by the `NUMBER`
pattern and `55.3.244.1` will be matched by the `IP` pattern. The syntax is how you match. `NUMBER` and `IP` are both
patterns that are provided within the default patterns set.

The `SEMANTIC` is the identifier you give to the piece of text being matched. For example, `3.44` could be the
duration of an event, so you could call it simply `duration`. Further, a string `55.3.244.1` might identify
the `client` making a request.

The `TYPE` is the type you wish to cast your named field. `int` and `float` are currently the only types supported for coercion.

For example, here is a grok pattern that would match the above example given. We would like to match a text with the following
contents:

[source,js]
--------------------------------------------------
3.44 55.3.244.1
--------------------------------------------------

We may know that the above message is a number followed by an IP-address. We can match this text with the following
Grok expression.

[source,js]
--------------------------------------------------
%{NUMBER:duration} %{IP:client}
--------------------------------------------------

===== Custom Patterns and Pattern Files

The Grok Processor comes pre-packaged with a base set of pattern files. These patterns may not always have
what you are looking for. These pattern files have a very basic format. Each line describes a named pattern with
the following format:

[source,js]
--------------------------------------------------
NAME ' '+ PATTERN '\n'
--------------------------------------------------

You can add this pattern to an existing file, or add your own file in the patterns directory here: `$ES_HOME/config/ingest/grok/patterns`.
The Ingest Plugin will pick up files in this directory to be loaded into the grok processor's known patterns. These patterns are loaded
at startup, so you will need to do a restart your ingest node if you wish to update these files while running.

Example snippet of pattern definitions found in the `grok-patterns` patterns file:

[source,js]
--------------------------------------------------
YEAR (?>\d\d){1,2}
HOUR (?:2[0123]|[01]?[0-9])
MINUTE (?:[0-5][0-9])
SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
--------------------------------------------------

===== Using Grok Processor in a Pipeline

[[grok-options]]
.Grok Options
[options="header"]
|======
| Name                   | Required  | Default             | Description
| `match_field`          | yes       | -                   | The field to use for grok expression parsing
| `match_pattern`        | yes       | -                   | The grok expression to match and extract named captures with
|======

Here is an example of using the provided patterns to extract out and name structured fields from a string field in
a document.

[source,js]
--------------------------------------------------
{
  "message": "55.3.244.1 GET /index.html 15824 0.043"
}
--------------------------------------------------

The pattern for this could be

[source]
--------------------------------------------------
%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}
--------------------------------------------------

An example pipeline for processing the above document using Grok:

[source,js]
--------------------------------------------------
{
  "description" : "...",
  "processors": [
    {
      "grok": {
        "match_field": "message",
        "match_pattern": "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"
      }
    }
  ]
}
--------------------------------------------------

This pipeline will insert these named captures as new fields within the document, like so:

[source,js]
--------------------------------------------------
{
  "message": "55.3.244.1 GET /index.html 15824 0.043",
  "client": "55.3.244.1",
  "method": "GET",
  "request": "/index.html",
  "bytes": 15824,
  "duration": "0.043"
}
--------------------------------------------------

=== Put pipeline API

The put pipeline api adds pipelines and updates existing pipelines in the cluster.

[source,js]
--------------------------------------------------
PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "simple" : {
        // settings
      }
    },
    // other processors
  ]
}
--------------------------------------------------
// AUTOSENSE

NOTE: Each ingest node updates its processors asynchronously in the background, so it may take a few seconds for all
      nodes to have the latest version of the pipeline.

=== Get pipeline API

The get pipeline api returns pipelines based on id. This api always returns a local reference of the pipeline.

[source,js]
--------------------------------------------------
GET _ingest/pipeline/my-pipeline-id
--------------------------------------------------
// AUTOSENSE

Example response:

[source,js]
--------------------------------------------------
{
   "my-pipeline-id": {
      "_source" : {
        "description": "describe pipeline",
        "processors": [
          {
            "simple" : {
              // settings
            }
          },
          // other processors
        ]
      },
      "_version" : 0
   }
}
--------------------------------------------------

For each returned pipeline the source and the version is returned.
The version is useful for knowing what version of the pipeline the node has.
Multiple ids can be provided at the same time. Also wildcards are supported.

=== Delete pipeline API

The delete pipeline api deletes pipelines by id.

[source,js]
--------------------------------------------------
DELETE _ingest/pipeline/my-pipeline-id
--------------------------------------------------
// AUTOSENSE
added put, get and delete pipeline APIs. 2015-10-09 09:02:12 -04:00			`[[ingest]]`
			`== Ingest Plugin`

			`TODO`

Introduce the GrokProcessor Also moved all processor classes into a subdirectory and introduced a ConfigException class to be a catch-all for things that can go wrong when constructing new processors with configurations that possibly throw exceptions. The GrokProcessor loads patterns from the resources directory. fix resource path issue, and add rest-api-spec test for grok fix rest-spec tests changes: license, remove configexception, throw IOException add more tests and fix iso8601-hour pattern move grok patterns from resources to config fix tests with pom changes, updated IngestClientIT with grok processor update gradle build script for grok deps and test configuration move config files to src/main/packaging move Env out of Processor, fix test for src/main/packaging change add docs clean up test resources task update Grok to be immutable - Updated the Grok class to be immutable. This means that all the pattern bank loading is handled by an external utility class called PatternUtils. - fixed tabs in the nagios patterns file's comments 2015-10-15 07:47:18 -04:00			`=== Processors`

			`==== Grok Processor`

			`The Grok Processor extracts structured fields out of a single text field within a document. You choose which field to`
			`extract matched fields from, as well as the Grok Pattern you expect will match. A Grok Pattern is like a regular`
			`expression that supports aliased expressions that can be reused.`

			`This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format`
			`that is generally written for humans and not computer consumption.`

			The processor comes packaged with over 120 reusable patterns that are located at `$ES_HOME/config/ingest/grok/patterns`.
			`Here, you can add your own custom grok pattern files with custom grok expressions to be used by the processor.`

			`If you need help building patterns to match your logs, you will find the <http://grokdebug.herokuapp.com> and`
			`<http://grokconstructor.appspot.com/> applications quite useful!`

			`===== Grok Basics`

			`Grok sits on top of regular expressions, so any regular expressions are valid in grok as well.`
			`The regular expression library is Oniguruma, and you can see the full supported regexp syntax`
			`https://github.com/kkos/oniguruma/blob/master/doc/RE[on the Onigiruma site].`

			`Grok works by leveraging this regular expression language to allow naming existing patterns and combining them into more`
			`complex patterns that match your fields.`

			The syntax for re-using a grok pattern comes in three forms: `%{SYNTAX:SEMANTIC}`, `%{SYNTAX}`, `%{SYNTAX:SEMANTIC:TYPE}`.

			The `SYNTAX` is the name of the pattern that will match your text. For example, `3.44` will be matched by the `NUMBER`
			pattern and `55.3.244.1` will be matched by the `IP` pattern. The syntax is how you match. `NUMBER` and `IP` are both
			`patterns that are provided within the default patterns set.`

			The `SEMANTIC` is the identifier you give to the piece of text being matched. For example, `3.44` could be the
			duration of an event, so you could call it simply `duration`. Further, a string `55.3.244.1` might identify
			the `client` making a request.

			The `TYPE` is the type you wish to cast your named field. `int` and `float` are currently the only types supported for coercion.

			`For example, here is a grok pattern that would match the above example given. We would like to match a text with the following`
			`contents:`

			`[source,js]`
			`--------------------------------------------------`
			`3.44 55.3.244.1`
			`--------------------------------------------------`

			`We may know that the above message is a number followed by an IP-address. We can match this text with the following`
			`Grok expression.`

			`[source,js]`
			`--------------------------------------------------`
			`%{NUMBER:duration} %{IP:client}`
			`--------------------------------------------------`

			`===== Custom Patterns and Pattern Files`

			`The Grok Processor comes pre-packaged with a base set of pattern files. These patterns may not always have`
			`what you are looking for. These pattern files have a very basic format. Each line describes a named pattern with`
			`the following format:`

			`[source,js]`
			`--------------------------------------------------`
			`NAME ' '+ PATTERN '\n'`
			`--------------------------------------------------`

			You can add this pattern to an existing file, or add your own file in the patterns directory here: `$ES_HOME/config/ingest/grok/patterns`.
			`The Ingest Plugin will pick up files in this directory to be loaded into the grok processor's known patterns. These patterns are loaded`
			`at startup, so you will need to do a restart your ingest node if you wish to update these files while running.`

			Example snippet of pattern definitions found in the `grok-patterns` patterns file:

			`[source,js]`
			`--------------------------------------------------`
			`YEAR (?>\d\d){1,2}`
			`HOUR (?:2[0123]\|[01]?[0-9])`
			`MINUTE (?:[0-5][0-9])`
			`SECOND (?:(?:[0-5]?[0-9]\|60)(?:[:.,][0-9]+)?)`
			`TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])`
			`--------------------------------------------------`

			`===== Using Grok Processor in a Pipeline`

			`[[grok-options]]`
			`.Grok Options`
			`[options="header"]`
			`\|======`
			`\| Name \| Required \| Default \| Description`
			\| `match_field` \| yes \| - \| The field to use for grok expression parsing
			\| `match_pattern` \| yes \| - \| The grok expression to match and extract named captures with
			`\|======`

			`Here is an example of using the provided patterns to extract out and name structured fields from a string field in`
			`a document.`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"message": "55.3.244.1 GET /index.html 15824 0.043"`
			`}`
			`--------------------------------------------------`

			`The pattern for this could be`

			`[source]`
			`--------------------------------------------------`
			`%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}`
			`--------------------------------------------------`

			`An example pipeline for processing the above document using Grok:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"description" : "...",`
			`"processors": [`
			`{`
			`"grok": {`
			`"match_field": "message",`
			`"match_pattern": "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"`
			`}`
			`}`
			`]`
			`}`
			`--------------------------------------------------`

			`This pipeline will insert these named captures as new fields within the document, like so:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"message": "55.3.244.1 GET /index.html 15824 0.043",`
			`"client": "55.3.244.1",`
			`"method": "GET",`
			`"request": "/index.html",`
			`"bytes": 15824,`
			`"duration": "0.043"`
			`}`
			`--------------------------------------------------`

added put, get and delete pipeline APIs. 2015-10-09 09:02:12 -04:00			`=== Put pipeline API`

			`The put pipeline api adds pipelines and updates existing pipelines in the cluster.`

			`[source,js]`
			`--------------------------------------------------`
			`PUT _ingest/pipeline/my-pipeline-id`
			`{`
			`"description" : "describe pipeline",`
			`"processors" : [`
			`{`
			`"simple" : {`
			`// settings`
			`}`
			`},`
			`// other processors`
			`]`
			`}`
			`--------------------------------------------------`
			`// AUTOSENSE`

			`NOTE: Each ingest node updates its processors asynchronously in the background, so it may take a few seconds for all`
			`nodes to have the latest version of the pipeline.`

			`=== Get pipeline API`

			`The get pipeline api returns pipelines based on id. This api always returns a local reference of the pipeline.`

			`[source,js]`
			`--------------------------------------------------`
			`GET _ingest/pipeline/my-pipeline-id`
			`--------------------------------------------------`
			`// AUTOSENSE`

			`Example response:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"my-pipeline-id": {`
			`"_source" : {`
			`"description": "describe pipeline",`
			`"processors": [`
			`{`
			`"simple" : {`
			`// settings`
			`}`
			`},`
			`// other processors`
			`]`
			`},`
			`"_version" : 0`
			`}`
			`}`
			`--------------------------------------------------`

			`For each returned pipeline the source and the version is returned.`
			`The version is useful for knowing what version of the pipeline the node has.`
			`Multiple ids can be provided at the same time. Also wildcards are supported.`

			`=== Delete pipeline API`

			`The delete pipeline api deletes pipelines by id.`

			`[source,js]`
			`--------------------------------------------------`
			`DELETE _ingest/pipeline/my-pipeline-id`
			`--------------------------------------------------`
			`// AUTOSENSE`