OpenSearch/docs/plugins/ingest.asciidoc

[[ingest]]
== Ingest Plugin

=== Processors

==== Set processor
Sets one or more fields and associates them with the specified values. If a field already exists,
its value will be replaced with the provided one.

[source,js]
--------------------------------------------------
{
  "set": {
    "fields": {
      "field": 582.1
    }
  }
}
--------------------------------------------------

==== Remove processor
Removes one or more existing fields. If a field doesn't exist, nothing will happen.

[source,js]
--------------------------------------------------
{
  "remove": {
    "fields": [
      "field1","field2"
    ]
  }
}
--------------------------------------------------

==== Rename processor
Renames one or more existing fields. If a field doesn't exist, an exception will be thrown. Also, the new field
name must not exist.

[source,js]
--------------------------------------------------
{
  "rename": {
    "fields": {
      "field1": "field2"
    }
  }
}
--------------------------------------------------


==== Convert processor
Converts one or more field value to a different type, like turning a string to an integer.
If the field value is an array, all members will be converted.

The supported types include: `integer`, `float`, `string`, and `boolean`.

`boolean` will set a field to true if its string value is equal to `true` (ignore case), to
false if its string value is equal to `false` (ignore case) and it will throw exception otherwise.

[source,js]
--------------------------------------------------
{
  "convert": {
    "fields": {
      "field1": "integer",
      "field2": "float"
    }
  }
}
--------------------------------------------------

==== Gsub processor
Converts a string field by applying a regular expression and a replacement.
If the field is not a string, the processor will throw an exception.

This configuration takes an `expression` array consisting of objects. Each object
holds three elements: `field` for the field name, `pattern` for the
pattern to be replaced, and `replacement` for the string to replace the matching patterns with.


[source,js]
--------------------------------------------------
{
  "gsub": {
    "expressions": [
      {
        "field": "field1",
        "pattern": "\.",
        "replacement": "-"
      }
    ]
  }
}
--------------------------------------------------

==== Join processor
Joins each element of an array into a single string using a separator character between each element.
Throws error when the field is not an array.

[source,js]
--------------------------------------------------
{
  "join": {
    "fields": {
      "joined_array_field": "other_array_field"
    }
  }
}
--------------------------------------------------

==== Split processor
Split a field to an array using a separator character. Only works on string fields.

[source,js]
--------------------------------------------------
{
  "split": {
    "fields": {
      "message": ","
    }
  }
}
--------------------------------------------------

==== Lowercase processor
Converts a string to its lowercase equivalent.

[source,js]
--------------------------------------------------
{
  "lowercase": {
    "fields": ["foo", "bar"]
  }
}
--------------------------------------------------

==== Uppercase processor
Converts a string to its uppercase equivalent.

[source,js]
--------------------------------------------------
{
  "uppercase": {
    "fields": ["foo", "bar"]
  }
}
--------------------------------------------------

==== Trim processor
Trims whitespace from field. NOTE: this only works on leading and trailing whitespaces.

[source,js]
--------------------------------------------------
{
  "trim": {
    "fields": ["foo", "bar"]
  }
}
--------------------------------------------------

==== Grok Processor

The Grok Processor extracts structured fields out of a single text field within a document. You choose which field to
extract matched fields from, as well as the Grok Pattern you expect will match. A Grok Pattern is like a regular
expression that supports aliased expressions that can be reused.

This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format
that is generally written for humans and not computer consumption.

The processor comes packaged with over 120 reusable patterns that are located at `$ES_HOME/config/ingest/grok/patterns`.
Here, you can add your own custom grok pattern files with custom grok expressions to be used by the processor.

If you need help building patterns to match your logs, you will find the <http://grokdebug.herokuapp.com> and
<http://grokconstructor.appspot.com/> applications quite useful!

===== Grok Basics

Grok sits on top of regular expressions, so any regular expressions are valid in grok as well.
The regular expression library is Oniguruma, and you can see the full supported regexp syntax
https://github.com/kkos/oniguruma/blob/master/doc/RE[on the Onigiruma site].

Grok works by leveraging this regular expression language to allow naming existing patterns and combining them into more
complex patterns that match your fields.

The syntax for re-using a grok pattern comes in three forms: `%{SYNTAX:SEMANTIC}`, `%{SYNTAX}`, `%{SYNTAX:SEMANTIC:TYPE}`.

The `SYNTAX` is the name of the pattern that will match your text. For example, `3.44` will be matched by the `NUMBER`
pattern and `55.3.244.1` will be matched by the `IP` pattern. The syntax is how you match. `NUMBER` and `IP` are both
patterns that are provided within the default patterns set.

The `SEMANTIC` is the identifier you give to the piece of text being matched. For example, `3.44` could be the
duration of an event, so you could call it simply `duration`. Further, a string `55.3.244.1` might identify
the `client` making a request.

The `TYPE` is the type you wish to cast your named field. `int` and `float` are currently the only types supported for coercion.

For example, here is a grok pattern that would match the above example given. We would like to match a text with the following
contents:

[source,js]
--------------------------------------------------
3.44 55.3.244.1
--------------------------------------------------

We may know that the above message is a number followed by an IP-address. We can match this text with the following
Grok expression.

[source,js]
--------------------------------------------------
%{NUMBER:duration} %{IP:client}
--------------------------------------------------

===== Custom Patterns and Pattern Files

The Grok Processor comes pre-packaged with a base set of pattern files. These patterns may not always have
what you are looking for. These pattern files have a very basic format. Each line describes a named pattern with
the following format:

[source,js]
--------------------------------------------------
NAME ' '+ PATTERN '\n'
--------------------------------------------------

You can add this pattern to an existing file, or add your own file in the patterns directory here: `$ES_HOME/config/ingest/grok/patterns`.
The Ingest Plugin will pick up files in this directory to be loaded into the grok processor's known patterns. These patterns are loaded
at startup, so you will need to do a restart your ingest node if you wish to update these files while running.

Example snippet of pattern definitions found in the `grok-patterns` patterns file:

[source,js]
--------------------------------------------------
YEAR (?>\d\d){1,2}
HOUR (?:2[0123]|[01]?[0-9])
MINUTE (?:[0-5][0-9])
SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
--------------------------------------------------

===== Using Grok Processor in a Pipeline

[[grok-options]]
.Grok Options
[options="header"]
|======
| Name                   | Required  | Default             | Description
| `match_field`          | yes       | -                   | The field to use for grok expression parsing
| `match_pattern`        | yes       | -                   | The grok expression to match and extract named captures with
|======

Here is an example of using the provided patterns to extract out and name structured fields from a string field in
a document.

[source,js]
--------------------------------------------------
{
  "message": "55.3.244.1 GET /index.html 15824 0.043"
}
--------------------------------------------------

The pattern for this could be

[source]
--------------------------------------------------
%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}
--------------------------------------------------

An example pipeline for processing the above document using Grok:

[source,js]
--------------------------------------------------
{
  "description" : "...",
  "processors": [
    {
      "grok": {
        "match_field": "message",
        "match_pattern": "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"
      }
    }
  ]
}
--------------------------------------------------

This pipeline will insert these named captures as new fields within the document, like so:

[source,js]
--------------------------------------------------
{
  "message": "55.3.244.1 GET /index.html 15824 0.043",
  "client": "55.3.244.1",
  "method": "GET",
  "request": "/index.html",
  "bytes": 15824,
  "duration": "0.043"
}
--------------------------------------------------

==== Geoip processor

The GeoIP processor adds information about the geographical location of IP addresses, based on data from the Maxmind databases.
This processor adds this information by default under the `geoip` field.

The ingest plugin ships by default with the GeoLite2 City and GeoLite2 Country geoip2 databases from Maxmind made available
under the CCA-ShareAlike 3.0 license. For more details see, http://dev.maxmind.com/geoip/geoip2/geolite2/

The GeoIP processor can run with other geoip2 databases from Maxmind. The files must be copied into the geoip config directory
and the `database_file` option should be used to specify the filename of the custom database. The geoip config directory
is located at `$ES_HOME/config/ingest/geoip` and holds the shipped databases too.

[[geoip-options]]
.Geoip options
[options="header"]
|======
| Name                   | Required  | Default                                                                            | Description
| `source_field`         | yes       | -                                                                                  | The field to get the ip address or hostname from for the geographical lookup.
| `target_field`         | no        | geoip                                                                              | The field that will hold the geographical information looked up from the Maxmind database.
| `database_file`        | no        | GeoLite2-City.mmdb                                                                 | The database filename in the geoip config directory. The ingest plugin ships with the GeoLite2-City.mmdb and GeoLite2-Country.mmdb files.
| `fields`               | no        | [`continent_name`, `country_iso_code`, `region_name`, `city_name`, `location`] <1> | Controls what properties are added to the `target_field` based on the geoip lookup.
|======

<1> Depends on what is available in `database_field`:
* If the GeoLite2 City database is used then the following fields may be added under the `target_field`: `ip`,
`country_iso_code`, `country_name`, `continent_name`, `region_name`, `city_name`, `timezone`, `latitude`, `longitude`
and `location`. The fields actually added depend on what has been found and which fields were configured in `fields`.
* If the GeoLite2 Country database is used then the following fields may be added under the `target_field`: `ip`,
`country_iso_code`, `country_name` and `continent_name`.The fields actually added depend on what has been found and which fields were configured in `fields`.

An example that uses the default city database and adds the geographical information to the `geoip` field based on the `ip` field:

[source,js]
--------------------------------------------------
{
  "description" : "...",
  "processors" : [
    {
      "geoip" : {
        "source_field" : "ip"
      }
    }
  ]
}
--------------------------------------------------

An example that uses the default country database and add the geographical information to the `geo` field based on the `ip` field`:

[source,js]
--------------------------------------------------
{
  "description" : "...",
  "processors" : [
    {
      "geoip" : {
        "source_field" : "ip",
        "target_field" : "geo",
        "database_file" : "GeoLite2-Country.mmdb"
      }
    }
  ]
}
--------------------------------------------------

==== Date processor

The date processor is used for parsing dates from fields, and then using that date or timestamp as the timestamp for that document.
The date processor adds by default the parsed date as a new field called `@timestamp`, configurable by setting the `target_field`
configuration parameter. Multiple date formats are supported as part of the same date processor definition. They will be used
sequentially to attempt parsing the date field, in the same order they were defined as part of the processor definition.

[[date-options]]
.Date options
[options="header"]
|======
| Name                   | Required  | Default             | Description
| `match_field`          | yes       | -                   | The field to get the date from.
| `target_field`         | no        | @timestamp          | The field that will hold the parsed date.
| `match_formats`        | yes       | -                   | Array of the expected date formats. Can be a joda pattern or one of the following formats: ISO8601, UNIX, UNIX_MS, TAI64N.
| `timezone`             | no        | UTC                 | The timezone to use when parsing the date.
| `locale`               | no        | ENGLISH             | The locale to use when parsing the date, relevant when parsing month names or week days.
|======

An example that adds the parsed date to the `timestamp` field based on the `initial_date` field:

[source,js]
--------------------------------------------------
{
  "description" : "...",
  "processors" : [
    {
      "date" : {
        "match_field" : "initial_date",
        "target_field" : "timestamp",
        "match_formats" : ["dd/MM/yyyy hh:mm:ss"],
        "timezone" : "Europe/Amsterdam"
      }
    }
  ]
}
--------------------------------------------------

==== Meta processor

The `meta` processor allows to modify metadata properties of a document being processed.

The following example changes the index of a document to `alternative_index` instead of indexing it into an index
that was specified in the index or bulk request:

[source,js]
--------------------------------------------------
{
  "description" : "...",
  "processors" : [
    {
      "meta" : {
        "_index" : "alternative_index"
      }
    }
  ]
}
--------------------------------------------------

The following metadata attributes can be modified in this processor: `_index`, `_type`, `_id`, `_routing`, `_parent`,
`_timestamp` and `_ttl`. All these metadata attributes can be specified in the body of the `meta` processor.

Also the metadata settings in this processor are templatable which allows metadata field values to be replaced with
field values in the source of the document being indexed. The mustache template language is used and anything between
`{{` and `}}` can contain a template and point to any field in the source of the document.

The following example documents being processed end up being indexed into an index based on the resolved city name by
the `geoip` processor. (for example `city-amsterdam`)

[source,js]
--------------------------------------------------
{
  "description" : "...",
  "processors" : [
    {
      "geoip" : {
        "source" : "ip"
      }
    },
    {
      "meta" : {
        "_index" : "city-{{geoip.city_name}}"
      }
    }
  ]
}
--------------------------------------------------

=== Put pipeline API

The put pipeline api adds pipelines and updates existing pipelines in the cluster.

[source,js]
--------------------------------------------------
PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "simple" : {
        // settings
      }
    },
    // other processors
  ]
}
--------------------------------------------------
// AUTOSENSE

NOTE: Each ingest node updates its processors asynchronously in the background, so it may take a few seconds for all
      nodes to have the latest version of the pipeline.

=== Get pipeline API

The get pipeline api returns pipelines based on id. This api always returns a local reference of the pipeline.

[source,js]
--------------------------------------------------
GET _ingest/pipeline/my-pipeline-id
--------------------------------------------------
// AUTOSENSE

Example response:

[source,js]
--------------------------------------------------
{
   "my-pipeline-id": {
      "_source" : {
        "description": "describe pipeline",
        "processors": [
          {
            "simple" : {
              // settings
            }
          },
          // other processors
        ]
      },
      "_version" : 0
   }
}
--------------------------------------------------

For each returned pipeline the source and the version is returned.
The version is useful for knowing what version of the pipeline the node has.
Multiple ids can be provided at the same time. Also wildcards are supported.

=== Delete pipeline API

The delete pipeline api deletes pipelines by id.

[source,js]
--------------------------------------------------
DELETE _ingest/pipeline/my-pipeline-id
--------------------------------------------------
// AUTOSENSE