276 lines
9.8 KiB
Plaintext
276 lines
9.8 KiB
Plaintext
[[ingest]]
|
|
== Ingest Plugin
|
|
|
|
TODO
|
|
|
|
=== Processors
|
|
|
|
==== Grok Processor
|
|
|
|
The Grok Processor extracts structured fields out of a single text field within a document. You choose which field to
|
|
extract matched fields from, as well as the Grok Pattern you expect will match. A Grok Pattern is like a regular
|
|
expression that supports aliased expressions that can be reused.
|
|
|
|
This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format
|
|
that is generally written for humans and not computer consumption.
|
|
|
|
The processor comes packaged with over 120 reusable patterns that are located at `$ES_HOME/config/ingest/grok/patterns`.
|
|
Here, you can add your own custom grok pattern files with custom grok expressions to be used by the processor.
|
|
|
|
If you need help building patterns to match your logs, you will find the <http://grokdebug.herokuapp.com> and
|
|
<http://grokconstructor.appspot.com/> applications quite useful!
|
|
|
|
===== Grok Basics
|
|
|
|
Grok sits on top of regular expressions, so any regular expressions are valid in grok as well.
|
|
The regular expression library is Oniguruma, and you can see the full supported regexp syntax
|
|
https://github.com/kkos/oniguruma/blob/master/doc/RE[on the Onigiruma site].
|
|
|
|
Grok works by leveraging this regular expression language to allow naming existing patterns and combining them into more
|
|
complex patterns that match your fields.
|
|
|
|
The syntax for re-using a grok pattern comes in three forms: `%{SYNTAX:SEMANTIC}`, `%{SYNTAX}`, `%{SYNTAX:SEMANTIC:TYPE}`.
|
|
|
|
The `SYNTAX` is the name of the pattern that will match your text. For example, `3.44` will be matched by the `NUMBER`
|
|
pattern and `55.3.244.1` will be matched by the `IP` pattern. The syntax is how you match. `NUMBER` and `IP` are both
|
|
patterns that are provided within the default patterns set.
|
|
|
|
The `SEMANTIC` is the identifier you give to the piece of text being matched. For example, `3.44` could be the
|
|
duration of an event, so you could call it simply `duration`. Further, a string `55.3.244.1` might identify
|
|
the `client` making a request.
|
|
|
|
The `TYPE` is the type you wish to cast your named field. `int` and `float` are currently the only types supported for coercion.
|
|
|
|
For example, here is a grok pattern that would match the above example given. We would like to match a text with the following
|
|
contents:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
3.44 55.3.244.1
|
|
--------------------------------------------------
|
|
|
|
We may know that the above message is a number followed by an IP-address. We can match this text with the following
|
|
Grok expression.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
%{NUMBER:duration} %{IP:client}
|
|
--------------------------------------------------
|
|
|
|
===== Custom Patterns and Pattern Files
|
|
|
|
The Grok Processor comes pre-packaged with a base set of pattern files. These patterns may not always have
|
|
what you are looking for. These pattern files have a very basic format. Each line describes a named pattern with
|
|
the following format:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
NAME ' '+ PATTERN '\n'
|
|
--------------------------------------------------
|
|
|
|
You can add this pattern to an existing file, or add your own file in the patterns directory here: `$ES_HOME/config/ingest/grok/patterns`.
|
|
The Ingest Plugin will pick up files in this directory to be loaded into the grok processor's known patterns. These patterns are loaded
|
|
at startup, so you will need to do a restart your ingest node if you wish to update these files while running.
|
|
|
|
Example snippet of pattern definitions found in the `grok-patterns` patterns file:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
YEAR (?>\d\d){1,2}
|
|
HOUR (?:2[0123]|[01]?[0-9])
|
|
MINUTE (?:[0-5][0-9])
|
|
SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
|
|
TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
|
|
--------------------------------------------------
|
|
|
|
===== Using Grok Processor in a Pipeline
|
|
|
|
[[grok-options]]
|
|
.Grok Options
|
|
[options="header"]
|
|
|======
|
|
| Name | Required | Default | Description
|
|
| `match_field` | yes | - | The field to use for grok expression parsing
|
|
| `match_pattern` | yes | - | The grok expression to match and extract named captures with
|
|
|======
|
|
|
|
Here is an example of using the provided patterns to extract out and name structured fields from a string field in
|
|
a document.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"message": "55.3.244.1 GET /index.html 15824 0.043"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The pattern for this could be
|
|
|
|
[source]
|
|
--------------------------------------------------
|
|
%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}
|
|
--------------------------------------------------
|
|
|
|
An example pipeline for processing the above document using Grok:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"description" : "...",
|
|
"processors": [
|
|
{
|
|
"grok": {
|
|
"match_field": "message",
|
|
"match_pattern": "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
This pipeline will insert these named captures as new fields within the document, like so:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"message": "55.3.244.1 GET /index.html 15824 0.043",
|
|
"client": "55.3.244.1",
|
|
"method": "GET",
|
|
"request": "/index.html",
|
|
"bytes": 15824,
|
|
"duration": "0.043"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
==== Geoip processor
|
|
|
|
The GeoIP processor adds information about the geographical location of IP addresses, based on data from the Maxmind databases.
|
|
This processor adds this information by default under the `geoip` field.
|
|
|
|
The ingest plugin ships by default with the GeoLite2 City and GeoLite2 Country geoip2 databases from Maxmind made available
|
|
under the CCA-ShareAlike 3.0 license. For more details see, http://dev.maxmind.com/geoip/geoip2/geolite2/
|
|
|
|
The GeoIP processor can run with other geoip2 databases from Maxmind. The files must be copied into the geoip config directory
|
|
and the `database_file` option should be used to specify the filename of the custom database. The geoip config directory
|
|
is located at `$ES_HOME/config/ingest/geoip` and holds the shipped databases too.
|
|
|
|
[[geoip-options]]
|
|
.Geoip options
|
|
[options="header"]
|
|
|======
|
|
| Name | Required | Default | Description
|
|
| `ip_field` | yes | - | The field to get the ip address from for the geographical lookip.
|
|
| `target_field` | no | geoip | The field that will hold the geographical information looked up from the Maxmind database.
|
|
| `database_file` | no | GeoLite2-City.mmdb | The database filename in the geoip config directory. The ingest plugin ships with the GeoLite2-City.mmdb and GeoLite2-Country.mmdb files.
|
|
|======
|
|
|
|
If the GeoLite2 City database is used then the following fields will be added under the `target_field`: `ip`,
|
|
`country_iso_code`, `country_name`, `continent_name`, `region_name`, `city_name`, `timezone`, `latitude`, `longitude`
|
|
and `location`.
|
|
|
|
If the GeoLite2 Country database is used then the following fields will be added under the `target_field`: `ip`,
|
|
`country_iso_code`, `country_name` and `continent_name`.
|
|
|
|
An example that uses the default city database and adds the geographical information to the `geoip` field based on the `ip` field`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"description" : "...",
|
|
"processors" : [
|
|
{
|
|
"geoip" : {
|
|
"ip_field" : "ip"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
An example that uses the default country database and add the geographical information to the `geo` field based on the `ip` field`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"description" : "...",
|
|
"processors" : [
|
|
{
|
|
"geoip" : {
|
|
"ip_field" : "ip",
|
|
"target_field" : "geo",
|
|
"database_file" : "GeoLite2-Country.mmdb"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
=== Put pipeline API
|
|
|
|
The put pipeline api adds pipelines and updates existing pipelines in the cluster.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT _ingest/pipeline/my-pipeline-id
|
|
{
|
|
"description" : "describe pipeline",
|
|
"processors" : [
|
|
{
|
|
"simple" : {
|
|
// settings
|
|
}
|
|
},
|
|
// other processors
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// AUTOSENSE
|
|
|
|
NOTE: Each ingest node updates its processors asynchronously in the background, so it may take a few seconds for all
|
|
nodes to have the latest version of the pipeline.
|
|
|
|
=== Get pipeline API
|
|
|
|
The get pipeline api returns pipelines based on id. This api always returns a local reference of the pipeline.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _ingest/pipeline/my-pipeline-id
|
|
--------------------------------------------------
|
|
// AUTOSENSE
|
|
|
|
Example response:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"my-pipeline-id": {
|
|
"_source" : {
|
|
"description": "describe pipeline",
|
|
"processors": [
|
|
{
|
|
"simple" : {
|
|
// settings
|
|
}
|
|
},
|
|
// other processors
|
|
]
|
|
},
|
|
"_version" : 0
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
For each returned pipeline the source and the version is returned.
|
|
The version is useful for knowing what version of the pipeline the node has.
|
|
Multiple ids can be provided at the same time. Also wildcards are supported.
|
|
|
|
=== Delete pipeline API
|
|
|
|
The delete pipeline api deletes pipelines by id.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
DELETE _ingest/pipeline/my-pipeline-id
|
|
--------------------------------------------------
|
|
// AUTOSENSE |