OpenSearch/docs/reference/ingest/ingest-node.asciidoc

1271 lines
36 KiB
Plaintext
Raw Normal View History

2016-02-11 17:16:56 -05:00
[[pipe-line]]
== Pipeline Definition
2016-03-04 01:00:07 -05:00
A pipeline is a definition of a series of <<ingest-processors, processors>> that are to be executed
in the same order as they are declared. A pipeline consists of two main fields: a `description`
and a list of `processors`:
[source,js]
--------------------------------------------------
{
"description" : "...",
"processors" : [ ... ]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
The `description` is a special field to store a helpful description of
what the pipeline does.
2016-03-04 01:00:07 -05:00
The `processors` parameter defines a list of processors to be executed in
order.
2016-03-04 01:00:07 -05:00
[[ingest-apis]]
2016-02-12 18:00:07 -05:00
== Ingest APIs
2016-03-04 01:00:07 -05:00
The following ingest APIs are available for managing pipelines:
* <<put-pipeline-api>> to add or update a pipeline
* <<get-pipeline-api>> to return a specific pipeline
* <<delete-pipeline-api>> to delete a pipeline
* <<simulate-pipeline-api>> to simulate a call to a pipeline
2016-02-12 18:00:07 -05:00
2016-03-04 01:00:07 -05:00
[[put-pipeline-api]]
=== Put Pipeline API
The put pipeline API adds pipelines and updates existing pipelines in the cluster.
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
PUT _ingest/pipeline/my-pipeline-id
{
2016-02-12 18:00:07 -05:00
"description" : "describe pipeline",
"processors" : [
{
"simple" : {
// settings
}
},
// other processors
]
}
--------------------------------------------------
2016-02-12 18:00:07 -05:00
// AUTOSENSE
2016-03-04 01:00:07 -05:00
NOTE: The put pipeline API also instructs all ingest nodes to reload their in-memory representation of pipelines, so that
pipeline changes take effect immediately.
2016-03-04 01:00:07 -05:00
[[get-pipeline-api]]
=== Get Pipeline API
2016-03-04 01:00:07 -05:00
The get pipeline API returns pipelines based on ID. This API always returns a local reference of the pipeline.
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
GET _ingest/pipeline/my-pipeline-id
--------------------------------------------------
2016-02-12 18:00:07 -05:00
// AUTOSENSE
2016-02-12 18:00:07 -05:00
Example response:
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"my-pipeline-id": {
"_source" : {
"description": "describe pipeline",
"processors": [
{
"simple" : {
// settings
}
},
// other processors
]
},
"_version" : 0
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
For each returned pipeline, the source and the version are returned.
The version is useful for knowing which version of the pipeline the node has.
You can specify multiple IDs to return more than one pipeline. Wildcards are also supported.
2016-03-04 01:00:07 -05:00
[[delete-pipeline-api]]
=== Delete Pipeline API
2016-03-04 01:00:07 -05:00
The delete pipeline API deletes pipelines by ID.
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
DELETE _ingest/pipeline/my-pipeline-id
--------------------------------------------------
2016-02-12 18:00:07 -05:00
// AUTOSENSE
2016-03-04 01:00:07 -05:00
[[simulate-pipeline-api]]
=== Simulate Pipeline API
2016-03-04 01:00:07 -05:00
The simulate pipeline API executes a specific pipeline against
2016-02-12 18:00:07 -05:00
the set of documents provided in the body of the request.
2016-03-04 01:00:07 -05:00
You can either specify an existing pipeline to execute
2016-02-12 18:00:07 -05:00
against the provided documents, or supply a pipeline definition in
the body of the request.
2016-03-04 01:00:07 -05:00
Here is the structure of a simulate request with a pipeline definition provided
in the body of the request:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
POST _ingest/pipeline/_simulate
{
2016-02-12 18:00:07 -05:00
"pipeline" : {
// pipeline definition here
},
"docs" : [
{ /** first document **/ },
{ /** second document **/ },
// ...
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
Here is the structure of a simulate request against an existing pipeline:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
POST _ingest/pipeline/my-pipeline-id/_simulate
{
2016-02-12 18:00:07 -05:00
"docs" : [
{ /** first document **/ },
{ /** second document **/ },
// ...
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
Here is an example of a simulate request with a pipeline defined in the request
and its response:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
POST _ingest/pipeline/_simulate
{
2016-02-12 18:00:07 -05:00
"pipeline" :
{
"description": "_description",
"processors": [
{
"set" : {
"field" : "field2",
"value" : "_value"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "type",
"_id": "id",
"_source": {
"foo": "bar"
}
},
{
"_index": "index",
"_type": "type",
"_id": "id",
"_source": {
"foo": "rab"
}
}
]
}
--------------------------------------------------
2016-02-12 18:00:07 -05:00
// AUTOSENSE
2016-03-04 01:00:07 -05:00
Response:
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"docs": [
{
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field2": "_value",
"foo": "bar"
},
"_ingest": {
"timestamp": "2016-01-04T23:53:27.186+0000"
}
}
},
{
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field2": "_value",
"foo": "rab"
},
"_ingest": {
"timestamp": "2016-01-04T23:53:27.186+0000"
}
}
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[ingest-verbose-param]]
==== Viewing Verbose Results
You can use the simulate pipeline API to see how each processor affects the ingest document
as it passes through the pipeline. To see the intermediate results of
each processor in the simulate request, you can add the `verbose` parameter
to the request.
2016-02-12 18:00:07 -05:00
2016-03-04 01:00:07 -05:00
Here is an example of a verbose request and its response:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
POST _ingest/pipeline/_simulate?verbose
{
2016-02-12 18:00:07 -05:00
"pipeline" :
{
"description": "_description",
"processors": [
{
"set" : {
"field" : "field2",
"value" : "_value2"
}
},
{
"set" : {
"field" : "field3",
"value" : "_value3"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "type",
"_id": "id",
"_source": {
"foo": "bar"
}
},
{
"_index": "index",
"_type": "type",
"_id": "id",
"_source": {
"foo": "rab"
}
}
]
}
--------------------------------------------------
2016-02-12 18:00:07 -05:00
// AUTOSENSE
2016-03-04 01:00:07 -05:00
Response:
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"docs": [
{
"processor_results": [
{
"tag": "processor[set]-0",
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field2": "_value2",
"foo": "bar"
},
"_ingest": {
"timestamp": "2016-01-05T00:02:51.383+0000"
}
}
},
{
"tag": "processor[set]-1",
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field3": "_value3",
"field2": "_value2",
"foo": "bar"
},
"_ingest": {
"timestamp": "2016-01-05T00:02:51.383+0000"
}
}
}
]
},
{
"processor_results": [
{
"tag": "processor[set]-0",
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field2": "_value2",
"foo": "rab"
},
"_ingest": {
"timestamp": "2016-01-05T00:02:51.384+0000"
}
}
},
{
"tag": "processor[set]-1",
"doc": {
"_id": "id",
"_ttl": null,
"_parent": null,
"_index": "index",
"_routing": null,
"_type": "type",
"_timestamp": null,
"_source": {
"field3": "_value3",
"field2": "_value2",
"foo": "rab"
},
"_ingest": {
"timestamp": "2016-01-05T00:02:51.384+0000"
}
}
}
]
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[accessing-data-in-pipelines]]
== Accessing Data in Pipelines
2016-03-04 01:00:07 -05:00
The processors in a pipeline have read and write access to documents that pass through the pipeline.
The processors can access fields in the source of a document and the document's metadata fields.
2016-02-12 18:00:07 -05:00
2016-03-04 01:00:07 -05:00
[float]
[[accessing-source-fields]]
=== Accessing Fields in the Source
Accessing a field in the source is straightforward. You simply refer to fields by
2016-02-12 18:00:07 -05:00
their name. For example:
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"set": {
"field": "my_field"
"value": 582.1
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
On top of this, fields from the source are always accessible via the `_source` prefix:
2016-02-12 18:00:07 -05:00
[source,js]
--------------------------------------------------
{
"set": {
"field": "_source.my_field"
"value": 582.1
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[float]
[[accessing-metadata-fields]]
=== Accessing Metadata Fields
You can access metadata fields in the same way that you access fields in the source. This
2016-02-12 18:00:07 -05:00
is possible because Elasticsearch doesn't allow fields in the source that have the
same name as metadata fields.
2016-03-04 01:00:07 -05:00
The following example sets the `_id` metadata field of a document to `1`:
2016-02-12 18:00:07 -05:00
[source,js]
--------------------------------------------------
{
"set": {
"field": "_id"
"value": "1"
}
}
--------------------------------------------------
2016-02-12 18:00:07 -05:00
The following metadata fields are accessible by a processor: `_index`, `_type`, `_id`, `_routing`, `_parent`,
2016-03-04 01:00:07 -05:00
`_timestamp`, and `_ttl`.
2016-03-04 01:00:07 -05:00
[float]
[[accessing-ingest-metadata]]
=== Accessing Ingest Metadata Fields
Beyond metadata fields and source fields, ingest also adds ingest metadata to the documents that it processes.
2016-02-12 18:00:07 -05:00
These metadata properties are accessible under the `_ingest` key. Currently ingest adds the ingest timestamp
2016-03-04 01:00:07 -05:00
under the `_ingest.timestamp` key of the ingest metadata. The ingest timestamp is the time when Elasticsearch
received the index or bulk request to pre-process the document.
2016-03-04 01:00:07 -05:00
Any processor can add ingest-related metadata during document processing. Ingest metadata is transient
and is lost after a document has been processed by the pipeline. Therefore, ingest metadata won't be indexed.
The following example adds a field with the name `received`. The value is the ingest timestamp:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
{
"set": {
"field": "received"
"value": "{{_ingest.timestamp}}"
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
Unlike Elasticsearch metadata fields, the ingest metadata field name `_ingest` can be used as a valid field name
in the source of a document. Use `_source._ingest` to refer to the field in the source document. Otherwise, `_ingest`
will be interpreted as an ingest metadata field.
2016-03-04 01:00:07 -05:00
[float]
[[accessing-template-fields]]
=== Accessing Fields and Metafields in Templates
2016-02-12 18:00:07 -05:00
A number of processor settings also support templating. Settings that support templating can have zero or more
template snippets. A template snippet begins with `{{` and ends with `}}`.
Accessing fields and metafields in templates is exactly the same as via regular processor field settings.
2016-03-04 01:00:07 -05:00
The following example adds a field named `field_c`. Its value is a concatenation of
2016-02-12 18:00:07 -05:00
the values of `field_a` and `field_b`.
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
{
"set": {
"field": "field_c"
"value": "{{field_a}} {{field_b}}"
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
The following example uses the value of the `geoip.country_iso_code` field in the source
to set the index that the document will be indexed into:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
{
"set": {
"field": "_index"
"value": "{{geoip.country_iso_code}}"
}
}
--------------------------------------------------
2016-02-12 18:00:07 -05:00
[[handling-failure-in-pipelines]]
2016-03-04 01:00:07 -05:00
== Handling Failures in Pipelines
2016-03-04 01:00:07 -05:00
In its simplest use case, a pipeline defines a list of processors that
are executed sequentially, and processing halts at the first exception. This
behavior may not be desirable when failures are expected. For example, you may have logs
that don't match the specified grok expression. Instead of halting execution, you may
want to index such documents into a separate index.
2016-03-04 01:00:07 -05:00
To enable this behavior, you can use the `on_failure` parameter. The `on_failure` parameter
2016-02-12 18:00:07 -05:00
defines a list of processors to be executed immediately following the failed processor.
2016-03-04 01:00:07 -05:00
You can specify this parameter at the pipeline level, as well as at the processor
level. If a processor specifies an `on_failure` configuration, whether
it is empty or not, any exceptions that are thrown by the processor are caught, and the
pipeline continues executing the remaining processors. Because you can define further processors
within the scope of an `on_failure` statement, you can nest failure handling.
The following example defines a pipeline that renames the `foo` field in
the processed document to `bar`. If the document does not contain the `foo` field, the processor
attaches an error message to the document for later analysis within
2016-02-12 18:00:07 -05:00
Elasticsearch.
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"description" : "my first pipeline with handled exceptions",
"processors" : [
{
"rename" : {
"field" : "foo",
"to" : "bar",
"on_failure" : [
{
"set" : {
"field" : "error",
"value" : "field \"foo\" does not exist, cannot rename to \"bar\""
}
}
]
}
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
The following example defines an `on_failure` block on a whole pipeline to change
the index to which failed documents get sent.
2016-02-11 17:16:56 -05:00
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
{
"description" : "my first pipeline with handled exceptions",
"processors" : [ ... ],
"on_failure" : [
{
"set" : {
"field" : "_index",
"value" : "failed-{{ _index }}"
}
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[float]
[[accessing-error-metadata]]
=== Accessing Error Metadata From Processors Handling Exceptions
2016-02-12 18:00:07 -05:00
2016-03-04 01:00:07 -05:00
You may want to retrieve the actual error message that was thrown
by a failed processor. To do so you can access metadata fields called
`on_failure_message`, `on_failure_processor_type`, and `on_failure_processor_tag`. These fields are only accessible
from within the context of an `on_failure` block.
2016-02-12 18:00:07 -05:00
2016-03-04 01:00:07 -05:00
Here is an updated version of the example that you
saw earlier. But instead of setting the error message manually, the example leverages the `on_failure_message`
metadata field to provide the error message.
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"description" : "my first pipeline with handled exceptions",
"processors" : [
{
2016-02-12 18:00:07 -05:00
"rename" : {
"field" : "foo",
"to" : "bar",
"on_failure" : [
{
"set" : {
"field" : "error",
"value" : "{{ _ingest.on_failure_message }}"
}
}
]
}
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[ingest-processors]]
2016-02-12 18:00:07 -05:00
== Processors
All processors are defined in the following way within a pipeline definition:
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"PROCESSOR_NAME" : {
... processor configuration options ...
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
Each processor defines its own configuration parameters, but all processors have
2016-02-12 18:00:07 -05:00
the ability to declare `tag` and `on_failure` fields. These fields are optional.
A `tag` is simply a string identifier of the specific instantiation of a certain
2016-03-04 01:00:07 -05:00
processor in a pipeline. The `tag` field does not affect the processor's behavior,
2016-02-12 18:00:07 -05:00
but is very useful for bookkeeping and tracing errors to specific processors.
See <<handling-failure-in-pipelines>> to learn more about the `on_failure` field and error handling in pipelines.
The <<ingest-info,node info API>> can be used to figure out what processors are available in a cluster.
The <<ingest-info,node info API>> will provide a per node list of what processors are available.
Custom processors must be installed on all nodes. The put pipeline API will fail if a processor specified in a pipeline
doesn't exist on all nodes. If you rely on custom processor plugins make sure to mark these plugins as mandatory by adding
`plugin.mandatory` setting to the `config/elasticsearch.yml` file, for example:
[source,yaml]
--------------------------------------------------
plugin.mandatory: ingest-attachment,ingest-geoip
--------------------------------------------------
A node will not start if either of these plugins are not available.
2016-03-04 01:00:07 -05:00
[[append-procesesor]]
=== Append Processor
2016-02-12 18:00:07 -05:00
Appends one or more values to an existing array if the field already exists and it is an array.
Converts a scalar to an array and appends one or more values to it if the field exists and it is a scalar.
2016-03-04 01:00:07 -05:00
Creates an array containing the provided values if the field doesn't exist.
2016-02-12 18:00:07 -05:00
Accepts a single value or an array of values.
[[append-options]]
.Append Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to be appended to
| `value` | yes | - | The value to be appended
|======
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"append": {
"field": "field1"
"value": ["item2", "item3", "item4"]
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[convert-processor]]
=== Convert Processor
Converts an existing field's value to a different type, such as converting a string to an integer.
2016-02-12 18:00:07 -05:00
If the field value is an array, all members will be converted.
The supported types include: `integer`, `float`, `string`, and `boolean`.
2016-03-04 01:00:07 -05:00
Specifying `boolean` will set the field to true if its string value is equal to `true` (ignore case), to
false if its string value is equal to `false` (ignore case), or it will throw an exception otherwise.
2016-02-12 18:00:07 -05:00
[[convert-options]]
.Convert Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field whose value is to be converted
| `type` | yes | - | The type to convert the existing value to
|======
[source,js]
--------------------------------------------------
{
"convert": {
"field" : "foo"
"type": "integer"
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[date-processor]]
=== Date Processor
2015-11-06 11:48:44 -05:00
2016-03-04 01:00:07 -05:00
Parses dates from fields, and then uses the date or timestamp as the timestamp for the document.
By default, the date processor adds the parsed date as a new field called `@timestamp`. You can specify a
different field by setting the `target_field` configuration parameter. Multiple date formats are supported
as part of the same date processor definition. They will be used sequentially to attempt parsing the date field,
in the same order they were defined as part of the processor definition.
2015-11-06 11:48:44 -05:00
[[date-options]]
.Date options
[options="header"]
|======
| Name | Required | Default | Description
| `match_field` | yes | - | The field to get the date from.
| `target_field` | no | @timestamp | The field that will hold the parsed date.
2016-03-04 01:00:07 -05:00
| `match_formats` | yes | - | An array of the expected date formats. Can be a Joda pattern or one of the following formats: ISO8601, UNIX, UNIX_MS, or TAI64N.
2015-11-06 11:48:44 -05:00
| `timezone` | no | UTC | The timezone to use when parsing the date.
| `locale` | no | ENGLISH | The locale to use when parsing the date, relevant when parsing month names or week days.
|======
2016-03-04 01:00:07 -05:00
Here is an example that adds the parsed date to the `timestamp` field based on the `initial_date` field:
2015-11-06 11:48:44 -05:00
[source,js]
--------------------------------------------------
{
"description" : "...",
"processors" : [
{
"date" : {
"match_field" : "initial_date",
"target_field" : "timestamp",
"match_formats" : ["dd/MM/yyyy hh:mm:ss"],
"timezone" : "Europe/Amsterdam"
}
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[fail-processor]]
=== Fail Processor
Raises an exception. This is useful for when
you expect a pipeline to fail and want to relay a specific message
2015-12-23 19:20:11 -05:00
to the requester.
[[fail-options]]
.Fail Options
[options="header"]
|======
| Name | Required | Default | Description
| `message` | yes | - | The error message of the `FailException` thrown by the processor
|======
2015-12-23 19:20:11 -05:00
[source,js]
--------------------------------------------------
{
"fail": {
"message": "an error message"
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[foreach-processor]]
=== Foreach Processor
Processes elements in an array of unknown length.
All processors can operate on elements inside an array, but if all elements of an array need to
2016-03-04 01:00:07 -05:00
be processed in the same way, defining a processor for each element becomes cumbersome and tricky
because it is likely that the number of elements in an array is unknown. For this reason the `foreach`
processor exists. By specifying the field holding array elements and a list of processors that
define what should happen to each element, array fields can easily be preprocessed.
2016-03-04 01:00:07 -05:00
Processors inside the foreach processor work in a different context, and the only valid top-level
field is `_value`, which holds the array element value. Under this field other fields may exist.
2016-03-04 01:00:07 -05:00
If the `foreach` processor fails to process an element inside the array, and no `on_failure` processor has been specified,
then it aborts the execution and leaves the array unmodified.
[[foreach-options]]
.Foreach Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The array field
| `processors` | yes | - | The processors
|======
Assume the following document:
[source,js]
--------------------------------------------------
{
"value" : ["foo", "bar", "baz"]
}
--------------------------------------------------
When this `foreach` processor operates on this sample document:
[source,js]
--------------------------------------------------
{
"foreach" : {
"field" : "values",
"processors" : [
{
"uppercase" : {
"field" : "_value"
}
}
]
}
}
--------------------------------------------------
Then the document will look like this after preprocessing:
[source,js]
--------------------------------------------------
{
"value" : ["FOO", "BAR", "BAZ"]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
Let's take a look at another example:
[source,js]
--------------------------------------------------
{
"persons" : [
{
"id" : "1",
"name" : "John Doe"
},
{
"id" : "2",
"name" : "Jane Doe"
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
In this case, the `id` field needs to be removed,
so the following `foreach` processor is used:
[source,js]
--------------------------------------------------
{
"foreach" : {
"field" : "persons",
"processors" : [
{
"remove" : {
"field" : "_value.id"
}
}
]
}
}
--------------------------------------------------
After preprocessing the result is:
[source,js]
--------------------------------------------------
{
"persons" : [
{
"name" : "John Doe"
},
{
"name" : "Jane Doe"
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
As for any processor, you can define `on_failure` processors
in processors that are wrapped inside the `foreach` processor.
2016-03-04 01:00:07 -05:00
For example, the `id` field may not exist on all person objects.
Instead of failing the index request, you can use an `on_failure`
block to send the document to the 'failure_index' index for later inspection:
[source,js]
--------------------------------------------------
{
"foreach" : {
"field" : "persons",
"processors" : [
{
"remove" : {
"field" : "_value.id",
"on_failure" : [
{
"set" : {
"field", "_index",
"value", "failure_index"
}
}
]
}
}
]
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
In this example, if the `remove` processor does fail, then
the array elements that have been processed thus far will
be updated.
2016-01-12 19:58:44 -05:00
2016-03-04 01:00:07 -05:00
[[grok-processor]]
2016-02-12 18:00:07 -05:00
=== Grok Processor
2016-03-04 01:00:07 -05:00
Extracts structured fields out of a single text field within a document. You choose which field to
extract matched fields from, as well as the grok pattern you expect will match. A grok pattern is like a regular
2016-02-12 18:00:07 -05:00
expression that supports aliased expressions that can be reused.
2016-02-12 18:00:07 -05:00
This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format
that is generally written for humans and not computer consumption.
The processor comes packaged with over 120 reusable patterns that are located at `$ES_HOME/config/ingest/grok/patterns`.
Here, you can add your own custom grok pattern files with custom grok expressions to be used by the processor.
If you need help building patterns to match your logs, you will find the <http://grokdebug.herokuapp.com> and
<http://grokconstructor.appspot.com/> applications quite useful!
2016-03-04 01:00:07 -05:00
[[grok-basics]]
2016-02-12 18:00:07 -05:00
==== Grok Basics
Grok sits on top of regular expressions, so any regular expressions are valid in grok as well.
The regular expression library is Oniguruma, and you can see the full supported regexp syntax
https://github.com/kkos/oniguruma/blob/master/doc/RE[on the Onigiruma site].
Grok works by leveraging this regular expression language to allow naming existing patterns and combining them into more
complex patterns that match your fields.
2016-03-04 01:00:07 -05:00
The syntax for reusing a grok pattern comes in three forms: `%{SYNTAX:SEMANTIC}`, `%{SYNTAX}`, `%{SYNTAX:SEMANTIC:TYPE}`.
2016-02-12 18:00:07 -05:00
The `SYNTAX` is the name of the pattern that will match your text. For example, `3.44` will be matched by the `NUMBER`
pattern and `55.3.244.1` will be matched by the `IP` pattern. The syntax is how you match. `NUMBER` and `IP` are both
patterns that are provided within the default patterns set.
The `SEMANTIC` is the identifier you give to the piece of text being matched. For example, `3.44` could be the
duration of an event, so you could call it simply `duration`. Further, a string `55.3.244.1` might identify
the `client` making a request.
The `TYPE` is the type you wish to cast your named field. `int` and `float` are currently the only types supported for coercion.
2016-03-04 01:00:07 -05:00
For example, you might want to match the following text:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
3.44 55.3.244.1
--------------------------------------------------
2016-03-04 01:00:07 -05:00
You may know that the message in the example is a number followed by an IP address. You can match this text by using the following
2016-02-12 18:00:07 -05:00
Grok expression.
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
%{NUMBER:duration} %{IP:client}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[custom-patterns]]
2016-02-12 18:00:07 -05:00
==== Custom Patterns and Pattern Files
2016-03-04 01:00:07 -05:00
The Grok processor comes pre-packaged with a base set of pattern files. These patterns may not always have
2016-02-12 18:00:07 -05:00
what you are looking for. These pattern files have a very basic format. Each line describes a named pattern with
the following format:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
NAME ' '+ PATTERN '\n'
--------------------------------------------------
2016-03-04 01:00:07 -05:00
You can add new patterns to an existing file, or add your own file in the patterns directory here: `$ES_HOME/config/ingest/grok/patterns`.
Ingest node picks up files in this directory and loads the patterns into the grok processor's known patterns.
These patterns are loaded at startup, so you need to restart your ingest node if you want to update these files.
2016-03-04 01:00:07 -05:00
Here is an example snippet of pattern definitions found in the `grok-patterns` patterns file:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
YEAR (?>\d\d){1,2}
HOUR (?:2[0123]|[01]?[0-9])
MINUTE (?:[0-5][0-9])
SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[using-grok]]
==== Using the Grok Processor in a Pipeline
2016-02-12 18:00:07 -05:00
[[grok-options]]
.Grok Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to use for grok expression parsing
| `pattern` | yes | - | The grok expression to match and extract named captures with
2016-02-12 18:00:07 -05:00
| `pattern_definitions` | no | - | A map of pattern-name and pattern tuples defining custom patterns to be used by the current processor. Patterns matching existing names will override the pre-existing definition.
|======
2016-02-12 18:00:07 -05:00
Here is an example of using the provided patterns to extract out and name structured fields from a string field in
a document.
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"message": "55.3.244.1 GET /index.html 15824 0.043"
}
--------------------------------------------------
2015-11-06 11:48:44 -05:00
2016-03-04 01:00:07 -05:00
The pattern for this could be:
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
Here is an example pipeline for processing the above document by using Grok:
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"description" : "...",
"processors": [
{
2016-02-12 18:00:07 -05:00
"grok": {
"field": "message",
"pattern": "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"
}
}
]
}
--------------------------------------------------
2016-02-12 18:00:07 -05:00
This pipeline will insert these named captures as new fields within the document, like so:
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"message": "55.3.244.1 GET /index.html 15824 0.043",
"client": "55.3.244.1",
"method": "GET",
"request": "/index.html",
"bytes": 15824,
"duration": "0.043"
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
Here is an example of a pipeline specifying custom pattern definitions:
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"description" : "...",
"processors": [
{
2016-02-12 18:00:07 -05:00
"grok": {
"field": "message",
"pattern": "my %{FAVORITE_DOG:dog} is colored %{RGB:color}"
2016-02-12 18:00:07 -05:00
"pattern_definitions" : {
"FAVORITE_DOG" : "beagle",
"RGB" : "RED|GREEN|BLUE"
}
}
}
]
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[gsub-processor]]
=== Gsub Processor
2016-02-12 18:00:07 -05:00
Converts a string field by applying a regular expression and a replacement.
If the field is not a string, the processor will throw an exception.
2016-02-12 18:00:07 -05:00
[[gsub-options]]
.Gsub Options
[options="header"]
|======
| Name | Required | Default | Description
2016-03-04 01:00:07 -05:00
| `field` | yes | - | The field to apply the replacement to
2016-02-12 18:00:07 -05:00
| `pattern` | yes | - | The pattern to be replaced
2016-03-04 01:00:07 -05:00
| `replacement` | yes | - | The string to replace the matching patterns with
2016-02-12 18:00:07 -05:00
|======
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"gsub": {
"field": "field1",
"pattern": "\.",
"replacement": "-"
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[join-processor]]
=== Join Processor
2016-02-12 18:00:07 -05:00
Joins each element of an array into a single string using a separator character between each element.
2016-03-04 01:00:07 -05:00
Throws an error when the field is not an array.
2016-02-12 18:00:07 -05:00
[[join-options]]
.Join Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to be separated
| `separator` | yes | - | The separator character
|======
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"join": {
"field": "joined_array_field",
"separator": "-"
}
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[lowercase-processor]]
=== Lowercase Processor
2016-02-12 18:00:07 -05:00
Converts a string to its lowercase equivalent.
2016-02-12 18:00:07 -05:00
[[lowercase-options]]
.Lowercase Options
[options="header"]
|======
| Name | Required | Default | Description
2016-03-04 01:00:07 -05:00
| `field` | yes | - | The field to make lowercase
2016-02-12 18:00:07 -05:00
|======
[source,js]
--------------------------------------------------
2016-02-12 18:00:07 -05:00
{
"lowercase": {
"field": "foo"
}
}
--------------------------------------------------
2016-01-04 19:10:42 -05:00
2016-03-04 01:00:07 -05:00
[[remove-processor]]
=== Remove Processor
Removes an existing field. If the field doesn't exist, an exception will be thrown.
2016-01-04 19:10:42 -05:00
2016-02-12 18:00:07 -05:00
[[remove-options]]
.Remove Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to be removed
|======
2016-01-04 19:10:42 -05:00
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"remove": {
"field": "foo"
}
2016-01-04 19:10:42 -05:00
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[rename-processor]]
=== Rename Processor
Renames an existing field. If the field doesn't exist or the new name is already used, an exception will be thrown.
2016-02-12 18:00:07 -05:00
[[rename-options]]
.Rename Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to be renamed
| `to` | yes | - | The new name of the field
|======
2016-01-04 19:10:42 -05:00
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"rename": {
"field": "foo",
"to": "foobar"
}
2016-01-04 19:10:42 -05:00
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[set-processor]]
=== Set Processor
2016-02-12 18:00:07 -05:00
Sets one field and associates it with the specified value. If the field already exists,
its value will be replaced with the provided one.
2016-01-04 19:10:42 -05:00
2016-02-12 18:00:07 -05:00
[[set-options]]
.Set Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to insert, upsert, or update
| `value` | yes | - | The value to be set for the field
|======
2016-01-04 19:10:42 -05:00
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"set": {
"field": "field1",
"value": 582.1
}
2016-01-04 19:10:42 -05:00
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[split-processor]]
=== Split Processor
Splits a field into an array using a separator character. Only works on string fields.
2016-02-12 18:00:07 -05:00
[[split-options]]
.Split Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to split
|======
2016-01-04 19:10:42 -05:00
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"split": {
"field": ","
}
2016-01-04 19:10:42 -05:00
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[trim-processor]]
=== Trim Processor
Trims whitespace from field.
NOTE: This only works on leading and trailing whitespace.
2016-01-04 19:10:42 -05:00
2016-02-12 18:00:07 -05:00
[[trim-options]]
.Trim Options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The string-valued field to trim whitespace from
|======
2016-01-04 19:10:42 -05:00
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"trim": {
"field": "foo"
}
2016-01-04 19:10:42 -05:00
}
--------------------------------------------------
2016-03-04 01:00:07 -05:00
[[uppercase-processor]]
=== Uppercase Processor
2016-02-12 18:00:07 -05:00
Converts a string to its uppercase equivalent.
[[uppercase-options]]
.Uppercase Options
[options="header"]
|======
| Name | Required | Default | Description
2016-03-04 01:00:07 -05:00
| `field` | yes | - | The field to make uppercase
2016-02-12 18:00:07 -05:00
|======
2016-01-04 19:10:42 -05:00
[source,js]
--------------------------------------------------
{
2016-02-12 18:00:07 -05:00
"uppercase": {
"field": "foo"
}
2016-01-04 19:10:42 -05:00
}
--------------------------------------------------
2016-02-12 18:00:07 -05:00