375 lines
11 KiB
Plaintext
375 lines
11 KiB
Plaintext
[[ingest-attachment]]
|
||
=== Ingest Attachment Processor Plugin
|
||
|
||
The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by
|
||
using the Apache text extraction library https://tika.apache.org/[Tika].
|
||
|
||
You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.
|
||
|
||
The source field must be a base64 encoded binary. If you do not want to incur
|
||
the overhead of converting back and forth between base64, you can use the CBOR
|
||
format instead of JSON and specify the field as a bytes array instead of a string
|
||
representation. The processor will skip the base64 decoding then.
|
||
|
||
:plugin_name: ingest-attachment
|
||
include::install_remove.asciidoc[]
|
||
|
||
[[using-ingest-attachment]]
|
||
==== Using the Attachment Processor in a Pipeline
|
||
|
||
[[ingest-attachment-options]]
|
||
.Attachment options
|
||
[options="header"]
|
||
|======
|
||
| Name | Required | Default | Description
|
||
| `field` | yes | - | The field to get the base64 encoded field from
|
||
| `target_field` | no | attachment | The field that will hold the attachment information
|
||
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
|
||
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
|
||
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
|
||
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
|
||
|======
|
||
|
||
For example, this:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ingest/pipeline/attachment
|
||
{
|
||
"description" : "Extract attachment information",
|
||
"processors" : [
|
||
{
|
||
"attachment" : {
|
||
"field" : "data"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
PUT my-index-00001/_doc/my_id?pipeline=attachment
|
||
{
|
||
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
|
||
}
|
||
GET my-index-00001/_doc/my_id
|
||
--------------------------------------------------
|
||
|
||
Returns this:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"found": true,
|
||
"_index": "my-index-00001",
|
||
"_type": "_doc",
|
||
"_id": "my_id",
|
||
"_version": 1,
|
||
"_seq_no": 22,
|
||
"_primary_term": 1,
|
||
"_source": {
|
||
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
|
||
"attachment": {
|
||
"content_type": "application/rtf",
|
||
"language": "ro",
|
||
"content": "Lorem ipsum dolor sit amet",
|
||
"content_length": 28
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
|
||
|
||
|
||
To specify only some fields to be extracted:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ingest/pipeline/attachment
|
||
{
|
||
"description" : "Extract attachment information",
|
||
"processors" : [
|
||
{
|
||
"attachment" : {
|
||
"field" : "data",
|
||
"properties": [ "content", "title" ]
|
||
}
|
||
}
|
||
]
|
||
}
|
||
--------------------------------------------------
|
||
|
||
NOTE: Extracting contents from binary data is a resource intensive operation and
|
||
consumes a lot of resources. It is highly recommended to run pipelines
|
||
using this processor in a dedicated ingest node.
|
||
|
||
[[ingest-attachment-cbor]]
|
||
==== Use the attachment processor with CBOR
|
||
|
||
To avoid encoding and decoding JSON to base64, you can instead pass CBOR data to
|
||
the attachment processor. For example, the following request creates the
|
||
`cbor-attachment` pipeline, which uses the attachment processor.
|
||
|
||
[source,console]
|
||
----
|
||
PUT _ingest/pipeline/cbor-attachment
|
||
{
|
||
"description" : "Extract attachment information",
|
||
"processors" : [
|
||
{
|
||
"attachment" : {
|
||
"field" : "data"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
----
|
||
|
||
The following Python script passes CBOR data to an HTTP indexing request that
|
||
includes the `cbor-attachment` pipeline. The HTTP request headers use a
|
||
a `content-type` of `application/cbor`.
|
||
|
||
NOTE: Not all {es} clients support custom HTTP request headers.
|
||
|
||
[source,python]
|
||
----
|
||
import cbor2
|
||
import requests
|
||
|
||
file = 'my-file'
|
||
headers = {'content-type': 'application/cbor'}
|
||
|
||
with open(file, 'rb') as f:
|
||
doc = {
|
||
'data': f.read()
|
||
}
|
||
requests.put(
|
||
'http://localhost:9200/my-index-000001/_doc/my_id?pipeline=cbor-attachment',
|
||
data=cbor2.dumps(doc),
|
||
headers=headers
|
||
)
|
||
----
|
||
|
||
[[ingest-attachment-extracted-chars]]
|
||
==== Limit the number of extracted chars
|
||
|
||
To prevent extracting too many chars and overload the node memory, the number of chars being used for extraction
|
||
is limited by default to `100000`. You can change this value by setting `indexed_chars`. Use `-1` for no limit but
|
||
ensure when setting this that your node will have enough HEAP to extract the content of very big documents.
|
||
|
||
You can also define this limit per document by extracting from a given field the limit to set. If the document
|
||
has that field, it will overwrite the `indexed_chars` setting. To set this field, define the `indexed_chars_field`
|
||
setting.
|
||
|
||
For example:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ingest/pipeline/attachment
|
||
{
|
||
"description" : "Extract attachment information",
|
||
"processors" : [
|
||
{
|
||
"attachment" : {
|
||
"field" : "data",
|
||
"indexed_chars" : 11,
|
||
"indexed_chars_field" : "max_size"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
PUT my-index-00001/_doc/my_id?pipeline=attachment
|
||
{
|
||
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
|
||
}
|
||
GET my-index-00001/_doc/my_id
|
||
--------------------------------------------------
|
||
|
||
Returns this:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"found": true,
|
||
"_index": "my-index-00001",
|
||
"_type": "_doc",
|
||
"_id": "my_id",
|
||
"_version": 1,
|
||
"_seq_no": 35,
|
||
"_primary_term": 1,
|
||
"_source": {
|
||
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
|
||
"attachment": {
|
||
"content_type": "application/rtf",
|
||
"language": "sl",
|
||
"content": "Lorem ipsum",
|
||
"content_length": 11
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
|
||
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ingest/pipeline/attachment
|
||
{
|
||
"description" : "Extract attachment information",
|
||
"processors" : [
|
||
{
|
||
"attachment" : {
|
||
"field" : "data",
|
||
"indexed_chars" : 11,
|
||
"indexed_chars_field" : "max_size"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
PUT my-index-00001/_doc/my_id_2?pipeline=attachment
|
||
{
|
||
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
|
||
"max_size": 5
|
||
}
|
||
GET my-index-00001/_doc/my_id_2
|
||
--------------------------------------------------
|
||
|
||
Returns this:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"found": true,
|
||
"_index": "my-index-00001",
|
||
"_type": "_doc",
|
||
"_id": "my_id_2",
|
||
"_version": 1,
|
||
"_seq_no": 40,
|
||
"_primary_term": 1,
|
||
"_source": {
|
||
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
|
||
"max_size": 5,
|
||
"attachment": {
|
||
"content_type": "application/rtf",
|
||
"language": "ro",
|
||
"content": "Lorem",
|
||
"content_length": 5
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
|
||
|
||
|
||
[[ingest-attachment-with-arrays]]
|
||
==== Using the Attachment Processor with arrays
|
||
|
||
To use the attachment processor within an array of attachments the
|
||
{ref}/foreach-processor.html[foreach processor] is required. This
|
||
enables the attachment processor to be run on the individual elements
|
||
of the array.
|
||
|
||
For example, given the following source:
|
||
|
||
[source,js]
|
||
--------------------------------------------------
|
||
{
|
||
"attachments" : [
|
||
{
|
||
"filename" : "ipsum.txt",
|
||
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
|
||
},
|
||
{
|
||
"filename" : "test.txt",
|
||
"data" : "VGhpcyBpcyBhIHRlc3QK"
|
||
}
|
||
]
|
||
}
|
||
--------------------------------------------------
|
||
// NOTCONSOLE
|
||
|
||
In this case, we want to process the data field in each element
|
||
of the attachments field and insert
|
||
the properties into the document so the following `foreach`
|
||
processor is used:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT _ingest/pipeline/attachment
|
||
{
|
||
"description" : "Extract attachment information from arrays",
|
||
"processors" : [
|
||
{
|
||
"foreach": {
|
||
"field": "attachments",
|
||
"processor": {
|
||
"attachment": {
|
||
"target_field": "_ingest._value.attachment",
|
||
"field": "_ingest._value.data"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
]
|
||
}
|
||
PUT my-index-00001/_doc/my_id?pipeline=attachment
|
||
{
|
||
"attachments" : [
|
||
{
|
||
"filename" : "ipsum.txt",
|
||
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
|
||
},
|
||
{
|
||
"filename" : "test.txt",
|
||
"data" : "VGhpcyBpcyBhIHRlc3QK"
|
||
}
|
||
]
|
||
}
|
||
GET my-index-00001/_doc/my_id
|
||
--------------------------------------------------
|
||
|
||
Returns this:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"_index" : "my-index-00001",
|
||
"_type" : "_doc",
|
||
"_id" : "my_id",
|
||
"_version" : 1,
|
||
"_seq_no" : 50,
|
||
"_primary_term" : 1,
|
||
"found" : true,
|
||
"_source" : {
|
||
"attachments" : [
|
||
{
|
||
"filename" : "ipsum.txt",
|
||
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
|
||
"attachment" : {
|
||
"content_type" : "text/plain; charset=ISO-8859-1",
|
||
"language" : "en",
|
||
"content" : "this is\njust some text",
|
||
"content_length" : 24
|
||
}
|
||
},
|
||
{
|
||
"filename" : "test.txt",
|
||
"data" : "VGhpcyBpcyBhIHRlc3QK",
|
||
"attachment" : {
|
||
"content_type" : "text/plain; charset=ISO-8859-1",
|
||
"language" : "en",
|
||
"content" : "This is a test",
|
||
"content_length" : 16
|
||
}
|
||
}
|
||
]
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
// TESTRESPONSE[s/"_seq_no" : \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]
|
||
|
||
|
||
Note that the `target_field` needs to be set, otherwise the
|
||
default value is used which is a top level field `attachment`. The
|
||
properties on this top level field will contain the value of the
|
||
first attachment only. However, by specifying the
|
||
`target_field` on to a value on `_ingest._value` it will correctly
|
||
associate the properties with the correct attachment.
|