Merge branch 'docs/mapper-attachments'
This commit is contained in:
commit
5b0e2823b1
|
@ -1,25 +1,46 @@
|
|||
Mapper Attachments Type for Elasticsearch
|
||||
=========================================
|
||||
[[mapper-attachments]]
|
||||
=== Mapper Attachments Plugin
|
||||
|
||||
The mapper attachments plugin lets Elasticsearch index file attachments in common formats (such as PPT, XLS, PDF) using the Apache text extraction library [Tika](http://lucene.apache.org/tika/).
|
||||
The mapper attachments plugin lets Elasticsearch index file attachments in common formats (such as PPT, XLS, PDF)
|
||||
using the Apache text extraction library http://lucene.apache.org/tika/[Tika].
|
||||
|
||||
In practice, the plugin adds the `attachment` type when mapping properties so that documents can be populated with file attachment contents (encoded as `base64`).
|
||||
In practice, the plugin adds the `attachment` type when mapping properties so that documents can be populated with
|
||||
file attachment contents (encoded as `base64`).
|
||||
|
||||
Installation
|
||||
------------
|
||||
[[mapper-attachments-install]]
|
||||
[float]
|
||||
==== Installation
|
||||
|
||||
In order to install the plugin, run:
|
||||
This plugin can be installed using the plugin manager:
|
||||
|
||||
```sh
|
||||
bin/plugin install mapper-attachments
|
||||
```
|
||||
[source,sh]
|
||||
----------------------------------------------------------------
|
||||
sudo bin/plugin install mapper-attachments
|
||||
----------------------------------------------------------------
|
||||
|
||||
Hello, world
|
||||
------------
|
||||
The plugin must be installed on every node in the cluster, and each node must
|
||||
be restarted after installation.
|
||||
|
||||
[[mapper-attachments-remove]]
|
||||
[float]
|
||||
==== Removal
|
||||
|
||||
The plugin can be removed with the following command:
|
||||
|
||||
[source,sh]
|
||||
----------------------------------------------------------------
|
||||
sudo bin/plugin remove mapper-attachments
|
||||
----------------------------------------------------------------
|
||||
|
||||
The node must be stopped before removing the plugin.
|
||||
|
||||
[[mapper-attachments-helloworld]]
|
||||
==== Hello, world
|
||||
|
||||
Create a property mapping using the new type `attachment`:
|
||||
|
||||
```javascript
|
||||
[source,js]
|
||||
--------------------------
|
||||
POST /trying-out-mapper-attachments
|
||||
{
|
||||
"mappings": {
|
||||
|
@ -27,36 +48,42 @@ POST /trying-out-mapper-attachments
|
|||
"properties": {
|
||||
"cv": { "type": "attachment" }
|
||||
}}}}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
Index a new document populated with a `base64`-encoded attachment:
|
||||
|
||||
```javascript
|
||||
[source,js]
|
||||
--------------------------
|
||||
POST /trying-out-mapper-attachments/person/1
|
||||
{
|
||||
"cv": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
Search for the document using words in the attachment:
|
||||
|
||||
```javascript
|
||||
[source,js]
|
||||
--------------------------
|
||||
POST /trying-out-mapper-attachments/person/_search
|
||||
{
|
||||
"query": {
|
||||
"query_string": {
|
||||
"query": "ipsum"
|
||||
}}}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
If you get a hit for your indexed document, the plugin should be installed and working.
|
||||
|
||||
Usage
|
||||
------------------------
|
||||
[[mapper-attachments-usage]]
|
||||
==== Usage
|
||||
|
||||
Using the attachment type is simple, in your mapping JSON, simply set a certain JSON element as attachment, for example:
|
||||
|
||||
```javascript
|
||||
[source,js]
|
||||
--------------------------
|
||||
PUT /test
|
||||
PUT /test/person/_mapping
|
||||
{
|
||||
|
@ -66,20 +93,24 @@ PUT /test/person/_mapping
|
|||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
In this case, the JSON to index can be:
|
||||
|
||||
```javascript
|
||||
[source,js]
|
||||
--------------------------
|
||||
PUT /test/person/1
|
||||
{
|
||||
"my_attachment" : "... base64 encoded attachment ..."
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
Or it is possible to use more elaborated JSON if content type, resource name or language need to be set explicitly:
|
||||
|
||||
```
|
||||
[source,js]
|
||||
--------------------------
|
||||
PUT /test/person/1
|
||||
{
|
||||
"my_attachment" : {
|
||||
|
@ -89,9 +120,10 @@ PUT /test/person/1
|
|||
"_content" : "... base64 encoded attachment ..."
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
The `attachment` type not only indexes the content of the doc in `content` sub field, but also automatically adds meta
|
||||
The `attachment` type not only indexes the content of the doc in `content` sub field, but also automatically adds meta
|
||||
data on the attachment as well (when available).
|
||||
|
||||
The metadata supported are:
|
||||
|
@ -107,10 +139,11 @@ The metadata supported are:
|
|||
|
||||
They can be queried using the "dot notation", for example: `my_attachment.author`.
|
||||
|
||||
Both the meta data and the actual content are simple core type mappers (string, date, ...), thus, they can be controlled
|
||||
Both the meta data and the actual content are simple core type mappers (string, date, …), thus, they can be controlled
|
||||
in the mappings. For example:
|
||||
|
||||
```javascript
|
||||
[source,js]
|
||||
--------------------------
|
||||
PUT /test/person/_mapping
|
||||
{
|
||||
"person" : {
|
||||
|
@ -131,19 +164,21 @@ PUT /test/person/_mapping
|
|||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
In the above example, the actual content indexed is mapped under `fields` name `content`, and we decide not to index it, so
|
||||
it will only be available in the `_all` field. The other fields map to their respective metadata names, but there is no
|
||||
need to specify the `type` (like `string` or `date`) since it is already known.
|
||||
|
||||
Copy To feature
|
||||
---------------
|
||||
[[mapper-attachments-copy-to]]
|
||||
==== Copy To feature
|
||||
|
||||
If you want to use [copy_to](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to)
|
||||
If you want to use http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to[copy_to]
|
||||
feature, you need to define it on each sub-field you want to copy to another field:
|
||||
|
||||
```javascript
|
||||
[source,js]
|
||||
--------------------------
|
||||
PUT /test/person/_mapping
|
||||
{
|
||||
"person": {
|
||||
|
@ -163,16 +198,18 @@ PUT /test/person/_mapping
|
|||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
In this example, the extracted content will be copy as well to `copy` field.
|
||||
|
||||
Querying or accessing metadata
|
||||
------------------------------
|
||||
[[mapper-attachments-querying-metadata]]
|
||||
==== Querying or accessing metadata
|
||||
|
||||
If you need to query on metadata fields, use the attachment field name dot the metadata field. For example:
|
||||
|
||||
```
|
||||
[source,js]
|
||||
--------------------------
|
||||
DELETE /test
|
||||
PUT /test
|
||||
PUT /test/person/_mapping
|
||||
|
@ -204,11 +241,13 @@ GET /test/person/_search
|
|||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
Will give you:
|
||||
|
||||
```
|
||||
[source,js]
|
||||
--------------------------
|
||||
{
|
||||
"took": 2,
|
||||
"timed_out": false,
|
||||
|
@ -235,17 +274,18 @@ Will give you:
|
|||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
|
||||
Indexed Characters
|
||||
------------------
|
||||
[[mapper-attachments-indexed-characters]]
|
||||
==== Indexed Characters
|
||||
|
||||
By default, `100000` characters are extracted when indexing the content. This default value can be changed by setting
|
||||
the `index.mapping.attachment.indexed_chars` setting. It can also be provided on a per document indexed using the
|
||||
`_indexed_chars` parameter. `-1` can be set to extract all text, but note that all the text needs to be allowed to be
|
||||
represented in memory:
|
||||
|
||||
```
|
||||
[source,js]
|
||||
--------------------------
|
||||
PUT /test/person/1
|
||||
{
|
||||
"my_attachment" : {
|
||||
|
@ -253,18 +293,19 @@ PUT /test/person/1
|
|||
"_content" : "... base64 encoded attachment ..."
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
Metadata parsing error handling
|
||||
-------------------------------
|
||||
[[mapper-attachments-error-handling]]
|
||||
==== Metadata parsing error handling
|
||||
|
||||
While extracting metadata content, errors could happen for example when parsing dates.
|
||||
Parsing errors are ignored so your document is indexed.
|
||||
|
||||
You can disable this feature by setting the `index.mapping.attachment.ignore_errors` setting to `false`.
|
||||
|
||||
Language Detection
|
||||
------------------
|
||||
[[mapper-attachments-language-detection]]
|
||||
==== Language Detection
|
||||
|
||||
By default, language detection is disabled (`false`) as it could come with a cost.
|
||||
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
|
||||
|
@ -272,22 +313,24 @@ It can also be provided on a per document indexed using the `_detect_language` p
|
|||
|
||||
Note that you can force language using `_language` field when sending your actual document:
|
||||
|
||||
```javascript
|
||||
[source,js]
|
||||
--------------------------
|
||||
{
|
||||
"my_attachment" : {
|
||||
"_language" : "en",
|
||||
"_content" : "... base64 encoded attachment ..."
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
|
||||
Highlighting attachments
|
||||
------------------------
|
||||
[[mapper-attachments-highlighting]]
|
||||
==== Highlighting attachments
|
||||
|
||||
If you want to highlight your attachment content, you will need to set `"store": true` and `"term_vector":"with_positions_offsets"`
|
||||
for your attachment field. Here is a full script which does it:
|
||||
If you want to highlight your attachment content, you will need to set `"store": true` and
|
||||
`"term_vector":"with_positions_offsets"` for your attachment field. Here is a full script which does it:
|
||||
|
||||
```
|
||||
[source,js]
|
||||
--------------------------
|
||||
DELETE /test
|
||||
PUT /test
|
||||
PUT /test/person/_mapping
|
||||
|
@ -326,11 +369,13 @@ GET /test/person/_search
|
|||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
// AUTOSENSE
|
||||
|
||||
It gives back:
|
||||
|
||||
```js
|
||||
[source,js]
|
||||
--------------------------
|
||||
{
|
||||
"took": 9,
|
||||
"timed_out": false,
|
||||
|
@ -357,29 +402,31 @@ It gives back:
|
|||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
--------------------------
|
||||
|
||||
Stand alone runner
|
||||
------------------
|
||||
[[mapper-attachments-standalone]]
|
||||
==== Stand alone runner
|
||||
|
||||
If you want to run some tests within your IDE, you can use `StandaloneRunner` class.
|
||||
It accepts arguments:
|
||||
|
||||
* `-u file://URL/TO/YOUR/DOC`
|
||||
* `--size` set extracted size (default to mapper attachment size)
|
||||
* `BASE64` encoded binary
|
||||
* `-u file://URL/TO/YOUR/DOC`
|
||||
* `--size` set extracted size (default to mapper attachment size)
|
||||
* `BASE64` encoded binary
|
||||
|
||||
Example:
|
||||
|
||||
```sh
|
||||
[source,sh]
|
||||
--------------------------
|
||||
StandaloneRunner BASE64Text
|
||||
StandaloneRunner -u /tmp/mydoc.pdf
|
||||
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
|
||||
```
|
||||
--------------------------
|
||||
|
||||
It produces something like:
|
||||
|
||||
```
|
||||
[source,text]
|
||||
--------------------------
|
||||
## Extracted text
|
||||
--------------------- BEGIN -----------------------
|
||||
This is the extracted text
|
||||
|
@ -393,4 +440,4 @@ This is the extracted text
|
|||
- language: null
|
||||
- name: null
|
||||
- title: null
|
||||
```
|
||||
--------------------------
|
|
@ -8,11 +8,10 @@ Mapper plugins allow new field datatypes to be added to Elasticsearch.
|
|||
|
||||
The core mapper plugins are:
|
||||
|
||||
https://github.com/elasticsearch/elasticsearch-mapper-attachments[Mapper Attachments Type plugin]::
|
||||
<<mapper-attachments>>::
|
||||
|
||||
Integrates http://lucene.apache.org/tika/[Apache Tika] to provide a new field
|
||||
type `attachment` to allow indexing of documents such as PDFs and Microsoft
|
||||
Word.
|
||||
The mapper-attachments integrates http://lucene.apache.org/tika/[Apache Tika] to provide a new field
|
||||
type `attachment` to allow indexing of documents such as PDFs and Microsoft Word.
|
||||
|
||||
<<mapper-size>>::
|
||||
|
||||
|
@ -25,5 +24,6 @@ indexes the size in bytes of the original
|
|||
The mapper-murmur3 plugin allows hashes to be computed at index-time and stored
|
||||
in the index for later use with the `cardinality` aggregation.
|
||||
|
||||
include::mapper-attachments.asciidoc[]
|
||||
include::mapper-size.asciidoc[]
|
||||
include::mapper-murmur3.asciidoc[]
|
||||
|
|
|
@ -75,13 +75,12 @@ sudo bin/plugin install lmenezes/elasticsearch-kopf/2.x <2>
|
|||
|
||||
When installing from Maven Central/Sonatype, `[org]` should be replaced by
|
||||
the artifact `groupId`, and `[user|component]` by the `artifactId`. For
|
||||
instance, to install the
|
||||
https://github.com/elastic/elasticsearch-mapper-attachments[mapper attachment]
|
||||
instance, to install the {plugins}/mapper-attachments.html[`mapper-attachments`]
|
||||
plugin from Sonatype, run:
|
||||
|
||||
[source,shell]
|
||||
-----------------------------------
|
||||
sudo bin/plugin install org.elasticsearch/elasticsearch-mapper-attachments/2.6.0 <1>
|
||||
sudo bin/plugin install org.elasticsearch.plugin/mapper-attachments/3.0.0 <1>
|
||||
-----------------------------------
|
||||
<1> When installing from `download.elastic.co` or from Maven Central/Sonatype, the
|
||||
version is required.
|
||||
|
|
|
@ -37,8 +37,8 @@ document:
|
|||
|
||||
Attachment datatype::
|
||||
|
||||
See the https://github.com/elastic/elasticsearch-mapper-attachments[mapper attachment plugin]
|
||||
which supports indexing ``attachments'' like Microsoft Office formats, Open
|
||||
See the {plugins}/mapper-attachments.html[`mapper-attachments`] plugin
|
||||
which supports indexing `attachments` like Microsoft Office formats, Open
|
||||
Document formats, ePub, HTML, etc. into an `attachment` datatype.
|
||||
|
||||
[float]
|
||||
|
|
Loading…
Reference in New Issue