OpenSearch/README.md

Mapper Attachments Type for ElasticSearch
==================================

The mapper attachments plugin adds the `attachment` type to ElasticSearch using Tika.

In order to install the plugin, simply run: `bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.2.0`.

    -----------------------------------------------
    | Attachment Mapper Plugin | ElasticSearch    |
    -----------------------------------------------
    | master                   | 0.19 -> master   |
    -----------------------------------------------
    | 1.2.0                    | 0.19 -> master   |
    -----------------------------------------------
    | 1.1.0                    | 0.19 -> master   |
    -----------------------------------------------
    | 1.0.0                    | 0.18             |
    -----------------------------------------------


The `attachment` type allows to index different "attachment" type field (encoded as `base64`), for example, microsoft office formats, open document formats, ePub, HTML, and so on (full list can be found [here](http://lucene.apache.org/tika/0.10/formats.html)).

The `attachment` type is provided as a plugin extension. The plugin is a simple zip file that can be downloaded and placed under `$ES_HOME/plugins` location. It will be automatically detected and the `attachment` type will be added.

Using the attachment type is simple, in your mapping JSON, simply set a certain JSON element as attachment, for example:

    {
        "person" : {
            "properties" : {
                "my_attachment" : { "type" : "attachment" }
            }
        }
    }

In this case, the JSON to index can be:

    {
        "my_attachment" : "... base64 encoded attachment ..."
    }

Or it is possible to use more elaborated JSON if content type or resource name need to be set explicitly:

    {
        "my_attachment" : {
            "_content_type" : "application/pdf",
            "_name" : "resource/name/of/my.pdf",
            "content" : "... base64 encoded attachment ..."
        }
    }

The `attachment` type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: `date`, `title`, `author`, and `keywords`. They can be queried using the "dot notation", for example: `my_attachment.author`.

Both the meta data and the actual content are simple core type mappers (string, date, ...), thus, they can be controlled in the mappings. For example:

    {
        "person" : {
            "properties" : {
                "file" : {
                    "type" : "attachment",
                    "fields" : {
                        "file" : {"index" : "no"},
                        "date" : {"store" : "yes"},
                        "author" : {"analyzer" : "myAnalyzer"}
                    }
                }
            }
        }
    }

In the above example, the actual content indexed is mapped under `fields` name `file`, and we decide not to index it, so it will only be available in the `_all` field. The other fields map to their respective metadata names, but there is no need to specify the `type` (like `string` or `date`) since it is already known.

Indexed Characters
------------------

By default, `100000` characters are extracted when indexing the content. This default value can be changed by setting the `index.mapping.attachment.indexed_chars` setting. It can also be provided on a per document indexed using the `_indexed_chars` parameter. `-1` can be set to extract all text, but note that all the text needs to be allowed to be represented in memory.

Note, this feature is support since `1.3.0` version.

The plugin uses [Apache Tika](http://lucene.apache.org/tika/) to parse attachments, so many formats are supported, listed [here](http://lucene.apache.org/tika/0.10/formats.html).
first commit 2011-12-05 07:05:14 -05:00			`Mapper Attachments Type for ElasticSearch`
			`==================================`

			The mapper attachments plugin adds the `attachment` type to ElasticSearch using Tika.

release 1.2.0 2012-02-15 15:43:48 -05:00			In order to install the plugin, simply run: `bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.2.0`.
first commit 2011-12-05 07:05:14 -05:00
Fix typo 2012-01-04 05:54:30 -05:00			`-----------------------------------------------`
			`\| Attachment Mapper Plugin \| ElasticSearch \|`
			`-----------------------------------------------`
release 1.1.0 supporting elasticsearch 0.19 2012-02-07 10:00:47 -05:00			`\| master \| 0.19 -> master \|`
Fix typo 2012-01-04 05:54:30 -05:00			`-----------------------------------------------`
release 1.2.0 2012-02-15 15:43:48 -05:00			`\| 1.2.0 \| 0.19 -> master \|`
			`-----------------------------------------------`
release 1.1.0 supporting elasticsearch 0.19 2012-02-07 10:00:47 -05:00			`\| 1.1.0 \| 0.19 -> master \|`
			`-----------------------------------------------`
			`\| 1.0.0 \| 0.18 \|`
Fix typo 2012-01-04 05:54:30 -05:00			`-----------------------------------------------`
first commit 2011-12-05 07:05:14 -05:00
update readme 2012-03-04 04:59:22 -05:00
			The `attachment` type allows to index different "attachment" type field (encoded as `base64`), for example, microsoft office formats, open document formats, ePub, HTML, and so on (full list can be found [here](http://lucene.apache.org/tika/0.10/formats.html)).

			The `attachment` type is provided as a plugin extension. The plugin is a simple zip file that can be downloaded and placed under `$ES_HOME/plugins` location. It will be automatically detected and the `attachment` type will be added.

			`Using the attachment type is simple, in your mapping JSON, simply set a certain JSON element as attachment, for example:`

			`{`
			`"person" : {`
			`"properties" : {`
			`"my_attachment" : { "type" : "attachment" }`
			`}`
			`}`
			`}`

			`In this case, the JSON to index can be:`

			`{`
			`"my_attachment" : "... base64 encoded attachment ..."`
			`}`

			`Or it is possible to use more elaborated JSON if content type or resource name need to be set explicitly:`

			`{`
			`"my_attachment" : {`
			`"_content_type" : "application/pdf",`
			`"_name" : "resource/name/of/my.pdf",`
			`"content" : "... base64 encoded attachment ..."`
			`}`
			`}`

			The `attachment` type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: `date`, `title`, `author`, and `keywords`. They can be queried using the "dot notation", for example: `my_attachment.author`.

			`Both the meta data and the actual content are simple core type mappers (string, date, ...), thus, they can be controlled in the mappings. For example:`

			`{`
			`"person" : {`
			`"properties" : {`
			`"file" : {`
			`"type" : "attachment",`
			`"fields" : {`
			`"file" : {"index" : "no"},`
			`"date" : {"store" : "yes"},`
			`"author" : {"analyzer" : "myAnalyzer"}`
			`}`
			`}`
			`}`
			`}`
			`}`

			In the above example, the actual content indexed is mapped under `fields` name `file`, and we decide not to index it, so it will only be available in the `_all` field. The other fields map to their respective metadata names, but there is no need to specify the `type` (like `string` or `date`) since it is already known.

update readme 2012-03-07 14:56:48 -05:00			`Indexed Characters`
			`------------------`

			By default, `100000` characters are extracted when indexing the content. This default value can be changed by setting the `index.mapping.attachment.indexed_chars` setting. It can also be provided on a per document indexed using the `_indexed_chars` parameter. `-1` can be set to extract all text, but note that all the text needs to be allowed to be represented in memory.

			Note, this feature is support since `1.3.0` version.

update readme 2012-03-04 04:59:22 -05:00			`The plugin uses [Apache Tika](http://lucene.apache.org/tika/) to parse attachments, so many formats are supported, listed [here](http://lucene.apache.org/tika/0.10/formats.html).`