If you want to use [copy_to](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to)
feature, you need to define it on each sub-field you want to copy to another field:
```javascript
PUT /test/person/_mapping
{
"person": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"copy_to": "copy"
}
}
},
"copy": {
"type": "string"
}
}
}
}
```
In this example, the extracted content will be copy as well to `copy` field.
Closes#97.
(cherry picked from commit f4f6b57)
(cherry picked from commit 5878a62)
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.
You can run `StandaloneRunner` class using:
* `-u file://URL/TO/YOUR/DOC`
* `--size` set extracted size (default to mapper attachment size)
* `BASE64` encoded binary
Example:
```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```
It produces something like:
```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```
Closes#99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
When we want to force some values, we need to set those using `_field` where `field` is the field name we want to force:
```
{
"file": {
"_name": "myfilename.txt"
}
}
```
But to set the content itself, we use `content` field name.
```
{
"file": {
"content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
"_name": "myfilename.txt"
}
}
```
For consistency, we set `_content` instead:
```
{
"file": {
"_content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
"_name": "myfilename.txt"
}
}
```
Closes#73.
(cherry picked from commit 2e6be20)
When we want to force a language instead of using Tika language detection, we set `language` field in documents.
To be consistent with other forced fields, `_content_type` and `_name`, we should prefix `language` field by an underscore `_`.
So `language` become `_language`.
We first deprecate `language` in version 2.1.0 and we remove it in 2.3.0.
Closes#68.
(cherry picked from commit 2f46343)
We create branches:
* es-0.90 for elasticsearch 0.90
* es-1.0 for elasticsearch 1.0
* es-1.1 for elasticsearch 1.1
* master for elasticsearch master
We also check that before releasing we don't have a dependency to an elasticsearch SNAPSHOT version.
Add links to each version in documentation
Based on PR #45, we add a new language detection option using Language detection feature available in Tika:
https://tika.apache.org/1.4/detection.html#Language_Detection
By default, language detection is disabled (`false`) as it could come with a cost.
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
It can also be provided on a per document indexed using the `_detect_language` parameter.
Closes#45.
Closes#44.
From original PR #17 from @fcamblor
If you try to index a document with an invalid metadata, the full document is rejected.
For example:
```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="fr">
<head>
<title>Hello</title>
<meta name="date" content="">
<meta name="Author" content="kimchy">
<meta name="Keywords" content="elasticsearch,cool,bonsai">
</head>
<body>World</body>
</html>
```
has a non parseable date.
This fix add a new option that ignore parsing errors `"index.mapping.attachment.ignore_errors":true` (default to `true`).
Closes#17, #38.