New paragraph
Some abbreviation to 1st paragraph
More concise phrasing
Rename heading
Remove repeated "Now," from Hello World
Person is also a document
Rephrasing of last paragraph in Hello, World
Move installation to being above Hello, world
Accidentally left out moving code backticks. Fixed
Closes#155
We were asking for short name fields but elasticsearch does not allow anymore using short names but full qualified names.
```java
SearchResponse response = client().prepareSearch("test")
.addField("content_type")
.addField("name")
.execute().get();
```
We need to use now:
```java
SearchResponse response = client().prepareSearch("test")
.addField("file.content_type")
.addField("file.name")
.execute().get();
```
Closes#102.
If you want to use [copy_to](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to)
feature, you need to define it on each sub-field you want to copy to another field:
```javascript
PUT /test/person/_mapping
{
"person": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"copy_to": "copy"
}
}
},
"copy": {
"type": "string"
}
}
}
}
```
In this example, the extracted content will be copy as well to `copy` field.
Closes#97.
(cherry picked from commit f4f6b57)
(cherry picked from commit 5878a62)
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.
You can run `StandaloneRunner` class using:
* `-u file://URL/TO/YOUR/DOC`
* `--size` set extracted size (default to mapper attachment size)
* `BASE64` encoded binary
Example:
```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```
It produces something like:
```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```
Closes#99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
When we want to force some values, we need to set those using `_field` where `field` is the field name we want to force:
```
{
"file": {
"_name": "myfilename.txt"
}
}
```
But to set the content itself, we use `content` field name.
```
{
"file": {
"content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
"_name": "myfilename.txt"
}
}
```
For consistency, we set `_content` instead:
```
{
"file": {
"_content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
"_name": "myfilename.txt"
}
}
```
Closes#73.
(cherry picked from commit 2e6be20)
When we want to force a language instead of using Tika language detection, we set `language` field in documents.
To be consistent with other forced fields, `_content_type` and `_name`, we should prefix `language` field by an underscore `_`.
So `language` become `_language`.
We first deprecate `language` in version 2.1.0 and we remove it in 2.3.0.
Closes#68.
(cherry picked from commit 2f46343)
We create branches:
* es-0.90 for elasticsearch 0.90
* es-1.0 for elasticsearch 1.0
* es-1.1 for elasticsearch 1.1
* master for elasticsearch master
We also check that before releasing we don't have a dependency to an elasticsearch SNAPSHOT version.
Add links to each version in documentation
Based on PR #45, we add a new language detection option using Language detection feature available in Tika:
https://tika.apache.org/1.4/detection.html#Language_Detection
By default, language detection is disabled (`false`) as it could come with a cost.
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
It can also be provided on a per document indexed using the `_detect_language` parameter.
Closes#45.
Closes#44.