Some exceptions might not be serializable. It would be safer not to wrap them in a `MapperParsingException` but just create the `MapperParsingException`.
Related to #113.
(cherry picked from commit e58878c)
(cherry picked from commit a673185)
Tika might fail with some Locale under some JVMs. We now check that won't happen before creating a Tika instance.
That will generate a `WARN` in logs like:
```
Tika can not be initialized with the current Locale [tr] on the current JVM [1.7.0_60]
```
To check that Tika is not initialized, you can run the test suite with:
```sh
mvn test -Dtests.output=always -Dtests.locale=tr
```
Closes#105.
(cherry picked from commit d6d63f7)
(cherry picked from commit 532bdf7)
Our package naming for tests is inconsistent.
We should move tests from:
* `o.e.index.mapper.xcontent` to `o.e.index.mapper.attachment.test.unit`
* `o.e.plugin.mapper.attachments.test` to `o.e.index.mapper.attachment.test.integration`
* `StandaloneRunner` class to `o.e.index.mapper.attachment.test.standalone`
Also rename resource dirs to match the test name so it's definitely easier to find mappings used for each test.
Closes#110.
I use ElasticSearch 1.4.3 with mapper-attachment plugin 2.4.2 (TIKA 1.7).
I get an error when indexing **specific** docx file:
> "[DEBUG][org.elasticsearch.index.mapper.attachment.AttachmentMapper] Failed to extract [-1] characters of text for [null]: [org.apache.poi.xwpf.usermodel.XWPFSDT.getContent()Lorg/apache/poi/xwpf/usermodel/ISDTContent;]"
But if i use mapper-attachment plugin 2.4.1 (TIKA 1.5) there is no error and content is parsed successfully.
Caused by this change #94.
Closes#104.
We were asking for short name fields but elasticsearch does not allow anymore using short names but full qualified names.
```java
SearchResponse response = client().prepareSearch("test")
.addField("content_type")
.addField("name")
.execute().get();
```
We need to use now:
```java
SearchResponse response = client().prepareSearch("test")
.addField("file.content_type")
.addField("file.name")
.execute().get();
```
Closes#102.
If you want to use [copy_to](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to)
feature, you need to define it on each sub-field you want to copy to another field:
```javascript
PUT /test/person/_mapping
{
"person": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"copy_to": "copy"
}
}
},
"copy": {
"type": "string"
}
}
}
}
```
In this example, the extracted content will be copy as well to `copy` field.
Closes#97.
(cherry picked from commit f4f6b57)
(cherry picked from commit 5878a62)
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.
You can run `StandaloneRunner` class using:
* `-u file://URL/TO/YOUR/DOC`
* `--size` set extracted size (default to mapper attachment size)
* `BASE64` encoded binary
Example:
```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```
It produces something like:
```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```
Closes#99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
To simplify plugins maintenance and provide more value in the future, we are starting to build an `elasticsearch-parent` project.
This commit is the first step for this plugin to depend on this new `pom` maven project.
* exclude *StarndaloneTest*.class from test target
* add cleanup to MultifieldAttachementMapperTests for terminating ThreadPool
* Modify MapperTestUtils.newMapperService for adding ThreadPool
Closes#88
This tool is a simple main class which can be used to test what is extracted from a given binary file or from its base64 equivalent.
You can give as first argument the BASE64 content
Available options:
-u file:/URL/TO/YOUR/DOC (in place of BASE64 content)
-s set extracted size (default to mapper attachment size)
Examples:
```
StandaloneTest BASE64Text
StandaloneTest BASE64Text -s 1000000
StandaloneTest -u /tmp/mydoc.pdf
StandaloneTest -u /tmp/mydoc.pdf -s 1000000
```
Closes#89.
Sadly netcdf library is not Apache2 License compatible so we should not package it anymore.
For users who wants to use it, they can add manually [netcdf librairies](http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/) in `plugins/mapper-attachments` dir and they will get the support back.
Closes#84.