The content mapper is now true subfield. There is limited backcompat
support for the previous behavior of indexing and querying as the main
field name. While indexes pre ES 2.0 can still be read, the content
must now be queried with `FIELDNAME.content`.
Tika 1.8 has been released. See https://dist.apache.org/repos/dist/release/tika/CHANGES-1.8.txt
We can replace:
```java
public static boolean isLocaleCompatible() {
String language = Locale.getDefault().getLanguage();
boolean acceptedLocale = true;
if (
// We can have issues with JDK7 Patch < 80
(JVM_MAJOR_VERSION == 1 && JVM_MINOR_VERSION == 7 && JVM_PATCH_MAJOR_VERSION == 0 && JVM_PATCH_MINOR_VERSION < 80) ||
// We can have issues with JDK8 Patch < 40
(JVM_MAJOR_VERSION == 1 && JVM_MINOR_VERSION == 8 && JVM_PATCH_MAJOR_VERSION == 0 && JVM_PATCH_MINOR_VERSION < 40)
) {
if (language.equalsIgnoreCase("tr") || language.equalsIgnoreCase("az")) {
acceptedLocale = false;
}
}
return acceptedLocale;
}
```
by
```java
public static boolean isLocaleCompatible() {
return true;
}
```
Related to https://issues.apache.org/jira/browse/TIKA-1526 and #105
Note that Content-type has changed a bit and now returns something like `application/xhtml+xml; charset=ISO-8859-1` instead of `application/xhtml+xml`.
Closes#112.
(cherry picked from commit bf4af47971ed07bfa126409413c435f121444c3c)
Some exceptions might not be serializable. It would be safer not to wrap them in a `MapperParsingException` but just create the `MapperParsingException`.
Related to #113.
(cherry picked from commit e58878c)
(cherry picked from commit a673185)
Tika might fail with some Locale under some JVMs. We now check that won't happen before creating a Tika instance.
That will generate a `WARN` in logs like:
```
Tika can not be initialized with the current Locale [tr] on the current JVM [1.7.0_60]
```
To check that Tika is not initialized, you can run the test suite with:
```sh
mvn test -Dtests.output=always -Dtests.locale=tr
```
Closes#105.
(cherry picked from commit d6d63f7)
(cherry picked from commit 532bdf7)
Our package naming for tests is inconsistent.
We should move tests from:
* `o.e.index.mapper.xcontent` to `o.e.index.mapper.attachment.test.unit`
* `o.e.plugin.mapper.attachments.test` to `o.e.index.mapper.attachment.test.integration`
* `StandaloneRunner` class to `o.e.index.mapper.attachment.test.standalone`
Also rename resource dirs to match the test name so it's definitely easier to find mappings used for each test.
Closes#110.
I use ElasticSearch 1.4.3 with mapper-attachment plugin 2.4.2 (TIKA 1.7).
I get an error when indexing **specific** docx file:
> "[DEBUG][org.elasticsearch.index.mapper.attachment.AttachmentMapper] Failed to extract [-1] characters of text for [null]: [org.apache.poi.xwpf.usermodel.XWPFSDT.getContent()Lorg/apache/poi/xwpf/usermodel/ISDTContent;]"
But if i use mapper-attachment plugin 2.4.1 (TIKA 1.5) there is no error and content is parsed successfully.
Caused by this change #94.
Closes#104.
We were asking for short name fields but elasticsearch does not allow anymore using short names but full qualified names.
```java
SearchResponse response = client().prepareSearch("test")
.addField("content_type")
.addField("name")
.execute().get();
```
We need to use now:
```java
SearchResponse response = client().prepareSearch("test")
.addField("file.content_type")
.addField("file.name")
.execute().get();
```
Closes#102.
If you want to use [copy_to](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to)
feature, you need to define it on each sub-field you want to copy to another field:
```javascript
PUT /test/person/_mapping
{
"person": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"copy_to": "copy"
}
}
},
"copy": {
"type": "string"
}
}
}
}
```
In this example, the extracted content will be copy as well to `copy` field.
Closes#97.
(cherry picked from commit f4f6b57)
(cherry picked from commit 5878a62)
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.
You can run `StandaloneRunner` class using:
* `-u file://URL/TO/YOUR/DOC`
* `--size` set extracted size (default to mapper attachment size)
* `BASE64` encoded binary
Example:
```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```
It produces something like:
```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```
Closes#99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
To simplify plugins maintenance and provide more value in the future, we are starting to build an `elasticsearch-parent` project.
This commit is the first step for this plugin to depend on this new `pom` maven project.
* exclude *StarndaloneTest*.class from test target
* add cleanup to MultifieldAttachementMapperTests for terminating ThreadPool
* Modify MapperTestUtils.newMapperService for adding ThreadPool
Closes#88
This tool is a simple main class which can be used to test what is extracted from a given binary file or from its base64 equivalent.
You can give as first argument the BASE64 content
Available options:
-u file:/URL/TO/YOUR/DOC (in place of BASE64 content)
-s set extracted size (default to mapper attachment size)
Examples:
```
StandaloneTest BASE64Text
StandaloneTest BASE64Text -s 1000000
StandaloneTest -u /tmp/mydoc.pdf
StandaloneTest -u /tmp/mydoc.pdf -s 1000000
```
Closes#89.