This tool is a simple main class which can be used to test what is extracted from a given binary file or from its base64 equivalent.
You can give as first argument the BASE64 content
Available options:
-u file:/URL/TO/YOUR/DOC (in place of BASE64 content)
-s set extracted size (default to mapper attachment size)
Examples:
```
StandaloneTest BASE64Text
StandaloneTest BASE64Text -s 1000000
StandaloneTest -u /tmp/mydoc.pdf
StandaloneTest -u /tmp/mydoc.pdf -s 1000000
```
Closes#89.
Sadly netcdf library is not Apache2 License compatible so we should not package it anymore.
For users who wants to use it, they can add manually [netcdf librairies](http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/) in `plugins/mapper-attachments` dir and they will get the support back.
Closes#84.
With issue #80 we explicitly removed appache POI dependency provided by Tika and replaced with a more recent one.
Sadly we forgot to add this new dependency to the assembly so the final ZIP file does not contain POI related jars.
Closes#82.
(cherry picked from commit 49793d5)
In #73, we deprecated `content` field in favor of `_content` field.
In plugin version 2.4.0, we can now remove the old field name.
Closes#75.
(cherry picked from commit 7a0f838)
Currently tika exceptions are swallowed with no log message.
We'd like to be able to know when/if this occurs and for what reason.
Closes#78.
(cherry picked from commit 36b0117)
When we want to force some values, we need to set those using `_field` where `field` is the field name we want to force:
```
{
"file": {
"_name": "myfilename.txt"
}
}
```
But to set the content itself, we use `content` field name.
```
{
"file": {
"content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
"_name": "myfilename.txt"
}
}
```
For consistency, we set `_content` instead:
```
{
"file": {
"_content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
"_name": "myfilename.txt"
}
}
```
Closes#73.
(cherry picked from commit 2e6be20)
This is due to `edu.ucar:netcdf` lib which comes from `tika-parsers` dependency.
```
[INFO] +- org.apache.tika:tika-parsers:jar:1.5:compile
[INFO] | +- edu.ucar:netcdf:jar:4.2-min:compile
[INFO] | | \- org.slf4j:slf4j-api:jar:1.5.6:compile
```
We can exclude this library from the generated ZIP artifact.
Closes#41.
When we want to force a language instead of using Tika language detection, we set `language` field in documents.
To be consistent with other forced fields, `_content_type` and `_name`, we should prefix `language` field by an underscore `_`.
So `language` become `_language`.
We first deprecate `language` in version 2.1.0 and we remove it in 2.3.0.
Closes#68.
(cherry picked from commit 2f46343)
We create branches:
* es-0.90 for elasticsearch 0.90
* es-1.0 for elasticsearch 1.0
* es-1.1 for elasticsearch 1.1
* master for elasticsearch master
We also check that before releasing we don't have a dependency to an elasticsearch SNAPSHOT version.
Add links to each version in documentation
Based on PR #45, we add a new language detection option using Language detection feature available in Tika:
https://tika.apache.org/1.4/detection.html#Language_Detection
By default, language detection is disabled (`false`) as it could come with a cost.
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
It can also be provided on a per document indexed using the `_detect_language` parameter.
Closes#45.
Closes#44.