It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.
You can run `StandaloneRunner` class using:
* `-u file://URL/TO/YOUR/DOC`
* `--size` set extracted size (default to mapper attachment size)
* `BASE64` encoded binary
Example:
```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```
It produces something like:
```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```
Closes#99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
To simplify plugins maintenance and provide more value in the future, we are starting to build an `elasticsearch-parent` project.
This commit is the first step for this plugin to depend on this new `pom` maven project.
* exclude *StarndaloneTest*.class from test target
* add cleanup to MultifieldAttachementMapperTests for terminating ThreadPool
* Modify MapperTestUtils.newMapperService for adding ThreadPool
Closes#88
This tool is a simple main class which can be used to test what is extracted from a given binary file or from its base64 equivalent.
You can give as first argument the BASE64 content
Available options:
-u file:/URL/TO/YOUR/DOC (in place of BASE64 content)
-s set extracted size (default to mapper attachment size)
Examples:
```
StandaloneTest BASE64Text
StandaloneTest BASE64Text -s 1000000
StandaloneTest -u /tmp/mydoc.pdf
StandaloneTest -u /tmp/mydoc.pdf -s 1000000
```
Closes#89.
Sadly netcdf library is not Apache2 License compatible so we should not package it anymore.
For users who wants to use it, they can add manually [netcdf librairies](http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/) in `plugins/mapper-attachments` dir and they will get the support back.
Closes#84.
With issue #80 we explicitly removed appache POI dependency provided by Tika and replaced with a more recent one.
Sadly we forgot to add this new dependency to the assembly so the final ZIP file does not contain POI related jars.
Closes#82.
(cherry picked from commit 49793d5)
In #73, we deprecated `content` field in favor of `_content` field.
In plugin version 2.4.0, we can now remove the old field name.
Closes#75.
(cherry picked from commit 7a0f838)
Currently tika exceptions are swallowed with no log message.
We'd like to be able to know when/if this occurs and for what reason.
Closes#78.
(cherry picked from commit 36b0117)
When we want to force some values, we need to set those using `_field` where `field` is the field name we want to force:
```
{
"file": {
"_name": "myfilename.txt"
}
}
```
But to set the content itself, we use `content` field name.
```
{
"file": {
"content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
"_name": "myfilename.txt"
}
}
```
For consistency, we set `_content` instead:
```
{
"file": {
"_content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
"_name": "myfilename.txt"
}
}
```
Closes#73.
(cherry picked from commit 2e6be20)
This is due to `edu.ucar:netcdf` lib which comes from `tika-parsers` dependency.
```
[INFO] +- org.apache.tika:tika-parsers:jar:1.5:compile
[INFO] | +- edu.ucar:netcdf:jar:4.2-min:compile
[INFO] | | \- org.slf4j:slf4j-api:jar:1.5.6:compile
```
We can exclude this library from the generated ZIP artifact.
Closes#41.
When we want to force a language instead of using Tika language detection, we set `language` field in documents.
To be consistent with other forced fields, `_content_type` and `_name`, we should prefix `language` field by an underscore `_`.
So `language` become `_language`.
We first deprecate `language` in version 2.1.0 and we remove it in 2.3.0.
Closes#68.
(cherry picked from commit 2f46343)