Commit Graph

202 Commits

Author SHA1 Message Date
Robert Muir 51da1fe2a1 parse java.specification.version not java.version, so that it is robust 2015-03-30 14:50:18 +02:00
David Pilato d20c8861ca Update owner to elastic
(cherry picked from commit c4d60ed)
(cherry picked from commit 450d088)
2015-03-30 11:36:10 +02:00
David Pilato 9f6519f84a Move parent after artifact coordinates 2015-03-30 11:35:54 +02:00
Robert Muir 021483626c Merge pull request #117 from rmuir/exclude-jhighlight
Exclude jhighlight dependency, which contains LGPL-only files
2015-03-20 14:55:00 -04:00
Robert Muir 977a7247c7 Exclude jhighlight dependency, which contains LGPL-only files 2015-03-20 14:42:55 -04:00
David Pilato e08ebe9efa create `es-1.5` branch 2015-03-16 16:52:08 -07:00
David Pilato 208c76e45e [Test] Fix remaining static objects after running tests
Test framework detects when static objects are not released when running tests.
This commit remove usage of static objects when possible.
2015-02-23 17:46:28 +01:00
David Pilato d4d54fe744 update documentation with release 2.4.3 2015-02-23 16:56:39 +01:00
David Pilato cfd83443f1 Add test for asciidoc format
Related to #29.
2015-02-23 16:43:45 +01:00
David Pilato 4f65664916 Tika might fails depending on the Locale
Tika might fail with some Locale under some JVMs. We now check that won't happen before creating a Tika instance.
That will generate a `WARN` in logs like:

```
Tika can not be initialized with the current Locale [tr] on the current JVM [1.7.0_60]
```

To check that Tika is not initialized, you can run the test suite with:

```sh
mvn test -Dtests.output=always -Dtests.locale=tr
```

Closes #105.
(cherry picked from commit d6d63f7)
(cherry picked from commit 532bdf7)
2015-02-23 14:49:08 +01:00
David Pilato a10c35f0eb [Test] move test package to o.e.index.mapper.attachment
Our package naming for tests is inconsistent.
We should move tests from:

* `o.e.index.mapper.xcontent` to  `o.e.index.mapper.attachment.test.unit`
* `o.e.plugin.mapper.attachments.test` to  `o.e.index.mapper.attachment.test.integration`
* `StandaloneRunner` class to  `o.e.index.mapper.attachment.test.standalone`

Also rename resource dirs to match the test name so it's definitely easier to find mappings used for each test.

Closes #110.
2015-02-23 11:54:38 +01:00
David Pilato 4ffa06d773 [Doc] highlighting example is incorrect
Closes #107.
2015-02-23 11:10:50 +01:00
David Pilato 6d77b085eb [Test] Add highlighting test
Closes #108.

(cherry picked from commit 2c96550)
(cherry picked from commit 440e534)
2015-02-23 11:10:49 +01:00
David Pilato 36344ac8b8 [Internal] Fix field mappers to always pass through index settings
Caused by https://github.com/elasticsearch/elasticsearch/pull/9780 we now need to pass index settings instead of empty settings.

Closes #109.
2015-02-23 11:04:35 +01:00
David Pilato c3c9f66d0d Indexing docx file fails
I use ElasticSearch 1.4.3 with mapper-attachment plugin 2.4.2 (TIKA 1.7).

I get an error when indexing **specific** docx file:
> "[DEBUG][org.elasticsearch.index.mapper.attachment.AttachmentMapper] Failed to extract [-1] characters of text for [null]: [org.apache.poi.xwpf.usermodel.XWPFSDT.getContent()Lorg/apache/poi/xwpf/usermodel/ISDTContent;]"

But if i use mapper-attachment plugin 2.4.1 (TIKA 1.5) there is no error and content is parsed successfully.

Caused by this change #94.

Closes #104.
2015-02-20 19:02:43 +01:00
David Pilato 1e0f03bc90 Remove DocValuesFormatService and PostingsFormatService
Related to elasticsearch/elasticsearch#9741

Closes #103.
2015-02-19 19:05:59 +01:00
David Pilato ec0de9c57d [Test] Use now full qualified names for fields
We were asking for short name fields but elasticsearch does not allow anymore using short names but full qualified names.

```java
SearchResponse response = client().prepareSearch("test")
        .addField("content_type")
        .addField("name")
        .execute().get();
```

We need to use now:

```java
SearchResponse response = client().prepareSearch("test")
        .addField("file.content_type")
        .addField("file.name")
        .execute().get();
```

Closes #102.
2015-02-18 20:36:25 +01:00
David Pilato 400910e53e update documentation with release 2.4.2 2015-02-11 23:22:02 +01:00
David Pilato 77081e3dbf [Doc] copy_to using attachment field type
If you want to use [copy_to](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to)
feature, you need to define it on each sub-field you want to copy to another field:

```javascript
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "path": "full",
        "fields": {
          "file": {
            "type": "string",
            "copy_to": "copy"
          }
        }
      },
      "copy": {
        "type": "string"
      }
    }
  }
}
```

In this example, the extracted content will be copy as well to `copy` field.

Closes #97.
(cherry picked from commit f4f6b57)
(cherry picked from commit 5878a62)
2015-02-11 23:13:56 +01:00
David Pilato ec59d381b8 Upgrade Tika to 1.7
Closes #94.
(cherry picked from commit 0ab38f3)
(cherry picked from commit 96c7bb1)
2015-02-11 17:17:41 +01:00
David Pilato 931be57da9 [test] Add standalone runner
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.

You can run `StandaloneRunner` class using:

*  `-u file://URL/TO/YOUR/DOC`
*  `--size` set extracted size (default to mapper attachment size)
*  `BASE64` encoded binary

Example:

```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```

It produces something like:

```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```

Closes #99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
2015-02-09 17:45:07 +01:00
David Pilato c353936b58 Add sonatype snapshot repository 2015-01-02 19:05:18 +01:00
David Pilato 33c9828385 Depend on elasticsearch-parent
To simplify plugins maintenance and provide more value in the future, we are starting to build an `elasticsearch-parent` project.
This commit is the first step for this plugin to depend on this new `pom` maven project.
2014-12-14 19:59:15 +01:00
David Pilato c338ae0dbe [Test] copyToByteArray has been removed in master 2014-12-03 18:42:14 +01:00
David Pilato e3d80af54e Test: Fix removed queryString -> queryStringQuery 2014-12-03 18:31:53 +01:00
Adrien Grand 11b1287610 Upgrade to Lucene 5.0.0-snapshot-1642891 2014-12-02 18:16:59 +01:00
Colin Goodheart-Smithe bbd4a62e50 Updated AttachmentMapper to work with new validation in ES 2.0 2014-11-28 16:04:31 +00:00
Michael McCandless abb03dc3d9 Upgrade to Lucene 5.0.0-snapshot-1641343 2014-11-24 05:51:40 -05:00
Michael McCandless 55042f0f23 Upgrade to Lucene 5.0.0-snapshot-1637347 2014-11-10 16:45:44 -05:00
Robert Muir 4c1b27f544 upgrade to lucene 5 snapshot 2014-11-05 16:48:10 -05:00
tlrx a5ed51533c update documentation with release 2.4.1 2014-11-05 20:38:24 +01:00
Jun Ohtani 94880aae3e Tests: thread leaks detected
* exclude *StarndaloneTest*.class from test target
* add cleanup to MultifieldAttachementMapperTests for terminating ThreadPool
* Modify MapperTestUtils.newMapperService for adding ThreadPool

Closes #88
2014-11-03 02:22:45 +09:00
Jun Ohtani d3f2df6d62 Tests: Fix randomizedtest fail
Closes #90
2014-11-03 02:15:59 +09:00
Michael McCandless 4dae1879ad Upgrade to Lucene 4.10.2 2014-10-30 05:55:35 -04:00
David Pilato a0d7aafdac Fix test
Related to #89
2014-10-27 22:18:50 +01:00
David Pilato 92bdc23c78 Fix test
Related to #89
2014-10-27 22:13:15 +01:00
David Pilato faf34d745d Fix test
Related to #89
2014-10-27 22:08:41 +01:00
David Pilato d08e9c7080 Test: add a standalone tool which process content
This tool is a simple main class which can be used to test what is extracted from a given binary file or from its base64 equivalent.

You can give as first argument the BASE64 content

Available options:

 -u file:/URL/TO/YOUR/DOC (in place of BASE64 content)
 -s set extracted size (default to mapper attachment size)

Examples:

```
StandaloneTest BASE64Text
StandaloneTest BASE64Text -s 1000000
StandaloneTest -u /tmp/mydoc.pdf
StandaloneTest -u /tmp/mydoc.pdf -s 1000000
```

Closes #89.
2014-10-27 22:01:22 +01:00
David Pilato c3bf3b1ce9 Tests: AnalysisService constructor signature change
Due to this [change](https://github.com/elasticsearch/elasticsearch/pull/8018), we need to fix our tests for elasticsearch 1.4.0 and above.

Closes #87.

(cherry picked from commit b3b0d34)
2014-10-15 13:05:41 +02:00
David Pilato 03b47d5a4c update documentation with release 2.4.0 2014-10-08 18:50:20 +02:00
mikemccand 2ff4eb58d6 Upgrade to Lucene 4.10.1 2014-09-28 17:57:06 -04:00
Michael McCandless 67a2548441 Upgrade to Lucene 4.10.1 snapshot 2014-09-24 17:10:08 -04:00
David Pilato eef6b61806 Create branch es-1.4 for elasticsearch 1.4.0 2014-09-12 16:08:59 +02:00
David Pilato ba74fc2b5e Remove netcdf support
Sadly netcdf library is not Apache2 License compatible so we should not package it anymore.

For users who wants to use it, they can add manually [netcdf librairies](http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/) in `plugins/mapper-attachments` dir and they will get the support back.

Closes #84.
2014-09-08 23:51:01 +02:00
David Pilato 888d79075e Update to Lucene 4.10.0
Closes #85.
2014-09-08 23:47:15 +02:00
David Pilato 20ee711436 parseMultiField() method signature change in es 1.4 and master
As seen with https://github.com/elasticsearch/elasticsearch/pull/7474, we need to update mapper attachment plugin with this new signature.

 Closes #83.
2014-09-04 11:23:09 +02:00
David Pilato c0d053d283 Update to elasticsearch 1.4
Related to #77

(cherry picked from commit ad1742a)
2014-09-01 10:26:38 +02:00
David Pilato 34fe111a2b update documentation with release 2.3.2 2014-09-01 09:53:26 +02:00
David Pilato 87b38c54eb Unable to extract text from Word documents
With issue #80 we explicitly removed appache POI dependency provided by Tika and replaced with a more recent one.
Sadly we forgot to add this new dependency to the assembly so the final ZIP file does not contain POI related jars.

Closes #82.

(cherry picked from commit 49793d5)
2014-09-01 09:41:57 +02:00
David Pilato cc1a43b5c3 update documentation with release 2.3.1 2014-08-18 21:52:53 +02:00