Commit Graph

162 Commits

Author SHA1 Message Date
Ryan Ernst 765afb655e Fix attachment mapper to expose subfields.
The content mapper is now true subfield. There is limited backcompat
support for the previous behavior of indexing and querying as the main
field name. While indexes pre ES 2.0 can still be read, the content
must now be queried with `FIELDNAME.content`.
2015-05-11 23:21:40 -07:00
David Pilato 1c030f6f75 Update to Tika 1.8
Tika 1.8 has been released. See https://dist.apache.org/repos/dist/release/tika/CHANGES-1.8.txt

We can replace:

```java
public static boolean isLocaleCompatible() {
    String language = Locale.getDefault().getLanguage();
    boolean acceptedLocale = true;

    if (
        // We can have issues with JDK7 Patch < 80
            (JVM_MAJOR_VERSION == 1 && JVM_MINOR_VERSION == 7 && JVM_PATCH_MAJOR_VERSION == 0 && JVM_PATCH_MINOR_VERSION < 80) ||
                    // We can have issues with JDK8 Patch < 40
                    (JVM_MAJOR_VERSION == 1 && JVM_MINOR_VERSION == 8 && JVM_PATCH_MAJOR_VERSION == 0 && JVM_PATCH_MINOR_VERSION < 40)
            ) {
        if (language.equalsIgnoreCase("tr") || language.equalsIgnoreCase("az")) {
            acceptedLocale = false;
        }
    }

    return acceptedLocale;
}
```

by

```java
public static boolean isLocaleCompatible() {
    return true;
}
```

Related to https://issues.apache.org/jira/browse/TIKA-1526 and #105

Note that Content-type has changed a bit and now returns something like `application/xhtml+xml; charset=ISO-8859-1` instead of `application/xhtml+xml`.

Closes #112.
(cherry picked from commit bf4af47971ed07bfa126409413c435f121444c3c)
2015-05-07 10:17:59 +02:00
Robert Muir 2f457111f9 Merge pull request #128 from elastic/get_past_mapping_changes
Fix the build and try to migrate past mappings changes
2015-05-05 15:10:39 -07:00
Ryan Ernst a83f98d018 fix standalone and remove unecessary override 2015-05-05 15:08:30 -07:00
Robert Muir 7ce86d95fe Fix the build and try to migrate past mappings changes, but there
is an @AwaitsFix test remaining with regards to copyTo behavior.
2015-05-05 13:43:31 -04:00
David Pilato 65a83e63d3 Mappings: Simplified mapper lookups
Due to https://github.com/elastic/elasticsearch/pull/10705

We need to adapt the mapper attachment plugin to version 2.0.0

Closes #125.
2015-04-25 16:29:35 +02:00
David Pilato 7e2a9dbf0c update documentation with release 2.5.0 2015-03-31 17:59:01 +02:00
David Pilato d2c02b19fc Don't wrap exceptions in `MapperParsingException`
Some exceptions might not be serializable. It would be safer not to wrap them in a `MapperParsingException` but just create the `MapperParsingException`.

Related to #113.
(cherry picked from commit e58878c)
(cherry picked from commit a673185)
2015-03-31 14:41:47 +02:00
David Pilato cbad7dce76 Cleanup: Remove unsafe field in BytesStreamInput
Related to https://github.com/elastic/elasticsearch/pull/10157

BytesStreamInput does not support anymore `BytesStreamInput(byte[], boolean)`

Closes #120.
2015-03-30 15:00:50 +02:00
David Pilato 3154510fad Update owner to elastic
Fix typo in previous commit
(cherry picked from commit 5303bc0)
(cherry picked from commit d3dab9b)
(cherry picked from commit 3ace2bb)
2015-03-30 14:53:58 +02:00
Robert Muir 51da1fe2a1 parse java.specification.version not java.version, so that it is robust 2015-03-30 14:50:18 +02:00
David Pilato d20c8861ca Update owner to elastic
(cherry picked from commit c4d60ed)
(cherry picked from commit 450d088)
2015-03-30 11:36:10 +02:00
David Pilato 9f6519f84a Move parent after artifact coordinates 2015-03-30 11:35:54 +02:00
Robert Muir 021483626c Merge pull request #117 from rmuir/exclude-jhighlight
Exclude jhighlight dependency, which contains LGPL-only files
2015-03-20 14:55:00 -04:00
Robert Muir 977a7247c7 Exclude jhighlight dependency, which contains LGPL-only files 2015-03-20 14:42:55 -04:00
David Pilato e08ebe9efa create `es-1.5` branch 2015-03-16 16:52:08 -07:00
David Pilato 208c76e45e [Test] Fix remaining static objects after running tests
Test framework detects when static objects are not released when running tests.
This commit remove usage of static objects when possible.
2015-02-23 17:46:28 +01:00
David Pilato d4d54fe744 update documentation with release 2.4.3 2015-02-23 16:56:39 +01:00
David Pilato cfd83443f1 Add test for asciidoc format
Related to #29.
2015-02-23 16:43:45 +01:00
David Pilato 4f65664916 Tika might fails depending on the Locale
Tika might fail with some Locale under some JVMs. We now check that won't happen before creating a Tika instance.
That will generate a `WARN` in logs like:

```
Tika can not be initialized with the current Locale [tr] on the current JVM [1.7.0_60]
```

To check that Tika is not initialized, you can run the test suite with:

```sh
mvn test -Dtests.output=always -Dtests.locale=tr
```

Closes #105.
(cherry picked from commit d6d63f7)
(cherry picked from commit 532bdf7)
2015-02-23 14:49:08 +01:00
David Pilato a10c35f0eb [Test] move test package to o.e.index.mapper.attachment
Our package naming for tests is inconsistent.
We should move tests from:

* `o.e.index.mapper.xcontent` to  `o.e.index.mapper.attachment.test.unit`
* `o.e.plugin.mapper.attachments.test` to  `o.e.index.mapper.attachment.test.integration`
* `StandaloneRunner` class to  `o.e.index.mapper.attachment.test.standalone`

Also rename resource dirs to match the test name so it's definitely easier to find mappings used for each test.

Closes #110.
2015-02-23 11:54:38 +01:00
David Pilato 4ffa06d773 [Doc] highlighting example is incorrect
Closes #107.
2015-02-23 11:10:50 +01:00
David Pilato 6d77b085eb [Test] Add highlighting test
Closes #108.

(cherry picked from commit 2c96550)
(cherry picked from commit 440e534)
2015-02-23 11:10:49 +01:00
David Pilato 36344ac8b8 [Internal] Fix field mappers to always pass through index settings
Caused by https://github.com/elasticsearch/elasticsearch/pull/9780 we now need to pass index settings instead of empty settings.

Closes #109.
2015-02-23 11:04:35 +01:00
David Pilato c3c9f66d0d Indexing docx file fails
I use ElasticSearch 1.4.3 with mapper-attachment plugin 2.4.2 (TIKA 1.7).

I get an error when indexing **specific** docx file:
> "[DEBUG][org.elasticsearch.index.mapper.attachment.AttachmentMapper] Failed to extract [-1] characters of text for [null]: [org.apache.poi.xwpf.usermodel.XWPFSDT.getContent()Lorg/apache/poi/xwpf/usermodel/ISDTContent;]"

But if i use mapper-attachment plugin 2.4.1 (TIKA 1.5) there is no error and content is parsed successfully.

Caused by this change #94.

Closes #104.
2015-02-20 19:02:43 +01:00
David Pilato 1e0f03bc90 Remove DocValuesFormatService and PostingsFormatService
Related to elasticsearch/elasticsearch#9741

Closes #103.
2015-02-19 19:05:59 +01:00
David Pilato ec0de9c57d [Test] Use now full qualified names for fields
We were asking for short name fields but elasticsearch does not allow anymore using short names but full qualified names.

```java
SearchResponse response = client().prepareSearch("test")
        .addField("content_type")
        .addField("name")
        .execute().get();
```

We need to use now:

```java
SearchResponse response = client().prepareSearch("test")
        .addField("file.content_type")
        .addField("file.name")
        .execute().get();
```

Closes #102.
2015-02-18 20:36:25 +01:00
David Pilato 400910e53e update documentation with release 2.4.2 2015-02-11 23:22:02 +01:00
David Pilato 77081e3dbf [Doc] copy_to using attachment field type
If you want to use [copy_to](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to)
feature, you need to define it on each sub-field you want to copy to another field:

```javascript
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "path": "full",
        "fields": {
          "file": {
            "type": "string",
            "copy_to": "copy"
          }
        }
      },
      "copy": {
        "type": "string"
      }
    }
  }
}
```

In this example, the extracted content will be copy as well to `copy` field.

Closes #97.
(cherry picked from commit f4f6b57)
(cherry picked from commit 5878a62)
2015-02-11 23:13:56 +01:00
David Pilato ec59d381b8 Upgrade Tika to 1.7
Closes #94.
(cherry picked from commit 0ab38f3)
(cherry picked from commit 96c7bb1)
2015-02-11 17:17:41 +01:00
David Pilato 931be57da9 [test] Add standalone runner
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.

You can run `StandaloneRunner` class using:

*  `-u file://URL/TO/YOUR/DOC`
*  `--size` set extracted size (default to mapper attachment size)
*  `BASE64` encoded binary

Example:

```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```

It produces something like:

```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```

Closes #99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
2015-02-09 17:45:07 +01:00
David Pilato c353936b58 Add sonatype snapshot repository 2015-01-02 19:05:18 +01:00
David Pilato 33c9828385 Depend on elasticsearch-parent
To simplify plugins maintenance and provide more value in the future, we are starting to build an `elasticsearch-parent` project.
This commit is the first step for this plugin to depend on this new `pom` maven project.
2014-12-14 19:59:15 +01:00
David Pilato c338ae0dbe [Test] copyToByteArray has been removed in master 2014-12-03 18:42:14 +01:00
David Pilato e3d80af54e Test: Fix removed queryString -> queryStringQuery 2014-12-03 18:31:53 +01:00
Adrien Grand 11b1287610 Upgrade to Lucene 5.0.0-snapshot-1642891 2014-12-02 18:16:59 +01:00
Colin Goodheart-Smithe bbd4a62e50 Updated AttachmentMapper to work with new validation in ES 2.0 2014-11-28 16:04:31 +00:00
Michael McCandless abb03dc3d9 Upgrade to Lucene 5.0.0-snapshot-1641343 2014-11-24 05:51:40 -05:00
Michael McCandless 55042f0f23 Upgrade to Lucene 5.0.0-snapshot-1637347 2014-11-10 16:45:44 -05:00
Robert Muir 4c1b27f544 upgrade to lucene 5 snapshot 2014-11-05 16:48:10 -05:00
tlrx a5ed51533c update documentation with release 2.4.1 2014-11-05 20:38:24 +01:00
Jun Ohtani 94880aae3e Tests: thread leaks detected
* exclude *StarndaloneTest*.class from test target
* add cleanup to MultifieldAttachementMapperTests for terminating ThreadPool
* Modify MapperTestUtils.newMapperService for adding ThreadPool

Closes #88
2014-11-03 02:22:45 +09:00
Jun Ohtani d3f2df6d62 Tests: Fix randomizedtest fail
Closes #90
2014-11-03 02:15:59 +09:00
Michael McCandless 4dae1879ad Upgrade to Lucene 4.10.2 2014-10-30 05:55:35 -04:00
David Pilato a0d7aafdac Fix test
Related to #89
2014-10-27 22:18:50 +01:00
David Pilato 92bdc23c78 Fix test
Related to #89
2014-10-27 22:13:15 +01:00
David Pilato faf34d745d Fix test
Related to #89
2014-10-27 22:08:41 +01:00
David Pilato d08e9c7080 Test: add a standalone tool which process content
This tool is a simple main class which can be used to test what is extracted from a given binary file or from its base64 equivalent.

You can give as first argument the BASE64 content

Available options:

 -u file:/URL/TO/YOUR/DOC (in place of BASE64 content)
 -s set extracted size (default to mapper attachment size)

Examples:

```
StandaloneTest BASE64Text
StandaloneTest BASE64Text -s 1000000
StandaloneTest -u /tmp/mydoc.pdf
StandaloneTest -u /tmp/mydoc.pdf -s 1000000
```

Closes #89.
2014-10-27 22:01:22 +01:00
David Pilato c3bf3b1ce9 Tests: AnalysisService constructor signature change
Due to this [change](https://github.com/elasticsearch/elasticsearch/pull/8018), we need to fix our tests for elasticsearch 1.4.0 and above.

Closes #87.

(cherry picked from commit b3b0d34)
2014-10-15 13:05:41 +02:00
David Pilato 03b47d5a4c update documentation with release 2.4.0 2014-10-08 18:50:20 +02:00