Commit Graph

132 Commits

Author SHA1 Message Date
David Pilato 931be57da9 [test] Add standalone runner
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.

You can run `StandaloneRunner` class using:

*  `-u file://URL/TO/YOUR/DOC`
*  `--size` set extracted size (default to mapper attachment size)
*  `BASE64` encoded binary

Example:

```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```

It produces something like:

```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```

Closes #99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
2015-02-09 17:45:07 +01:00
David Pilato c353936b58 Add sonatype snapshot repository 2015-01-02 19:05:18 +01:00
David Pilato 33c9828385 Depend on elasticsearch-parent
To simplify plugins maintenance and provide more value in the future, we are starting to build an `elasticsearch-parent` project.
This commit is the first step for this plugin to depend on this new `pom` maven project.
2014-12-14 19:59:15 +01:00
David Pilato c338ae0dbe [Test] copyToByteArray has been removed in master 2014-12-03 18:42:14 +01:00
David Pilato e3d80af54e Test: Fix removed queryString -> queryStringQuery 2014-12-03 18:31:53 +01:00
Adrien Grand 11b1287610 Upgrade to Lucene 5.0.0-snapshot-1642891 2014-12-02 18:16:59 +01:00
Colin Goodheart-Smithe bbd4a62e50 Updated AttachmentMapper to work with new validation in ES 2.0 2014-11-28 16:04:31 +00:00
Michael McCandless abb03dc3d9 Upgrade to Lucene 5.0.0-snapshot-1641343 2014-11-24 05:51:40 -05:00
Michael McCandless 55042f0f23 Upgrade to Lucene 5.0.0-snapshot-1637347 2014-11-10 16:45:44 -05:00
Robert Muir 4c1b27f544 upgrade to lucene 5 snapshot 2014-11-05 16:48:10 -05:00
tlrx a5ed51533c update documentation with release 2.4.1 2014-11-05 20:38:24 +01:00
Jun Ohtani 94880aae3e Tests: thread leaks detected
* exclude *StarndaloneTest*.class from test target
* add cleanup to MultifieldAttachementMapperTests for terminating ThreadPool
* Modify MapperTestUtils.newMapperService for adding ThreadPool

Closes #88
2014-11-03 02:22:45 +09:00
Jun Ohtani d3f2df6d62 Tests: Fix randomizedtest fail
Closes #90
2014-11-03 02:15:59 +09:00
Michael McCandless 4dae1879ad Upgrade to Lucene 4.10.2 2014-10-30 05:55:35 -04:00
David Pilato a0d7aafdac Fix test
Related to #89
2014-10-27 22:18:50 +01:00
David Pilato 92bdc23c78 Fix test
Related to #89
2014-10-27 22:13:15 +01:00
David Pilato faf34d745d Fix test
Related to #89
2014-10-27 22:08:41 +01:00
David Pilato d08e9c7080 Test: add a standalone tool which process content
This tool is a simple main class which can be used to test what is extracted from a given binary file or from its base64 equivalent.

You can give as first argument the BASE64 content

Available options:

 -u file:/URL/TO/YOUR/DOC (in place of BASE64 content)
 -s set extracted size (default to mapper attachment size)

Examples:

```
StandaloneTest BASE64Text
StandaloneTest BASE64Text -s 1000000
StandaloneTest -u /tmp/mydoc.pdf
StandaloneTest -u /tmp/mydoc.pdf -s 1000000
```

Closes #89.
2014-10-27 22:01:22 +01:00
David Pilato c3bf3b1ce9 Tests: AnalysisService constructor signature change
Due to this [change](https://github.com/elasticsearch/elasticsearch/pull/8018), we need to fix our tests for elasticsearch 1.4.0 and above.

Closes #87.

(cherry picked from commit b3b0d34)
2014-10-15 13:05:41 +02:00
David Pilato 03b47d5a4c update documentation with release 2.4.0 2014-10-08 18:50:20 +02:00
mikemccand 2ff4eb58d6 Upgrade to Lucene 4.10.1 2014-09-28 17:57:06 -04:00
Michael McCandless 67a2548441 Upgrade to Lucene 4.10.1 snapshot 2014-09-24 17:10:08 -04:00
David Pilato eef6b61806 Create branch es-1.4 for elasticsearch 1.4.0 2014-09-12 16:08:59 +02:00
David Pilato ba74fc2b5e Remove netcdf support
Sadly netcdf library is not Apache2 License compatible so we should not package it anymore.

For users who wants to use it, they can add manually [netcdf librairies](http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/) in `plugins/mapper-attachments` dir and they will get the support back.

Closes #84.
2014-09-08 23:51:01 +02:00
David Pilato 888d79075e Update to Lucene 4.10.0
Closes #85.
2014-09-08 23:47:15 +02:00
David Pilato 20ee711436 parseMultiField() method signature change in es 1.4 and master
As seen with https://github.com/elasticsearch/elasticsearch/pull/7474, we need to update mapper attachment plugin with this new signature.

 Closes #83.
2014-09-04 11:23:09 +02:00
David Pilato c0d053d283 Update to elasticsearch 1.4
Related to #77

(cherry picked from commit ad1742a)
2014-09-01 10:26:38 +02:00
David Pilato 34fe111a2b update documentation with release 2.3.2 2014-09-01 09:53:26 +02:00
David Pilato 87b38c54eb Unable to extract text from Word documents
With issue #80 we explicitly removed appache POI dependency provided by Tika and replaced with a more recent one.
Sadly we forgot to add this new dependency to the assembly so the final ZIP file does not contain POI related jars.

Closes #82.

(cherry picked from commit 49793d5)
2014-09-01 09:41:57 +02:00
David Pilato cc1a43b5c3 update documentation with release 2.3.1 2014-08-18 21:52:53 +02:00
David Pilato 08454d72f6 update documentation with release 2.2.1 2014-08-18 21:39:31 +02:00
David Pilato 2b172f8ff6 Update a few dependencies
Related to #80.
2014-08-18 17:49:36 +02:00
David Pilato 5cf20331a8 Update to elasticsearch 1.4.0
Related to #77.

(cherry picked from commit 7e65cfb)
2014-08-18 15:39:19 +02:00
David Pilato 75d03621aa Update a few dependencies
Related to #80.

(cherry picked from commit 89d5460)
2014-08-18 15:37:03 +02:00
David Pilato 587e6d3da2 Docs: make the welcome page more obvious
Closes #79.
2014-08-18 12:38:03 +02:00
David Pilato f8d2975946 Update a few dependencies
Closes #80.

(cherry picked from commit 930c8be)
2014-08-18 12:27:23 +02:00
David Pilato 6edf3447b1 Remove old `content` deprecated field
In #73, we deprecated `content` field in favor of `_content` field.

In plugin version 2.4.0, we can now remove the old field name.

Closes #75.

(cherry picked from commit 7a0f838)
2014-07-26 00:33:50 +02:00
David Pilato e704f68525 Log tika exceptions
Currently tika exceptions are swallowed with no log message.
We'd like to be able to know when/if this occurs and for what reason.

Closes #78.

(cherry picked from commit 36b0117)
2014-07-26 00:27:49 +02:00
David Pilato ad986eb2fc Add support for multi-fields
Now https://github.com/elasticsearch/elasticsearch/pull/6867 is merged in elasticsearch core code (branch 1.x - es 1.4),
we can support multi fields in mapper attachment plugin.

```
DELETE /test
PUT /test
{
  "settings": {
    "number_of_shards": 1
  }
}
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "path": "full",
        "fields": {
          "file": {
            "type": "string",
            "fields": {
              "store": {
                "type": "string",
                "store": true
              }
            }
          },
          "content_type": {
            "type": "string",
            "fields": {
              "store": {
                "type": "string",
                "store": true
              },
              "untouched": {
                "type": "string",
                "index": "not_analyzed",
                "store": true
              }
            }
          }
        }
      }
    }
  }
}

PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

GET /test/person/_search
{
  "fields": [
    "file.store",
    "file.content_type.store"
  ],
  "aggs": {
    "store": {
      "terms": {
        "field": "file.content_type.store"
      }
    },
    "untouched": {
      "terms": {
        "field": "file.content_type.untouched"
      }
    }
  }
}
```

It gives:

```js
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "test",
            "_type": "person",
            "_id": "1",
            "_score": 1,
            "fields": {
               "file.store": [
                  "\"God Save the Queen\" (alternatively \"God Save the King\"\n"
               ],
               "file.content_type.store": [
                  "text/plain; charset=ISO-8859-1"
               ]
            }
         }
      ]
   },
   "aggregations": {
      "store": {
         "doc_count_error_upper_bound": 0,
         "buckets": [
            {
               "key": "1",
               "doc_count": 1
            },
            {
               "key": "8859",
               "doc_count": 1
            },
            {
               "key": "charset",
               "doc_count": 1
            },
            {
               "key": "iso",
               "doc_count": 1
            },
            {
               "key": "plain",
               "doc_count": 1
            },
            {
               "key": "text",
               "doc_count": 1
            }
         ]
      },
      "untouched": {
         "doc_count_error_upper_bound": 0,
         "buckets": [
            {
               "key": "text/plain; charset=ISO-8859-1",
               "doc_count": 1
            }
         ]
      }
   }
}
```

Note that using shorter definition works as well:

```
DELETE /test
PUT /test
{
  "settings": {
    "number_of_shards": 1
  }
}
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment"
      }
    }
  }
}
PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

GET /test/person/_search
{
  "query": {
    "match": {
      "file": "king"
    }
  }
}
```

gives:

```js
{
   "took": 53,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.095891505,
      "hits": [
         {
            "_index": "test",
            "_type": "person",
            "_id": "1",
            "_score": 0.095891505,
            "_source": {
               "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
            }
         }
      ]
   }
}
```

Closes #57.

(cherry picked from commit 432d7c0)
2014-07-26 00:27:28 +02:00
David Pilato 663d4eaddb Update to elasticsearch 1.4.0
Closes #77.

(cherry picked from commit c58516f)
2014-07-26 00:26:41 +02:00
David Pilato eaccd4383d Deprecate `content` by `_content`
When we want to force some values, we need to set those using `_field` where `field` is the field name we want to force:

```
{
  "file": {
    "_name": "myfilename.txt"
  }
}
```

But to set the content itself, we use `content` field name.

```
{
  "file": {
    "content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
    "_name": "myfilename.txt"
  }
}
```

For consistency, we set `_content` instead:

```
{
  "file": {
    "_content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
    "_name": "myfilename.txt"
  }
}
```

Closes #73.

(cherry picked from commit 2e6be20)
2014-07-25 18:15:37 +02:00
David Pilato 1d1225b87c Update to Lucene 4.9.0
Update to elasticsearch 1.3.0
Move to java 1.7

Related to #67.
Closed #76.

(cherry picked from commit 2303932)
2014-07-25 18:15:28 +02:00
David Pilato 310df36bfa SL4FJ dependency version problem
This is due to `edu.ucar:netcdf` lib which comes from `tika-parsers` dependency.

```
[INFO] +- org.apache.tika:tika-parsers:jar:1.5:compile
[INFO] |  +- edu.ucar:netcdf:jar:4.2-min:compile
[INFO] |  |  \- org.slf4j:slf4j-api:jar:1.5.6:compile
```

We can exclude this library from the generated ZIP artifact.

Closes #41.
2014-06-14 18:56:14 +02:00
David Pilato 51a8f6f1a0 Fix doc typo
(cherry picked from commit f70eb1d)
2014-06-03 10:13:12 +02:00
David Pilato a3bb103297 Remove deprecated `language` forced field
With #68 we replaced `language`field with `_language`.

We can now remove the old deprecated name.

Closes #69.
(cherry picked from commit e39f144)
2014-06-03 10:11:13 +02:00
David Pilato 94cf141108 Use` _language` field instead of `language`
When we want to force a language instead of using Tika language detection, we set `language` field in documents.

 To be consistent with other forced fields, `_content_type` and `_name`, we should prefix `language` field by an underscore `_`.

 So `language` become `_language`.

 We first deprecate `language` in version 2.1.0 and we remove it in 2.3.0.

 Closes #68.

(cherry picked from commit 2f46343)
2014-06-03 10:10:49 +02:00
David Pilato 7c1c2011bc Update to elasticsearch 1.3.0
Closes #67.
(cherry picked from commit d3eaac9)
2014-06-03 09:49:41 +02:00
David Pilato c0e7795f1f Update to elasticsearch 1.2.0
Closes #66.
(cherry picked from commit fb3b288)
2014-06-03 09:49:13 +02:00
David Pilato 4b35501cf3 Setting "_content_type" in indexing request has no effect
Example below. I set the type as text/plain but it is identified as text/html.

```sh
#!/bin/sh

echo "\n\n Delete testidx \n"
curl -XDELETE "http://localhost:9200/testidx"

echo "\n\n Create index and mapping \n"
curl -XPUT "http://localhost:9200/testidx" -d'
{
  "mappings": {
    "session": {
      "properties": {
        "Content": {
          "properties": {
            "content": {
              "type": "attachment",
              "path": "full",
              "store": "yes",
              "fields": {
                "content": {
                  "type": "string",
                  "store": "yes"
                },
                "author": {
                  "type": "string",
                  "store": "yes"
                },
                "title": {
                  "type": "string",
                  "store": "yes"
                },
                "name": {
                  "type": "string",
                  "store": "yes"
                },
                "date": {
                  "type": "date",
                  "format": "dateOptionalTime",
                  "store": "yes"
                },
                "keywords": {
                  "type": "string",
                  "store": "yes"
                },
                "content_type": {
                  "type": "string",
                  "store": "yes"
                },
                "content_length": {
                  "type": "integer",
                  "store": "yes"
                }
              }
            }
          }
        }
      }
    }
  }
}'

echo "\n\n Index document \n"
curl -XPOST "http://localhost:9200/_bulk" -d'
  {"index":{"_index":"testidx","_type":"session"}}
  {"Content":[{"_content_type":"text/plain","content":"BASE64ENCODED_CONTENT"}]}
'

echo "\n\n Refresh \n"
curl -XPOST "http://localhost:9200/testidx/_refresh"

echo "\n\n Get doc type \n"
curl -XPOST "http://localhost:9200/testidx/_search?pretty" -d'
{
  "fields": ["Content.content.content_type","Content.content.content_length","Content.content"]
}'
```

Closes #65.
(cherry picked from commit 38075dc)
2014-06-03 09:36:10 +02:00
David Pilato 7f8143ff12 Add highlighting documentation
Closes #54.
(cherry picked from commit efdf8ef)
2014-06-03 09:35:05 +02:00