Commit Graph

44 Commits

Author SHA1 Message Date
David Pilato 20ee711436 parseMultiField() method signature change in es 1.4 and master
As seen with https://github.com/elasticsearch/elasticsearch/pull/7474, we need to update mapper attachment plugin with this new signature.

 Closes #83.
2014-09-04 11:23:09 +02:00
David Pilato c0d053d283 Update to elasticsearch 1.4
Related to #77

(cherry picked from commit ad1742a)
2014-09-01 10:26:38 +02:00
David Pilato 87b38c54eb Unable to extract text from Word documents
With issue #80 we explicitly removed appache POI dependency provided by Tika and replaced with a more recent one.
Sadly we forgot to add this new dependency to the assembly so the final ZIP file does not contain POI related jars.

Closes #82.

(cherry picked from commit 49793d5)
2014-09-01 09:41:57 +02:00
David Pilato 5cf20331a8 Update to elasticsearch 1.4.0
Related to #77.

(cherry picked from commit 7e65cfb)
2014-08-18 15:39:19 +02:00
David Pilato 6edf3447b1 Remove old `content` deprecated field
In #73, we deprecated `content` field in favor of `_content` field.

In plugin version 2.4.0, we can now remove the old field name.

Closes #75.

(cherry picked from commit 7a0f838)
2014-07-26 00:33:50 +02:00
David Pilato e704f68525 Log tika exceptions
Currently tika exceptions are swallowed with no log message.
We'd like to be able to know when/if this occurs and for what reason.

Closes #78.

(cherry picked from commit 36b0117)
2014-07-26 00:27:49 +02:00
David Pilato ad986eb2fc Add support for multi-fields
Now https://github.com/elasticsearch/elasticsearch/pull/6867 is merged in elasticsearch core code (branch 1.x - es 1.4),
we can support multi fields in mapper attachment plugin.

```
DELETE /test
PUT /test
{
  "settings": {
    "number_of_shards": 1
  }
}
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "path": "full",
        "fields": {
          "file": {
            "type": "string",
            "fields": {
              "store": {
                "type": "string",
                "store": true
              }
            }
          },
          "content_type": {
            "type": "string",
            "fields": {
              "store": {
                "type": "string",
                "store": true
              },
              "untouched": {
                "type": "string",
                "index": "not_analyzed",
                "store": true
              }
            }
          }
        }
      }
    }
  }
}

PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

GET /test/person/_search
{
  "fields": [
    "file.store",
    "file.content_type.store"
  ],
  "aggs": {
    "store": {
      "terms": {
        "field": "file.content_type.store"
      }
    },
    "untouched": {
      "terms": {
        "field": "file.content_type.untouched"
      }
    }
  }
}
```

It gives:

```js
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "test",
            "_type": "person",
            "_id": "1",
            "_score": 1,
            "fields": {
               "file.store": [
                  "\"God Save the Queen\" (alternatively \"God Save the King\"\n"
               ],
               "file.content_type.store": [
                  "text/plain; charset=ISO-8859-1"
               ]
            }
         }
      ]
   },
   "aggregations": {
      "store": {
         "doc_count_error_upper_bound": 0,
         "buckets": [
            {
               "key": "1",
               "doc_count": 1
            },
            {
               "key": "8859",
               "doc_count": 1
            },
            {
               "key": "charset",
               "doc_count": 1
            },
            {
               "key": "iso",
               "doc_count": 1
            },
            {
               "key": "plain",
               "doc_count": 1
            },
            {
               "key": "text",
               "doc_count": 1
            }
         ]
      },
      "untouched": {
         "doc_count_error_upper_bound": 0,
         "buckets": [
            {
               "key": "text/plain; charset=ISO-8859-1",
               "doc_count": 1
            }
         ]
      }
   }
}
```

Note that using shorter definition works as well:

```
DELETE /test
PUT /test
{
  "settings": {
    "number_of_shards": 1
  }
}
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment"
      }
    }
  }
}
PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

GET /test/person/_search
{
  "query": {
    "match": {
      "file": "king"
    }
  }
}
```

gives:

```js
{
   "took": 53,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.095891505,
      "hits": [
         {
            "_index": "test",
            "_type": "person",
            "_id": "1",
            "_score": 0.095891505,
            "_source": {
               "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
            }
         }
      ]
   }
}
```

Closes #57.

(cherry picked from commit 432d7c0)
2014-07-26 00:27:28 +02:00
David Pilato eaccd4383d Deprecate `content` by `_content`
When we want to force some values, we need to set those using `_field` where `field` is the field name we want to force:

```
{
  "file": {
    "_name": "myfilename.txt"
  }
}
```

But to set the content itself, we use `content` field name.

```
{
  "file": {
    "content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
    "_name": "myfilename.txt"
  }
}
```

For consistency, we set `_content` instead:

```
{
  "file": {
    "_content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu",
    "_name": "myfilename.txt"
  }
}
```

Closes #73.

(cherry picked from commit 2e6be20)
2014-07-25 18:15:37 +02:00
David Pilato 1d1225b87c Update to Lucene 4.9.0
Update to elasticsearch 1.3.0
Move to java 1.7

Related to #67.
Closed #76.

(cherry picked from commit 2303932)
2014-07-25 18:15:28 +02:00
David Pilato 310df36bfa SL4FJ dependency version problem
This is due to `edu.ucar:netcdf` lib which comes from `tika-parsers` dependency.

```
[INFO] +- org.apache.tika:tika-parsers:jar:1.5:compile
[INFO] |  +- edu.ucar:netcdf:jar:4.2-min:compile
[INFO] |  |  \- org.slf4j:slf4j-api:jar:1.5.6:compile
```

We can exclude this library from the generated ZIP artifact.

Closes #41.
2014-06-14 18:56:14 +02:00
David Pilato a3bb103297 Remove deprecated `language` forced field
With #68 we replaced `language`field with `_language`.

We can now remove the old deprecated name.

Closes #69.
(cherry picked from commit e39f144)
2014-06-03 10:11:13 +02:00
David Pilato 94cf141108 Use` _language` field instead of `language`
When we want to force a language instead of using Tika language detection, we set `language` field in documents.

 To be consistent with other forced fields, `_content_type` and `_name`, we should prefix `language` field by an underscore `_`.

 So `language` become `_language`.

 We first deprecate `language` in version 2.1.0 and we remove it in 2.3.0.

 Closes #68.

(cherry picked from commit 2f46343)
2014-06-03 10:10:49 +02:00
David Pilato 4b35501cf3 Setting "_content_type" in indexing request has no effect
Example below. I set the type as text/plain but it is identified as text/html.

```sh
#!/bin/sh

echo "\n\n Delete testidx \n"
curl -XDELETE "http://localhost:9200/testidx"

echo "\n\n Create index and mapping \n"
curl -XPUT "http://localhost:9200/testidx" -d'
{
  "mappings": {
    "session": {
      "properties": {
        "Content": {
          "properties": {
            "content": {
              "type": "attachment",
              "path": "full",
              "store": "yes",
              "fields": {
                "content": {
                  "type": "string",
                  "store": "yes"
                },
                "author": {
                  "type": "string",
                  "store": "yes"
                },
                "title": {
                  "type": "string",
                  "store": "yes"
                },
                "name": {
                  "type": "string",
                  "store": "yes"
                },
                "date": {
                  "type": "date",
                  "format": "dateOptionalTime",
                  "store": "yes"
                },
                "keywords": {
                  "type": "string",
                  "store": "yes"
                },
                "content_type": {
                  "type": "string",
                  "store": "yes"
                },
                "content_length": {
                  "type": "integer",
                  "store": "yes"
                }
              }
            }
          }
        }
      }
    }
  }
}'

echo "\n\n Index document \n"
curl -XPOST "http://localhost:9200/_bulk" -d'
  {"index":{"_index":"testidx","_type":"session"}}
  {"Content":[{"_content_type":"text/plain","content":"BASE64ENCODED_CONTENT"}]}
'

echo "\n\n Refresh \n"
curl -XPOST "http://localhost:9200/testidx/_refresh"

echo "\n\n Get doc type \n"
curl -XPOST "http://localhost:9200/testidx/_search?pretty" -d'
{
  "fields": ["Content.content.content_type","Content.content.content_length","Content.content"]
}'
```

Closes #65.
(cherry picked from commit 38075dc)
2014-06-03 09:36:10 +02:00
David Pilato 4d63130a23 Update to elasticsearch 2.0.0 / Lucene 4.8.1 2014-06-03 09:34:31 +02:00
David Pilato e95bb18edb Create branches according to elasticsearch versions
We create branches:

* es-0.90 for elasticsearch 0.90
* es-1.0 for elasticsearch 1.0
* es-1.1 for elasticsearch 1.1
* master for elasticsearch master

We also check that before releasing we don't have a dependency to an elasticsearch SNAPSHOT version.

Add links to each version in documentation
2014-03-28 17:47:38 +01:00
Richard Louapre 3d15cb0484 Add language detection option
Based on PR #45, we add a new language detection option using Language detection feature available in Tika:
https://tika.apache.org/1.4/detection.html#Language_Detection

By default, language detection is disabled (`false`) as it could come with a cost.
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
It can also be provided on a per document indexed using the `_detect_language` parameter.

Closes #45.
Closes #44.
2014-03-25 18:26:09 +01:00
David Pilato b8d7f17951 Update to elasticsearch 1.0.0
Closes #60.
2014-03-19 23:14:39 +01:00
David Pilato 1b7daafeac Add plugin version in es-plugin.properties
Closes #59.
2014-03-19 23:09:37 +01:00
David Pilato 054d1acf3a Update to elasticsearch 1.0.0.RC1
Related to #48.
2014-01-15 23:37:43 +01:00
David Pilato b877f1bd4f Update to elasticsearch 1.0.0.RC1
Closes #48.
2014-01-14 14:51:32 +01:00
David Pilato 2b4f875731 Move tests to elasticsearch test framework
Closes #49.
2014-01-13 23:18:04 +01:00
David Pilato f8f647dea9 update headers 2014-01-13 22:31:14 +01:00
David Pilato b35ad804df Ignore encrypted documents
Original request:
        I am sending multiple pdf, word etc. attachments in one documents to be indexed.

        Some of them (pdf) are encrypted and I am getting a MapperParsingException caused by org.apache.tika.exception.TikaException: Unable to extract PDF content cause by
        org.apache.pdfbox.exceptions.WrappedIOException: Error decrypting document.

        I was wondering if the attachment mapper could expose some switch to ignore the documents it can not extract?

 As we now have option `ignore_errors`, we can support it. See #38 relative to this option.

Closes #18.
2013-08-20 18:31:06 +02:00
David Pilato d6aa2f0615 Tika may "hang up" client application
Sometimes Tika may crash while parsing some files.  In this case it may generate just runtime errors (Throwable), not  TikaException.
But there is no “catch” clause for  Throwable in the AttachmentMapper.java :

        String parsedContent;
        try {
            // Set the maximum length of strings returned by the parseToString method, -1 sets no limit
            parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
        } catch (TikaException e) {
            throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
        }

As a result,  tika() may “hang up” the whole application.
(we have some pdf-files that "hang up" Elastic client if you try to parse them using mapper-attahcment plugin)

We propose the following fix:

        String parsedContent;
        try {
            // Set the maximum length of strings returned by the parseToString method, -1 sets no limit
            parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
        } catch (Throwable e) {
            throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
        }

(just replace “TikaException” with “Throwable” – it works for our cases)

Thank you!
Closes #21.
2013-08-20 17:28:11 +02:00
David Pilato 0fff26f2bf Mapper plugin overwrite date mapping
If you define some specific mapping for your file content, such as the following:

```javascript
{
    "person": {
        "properties": {
            "file": {
                "type": "attachment",
                "path": "full",
                "fields": {
                    "date": { "type": "string" }
                }
            }
        }
    }
}
```

And then, if you ask back the mapping, you get:

```javascript
{
   "person":{
      "properties":{
         "file":{
            "type":"attachment",
            "path":"full",
            "fields":{
               "file":{
                  "type":"string"
               },
               "author":{
                  "type":"string"
               },
               "title":{
                  "type":"string"
               },
               "name":{
                  "type":"string"
               },
               "date":{
                  "type":"date",
                  "format":"dateOptionalTime"
               },
               "keywords":{
                  "type":"string"
               },
               "content_type":{
                  "type":"string"
               }
            }
         }
      }
   }
}
```

All your settings have been overwrited by the mapper plugin.

See also issue #22 where the issue was found.

Closes #39.
2013-08-20 16:35:52 +02:00
David Pilato 8c340535d2 Add content_length metadata
We now generate `content_length` field field based on file size.
Closes #26.
2013-08-20 16:03:31 +02:00
David Pilato 406e295c6c In test for #38, we should check the real file name as we have it :-). 2013-08-20 12:34:33 +02:00
Frédéric Camblor 019d0f9a26 Don't reject full document in case of invalid metadata
From original PR #17 from @fcamblor

If you try to index a document with an invalid metadata, the full document is rejected.

For example:

```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="fr">
<head>
<title>Hello</title>
<meta name="date" content="">
<meta name="Author" content="kimchy">
<meta name="Keywords" content="elasticsearch,cool,bonsai">
</head>
<body>World</body>
</html>
```

has a non parseable date.

This fix add a new option that ignore parsing errors `"index.mapping.attachment.ignore_errors":true` (default to `true`).

Closes #17, #38.
2013-08-20 12:26:49 +02:00
David Pilato d7a2e7e2ff Mapper plugin overwrites multifield mapping
If you define some specific mapping for your file content, such as the following:

```javascript
{
    "person": {
        "properties": {
            "file": {
                "type": "attachment",
                "path": "full",
                "fields": {
                    "file": {
                        "type": "multifield",
                        "fields": {
                            "file": { "type": "string" },
                            "suggest": { "type": "string" }
                        }
                    }
                }
            }
        }
    }
}
```

And then, if you ask back the mapping, you get:

```javascript
{
   "person":{
      "properties":{
         "file":{
            "type":"attachment",
            "path":"full",
            "fields":{
               "file":{
                  "type":"string"
               },
               "author":{
                  "type":"string"
               },
               "title":{
                  "type":"string"
               },
               "name":{
                  "type":"string"
               },
               "date":{
                  "type":"date",
                  "format":"dateOptionalTime"
               },
               "keywords":{
                  "type":"string"
               },
               "content_type":{
                  "type":"string"
               }
            }
         }
      }
   }
}
```

All your settings have been overwrited by the mapper plugin.

Closes #37.
2013-08-19 11:01:02 +02:00
David Pilato d054f9a1e7 Mapper 1.7.0 does not work with elasticsearch 0.90.3
FastByteArrayInputStream has been removed in 0.90.3.
Closes #34.
2013-08-07 09:47:12 +02:00
David Pilato 942b87b763 Move to Elasticsearch 0.21.0.Beta1
Due to refactoring in 0.21.x we have to update this plugin
Closes #24.
2013-02-23 12:13:51 +01:00
David Pilato eba4da7086 NPE if "content" is missing in mapper-attachment plugin
Curl recreation:

        curl -X DELETE "localhost:9200/test"

        curl -X PUT "localhost:9200/test" -d '{
          "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}
        }'

        curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&pretty=1&timeout=5s"

        curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{
          "attachment" : {
            "properties" : {
              "file" : {
                "type" : "attachment"
              }
            }
          }
        }'

        curl -X PUT "localhost:9200/test/attachment/1" -d '{
            "file" : {
                "_content_type" : "application/pdf",
                "_name" : "resource/name/of/my.pdf"
            }
        }
        '

Produces a:

        {"error":"NullPointerException[null]","status":500}

And in ES logs:

      [2013-02-20 12:49:04,445][DEBUG][action.index             ] [Drake, Frank] [test][0], node[LI6crwNKQmu1ue1u7mlqGA], [P], s[STARTED]: Failed to execute [index {[test][attachment][1], source[{
          "file" : {
              "_content_type" : "application/pdf",
              "_name" : "resource/name/of/my.pdf"
          }
      }
      ]}]
      java.lang.NullPointerException
      	at org.elasticsearch.common.io.FastByteArrayInputStream.<init>(FastByteArrayInputStream.java:90)
      	at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:309)
      	at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:507)
      	at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:449)
      	at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:486)
      	at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:430)
      	at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:318)
      	at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
      	at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:531)
      	at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:429)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:680)

Closes #23
2013-02-23 09:40:07 +01:00
David Pilato 30e425e209 Plugin must be 0.20.x compatible (tests fails) 2012-12-21 15:16:04 +01:00
David Pilato 17ae816a6a Move resources in /src/test/resources 2012-12-21 15:12:15 +01:00
David Pilato 00d87de418 #13 : Fix dependency to tika-app 2012-09-28 10:00:45 +02:00
Martijn van Groningen 64254e621b Removed ExtendedTika class. The maxLength change made it into Tika 1.2, which we use now and was the reason of having this class. 2012-09-19 12:44:48 +02:00
Shay Banon c1df26e4e9 upgrade to tika 1.1 2012-03-25 20:00:45 +02:00
Shay Banon 4292512f8e rename fileName to name 2012-03-17 11:54:40 +02:00
alheim d9a822dba8 Add a fineName field to index the attchment fileName 2012-03-17 11:52:35 +02:00
Shay Banon 0352c1436e change to _indexed_chars the parameter per doc, and add index.mapping.attachment.indexed_chars setting to globally change it (per index) 2012-03-07 21:53:41 +02:00
Henac 9a26458862 Fixed issue with setting of maxStringLength applying globally to the tika instance.
I have extended the Tika class to allow for setting of how much text to
extract from a document to be on a per call basis.
2012-03-06 22:20:04 +11:00
Henac 6a08ca673a Added the ability to specify the amount of text to extract and index from an attachment. 2012-03-04 16:09:21 +11:00
Shay Banon dfd0e2cc41 latest assembly 2012-02-26 10:22:12 +02:00
Shay Banon c4a1275475 first commit 2011-12-05 14:05:14 +02:00