Commit Graph

62 Commits

Author SHA1 Message Date
David Pilato b35ad804df Ignore encrypted documents
Original request:
        I am sending multiple pdf, word etc. attachments in one documents to be indexed.

        Some of them (pdf) are encrypted and I am getting a MapperParsingException caused by org.apache.tika.exception.TikaException: Unable to extract PDF content cause by
        org.apache.pdfbox.exceptions.WrappedIOException: Error decrypting document.

        I was wondering if the attachment mapper could expose some switch to ignore the documents it can not extract?

 As we now have option `ignore_errors`, we can support it. See #38 relative to this option.

Closes #18.
2013-08-20 18:31:06 +02:00
David Pilato d6aa2f0615 Tika may "hang up" client application
Sometimes Tika may crash while parsing some files.  In this case it may generate just runtime errors (Throwable), not  TikaException.
But there is no “catch” clause for  Throwable in the AttachmentMapper.java :

        String parsedContent;
        try {
            // Set the maximum length of strings returned by the parseToString method, -1 sets no limit
            parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
        } catch (TikaException e) {
            throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
        }

As a result,  tika() may “hang up” the whole application.
(we have some pdf-files that "hang up" Elastic client if you try to parse them using mapper-attahcment plugin)

We propose the following fix:

        String parsedContent;
        try {
            // Set the maximum length of strings returned by the parseToString method, -1 sets no limit
            parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
        } catch (Throwable e) {
            throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
        }

(just replace “TikaException” with “Throwable” – it works for our cases)

Thank you!
Closes #21.
2013-08-20 17:28:11 +02:00
David Pilato 0fff26f2bf Mapper plugin overwrite date mapping
If you define some specific mapping for your file content, such as the following:

```javascript
{
    "person": {
        "properties": {
            "file": {
                "type": "attachment",
                "path": "full",
                "fields": {
                    "date": { "type": "string" }
                }
            }
        }
    }
}
```

And then, if you ask back the mapping, you get:

```javascript
{
   "person":{
      "properties":{
         "file":{
            "type":"attachment",
            "path":"full",
            "fields":{
               "file":{
                  "type":"string"
               },
               "author":{
                  "type":"string"
               },
               "title":{
                  "type":"string"
               },
               "name":{
                  "type":"string"
               },
               "date":{
                  "type":"date",
                  "format":"dateOptionalTime"
               },
               "keywords":{
                  "type":"string"
               },
               "content_type":{
                  "type":"string"
               }
            }
         }
      }
   }
}
```

All your settings have been overwrited by the mapper plugin.

See also issue #22 where the issue was found.

Closes #39.
2013-08-20 16:35:52 +02:00
David Pilato 62cc54a7c8 Update readme with release dates 2013-08-20 16:15:18 +02:00
David Pilato 8c340535d2 Add content_length metadata
We now generate `content_length` field field based on file size.
Closes #26.
2013-08-20 16:03:31 +02:00
David Pilato 406e295c6c In test for #38, we should check the real file name as we have it :-). 2013-08-20 12:34:33 +02:00
Frédéric Camblor 019d0f9a26 Don't reject full document in case of invalid metadata
From original PR #17 from @fcamblor

If you try to index a document with an invalid metadata, the full document is rejected.

For example:

```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="fr">
<head>
<title>Hello</title>
<meta name="date" content="">
<meta name="Author" content="kimchy">
<meta name="Keywords" content="elasticsearch,cool,bonsai">
</head>
<body>World</body>
</html>
```

has a non parseable date.

This fix add a new option that ignore parsing errors `"index.mapping.attachment.ignore_errors":true` (default to `true`).

Closes #17, #38.
2013-08-20 12:26:49 +02:00
David Pilato d7a2e7e2ff Mapper plugin overwrites multifield mapping
If you define some specific mapping for your file content, such as the following:

```javascript
{
    "person": {
        "properties": {
            "file": {
                "type": "attachment",
                "path": "full",
                "fields": {
                    "file": {
                        "type": "multifield",
                        "fields": {
                            "file": { "type": "string" },
                            "suggest": { "type": "string" }
                        }
                    }
                }
            }
        }
    }
}
```

And then, if you ask back the mapping, you get:

```javascript
{
   "person":{
      "properties":{
         "file":{
            "type":"attachment",
            "path":"full",
            "fields":{
               "file":{
                  "type":"string"
               },
               "author":{
                  "type":"string"
               },
               "title":{
                  "type":"string"
               },
               "name":{
                  "type":"string"
               },
               "date":{
                  "type":"date",
                  "format":"dateOptionalTime"
               },
               "keywords":{
                  "type":"string"
               },
               "content_type":{
                  "type":"string"
               }
            }
         }
      }
   }
}
```

All your settings have been overwrited by the mapper plugin.

Closes #37.
2013-08-19 11:01:02 +02:00
David Pilato d2e2fb5cdf Upgrade Tika to 1.4.
Closes #36.
2013-08-14 16:57:42 +02:00
David Pilato c0663277bc prepare for next development iteration 2013-08-07 10:02:02 +02:00
David Pilato 0a454efe18 prepare release elasticsearch-mapper-attachments-1.8.0 2013-08-07 09:52:29 +02:00
David Pilato d054f9a1e7 Mapper 1.7.0 does not work with elasticsearch 0.90.3
FastByteArrayInputStream has been removed in 0.90.3.
Closes #34.
2013-08-07 09:47:12 +02:00
Shay Banon 690779cf2f move to 1.8 snap 2013-02-26 16:06:53 +01:00
Shay Banon 7e58416506 release 1.7 2013-02-26 16:06:39 +01:00
David Pilato 942b87b763 Move to Elasticsearch 0.21.0.Beta1
Due to refactoring in 0.21.x we have to update this plugin
Closes #24.
2013-02-23 12:13:51 +01:00
David Pilato eba4da7086 NPE if "content" is missing in mapper-attachment plugin
Curl recreation:

        curl -X DELETE "localhost:9200/test"

        curl -X PUT "localhost:9200/test" -d '{
          "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}
        }'

        curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&pretty=1&timeout=5s"

        curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{
          "attachment" : {
            "properties" : {
              "file" : {
                "type" : "attachment"
              }
            }
          }
        }'

        curl -X PUT "localhost:9200/test/attachment/1" -d '{
            "file" : {
                "_content_type" : "application/pdf",
                "_name" : "resource/name/of/my.pdf"
            }
        }
        '

Produces a:

        {"error":"NullPointerException[null]","status":500}

And in ES logs:

      [2013-02-20 12:49:04,445][DEBUG][action.index             ] [Drake, Frank] [test][0], node[LI6crwNKQmu1ue1u7mlqGA], [P], s[STARTED]: Failed to execute [index {[test][attachment][1], source[{
          "file" : {
              "_content_type" : "application/pdf",
              "_name" : "resource/name/of/my.pdf"
          }
      }
      ]}]
      java.lang.NullPointerException
      	at org.elasticsearch.common.io.FastByteArrayInputStream.<init>(FastByteArrayInputStream.java:90)
      	at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:309)
      	at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:507)
      	at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:449)
      	at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:486)
      	at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:430)
      	at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:318)
      	at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
      	at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:531)
      	at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:429)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:680)

Closes #23
2013-02-23 09:40:07 +01:00
Martijn van Groningen 69f8bdea03 Master is now 0.20 2012-12-21 15:17:02 +01:00
David Pilato 30e425e209 Plugin must be 0.20.x compatible (tests fails) 2012-12-21 15:16:04 +01:00
David Pilato 17ae816a6a Move resources in /src/test/resources 2012-12-21 15:12:15 +01:00
Martijn van Groningen 15248a9d52 Set next development version 2012-09-28 12:09:44 +02:00
Martijn van Groningen a163fdad0f Prepare 1.6.0 release 2012-09-28 12:00:12 +02:00
David Pilato 00d87de418 #13 : Fix dependency to tika-app 2012-09-28 10:00:45 +02:00
Martijn van Groningen 64254e621b Removed ExtendedTika class. The maxLength change made it into Tika 1.2, which we use now and was the reason of having this class. 2012-09-19 12:44:48 +02:00
Martijn van Groningen ab159355ed Set new development version 2012-09-19 11:42:34 +02:00
Martijn van Groningen 0a17fe2e44 Release 1.5 2012-09-19 11:33:06 +02:00
Martijn van Groningen 5c649ad226 Upgraded Tika, Testng, hamcrest, log4j and surefire plugin.
Closes #12
2012-09-19 10:55:58 +02:00
Shay Banon 65043c0692 add license and repo 2012-06-10 22:14:18 +02:00
Shay Banon 0ae4c73386 move to 1.5.0 snap 2012-03-25 20:11:07 +02:00
Shay Banon 66b96cb994 release 1.4.0 2012-03-25 20:10:46 +02:00
Shay Banon c1df26e4e9 upgrade to tika 1.1 2012-03-25 20:00:45 +02:00
Shay Banon 4292512f8e rename fileName to name 2012-03-17 11:54:40 +02:00
alheim d9a822dba8 Add a fineName field to index the attchment fileName 2012-03-17 11:52:35 +02:00
Shay Banon 911fa246d0 move to 1.4.0 snap 2012-03-07 22:03:45 +02:00
Shay Banon 4482a5de67 release 1.3.0 2012-03-07 22:02:49 +02:00
Shay Banon 744e3772a5 update readme 2012-03-07 21:56:48 +02:00
Shay Banon 0352c1436e change to _indexed_chars the parameter per doc, and add index.mapping.attachment.indexed_chars setting to globally change it (per index) 2012-03-07 21:53:41 +02:00
Shay Banon 59f38ff576 Merge branch 'master' of https://github.com/Henac/elasticsearch-mapper-attachments 2012-03-07 21:44:54 +02:00
Henac 9a26458862 Fixed issue with setting of maxStringLength applying globally to the tika instance.
I have extended the Tika class to allow for setting of how much text to
extract from a document to be on a per call basis.
2012-03-06 22:20:04 +11:00
Shay Banon 9882a2937b update readme 2012-03-04 11:59:22 +02:00
Shay Banon 3a72b6b2c4 update to 0.19.0 2012-03-04 11:52:47 +02:00
Henac 6a08ca673a Added the ability to specify the amount of text to extract and index from an attachment. 2012-03-04 16:09:21 +11:00
Shay Banon bf12b2be21 Merge pull request #6 from dadoonet/master
Update maven assembly plugin version to 2.3
2012-02-26 15:39:45 -08:00
David Pilato 509b467658 Update maven assembly plugin to latest version : 2.3 2012-02-26 22:34:02 +01:00
David Pilato 79d7860a72 Merge remote-tracking branch 'elasticsearch/master'
Conflicts:
	pom.xml
2012-02-26 17:19:12 +01:00
Shay Banon dfd0e2cc41 latest assembly 2012-02-26 10:22:12 +02:00
Shay Banon 8ffa86bb31 latest elasticsearch 2012-02-26 10:21:53 +02:00
David Pilato 623952f839 Ignore eclipse files 2012-02-25 13:38:44 +01:00
David Pilato 1bee1b3d0c Update to elasticsearch : 0.19.0.RC3 (fix dependencies issues) 2012-02-25 13:38:15 +01:00
David Pilato ee3c17ec8b We should indicate each plugin version. Assembly plugin 2.2-beta-5
(default) works but the latest release (2.2.1) won't as we don't set the
assembly id
2012-02-25 13:34:07 +01:00
Shay Banon c13da65bea move to 1.3.0 snap 2012-02-15 22:44:12 +02:00