David Pilato
d6aa2f0615
Tika may "hang up" client application
...
Sometimes Tika may crash while parsing some files. In this case it may generate just runtime errors (Throwable), not TikaException.
But there is no “catch” clause for Throwable in the AttachmentMapper.java :
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
} catch (TikaException e) {
throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
}
As a result, tika() may “hang up” the whole application.
(we have some pdf-files that "hang up" Elastic client if you try to parse them using mapper-attahcment plugin)
We propose the following fix:
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
} catch (Throwable e) {
throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
}
(just replace “TikaException” with “Throwable” – it works for our cases)
Thank you!
Closes #21 .
2013-08-20 17:28:11 +02:00
David Pilato
0fff26f2bf
Mapper plugin overwrite date mapping
...
If you define some specific mapping for your file content, such as the following:
```javascript
{
"person": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"date": { "type": "string" }
}
}
}
}
}
```
And then, if you ask back the mapping, you get:
```javascript
{
"person":{
"properties":{
"file":{
"type":"attachment",
"path":"full",
"fields":{
"file":{
"type":"string"
},
"author":{
"type":"string"
},
"title":{
"type":"string"
},
"name":{
"type":"string"
},
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"keywords":{
"type":"string"
},
"content_type":{
"type":"string"
}
}
}
}
}
}
```
All your settings have been overwrited by the mapper plugin.
See also issue #22 where the issue was found.
Closes #39 .
2013-08-20 16:35:52 +02:00
David Pilato
62cc54a7c8
Update readme with release dates
2013-08-20 16:15:18 +02:00
David Pilato
8c340535d2
Add content_length metadata
...
We now generate `content_length` field field based on file size.
Closes #26 .
2013-08-20 16:03:31 +02:00
David Pilato
406e295c6c
In test for #38 , we should check the real file name as we have it :-).
2013-08-20 12:34:33 +02:00
Frédéric Camblor
019d0f9a26
Don't reject full document in case of invalid metadata
...
From original PR #17 from @fcamblor
If you try to index a document with an invalid metadata, the full document is rejected.
For example:
```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd ">
<html lang="fr">
<head>
<title>Hello</title>
<meta name="date" content="">
<meta name="Author" content="kimchy">
<meta name="Keywords" content="elasticsearch,cool,bonsai">
</head>
<body>World</body>
</html>
```
has a non parseable date.
This fix add a new option that ignore parsing errors `"index.mapping.attachment.ignore_errors":true` (default to `true`).
Closes #17 , #38 .
2013-08-20 12:26:49 +02:00
David Pilato
d7a2e7e2ff
Mapper plugin overwrites multifield mapping
...
If you define some specific mapping for your file content, such as the following:
```javascript
{
"person": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "multifield",
"fields": {
"file": { "type": "string" },
"suggest": { "type": "string" }
}
}
}
}
}
}
}
```
And then, if you ask back the mapping, you get:
```javascript
{
"person":{
"properties":{
"file":{
"type":"attachment",
"path":"full",
"fields":{
"file":{
"type":"string"
},
"author":{
"type":"string"
},
"title":{
"type":"string"
},
"name":{
"type":"string"
},
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"keywords":{
"type":"string"
},
"content_type":{
"type":"string"
}
}
}
}
}
}
```
All your settings have been overwrited by the mapper plugin.
Closes #37 .
2013-08-19 11:01:02 +02:00
David Pilato
d2e2fb5cdf
Upgrade Tika to 1.4.
...
Closes #36 .
2013-08-14 16:57:42 +02:00
David Pilato
c0663277bc
prepare for next development iteration
2013-08-07 10:02:02 +02:00
David Pilato
0a454efe18
prepare release elasticsearch-mapper-attachments-1.8.0
2013-08-07 09:52:29 +02:00
David Pilato
d054f9a1e7
Mapper 1.7.0 does not work with elasticsearch 0.90.3
...
FastByteArrayInputStream has been removed in 0.90.3.
Closes #34 .
2013-08-07 09:47:12 +02:00
Shay Banon
690779cf2f
move to 1.8 snap
2013-02-26 16:06:53 +01:00
Shay Banon
7e58416506
release 1.7
2013-02-26 16:06:39 +01:00
David Pilato
942b87b763
Move to Elasticsearch 0.21.0.Beta1
...
Due to refactoring in 0.21.x we have to update this plugin
Closes #24 .
2013-02-23 12:13:51 +01:00
David Pilato
eba4da7086
NPE if "content" is missing in mapper-attachment plugin
...
Curl recreation:
curl -X DELETE "localhost:9200/test"
curl -X PUT "localhost:9200/test" -d '{
"settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}
}'
curl -X GET "localhost:9200/_cluster/health?wait_for_status=green&pretty=1&timeout=5s"
curl -X PUT "localhost:9200/test/attachment/_mapping" -d '{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment"
}
}
}
}'
curl -X PUT "localhost:9200/test/attachment/1" -d '{
"file" : {
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf"
}
}
'
Produces a:
{"error":"NullPointerException[null]","status":500}
And in ES logs:
[2013-02-20 12:49:04,445][DEBUG][action.index ] [Drake, Frank] [test][0], node[LI6crwNKQmu1ue1u7mlqGA], [P], s[STARTED]: Failed to execute [index {[test][attachment][1], source[{
"file" : {
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf"
}
}
]}]
java.lang.NullPointerException
at org.elasticsearch.common.io.FastByteArrayInputStream.<init>(FastByteArrayInputStream.java:90)
at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:309)
at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:507)
at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:449)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:486)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:430)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:318)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:531)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:429)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Closes #23
2013-02-23 09:40:07 +01:00
Martijn van Groningen
69f8bdea03
Master is now 0.20
2012-12-21 15:17:02 +01:00
David Pilato
30e425e209
Plugin must be 0.20.x compatible (tests fails)
2012-12-21 15:16:04 +01:00
David Pilato
17ae816a6a
Move resources in /src/test/resources
2012-12-21 15:12:15 +01:00
Martijn van Groningen
15248a9d52
Set next development version
2012-09-28 12:09:44 +02:00
Martijn van Groningen
a163fdad0f
Prepare 1.6.0 release
2012-09-28 12:00:12 +02:00
David Pilato
00d87de418
#13 : Fix dependency to tika-app
2012-09-28 10:00:45 +02:00
Martijn van Groningen
64254e621b
Removed ExtendedTika class. The maxLength change made it into Tika 1.2, which we use now and was the reason of having this class.
2012-09-19 12:44:48 +02:00
Martijn van Groningen
ab159355ed
Set new development version
2012-09-19 11:42:34 +02:00
Martijn van Groningen
0a17fe2e44
Release 1.5
2012-09-19 11:33:06 +02:00
Martijn van Groningen
5c649ad226
Upgraded Tika, Testng, hamcrest, log4j and surefire plugin.
...
Closes #12
2012-09-19 10:55:58 +02:00
Shay Banon
65043c0692
add license and repo
2012-06-10 22:14:18 +02:00
Shay Banon
0ae4c73386
move to 1.5.0 snap
2012-03-25 20:11:07 +02:00
Shay Banon
66b96cb994
release 1.4.0
2012-03-25 20:10:46 +02:00
Shay Banon
c1df26e4e9
upgrade to tika 1.1
2012-03-25 20:00:45 +02:00
Shay Banon
4292512f8e
rename fileName to name
2012-03-17 11:54:40 +02:00
alheim
d9a822dba8
Add a fineName field to index the attchment fileName
2012-03-17 11:52:35 +02:00
Shay Banon
911fa246d0
move to 1.4.0 snap
2012-03-07 22:03:45 +02:00
Shay Banon
4482a5de67
release 1.3.0
2012-03-07 22:02:49 +02:00
Shay Banon
744e3772a5
update readme
2012-03-07 21:56:48 +02:00
Shay Banon
0352c1436e
change to _indexed_chars the parameter per doc, and add index.mapping.attachment.indexed_chars setting to globally change it (per index)
2012-03-07 21:53:41 +02:00
Shay Banon
59f38ff576
Merge branch 'master' of https://github.com/Henac/elasticsearch-mapper-attachments
2012-03-07 21:44:54 +02:00
Henac
9a26458862
Fixed issue with setting of maxStringLength applying globally to the tika instance.
...
I have extended the Tika class to allow for setting of how much text to
extract from a document to be on a per call basis.
2012-03-06 22:20:04 +11:00
Shay Banon
9882a2937b
update readme
2012-03-04 11:59:22 +02:00
Shay Banon
3a72b6b2c4
update to 0.19.0
2012-03-04 11:52:47 +02:00
Henac
6a08ca673a
Added the ability to specify the amount of text to extract and index from an attachment.
2012-03-04 16:09:21 +11:00
Shay Banon
bf12b2be21
Merge pull request #6 from dadoonet/master
...
Update maven assembly plugin version to 2.3
2012-02-26 15:39:45 -08:00
David Pilato
509b467658
Update maven assembly plugin to latest version : 2.3
2012-02-26 22:34:02 +01:00
David Pilato
79d7860a72
Merge remote-tracking branch 'elasticsearch/master'
...
Conflicts:
pom.xml
2012-02-26 17:19:12 +01:00
Shay Banon
dfd0e2cc41
latest assembly
2012-02-26 10:22:12 +02:00
Shay Banon
8ffa86bb31
latest elasticsearch
2012-02-26 10:21:53 +02:00
David Pilato
623952f839
Ignore eclipse files
2012-02-25 13:38:44 +01:00
David Pilato
1bee1b3d0c
Update to elasticsearch : 0.19.0.RC3 (fix dependencies issues)
2012-02-25 13:38:15 +01:00
David Pilato
ee3c17ec8b
We should indicate each plugin version. Assembly plugin 2.2-beta-5
...
(default) works but the latest release (2.2.1) won't as we don't set the
assembly id
2012-02-25 13:34:07 +01:00
Shay Banon
c13da65bea
move to 1.3.0 snap
2012-02-15 22:44:12 +02:00
Shay Banon
8d2a02e7d1
release 1.2.0
2012-02-15 22:43:48 +02:00