Original request:
I am sending multiple pdf, word etc. attachments in one documents to be indexed.
Some of them (pdf) are encrypted and I am getting a MapperParsingException caused by org.apache.tika.exception.TikaException: Unable to extract PDF content cause by
org.apache.pdfbox.exceptions.WrappedIOException: Error decrypting document.
I was wondering if the attachment mapper could expose some switch to ignore the documents it can not extract?
As we now have option `ignore_errors`, we can support it. See #38 relative to this option.
Closes#18.
Sometimes Tika may crash while parsing some files. In this case it may generate just runtime errors (Throwable), not TikaException.
But there is no “catch” clause for Throwable in the AttachmentMapper.java :
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
} catch (TikaException e) {
throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
}
As a result, tika() may “hang up” the whole application.
(we have some pdf-files that "hang up" Elastic client if you try to parse them using mapper-attahcment plugin)
We propose the following fix:
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
} catch (Throwable e) {
throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
}
(just replace “TikaException” with “Throwable” – it works for our cases)
Thank you!
Closes#21.
If you define some specific mapping for your file content, such as the following:
```javascript
{
"person": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"date": { "type": "string" }
}
}
}
}
}
```
And then, if you ask back the mapping, you get:
```javascript
{
"person":{
"properties":{
"file":{
"type":"attachment",
"path":"full",
"fields":{
"file":{
"type":"string"
},
"author":{
"type":"string"
},
"title":{
"type":"string"
},
"name":{
"type":"string"
},
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"keywords":{
"type":"string"
},
"content_type":{
"type":"string"
}
}
}
}
}
}
```
All your settings have been overwrited by the mapper plugin.
See also issue #22 where the issue was found.
Closes#39.
From original PR #17 from @fcamblor
If you try to index a document with an invalid metadata, the full document is rejected.
For example:
```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="fr">
<head>
<title>Hello</title>
<meta name="date" content="">
<meta name="Author" content="kimchy">
<meta name="Keywords" content="elasticsearch,cool,bonsai">
</head>
<body>World</body>
</html>
```
has a non parseable date.
This fix add a new option that ignore parsing errors `"index.mapping.attachment.ignore_errors":true` (default to `true`).
Closes#17, #38.