When we want to force a language instead of using Tika language detection, we set `language` field in documents.
To be consistent with other forced fields, `_content_type` and `_name`, we should prefix `language` field by an underscore `_`.
So `language` become `_language`.
We first deprecate `language` in version 2.1.0 and we remove it in 2.3.0.
Closes#68.
(cherry picked from commit 2f46343)
We create branches:
* es-0.90 for elasticsearch 0.90
* es-1.0 for elasticsearch 1.0
* es-1.1 for elasticsearch 1.1
* master for elasticsearch master
We also check that before releasing we don't have a dependency to an elasticsearch SNAPSHOT version.
Add links to each version in documentation
Based on PR #45, we add a new language detection option using Language detection feature available in Tika:
https://tika.apache.org/1.4/detection.html#Language_Detection
By default, language detection is disabled (`false`) as it could come with a cost.
This default value can be changed by setting the `index.mapping.attachment.detect_language` setting.
It can also be provided on a per document indexed using the `_detect_language` parameter.
Closes#45.
Closes#44.
Original request:
I am sending multiple pdf, word etc. attachments in one documents to be indexed.
Some of them (pdf) are encrypted and I am getting a MapperParsingException caused by org.apache.tika.exception.TikaException: Unable to extract PDF content cause by
org.apache.pdfbox.exceptions.WrappedIOException: Error decrypting document.
I was wondering if the attachment mapper could expose some switch to ignore the documents it can not extract?
As we now have option `ignore_errors`, we can support it. See #38 relative to this option.
Closes#18.
Sometimes Tika may crash while parsing some files. In this case it may generate just runtime errors (Throwable), not TikaException.
But there is no “catch” clause for Throwable in the AttachmentMapper.java :
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
} catch (TikaException e) {
throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
}
As a result, tika() may “hang up” the whole application.
(we have some pdf-files that "hang up" Elastic client if you try to parse them using mapper-attahcment plugin)
We propose the following fix:
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
} catch (Throwable e) {
throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
}
(just replace “TikaException” with “Throwable” – it works for our cases)
Thank you!
Closes#21.
If you define some specific mapping for your file content, such as the following:
```javascript
{
"person": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"date": { "type": "string" }
}
}
}
}
}
```
And then, if you ask back the mapping, you get:
```javascript
{
"person":{
"properties":{
"file":{
"type":"attachment",
"path":"full",
"fields":{
"file":{
"type":"string"
},
"author":{
"type":"string"
},
"title":{
"type":"string"
},
"name":{
"type":"string"
},
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"keywords":{
"type":"string"
},
"content_type":{
"type":"string"
}
}
}
}
}
}
```
All your settings have been overwrited by the mapper plugin.
See also issue #22 where the issue was found.
Closes#39.
From original PR #17 from @fcamblor
If you try to index a document with an invalid metadata, the full document is rejected.
For example:
```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="fr">
<head>
<title>Hello</title>
<meta name="date" content="">
<meta name="Author" content="kimchy">
<meta name="Keywords" content="elasticsearch,cool,bonsai">
</head>
<body>World</body>
</html>
```
has a non parseable date.
This fix add a new option that ignore parsing errors `"index.mapping.attachment.ignore_errors":true` (default to `true`).
Closes#17, #38.