Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene
Add `irish` analyzer Add `sorani` analyzer (Kurdish) Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc. Add `thai` tokenizer: segments thai text into words. Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself Add `german_normalization` tokenfilter: umlaut/sharp S normalization Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages Add `sorani_normalization` tokenfilter: normalizes kurdish text Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization` Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk` Add support access to default Thai stopword set "_thai_" Fix some bugs and broken links in documentation. Closes #5935
This commit is contained in:
parent
9ddfaf3aaf
commit
b9a09c2b06
|
@ -23,12 +23,14 @@ following types are supported:
|
||||||
<<hindi-analyzer,`hindi`>>,
|
<<hindi-analyzer,`hindi`>>,
|
||||||
<<hungarian-analyzer,`hungarian`>>,
|
<<hungarian-analyzer,`hungarian`>>,
|
||||||
<<indonesian-analyzer,`indonesian`>>,
|
<<indonesian-analyzer,`indonesian`>>,
|
||||||
|
<<irish-analyzer,`irish`>>,
|
||||||
<<italian-analyzer,`italian`>>,
|
<<italian-analyzer,`italian`>>,
|
||||||
<<norwegian-analyzer,`norwegian`>>,
|
<<norwegian-analyzer,`norwegian`>>,
|
||||||
<<persian-analyzer,`persian`>>,
|
<<persian-analyzer,`persian`>>,
|
||||||
<<portuguese-analyzer,`portuguese`>>,
|
<<portuguese-analyzer,`portuguese`>>,
|
||||||
<<romanian-analyzer,`romanian`>>,
|
<<romanian-analyzer,`romanian`>>,
|
||||||
<<russian-analyzer,`russian`>>,
|
<<russian-analyzer,`russian`>>,
|
||||||
|
<<sorani-analyzer,`sorani`>>,
|
||||||
<<spanish-analyzer,`spanish`>>,
|
<<spanish-analyzer,`spanish`>>,
|
||||||
<<swedish-analyzer,`swedish`>>,
|
<<swedish-analyzer,`swedish`>>,
|
||||||
<<turkish-analyzer,`turkish`>>,
|
<<turkish-analyzer,`turkish`>>,
|
||||||
|
@ -42,8 +44,8 @@ more details.
|
||||||
The following analyzers support setting custom `stem_exclusion` list:
|
The following analyzers support setting custom `stem_exclusion` list:
|
||||||
`arabic`, `armenian`, `basque`, `catalan`, `bulgarian`, `catalan`,
|
`arabic`, `armenian`, `basque`, `catalan`, `bulgarian`, `catalan`,
|
||||||
`czech`, `finnish`, `dutch`, `english`, `finnish`, `french`, `galician`,
|
`czech`, `finnish`, `dutch`, `english`, `finnish`, `french`, `galician`,
|
||||||
`german`, `hindi`, `hungarian`, `indonesian`, `italian`, `norwegian`,
|
`german`, `irish`, `hindi`, `hungarian`, `indonesian`, `italian`, `norwegian`,
|
||||||
`portuguese`, `romanian`, `russian`, `spanish`, `swedish`, `turkish`.
|
`portuguese`, `romanian`, `russian`, `sorani`, `spanish`, `swedish`, `turkish`.
|
||||||
|
|
||||||
[[arabic-analyzer]]
|
[[arabic-analyzer]]
|
||||||
==== `arabic` analyzer
|
==== `arabic` analyzer
|
||||||
|
@ -720,7 +722,7 @@ The `german` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
"lowercase",
|
"lowercase",
|
||||||
"german_stop",
|
"german_stop",
|
||||||
"german_keywords",
|
"german_keywords",
|
||||||
"ascii_folding", <3>
|
"german_normalization",
|
||||||
"german_stemmer"
|
"german_stemmer"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
@ -733,9 +735,6 @@ The `german` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
or `stopwords_path` parameters.
|
or `stopwords_path` parameters.
|
||||||
<2> Words can be excluded from stemming with the `stem_exclusion`
|
<2> Words can be excluded from stemming with the `stem_exclusion`
|
||||||
parameter.
|
parameter.
|
||||||
<3> The `german` analyzer actually uses the GermanNormalizationFilter,
|
|
||||||
which isn't exposed in Elasticsearch. The `ascii_folding` filter
|
|
||||||
does a similar job but is more extensive.
|
|
||||||
|
|
||||||
[[greek-analyzer]]
|
[[greek-analyzer]]
|
||||||
==== `greek` analyzer
|
==== `greek` analyzer
|
||||||
|
@ -752,6 +751,10 @@ The `greek` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
"type": "stop",
|
"type": "stop",
|
||||||
"stopwords": "_greek_" <1>
|
"stopwords": "_greek_" <1>
|
||||||
},
|
},
|
||||||
|
"greek_lowercase": {
|
||||||
|
"type": "lowercase",
|
||||||
|
"language": "greek"
|
||||||
|
},
|
||||||
"greek_keywords": {
|
"greek_keywords": {
|
||||||
"type": "keyword_marker",
|
"type": "keyword_marker",
|
||||||
"keywords": [] <2>
|
"keywords": [] <2>
|
||||||
|
@ -765,7 +768,7 @@ The `greek` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
"greek": {
|
"greek": {
|
||||||
"tokenizer": "standard",
|
"tokenizer": "standard",
|
||||||
"filter": [
|
"filter": [
|
||||||
"lowercase",
|
"greek_lowercase",
|
||||||
"greek_stop",
|
"greek_stop",
|
||||||
"greek_keywords",
|
"greek_keywords",
|
||||||
"greek_stemmer"
|
"greek_stemmer"
|
||||||
|
@ -784,9 +787,48 @@ The `greek` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
[[hindi-analyzer]]
|
[[hindi-analyzer]]
|
||||||
==== `hindi` analyzer
|
==== `hindi` analyzer
|
||||||
|
|
||||||
The `hindi` analyzer cannot currently be implemented as a `custom` analyzer
|
The `hindi` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
as it depends on the IndicNormalizationFilter and HindiNormalizationFilter
|
|
||||||
which are not yet exposed by Elasticsearch. Instead, see the <<analysis-icu-plugin>>.
|
[source,js]
|
||||||
|
----------------------------------------------------
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"filter": {
|
||||||
|
"hindi_stop": {
|
||||||
|
"type": "stop",
|
||||||
|
"stopwords": "_hindi_" <1>
|
||||||
|
},
|
||||||
|
"hindi_keywords": {
|
||||||
|
"type": "keyword_marker",
|
||||||
|
"keywords": [] <2>
|
||||||
|
},
|
||||||
|
"hindi_stemmer": {
|
||||||
|
"type": "stemmer",
|
||||||
|
"language": "hindi"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analyzer": {
|
||||||
|
"hindi": {
|
||||||
|
"tokenizer": "standard",
|
||||||
|
"filter": [
|
||||||
|
"lowercase",
|
||||||
|
"indic_normalization",
|
||||||
|
"hindi_normalization",
|
||||||
|
"hindi_stop",
|
||||||
|
"hindi_keywords",
|
||||||
|
"hindi_stemmer"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----------------------------------------------------
|
||||||
|
<1> The default stopwords can be overridden with the `stopwords`
|
||||||
|
or `stopwords_path` parameters.
|
||||||
|
<2> Words can be excluded from stemming with the `stem_exclusion`
|
||||||
|
parameter.
|
||||||
|
|
||||||
[[hungarian-analyzer]]
|
[[hungarian-analyzer]]
|
||||||
==== `hungarian` analyzer
|
==== `hungarian` analyzer
|
||||||
|
@ -877,6 +919,59 @@ The `indonesian` analyzer could be reimplemented as a `custom` analyzer as follo
|
||||||
<2> Words can be excluded from stemming with the `stem_exclusion`
|
<2> Words can be excluded from stemming with the `stem_exclusion`
|
||||||
parameter.
|
parameter.
|
||||||
|
|
||||||
|
[[irish-analyzer]]
|
||||||
|
==== `irish` analyzer
|
||||||
|
|
||||||
|
The `irish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
----------------------------------------------------
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"filter": {
|
||||||
|
"irish_elision": {
|
||||||
|
"type": "elision",
|
||||||
|
"articles": [ "h", "n", "t" ]
|
||||||
|
},
|
||||||
|
"irish_stop": {
|
||||||
|
"type": "stop",
|
||||||
|
"stopwords": "_irish_" <1>
|
||||||
|
},
|
||||||
|
"irish_lowercase": {
|
||||||
|
"type": "lowercase",
|
||||||
|
"language": "irish"
|
||||||
|
},
|
||||||
|
"irish_keywords": {
|
||||||
|
"type": "keyword_marker",
|
||||||
|
"keywords": [] <2>
|
||||||
|
},
|
||||||
|
"irish_stemmer": {
|
||||||
|
"type": "stemmer",
|
||||||
|
"language": "irish"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analyzer": {
|
||||||
|
"irish": {
|
||||||
|
"tokenizer": "standard",
|
||||||
|
"filter": [
|
||||||
|
"irish_stop",
|
||||||
|
"irish_elision",
|
||||||
|
"irish_lowercase",
|
||||||
|
"irish_keywords",
|
||||||
|
"irish_stemmer"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----------------------------------------------------
|
||||||
|
<1> The default stopwords can be overridden with the `stopwords`
|
||||||
|
or `stopwords_path` parameters.
|
||||||
|
<2> Words can be excluded from stemming with the `stem_exclusion`
|
||||||
|
parameter.
|
||||||
|
|
||||||
[[italian-analyzer]]
|
[[italian-analyzer]]
|
||||||
==== `italian` analyzer
|
==== `italian` analyzer
|
||||||
|
|
||||||
|
@ -1150,6 +1245,51 @@ The `russian` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
<2> Words can be excluded from stemming with the `stem_exclusion`
|
<2> Words can be excluded from stemming with the `stem_exclusion`
|
||||||
parameter.
|
parameter.
|
||||||
|
|
||||||
|
[[sorani-analyzer]]
|
||||||
|
==== `sorani` analyzer
|
||||||
|
|
||||||
|
The `sorani` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
----------------------------------------------------
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"filter": {
|
||||||
|
"sorani_stop": {
|
||||||
|
"type": "stop",
|
||||||
|
"stopwords": "_sorani_" <1>
|
||||||
|
},
|
||||||
|
"sorani_keywords": {
|
||||||
|
"type": "keyword_marker",
|
||||||
|
"keywords": [] <2>
|
||||||
|
},
|
||||||
|
"sorani_stemmer": {
|
||||||
|
"type": "stemmer",
|
||||||
|
"language": "sorani"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analyzer": {
|
||||||
|
"sorani": {
|
||||||
|
"tokenizer": "standard",
|
||||||
|
"filter": [
|
||||||
|
"sorani_normalization",
|
||||||
|
"lowercase",
|
||||||
|
"sorani_stop",
|
||||||
|
"sorani_keywords",
|
||||||
|
"sorani_stemmer"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----------------------------------------------------
|
||||||
|
<1> The default stopwords can be overridden with the `stopwords`
|
||||||
|
or `stopwords_path` parameters.
|
||||||
|
<2> Words can be excluded from stemming with the `stem_exclusion`
|
||||||
|
parameter.
|
||||||
|
|
||||||
[[spanish-analyzer]]
|
[[spanish-analyzer]]
|
||||||
==== `spanish` analyzer
|
==== `spanish` analyzer
|
||||||
|
|
||||||
|
@ -1241,14 +1381,80 @@ The `swedish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
[[turkish-analyzer]]
|
[[turkish-analyzer]]
|
||||||
==== `turkish` analyzer
|
==== `turkish` analyzer
|
||||||
|
|
||||||
The `turkish` analyzer cannot currently be implemented as a `custom` analyzer
|
The `turkish` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
because it depends on the TurkishLowerCaseFilter and the ApostropheFilter
|
|
||||||
which are not exposed in Elasticsearch. Instead, see the <<analysis-icu-plugin>>.
|
[source,js]
|
||||||
|
----------------------------------------------------
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"filter": {
|
||||||
|
"turkish_stop": {
|
||||||
|
"type": "stop",
|
||||||
|
"stopwords": "_turkish_" <1>
|
||||||
|
},
|
||||||
|
"turkish_lowercase": {
|
||||||
|
"type": "lowercase",
|
||||||
|
"language": "turkish"
|
||||||
|
},
|
||||||
|
"turkish_keywords": {
|
||||||
|
"type": "keyword_marker",
|
||||||
|
"keywords": [] <2>
|
||||||
|
},
|
||||||
|
"turkish_stemmer": {
|
||||||
|
"type": "stemmer",
|
||||||
|
"language": "turkish"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analyzer": {
|
||||||
|
"turkish": {
|
||||||
|
"tokenizer": "standard",
|
||||||
|
"filter": [
|
||||||
|
"apostrophe",
|
||||||
|
"turkish_lowercase",
|
||||||
|
"turkish_stop",
|
||||||
|
"turkish_keywords",
|
||||||
|
"turkish_stemmer"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----------------------------------------------------
|
||||||
|
<1> The default stopwords can be overridden with the `stopwords`
|
||||||
|
or `stopwords_path` parameters.
|
||||||
|
<2> Words can be excluded from stemming with the `stem_exclusion`
|
||||||
|
parameter.
|
||||||
|
|
||||||
[[thai-analyzer]]
|
[[thai-analyzer]]
|
||||||
==== `thai` analyzer
|
==== `thai` analyzer
|
||||||
|
|
||||||
The `thai` analyzer cannot currently be implemented as a `custom` analyzer
|
The `thai` analyzer could be reimplemented as a `custom` analyzer as follows:
|
||||||
because it depends on the ThaiTokenizer which is not exposed in Elasticsearch.
|
|
||||||
Instead, see the <<analysis-icu-plugin>>.
|
|
||||||
|
|
||||||
|
[source,js]
|
||||||
|
----------------------------------------------------
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"filter": {
|
||||||
|
"thai_stop": {
|
||||||
|
"type": "stop",
|
||||||
|
"stopwords": "_thai_" <1>
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analyzer": {
|
||||||
|
"thai": {
|
||||||
|
"tokenizer": "thai",
|
||||||
|
"filter": [
|
||||||
|
"lowercase",
|
||||||
|
"thai_stop"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----------------------------------------------------
|
||||||
|
<1> The default stopwords can be overridden with the `stopwords`
|
||||||
|
or `stopwords_path` parameters.
|
||||||
|
|
|
@ -78,3 +78,7 @@ include::tokenfilters/cjk-bigram-tokenfilter.asciidoc[]
|
||||||
include::tokenfilters/delimited-payload-tokenfilter.asciidoc[]
|
include::tokenfilters/delimited-payload-tokenfilter.asciidoc[]
|
||||||
|
|
||||||
include::tokenfilters/keep-words-tokenfilter.asciidoc[]
|
include::tokenfilters/keep-words-tokenfilter.asciidoc[]
|
||||||
|
|
||||||
|
include::tokenfilters/classic-tokenfilter.asciidoc[]
|
||||||
|
|
||||||
|
include::tokenfilters/apostrophe-tokenfilter.asciidoc[]
|
||||||
|
|
|
@ -0,0 +1,7 @@
|
||||||
|
[[analysis-apostrophe-tokenfilter]]
|
||||||
|
=== Apostrophe Token Filter
|
||||||
|
|
||||||
|
coming[1.3.0]
|
||||||
|
|
||||||
|
The `apostrophe` token filter strips all characters after an apostrophe,
|
||||||
|
including the apostrophe itself.
|
|
@ -0,0 +1,11 @@
|
||||||
|
[[analysis-classic-tokenfilter]]
|
||||||
|
=== Classic Token Filter
|
||||||
|
|
||||||
|
coming[1.3.0]
|
||||||
|
|
||||||
|
The `classic` token filter does optional post-processing of
|
||||||
|
terms that are generated by the <<analysis-classic-tokenizer,`classic` tokenizer>>.
|
||||||
|
|
||||||
|
This filter removes the english possessive from the end of words, and
|
||||||
|
it removes dots from acronyms.
|
||||||
|
|
|
@ -4,7 +4,7 @@
|
||||||
A token filter of type `lowercase` that normalizes token text to lower
|
A token filter of type `lowercase` that normalizes token text to lower
|
||||||
case.
|
case.
|
||||||
|
|
||||||
Lowercase token filter supports Greek and Turkish lowercase token
|
Lowercase token filter supports Greek, Irish coming[1.3.0], and Turkish lowercase token
|
||||||
filters through the `language` parameter. Below is a usage example in a
|
filters through the `language` parameter. Below is a usage example in a
|
||||||
custom analyzer
|
custom analyzer
|
||||||
|
|
||||||
|
|
|
@ -4,12 +4,33 @@
|
||||||
There are several token filters available which try to normalize special
|
There are several token filters available which try to normalize special
|
||||||
characters of a certain language.
|
characters of a certain language.
|
||||||
|
|
||||||
You can currently choose between `arabic_normalization` and
|
[horizontal]
|
||||||
`persian_normalization` normalization in your token filter
|
Arabic::
|
||||||
configuration. For more information check the
|
|
||||||
http://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizer.html[ArabicNormalizer]
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizer.html[`arabic_normalization`]
|
||||||
or the
|
|
||||||
http://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizer.html[PersianNormalizer]
|
German::
|
||||||
documentation.
|
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html[`german_normalization`] coming[1.3.0]
|
||||||
|
|
||||||
|
Hindi::
|
||||||
|
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizer.html[`hindi_normalization`] coming[1.3.0]
|
||||||
|
|
||||||
|
Indic::
|
||||||
|
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizer.html[`indic_normalization`] coming[1.3.0]
|
||||||
|
|
||||||
|
Kurdish (Sorani)::
|
||||||
|
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html[`sorani_normalization`] coming[1.3.0]
|
||||||
|
|
||||||
|
Persian::
|
||||||
|
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizer.html[`persian_normalization`]
|
||||||
|
|
||||||
|
Scandinavian::
|
||||||
|
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html[`scandinavian_normalization`] coming[1.3.0]
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html[`scandinavian_folding`] coming[1.3.0]
|
||||||
|
|
||||||
*Note:* These filters are available since `0.90.2`
|
|
||||||
|
|
|
@ -32,7 +32,7 @@ available values (the preferred filters are marked in *bold*):
|
||||||
[horizontal]
|
[horizontal]
|
||||||
Arabic::
|
Arabic::
|
||||||
|
|
||||||
http://lucene.apache.org/core/4_3_0/analyzers-common/index.html?org%2Fapache%2Flucene%2Fanalysis%2Far%2FArabicStemmer.html[*`arabic`*]
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicStemmer.html[*`arabic`*]
|
||||||
|
|
||||||
Armenian::
|
Armenian::
|
||||||
|
|
||||||
|
@ -44,7 +44,7 @@ http://snowball.tartarus.org/algorithms/basque/stemmer.html[*`basque`*]
|
||||||
|
|
||||||
Brazilian Portuguese::
|
Brazilian Portuguese::
|
||||||
|
|
||||||
http://lucene.apache.org/core/4_3_0/analyzers-common/index.html?org%2Fapache%2Flucene%2Fanalysis%2Fbr%2FBrazilianStemmer.html[*`brazilian`*]
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/br/BrazilianStemmer.html[*`brazilian`*]
|
||||||
|
|
||||||
Bulgarian::
|
Bulgarian::
|
||||||
|
|
||||||
|
@ -72,7 +72,7 @@ English::
|
||||||
http://snowball.tartarus.org/algorithms/porter/stemmer.html[*`english`*] coming[1.3.0,Returns the <<analysis-porterstem-tokenfilter,`porter_stem`>> instead of the <<analysis-snowball-tokenfilter,`english` Snowball token filter>>],
|
http://snowball.tartarus.org/algorithms/porter/stemmer.html[*`english`*] coming[1.3.0,Returns the <<analysis-porterstem-tokenfilter,`porter_stem`>> instead of the <<analysis-snowball-tokenfilter,`english` Snowball token filter>>],
|
||||||
http://ciir.cs.umass.edu/pubfiles/ir-35.pdf[`light_english`] coming[1.3.0,Returns the <<analysis-kstem-tokenfilter,`kstem` token filter>>],
|
http://ciir.cs.umass.edu/pubfiles/ir-35.pdf[`light_english`] coming[1.3.0,Returns the <<analysis-kstem-tokenfilter,`kstem` token filter>>],
|
||||||
http://www.medialab.tfe.umu.se/courses/mdm0506a/material/fulltext_ID%3D10049387%26PLACEBO%3DIE.pdf[`minimal_english`],
|
http://www.medialab.tfe.umu.se/courses/mdm0506a/material/fulltext_ID%3D10049387%26PLACEBO%3DIE.pdf[`minimal_english`],
|
||||||
http://lucene.apache.org/core/4_3_0/analyzers-common/index.html?org%2Fapache%2Flucene%2Fanalysis%2Fen%2FEnglishPossessiveFilter.html[`possessive_english`],
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilter.html[`possessive_english`],
|
||||||
http://snowball.tartarus.org/algorithms/english/stemmer.html[`porter2`] coming[1.3.0,Returns the <<analysis-snowball-tokenfilter,`english` Snowball token filter>> instead of the <<analysis-snowball-tokenfilter,`porter` Snowball token filter>>],
|
http://snowball.tartarus.org/algorithms/english/stemmer.html[`porter2`] coming[1.3.0,Returns the <<analysis-snowball-tokenfilter,`english` Snowball token filter>> instead of the <<analysis-snowball-tokenfilter,`porter` Snowball token filter>>],
|
||||||
http://snowball.tartarus.org/algorithms/lovins/stemmer.html[`lovins`]
|
http://snowball.tartarus.org/algorithms/lovins/stemmer.html[`lovins`]
|
||||||
|
|
||||||
|
@ -87,6 +87,11 @@ http://snowball.tartarus.org/algorithms/french/stemmer.html[`french`],
|
||||||
http://dl.acm.org/citation.cfm?id=1141523[*`light_french`*],
|
http://dl.acm.org/citation.cfm?id=1141523[*`light_french`*],
|
||||||
http://dl.acm.org/citation.cfm?id=318984[`minimal_french`]
|
http://dl.acm.org/citation.cfm?id=318984[`minimal_french`]
|
||||||
|
|
||||||
|
Galician::
|
||||||
|
|
||||||
|
http://bvg.udc.es/recursos_lingua/stemming.jsp[*`galician`*] coming[1.3.0],
|
||||||
|
http://bvg.udc.es/recursos_lingua/stemming.jsp[`minimal_galician`] (Plural step only) coming[1.3.0]
|
||||||
|
|
||||||
German::
|
German::
|
||||||
|
|
||||||
http://snowball.tartarus.org/algorithms/german/stemmer.html[`german`],
|
http://snowball.tartarus.org/algorithms/german/stemmer.html[`german`],
|
||||||
|
@ -111,19 +116,33 @@ Indonesian::
|
||||||
|
|
||||||
http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf[*`indonesian`*]
|
http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf[*`indonesian`*]
|
||||||
|
|
||||||
|
Irish::
|
||||||
|
|
||||||
|
http://snowball.tartarus.org/otherapps/oregan/intro.html[*`irish`*]
|
||||||
|
|
||||||
Italian::
|
Italian::
|
||||||
|
|
||||||
http://snowball.tartarus.org/algorithms/italian/stemmer.html[`italian`],
|
http://snowball.tartarus.org/algorithms/italian/stemmer.html[`italian`],
|
||||||
http://www.ercim.eu/publication/ws-proceedings/CLEF2/savoy.pdf[*`light_italian`*]
|
http://www.ercim.eu/publication/ws-proceedings/CLEF2/savoy.pdf[*`light_italian`*]
|
||||||
|
|
||||||
|
Kurdish (Sorani)::
|
||||||
|
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ckb/SoraniStemmer.html[*`sorani`*] coming[1.3.0]
|
||||||
|
|
||||||
Latvian::
|
Latvian::
|
||||||
|
|
||||||
http://lucene.apache.org/core/4_3_0/analyzers-common/index.html?org%2Fapache%2Flucene%2Fanalysis%2Flv%2FLatvianStemmer.html[*`latvian`*]
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/lv/LatvianStemmer.html[*`latvian`*]
|
||||||
|
|
||||||
Norwegian::
|
Norwegian (Bokmål)::
|
||||||
|
|
||||||
http://snowball.tartarus.org/algorithms/norwegian/stemmer.html[*`norwegian`*],
|
http://snowball.tartarus.org/algorithms/norwegian/stemmer.html[*`norwegian`*],
|
||||||
http://lucene.apache.org/core/4_3_0/analyzers-common/index.html?org%2Fapache%2Flucene%2Fanalysis%2Fno%2FNorwegianMinimalStemFilter.html[`minimal_norwegian`]
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianLightStemmer.html[*`light_norwegian`*] coming[1.3.0]
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianMinimalStemmer.html[`minimal_norwegian`]
|
||||||
|
|
||||||
|
Norwegian (Nynorsk)::
|
||||||
|
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianLightStemmer.html[*`light_nynorsk`*] coming[1.3.0]
|
||||||
|
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianMinimalStemmer.html[`minimal_nynorsk`] coming[1.3.0]
|
||||||
|
|
||||||
Portuguese::
|
Portuguese::
|
||||||
|
|
||||||
|
@ -132,7 +151,6 @@ http://dl.acm.org/citation.cfm?id=1141523&dl=ACM&coll=DL&CFID=179095584&CFTOKEN=
|
||||||
http://www.inf.ufrgs.br/\~buriol/papers/Orengo_CLEF07.pdf[`minimal_portuguese`],
|
http://www.inf.ufrgs.br/\~buriol/papers/Orengo_CLEF07.pdf[`minimal_portuguese`],
|
||||||
http://www.inf.ufrgs.br/\~viviane/rslp/index.htm[`portuguese_rslp`] coming[1.3.0]
|
http://www.inf.ufrgs.br/\~viviane/rslp/index.htm[`portuguese_rslp`] coming[1.3.0]
|
||||||
|
|
||||||
|
|
||||||
Romanian::
|
Romanian::
|
||||||
|
|
||||||
http://snowball.tartarus.org/algorithms/romanian/stemmer.html[*`romanian`*]
|
http://snowball.tartarus.org/algorithms/romanian/stemmer.html[*`romanian`*]
|
||||||
|
|
|
@ -28,3 +28,7 @@ include::tokenizers/uaxurlemail-tokenizer.asciidoc[]
|
||||||
|
|
||||||
include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
|
include::tokenizers/pathhierarchy-tokenizer.asciidoc[]
|
||||||
|
|
||||||
|
include::tokenizers/classic-tokenizer.asciidoc[]
|
||||||
|
|
||||||
|
include::tokenizers/thai-tokenizer.asciidoc[]
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,21 @@
|
||||||
|
[[analysis-classic-tokenizer]]
|
||||||
|
=== Classic Tokenizer
|
||||||
|
|
||||||
|
coming[1.3.0]
|
||||||
|
|
||||||
|
A tokenizer of type `classic` providing grammar based tokenizer that is
|
||||||
|
a good tokenizer for English language documents. This tokenizer has
|
||||||
|
heuristics for special treatment of acronyms, company names, email addresses,
|
||||||
|
and internet host names. However, these rules don't always work, and
|
||||||
|
the tokenizer doesn't work well for most languages other than English.
|
||||||
|
|
||||||
|
The following are settings that can be set for a `classic` tokenizer
|
||||||
|
type:
|
||||||
|
|
||||||
|
[cols="<,<",options="header",]
|
||||||
|
|=======================================================================
|
||||||
|
|Setting |Description
|
||||||
|
|`max_token_length` |The maximum token length. If a token is seen that
|
||||||
|
exceeds this length then it is discarded. Defaults to `255`.
|
||||||
|
|=======================================================================
|
||||||
|
|
|
@ -0,0 +1,9 @@
|
||||||
|
[[analysis-thai-tokenizer]]
|
||||||
|
=== Thai Tokenizer
|
||||||
|
|
||||||
|
coming[1.3.0]
|
||||||
|
|
||||||
|
A tokenizer of type `thai` that segments Thai text into words. This tokenizer
|
||||||
|
uses the built-in Thai segmentation algorithm included with Java to divide
|
||||||
|
up Thai text. Text in other languages in general will be treated the same
|
||||||
|
as `standard`.
|
|
@ -28,6 +28,7 @@ import org.apache.lucene.analysis.ar.ArabicAnalyzer;
|
||||||
import org.apache.lucene.analysis.bg.BulgarianAnalyzer;
|
import org.apache.lucene.analysis.bg.BulgarianAnalyzer;
|
||||||
import org.apache.lucene.analysis.br.BrazilianAnalyzer;
|
import org.apache.lucene.analysis.br.BrazilianAnalyzer;
|
||||||
import org.apache.lucene.analysis.ca.CatalanAnalyzer;
|
import org.apache.lucene.analysis.ca.CatalanAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.ckb.SoraniAnalyzer;
|
||||||
import org.apache.lucene.analysis.cz.CzechAnalyzer;
|
import org.apache.lucene.analysis.cz.CzechAnalyzer;
|
||||||
import org.apache.lucene.analysis.da.DanishAnalyzer;
|
import org.apache.lucene.analysis.da.DanishAnalyzer;
|
||||||
import org.apache.lucene.analysis.de.GermanAnalyzer;
|
import org.apache.lucene.analysis.de.GermanAnalyzer;
|
||||||
|
@ -38,6 +39,7 @@ import org.apache.lucene.analysis.eu.BasqueAnalyzer;
|
||||||
import org.apache.lucene.analysis.fa.PersianAnalyzer;
|
import org.apache.lucene.analysis.fa.PersianAnalyzer;
|
||||||
import org.apache.lucene.analysis.fi.FinnishAnalyzer;
|
import org.apache.lucene.analysis.fi.FinnishAnalyzer;
|
||||||
import org.apache.lucene.analysis.fr.FrenchAnalyzer;
|
import org.apache.lucene.analysis.fr.FrenchAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.ga.IrishAnalyzer;
|
||||||
import org.apache.lucene.analysis.gl.GalicianAnalyzer;
|
import org.apache.lucene.analysis.gl.GalicianAnalyzer;
|
||||||
import org.apache.lucene.analysis.hi.HindiAnalyzer;
|
import org.apache.lucene.analysis.hi.HindiAnalyzer;
|
||||||
import org.apache.lucene.analysis.hu.HungarianAnalyzer;
|
import org.apache.lucene.analysis.hu.HungarianAnalyzer;
|
||||||
|
@ -50,6 +52,7 @@ import org.apache.lucene.analysis.pt.PortugueseAnalyzer;
|
||||||
import org.apache.lucene.analysis.ro.RomanianAnalyzer;
|
import org.apache.lucene.analysis.ro.RomanianAnalyzer;
|
||||||
import org.apache.lucene.analysis.ru.RussianAnalyzer;
|
import org.apache.lucene.analysis.ru.RussianAnalyzer;
|
||||||
import org.apache.lucene.analysis.sv.SwedishAnalyzer;
|
import org.apache.lucene.analysis.sv.SwedishAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.th.ThaiAnalyzer;
|
||||||
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
|
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
|
||||||
import org.apache.lucene.analysis.tr.TurkishAnalyzer;
|
import org.apache.lucene.analysis.tr.TurkishAnalyzer;
|
||||||
import org.apache.lucene.analysis.util.CharArraySet;
|
import org.apache.lucene.analysis.util.CharArraySet;
|
||||||
|
@ -134,14 +137,17 @@ public class Analysis {
|
||||||
.put("_hindi_", HindiAnalyzer.getDefaultStopSet())
|
.put("_hindi_", HindiAnalyzer.getDefaultStopSet())
|
||||||
.put("_hungarian_", HungarianAnalyzer.getDefaultStopSet())
|
.put("_hungarian_", HungarianAnalyzer.getDefaultStopSet())
|
||||||
.put("_indonesian_", IndonesianAnalyzer.getDefaultStopSet())
|
.put("_indonesian_", IndonesianAnalyzer.getDefaultStopSet())
|
||||||
|
.put("_irish_", IrishAnalyzer.getDefaultStopSet())
|
||||||
.put("_italian_", ItalianAnalyzer.getDefaultStopSet())
|
.put("_italian_", ItalianAnalyzer.getDefaultStopSet())
|
||||||
.put("_norwegian_", NorwegianAnalyzer.getDefaultStopSet())
|
.put("_norwegian_", NorwegianAnalyzer.getDefaultStopSet())
|
||||||
.put("_persian_", PersianAnalyzer.getDefaultStopSet())
|
.put("_persian_", PersianAnalyzer.getDefaultStopSet())
|
||||||
.put("_portuguese_", PortugueseAnalyzer.getDefaultStopSet())
|
.put("_portuguese_", PortugueseAnalyzer.getDefaultStopSet())
|
||||||
.put("_romanian_", RomanianAnalyzer.getDefaultStopSet())
|
.put("_romanian_", RomanianAnalyzer.getDefaultStopSet())
|
||||||
.put("_russian_", RussianAnalyzer.getDefaultStopSet())
|
.put("_russian_", RussianAnalyzer.getDefaultStopSet())
|
||||||
|
.put("_sorani_", SoraniAnalyzer.getDefaultStopSet())
|
||||||
.put("_spanish_", SpanishAnalyzer.getDefaultStopSet())
|
.put("_spanish_", SpanishAnalyzer.getDefaultStopSet())
|
||||||
.put("_swedish_", SwedishAnalyzer.getDefaultStopSet())
|
.put("_swedish_", SwedishAnalyzer.getDefaultStopSet())
|
||||||
|
.put("_thai_", ThaiAnalyzer.getDefaultStopSet())
|
||||||
.put("_turkish_", TurkishAnalyzer.getDefaultStopSet())
|
.put("_turkish_", TurkishAnalyzer.getDefaultStopSet())
|
||||||
.immutableMap();
|
.immutableMap();
|
||||||
|
|
||||||
|
|
|
@ -503,18 +503,29 @@ public class AnalysisModule extends AbstractModule {
|
||||||
tokenFiltersBindings.processTokenFilter("stemmer_override", StemmerOverrideTokenFilterFactory.class);
|
tokenFiltersBindings.processTokenFilter("stemmer_override", StemmerOverrideTokenFilterFactory.class);
|
||||||
|
|
||||||
tokenFiltersBindings.processTokenFilter("arabic_normalization", ArabicNormalizationFilterFactory.class);
|
tokenFiltersBindings.processTokenFilter("arabic_normalization", ArabicNormalizationFilterFactory.class);
|
||||||
|
tokenFiltersBindings.processTokenFilter("german_normalization", GermanNormalizationFilterFactory.class);
|
||||||
|
tokenFiltersBindings.processTokenFilter("hindi_normalization", HindiNormalizationFilterFactory.class);
|
||||||
|
tokenFiltersBindings.processTokenFilter("indic_normalization", IndicNormalizationFilterFactory.class);
|
||||||
|
tokenFiltersBindings.processTokenFilter("sorani_normalization", SoraniNormalizationFilterFactory.class);
|
||||||
tokenFiltersBindings.processTokenFilter("persian_normalization", PersianNormalizationFilterFactory.class);
|
tokenFiltersBindings.processTokenFilter("persian_normalization", PersianNormalizationFilterFactory.class);
|
||||||
|
tokenFiltersBindings.processTokenFilter("scandinavian_normalization", ScandinavianNormalizationFilterFactory.class);
|
||||||
|
tokenFiltersBindings.processTokenFilter("scandinavian_folding", ScandinavianFoldingFilterFactory.class);
|
||||||
|
|
||||||
tokenFiltersBindings.processTokenFilter("hunspell", HunspellTokenFilterFactory.class);
|
tokenFiltersBindings.processTokenFilter("hunspell", HunspellTokenFilterFactory.class);
|
||||||
tokenFiltersBindings.processTokenFilter("cjk_bigram", CJKBigramFilterFactory.class);
|
tokenFiltersBindings.processTokenFilter("cjk_bigram", CJKBigramFilterFactory.class);
|
||||||
tokenFiltersBindings.processTokenFilter("cjk_width", CJKWidthFilterFactory.class);
|
tokenFiltersBindings.processTokenFilter("cjk_width", CJKWidthFilterFactory.class);
|
||||||
|
|
||||||
|
tokenFiltersBindings.processTokenFilter("apostrophe", ApostropheFilterFactory.class);
|
||||||
|
tokenFiltersBindings.processTokenFilter("classic", ClassicFilterFactory.class);
|
||||||
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void processTokenizers(TokenizersBindings tokenizersBindings) {
|
public void processTokenizers(TokenizersBindings tokenizersBindings) {
|
||||||
tokenizersBindings.processTokenizer("pattern", PatternTokenizerFactory.class);
|
tokenizersBindings.processTokenizer("pattern", PatternTokenizerFactory.class);
|
||||||
|
tokenizersBindings.processTokenizer("classic", ClassicTokenizerFactory.class);
|
||||||
|
tokenizersBindings.processTokenizer("thai", ThaiTokenizerFactory.class);
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
|
@ -542,6 +553,7 @@ public class AnalysisModule extends AbstractModule {
|
||||||
analyzersBindings.processAnalyzer("hindi", HindiAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("hindi", HindiAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("hungarian", HungarianAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("hungarian", HungarianAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("indonesian", IndonesianAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("indonesian", IndonesianAnalyzerProvider.class);
|
||||||
|
analyzersBindings.processAnalyzer("irish", IrishAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("italian", ItalianAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("italian", ItalianAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("latvian", LatvianAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("latvian", LatvianAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("norwegian", NorwegianAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("norwegian", NorwegianAnalyzerProvider.class);
|
||||||
|
@ -549,6 +561,7 @@ public class AnalysisModule extends AbstractModule {
|
||||||
analyzersBindings.processAnalyzer("portuguese", PortugueseAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("portuguese", PortugueseAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("romanian", RomanianAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("romanian", RomanianAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("russian", RussianAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("russian", RussianAnalyzerProvider.class);
|
||||||
|
analyzersBindings.processAnalyzer("sorani", SoraniAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("spanish", SpanishAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("spanish", SpanishAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("swedish", SwedishAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("swedish", SwedishAnalyzerProvider.class);
|
||||||
analyzersBindings.processAnalyzer("turkish", TurkishAnalyzerProvider.class);
|
analyzersBindings.processAnalyzer("turkish", TurkishAnalyzerProvider.class);
|
||||||
|
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.tr.ApostropheFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link ApostropheFilter}
|
||||||
|
*/
|
||||||
|
public class ApostropheFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public ApostropheFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new ApostropheFilter(tokenStream);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.standard.ClassicFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link ClassicFilter}
|
||||||
|
*/
|
||||||
|
public class ClassicFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public ClassicFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new ClassicFilter(tokenStream);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,52 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.Tokenizer;
|
||||||
|
import org.apache.lucene.analysis.standard.ClassicTokenizer;
|
||||||
|
import org.apache.lucene.analysis.standard.StandardAnalyzer;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
import java.io.Reader;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link ClassicTokenizer}
|
||||||
|
*/
|
||||||
|
public class ClassicTokenizerFactory extends AbstractTokenizerFactory {
|
||||||
|
|
||||||
|
private final int maxTokenLength;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public ClassicTokenizerFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
maxTokenLength = settings.getAsInt("max_token_length", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Tokenizer create(Reader reader) {
|
||||||
|
ClassicTokenizer tokenizer = new ClassicTokenizer(version, reader);
|
||||||
|
tokenizer.setMaxTokenLength(maxTokenLength);
|
||||||
|
return tokenizer;
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link GermanNormalizationFilter}
|
||||||
|
*/
|
||||||
|
public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new GermanNormalizationFilter(tokenStream);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.hi.HindiNormalizationFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link HindiNormalizationFilter}
|
||||||
|
*/
|
||||||
|
public class HindiNormalizationFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public HindiNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new HindiNormalizationFilter(tokenStream);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.in.IndicNormalizationFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link IndicNormalizationFilter}
|
||||||
|
*/
|
||||||
|
public class IndicNormalizationFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public IndicNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new IndicNormalizationFilter(tokenStream);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,50 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.ga.IrishAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.util.CharArraySet;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.env.Environment;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Provider for {@link IrishAnalyzer}
|
||||||
|
*/
|
||||||
|
public class IrishAnalyzerProvider extends AbstractIndexAnalyzerProvider<IrishAnalyzer> {
|
||||||
|
|
||||||
|
private final IrishAnalyzer analyzer;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public IrishAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
analyzer = new IrishAnalyzer(version,
|
||||||
|
Analysis.parseStopWords(env, settings, IrishAnalyzer.getDefaultStopSet(), version),
|
||||||
|
Analysis.parseStemExclusion(settings, CharArraySet.EMPTY_SET, version));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public IrishAnalyzer get() {
|
||||||
|
return this.analyzer;
|
||||||
|
}
|
||||||
|
}
|
|
@ -22,6 +22,7 @@ package org.elasticsearch.index.analysis;
|
||||||
import org.apache.lucene.analysis.TokenStream;
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
import org.apache.lucene.analysis.core.LowerCaseFilter;
|
import org.apache.lucene.analysis.core.LowerCaseFilter;
|
||||||
import org.apache.lucene.analysis.el.GreekLowerCaseFilter;
|
import org.apache.lucene.analysis.el.GreekLowerCaseFilter;
|
||||||
|
import org.apache.lucene.analysis.ga.IrishLowerCaseFilter;
|
||||||
import org.apache.lucene.analysis.tr.TurkishLowerCaseFilter;
|
import org.apache.lucene.analysis.tr.TurkishLowerCaseFilter;
|
||||||
import org.elasticsearch.ElasticsearchIllegalArgumentException;
|
import org.elasticsearch.ElasticsearchIllegalArgumentException;
|
||||||
import org.elasticsearch.common.inject.Inject;
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
@ -31,7 +32,13 @@ import org.elasticsearch.index.Index;
|
||||||
import org.elasticsearch.index.settings.IndexSettings;
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
*
|
* Factory for {@link LowerCaseFilter} and some language-specific variants
|
||||||
|
* supported by the {@code language} parameter:
|
||||||
|
* <ul>
|
||||||
|
* <li>greek: {@link GreekLowerCaseFilter}
|
||||||
|
* <li>irish: {@link IrishLowerCaseFilter}
|
||||||
|
* <li>turkish: {@link TurkishLowerCaseFilter}
|
||||||
|
* </ul>
|
||||||
*/
|
*/
|
||||||
public class LowerCaseTokenFilterFactory extends AbstractTokenFilterFactory {
|
public class LowerCaseTokenFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@ -49,6 +56,8 @@ public class LowerCaseTokenFilterFactory extends AbstractTokenFilterFactory {
|
||||||
return new LowerCaseFilter(version, tokenStream);
|
return new LowerCaseFilter(version, tokenStream);
|
||||||
} else if (lang.equalsIgnoreCase("greek")) {
|
} else if (lang.equalsIgnoreCase("greek")) {
|
||||||
return new GreekLowerCaseFilter(version, tokenStream);
|
return new GreekLowerCaseFilter(version, tokenStream);
|
||||||
|
} else if (lang.equalsIgnoreCase("irish")) {
|
||||||
|
return new IrishLowerCaseFilter(tokenStream);
|
||||||
} else if (lang.equalsIgnoreCase("turkish")) {
|
} else if (lang.equalsIgnoreCase("turkish")) {
|
||||||
return new TurkishLowerCaseFilter(tokenStream);
|
return new TurkishLowerCaseFilter(tokenStream);
|
||||||
} else {
|
} else {
|
||||||
|
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.miscellaneous.ScandinavianFoldingFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link ScandinavianFoldingFilter}
|
||||||
|
*/
|
||||||
|
public class ScandinavianFoldingFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public ScandinavianFoldingFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new ScandinavianFoldingFilter(tokenStream);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.miscellaneous.ScandinavianNormalizationFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link ScandinavianNormalizationFilter}
|
||||||
|
*/
|
||||||
|
public class ScandinavianNormalizationFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public ScandinavianNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new ScandinavianNormalizationFilter(tokenStream);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,50 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.ckb.SoraniAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.util.CharArraySet;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.env.Environment;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Provider for {@link SoraniAnalyzer}
|
||||||
|
*/
|
||||||
|
public class SoraniAnalyzerProvider extends AbstractIndexAnalyzerProvider<SoraniAnalyzer> {
|
||||||
|
|
||||||
|
private final SoraniAnalyzer analyzer;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public SoraniAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
analyzer = new SoraniAnalyzer(version,
|
||||||
|
Analysis.parseStopWords(env, settings, SoraniAnalyzer.getDefaultStopSet(), version),
|
||||||
|
Analysis.parseStemExclusion(settings, CharArraySet.EMPTY_SET, version));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public SoraniAnalyzer get() {
|
||||||
|
return this.analyzer;
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.ckb.SoraniNormalizationFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link SoraniNormalizationFilter}
|
||||||
|
*/
|
||||||
|
public class SoraniNormalizationFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public SoraniNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new SoraniNormalizationFilter(tokenStream);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -23,6 +23,7 @@ import org.apache.lucene.analysis.TokenStream;
|
||||||
import org.apache.lucene.analysis.ar.ArabicStemFilter;
|
import org.apache.lucene.analysis.ar.ArabicStemFilter;
|
||||||
import org.apache.lucene.analysis.bg.BulgarianStemFilter;
|
import org.apache.lucene.analysis.bg.BulgarianStemFilter;
|
||||||
import org.apache.lucene.analysis.br.BrazilianStemFilter;
|
import org.apache.lucene.analysis.br.BrazilianStemFilter;
|
||||||
|
import org.apache.lucene.analysis.ckb.SoraniStemFilter;
|
||||||
import org.apache.lucene.analysis.cz.CzechStemFilter;
|
import org.apache.lucene.analysis.cz.CzechStemFilter;
|
||||||
import org.apache.lucene.analysis.de.GermanLightStemFilter;
|
import org.apache.lucene.analysis.de.GermanLightStemFilter;
|
||||||
import org.apache.lucene.analysis.de.GermanMinimalStemFilter;
|
import org.apache.lucene.analysis.de.GermanMinimalStemFilter;
|
||||||
|
@ -35,11 +36,15 @@ import org.apache.lucene.analysis.es.SpanishLightStemFilter;
|
||||||
import org.apache.lucene.analysis.fi.FinnishLightStemFilter;
|
import org.apache.lucene.analysis.fi.FinnishLightStemFilter;
|
||||||
import org.apache.lucene.analysis.fr.FrenchLightStemFilter;
|
import org.apache.lucene.analysis.fr.FrenchLightStemFilter;
|
||||||
import org.apache.lucene.analysis.fr.FrenchMinimalStemFilter;
|
import org.apache.lucene.analysis.fr.FrenchMinimalStemFilter;
|
||||||
|
import org.apache.lucene.analysis.gl.GalicianMinimalStemFilter;
|
||||||
|
import org.apache.lucene.analysis.gl.GalicianStemFilter;
|
||||||
import org.apache.lucene.analysis.hi.HindiStemFilter;
|
import org.apache.lucene.analysis.hi.HindiStemFilter;
|
||||||
import org.apache.lucene.analysis.hu.HungarianLightStemFilter;
|
import org.apache.lucene.analysis.hu.HungarianLightStemFilter;
|
||||||
import org.apache.lucene.analysis.id.IndonesianStemFilter;
|
import org.apache.lucene.analysis.id.IndonesianStemFilter;
|
||||||
import org.apache.lucene.analysis.it.ItalianLightStemFilter;
|
import org.apache.lucene.analysis.it.ItalianLightStemFilter;
|
||||||
import org.apache.lucene.analysis.lv.LatvianStemFilter;
|
import org.apache.lucene.analysis.lv.LatvianStemFilter;
|
||||||
|
import org.apache.lucene.analysis.no.NorwegianLightStemFilter;
|
||||||
|
import org.apache.lucene.analysis.no.NorwegianLightStemmer;
|
||||||
import org.apache.lucene.analysis.no.NorwegianMinimalStemFilter;
|
import org.apache.lucene.analysis.no.NorwegianMinimalStemFilter;
|
||||||
import org.apache.lucene.analysis.pt.PortugueseLightStemFilter;
|
import org.apache.lucene.analysis.pt.PortugueseLightStemFilter;
|
||||||
import org.apache.lucene.analysis.pt.PortugueseMinimalStemFilter;
|
import org.apache.lucene.analysis.pt.PortugueseMinimalStemFilter;
|
||||||
|
@ -138,6 +143,12 @@ public class StemmerTokenFilterFactory extends AbstractTokenFilterFactory {
|
||||||
} else if ("minimal_french".equalsIgnoreCase(language) || "minimalFrench".equalsIgnoreCase(language)) {
|
} else if ("minimal_french".equalsIgnoreCase(language) || "minimalFrench".equalsIgnoreCase(language)) {
|
||||||
return new FrenchMinimalStemFilter(tokenStream);
|
return new FrenchMinimalStemFilter(tokenStream);
|
||||||
|
|
||||||
|
// Galician stemmers
|
||||||
|
} else if ("galician".equalsIgnoreCase(language)) {
|
||||||
|
return new GalicianStemFilter(tokenStream);
|
||||||
|
} else if ("minimal_galician".equalsIgnoreCase(language)) {
|
||||||
|
return new GalicianMinimalStemFilter(tokenStream);
|
||||||
|
|
||||||
// German stemmers
|
// German stemmers
|
||||||
} else if ("german".equalsIgnoreCase(language)) {
|
} else if ("german".equalsIgnoreCase(language)) {
|
||||||
return new SnowballFilter(tokenStream, new GermanStemmer());
|
return new SnowballFilter(tokenStream, new GermanStemmer());
|
||||||
|
@ -162,6 +173,10 @@ public class StemmerTokenFilterFactory extends AbstractTokenFilterFactory {
|
||||||
} else if ("indonesian".equalsIgnoreCase(language)) {
|
} else if ("indonesian".equalsIgnoreCase(language)) {
|
||||||
return new IndonesianStemFilter(tokenStream);
|
return new IndonesianStemFilter(tokenStream);
|
||||||
|
|
||||||
|
// Irish stemmer
|
||||||
|
} else if ("irish".equalsIgnoreCase(language)) {
|
||||||
|
return new SnowballFilter(tokenStream, new IrishStemmer());
|
||||||
|
|
||||||
// Italian stemmers
|
// Italian stemmers
|
||||||
} else if ("italian".equalsIgnoreCase(language)) {
|
} else if ("italian".equalsIgnoreCase(language)) {
|
||||||
return new SnowballFilter(tokenStream, new ItalianStemmer());
|
return new SnowballFilter(tokenStream, new ItalianStemmer());
|
||||||
|
@ -171,12 +186,20 @@ public class StemmerTokenFilterFactory extends AbstractTokenFilterFactory {
|
||||||
} else if ("latvian".equalsIgnoreCase(language)) {
|
} else if ("latvian".equalsIgnoreCase(language)) {
|
||||||
return new LatvianStemFilter(tokenStream);
|
return new LatvianStemFilter(tokenStream);
|
||||||
|
|
||||||
// Norwegian stemmers
|
// Norwegian (Bokmål) stemmers
|
||||||
} else if ("norwegian".equalsIgnoreCase(language)) {
|
} else if ("norwegian".equalsIgnoreCase(language)) {
|
||||||
return new SnowballFilter(tokenStream, new NorwegianStemmer());
|
return new SnowballFilter(tokenStream, new NorwegianStemmer());
|
||||||
|
} else if ("light_norwegian".equalsIgnoreCase(language) || "lightNorwegian".equalsIgnoreCase(language)) {
|
||||||
|
return new NorwegianLightStemFilter(tokenStream);
|
||||||
} else if ("minimal_norwegian".equalsIgnoreCase(language) || "minimalNorwegian".equals(language)) {
|
} else if ("minimal_norwegian".equalsIgnoreCase(language) || "minimalNorwegian".equals(language)) {
|
||||||
return new NorwegianMinimalStemFilter(tokenStream);
|
return new NorwegianMinimalStemFilter(tokenStream);
|
||||||
|
|
||||||
|
// Norwegian (Nynorsk) stemmers
|
||||||
|
} else if ("light_nynorsk".equalsIgnoreCase(language) || "lightNynorsk".equalsIgnoreCase(language)) {
|
||||||
|
return new NorwegianLightStemFilter(tokenStream, NorwegianLightStemmer.NYNORSK);
|
||||||
|
} else if ("minimal_nynorsk".equalsIgnoreCase(language) || "minimalNynorsk".equalsIgnoreCase(language)) {
|
||||||
|
return new NorwegianMinimalStemFilter(tokenStream, NorwegianLightStemmer.NYNORSK);
|
||||||
|
|
||||||
// Portuguese stemmers
|
// Portuguese stemmers
|
||||||
} else if ("portuguese".equalsIgnoreCase(language)) {
|
} else if ("portuguese".equalsIgnoreCase(language)) {
|
||||||
return new SnowballFilter(tokenStream, new PortugueseStemmer());
|
return new SnowballFilter(tokenStream, new PortugueseStemmer());
|
||||||
|
@ -202,6 +225,10 @@ public class StemmerTokenFilterFactory extends AbstractTokenFilterFactory {
|
||||||
} else if ("light_spanish".equalsIgnoreCase(language) || "lightSpanish".equalsIgnoreCase(language)) {
|
} else if ("light_spanish".equalsIgnoreCase(language) || "lightSpanish".equalsIgnoreCase(language)) {
|
||||||
return new SpanishLightStemFilter(tokenStream);
|
return new SpanishLightStemFilter(tokenStream);
|
||||||
|
|
||||||
|
// Sorani Kurdish stemmer
|
||||||
|
} else if ("sorani".equalsIgnoreCase(language)) {
|
||||||
|
return new SoraniStemFilter(tokenStream);
|
||||||
|
|
||||||
// Swedish stemmers
|
// Swedish stemmers
|
||||||
} else if ("swedish".equalsIgnoreCase(language)) {
|
} else if ("swedish".equalsIgnoreCase(language)) {
|
||||||
return new SnowballFilter(tokenStream, new SwedishStemmer());
|
return new SnowballFilter(tokenStream, new SwedishStemmer());
|
||||||
|
|
|
@ -0,0 +1,46 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.Tokenizer;
|
||||||
|
import org.apache.lucene.analysis.th.ThaiTokenizer;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
import java.io.Reader;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory for {@link ThaiTokenizer}
|
||||||
|
*/
|
||||||
|
public class ThaiTokenizerFactory extends AbstractTokenizerFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public ThaiTokenizerFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Tokenizer create(Reader reader) {
|
||||||
|
return new ThaiTokenizer(reader);
|
||||||
|
}
|
||||||
|
}
|
|
@ -24,6 +24,7 @@ import org.apache.lucene.analysis.bg.BulgarianAnalyzer;
|
||||||
import org.apache.lucene.analysis.br.BrazilianAnalyzer;
|
import org.apache.lucene.analysis.br.BrazilianAnalyzer;
|
||||||
import org.apache.lucene.analysis.ca.CatalanAnalyzer;
|
import org.apache.lucene.analysis.ca.CatalanAnalyzer;
|
||||||
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
|
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.ckb.SoraniAnalyzer;
|
||||||
import org.apache.lucene.analysis.cn.ChineseAnalyzer;
|
import org.apache.lucene.analysis.cn.ChineseAnalyzer;
|
||||||
import org.apache.lucene.analysis.core.KeywordAnalyzer;
|
import org.apache.lucene.analysis.core.KeywordAnalyzer;
|
||||||
import org.apache.lucene.analysis.core.SimpleAnalyzer;
|
import org.apache.lucene.analysis.core.SimpleAnalyzer;
|
||||||
|
@ -349,6 +350,13 @@ public enum PreBuiltAnalyzers {
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
|
||||||
|
SORANI {
|
||||||
|
@Override
|
||||||
|
protected Analyzer create(Version version) {
|
||||||
|
return new SoraniAnalyzer(version.luceneVersion);
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
SPANISH {
|
SPANISH {
|
||||||
@Override
|
@Override
|
||||||
protected Analyzer create(Version version) {
|
protected Analyzer create(Version version) {
|
||||||
|
|
|
@ -43,6 +43,7 @@ public class AnalysisFactoryTests extends ElasticsearchTestCase {
|
||||||
put("russianletter", Deprecated.class);
|
put("russianletter", Deprecated.class);
|
||||||
|
|
||||||
// exposed in ES
|
// exposed in ES
|
||||||
|
put("classic", ClassicTokenizerFactory.class);
|
||||||
put("edgengram", EdgeNGramTokenizerFactory.class);
|
put("edgengram", EdgeNGramTokenizerFactory.class);
|
||||||
put("keyword", KeywordTokenizerFactory.class);
|
put("keyword", KeywordTokenizerFactory.class);
|
||||||
put("letter", LetterTokenizerFactory.class);
|
put("letter", LetterTokenizerFactory.class);
|
||||||
|
@ -51,16 +52,10 @@ public class AnalysisFactoryTests extends ElasticsearchTestCase {
|
||||||
put("pathhierarchy", PathHierarchyTokenizerFactory.class);
|
put("pathhierarchy", PathHierarchyTokenizerFactory.class);
|
||||||
put("pattern", PatternTokenizerFactory.class);
|
put("pattern", PatternTokenizerFactory.class);
|
||||||
put("standard", StandardTokenizerFactory.class);
|
put("standard", StandardTokenizerFactory.class);
|
||||||
|
put("thai", ThaiTokenizerFactory.class);
|
||||||
put("uax29urlemail", UAX29URLEmailTokenizerFactory.class);
|
put("uax29urlemail", UAX29URLEmailTokenizerFactory.class);
|
||||||
put("whitespace", WhitespaceTokenizerFactory.class);
|
put("whitespace", WhitespaceTokenizerFactory.class);
|
||||||
|
|
||||||
// TODO: these tokenizers are not yet exposed: useful?
|
|
||||||
|
|
||||||
// historical version of standardtokenizer... tries to recognize
|
|
||||||
// company names and a few other things. not good for asian languages etc.
|
|
||||||
put("classic", Void.class);
|
|
||||||
// we should add this, the thaiwordfilter is deprecated. this one has correct offsets
|
|
||||||
put("thai", Void.class);
|
|
||||||
// this one "seems to mess up offsets". probably shouldn't be a tokenizer...
|
// this one "seems to mess up offsets". probably shouldn't be a tokenizer...
|
||||||
put("wikipedia", Void.class);
|
put("wikipedia", Void.class);
|
||||||
}};
|
}};
|
||||||
|
@ -80,6 +75,7 @@ public class AnalysisFactoryTests extends ElasticsearchTestCase {
|
||||||
|
|
||||||
|
|
||||||
// exposed in ES
|
// exposed in ES
|
||||||
|
put("apostrophe", ApostropheFilterFactory.class);
|
||||||
put("arabicnormalization", ArabicNormalizationFilterFactory.class);
|
put("arabicnormalization", ArabicNormalizationFilterFactory.class);
|
||||||
put("arabicstem", ArabicStemTokenFilterFactory.class);
|
put("arabicstem", ArabicStemTokenFilterFactory.class);
|
||||||
put("asciifolding", ASCIIFoldingTokenFilterFactory.class);
|
put("asciifolding", ASCIIFoldingTokenFilterFactory.class);
|
||||||
|
@ -87,6 +83,7 @@ public class AnalysisFactoryTests extends ElasticsearchTestCase {
|
||||||
put("bulgarianstem", StemmerTokenFilterFactory.class);
|
put("bulgarianstem", StemmerTokenFilterFactory.class);
|
||||||
put("cjkbigram", CJKBigramFilterFactory.class);
|
put("cjkbigram", CJKBigramFilterFactory.class);
|
||||||
put("cjkwidth", CJKWidthFilterFactory.class);
|
put("cjkwidth", CJKWidthFilterFactory.class);
|
||||||
|
put("classic", ClassicFilterFactory.class);
|
||||||
put("commongrams", CommonGramsTokenFilterFactory.class);
|
put("commongrams", CommonGramsTokenFilterFactory.class);
|
||||||
put("commongramsquery", CommonGramsTokenFilterFactory.class);
|
put("commongramsquery", CommonGramsTokenFilterFactory.class);
|
||||||
put("czechstem", CzechStemTokenFilterFactory.class);
|
put("czechstem", CzechStemTokenFilterFactory.class);
|
||||||
|
@ -99,16 +96,21 @@ public class AnalysisFactoryTests extends ElasticsearchTestCase {
|
||||||
put("finnishlightstem", StemmerTokenFilterFactory.class);
|
put("finnishlightstem", StemmerTokenFilterFactory.class);
|
||||||
put("frenchlightstem", StemmerTokenFilterFactory.class);
|
put("frenchlightstem", StemmerTokenFilterFactory.class);
|
||||||
put("frenchminimalstem", StemmerTokenFilterFactory.class);
|
put("frenchminimalstem", StemmerTokenFilterFactory.class);
|
||||||
|
put("galicianminimalstem", StemmerTokenFilterFactory.class);
|
||||||
|
put("galicianstem", StemmerTokenFilterFactory.class);
|
||||||
put("germanstem", GermanStemTokenFilterFactory.class);
|
put("germanstem", GermanStemTokenFilterFactory.class);
|
||||||
put("germanlightstem", StemmerTokenFilterFactory.class);
|
put("germanlightstem", StemmerTokenFilterFactory.class);
|
||||||
put("germanminimalstem", StemmerTokenFilterFactory.class);
|
put("germanminimalstem", StemmerTokenFilterFactory.class);
|
||||||
|
put("germannormalization", GermanNormalizationFilterFactory.class);
|
||||||
put("greeklowercase", LowerCaseTokenFilterFactory.class);
|
put("greeklowercase", LowerCaseTokenFilterFactory.class);
|
||||||
put("greekstem", StemmerTokenFilterFactory.class);
|
put("greekstem", StemmerTokenFilterFactory.class);
|
||||||
put("hindistem", StemmerTokenFilterFactory.class);
|
put("hindinormalization", HindiNormalizationFilterFactory.class);
|
||||||
put("hindistem", StemmerTokenFilterFactory.class);
|
put("hindistem", StemmerTokenFilterFactory.class);
|
||||||
put("hungarianlightstem", StemmerTokenFilterFactory.class);
|
put("hungarianlightstem", StemmerTokenFilterFactory.class);
|
||||||
put("hunspellstem", HunspellTokenFilterFactory.class);
|
put("hunspellstem", HunspellTokenFilterFactory.class);
|
||||||
put("hyphenationcompoundword", HyphenationCompoundWordTokenFilterFactory.class);
|
put("hyphenationcompoundword", HyphenationCompoundWordTokenFilterFactory.class);
|
||||||
|
put("indicnormalization", IndicNormalizationFilterFactory.class);
|
||||||
|
put("irishlowercase", LowerCaseTokenFilterFactory.class);
|
||||||
put("indonesianstem", StemmerTokenFilterFactory.class);
|
put("indonesianstem", StemmerTokenFilterFactory.class);
|
||||||
put("italianlightstem", StemmerTokenFilterFactory.class);
|
put("italianlightstem", StemmerTokenFilterFactory.class);
|
||||||
put("keepword", KeepWordFilterFactory.class);
|
put("keepword", KeepWordFilterFactory.class);
|
||||||
|
@ -119,17 +121,23 @@ public class AnalysisFactoryTests extends ElasticsearchTestCase {
|
||||||
put("limittokencount", LimitTokenCountFilterFactory.class);
|
put("limittokencount", LimitTokenCountFilterFactory.class);
|
||||||
put("lowercase", LowerCaseTokenFilterFactory.class);
|
put("lowercase", LowerCaseTokenFilterFactory.class);
|
||||||
put("ngram", NGramTokenFilterFactory.class);
|
put("ngram", NGramTokenFilterFactory.class);
|
||||||
|
put("norwegianlightstem", StemmerTokenFilterFactory.class);
|
||||||
put("norwegianminimalstem", StemmerTokenFilterFactory.class);
|
put("norwegianminimalstem", StemmerTokenFilterFactory.class);
|
||||||
put("patterncapturegroup", PatternCaptureGroupTokenFilterFactory.class);
|
put("patterncapturegroup", PatternCaptureGroupTokenFilterFactory.class);
|
||||||
put("patternreplace", PatternReplaceTokenFilterFactory.class);
|
put("patternreplace", PatternReplaceTokenFilterFactory.class);
|
||||||
put("persiannormalization", PersianNormalizationFilterFactory.class);
|
put("persiannormalization", PersianNormalizationFilterFactory.class);
|
||||||
put("porterstem", PorterStemTokenFilterFactory.class);
|
put("porterstem", PorterStemTokenFilterFactory.class);
|
||||||
|
put("portuguesestem", StemmerTokenFilterFactory.class);
|
||||||
put("portugueselightstem", StemmerTokenFilterFactory.class);
|
put("portugueselightstem", StemmerTokenFilterFactory.class);
|
||||||
put("portugueseminimalstem", StemmerTokenFilterFactory.class);
|
put("portugueseminimalstem", StemmerTokenFilterFactory.class);
|
||||||
put("reversestring", ReverseTokenFilterFactory.class);
|
put("reversestring", ReverseTokenFilterFactory.class);
|
||||||
put("russianlightstem", StemmerTokenFilterFactory.class);
|
put("russianlightstem", StemmerTokenFilterFactory.class);
|
||||||
|
put("scandinavianfolding", ScandinavianFoldingFilterFactory.class);
|
||||||
|
put("scandinaviannormalization", ScandinavianNormalizationFilterFactory.class);
|
||||||
put("shingle", ShingleTokenFilterFactory.class);
|
put("shingle", ShingleTokenFilterFactory.class);
|
||||||
put("snowballporter", SnowballTokenFilterFactory.class);
|
put("snowballporter", SnowballTokenFilterFactory.class);
|
||||||
|
put("soraninormalization", SoraniNormalizationFilterFactory.class);
|
||||||
|
put("soranistem", StemmerTokenFilterFactory.class);
|
||||||
put("spanishlightstem", StemmerTokenFilterFactory.class);
|
put("spanishlightstem", StemmerTokenFilterFactory.class);
|
||||||
put("standard", StandardTokenFilterFactory.class);
|
put("standard", StandardTokenFilterFactory.class);
|
||||||
put("stemmeroverride", StemmerOverrideTokenFilterFactory.class);
|
put("stemmeroverride", StemmerOverrideTokenFilterFactory.class);
|
||||||
|
@ -144,46 +152,20 @@ public class AnalysisFactoryTests extends ElasticsearchTestCase {
|
||||||
|
|
||||||
// TODO: these tokenfilters are not yet exposed: useful?
|
// TODO: these tokenfilters are not yet exposed: useful?
|
||||||
|
|
||||||
// useful for turkish language
|
|
||||||
put("apostrophe", Void.class);
|
|
||||||
// capitalizes tokens
|
// capitalizes tokens
|
||||||
put("capitalization", Void.class);
|
put("capitalization", Void.class);
|
||||||
// cleans up after classic tokenizer
|
|
||||||
put("classic", Void.class);
|
|
||||||
// like length filter (but codepoints)
|
// like length filter (but codepoints)
|
||||||
put("codepointcount", Void.class);
|
put("codepointcount", Void.class);
|
||||||
// galician language stemmers
|
|
||||||
put("galicianminimalstem", Void.class);
|
|
||||||
put("galicianstem", Void.class);
|
|
||||||
// o+umlaut=oe type normalization for german
|
|
||||||
put("germannormalization", Void.class);
|
|
||||||
// hindi text normalization
|
|
||||||
put("hindinormalization", Void.class);
|
|
||||||
// puts hyphenated words back together
|
// puts hyphenated words back together
|
||||||
put("hyphenatedwords", Void.class);
|
put("hyphenatedwords", Void.class);
|
||||||
// unicode normalization for indian languages
|
|
||||||
put("indicnormalization", Void.class);
|
|
||||||
// lowercasing for irish: add to LowerCase (has a stemmer, too)
|
|
||||||
put("irishlowercase", Void.class);
|
|
||||||
// repeats anything marked as keyword
|
// repeats anything marked as keyword
|
||||||
put("keywordrepeat", Void.class);
|
put("keywordrepeat", Void.class);
|
||||||
// like limittokencount, but by position
|
// like limittokencount, but by position
|
||||||
put("limittokenposition", Void.class);
|
put("limittokenposition", Void.class);
|
||||||
// ???
|
// ???
|
||||||
put("numericpayload", Void.class);
|
put("numericpayload", Void.class);
|
||||||
// RSLP stemmer for portuguese
|
|
||||||
put("portuguesestem", Void.class);
|
|
||||||
// light stemming for norwegian (has nb/nn options too)
|
|
||||||
put("norwegianlightstem", Void.class);
|
|
||||||
// removes duplicates at the same position (this should be used by the existing factory)
|
// removes duplicates at the same position (this should be used by the existing factory)
|
||||||
put("removeduplicates", Void.class);
|
put("removeduplicates", Void.class);
|
||||||
// accent handling for scandinavian languages
|
|
||||||
put("scandinavianfolding", Void.class);
|
|
||||||
// less aggressive accent handling for scandinavian languages
|
|
||||||
put("scandinaviannormalization", Void.class);
|
|
||||||
// kurdish language support
|
|
||||||
put("soraninormalization", Void.class);
|
|
||||||
put("soranistem", Void.class);
|
|
||||||
// ???
|
// ???
|
||||||
put("tokenoffsetpayload", Void.class);
|
put("tokenoffsetpayload", Void.class);
|
||||||
// like a stop filter but by token-type
|
// like a stop filter but by token-type
|
||||||
|
|
Loading…
Reference in New Issue