OpenSearch/README.md

116 lines
3.6 KiB
Markdown
Raw Normal View History

2011-12-05 06:31:59 -05:00
ICU Analysis for ElasticSearch
==================================
The ICU Analysis plugin integrates Lucene ICU module into elasticsearch, adding ICU relates analysis components.
2012-02-07 08:23:36 -05:00
In order to install the plugin, simply run: `bin/plugin -install elasticsearch/elasticsearch-analysis-icu/1.2.0`.
2011-12-05 06:31:59 -05:00
2011-12-10 18:40:31 -05:00
----------------------------------------
| ICU Analysis Plugin | ElasticSearch |
----------------------------------------
2012-02-07 08:23:36 -05:00
| master | 0.19 -> master |
----------------------------------------
| 1.2.0 | 0.19 -> master |
2011-12-10 18:40:31 -05:00
----------------------------------------
2011-12-13 08:27:47 -05:00
| 1.1.0 | 0.18 -> master |
----------------------------------------
2011-12-10 18:40:31 -05:00
| 1.0.0 | 0.18 -> master |
----------------------------------------
2011-12-05 06:31:59 -05:00
2011-12-13 08:27:47 -05:00
ICU Normalization
-----------------
Normalizes characters as explained "here":http://userguide.icu-project.org/transforms/normalization. It registers itself by default under @icu_normalizer@ or @icuNormalizer@ using the default settings. Allows for the name parameter to be provided which can include the following values: @nfc@, @nfkc@, and @nfkc_cf@. Here is a sample settings:
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"filter" : ["icu_normalizer"]
}
}
}
}
}
ICU Folding
-----------
Folding of unicode characters based on @UTR#30@. It registers itself under @icu_folding@ and @icuFolding@ names. Sample setting:
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"filter" : ["icu_folding"]
}
}
}
}
}
ICU Collation
-------------
Uses collation token filter. Allows to either specify the rules for collation (defined "here":http://www.icu-project.org/userguide/Collate_Customization.html) using the @rules@ parameter (can point to a location or expressed in the settings, location can be relative to config location), or using the @language@ parameter (further specialized by country and variant). By default registers under @icu_collation@ or @icuCollation@ and uses the default locale.
Here is a sample settings:
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"filter" : ["icu_collation"]
}
}
}
}
}
And here is a sample of custom collation:
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"filter" : ["myCollator"]
}
},
"filter" : {
"myCollator" : {
"type" : "icu_collation",
"language" : "en"
}
}
}
}
}
ICU Tokenizer
-------------
Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "icu_tokenizer",
}
}
}
}
}