141 lines
4.5 KiB
Markdown
141 lines
4.5 KiB
Markdown
ICU Analysis for ElasticSearch
|
|
==================================
|
|
|
|
The ICU Analysis plugin integrates Lucene ICU module into elasticsearch, adding ICU relates analysis components.
|
|
|
|
In order to install the plugin, simply run: `bin/plugin -install elasticsearch/elasticsearch-analysis-icu/1.5.0`.
|
|
|
|
----------------------------------------
|
|
| ICU Analysis Plugin | ElasticSearch |
|
|
----------------------------------------
|
|
| master | 0.19 -> master |
|
|
----------------------------------------
|
|
| 1.5.0 | 0.19 -> master |
|
|
----------------------------------------
|
|
| 1.4.0 | 0.19 -> master |
|
|
----------------------------------------
|
|
| 1.3.0 | 0.19 -> master |
|
|
----------------------------------------
|
|
| 1.2.0 | 0.19 -> master |
|
|
----------------------------------------
|
|
| 1.1.0 | 0.18 |
|
|
----------------------------------------
|
|
| 1.0.0 | 0.18 |
|
|
----------------------------------------
|
|
|
|
|
|
ICU Normalization
|
|
-----------------
|
|
|
|
Normalizes characters as explained "here":http://userguide.icu-project.org/transforms/normalization. It registers itself by default under @icu_normalizer@ or @icuNormalizer@ using the default settings. Allows for the name parameter to be provided which can include the following values: @nfc@, @nfkc@, and @nfkc_cf@. Here is a sample settings:
|
|
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"collation" : {
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["icu_normalizer"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
ICU Folding
|
|
-----------
|
|
|
|
Folding of unicode characters based on @UTR#30@. It registers itself under @icu_folding@ and @icuFolding@ names. Sample setting:
|
|
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"collation" : {
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["icu_folding"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
ICU Collation
|
|
-------------
|
|
|
|
Uses collation token filter. Allows to either specify the rules for collation (defined "here":http://www.icu-project.org/userguide/Collate_Customization.html) using the @rules@ parameter (can point to a location or expressed in the settings, location can be relative to config location), or using the @language@ parameter (further specialized by country and variant). By default registers under @icu_collation@ or @icuCollation@ and uses the default locale.
|
|
|
|
Here is a sample settings:
|
|
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"collation" : {
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["icu_collation"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
And here is a sample of custom collation:
|
|
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"collation" : {
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["myCollator"]
|
|
}
|
|
},
|
|
"filter" : {
|
|
"myCollator" : {
|
|
"type" : "icu_collation",
|
|
"language" : "en"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
ICU Tokenizer
|
|
-------------
|
|
|
|
Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).
|
|
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"collation" : {
|
|
"tokenizer" : "icu_tokenizer",
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
|
|
License
|
|
-------
|
|
|
|
This software is licensed under the Apache 2 license, quoted below.
|
|
|
|
Copyright 2009-2011 Shay Banon and ElasticSearch <http://www.elasticsearch.org>
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not
|
|
use this file except in compliance with the License. You may obtain a copy of
|
|
the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
|
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
|
License for the specific language governing permissions and limitations under
|
|
the License.
|