149 lines
4.1 KiB
Plaintext
149 lines
4.1 KiB
Plaintext
[[analysis-icu-plugin]]
|
|
== ICU Analysis Plugin
|
|
|
|
The http://icu-project.org/[ICU] analysis plugin allows for unicode
|
|
normalization, collation and folding. The plugin is called
|
|
https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].
|
|
|
|
The plugin includes the following analysis components:
|
|
|
|
[float]
|
|
=== ICU Normalization
|
|
|
|
Normalizes characters as explained
|
|
http://userguide.icu-project.org/transforms/normalization[here]. It
|
|
registers itself by default under `icu_normalizer` or `icuNormalizer`
|
|
using the default settings. Allows for the name parameter to be provided
|
|
which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
|
|
Here is a sample settings:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"normalization" : {
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["icu_normalizer"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
=== ICU Folding
|
|
|
|
Folding of unicode characters based on `UTR#30`. It registers itself
|
|
under `icu_folding` and `icuFolding` names.
|
|
The filter also does lowercasing, which means the lowercase filter can
|
|
normally be left out. Sample setting:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"folding" : {
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["icu_folding"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
==== Filtering
|
|
|
|
The folding can be filtered by a set of unicode characters with the
|
|
parameter `unicodeSetFilter`. This is useful for a non-internationalized
|
|
search engine where retaining a set of national characters which are
|
|
primary letters in a specific language is wanted. See syntax for the
|
|
UnicodeSet
|
|
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].
|
|
|
|
The Following example excempt Swedish characters from the folding. Note
|
|
that the filtered characters are NOT lowercased which is why we add that
|
|
filter below.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"folding" : {
|
|
"tokenizer" : "standard",
|
|
"filter" : ["my_icu_folding", "lowercase"]
|
|
}
|
|
}
|
|
"filter" : {
|
|
"my_icu_folding" : {
|
|
"type" : "icu_folding"
|
|
"unicodeSetFilter" : "[^åäöÅÄÖ]"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
=== ICU Collation
|
|
|
|
Uses collation token filter. Allows to either specify the rules for
|
|
collation (defined
|
|
http://www.icu-project.org/userguide/Collate_Customization.html[here])
|
|
using the `rules` parameter (can point to a location or expressed in the
|
|
settings, location can be relative to config location), or using the
|
|
`language` parameter (further specialized by country and variant). By
|
|
default registers under `icu_collation` or `icuCollation` and uses the
|
|
default locale.
|
|
|
|
Here is a sample settings:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"collation" : {
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["icu_collation"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
And here is a sample of custom collation:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"index" : {
|
|
"analysis" : {
|
|
"analyzer" : {
|
|
"collation" : {
|
|
"tokenizer" : "keyword",
|
|
"filter" : ["myCollator"]
|
|
}
|
|
},
|
|
"filter" : {
|
|
"myCollator" : {
|
|
"type" : "icu_collation",
|
|
"language" : "en"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|