OpenSearch/docs/reference/analysis/icu-plugin.asciidoc

247 lines
7.8 KiB
Plaintext

[[analysis-icu-plugin]]
== ICU Analysis Plugin
The http://icu-project.org/[ICU] analysis plugin allows for unicode
normalization, collation and folding. The plugin is called
https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].
The plugin includes the following analysis components:
[float]
[[icu-normalization]]
=== ICU Normalization
Normalizes characters as explained
http://userguide.icu-project.org/transforms/normalization[here]. It
registers itself by default under `icu_normalizer` or `icuNormalizer`
using the default settings. Allows for the name parameter to be provided
which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
Here is a sample settings:
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"normalization" : {
"tokenizer" : "keyword",
"filter" : ["icu_normalizer"]
}
}
}
}
}
--------------------------------------------------
[float]
[[icu-folding]]
=== ICU Folding
Folding of unicode characters based on `UTR#30`. It registers itself
under `icu_folding` and `icuFolding` names.
The filter also does lowercasing, which means the lowercase filter can
normally be left out. Sample setting:
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "keyword",
"filter" : ["icu_folding"]
}
}
}
}
}
--------------------------------------------------
[float]
[[icu-filtering]]
==== Filtering
The folding can be filtered by a set of unicode characters with the
parameter `unicodeSetFilter`. This is useful for a non-internationalized
search engine where retaining a set of national characters which are
primary letters in a specific language is wanted. See syntax for the
UnicodeSet
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].
The Following example exempts Swedish characters from the folding. Note
that the filtered characters are NOT lowercased which is why we add that
filter below.
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "standard",
"filter" : ["my_icu_folding", "lowercase"]
}
}
"filter" : {
"my_icu_folding" : {
"type" : "icu_folding"
"unicodeSetFilter" : "[^åäöÅÄÖ]"
}
}
}
}
}
--------------------------------------------------
[float]
[[icu-collation]]
=== ICU Collation
Uses collation token filter. Allows to either specify the rules for
collation (defined
http://www.icu-project.org/userguide/Collate_Customization.html[here])
using the `rules` parameter (can point to a location or expressed in the
settings, location can be relative to config location), or using the
`language` parameter (further specialized by country and variant). By
default registers under `icu_collation` or `icuCollation` and uses the
default locale.
Here is a sample settings:
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"filter" : ["icu_collation"]
}
}
}
}
}
--------------------------------------------------
And here is a sample of custom collation:
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"filter" : ["myCollator"]
}
},
"filter" : {
"myCollator" : {
"type" : "icu_collation",
"language" : "en"
}
}
}
}
}
--------------------------------------------------
[float]
==== Options
[horizontal]
`strength`::
The strength property determines the minimum level of difference considered significant during comparison.
The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator.
Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`.
+
See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed
explanation for the specific values.
`decomposition`::
Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with
`canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were
normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form
before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between
faster and more complete collation behavior. Since a great many of the world's languages do not require text
normalization, most locales set `no` as the default decomposition mode.
[float]
==== Expert options:
[horizontal]
`alternate`::
Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary`
to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.
`caseLevel`::
Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When
strength is set to `primary` this will ignore accent differences.
`caseFirst`::
Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored
for strength `tertiary`.
`numeric`::
Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For
example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.
`variableTop`::
Single character or contraction. Controls what is variable for `alternate`.
`hiraganaQuaternaryMode`::
Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and
Hiragana characters in `quaternary` strength .
[float]
=== ICU Tokenizer
Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "icu_tokenizer",
}
}
}
}
}
--------------------------------------------------
[float]
=== ICU Normalization CharFilter
Normalizes characters as explained http://userguide.icu-project.org/transforms/normalization[here].
It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings.
Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
Allows for the mode parameter to be provided which can include the following values: `compose` and `decompose`.
Use `decompose` with `nfc` or `nfkc`, to get `nfd` or `nfkd`, respectively.
Here is a sample settings:
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"collation" : {
"tokenizer" : "keyword",
"char_filter" : ["icu_normalizer"]
}
}
}
}
}
--------------------------------------------------