OpenSearch/docs/reference/analysis/icu-plugin.asciidoc

[[analysis-icu-plugin]]
== ICU Analysis Plugin

The http://icu-project.org/[ICU] analysis plugin allows for unicode
normalization, collation and folding. The plugin is called
https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].

The plugin includes the following analysis components:

[float]
[[icu-normalization]]
=== ICU Normalization

Normalizes characters as explained
http://userguide.icu-project.org/transforms/normalization[here]. It
registers itself by default under `icu_normalizer` or `icuNormalizer`
using the default settings. Allows for the name parameter to be provided
which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
Here is a sample settings:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "normalization" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_normalizer"]
                }
            }
        }
    }
}
--------------------------------------------------

[float]
[[icu-folding]]
=== ICU Folding

Folding of unicode characters based on `UTR#30`. It registers itself
under `icu_folding` and `icuFolding` names.
The filter also does lowercasing, which means the lowercase filter can
normally be left out. Sample setting:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_folding"]
                }
            }
        }
    }
}
--------------------------------------------------

[float]
[[icu-filtering]]
==== Filtering

The folding can be filtered by a set of unicode characters with the
parameter `unicodeSetFilter`. This is useful for a non-internationalized
search engine where retaining a set of national characters which are
primary letters in a specific language is wanted. See syntax for the
UnicodeSet
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].

The Following example exempts Swedish characters from the folding. Note
that the filtered characters are NOT lowercased which is why we add that
filter below.

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_icu_folding", "lowercase"]
                }
            }
            "filter" : {
                "my_icu_folding" : {
                    "type" : "icu_folding"
                    "unicodeSetFilter" : "[^åäöÅÄÖ]"
                }
            }
        }
    }
}
--------------------------------------------------

[float]
[[icu-collation]]
=== ICU Collation

Uses collation token filter. Allows to either specify the rules for
collation (defined
http://www.icu-project.org/userguide/Collate_Customization.html[here])
using the `rules` parameter (can point to a location or expressed in the
settings, location can be relative to config location), or using the
`language` parameter (further specialized by country and variant). By
default registers under `icu_collation` or `icuCollation` and uses the
default locale.

Here is a sample settings:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_collation"]
                }
            }
        }
    }
}
--------------------------------------------------

And here is a sample of custom collation:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["myCollator"]
                }
            },
            "filter" : {
                "myCollator" : {
                    "type" : "icu_collation",
                    "language" : "en"
                }
            }
        }
    }
}
--------------------------------------------------

[float]
==== Options

[horizontal]
`strength`::
    The strength property determines the minimum level of difference considered significant during comparison.
     The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator.
     Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`.
 +
 See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed
 explanation for the specific values.

`decomposition`::
    Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with
    `canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were
    normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form
    before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between
    faster and more complete collation behavior. Since a great many of the world's languages do not require text
    normalization, most locales set `no` as the default decomposition mode.

[float]
==== Expert options:

[horizontal]
`alternate`::
     Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary`
     to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.

`caseLevel`::
    Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When
     strength is set to `primary` this will ignore accent differences.

`caseFirst`::
    Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored
    for strength `tertiary`.

`numeric`::
    Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For
    example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.

`variableTop`::
    Single character or contraction. Controls what is variable for `alternate`.

`hiraganaQuaternaryMode`::
    Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and
    Hiragana characters in `quaternary` strength .

[float]
=== ICU Tokenizer

Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "icu_tokenizer",
                }
            }
        }
    }
}
--------------------------------------------------


[float]
=== ICU Normalization CharFilter

Normalizes characters as explained http://userguide.icu-project.org/transforms/normalization[here].
It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings.
Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
Allows for the mode parameter to be provided which can include the following values: `compose` and `decompose`.
Use `decompose` with `nfc` or `nfkc`, to get `nfd` or `nfkd`, respectively.
Here is a sample settings:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "char_filter" : ["icu_normalizer"]
                }
            }
        }
    }
}
--------------------------------------------------
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`[[analysis-icu-plugin]]`
			`== ICU Analysis Plugin`

			`The http://icu-project.org/[ICU] analysis plugin allows for unicode`
			`normalization, collation and folding. The plugin is called`
			`https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].`

			`The plugin includes the following analysis components:`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 10:17:40 -06:00			`[[icu-normalization]]`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`=== ICU Normalization`

			`Normalizes characters as explained`
			`http://userguide.icu-project.org/transforms/normalization[here]. It`
			registers itself by default under `icu_normalizer` or `icuNormalizer`
			`using the default settings. Allows for the name parameter to be provided`
			which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
			`Here is a sample settings:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
			`"normalization" : {`
			`"tokenizer" : "keyword",`
			`"filter" : ["icu_normalizer"]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 10:17:40 -06:00			`[[icu-folding]]`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`=== ICU Folding`

			Folding of unicode characters based on `UTR#30`. It registers itself
[DOCS] Updated ICU-Plugin docs from the repo README 2013-10-05 16:31:52 +02:00			under `icu_folding` and `icuFolding` names.
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`The filter also does lowercasing, which means the lowercase filter can`
			`normally be left out. Sample setting:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
			`"folding" : {`
			`"tokenizer" : "keyword",`
			`"filter" : ["icu_folding"]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
Uniquify anchor links to fix asciidoc/docbook generation 2013-09-30 15:32:00 -06:00			`[[icu-filtering]]`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`==== Filtering`

			`The folding can be filtered by a set of unicode characters with the`
			parameter `unicodeSetFilter`. This is useful for a non-internationalized
			`search engine where retaining a set of national characters which are`
			`primary letters in a specific language is wanted. See syntax for the`
			`UnicodeSet`
			`http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].`

[DOCS] Updated ICU-Plugin docs from the repo README 2013-10-05 16:31:52 +02:00			`The Following example exempts Swedish characters from the folding. Note`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`that the filtered characters are NOT lowercased which is why we add that`
			`filter below.`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
			`"folding" : {`
			`"tokenizer" : "standard",`
			`"filter" : ["my_icu_folding", "lowercase"]`
			`}`
			`}`
			`"filter" : {`
			`"my_icu_folding" : {`
			`"type" : "icu_folding"`
			`"unicodeSetFilter" : "[^åäöÅÄÖ]"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 10:17:40 -06:00			`[[icu-collation]]`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`=== ICU Collation`

			`Uses collation token filter. Allows to either specify the rules for`
			`collation (defined`
			`http://www.icu-project.org/userguide/Collate_Customization.html[here])`
			using the `rules` parameter (can point to a location or expressed in the
			`settings, location can be relative to config location), or using the`
			`language` parameter (further specialized by country and variant). By
			default registers under `icu_collation` or `icuCollation` and uses the
			`default locale.`

			`Here is a sample settings:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
			`"collation" : {`
			`"tokenizer" : "keyword",`
			`"filter" : ["icu_collation"]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`And here is a sample of custom collation:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
			`"collation" : {`
			`"tokenizer" : "keyword",`
			`"filter" : ["myCollator"]`
			`}`
			`},`
			`"filter" : {`
			`"myCollator" : {`
			`"type" : "icu_collation",`
			`"language" : "en"`
			`}`
			`}`
			`}`
			`}`
[DOCS] Updated ICU-Plugin docs from the repo README 2013-10-05 16:31:52 +02:00			`}`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`--------------------------------------------------`
[DOCS] Updated ICU-Plugin docs from the repo README 2013-10-05 16:31:52 +02:00
			`[float]`
			`==== Options`

			`[horizontal]`
			`strength`::
			`The strength property determines the minimum level of difference considered significant during comparison.`
			The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator.
			Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`.
			`+`
			`See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed`
			`explanation for the specific values.`

			`decomposition`::
			Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with
			`canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were
			normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form
			`before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between`
			`faster and more complete collation behavior. Since a great many of the world's languages do not require text`
			normalization, most locales set `no` as the default decomposition mode.

			`[float]`
			`==== Expert options:`

			`[horizontal]`
			`alternate`::
			Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary`
			`to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.`

			`caseLevel`::
			Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When
			strength is set to `primary` this will ignore accent differences.

			`caseFirst`::
			Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored
			for strength `tertiary`.

			`numeric`::
			Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For
			example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.

			`variableTop`::
			Single character or contraction. Controls what is variable for `alternate`.

			`hiraganaQuaternaryMode`::
			Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and
			Hiragana characters in `quaternary` strength .

			`[float]`
			`=== ICU Tokenizer`

			`Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
			`"collation" : {`
			`"tokenizer" : "icu_tokenizer",`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

Docs: fixed ICU plugin documentation add ICU Normalization CharFilter to docs Closes #6711 2014-07-03 22:14:54 +09:00
			`[float]`
			`=== ICU Normalization CharFilter`

			`Normalizes characters as explained http://userguide.icu-project.org/transforms/normalization[here].`
			It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings.
			Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
			Allows for the mode parameter to be provided which can include the following values: `compose` and `decompose`.
			Use `decompose` with `nfc` or `nfkc`, to get `nfd` or `nfkd`, respectively.
			`Here is a sample settings:`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
			`"collation" : {`
			`"tokenizer" : "keyword",`
			`"char_filter" : ["icu_normalizer"]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`