OpenSearch/docs/reference/analysis/icu-plugin.asciidoc

[[analysis-icu-plugin]]
== ICU Analysis Plugin

The http://icu-project.org/[ICU] analysis plugin allows for unicode
normalization, collation and folding. The plugin is called
https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu].

The plugin includes the following analysis components:

[float]
=== ICU Normalization

Normalizes characters as explained
http://userguide.icu-project.org/transforms/normalization[here]. It
registers itself by default under `icu_normalizer` or `icuNormalizer`
using the default settings. Allows for the name parameter to be provided
which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
Here is a sample settings:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "normalization" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_normalizer"]
                }
            }
        }
    }
}
--------------------------------------------------

[float]
=== ICU Folding

Folding of unicode characters based on `UTR#30`. It registers itself
under `icu_folding` and `icuFolding` names.
The filter also does lowercasing, which means the lowercase filter can
normally be left out. Sample setting:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_folding"]
                }
            }
        }
    }
}
--------------------------------------------------

[float]
==== Filtering

The folding can be filtered by a set of unicode characters with the
parameter `unicodeSetFilter`. This is useful for a non-internationalized
search engine where retaining a set of national characters which are
primary letters in a specific language is wanted. See syntax for the
UnicodeSet
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].

The Following example excempt Swedish characters from the folding. Note
that the filtered characters are NOT lowercased which is why we add that
filter below.

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_icu_folding", "lowercase"]
                }
            }
            "filter" : {
                "my_icu_folding" : {
                    "type" : "icu_folding"
                    "unicodeSetFilter" : "[^åäöÅÄÖ]"
                }
            }
        }
    }
}
--------------------------------------------------

[float]
=== ICU Collation

Uses collation token filter. Allows to either specify the rules for
collation (defined
http://www.icu-project.org/userguide/Collate_Customization.html[here])
using the `rules` parameter (can point to a location or expressed in the
settings, location can be relative to config location), or using the
`language` parameter (further specialized by country and variant). By
default registers under `icu_collation` or `icuCollation` and uses the
default locale.

Here is a sample settings:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_collation"]
                }
            }
        }
    }
}
--------------------------------------------------

And here is a sample of custom collation:

[source,js]
--------------------------------------------------
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "collation" : {
                    "tokenizer" : "keyword",
                    "filter" : ["myCollator"]
                }
            },
            "filter" : {
                "myCollator" : {
                    "type" : "icu_collation",
                    "language" : "en"
                }
            }
        }
    }
}
--------------------------------------------------