[[analysis-icu-plugin]] == ICU Analysis Plugin The http://icu-project.org/[ICU] analysis plugin allows for unicode normalization, collation and folding. The plugin is called https://github.com/elasticsearch/elasticsearch-analysis-icu[elasticsearch-analysis-icu]. The plugin includes the following analysis components: [float] [[icu-normalization]] === ICU Normalization Normalizes characters as explained http://userguide.icu-project.org/transforms/normalization[here]. It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings. Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`. Here is a sample settings: [source,js] -------------------------------------------------- { "index" : { "analysis" : { "analyzer" : { "normalization" : { "tokenizer" : "keyword", "filter" : ["icu_normalizer"] } } } } } -------------------------------------------------- [float] [[icu-folding]] === ICU Folding Folding of unicode characters based on `UTR#30`. It registers itself under `icu_folding` and `icuFolding` names. The filter also does lowercasing, which means the lowercase filter can normally be left out. Sample setting: [source,js] -------------------------------------------------- { "index" : { "analysis" : { "analyzer" : { "folding" : { "tokenizer" : "keyword", "filter" : ["icu_folding"] } } } } } -------------------------------------------------- [float] [[icu-filtering]] ==== Filtering The folding can be filtered by a set of unicode characters with the parameter `unicodeSetFilter`. This is useful for a non-internationalized search engine where retaining a set of national characters which are primary letters in a specific language is wanted. See syntax for the UnicodeSet http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here]. The Following example exempts Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below. [source,js] -------------------------------------------------- { "index" : { "analysis" : { "analyzer" : { "folding" : { "tokenizer" : "standard", "filter" : ["my_icu_folding", "lowercase"] } } "filter" : { "my_icu_folding" : { "type" : "icu_folding" "unicodeSetFilter" : "[^åäöÅÄÖ]" } } } } } -------------------------------------------------- [float] [[icu-collation]] === ICU Collation Uses collation token filter. Allows to either specify the rules for collation (defined http://www.icu-project.org/userguide/Collate_Customization.html[here]) using the `rules` parameter (can point to a location or expressed in the settings, location can be relative to config location), or using the `language` parameter (further specialized by country and variant). By default registers under `icu_collation` or `icuCollation` and uses the default locale. Here is a sample settings: [source,js] -------------------------------------------------- { "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "filter" : ["icu_collation"] } } } } } -------------------------------------------------- And here is a sample of custom collation: [source,js] -------------------------------------------------- { "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "keyword", "filter" : ["myCollator"] } }, "filter" : { "myCollator" : { "type" : "icu_collation", "language" : "en" } } } } } -------------------------------------------------- [float] ==== Options [horizontal] `strength`:: The strength property determines the minimum level of difference considered significant during comparison. The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator. Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`. + See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed explanation for the specific values. `decomposition`:: Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with `canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world's languages do not require text normalization, most locales set `no` as the default decomposition mode. [float] ==== Expert options: [horizontal] `alternate`:: Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary` to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace. `caseLevel`:: Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When strength is set to `primary` this will ignore accent differences. `caseFirst`:: Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored for strength `tertiary`. `numeric`:: Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`. `variableTop`:: Single character or contraction. Controls what is variable for `alternate`. `hiraganaQuaternaryMode`:: Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and Hiragana characters in `quaternary` strength . [float] === ICU Tokenizer Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)). [source,js] -------------------------------------------------- { "index" : { "analysis" : { "analyzer" : { "collation" : { "tokenizer" : "icu_tokenizer", } } } } } --------------------------------------------------