diff --git a/docs/reference/analysis/icu-plugin.asciidoc b/docs/reference/analysis/icu-plugin.asciidoc index c89639fa943..c1be21618fa 100644 --- a/docs/reference/analysis/icu-plugin.asciidoc +++ b/docs/reference/analysis/icu-plugin.asciidoc @@ -39,7 +39,7 @@ Here is a sample settings: === ICU Folding Folding of unicode characters based on `UTR#30`. It registers itself -under `icu_folding` and `icuFolding` names. +under `icu_folding` and `icuFolding` names. The filter also does lowercasing, which means the lowercase filter can normally be left out. Sample setting: @@ -70,7 +70,7 @@ primary letters in a specific language is wanted. See syntax for the UnicodeSet http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here]. -The Following example excempt Swedish characters from the folding. Note +The Following example exempts Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below. @@ -148,5 +148,73 @@ And here is a sample of custom collation: } } } -} +} -------------------------------------------------- + +[float] +==== Options + +[horizontal] +`strength`:: + The strength property determines the minimum level of difference considered significant during comparison. + The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator. + Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`. + + + See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed + explanation for the specific values. + +`decomposition`:: + Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with + `canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were + normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form + before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between + faster and more complete collation behavior. Since a great many of the world's languages do not require text + normalization, most locales set `no` as the default decomposition mode. + +[float] +==== Expert options: + +[horizontal] +`alternate`:: + Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary` + to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace. + +`caseLevel`:: + Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When + strength is set to `primary` this will ignore accent differences. + +`caseFirst`:: + Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored + for strength `tertiary`. + +`numeric`:: + Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For + example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`. + +`variableTop`:: + Single character or contraction. Controls what is variable for `alternate`. + +`hiraganaQuaternaryMode`:: + Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and + Hiragana characters in `quaternary` strength . + +[float] +=== ICU Tokenizer + +Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)). + +[source,js] +-------------------------------------------------- +{ + "index" : { + "analysis" : { + "analyzer" : { + "collation" : { + "tokenizer" : "icu_tokenizer", + } + } + } + } +} +-------------------------------------------------- +