558 lines
16 KiB
Plaintext
558 lines
16 KiB
Plaintext
[[analysis-icu]]
|
|
=== ICU Analysis Plugin
|
|
|
|
The ICU Analysis plugin integrates the Lucene ICU module into {es},
|
|
adding extended Unicode support using the http://site.icu-project.org/[ICU]
|
|
libraries, including better analysis of Asian languages, Unicode
|
|
normalization, Unicode-aware case folding, collation support, and
|
|
transliteration.
|
|
|
|
[IMPORTANT]
|
|
.ICU analysis and backwards compatibility
|
|
================================================
|
|
|
|
From time to time, the ICU library receives updates such as adding new
|
|
characters and emojis, and improving collation (sort) orders. These changes
|
|
may or may not affect search and sort orders, depending on which characters
|
|
sets you are using.
|
|
|
|
While we restrict ICU upgrades to major versions, you may find that an index
|
|
created in the previous major version will need to be reindexed in order to
|
|
return correct (and correctly ordered) results, and to take advantage of new
|
|
characters.
|
|
|
|
================================================
|
|
|
|
:plugin_name: analysis-icu
|
|
include::install_remove.asciidoc[]
|
|
|
|
[[analysis-icu-analyzer]]
|
|
==== ICU Analyzer
|
|
|
|
The `icu_analyzer` analyzer performs basic normalization, tokenization and character folding, using the
|
|
`icu_normalizer` char filter, `icu_tokenizer` and `icu_folding` token filter
|
|
|
|
The following parameters are accepted:
|
|
|
|
[horizontal]
|
|
|
|
`method`::
|
|
|
|
Normalization method. Accepts `nfkc`, `nfc` or `nfkc_cf` (default)
|
|
|
|
`mode`::
|
|
|
|
Normalization mode. Accepts `compose` (default) or `decompose`.
|
|
|
|
[[analysis-icu-normalization-charfilter]]
|
|
==== ICU Normalization Character Filter
|
|
|
|
Normalizes characters as explained
|
|
http://userguide.icu-project.org/transforms/normalization[here].
|
|
It registers itself as the `icu_normalizer` character filter, which is
|
|
available to all indices without any further configuration. The type of
|
|
normalization can be specified with the `name` parameter, which accepts `nfc`,
|
|
`nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
|
|
convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:
|
|
|
|
Which letters are normalized can be controlled by specifying the
|
|
`unicode_set_filter` parameter, which accepts a
|
|
https://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
|
|
|
|
Here are two examples, the default usage and a customised character filter:
|
|
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT icu_sample
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"nfkc_cf_normalized": { <1>
|
|
"tokenizer": "icu_tokenizer",
|
|
"char_filter": [
|
|
"icu_normalizer"
|
|
]
|
|
},
|
|
"nfd_normalized": { <2>
|
|
"tokenizer": "icu_tokenizer",
|
|
"char_filter": [
|
|
"nfd_normalizer"
|
|
]
|
|
}
|
|
},
|
|
"char_filter": {
|
|
"nfd_normalizer": {
|
|
"type": "icu_normalizer",
|
|
"name": "nfc",
|
|
"mode": "decompose"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Uses the default `nfkc_cf` normalization.
|
|
<2> Uses the customized `nfd_normalizer` token filter, which is set to use `nfc` normalization with decomposition.
|
|
|
|
[[analysis-icu-tokenizer]]
|
|
==== ICU Tokenizer
|
|
|
|
Tokenizes text into words on word boundaries, as defined in
|
|
https://www.unicode.org/reports/tr29/[UAX #29: Unicode Text Segmentation].
|
|
It behaves much like the {ref}/analysis-standard-tokenizer.html[`standard` tokenizer],
|
|
but adds better support for some Asian languages by using a dictionary-based
|
|
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
|
|
using custom rules to break Myanmar and Khmer text into syllables.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT icu_sample
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_icu_analyzer": {
|
|
"tokenizer": "icu_tokenizer"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
===== Rules customization
|
|
|
|
experimental[This functionality is marked as experimental in Lucene]
|
|
|
|
You can customize the `icu-tokenizer` behavior by specifying per-script rule files, see the
|
|
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[RBBI rules syntax reference]
|
|
for a more detailed explanation.
|
|
|
|
To add icu tokenizer rules, set the `rule_files` settings, which should contain a comma-separated list of
|
|
`code:rulefile` pairs in the following format:
|
|
https://unicode.org/iso15924/iso15924-codes.html[four-letter ISO 15924 script code],
|
|
followed by a colon, then a rule file name. Rule files are placed `ES_HOME/config` directory.
|
|
|
|
As a demonstration of how the rule files can be used, save the following user file to `$ES_HOME/config/KeywordTokenizer.rbbi`:
|
|
|
|
[source,text]
|
|
-----------------------
|
|
.+ {200};
|
|
-----------------------
|
|
|
|
Then create an analyzer to use this rule file as follows:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT icu_sample
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"tokenizer": {
|
|
"icu_user_file": {
|
|
"type": "icu_tokenizer",
|
|
"rule_files": "Latn:KeywordTokenizer.rbbi"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"my_analyzer": {
|
|
"type": "custom",
|
|
"tokenizer": "icu_user_file"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
GET icu_sample/_analyze
|
|
{
|
|
"analyzer": "my_analyzer",
|
|
"text": "Elasticsearch. Wow!"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The above `analyze` request returns the following:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "Elasticsearch. Wow!",
|
|
"start_offset": 0,
|
|
"end_offset": 19,
|
|
"type": "<ALPHANUM>",
|
|
"position": 0
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
[[analysis-icu-normalization]]
|
|
==== ICU Normalization Token Filter
|
|
|
|
Normalizes characters as explained
|
|
http://userguide.icu-project.org/transforms/normalization[here]. It registers
|
|
itself as the `icu_normalizer` token filter, which is available to all indices
|
|
without any further configuration. The type of normalization can be specified
|
|
with the `name` parameter, which accepts `nfc`, `nfkc`, and `nfkc_cf`
|
|
(default).
|
|
|
|
Which letters are normalized can be controlled by specifying the
|
|
`unicode_set_filter` parameter, which accepts a
|
|
https://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
|
|
|
|
You should probably prefer the <<analysis-icu-normalization-charfilter,Normalization character filter>>.
|
|
|
|
Here are two examples, the default usage and a customised token filter:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT icu_sample
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"nfkc_cf_normalized": { <1>
|
|
"tokenizer": "icu_tokenizer",
|
|
"filter": [
|
|
"icu_normalizer"
|
|
]
|
|
},
|
|
"nfc_normalized": { <2>
|
|
"tokenizer": "icu_tokenizer",
|
|
"filter": [
|
|
"nfc_normalizer"
|
|
]
|
|
}
|
|
},
|
|
"filter": {
|
|
"nfc_normalizer": {
|
|
"type": "icu_normalizer",
|
|
"name": "nfc"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Uses the default `nfkc_cf` normalization.
|
|
<2> Uses the customized `nfc_normalizer` token filter, which is set to use `nfc` normalization.
|
|
|
|
|
|
[[analysis-icu-folding]]
|
|
==== ICU Folding Token Filter
|
|
|
|
Case folding of Unicode characters based on `UTR#30`, like the
|
|
{ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]
|
|
on steroids. It registers itself as the `icu_folding` token filter and is
|
|
available to all indices:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT icu_sample
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"folded": {
|
|
"tokenizer": "icu_tokenizer",
|
|
"filter": [
|
|
"icu_folding"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The ICU folding token filter already does Unicode normalization, so there is
|
|
no need to use Normalize character or token filter as well.
|
|
|
|
Which letters are folded can be controlled by specifying the
|
|
`unicode_set_filter` parameter, which accepts a
|
|
https://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
|
|
|
|
The following example exempts Swedish characters from folding. It is important
|
|
to note that both upper and lowercase forms should be specified, and that
|
|
these filtered character are not lowercased which is why we add the
|
|
`lowercase` filter as well:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT icu_sample
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"swedish_analyzer": {
|
|
"tokenizer": "icu_tokenizer",
|
|
"filter": [
|
|
"swedish_folding",
|
|
"lowercase"
|
|
]
|
|
}
|
|
},
|
|
"filter": {
|
|
"swedish_folding": {
|
|
"type": "icu_folding",
|
|
"unicode_set_filter": "[^åäöÅÄÖ]"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
|
|
[[analysis-icu-collation]]
|
|
==== ICU Collation Token Filter
|
|
|
|
[WARNING]
|
|
======
|
|
This token filter has been deprecated since Lucene 5.0. Please use
|
|
<<analysis-icu-collation-keyword-field, ICU Collation Keyword Field>>.
|
|
======
|
|
|
|
[[analysis-icu-collation-keyword-field]]
|
|
==== ICU Collation Keyword Field
|
|
|
|
Collations are used for sorting documents in a language-specific word order.
|
|
The `icu_collation_keyword` field type is available to all indices and will encode
|
|
the terms directly as bytes in a doc values field and a single indexed token just
|
|
like a standard {ref}/keyword.html[Keyword Field].
|
|
|
|
Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation],
|
|
which is a best-effort attempt at language-neutral sorting.
|
|
|
|
Below is an example of how to set up a field for sorting German names in
|
|
``phonebook'' order:
|
|
|
|
[source,console]
|
|
--------------------------
|
|
PUT my-index-000001
|
|
{
|
|
"mappings": {
|
|
"properties": {
|
|
"name": { <1>
|
|
"type": "text",
|
|
"fields": {
|
|
"sort": { <2>
|
|
"type": "icu_collation_keyword",
|
|
"index": false,
|
|
"language": "de",
|
|
"country": "DE",
|
|
"variant": "@collation=phonebook"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
GET /my-index-000001/_search <3>
|
|
{
|
|
"query": {
|
|
"match": {
|
|
"name": "Fritz"
|
|
}
|
|
},
|
|
"sort": "name.sort"
|
|
}
|
|
|
|
--------------------------
|
|
|
|
<1> The `name` field uses the `standard` analyzer, and so support full text queries.
|
|
<2> The `name.sort` field is an `icu_collation_keyword` field that will preserve the name as
|
|
a single token doc values, and applies the German ``phonebook'' order.
|
|
<3> An example query which searches the `name` field and sorts on the `name.sort` field.
|
|
|
|
===== Parameters for ICU Collation Keyword Fields
|
|
|
|
The following parameters are accepted by `icu_collation_keyword` fields:
|
|
|
|
[horizontal]
|
|
|
|
`doc_values`::
|
|
|
|
Should the field be stored on disk in a column-stride fashion, so that it
|
|
can later be used for sorting, aggregations, or scripting? Accepts `true`
|
|
(default) or `false`.
|
|
|
|
`index`::
|
|
|
|
Should the field be searchable? Accepts `true` (default) or `false`.
|
|
|
|
`null_value`::
|
|
|
|
Accepts a string value which is substituted for any explicit `null`
|
|
values. Defaults to `null`, which means the field is treated as missing.
|
|
|
|
{ref}/ignore-above.html[`ignore_above`]::
|
|
|
|
Strings longer than the `ignore_above` setting will be ignored.
|
|
Checking is performed on the original string before the collation.
|
|
The `ignore_above` setting can be updated on existing fields
|
|
using the {ref}/indices-put-mapping.html[PUT mapping API].
|
|
By default, there is no limit and all values will be indexed.
|
|
|
|
`store`::
|
|
|
|
Whether the field value should be stored and retrievable separately from
|
|
the {ref}/mapping-source-field.html[`_source`] field. Accepts `true` or `false`
|
|
(default).
|
|
|
|
`fields`::
|
|
|
|
Multi-fields allow the same string value to be indexed in multiple ways for
|
|
different purposes, such as one field for search and a multi-field for
|
|
sorting and aggregations.
|
|
|
|
===== Collation options
|
|
|
|
`strength`::
|
|
|
|
The strength property determines the minimum level of difference considered
|
|
significant during comparison. Possible values are : `primary`, `secondary`,
|
|
`tertiary`, `quaternary` or `identical`. See the
|
|
https://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]
|
|
for a more detailed explanation for each value. Defaults to `tertiary`
|
|
unless otherwise specified in the collation.
|
|
|
|
`decomposition`::
|
|
|
|
Possible values: `no` (default, but collation-dependent) or `canonical`.
|
|
Setting this decomposition property to `canonical` allows the Collator to
|
|
handle unnormalized text properly, producing the same results as if the text
|
|
were normalized. If `no` is set, it is the user's responsibility to insure
|
|
that all text is already in the appropriate form before a comparison or before
|
|
getting a CollationKey. Adjusting decomposition mode allows the user to select
|
|
between faster and more complete collation behavior. Since a great many of the
|
|
world's languages do not require text normalization, most locales set `no` as
|
|
the default decomposition mode.
|
|
|
|
The following options are expert only:
|
|
|
|
`alternate`::
|
|
|
|
Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for
|
|
strength `quaternary` to be either shifted or non-ignorable. Which boils down
|
|
to ignoring punctuation and whitespace.
|
|
|
|
`case_level`::
|
|
|
|
Possible values: `true` or `false` (default). Whether case level sorting is
|
|
required. When strength is set to `primary` this will ignore accent
|
|
differences.
|
|
|
|
|
|
`case_first`::
|
|
|
|
Possible values: `lower` or `upper`. Useful to control which case is sorted
|
|
first when case is not ignored for strength `tertiary`. The default depends on
|
|
the collation.
|
|
|
|
`numeric`::
|
|
|
|
Possible values: `true` or `false` (default) . Whether digits are sorted
|
|
according to their numeric representation. For example the value `egg-9` is
|
|
sorted before the value `egg-21`.
|
|
|
|
|
|
`variable_top`::
|
|
|
|
Single character or contraction. Controls what is variable for `alternate`.
|
|
|
|
`hiragana_quaternary_mode`::
|
|
|
|
Possible values: `true` or `false`. Distinguishing between Katakana and
|
|
Hiragana characters in `quaternary` strength.
|
|
|
|
|
|
[[analysis-icu-transform]]
|
|
==== ICU Transform Token Filter
|
|
|
|
Transforms are used to process Unicode text in many different ways, such as
|
|
case mapping, normalization, transliteration and bidirectional text handling.
|
|
|
|
You can define which transformation you want to apply with the `id` parameter
|
|
(defaults to `Null`), and specify text direction with the `dir` parameter
|
|
which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
|
|
rulesets are not yet supported.
|
|
|
|
For example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT icu_sample
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"latin": {
|
|
"tokenizer": "keyword",
|
|
"filter": [
|
|
"myLatinTransform"
|
|
]
|
|
}
|
|
},
|
|
"filter": {
|
|
"myLatinTransform": {
|
|
"type": "icu_transform",
|
|
"id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" <1>
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
GET icu_sample/_analyze
|
|
{
|
|
"analyzer": "latin",
|
|
"text": "你好" <2>
|
|
}
|
|
|
|
GET icu_sample/_analyze
|
|
{
|
|
"analyzer": "latin",
|
|
"text": "здравствуйте" <3>
|
|
}
|
|
|
|
GET icu_sample/_analyze
|
|
{
|
|
"analyzer": "latin",
|
|
"text": "こんにちは" <4>
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
<1> This transforms transliterates characters to Latin, and separates accents
|
|
from their base characters, removes the accents, and then puts the
|
|
remaining text into an unaccented form.
|
|
|
|
<2> Returns `ni hao`.
|
|
<3> Returns `zdravstvujte`.
|
|
<4> Returns `kon'nichiha`.
|
|
|
|
For more documentation, Please see the http://userguide.icu-project.org/transforms/general[user guide of ICU Transform].
|