OpenSearch/docs/plugins/analysis-icu.asciidoc

[[analysis-icu]]
=== ICU Analysis Plugin

The ICU Analysis plugin integrates the Lucene ICU module into elasticsearch,
adding extended Unicode support using the http://site.icu-project.org/[ICU]
libraries, including better analysis of Asian languages, Unicode
normalization, Unicode-aware case folding, collation support, and
transliteration.

[[analysis-icu-install]]
[float]
==== Installation

This plugin can be installed using the plugin manager:

[source,sh]
----------------------------------------------------------------
sudo bin/plugin install analysis-icu
----------------------------------------------------------------

The plugin must be installed on every node in the cluster, and each node must
be restarted after installation.

[[analysis-icu-remove]]
[float]
==== Removal

The plugin can be removed with the following command:

[source,sh]
----------------------------------------------------------------
sudo bin/plugin remove analysis-icu
----------------------------------------------------------------

The node must be stopped before removing the plugin.

[[analysis-icu-normalization-charfilter]]
==== ICU Normalization Character Filter

Normalizes characters as explained
http://userguide.icu-project.org/transforms/normalization[here].
It registers itself as the `icu_normalizer` character filter, which is
available to all indices without any further configuration. The type of
normalization can be specified with the `name` parameter, which accepts `nfc`,
`nfkc`, and `nfkc_cf` (default).  Set the `mode` parameter to `decompose` to
convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:

Here are two examples, the default usage and a customised character filter:


[source,json]
--------------------------------------------------
PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { <1>
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "icu_normalizer"
            ]
          },
          "nfd_normalized": { <2>
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "nfd_normalizer"
            ]
          }
        },
        "char_filter": {
          "nfd_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc",
            "mode": "decompose"
          }
        }
      }
    }
  }
}
--------------------------------------------------
// AUTOSENSE

<1> Uses the default `nfkc_cf` normalization.
<2> Uses the customized `nfd_normalizer` token filter, which is set to use `nfc` normalization with decomposition.

[[analysis-icu-tokenizer]]
==== ICU Tokenizer

Tokenizes text into words on word boundaries, as defined in
http://www.unicode.org/reports/tr29/[UAX #29: Unicode Text Segmentation].
It behaves much like the {ref}/analysis-standard-tokenizer.html[`standard` tokenizer],
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.

[source,json]
--------------------------------------------------
PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}
--------------------------------------------------
// AUTOSENSE


[[analysis-icu-normalization]]
==== ICU Normalization Token Filter

Normalizes characters as explained
http://userguide.icu-project.org/transforms/normalization[here]. It registers
itself as the `icu_normalizer` token filter, which is available to all indices
without any further configuration. The type of normalization can be specified
with the `name` parameter, which accepts `nfc`, `nfkc`, and `nfkc_cf`
(default).

You should probably prefer the <<analysis-icu-normalization-charfilter,Normalization character filter>>.

Here are two examples, the default usage and a customised token filter:

[source,json]
--------------------------------------------------
PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { <1>
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_normalizer"
            ]
          },
          "nfc_normalized": { <2>
            "tokenizer": "icu_tokenizer",
            "filter": [
              "nfc_normalizer"
            ]
          }
        },
        "filter": {
          "nfc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc"
          }
        }
      }
    }
  }
}
--------------------------------------------------
// AUTOSENSE

<1> Uses the default `nfkc_cf` normalization.
<2> Uses the customized `nfc_normalizer` token filter, which is set to use `nfc` normalization.


[[analysis-icu-folding]]
==== ICU Folding Token Filter

Case folding of Unicode characters based on `UTR#30`, like the
{ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]
on steroids. It registers itself as the `icu_folding` token filter and is
available to all indices:

[source,json]
--------------------------------------------------
PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "folded": {
            "tokenizer": "icu",
            "filter": [
              "icu_folding"
            ]
          }
        }
      }
    }
  }
}
--------------------------------------------------
// AUTOSENSE

The ICU folding token filter already does Unicode normalization, so there is
no need to use Normalize character or token filter as well.

Which letters are folded can be controlled by specifying the
`unicodeSetFilter` parameter, which accepts a
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].

The following example exempts Swedish characters from folding. It is important
to note that both upper and lowercase forms should be specified, and that
these filtered character are not lowercased which is why we add the
`lowercase` filter as well:

[source,json]
--------------------------------------------------
PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "swedish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "swedish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "swedish_folding": {
            "type": "icu_folding",
            "unicodeSetFilter": "[^åäöÅÄÖ]"
          }
        }
      }
    }
  }
}
--------------------------------------------------
// AUTOSENSE

[[analysis-icu-collation]]
==== ICU Collation Token Filter

Collations are used for sorting documents in a language-specific word order.
The `icu_collation` token filter is available to all indices and defaults to
using the
https://www.elastic.co/guide/en/elasticsearch/guide/current/sorting-collations.html#uca[DUCET collation],
which is a best-effort attempt at language-neutral sorting.

Below is an example of how to set up a field for sorting German names in
``phonebook'' order:

[source,json]
--------------------------------------------------
PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "german_phonebook": {
          "type":     "icu_collation",
          "language": "de",
          "country":  "DE",
          "variant":  "@collation=phonebook"
        }
      },
      "analyzer": {
        "german_phonebook": {
          "tokenizer": "keyword",
          "filter":  [ "german_phonebook" ]
        }
      }
    }
  },
  "mappings": {
    "user": {
      "properties": {
        "name": { <1>
          "type": "string",
          "fields": {
            "sort": { <2>
              "type":     "string",
              "analyzer": "german_phonebook"
            }
          }
        }
      }
    }
  }
}

GET _search <3>
{
  "query": {
    "match": {
      "name": "Fritz"
    }
  },
  "sort": "name.sort"
}

--------------------------------------------------
// AUTOSENSE

<1> The `name` field uses the `standard` analyzer, and so support full text queries.
<2> The `name.sort` field uses the `keyword` analyzer to preserve the name as
    a single token, and applies the `german_phonebook` token filter to index
    the value in German phonebook sort order.
<3> An example query which searches the `name` field and sorts on the `name.sort` field.

===== Collation options

`strength`::

The strength property determines the minimum level of difference considered
significant during comparison. Possible values are : `primary`, `secondary`,
`tertiary`, `quaternary` or `identical`. See the
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]
for a more detailed  explanation for each value.  Defaults to `tertiary`
unless otherwise specified in the collation.

`decomposition`::

Possible values: `no` (default, but collation-dependent) or `canonical`.
Setting this decomposition property to `canonical` allows the Collator to
handle unnormalized text properly, producing the same results as if the text
were normalized. If `no` is set, it is the user's responsibility to insure
that all text is already in the appropriate form before a comparison or before
getting a CollationKey. Adjusting decomposition mode allows the user to select
between faster and more complete collation behavior. Since a great many of the
world's languages do not require text normalization, most locales set `no` as
the default decomposition mode.

The following options are expert only:

`alternate`::

Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for
strength `quaternary` to be either shifted or non-ignorable. Which boils down
to ignoring punctuation and whitespace.

`caseLevel`::

Possible values: `true` or `false` (default). Whether case level sorting is
required. When strength is set to `primary` this will ignore accent
differences.


`caseFirst`::

Possible values: `lower` or `upper`. Useful to control which case is sorted
first when case is not ignored for strength `tertiary`. The default depends on
the collation.

`numeric`::

Possible values: `true` or `false` (default) . Whether digits are sorted
according to their numeric representation. For example the value `egg-9` is
sorted before the value `egg-21`.


`variableTop`::

Single character or contraction. Controls what is variable for `alternate`.

`hiraganaQuaternaryMode`::

Possible values: `true` or `false`.  Distinguishing between Katakana and
Hiragana characters in `quaternary` strength.


[[analysis-icu-transform]]
==== ICU Transform Token Filter

Transforms are used to process Unicode text in many different ways, such as
case mapping, normalization, transliteration and bidirectional text handling.

You can define which transformation you want to apply with the `id` parameter
(defaults to `Null`), and specify text direction with the `dir` parameter
which accepts `forward` (default) for LTR and `reverse` for RTL.  Custom
rulesets are not yet supported.

For example:

[source,json]
--------------------------------------------------
PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "latin": {
            "tokenizer": "keyword",
            "filter": [
              "myLatinTransform"
            ]
          }
        },
        "filter": {
          "myLatinTransform": {
            "type": "icu_transform",
            "id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" <1>
          }
        }
      }
    }
  }
}

GET icu_sample/_analyze?analyzer=latin
{
  "text": "你好" <2>
}

GET icu_sample/_analyze?analyzer=latin
{
  "text": "здравствуйте" <3>
}

GET icu_sample/_analyze?analyzer=latin
{
  "text": "こんにちは" <4>
}

--------------------------------------------------
// AUTOSENSE

<1> This transforms transliterates characters to Latin, and separates accents
    from their base characters, removes the accents, and then puts the
    remaining text into an unaccented form.

<2> Returns `ni hao`.
<3> Returns `zdravstvujte`.
<4> Returns `kon'nichiha`.

For more documentation, Please see the http://userguide.icu-project.org/transforms/general[user guide of ICU Transform].
Docs: Prepare plugin and integration docs for 2.0 * Centralised plugin docs in docs/plugins/ * Moved integrations into same docs * Moved community clients into the clients section of the docs * Removed docs/community Closes #11734 Closes #11724 Closes #11636 Closes #11635 Closes #11632 Closes #11630 Closes #12046 Closes #12438 Closes #12579 2015-08-15 12:00:55 -04:00			`[[analysis-icu]]`
			`=== ICU Analysis Plugin`

			`The ICU Analysis plugin integrates the Lucene ICU module into elasticsearch,`
			`adding extended Unicode support using the http://site.icu-project.org/[ICU]`
			`libraries, including better analysis of Asian languages, Unicode`
			`normalization, Unicode-aware case folding, collation support, and`
			`transliteration.`

			`[[analysis-icu-install]]`
			`[float]`
			`==== Installation`

			`This plugin can be installed using the plugin manager:`

			`[source,sh]`
			`----------------------------------------------------------------`
			`sudo bin/plugin install analysis-icu`
			`----------------------------------------------------------------`

			`The plugin must be installed on every node in the cluster, and each node must`
			`be restarted after installation.`

			`[[analysis-icu-remove]]`
			`[float]`
			`==== Removal`

			`The plugin can be removed with the following command:`

			`[source,sh]`
			`----------------------------------------------------------------`
			`sudo bin/plugin remove analysis-icu`
			`----------------------------------------------------------------`

			`The node must be stopped before removing the plugin.`

			`[[analysis-icu-normalization-charfilter]]`
			`==== ICU Normalization Character Filter`

			`Normalizes characters as explained`
			`http://userguide.icu-project.org/transforms/normalization[here].`
			It registers itself as the `icu_normalizer` character filter, which is
			`available to all indices without any further configuration. The type of`
			normalization can be specified with the `name` parameter, which accepts `nfc`,
			`nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
			convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:

			`Here are two examples, the default usage and a customised character filter:`


			`[source,json]`
			`--------------------------------------------------`
			`PUT icu_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"nfkc_cf_normalized": { <1>`
			`"tokenizer": "icu_tokenizer",`
			`"char_filter": [`
			`"icu_normalizer"`
			`]`
			`},`
			`"nfd_normalized": { <2>`
			`"tokenizer": "icu_tokenizer",`
			`"char_filter": [`
			`"nfd_normalizer"`
			`]`
			`}`
			`},`
			`"char_filter": {`
			`"nfd_normalizer": {`
			`"type": "icu_normalizer",`
			`"name": "nfc",`
			`"mode": "decompose"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// AUTOSENSE`

			<1> Uses the default `nfkc_cf` normalization.
			<2> Uses the customized `nfd_normalizer` token filter, which is set to use `nfc` normalization with decomposition.

			`[[analysis-icu-tokenizer]]`
			`==== ICU Tokenizer`

			`Tokenizes text into words on word boundaries, as defined in`
			`http://www.unicode.org/reports/tr29/[UAX #29: Unicode Text Segmentation].`
			It behaves much like the {ref}/analysis-standard-tokenizer.html[`standard` tokenizer],
			`but adds better support for some Asian languages by using a dictionary-based`
			`approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and`
			`using custom rules to break Myanmar and Khmer text into syllables.`

			`[source,json]`
			`--------------------------------------------------`
			`PUT icu_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_icu_analyzer": {`
			`"tokenizer": "icu_tokenizer"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// AUTOSENSE`


			`[[analysis-icu-normalization]]`
			`==== ICU Normalization Token Filter`

			`Normalizes characters as explained`
			`http://userguide.icu-project.org/transforms/normalization[here]. It registers`
			itself as the `icu_normalizer` token filter, which is available to all indices
			`without any further configuration. The type of normalization can be specified`
			with the `name` parameter, which accepts `nfc`, `nfkc`, and `nfkc_cf`
			`(default).`

			`You should probably prefer the <<analysis-icu-normalization-charfilter,Normalization character filter>>.`

			`Here are two examples, the default usage and a customised token filter:`

			`[source,json]`
			`--------------------------------------------------`
			`PUT icu_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"nfkc_cf_normalized": { <1>`
			`"tokenizer": "icu_tokenizer",`
			`"filter": [`
			`"icu_normalizer"`
			`]`
			`},`
			`"nfc_normalized": { <2>`
			`"tokenizer": "icu_tokenizer",`
			`"filter": [`
			`"nfc_normalizer"`
			`]`
			`}`
			`},`
			`"filter": {`
			`"nfc_normalizer": {`
			`"type": "icu_normalizer",`
			`"name": "nfc"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// AUTOSENSE`

			<1> Uses the default `nfkc_cf` normalization.
			<2> Uses the customized `nfc_normalizer` token filter, which is set to use `nfc` normalization.


			`[[analysis-icu-folding]]`
			`==== ICU Folding Token Filter`

			Case folding of Unicode characters based on `UTR#30`, like the
			`{ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]`
			on steroids. It registers itself as the `icu_folding` token filter and is
			`available to all indices:`

			`[source,json]`
			`--------------------------------------------------`
			`PUT icu_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"folded": {`
			`"tokenizer": "icu",`
			`"filter": [`
			`"icu_folding"`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// AUTOSENSE`

			`The ICU folding token filter already does Unicode normalization, so there is`
			`no need to use Normalize character or token filter as well.`

			`Which letters are folded can be controlled by specifying the`
			`unicodeSetFilter` parameter, which accepts a
			`http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].`

			`The following example exempts Swedish characters from folding. It is important`
			`to note that both upper and lowercase forms should be specified, and that`
			`these filtered character are not lowercased which is why we add the`
			`lowercase` filter as well:

			`[source,json]`
			`--------------------------------------------------`
			`PUT icu_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"swedish_analyzer": {`
			`"tokenizer": "icu_tokenizer",`
			`"filter": [`
			`"swedish_folding",`
			`"lowercase"`
			`]`
			`}`
			`},`
			`"filter": {`
			`"swedish_folding": {`
			`"type": "icu_folding",`
			`"unicodeSetFilter": "[^åäöÅÄÖ]"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// AUTOSENSE`

			`[[analysis-icu-collation]]`
			`==== ICU Collation Token Filter`

			`Collations are used for sorting documents in a language-specific word order.`
			The `icu_collation` token filter is available to all indices and defaults to
			`using the`
			`https://www.elastic.co/guide/en/elasticsearch/guide/current/sorting-collations.html#uca[DUCET collation],`
			`which is a best-effort attempt at language-neutral sorting.`

			`Below is an example of how to set up a field for sorting German names in`
			``phonebook'' order:

			`[source,json]`
			`--------------------------------------------------`
			`PUT /my_index`
			`{`
			`"settings": {`
			`"analysis": {`
			`"filter": {`
			`"german_phonebook": {`
			`"type": "icu_collation",`
			`"language": "de",`
			`"country": "DE",`
			`"variant": "@collation=phonebook"`
			`}`
			`},`
			`"analyzer": {`
			`"german_phonebook": {`
			`"tokenizer": "keyword",`
			`"filter": [ "german_phonebook" ]`
			`}`
			`}`
			`}`
			`},`
			`"mappings": {`
			`"user": {`
			`"properties": {`
			`"name": { <1>`
			`"type": "string",`
			`"fields": {`
			`"sort": { <2>`
			`"type": "string",`
			`"analyzer": "german_phonebook"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`GET _search <3>`
			`{`
			`"query": {`
			`"match": {`
			`"name": "Fritz"`
			`}`
			`},`
			`"sort": "name.sort"`
			`}`

			`--------------------------------------------------`
			`// AUTOSENSE`

			<1> The `name` field uses the `standard` analyzer, and so support full text queries.
			<2> The `name.sort` field uses the `keyword` analyzer to preserve the name as
			a single token, and applies the `german_phonebook` token filter to index
			`the value in German phonebook sort order.`
			<3> An example query which searches the `name` field and sorts on the `name.sort` field.

			`===== Collation options`

			`strength`::

			`The strength property determines the minimum level of difference considered`
			significant during comparison. Possible values are : `primary`, `secondary`,
			`tertiary`, `quaternary` or `identical`. See the
			`http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]`
			for a more detailed explanation for each value. Defaults to `tertiary`
			`unless otherwise specified in the collation.`

			`decomposition`::

			Possible values: `no` (default, but collation-dependent) or `canonical`.
			Setting this decomposition property to `canonical` allows the Collator to
			`handle unnormalized text properly, producing the same results as if the text`
			were normalized. If `no` is set, it is the user's responsibility to insure
			`that all text is already in the appropriate form before a comparison or before`
			`getting a CollationKey. Adjusting decomposition mode allows the user to select`
			`between faster and more complete collation behavior. Since a great many of the`
			world's languages do not require text normalization, most locales set `no` as
			`the default decomposition mode.`

			`The following options are expert only:`

			`alternate`::

			Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for
			strength `quaternary` to be either shifted or non-ignorable. Which boils down
			`to ignoring punctuation and whitespace.`

			`caseLevel`::

			Possible values: `true` or `false` (default). Whether case level sorting is
			required. When strength is set to `primary` this will ignore accent
			`differences.`


			`caseFirst`::

			Possible values: `lower` or `upper`. Useful to control which case is sorted
			first when case is not ignored for strength `tertiary`. The default depends on
			`the collation.`

			`numeric`::

			Possible values: `true` or `false` (default) . Whether digits are sorted
			according to their numeric representation. For example the value `egg-9` is
			sorted before the value `egg-21`.


			`variableTop`::

			Single character or contraction. Controls what is variable for `alternate`.

			`hiraganaQuaternaryMode`::

			Possible values: `true` or `false`. Distinguishing between Katakana and
			Hiragana characters in `quaternary` strength.


			`[[analysis-icu-transform]]`
			`==== ICU Transform Token Filter`

			`Transforms are used to process Unicode text in many different ways, such as`
			`case mapping, normalization, transliteration and bidirectional text handling.`

			You can define which transformation you want to apply with the `id` parameter
			(defaults to `Null`), and specify text direction with the `dir` parameter
			which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
			`rulesets are not yet supported.`

			`For example:`

			`[source,json]`
			`--------------------------------------------------`
			`PUT icu_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"latin": {`
			`"tokenizer": "keyword",`
			`"filter": [`
			`"myLatinTransform"`
			`]`
			`}`
			`},`
			`"filter": {`
			`"myLatinTransform": {`
			`"type": "icu_transform",`
			`"id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" <1>`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`GET icu_sample/_analyze?analyzer=latin`
			`{`
			`"text": "你好" <2>`
			`}`

			`GET icu_sample/_analyze?analyzer=latin`
			`{`
			`"text": "здравствуйте" <3>`
			`}`

			`GET icu_sample/_analyze?analyzer=latin`
			`{`
			`"text": "こんにちは" <4>`
			`}`

			`--------------------------------------------------`
			`// AUTOSENSE`

			`<1> This transforms transliterates characters to Latin, and separates accents`
			`from their base characters, removes the accents, and then puts the`
			`remaining text into an unaccented form.`

			<2> Returns `ni hao`.
			<3> Returns `zdravstvujte`.
			<4> Returns `kon'nichiha`.

			`For more documentation, Please see the http://userguide.icu-project.org/transforms/general[user guide of ICU Transform].`