2015-08-15 12:00:55 -04:00
|
|
|
[[analysis-icu]]
|
|
|
|
=== ICU Analysis Plugin
|
|
|
|
|
2019-09-27 13:04:14 -04:00
|
|
|
The ICU Analysis plugin integrates the Lucene ICU module into {es},
|
2015-08-15 12:00:55 -04:00
|
|
|
adding extended Unicode support using the http://site.icu-project.org/[ICU]
|
|
|
|
libraries, including better analysis of Asian languages, Unicode
|
|
|
|
normalization, Unicode-aware case folding, collation support, and
|
|
|
|
transliteration.
|
|
|
|
|
2016-11-09 09:20:15 -05:00
|
|
|
[IMPORTANT]
|
|
|
|
.ICU analysis and backwards compatibility
|
|
|
|
================================================
|
|
|
|
|
|
|
|
From time to time, the ICU library receives updates such as adding new
|
|
|
|
characters and emojis, and improving collation (sort) orders. These changes
|
|
|
|
may or may not affect search and sort orders, depending on which characters
|
|
|
|
sets you are using.
|
|
|
|
|
|
|
|
While we restrict ICU upgrades to major versions, you may find that an index
|
|
|
|
created in the previous major version will need to be reindexed in order to
|
|
|
|
return correct (and correctly ordered) results, and to take advantage of new
|
|
|
|
characters.
|
|
|
|
|
|
|
|
================================================
|
|
|
|
|
2017-04-20 09:01:37 -04:00
|
|
|
:plugin_name: analysis-icu
|
|
|
|
include::install_remove.asciidoc[]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
2018-11-21 04:00:48 -05:00
|
|
|
[[analysis-icu-analyzer]]
|
|
|
|
==== ICU Analyzer
|
|
|
|
|
2019-05-09 09:08:31 -04:00
|
|
|
The `icu_analyzer` analyzer performs basic normalization, tokenization and character folding, using the
|
2020-07-13 07:13:46 -04:00
|
|
|
`icu_normalizer` char filter, `icu_tokenizer` and `icu_folding` token filter
|
2018-11-21 04:00:48 -05:00
|
|
|
|
|
|
|
The following parameters are accepted:
|
|
|
|
|
|
|
|
[horizontal]
|
|
|
|
|
|
|
|
`method`::
|
|
|
|
|
|
|
|
Normalization method. Accepts `nfkc`, `nfc` or `nfkc_cf` (default)
|
|
|
|
|
|
|
|
`mode`::
|
|
|
|
|
|
|
|
Normalization mode. Accepts `compose` (default) or `decompose`.
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
[[analysis-icu-normalization-charfilter]]
|
|
|
|
==== ICU Normalization Character Filter
|
|
|
|
|
|
|
|
Normalizes characters as explained
|
|
|
|
http://userguide.icu-project.org/transforms/normalization[here].
|
|
|
|
It registers itself as the `icu_normalizer` character filter, which is
|
|
|
|
available to all indices without any further configuration. The type of
|
|
|
|
normalization can be specified with the `name` parameter, which accepts `nfc`,
|
|
|
|
`nfkc`, and `nfkc_cf` (default). Set the `mode` parameter to `decompose` to
|
|
|
|
convert `nfc` to `nfd` or `nfkc` to `nfkd` respectively:
|
|
|
|
|
2017-06-16 05:08:39 -04:00
|
|
|
Which letters are normalized can be controlled by specifying the
|
2018-11-01 06:06:51 -04:00
|
|
|
`unicode_set_filter` parameter, which accepts a
|
2017-06-16 05:08:39 -04:00
|
|
|
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
Here are two examples, the default usage and a customised character filter:
|
|
|
|
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
--------------------------------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT icu_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"index": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"nfkc_cf_normalized": { <1>
|
|
|
|
"tokenizer": "icu_tokenizer",
|
|
|
|
"char_filter": [
|
|
|
|
"icu_normalizer"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"nfd_normalized": { <2>
|
|
|
|
"tokenizer": "icu_tokenizer",
|
|
|
|
"char_filter": [
|
|
|
|
"nfd_normalizer"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"char_filter": {
|
|
|
|
"nfd_normalizer": {
|
|
|
|
"type": "icu_normalizer",
|
|
|
|
"name": "nfc",
|
|
|
|
"mode": "decompose"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
<1> Uses the default `nfkc_cf` normalization.
|
|
|
|
<2> Uses the customized `nfd_normalizer` token filter, which is set to use `nfc` normalization with decomposition.
|
|
|
|
|
|
|
|
[[analysis-icu-tokenizer]]
|
|
|
|
==== ICU Tokenizer
|
|
|
|
|
|
|
|
Tokenizes text into words on word boundaries, as defined in
|
|
|
|
http://www.unicode.org/reports/tr29/[UAX #29: Unicode Text Segmentation].
|
|
|
|
It behaves much like the {ref}/analysis-standard-tokenizer.html[`standard` tokenizer],
|
|
|
|
but adds better support for some Asian languages by using a dictionary-based
|
|
|
|
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
|
|
|
|
using custom rules to break Myanmar and Khmer text into syllables.
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
--------------------------------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT icu_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"index": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"my_icu_analyzer": {
|
|
|
|
"tokenizer": "icu_tokenizer"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
2015-09-05 14:07:59 -04:00
|
|
|
===== Rules customization
|
|
|
|
|
2017-07-18 08:06:22 -04:00
|
|
|
experimental[This functionality is marked as experimental in Lucene]
|
2015-09-05 14:07:59 -04:00
|
|
|
|
|
|
|
You can customize the `icu-tokenizer` behavior by specifying per-script rule files, see the
|
|
|
|
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[RBBI rules syntax reference]
|
|
|
|
for a more detailed explanation.
|
|
|
|
|
|
|
|
To add icu tokenizer rules, set the `rule_files` settings, which should contain a comma-separated list of
|
|
|
|
`code:rulefile` pairs in the following format:
|
|
|
|
http://unicode.org/iso15924/iso15924-codes.html[four-letter ISO 15924 script code],
|
|
|
|
followed by a colon, then a rule file name. Rule files are placed `ES_HOME/config` directory.
|
|
|
|
|
|
|
|
As a demonstration of how the rule files can be used, save the following user file to `$ES_HOME/config/KeywordTokenizer.rbbi`:
|
|
|
|
|
|
|
|
[source,text]
|
|
|
|
-----------------------
|
|
|
|
.+ {200};
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
Then create an analyzer to use this rule file as follows:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2015-09-05 14:07:59 -04:00
|
|
|
--------------------------------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT icu_sample
|
2015-09-05 14:07:59 -04:00
|
|
|
{
|
2020-07-20 15:06:12 -04:00
|
|
|
"settings": {
|
|
|
|
"index": {
|
|
|
|
"analysis": {
|
|
|
|
"tokenizer": {
|
|
|
|
"icu_user_file": {
|
|
|
|
"type": "icu_tokenizer",
|
|
|
|
"rule_files": "Latn:KeywordTokenizer.rbbi"
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"analyzer": {
|
|
|
|
"my_analyzer": {
|
|
|
|
"type": "custom",
|
|
|
|
"tokenizer": "icu_user_file"
|
|
|
|
}
|
2015-09-05 14:07:59 -04:00
|
|
|
}
|
2020-07-20 15:06:12 -04:00
|
|
|
}
|
2015-09-05 14:07:59 -04:00
|
|
|
}
|
2020-07-20 15:06:12 -04:00
|
|
|
}
|
2015-09-05 14:07:59 -04:00
|
|
|
}
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
GET icu_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
{
|
|
|
|
"analyzer": "my_analyzer",
|
|
|
|
"text": "Elasticsearch. Wow!"
|
|
|
|
}
|
2015-09-05 14:07:59 -04:00
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
The above `analyze` request returns the following:
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
[source,console-result]
|
2015-09-05 14:07:59 -04:00
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"tokens": [
|
|
|
|
{
|
|
|
|
"token": "Elasticsearch. Wow!",
|
|
|
|
"start_offset": 0,
|
|
|
|
"end_offset": 19,
|
|
|
|
"type": "<ALPHANUM>",
|
|
|
|
"position": 0
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
2019-09-06 09:22:08 -04:00
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
[[analysis-icu-normalization]]
|
|
|
|
==== ICU Normalization Token Filter
|
|
|
|
|
|
|
|
Normalizes characters as explained
|
|
|
|
http://userguide.icu-project.org/transforms/normalization[here]. It registers
|
|
|
|
itself as the `icu_normalizer` token filter, which is available to all indices
|
|
|
|
without any further configuration. The type of normalization can be specified
|
|
|
|
with the `name` parameter, which accepts `nfc`, `nfkc`, and `nfkc_cf`
|
|
|
|
(default).
|
|
|
|
|
2017-06-16 05:08:39 -04:00
|
|
|
Which letters are normalized can be controlled by specifying the
|
2018-11-01 06:06:51 -04:00
|
|
|
`unicode_set_filter` parameter, which accepts a
|
2017-06-16 05:08:39 -04:00
|
|
|
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
You should probably prefer the <<analysis-icu-normalization-charfilter,Normalization character filter>>.
|
|
|
|
|
|
|
|
Here are two examples, the default usage and a customised token filter:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
--------------------------------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT icu_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"index": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"nfkc_cf_normalized": { <1>
|
|
|
|
"tokenizer": "icu_tokenizer",
|
|
|
|
"filter": [
|
|
|
|
"icu_normalizer"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"nfc_normalized": { <2>
|
|
|
|
"tokenizer": "icu_tokenizer",
|
|
|
|
"filter": [
|
|
|
|
"nfc_normalizer"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"filter": {
|
|
|
|
"nfc_normalizer": {
|
|
|
|
"type": "icu_normalizer",
|
|
|
|
"name": "nfc"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
<1> Uses the default `nfkc_cf` normalization.
|
|
|
|
<2> Uses the customized `nfc_normalizer` token filter, which is set to use `nfc` normalization.
|
|
|
|
|
|
|
|
|
|
|
|
[[analysis-icu-folding]]
|
|
|
|
==== ICU Folding Token Filter
|
|
|
|
|
|
|
|
Case folding of Unicode characters based on `UTR#30`, like the
|
|
|
|
{ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]
|
|
|
|
on steroids. It registers itself as the `icu_folding` token filter and is
|
|
|
|
available to all indices:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
--------------------------------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT icu_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"index": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"folded": {
|
2016-05-13 16:15:51 -04:00
|
|
|
"tokenizer": "icu_tokenizer",
|
2015-08-15 12:00:55 -04:00
|
|
|
"filter": [
|
|
|
|
"icu_folding"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
The ICU folding token filter already does Unicode normalization, so there is
|
|
|
|
no need to use Normalize character or token filter as well.
|
|
|
|
|
|
|
|
Which letters are folded can be controlled by specifying the
|
2018-11-01 06:06:51 -04:00
|
|
|
`unicode_set_filter` parameter, which accepts a
|
2015-08-15 12:00:55 -04:00
|
|
|
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[UnicodeSet].
|
|
|
|
|
|
|
|
The following example exempts Swedish characters from folding. It is important
|
|
|
|
to note that both upper and lowercase forms should be specified, and that
|
|
|
|
these filtered character are not lowercased which is why we add the
|
|
|
|
`lowercase` filter as well:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
--------------------------------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT icu_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"index": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"swedish_analyzer": {
|
|
|
|
"tokenizer": "icu_tokenizer",
|
|
|
|
"filter": [
|
|
|
|
"swedish_folding",
|
|
|
|
"lowercase"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"filter": {
|
|
|
|
"swedish_folding": {
|
|
|
|
"type": "icu_folding",
|
2018-11-01 06:06:51 -04:00
|
|
|
"unicode_set_filter": "[^åäöÅÄÖ]"
|
2015-08-15 12:00:55 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
2017-05-10 04:35:11 -04:00
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
[[analysis-icu-collation]]
|
|
|
|
==== ICU Collation Token Filter
|
|
|
|
|
2017-05-10 04:35:11 -04:00
|
|
|
[WARNING]
|
|
|
|
======
|
|
|
|
This token filter has been deprecated since Lucene 5.0. Please use
|
|
|
|
<<analysis-icu-collation-keyword-field, ICU Collation Keyword Field>>.
|
|
|
|
======
|
|
|
|
|
|
|
|
[[analysis-icu-collation-keyword-field]]
|
|
|
|
==== ICU Collation Keyword Field
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
Collations are used for sorting documents in a language-specific word order.
|
2017-05-10 04:35:11 -04:00
|
|
|
The `icu_collation_keyword` field type is available to all indices and will encode
|
|
|
|
the terms directly as bytes in a doc values field and a single indexed token just
|
|
|
|
like a standard {ref}/keyword.html[Keyword Field].
|
|
|
|
|
|
|
|
Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation],
|
2015-08-15 12:00:55 -04:00
|
|
|
which is a best-effort attempt at language-neutral sorting.
|
|
|
|
|
|
|
|
Below is an example of how to set up a field for sorting German names in
|
|
|
|
``phonebook'' order:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2017-05-10 04:35:11 -04:00
|
|
|
--------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT my_index
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
|
|
|
"mappings": {
|
2019-01-18 08:11:18 -05:00
|
|
|
"properties": {
|
|
|
|
"name": { <1>
|
|
|
|
"type": "text",
|
|
|
|
"fields": {
|
|
|
|
"sort": { <2>
|
|
|
|
"type": "icu_collation_keyword",
|
|
|
|
"index": false,
|
|
|
|
"language": "de",
|
|
|
|
"country": "DE",
|
|
|
|
"variant": "@collation=phonebook"
|
2015-08-15 12:00:55 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
[7.x] Add ILM histore store index (#50287) (#50345)
* Add ILM histore store index (#50287)
* Add ILM histore store index
This commit adds an ILM history store that tracks the lifecycle
execution state as an index progresses through its ILM policy. ILM
history documents store output similar to what the ILM explain API
returns.
An example document with ALL fields (not all documents will have all
fields) would look like:
```json
{
"@timestamp": 1203012389,
"policy": "my-ilm-policy",
"index": "index-2019.1.1-000023",
"index_age":123120,
"success": true,
"state": {
"phase": "warm",
"action": "allocate",
"step": "ERROR",
"failed_step": "update-settings",
"is_auto-retryable_error": true,
"creation_date": 12389012039,
"phase_time": 12908389120,
"action_time": 1283901209,
"step_time": 123904107140,
"phase_definition": "{\"policy\":\"ilm-history-ilm-policy\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"rollover\":{\"max_size\":\"50gb\",\"max_age\":\"30d\"}}},\"version\":1,\"modified_date_in_millis\":1576517253463}",
"step_info": "{... etc step info here as json ...}"
},
"error_details": "java.lang.RuntimeException: etc\n\tcaused by:etc etc etc full stacktrace"
}
```
These documents go into the `ilm-history-1-00000N` index to provide an
audit trail of the operations ILM has performed.
This history storage is enabled by default but can be disabled by setting
`index.lifecycle.history_index_enabled` to `false.`
Resolves #49180
* Make ILMHistoryStore.putAsync truly async (#50403)
This moves the `putAsync` method in `ILMHistoryStore` never to block.
Previously due to the way that the `BulkProcessor` works, it was possible
for `BulkProcessor#add` to block executing a bulk request. This was bad
as we may be adding things to the history store in cluster state update
threads.
This also moves the index creation to be done prior to the bulk request
execution, rather than being checked every time an operation was added
to the queue. This lessens the chance of the index being created, then
deleted (by some external force), and then recreated via a bulk indexing
request.
Resolves #50353
2019-12-20 14:33:36 -05:00
|
|
|
GET /my_index/_search <3>
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
|
|
|
"query": {
|
|
|
|
"match": {
|
|
|
|
"name": "Fritz"
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"sort": "name.sort"
|
|
|
|
}
|
|
|
|
|
2017-05-10 04:35:11 -04:00
|
|
|
--------------------------
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
<1> The `name` field uses the `standard` analyzer, and so support full text queries.
|
2017-05-10 04:35:11 -04:00
|
|
|
<2> The `name.sort` field is an `icu_collation_keyword` field that will preserve the name as
|
|
|
|
a single token doc values, and applies the German ``phonebook'' order.
|
2015-08-15 12:00:55 -04:00
|
|
|
<3> An example query which searches the `name` field and sorts on the `name.sort` field.
|
|
|
|
|
2017-07-03 11:49:16 -04:00
|
|
|
===== Parameters for ICU Collation Keyword Fields
|
2017-05-10 04:35:11 -04:00
|
|
|
|
|
|
|
The following parameters are accepted by `icu_collation_keyword` fields:
|
|
|
|
|
|
|
|
[horizontal]
|
|
|
|
|
|
|
|
`doc_values`::
|
|
|
|
|
|
|
|
Should the field be stored on disk in a column-stride fashion, so that it
|
|
|
|
can later be used for sorting, aggregations, or scripting? Accepts `true`
|
|
|
|
(default) or `false`.
|
|
|
|
|
|
|
|
`index`::
|
|
|
|
|
|
|
|
Should the field be searchable? Accepts `true` (default) or `false`.
|
|
|
|
|
|
|
|
`null_value`::
|
|
|
|
|
|
|
|
Accepts a string value which is substituted for any explicit `null`
|
|
|
|
values. Defaults to `null`, which means the field is treated as missing.
|
|
|
|
|
2019-04-19 17:23:15 -04:00
|
|
|
{ref}/ignore-above.html[`ignore_above`]::
|
2019-04-19 16:17:00 -04:00
|
|
|
|
|
|
|
Strings longer than the `ignore_above` setting will be ignored.
|
|
|
|
Checking is performed on the original string before the collation.
|
|
|
|
The `ignore_above` setting can be updated on existing fields
|
|
|
|
using the {ref}/indices-put-mapping.html[PUT mapping API].
|
|
|
|
By default, there is no limit and all values will be indexed.
|
|
|
|
|
2017-05-10 04:35:11 -04:00
|
|
|
`store`::
|
|
|
|
|
|
|
|
Whether the field value should be stored and retrievable separately from
|
|
|
|
the {ref}/mapping-source-field.html[`_source`] field. Accepts `true` or `false`
|
|
|
|
(default).
|
|
|
|
|
|
|
|
`fields`::
|
|
|
|
|
|
|
|
Multi-fields allow the same string value to be indexed in multiple ways for
|
|
|
|
different purposes, such as one field for search and a multi-field for
|
|
|
|
sorting and aggregations.
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
===== Collation options
|
|
|
|
|
|
|
|
`strength`::
|
|
|
|
|
|
|
|
The strength property determines the minimum level of difference considered
|
|
|
|
significant during comparison. Possible values are : `primary`, `secondary`,
|
|
|
|
`tertiary`, `quaternary` or `identical`. See the
|
|
|
|
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation documentation]
|
|
|
|
for a more detailed explanation for each value. Defaults to `tertiary`
|
|
|
|
unless otherwise specified in the collation.
|
|
|
|
|
|
|
|
`decomposition`::
|
|
|
|
|
|
|
|
Possible values: `no` (default, but collation-dependent) or `canonical`.
|
|
|
|
Setting this decomposition property to `canonical` allows the Collator to
|
|
|
|
handle unnormalized text properly, producing the same results as if the text
|
|
|
|
were normalized. If `no` is set, it is the user's responsibility to insure
|
|
|
|
that all text is already in the appropriate form before a comparison or before
|
|
|
|
getting a CollationKey. Adjusting decomposition mode allows the user to select
|
|
|
|
between faster and more complete collation behavior. Since a great many of the
|
|
|
|
world's languages do not require text normalization, most locales set `no` as
|
|
|
|
the default decomposition mode.
|
|
|
|
|
|
|
|
The following options are expert only:
|
|
|
|
|
|
|
|
`alternate`::
|
|
|
|
|
|
|
|
Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for
|
|
|
|
strength `quaternary` to be either shifted or non-ignorable. Which boils down
|
|
|
|
to ignoring punctuation and whitespace.
|
|
|
|
|
2017-05-10 04:35:11 -04:00
|
|
|
`case_level`::
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
Possible values: `true` or `false` (default). Whether case level sorting is
|
|
|
|
required. When strength is set to `primary` this will ignore accent
|
|
|
|
differences.
|
|
|
|
|
|
|
|
|
2017-05-10 04:35:11 -04:00
|
|
|
`case_first`::
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
Possible values: `lower` or `upper`. Useful to control which case is sorted
|
|
|
|
first when case is not ignored for strength `tertiary`. The default depends on
|
|
|
|
the collation.
|
|
|
|
|
|
|
|
`numeric`::
|
|
|
|
|
|
|
|
Possible values: `true` or `false` (default) . Whether digits are sorted
|
|
|
|
according to their numeric representation. For example the value `egg-9` is
|
|
|
|
sorted before the value `egg-21`.
|
|
|
|
|
|
|
|
|
2017-05-10 04:35:11 -04:00
|
|
|
`variable_top`::
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
Single character or contraction. Controls what is variable for `alternate`.
|
|
|
|
|
2017-05-10 04:35:11 -04:00
|
|
|
`hiragana_quaternary_mode`::
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
Possible values: `true` or `false`. Distinguishing between Katakana and
|
|
|
|
Hiragana characters in `quaternary` strength.
|
|
|
|
|
|
|
|
|
|
|
|
[[analysis-icu-transform]]
|
|
|
|
==== ICU Transform Token Filter
|
|
|
|
|
|
|
|
Transforms are used to process Unicode text in many different ways, such as
|
|
|
|
case mapping, normalization, transliteration and bidirectional text handling.
|
|
|
|
|
|
|
|
You can define which transformation you want to apply with the `id` parameter
|
|
|
|
(defaults to `Null`), and specify text direction with the `dir` parameter
|
|
|
|
which accepts `forward` (default) for LTR and `reverse` for RTL. Custom
|
|
|
|
rulesets are not yet supported.
|
|
|
|
|
|
|
|
For example:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
--------------------------------------------------
|
2019-01-18 08:11:18 -05:00
|
|
|
PUT icu_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"index": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"latin": {
|
|
|
|
"tokenizer": "keyword",
|
|
|
|
"filter": [
|
|
|
|
"myLatinTransform"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"filter": {
|
|
|
|
"myLatinTransform": {
|
|
|
|
"type": "icu_transform",
|
|
|
|
"id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" <1>
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-09-22 07:54:30 -04:00
|
|
|
GET icu_sample/_analyze
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
2016-09-22 07:54:30 -04:00
|
|
|
"analyzer": "latin",
|
2015-08-15 12:00:55 -04:00
|
|
|
"text": "你好" <2>
|
|
|
|
}
|
|
|
|
|
2016-09-22 07:54:30 -04:00
|
|
|
GET icu_sample/_analyze
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
2016-09-22 07:54:30 -04:00
|
|
|
"analyzer": "latin",
|
2015-08-15 12:00:55 -04:00
|
|
|
"text": "здравствуйте" <3>
|
|
|
|
}
|
|
|
|
|
2016-09-22 07:54:30 -04:00
|
|
|
GET icu_sample/_analyze
|
2015-08-15 12:00:55 -04:00
|
|
|
{
|
2016-09-22 07:54:30 -04:00
|
|
|
"analyzer": "latin",
|
2015-08-15 12:00:55 -04:00
|
|
|
"text": "こんにちは" <4>
|
|
|
|
}
|
|
|
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
<1> This transforms transliterates characters to Latin, and separates accents
|
|
|
|
from their base characters, removes the accents, and then puts the
|
|
|
|
remaining text into an unaccented form.
|
|
|
|
|
|
|
|
<2> Returns `ni hao`.
|
|
|
|
<3> Returns `zdravstvujte`.
|
|
|
|
<4> Returns `kon'nichiha`.
|
|
|
|
|
|
|
|
For more documentation, Please see the http://userguide.icu-project.org/transforms/general[user guide of ICU Transform].
|