OpenSearch/README.md

ICU Analysis for Elasticsearch
==================================

The ICU Analysis plugin integrates Lucene ICU module into elasticsearch, adding ICU relates analysis components.

In order to install the plugin, simply run: 

```sh
bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1
```

You need to install a version matching your Elasticsearch version:

| elasticsearch |  ICU Analysis Plugin  |   Docs     |  
|---------------|-----------------------|------------|
| master        |  Build from source    | See below  |
| es-1.x        |  Build from source    | [2.5.0-SNAPSHOT](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/es-1.x/#version-250-snapshot-for-elasticsearch-1x)  |
|    es-1.4              |     2.4.1         | [2.4.1](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.4.1/#version-241-for-elasticsearch-14)                  |
| es-1.3        |  2.3.0                | [2.3.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.3.0/#icu-analysis-for-elasticsearch)  |
| es-1.2        |  2.2.0                | [2.2.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.2.0/#icu-analysis-for-elasticsearch)  |
| es-1.1        |  2.1.0                | [2.1.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.1.0/#icu-analysis-for-elasticsearch)  |
| es-1.0        |  2.0.0                | [2.0.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.0.0/#icu-analysis-for-elasticsearch)  |
| es-0.90       |  1.13.0               | [1.13.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v1.13.0/#icu-analysis-for-elasticsearch)  |

To build a `SNAPSHOT` version, you need to build it with Maven:

```bash
mvn clean install
plugin --install analysis-icu \
       --url file:target/releases/elasticsearch-analysis-icu-X.X.X-SNAPSHOT.zip
```


ICU Normalization
-----------------

Normalizes characters as explained [here](http://userguide.icu-project.org/transforms/normalization). It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings. Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`. Here is a sample settings:

```js
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "normalized" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_normalizer"]
                }
            }
        }
    }
}
```

ICU Folding
-----------

Folding of unicode characters based on `UTR#30`. It registers itself under `icu_folding` and `icuFolding` names. Sample setting:

```js
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folded" : {
                    "tokenizer" : "keyword",
                    "filter" : ["icu_folding"]
                }
            }
        }
    }
}
```

ICU Filtering
-------------

The folding can be filtered by a set of unicode characters with the parameter `unicodeSetFilter`. This is useful for a
non-internationalized search engine where retaining a set of national characters which are primary letters in a specific
language is wanted. See syntax for the UnicodeSet [here](http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html).

The Following example exempts Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below.

```js
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_icu_folding", "lowercase"]
                }
            }
            "filter" : {
                "my_icu_folding" : {
                    "type" : "icu_folding"
                    "unicodeSetFilter" : "[^åäöÅÄÖ]"
                }
            }
        }
    }
}
```

ICU Tokenizer
-------------

Breaks text into words according to [UAX #29: Unicode Text Segmentation](http://www.unicode.org/reports/tr29/).

```js
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "tokenized" : {
                    "tokenizer" : "icu_tokenizer",
                }
            }
        }
    }
}
```


ICU Normalization CharFilter
-----------------

Normalizes characters as explained [here](http://userguide.icu-project.org/transforms/normalization).
It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings.
Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
Allows for the mode parameter to be provided which can include the following values: `compose` and `decompose`.
Use `decompose` with `nfc` or `nfkc`, to get `nfd` or `nfkd`, respectively.
Here is a sample settings:

```js
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "normalized" : {
                    "tokenizer" : "keyword",
                    "char_filter" : ["icu_normalizer"]
                }
            }
        }
    }
}
```

License
-------

    This software is licensed under the Apache 2 license, quoted below.

    Copyright 2009-2014 Elasticsearch <http://www.elasticsearch.org>

    Licensed under the Apache License, Version 2.0 (the "License"); you may not
    use this file except in compliance with the License. You may obtain a copy of
    the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    License for the specific language governing permissions and limitations under
    the License.
Update headers 2014-01-10 17:18:52 -05:00			`ICU Analysis for Elasticsearch`
first commit 2011-12-05 06:31:59 -05:00			`==================================`

			`The ICU Analysis plugin integrates Lucene ICU module into elasticsearch, adding ICU relates analysis components.`

Docs: make the welcome page more obvious Closes #36. 2014-08-26 01:04:37 -04:00			`In order to install the plugin, simply run:`
first commit 2011-12-05 06:31:59 -05:00
Docs: make the welcome page more obvious Closes #36. 2014-08-26 01:04:37 -04:00			```sh
update documentation with release 2.4.1 2014-11-05 11:30:57 -05:00			`bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1`
Docs: make the welcome page more obvious Closes #36. 2014-08-26 01:04:37 -04:00			```
Add plugin release semi-automatic script Closes #21 2014-02-28 17:09:56 -05:00
Fix missing line 2014-09-08 18:04:33 -04:00			`You need to install a version matching your Elasticsearch version:`
first commit 2011-12-05 06:31:59 -05:00
Docs: make the welcome page more obvious Closes #36. 2014-08-26 01:04:37 -04:00			`\| elasticsearch \| ICU Analysis Plugin \| Docs \|`
			`\|---------------\|-----------------------\|------------\|`
			`\| master \| Build from source \| See below \|`
Create branch es-1.4 for elasticsearch 1.4.0 2014-09-12 10:02:43 -04:00			`\| es-1.x \| Build from source \| [2.5.0-SNAPSHOT](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/es-1.x/#version-250-snapshot-for-elasticsearch-1x) \|`
update documentation with release 2.4.1 2014-11-05 11:30:57 -05:00			`\| es-1.4 \| 2.4.1 \| [2.4.1](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.4.1/#version-241-for-elasticsearch-14) \|`
Docs: make the welcome page more obvious Closes #36. 2014-08-26 01:04:37 -04:00			`\| es-1.3 \| 2.3.0 \| [2.3.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.3.0/#icu-analysis-for-elasticsearch) \|`
			`\| es-1.2 \| 2.2.0 \| [2.2.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.2.0/#icu-analysis-for-elasticsearch) \|`
			`\| es-1.1 \| 2.1.0 \| [2.1.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.1.0/#icu-analysis-for-elasticsearch) \|`
			`\| es-1.0 \| 2.0.0 \| [2.0.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.0.0/#icu-analysis-for-elasticsearch) \|`
			`\| es-0.90 \| 1.13.0 \| [1.13.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v1.13.0/#icu-analysis-for-elasticsearch) \|`

			To build a `SNAPSHOT` version, you need to build it with Maven:

			```bash
			`mvn clean install`
			`plugin --install analysis-icu \`
			`--url file:target/releases/elasticsearch-analysis-icu-X.X.X-SNAPSHOT.zip`
			```
Create branches according to elasticsearch versions We create branches: * es-0.90 for elasticsearch 0.90 * es-1.0 for elasticsearch 1.0 * es-1.1 for elasticsearch 1.1 * master for elasticsearch master We also check that before releasing we don't have a dependency to an elasticsearch SNAPSHOT version. Add links to each version in documentation (cherry picked from commit 35f5901) 2014-03-26 11:42:14 -04:00
release 1.1.0 2011-12-13 08:27:47 -05:00
			`ICU Normalization`
			`-----------------`

Fix doc 2014-01-10 17:30:05 -05:00			Normalizes characters as explained [here](http://userguide.icu-project.org/transforms/normalization). It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings. Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`. Here is a sample settings:
release 1.1.0 2011-12-13 08:27:47 -05:00
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			```js
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
upgrade to lucene 5 snapshot (will open issue about collators) 2014-11-05 16:25:33 -05:00			`"normalized" : {`
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			`"tokenizer" : "keyword",`
			`"filter" : ["icu_normalizer"]`
release 1.1.0 2011-12-13 08:27:47 -05:00			`}`
			`}`
			`}`
			`}`
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			`}`
			```
release 1.1.0 2011-12-13 08:27:47 -05:00
			`ICU Folding`
			`-----------`

Changed textile markup to markdown in README 2013-05-09 06:41:05 -04:00			Folding of unicode characters based on `UTR#30`. It registers itself under `icu_folding` and `icuFolding` names. Sample setting:
release 1.1.0 2011-12-13 08:27:47 -05:00
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			```js
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
upgrade to lucene 5 snapshot (will open issue about collators) 2014-11-05 16:25:33 -05:00			`"folded" : {`
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			`"tokenizer" : "keyword",`
			`"filter" : ["icu_folding"]`
release 1.1.0 2011-12-13 08:27:47 -05:00			`}`
			`}`
			`}`
			`}`
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			`}`
			```
release 1.1.0 2011-12-13 08:27:47 -05:00
Update README.md 2013-10-05 10:29:40 -04:00			`ICU Filtering`
			`-------------`

Fix doc 2014-01-10 17:30:05 -05:00			The folding can be filtered by a set of unicode characters with the parameter `unicodeSetFilter`. This is useful for a
			`non-internationalized search engine where retaining a set of national characters which are primary letters in a specific`
			`language is wanted. See syntax for the UnicodeSet [here](http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html).`
Update README.md 2013-10-05 10:29:40 -04:00
			`The Following example exempts Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below.`

Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			```js
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
			`"folding" : {`
			`"tokenizer" : "standard",`
			`"filter" : ["my_icu_folding", "lowercase"]`
Update README.md 2013-10-05 10:29:40 -04:00			`}`
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			`}`
			`"filter" : {`
			`"my_icu_folding" : {`
			`"type" : "icu_folding"`
			`"unicodeSetFilter" : "[^åäöÅÄÖ]"`
Update README.md 2013-10-05 10:29:40 -04:00			`}`
			`}`
			`}`
			`}`
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			`}`
			```
Update README.md 2013-10-05 10:29:40 -04:00
release 1.1.0 2011-12-13 08:27:47 -05:00			`ICU Tokenizer`
			`-------------`

Fix doc 2014-01-10 17:30:05 -05:00			`Breaks text into words according to [UAX #29: Unicode Text Segmentation](http://www.unicode.org/reports/tr29/).`
release 1.1.0 2011-12-13 08:27:47 -05:00
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			```js
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
upgrade to lucene 5 snapshot (will open issue about collators) 2014-11-05 16:25:33 -05:00			`"tokenized" : {`
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			`"tokenizer" : "icu_tokenizer",`
release 1.1.0 2011-12-13 08:27:47 -05:00			`}`
			`}`
			`}`
			`}`
Use JS markdown formatter (cherry picked from commit 3941016) 2014-05-28 09:24:39 -04:00			`}`
			```
release 1.1.0 2011-12-13 08:27:47 -05:00
add license and repo 2012-06-10 15:55:08 -04:00
add ICUNormalizer2CharFilter Included ICUNormalizer2Charfilter in Lucene 4.8.0. Add CharFilterFactory. Now, char_filter name is "icu_normalizer", however token_filter name is same name. Closes #27. (cherry picked from commit 0cbf1b3) 2014-05-28 09:46:59 -04:00			`ICU Normalization CharFilter`
			`-----------------`

			`Normalizes characters as explained [here](http://userguide.icu-project.org/transforms/normalization).`
			It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings.
			Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
			Allows for the mode parameter to be provided which can include the following values: `compose` and `decompose`.
			Use `decompose` with `nfc` or `nfkc`, to get `nfd` or `nfkd`, respectively.
			`Here is a sample settings:`

			```js
			`{`
			`"index" : {`
			`"analysis" : {`
			`"analyzer" : {`
upgrade to lucene 5 snapshot (will open issue about collators) 2014-11-05 16:25:33 -05:00			`"normalized" : {`
add ICUNormalizer2CharFilter Included ICUNormalizer2Charfilter in Lucene 4.8.0. Add CharFilterFactory. Now, char_filter name is "icu_normalizer", however token_filter name is same name. Closes #27. (cherry picked from commit 0cbf1b3) 2014-05-28 09:46:59 -04:00			`"tokenizer" : "keyword",`
			`"char_filter" : ["icu_normalizer"]`
			`}`
			`}`
			`}`
			`}`
			`}`
			```

add license and repo 2012-06-10 15:55:08 -04:00			`License`
			`-------`

			`This software is licensed under the Apache 2 license, quoted below.`

Update headers 2014-01-10 17:18:52 -05:00			`Copyright 2009-2014 Elasticsearch <http://www.elasticsearch.org>`
add license and repo 2012-06-10 15:55:08 -04:00
			`Licensed under the Apache License, Version 2.0 (the "License"); you may not`
			`use this file except in compliance with the License. You may obtain a copy of`
			`the License at`

			`http://www.apache.org/licenses/LICENSE-2.0`

			`Unless required by applicable law or agreed to in writing, software`
			`distributed under the License is distributed on an "AS IS" BASIS, WITHOUT`
			`WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the`
			`License for the specific language governing permissions and limitations under`
			`the License.`