OpenSearch/README.md

167 lines
5.7 KiB
Markdown
Raw Normal View History

2014-01-10 17:18:52 -05:00
ICU Analysis for Elasticsearch
2011-12-05 06:31:59 -05:00
==================================
The ICU Analysis plugin integrates Lucene ICU module into elasticsearch, adding ICU relates analysis components.
In order to install the plugin, simply run:
2011-12-05 06:31:59 -05:00
```sh
bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1
```
2014-09-08 18:04:33 -04:00
You need to install a version matching your Elasticsearch version:
2011-12-05 06:31:59 -05:00
| elasticsearch | ICU Analysis Plugin | Docs |
|---------------|-----------------------|------------|
| master | Build from source | See below |
| es-1.x | Build from source | [2.5.0-SNAPSHOT](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/es-1.x/#version-250-snapshot-for-elasticsearch-1x) |
| es-1.4 | 2.4.1 | [2.4.1](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.4.1/#version-241-for-elasticsearch-14) |
| es-1.3 | 2.3.0 | [2.3.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.3.0/#icu-analysis-for-elasticsearch) |
| es-1.2 | 2.2.0 | [2.2.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.2.0/#icu-analysis-for-elasticsearch) |
| es-1.1 | 2.1.0 | [2.1.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.1.0/#icu-analysis-for-elasticsearch) |
| es-1.0 | 2.0.0 | [2.0.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v2.0.0/#icu-analysis-for-elasticsearch) |
| es-0.90 | 1.13.0 | [1.13.0](https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/v1.13.0/#icu-analysis-for-elasticsearch) |
To build a `SNAPSHOT` version, you need to build it with Maven:
```bash
mvn clean install
plugin --install analysis-icu \
--url file:target/releases/elasticsearch-analysis-icu-X.X.X-SNAPSHOT.zip
```
2011-12-13 08:27:47 -05:00
ICU Normalization
-----------------
2014-01-10 17:30:05 -05:00
Normalizes characters as explained [here](http://userguide.icu-project.org/transforms/normalization). It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings. Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`. Here is a sample settings:
2011-12-13 08:27:47 -05:00
```js
{
"index" : {
"analysis" : {
"analyzer" : {
"normalized" : {
"tokenizer" : "keyword",
"filter" : ["icu_normalizer"]
2011-12-13 08:27:47 -05:00
}
}
}
}
}
```
2011-12-13 08:27:47 -05:00
ICU Folding
-----------
Folding of unicode characters based on `UTR#30`. It registers itself under `icu_folding` and `icuFolding` names. Sample setting:
2011-12-13 08:27:47 -05:00
```js
{
"index" : {
"analysis" : {
"analyzer" : {
"folded" : {
"tokenizer" : "keyword",
"filter" : ["icu_folding"]
2011-12-13 08:27:47 -05:00
}
}
}
}
}
```
2011-12-13 08:27:47 -05:00
2013-10-05 10:29:40 -04:00
ICU Filtering
-------------
2014-01-10 17:30:05 -05:00
The folding can be filtered by a set of unicode characters with the parameter `unicodeSetFilter`. This is useful for a
non-internationalized search engine where retaining a set of national characters which are primary letters in a specific
language is wanted. See syntax for the UnicodeSet [here](http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html).
2013-10-05 10:29:40 -04:00
The Following example exempts Swedish characters from the folding. Note that the filtered characters are NOT lowercased which is why we add that filter below.
```js
{
"index" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "standard",
"filter" : ["my_icu_folding", "lowercase"]
2013-10-05 10:29:40 -04:00
}
}
"filter" : {
"my_icu_folding" : {
"type" : "icu_folding"
"unicodeSetFilter" : "[^åäöÅÄÖ]"
2013-10-05 10:29:40 -04:00
}
}
}
}
}
```
2013-10-05 10:29:40 -04:00
2011-12-13 08:27:47 -05:00
ICU Tokenizer
-------------
2014-01-10 17:30:05 -05:00
Breaks text into words according to [UAX #29: Unicode Text Segmentation](http://www.unicode.org/reports/tr29/).
2011-12-13 08:27:47 -05:00
```js
{
"index" : {
"analysis" : {
"analyzer" : {
"tokenized" : {
"tokenizer" : "icu_tokenizer",
2011-12-13 08:27:47 -05:00
}
}
}
}
}
```
2011-12-13 08:27:47 -05:00
2012-06-10 15:55:08 -04:00
ICU Normalization CharFilter
-----------------
Normalizes characters as explained [here](http://userguide.icu-project.org/transforms/normalization).
It registers itself by default under `icu_normalizer` or `icuNormalizer` using the default settings.
Allows for the name parameter to be provided which can include the following values: `nfc`, `nfkc`, and `nfkc_cf`.
Allows for the mode parameter to be provided which can include the following values: `compose` and `decompose`.
Use `decompose` with `nfc` or `nfkc`, to get `nfd` or `nfkd`, respectively.
Here is a sample settings:
```js
{
"index" : {
"analysis" : {
"analyzer" : {
"normalized" : {
"tokenizer" : "keyword",
"char_filter" : ["icu_normalizer"]
}
}
}
}
}
```
2012-06-10 15:55:08 -04:00
License
-------
This software is licensed under the Apache 2 license, quoted below.
2014-01-10 17:18:52 -05:00
Copyright 2009-2014 Elasticsearch <http://www.elasticsearch.org>
2012-06-10 15:55:08 -04:00
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.