111 lines
3.1 KiB
Markdown
111 lines
3.1 KiB
Markdown
|
---
|
||
|
layout: default
|
||
|
title: Normalizers
|
||
|
nav_order: 100
|
||
|
---
|
||
|
|
||
|
# Normalizers
|
||
|
|
||
|
A _normalizer_ functions similarly to an analyzer but outputs only a single token. It does not contain a tokenizer and can only include specific types of character and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot operate on the token as a whole. This means that replacing a token with a synonym or stemming is not supported.
|
||
|
|
||
|
A normalizer is useful in keyword search (that is, in term-based queries) because it allows you to run token and character filters on any given input. For instance, it makes it possible to match an incoming query `Naïve` with the index term `naive`.
|
||
|
|
||
|
Consider the following example.
|
||
|
|
||
|
Create a new index with a custom normalizer:
|
||
|
```json
|
||
|
PUT /sample-index
|
||
|
{
|
||
|
"settings": {
|
||
|
"analysis": {
|
||
|
"normalizer": {
|
||
|
"normalized_keyword": {
|
||
|
"type": "custom",
|
||
|
"char_filter": [],
|
||
|
"filter": [ "asciifolding", "lowercase" ]
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
},
|
||
|
"mappings": {
|
||
|
"properties": {
|
||
|
"approach": {
|
||
|
"type": "keyword",
|
||
|
"normalizer": "normalized_keyword"
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
{% include copy-curl.html %}
|
||
|
|
||
|
Index a document:
|
||
|
```json
|
||
|
POST /sample-index/_doc/
|
||
|
{
|
||
|
"approach": "naive"
|
||
|
}
|
||
|
```
|
||
|
{% include copy-curl.html %}
|
||
|
|
||
|
The following query matches the document. This is expected:
|
||
|
```json
|
||
|
GET /sample-index/_search
|
||
|
{
|
||
|
"query": {
|
||
|
"term": {
|
||
|
"approach": "naive"
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
{% include copy-curl.html %}
|
||
|
|
||
|
But this query matches the document as well:
|
||
|
```json
|
||
|
GET /sample-index/_search
|
||
|
{
|
||
|
"query": {
|
||
|
"term": {
|
||
|
"approach": "Naïve"
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
{% include copy-curl.html %}
|
||
|
|
||
|
To understand why, consider the effect of the normalizer:
|
||
|
```json
|
||
|
GET /sample-index/_analyze
|
||
|
{
|
||
|
"normalizer" : "normalized_keyword",
|
||
|
"text" : "Naïve"
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Internally, a normalizer accepts only filters that are instances of either `NormalizingTokenFilterFactory` or `NormalizingCharFilterFactory`. The following is a list of compatible filters found in modules and plugins that are part of the core OpenSearch repository.
|
||
|
|
||
|
### The `common-analysis` module
|
||
|
|
||
|
This module does not require installation; it is available by default.
|
||
|
|
||
|
Character filters: `pattern_replace`, `mapping`
|
||
|
|
||
|
Token filters: `arabic_normalization`, `asciifolding`, `bengali_normalization`, `cjk_width`, `decimal_digit`, `elision`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `lowercase`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `serbian_normalization`, `sorani_normalization`, `trim`, `uppercase`
|
||
|
|
||
|
### The `analysis-icu` plugin
|
||
|
|
||
|
Character filters: `icu_normalizer`
|
||
|
|
||
|
Token filters: `icu_normalizer`, `icu_folding`, `icu_transform`
|
||
|
|
||
|
### The `analysis-kuromoji` plugin
|
||
|
|
||
|
Character filters: `normalize_kanji`, `normalize_kana`
|
||
|
|
||
|
### The `analysis-nori` plugin
|
||
|
|
||
|
Character filters: `normalize_kanji`, `normalize_kana`
|
||
|
|
||
|
These lists of filters include only analysis components found in the [additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins) that are part of the core OpenSearch repository.
|
||
|
{: .note}
|