Add documentation for Normalizers (#6415)

Improve the documentation for Analyzers (some terminology was
not correct) also fix some broken links along the way.

Note: this commit is a spin-off of #6252

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
This commit is contained in:
Lukáš Vlček 2024-02-15 16:04:47 +01:00 committed by GitHub
parent 1ecb744c17
commit 29f42850fe
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 144 additions and 23 deletions

View File

@ -15,16 +15,24 @@ redirect_from:
# Text analysis
When you are searching documents using a full-text search, you want to receive all relevant results and not only exact matches. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking." To facilitate full-text search, OpenSearch uses text analysis.
When you are searching documents using a full-text search, you want to receive all relevant results. If you're looking for "walk", you're interested in results that contain any form of the word, like "Walk", "walked", or "walking". To facilitate full-text search, OpenSearch uses text analysis.
Text analysis consists of the following steps:
The objective of text analysis is to split the unstructured free text content of the source document into a sequence of terms, which are then stored in an inverted index. Subsequently, when a similar text analysis is applied to a user's query, the resulting sequence of terms facilitates the matching of relevant source documents.
1. _Tokenize_ text into terms: For example, after tokenization, the phrase `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
1. _Normalize_ the terms by converting them into a standard format, for example, converting them to lowercase or performing stemming (reducing the word to its root): For example, after normalization, `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
From a technical point of view, the text analysis process consists of several steps, some of which are optional:
1. Before the free text content can be split into individual words, it may be beneficial to refine the text at the character level. The primary aim of this optional step is to help the tokenizer (the subsequent stage in the analysis process) generate better tokens. This can include removal of markup tags (such as HTML) or handling specific character patterns (like replacing the &#x1F642; emoji with the text `:slightly_smiling_face:`).
2. The next step is to split the free text into individual words---_tokens_. This is performed by a _tokenizer_. For example, after tokenization, the sentence `Actions speak louder than words` is split into tokens `Actions`, `speak`, `louder`, `than`, and `words`.
3. The last step is to process individual tokens by applying a series of token filters. The aim is to convert each token into a predictable form that is directly stored in the index, for example, by converting them to lowercase or performing stemming (reducing the word to its root). For example, the token `Actions` becomes `action`, `louder` becomes `loud`, and `words` becomes `word`.
Although the terms ***token*** and ***term*** may sound similar and are occasionally used interchangeably, it is helpful to understand the difference between the two. In the context of Apache Lucene, each holds a distinct role. A ***token*** is created by a tokenizer during text analysis and often undergoes a number of additional modifications as it passes through the chain of token filters. Each token is associated with metadata that can be further used during the text analysis process. A ***term*** is a data value that is directly stored in the inverted index and is associated with much less metadata. During search, matching operates at the term level.
{: .note}
## Analyzers
In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contains the following sequentially applied components:
In OpenSearch, the abstraction that encompasses text analysis is referred to as an _analyzer_. Each analyzer contains the following sequentially applied components:
1. **Character filters**: First, a character filter receives the original text as a stream of characters and adds, removes, or modifies characters in the text. For example, a character filter can strip HTML characters from a string so that the text `<p><b>Actions</b> speak louder than <em>words</em></p>` becomes `\nActions speak louder than words\n`. The output of a character filter is a stream of characters.
@ -35,6 +43,8 @@ In OpenSearch, text analysis is performed by an _analyzer_. Each analyzer contai
An analyzer must contain exactly one tokenizer and may contain zero or more character filters and zero or more token filters.
{: .note}
There is also a special type of analyzer called a ***normalizer***. A normalizer is similar to an analyzer except that it does not contain a tokenizer and can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot perform operations on the token as a whole. This means that replacing a token with a synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.
## Built-in analyzers
The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `Its fun to contribute a brand-new PR or 2 to OpenSearch!`.

111
_analyzers/normalizers.md Normal file
View File

@ -0,0 +1,111 @@
---
layout: default
title: Normalizers
nav_order: 100
---
# Normalizers
A _normalizer_ functions similarly to an analyzer but outputs only a single token. It does not contain a tokenizer and can only include specific types of character and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot operate on the token as a whole. This means that replacing a token with a synonym or stemming is not supported.
A normalizer is useful in keyword search (that is, in term-based queries) because it allows you to run token and character filters on any given input. For instance, it makes it possible to match an incoming query `Naïve` with the index term `naive`.
Consider the following example.
Create a new index with a custom normalizer:
```json
PUT /sample-index
{
"settings": {
"analysis": {
"normalizer": {
"normalized_keyword": {
"type": "custom",
"char_filter": [],
"filter": [ "asciifolding", "lowercase" ]
}
}
}
},
"mappings": {
"properties": {
"approach": {
"type": "keyword",
"normalizer": "normalized_keyword"
}
}
}
}
```
{% include copy-curl.html %}
Index a document:
```json
POST /sample-index/_doc/
{
"approach": "naive"
}
```
{% include copy-curl.html %}
The following query matches the document. This is expected:
```json
GET /sample-index/_search
{
"query": {
"term": {
"approach": "naive"
}
}
}
```
{% include copy-curl.html %}
But this query matches the document as well:
```json
GET /sample-index/_search
{
"query": {
"term": {
"approach": "Naïve"
}
}
}
```
{% include copy-curl.html %}
To understand why, consider the effect of the normalizer:
```json
GET /sample-index/_analyze
{
"normalizer" : "normalized_keyword",
"text" : "Naïve"
}
```
Internally, a normalizer accepts only filters that are instances of either `NormalizingTokenFilterFactory` or `NormalizingCharFilterFactory`. The following is a list of compatible filters found in modules and plugins that are part of the core OpenSearch repository.
### The `common-analysis` module
This module does not require installation; it is available by default.
Character filters: `pattern_replace`, `mapping`
Token filters: `arabic_normalization`, `asciifolding`, `bengali_normalization`, `cjk_width`, `decimal_digit`, `elision`, `german_normalization`, `hindi_normalization`, `indic_normalization`, `lowercase`, `persian_normalization`, `scandinavian_folding`, `scandinavian_normalization`, `serbian_normalization`, `sorani_normalization`, `trim`, `uppercase`
### The `analysis-icu` plugin
Character filters: `icu_normalizer`
Token filters: `icu_normalizer`, `icu_folding`, `icu_transform`
### The `analysis-kuromoji` plugin
Character filters: `normalize_kanji`, `normalize_kana`
### The `analysis-nori` plugin
Character filters: `normalize_kanji`, `normalize_kana`
These lists of filters include only analysis components found in the [additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/#additional-plugins) that are part of the core OpenSearch repository.
{: .note}

View File

@ -52,7 +52,7 @@ Parameter | Description
`index` | A Boolean value that specifies whether the field should be searchable. Default is `true`. To reduce disk space, set `index` to `false`.
`index_options` | Information to be stored in the index that will be considered when calculating relevance scores. Can be set to `freqs` for term frequency. Default is `docs`.
`meta` | Accepts metadata for this field.
`normalizer` | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing).
[`normalizer`]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) | Specifies how to preprocess this field before indexing (for example, make it lowercase). Default is `null` (no preprocessing).
`norms` | A Boolean value that specifies whether the field length should be used when calculating relevance scores. Default is `false`.
[`null_value`]({{site.url}}{{site.baseurl}}/opensearch/supported-field-types/index#null-value) | A value to be used in place of `null`. Must be of the same type as the field. If this parameter is not specified, the field is treated as missing when its value is `null`. Default is `null`.
`similarity` | The ranking algorithm for calculating relevance scores. Default is `BM25`.

View File

@ -31,12 +31,12 @@ If you are running OpenSearch in a Docker container, plugins must be installed,
Use `list` to see a list of plugins that have already been installed.
#### Usage:
#### Usage
```bash
bin/opensearch-plugin list
```
#### Example:
#### Example
```bash
$ ./opensearch-plugin list
opensearch-alerting
@ -84,20 +84,20 @@ opensearch-node1 opensearch-notifications-core 2.0.1.0
There are three ways to install plugins using the `opensearch-plugin`:
- [Install a plugin by name]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-by-name)
- [Install a plugin by from a zip file]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-from-a-zip-file)
- [Install a plugin using Maven coordinates]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#install-a-plugin-using-maven-coordinates)
- [Install a plugin by name](#install-a-plugin-by-name).
- [Install a plugin from a ZIP file](#install-a-plugin-from-a-zip-file).
- [Install a plugin using Maven coordinates](#install-a-plugin-using-maven-coordinates).
### Install a plugin by name:
For a list of plugins that can be installed by name, see [Additional plugins]({{site.url}}{{site.baseurl}}/opensearch/install/plugins#additional-plugins).
For a list of plugins that can be installed by name, see [Additional plugins](#additional-plugins).
#### Usage:
#### Usage
```bash
bin/opensearch-plugin install <plugin-name>
```
#### Example:
#### Example
```bash
$ sudo ./opensearch-plugin install analysis-icu
-> Installing analysis-icu
@ -106,16 +106,16 @@ $ sudo ./opensearch-plugin install analysis-icu
-> Installed analysis-icu with folder name analysis-icu
```
### Install a plugin from a zip file:
### Install a plugin from a zip file
Remote zip files can be installed by replacing `<zip-file>` with the URL of the hosted file. The tool only supports downloading over HTTP/HTTPS protocols. For local zip files, replace `<zip-file>` with `file:` followed by the absolute or relative path to the plugin zip file as in the second example below.
#### Usage:
#### Usage
```bash
bin/opensearch-plugin install <zip-file>
```
#### Example:
#### Example
```bash
# Zip file is hosted on a remote server - in this case, Maven central repository.
$ sudo ./opensearch-plugin install https://repo1.maven.org/maven2/org/opensearch/plugin/opensearch-anomaly-detection/2.2.0.0/opensearch-anomaly-detection-2.2.0.0.zip
@ -166,16 +166,16 @@ Continue with installation? [y/N]y
-> Installed opensearch-anomaly-detection with folder name opensearch-anomaly-detection
```
### Install a plugin using Maven coordinates:
### Install a plugin using Maven coordinates
The `opensearch-plugin install` tool also accepts Maven coordinates for available artifacts and versions hosted on [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). `opensearch-plugin` will parse the Maven coordinates you provide and construct a URL. As a result, the host must be able to connect directly to [Maven Central](https://search.maven.org/search?q=org.opensearch.plugin). The plugin installation will fail if you pass coordinates to a proxy or local repository.
#### Usage:
#### Usage
```bash
bin/opensearch-plugin install <groupId>:<artifactId>:<version>
```
#### Example:
#### Example
```bash
$ sudo ./opensearch-plugin install org.opensearch.plugin:opensearch-anomaly-detection:2.2.0.0
-> Installing org.opensearch.plugin:opensearch-anomaly-detection:2.2.0.0
@ -222,12 +222,12 @@ $ sudo $ ./opensearch-plugin install analysis-nori repository-s3
You can remove a plugin that has already been installed with the `remove` option.
#### Usage:
#### Usage
```bash
bin/opensearch-plugin remove <plugin-name>
```
#### Example:
#### Example
```bash
$ sudo $ ./opensearch-plugin remove opensearch-anomaly-detection
-> removing [opensearch-anomaly-detection]...

View File

@ -8,7 +8,7 @@ redirect_from:
# Term-level and full-text queries compared
You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries analyze the query string. The following table summarizes the differences between term-level and full-text queries.
You can use both term-level and full-text queries to search text, but while term-level queries are usually used to search structured data, full-text queries are used for full-text search. The main difference between term-level and full-text queries is that term-level queries search documents for an exact specified term, while full-text queries [analyze]({{{site.url}}{{site.baseurl}}/analyzers/) the query string. The following table summarizes the differences between term-level and full-text queries.
| | Term-level queries | Full-text queries
:--- | :--- | :---