From 49767585c48f6cb44935e2f349c9084395d8f418 Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Thu, 18 Jan 2024 15:05:20 -0500 Subject: [PATCH] Add tokenizer list (#6205) * Add tokenizer list Signed-off-by: Fanit Kolchina * Minor rewording Signed-off-by: Fanit Kolchina * Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> * Implement editorial comments Signed-off-by: Fanit Kolchina --------- Signed-off-by: Fanit Kolchina Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower --- _analyzers/index.md | 2 +- _analyzers/tokenizers/index.md | 61 ++++++++++++++++++++++++++++++++++ 2 files changed, 62 insertions(+), 1 deletion(-) create mode 100644 _analyzers/tokenizers/index.md diff --git a/_analyzers/index.md b/_analyzers/index.md index 6dc0ef0a..332a9821 100644 --- a/_analyzers/index.md +++ b/_analyzers/index.md @@ -45,7 +45,7 @@ Analyzer | Analysis performed | Analyzer output **Simple** | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] **Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] **Stop** | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Removes stop words
- Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`] -**Keyword** (noop) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`] +**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`] **Pattern** | - Parses strings into tokens using regular expressions
- Supports converting strings to lowercase
- Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`] [**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`] **Fingerprint** | - Parses strings on any non-letter character
- Normalizes characters by converting them to ASCII
- Converts tokens to lowercase
- Sorts, deduplicates, and concatenates tokens into a single token
- Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`]
Note that the apostrophe was converted to its ASCII counterpart. diff --git a/_analyzers/tokenizers/index.md b/_analyzers/tokenizers/index.md new file mode 100644 index 00000000..1d1752ad --- /dev/null +++ b/_analyzers/tokenizers/index.md @@ -0,0 +1,61 @@ +--- +layout: default +title: Tokenizers +nav_order: 60 +has_children: false +has_toc: false +--- + +# Tokenizers + +A tokenizer receives a stream of characters and splits the text into individual _tokens_. A token consists of a term (usually, a word) and metadata about this term. For example, a tokenizer can split text on white space so that the text `Actions speak louder than words.` becomes [`Actions`, `speak`, `louder`, `than`, `words.`]. + +The output of a tokenizer is a stream of tokens. Tokenizers also maintain the following metadata about tokens: + +- The **order** or **position** of each token: This information is used for word and phrase proximity queries. +- The starting and ending positions (**offsets**) of the tokens in the text: This information is used for highlighting search terms. +- The token **type**: Some tokenizers (for example, `standard`) classify tokens by type, for example, or . Simpler tokenizers (for example, `letter`) only classify tokens as type `word`. + +You can use tokenizers to define custom analyzers. + +## Built-in tokenizers + +The following tables list the built-in tokenizers that OpenSearch provides. + +### Word tokenizers + +Word tokenizers parse full text into words. + +Tokenizer | Description | Example +:--- | :--- | :--- +`standard` | - Parses strings into tokens at word boundaries
- Removes most punctuation | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`] +`letter` | - Parses strings into tokens on any non-letter character
- Removes non-letter characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`] +`lowercase` | - Parses strings into tokens on any non-letter character
- Removes non-letter characters
- Converts terms to lowercase | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] +`whitespace` | - Parses strings into tokens at white space characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] +`uax_url_email` | - Similar to the standard tokenizer
- Unlike the standard tokenizer, leaves URLs and email addresses as single terms | `It’s fun to contribute a brand-new PR or 2 to OpenSearch opensearch-project@github.com!`
becomes
[`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `opensearch-project@github.com`] +`classic` | - Parses strings into tokens on:
  - Punctuation characters that are followed by a white space character
  - Hyphens if the term does not contain numbers
- Removes punctuation
- Leaves URLs and email addresses as single terms | `Part number PA-35234, single-use product (128.32)`
becomes
[`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`] +`thai` | - Parses Thai text into terms | `สวัสดีและยินดีต`
becomes
[`สวัสด`, `และ`, `ยินดี`, `ต`] + +### Partial word tokenizers + +Partial word tokenizers parse text into words and generate fragments of those words for partial word matching. + +Tokenizer | Description | Example +:--- | :--- | :--- +`ngram`| - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | `My repo`
becomes
[`M`, `My`, `y`, `y `,  ,  r, `r`, `re`, `e`, `ep`, `p`, `po`, `o`]
because the default n-gram length is 1--2 characters +`edge_ngram` | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | `My repo`
becomes
[`M`, `My`]
because the default n-gram length is 1--2 characters + +### Structured text tokenizers + +Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes. + +Tokenizer | Description | Example +:--- | :--- | :--- +`keyword` | - No-op tokenizer
- Outputs the entire string unchanged
- Can be combined with token filters, like lowercase, to normalize terms | `My repo`
becomes
`My repo` +`pattern` | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms
- Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum`
becomes
[`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)
Can be configured with a regex pattern +`simple_pattern` | - Uses a regular expression pattern to return matching text as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default
Must be configured with a pattern because the pattern defaults to an empty string +`simple_pattern_split` | - Uses a regular expression pattern to split the text at matches rather than returning the matches as terms
- Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html)
- Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default
Must be configured with a pattern +`char_group` | - Parses on a set of configurable characters
- Faster than tokenizers that run regular expressions | No-op by default
Must be configured with a list of characters +`path_hierarchy` | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three`
becomes
[`one`, `one/two`, `one/two/three`] + +