diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc index 25074b2725e..02d6257cb4e 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc @@ -4,84 +4,379 @@ Word delimiter ++++ -Named `word_delimiter`, it Splits words into subwords and performs -optional transformations on subword groups. Words are split into -subwords with the following rules: +[WARNING] +==== +We recommend using the +<> instead of +the `word_delimiter` filter. -* split on intra-word delimiters (by default, all non alpha-numeric -characters): "Wi-Fi" -> "Wi", "Fi" -* split on case transitions: "PowerShot" -> "Power", "Shot" -* split on letter-number transitions: "SD500" -> "SD", "500" -* leading and trailing intra-word delimiters on each subword are -ignored: "//hello---there, 'dude'" -> "hello", "there", "dude" -* trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil" +The `word_delimiter` filter can produce invalid token graphs. See +<>. -Parameters include: +The `word_delimiter` filter also uses Lucene's +{lucene-analysis-docs}/miscellaneous/WordDelimiterFilter.html[WordDelimiterFilter], +which is marked as deprecated. +==== -`generate_word_parts`:: - If `true` causes parts of words to be - generated: "Power-Shot", "(Power,Shot)" -> "Power" "Shot". Defaults to `true`. +Splits tokens at non-alphanumeric characters. The `word_delimiter` filter +also performs optional token normalization based on a set of rules. By default, +the filter uses the following rules: -`generate_number_parts`:: - If `true` causes number subwords to be - generated: "500-42" -> "500" "42". Defaults to `true`. +* Split tokens at non-alphanumeric characters. + The filter uses these characters as delimiters. + For example: `Super-Duper` -> `Super`, `Duper` +* Remove leading or trailing delimiters from each token. + For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder` +* Split tokens at letter case transitions. + For example: `PowerShot` -> `Power`, `Shot` +* Split tokens at letter-number transitions. + For example: `XL500` -> `XL`, `500` +* Remove the English possessive (`'s`) from the end of each token. + For example: `Neil's` -> `Neil` -`catenate_words`:: - If `true` causes maximum runs of word parts to be - catenated: "wi-fi" -> "wifi". Defaults to `false`. +[TIP] +==== +The `word_delimiter` filter was designed to remove punctuation from complex +identifiers, such as product IDs or part numbers. For these use cases, we +recommend using the `word_delimiter` filter with the +<> tokenizer. -`catenate_numbers`:: - If `true` causes maximum runs of number parts to - be catenated: "500-42" -> "50042". Defaults to `false`. +Avoid using the `word_delimiter` filter to split hyphenated words, such as +`wi-fi`. Because users often search for these words both with and without +hyphens, we recommend using the +<> filter instead. +==== + +[[analysis-word-delimiter-tokenfilter-analyze-ex]] +==== Example + +The following <> request uses the +`word_delimiter` filter to split `Neil's-Super-Duper-XL500--42+AutoCoder` +into normalized tokens using the filter's default rules: + +[source,console] +---- +GET /_analyze +{ + "tokenizer": "keyword", + "filter": [ "word_delimiter" ], + "text": "Neil's-Super-Duper-XL500--42+AutoCoder" +} +---- + +The filter produces the following tokens: + +[source,txt] +---- +[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ] +---- + +//// +[source,console-result] +---- +{ + "tokens": [ + { + "token": "Neil", + "start_offset": 0, + "end_offset": 4, + "type": "word", + "position": 0 + }, + { + "token": "Super", + "start_offset": 7, + "end_offset": 12, + "type": "word", + "position": 1 + }, + { + "token": "Duper", + "start_offset": 13, + "end_offset": 18, + "type": "word", + "position": 2 + }, + { + "token": "XL", + "start_offset": 19, + "end_offset": 21, + "type": "word", + "position": 3 + }, + { + "token": "500", + "start_offset": 21, + "end_offset": 24, + "type": "word", + "position": 4 + }, + { + "token": "42", + "start_offset": 26, + "end_offset": 28, + "type": "word", + "position": 5 + }, + { + "token": "Auto", + "start_offset": 29, + "end_offset": 33, + "type": "word", + "position": 6 + }, + { + "token": "Coder", + "start_offset": 33, + "end_offset": 38, + "type": "word", + "position": 7 + } + ] +} +---- +//// + +[analysis-word-delimiter-tokenfilter-analyzer-ex]] +==== Add to an analyzer + +The following <> request uses the +`word_delimiter` filter to configure a new +<>. + +[source,console] +---- +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_analyzer": { + "tokenizer": "keyword", + "filter": [ "word_delimiter" ] + } + } + } + } +} +---- + +[WARNING] +==== +Avoid using the `word_delimiter` filter with tokenizers that remove punctuation, +such as the <> tokenizer. This could +prevent the `word_delimiter` filter from splitting tokens correctly. It can also +interfere with the filter's configurable parameters, such as `catenate_all` or +`preserve_original`. We recommend using the +<> or +<> tokenizer instead. +==== + +[[word-delimiter-tokenfilter-configure-parms]] +==== Configurable parameters `catenate_all`:: - If `true` causes all subword parts to be catenated: - "wi-fi-4000" -> "wifi4000". Defaults to `false`. ++ +-- +(Optional, boolean) +If `true`, the filter produces catenated tokens for chains of alphanumeric +characters separated by non-alphabetic delimiters. For example: +`super-duper-xl-500` -> [ `super`, **`superduperxl500`**, `duper`, `xl`, `500` +]. Defaults to `false`. -`split_on_case_change`:: - If `true` causes "PowerShot" to be two tokens; - ("Power-Shot" remains two parts regards). Defaults to `true`. +[WARNING] +==== +When used for search analysis, catenated tokens can cause problems for the +<> query and other queries that +rely on token position for matching. Avoid setting this parameter to `true` if +you plan to use these queries. +==== +-- + +`catenate_numbers`:: ++ +-- +(Optional, boolean) +If `true`, the filter produces catenated tokens for chains of numeric characters +separated by non-alphabetic delimiters. For example: `01-02-03` -> +[ `01`, **`010203`**, `02`, `03` ]. Defaults to `false`. + +[WARNING] +==== +When used for search analysis, catenated tokens can cause problems for the +<> query and other queries that +rely on token position for matching. Avoid setting this parameter to `true` if +you plan to use these queries. +==== +-- + +`catenate_words`:: ++ +-- +(Optional, boolean) +If `true`, the filter produces catenated tokens for chains of alphabetical +characters separated by non-alphabetic delimiters. For example: `super-duper-xl` +-> [ `super`, **`superduperxl`**, `duper`, `xl` ]. Defaults to `false`. + +[WARNING] +==== +When used for search analysis, catenated tokens can cause problems for the +<> query and other queries that +rely on token position for matching. Avoid setting this parameter to `true` if +you plan to use these queries. +==== +-- + +`generate_number_parts`:: +(Optional, boolean) +If `true`, the filter includes tokens consisting of only numeric characters in +the output. If `false`, the filter excludes these tokens from the output. +Defaults to `true`. + +`generate_word_parts`:: +(Optional, boolean) +If `true`, the filter includes tokens consisting of only alphabetical characters +in the output. If `false`, the filter excludes these tokens from the output. +Defaults to `true`. `preserve_original`:: - If `true` includes original words in subwords: - "500-42" -> "500-42" "500" "42". Defaults to `false`. - -`split_on_numerics`:: - If `true` causes "j2se" to be three tokens; "j" - "2" "se". Defaults to `true`. - -`stem_english_possessive`:: - If `true` causes trailing "'s" to be - removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`. - -Advance settings include: +(Optional, boolean) +If `true`, the filter includes the original version of any split tokens in the +output. This original version includes non-alphanumeric delimiters. For example: +`super-duper-xl-500` -> [ **`super-duper-xl-500`**, `super`, `duper`, `xl`, +`500` ]. Defaults to `false`. `protected_words`:: - A list of protected words from being delimiter. - Either an array, or also can set `protected_words_path` which resolved - to a file configured with protected words (one on each line). - Automatically resolves to `config/` based location if exists. +(Optional, array of strings) +Array of tokens the filter won't split. + +`protected_words_path`:: ++ +-- +(Optional, string) +Path to a file that contains a list of tokens the filter won't split. + +This path must be absolute or relative to the `config` location, and the file +must be UTF-8 encoded. Each token in the file must be separated by a line +break. +-- + +`split_on_case_change`:: +(Optional, boolean) +If `true`, the filter splits tokens at letter case transitions. For example: +`camelCase` -> [ `camel`, `Case` ]. Defaults to `true`. + +`split_on_numerics`:: +(Optional, boolean) +If `true`, the filter splits tokens at letter-number transitions. For example: +`j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`. + +`stem_english_possessive`:: +(Optional, boolean) +If `true`, the filter removes the English possessive (`'s`) from the end of each +token. For example: `O'Neil's` -> [ `O`, `Neil` ]. Defaults to `true`. `type_table`:: - A custom type mapping table, for example (when configured - using `type_table_path`): ++ +-- +(Optional, array of strings) +Array of custom type mappings for characters. This allows you to map +non-alphanumeric characters as numeric or alphanumeric to avoid splitting on +those characters. -[source,type_table] --------------------------------------------------- - # Map the $, %, '.', and ',' characters to DIGIT - # This might be useful for financial data. - $ => DIGIT - % => DIGIT - . => DIGIT - \\u002C => DIGIT +For example, the following array maps the plus (`+`) and hyphen (`-`) characters +as alphanumeric, which means they won't be treated as delimiters: - # in some cases you might not want to split on ZWJ - # this also tests the case where we need a bigger byte[] - # see http://en.wikipedia.org/wiki/Zero-width_joiner - \\u200D => ALPHANUM --------------------------------------------------- +`[ "+ => ALPHA", "- => ALPHA" ]` -NOTE: Using a tokenizer like the `standard` tokenizer may interfere with -the `catenate_*` and `preserve_original` parameters, as the original -string may already have lost punctuation during tokenization. Instead, -you may want to use the `whitespace` tokenizer. +Supported types include: + +* `ALPHA` (Alphabetical) +* `ALPHANUM` (Alphanumeric) +* `DIGIT` (Numeric) +* `LOWER` (Lowercase alphabetical) +* `SUBWORD_DELIM` (Non-alphanumeric delimiter) +* `UPPER` (Uppercase alphabetical) +-- + +`type_table_path`:: ++ +-- +(Optional, string) +Path to a file that contains custom type mappings for characters. This allows +you to map non-alphanumeric characters as numeric or alphanumeric to avoid +splitting on those characters. + +For example, the contents of this file may contain the following: + +[source,txt] +---- +# Map the $, %, '.', and ',' characters to DIGIT +# This might be useful for financial data. +$ => DIGIT +% => DIGIT +. => DIGIT +\\u002C => DIGIT + +# in some cases you might not want to split on ZWJ +# this also tests the case where we need a bigger byte[] +# see http://en.wikipedia.org/wiki/Zero-width_joiner +\\u200D => ALPHANUM +---- + +Supported types include: + +* `ALPHA` (Alphabetical) +* `ALPHANUM` (Alphanumeric) +* `DIGIT` (Numeric) +* `LOWER` (Lowercase alphabetical) +* `SUBWORD_DELIM` (Non-alphanumeric delimiter) +* `UPPER` (Uppercase alphabetical) + +This file path must be absolute or relative to the `config` location, and the +file must be UTF-8 encoded. Each mapping in the file must be separated by a line +break. +-- + +[[analysis-word-delimiter-tokenfilter-customize]] +==== Customize + +To customize the `word_delimiter` filter, duplicate it to create the basis +for a new custom token filter. You can modify the filter using its configurable +parameters. + +For example, the following request creates a `word_delimiter` +filter that uses the following rules: + +* Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`) + character. +* Remove leading or trailing delimiters from each token. +* Do _not_ split tokens at letter case transitions. +* Do _not_ split tokens at letter-number transitions. +* Remove the English possessive (`'s`) from the end of each token. + +[source,console] +---- +PUT /my_index +{ + "settings": { + "analysis": { + "analyzer": { + "my_analyzer": { + "tokenizer": "keyword", + "filter": [ "my_custom_word_delimiter_filter" ] + } + }, + "filter": { + "my_custom_word_delimiter_filter": { + "type": "word_delimiter", + "type_table": [ "- => ALPHA" ], + "split_on_case_change": false, + "split_on_numerics": false, + "stem_english_possessive": true + } + } + } + } +} +---- \ No newline at end of file