diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc
index 25074b2725e..02d6257cb4e 100644
--- a/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc
@@ -4,84 +4,379 @@
Word delimiter
++++
-Named `word_delimiter`, it Splits words into subwords and performs
-optional transformations on subword groups. Words are split into
-subwords with the following rules:
+[WARNING]
+====
+We recommend using the
+<> instead of
+the `word_delimiter` filter.
-* split on intra-word delimiters (by default, all non alpha-numeric
-characters): "Wi-Fi" -> "Wi", "Fi"
-* split on case transitions: "PowerShot" -> "Power", "Shot"
-* split on letter-number transitions: "SD500" -> "SD", "500"
-* leading and trailing intra-word delimiters on each subword are
-ignored: "//hello---there, 'dude'" -> "hello", "there", "dude"
-* trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil"
+The `word_delimiter` filter can produce invalid token graphs. See
+<>.
-Parameters include:
+The `word_delimiter` filter also uses Lucene's
+{lucene-analysis-docs}/miscellaneous/WordDelimiterFilter.html[WordDelimiterFilter],
+which is marked as deprecated.
+====
-`generate_word_parts`::
- If `true` causes parts of words to be
- generated: "Power-Shot", "(Power,Shot)" -> "Power" "Shot". Defaults to `true`.
+Splits tokens at non-alphanumeric characters. The `word_delimiter` filter
+also performs optional token normalization based on a set of rules. By default,
+the filter uses the following rules:
-`generate_number_parts`::
- If `true` causes number subwords to be
- generated: "500-42" -> "500" "42". Defaults to `true`.
+* Split tokens at non-alphanumeric characters.
+ The filter uses these characters as delimiters.
+ For example: `Super-Duper` -> `Super`, `Duper`
+* Remove leading or trailing delimiters from each token.
+ For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder`
+* Split tokens at letter case transitions.
+ For example: `PowerShot` -> `Power`, `Shot`
+* Split tokens at letter-number transitions.
+ For example: `XL500` -> `XL`, `500`
+* Remove the English possessive (`'s`) from the end of each token.
+ For example: `Neil's` -> `Neil`
-`catenate_words`::
- If `true` causes maximum runs of word parts to be
- catenated: "wi-fi" -> "wifi". Defaults to `false`.
+[TIP]
+====
+The `word_delimiter` filter was designed to remove punctuation from complex
+identifiers, such as product IDs or part numbers. For these use cases, we
+recommend using the `word_delimiter` filter with the
+<> tokenizer.
-`catenate_numbers`::
- If `true` causes maximum runs of number parts to
- be catenated: "500-42" -> "50042". Defaults to `false`.
+Avoid using the `word_delimiter` filter to split hyphenated words, such as
+`wi-fi`. Because users often search for these words both with and without
+hyphens, we recommend using the
+<> filter instead.
+====
+
+[[analysis-word-delimiter-tokenfilter-analyze-ex]]
+==== Example
+
+The following <> request uses the
+`word_delimiter` filter to split `Neil's-Super-Duper-XL500--42+AutoCoder`
+into normalized tokens using the filter's default rules:
+
+[source,console]
+----
+GET /_analyze
+{
+ "tokenizer": "keyword",
+ "filter": [ "word_delimiter" ],
+ "text": "Neil's-Super-Duper-XL500--42+AutoCoder"
+}
+----
+
+The filter produces the following tokens:
+
+[source,txt]
+----
+[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
+----
+
+////
+[source,console-result]
+----
+{
+ "tokens": [
+ {
+ "token": "Neil",
+ "start_offset": 0,
+ "end_offset": 4,
+ "type": "word",
+ "position": 0
+ },
+ {
+ "token": "Super",
+ "start_offset": 7,
+ "end_offset": 12,
+ "type": "word",
+ "position": 1
+ },
+ {
+ "token": "Duper",
+ "start_offset": 13,
+ "end_offset": 18,
+ "type": "word",
+ "position": 2
+ },
+ {
+ "token": "XL",
+ "start_offset": 19,
+ "end_offset": 21,
+ "type": "word",
+ "position": 3
+ },
+ {
+ "token": "500",
+ "start_offset": 21,
+ "end_offset": 24,
+ "type": "word",
+ "position": 4
+ },
+ {
+ "token": "42",
+ "start_offset": 26,
+ "end_offset": 28,
+ "type": "word",
+ "position": 5
+ },
+ {
+ "token": "Auto",
+ "start_offset": 29,
+ "end_offset": 33,
+ "type": "word",
+ "position": 6
+ },
+ {
+ "token": "Coder",
+ "start_offset": 33,
+ "end_offset": 38,
+ "type": "word",
+ "position": 7
+ }
+ ]
+}
+----
+////
+
+[analysis-word-delimiter-tokenfilter-analyzer-ex]]
+==== Add to an analyzer
+
+The following <> request uses the
+`word_delimiter` filter to configure a new
+<>.
+
+[source,console]
+----
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_analyzer": {
+ "tokenizer": "keyword",
+ "filter": [ "word_delimiter" ]
+ }
+ }
+ }
+ }
+}
+----
+
+[WARNING]
+====
+Avoid using the `word_delimiter` filter with tokenizers that remove punctuation,
+such as the <> tokenizer. This could
+prevent the `word_delimiter` filter from splitting tokens correctly. It can also
+interfere with the filter's configurable parameters, such as `catenate_all` or
+`preserve_original`. We recommend using the
+<> or
+<> tokenizer instead.
+====
+
+[[word-delimiter-tokenfilter-configure-parms]]
+==== Configurable parameters
`catenate_all`::
- If `true` causes all subword parts to be catenated:
- "wi-fi-4000" -> "wifi4000". Defaults to `false`.
++
+--
+(Optional, boolean)
+If `true`, the filter produces catenated tokens for chains of alphanumeric
+characters separated by non-alphabetic delimiters. For example:
+`super-duper-xl-500` -> [ `super`, **`superduperxl500`**, `duper`, `xl`, `500`
+]. Defaults to `false`.
-`split_on_case_change`::
- If `true` causes "PowerShot" to be two tokens;
- ("Power-Shot" remains two parts regards). Defaults to `true`.
+[WARNING]
+====
+When used for search analysis, catenated tokens can cause problems for the
+<> query and other queries that
+rely on token position for matching. Avoid setting this parameter to `true` if
+you plan to use these queries.
+====
+--
+
+`catenate_numbers`::
++
+--
+(Optional, boolean)
+If `true`, the filter produces catenated tokens for chains of numeric characters
+separated by non-alphabetic delimiters. For example: `01-02-03` ->
+[ `01`, **`010203`**, `02`, `03` ]. Defaults to `false`.
+
+[WARNING]
+====
+When used for search analysis, catenated tokens can cause problems for the
+<> query and other queries that
+rely on token position for matching. Avoid setting this parameter to `true` if
+you plan to use these queries.
+====
+--
+
+`catenate_words`::
++
+--
+(Optional, boolean)
+If `true`, the filter produces catenated tokens for chains of alphabetical
+characters separated by non-alphabetic delimiters. For example: `super-duper-xl`
+-> [ `super`, **`superduperxl`**, `duper`, `xl` ]. Defaults to `false`.
+
+[WARNING]
+====
+When used for search analysis, catenated tokens can cause problems for the
+<> query and other queries that
+rely on token position for matching. Avoid setting this parameter to `true` if
+you plan to use these queries.
+====
+--
+
+`generate_number_parts`::
+(Optional, boolean)
+If `true`, the filter includes tokens consisting of only numeric characters in
+the output. If `false`, the filter excludes these tokens from the output.
+Defaults to `true`.
+
+`generate_word_parts`::
+(Optional, boolean)
+If `true`, the filter includes tokens consisting of only alphabetical characters
+in the output. If `false`, the filter excludes these tokens from the output.
+Defaults to `true`.
`preserve_original`::
- If `true` includes original words in subwords:
- "500-42" -> "500-42" "500" "42". Defaults to `false`.
-
-`split_on_numerics`::
- If `true` causes "j2se" to be three tokens; "j"
- "2" "se". Defaults to `true`.
-
-`stem_english_possessive`::
- If `true` causes trailing "'s" to be
- removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`.
-
-Advance settings include:
+(Optional, boolean)
+If `true`, the filter includes the original version of any split tokens in the
+output. This original version includes non-alphanumeric delimiters. For example:
+`super-duper-xl-500` -> [ **`super-duper-xl-500`**, `super`, `duper`, `xl`,
+`500` ]. Defaults to `false`.
`protected_words`::
- A list of protected words from being delimiter.
- Either an array, or also can set `protected_words_path` which resolved
- to a file configured with protected words (one on each line).
- Automatically resolves to `config/` based location if exists.
+(Optional, array of strings)
+Array of tokens the filter won't split.
+
+`protected_words_path`::
++
+--
+(Optional, string)
+Path to a file that contains a list of tokens the filter won't split.
+
+This path must be absolute or relative to the `config` location, and the file
+must be UTF-8 encoded. Each token in the file must be separated by a line
+break.
+--
+
+`split_on_case_change`::
+(Optional, boolean)
+If `true`, the filter splits tokens at letter case transitions. For example:
+`camelCase` -> [ `camel`, `Case` ]. Defaults to `true`.
+
+`split_on_numerics`::
+(Optional, boolean)
+If `true`, the filter splits tokens at letter-number transitions. For example:
+`j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`.
+
+`stem_english_possessive`::
+(Optional, boolean)
+If `true`, the filter removes the English possessive (`'s`) from the end of each
+token. For example: `O'Neil's` -> [ `O`, `Neil` ]. Defaults to `true`.
`type_table`::
- A custom type mapping table, for example (when configured
- using `type_table_path`):
++
+--
+(Optional, array of strings)
+Array of custom type mappings for characters. This allows you to map
+non-alphanumeric characters as numeric or alphanumeric to avoid splitting on
+those characters.
-[source,type_table]
---------------------------------------------------
- # Map the $, %, '.', and ',' characters to DIGIT
- # This might be useful for financial data.
- $ => DIGIT
- % => DIGIT
- . => DIGIT
- \\u002C => DIGIT
+For example, the following array maps the plus (`+`) and hyphen (`-`) characters
+as alphanumeric, which means they won't be treated as delimiters:
- # in some cases you might not want to split on ZWJ
- # this also tests the case where we need a bigger byte[]
- # see http://en.wikipedia.org/wiki/Zero-width_joiner
- \\u200D => ALPHANUM
---------------------------------------------------
+`[ "+ => ALPHA", "- => ALPHA" ]`
-NOTE: Using a tokenizer like the `standard` tokenizer may interfere with
-the `catenate_*` and `preserve_original` parameters, as the original
-string may already have lost punctuation during tokenization. Instead,
-you may want to use the `whitespace` tokenizer.
+Supported types include:
+
+* `ALPHA` (Alphabetical)
+* `ALPHANUM` (Alphanumeric)
+* `DIGIT` (Numeric)
+* `LOWER` (Lowercase alphabetical)
+* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
+* `UPPER` (Uppercase alphabetical)
+--
+
+`type_table_path`::
++
+--
+(Optional, string)
+Path to a file that contains custom type mappings for characters. This allows
+you to map non-alphanumeric characters as numeric or alphanumeric to avoid
+splitting on those characters.
+
+For example, the contents of this file may contain the following:
+
+[source,txt]
+----
+# Map the $, %, '.', and ',' characters to DIGIT
+# This might be useful for financial data.
+$ => DIGIT
+% => DIGIT
+. => DIGIT
+\\u002C => DIGIT
+
+# in some cases you might not want to split on ZWJ
+# this also tests the case where we need a bigger byte[]
+# see http://en.wikipedia.org/wiki/Zero-width_joiner
+\\u200D => ALPHANUM
+----
+
+Supported types include:
+
+* `ALPHA` (Alphabetical)
+* `ALPHANUM` (Alphanumeric)
+* `DIGIT` (Numeric)
+* `LOWER` (Lowercase alphabetical)
+* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
+* `UPPER` (Uppercase alphabetical)
+
+This file path must be absolute or relative to the `config` location, and the
+file must be UTF-8 encoded. Each mapping in the file must be separated by a line
+break.
+--
+
+[[analysis-word-delimiter-tokenfilter-customize]]
+==== Customize
+
+To customize the `word_delimiter` filter, duplicate it to create the basis
+for a new custom token filter. You can modify the filter using its configurable
+parameters.
+
+For example, the following request creates a `word_delimiter`
+filter that uses the following rules:
+
+* Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`)
+ character.
+* Remove leading or trailing delimiters from each token.
+* Do _not_ split tokens at letter case transitions.
+* Do _not_ split tokens at letter-number transitions.
+* Remove the English possessive (`'s`) from the end of each token.
+
+[source,console]
+----
+PUT /my_index
+{
+ "settings": {
+ "analysis": {
+ "analyzer": {
+ "my_analyzer": {
+ "tokenizer": "keyword",
+ "filter": [ "my_custom_word_delimiter_filter" ]
+ }
+ },
+ "filter": {
+ "my_custom_word_delimiter_filter": {
+ "type": "word_delimiter",
+ "type_table": [ "- => ALPHA" ],
+ "split_on_case_change": false,
+ "split_on_numerics": false,
+ "stem_english_possessive": true
+ }
+ }
+ }
+ }
+}
+----
\ No newline at end of file