[DOCS] Reformat `word_delimiter` token filter (#53387)

Makes the following changes to the `word_delimiter` token filter docs: * Adds a warning admonition recommending the `word_delimiter_graph` filter instead. This warning includes a link to the deprecated Lucene `WordDelimiterFilter`. * Updates the description * Adds detailed analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter documentation
2020-03-11 09:03:57 -04:00 · 2020-03-11 09:03:57 -04:00 · 933a9c6fca
parent 063957b7d8
commit 933a9c6fca
1 changed files with 358 additions and 63 deletions
--- a/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/word-delimiter-tokenfilter.asciidoc
@ -4,70 +4,313 @@
 <titleabbrev>Word delimiter</titleabbrev>
 ++++
-Named `word_delimiter`, it Splits words into subwords and performs
+[WARNING]
-optional transformations on subword groups. Words are split into
+====
-subwords with the following rules:
+We recommend using the
 <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>> instead of
 the `word_delimiter` filter.
-* split on intra-word delimiters (by default, all non alpha-numeric
+The `word_delimiter` filter can produce invalid token graphs. See
-characters): "Wi-Fi" -> "Wi", "Fi"
+<<analysis-word-delimiter-graph-differences>>.
 * split on case transitions: "PowerShot" -> "Power", "Shot"
 * split on letter-number transitions: "SD500" -> "SD", "500"
 * leading and trailing intra-word delimiters on each subword are
 ignored: "//hello---there, 'dude'" -> "hello", "there", "dude"
 * trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil"
-Parameters include:
+The `word_delimiter` filter also uses Lucene's
 {lucene-analysis-docs}/miscellaneous/WordDelimiterFilter.html[WordDelimiterFilter],
 which is marked as deprecated. 
 ====
-`generate_word_parts`::
+Splits tokens at non-alphanumeric characters. The `word_delimiter` filter
-    If `true` causes parts of words to be
+also performs optional token normalization based on a set of rules. By default,
-    generated: "Power-Shot", "(Power,Shot)" -> "Power" "Shot". Defaults to `true`.
+the filter uses the following rules:
-`generate_number_parts`::
+* Split tokens at non-alphanumeric characters.
-    If `true` causes number subwords to be
+  The filter uses these characters as delimiters.
-    generated: "500-42" -> "500" "42". Defaults to `true`.
+  For example: `Super-Duper` -> `Super`, `Duper`
 * Remove leading or trailing delimiters from each token.
  For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder`
 * Split tokens at letter case transitions.
  For example: `PowerShot` -> `Power`, `Shot`
 * Split tokens at letter-number transitions.
  For example: `XL500` -> `XL`, `500`
 * Remove the English possessive (`'s`) from the end of each token.
  For example: `Neil's` -> `Neil`
-`catenate_words`::
+[TIP]
-    If `true` causes maximum runs of word parts to be
+====
-    catenated: "wi-fi" -> "wifi". Defaults to `false`.
+The `word_delimiter` filter was designed to remove punctuation from complex
 identifiers, such as product IDs or part numbers. For these use cases, we
 recommend using the `word_delimiter` filter with the
 <<analysis-keyword-tokenizer,`keyword`>> tokenizer.
-`catenate_numbers`::
+Avoid using the `word_delimiter` filter to split hyphenated words, such as
-    If `true` causes maximum runs of number parts to
+`wi-fi`. Because users often search for these words both with and without
-    be catenated: "500-42" -> "50042". Defaults to `false`.
+hyphens, we recommend using the
 <<analysis-synonym-graph-tokenfilter,`synonym_graph`>> filter instead.
 ====
 [[analysis-word-delimiter-tokenfilter-analyze-ex]]
 ==== Example
 The following <<indices-analyze,analyze API>> request uses the
 `word_delimiter` filter to split `Neil's-Super-Duper-XL500--42+AutoCoder`
 into normalized tokens using the filter's default rules:
 [source,console]
 ----
 GET /_analyze
 {
  "tokenizer": "keyword",
  "filter": [ "word_delimiter" ],
  "text": "Neil's-Super-Duper-XL500--42+AutoCoder"
 }
 ----
 The filter produces the following tokens:
 [source,txt]
 ----
 [ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
 ----
 ////
 [source,console-result]
 ----
 {
  "tokens": [
    {
      "token": "Neil",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "Super",
      "start_offset": 7,
      "end_offset": 12,
      "type": "word",
      "position": 1
    },
    {
      "token": "Duper",
      "start_offset": 13,
      "end_offset": 18,
      "type": "word",
      "position": 2
    },
    {
      "token": "XL",
      "start_offset": 19,
      "end_offset": 21,
      "type": "word",
      "position": 3
    },
    {
      "token": "500",
      "start_offset": 21,
      "end_offset": 24,
      "type": "word",
      "position": 4
    },
    {
      "token": "42",
      "start_offset": 26,
      "end_offset": 28,
      "type": "word",
      "position": 5
    },
    {
      "token": "Auto",
      "start_offset": 29,
      "end_offset": 33,
      "type": "word",
      "position": 6
    },
    {
      "token": "Coder",
      "start_offset": 33,
      "end_offset": 38,
      "type": "word",
      "position": 7
    }
  ]
 }
 ----
 ////
 [analysis-word-delimiter-tokenfilter-analyzer-ex]]
 ==== Add to an analyzer
 The following <<indices-create-index,create index API>> request uses the
 `word_delimiter` filter to configure a new
 <<analysis-custom-analyzer,custom analyzer>>.
 [source,console]
 ----
 PUT /my_index
 {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [ "word_delimiter" ]
        }
      }
    }
  }
 }
 ----
 [WARNING]
 ====
 Avoid using the `word_delimiter` filter with tokenizers that remove punctuation,
 such as the <<analysis-standard-tokenizer,`standard`>> tokenizer. This could
 prevent the `word_delimiter` filter from splitting tokens correctly. It can also
 interfere with the filter's configurable parameters, such as `catenate_all` or
 `preserve_original`. We recommend using the
 <<analysis-keyword-tokenizer,`keyword`>> or
 <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer instead.
 ====
 [[word-delimiter-tokenfilter-configure-parms]]
 ==== Configurable parameters
 `catenate_all`::
-    If `true` causes all subword parts to be catenated:
+
-    "wi-fi-4000" -> "wifi4000". Defaults to `false`.
+--
 (Optional, boolean)
 If `true`, the filter produces catenated tokens for chains of alphanumeric
 characters separated by non-alphabetic delimiters. For example:
 `super-duper-xl-500` -> [ `super`, **`superduperxl500`**, `duper`, `xl`, `500`
 ]. Defaults to `false`.
-`split_on_case_change`::
+[WARNING]
-    If `true` causes "PowerShot" to be two tokens;
+====
-    ("Power-Shot" remains two parts regards). Defaults to `true`.
+When used for search analysis, catenated tokens can cause problems for the
 <<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
 rely on token position for matching. Avoid setting this parameter to `true` if
 you plan to use these queries.
 ====
 --
 `catenate_numbers`::
 +
 --
 (Optional, boolean)
 If `true`, the filter produces catenated tokens for chains of numeric characters
 separated by non-alphabetic delimiters. For example: `01-02-03` ->
 [ `01`, **`010203`**, `02`, `03` ]. Defaults to `false`.
 [WARNING]
 ====
 When used for search analysis, catenated tokens can cause problems for the
 <<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
 rely on token position for matching. Avoid setting this parameter to `true` if
 you plan to use these queries.
 ====
 --
 `catenate_words`::
 +
 --
 (Optional, boolean)
 If `true`, the filter produces catenated tokens for chains of alphabetical
 characters separated by non-alphabetic delimiters. For example: `super-duper-xl`
 -> [ `super`, **`superduperxl`**, `duper`, `xl` ]. Defaults to `false`.
 [WARNING]
 ====
 When used for search analysis, catenated tokens can cause problems for the
 <<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
 rely on token position for matching. Avoid setting this parameter to `true` if
 you plan to use these queries.
 ====
 --
 `generate_number_parts`::
 (Optional, boolean)
 If `true`, the filter includes tokens consisting of only numeric characters in
 the output. If `false`, the filter excludes these tokens from the output.
 Defaults to `true`.
 `generate_word_parts`::
 (Optional, boolean)
 If `true`, the filter includes tokens consisting of only alphabetical characters
 in the output. If `false`, the filter excludes these tokens from the output.
 Defaults to `true`.
 `preserve_original`::
-    If `true` includes original words in subwords:
+(Optional, boolean)
-    "500-42" -> "500-42" "500" "42". Defaults to `false`.
+If `true`, the filter includes the original version of any split tokens in the
-
+output. This original version includes non-alphanumeric delimiters. For example:
-`split_on_numerics`::
+`super-duper-xl-500` -> [ **`super-duper-xl-500`**, `super`, `duper`, `xl`,
-    If `true` causes "j2se" to be three tokens; "j"
+`500` ]. Defaults to `false`.
    "2" "se". Defaults to `true`.
 `stem_english_possessive`::
    If `true` causes trailing "'s" to be
    removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`.
 Advance settings include:
 `protected_words`::
-    A list of protected words from being delimiter.
+(Optional, array of strings)
-    Either an array, or also can set `protected_words_path` which resolved
+Array of tokens the filter won't split.
-    to a file configured with protected words (one on each line).
+
-    Automatically resolves to `config/` based location if exists.
+`protected_words_path`::
 +
 --
 (Optional, string)
 Path to a file that contains a list of tokens the filter won't split.
 This path must be absolute or relative to the `config` location, and the file
 must be UTF-8 encoded. Each token in the file must be separated by a line
 break.
 --
 `split_on_case_change`::
 (Optional, boolean)
 If `true`, the filter splits tokens at letter case transitions. For example:
 `camelCase` -> [ `camel`, `Case` ]. Defaults to `true`.
 `split_on_numerics`::
 (Optional, boolean)
 If `true`, the filter splits tokens at letter-number transitions. For example:
 `j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`.
 `stem_english_possessive`::
 (Optional, boolean)
 If `true`, the filter removes the English possessive (`'s`) from the end of each
 token. For example: `O'Neil's` -> [ `O`, `Neil` ]. Defaults to `true`.
 `type_table`::
-    A custom type mapping table, for example (when configured
+
-    using `type_table_path`):
+--
 (Optional, array of strings)
 Array of custom type mappings for characters. This allows you to map
 non-alphanumeric characters as numeric or alphanumeric to avoid splitting on
 those characters.
-[source,type_table]
+For example, the following array maps the plus (`+`) and hyphen (`-`) characters
--------------------------------------------------
+as alphanumeric, which means they won't be treated as delimiters:
 `[ "+ => ALPHA", "- => ALPHA" ]`
 Supported types include:
 * `ALPHA` (Alphabetical)
 * `ALPHANUM` (Alphanumeric)
 * `DIGIT` (Numeric)
 * `LOWER` (Lowercase alphabetical)
 * `SUBWORD_DELIM` (Non-alphanumeric delimiter)
 * `UPPER` (Uppercase alphabetical)
 --
 `type_table_path`::
 +
 --
 (Optional, string)
 Path to a file that contains custom type mappings for characters. This allows
 you to map non-alphanumeric characters as numeric or alphanumeric to avoid
 splitting on those characters.
 For example, the contents of this file may contain the following:
 [source,txt]
 ----
 # Map the $, %, '.', and ',' characters to DIGIT
 # This might be useful for financial data.
 $ => DIGIT
@ -79,9 +322,61 @@ Advance settings include:
 # this also tests the case where we need a bigger byte[]
 # see http://en.wikipedia.org/wiki/Zero-width_joiner
 \\u200D => ALPHANUM
--------------------------------------------------
+----
-NOTE: Using a tokenizer like the `standard` tokenizer may interfere with
+Supported types include:
-the `catenate_*` and `preserve_original` parameters, as the original
+
-string may already have lost punctuation during tokenization.  Instead,
+* `ALPHA` (Alphabetical)
-you may want to use the `whitespace` tokenizer.
+* `ALPHANUM` (Alphanumeric)
 * `DIGIT` (Numeric)
 * `LOWER` (Lowercase alphabetical)
 * `SUBWORD_DELIM` (Non-alphanumeric delimiter)
 * `UPPER` (Uppercase alphabetical)
 This file path must be absolute or relative to the `config` location, and the
 file must be UTF-8 encoded. Each mapping in the file must be separated by a line
 break.
 --
 [[analysis-word-delimiter-tokenfilter-customize]]
 ==== Customize
 To customize the `word_delimiter` filter, duplicate it to create the basis
 for a new custom token filter. You can modify the filter using its configurable
 parameters.
 For example, the following request creates a `word_delimiter`
 filter that uses the following rules:
 * Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`)
  character.
 * Remove leading or trailing delimiters from each token.
 * Do _not_ split tokens at letter case transitions.
 * Do _not_ split tokens at letter-number transitions.
 * Remove the English possessive (`'s`) from the end of each token.
 [source,console]
 ----
 PUT /my_index
 {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [ "my_custom_word_delimiter_filter" ]
        }
      },
      "filter": {
        "my_custom_word_delimiter_filter": {
          "type": "word_delimiter",
          "type_table": [ "- => ALPHA" ],
          "split_on_case_change": false,
          "split_on_numerics": false,
          "stem_english_possessive": true
        }
      }
    }
  }
 }
 ----