[DOCS] Reformat `pattern_replace` token filter (#57699) (#57995)

Changes: * Rewrites description and adds Lucene link * Adds analyze example * Adds parameter definitions * Adds custom analyzer example
2020-06-11 12:19:38 -04:00 · 2020-06-11 12:19:38 -04:00 · c36df27730
parent 85b0b540f0
commit c36df27730
1 changed files with 148 additions and 14 deletions
--- a/docs/reference/analysis/tokenfilters/pattern_replace-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/pattern_replace-tokenfilter.asciidoc
@ -4,23 +4,157 @@
 <titleabbrev>Pattern replace</titleabbrev>
 ++++
-The `pattern_replace` token filter allows to easily handle string
+Uses a regular expression to match and replace token substrings.
-replacements based on a regular expression. The regular expression is
+
-defined using the `pattern` parameter, and the replacement string can be
+The `pattern_replace` filter uses
-provided using the `replacement` parameter (supporting referencing the
+http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
-original text, as explained
+regular expression syntax]. By default, the filter replaces matching
-http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement(java.lang.StringBuffer,%20java.lang.String)[here]).
+substrings with an empty substring (`""`).
 Regular expressions cannot be anchored to the
 beginning or end of a token. Replacement substrings can use Java's
 https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-[`$g` syntax] to reference capture groups
 from the original token text.
 [WARNING]
-.Beware of Pathological Regular Expressions
+====
-========================================
+A poorly-written regular expression may run slowly or return a
 StackOverflowError, causing the node running the expression to exit suddenly.
-The pattern replace token filter uses
+Read more about
-http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].
+http://www.regular-expressions.info/catastrophic.html[pathological regular
 expressions and how to avoid them].
 ====
-A badly written regular expression could run very slowly or even throw a
+This filter uses Lucene's
-StackOverflowError and cause the node it is running on to exit suddenly.
+{lucene-analysis-docs}/pattern/PatternReplaceFilter.html[PatternReplaceFilter].
-Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].
+[[analysis-pattern-replace-tokenfilter-analyze-ex]]
 ==== Example
-========================================
+The following <<indices-analyze,analyze API>> request uses the `pattern_replace`
 filter to prepend `watch` to the substring `dog` in `foxes jump lazy dogs`.
 [source,console]
 ----
 GET /_analyze
 {
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "pattern_replace",
      "pattern": "(dog)",
      "replacement": "watch$1"
    }
  ],
  "text": "foxes jump lazy dogs"
 }
 ----
 The filter produces the following tokens.
 [source,text]
 ----
 [ foxes, jump, lazy, watchdogs ]
 ----
 ////
 [source,console-result]
 ----
 {
  "tokens": [
    {
      "token": "foxes",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "jump",
      "start_offset": 6,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "lazy",
      "start_offset": 11,
      "end_offset": 15,
      "type": "word",
      "position": 2
    },
    {
      "token": "watchdogs",
      "start_offset": 16,
      "end_offset": 20,
      "type": "word",
      "position": 3
    }
  ]
 }
 ----
 ////
 [[analysis-pattern-replace-tokenfilter-configure-parms]]
 ==== Configurable parameters
 `all`::
 (Optional, boolean)
 If `true`, all substrings matching the `pattern` parameter's regular expression
 are replaced. If `false`, the filter replaces only the first matching substring
 in each token. Defaults to `true`.
 `pattern`::
 (Required, string)
 Regular expression, written in
 http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
 regular expression syntax]. The filter replaces token substrings matching this
 pattern with the substring in the `replacement` parameter.
 `replacement`::
 (Optional, string)
 Replacement substring. Defaults to an empty substring (`""`).
 [[analysis-pattern-replace-tokenfilter-customize]]
 ==== Customize and add to an analyzer
 To customize the `pattern_replace` filter, duplicate it to create the basis
 for a new custom token filter. You can modify the filter using its configurable
 parameters.
 The following <<indices-create-index,create index API>> request
 configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom
 `pattern_replace` filter, `my_pattern_replace_filter`.
 The `my_pattern_replace_filter` filter uses the regular expression `[£|€]` to
 match and remove the currency symbols `£` and `€`. The filter's `all`
 parameter is `false`, meaning only the first matching symbol in each token is
 removed.
 [source,console]
 ----
 PUT /my_index
 {
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [
            "my_pattern_replace_filter"
          ]
        }
      },
      "filter": {
        "my_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "[£|€]",
          "replacement": "",
          "all": false
        }
      }
    }
  }
 }
 ----