Docs: Document how to rebuild analyzers (#30498)

Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499
2018-05-14 18:40:54 -04:00 · 2018-05-14 18:40:54 -04:00 · 9881bfaea5
parent 7f47ff9fcd
commit 9881bfaea5
7 changed files with 284 additions and 75 deletions
--- a/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc
@ -9,20 +9,6 @@ Input text is lowercased, normalized to remove extended characters, sorted,
 deduplicated and concatenated into a single token.  If a stopword list is
 configured, stop words will also be removed.

-[float]
-=== Definition
-
-It consists of:
-
-Tokenizer::
-* <<analysis-standard-tokenizer,Standard Tokenizer>>
-
-Token Filters (in order)::
-1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
-2. <<analysis-asciifolding-tokenfilter>>
-3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
-4. <<analysis-fingerprint-tokenfilter>>
-
 [float]
 === Example output

@ -149,3 +135,46 @@ The above example produces the following term:
 ---------------------------
 [ consistent godel said sentence yes ]
 ---------------------------
+
+[float]
+=== Definition
+
+The `fingerprint` tokenizer consists of:
+
+Tokenizer::
+* <<analysis-standard-tokenizer,Standard Tokenizer>>
+
+Token Filters (in order)::
+* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
+* <<analysis-asciifolding-tokenfilter>>
+* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
+* <<analysis-fingerprint-tokenfilter>>
+
+If you need to customize the `fingerprint` analyzer beyond the configuration
+parameters then you need to recreate it as a `custom` analyzer and modify
+it, usually by adding token filters. This would recreate the built-in
+`fingerprint` analyzer and you can use it as a starting point for further
+customization:
+
+[source,js]
+----------------------------------------------------
+PUT /fingerprint_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "rebuilt_fingerprint": {
+          "tokenizer": "standard",
+          "filter": [
+            "lowercase",
+            "asciifolding",
+            "fingerprint"
+          ]
+        }
+      }
+    }
+  }
+}
+----------------------------------------------------
+// CONSOLE
+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]
--- a/docs/reference/analysis/analyzers/keyword-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/keyword-analyzer.asciidoc
@ -4,14 +4,6 @@
 The `keyword` analyzer is a ``noop'' analyzer which returns the entire input
 string as a single token.

-[float]
-=== Definition
-
-It consists of:
-
-Tokenizer::
-* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
-
 [float]
 === Example output

@ -57,3 +49,40 @@ The above sentence would produce the following single term:
 === Configuration

 The `keyword` analyzer is not configurable.
+
+[float]
+=== Definition
+
+The `keyword` analyzer consists of:
+
+Tokenizer::
+* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
+
+If you need to customize the `keyword` analyzer then you need to
+recreate it as a `custom` analyzer and modify it, usually by adding
+token filters. Usually, you should prefer the
+<<keyword, Keyword type>> when you want strings that are not split
+into tokens, but just in case you need it, this would recreate the
+built-in `keyword` analyzer and you can use it as a starting point
+for further customization:
+
+[source,js]
+----------------------------------------------------
+PUT /keyword_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "rebuilt_keyword": {
+          "tokenizer": "keyword",
+          "filter": [         <1>
+          ]
+        }
+      }
+    }
+  }
+}
+----------------------------------------------------
+// CONSOLE
+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: keyword_example, first: keyword, second: rebuilt_keyword}\nendyaml\n/]
+<1> You'd add any token filters here.
--- a/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc
@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic

 ========================================

-
-[float]
-=== Definition
-
-It consists of:
-
-Tokenizer::
-* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
-
-Token Filters::
-*  <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
-*  <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
-
 [float]
 === Example output

@ -378,3 +365,51 @@ The regex above is easier to understand as:
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )
 --------------------------------------------------
+
+[float]
+=== Definition
+
+The `pattern` anlayzer consists of:
+
+Tokenizer::
+* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
+
+Token Filters::
+*  <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
+*  <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
+
+If you need to customize the `pattern` analyzer beyond the configuration
+parameters then you need to recreate it as a `custom` analyzer and modify
+it, usually by adding token filters. This would recreate the built-in
+`pattern` analyzer and you can use it as a starting point for further
+customization:
+
+[source,js]
+----------------------------------------------------
+PUT /pattern_example
+{
+  "settings": {
+    "analysis": {
+      "tokenizer": {
+        "split_on_non_word": {
+          "type":       "pattern",
+          "pattern":    "\\W+" <1>
+        }
+      },
+      "analyzer": {
+        "rebuilt_pattern": {
+          "tokenizer": "split_on_non_word",
+          "filter": [
+            "lowercase"       <2>
+          ]
+        }
+      }
+    }
+  }
+}
+----------------------------------------------------
+// CONSOLE
+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
+<1> The default pattern is `\W+` which splits on non-word characters
+and this is where you'd change it.
+<2> You'd add other token filters after `lowercase`.
--- a/docs/reference/analysis/analyzers/simple-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/simple-analyzer.asciidoc
@ -4,14 +4,6 @@
 The `simple` analyzer breaks text into terms whenever it encounters a
 character which is not a letter. All terms are lower cased.

-[float]
-=== Definition
-
-It consists of:
-
-Tokenizer::
-* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
-
 [float]
 === Example output

@ -127,3 +119,37 @@ The above sentence would produce the following terms:
 === Configuration

 The `simple` analyzer is not configurable.
+
+[float]
+=== Definition
+
+The `simple` analzyer consists of:
+
+Tokenizer::
+* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
+
+If you need to customize the `simple` analyzer then you need to recreate
+it as a `custom` analyzer and modify it, usually by adding token filters.
+This would recreate the built-in `simple` analyzer and you can use it as
+a starting point for further customization:
+
+[source,js]
+----------------------------------------------------
+PUT /simple_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "rebuilt_simple": {
+          "tokenizer": "lowercase",
+          "filter": [         <1>
+          ]
+        }
+      }
+    }
+  }
+}
+----------------------------------------------------
+// CONSOLE
+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/]
+<1> You'd add any token filters here.
--- a/docs/reference/analysis/analyzers/standard-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/standard-analyzer.asciidoc
@ -7,19 +7,6 @@ Segmentation algorithm, as specified in
 http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
 for most languages.

-[float]
-=== Definition
-
-It consists of:
-
-Tokenizer::
-* <<analysis-standard-tokenizer,Standard Tokenizer>>
-
-Token Filters::
-* <<analysis-standard-tokenfilter,Standard Token Filter>>
-* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
-* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
-
 [float]
 === Example output

@ -276,3 +263,44 @@ The above example produces the following terms:
 ---------------------------
 [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
 ---------------------------
+
+[float]
+=== Definition
+
+The `standard` analyzer consists of:
+
+Tokenizer::
+* <<analysis-standard-tokenizer,Standard Tokenizer>>
+
+Token Filters::
+* <<analysis-standard-tokenfilter,Standard Token Filter>>
+* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
+* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
+
+If you need to customize the `standard` analyzer beyond the configuration
+parameters then you need to recreate it as a `custom` analyzer and modify
+it, usually by adding token filters. This would recreate the built-in
+`standard` analyzer and you can use it as a starting point:
+
+[source,js]
+----------------------------------------------------
+PUT /standard_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "rebuilt_standard": {
+          "tokenizer": "standard",
+          "filter": [
+            "standard",
+            "lowercase"       <1>
+          ]
+        }
+      }
+    }
+  }
+}
+----------------------------------------------------
+// CONSOLE
+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
+<1> You'd add any token filters after `lowercase`.
--- a/docs/reference/analysis/analyzers/stop-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/stop-analyzer.asciidoc
@ -5,17 +5,6 @@ The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analy
 but adds support for removing stop words.  It defaults to using the
 `_english_` stop words.

-[float]
-=== Definition
-
-It consists of:
-
-Tokenizer::
-* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
-
-Token filters::
-* <<analysis-stop-tokenfilter,Stop Token Filter>>
-
 [float]
 === Example output

@ -239,3 +228,50 @@ The above example produces the following terms:
 ---------------------------
 [ quick, brown, foxes, jumped, lazy, dog, s, bone ]
 ---------------------------
+
+[float]
+=== Definition
+
+It consists of:
+
+Tokenizer::
+* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
+
+Token filters::
+* <<analysis-stop-tokenfilter,Stop Token Filter>>
+
+If you need to customize the `stop` analyzer beyond the configuration
+parameters then you need to recreate it as a `custom` analyzer and modify
+it, usually by adding token filters. This would recreate the built-in
+`stop` analyzer and you can use it as a starting point for further
+customization:
+
+[source,js]
+----------------------------------------------------
+PUT /stop_example
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "english_stop": {
+          "type":       "stop",
+          "stopwords":  "_english_" <1>
+        }
+      },
+      "analyzer": {
+        "rebuilt_stop": {
+          "tokenizer": "lowercase",
+          "filter": [
+            "english_stop"          <2>
+          ]
+        }
+      }
+    }
+  }
+}
+----------------------------------------------------
+// CONSOLE
+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
+<1> The default stopwords can be overridden with the `stopwords`
+    or `stopwords_path` parameters.
+<2> You'd add any token filters after `english_stop`.
--- a/docs/reference/analysis/analyzers/whitespace-analyzer.asciidoc
+++ b/docs/reference/analysis/analyzers/whitespace-analyzer.asciidoc
@ -4,14 +4,6 @@
 The `whitespace` analyzer breaks text into terms whenever it encounters a
 whitespace character.

-[float]
-=== Definition
-
-It consists of:
-
-Tokenizer::
-* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
-
 [float]
 === Example output

@ -120,3 +112,37 @@ The above sentence would produce the following terms:
 === Configuration

 The `whitespace` analyzer is not configurable.
+
+[float]
+=== Definition
+
+It consists of:
+
+Tokenizer::
+* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
+
+If you need to customize the `whitespace` analyzer then you need to
+recreate it as a `custom` analyzer and modify it, usually by adding
+token filters. This would recreate the built-in `whitespace` analyzer
+and you can use it as a starting point for further customization:
+
+[source,js]
+----------------------------------------------------
+PUT /whitespace_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "rebuilt_whitespace": {
+          "tokenizer": "whitespace",
+          "filter": [         <1>
+          ]
+        }
+      }
+    }
+  }
+}
+----------------------------------------------------
+// CONSOLE
+// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: whitespace_example, first: whitespace, second: rebuilt_whitespace}\nendyaml\n/]
+<1> You'd add any token filters here.