[DOCS] Reformat CJK bigram and CJK width token filter docs (#48210)

2019-10-21 09:43:59 -04:00 · 2019-10-21 09:43:59 -04:00 · a66bb2c7ed
parent 2751a4ff1b
commit a66bb2c7ed
2 changed files with 249 additions and 20 deletions
--- a/docs/reference/analysis/tokenfilters/cjk-bigram-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/cjk-bigram-tokenfilter.asciidoc
@ -1,18 +1,176 @@
 [[analysis-cjk-bigram-tokenfilter]]
-=== CJK Bigram Token Filter
+=== CJK bigram token filter
 ++++
 <titleabbrev>CJK bigram</titleabbrev>
 ++++
-The `cjk_bigram` token filter forms bigrams out of the CJK
+Forms https://en.wikipedia.org/wiki/Bigram[bigrams] out of CJK (Chinese,
-terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
+Japanese, and Korean) tokens.
 or the `icu_tokenizer` (see {plugins}/analysis-icu-tokenizer.html[`analysis-icu` plugin]).
-By default, when a CJK character has no adjacent characters to form a bigram,
+This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
-it is output in unigram form. If you always want to output both unigrams and
+analyzer>>. It uses Lucene's
-bigrams, set the `output_unigrams` flag to `true`. This can be used for a
+https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html[CJKBigramFilter].
 combined unigram+bigram approach.
-Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
+
-`hangul`, but bigrams can be disabled for particular scripts with the
+[[analysis-cjk-bigram-tokenfilter-analyze-ex]]
-`ignored_scripts` parameter.  All non-CJK input is passed through unmodified.
+==== Example
 The following <<indices-analyze,analyze API>> request demonstrates how the
 CJK bigram token filter works.
 [source,console]
 --------------------------------------------------
 GET /_analyze
 {
  "tokenizer" : "standard",
  "filter" : ["cjk_bigram"],
  "text" : "東京都は、日本の首都であり"
 }
 --------------------------------------------------
 The filter produces the following tokens:
 [source,text]
 --------------------------------------------------
 [ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
 --------------------------------------------------
 /////////////////////
 [source,console-result]
 --------------------------------------------------
 {
  "tokens" : [
    {
      "token" : "東京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<DOUBLE>",
      "position" : 0
    },
    {
      "token" : "京都",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "<DOUBLE>",
      "position" : 1
    },
    {
      "token" : "都は",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<DOUBLE>",
      "position" : 2
    },
    {
      "token" : "日本",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<DOUBLE>",
      "position" : 3
    },
    {
      "token" : "本の",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<DOUBLE>",
      "position" : 4
    },
    {
      "token" : "の首",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "<DOUBLE>",
      "position" : 5
    },
    {
      "token" : "首都",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<DOUBLE>",
      "position" : 6
    },
    {
      "token" : "都で",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "<DOUBLE>",
      "position" : 7
    },
    {
      "token" : "であ",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "<DOUBLE>",
      "position" : 8
    },
    {
      "token" : "あり",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "<DOUBLE>",
      "position" : 9
    }
  ]
 }
 --------------------------------------------------
 /////////////////////
 [[analysis-cjk-bigram-tokenfilter-analyzer-ex]]
 ==== Add to an analyzer
 The following <<indices-create-index,create index API>> request uses the
 CJK bigram token filter to configure a new 
 <<analysis-custom-analyzer,custom analyzer>>.
 [source,console]
 --------------------------------------------------
 PUT /cjk_bigram_example
 {
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "standard_cjk_bigram" : {
                    "tokenizer" : "standard",
                    "filter" : ["cjk_bigram"]
                }
            }
        }
    }
 }
 --------------------------------------------------
 [[analysis-cjk-bigram-tokenfilter-configure-parms]]
 ==== Configurable parameters
 `ignored_scripts`::
 +
 --
 (Optional, array of character scripts)
 Array of character scripts for which to disable bigrams.
 Possible values:
 * `han`
 * `hangul`
 * `hiragana`
 * `katakana`
 All non-CJK input is passed through unmodified.
 --
 `output_unigrams`
 (Optional, boolean)
 If `true`, emit tokens in both bigram and
 https://en.wikipedia.org/wiki/N-gram[unigram] form. If `false`, a CJK character
 is output in unigram form when it has no adjacent characters. Defaults to
 `false`.
 [[analysis-cjk-bigram-tokenfilter-customize]]
 ==== Customize
 To customize the CJK bigram token filter, duplicate it to create the basis
 for a new custom token filter. You can modify the filter using its configurable
 parameters.
 [source,console]
 --------------------------------------------------
@ -30,9 +188,9 @@ PUT /cjk_bigram_example
                "han_bigrams_filter" : {
                    "type" : "cjk_bigram",
                    "ignored_scripts": [
                        "hangul",
                        "hiragana",
-                        "katakana",
+                        "katakana"
                        "hangul"
                    ],
                    "output_unigrams" : true
                }
--- a/docs/reference/analysis/tokenfilters/cjk-width-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/cjk-width-tokenfilter.asciidoc
@ -1,12 +1,83 @@
 [[analysis-cjk-width-tokenfilter]]
-=== CJK Width Token Filter
+=== CJK width token filter
 ++++
 <titleabbrev>CJK width</titleabbrev>
 ++++
-The `cjk_width` token filter normalizes CJK width differences:
+Normalizes width differences in CJK (Chinese, Japanese, and Korean) characters
 as follows:
-* Folds fullwidth ASCII variants into the equivalent basic Latin
+* Folds full-width ASCII character variants into the equivalent basic Latin
-* Folds halfwidth Katakana variants into the equivalent Kana
+characters
 * Folds half-width Katakana character variants into the equivalent Kana
 characters
-NOTE: This token filter can be viewed as a subset of NFKC/NFKD
+This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
-Unicode normalization.  See the {plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin]
+analyzer>>. It uses Lucene's
-for full normalization support.
+https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html[CJKWidthFilter].
 NOTE: This token filter can be viewed as a subset of NFKC/NFKD Unicode
 normalization. See the
 {plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin] for
 full normalization support.
 [[analysis-cjk-width-tokenfilter-analyze-ex]]
 ==== Example
 [source,console]
 --------------------------------------------------
 GET /_analyze
 {
  "tokenizer" : "standard",
  "filter" : ["cjk_width"],
  "text" : "ｼｰｻｲﾄﾞﾗｲﾅｰ"
 }
 --------------------------------------------------
 The filter produces the following token:
 [source,text]
 --------------------------------------------------
 シーサイドライナー
 --------------------------------------------------
 /////////////////////
 [source,console-result]
 --------------------------------------------------
 {
  "tokens" : [
    {
      "token" : "シーサイドライナー",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<KATAKANA>",
      "position" : 0
    }
  ]
 }
 --------------------------------------------------
 /////////////////////
 [[analysis-cjk-width-tokenfilter-analyzer-ex]]
 ==== Add to an analyzer
 The following <<indices-create-index,create index API>> request uses the
 CJK width token filter to configure a new 
 <<analysis-custom-analyzer,custom analyzer>>.
 [source,console]
 --------------------------------------------------
 PUT /cjk_width_example
 {
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "standard_cjk_width" : {
                    "tokenizer" : "standard",
                    "filter" : ["cjk_width"]
                }
            }
        }
    }
 }
 --------------------------------------------------