[DOCS] Reformat CJK bigram and CJK width token filter docs (#48210)
This commit is contained in:
parent
2751a4ff1b
commit
a66bb2c7ed
|
@ -1,18 +1,176 @@
|
||||||
[[analysis-cjk-bigram-tokenfilter]]
|
[[analysis-cjk-bigram-tokenfilter]]
|
||||||
=== CJK Bigram Token Filter
|
=== CJK bigram token filter
|
||||||
|
++++
|
||||||
|
<titleabbrev>CJK bigram</titleabbrev>
|
||||||
|
++++
|
||||||
|
|
||||||
The `cjk_bigram` token filter forms bigrams out of the CJK
|
Forms https://en.wikipedia.org/wiki/Bigram[bigrams] out of CJK (Chinese,
|
||||||
terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
|
Japanese, and Korean) tokens.
|
||||||
or the `icu_tokenizer` (see {plugins}/analysis-icu-tokenizer.html[`analysis-icu` plugin]).
|
|
||||||
|
|
||||||
By default, when a CJK character has no adjacent characters to form a bigram,
|
This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
|
||||||
it is output in unigram form. If you always want to output both unigrams and
|
analyzer>>. It uses Lucene's
|
||||||
bigrams, set the `output_unigrams` flag to `true`. This can be used for a
|
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html[CJKBigramFilter].
|
||||||
combined unigram+bigram approach.
|
|
||||||
|
|
||||||
Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
|
|
||||||
`hangul`, but bigrams can be disabled for particular scripts with the
|
[[analysis-cjk-bigram-tokenfilter-analyze-ex]]
|
||||||
`ignored_scripts` parameter. All non-CJK input is passed through unmodified.
|
==== Example
|
||||||
|
|
||||||
|
The following <<indices-analyze,analyze API>> request demonstrates how the
|
||||||
|
CJK bigram token filter works.
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
GET /_analyze
|
||||||
|
{
|
||||||
|
"tokenizer" : "standard",
|
||||||
|
"filter" : ["cjk_bigram"],
|
||||||
|
"text" : "東京都は、日本の首都であり"
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
The filter produces the following tokens:
|
||||||
|
|
||||||
|
[source,text]
|
||||||
|
--------------------------------------------------
|
||||||
|
[ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
/////////////////////
|
||||||
|
[source,console-result]
|
||||||
|
--------------------------------------------------
|
||||||
|
{
|
||||||
|
"tokens" : [
|
||||||
|
{
|
||||||
|
"token" : "東京",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "京都",
|
||||||
|
"start_offset" : 1,
|
||||||
|
"end_offset" : 3,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 1
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "都は",
|
||||||
|
"start_offset" : 2,
|
||||||
|
"end_offset" : 4,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "日本",
|
||||||
|
"start_offset" : 5,
|
||||||
|
"end_offset" : 7,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 3
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "本の",
|
||||||
|
"start_offset" : 6,
|
||||||
|
"end_offset" : 8,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 4
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "の首",
|
||||||
|
"start_offset" : 7,
|
||||||
|
"end_offset" : 9,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 5
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "首都",
|
||||||
|
"start_offset" : 8,
|
||||||
|
"end_offset" : 10,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 6
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "都で",
|
||||||
|
"start_offset" : 9,
|
||||||
|
"end_offset" : 11,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "であ",
|
||||||
|
"start_offset" : 10,
|
||||||
|
"end_offset" : 12,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 8
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "あり",
|
||||||
|
"start_offset" : 11,
|
||||||
|
"end_offset" : 13,
|
||||||
|
"type" : "<DOUBLE>",
|
||||||
|
"position" : 9
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
/////////////////////
|
||||||
|
|
||||||
|
[[analysis-cjk-bigram-tokenfilter-analyzer-ex]]
|
||||||
|
==== Add to an analyzer
|
||||||
|
|
||||||
|
The following <<indices-create-index,create index API>> request uses the
|
||||||
|
CJK bigram token filter to configure a new
|
||||||
|
<<analysis-custom-analyzer,custom analyzer>>.
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
PUT /cjk_bigram_example
|
||||||
|
{
|
||||||
|
"settings" : {
|
||||||
|
"analysis" : {
|
||||||
|
"analyzer" : {
|
||||||
|
"standard_cjk_bigram" : {
|
||||||
|
"tokenizer" : "standard",
|
||||||
|
"filter" : ["cjk_bigram"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
[[analysis-cjk-bigram-tokenfilter-configure-parms]]
|
||||||
|
==== Configurable parameters
|
||||||
|
|
||||||
|
`ignored_scripts`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Optional, array of character scripts)
|
||||||
|
Array of character scripts for which to disable bigrams.
|
||||||
|
Possible values:
|
||||||
|
|
||||||
|
* `han`
|
||||||
|
* `hangul`
|
||||||
|
* `hiragana`
|
||||||
|
* `katakana`
|
||||||
|
|
||||||
|
All non-CJK input is passed through unmodified.
|
||||||
|
--
|
||||||
|
|
||||||
|
`output_unigrams`
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, emit tokens in both bigram and
|
||||||
|
https://en.wikipedia.org/wiki/N-gram[unigram] form. If `false`, a CJK character
|
||||||
|
is output in unigram form when it has no adjacent characters. Defaults to
|
||||||
|
`false`.
|
||||||
|
|
||||||
|
[[analysis-cjk-bigram-tokenfilter-customize]]
|
||||||
|
==== Customize
|
||||||
|
|
||||||
|
To customize the CJK bigram token filter, duplicate it to create the basis
|
||||||
|
for a new custom token filter. You can modify the filter using its configurable
|
||||||
|
parameters.
|
||||||
|
|
||||||
[source,console]
|
[source,console]
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
|
@ -30,9 +188,9 @@ PUT /cjk_bigram_example
|
||||||
"han_bigrams_filter" : {
|
"han_bigrams_filter" : {
|
||||||
"type" : "cjk_bigram",
|
"type" : "cjk_bigram",
|
||||||
"ignored_scripts": [
|
"ignored_scripts": [
|
||||||
|
"hangul",
|
||||||
"hiragana",
|
"hiragana",
|
||||||
"katakana",
|
"katakana"
|
||||||
"hangul"
|
|
||||||
],
|
],
|
||||||
"output_unigrams" : true
|
"output_unigrams" : true
|
||||||
}
|
}
|
||||||
|
|
|
@ -1,12 +1,83 @@
|
||||||
[[analysis-cjk-width-tokenfilter]]
|
[[analysis-cjk-width-tokenfilter]]
|
||||||
=== CJK Width Token Filter
|
=== CJK width token filter
|
||||||
|
++++
|
||||||
|
<titleabbrev>CJK width</titleabbrev>
|
||||||
|
++++
|
||||||
|
|
||||||
The `cjk_width` token filter normalizes CJK width differences:
|
Normalizes width differences in CJK (Chinese, Japanese, and Korean) characters
|
||||||
|
as follows:
|
||||||
|
|
||||||
* Folds fullwidth ASCII variants into the equivalent basic Latin
|
* Folds full-width ASCII character variants into the equivalent basic Latin
|
||||||
* Folds halfwidth Katakana variants into the equivalent Kana
|
characters
|
||||||
|
* Folds half-width Katakana character variants into the equivalent Kana
|
||||||
|
characters
|
||||||
|
|
||||||
NOTE: This token filter can be viewed as a subset of NFKC/NFKD
|
This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
|
||||||
Unicode normalization. See the {plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin]
|
analyzer>>. It uses Lucene's
|
||||||
for full normalization support.
|
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html[CJKWidthFilter].
|
||||||
|
|
||||||
|
NOTE: This token filter can be viewed as a subset of NFKC/NFKD Unicode
|
||||||
|
normalization. See the
|
||||||
|
{plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin] for
|
||||||
|
full normalization support.
|
||||||
|
|
||||||
|
[[analysis-cjk-width-tokenfilter-analyze-ex]]
|
||||||
|
==== Example
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
GET /_analyze
|
||||||
|
{
|
||||||
|
"tokenizer" : "standard",
|
||||||
|
"filter" : ["cjk_width"],
|
||||||
|
"text" : "シーサイドライナー"
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
The filter produces the following token:
|
||||||
|
|
||||||
|
[source,text]
|
||||||
|
--------------------------------------------------
|
||||||
|
シーサイドライナー
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
/////////////////////
|
||||||
|
[source,console-result]
|
||||||
|
--------------------------------------------------
|
||||||
|
{
|
||||||
|
"tokens" : [
|
||||||
|
{
|
||||||
|
"token" : "シーサイドライナー",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 10,
|
||||||
|
"type" : "<KATAKANA>",
|
||||||
|
"position" : 0
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
/////////////////////
|
||||||
|
|
||||||
|
[[analysis-cjk-width-tokenfilter-analyzer-ex]]
|
||||||
|
==== Add to an analyzer
|
||||||
|
|
||||||
|
The following <<indices-create-index,create index API>> request uses the
|
||||||
|
CJK width token filter to configure a new
|
||||||
|
<<analysis-custom-analyzer,custom analyzer>>.
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
PUT /cjk_width_example
|
||||||
|
{
|
||||||
|
"settings" : {
|
||||||
|
"analysis" : {
|
||||||
|
"analyzer" : {
|
||||||
|
"standard_cjk_width" : {
|
||||||
|
"tokenizer" : "standard",
|
||||||
|
"filter" : ["cjk_width"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
|
Loading…
Reference in New Issue