Docs: Documented the cjk_width and cjk_bigram token filters
This commit is contained in:
parent
ed5b49a5be
commit
bc402d5f87
|
@ -1,7 +1,7 @@
|
|||
[[analysis-tokenfilters]]
|
||||
== Token Filters
|
||||
|
||||
Token filters accept a stream of tokens from a
|
||||
Token filters accept a stream of tokens from a
|
||||
<<analysis-tokenizers,tokenizer>> and can modify tokens
|
||||
(eg lowercasing), delete tokens (eg remove stopwords)
|
||||
or add tokens (eg synonyms).
|
||||
|
@ -71,6 +71,10 @@ include::tokenfilters/common-grams-tokenfilter.asciidoc[]
|
|||
|
||||
include::tokenfilters/normalization-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/cjk-width-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/cjk-bigram-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/delimited-payload-tokenfilter.asciidoc[]
|
||||
|
||||
include::tokenfilters/keep-words-tokenfilter.asciidoc[]
|
||||
|
|
|
@ -0,0 +1,42 @@
|
|||
[[analysis-cjk-bigram-tokenfilter]]
|
||||
=== CJK Bigram Token Filter
|
||||
|
||||
The `cjk_bigram` token filter forms bigrams out of the CJK
|
||||
terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
|
||||
or the `icu_tokenizer` (see <<icu-analysis-plugin>>).
|
||||
|
||||
By default, when a CJK character has no adjacent characters to form a bigram,
|
||||
it is output in unigram form. If you always want to output both unigrams and
|
||||
bigrams, set the `output_unigrams` flag to `true`. This can be used for a
|
||||
combined unigram+bigram approach.
|
||||
|
||||
Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
|
||||
`hangul`, but bigrams can be disabled for particular scripts with the
|
||||
`ignore_scripts` parameter. All non-CJK input is passed through unmodified.
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"index" : {
|
||||
"analysis" : {
|
||||
"analyzer" : {
|
||||
"han_bigrams" : {
|
||||
"tokenizer" : "standard",
|
||||
"filter" : ["han_bigrams_filter"]
|
||||
}
|
||||
},
|
||||
"filter" : {
|
||||
"han_bigrams_filter" : {
|
||||
"type" : "cjk_bigram",
|
||||
"ignore_scripts": [
|
||||
"hiragana",
|
||||
"katakana"
|
||||
"hangul"
|
||||
],
|
||||
"output_ungirams" : true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
|
@ -0,0 +1,12 @@
|
|||
[[analysis-cjk-width-tokenfilter]]
|
||||
=== CJK Width Token Filter
|
||||
|
||||
The `cjk_width` token filter normalizes CJK width differences:
|
||||
|
||||
* Folds fullwidth ASCII variants into the equivalent basic Latin
|
||||
* Folds halfwidth Katakana variants into the equivalent Kana
|
||||
|
||||
NOTE: This token filter can be viewed as a subset of NFKC/NFKD
|
||||
Unicode normalization. See the <<icu-analysis-plugin>>
|
||||
for full normalization support.
|
||||
|
Loading…
Reference in New Issue