Docs: Documented the cjk_width and cjk_bigram token filters

This commit is contained in:
Clinton Gormley 2014-06-09 22:40:58 +02:00
parent ed5b49a5be
commit bc402d5f87
3 changed files with 59 additions and 1 deletions

View File

@ -1,7 +1,7 @@
[[analysis-tokenfilters]]
== Token Filters
Token filters accept a stream of tokens from a
Token filters accept a stream of tokens from a
<<analysis-tokenizers,tokenizer>> and can modify tokens
(eg lowercasing), delete tokens (eg remove stopwords)
or add tokens (eg synonyms).
@ -71,6 +71,10 @@ include::tokenfilters/common-grams-tokenfilter.asciidoc[]
include::tokenfilters/normalization-tokenfilter.asciidoc[]
include::tokenfilters/cjk-width-tokenfilter.asciidoc[]
include::tokenfilters/cjk-bigram-tokenfilter.asciidoc[]
include::tokenfilters/delimited-payload-tokenfilter.asciidoc[]
include::tokenfilters/keep-words-tokenfilter.asciidoc[]

View File

@ -0,0 +1,42 @@
[[analysis-cjk-bigram-tokenfilter]]
=== CJK Bigram Token Filter
The `cjk_bigram` token filter forms bigrams out of the CJK
terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
or the `icu_tokenizer` (see <<icu-analysis-plugin>>).
By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the `output_unigrams` flag to `true`. This can be used for a
combined unigram+bigram approach.
Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
`hangul`, but bigrams can be disabled for particular scripts with the
`ignore_scripts` parameter. All non-CJK input is passed through unmodified.
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"han_bigrams" : {
"tokenizer" : "standard",
"filter" : ["han_bigrams_filter"]
}
},
"filter" : {
"han_bigrams_filter" : {
"type" : "cjk_bigram",
"ignore_scripts": [
"hiragana",
"katakana"
"hangul"
],
"output_ungirams" : true
}
}
}
}
}
--------------------------------------------------

View File

@ -0,0 +1,12 @@
[[analysis-cjk-width-tokenfilter]]
=== CJK Width Token Filter
The `cjk_width` token filter normalizes CJK width differences:
* Folds fullwidth ASCII variants into the equivalent basic Latin
* Folds halfwidth Katakana variants into the equivalent Kana
NOTE: This token filter can be viewed as a subset of NFKC/NFKD
Unicode normalization. See the <<icu-analysis-plugin>>
for full normalization support.