[[analysis-kuromoji]] === Japanese (kuromoji) Analysis Plugin The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch. :plugin_name: analysis-kuromoji include::install_remove.asciidoc[] [[analysis-kuromoji-analyzer]] ==== `kuromoji` analyzer The `kuromoji` analyzer consists of the following tokenizer and token filters: * <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>> * <<analysis-kuromoji-baseform,`kuromoji_baseform`>> token filter * <<analysis-kuromoji-speech,`kuromoji_part_of_speech`>> token filter * {ref}/analysis-cjk-width-tokenfilter.html[`cjk_width`] token filter * <<analysis-kuromoji-stop,`ja_stop`>> token filter * <<analysis-kuromoji-stemmer,`kuromoji_stemmer`>> token filter * {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter It supports the `mode` and `user_dictionary` settings from <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>. [[analysis-kuromoji-charfilter]] ==== `kuromoji_iteration_mark` character filter The `kuromoji_iteration_mark` normalizes Japanese horizontal iteration marks (_odoriji_) to their expanded form. It accepts the following settings: `normalize_kanji`:: Indicates whether kanji iteration marks should be normalize. Defaults to `true`. `normalize_kana`:: Indicates whether kana iteration marks should be normalized. Defaults to `true` [[analysis-kuromoji-tokenizer]] ==== `kuromoji_tokenizer` The `kuromoji_tokenizer` accepts the following settings: `mode`:: + -- The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to: `normal`:: Normal segmentation, no decomposition for compounds. Example output: 関西国際空港 アブラカダブラ `search`:: Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output: 関西, 関西国際空港, 国際, 空港 アブラカダブラ `extended`:: Extended mode outputs unigrams for unknown words. Example output: 関西, 関西国際空港, 国際, 空港 ア, ブ, ラ, カ, ダ, ブ, ラ -- `discard_punctuation`:: Whether punctuation should be discarded from the output. Defaults to `true`. `user_dictionary`:: + -- The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A `user_dictionary` may be appended to the default dictionary. The dictionary should have the following CSV format: [source,csv] ----------------------- <text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag> ----------------------- -- As a demonstration of how the user dictionary can be used, save the following dictionary to `$ES_HOME/config/userdict_ja.txt`: [source,csv] ----------------------- 東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞 ----------------------- -- You can also inline the rules directly in the tokenizer definition using the `user_dictionary_rules` option: [source,console] -------------------------------------------------- PUT nori_sample { "settings": { "index": { "analysis": { "tokenizer": { "kuromoji_user_dict": { "type": "kuromoji_tokenizer", "mode": "extended", "user_dictionary_rules": ["東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"] } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "kuromoji_user_dict" } } } } } } -------------------------------------------------- -- `nbest_cost`/`nbest_examples`:: + -- Additional expert user parameters `nbest_cost` and `nbest_examples` can be used to include additional tokens that most likely according to the statistical model. If both parameters are used, the largest number of both is applied. `nbest_cost`:: The `nbest_cost` parameter specifies an additional Viterbi cost. The KuromojiTokenizer will include all tokens in Viterbi paths that are within the nbest_cost value of the best path. `nbest_examples`:: The `nbest_examples` can be used to find a `nbest_cost` value based on examples. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us 箱根 (Hakone) and 成田 (Narita). -- Then create an analyzer as follows: [source,console] -------------------------------------------------- PUT kuromoji_sample { "settings": { "index": { "analysis": { "tokenizer": { "kuromoji_user_dict": { "type": "kuromoji_tokenizer", "mode": "extended", "discard_punctuation": "false", "user_dictionary": "userdict_ja.txt" } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "kuromoji_user_dict" } } } } } } GET kuromoji_sample/_analyze { "analyzer": "my_analyzer", "text": "東京スカイツリー" } -------------------------------------------------- The above `analyze` request returns the following: [source,console-result] -------------------------------------------------- { "tokens" : [ { "token" : "東京", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "スカイツリー", "start_offset" : 2, "end_offset" : 8, "type" : "word", "position" : 1 } ] } -------------------------------------------------- `discard_compound_token`:: Whether original compound tokens should be discarded from the output with `search` mode. Defaults to `false`. Example output with `search` or `extended` mode and this option `true`: 関西, 国際, 空港 [[analysis-kuromoji-baseform]] ==== `kuromoji_baseform` token filter The `kuromoji_baseform` token filter replaces terms with their BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example: [source,console] -------------------------------------------------- PUT kuromoji_sample { "settings": { "index": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "kuromoji_tokenizer", "filter": [ "kuromoji_baseform" ] } } } } } } GET kuromoji_sample/_analyze { "analyzer": "my_analyzer", "text": "飲み" } -------------------------------------------------- which responds with: [source,console-result] -------------------------------------------------- { "tokens" : [ { "token" : "飲む", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 } ] } -------------------------------------------------- [[analysis-kuromoji-speech]] ==== `kuromoji_part_of_speech` token filter The `kuromoji_part_of_speech` token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting: `stoptags`:: An array of part-of-speech tags that should be removed. It defaults to the `stoptags.txt` file embedded in the `lucene-analyzer-kuromoji.jar`. For example: [source,console] -------------------------------------------------- PUT kuromoji_sample { "settings": { "index": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "kuromoji_tokenizer", "filter": [ "my_posfilter" ] } }, "filter": { "my_posfilter": { "type": "kuromoji_part_of_speech", "stoptags": [ "助詞-格助詞-一般", "助詞-終助詞" ] } } } } } } GET kuromoji_sample/_analyze { "analyzer": "my_analyzer", "text": "寿司がおいしいね" } -------------------------------------------------- Which responds with: [source,console-result] -------------------------------------------------- { "tokens" : [ { "token" : "寿司", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "おいしい", "start_offset" : 3, "end_offset" : 7, "type" : "word", "position" : 2 } ] } -------------------------------------------------- [[analysis-kuromoji-readingform]] ==== `kuromoji_readingform` token filter The `kuromoji_readingform` token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting: `use_romaji`:: Whether romaji reading form should be output instead of katakana. Defaults to `false`. When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set to `true`. The default when defining a custom `kuromoji_readingform`, however, is `false`. The only reason to use the custom form is if you need the katakana reading form: [source,console] -------------------------------------------------- PUT kuromoji_sample { "settings": { "index":{ "analysis":{ "analyzer" : { "romaji_analyzer" : { "tokenizer" : "kuromoji_tokenizer", "filter" : ["romaji_readingform"] }, "katakana_analyzer" : { "tokenizer" : "kuromoji_tokenizer", "filter" : ["katakana_readingform"] } }, "filter" : { "romaji_readingform" : { "type" : "kuromoji_readingform", "use_romaji" : true }, "katakana_readingform" : { "type" : "kuromoji_readingform", "use_romaji" : false } } } } } } GET kuromoji_sample/_analyze { "analyzer": "katakana_analyzer", "text": "寿司" <1> } GET kuromoji_sample/_analyze { "analyzer": "romaji_analyzer", "text": "寿司" <2> } -------------------------------------------------- <1> Returns `スシ`. <2> Returns `sushi`. [[analysis-kuromoji-stemmer]] ==== `kuromoji_stemmer` token filter The `kuromoji_stemmer` token filter normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC). Only full-width katakana characters are supported. This token filter accepts the following setting: `minimum_length`:: Katakana words shorter than the `minimum length` are not stemmed (default is `4`). [source,console] -------------------------------------------------- PUT kuromoji_sample { "settings": { "index": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "kuromoji_tokenizer", "filter": [ "my_katakana_stemmer" ] } }, "filter": { "my_katakana_stemmer": { "type": "kuromoji_stemmer", "minimum_length": 4 } } } } } } GET kuromoji_sample/_analyze { "analyzer": "my_analyzer", "text": "コピー" <1> } GET kuromoji_sample/_analyze { "analyzer": "my_analyzer", "text": "サーバー" <2> } -------------------------------------------------- <1> Returns `コピー`. <2> Return `サーバ`. [[analysis-kuromoji-stop]] ==== `ja_stop` token filter The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and any other custom stopwords specified by the user. This filter only supports the predefined `_japanese_` stopwords list. If you want to use a different predefined list, then use the {ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead. [source,console] -------------------------------------------------- PUT kuromoji_sample { "settings": { "index": { "analysis": { "analyzer": { "analyzer_with_ja_stop": { "tokenizer": "kuromoji_tokenizer", "filter": [ "ja_stop" ] } }, "filter": { "ja_stop": { "type": "ja_stop", "stopwords": [ "_japanese_", "ストップ" ] } } } } } } GET kuromoji_sample/_analyze { "analyzer": "analyzer_with_ja_stop", "text": "ストップは消える" } -------------------------------------------------- The above request returns: [source,console-result] -------------------------------------------------- { "tokens" : [ { "token" : "消える", "start_offset" : 5, "end_offset" : 8, "type" : "word", "position" : 2 } ] } -------------------------------------------------- [[analysis-kuromoji-number]] ==== `kuromoji_number` token filter The `kuromoji_number` token filter normalizes Japanese numbers (kansūji) to regular Arabic decimal numbers in half-width characters. For example: [source,console] -------------------------------------------------- PUT kuromoji_sample { "settings": { "index": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "kuromoji_tokenizer", "filter": [ "kuromoji_number" ] } } } } } } GET kuromoji_sample/_analyze { "analyzer": "my_analyzer", "text": "一〇〇〇" } -------------------------------------------------- Which results in: [source,console-result] -------------------------------------------------- { "tokens" : [ { "token" : "1000", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 } ] } --------------------------------------------------