627 lines
15 KiB
Plaintext
627 lines
15 KiB
Plaintext
[[analysis-kuromoji]]
|
||
=== Japanese (kuromoji) Analysis Plugin
|
||
|
||
The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis
|
||
module into {es}.
|
||
|
||
:plugin_name: analysis-kuromoji
|
||
include::install_remove.asciidoc[]
|
||
|
||
[[analysis-kuromoji-analyzer]]
|
||
==== `kuromoji` analyzer
|
||
|
||
The `kuromoji` analyzer consists of the following tokenizer and token filters:
|
||
|
||
* <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>
|
||
* <<analysis-kuromoji-baseform,`kuromoji_baseform`>> token filter
|
||
* <<analysis-kuromoji-speech,`kuromoji_part_of_speech`>> token filter
|
||
* {ref}/analysis-cjk-width-tokenfilter.html[`cjk_width`] token filter
|
||
* <<analysis-kuromoji-stop,`ja_stop`>> token filter
|
||
* <<analysis-kuromoji-stemmer,`kuromoji_stemmer`>> token filter
|
||
* {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter
|
||
|
||
It supports the `mode` and `user_dictionary` settings from
|
||
<<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.
|
||
|
||
[discrete]
|
||
[[kuromoji-analyzer-normalize-full-width-characters]]
|
||
==== Normalize full-width characters
|
||
|
||
The `kuromoji_tokenizer` tokenizer uses characters from the MeCab-IPADIC
|
||
dictionary to split text into tokens. The dictionary includes some full-width
|
||
characters, such as `o` and `f`. If a text contains full-width characters,
|
||
the tokenizer can produce unexpected tokens.
|
||
|
||
For example, the `kuromoji_tokenizer` tokenizer converts the text
|
||
`Culture of Japan` to the tokens `[ culture, o, f, japan ]`
|
||
instead of `[ culture, of, japan ]`.
|
||
|
||
To avoid this, add the <<analysis-icu-normalization-charfilter,`icu_normalizer`
|
||
character filter>> to a custom analyzer based on the `kuromoji` analyzer. The
|
||
`icu_normalizer` character filter converts full-width characters to their normal
|
||
equivalents.
|
||
|
||
First, duplicate the `kuromoji` analyzer to create the basis for a custom
|
||
analyzer. Then add the `icu_normalizer` character filter to the custom analyzer.
|
||
For example:
|
||
|
||
[source,console]
|
||
----
|
||
PUT index-00001
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"kuromoji_normalize": { <1>
|
||
"char_filter": [
|
||
"icu_normalizer" <2>
|
||
],
|
||
"tokenizer": "kuromoji_tokenizer",
|
||
"filter": [
|
||
"kuromoji_baseform",
|
||
"kuromoji_part_of_speech",
|
||
"cjk_width",
|
||
"ja_stop",
|
||
"kuromoji_stemmer",
|
||
"lowercase"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
----
|
||
<1> Creates a new custom analyzer, `kuromoji_normalize`, based on the `kuromoji`
|
||
analyzer.
|
||
<2> Adds the `icu_normalizer` character filter to the analyzer.
|
||
|
||
|
||
[[analysis-kuromoji-charfilter]]
|
||
==== `kuromoji_iteration_mark` character filter
|
||
|
||
The `kuromoji_iteration_mark` normalizes Japanese horizontal iteration marks
|
||
(_odoriji_) to their expanded form. It accepts the following settings:
|
||
|
||
`normalize_kanji`::
|
||
|
||
Indicates whether kanji iteration marks should be normalize. Defaults to `true`.
|
||
|
||
`normalize_kana`::
|
||
|
||
Indicates whether kana iteration marks should be normalized. Defaults to `true`
|
||
|
||
|
||
[[analysis-kuromoji-tokenizer]]
|
||
==== `kuromoji_tokenizer`
|
||
|
||
The `kuromoji_tokenizer` accepts the following settings:
|
||
|
||
`mode`::
|
||
+
|
||
--
|
||
|
||
The tokenization mode determines how the tokenizer handles compound and
|
||
unknown words. It can be set to:
|
||
|
||
`normal`::
|
||
|
||
Normal segmentation, no decomposition for compounds. Example output:
|
||
|
||
関西国際空港
|
||
アブラカダブラ
|
||
|
||
`search`::
|
||
|
||
Segmentation geared towards search. This includes a decompounding process
|
||
for long nouns, also including the full compound token as a synonym.
|
||
Example output:
|
||
|
||
関西, 関西国際空港, 国際, 空港
|
||
アブラカダブラ
|
||
|
||
`extended`::
|
||
|
||
Extended mode outputs unigrams for unknown words. Example output:
|
||
|
||
関西, 関西国際空港, 国際, 空港
|
||
ア, ブ, ラ, カ, ダ, ブ, ラ
|
||
--
|
||
|
||
`discard_punctuation`::
|
||
|
||
Whether punctuation should be discarded from the output. Defaults to `true`.
|
||
|
||
`user_dictionary`::
|
||
+
|
||
--
|
||
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A `user_dictionary`
|
||
may be appended to the default dictionary. The dictionary should have the following CSV format:
|
||
|
||
[source,csv]
|
||
-----------------------
|
||
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
|
||
-----------------------
|
||
--
|
||
|
||
As a demonstration of how the user dictionary can be used, save the following
|
||
dictionary to `$ES_HOME/config/userdict_ja.txt`:
|
||
|
||
[source,csv]
|
||
-----------------------
|
||
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
|
||
-----------------------
|
||
|
||
--
|
||
|
||
You can also inline the rules directly in the tokenizer definition using
|
||
the `user_dictionary_rules` option:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT nori_sample
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"tokenizer": {
|
||
"kuromoji_user_dict": {
|
||
"type": "kuromoji_tokenizer",
|
||
"mode": "extended",
|
||
"user_dictionary_rules": ["東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"]
|
||
}
|
||
},
|
||
"analyzer": {
|
||
"my_analyzer": {
|
||
"type": "custom",
|
||
"tokenizer": "kuromoji_user_dict"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
--------------------------------------------------
|
||
--
|
||
|
||
`nbest_cost`/`nbest_examples`::
|
||
+
|
||
--
|
||
Additional expert user parameters `nbest_cost` and `nbest_examples` can be used
|
||
to include additional tokens that most likely according to the statistical model.
|
||
If both parameters are used, the largest number of both is applied.
|
||
|
||
`nbest_cost`::
|
||
|
||
The `nbest_cost` parameter specifies an additional Viterbi cost.
|
||
The KuromojiTokenizer will include all tokens in Viterbi paths that are
|
||
within the nbest_cost value of the best path.
|
||
|
||
`nbest_examples`::
|
||
|
||
The `nbest_examples` can be used to find a `nbest_cost` value based on examples.
|
||
For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts,
|
||
箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us
|
||
箱根 (Hakone) and 成田 (Narita).
|
||
--
|
||
|
||
|
||
Then create an analyzer as follows:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT kuromoji_sample
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"tokenizer": {
|
||
"kuromoji_user_dict": {
|
||
"type": "kuromoji_tokenizer",
|
||
"mode": "extended",
|
||
"discard_punctuation": "false",
|
||
"user_dictionary": "userdict_ja.txt"
|
||
}
|
||
},
|
||
"analyzer": {
|
||
"my_analyzer": {
|
||
"type": "custom",
|
||
"tokenizer": "kuromoji_user_dict"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "my_analyzer",
|
||
"text": "東京スカイツリー"
|
||
}
|
||
--------------------------------------------------
|
||
|
||
The above `analyze` request returns the following:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"tokens" : [ {
|
||
"token" : "東京",
|
||
"start_offset" : 0,
|
||
"end_offset" : 2,
|
||
"type" : "word",
|
||
"position" : 0
|
||
}, {
|
||
"token" : "スカイツリー",
|
||
"start_offset" : 2,
|
||
"end_offset" : 8,
|
||
"type" : "word",
|
||
"position" : 1
|
||
} ]
|
||
}
|
||
--------------------------------------------------
|
||
|
||
`discard_compound_token`::
|
||
Whether original compound tokens should be discarded from the output with `search` mode. Defaults to `false`.
|
||
Example output with `search` or `extended` mode and this option `true`:
|
||
|
||
関西, 国際, 空港
|
||
|
||
NOTE: If a text contains full-width characters, the `kuromoji_tokenizer`
|
||
tokenizer can produce unexpected tokens. To avoid this, add the
|
||
<<analysis-icu-normalization-charfilter,`icu_normalizer` character filter>> to
|
||
your analyzer. See <<kuromoji-analyzer-normalize-full-width-characters>>.
|
||
|
||
|
||
[[analysis-kuromoji-baseform]]
|
||
==== `kuromoji_baseform` token filter
|
||
|
||
The `kuromoji_baseform` token filter replaces terms with their
|
||
BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT kuromoji_sample
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"my_analyzer": {
|
||
"tokenizer": "kuromoji_tokenizer",
|
||
"filter": [
|
||
"kuromoji_baseform"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "my_analyzer",
|
||
"text": "飲み"
|
||
}
|
||
--------------------------------------------------
|
||
|
||
which responds with:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"tokens" : [ {
|
||
"token" : "飲む",
|
||
"start_offset" : 0,
|
||
"end_offset" : 2,
|
||
"type" : "word",
|
||
"position" : 0
|
||
} ]
|
||
}
|
||
--------------------------------------------------
|
||
|
||
|
||
[[analysis-kuromoji-speech]]
|
||
==== `kuromoji_part_of_speech` token filter
|
||
|
||
The `kuromoji_part_of_speech` token filter removes tokens that match a set of
|
||
part-of-speech tags. It accepts the following setting:
|
||
|
||
`stoptags`::
|
||
|
||
An array of part-of-speech tags that should be removed. It defaults to the
|
||
`stoptags.txt` file embedded in the `lucene-analyzer-kuromoji.jar`.
|
||
|
||
For example:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT kuromoji_sample
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"my_analyzer": {
|
||
"tokenizer": "kuromoji_tokenizer",
|
||
"filter": [
|
||
"my_posfilter"
|
||
]
|
||
}
|
||
},
|
||
"filter": {
|
||
"my_posfilter": {
|
||
"type": "kuromoji_part_of_speech",
|
||
"stoptags": [
|
||
"助詞-格助詞-一般",
|
||
"助詞-終助詞"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "my_analyzer",
|
||
"text": "寿司がおいしいね"
|
||
}
|
||
--------------------------------------------------
|
||
|
||
Which responds with:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"tokens" : [ {
|
||
"token" : "寿司",
|
||
"start_offset" : 0,
|
||
"end_offset" : 2,
|
||
"type" : "word",
|
||
"position" : 0
|
||
}, {
|
||
"token" : "おいしい",
|
||
"start_offset" : 3,
|
||
"end_offset" : 7,
|
||
"type" : "word",
|
||
"position" : 2
|
||
} ]
|
||
}
|
||
--------------------------------------------------
|
||
|
||
|
||
[[analysis-kuromoji-readingform]]
|
||
==== `kuromoji_readingform` token filter
|
||
|
||
The `kuromoji_readingform` token filter replaces the token with its reading
|
||
form in either katakana or romaji. It accepts the following setting:
|
||
|
||
`use_romaji`::
|
||
|
||
Whether romaji reading form should be output instead of katakana. Defaults to `false`.
|
||
|
||
When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
|
||
to `true`. The default when defining a custom `kuromoji_readingform`, however,
|
||
is `false`. The only reason to use the custom form is if you need the
|
||
katakana reading form:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT kuromoji_sample
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"romaji_analyzer": {
|
||
"tokenizer": "kuromoji_tokenizer",
|
||
"filter": [ "romaji_readingform" ]
|
||
},
|
||
"katakana_analyzer": {
|
||
"tokenizer": "kuromoji_tokenizer",
|
||
"filter": [ "katakana_readingform" ]
|
||
}
|
||
},
|
||
"filter": {
|
||
"romaji_readingform": {
|
||
"type": "kuromoji_readingform",
|
||
"use_romaji": true
|
||
},
|
||
"katakana_readingform": {
|
||
"type": "kuromoji_readingform",
|
||
"use_romaji": false
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "katakana_analyzer",
|
||
"text": "寿司" <1>
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "romaji_analyzer",
|
||
"text": "寿司" <2>
|
||
}
|
||
--------------------------------------------------
|
||
|
||
<1> Returns `スシ`.
|
||
<2> Returns `sushi`.
|
||
|
||
[[analysis-kuromoji-stemmer]]
|
||
==== `kuromoji_stemmer` token filter
|
||
|
||
The `kuromoji_stemmer` token filter normalizes common katakana spelling
|
||
variations ending in a long sound character by removing this character
|
||
(U+30FC). Only full-width katakana characters are supported.
|
||
|
||
This token filter accepts the following setting:
|
||
|
||
`minimum_length`::
|
||
|
||
Katakana words shorter than the `minimum length` are not stemmed (default
|
||
is `4`).
|
||
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT kuromoji_sample
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"my_analyzer": {
|
||
"tokenizer": "kuromoji_tokenizer",
|
||
"filter": [
|
||
"my_katakana_stemmer"
|
||
]
|
||
}
|
||
},
|
||
"filter": {
|
||
"my_katakana_stemmer": {
|
||
"type": "kuromoji_stemmer",
|
||
"minimum_length": 4
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "my_analyzer",
|
||
"text": "コピー" <1>
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "my_analyzer",
|
||
"text": "サーバー" <2>
|
||
}
|
||
--------------------------------------------------
|
||
|
||
<1> Returns `コピー`.
|
||
<2> Return `サーバ`.
|
||
|
||
|
||
[[analysis-kuromoji-stop]]
|
||
==== `ja_stop` token filter
|
||
|
||
The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
|
||
any other custom stopwords specified by the user. This filter only supports
|
||
the predefined `_japanese_` stopwords list. If you want to use a different
|
||
predefined list, then use the
|
||
{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT kuromoji_sample
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"analyzer_with_ja_stop": {
|
||
"tokenizer": "kuromoji_tokenizer",
|
||
"filter": [
|
||
"ja_stop"
|
||
]
|
||
}
|
||
},
|
||
"filter": {
|
||
"ja_stop": {
|
||
"type": "ja_stop",
|
||
"stopwords": [
|
||
"_japanese_",
|
||
"ストップ"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "analyzer_with_ja_stop",
|
||
"text": "ストップは消える"
|
||
}
|
||
--------------------------------------------------
|
||
|
||
The above request returns:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"tokens" : [ {
|
||
"token" : "消える",
|
||
"start_offset" : 5,
|
||
"end_offset" : 8,
|
||
"type" : "word",
|
||
"position" : 2
|
||
} ]
|
||
}
|
||
--------------------------------------------------
|
||
|
||
|
||
[[analysis-kuromoji-number]]
|
||
==== `kuromoji_number` token filter
|
||
|
||
The `kuromoji_number` token filter normalizes Japanese numbers (kansūji)
|
||
to regular Arabic decimal numbers in half-width characters. For example:
|
||
|
||
[source,console]
|
||
--------------------------------------------------
|
||
PUT kuromoji_sample
|
||
{
|
||
"settings": {
|
||
"index": {
|
||
"analysis": {
|
||
"analyzer": {
|
||
"my_analyzer": {
|
||
"tokenizer": "kuromoji_tokenizer",
|
||
"filter": [
|
||
"kuromoji_number"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
GET kuromoji_sample/_analyze
|
||
{
|
||
"analyzer": "my_analyzer",
|
||
"text": "一〇〇〇"
|
||
}
|
||
--------------------------------------------------
|
||
|
||
Which results in:
|
||
|
||
[source,console-result]
|
||
--------------------------------------------------
|
||
{
|
||
"tokens" : [ {
|
||
"token" : "1000",
|
||
"start_offset" : 0,
|
||
"end_offset" : 4,
|
||
"type" : "word",
|
||
"position" : 0
|
||
} ]
|
||
}
|
||
--------------------------------------------------
|