2015-08-15 12:00:55 -04:00
|
|
|
|
[[analysis-kuromoji]]
|
|
|
|
|
=== Japanese (kuromoji) Analysis Plugin
|
|
|
|
|
|
|
|
|
|
The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis
|
2020-07-28 14:17:23 -04:00
|
|
|
|
module into {es}.
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
2017-04-20 09:01:37 -04:00
|
|
|
|
:plugin_name: analysis-kuromoji
|
|
|
|
|
include::install_remove.asciidoc[]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
|
|
[[analysis-kuromoji-analyzer]]
|
|
|
|
|
==== `kuromoji` analyzer
|
|
|
|
|
|
|
|
|
|
The `kuromoji` analyzer consists of the following tokenizer and token filters:
|
|
|
|
|
|
|
|
|
|
* <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>
|
|
|
|
|
* <<analysis-kuromoji-baseform,`kuromoji_baseform`>> token filter
|
|
|
|
|
* <<analysis-kuromoji-speech,`kuromoji_part_of_speech`>> token filter
|
|
|
|
|
* {ref}/analysis-cjk-width-tokenfilter.html[`cjk_width`] token filter
|
|
|
|
|
* <<analysis-kuromoji-stop,`ja_stop`>> token filter
|
|
|
|
|
* <<analysis-kuromoji-stemmer,`kuromoji_stemmer`>> token filter
|
|
|
|
|
* {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter
|
|
|
|
|
|
|
|
|
|
It supports the `mode` and `user_dictionary` settings from
|
|
|
|
|
<<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.
|
|
|
|
|
|
2020-07-28 14:17:23 -04:00
|
|
|
|
[discrete]
|
|
|
|
|
[[kuromoji-analyzer-normalize-full-width-characters]]
|
|
|
|
|
==== Normalize full-width characters
|
|
|
|
|
|
|
|
|
|
The `kuromoji_tokenizer` tokenizer uses characters from the MeCab-IPADIC
|
|
|
|
|
dictionary to split text into tokens. The dictionary includes some full-width
|
|
|
|
|
characters, such as `o` and `f`. If a text contains full-width characters,
|
|
|
|
|
the tokenizer can produce unexpected tokens.
|
|
|
|
|
|
|
|
|
|
For example, the `kuromoji_tokenizer` tokenizer converts the text
|
2020-07-30 08:53:15 -04:00
|
|
|
|
`Culture of Japan` to the tokens `[ culture, o, f, japan ]`
|
|
|
|
|
instead of `[ culture, of, japan ]`.
|
2020-07-28 14:17:23 -04:00
|
|
|
|
|
|
|
|
|
To avoid this, add the <<analysis-icu-normalization-charfilter,`icu_normalizer`
|
|
|
|
|
character filter>> to a custom analyzer based on the `kuromoji` analyzer. The
|
|
|
|
|
`icu_normalizer` character filter converts full-width characters to their normal
|
|
|
|
|
equivalents.
|
|
|
|
|
|
|
|
|
|
First, duplicate the `kuromoji` analyzer to create the basis for a custom
|
|
|
|
|
analyzer. Then add the `icu_normalizer` character filter to the custom analyzer.
|
|
|
|
|
For example:
|
|
|
|
|
|
|
|
|
|
[source,console]
|
|
|
|
|
----
|
|
|
|
|
PUT index-00001
|
|
|
|
|
{
|
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"kuromoji_normalize": { <1>
|
|
|
|
|
"char_filter": [
|
|
|
|
|
"icu_normalizer" <2>
|
|
|
|
|
],
|
|
|
|
|
"tokenizer": "kuromoji_tokenizer",
|
|
|
|
|
"filter": [
|
|
|
|
|
"kuromoji_baseform",
|
|
|
|
|
"kuromoji_part_of_speech",
|
|
|
|
|
"cjk_width",
|
|
|
|
|
"ja_stop",
|
|
|
|
|
"kuromoji_stemmer",
|
|
|
|
|
"lowercase"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
----
|
|
|
|
|
<1> Creates a new custom analyzer, `kuromoji_normalize`, based on the `kuromoji`
|
|
|
|
|
analyzer.
|
|
|
|
|
<2> Adds the `icu_normalizer` character filter to the analyzer.
|
|
|
|
|
|
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
|
[[analysis-kuromoji-charfilter]]
|
|
|
|
|
==== `kuromoji_iteration_mark` character filter
|
|
|
|
|
|
|
|
|
|
The `kuromoji_iteration_mark` normalizes Japanese horizontal iteration marks
|
|
|
|
|
(_odoriji_) to their expanded form. It accepts the following settings:
|
|
|
|
|
|
|
|
|
|
`normalize_kanji`::
|
|
|
|
|
|
|
|
|
|
Indicates whether kanji iteration marks should be normalize. Defaults to `true`.
|
|
|
|
|
|
|
|
|
|
`normalize_kana`::
|
|
|
|
|
|
|
|
|
|
Indicates whether kana iteration marks should be normalized. Defaults to `true`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[analysis-kuromoji-tokenizer]]
|
|
|
|
|
==== `kuromoji_tokenizer`
|
|
|
|
|
|
|
|
|
|
The `kuromoji_tokenizer` accepts the following settings:
|
|
|
|
|
|
|
|
|
|
`mode`::
|
|
|
|
|
+
|
|
|
|
|
--
|
|
|
|
|
|
|
|
|
|
The tokenization mode determines how the tokenizer handles compound and
|
|
|
|
|
unknown words. It can be set to:
|
|
|
|
|
|
|
|
|
|
`normal`::
|
|
|
|
|
|
|
|
|
|
Normal segmentation, no decomposition for compounds. Example output:
|
|
|
|
|
|
|
|
|
|
関西国際空港
|
|
|
|
|
アブラカダブラ
|
|
|
|
|
|
|
|
|
|
`search`::
|
|
|
|
|
|
|
|
|
|
Segmentation geared towards search. This includes a decompounding process
|
|
|
|
|
for long nouns, also including the full compound token as a synonym.
|
|
|
|
|
Example output:
|
|
|
|
|
|
|
|
|
|
関西, 関西国際空港, 国際, 空港
|
|
|
|
|
アブラカダブラ
|
|
|
|
|
|
|
|
|
|
`extended`::
|
|
|
|
|
|
|
|
|
|
Extended mode outputs unigrams for unknown words. Example output:
|
|
|
|
|
|
2020-06-05 09:33:31 -04:00
|
|
|
|
関西, 関西国際空港, 国際, 空港
|
2015-08-15 12:00:55 -04:00
|
|
|
|
ア, ブ, ラ, カ, ダ, ブ, ラ
|
|
|
|
|
--
|
|
|
|
|
|
|
|
|
|
`discard_punctuation`::
|
|
|
|
|
|
|
|
|
|
Whether punctuation should be discarded from the output. Defaults to `true`.
|
|
|
|
|
|
|
|
|
|
`user_dictionary`::
|
|
|
|
|
+
|
|
|
|
|
--
|
|
|
|
|
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A `user_dictionary`
|
|
|
|
|
may be appended to the default dictionary. The dictionary should have the following CSV format:
|
|
|
|
|
|
|
|
|
|
[source,csv]
|
|
|
|
|
-----------------------
|
|
|
|
|
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
|
|
|
|
|
-----------------------
|
|
|
|
|
--
|
|
|
|
|
|
|
|
|
|
As a demonstration of how the user dictionary can be used, save the following
|
|
|
|
|
dictionary to `$ES_HOME/config/userdict_ja.txt`:
|
|
|
|
|
|
|
|
|
|
[source,csv]
|
|
|
|
|
-----------------------
|
|
|
|
|
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
|
|
|
|
|
-----------------------
|
|
|
|
|
|
2019-08-20 09:06:01 -04:00
|
|
|
|
--
|
|
|
|
|
|
|
|
|
|
You can also inline the rules directly in the tokenizer definition using
|
|
|
|
|
the `user_dictionary_rules` option:
|
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
|
[source,console]
|
2019-08-20 09:06:01 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
PUT nori_sample
|
|
|
|
|
{
|
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"tokenizer": {
|
|
|
|
|
"kuromoji_user_dict": {
|
|
|
|
|
"type": "kuromoji_tokenizer",
|
|
|
|
|
"mode": "extended",
|
|
|
|
|
"user_dictionary_rules": ["東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"]
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"my_analyzer": {
|
|
|
|
|
"type": "custom",
|
|
|
|
|
"tokenizer": "kuromoji_user_dict"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
|
|
|
|
--
|
|
|
|
|
|
2016-03-16 04:43:21 -04:00
|
|
|
|
`nbest_cost`/`nbest_examples`::
|
|
|
|
|
+
|
|
|
|
|
--
|
|
|
|
|
Additional expert user parameters `nbest_cost` and `nbest_examples` can be used
|
|
|
|
|
to include additional tokens that most likely according to the statistical model.
|
|
|
|
|
If both parameters are used, the largest number of both is applied.
|
|
|
|
|
|
|
|
|
|
`nbest_cost`::
|
|
|
|
|
|
|
|
|
|
The `nbest_cost` parameter specifies an additional Viterbi cost.
|
|
|
|
|
The KuromojiTokenizer will include all tokens in Viterbi paths that are
|
|
|
|
|
within the nbest_cost value of the best path.
|
|
|
|
|
|
|
|
|
|
`nbest_examples`::
|
|
|
|
|
|
|
|
|
|
The `nbest_examples` can be used to find a `nbest_cost` value based on examples.
|
|
|
|
|
For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts,
|
|
|
|
|
箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us
|
|
|
|
|
箱根 (Hakone) and 成田 (Narita).
|
|
|
|
|
--
|
|
|
|
|
|
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
|
Then create an analyzer as follows:
|
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
|
PUT kuromoji_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
|
{
|
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"tokenizer": {
|
|
|
|
|
"kuromoji_user_dict": {
|
|
|
|
|
"type": "kuromoji_tokenizer",
|
|
|
|
|
"mode": "extended",
|
|
|
|
|
"discard_punctuation": "false",
|
|
|
|
|
"user_dictionary": "userdict_ja.txt"
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"my_analyzer": {
|
|
|
|
|
"type": "custom",
|
|
|
|
|
"tokenizer": "kuromoji_user_dict"
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "my_analyzer",
|
|
|
|
|
"text": "東京スカイツリー"
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
|
|
The above `analyze` request returns the following:
|
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
|
[source,console-result]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
{
|
|
|
|
|
"tokens" : [ {
|
|
|
|
|
"token" : "東京",
|
|
|
|
|
"start_offset" : 0,
|
|
|
|
|
"end_offset" : 2,
|
|
|
|
|
"type" : "word",
|
2016-08-17 09:56:00 -04:00
|
|
|
|
"position" : 0
|
2015-08-15 12:00:55 -04:00
|
|
|
|
}, {
|
|
|
|
|
"token" : "スカイツリー",
|
|
|
|
|
"start_offset" : 2,
|
|
|
|
|
"end_offset" : 8,
|
|
|
|
|
"type" : "word",
|
2016-08-17 09:56:00 -04:00
|
|
|
|
"position" : 1
|
2015-08-15 12:00:55 -04:00
|
|
|
|
} ]
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
2019-09-06 09:22:08 -04:00
|
|
|
|
|
2020-06-05 09:33:31 -04:00
|
|
|
|
`discard_compound_token`::
|
|
|
|
|
Whether original compound tokens should be discarded from the output with `search` mode. Defaults to `false`.
|
|
|
|
|
Example output with `search` or `extended` mode and this option `true`:
|
|
|
|
|
|
|
|
|
|
関西, 国際, 空港
|
|
|
|
|
|
2020-07-28 14:17:23 -04:00
|
|
|
|
NOTE: If a text contains full-width characters, the `kuromoji_tokenizer`
|
|
|
|
|
tokenizer can produce unexpected tokens. To avoid this, add the
|
|
|
|
|
<<analysis-icu-normalization-charfilter,`icu_normalizer` character filter>> to
|
|
|
|
|
your analyzer. See <<kuromoji-analyzer-normalize-full-width-characters>>.
|
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
|
|
[[analysis-kuromoji-baseform]]
|
|
|
|
|
==== `kuromoji_baseform` token filter
|
|
|
|
|
|
|
|
|
|
The `kuromoji_baseform` token filter replaces terms with their
|
2016-08-17 09:56:00 -04:00
|
|
|
|
BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
|
PUT kuromoji_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
|
{
|
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"my_analyzer": {
|
|
|
|
|
"tokenizer": "kuromoji_tokenizer",
|
|
|
|
|
"filter": [
|
|
|
|
|
"kuromoji_baseform"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "my_analyzer",
|
|
|
|
|
"text": "飲み"
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
2016-08-17 09:56:00 -04:00
|
|
|
|
which responds with:
|
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
|
[source,console-result]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
{
|
|
|
|
|
"tokens" : [ {
|
|
|
|
|
"token" : "飲む",
|
|
|
|
|
"start_offset" : 0,
|
|
|
|
|
"end_offset" : 2,
|
|
|
|
|
"type" : "word",
|
2016-08-17 09:56:00 -04:00
|
|
|
|
"position" : 0
|
2015-08-15 12:00:55 -04:00
|
|
|
|
} ]
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
2019-09-06 09:22:08 -04:00
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
|
|
[[analysis-kuromoji-speech]]
|
|
|
|
|
==== `kuromoji_part_of_speech` token filter
|
|
|
|
|
|
|
|
|
|
The `kuromoji_part_of_speech` token filter removes tokens that match a set of
|
|
|
|
|
part-of-speech tags. It accepts the following setting:
|
|
|
|
|
|
|
|
|
|
`stoptags`::
|
|
|
|
|
|
|
|
|
|
An array of part-of-speech tags that should be removed. It defaults to the
|
|
|
|
|
`stoptags.txt` file embedded in the `lucene-analyzer-kuromoji.jar`.
|
|
|
|
|
|
2016-08-17 09:56:00 -04:00
|
|
|
|
For example:
|
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
|
PUT kuromoji_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
|
{
|
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"my_analyzer": {
|
|
|
|
|
"tokenizer": "kuromoji_tokenizer",
|
|
|
|
|
"filter": [
|
|
|
|
|
"my_posfilter"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"filter": {
|
|
|
|
|
"my_posfilter": {
|
|
|
|
|
"type": "kuromoji_part_of_speech",
|
|
|
|
|
"stoptags": [
|
|
|
|
|
"助詞-格助詞-一般",
|
|
|
|
|
"助詞-終助詞"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "my_analyzer",
|
|
|
|
|
"text": "寿司がおいしいね"
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
2016-08-17 09:56:00 -04:00
|
|
|
|
Which responds with:
|
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
|
[source,console-result]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
{
|
|
|
|
|
"tokens" : [ {
|
|
|
|
|
"token" : "寿司",
|
|
|
|
|
"start_offset" : 0,
|
|
|
|
|
"end_offset" : 2,
|
|
|
|
|
"type" : "word",
|
2016-08-17 09:56:00 -04:00
|
|
|
|
"position" : 0
|
2015-08-15 12:00:55 -04:00
|
|
|
|
}, {
|
|
|
|
|
"token" : "おいしい",
|
|
|
|
|
"start_offset" : 3,
|
|
|
|
|
"end_offset" : 7,
|
|
|
|
|
"type" : "word",
|
2016-08-17 09:56:00 -04:00
|
|
|
|
"position" : 2
|
2015-08-15 12:00:55 -04:00
|
|
|
|
} ]
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
2019-09-06 09:22:08 -04:00
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
|
|
[[analysis-kuromoji-readingform]]
|
|
|
|
|
==== `kuromoji_readingform` token filter
|
|
|
|
|
|
|
|
|
|
The `kuromoji_readingform` token filter replaces the token with its reading
|
|
|
|
|
form in either katakana or romaji. It accepts the following setting:
|
|
|
|
|
|
|
|
|
|
`use_romaji`::
|
|
|
|
|
|
|
|
|
|
Whether romaji reading form should be output instead of katakana. Defaults to `false`.
|
|
|
|
|
|
|
|
|
|
When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
|
|
|
|
|
to `true`. The default when defining a custom `kuromoji_readingform`, however,
|
|
|
|
|
is `false`. The only reason to use the custom form is if you need the
|
|
|
|
|
katakana reading form:
|
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
|
PUT kuromoji_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
|
{
|
2020-07-20 15:06:12 -04:00
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"romaji_analyzer": {
|
|
|
|
|
"tokenizer": "kuromoji_tokenizer",
|
|
|
|
|
"filter": [ "romaji_readingform" ]
|
|
|
|
|
},
|
|
|
|
|
"katakana_analyzer": {
|
|
|
|
|
"tokenizer": "kuromoji_tokenizer",
|
|
|
|
|
"filter": [ "katakana_readingform" ]
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"filter": {
|
|
|
|
|
"romaji_readingform": {
|
|
|
|
|
"type": "kuromoji_readingform",
|
|
|
|
|
"use_romaji": true
|
|
|
|
|
},
|
|
|
|
|
"katakana_readingform": {
|
|
|
|
|
"type": "kuromoji_readingform",
|
|
|
|
|
"use_romaji": false
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
}
|
2020-07-20 15:06:12 -04:00
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
}
|
2020-07-20 15:06:12 -04:00
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
}
|
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "katakana_analyzer",
|
|
|
|
|
"text": "寿司" <1>
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "romaji_analyzer",
|
|
|
|
|
"text": "寿司" <2>
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
|
|
<1> Returns `スシ`.
|
|
|
|
|
<2> Returns `sushi`.
|
|
|
|
|
|
|
|
|
|
[[analysis-kuromoji-stemmer]]
|
|
|
|
|
==== `kuromoji_stemmer` token filter
|
|
|
|
|
|
|
|
|
|
The `kuromoji_stemmer` token filter normalizes common katakana spelling
|
|
|
|
|
variations ending in a long sound character by removing this character
|
|
|
|
|
(U+30FC). Only full-width katakana characters are supported.
|
|
|
|
|
|
|
|
|
|
This token filter accepts the following setting:
|
|
|
|
|
|
|
|
|
|
`minimum_length`::
|
|
|
|
|
|
|
|
|
|
Katakana words shorter than the `minimum length` are not stemmed (default
|
|
|
|
|
is `4`).
|
|
|
|
|
|
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
|
PUT kuromoji_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
|
{
|
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"my_analyzer": {
|
|
|
|
|
"tokenizer": "kuromoji_tokenizer",
|
|
|
|
|
"filter": [
|
|
|
|
|
"my_katakana_stemmer"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"filter": {
|
|
|
|
|
"my_katakana_stemmer": {
|
|
|
|
|
"type": "kuromoji_stemmer",
|
|
|
|
|
"minimum_length": 4
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "my_analyzer",
|
|
|
|
|
"text": "コピー" <1>
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "my_analyzer",
|
|
|
|
|
"text": "サーバー" <2>
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
|
|
<1> Returns `コピー`.
|
|
|
|
|
<2> Return `サーバ`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[analysis-kuromoji-stop]]
|
2016-09-30 16:42:45 -04:00
|
|
|
|
==== `ja_stop` token filter
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
|
|
|
|
The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
|
|
|
|
|
any other custom stopwords specified by the user. This filter only supports
|
|
|
|
|
the predefined `_japanese_` stopwords list. If you want to use a different
|
|
|
|
|
predefined list, then use the
|
|
|
|
|
{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.
|
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
|
[source,console]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
|
PUT kuromoji_sample
|
2015-08-15 12:00:55 -04:00
|
|
|
|
{
|
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"analyzer_with_ja_stop": {
|
|
|
|
|
"tokenizer": "kuromoji_tokenizer",
|
|
|
|
|
"filter": [
|
|
|
|
|
"ja_stop"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"filter": {
|
|
|
|
|
"ja_stop": {
|
|
|
|
|
"type": "ja_stop",
|
|
|
|
|
"stopwords": [
|
|
|
|
|
"_japanese_",
|
|
|
|
|
"ストップ"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "analyzer_with_ja_stop",
|
|
|
|
|
"text": "ストップは消える"
|
|
|
|
|
}
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
|
|
The above request returns:
|
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
|
[source,console-result]
|
2015-08-15 12:00:55 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
{
|
|
|
|
|
"tokens" : [ {
|
|
|
|
|
"token" : "消える",
|
|
|
|
|
"start_offset" : 5,
|
|
|
|
|
"end_offset" : 8,
|
|
|
|
|
"type" : "word",
|
2016-08-26 15:59:45 -04:00
|
|
|
|
"position" : 2
|
2015-08-15 12:00:55 -04:00
|
|
|
|
} ]
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|
2019-09-06 09:22:08 -04:00
|
|
|
|
|
2015-08-15 12:00:55 -04:00
|
|
|
|
|
2016-03-16 04:43:21 -04:00
|
|
|
|
[[analysis-kuromoji-number]]
|
2016-09-30 16:42:45 -04:00
|
|
|
|
==== `kuromoji_number` token filter
|
2016-03-16 04:43:21 -04:00
|
|
|
|
|
|
|
|
|
The `kuromoji_number` token filter normalizes Japanese numbers (kansūji)
|
2016-08-17 09:56:00 -04:00
|
|
|
|
to regular Arabic decimal numbers in half-width characters. For example:
|
2016-03-16 04:43:21 -04:00
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
|
[source,console]
|
2016-03-16 04:43:21 -04:00
|
|
|
|
--------------------------------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
|
PUT kuromoji_sample
|
2016-03-16 04:43:21 -04:00
|
|
|
|
{
|
|
|
|
|
"settings": {
|
|
|
|
|
"index": {
|
|
|
|
|
"analysis": {
|
|
|
|
|
"analyzer": {
|
|
|
|
|
"my_analyzer": {
|
|
|
|
|
"tokenizer": "kuromoji_tokenizer",
|
|
|
|
|
"filter": [
|
|
|
|
|
"kuromoji_number"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2016-09-30 16:42:45 -04:00
|
|
|
|
GET kuromoji_sample/_analyze
|
2016-09-22 07:54:30 -04:00
|
|
|
|
{
|
|
|
|
|
"analyzer": "my_analyzer",
|
|
|
|
|
"text": "一〇〇〇"
|
|
|
|
|
}
|
2016-03-16 04:43:21 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
2016-08-17 09:56:00 -04:00
|
|
|
|
Which results in:
|
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
|
[source,console-result]
|
2016-03-16 04:43:21 -04:00
|
|
|
|
--------------------------------------------------
|
|
|
|
|
{
|
|
|
|
|
"tokens" : [ {
|
|
|
|
|
"token" : "1000",
|
|
|
|
|
"start_offset" : 0,
|
|
|
|
|
"end_offset" : 4,
|
|
|
|
|
"type" : "word",
|
2016-08-17 09:56:00 -04:00
|
|
|
|
"position" : 0
|
2016-03-16 04:43:21 -04:00
|
|
|
|
} ]
|
|
|
|
|
}
|
|
|
|
|
--------------------------------------------------
|