OpenSearch/docs/plugins/analysis-kuromoji.asciidoc

[[analysis-kuromoji]]
=== Japanese (kuromoji) Analysis Plugin

The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis
module into elasticsearch.

[[analysis-kuromoji-install]]
[float]
==== Installation

This plugin can be installed using the plugin manager:

[source,sh]
----------------------------------------------------------------
sudo bin/elasticsearch-plugin install analysis-kuromoji
----------------------------------------------------------------

The plugin must be installed on every node in the cluster, and each node must
be restarted after installation.

[[analysis-kuromoji-remove]]
[float]
==== Removal

The plugin can be removed with the following command:

[source,sh]
----------------------------------------------------------------
sudo bin/elasticsearch-plugin remove analysis-kuromoji
----------------------------------------------------------------

The node must be stopped before removing the plugin.

[[analysis-kuromoji-analyzer]]
==== `kuromoji` analyzer

The `kuromoji` analyzer consists of the following tokenizer and token filters:

* <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>
* <<analysis-kuromoji-baseform,`kuromoji_baseform`>> token filter
* <<analysis-kuromoji-speech,`kuromoji_part_of_speech`>> token filter
* {ref}/analysis-cjk-width-tokenfilter.html[`cjk_width`] token filter
* <<analysis-kuromoji-stop,`ja_stop`>> token filter
* <<analysis-kuromoji-stemmer,`kuromoji_stemmer`>> token filter
* {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter

It supports the `mode` and `user_dictionary` settings from
<<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.

[[analysis-kuromoji-charfilter]]
==== `kuromoji_iteration_mark` character filter

The `kuromoji_iteration_mark` normalizes Japanese horizontal iteration marks
(_odoriji_) to their expanded form. It accepts the following settings:

`normalize_kanji`::

    Indicates whether kanji iteration marks should be normalize. Defaults to `true`.

`normalize_kana`::

    Indicates whether kana iteration marks should be normalized. Defaults to `true`


[[analysis-kuromoji-tokenizer]]
==== `kuromoji_tokenizer`

The `kuromoji_tokenizer` accepts the following settings:

`mode`::
+
--

The tokenization mode determines how the tokenizer handles compound and
unknown words.  It can be set to:

`normal`::

    Normal segmentation, no decomposition for compounds. Example output:

    関西国際空港
    アブラカダブラ

`search`::

    Segmentation geared towards search. This includes a decompounding process
    for long nouns, also including the full compound token as a synonym.
    Example output:

    関西, 関西国際空港, 国際, 空港
    アブラカダブラ

`extended`::

    Extended mode outputs unigrams for unknown words. Example output:

    関西, 国際, 空港
    ア, ブ, ラ, カ, ダ, ブ, ラ
--

`discard_punctuation`::

    Whether punctuation should be discarded from the output. Defaults to `true`.

`user_dictionary`::
+
--
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A `user_dictionary`
may be appended to the default dictionary. The dictionary should have the following CSV format:

[source,csv]
-----------------------
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
-----------------------
--

As a demonstration of how the user dictionary can be used, save the following
dictionary to `$ES_HOME/config/userdict_ja.txt`:

[source,csv]
-----------------------
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
-----------------------

`nbest_cost`/`nbest_examples`::
+
--
Additional expert user parameters `nbest_cost` and `nbest_examples` can be used
to include additional tokens that most likely according to the statistical model.
If both parameters are used, the largest number of both is applied.

`nbest_cost`::

    The `nbest_cost` parameter specifies an additional Viterbi cost.
    The KuromojiTokenizer will include all tokens in Viterbi paths that are
    within the nbest_cost value of the best path.

`nbest_examples`::

    The `nbest_examples` can be used to find a `nbest_cost` value based on examples.
    For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts,
    箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us
    箱根 (Hakone) and 成田 (Narita).
--


Then create an analyzer as follows:

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=東京スカイツリー
--------------------------------------------------
// AUTOSENSE

The above `analyze` request returns the following:

[source,json]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "スカイツリー",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  } ]
}
--------------------------------------------------

[[analysis-kuromoji-baseform]]
==== `kuromoji_baseform` token filter

The `kuromoji_baseform` token filter replaces terms with their
BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_baseform"
            ]
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=飲み
--------------------------------------------------
// AUTOSENSE

[source,text]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "飲む",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  } ]
}
--------------------------------------------------

[[analysis-kuromoji-speech]]
==== `kuromoji_part_of_speech` token filter

The `kuromoji_part_of_speech` token filter removes tokens that match a set of
part-of-speech tags. It accepts the following setting:

`stoptags`::

    An array of part-of-speech tags that should be removed. It defaults to the
    `stoptags.txt` file embedded in the `lucene-analyzer-kuromoji.jar`.

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=寿司がおいしいね

--------------------------------------------------
// AUTOSENSE

[source,text]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "寿司",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "おいしい",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "word",
    "position" : 3
  } ]
}
--------------------------------------------------

[[analysis-kuromoji-readingform]]
==== `kuromoji_readingform` token filter

The `kuromoji_readingform` token filter replaces the token with its reading
form in either katakana or romaji. It accepts the following setting:

`use_romaji`::

    Whether romaji reading form should be output instead of katakana.  Defaults to `false`.

When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
to `true`. The default when defining a custom `kuromoji_readingform`, however,
is `false`.  The only reason to use the custom form is if you need the
katakana reading form:

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "romaji_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["romaji_readingform"]
                    },
                    "katakana_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["katakana_readingform"]
                    }
                },
                "filter" : {
                    "romaji_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : true
                    },
                    "katakana_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : false
                    }
                }
            }
        }
    }
}

POST kuromoji_sample/_analyze?analyzer=katakana_analyzer&text=寿司 <1>

POST kuromoji_sample/_analyze?analyzer=romaji_analyzer&text=寿司 <2>

--------------------------------------------------
// AUTOSENSE

<1> Returns `スシ`.
<2> Returns `sushi`.

[[analysis-kuromoji-stemmer]]
==== `kuromoji_stemmer` token filter

The `kuromoji_stemmer` token filter normalizes common katakana spelling
variations ending in a long sound character by removing this character
(U+30FC). Only full-width katakana characters are supported.

This token filter accepts the following setting:

`minimum_length`::

    Katakana words shorter than the `minimum length` are not stemmed (default
    is `4`).


[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_katakana_stemmer"
            ]
          }
        },
        "filter": {
          "my_katakana_stemmer": {
            "type": "kuromoji_stemmer",
            "minimum_length": 4
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=コピー <1>

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=サーバー <2>

--------------------------------------------------
// AUTOSENSE

<1> Returns `コピー`.
<2> Return `サーバ`.


[[analysis-kuromoji-stop]]
===== `ja_stop` token filter

The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
any other custom stopwords specified by the user. This filter only supports
the predefined `_japanese_` stopwords list.  If you want to use a different
predefined list, then use the
{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_with_ja_stop": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "ja_stop"
            ]
          }
        },
        "filter": {
          "ja_stop": {
            "type": "ja_stop",
            "stopwords": [
              "_japanese_",
              "ストップ"
            ]
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=ストップは消える
--------------------------------------------------
// AUTOSENSE

The above request returns:

[source,text]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "消える",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 3
  } ]
}
--------------------------------------------------

[[analysis-kuromoji-number]]
===== `kuromoji_number` token filter

The `kuromoji_number` token filter normalizes Japanese numbers (kansūji)
to regular Arabic decimal numbers in half-width characters.

[source,json]
--------------------------------------------------
PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_number"
            ]
          }
        }
      }
    }
  }
}

POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=一〇〇〇

--------------------------------------------------
// AUTOSENSE

[source,text]
--------------------------------------------------
# Result
{
  "tokens" : [ {
    "token" : "1000",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  } ]
}
--------------------------------------------------
Docs: Prepare plugin and integration docs for 2.0 * Centralised plugin docs in docs/plugins/ * Moved integrations into same docs * Moved community clients into the clients section of the docs * Removed docs/community Closes #11734 Closes #11724 Closes #11636 Closes #11635 Closes #11632 Closes #11630 Closes #12046 Closes #12438 Closes #12579 2015-08-15 12:00:55 -04:00			`[[analysis-kuromoji]]`
			`=== Japanese (kuromoji) Analysis Plugin`

			`The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis`
			`module into elasticsearch.`

			`[[analysis-kuromoji-install]]`
			`[float]`
			`==== Installation`

			`This plugin can be installed using the plugin manager:`

			`[source,sh]`
			`----------------------------------------------------------------`
Rename bin/plugin in bin/elasticsearch-plugin 2016-02-04 10:00:55 -05:00			`sudo bin/elasticsearch-plugin install analysis-kuromoji`
Docs: Prepare plugin and integration docs for 2.0 * Centralised plugin docs in docs/plugins/ * Moved integrations into same docs * Moved community clients into the clients section of the docs * Removed docs/community Closes #11734 Closes #11724 Closes #11636 Closes #11635 Closes #11632 Closes #11630 Closes #12046 Closes #12438 Closes #12579 2015-08-15 12:00:55 -04:00			`----------------------------------------------------------------`

			`The plugin must be installed on every node in the cluster, and each node must`
			`be restarted after installation.`

			`[[analysis-kuromoji-remove]]`
			`[float]`
			`==== Removal`

			`The plugin can be removed with the following command:`

			`[source,sh]`
			`----------------------------------------------------------------`
Rename bin/plugin in bin/elasticsearch-plugin 2016-02-04 10:00:55 -05:00			`sudo bin/elasticsearch-plugin remove analysis-kuromoji`
Docs: Prepare plugin and integration docs for 2.0 * Centralised plugin docs in docs/plugins/ * Moved integrations into same docs * Moved community clients into the clients section of the docs * Removed docs/community Closes #11734 Closes #11724 Closes #11636 Closes #11635 Closes #11632 Closes #11630 Closes #12046 Closes #12438 Closes #12579 2015-08-15 12:00:55 -04:00			`----------------------------------------------------------------`

			`The node must be stopped before removing the plugin.`

			`[[analysis-kuromoji-analyzer]]`
			==== `kuromoji` analyzer

			The `kuromoji` analyzer consists of the following tokenizer and token filters:

			* <<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>
			* <<analysis-kuromoji-baseform,`kuromoji_baseform`>> token filter
			* <<analysis-kuromoji-speech,`kuromoji_part_of_speech`>> token filter
			* {ref}/analysis-cjk-width-tokenfilter.html[`cjk_width`] token filter
			* <<analysis-kuromoji-stop,`ja_stop`>> token filter
			* <<analysis-kuromoji-stemmer,`kuromoji_stemmer`>> token filter
			* {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter

			It supports the `mode` and `user_dictionary` settings from
			<<analysis-kuromoji-tokenizer,`kuromoji_tokenizer`>>.

			`[[analysis-kuromoji-charfilter]]`
			==== `kuromoji_iteration_mark` character filter

			The `kuromoji_iteration_mark` normalizes Japanese horizontal iteration marks
			`(_odoriji_) to their expanded form. It accepts the following settings:`

			`normalize_kanji`::

			Indicates whether kanji iteration marks should be normalize. Defaults to `true`.

			`normalize_kana`::

			Indicates whether kana iteration marks should be normalized. Defaults to `true`


			`[[analysis-kuromoji-tokenizer]]`
			==== `kuromoji_tokenizer`

			The `kuromoji_tokenizer` accepts the following settings:

			`mode`::
			`+`
			`--`

			`The tokenization mode determines how the tokenizer handles compound and`
			`unknown words. It can be set to:`

			`normal`::

			`Normal segmentation, no decomposition for compounds. Example output:`

			`関西国際空港`
			`アブラカダブラ`

			`search`::

			`Segmentation geared towards search. This includes a decompounding process`
			`for long nouns, also including the full compound token as a synonym.`
			`Example output:`

			`関西, 関西国際空港, 国際, 空港`
			`アブラカダブラ`

			`extended`::

			`Extended mode outputs unigrams for unknown words. Example output:`

			`関西, 国際, 空港`
			`ア, ブ, ラ, カ, ダ, ブ, ラ`
			`--`

			`discard_punctuation`::

			Whether punctuation should be discarded from the output. Defaults to `true`.

			`user_dictionary`::
			`+`
			`--`
			The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A `user_dictionary`
			`may be appended to the default dictionary. The dictionary should have the following CSV format:`

			`[source,csv]`
			`-----------------------`
			`<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>`
			`-----------------------`
			`--`

			`As a demonstration of how the user dictionary can be used, save the following`
			dictionary to `$ES_HOME/config/userdict_ja.txt`:

			`[source,csv]`
			`-----------------------`
			`東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞`
			`-----------------------`

Analysis Kuromoji: Add nbest option and NumberFilter Add nbest_cost and nbest_examples parameter to KuromojiTokenizerFactory Add KuromojiNumberFilterFactory 2016-03-16 04:43:21 -04:00			`nbest_cost`/`nbest_examples`::
			`+`
			`--`
			Additional expert user parameters `nbest_cost` and `nbest_examples` can be used
			`to include additional tokens that most likely according to the statistical model.`
			`If both parameters are used, the largest number of both is applied.`

			`nbest_cost`::

			The `nbest_cost` parameter specifies an additional Viterbi cost.
			`The KuromojiTokenizer will include all tokens in Viterbi paths that are`
			`within the nbest_cost value of the best path.`

			`nbest_examples`::

			The `nbest_examples` can be used to find a `nbest_cost` value based on examples.
			`For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts,`
			`箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us`
			`箱根 (Hakone) and 成田 (Narita).`
			`--`


Docs: Prepare plugin and integration docs for 2.0 * Centralised plugin docs in docs/plugins/ * Moved integrations into same docs * Moved community clients into the clients section of the docs * Removed docs/community Closes #11734 Closes #11724 Closes #11636 Closes #11635 Closes #11632 Closes #11630 Closes #12046 Closes #12438 Closes #12579 2015-08-15 12:00:55 -04:00			`Then create an analyzer as follows:`

			`[source,json]`
			`--------------------------------------------------`
			`PUT kuromoji_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"tokenizer": {`
			`"kuromoji_user_dict": {`
			`"type": "kuromoji_tokenizer",`
			`"mode": "extended",`
			`"discard_punctuation": "false",`
			`"user_dictionary": "userdict_ja.txt"`
			`}`
			`},`
			`"analyzer": {`
			`"my_analyzer": {`
			`"type": "custom",`
			`"tokenizer": "kuromoji_user_dict"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=東京スカイツリー`
			`--------------------------------------------------`
			`// AUTOSENSE`

			The above `analyze` request returns the following:

			`[source,json]`
			`--------------------------------------------------`
			`# Result`
			`{`
			`"tokens" : [ {`
			`"token" : "東京",`
			`"start_offset" : 0,`
			`"end_offset" : 2,`
			`"type" : "word",`
			`"position" : 1`
			`}, {`
			`"token" : "スカイツリー",`
			`"start_offset" : 2,`
			`"end_offset" : 8,`
			`"type" : "word",`
			`"position" : 2`
			`} ]`
			`}`
			`--------------------------------------------------`

			`[[analysis-kuromoji-baseform]]`
			==== `kuromoji_baseform` token filter

			The `kuromoji_baseform` token filter replaces terms with their
			`BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.`

			`[source,json]`
			`--------------------------------------------------`
			`PUT kuromoji_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_analyzer": {`
			`"tokenizer": "kuromoji_tokenizer",`
			`"filter": [`
			`"kuromoji_baseform"`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=飲み`
			`--------------------------------------------------`
			`// AUTOSENSE`

			`[source,text]`
			`--------------------------------------------------`
			`# Result`
			`{`
			`"tokens" : [ {`
			`"token" : "飲む",`
			`"start_offset" : 0,`
			`"end_offset" : 2,`
			`"type" : "word",`
			`"position" : 1`
			`} ]`
			`}`
			`--------------------------------------------------`

			`[[analysis-kuromoji-speech]]`
			==== `kuromoji_part_of_speech` token filter

			The `kuromoji_part_of_speech` token filter removes tokens that match a set of
			`part-of-speech tags. It accepts the following setting:`

			`stoptags`::

			`An array of part-of-speech tags that should be removed. It defaults to the`
			`stoptags.txt` file embedded in the `lucene-analyzer-kuromoji.jar`.

			`[source,json]`
			`--------------------------------------------------`
			`PUT kuromoji_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_analyzer": {`
			`"tokenizer": "kuromoji_tokenizer",`
			`"filter": [`
			`"my_posfilter"`
			`]`
			`}`
			`},`
			`"filter": {`
			`"my_posfilter": {`
			`"type": "kuromoji_part_of_speech",`
			`"stoptags": [`
			`"助詞-格助詞-一般",`
			`"助詞-終助詞"`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=寿司がおいしいね`

			`--------------------------------------------------`
			`// AUTOSENSE`

			`[source,text]`
			`--------------------------------------------------`
			`# Result`
			`{`
			`"tokens" : [ {`
			`"token" : "寿司",`
			`"start_offset" : 0,`
			`"end_offset" : 2,`
			`"type" : "word",`
			`"position" : 1`
			`}, {`
			`"token" : "おいしい",`
			`"start_offset" : 3,`
			`"end_offset" : 7,`
			`"type" : "word",`
			`"position" : 3`
			`} ]`
			`}`
			`--------------------------------------------------`

			`[[analysis-kuromoji-readingform]]`
			==== `kuromoji_readingform` token filter

			The `kuromoji_readingform` token filter replaces the token with its reading
			`form in either katakana or romaji. It accepts the following setting:`

			`use_romaji`::

			Whether romaji reading form should be output instead of katakana. Defaults to `false`.

			When using the pre-defined `kuromoji_readingform` filter, `use_romaji` is set
			to `true`. The default when defining a custom `kuromoji_readingform`, however,
			is `false`. The only reason to use the custom form is if you need the
			`katakana reading form:`

			`[source,json]`
			`--------------------------------------------------`
			`PUT kuromoji_sample`
			`{`
			`"settings": {`
			`"index":{`
			`"analysis":{`
			`"analyzer" : {`
			`"romaji_analyzer" : {`
			`"tokenizer" : "kuromoji_tokenizer",`
			`"filter" : ["romaji_readingform"]`
			`},`
			`"katakana_analyzer" : {`
			`"tokenizer" : "kuromoji_tokenizer",`
			`"filter" : ["katakana_readingform"]`
			`}`
			`},`
			`"filter" : {`
			`"romaji_readingform" : {`
			`"type" : "kuromoji_readingform",`
			`"use_romaji" : true`
			`},`
			`"katakana_readingform" : {`
			`"type" : "kuromoji_readingform",`
			`"use_romaji" : false`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST kuromoji_sample/_analyze?analyzer=katakana_analyzer&text=寿司 <1>`

			`POST kuromoji_sample/_analyze?analyzer=romaji_analyzer&text=寿司 <2>`

			`--------------------------------------------------`
			`// AUTOSENSE`

			<1> Returns `スシ`.
			<2> Returns `sushi`.

			`[[analysis-kuromoji-stemmer]]`
			==== `kuromoji_stemmer` token filter

			The `kuromoji_stemmer` token filter normalizes common katakana spelling
			`variations ending in a long sound character by removing this character`
			`(U+30FC). Only full-width katakana characters are supported.`

			`This token filter accepts the following setting:`

			`minimum_length`::

			Katakana words shorter than the `minimum length` are not stemmed (default
			is `4`).


			`[source,json]`
			`--------------------------------------------------`
			`PUT kuromoji_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_analyzer": {`
			`"tokenizer": "kuromoji_tokenizer",`
			`"filter": [`
			`"my_katakana_stemmer"`
			`]`
			`}`
			`},`
			`"filter": {`
			`"my_katakana_stemmer": {`
			`"type": "kuromoji_stemmer",`
			`"minimum_length": 4`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=コピー <1>`

			`POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=サーバー <2>`

			`--------------------------------------------------`
			`// AUTOSENSE`

			<1> Returns `コピー`.
			<2> Return `サーバ`.


			`[[analysis-kuromoji-stop]]`
			===== `ja_stop` token filter

			The `ja_stop` token filter filters out Japanese stopwords (`_japanese_`), and
			`any other custom stopwords specified by the user. This filter only supports`
			the predefined `_japanese_` stopwords list. If you want to use a different
			`predefined list, then use the`
			{ref}/analysis-stop-tokenfilter.html[`stop` token filter] instead.

			`[source,json]`
			`--------------------------------------------------`
			`PUT kuromoji_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"analyzer_with_ja_stop": {`
			`"tokenizer": "kuromoji_tokenizer",`
			`"filter": [`
			`"ja_stop"`
			`]`
			`}`
			`},`
			`"filter": {`
			`"ja_stop": {`
			`"type": "ja_stop",`
			`"stopwords": [`
			`"_japanese_",`
			`"ストップ"`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=ストップは消える`
			`--------------------------------------------------`
			`// AUTOSENSE`

			`The above request returns:`

			`[source,text]`
			`--------------------------------------------------`
			`# Result`
			`{`
			`"tokens" : [ {`
			`"token" : "消える",`
			`"start_offset" : 5,`
			`"end_offset" : 8,`
			`"type" : "word",`
			`"position" : 3`
			`} ]`
			`}`
			`--------------------------------------------------`

Analysis Kuromoji: Add nbest option and NumberFilter Add nbest_cost and nbest_examples parameter to KuromojiTokenizerFactory Add KuromojiNumberFilterFactory 2016-03-16 04:43:21 -04:00			`[[analysis-kuromoji-number]]`
			===== `kuromoji_number` token filter

			The `kuromoji_number` token filter normalizes Japanese numbers (kansūji)
			`to regular Arabic decimal numbers in half-width characters.`

			`[source,json]`
			`--------------------------------------------------`
			`PUT kuromoji_sample`
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_analyzer": {`
			`"tokenizer": "kuromoji_tokenizer",`
			`"filter": [`
			`"kuromoji_number"`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=一〇〇〇`

			`--------------------------------------------------`
			`// AUTOSENSE`

			`[source,text]`
			`--------------------------------------------------`
			`# Result`
			`{`
			`"tokens" : [ {`
			`"token" : "1000",`
			`"start_offset" : 0,`
			`"end_offset" : 4,`
			`"type" : "word",`
			`"position" : 1`
			`} ]`
			`}`
			`--------------------------------------------------`