Add description and example

2025-02-17 10:25:15 +00:00 · 2013-10-20 06:05:29 +09:00 · 2013-10-20 06:05:29 +09:00 · 4c31dfc37e
commit 4c31dfc37e
parent 578c5acbb3
1 changed files with 348 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -25,6 +25,354 @@ In order to install the plugin, simply run: `bin/plugin -install elasticsearch/e

 The plugin includes the `kuromoji` analyzer.

+Includes Analyzer, Tokenizer, TokenFilter
+----------------------------------------
+
+The plugin includes these analyzer and tokenizer, tokenfilter.
+
+| name                    | type        |
+|-------------------------|-------------|
+| kuromoji                | analyzer    |
+| kuromoji_tokenizer      | tokenizer   |
+| kuromoji_baseform       | tokenfilter |
+| kuromoji_part_of_speech | tokenfilter |
+| kuromoji_readingform    | tokenfilter |
+| kuromoji_stemmer        | tokenfilter |
+
+
+Usage
+-----
+
+## Analyzer : kuromoji
+
+An analyzer of type `kuromoji`.
+This analyzer is the following tokenizer and tokenfilter combination.
+
+* `kuromoji_tokenizer` : Kuromoji Tokenizer
+* `kuromoji_baseform` : Kuromoji BasicFormFilter (TokenFilter)
+* `kuromoji_part_of_speech` : Kuromoji Part of Speech Stop Filter (TokenFilter)
+* `cjk_width` : CJK Width Filter (TokenFilter)
+* `stop` : Stop Filter (TokenFilter)
+* `kuromoji_stemmer` : Kuromiji Katakana Stemmer Filter(TokenFilter)
+* `lowercase` : LowerCase Filter (TokenFilter)
+
+## Tokenizer : kuromoji_tokenizer
+
+A tokenizer of type `kuromoji_tokenizer`.
+
+The following are settings that can be set for a `kuromoji_tokenizer` tokenizer type:
+
+| **Setting**         | **Description**                                                                                                           | **Default value** |
+|:--------------------|:--------------------------------------------------------------------------------------------------------------------------|:------------------|
+| mode                | Tokenization mode: this determines how the tokenizer handles compound and unknown words. `normal` and `search`, `extended`| `search`          |
+| discard_punctuation | `true` if punctuation tokens should be dropped from the output.                                                           | `true`            |
+| user_dict           | set User Dictionary file                                                                                                  |                   |
+
+### Tokenization mode
+
+The mode is three types.
+
+* `normal` : Ordinary segmentation: no decomposition for compounds
+
+* `search` : Segmentation geared towards search: this includes a decompounding process for long nouns, also includeing the full compound token as a synonym.
+
+* `extended` : Extended mode outputs unigrams for unknown words.
+
+#### Difference tokenization mode outputs
+
+Input text is `関西国際空港` and `アブラカダブラ`.
+
+| **mode**   | `関西国際空港` | `アブラカダブラ` |
+|:-----------|:-------------|:-------|
+| `normal`   | `関西国際空港` | `アブラカダブラ` |
+| `search`   | `関西` `関西国際空港` `国際` `空港` | `アブラカダブラ` |
+| `extended` | `関西` `国際` `空港` | `ア` `ブ` `ラ` `カ` `ダ` `ブ` `ラ` |
+
+### User Dictionary
+
+Kuromoji tokenizer use MecCab-IPADIC dictionary by default.
+And Kuromoji is added an entry of dictionary to define by user; this is User Dictionary.
+User Dictionary entries are defined using the following CSV format:
+
+```
+<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
+```
+
+Dictionary Example
+
+```
+東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
+```
+
+To use User Dictionary set file path to `user_dict` attribute.
+User Dictionary file is placed `ES_HOME/config` directory.
+
+### example
+
+```
+curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
+{
+    "index":{
+        "analysis":{
+            "tokenizer" : {
+                "kuromoji_user_dict" : {
+                   "type" : "kuromoji_tokenizer",
+                   "mode" : "extended",
+                   "discard_punctuation" : "false",
+                   "user_dictionary" : "userdict_ja.txt"
+                }
+            },
+            "analyzer" : {
+                "my_analyzer" : {
+                    "type" : "custom",
+                    "tokenizer" : "kuromoji_user_dict"
+                }
+            }
+
+        }
+    }
+}
+'
+
+curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '東京スカイツリー'
+{
+  "tokens" : [ {
+    "token" : "東京",
+    "start_offset" : 0,
+    "end_offset" : 2,
+    "type" : "word",
+    "position" : 1
+  }, {
+    "token" : "スカイツリー",
+    "start_offset" : 2,
+    "end_offset" : 8,
+    "type" : "word",
+    "position" : 2
+  } ]
+}
+```
+
+## TokenFilter : kuromoji_baseform
+
+A token filter of type `kuromoji_baseform` that replcaes term text with BaseFormAttribute.
+This acts as a lemmatizer for verbs and adjectives.
+
+### example
+
+```
+curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
+{
+    "index":{
+        "analysis":{
+            "analyzer" : {
+                "my_analyzer" : {
+                    "tokenizer" : "kuromoji_tokenizer",
+                    "filter" : ["kuromoji_baseform"]
+                }
+            }
+        }
+    }
+}
+'
+
+curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '飲み'
+{
+  "tokens" : [ {
+    "token" : "飲む",
+    "start_offset" : 0,
+    "end_offset" : 2,
+    "type" : "word",
+    "position" : 1
+  } ]
+}
+```
+
+## TokenFilter : kuromoji_part_of_speech
+
+A token filter of type `kuromoji_part_of_speech` that removes tokens that match a set of part-of-speech tags.
+
+The following are settings that can be set for a stop token filter type:
+
+| **Setting** | **Description**                                      |
+|:------------|:-----------------------------------------------------|
+| stoptags    | A list of part-of-speech tags that should be removed |
+
+Note that default setting is stoptags.txt include lucene-analyzer-kuromji.jar.
+
+### example
+
+```
+curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
+{
+    "index":{
+        "analysis":{
+            "analyzer" : {
+                "my_analyzer" : {
+                    "tokenizer" : "kuromoji_tokenizer",
+                    "filter" : ["my_posfilter"]
+                }
+            },
+            "filter" : {
+                "my_posfilter" : {
+                    "type" : "kuromoji_part_of_speech",
+                    "stoptags" : [
+                        "助詞-格助詞-一般",
+                        "助詞-終助詞"
+                    ]
+                }
+            }
+        }
+    }
+}
+'
+
+curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '寿司がおいしいね'
+{
+  "tokens" : [ {
+    "token" : "寿司",
+    "start_offset" : 0,
+    "end_offset" : 2,
+    "type" : "word",
+    "position" : 1
+  }, {
+    "token" : "おいしい",
+    "start_offset" : 3,
+    "end_offset" : 7,
+    "type" : "word",
+    "position" : 3
+  } ]
+}
+```
+
+## TokenFilter : kuromoji_readingform
+
+A token filter of type `kuromoji_readingform` that replaces the term attribute with the reading of a token in either katakana or romaji form.
+The default reading form is katakana.
+
+The following are settings that can be set for a `kuromoji_readingform` token filter type:
+
+| **Setting** | **Description**                                           | **Default value** |
+|:------------|:----------------------------------------------------------|:------------------|
+| use_romaji  | `true` if romaji reading form output instead of katakana. | `false`           |
+
+Note that elasticsearch-analysis-kuromoji built-in `kuromoji_readingform` set default `ture` to `use_romaji` attribute.
+
+### example
+
+```
+curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
+{
+    "index":{
+        "analysis":{
+            "analyzer" : {
+                "romaji_analyzer" : {
+                    "tokenizer" : "kuromoji_tokenizer",
+                    "filter" : ["romaji_readingform"]
+                },
+                "katakana_analyzer" : {
+                    "tokenizer" : "kuromoji_tokenizer",
+                    "filter" : ["katakana_readingform"]
+                }
+            },
+            "filter" : {
+                "romaji_readingform" : {
+                    "type" : "kuromoji_readingform",
+                    "use_romaji" : true
+                },
+                "katakana_readingform" : {
+                    "type" : "kuromoji_readingform",
+                    "use_romaji" : false
+                }
+            }
+        }
+    }
+}
+'
+
+curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d '寿司'
+{
+  "tokens" : [ {
+    "token" : "スシ",
+    "start_offset" : 0,
+    "end_offset" : 2,
+    "type" : "word",
+    "position" : 1
+  } ]
+}
+
+curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=romaji_analyzer&pretty' -d '寿司'
+{
+  "tokens" : [ {
+    "token" : "sushi",
+    "start_offset" : 0,
+    "end_offset" : 2,
+    "type" : "word",
+    "position" : 1
+  } ]
+}
+```
+
+## TokenFilter : kuromoji_stemmer
+
+A token filter of type `kuromoji_stemmer` that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC).
+Only katakana words longer than a minimum length are stemmed (default is four).
+
+Note that only full-width katakana characters are supported.
+
+The following are settings that can be set for a `kuromoji_stemmer` token filter type:
+
+| **Setting**     | **Description**            | **Default value** |
+|:----------------|:---------------------------|:------------------|
+| minimum_length  | The minimum length to stem | `4`               |
+
+### example
+
+```
+curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
+{
+    "index":{
+        "analysis":{
+            "analyzer" : {
+                "my_analyzer" : {
+                    "tokenizer" : "kuromoji_tokenizer",
+                    "filter" : ["my_katakana_stemmer"]
+                }
+            },
+            "filter" : {
+                "my_katakana_stemmer" : {
+                    "type" : "kuromoji_stemmer",
+                    "minimum_length" : 4
+                }
+            }
+        }
+    }
+}
+'
+
+curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'コピー'
+{
+  "tokens" : [ {
+    "token" : "コピー",
+    "start_offset" : 0,
+    "end_offset" : 3,
+    "type" : "word",
+    "position" : 1
+  } ]
+}
+
+curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'サーバー'
+{
+  "tokens" : [ {
+    "token" : "サーバ",
+    "start_offset" : 0,
+    "end_offset" : 4,
+    "type" : "word",
+    "position" : 1
+  } ]
+}
+```
+
+
 License
 -------