Add description and example

This commit is contained in:
Jun Ohtani 2013-10-20 06:05:29 +09:00
parent 578c5acbb3
commit 4c31dfc37e
1 changed files with 348 additions and 0 deletions

348
README.md
View File

@ -25,6 +25,354 @@ In order to install the plugin, simply run: `bin/plugin -install elasticsearch/e
The plugin includes the `kuromoji` analyzer. The plugin includes the `kuromoji` analyzer.
Includes Analyzer, Tokenizer, TokenFilter
----------------------------------------
The plugin includes these analyzer and tokenizer, tokenfilter.
| name | type |
|-------------------------|-------------|
| kuromoji | analyzer |
| kuromoji_tokenizer | tokenizer |
| kuromoji_baseform | tokenfilter |
| kuromoji_part_of_speech | tokenfilter |
| kuromoji_readingform | tokenfilter |
| kuromoji_stemmer | tokenfilter |
Usage
-----
## Analyzer : kuromoji
An analyzer of type `kuromoji`.
This analyzer is the following tokenizer and tokenfilter combination.
* `kuromoji_tokenizer` : Kuromoji Tokenizer
* `kuromoji_baseform` : Kuromoji BasicFormFilter (TokenFilter)
* `kuromoji_part_of_speech` : Kuromoji Part of Speech Stop Filter (TokenFilter)
* `cjk_width` : CJK Width Filter (TokenFilter)
* `stop` : Stop Filter (TokenFilter)
* `kuromoji_stemmer` : Kuromiji Katakana Stemmer Filter(TokenFilter)
* `lowercase` : LowerCase Filter (TokenFilter)
## Tokenizer : kuromoji_tokenizer
A tokenizer of type `kuromoji_tokenizer`.
The following are settings that can be set for a `kuromoji_tokenizer` tokenizer type:
| **Setting** | **Description** | **Default value** |
|:--------------------|:--------------------------------------------------------------------------------------------------------------------------|:------------------|
| mode | Tokenization mode: this determines how the tokenizer handles compound and unknown words. `normal` and `search`, `extended`| `search` |
| discard_punctuation | `true` if punctuation tokens should be dropped from the output. | `true` |
| user_dict | set User Dictionary file | |
### Tokenization mode
The mode is three types.
* `normal` : Ordinary segmentation: no decomposition for compounds
* `search` : Segmentation geared towards search: this includes a decompounding process for long nouns, also includeing the full compound token as a synonym.
* `extended` : Extended mode outputs unigrams for unknown words.
#### Difference tokenization mode outputs
Input text is `関西国際空港` and `アブラカダブラ`.
| **mode** | `関西国際空港` | `アブラカダブラ` |
|:-----------|:-------------|:-------|
| `normal` | `関西国際空港` | `アブラカダブラ` |
| `search` | `関西` `関西国際空港` `国際` `空港` | `アブラカダブラ` |
| `extended` | `関西` `国際` `空港` | `ア` `ブ` `ラ` `カ` `ダ` `ブ` `ラ` |
### User Dictionary
Kuromoji tokenizer use MecCab-IPADIC dictionary by default.
And Kuromoji is added an entry of dictionary to define by user; this is User Dictionary.
User Dictionary entries are defined using the following CSV format:
```
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
```
Dictionary Example
```
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
```
To use User Dictionary set file path to `user_dict` attribute.
User Dictionary file is placed `ES_HOME/config` directory.
### example
```
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"index":{
"analysis":{
"tokenizer" : {
"kuromoji_user_dict" : {
"type" : "kuromoji_tokenizer",
"mode" : "extended",
"discard_punctuation" : "false",
"user_dictionary" : "userdict_ja.txt"
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "kuromoji_user_dict"
}
}
}
}
}
'
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '東京スカイツリー'
{
"tokens" : [ {
"token" : "東京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
}, {
"token" : "スカイツリー",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
```
## TokenFilter : kuromoji_baseform
A token filter of type `kuromoji_baseform` that replcaes term text with BaseFormAttribute.
This acts as a lemmatizer for verbs and adjectives.
### example
```
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["kuromoji_baseform"]
}
}
}
}
}
'
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '飲み'
{
"tokens" : [ {
"token" : "飲む",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
} ]
}
```
## TokenFilter : kuromoji_part_of_speech
A token filter of type `kuromoji_part_of_speech` that removes tokens that match a set of part-of-speech tags.
The following are settings that can be set for a stop token filter type:
| **Setting** | **Description** |
|:------------|:-----------------------------------------------------|
| stoptags | A list of part-of-speech tags that should be removed |
Note that default setting is stoptags.txt include lucene-analyzer-kuromji.jar.
### example
```
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["my_posfilter"]
}
},
"filter" : {
"my_posfilter" : {
"type" : "kuromoji_part_of_speech",
"stoptags" : [
"助詞-格助詞-一般",
"助詞-終助詞"
]
}
}
}
}
}
'
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '寿司がおいしいね'
{
"tokens" : [ {
"token" : "寿司",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
}, {
"token" : "おいしい",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 3
} ]
}
```
## TokenFilter : kuromoji_readingform
A token filter of type `kuromoji_readingform` that replaces the term attribute with the reading of a token in either katakana or romaji form.
The default reading form is katakana.
The following are settings that can be set for a `kuromoji_readingform` token filter type:
| **Setting** | **Description** | **Default value** |
|:------------|:----------------------------------------------------------|:------------------|
| use_romaji | `true` if romaji reading form output instead of katakana. | `false` |
Note that elasticsearch-analysis-kuromoji built-in `kuromoji_readingform` set default `ture` to `use_romaji` attribute.
### example
```
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"index":{
"analysis":{
"analyzer" : {
"romaji_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["romaji_readingform"]
},
"katakana_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["katakana_readingform"]
}
},
"filter" : {
"romaji_readingform" : {
"type" : "kuromoji_readingform",
"use_romaji" : true
},
"katakana_readingform" : {
"type" : "kuromoji_readingform",
"use_romaji" : false
}
}
}
}
}
'
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d '寿司'
{
"tokens" : [ {
"token" : "スシ",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
} ]
}
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=romaji_analyzer&pretty' -d '寿司'
{
"tokens" : [ {
"token" : "sushi",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
} ]
}
```
## TokenFilter : kuromoji_stemmer
A token filter of type `kuromoji_stemmer` that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC).
Only katakana words longer than a minimum length are stemmed (default is four).
Note that only full-width katakana characters are supported.
The following are settings that can be set for a `kuromoji_stemmer` token filter type:
| **Setting** | **Description** | **Default value** |
|:----------------|:---------------------------|:------------------|
| minimum_length | The minimum length to stem | `4` |
### example
```
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["my_katakana_stemmer"]
}
},
"filter" : {
"my_katakana_stemmer" : {
"type" : "kuromoji_stemmer",
"minimum_length" : 4
}
}
}
}
}
'
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'コピー'
{
"tokens" : [ {
"token" : "コピー",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
} ]
}
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'サーバー'
{
"tokens" : [ {
"token" : "サーバ",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 1
} ]
}
```
License License
------- -------