Add description and example
This commit is contained in:
parent
578c5acbb3
commit
4c31dfc37e
348
README.md
348
README.md
|
@ -25,6 +25,354 @@ In order to install the plugin, simply run: `bin/plugin -install elasticsearch/e
|
||||||
|
|
||||||
The plugin includes the `kuromoji` analyzer.
|
The plugin includes the `kuromoji` analyzer.
|
||||||
|
|
||||||
|
Includes Analyzer, Tokenizer, TokenFilter
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
The plugin includes these analyzer and tokenizer, tokenfilter.
|
||||||
|
|
||||||
|
| name | type |
|
||||||
|
|-------------------------|-------------|
|
||||||
|
| kuromoji | analyzer |
|
||||||
|
| kuromoji_tokenizer | tokenizer |
|
||||||
|
| kuromoji_baseform | tokenfilter |
|
||||||
|
| kuromoji_part_of_speech | tokenfilter |
|
||||||
|
| kuromoji_readingform | tokenfilter |
|
||||||
|
| kuromoji_stemmer | tokenfilter |
|
||||||
|
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
|
||||||
|
## Analyzer : kuromoji
|
||||||
|
|
||||||
|
An analyzer of type `kuromoji`.
|
||||||
|
This analyzer is the following tokenizer and tokenfilter combination.
|
||||||
|
|
||||||
|
* `kuromoji_tokenizer` : Kuromoji Tokenizer
|
||||||
|
* `kuromoji_baseform` : Kuromoji BasicFormFilter (TokenFilter)
|
||||||
|
* `kuromoji_part_of_speech` : Kuromoji Part of Speech Stop Filter (TokenFilter)
|
||||||
|
* `cjk_width` : CJK Width Filter (TokenFilter)
|
||||||
|
* `stop` : Stop Filter (TokenFilter)
|
||||||
|
* `kuromoji_stemmer` : Kuromiji Katakana Stemmer Filter(TokenFilter)
|
||||||
|
* `lowercase` : LowerCase Filter (TokenFilter)
|
||||||
|
|
||||||
|
## Tokenizer : kuromoji_tokenizer
|
||||||
|
|
||||||
|
A tokenizer of type `kuromoji_tokenizer`.
|
||||||
|
|
||||||
|
The following are settings that can be set for a `kuromoji_tokenizer` tokenizer type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** | **Default value** |
|
||||||
|
|:--------------------|:--------------------------------------------------------------------------------------------------------------------------|:------------------|
|
||||||
|
| mode | Tokenization mode: this determines how the tokenizer handles compound and unknown words. `normal` and `search`, `extended`| `search` |
|
||||||
|
| discard_punctuation | `true` if punctuation tokens should be dropped from the output. | `true` |
|
||||||
|
| user_dict | set User Dictionary file | |
|
||||||
|
|
||||||
|
### Tokenization mode
|
||||||
|
|
||||||
|
The mode is three types.
|
||||||
|
|
||||||
|
* `normal` : Ordinary segmentation: no decomposition for compounds
|
||||||
|
|
||||||
|
* `search` : Segmentation geared towards search: this includes a decompounding process for long nouns, also includeing the full compound token as a synonym.
|
||||||
|
|
||||||
|
* `extended` : Extended mode outputs unigrams for unknown words.
|
||||||
|
|
||||||
|
#### Difference tokenization mode outputs
|
||||||
|
|
||||||
|
Input text is `関西国際空港` and `アブラカダブラ`.
|
||||||
|
|
||||||
|
| **mode** | `関西国際空港` | `アブラカダブラ` |
|
||||||
|
|:-----------|:-------------|:-------|
|
||||||
|
| `normal` | `関西国際空港` | `アブラカダブラ` |
|
||||||
|
| `search` | `関西` `関西国際空港` `国際` `空港` | `アブラカダブラ` |
|
||||||
|
| `extended` | `関西` `国際` `空港` | `ア` `ブ` `ラ` `カ` `ダ` `ブ` `ラ` |
|
||||||
|
|
||||||
|
### User Dictionary
|
||||||
|
|
||||||
|
Kuromoji tokenizer use MecCab-IPADIC dictionary by default.
|
||||||
|
And Kuromoji is added an entry of dictionary to define by user; this is User Dictionary.
|
||||||
|
User Dictionary entries are defined using the following CSV format:
|
||||||
|
|
||||||
|
```
|
||||||
|
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
|
||||||
|
```
|
||||||
|
|
||||||
|
Dictionary Example
|
||||||
|
|
||||||
|
```
|
||||||
|
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
|
||||||
|
```
|
||||||
|
|
||||||
|
To use User Dictionary set file path to `user_dict` attribute.
|
||||||
|
User Dictionary file is placed `ES_HOME/config` directory.
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
```
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"tokenizer" : {
|
||||||
|
"kuromoji_user_dict" : {
|
||||||
|
"type" : "kuromoji_tokenizer",
|
||||||
|
"mode" : "extended",
|
||||||
|
"discard_punctuation" : "false",
|
||||||
|
"user_dictionary" : "userdict_ja.txt"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"type" : "custom",
|
||||||
|
"tokenizer" : "kuromoji_user_dict"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '東京スカイツリー'
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "東京",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
}, {
|
||||||
|
"token" : "スカイツリー",
|
||||||
|
"start_offset" : 2,
|
||||||
|
"end_offset" : 8,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 2
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## TokenFilter : kuromoji_baseform
|
||||||
|
|
||||||
|
A token filter of type `kuromoji_baseform` that replcaes term text with BaseFormAttribute.
|
||||||
|
This acts as a lemmatizer for verbs and adjectives.
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
```
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["kuromoji_baseform"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '飲み'
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "飲む",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## TokenFilter : kuromoji_part_of_speech
|
||||||
|
|
||||||
|
A token filter of type `kuromoji_part_of_speech` that removes tokens that match a set of part-of-speech tags.
|
||||||
|
|
||||||
|
The following are settings that can be set for a stop token filter type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** |
|
||||||
|
|:------------|:-----------------------------------------------------|
|
||||||
|
| stoptags | A list of part-of-speech tags that should be removed |
|
||||||
|
|
||||||
|
Note that default setting is stoptags.txt include lucene-analyzer-kuromji.jar.
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
```
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["my_posfilter"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter" : {
|
||||||
|
"my_posfilter" : {
|
||||||
|
"type" : "kuromoji_part_of_speech",
|
||||||
|
"stoptags" : [
|
||||||
|
"助詞-格助詞-一般",
|
||||||
|
"助詞-終助詞"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '寿司がおいしいね'
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "寿司",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
}, {
|
||||||
|
"token" : "おいしい",
|
||||||
|
"start_offset" : 3,
|
||||||
|
"end_offset" : 7,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 3
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## TokenFilter : kuromoji_readingform
|
||||||
|
|
||||||
|
A token filter of type `kuromoji_readingform` that replaces the term attribute with the reading of a token in either katakana or romaji form.
|
||||||
|
The default reading form is katakana.
|
||||||
|
|
||||||
|
The following are settings that can be set for a `kuromoji_readingform` token filter type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** | **Default value** |
|
||||||
|
|:------------|:----------------------------------------------------------|:------------------|
|
||||||
|
| use_romaji | `true` if romaji reading form output instead of katakana. | `false` |
|
||||||
|
|
||||||
|
Note that elasticsearch-analysis-kuromoji built-in `kuromoji_readingform` set default `ture` to `use_romaji` attribute.
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
```
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"romaji_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["romaji_readingform"]
|
||||||
|
},
|
||||||
|
"katakana_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["katakana_readingform"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter" : {
|
||||||
|
"romaji_readingform" : {
|
||||||
|
"type" : "kuromoji_readingform",
|
||||||
|
"use_romaji" : true
|
||||||
|
},
|
||||||
|
"katakana_readingform" : {
|
||||||
|
"type" : "kuromoji_readingform",
|
||||||
|
"use_romaji" : false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d '寿司'
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "スシ",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=romaji_analyzer&pretty' -d '寿司'
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "sushi",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## TokenFilter : kuromoji_stemmer
|
||||||
|
|
||||||
|
A token filter of type `kuromoji_stemmer` that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC).
|
||||||
|
Only katakana words longer than a minimum length are stemmed (default is four).
|
||||||
|
|
||||||
|
Note that only full-width katakana characters are supported.
|
||||||
|
|
||||||
|
The following are settings that can be set for a `kuromoji_stemmer` token filter type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** | **Default value** |
|
||||||
|
|:----------------|:---------------------------|:------------------|
|
||||||
|
| minimum_length | The minimum length to stem | `4` |
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
```
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["my_katakana_stemmer"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter" : {
|
||||||
|
"my_katakana_stemmer" : {
|
||||||
|
"type" : "kuromoji_stemmer",
|
||||||
|
"minimum_length" : 4
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'コピー'
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "コピー",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 3,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'サーバー'
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "サーバ",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 4,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
License
|
License
|
||||||
-------
|
-------
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue