migrate branch for analysis-kuromoji
This commit is contained in:
commit
7294d27e5c
|
@ -0,0 +1,552 @@
|
||||||
|
Japanese (kuromoji) Analysis for Elasticsearch
|
||||||
|
==================================
|
||||||
|
|
||||||
|
The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.
|
||||||
|
|
||||||
|
In order to install the plugin, run:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
bin/plugin install elasticsearch/elasticsearch-analysis-kuromoji/2.5.0
|
||||||
|
```
|
||||||
|
|
||||||
|
You need to install a version matching your Elasticsearch version:
|
||||||
|
|
||||||
|
| elasticsearch | Kuromoji Analysis Plugin | Docs |
|
||||||
|
|---------------|-----------------------------|------------|
|
||||||
|
| master | Build from source | See below |
|
||||||
|
| es-1.x | Build from source | [2.6.0-SNAPSHOT](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/es-1.x/#version-260-snapshot-for-elasticsearch-1x) |
|
||||||
|
| es-1.5 | 2.5.0 | [2.5.0](https://github.com/elastic/elasticsearch-analysis-kuromoji/tree/v2.5.0/#version-250-for-elasticsearch-15) |
|
||||||
|
| es-1.4 | 2.4.3 | [2.4.3](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.3/#version-243-for-elasticsearch-14) |
|
||||||
|
| < 1.4.5 | 2.4.2 | [2.4.2](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.2/#version-242-for-elasticsearch-14) |
|
||||||
|
| < 1.4.3 | 2.4.1 | [2.4.1](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.1/#version-241-for-elasticsearch-14) |
|
||||||
|
| es-1.3 | 2.3.0 | [2.3.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.3.0/#japanese-kuromoji-analysis-for-elasticsearch) |
|
||||||
|
| es-1.2 | 2.2.0 | [2.2.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.2.0/#japanese-kuromoji-analysis-for-elasticsearch) |
|
||||||
|
| es-1.1 | 2.1.0 | [2.1.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.1.0/#japanese-kuromoji-analysis-for-elasticsearch) |
|
||||||
|
| es-1.0 | 2.0.0 | [2.0.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.0.0/#japanese-kuromoji-analysis-for-elasticsearch) |
|
||||||
|
| es-0.90 | 1.8.0 | [1.8.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v1.8.0/#japanese-kuromoji-analysis-for-elasticsearch) |
|
||||||
|
|
||||||
|
To build a `SNAPSHOT` version, you need to build it with Maven:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mvn clean install
|
||||||
|
plugin --install analysis-kuromoji \
|
||||||
|
--url file:target/releases/elasticsearch-analysis-kuromoji-X.X.X-SNAPSHOT.zip
|
||||||
|
```
|
||||||
|
|
||||||
|
Includes Analyzer, Tokenizer, TokenFilter, CharFilter
|
||||||
|
-----------------------------------------------
|
||||||
|
|
||||||
|
The plugin includes these analyzer and tokenizer, tokenfilter.
|
||||||
|
|
||||||
|
| name | type |
|
||||||
|
|-------------------------|-------------|
|
||||||
|
| kuromoji_iteration_mark | charfilter |
|
||||||
|
| kuromoji | analyzer |
|
||||||
|
| kuromoji_tokenizer | tokenizer |
|
||||||
|
| kuromoji_baseform | tokenfilter |
|
||||||
|
| kuromoji_part_of_speech | tokenfilter |
|
||||||
|
| kuromoji_readingform | tokenfilter |
|
||||||
|
| kuromoji_stemmer | tokenfilter |
|
||||||
|
| ja_stop | tokenfilter |
|
||||||
|
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
|
||||||
|
## Analyzer : kuromoji
|
||||||
|
|
||||||
|
An analyzer of type `kuromoji`.
|
||||||
|
This analyzer is the following tokenizer and tokenfilter combination.
|
||||||
|
|
||||||
|
* `kuromoji_tokenizer` : Kuromoji Tokenizer
|
||||||
|
* `kuromoji_baseform` : Kuromoji BasicFormFilter (TokenFilter)
|
||||||
|
* `kuromoji_part_of_speech` : Kuromoji Part of Speech Stop Filter (TokenFilter)
|
||||||
|
* `cjk_width` : CJK Width Filter (TokenFilter)
|
||||||
|
* `stop` : Stop Filter (TokenFilter)
|
||||||
|
* `kuromoji_stemmer` : Kuromoji Katakana Stemmer Filter(TokenFilter)
|
||||||
|
* `lowercase` : LowerCase Filter (TokenFilter)
|
||||||
|
|
||||||
|
## CharFilter : kuromoji_iteration_mark
|
||||||
|
|
||||||
|
A charfilter of type `kuromoji_iteration_mark`.
|
||||||
|
This charfilter is Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
|
||||||
|
|
||||||
|
The following ar setting that can be set for a `kuromoji_iteration_mark` charfilter type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** | **Default value** |
|
||||||
|
|:----------------|:-------------------------------------------------------------|:------------------|
|
||||||
|
| normalize_kanji | indicates whether kanji iteration marks should be normalized | `true` |
|
||||||
|
| normalize_kana | indicates whether kanji iteration marks should be normalized | `true` |
|
||||||
|
|
||||||
|
## Tokenizer : kuromoji_tokenizer
|
||||||
|
|
||||||
|
A tokenizer of type `kuromoji_tokenizer`.
|
||||||
|
|
||||||
|
The following are settings that can be set for a `kuromoji_tokenizer` tokenizer type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** | **Default value** |
|
||||||
|
|:--------------------|:--------------------------------------------------------------------------------------------------------------------------|:------------------|
|
||||||
|
| mode | Tokenization mode: this determines how the tokenizer handles compound and unknown words. `normal` and `search`, `extended`| `search` |
|
||||||
|
| discard_punctuation | `true` if punctuation tokens should be dropped from the output. | `true` |
|
||||||
|
| user_dictionary | set User Dictionary file | |
|
||||||
|
|
||||||
|
### Tokenization mode
|
||||||
|
|
||||||
|
The mode is three types.
|
||||||
|
|
||||||
|
* `normal` : Ordinary segmentation: no decomposition for compounds
|
||||||
|
|
||||||
|
* `search` : Segmentation geared towards search: this includes a decompounding process for long nouns, also including the full compound token as a synonym.
|
||||||
|
|
||||||
|
* `extended` : Extended mode outputs unigrams for unknown words.
|
||||||
|
|
||||||
|
#### Difference tokenization mode outputs
|
||||||
|
|
||||||
|
Input text is `関西国際空港` and `アブラカダブラ`.
|
||||||
|
|
||||||
|
| **mode** | `関西国際空港` | `アブラカダブラ` |
|
||||||
|
|:-----------|:-------------|:-------|
|
||||||
|
| `normal` | `関西国際空港` | `アブラカダブラ` |
|
||||||
|
| `search` | `関西` `関西国際空港` `国際` `空港` | `アブラカダブラ` |
|
||||||
|
| `extended` | `関西` `国際` `空港` | `ア` `ブ` `ラ` `カ` `ダ` `ブ` `ラ` |
|
||||||
|
|
||||||
|
### User Dictionary
|
||||||
|
|
||||||
|
Kuromoji tokenizer use MeCab-IPADIC dictionary by default.
|
||||||
|
And Kuromoji is added an entry of dictionary to define by user; this is User Dictionary.
|
||||||
|
User Dictionary entries are defined using the following CSV format:
|
||||||
|
|
||||||
|
```
|
||||||
|
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
|
||||||
|
```
|
||||||
|
|
||||||
|
Dictionary Example
|
||||||
|
|
||||||
|
```
|
||||||
|
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
|
||||||
|
```
|
||||||
|
|
||||||
|
To use User Dictionary set file path to `user_dict` attribute.
|
||||||
|
User Dictionary file is placed `ES_HOME/config` directory.
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
_Example Settings:_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"tokenizer" : {
|
||||||
|
"kuromoji_user_dict" : {
|
||||||
|
"type" : "kuromoji_tokenizer",
|
||||||
|
"mode" : "extended",
|
||||||
|
"discard_punctuation" : "false",
|
||||||
|
"user_dictionary" : "userdict_ja.txt"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"type" : "custom",
|
||||||
|
"tokenizer" : "kuromoji_user_dict"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Example Request using `_analyze` API :_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '東京スカイツリー'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Response :_
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "東京",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
}, {
|
||||||
|
"token" : "スカイツリー",
|
||||||
|
"start_offset" : 2,
|
||||||
|
"end_offset" : 8,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 2
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## TokenFilter : kuromoji_baseform
|
||||||
|
|
||||||
|
A token filter of type `kuromoji_baseform` that replaces term text with BaseFormAttribute.
|
||||||
|
This acts as a lemmatizer for verbs and adjectives.
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
_Example Settings:_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["kuromoji_baseform"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Example Request using `_analyze` API :_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '飲み'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Response :_
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "飲む",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## TokenFilter : kuromoji_part_of_speech
|
||||||
|
|
||||||
|
A token filter of type `kuromoji_part_of_speech` that removes tokens that match a set of part-of-speech tags.
|
||||||
|
|
||||||
|
The following are settings that can be set for a stop token filter type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** |
|
||||||
|
|:------------|:-----------------------------------------------------|
|
||||||
|
| stoptags | A list of part-of-speech tags that should be removed |
|
||||||
|
|
||||||
|
Note that default setting is stoptags.txt include lucene-analyzer-kuromoji.jar.
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
_Example Settings:_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["my_posfilter"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter" : {
|
||||||
|
"my_posfilter" : {
|
||||||
|
"type" : "kuromoji_part_of_speech",
|
||||||
|
"stoptags" : [
|
||||||
|
"助詞-格助詞-一般",
|
||||||
|
"助詞-終助詞"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Example Request using `_analyze` API :_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '寿司がおいしいね'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Response :_
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "寿司",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
}, {
|
||||||
|
"token" : "おいしい",
|
||||||
|
"start_offset" : 3,
|
||||||
|
"end_offset" : 7,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 3
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## TokenFilter : kuromoji_readingform
|
||||||
|
|
||||||
|
A token filter of type `kuromoji_readingform` that replaces the term attribute with the reading of a token in either katakana or romaji form.
|
||||||
|
The default reading form is katakana.
|
||||||
|
|
||||||
|
The following are settings that can be set for a `kuromoji_readingform` token filter type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** | **Default value** |
|
||||||
|
|:------------|:----------------------------------------------------------|:------------------|
|
||||||
|
| use_romaji | `true` if romaji reading form output instead of katakana. | `false` |
|
||||||
|
|
||||||
|
Note that elasticsearch-analysis-kuromoji built-in `kuromoji_readingform` set default `true` to `use_romaji` attribute.
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
_Example Settings:_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"romaji_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["romaji_readingform"]
|
||||||
|
},
|
||||||
|
"katakana_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["katakana_readingform"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter" : {
|
||||||
|
"romaji_readingform" : {
|
||||||
|
"type" : "kuromoji_readingform",
|
||||||
|
"use_romaji" : true
|
||||||
|
},
|
||||||
|
"katakana_readingform" : {
|
||||||
|
"type" : "kuromoji_readingform",
|
||||||
|
"use_romaji" : false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Example Request using `_analyze` API :_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d '寿司'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Response :_
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "スシ",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
_Example Request using `_analyze` API :_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=romaji_analyzer&pretty' -d '寿司'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Response :_
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "sushi",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 2,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## TokenFilter : kuromoji_stemmer
|
||||||
|
|
||||||
|
A token filter of type `kuromoji_stemmer` that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC).
|
||||||
|
Only katakana words longer than a minimum length are stemmed (default is four).
|
||||||
|
|
||||||
|
Note that only full-width katakana characters are supported.
|
||||||
|
|
||||||
|
The following are settings that can be set for a `kuromoji_stemmer` token filter type:
|
||||||
|
|
||||||
|
| **Setting** | **Description** | **Default value** |
|
||||||
|
|:----------------|:---------------------------|:------------------|
|
||||||
|
| minimum_length | The minimum length to stem | `4` |
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
_Example Settings:_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["my_katakana_stemmer"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter" : {
|
||||||
|
"my_katakana_stemmer" : {
|
||||||
|
"type" : "kuromoji_stemmer",
|
||||||
|
"minimum_length" : 4
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Example Request using `_analyze` API :_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'コピー'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Response :_
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "コピー",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 3,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
_Example Request using `_analyze` API :_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'サーバー'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Response :_
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "サーバ",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 4,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## TokenFilter : ja_stop
|
||||||
|
|
||||||
|
|
||||||
|
A token filter of type `ja_stop` that provide a predefined "_japanese_" stop words.
|
||||||
|
*Note: It is only provide "_japanese_". If you want to use other predefined stop words, you can use `stop` token filter.*
|
||||||
|
|
||||||
|
_Example Settings:_
|
||||||
|
|
||||||
|
### example
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"analyzer" : {
|
||||||
|
"analyzer_with_ja_stop" : {
|
||||||
|
"tokenizer" : "kuromoji_tokenizer",
|
||||||
|
"filter" : ["ja_stop"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter" : {
|
||||||
|
"ja_stop" : {
|
||||||
|
"type" : "ja_stop",
|
||||||
|
"stopwords" : ["_japanese_", "ストップ"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Example Request using `_analyze` API :_
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d 'ストップは消える'
|
||||||
|
```
|
||||||
|
|
||||||
|
_Response :_
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"tokens" : [ {
|
||||||
|
"token" : "消える",
|
||||||
|
"start_offset" : 5,
|
||||||
|
"end_offset" : 8,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 3
|
||||||
|
} ]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
License
|
||||||
|
-------
|
||||||
|
|
||||||
|
This software is licensed under the Apache 2 license, quoted below.
|
||||||
|
|
||||||
|
Copyright 2009-2014 Elasticsearch <http://www.elasticsearch.org>
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not
|
||||||
|
use this file except in compliance with the License. You may obtain a copy of
|
||||||
|
the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||||
|
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||||
|
License for the specific language governing permissions and limitations under
|
||||||
|
the License.
|
|
@ -0,0 +1,40 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<project xmlns="http://maven.apache.org/POM/4.0.0"
|
||||||
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||||
|
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
|
||||||
|
<modelVersion>4.0.0</modelVersion>
|
||||||
|
|
||||||
|
<groupId>org.elasticsearch.plugin</groupId>
|
||||||
|
<artifactId>elasticsearch-analysis-kuromoji</artifactId>
|
||||||
|
|
||||||
|
<packaging>jar</packaging>
|
||||||
|
<name>Elasticsearch Japanese (kuromoji) Analysis plugin</name>
|
||||||
|
<description>The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.</description>
|
||||||
|
|
||||||
|
<parent>
|
||||||
|
<groupId>org.elasticsearch</groupId>
|
||||||
|
<artifactId>elasticsearch-plugin</artifactId>
|
||||||
|
<version>2.0.0-SNAPSHOT</version>
|
||||||
|
</parent>
|
||||||
|
|
||||||
|
<properties>
|
||||||
|
<!-- You can add any specific project property here -->
|
||||||
|
</properties>
|
||||||
|
|
||||||
|
<dependencies>
|
||||||
|
<dependency>
|
||||||
|
<groupId>org.apache.lucene</groupId>
|
||||||
|
<artifactId>lucene-analyzers-kuromoji</artifactId>
|
||||||
|
</dependency>
|
||||||
|
</dependencies>
|
||||||
|
|
||||||
|
<build>
|
||||||
|
<plugins>
|
||||||
|
<plugin>
|
||||||
|
<groupId>org.apache.maven.plugins</groupId>
|
||||||
|
<artifactId>maven-assembly-plugin</artifactId>
|
||||||
|
</plugin>
|
||||||
|
</plugins>
|
||||||
|
</build>
|
||||||
|
|
||||||
|
</project>
|
|
@ -0,0 +1,26 @@
|
||||||
|
<?xml version="1.0"?>
|
||||||
|
<assembly>
|
||||||
|
<id>plugin</id>
|
||||||
|
<formats>
|
||||||
|
<format>zip</format>
|
||||||
|
</formats>
|
||||||
|
<includeBaseDirectory>false</includeBaseDirectory>
|
||||||
|
<dependencySets>
|
||||||
|
<dependencySet>
|
||||||
|
<outputDirectory>/</outputDirectory>
|
||||||
|
<useProjectArtifact>true</useProjectArtifact>
|
||||||
|
<useTransitiveFiltering>true</useTransitiveFiltering>
|
||||||
|
<excludes>
|
||||||
|
<exclude>org.elasticsearch:elasticsearch</exclude>
|
||||||
|
</excludes>
|
||||||
|
</dependencySet>
|
||||||
|
<dependencySet>
|
||||||
|
<outputDirectory>/</outputDirectory>
|
||||||
|
<useProjectArtifact>true</useProjectArtifact>
|
||||||
|
<useTransitiveFiltering>true</useTransitiveFiltering>
|
||||||
|
<includes>
|
||||||
|
<include>org.apache.lucene:lucene-analyzers-kuromoji</include>
|
||||||
|
</includes>
|
||||||
|
</dependencySet>
|
||||||
|
</dependencySets>
|
||||||
|
</assembly>
|
|
@ -0,0 +1,76 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.core.StopFilter;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.util.CharArraySet;
|
||||||
|
import org.apache.lucene.search.suggest.analyzing.SuggestStopFilter;
|
||||||
|
import org.elasticsearch.common.collect.MapBuilder;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.env.Environment;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Set;
|
||||||
|
|
||||||
|
public class JapaneseStopTokenFilterFactory extends AbstractTokenFilterFactory{
|
||||||
|
|
||||||
|
|
||||||
|
private final CharArraySet stopWords;
|
||||||
|
|
||||||
|
private final boolean ignoreCase;
|
||||||
|
|
||||||
|
private final boolean removeTrailing;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public JapaneseStopTokenFilterFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
this.ignoreCase = settings.getAsBoolean("ignore_case", false);
|
||||||
|
this.removeTrailing = settings.getAsBoolean("remove_trailing", true);
|
||||||
|
Map<String, Set<?>> namedStopWords = MapBuilder.<String, Set<?>>newMapBuilder()
|
||||||
|
.put("_japanese_", JapaneseAnalyzer.getDefaultStopSet())
|
||||||
|
.immutableMap();
|
||||||
|
this.stopWords = Analysis.parseWords(env, settings, "stopwords", JapaneseAnalyzer.getDefaultStopSet(), namedStopWords, ignoreCase);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
if (removeTrailing) {
|
||||||
|
return new StopFilter(tokenStream, stopWords);
|
||||||
|
} else {
|
||||||
|
return new SuggestStopFilter(tokenStream, stopWords);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public Set<?> stopWords() {
|
||||||
|
return stopWords;
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean ignoreCase() {
|
||||||
|
return ignoreCase;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,56 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseTokenizer;
|
||||||
|
import org.apache.lucene.analysis.ja.dict.UserDictionary;
|
||||||
|
import org.apache.lucene.analysis.util.CharArraySet;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.env.Environment;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
import java.util.Set;
|
||||||
|
|
||||||
|
/**
|
||||||
|
*/
|
||||||
|
public class KuromojiAnalyzerProvider extends AbstractIndexAnalyzerProvider<JapaneseAnalyzer> {
|
||||||
|
|
||||||
|
private final JapaneseAnalyzer analyzer;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public KuromojiAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
final Set<?> stopWords = Analysis.parseStopWords(env, settings, JapaneseAnalyzer.getDefaultStopSet());
|
||||||
|
final JapaneseTokenizer.Mode mode = KuromojiTokenizerFactory.getMode(settings);
|
||||||
|
final UserDictionary userDictionary = KuromojiTokenizerFactory.getUserDictionary(env, settings);
|
||||||
|
analyzer = new JapaneseAnalyzer(userDictionary, mode, CharArraySet.copy(stopWords), JapaneseAnalyzer.getDefaultStopTags());
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public JapaneseAnalyzer get() {
|
||||||
|
return this.analyzer;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,41 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseBaseFormFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
public class KuromojiBaseFormFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public KuromojiBaseFormFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new JapaneseBaseFormFilter(tokenStream);
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,48 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseIterationMarkCharFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
import java.io.Reader;
|
||||||
|
|
||||||
|
public class KuromojiIterationMarkCharFilterFactory extends AbstractCharFilterFactory {
|
||||||
|
|
||||||
|
private final boolean normalizeKanji;
|
||||||
|
private final boolean normalizeKana;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public KuromojiIterationMarkCharFilterFactory(Index index, @IndexSettings Settings indexSettings,
|
||||||
|
@Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name);
|
||||||
|
normalizeKanji = settings.getAsBoolean("normalize_kanji", JapaneseIterationMarkCharFilter.NORMALIZE_KANJI_DEFAULT);
|
||||||
|
normalizeKana = settings.getAsBoolean("normalize_kana", JapaneseIterationMarkCharFilter.NORMALIZE_KANA_DEFAULT);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Reader create(Reader reader) {
|
||||||
|
return new JapaneseIterationMarkCharFilter(reader, normalizeKanji, normalizeKana);
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseKatakanaStemFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
public class KuromojiKatakanaStemmerFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
private final int minimumLength;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public KuromojiKatakanaStemmerFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
minimumLength = settings.getAsInt("minimum_length", JapaneseKatakanaStemFilter.DEFAULT_MINIMUM_LENGTH);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new JapaneseKatakanaStemFilter(tokenStream, minimumLength);
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,53 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.ja.JapanesePartOfSpeechStopFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.env.Environment;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
import java.util.HashSet;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Set;
|
||||||
|
|
||||||
|
public class KuromojiPartOfSpeechFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
private final Set<String> stopTags = new HashSet<String>();
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public KuromojiPartOfSpeechFilterFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
List<String> wordList = Analysis.getWordList(env, settings, "stoptags");
|
||||||
|
if (wordList != null) {
|
||||||
|
stopTags.addAll(wordList);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new JapanesePartOfSpeechStopFilter(tokenStream, stopTags);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,44 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseReadingFormFilter;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
public class KuromojiReadingFormFilterFactory extends AbstractTokenFilterFactory {
|
||||||
|
|
||||||
|
private final boolean useRomaji;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public KuromojiReadingFormFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
useRomaji = settings.getAsBoolean("use_romaji", false);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new JapaneseReadingFormFilter(tokenStream, useRomaji);
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,93 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.Tokenizer;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseTokenizer;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseTokenizer.Mode;
|
||||||
|
import org.apache.lucene.analysis.ja.dict.UserDictionary;
|
||||||
|
import org.elasticsearch.ElasticsearchException;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.inject.assistedinject.Assisted;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.env.Environment;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettings;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.Reader;
|
||||||
|
|
||||||
|
/**
|
||||||
|
*/
|
||||||
|
public class KuromojiTokenizerFactory extends AbstractTokenizerFactory {
|
||||||
|
|
||||||
|
private static final String USER_DICT_OPTION = "user_dictionary";
|
||||||
|
|
||||||
|
private final UserDictionary userDictionary;
|
||||||
|
private final Mode mode;
|
||||||
|
|
||||||
|
private boolean discartPunctuation;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public KuromojiTokenizerFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
|
||||||
|
super(index, indexSettings, name, settings);
|
||||||
|
mode = getMode(settings);
|
||||||
|
userDictionary = getUserDictionary(env, settings);
|
||||||
|
discartPunctuation = settings.getAsBoolean("discard_punctuation", true);
|
||||||
|
}
|
||||||
|
|
||||||
|
public static UserDictionary getUserDictionary(Environment env, Settings settings) {
|
||||||
|
try {
|
||||||
|
final Reader reader = Analysis.getReaderFromFile(env, settings, USER_DICT_OPTION);
|
||||||
|
if (reader == null) {
|
||||||
|
return null;
|
||||||
|
} else {
|
||||||
|
try {
|
||||||
|
return UserDictionary.open(reader);
|
||||||
|
} finally {
|
||||||
|
reader.close();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} catch (IOException e) {
|
||||||
|
throw new ElasticsearchException("failed to load kuromoji user dictionary", e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public static JapaneseTokenizer.Mode getMode(Settings settings) {
|
||||||
|
JapaneseTokenizer.Mode mode = JapaneseTokenizer.DEFAULT_MODE;
|
||||||
|
String modeSetting = settings.get("mode", null);
|
||||||
|
if (modeSetting != null) {
|
||||||
|
if ("search".equalsIgnoreCase(modeSetting)) {
|
||||||
|
mode = JapaneseTokenizer.Mode.SEARCH;
|
||||||
|
} else if ("normal".equalsIgnoreCase(modeSetting)) {
|
||||||
|
mode = JapaneseTokenizer.Mode.NORMAL;
|
||||||
|
} else if ("extended".equalsIgnoreCase(modeSetting)) {
|
||||||
|
mode = JapaneseTokenizer.Mode.EXTENDED;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return mode;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Tokenizer create() {
|
||||||
|
return new JapaneseTokenizer(userDictionary, discartPunctuation, mode);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,131 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.indices.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.Tokenizer;
|
||||||
|
import org.apache.lucene.analysis.ja.*;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseTokenizer.Mode;
|
||||||
|
import org.elasticsearch.common.component.AbstractComponent;
|
||||||
|
import org.elasticsearch.common.inject.Inject;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.index.analysis.*;
|
||||||
|
|
||||||
|
import java.io.Reader;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Registers indices level analysis components so, if not explicitly configured,
|
||||||
|
* will be shared among all indices.
|
||||||
|
*/
|
||||||
|
public class KuromojiIndicesAnalysis extends AbstractComponent {
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public KuromojiIndicesAnalysis(Settings settings,
|
||||||
|
IndicesAnalysisService indicesAnalysisService) {
|
||||||
|
super(settings);
|
||||||
|
|
||||||
|
indicesAnalysisService.analyzerProviderFactories().put("kuromoji",
|
||||||
|
new PreBuiltAnalyzerProviderFactory("kuromoji", AnalyzerScope.INDICES,
|
||||||
|
new JapaneseAnalyzer()));
|
||||||
|
|
||||||
|
indicesAnalysisService.charFilterFactories().put("kuromoji_iteration_mark",
|
||||||
|
new PreBuiltCharFilterFactoryFactory(new CharFilterFactory() {
|
||||||
|
@Override
|
||||||
|
public String name() {
|
||||||
|
return "kuromoji_iteration_mark";
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Reader create(Reader reader) {
|
||||||
|
return new JapaneseIterationMarkCharFilter(reader,
|
||||||
|
JapaneseIterationMarkCharFilter.NORMALIZE_KANJI_DEFAULT,
|
||||||
|
JapaneseIterationMarkCharFilter.NORMALIZE_KANA_DEFAULT);
|
||||||
|
}
|
||||||
|
}));
|
||||||
|
|
||||||
|
indicesAnalysisService.tokenizerFactories().put("kuromoji_tokenizer",
|
||||||
|
new PreBuiltTokenizerFactoryFactory(new TokenizerFactory() {
|
||||||
|
@Override
|
||||||
|
public String name() {
|
||||||
|
return "kuromoji_tokenizer";
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Tokenizer create() {
|
||||||
|
return new JapaneseTokenizer(null, true, Mode.SEARCH);
|
||||||
|
}
|
||||||
|
}));
|
||||||
|
|
||||||
|
indicesAnalysisService.tokenFilterFactories().put("kuromoji_baseform",
|
||||||
|
new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
|
||||||
|
@Override
|
||||||
|
public String name() {
|
||||||
|
return "kuromoji_baseform";
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new JapaneseBaseFormFilter(tokenStream);
|
||||||
|
}
|
||||||
|
}));
|
||||||
|
|
||||||
|
indicesAnalysisService.tokenFilterFactories().put(
|
||||||
|
"kuromoji_part_of_speech",
|
||||||
|
new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
|
||||||
|
@Override
|
||||||
|
public String name() {
|
||||||
|
return "kuromoji_part_of_speech";
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new JapanesePartOfSpeechStopFilter(tokenStream, JapaneseAnalyzer
|
||||||
|
.getDefaultStopTags());
|
||||||
|
}
|
||||||
|
}));
|
||||||
|
|
||||||
|
indicesAnalysisService.tokenFilterFactories().put(
|
||||||
|
"kuromoji_readingform",
|
||||||
|
new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
|
||||||
|
@Override
|
||||||
|
public String name() {
|
||||||
|
return "kuromoji_readingform";
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new JapaneseReadingFormFilter(tokenStream, true);
|
||||||
|
}
|
||||||
|
}));
|
||||||
|
|
||||||
|
indicesAnalysisService.tokenFilterFactories().put("kuromoji_stemmer",
|
||||||
|
new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
|
||||||
|
@Override
|
||||||
|
public String name() {
|
||||||
|
return "kuromoji_stemmer";
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public TokenStream create(TokenStream tokenStream) {
|
||||||
|
return new JapaneseKatakanaStemFilter(tokenStream);
|
||||||
|
}
|
||||||
|
}));
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,32 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.indices.analysis;
|
||||||
|
|
||||||
|
import org.elasticsearch.common.inject.AbstractModule;
|
||||||
|
|
||||||
|
/**
|
||||||
|
*/
|
||||||
|
public class KuromojiIndicesAnalysisModule extends AbstractModule {
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected void configure() {
|
||||||
|
bind(KuromojiIndicesAnalysis.class).asEagerSingleton();
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,62 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.plugin.analysis.kuromoji;
|
||||||
|
|
||||||
|
import org.elasticsearch.common.inject.Module;
|
||||||
|
import org.elasticsearch.index.analysis.*;
|
||||||
|
import org.elasticsearch.indices.analysis.KuromojiIndicesAnalysisModule;
|
||||||
|
import org.elasticsearch.plugins.AbstractPlugin;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.Collection;
|
||||||
|
|
||||||
|
/**
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
public class AnalysisKuromojiPlugin extends AbstractPlugin {
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String name() {
|
||||||
|
return "analysis-kuromoji";
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String description() {
|
||||||
|
return "Kuromoji analysis support";
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Collection<Class<? extends Module>> modules() {
|
||||||
|
Collection<Class<? extends Module>> classes = new ArrayList<>();
|
||||||
|
classes.add(KuromojiIndicesAnalysisModule.class);
|
||||||
|
return classes;
|
||||||
|
}
|
||||||
|
|
||||||
|
public void onModule(AnalysisModule module) {
|
||||||
|
module.addCharFilter("kuromoji_iteration_mark", KuromojiIterationMarkCharFilterFactory.class);
|
||||||
|
module.addAnalyzer("kuromoji", KuromojiAnalyzerProvider.class);
|
||||||
|
module.addTokenizer("kuromoji_tokenizer", KuromojiTokenizerFactory.class);
|
||||||
|
module.addTokenFilter("kuromoji_baseform", KuromojiBaseFormFilterFactory.class);
|
||||||
|
module.addTokenFilter("kuromoji_part_of_speech", KuromojiPartOfSpeechFilterFactory.class);
|
||||||
|
module.addTokenFilter("kuromoji_readingform", KuromojiReadingFormFilterFactory.class);
|
||||||
|
module.addTokenFilter("kuromoji_stemmer", KuromojiKatakanaStemmerFactory.class);
|
||||||
|
module.addTokenFilter("ja_stop", JapaneseStopTokenFilterFactory.class);
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,3 @@
|
||||||
|
plugin=org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin
|
||||||
|
version=${project.version}
|
||||||
|
lucene=${lucene.version}
|
|
@ -0,0 +1,267 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.apache.lucene.analysis.TokenStream;
|
||||||
|
import org.apache.lucene.analysis.Tokenizer;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
|
||||||
|
import org.apache.lucene.analysis.ja.JapaneseTokenizer;
|
||||||
|
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
|
||||||
|
import org.elasticsearch.Version;
|
||||||
|
import org.elasticsearch.cluster.metadata.IndexMetaData;
|
||||||
|
import org.elasticsearch.common.inject.Injector;
|
||||||
|
import org.elasticsearch.common.inject.ModulesBuilder;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.common.settings.SettingsModule;
|
||||||
|
import org.elasticsearch.env.Environment;
|
||||||
|
import org.elasticsearch.env.EnvironmentModule;
|
||||||
|
import org.elasticsearch.index.Index;
|
||||||
|
import org.elasticsearch.index.IndexNameModule;
|
||||||
|
import org.elasticsearch.index.settings.IndexSettingsModule;
|
||||||
|
import org.elasticsearch.indices.analysis.IndicesAnalysisModule;
|
||||||
|
import org.elasticsearch.indices.analysis.IndicesAnalysisService;
|
||||||
|
import org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin;
|
||||||
|
import org.elasticsearch.test.ElasticsearchTestCase;
|
||||||
|
import org.junit.Test;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.Reader;
|
||||||
|
import java.io.StringReader;
|
||||||
|
|
||||||
|
import static org.hamcrest.Matchers.*;
|
||||||
|
|
||||||
|
/**
|
||||||
|
*/
|
||||||
|
public class KuromojiAnalysisTests extends ElasticsearchTestCase {
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testDefaultsKuromojiAnalysis() throws IOException {
|
||||||
|
AnalysisService analysisService = createAnalysisService();
|
||||||
|
|
||||||
|
TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_tokenizer");
|
||||||
|
assertThat(tokenizerFactory, instanceOf(KuromojiTokenizerFactory.class));
|
||||||
|
|
||||||
|
TokenFilterFactory filterFactory = analysisService.tokenFilter("kuromoji_part_of_speech");
|
||||||
|
assertThat(filterFactory, instanceOf(KuromojiPartOfSpeechFilterFactory.class));
|
||||||
|
|
||||||
|
filterFactory = analysisService.tokenFilter("kuromoji_readingform");
|
||||||
|
assertThat(filterFactory, instanceOf(KuromojiReadingFormFilterFactory.class));
|
||||||
|
|
||||||
|
filterFactory = analysisService.tokenFilter("kuromoji_baseform");
|
||||||
|
assertThat(filterFactory, instanceOf(KuromojiBaseFormFilterFactory.class));
|
||||||
|
|
||||||
|
filterFactory = analysisService.tokenFilter("kuromoji_stemmer");
|
||||||
|
assertThat(filterFactory, instanceOf(KuromojiKatakanaStemmerFactory.class));
|
||||||
|
|
||||||
|
filterFactory = analysisService.tokenFilter("ja_stop");
|
||||||
|
assertThat(filterFactory, instanceOf(JapaneseStopTokenFilterFactory.class));
|
||||||
|
|
||||||
|
NamedAnalyzer analyzer = analysisService.analyzer("kuromoji");
|
||||||
|
assertThat(analyzer.analyzer(), instanceOf(JapaneseAnalyzer.class));
|
||||||
|
|
||||||
|
analyzer = analysisService.analyzer("my_analyzer");
|
||||||
|
assertThat(analyzer.analyzer(), instanceOf(CustomAnalyzer.class));
|
||||||
|
assertThat(analyzer.analyzer().tokenStream(null, new StringReader("")), instanceOf(JapaneseTokenizer.class));
|
||||||
|
|
||||||
|
CharFilterFactory charFilterFactory = analysisService.charFilter("kuromoji_iteration_mark");
|
||||||
|
assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testBaseFormFilterFactory() throws IOException {
|
||||||
|
AnalysisService analysisService = createAnalysisService();
|
||||||
|
TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_pos");
|
||||||
|
assertThat(tokenFilter, instanceOf(KuromojiPartOfSpeechFilterFactory.class));
|
||||||
|
String source = "私は制限スピードを超える。";
|
||||||
|
String[] expected = new String[]{"私", "は", "制限", "スピード", "を"};
|
||||||
|
Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
|
||||||
|
tokenizer.setReader(new StringReader(source));
|
||||||
|
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testReadingFormFilterFactory() throws IOException {
|
||||||
|
AnalysisService analysisService = createAnalysisService();
|
||||||
|
TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_rf");
|
||||||
|
assertThat(tokenFilter, instanceOf(KuromojiReadingFormFilterFactory.class));
|
||||||
|
String source = "今夜はロバート先生と話した";
|
||||||
|
String[] expected_tokens_romaji = new String[]{"kon'ya", "ha", "robato", "sensei", "to", "hanashi", "ta"};
|
||||||
|
|
||||||
|
Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
|
||||||
|
tokenizer.setReader(new StringReader(source));
|
||||||
|
|
||||||
|
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_romaji);
|
||||||
|
|
||||||
|
tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
|
||||||
|
tokenizer.setReader(new StringReader(source));
|
||||||
|
String[] expected_tokens_katakana = new String[]{"コンヤ", "ハ", "ロバート", "センセイ", "ト", "ハナシ", "タ"};
|
||||||
|
tokenFilter = analysisService.tokenFilter("kuromoji_readingform");
|
||||||
|
assertThat(tokenFilter, instanceOf(KuromojiReadingFormFilterFactory.class));
|
||||||
|
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testKatakanaStemFilter() throws IOException {
|
||||||
|
AnalysisService analysisService = createAnalysisService();
|
||||||
|
TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_stemmer");
|
||||||
|
assertThat(tokenFilter, instanceOf(KuromojiKatakanaStemmerFactory.class));
|
||||||
|
String source = "明後日パーティーに行く予定がある。図書館で資料をコピーしました。";
|
||||||
|
|
||||||
|
Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
|
||||||
|
tokenizer.setReader(new StringReader(source));
|
||||||
|
|
||||||
|
// パーティー should be stemmed by default
|
||||||
|
// (min len) コピー should not be stemmed
|
||||||
|
String[] expected_tokens_katakana = new String[]{"明後日", "パーティ", "に", "行く", "予定", "が", "ある", "図書館", "で", "資料", "を", "コピー", "し", "まし", "た"};
|
||||||
|
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
|
||||||
|
|
||||||
|
tokenFilter = analysisService.tokenFilter("kuromoji_ks");
|
||||||
|
assertThat(tokenFilter, instanceOf(KuromojiKatakanaStemmerFactory.class));
|
||||||
|
tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
|
||||||
|
tokenizer.setReader(new StringReader(source));
|
||||||
|
|
||||||
|
// パーティー should not be stemmed since min len == 6
|
||||||
|
// コピー should not be stemmed
|
||||||
|
expected_tokens_katakana = new String[]{"明後日", "パーティー", "に", "行く", "予定", "が", "ある", "図書館", "で", "資料", "を", "コピー", "し", "まし", "た"};
|
||||||
|
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
|
||||||
|
}
|
||||||
|
@Test
|
||||||
|
public void testIterationMarkCharFilter() throws IOException {
|
||||||
|
AnalysisService analysisService = createAnalysisService();
|
||||||
|
// test only kanji
|
||||||
|
CharFilterFactory charFilterFactory = analysisService.charFilter("kuromoji_im_only_kanji");
|
||||||
|
assertNotNull(charFilterFactory);
|
||||||
|
assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
|
||||||
|
|
||||||
|
String source = "ところゞゝゝ、ジヾが、時々、馬鹿々々しい";
|
||||||
|
String expected = "ところゞゝゝ、ジヾが、時時、馬鹿馬鹿しい";
|
||||||
|
|
||||||
|
assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
|
||||||
|
|
||||||
|
// test only kana
|
||||||
|
|
||||||
|
charFilterFactory = analysisService.charFilter("kuromoji_im_only_kana");
|
||||||
|
assertNotNull(charFilterFactory);
|
||||||
|
assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
|
||||||
|
|
||||||
|
expected = "ところどころ、ジジが、時々、馬鹿々々しい";
|
||||||
|
|
||||||
|
assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
|
||||||
|
|
||||||
|
// test default
|
||||||
|
|
||||||
|
charFilterFactory = analysisService.charFilter("kuromoji_im_default");
|
||||||
|
assertNotNull(charFilterFactory);
|
||||||
|
assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
|
||||||
|
|
||||||
|
expected = "ところどころ、ジジが、時時、馬鹿馬鹿しい";
|
||||||
|
|
||||||
|
assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testJapaneseStopFilterFactory() throws IOException {
|
||||||
|
AnalysisService analysisService = createAnalysisService();
|
||||||
|
TokenFilterFactory tokenFilter = analysisService.tokenFilter("ja_stop");
|
||||||
|
assertThat(tokenFilter, instanceOf(JapaneseStopTokenFilterFactory.class));
|
||||||
|
String source = "私は制限スピードを超える。";
|
||||||
|
String[] expected = new String[]{"私", "制限", "超える"};
|
||||||
|
Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
|
||||||
|
tokenizer.setReader(new StringReader(source));
|
||||||
|
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected);
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public AnalysisService createAnalysisService() {
|
||||||
|
Settings settings = Settings.settingsBuilder()
|
||||||
|
.put("path.home", createTempDir())
|
||||||
|
.loadFromClasspath("org/elasticsearch/index/analysis/kuromoji_analysis.json")
|
||||||
|
.put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT)
|
||||||
|
.build();
|
||||||
|
|
||||||
|
Index index = new Index("test");
|
||||||
|
|
||||||
|
Injector parentInjector = new ModulesBuilder().add(new SettingsModule(settings),
|
||||||
|
new EnvironmentModule(new Environment(settings)),
|
||||||
|
new IndicesAnalysisModule())
|
||||||
|
.createInjector();
|
||||||
|
|
||||||
|
AnalysisModule analysisModule = new AnalysisModule(settings, parentInjector.getInstance(IndicesAnalysisService.class));
|
||||||
|
new AnalysisKuromojiPlugin().onModule(analysisModule);
|
||||||
|
|
||||||
|
Injector injector = new ModulesBuilder().add(
|
||||||
|
new IndexSettingsModule(index, settings),
|
||||||
|
new IndexNameModule(index),
|
||||||
|
analysisModule)
|
||||||
|
.createChildInjector(parentInjector);
|
||||||
|
|
||||||
|
return injector.getInstance(AnalysisService.class);
|
||||||
|
}
|
||||||
|
|
||||||
|
public static void assertSimpleTSOutput(TokenStream stream,
|
||||||
|
String[] expected) throws IOException {
|
||||||
|
stream.reset();
|
||||||
|
CharTermAttribute termAttr = stream.getAttribute(CharTermAttribute.class);
|
||||||
|
assertThat(termAttr, notNullValue());
|
||||||
|
int i = 0;
|
||||||
|
while (stream.incrementToken()) {
|
||||||
|
assertThat(expected.length, greaterThan(i));
|
||||||
|
assertThat( "expected different term at index " + i, expected[i++], equalTo(termAttr.toString()));
|
||||||
|
}
|
||||||
|
assertThat("not all tokens produced", i, equalTo(expected.length));
|
||||||
|
}
|
||||||
|
|
||||||
|
private void assertCharFilterEquals(Reader filtered,
|
||||||
|
String expected) throws IOException {
|
||||||
|
String actual = readFully(filtered);
|
||||||
|
assertThat(actual, equalTo(expected));
|
||||||
|
}
|
||||||
|
|
||||||
|
private String readFully(Reader reader) throws IOException {
|
||||||
|
StringBuilder buffer = new StringBuilder();
|
||||||
|
int ch;
|
||||||
|
while((ch = reader.read()) != -1){
|
||||||
|
buffer.append((char)ch);
|
||||||
|
}
|
||||||
|
return buffer.toString();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testKuromojiUserDict() throws IOException {
|
||||||
|
AnalysisService analysisService = createAnalysisService();
|
||||||
|
TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_user_dict");
|
||||||
|
String source = "私は制限スピードを超える。";
|
||||||
|
String[] expected = new String[]{"私", "は", "制限スピード", "を", "超える"};
|
||||||
|
|
||||||
|
Tokenizer tokenizer = tokenizerFactory.create();
|
||||||
|
tokenizer.setReader(new StringReader(source));
|
||||||
|
assertSimpleTSOutput(tokenizer, expected);
|
||||||
|
}
|
||||||
|
|
||||||
|
// fix #59
|
||||||
|
@Test
|
||||||
|
public void testKuromojiEmptyUserDict() {
|
||||||
|
AnalysisService analysisService = createAnalysisService();
|
||||||
|
TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_empty_user_dict");
|
||||||
|
assertThat(tokenizerFactory, instanceOf(KuromojiTokenizerFactory.class));
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,90 @@
|
||||||
|
/*
|
||||||
|
* Licensed to Elasticsearch under one or more contributor
|
||||||
|
* license agreements. See the NOTICE file distributed with
|
||||||
|
* this work for additional information regarding copyright
|
||||||
|
* ownership. Elasticsearch licenses this file to you under
|
||||||
|
* the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
* not use this file except in compliance with the License.
|
||||||
|
* You may obtain a copy of the License at
|
||||||
|
*
|
||||||
|
* http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
*
|
||||||
|
* Unless required by applicable law or agreed to in writing,
|
||||||
|
* software distributed under the License is distributed on an
|
||||||
|
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||||
|
* KIND, either express or implied. See the License for the
|
||||||
|
* specific language governing permissions and limitations
|
||||||
|
* under the License.
|
||||||
|
*/
|
||||||
|
package org.elasticsearch.index.analysis;
|
||||||
|
|
||||||
|
import org.elasticsearch.action.admin.indices.analyze.AnalyzeResponse;
|
||||||
|
import org.elasticsearch.action.search.SearchResponse;
|
||||||
|
import org.elasticsearch.common.settings.Settings;
|
||||||
|
import org.elasticsearch.common.xcontent.XContentBuilder;
|
||||||
|
import org.elasticsearch.index.query.QueryBuilders;
|
||||||
|
import org.elasticsearch.plugins.PluginsService;
|
||||||
|
import org.elasticsearch.test.ElasticsearchIntegrationTest;
|
||||||
|
import org.junit.Test;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.concurrent.ExecutionException;
|
||||||
|
|
||||||
|
import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
|
||||||
|
import static org.hamcrest.CoreMatchers.is;
|
||||||
|
import static org.hamcrest.CoreMatchers.notNullValue;
|
||||||
|
|
||||||
|
@ElasticsearchIntegrationTest.ClusterScope(scope = ElasticsearchIntegrationTest.Scope.SUITE)
|
||||||
|
public class KuromojiIntegrationTests extends ElasticsearchIntegrationTest {
|
||||||
|
|
||||||
|
@Override
|
||||||
|
protected Settings nodeSettings(int nodeOrdinal) {
|
||||||
|
return Settings.builder()
|
||||||
|
.put(super.nodeSettings(nodeOrdinal))
|
||||||
|
.put("plugins." + PluginsService.LOAD_PLUGIN_FROM_CLASSPATH, true)
|
||||||
|
.build();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testKuromojiAnalyzer() throws ExecutionException, InterruptedException {
|
||||||
|
AnalyzeResponse response = client().admin().indices()
|
||||||
|
.prepareAnalyze("JR新宿駅の近くにビールを飲みに行こうか").setAnalyzer("kuromoji")
|
||||||
|
.execute().get();
|
||||||
|
|
||||||
|
String[] expectedTokens = {"jr", "新宿", "駅", "近く", "ビール", "飲む", "行く"};
|
||||||
|
|
||||||
|
assertThat(response, notNullValue());
|
||||||
|
assertThat(response.getTokens().size(), is(7));
|
||||||
|
|
||||||
|
for (int i = 0; i < expectedTokens.length; i++) {
|
||||||
|
assertThat(response.getTokens().get(i).getTerm(), is(expectedTokens[i]));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testKuromojiAnalyzerInMapping() throws ExecutionException, InterruptedException, IOException {
|
||||||
|
createIndex("test");
|
||||||
|
ensureGreen("test");
|
||||||
|
final XContentBuilder mapping = jsonBuilder().startObject()
|
||||||
|
.startObject("type")
|
||||||
|
.startObject("properties")
|
||||||
|
.startObject("foo")
|
||||||
|
.field("type", "string")
|
||||||
|
.field("analyzer", "kuromoji")
|
||||||
|
.endObject()
|
||||||
|
.endObject()
|
||||||
|
.endObject()
|
||||||
|
.endObject();
|
||||||
|
|
||||||
|
client().admin().indices().preparePutMapping("test").setType("type").setSource(mapping).get();
|
||||||
|
|
||||||
|
index("test", "type", "1", "foo", "JR新宿駅の近くにビールを飲みに行こうか");
|
||||||
|
refresh();
|
||||||
|
|
||||||
|
SearchResponse response = client().prepareSearch("test").setQuery(
|
||||||
|
QueryBuilders.matchQuery("foo", "jr")
|
||||||
|
).execute().actionGet();
|
||||||
|
|
||||||
|
assertThat(response.getHits().getTotalHits(), is(1L));
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,62 @@
|
||||||
|
{
|
||||||
|
"index":{
|
||||||
|
"analysis":{
|
||||||
|
"filter":{
|
||||||
|
"kuromoji_rf":{
|
||||||
|
"type":"kuromoji_readingform",
|
||||||
|
"use_romaji" : "true"
|
||||||
|
},
|
||||||
|
"kuromoji_pos" : {
|
||||||
|
"type": "kuromoji_part_of_speech",
|
||||||
|
"stoptags" : ["# verb-main:", "動詞-自立"]
|
||||||
|
},
|
||||||
|
"kuromoji_ks" : {
|
||||||
|
"type": "kuromoji_stemmer",
|
||||||
|
"minimum_length" : 6
|
||||||
|
},
|
||||||
|
"ja_stop" : {
|
||||||
|
"type": "ja_stop",
|
||||||
|
"stopwords": ["_japanese_", "スピード"]
|
||||||
|
}
|
||||||
|
|
||||||
|
},
|
||||||
|
|
||||||
|
"char_filter":{
|
||||||
|
"kuromoji_im_only_kanji":{
|
||||||
|
"type":"kuromoji_iteration_mark",
|
||||||
|
"normalize_kanji":true,
|
||||||
|
"normalize_kana":false
|
||||||
|
},
|
||||||
|
"kuromoji_im_only_kana":{
|
||||||
|
"type":"kuromoji_iteration_mark",
|
||||||
|
"normalize_kanji":false,
|
||||||
|
"normalize_kana":true
|
||||||
|
},
|
||||||
|
"kuromoji_im_default":{
|
||||||
|
"type":"kuromoji_iteration_mark"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
|
||||||
|
"tokenizer" : {
|
||||||
|
"kuromoji" : {
|
||||||
|
"type":"kuromoji_tokenizer"
|
||||||
|
},
|
||||||
|
"kuromoji_empty_user_dict" : {
|
||||||
|
"type":"kuromoji_tokenizer",
|
||||||
|
"user_dictionary":"org/elasticsearch/index/analysis/empty_user_dict.txt"
|
||||||
|
},
|
||||||
|
"kuromoji_user_dict" : {
|
||||||
|
"type":"kuromoji_tokenizer",
|
||||||
|
"user_dictionary":"org/elasticsearch/index/analysis/user_dict.txt"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"analyzer" : {
|
||||||
|
"my_analyzer" : {
|
||||||
|
"type" : "custom",
|
||||||
|
"tokenizer" : "kuromoji_tokenizer"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1 @@
|
||||||
|
制限スピード,制限スピード,セイゲンスピード,テスト名詞
|
Loading…
Reference in New Issue