migrate branch for analysis-kuromoji

This commit is contained in:
Simon Willnauer 2015-06-05 13:12:09 +02:00
commit 7294d27e5c
20 changed files with 1721 additions and 0 deletions

View File

@ -0,0 +1,552 @@
Japanese (kuromoji) Analysis for Elasticsearch
==================================
The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.
In order to install the plugin, run:
```sh
bin/plugin install elasticsearch/elasticsearch-analysis-kuromoji/2.5.0
```
You need to install a version matching your Elasticsearch version:
| elasticsearch | Kuromoji Analysis Plugin | Docs |
|---------------|-----------------------------|------------|
| master | Build from source | See below |
| es-1.x | Build from source | [2.6.0-SNAPSHOT](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/es-1.x/#version-260-snapshot-for-elasticsearch-1x) |
| es-1.5 | 2.5.0 | [2.5.0](https://github.com/elastic/elasticsearch-analysis-kuromoji/tree/v2.5.0/#version-250-for-elasticsearch-15) |
| es-1.4 | 2.4.3 | [2.4.3](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.3/#version-243-for-elasticsearch-14) |
| < 1.4.5 | 2.4.2 | [2.4.2](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.2/#version-242-for-elasticsearch-14) |
| < 1.4.3 | 2.4.1 | [2.4.1](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.1/#version-241-for-elasticsearch-14) |
| es-1.3 | 2.3.0 | [2.3.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.3.0/#japanese-kuromoji-analysis-for-elasticsearch) |
| es-1.2 | 2.2.0 | [2.2.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.2.0/#japanese-kuromoji-analysis-for-elasticsearch) |
| es-1.1 | 2.1.0 | [2.1.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.1.0/#japanese-kuromoji-analysis-for-elasticsearch) |
| es-1.0 | 2.0.0 | [2.0.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.0.0/#japanese-kuromoji-analysis-for-elasticsearch) |
| es-0.90 | 1.8.0 | [1.8.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v1.8.0/#japanese-kuromoji-analysis-for-elasticsearch) |
To build a `SNAPSHOT` version, you need to build it with Maven:
```bash
mvn clean install
plugin --install analysis-kuromoji \
--url file:target/releases/elasticsearch-analysis-kuromoji-X.X.X-SNAPSHOT.zip
```
Includes Analyzer, Tokenizer, TokenFilter, CharFilter
-----------------------------------------------
The plugin includes these analyzer and tokenizer, tokenfilter.
| name | type |
|-------------------------|-------------|
| kuromoji_iteration_mark | charfilter |
| kuromoji | analyzer |
| kuromoji_tokenizer | tokenizer |
| kuromoji_baseform | tokenfilter |
| kuromoji_part_of_speech | tokenfilter |
| kuromoji_readingform | tokenfilter |
| kuromoji_stemmer | tokenfilter |
| ja_stop | tokenfilter |
Usage
-----
## Analyzer : kuromoji
An analyzer of type `kuromoji`.
This analyzer is the following tokenizer and tokenfilter combination.
* `kuromoji_tokenizer` : Kuromoji Tokenizer
* `kuromoji_baseform` : Kuromoji BasicFormFilter (TokenFilter)
* `kuromoji_part_of_speech` : Kuromoji Part of Speech Stop Filter (TokenFilter)
* `cjk_width` : CJK Width Filter (TokenFilter)
* `stop` : Stop Filter (TokenFilter)
* `kuromoji_stemmer` : Kuromoji Katakana Stemmer Filter(TokenFilter)
* `lowercase` : LowerCase Filter (TokenFilter)
## CharFilter : kuromoji_iteration_mark
A charfilter of type `kuromoji_iteration_mark`.
This charfilter is Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
The following ar setting that can be set for a `kuromoji_iteration_mark` charfilter type:
| **Setting** | **Description** | **Default value** |
|:----------------|:-------------------------------------------------------------|:------------------|
| normalize_kanji | indicates whether kanji iteration marks should be normalized | `true` |
| normalize_kana | indicates whether kanji iteration marks should be normalized | `true` |
## Tokenizer : kuromoji_tokenizer
A tokenizer of type `kuromoji_tokenizer`.
The following are settings that can be set for a `kuromoji_tokenizer` tokenizer type:
| **Setting** | **Description** | **Default value** |
|:--------------------|:--------------------------------------------------------------------------------------------------------------------------|:------------------|
| mode | Tokenization mode: this determines how the tokenizer handles compound and unknown words. `normal` and `search`, `extended`| `search` |
| discard_punctuation | `true` if punctuation tokens should be dropped from the output. | `true` |
| user_dictionary | set User Dictionary file | |
### Tokenization mode
The mode is three types.
* `normal` : Ordinary segmentation: no decomposition for compounds
* `search` : Segmentation geared towards search: this includes a decompounding process for long nouns, also including the full compound token as a synonym.
* `extended` : Extended mode outputs unigrams for unknown words.
#### Difference tokenization mode outputs
Input text is `関西国際空港` and `アブラカダブラ`.
| **mode** | `関西国際空港` | `アブラカダブラ` |
|:-----------|:-------------|:-------|
| `normal` | `関西国際空港` | `アブラカダブラ` |
| `search` | `関西` `関西国際空港` `国際` `空港` | `アブラカダブラ` |
| `extended` | `関西` `国際` `空港` | `ア` `ブ` `ラ` `カ` `ダ` `ブ` `ラ` |
### User Dictionary
Kuromoji tokenizer use MeCab-IPADIC dictionary by default.
And Kuromoji is added an entry of dictionary to define by user; this is User Dictionary.
User Dictionary entries are defined using the following CSV format:
```
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
```
Dictionary Example
```
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
```
To use User Dictionary set file path to `user_dict` attribute.
User Dictionary file is placed `ES_HOME/config` directory.
### example
_Example Settings:_
```sh
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"settings": {
"index":{
"analysis":{
"tokenizer" : {
"kuromoji_user_dict" : {
"type" : "kuromoji_tokenizer",
"mode" : "extended",
"discard_punctuation" : "false",
"user_dictionary" : "userdict_ja.txt"
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "kuromoji_user_dict"
}
}
}
}
}
}
'
```
_Example Request using `_analyze` API :_
```sh
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '東京スカイツリー'
```
_Response :_
```json
{
"tokens" : [ {
"token" : "東京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
}, {
"token" : "スカイツリー",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
```
## TokenFilter : kuromoji_baseform
A token filter of type `kuromoji_baseform` that replaces term text with BaseFormAttribute.
This acts as a lemmatizer for verbs and adjectives.
### example
_Example Settings:_
```sh
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["kuromoji_baseform"]
}
}
}
}
}
}
'
```
_Example Request using `_analyze` API :_
```sh
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '飲み'
```
_Response :_
```json
{
"tokens" : [ {
"token" : "飲む",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
} ]
}
```
## TokenFilter : kuromoji_part_of_speech
A token filter of type `kuromoji_part_of_speech` that removes tokens that match a set of part-of-speech tags.
The following are settings that can be set for a stop token filter type:
| **Setting** | **Description** |
|:------------|:-----------------------------------------------------|
| stoptags | A list of part-of-speech tags that should be removed |
Note that default setting is stoptags.txt include lucene-analyzer-kuromoji.jar.
### example
_Example Settings:_
```sh
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["my_posfilter"]
}
},
"filter" : {
"my_posfilter" : {
"type" : "kuromoji_part_of_speech",
"stoptags" : [
"助詞-格助詞-一般",
"助詞-終助詞"
]
}
}
}
}
}
}
'
```
_Example Request using `_analyze` API :_
```sh
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '寿司がおいしいね'
```
_Response :_
```json
{
"tokens" : [ {
"token" : "寿司",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
}, {
"token" : "おいしい",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 3
} ]
}
```
## TokenFilter : kuromoji_readingform
A token filter of type `kuromoji_readingform` that replaces the term attribute with the reading of a token in either katakana or romaji form.
The default reading form is katakana.
The following are settings that can be set for a `kuromoji_readingform` token filter type:
| **Setting** | **Description** | **Default value** |
|:------------|:----------------------------------------------------------|:------------------|
| use_romaji | `true` if romaji reading form output instead of katakana. | `false` |
Note that elasticsearch-analysis-kuromoji built-in `kuromoji_readingform` set default `true` to `use_romaji` attribute.
### example
_Example Settings:_
```sh
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"romaji_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["romaji_readingform"]
},
"katakana_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["katakana_readingform"]
}
},
"filter" : {
"romaji_readingform" : {
"type" : "kuromoji_readingform",
"use_romaji" : true
},
"katakana_readingform" : {
"type" : "kuromoji_readingform",
"use_romaji" : false
}
}
}
}
}
}
'
```
_Example Request using `_analyze` API :_
```sh
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d '寿司'
```
_Response :_
```json
{
"tokens" : [ {
"token" : "スシ",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
} ]
}
```
_Example Request using `_analyze` API :_
```sh
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=romaji_analyzer&pretty' -d '寿司'
```
_Response :_
```json
{
"tokens" : [ {
"token" : "sushi",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
} ]
}
```
## TokenFilter : kuromoji_stemmer
A token filter of type `kuromoji_stemmer` that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC).
Only katakana words longer than a minimum length are stemmed (default is four).
Note that only full-width katakana characters are supported.
The following are settings that can be set for a `kuromoji_stemmer` token filter type:
| **Setting** | **Description** | **Default value** |
|:----------------|:---------------------------|:------------------|
| minimum_length | The minimum length to stem | `4` |
### example
_Example Settings:_
```sh
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["my_katakana_stemmer"]
}
},
"filter" : {
"my_katakana_stemmer" : {
"type" : "kuromoji_stemmer",
"minimum_length" : 4
}
}
}
}
}
}
'
```
_Example Request using `_analyze` API :_
```sh
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'コピー'
```
_Response :_
```json
{
"tokens" : [ {
"token" : "コピー",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
} ]
}
```
_Example Request using `_analyze` API :_
```sh
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'サーバー'
```
_Response :_
```json
{
"tokens" : [ {
"token" : "サーバ",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 1
} ]
}
```
## TokenFilter : ja_stop
A token filter of type `ja_stop` that provide a predefined "_japanese_" stop words.
*Note: It is only provide "_japanese_". If you want to use other predefined stop words, you can use `stop` token filter.*
_Example Settings:_
### example
```sh
curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"analyzer_with_ja_stop" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["ja_stop"]
}
},
"filter" : {
"ja_stop" : {
"type" : "ja_stop",
"stopwords" : ["_japanese_", "ストップ"]
}
}
}
}
}
}'
```
_Example Request using `_analyze` API :_
```sh
curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d 'ストップは消える'
```
_Response :_
```json
{
"tokens" : [ {
"token" : "消える",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 3
} ]
}
```
License
-------
This software is licensed under the Apache 2 license, quoted below.
Copyright 2009-2014 Elasticsearch <http://www.elasticsearch.org>
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

View File

@ -0,0 +1,40 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.elasticsearch.plugin</groupId>
<artifactId>elasticsearch-analysis-kuromoji</artifactId>
<packaging>jar</packaging>
<name>Elasticsearch Japanese (kuromoji) Analysis plugin</name>
<description>The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.</description>
<parent>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-plugin</artifactId>
<version>2.0.0-SNAPSHOT</version>
</parent>
<properties>
<!-- You can add any specific project property here -->
</properties>
<dependencies>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-kuromoji</artifactId>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>

View File

@ -0,0 +1,26 @@
<?xml version="1.0"?>
<assembly>
<id>plugin</id>
<formats>
<format>zip</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<dependencySets>
<dependencySet>
<outputDirectory>/</outputDirectory>
<useProjectArtifact>true</useProjectArtifact>
<useTransitiveFiltering>true</useTransitiveFiltering>
<excludes>
<exclude>org.elasticsearch:elasticsearch</exclude>
</excludes>
</dependencySet>
<dependencySet>
<outputDirectory>/</outputDirectory>
<useProjectArtifact>true</useProjectArtifact>
<useTransitiveFiltering>true</useTransitiveFiltering>
<includes>
<include>org.apache.lucene:lucene-analyzers-kuromoji</include>
</includes>
</dependencySet>
</dependencySets>
</assembly>

View File

@ -0,0 +1,76 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.search.suggest.analyzing.SuggestStopFilter;
import org.elasticsearch.common.collect.MapBuilder;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
import java.util.Map;
import java.util.Set;
public class JapaneseStopTokenFilterFactory extends AbstractTokenFilterFactory{
private final CharArraySet stopWords;
private final boolean ignoreCase;
private final boolean removeTrailing;
@Inject
public JapaneseStopTokenFilterFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
this.ignoreCase = settings.getAsBoolean("ignore_case", false);
this.removeTrailing = settings.getAsBoolean("remove_trailing", true);
Map<String, Set<?>> namedStopWords = MapBuilder.<String, Set<?>>newMapBuilder()
.put("_japanese_", JapaneseAnalyzer.getDefaultStopSet())
.immutableMap();
this.stopWords = Analysis.parseWords(env, settings, "stopwords", JapaneseAnalyzer.getDefaultStopSet(), namedStopWords, ignoreCase);
}
@Override
public TokenStream create(TokenStream tokenStream) {
if (removeTrailing) {
return new StopFilter(tokenStream, stopWords);
} else {
return new SuggestStopFilter(tokenStream, stopWords);
}
}
public Set<?> stopWords() {
return stopWords;
}
public boolean ignoreCase() {
return ignoreCase;
}
}

View File

@ -0,0 +1,56 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
import org.apache.lucene.analysis.ja.JapaneseTokenizer;
import org.apache.lucene.analysis.ja.dict.UserDictionary;
import org.apache.lucene.analysis.util.CharArraySet;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
import java.util.Set;
/**
*/
public class KuromojiAnalyzerProvider extends AbstractIndexAnalyzerProvider<JapaneseAnalyzer> {
private final JapaneseAnalyzer analyzer;
@Inject
public KuromojiAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
final Set<?> stopWords = Analysis.parseStopWords(env, settings, JapaneseAnalyzer.getDefaultStopSet());
final JapaneseTokenizer.Mode mode = KuromojiTokenizerFactory.getMode(settings);
final UserDictionary userDictionary = KuromojiTokenizerFactory.getUserDictionary(env, settings);
analyzer = new JapaneseAnalyzer(userDictionary, mode, CharArraySet.copy(stopWords), JapaneseAnalyzer.getDefaultStopTags());
}
@Override
public JapaneseAnalyzer get() {
return this.analyzer;
}
}

View File

@ -0,0 +1,41 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ja.JapaneseBaseFormFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
public class KuromojiBaseFormFilterFactory extends AbstractTokenFilterFactory {
@Inject
public KuromojiBaseFormFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new JapaneseBaseFormFilter(tokenStream);
}
}

View File

@ -0,0 +1,48 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.ja.JapaneseIterationMarkCharFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
import java.io.Reader;
public class KuromojiIterationMarkCharFilterFactory extends AbstractCharFilterFactory {
private final boolean normalizeKanji;
private final boolean normalizeKana;
@Inject
public KuromojiIterationMarkCharFilterFactory(Index index, @IndexSettings Settings indexSettings,
@Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name);
normalizeKanji = settings.getAsBoolean("normalize_kanji", JapaneseIterationMarkCharFilter.NORMALIZE_KANJI_DEFAULT);
normalizeKana = settings.getAsBoolean("normalize_kana", JapaneseIterationMarkCharFilter.NORMALIZE_KANA_DEFAULT);
}
@Override
public Reader create(Reader reader) {
return new JapaneseIterationMarkCharFilter(reader, normalizeKanji, normalizeKana);
}
}

View File

@ -0,0 +1,44 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ja.JapaneseKatakanaStemFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
public class KuromojiKatakanaStemmerFactory extends AbstractTokenFilterFactory {
private final int minimumLength;
@Inject
public KuromojiKatakanaStemmerFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
minimumLength = settings.getAsInt("minimum_length", JapaneseKatakanaStemFilter.DEFAULT_MINIMUM_LENGTH);
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new JapaneseKatakanaStemFilter(tokenStream, minimumLength);
}
}

View File

@ -0,0 +1,53 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ja.JapanesePartOfSpeechStopFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class KuromojiPartOfSpeechFilterFactory extends AbstractTokenFilterFactory {
private final Set<String> stopTags = new HashSet<String>();
@Inject
public KuromojiPartOfSpeechFilterFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
List<String> wordList = Analysis.getWordList(env, settings, "stoptags");
if (wordList != null) {
stopTags.addAll(wordList);
}
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new JapanesePartOfSpeechStopFilter(tokenStream, stopTags);
}
}

View File

@ -0,0 +1,44 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ja.JapaneseReadingFormFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
public class KuromojiReadingFormFilterFactory extends AbstractTokenFilterFactory {
private final boolean useRomaji;
@Inject
public KuromojiReadingFormFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
useRomaji = settings.getAsBoolean("use_romaji", false);
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new JapaneseReadingFormFilter(tokenStream, useRomaji);
}
}

View File

@ -0,0 +1,93 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.ja.JapaneseTokenizer;
import org.apache.lucene.analysis.ja.JapaneseTokenizer.Mode;
import org.apache.lucene.analysis.ja.dict.UserDictionary;
import org.elasticsearch.ElasticsearchException;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;
import java.io.IOException;
import java.io.Reader;
/**
*/
public class KuromojiTokenizerFactory extends AbstractTokenizerFactory {
private static final String USER_DICT_OPTION = "user_dictionary";
private final UserDictionary userDictionary;
private final Mode mode;
private boolean discartPunctuation;
@Inject
public KuromojiTokenizerFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
mode = getMode(settings);
userDictionary = getUserDictionary(env, settings);
discartPunctuation = settings.getAsBoolean("discard_punctuation", true);
}
public static UserDictionary getUserDictionary(Environment env, Settings settings) {
try {
final Reader reader = Analysis.getReaderFromFile(env, settings, USER_DICT_OPTION);
if (reader == null) {
return null;
} else {
try {
return UserDictionary.open(reader);
} finally {
reader.close();
}
}
} catch (IOException e) {
throw new ElasticsearchException("failed to load kuromoji user dictionary", e);
}
}
public static JapaneseTokenizer.Mode getMode(Settings settings) {
JapaneseTokenizer.Mode mode = JapaneseTokenizer.DEFAULT_MODE;
String modeSetting = settings.get("mode", null);
if (modeSetting != null) {
if ("search".equalsIgnoreCase(modeSetting)) {
mode = JapaneseTokenizer.Mode.SEARCH;
} else if ("normal".equalsIgnoreCase(modeSetting)) {
mode = JapaneseTokenizer.Mode.NORMAL;
} else if ("extended".equalsIgnoreCase(modeSetting)) {
mode = JapaneseTokenizer.Mode.EXTENDED;
}
}
return mode;
}
@Override
public Tokenizer create() {
return new JapaneseTokenizer(userDictionary, discartPunctuation, mode);
}
}

View File

@ -0,0 +1,131 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.indices.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.ja.*;
import org.apache.lucene.analysis.ja.JapaneseTokenizer.Mode;
import org.elasticsearch.common.component.AbstractComponent;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.analysis.*;
import java.io.Reader;
/**
* Registers indices level analysis components so, if not explicitly configured,
* will be shared among all indices.
*/
public class KuromojiIndicesAnalysis extends AbstractComponent {
@Inject
public KuromojiIndicesAnalysis(Settings settings,
IndicesAnalysisService indicesAnalysisService) {
super(settings);
indicesAnalysisService.analyzerProviderFactories().put("kuromoji",
new PreBuiltAnalyzerProviderFactory("kuromoji", AnalyzerScope.INDICES,
new JapaneseAnalyzer()));
indicesAnalysisService.charFilterFactories().put("kuromoji_iteration_mark",
new PreBuiltCharFilterFactoryFactory(new CharFilterFactory() {
@Override
public String name() {
return "kuromoji_iteration_mark";
}
@Override
public Reader create(Reader reader) {
return new JapaneseIterationMarkCharFilter(reader,
JapaneseIterationMarkCharFilter.NORMALIZE_KANJI_DEFAULT,
JapaneseIterationMarkCharFilter.NORMALIZE_KANA_DEFAULT);
}
}));
indicesAnalysisService.tokenizerFactories().put("kuromoji_tokenizer",
new PreBuiltTokenizerFactoryFactory(new TokenizerFactory() {
@Override
public String name() {
return "kuromoji_tokenizer";
}
@Override
public Tokenizer create() {
return new JapaneseTokenizer(null, true, Mode.SEARCH);
}
}));
indicesAnalysisService.tokenFilterFactories().put("kuromoji_baseform",
new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
@Override
public String name() {
return "kuromoji_baseform";
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new JapaneseBaseFormFilter(tokenStream);
}
}));
indicesAnalysisService.tokenFilterFactories().put(
"kuromoji_part_of_speech",
new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
@Override
public String name() {
return "kuromoji_part_of_speech";
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new JapanesePartOfSpeechStopFilter(tokenStream, JapaneseAnalyzer
.getDefaultStopTags());
}
}));
indicesAnalysisService.tokenFilterFactories().put(
"kuromoji_readingform",
new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
@Override
public String name() {
return "kuromoji_readingform";
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new JapaneseReadingFormFilter(tokenStream, true);
}
}));
indicesAnalysisService.tokenFilterFactories().put("kuromoji_stemmer",
new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
@Override
public String name() {
return "kuromoji_stemmer";
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new JapaneseKatakanaStemFilter(tokenStream);
}
}));
}
}

View File

@ -0,0 +1,32 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.indices.analysis;
import org.elasticsearch.common.inject.AbstractModule;
/**
*/
public class KuromojiIndicesAnalysisModule extends AbstractModule {
@Override
protected void configure() {
bind(KuromojiIndicesAnalysis.class).asEagerSingleton();
}
}

View File

@ -0,0 +1,62 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.plugin.analysis.kuromoji;
import org.elasticsearch.common.inject.Module;
import org.elasticsearch.index.analysis.*;
import org.elasticsearch.indices.analysis.KuromojiIndicesAnalysisModule;
import org.elasticsearch.plugins.AbstractPlugin;
import java.util.ArrayList;
import java.util.Collection;
/**
*
*/
public class AnalysisKuromojiPlugin extends AbstractPlugin {
@Override
public String name() {
return "analysis-kuromoji";
}
@Override
public String description() {
return "Kuromoji analysis support";
}
@Override
public Collection<Class<? extends Module>> modules() {
Collection<Class<? extends Module>> classes = new ArrayList<>();
classes.add(KuromojiIndicesAnalysisModule.class);
return classes;
}
public void onModule(AnalysisModule module) {
module.addCharFilter("kuromoji_iteration_mark", KuromojiIterationMarkCharFilterFactory.class);
module.addAnalyzer("kuromoji", KuromojiAnalyzerProvider.class);
module.addTokenizer("kuromoji_tokenizer", KuromojiTokenizerFactory.class);
module.addTokenFilter("kuromoji_baseform", KuromojiBaseFormFilterFactory.class);
module.addTokenFilter("kuromoji_part_of_speech", KuromojiPartOfSpeechFilterFactory.class);
module.addTokenFilter("kuromoji_readingform", KuromojiReadingFormFilterFactory.class);
module.addTokenFilter("kuromoji_stemmer", KuromojiKatakanaStemmerFactory.class);
module.addTokenFilter("ja_stop", JapaneseStopTokenFilterFactory.class);
}
}

View File

@ -0,0 +1,3 @@
plugin=org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin
version=${project.version}
lucene=${lucene.version}

View File

@ -0,0 +1,267 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
import org.apache.lucene.analysis.ja.JapaneseTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.elasticsearch.Version;
import org.elasticsearch.cluster.metadata.IndexMetaData;
import org.elasticsearch.common.inject.Injector;
import org.elasticsearch.common.inject.ModulesBuilder;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.settings.SettingsModule;
import org.elasticsearch.env.Environment;
import org.elasticsearch.env.EnvironmentModule;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.IndexNameModule;
import org.elasticsearch.index.settings.IndexSettingsModule;
import org.elasticsearch.indices.analysis.IndicesAnalysisModule;
import org.elasticsearch.indices.analysis.IndicesAnalysisService;
import org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin;
import org.elasticsearch.test.ElasticsearchTestCase;
import org.junit.Test;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import static org.hamcrest.Matchers.*;
/**
*/
public class KuromojiAnalysisTests extends ElasticsearchTestCase {
@Test
public void testDefaultsKuromojiAnalysis() throws IOException {
AnalysisService analysisService = createAnalysisService();
TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_tokenizer");
assertThat(tokenizerFactory, instanceOf(KuromojiTokenizerFactory.class));
TokenFilterFactory filterFactory = analysisService.tokenFilter("kuromoji_part_of_speech");
assertThat(filterFactory, instanceOf(KuromojiPartOfSpeechFilterFactory.class));
filterFactory = analysisService.tokenFilter("kuromoji_readingform");
assertThat(filterFactory, instanceOf(KuromojiReadingFormFilterFactory.class));
filterFactory = analysisService.tokenFilter("kuromoji_baseform");
assertThat(filterFactory, instanceOf(KuromojiBaseFormFilterFactory.class));
filterFactory = analysisService.tokenFilter("kuromoji_stemmer");
assertThat(filterFactory, instanceOf(KuromojiKatakanaStemmerFactory.class));
filterFactory = analysisService.tokenFilter("ja_stop");
assertThat(filterFactory, instanceOf(JapaneseStopTokenFilterFactory.class));
NamedAnalyzer analyzer = analysisService.analyzer("kuromoji");
assertThat(analyzer.analyzer(), instanceOf(JapaneseAnalyzer.class));
analyzer = analysisService.analyzer("my_analyzer");
assertThat(analyzer.analyzer(), instanceOf(CustomAnalyzer.class));
assertThat(analyzer.analyzer().tokenStream(null, new StringReader("")), instanceOf(JapaneseTokenizer.class));
CharFilterFactory charFilterFactory = analysisService.charFilter("kuromoji_iteration_mark");
assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
}
@Test
public void testBaseFormFilterFactory() throws IOException {
AnalysisService analysisService = createAnalysisService();
TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_pos");
assertThat(tokenFilter, instanceOf(KuromojiPartOfSpeechFilterFactory.class));
String source = "私は制限スピードを超える。";
String[] expected = new String[]{"", "", "制限", "スピード", ""};
Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
tokenizer.setReader(new StringReader(source));
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected);
}
@Test
public void testReadingFormFilterFactory() throws IOException {
AnalysisService analysisService = createAnalysisService();
TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_rf");
assertThat(tokenFilter, instanceOf(KuromojiReadingFormFilterFactory.class));
String source = "今夜はロバート先生と話した";
String[] expected_tokens_romaji = new String[]{"kon'ya", "ha", "robato", "sensei", "to", "hanashi", "ta"};
Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
tokenizer.setReader(new StringReader(source));
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_romaji);
tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
tokenizer.setReader(new StringReader(source));
String[] expected_tokens_katakana = new String[]{"コンヤ", "", "ロバート", "センセイ", "", "ハナシ", ""};
tokenFilter = analysisService.tokenFilter("kuromoji_readingform");
assertThat(tokenFilter, instanceOf(KuromojiReadingFormFilterFactory.class));
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
}
@Test
public void testKatakanaStemFilter() throws IOException {
AnalysisService analysisService = createAnalysisService();
TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_stemmer");
assertThat(tokenFilter, instanceOf(KuromojiKatakanaStemmerFactory.class));
String source = "明後日パーティーに行く予定がある。図書館で資料をコピーしました。";
Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
tokenizer.setReader(new StringReader(source));
// パーティー should be stemmed by default
// (min len) コピー should not be stemmed
String[] expected_tokens_katakana = new String[]{"明後日", "パーティ", "", "行く", "予定", "", "ある", "図書館", "", "資料", "", "コピー", "", "まし", ""};
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
tokenFilter = analysisService.tokenFilter("kuromoji_ks");
assertThat(tokenFilter, instanceOf(KuromojiKatakanaStemmerFactory.class));
tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
tokenizer.setReader(new StringReader(source));
// パーティー should not be stemmed since min len == 6
// コピー should not be stemmed
expected_tokens_katakana = new String[]{"明後日", "パーティー", "", "行く", "予定", "", "ある", "図書館", "", "資料", "", "コピー", "", "まし", ""};
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
}
@Test
public void testIterationMarkCharFilter() throws IOException {
AnalysisService analysisService = createAnalysisService();
// test only kanji
CharFilterFactory charFilterFactory = analysisService.charFilter("kuromoji_im_only_kanji");
assertNotNull(charFilterFactory);
assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
String source = "ところゞゝゝ、ジヾが、時々、馬鹿々々しい";
String expected = "ところゞゝゝ、ジヾが、時時、馬鹿馬鹿しい";
assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
// test only kana
charFilterFactory = analysisService.charFilter("kuromoji_im_only_kana");
assertNotNull(charFilterFactory);
assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
expected = "ところどころ、ジジが、時々、馬鹿々々しい";
assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
// test default
charFilterFactory = analysisService.charFilter("kuromoji_im_default");
assertNotNull(charFilterFactory);
assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
expected = "ところどころ、ジジが、時時、馬鹿馬鹿しい";
assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
}
@Test
public void testJapaneseStopFilterFactory() throws IOException {
AnalysisService analysisService = createAnalysisService();
TokenFilterFactory tokenFilter = analysisService.tokenFilter("ja_stop");
assertThat(tokenFilter, instanceOf(JapaneseStopTokenFilterFactory.class));
String source = "私は制限スピードを超える。";
String[] expected = new String[]{"", "制限", "超える"};
Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
tokenizer.setReader(new StringReader(source));
assertSimpleTSOutput(tokenFilter.create(tokenizer), expected);
}
public AnalysisService createAnalysisService() {
Settings settings = Settings.settingsBuilder()
.put("path.home", createTempDir())
.loadFromClasspath("org/elasticsearch/index/analysis/kuromoji_analysis.json")
.put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT)
.build();
Index index = new Index("test");
Injector parentInjector = new ModulesBuilder().add(new SettingsModule(settings),
new EnvironmentModule(new Environment(settings)),
new IndicesAnalysisModule())
.createInjector();
AnalysisModule analysisModule = new AnalysisModule(settings, parentInjector.getInstance(IndicesAnalysisService.class));
new AnalysisKuromojiPlugin().onModule(analysisModule);
Injector injector = new ModulesBuilder().add(
new IndexSettingsModule(index, settings),
new IndexNameModule(index),
analysisModule)
.createChildInjector(parentInjector);
return injector.getInstance(AnalysisService.class);
}
public static void assertSimpleTSOutput(TokenStream stream,
String[] expected) throws IOException {
stream.reset();
CharTermAttribute termAttr = stream.getAttribute(CharTermAttribute.class);
assertThat(termAttr, notNullValue());
int i = 0;
while (stream.incrementToken()) {
assertThat(expected.length, greaterThan(i));
assertThat( "expected different term at index " + i, expected[i++], equalTo(termAttr.toString()));
}
assertThat("not all tokens produced", i, equalTo(expected.length));
}
private void assertCharFilterEquals(Reader filtered,
String expected) throws IOException {
String actual = readFully(filtered);
assertThat(actual, equalTo(expected));
}
private String readFully(Reader reader) throws IOException {
StringBuilder buffer = new StringBuilder();
int ch;
while((ch = reader.read()) != -1){
buffer.append((char)ch);
}
return buffer.toString();
}
@Test
public void testKuromojiUserDict() throws IOException {
AnalysisService analysisService = createAnalysisService();
TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_user_dict");
String source = "私は制限スピードを超える。";
String[] expected = new String[]{"", "", "制限スピード", "", "超える"};
Tokenizer tokenizer = tokenizerFactory.create();
tokenizer.setReader(new StringReader(source));
assertSimpleTSOutput(tokenizer, expected);
}
// fix #59
@Test
public void testKuromojiEmptyUserDict() {
AnalysisService analysisService = createAnalysisService();
TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_empty_user_dict");
assertThat(tokenizerFactory, instanceOf(KuromojiTokenizerFactory.class));
}
}

View File

@ -0,0 +1,90 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.index.analysis;
import org.elasticsearch.action.admin.indices.analyze.AnalyzeResponse;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.plugins.PluginsService;
import org.elasticsearch.test.ElasticsearchIntegrationTest;
import org.junit.Test;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
import static org.hamcrest.CoreMatchers.is;
import static org.hamcrest.CoreMatchers.notNullValue;
@ElasticsearchIntegrationTest.ClusterScope(scope = ElasticsearchIntegrationTest.Scope.SUITE)
public class KuromojiIntegrationTests extends ElasticsearchIntegrationTest {
@Override
protected Settings nodeSettings(int nodeOrdinal) {
return Settings.builder()
.put(super.nodeSettings(nodeOrdinal))
.put("plugins." + PluginsService.LOAD_PLUGIN_FROM_CLASSPATH, true)
.build();
}
@Test
public void testKuromojiAnalyzer() throws ExecutionException, InterruptedException {
AnalyzeResponse response = client().admin().indices()
.prepareAnalyze("JR新宿駅の近くにビールを飲みに行こうか").setAnalyzer("kuromoji")
.execute().get();
String[] expectedTokens = {"jr", "新宿", "", "近く", "ビール", "飲む", "行く"};
assertThat(response, notNullValue());
assertThat(response.getTokens().size(), is(7));
for (int i = 0; i < expectedTokens.length; i++) {
assertThat(response.getTokens().get(i).getTerm(), is(expectedTokens[i]));
}
}
@Test
public void testKuromojiAnalyzerInMapping() throws ExecutionException, InterruptedException, IOException {
createIndex("test");
ensureGreen("test");
final XContentBuilder mapping = jsonBuilder().startObject()
.startObject("type")
.startObject("properties")
.startObject("foo")
.field("type", "string")
.field("analyzer", "kuromoji")
.endObject()
.endObject()
.endObject()
.endObject();
client().admin().indices().preparePutMapping("test").setType("type").setSource(mapping).get();
index("test", "type", "1", "foo", "JR新宿駅の近くにビールを飲みに行こうか");
refresh();
SearchResponse response = client().prepareSearch("test").setQuery(
QueryBuilders.matchQuery("foo", "jr")
).execute().actionGet();
assertThat(response.getHits().getTotalHits(), is(1L));
}
}

View File

@ -0,0 +1,62 @@
{
"index":{
"analysis":{
"filter":{
"kuromoji_rf":{
"type":"kuromoji_readingform",
"use_romaji" : "true"
},
"kuromoji_pos" : {
"type": "kuromoji_part_of_speech",
"stoptags" : ["# verb-main:", "動詞-自立"]
},
"kuromoji_ks" : {
"type": "kuromoji_stemmer",
"minimum_length" : 6
},
"ja_stop" : {
"type": "ja_stop",
"stopwords": ["_japanese_", "スピード"]
}
},
"char_filter":{
"kuromoji_im_only_kanji":{
"type":"kuromoji_iteration_mark",
"normalize_kanji":true,
"normalize_kana":false
},
"kuromoji_im_only_kana":{
"type":"kuromoji_iteration_mark",
"normalize_kanji":false,
"normalize_kana":true
},
"kuromoji_im_default":{
"type":"kuromoji_iteration_mark"
}
},
"tokenizer" : {
"kuromoji" : {
"type":"kuromoji_tokenizer"
},
"kuromoji_empty_user_dict" : {
"type":"kuromoji_tokenizer",
"user_dictionary":"org/elasticsearch/index/analysis/empty_user_dict.txt"
},
"kuromoji_user_dict" : {
"type":"kuromoji_tokenizer",
"user_dictionary":"org/elasticsearch/index/analysis/user_dict.txt"
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "kuromoji_tokenizer"
}
}
}
}
}

View File

@ -0,0 +1 @@
制限スピード,制限スピード,セイゲンスピード,テスト名詞