migrate branch for analysis-kuromoji

2015-06-05 13:12:09 +02:00 · 2015-06-05 13:12:09 +02:00 · 7294d27e5c
parent 96101d3e7e 9b41b94459
commit 7294d27e5c
20 changed files with 1721 additions and 0 deletions
--- a/plugins/analysis-kuromoji/README.md
+++ b/plugins/analysis-kuromoji/README.md
@ -0,0 +1,552 @@
 Japanese (kuromoji) Analysis for Elasticsearch
 ==================================
 The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.
 In order to install the plugin, run: 
 ```sh
 bin/plugin install elasticsearch/elasticsearch-analysis-kuromoji/2.5.0
 ```
 You need to install a version matching your Elasticsearch version:
 | elasticsearch |  Kuromoji Analysis Plugin   |   Docs     |  
 |---------------|-----------------------------|------------|
 | master        |  Build from source          | See below  |
 | es-1.x        |  Build from source          | [2.6.0-SNAPSHOT](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/es-1.x/#version-260-snapshot-for-elasticsearch-1x)  |
 |    es-1.5              |     2.5.0         | [2.5.0](https://github.com/elastic/elasticsearch-analysis-kuromoji/tree/v2.5.0/#version-250-for-elasticsearch-15)                  |
 |    es-1.4              |     2.4.3         | [2.4.3](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.3/#version-243-for-elasticsearch-14)                  |
 | < 1.4.5       |  2.4.2                      | [2.4.2](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.2/#version-242-for-elasticsearch-14)                  |
 | < 1.4.3       |  2.4.1                      | [2.4.1](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.4.1/#version-241-for-elasticsearch-14)                  |
 | es-1.3        |  2.3.0                      | [2.3.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.3.0/#japanese-kuromoji-analysis-for-elasticsearch)  |
 | es-1.2        |  2.2.0                      | [2.2.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.2.0/#japanese-kuromoji-analysis-for-elasticsearch)  |
 | es-1.1        |  2.1.0                      | [2.1.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.1.0/#japanese-kuromoji-analysis-for-elasticsearch)  |
 | es-1.0        |  2.0.0                      | [2.0.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v2.0.0/#japanese-kuromoji-analysis-for-elasticsearch)  |
 | es-0.90       |  1.8.0                      | [1.8.0](https://github.com/elasticsearch/elasticsearch-analysis-kuromoji/tree/v1.8.0/#japanese-kuromoji-analysis-for-elasticsearch)  |
 To build a `SNAPSHOT` version, you need to build it with Maven:
 ```bash
 mvn clean install
 plugin --install analysis-kuromoji \
       --url file:target/releases/elasticsearch-analysis-kuromoji-X.X.X-SNAPSHOT.zip
 ```
 Includes Analyzer, Tokenizer, TokenFilter, CharFilter
 -----------------------------------------------
 The plugin includes these analyzer and tokenizer, tokenfilter.
 | name                    | type        |
 |-------------------------|-------------|
 | kuromoji_iteration_mark | charfilter  |
 | kuromoji                | analyzer    |
 | kuromoji_tokenizer      | tokenizer   |
 | kuromoji_baseform       | tokenfilter |
 | kuromoji_part_of_speech | tokenfilter |
 | kuromoji_readingform    | tokenfilter |
 | kuromoji_stemmer        | tokenfilter |
 | ja_stop                 | tokenfilter |
 Usage
 -----
 ## Analyzer : kuromoji
 An analyzer of type `kuromoji`.
 This analyzer is the following tokenizer and tokenfilter combination.
 * `kuromoji_tokenizer` : Kuromoji Tokenizer
 * `kuromoji_baseform` : Kuromoji BasicFormFilter (TokenFilter)
 * `kuromoji_part_of_speech` : Kuromoji Part of Speech Stop Filter (TokenFilter)
 * `cjk_width` : CJK Width Filter (TokenFilter)
 * `stop` : Stop Filter (TokenFilter)
 * `kuromoji_stemmer` : Kuromoji Katakana Stemmer Filter(TokenFilter)
 * `lowercase` : LowerCase Filter (TokenFilter)
 ## CharFilter : kuromoji_iteration_mark
 A charfilter of type `kuromoji_iteration_mark`.
 This charfilter is Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
 The following ar setting that can be set for a `kuromoji_iteration_mark` charfilter type:
 | **Setting**     | **Description**                                              | **Default value** |
 |:----------------|:-------------------------------------------------------------|:------------------|
 | normalize_kanji | indicates whether kanji iteration marks should be normalized | `true`            |
 | normalize_kana  | indicates whether kanji iteration marks should be normalized | `true`            |
 ## Tokenizer : kuromoji_tokenizer
 A tokenizer of type `kuromoji_tokenizer`.
 The following are settings that can be set for a `kuromoji_tokenizer` tokenizer type:
 | **Setting**         | **Description**                                                                                                           | **Default value** |
 |:--------------------|:--------------------------------------------------------------------------------------------------------------------------|:------------------|
 | mode                | Tokenization mode: this determines how the tokenizer handles compound and unknown words. `normal` and `search`, `extended`| `search`          |
 | discard_punctuation | `true` if punctuation tokens should be dropped from the output.                                                           | `true`            |
 | user_dictionary     | set User Dictionary file                                                                                                  |                   |
 ### Tokenization mode
 The mode is three types.
 * `normal` : Ordinary segmentation: no decomposition for compounds
 * `search` : Segmentation geared towards search: this includes a decompounding process for long nouns, also including the full compound token as a synonym.
 * `extended` : Extended mode outputs unigrams for unknown words.
 #### Difference tokenization mode outputs
 Input text is `関西国際空港` and `アブラカダブラ`.
 | **mode**   | `関西国際空港` | `アブラカダブラ` |
 |:-----------|:-------------|:-------|
 | `normal`   | `関西国際空港` | `アブラカダブラ` |
 | `search`   | `関西` `関西国際空港` `国際` `空港` | `アブラカダブラ` |
 | `extended` | `関西` `国際` `空港` | `ア` `ブ` `ラ` `カ` `ダ` `ブ` `ラ` |
 ### User Dictionary
 Kuromoji tokenizer use MeCab-IPADIC dictionary by default.
 And Kuromoji is added an entry of dictionary to define by user; this is User Dictionary.
 User Dictionary entries are defined using the following CSV format:
 ```
 <text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
 ```
 Dictionary Example
 ```
 東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
 ```
 To use User Dictionary set file path to `user_dict` attribute.
 User Dictionary file is placed `ES_HOME/config` directory.
 ### example
 _Example Settings:_
 ```sh
 curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
 {
    "settings": {
        "index":{
            "analysis":{
                "tokenizer" : {
                    "kuromoji_user_dict" : {
                       "type" : "kuromoji_tokenizer",
                       "mode" : "extended",
                       "discard_punctuation" : "false",
                       "user_dictionary" : "userdict_ja.txt"
                    }
                },
                "analyzer" : {
                    "my_analyzer" : {
                        "type" : "custom",
                        "tokenizer" : "kuromoji_user_dict"
                    }
                }
            }
        }
    }
 }
 '
 ```
 _Example Request using `_analyze` API :_
 ```sh
 curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '東京スカイツリー'
 ```
 _Response :_
 ```json
 {
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "スカイツリー",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  } ]
 }
 ```
 ## TokenFilter : kuromoji_baseform
 A token filter of type `kuromoji_baseform` that replaces term text with BaseFormAttribute.
 This acts as a lemmatizer for verbs and adjectives.
 ### example
 _Example Settings:_
 ```sh
 curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
 {
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["kuromoji_baseform"]
                    }
                }
            }
        }
    }
 }
 '
 ```
 _Example Request using `_analyze` API :_
 ```sh
 curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '飲み'
 ```
 _Response :_
 ```json
 {
  "tokens" : [ {
    "token" : "飲む",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  } ]
 }
 ```
 ## TokenFilter : kuromoji_part_of_speech
 A token filter of type `kuromoji_part_of_speech` that removes tokens that match a set of part-of-speech tags.
 The following are settings that can be set for a stop token filter type:
 | **Setting** | **Description**                                      |
 |:------------|:-----------------------------------------------------|
 | stoptags    | A list of part-of-speech tags that should be removed |
 Note that default setting is stoptags.txt include lucene-analyzer-kuromoji.jar.
 ### example
 _Example Settings:_
 ```sh
 curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
 {
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["my_posfilter"]
                    }
                },
                "filter" : {
                    "my_posfilter" : {
                        "type" : "kuromoji_part_of_speech",
                        "stoptags" : [
                            "助詞-格助詞-一般",
                            "助詞-終助詞"
                        ]
                    }
                }
            }
        }
    }
 }
 '
 ```
 _Example Request using `_analyze` API :_
 ```sh
 curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '寿司がおいしいね'
 ```
 _Response :_
 ```json
 {
  "tokens" : [ {
    "token" : "寿司",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "おいしい",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "word",
    "position" : 3
  } ]
 }
 ```
 ## TokenFilter : kuromoji_readingform
 A token filter of type `kuromoji_readingform` that replaces the term attribute with the reading of a token in either katakana or romaji form.
 The default reading form is katakana.
 The following are settings that can be set for a `kuromoji_readingform` token filter type:
 | **Setting** | **Description**                                           | **Default value** |
 |:------------|:----------------------------------------------------------|:------------------|
 | use_romaji  | `true` if romaji reading form output instead of katakana. | `false`           |
 Note that elasticsearch-analysis-kuromoji built-in `kuromoji_readingform` set default `true` to `use_romaji` attribute.
 ### example
 _Example Settings:_
 ```sh
 curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
 {
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "romaji_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["romaji_readingform"]
                    },
                    "katakana_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["katakana_readingform"]
                    }
                },
                "filter" : {
                    "romaji_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : true
                    },
                    "katakana_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : false
                    }
                }
            }
        }
    }
 }
 '
 ```
 _Example Request using `_analyze` API :_
 ```sh
 curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d '寿司'
 ```
 _Response :_
 ```json
 {
  "tokens" : [ {
    "token" : "スシ",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  } ]
 }
 ```
 _Example Request using `_analyze` API :_
 ```sh
 curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=romaji_analyzer&pretty' -d '寿司'
 ```
 _Response :_
 ```json
 {
  "tokens" : [ {
    "token" : "sushi",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  } ]
 }
 ```
 ## TokenFilter : kuromoji_stemmer
 A token filter of type `kuromoji_stemmer` that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC).
 Only katakana words longer than a minimum length are stemmed (default is four).
 Note that only full-width katakana characters are supported.
 The following are settings that can be set for a `kuromoji_stemmer` token filter type:
 | **Setting**     | **Description**            | **Default value** |
 |:----------------|:---------------------------|:------------------|
 | minimum_length  | The minimum length to stem | `4`               |
 ### example
 _Example Settings:_
 ```sh
 curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
 {
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["my_katakana_stemmer"]
                    }
                },
                "filter" : {
                    "my_katakana_stemmer" : {
                        "type" : "kuromoji_stemmer",
                        "minimum_length" : 4
                    }
                }
            }
        }
    }
 }
 '
 ```
 _Example Request using `_analyze` API :_
 ```sh
 curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'コピー'
 ```
 _Response :_
 ```json
 {
  "tokens" : [ {
    "token" : "コピー",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  } ]
 }
 ```
 _Example Request using `_analyze` API :_
 ```sh
 curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'サーバー'
 ```
 _Response :_
 ```json
 {
  "tokens" : [ {
    "token" : "サーバ",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  } ]
 }
 ```
 ## TokenFilter : ja_stop
 A token filter of type `ja_stop` that provide a predefined "_japanese_" stop words.
 *Note: It is only provide "_japanese_". If you want to use other predefined stop words, you can use `stop` token filter.*
 _Example Settings:_
 ### example
 ```sh
 curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
 {
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "analyzer_with_ja_stop" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["ja_stop"]
                    }
                },
                "filter" : {
                    "ja_stop" : {
                        "type" : "ja_stop",
                        "stopwords" : ["_japanese_", "ストップ"]
                    }
                }
            }
        }
    }
 }'
 ```
 _Example Request using `_analyze` API :_
 ```sh
 curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d 'ストップは消える'
 ```
 _Response :_
 ```json
 {
  "tokens" : [ {
    "token" : "消える",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 3
  } ]
 }
 ```
 License
 -------
    This software is licensed under the Apache 2 license, quoted below.
    Copyright 2009-2014 Elasticsearch <http://www.elasticsearch.org>
    Licensed under the Apache License, Version 2.0 (the "License"); you may not
    use this file except in compliance with the License. You may obtain a copy of
    the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    License for the specific language governing permissions and limitations under
    the License.
--- a/plugins/analysis-kuromoji/pom.xml
+++ b/plugins/analysis-kuromoji/pom.xml
@ -0,0 +1,40 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.elasticsearch.plugin</groupId>
    <artifactId>elasticsearch-analysis-kuromoji</artifactId>
    <packaging>jar</packaging>
    <name>Elasticsearch Japanese (kuromoji) Analysis plugin</name>
    <description>The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.</description>
    <parent>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch-plugin</artifactId>
        <version>2.0.0-SNAPSHOT</version>
    </parent>
    <properties>
        <!-- You can add any specific project property here -->
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-kuromoji</artifactId>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
            </plugin>
        </plugins>
    </build>
 </project>
--- a/plugins/analysis-kuromoji/src/main/assemblies/plugin.xml
+++ b/plugins/analysis-kuromoji/src/main/assemblies/plugin.xml
@ -0,0 +1,26 @@
 <?xml version="1.0"?>
 <assembly>
    <id>plugin</id>
    <formats>
        <format>zip</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <outputDirectory>/</outputDirectory>
            <useProjectArtifact>true</useProjectArtifact>
            <useTransitiveFiltering>true</useTransitiveFiltering>
            <excludes>
                <exclude>org.elasticsearch:elasticsearch</exclude>
            </excludes>
        </dependencySet>
        <dependencySet>
            <outputDirectory>/</outputDirectory>
            <useProjectArtifact>true</useProjectArtifact>
            <useTransitiveFiltering>true</useTransitiveFiltering>
            <includes>
                <include>org.apache.lucene:lucene-analyzers-kuromoji</include>
            </includes>
        </dependencySet>
    </dependencySets>
 </assembly>
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/JapaneseStopTokenFilterFactory.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/JapaneseStopTokenFilterFactory.java
@ -0,0 +1,76 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.core.StopFilter;
 import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
 import org.apache.lucene.analysis.util.CharArraySet;
 import org.apache.lucene.search.suggest.analyzing.SuggestStopFilter;
 import org.elasticsearch.common.collect.MapBuilder;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.inject.assistedinject.Assisted;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.env.Environment;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.settings.IndexSettings;
 import java.util.Map;
 import java.util.Set;
 public class JapaneseStopTokenFilterFactory extends AbstractTokenFilterFactory{
    private final CharArraySet stopWords;
    private final boolean ignoreCase;
    private final boolean removeTrailing;
    @Inject
    public JapaneseStopTokenFilterFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
        super(index, indexSettings, name, settings);
        this.ignoreCase = settings.getAsBoolean("ignore_case", false);
        this.removeTrailing = settings.getAsBoolean("remove_trailing", true);
        Map<String, Set<?>> namedStopWords = MapBuilder.<String, Set<?>>newMapBuilder()
            .put("_japanese_", JapaneseAnalyzer.getDefaultStopSet())
            .immutableMap();
        this.stopWords = Analysis.parseWords(env, settings, "stopwords", JapaneseAnalyzer.getDefaultStopSet(), namedStopWords, ignoreCase);
    }
    @Override
    public TokenStream create(TokenStream tokenStream) {
        if (removeTrailing) {
            return new StopFilter(tokenStream, stopWords);
        } else {
            return new SuggestStopFilter(tokenStream, stopWords);
        }
    }
    public Set<?> stopWords() {
        return stopWords;
    }
    public boolean ignoreCase() {
        return ignoreCase;
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiAnalyzerProvider.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiAnalyzerProvider.java
@ -0,0 +1,56 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
 import org.apache.lucene.analysis.ja.JapaneseTokenizer;
 import org.apache.lucene.analysis.ja.dict.UserDictionary;
 import org.apache.lucene.analysis.util.CharArraySet;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.inject.assistedinject.Assisted;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.env.Environment;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.settings.IndexSettings;
 import java.util.Set;
 /**
 */
 public class KuromojiAnalyzerProvider extends AbstractIndexAnalyzerProvider<JapaneseAnalyzer> {
    private final JapaneseAnalyzer analyzer;
    @Inject
    public KuromojiAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
        super(index, indexSettings, name, settings);
        final Set<?> stopWords = Analysis.parseStopWords(env, settings, JapaneseAnalyzer.getDefaultStopSet());
        final JapaneseTokenizer.Mode mode = KuromojiTokenizerFactory.getMode(settings);
        final UserDictionary userDictionary = KuromojiTokenizerFactory.getUserDictionary(env, settings);
        analyzer = new JapaneseAnalyzer(userDictionary, mode, CharArraySet.copy(stopWords), JapaneseAnalyzer.getDefaultStopTags());
    }
    @Override
    public JapaneseAnalyzer get() {
        return this.analyzer;
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiBaseFormFilterFactory.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiBaseFormFilterFactory.java
@ -0,0 +1,41 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.ja.JapaneseBaseFormFilter;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.inject.assistedinject.Assisted;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.settings.IndexSettings;
 public class KuromojiBaseFormFilterFactory extends AbstractTokenFilterFactory {
    @Inject
    public KuromojiBaseFormFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
        super(index, indexSettings, name, settings);
    }
    @Override
    public TokenStream create(TokenStream tokenStream) {
        return new JapaneseBaseFormFilter(tokenStream);
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiIterationMarkCharFilterFactory.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiIterationMarkCharFilterFactory.java
@ -0,0 +1,48 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.ja.JapaneseIterationMarkCharFilter;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.inject.assistedinject.Assisted;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.settings.IndexSettings;
 import java.io.Reader;
 public class KuromojiIterationMarkCharFilterFactory extends AbstractCharFilterFactory {
    private final boolean normalizeKanji;
    private final boolean normalizeKana;
    @Inject
    public KuromojiIterationMarkCharFilterFactory(Index index, @IndexSettings Settings indexSettings,
                                                  @Assisted String name, @Assisted Settings settings) {
        super(index, indexSettings, name);
        normalizeKanji = settings.getAsBoolean("normalize_kanji", JapaneseIterationMarkCharFilter.NORMALIZE_KANJI_DEFAULT);
        normalizeKana = settings.getAsBoolean("normalize_kana", JapaneseIterationMarkCharFilter.NORMALIZE_KANA_DEFAULT);
    }
    @Override
    public Reader create(Reader reader) {
        return new JapaneseIterationMarkCharFilter(reader, normalizeKanji, normalizeKana);
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiKatakanaStemmerFactory.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiKatakanaStemmerFactory.java
@ -0,0 +1,44 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.ja.JapaneseKatakanaStemFilter;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.inject.assistedinject.Assisted;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.settings.IndexSettings;
 public class KuromojiKatakanaStemmerFactory extends AbstractTokenFilterFactory {
    private final int minimumLength;
    @Inject
    public KuromojiKatakanaStemmerFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
        super(index, indexSettings, name, settings);
        minimumLength = settings.getAsInt("minimum_length", JapaneseKatakanaStemFilter.DEFAULT_MINIMUM_LENGTH);
    }
    @Override
    public TokenStream create(TokenStream tokenStream) {
        return new JapaneseKatakanaStemFilter(tokenStream, minimumLength);
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiPartOfSpeechFilterFactory.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiPartOfSpeechFilterFactory.java
@ -0,0 +1,53 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.ja.JapanesePartOfSpeechStopFilter;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.inject.assistedinject.Assisted;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.env.Environment;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.settings.IndexSettings;
 import java.util.HashSet;
 import java.util.List;
 import java.util.Set;
 public class KuromojiPartOfSpeechFilterFactory extends AbstractTokenFilterFactory {
    private final Set<String> stopTags = new HashSet<String>();
    @Inject
    public KuromojiPartOfSpeechFilterFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
        super(index, indexSettings, name, settings);
        List<String> wordList = Analysis.getWordList(env, settings, "stoptags");
        if (wordList != null) {
            stopTags.addAll(wordList);
        }
    }
    @Override
    public TokenStream create(TokenStream tokenStream) {
        return new JapanesePartOfSpeechStopFilter(tokenStream, stopTags);
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiReadingFormFilterFactory.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiReadingFormFilterFactory.java
@ -0,0 +1,44 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.ja.JapaneseReadingFormFilter;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.inject.assistedinject.Assisted;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.settings.IndexSettings;
 public class KuromojiReadingFormFilterFactory extends AbstractTokenFilterFactory {
    private final boolean useRomaji;
    @Inject
    public KuromojiReadingFormFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) {
        super(index, indexSettings, name, settings);
        useRomaji = settings.getAsBoolean("use_romaji", false);
    }
    @Override
    public TokenStream create(TokenStream tokenStream) {
        return new JapaneseReadingFormFilter(tokenStream, useRomaji);
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiTokenizerFactory.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/index/analysis/KuromojiTokenizerFactory.java
@ -0,0 +1,93 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.Tokenizer;
 import org.apache.lucene.analysis.ja.JapaneseTokenizer;
 import org.apache.lucene.analysis.ja.JapaneseTokenizer.Mode;
 import org.apache.lucene.analysis.ja.dict.UserDictionary;
 import org.elasticsearch.ElasticsearchException;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.inject.assistedinject.Assisted;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.env.Environment;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.settings.IndexSettings;
 import java.io.IOException;
 import java.io.Reader;
 /**
 */
 public class KuromojiTokenizerFactory extends AbstractTokenizerFactory {
    private static final String USER_DICT_OPTION = "user_dictionary";
    private final UserDictionary userDictionary;
    private final Mode mode;
    private boolean discartPunctuation;
    @Inject
    public KuromojiTokenizerFactory(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) {
        super(index, indexSettings, name, settings);
        mode = getMode(settings);
        userDictionary = getUserDictionary(env, settings);
        discartPunctuation = settings.getAsBoolean("discard_punctuation", true);
    }
    public static UserDictionary getUserDictionary(Environment env, Settings settings) {
        try {
            final Reader reader = Analysis.getReaderFromFile(env, settings, USER_DICT_OPTION);
            if (reader == null) {
                return null;
            } else {
                try {
                    return UserDictionary.open(reader);
                } finally {
                    reader.close();
                }
            }
        } catch (IOException e) {
            throw new ElasticsearchException("failed to load kuromoji user dictionary", e);
        }
    }
    public static JapaneseTokenizer.Mode getMode(Settings settings) {
        JapaneseTokenizer.Mode mode = JapaneseTokenizer.DEFAULT_MODE;
        String modeSetting = settings.get("mode", null);
        if (modeSetting != null) {
            if ("search".equalsIgnoreCase(modeSetting)) {
                mode = JapaneseTokenizer.Mode.SEARCH;
            } else if ("normal".equalsIgnoreCase(modeSetting)) {
                mode = JapaneseTokenizer.Mode.NORMAL;
            } else if ("extended".equalsIgnoreCase(modeSetting)) {
                mode = JapaneseTokenizer.Mode.EXTENDED;
            }
        }
        return mode;
    }
    @Override
    public Tokenizer create() {
        return new JapaneseTokenizer(userDictionary, discartPunctuation, mode);
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/indices/analysis/KuromojiIndicesAnalysis.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/indices/analysis/KuromojiIndicesAnalysis.java
@ -0,0 +1,131 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.indices.analysis;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.Tokenizer;
 import org.apache.lucene.analysis.ja.*;
 import org.apache.lucene.analysis.ja.JapaneseTokenizer.Mode;
 import org.elasticsearch.common.component.AbstractComponent;
 import org.elasticsearch.common.inject.Inject;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.index.analysis.*;
 import java.io.Reader;
 /**
 * Registers indices level analysis components so, if not explicitly configured,
 * will be shared among all indices.
 */
 public class KuromojiIndicesAnalysis extends AbstractComponent {
    @Inject
    public KuromojiIndicesAnalysis(Settings settings,
                                   IndicesAnalysisService indicesAnalysisService) {
        super(settings);
        indicesAnalysisService.analyzerProviderFactories().put("kuromoji",
                new PreBuiltAnalyzerProviderFactory("kuromoji", AnalyzerScope.INDICES,
                        new JapaneseAnalyzer()));
        indicesAnalysisService.charFilterFactories().put("kuromoji_iteration_mark",
                new PreBuiltCharFilterFactoryFactory(new CharFilterFactory() {
                    @Override
                    public String name() {
                        return "kuromoji_iteration_mark";
                    }
                    @Override
                    public Reader create(Reader reader) {
                        return new JapaneseIterationMarkCharFilter(reader,
                                JapaneseIterationMarkCharFilter.NORMALIZE_KANJI_DEFAULT,
                                JapaneseIterationMarkCharFilter.NORMALIZE_KANA_DEFAULT);
                    }
                }));
        indicesAnalysisService.tokenizerFactories().put("kuromoji_tokenizer",
                new PreBuiltTokenizerFactoryFactory(new TokenizerFactory() {
                    @Override
                    public String name() {
                        return "kuromoji_tokenizer";
                    }
                    @Override
                    public Tokenizer create() {
                        return new JapaneseTokenizer(null, true, Mode.SEARCH);
                    }
                }));
        indicesAnalysisService.tokenFilterFactories().put("kuromoji_baseform",
                new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
                    @Override
                    public String name() {
                        return "kuromoji_baseform";
                    }
                    @Override
                    public TokenStream create(TokenStream tokenStream) {
                        return new JapaneseBaseFormFilter(tokenStream);
                    }
                }));
        indicesAnalysisService.tokenFilterFactories().put(
                "kuromoji_part_of_speech",
                new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
                    @Override
                    public String name() {
                        return "kuromoji_part_of_speech";
                    }
                    @Override
                    public TokenStream create(TokenStream tokenStream) {
                        return new JapanesePartOfSpeechStopFilter(tokenStream, JapaneseAnalyzer
                                .getDefaultStopTags());
                    }
                }));
        indicesAnalysisService.tokenFilterFactories().put(
                "kuromoji_readingform",
                new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
                    @Override
                    public String name() {
                        return "kuromoji_readingform";
                    }
                    @Override
                    public TokenStream create(TokenStream tokenStream) {
                        return new JapaneseReadingFormFilter(tokenStream, true);
                    }
                }));
        indicesAnalysisService.tokenFilterFactories().put("kuromoji_stemmer",
                new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
                    @Override
                    public String name() {
                        return "kuromoji_stemmer";
                    }
                    @Override
                    public TokenStream create(TokenStream tokenStream) {
                        return new JapaneseKatakanaStemFilter(tokenStream);
                    }
                }));
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/indices/analysis/KuromojiIndicesAnalysisModule.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/indices/analysis/KuromojiIndicesAnalysisModule.java
@ -0,0 +1,32 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.indices.analysis;
 import org.elasticsearch.common.inject.AbstractModule;
 /**
 */
 public class KuromojiIndicesAnalysisModule extends AbstractModule {
    @Override
    protected void configure() {
        bind(KuromojiIndicesAnalysis.class).asEagerSingleton();
    }
 }
--- a/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/plugin/analysis/kuromoji/AnalysisKuromojiPlugin.java
+++ b/plugins/analysis-kuromoji/src/main/java/org/elasticsearch/plugin/analysis/kuromoji/AnalysisKuromojiPlugin.java
@ -0,0 +1,62 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.plugin.analysis.kuromoji;
 import org.elasticsearch.common.inject.Module;
 import org.elasticsearch.index.analysis.*;
 import org.elasticsearch.indices.analysis.KuromojiIndicesAnalysisModule;
 import org.elasticsearch.plugins.AbstractPlugin;
 import java.util.ArrayList;
 import java.util.Collection;
 /**
 *
 */
 public class AnalysisKuromojiPlugin extends AbstractPlugin {
    @Override
    public String name() {
        return "analysis-kuromoji";
    }
    @Override
    public String description() {
        return "Kuromoji analysis support";
    }
    @Override
    public Collection<Class<? extends Module>> modules() {
        Collection<Class<? extends Module>> classes = new ArrayList<>();
        classes.add(KuromojiIndicesAnalysisModule.class);
        return classes;
    }
    public void onModule(AnalysisModule module) {
        module.addCharFilter("kuromoji_iteration_mark", KuromojiIterationMarkCharFilterFactory.class);
        module.addAnalyzer("kuromoji", KuromojiAnalyzerProvider.class);
        module.addTokenizer("kuromoji_tokenizer", KuromojiTokenizerFactory.class);
        module.addTokenFilter("kuromoji_baseform", KuromojiBaseFormFilterFactory.class);
        module.addTokenFilter("kuromoji_part_of_speech", KuromojiPartOfSpeechFilterFactory.class);
        module.addTokenFilter("kuromoji_readingform", KuromojiReadingFormFilterFactory.class);
        module.addTokenFilter("kuromoji_stemmer", KuromojiKatakanaStemmerFactory.class);
        module.addTokenFilter("ja_stop", JapaneseStopTokenFilterFactory.class);
    }
 }
--- a/plugins/analysis-kuromoji/src/main/resources/es-plugin.properties
+++ b/plugins/analysis-kuromoji/src/main/resources/es-plugin.properties
@ -0,0 +1,3 @@
 plugin=org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin
 version=${project.version}
 lucene=${lucene.version}
--- a/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/KuromojiAnalysisTests.java
+++ b/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/KuromojiAnalysisTests.java
@ -0,0 +1,267 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.apache.lucene.analysis.TokenStream;
 import org.apache.lucene.analysis.Tokenizer;
 import org.apache.lucene.analysis.ja.JapaneseAnalyzer;
 import org.apache.lucene.analysis.ja.JapaneseTokenizer;
 import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
 import org.elasticsearch.Version;
 import org.elasticsearch.cluster.metadata.IndexMetaData;
 import org.elasticsearch.common.inject.Injector;
 import org.elasticsearch.common.inject.ModulesBuilder;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.common.settings.SettingsModule;
 import org.elasticsearch.env.Environment;
 import org.elasticsearch.env.EnvironmentModule;
 import org.elasticsearch.index.Index;
 import org.elasticsearch.index.IndexNameModule;
 import org.elasticsearch.index.settings.IndexSettingsModule;
 import org.elasticsearch.indices.analysis.IndicesAnalysisModule;
 import org.elasticsearch.indices.analysis.IndicesAnalysisService;
 import org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin;
 import org.elasticsearch.test.ElasticsearchTestCase;
 import org.junit.Test;
 import java.io.IOException;
 import java.io.Reader;
 import java.io.StringReader;
 import static org.hamcrest.Matchers.*;
 /**
 */
 public class KuromojiAnalysisTests extends ElasticsearchTestCase {
    @Test
    public void testDefaultsKuromojiAnalysis() throws IOException {
        AnalysisService analysisService = createAnalysisService();
        TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_tokenizer");
        assertThat(tokenizerFactory, instanceOf(KuromojiTokenizerFactory.class));
        TokenFilterFactory filterFactory = analysisService.tokenFilter("kuromoji_part_of_speech");
        assertThat(filterFactory, instanceOf(KuromojiPartOfSpeechFilterFactory.class));
        filterFactory = analysisService.tokenFilter("kuromoji_readingform");
        assertThat(filterFactory, instanceOf(KuromojiReadingFormFilterFactory.class));
        filterFactory = analysisService.tokenFilter("kuromoji_baseform");
        assertThat(filterFactory, instanceOf(KuromojiBaseFormFilterFactory.class));
        filterFactory = analysisService.tokenFilter("kuromoji_stemmer");
        assertThat(filterFactory, instanceOf(KuromojiKatakanaStemmerFactory.class));
        filterFactory = analysisService.tokenFilter("ja_stop");
        assertThat(filterFactory, instanceOf(JapaneseStopTokenFilterFactory.class));
        NamedAnalyzer analyzer = analysisService.analyzer("kuromoji");
        assertThat(analyzer.analyzer(), instanceOf(JapaneseAnalyzer.class));
        analyzer = analysisService.analyzer("my_analyzer");
        assertThat(analyzer.analyzer(), instanceOf(CustomAnalyzer.class));
        assertThat(analyzer.analyzer().tokenStream(null, new StringReader("")), instanceOf(JapaneseTokenizer.class));
        CharFilterFactory  charFilterFactory = analysisService.charFilter("kuromoji_iteration_mark");
        assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
    }
    @Test
    public void testBaseFormFilterFactory() throws IOException {
        AnalysisService analysisService = createAnalysisService();
        TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_pos");
        assertThat(tokenFilter, instanceOf(KuromojiPartOfSpeechFilterFactory.class));
        String source = "私は制限スピードを超える。";
        String[] expected = new String[]{"私", "は", "制限", "スピード", "を"};
        Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
        tokenizer.setReader(new StringReader(source));
        assertSimpleTSOutput(tokenFilter.create(tokenizer), expected);
    }
    @Test
    public void testReadingFormFilterFactory() throws IOException {
        AnalysisService analysisService = createAnalysisService();
        TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_rf");
        assertThat(tokenFilter, instanceOf(KuromojiReadingFormFilterFactory.class));
        String source = "今夜はロバート先生と話した";
        String[] expected_tokens_romaji = new String[]{"kon'ya", "ha", "robato", "sensei", "to", "hanashi", "ta"};
        Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
        tokenizer.setReader(new StringReader(source));
        assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_romaji);
        tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
        tokenizer.setReader(new StringReader(source));
        String[] expected_tokens_katakana = new String[]{"コンヤ", "ハ", "ロバート", "センセイ", "ト", "ハナシ", "タ"};
        tokenFilter = analysisService.tokenFilter("kuromoji_readingform");
        assertThat(tokenFilter, instanceOf(KuromojiReadingFormFilterFactory.class));
        assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
    }
    @Test
    public void testKatakanaStemFilter() throws IOException {
        AnalysisService analysisService = createAnalysisService();
        TokenFilterFactory tokenFilter = analysisService.tokenFilter("kuromoji_stemmer");
        assertThat(tokenFilter, instanceOf(KuromojiKatakanaStemmerFactory.class));
        String source = "明後日パーティーに行く予定がある。図書館で資料をコピーしました。";
        Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
        tokenizer.setReader(new StringReader(source));
        // パーティー should be stemmed by default
        // (min len) コピー should not be stemmed
        String[] expected_tokens_katakana = new String[]{"明後日", "パーティ", "に", "行く", "予定", "が", "ある", "図書館", "で", "資料", "を", "コピー", "し", "まし", "た"};
        assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
        tokenFilter = analysisService.tokenFilter("kuromoji_ks");
        assertThat(tokenFilter, instanceOf(KuromojiKatakanaStemmerFactory.class));
        tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
        tokenizer.setReader(new StringReader(source));
        // パーティー should not be stemmed since min len == 6
        // コピー should not be stemmed
        expected_tokens_katakana = new String[]{"明後日", "パーティー", "に", "行く", "予定", "が", "ある", "図書館", "で", "資料", "を", "コピー", "し", "まし", "た"};
        assertSimpleTSOutput(tokenFilter.create(tokenizer), expected_tokens_katakana);
    }
    @Test
    public void testIterationMarkCharFilter() throws IOException {
        AnalysisService analysisService = createAnalysisService();
        // test only kanji
        CharFilterFactory charFilterFactory = analysisService.charFilter("kuromoji_im_only_kanji");
        assertNotNull(charFilterFactory);
        assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
        String source = "ところゞゝゝ、ジヾが、時々、馬鹿々々しい";
        String expected = "ところゞゝゝ、ジヾが、時時、馬鹿馬鹿しい";
        assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
        // test only kana
        charFilterFactory = analysisService.charFilter("kuromoji_im_only_kana");
        assertNotNull(charFilterFactory);
        assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
        expected = "ところどころ、ジジが、時々、馬鹿々々しい";
        assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
        // test default
        charFilterFactory = analysisService.charFilter("kuromoji_im_default");
        assertNotNull(charFilterFactory);
        assertThat(charFilterFactory, instanceOf(KuromojiIterationMarkCharFilterFactory.class));
        expected = "ところどころ、ジジが、時時、馬鹿馬鹿しい";
        assertCharFilterEquals(charFilterFactory.create(new StringReader(source)), expected);
    }
    @Test
    public void testJapaneseStopFilterFactory() throws IOException {
        AnalysisService analysisService = createAnalysisService();
        TokenFilterFactory tokenFilter = analysisService.tokenFilter("ja_stop");
        assertThat(tokenFilter, instanceOf(JapaneseStopTokenFilterFactory.class));
        String source = "私は制限スピードを超える。";
        String[] expected = new String[]{"私", "制限", "超える"};
        Tokenizer tokenizer = new JapaneseTokenizer(null, true, JapaneseTokenizer.Mode.SEARCH);
        tokenizer.setReader(new StringReader(source));
        assertSimpleTSOutput(tokenFilter.create(tokenizer), expected);
    }
    public AnalysisService createAnalysisService() {
        Settings settings = Settings.settingsBuilder()
                .put("path.home", createTempDir())
                .loadFromClasspath("org/elasticsearch/index/analysis/kuromoji_analysis.json")
                .put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT)
                .build();
        Index index = new Index("test");
        Injector parentInjector = new ModulesBuilder().add(new SettingsModule(settings),
                new EnvironmentModule(new Environment(settings)),
                new IndicesAnalysisModule())
                .createInjector();
        AnalysisModule analysisModule = new AnalysisModule(settings, parentInjector.getInstance(IndicesAnalysisService.class));
        new AnalysisKuromojiPlugin().onModule(analysisModule);
        Injector injector = new ModulesBuilder().add(
                new IndexSettingsModule(index, settings),
                new IndexNameModule(index),
                analysisModule)
                .createChildInjector(parentInjector);
        return injector.getInstance(AnalysisService.class);
    }
    public static void assertSimpleTSOutput(TokenStream stream,
                                            String[] expected) throws IOException {
        stream.reset();
        CharTermAttribute termAttr = stream.getAttribute(CharTermAttribute.class);
        assertThat(termAttr, notNullValue());
        int i = 0;
        while (stream.incrementToken()) {
            assertThat(expected.length, greaterThan(i));
            assertThat( "expected different term at index " + i, expected[i++], equalTo(termAttr.toString()));
        }
        assertThat("not all tokens produced", i, equalTo(expected.length));
    }
    private void assertCharFilterEquals(Reader filtered,
                                        String expected) throws IOException {
        String actual = readFully(filtered);
        assertThat(actual, equalTo(expected));
    }
    private String readFully(Reader reader) throws IOException {
        StringBuilder buffer = new StringBuilder();
        int ch;
        while((ch = reader.read()) != -1){
            buffer.append((char)ch);
        }
        return buffer.toString();
    }
    @Test
    public void testKuromojiUserDict() throws IOException {
        AnalysisService analysisService = createAnalysisService();
        TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_user_dict");
        String source = "私は制限スピードを超える。";
        String[] expected = new String[]{"私", "は", "制限スピード", "を", "超える"};
        Tokenizer tokenizer = tokenizerFactory.create();
        tokenizer.setReader(new StringReader(source));
        assertSimpleTSOutput(tokenizer, expected);
    }
    // fix #59
    @Test
    public void testKuromojiEmptyUserDict() {
        AnalysisService analysisService = createAnalysisService();
        TokenizerFactory tokenizerFactory = analysisService.tokenizer("kuromoji_empty_user_dict");
        assertThat(tokenizerFactory, instanceOf(KuromojiTokenizerFactory.class));
    }
 }
--- a/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/KuromojiIntegrationTests.java
+++ b/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/KuromojiIntegrationTests.java
@ -0,0 +1,90 @@
 /*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */
 package org.elasticsearch.index.analysis;
 import org.elasticsearch.action.admin.indices.analyze.AnalyzeResponse;
 import org.elasticsearch.action.search.SearchResponse;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.common.xcontent.XContentBuilder;
 import org.elasticsearch.index.query.QueryBuilders;
 import org.elasticsearch.plugins.PluginsService;
 import org.elasticsearch.test.ElasticsearchIntegrationTest;
 import org.junit.Test;
 import java.io.IOException;
 import java.util.concurrent.ExecutionException;
 import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
 import static org.hamcrest.CoreMatchers.is;
 import static org.hamcrest.CoreMatchers.notNullValue;
@ElasticsearchIntegrationTest.ClusterScope(scope = ElasticsearchIntegrationTest.Scope.SUITE)
 public class KuromojiIntegrationTests extends ElasticsearchIntegrationTest {
    @Override
    protected Settings nodeSettings(int nodeOrdinal) {
        return Settings.builder()
                .put(super.nodeSettings(nodeOrdinal))
                .put("plugins." + PluginsService.LOAD_PLUGIN_FROM_CLASSPATH, true)
                .build();
    }
    @Test
    public void testKuromojiAnalyzer() throws ExecutionException, InterruptedException {
        AnalyzeResponse response = client().admin().indices()
                .prepareAnalyze("JR新宿駅の近くにビールを飲みに行こうか").setAnalyzer("kuromoji")
                .execute().get();
        String[] expectedTokens = {"jr", "新宿", "駅", "近く", "ビール", "飲む", "行く"};
        assertThat(response, notNullValue());
        assertThat(response.getTokens().size(), is(7));
        for (int i = 0; i < expectedTokens.length; i++) {
            assertThat(response.getTokens().get(i).getTerm(), is(expectedTokens[i]));
        }
    }
    @Test
    public void testKuromojiAnalyzerInMapping() throws ExecutionException, InterruptedException, IOException {
        createIndex("test");
        ensureGreen("test");
        final XContentBuilder mapping = jsonBuilder().startObject()
                .startObject("type")
                .startObject("properties")
                .startObject("foo")
                .field("type", "string")
                .field("analyzer", "kuromoji")
                .endObject()
                .endObject()
                .endObject()
                .endObject();
        client().admin().indices().preparePutMapping("test").setType("type").setSource(mapping).get();
        index("test", "type", "1", "foo", "JR新宿駅の近くにビールを飲みに行こうか");
        refresh();
        SearchResponse response = client().prepareSearch("test").setQuery(
                QueryBuilders.matchQuery("foo", "jr")
        ).execute().actionGet();
        assertThat(response.getHits().getTotalHits(), is(1L));
    }
 }
--- a/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/empty_user_dict.txt
+++ b/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/empty_user_dict.txt
--- a/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/kuromoji_analysis.json
+++ b/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/kuromoji_analysis.json
@ -0,0 +1,62 @@
 {
    "index":{
        "analysis":{
            "filter":{
                "kuromoji_rf":{
                    "type":"kuromoji_readingform",
                    "use_romaji" : "true"
                },
                "kuromoji_pos" : {
                    "type": "kuromoji_part_of_speech",
                    "stoptags" : ["#  verb-main:", "動詞-自立"]
                },
                "kuromoji_ks" : {
                    "type": "kuromoji_stemmer",
                    "minimum_length" : 6
                },
                "ja_stop" : {
                    "type": "ja_stop",
                    "stopwords": ["_japanese_", "スピード"]
                }
            },
            "char_filter":{
                "kuromoji_im_only_kanji":{
                    "type":"kuromoji_iteration_mark",
                    "normalize_kanji":true,
                    "normalize_kana":false
                },
                "kuromoji_im_only_kana":{
                    "type":"kuromoji_iteration_mark",
                    "normalize_kanji":false,
                    "normalize_kana":true
                },
                "kuromoji_im_default":{
                    "type":"kuromoji_iteration_mark"
                }
            },
            "tokenizer" : {
                "kuromoji" : {
                    "type":"kuromoji_tokenizer"
                },
                "kuromoji_empty_user_dict" : {
                    "type":"kuromoji_tokenizer",
                    "user_dictionary":"org/elasticsearch/index/analysis/empty_user_dict.txt"
                },
                "kuromoji_user_dict" : {
                    "type":"kuromoji_tokenizer",
                    "user_dictionary":"org/elasticsearch/index/analysis/user_dict.txt"
                }
            },
            "analyzer" : {
                "my_analyzer" : {
                    "type" : "custom",
                    "tokenizer" : "kuromoji_tokenizer"
                }
            }
        }
    }
 }
--- a/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/user_dict.txt
+++ b/plugins/analysis-kuromoji/src/test/java/org/elasticsearch/index/analysis/user_dict.txt
@ -0,0 +1 @@
 制限スピード,制限スピード,セイゲンスピード,テスト名詞
		`@ -0,0 +1 @@`
							`制限スピード,制限スピード,セイゲンスピード,テスト名詞`