OpenSearch/plugins/analysis-kuromoji
David Pilato e429b8d190 [build] include in plugins only needed jars
Follow up for https://github.com/elastic/elasticsearch-analysis-kuromoji/issues/61

We don't shade anymore elasticsearch dependencies, so plugins might include jars in the distribution ZIP file which might not be needed anymore.

For example, `elasticsearch-cloud-aws` comes with:

```
Archive:  cloud-aws/target/releases/elasticsearch-cloud-aws-2.0.0-SNAPSHOT.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
  1920788  05-18-15 09:42   aws-java-sdk-ec2-1.9.34.jar
   503963  05-18-15 09:42   aws-java-sdk-core-1.9.34.jar
   232771  01-19-15 09:24   commons-codec-1.6.jar
   915096  01-19-15 09:24   jackson-databind-2.3.2.jar
   252288  05-18-15 09:42   aws-java-sdk-kms-1.9.34.jar
    62050  01-19-15 09:24   commons-logging-1.1.3.jar
   282269  10-31-14 13:19   httpcore-4.3.2.jar
    35058  01-19-15 09:24   jackson-annotations-2.3.0.jar
   229998  05-29-15 12:28   jackson-core-2.5.3.jar
   589289  01-19-15 09:24   joda-time-2.7.jar
   562858  05-18-15 09:42   aws-java-sdk-s3-1.9.34.jar
   590533  10-31-14 13:19   httpclient-4.3.5.jar
    44854  06-12-15 19:22   elasticsearch-cloud-aws-2.0.0-SNAPSHOT.jar
 --------                   -------
  6221815                   13 files
```

A lot of those files are already distributed with elasticsearch itself so classes are available within the classloader.

We mark all es core dependencies as provided in plugins.
We also remove `groupId` as already defined in parent pom.
And we remove non needed licenses files as some jars are not included anymore in plugins.

Closes #11647.
2015-07-01 21:37:27 +02:00
..
licenses [build] include in plugins only needed jars 2015-07-01 21:37:27 +02:00
src [build] include in plugins only needed jars 2015-07-01 21:37:27 +02:00
LICENSE.txt Added LICENSE and NOTICE files for all plugins 2015-06-23 12:50:31 +02:00
NOTICE.txt Added LICENSE and NOTICE files for all plugins 2015-06-23 12:50:31 +02:00
README.md add analysis-kuromoji module 2015-06-05 13:12:07 +02:00
pom.xml add analysis-kuromoji module 2015-06-05 13:12:07 +02:00

README.md

Japanese (kuromoji) Analysis for Elasticsearch

The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.

In order to install the plugin, run:

bin/plugin install elasticsearch/elasticsearch-analysis-kuromoji/2.5.0

You need to install a version matching your Elasticsearch version:

elasticsearch Kuromoji Analysis Plugin Docs
master Build from source See below
es-1.x Build from source 2.6.0-SNAPSHOT
es-1.5 2.5.0 2.5.0
es-1.4 2.4.3 2.4.3
< 1.4.5 2.4.2 2.4.2
< 1.4.3 2.4.1 2.4.1
es-1.3 2.3.0 2.3.0
es-1.2 2.2.0 2.2.0
es-1.1 2.1.0 2.1.0
es-1.0 2.0.0 2.0.0
es-0.90 1.8.0 1.8.0

To build a SNAPSHOT version, you need to build it with Maven:

mvn clean install
plugin --install analysis-kuromoji \
       --url file:target/releases/elasticsearch-analysis-kuromoji-X.X.X-SNAPSHOT.zip

Includes Analyzer, Tokenizer, TokenFilter, CharFilter

The plugin includes these analyzer and tokenizer, tokenfilter.

name type
kuromoji_iteration_mark charfilter
kuromoji analyzer
kuromoji_tokenizer tokenizer
kuromoji_baseform tokenfilter
kuromoji_part_of_speech tokenfilter
kuromoji_readingform tokenfilter
kuromoji_stemmer tokenfilter
ja_stop tokenfilter

Usage

Analyzer : kuromoji

An analyzer of type kuromoji. This analyzer is the following tokenizer and tokenfilter combination.

  • kuromoji_tokenizer : Kuromoji Tokenizer
  • kuromoji_baseform : Kuromoji BasicFormFilter (TokenFilter)
  • kuromoji_part_of_speech : Kuromoji Part of Speech Stop Filter (TokenFilter)
  • cjk_width : CJK Width Filter (TokenFilter)
  • stop : Stop Filter (TokenFilter)
  • kuromoji_stemmer : Kuromoji Katakana Stemmer Filter(TokenFilter)
  • lowercase : LowerCase Filter (TokenFilter)

CharFilter : kuromoji_iteration_mark

A charfilter of type kuromoji_iteration_mark. This charfilter is Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.

The following ar setting that can be set for a kuromoji_iteration_mark charfilter type:

Setting Description Default value
normalize_kanji indicates whether kanji iteration marks should be normalized true
normalize_kana indicates whether kanji iteration marks should be normalized true

Tokenizer : kuromoji_tokenizer

A tokenizer of type kuromoji_tokenizer.

The following are settings that can be set for a kuromoji_tokenizer tokenizer type:

Setting Description Default value
mode Tokenization mode: this determines how the tokenizer handles compound and unknown words. normal and search, extended search
discard_punctuation true if punctuation tokens should be dropped from the output. true
user_dictionary set User Dictionary file

Tokenization mode

The mode is three types.

  • normal : Ordinary segmentation: no decomposition for compounds

  • search : Segmentation geared towards search: this includes a decompounding process for long nouns, also including the full compound token as a synonym.

  • extended : Extended mode outputs unigrams for unknown words.

Difference tokenization mode outputs

Input text is 関西国際空港 and アブラカダブラ.

mode 関西国際空港 アブラカダブラ
normal 関西国際空港 アブラカダブラ
search 関西 関西国際空港 国際 空港 アブラカダブラ
extended 関西 国際 空港

User Dictionary

Kuromoji tokenizer use MeCab-IPADIC dictionary by default. And Kuromoji is added an entry of dictionary to define by user; this is User Dictionary. User Dictionary entries are defined using the following CSV format:

<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>

Dictionary Example

東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞

To use User Dictionary set file path to user_dict attribute. User Dictionary file is placed ES_HOME/config directory.

example

Example Settings:

curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
    "settings": {
        "index":{
            "analysis":{
                "tokenizer" : {
                    "kuromoji_user_dict" : {
                       "type" : "kuromoji_tokenizer",
                       "mode" : "extended",
                       "discard_punctuation" : "false",
                       "user_dictionary" : "userdict_ja.txt"
                    }
                },
                "analyzer" : {
                    "my_analyzer" : {
                        "type" : "custom",
                        "tokenizer" : "kuromoji_user_dict"
                    }
                }

            }
        }
    }
}
'

Example Request using _analyze API :

curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '東京スカイツリー'

Response :

{
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "スカイツリー",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  } ]
}

TokenFilter : kuromoji_baseform

A token filter of type kuromoji_baseform that replaces term text with BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.

example

Example Settings:

curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["kuromoji_baseform"]
                    }
                }
            }
        }
    }
}
'

Example Request using _analyze API :

curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '飲み'

Response :

{
  "tokens" : [ {
    "token" : "飲む",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  } ]
}

TokenFilter : kuromoji_part_of_speech

A token filter of type kuromoji_part_of_speech that removes tokens that match a set of part-of-speech tags.

The following are settings that can be set for a stop token filter type:

Setting Description
stoptags A list of part-of-speech tags that should be removed

Note that default setting is stoptags.txt include lucene-analyzer-kuromoji.jar.

example

Example Settings:

curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["my_posfilter"]
                    }
                },
                "filter" : {
                    "my_posfilter" : {
                        "type" : "kuromoji_part_of_speech",
                        "stoptags" : [
                            "助詞-格助詞-一般",
                            "助詞-終助詞"
                        ]
                    }
                }
            }
        }
    }
}
'

Example Request using _analyze API :

curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d '寿司がおいしいね'

Response :

{
  "tokens" : [ {
    "token" : "寿司",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "おいしい",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "word",
    "position" : 3
  } ]
}

TokenFilter : kuromoji_readingform

A token filter of type kuromoji_readingform that replaces the term attribute with the reading of a token in either katakana or romaji form. The default reading form is katakana.

The following are settings that can be set for a kuromoji_readingform token filter type:

Setting Description Default value
use_romaji true if romaji reading form output instead of katakana. false

Note that elasticsearch-analysis-kuromoji built-in kuromoji_readingform set default true to use_romaji attribute.

example

Example Settings:

curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "romaji_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["romaji_readingform"]
                    },
                    "katakana_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["katakana_readingform"]
                    }
                },
                "filter" : {
                    "romaji_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : true
                    },
                    "katakana_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : false
                    }
                }
            }
        }
    }
}
'

Example Request using _analyze API :

curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d '寿司'

Response :

{
  "tokens" : [ {
    "token" : "スシ",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  } ]
}

Example Request using _analyze API :

curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=romaji_analyzer&pretty' -d '寿司'

Response :

{
  "tokens" : [ {
    "token" : "sushi",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  } ]
}

TokenFilter : kuromoji_stemmer

A token filter of type kuromoji_stemmer that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC). Only katakana words longer than a minimum length are stemmed (default is four).

Note that only full-width katakana characters are supported.

The following are settings that can be set for a kuromoji_stemmer token filter type:

Setting Description Default value
minimum_length The minimum length to stem 4

example

Example Settings:

curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["my_katakana_stemmer"]
                    }
                },
                "filter" : {
                    "my_katakana_stemmer" : {
                        "type" : "kuromoji_stemmer",
                        "minimum_length" : 4
                    }
                }
            }
        }
    }
}
'

Example Request using _analyze API :

curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'コピー'

Response :

{
  "tokens" : [ {
    "token" : "コピー",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  } ]
}

Example Request using _analyze API :

curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=my_analyzer&pretty' -d 'サーバー'

Response :

{
  "tokens" : [ {
    "token" : "サーバ",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  } ]
}

TokenFilter : ja_stop

A token filter of type ja_stop that provide a predefined "japanese" stop words. Note: It is only provide "japanese". If you want to use other predefined stop words, you can use stop token filter.

Example Settings:

example

curl -XPUT 'http://localhost:9200/kuromoji_sample/' -d'
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "analyzer_with_ja_stop" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["ja_stop"]
                    }
                },
                "filter" : {
                    "ja_stop" : {
                        "type" : "ja_stop",
                        "stopwords" : ["_japanese_", "ストップ"]
                    }
                }
            }
        }
    }
}'

Example Request using _analyze API :

curl -XPOST 'http://localhost:9200/kuromoji_sample/_analyze?analyzer=katakana_analyzer&pretty' -d 'ストップは消える'

Response :

{
  "tokens" : [ {
    "token" : "消える",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 3
  } ]
}

License

This software is licensed under the Apache 2 license, quoted below.

Copyright 2009-2014 Elasticsearch <http://www.elasticsearch.org>

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.