OpenSearch/docs/plugins/analysis-nori.asciidoc

[[analysis-nori]]
=== Korean (nori) Analysis Plugin

The Korean (nori) Analysis plugin integrates Lucene nori analysis
module into elasticsearch. It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary]
to perform morphological analysis of Korean texts.

:plugin_name: analysis-nori
include::install_remove.asciidoc[]

[[analysis-nori-analyzer]]
==== `nori` analyzer

The `nori` analyzer consists of the following tokenizer and token filters:

* <<analysis-nori-tokenizer,`nori_tokenizer`>>
* <<analysis-nori-speech,`nori_part_of_speech`>> token filter
* <<analysis-nori-readingform,`nori_readingform`>> token filter
* {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter

It supports the `decompound_mode` and `user_dictionary` settings from
<<analysis-nori-tokenizer,`nori_tokenizer`>> and the `stoptags` setting from
<<analysis-nori-speech,`nori_part_of_speech`>>.

[[analysis-nori-tokenizer]]
==== `nori_tokenizer`

The `nori_tokenizer` accepts the following settings:

`decompound_mode`::
+
--

The decompound mode determines how the tokenizer handles compound tokens.
It can be set to:

`none`::

    No decomposition for compounds. Example output:

    가거도항
    가곡역

`discard`::

    Decomposes compounds and discards the original form (*default*). Example output:

    가곡역 => 가곡, 역

`mixed`::

    Decomposes compounds and keeps the original form. Example output:

    가곡역 => 가곡역, 가곡, 역
--

`discard_punctuation`::

    Whether punctuation should be discarded from the output. Defaults to `true`.

`user_dictionary`::
+
--
The Nori tokenizer uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary] by default.
A `user_dictionary` with custom nouns (`NNG`) may be appended to the default dictionary.
The dictionary should have the following format:

[source,txt]
-----------------------
<token> [<token 1> ... <token n>]
-----------------------

The first token is mandatory and represents the custom noun that should be added in
the dictionary. For compound nouns the custom segmentation can be provided
after the first token (`[<token 1> ... <token n>]`). The segmentation of the
custom compound nouns is controlled by the `decompound_mode` setting.


As a demonstration of how the user dictionary can be used, save the following
dictionary to `$ES_HOME/config/userdict_ko.txt`:

[source,txt]
-----------------------
c++                 <1>
C샤프
세종
세종시 세종 시        <2>
-----------------------

<1> A simple noun
<2> A compound noun (`세종시`) followed by its decomposition: `세종` and `시`.

Then create an analyzer as follows:

[source,console]
--------------------------------------------------
PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ko.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "세종시"  <1>
}
--------------------------------------------------

<1> Sejong city

The above `analyze` request returns the following:

[source,console-result]
--------------------------------------------------
{
  "tokens" : [ {
    "token" : "세종시",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0,
    "positionLength" : 2    <1>
  }, {
    "token" : "세종",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "시",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
   }]
}
--------------------------------------------------

<1> This is a compound token that spans two positions (`mixed` mode).
--

`user_dictionary_rules`::
+
--

You can also inline the rules directly in the tokenizer definition using
the `user_dictionary_rules` option:

[source,console]
--------------------------------------------------
PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}
--------------------------------------------------
--

The `nori_tokenizer` sets a number of additional attributes per token that are used by token filters
to modify the stream.
You can view all these additional attributes with the following request:

[source,console]
--------------------------------------------------
GET _analyze
{
  "tokenizer": "nori_tokenizer",
  "text": "뿌리가 깊은 나무는",   <1>
  "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
  "explain": true
}
--------------------------------------------------

<1> A tree with deep roots

Which responds with:

[source,console-result]
--------------------------------------------------
{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "nori_tokenizer",
            "tokens": [
                {
                    "token": "뿌리",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "가",
                    "start_offset": 2,
                    "end_offset": 3,
                    "type": "word",
                    "position": 1,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "깊",
                    "start_offset": 4,
                    "end_offset": 5,
                    "type": "word",
                    "position": 2,
                    "leftPOS": "VA(Adjective)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "VA(Adjective)"
                },
                {
                    "token": "은",
                    "start_offset": 5,
                    "end_offset": 6,
                    "type": "word",
                    "position": 3,
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "E(Verbal endings)"
                },
                {
                    "token": "나무",
                    "start_offset": 7,
                    "end_offset": 9,
                    "type": "word",
                    "position": 4,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "는",
                    "start_offset": 9,
                    "end_offset": 10,
                    "type": "word",
                    "position": 5,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                }
            ]
        },
        "tokenfilters": []
    }
}
--------------------------------------------------


[[analysis-nori-speech]]
==== `nori_part_of_speech` token filter

The `nori_part_of_speech` token filter removes tokens that match a set of
part-of-speech tags. The list of supported tags and their meanings can be found here:
{lucene-core-javadoc}/../analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html[Part of speech tags]

It accepts the following setting:

`stoptags`::

    An array of part-of-speech tags that should be removed.

and defaults to:

[source,js]
--------------------------------------------------
"stoptags": [
    "E",
    "IC",
    "J",
    "MAG", "MAJ", "MM",
    "SP", "SSC", "SSO", "SC", "SE",
    "XPN", "XSA", "XSN", "XSV",
    "UNA", "NA", "VSV"
]
--------------------------------------------------
// NOTCONSOLE

For example:

[source,console]
--------------------------------------------------
PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "nori_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "nori_part_of_speech",
            "stoptags": [
              "NR"   <1>
            ]
          }
        }
      }
    }
  }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "여섯 용이"  <2>
}
--------------------------------------------------

<1> Korean numerals should be removed (`NR`)
<2> Six dragons

Which responds with:

[source,console-result]
--------------------------------------------------
{
  "tokens" : [ {
    "token" : "용",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "이",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "word",
    "position" : 2
  } ]
}
--------------------------------------------------


[[analysis-nori-readingform]]
==== `nori_readingform` token filter

The `nori_readingform` token filter rewrites tokens written in Hanja to their Hangul form.

[source,console]
--------------------------------------------------
PUT nori_sample
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "nori_tokenizer",
                        "filter" : ["nori_readingform"]
                    }
                }
            }
        }
    }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "鄕歌"      <1>
}
--------------------------------------------------

<1> A token written in Hanja: Hyangga

Which responds with:

[source,console-result]
--------------------------------------------------
{
  "tokens" : [ {
    "token" : "향가",     <1>
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }]
}
--------------------------------------------------

<1> The Hanja form is replaced by the Hangul translation.


[[analysis-nori-number]]
==== `nori_number` token filter

The `nori_number` token filter normalizes Korean numbers
to regular Arabic decimal numbers in half-width characters.

Korean numbers are often written using a combination of Hangul and Arabic numbers with various kinds punctuation.
For example, ３．２천 means 3200.
This filter does this kind of normalization and allows a search for 3200 to match ３．２천 in text,
but can also be used to make range facets based on the normalized numbers and so on.

[NOTE]
====
Notice that this analyzer uses a token composition scheme and relies on punctuation tokens
being found in the token stream.
Please make sure your `nori_tokenizer` has `discard_punctuation` set to false.
In case punctuation characters, such as U+FF0E(．), is removed from the token stream,
this filter would find input tokens ３ and ２천 and give outputs 3 and 2000 instead of 3200,
which is likely not the intended result.

If you want to remove punctuation characters from your index that are not part of normalized numbers,
add a `stop` token filter with the punctuation you wish to remove after `nori_number` in your analyzer chain.
====
Below are some examples of normalizations this filter supports.
The input is untokenized text and the result is the single term attribute emitted for the input.

- 영영칠 -> 7
- 일영영영 -> 1000
- 삼천2백2십삼 -> 3223
- 조육백만오천일 -> 1000006005001
- ３.２천 ->  3200
- １.２만３４５.６７ -> 12345.67
- 4,647.100 -> 4647.1
- 15,7 -> 157 (be aware of this weakness)

For example:

[source,console]
--------------------------------------------------
PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "tokenizer_discard_puncuation_false",
            "filter": [
              "part_of_speech_stop_sp", "nori_number"
            ]
          }
        },
        "tokenizer": {
          "tokenizer_discard_puncuation_false": {
            "type": "nori_tokenizer",
            "discard_punctuation": "false"
          }
        },
        "filter": {
            "part_of_speech_stop_sp": {
                "type": "nori_part_of_speech",
                "stoptags": ["SP"]
            }
        }
      }
    }
  }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "십만이천오백과 ３.２천"
}
--------------------------------------------------

Which results in:

[source,console-result]
--------------------------------------------------
{
  "tokens" : [{
    "token" : "102500",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "과",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "3200",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "word",
    "position" : 2
  }]
}
--------------------------------------------------
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								[[analysis-nori]]
 								=== Korean (nori) Analysis Plugin
 								The Korean (nori) Analysis plugin integrates Lucene nori analysis
 								module into elasticsearch. It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary]
 								to perform morphological analysis of Korean texts.
 								:plugin_name: analysis-nori
 								include::install_remove.asciidoc[]
 								[[analysis-nori-analyzer]]
 								==== `nori` analyzer
 								The `nori` analyzer consists of the following tokenizer and token filters:
 								* <<analysis-nori-tokenizer,`nori_tokenizer`>>
 								* <<analysis-nori-speech,`nori_part_of_speech`>> token filter
-												[Docs] Fix bad link

relates #30397

											
										
										
											2018-05-04 16:07:12 -04:00
+								* <<analysis-nori-readingform,`nori_readingform`>> token filter
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								* {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter
 								It supports the `decompound_mode` and `user_dictionary` settings from
 								<<analysis-nori-tokenizer,`nori_tokenizer`>> and the `stoptags` setting from
 								<<analysis-nori-speech,`nori_part_of_speech`>>.
 								[[analysis-nori-tokenizer]]
 								==== `nori_tokenizer`
 								The `nori_tokenizer` accepts the following settings:
 								`decompound_mode`::
 								+
 								--
 								The decompound mode determines how the tokenizer handles compound tokens.
 								It can be set to:
 								`none`::
 								    No decomposition for compounds. Example output:
 								    가거도항
 								    가곡역
 								`discard`::
 								    Decomposes compounds and discards the original form (*default*). Example output:
 								    가곡역 => 가곡, 역
 								`mixed`::
 								    Decomposes compounds and keeps the original form. Example output:
 								    가곡역 => 가곡역, 가곡, 역
 								--
-												Add nori_number token filter in analysis-nori (#53583)

This change adds the `nori_number` token filter.
It also adds a `discard_punctuation` option in nori_tokenizer that should be used in conjunction with the new filter.

											
										
										
											2020-03-23 14:28:49 -04:00
+								`discard_punctuation`::
 								    Whether punctuation should be discarded from the output. Defaults to `true`.
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								`user_dictionary`::
 								+
 								--
 								The Nori tokenizer uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary] by default.
 								A `user_dictionary` with custom nouns (`NNG`) may be appended to the default dictionary.
 								The dictionary should have the following format:
 								[source,txt]
 								-----------------------
 								<token> [<token 1> ... <token n>]
 								-----------------------
 								The first token is mandatory and represents the custom noun that should be added in
 								the dictionary. For compound nouns the custom segmentation can be provided
 								after the first token (`[<token 1> ... <token n>]`). The segmentation of the
 								custom compound nouns is controlled by the `decompound_mode` setting.
-												Add support for inlined user dictionary in Nori (#36123)

Add support for inlined user dictionary in Nori

This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes #35842
											
										
										
											2018-12-07 09:26:08 -05:00
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
 								As a demonstration of how the user dictionary can be used, save the following
 								dictionary to `$ES_HOME/config/userdict_ko.txt`:
 								[source,txt]
 								-----------------------
 								c++                 <1>
 								C샤프
 								세종
 								세종시 세종 시        <2>
 								-----------------------
 								<1> A simple noun
 								<2> A compound noun (`세종시`) followed by its decomposition: `세종` and `시`.
 								Then create an analyzer as follows:
-												[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502)


											
										
										
											2019-09-09 13:38:14 -04:00
+								[source,console]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								--------------------------------------------------
-												Remove `include_type_name` in asciidoc where possible (#37568)

The "include_type_name" parameter was temporarily introduced in #37285 to facilitate
moving the default parameter setting to "false" in many places in the documentation
code snippets. Most of the places can simply be reverted without causing errors.
In this change I looked for asciidoc files that contained the
"include_type_name=true" addition when creating new indices but didn't look
likey they made use of the "_doc" type for mappings. This is mostly the case
e.g. in the analysis docs where index creating often only contains settings. I
manually corrected the use of types in some places where the docs still used an
explicit type name and not the dummy "_doc" type.
											
										
										
											2019-01-18 03:34:11 -05:00
+								PUT nori_sample
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								{
 								  "settings": {
 								    "index": {
 								      "analysis": {
 								        "tokenizer": {
 								          "nori_user_dict": {
 								            "type": "nori_tokenizer",
 								            "decompound_mode": "mixed",
-												Add nori_number token filter in analysis-nori (#53583)

This change adds the `nori_number` token filter.
It also adds a `discard_punctuation` option in nori_tokenizer that should be used in conjunction with the new filter.

											
										
										
											2020-03-23 14:28:49 -04:00
+								            "discard_punctuation": "false",
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								            "user_dictionary": "userdict_ko.txt"
 								          }
 								        },
 								        "analyzer": {
 								          "my_analyzer": {
 								            "type": "custom",
 								            "tokenizer": "nori_user_dict"
 								          }
 								        }
 								      }
 								    }
 								  }
 								}
 								GET nori_sample/_analyze
 								{
 								  "analyzer": "my_analyzer",
 								  "text": "세종시"  <1>
 								}
 								--------------------------------------------------
 								<1> Sejong city
 								The above `analyze` request returns the following:
-												[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418)


											
										
										
											2019-09-06 09:22:08 -04:00
+								[source,console-result]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								--------------------------------------------------
 								{
 								  "tokens" : [ {
 								    "token" : "세종시",
 								    "start_offset" : 0,
 								    "end_offset" : 3,
 								    "type" : "word",
 								    "position" : 0,
 								    "positionLength" : 2    <1>
 								  }, {
 								    "token" : "세종",
 								    "start_offset" : 0,
 								    "end_offset" : 2,
 								    "type" : "word",
 								    "position" : 0
 								  }, {
 								    "token" : "시",
 								    "start_offset" : 2,
 								    "end_offset" : 3,
 								    "type" : "word",
 								    "position" : 1
 								   }]
 								}
 								--------------------------------------------------
 								<1> This is a compound token that spans two positions (`mixed` mode).
-												Add support for inlined user dictionary in Nori (#36123)

Add support for inlined user dictionary in Nori

This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes #35842
											
										
										
											2018-12-07 09:26:08 -05:00
+								--
 								`user_dictionary_rules`::
 								+
 								--
 								You can also inline the rules directly in the tokenizer definition using
 								the `user_dictionary_rules` option:
-												[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502)


											
										
										
											2019-09-09 13:38:14 -04:00
+								[source,console]
-												Add support for inlined user dictionary in Nori (#36123)

Add support for inlined user dictionary in Nori

This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes #35842
											
										
										
											2018-12-07 09:26:08 -05:00
+								--------------------------------------------------
-												Remove `include_type_name` in asciidoc where possible (#37568)

The "include_type_name" parameter was temporarily introduced in #37285 to facilitate
moving the default parameter setting to "false" in many places in the documentation
code snippets. Most of the places can simply be reverted without causing errors.
In this change I looked for asciidoc files that contained the
"include_type_name=true" addition when creating new indices but didn't look
likey they made use of the "_doc" type for mappings. This is mostly the case
e.g. in the analysis docs where index creating often only contains settings. I
manually corrected the use of types in some places where the docs still used an
explicit type name and not the dummy "_doc" type.
											
										
										
											2019-01-18 03:34:11 -05:00
+								PUT nori_sample
-												Add support for inlined user dictionary in Nori (#36123)

Add support for inlined user dictionary in Nori

This change adds a new option called `user_dictionary_rules` to the
Nori a tokenizer`. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).

Closes #35842
											
										
										
											2018-12-07 09:26:08 -05:00
+								{
 								  "settings": {
 								    "index": {
 								      "analysis": {
 								        "tokenizer": {
 								          "nori_user_dict": {
 								            "type": "nori_tokenizer",
 								            "decompound_mode": "mixed",
 								            "user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]
 								          }
 								        },
 								        "analyzer": {
 								          "my_analyzer": {
 								            "type": "custom",
 								            "tokenizer": "nori_user_dict"
 								          }
 								        }
 								      }
 								    }
 								  }
 								}
 								--------------------------------------------------
 								--
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
 								The `nori_tokenizer` sets a number of additional attributes per token that are used by token filters
 								to modify the stream.
 								You can view all these additional attributes with the following request:
-												[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502)


											
										
										
											2019-09-09 13:38:14 -04:00
+								[source,console]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								--------------------------------------------------
 								GET _analyze
 								{
 								  "tokenizer": "nori_tokenizer",
 								  "text": "뿌리가 깊은 나무는",   <1>
 								  "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
 								  "explain": true
 								}
 								--------------------------------------------------
 								<1> A tree with deep roots
 								Which responds with:
-												[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418)


											
										
										
											2019-09-06 09:22:08 -04:00
+								[source,console-result]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								--------------------------------------------------
 								{
 								    "detail": {
 								        "custom_analyzer": true,
 								        "charfilters": [],
 								        "tokenizer": {
 								            "name": "nori_tokenizer",
 								            "tokens": [
 								                {
 								                    "token": "뿌리",
 								                    "start_offset": 0,
 								                    "end_offset": 2,
 								                    "type": "word",
 								                    "position": 0,
 								                    "leftPOS": "NNG(General Noun)",
 								                    "morphemes": null,
 								                    "posType": "MORPHEME",
 								                    "reading": null,
 								                    "rightPOS": "NNG(General Noun)"
 								                },
 								                {
 								                    "token": "가",
 								                    "start_offset": 2,
 								                    "end_offset": 3,
 								                    "type": "word",
 								                    "position": 1,
 								                    "leftPOS": "J(Ending Particle)",
 								                    "morphemes": null,
 								                    "posType": "MORPHEME",
 								                    "reading": null,
 								                    "rightPOS": "J(Ending Particle)"
 								                },
 								                {
 								                    "token": "깊",
 								                    "start_offset": 4,
 								                    "end_offset": 5,
 								                    "type": "word",
 								                    "position": 2,
 								                    "leftPOS": "VA(Adjective)",
 								                    "morphemes": null,
 								                    "posType": "MORPHEME",
 								                    "reading": null,
 								                    "rightPOS": "VA(Adjective)"
 								                },
 								                {
 								                    "token": "은",
 								                    "start_offset": 5,
 								                    "end_offset": 6,
 								                    "type": "word",
 								                    "position": 3,
 								                    "leftPOS": "E(Verbal endings)",
 								                    "morphemes": null,
 								                    "posType": "MORPHEME",
 								                    "reading": null,
 								                    "rightPOS": "E(Verbal endings)"
 								                },
 								                {
 								                    "token": "나무",
 								                    "start_offset": 7,
 								                    "end_offset": 9,
 								                    "type": "word",
 								                    "position": 4,
 								                    "leftPOS": "NNG(General Noun)",
 								                    "morphemes": null,
 								                    "posType": "MORPHEME",
 								                    "reading": null,
 								                    "rightPOS": "NNG(General Noun)"
 								                },
 								                {
 								                    "token": "는",
 								                    "start_offset": 9,
 								                    "end_offset": 10,
 								                    "type": "word",
 								                    "position": 5,
 								                    "leftPOS": "J(Ending Particle)",
 								                    "morphemes": null,
 								                    "posType": "MORPHEME",
 								                    "reading": null,
 								                    "rightPOS": "J(Ending Particle)"
 								                }
 								            ]
 								        },
 								        "tokenfilters": []
 								    }
 								}
 								--------------------------------------------------
-												[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418)


											
										
										
											2019-09-06 09:22:08 -04:00
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
 								[[analysis-nori-speech]]
 								==== `nori_part_of_speech` token filter
 								The `nori_part_of_speech` token filter removes tokens that match a set of
 								part-of-speech tags. The list of supported tags and their meanings can be found here:
-												[Docs] Fix wrong link in Korean analyzer docs (#31815)


											
										
										
											2018-07-06 03:30:48 -04:00
+								{lucene-core-javadoc}/../analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html[Part of speech tags]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
 								It accepts the following setting:
 								`stoptags`::
 								    An array of part-of-speech tags that should be removed.
 								and defaults to:
-												[Docs] Add snippets for POS stop tags default value

relates #30397

											
										
										
											2018-05-05 01:53:21 -04:00
+								[source,js]
 								--------------------------------------------------
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								"stoptags": [
 								    "E",
 								    "IC",
 								    "J",
 								    "MAG", "MAJ", "MM",
 								    "SP", "SSC", "SSO", "SC", "SE",
 								    "XPN", "XSA", "XSN", "XSV",
 								    "UNA", "NA", "VSV"
 								]
-												[Docs] Add snippets for POS stop tags default value

relates #30397

											
										
										
											2018-05-05 01:53:21 -04:00
+								--------------------------------------------------
 								// NOTCONSOLE
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
 								For example:
-												[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502)


											
										
										
											2019-09-09 13:38:14 -04:00
+								[source,console]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								--------------------------------------------------
-												Remove `include_type_name` in asciidoc where possible (#37568)

The "include_type_name" parameter was temporarily introduced in #37285 to facilitate
moving the default parameter setting to "false" in many places in the documentation
code snippets. Most of the places can simply be reverted without causing errors.
In this change I looked for asciidoc files that contained the
"include_type_name=true" addition when creating new indices but didn't look
likey they made use of the "_doc" type for mappings. This is mostly the case
e.g. in the analysis docs where index creating often only contains settings. I
manually corrected the use of types in some places where the docs still used an
explicit type name and not the dummy "_doc" type.
											
										
										
											2019-01-18 03:34:11 -05:00
+								PUT nori_sample
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								{
 								  "settings": {
 								    "index": {
 								      "analysis": {
 								        "analyzer": {
 								          "my_analyzer": {
 								            "tokenizer": "nori_tokenizer",
 								            "filter": [
 								              "my_posfilter"
 								            ]
 								          }
 								        },
 								        "filter": {
 								          "my_posfilter": {
 								            "type": "nori_part_of_speech",
 								            "stoptags": [
 								              "NR"   <1>
 								            ]
 								          }
 								        }
 								      }
 								    }
 								  }
 								}
 								GET nori_sample/_analyze
 								{
 								  "analyzer": "my_analyzer",
 								  "text": "여섯 용이"  <2>
 								}
 								--------------------------------------------------
 								<1> Korean numerals should be removed (`NR`)
 								<2> Six dragons
 								Which responds with:
-												[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418)


											
										
										
											2019-09-06 09:22:08 -04:00
+								[source,console-result]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								--------------------------------------------------
 								{
 								  "tokens" : [ {
 								    "token" : "용",
 								    "start_offset" : 3,
 								    "end_offset" : 4,
 								    "type" : "word",
 								    "position" : 1
 								  }, {
 								    "token" : "이",
 								    "start_offset" : 4,
 								    "end_offset" : 5,
 								    "type" : "word",
 								    "position" : 2
 								  } ]
 								}
 								--------------------------------------------------
-												[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418)


											
										
										
											2019-09-06 09:22:08 -04:00
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
 								[[analysis-nori-readingform]]
 								==== `nori_readingform` token filter
 								The `nori_readingform` token filter rewrites tokens written in Hanja to their Hangul form.
-												[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502)


											
										
										
											2019-09-09 13:38:14 -04:00
+								[source,console]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								--------------------------------------------------
-												Remove `include_type_name` in asciidoc where possible (#37568)

The "include_type_name" parameter was temporarily introduced in #37285 to facilitate
moving the default parameter setting to "false" in many places in the documentation
code snippets. Most of the places can simply be reverted without causing errors.
In this change I looked for asciidoc files that contained the
"include_type_name=true" addition when creating new indices but didn't look
likey they made use of the "_doc" type for mappings. This is mostly the case
e.g. in the analysis docs where index creating often only contains settings. I
manually corrected the use of types in some places where the docs still used an
explicit type name and not the dummy "_doc" type.
											
										
										
											2019-01-18 03:34:11 -05:00
+								PUT nori_sample
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								{
 								    "settings": {
 								        "index":{
 								            "analysis":{
 								                "analyzer" : {
 								                    "my_analyzer" : {
 								                        "tokenizer" : "nori_tokenizer",
 								                        "filter" : ["nori_readingform"]
 								                    }
 								                }
 								            }
 								        }
 								    }
 								}
 								GET nori_sample/_analyze
 								{
 								  "analyzer": "my_analyzer",
-												[Docs] Fix bad link

relates #30397

											
										
										
											2018-05-04 16:07:12 -04:00
+								  "text": "鄕歌"      <1>
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								}
 								--------------------------------------------------
-												[Docs] Fix bad link

relates #30397

											
										
										
											2018-05-04 16:07:12 -04:00
+								<1> A token written in Hanja: Hyangga
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
 								Which responds with:
-												[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418)


											
										
										
											2019-09-06 09:22:08 -04:00
+								[source,console-result]
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								--------------------------------------------------
 								{
 								  "tokens" : [ {
-												[Docs] Fix bad link

relates #30397

											
										
										
											2018-05-04 16:07:12 -04:00
+								    "token" : "향가",     <1>
-												Expose the Lucene Korean analyzer module in a plugin (#30397)

This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
											
										
										
											2018-05-04 14:46:13 -04:00
+								    "start_offset" : 0,
 								    "end_offset" : 2,
 								    "type" : "word",
 								    "position" : 0
 								  }]
 								}
 								--------------------------------------------------
-												[Docs] Fix bad link

relates #30397

											
										
										
											2018-05-04 16:07:12 -04:00
+								<1> The Hanja form is replaced by the Hangul translation.
-												Add nori_number token filter in analysis-nori (#53583)

This change adds the `nori_number` token filter.
It also adds a `discard_punctuation` option in nori_tokenizer that should be used in conjunction with the new filter.

											
										
										
											2020-03-23 14:28:49 -04:00
 								[[analysis-nori-number]]
 								==== `nori_number` token filter
 								The `nori_number` token filter normalizes Korean numbers
 								to regular Arabic decimal numbers in half-width characters.
 								Korean numbers are often written using a combination of Hangul and Arabic numbers with various kinds punctuation.
 								For example, ３．２천 means 3200.
 								This filter does this kind of normalization and allows a search for 3200 to match ３．２천 in text,
 								but can also be used to make range facets based on the normalized numbers and so on.
 								[NOTE]
 								====
 								Notice that this analyzer uses a token composition scheme and relies on punctuation tokens
 								being found in the token stream.
 								Please make sure your `nori_tokenizer` has `discard_punctuation` set to false.
 								In case punctuation characters, such as U+FF0E(．), is removed from the token stream,
 								this filter would find input tokens ３ and ２천 and give outputs 3 and 2000 instead of 3200,
 								which is likely not the intended result.
 								If you want to remove punctuation characters from your index that are not part of normalized numbers,
 								add a `stop` token filter with the punctuation you wish to remove after `nori_number` in your analyzer chain.
 								====
 								Below are some examples of normalizations this filter supports.
 								The input is untokenized text and the result is the single term attribute emitted for the input.
 								- 영영칠 -> 7
 								- 일영영영 -> 1000
 								- 삼천2백2십삼 -> 3223
 								- 조육백만오천일 -> 1000006005001
 								- ３.２천 ->  3200
 								- １.２만３４５.６７ -> 12345.67
 								- 4,647.100 -> 4647.1
 								- 15,7 -> 157 (be aware of this weakness)
 								For example:
 								[source,console]
 								--------------------------------------------------
 								PUT nori_sample
 								{
 								  "settings": {
 								    "index": {
 								      "analysis": {
 								        "analyzer": {
 								          "my_analyzer": {
 								            "tokenizer": "tokenizer_discard_puncuation_false",
 								            "filter": [
 								              "part_of_speech_stop_sp", "nori_number"
 								            ]
 								          }
 								        },
 								        "tokenizer": {
 								          "tokenizer_discard_puncuation_false": {
 								            "type": "nori_tokenizer",
 								            "discard_punctuation": "false"
 								          }
 								        },
 								        "filter": {
 								            "part_of_speech_stop_sp": {
 								                "type": "nori_part_of_speech",
 								                "stoptags": ["SP"]
 								            }
 								        }
 								      }
 								    }
 								  }
 								}
 								GET nori_sample/_analyze
 								{
 								  "analyzer": "my_analyzer",
 								  "text": "십만이천오백과 ３.２천"
 								}
 								--------------------------------------------------
 								Which results in:
 								[source,console-result]
 								--------------------------------------------------
 								{
 								  "tokens" : [{
 								    "token" : "102500",
 								    "start_offset" : 0,
 								    "end_offset" : 6,
 								    "type" : "word",
 								    "position" : 0
 								  }, {
 								    "token" : "과",
 								    "start_offset" : 6,
 								    "end_offset" : 7,
 								    "type" : "word",
 								    "position" : 1
 								  }, {
 								    "token" : "3200",
 								    "start_offset" : 8,
 								    "end_offset" : 12,
 								    "type" : "word",
 								    "position" : 2
 								  }]
 								}
 								--------------------------------------------------