mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-08 22:14:59 +00:00
* Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
444 lines
11 KiB
Plaintext
444 lines
11 KiB
Plaintext
[[analysis-nori]]
|
|
=== Korean (nori) Analysis Plugin
|
|
|
|
The Korean (nori) Analysis plugin integrates Lucene nori analysis
|
|
module into elasticsearch. It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary]
|
|
to perform morphological analysis of Korean texts.
|
|
|
|
:plugin_name: analysis-nori
|
|
include::install_remove.asciidoc[]
|
|
|
|
[[analysis-nori-analyzer]]
|
|
==== `nori` analyzer
|
|
|
|
The `nori` analyzer consists of the following tokenizer and token filters:
|
|
|
|
* <<analysis-nori-tokenizer,`nori_tokenizer`>>
|
|
* <<analysis-nori-speech,`nori_part_of_speech`>> token filter
|
|
* <<analysis-nori-readingform,`nori_readingform`>> token filter
|
|
* {ref}/analysis-lowercase-tokenfilter.html[`lowercase`] token filter
|
|
|
|
It supports the `decompound_mode` and `user_dictionary` settings from
|
|
<<analysis-nori-tokenizer,`nori_tokenizer`>> and the `stoptags` setting from
|
|
<<analysis-nori-speech,`nori_part_of_speech`>>.
|
|
|
|
[[analysis-nori-tokenizer]]
|
|
==== `nori_tokenizer`
|
|
|
|
The `nori_tokenizer` accepts the following settings:
|
|
|
|
`decompound_mode`::
|
|
+
|
|
--
|
|
|
|
The decompound mode determines how the tokenizer handles compound tokens.
|
|
It can be set to:
|
|
|
|
`none`::
|
|
|
|
No decomposition for compounds. Example output:
|
|
|
|
가거도항
|
|
가곡역
|
|
|
|
`discard`::
|
|
|
|
Decomposes compounds and discards the original form (*default*). Example output:
|
|
|
|
가곡역 => 가곡, 역
|
|
|
|
`mixed`::
|
|
|
|
Decomposes compounds and keeps the original form. Example output:
|
|
|
|
가곡역 => 가곡역, 가곡, 역
|
|
--
|
|
|
|
`user_dictionary`::
|
|
+
|
|
--
|
|
The Nori tokenizer uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic dictionary] by default.
|
|
A `user_dictionary` with custom nouns (`NNG`) may be appended to the default dictionary.
|
|
The dictionary should have the following format:
|
|
|
|
[source,txt]
|
|
-----------------------
|
|
<token> [<token 1> ... <token n>]
|
|
-----------------------
|
|
|
|
The first token is mandatory and represents the custom noun that should be added in
|
|
the dictionary. For compound nouns the custom segmentation can be provided
|
|
after the first token (`[<token 1> ... <token n>]`). The segmentation of the
|
|
custom compound nouns is controlled by the `decompound_mode` setting.
|
|
|
|
|
|
As a demonstration of how the user dictionary can be used, save the following
|
|
dictionary to `$ES_HOME/config/userdict_ko.txt`:
|
|
|
|
[source,txt]
|
|
-----------------------
|
|
c++ <1>
|
|
C샤프
|
|
세종
|
|
세종시 세종 시 <2>
|
|
-----------------------
|
|
|
|
<1> A simple noun
|
|
<2> A compound noun (`세종시`) followed by its decomposition: `세종` and `시`.
|
|
|
|
Then create an analyzer as follows:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT nori_sample?include_type_name=true
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"tokenizer": {
|
|
"nori_user_dict": {
|
|
"type": "nori_tokenizer",
|
|
"decompound_mode": "mixed",
|
|
"user_dictionary": "userdict_ko.txt"
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"my_analyzer": {
|
|
"type": "custom",
|
|
"tokenizer": "nori_user_dict"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
GET nori_sample/_analyze
|
|
{
|
|
"analyzer": "my_analyzer",
|
|
"text": "세종시" <1>
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> Sejong city
|
|
|
|
The above `analyze` request returns the following:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"tokens" : [ {
|
|
"token" : "세종시",
|
|
"start_offset" : 0,
|
|
"end_offset" : 3,
|
|
"type" : "word",
|
|
"position" : 0,
|
|
"positionLength" : 2 <1>
|
|
}, {
|
|
"token" : "세종",
|
|
"start_offset" : 0,
|
|
"end_offset" : 2,
|
|
"type" : "word",
|
|
"position" : 0
|
|
}, {
|
|
"token" : "시",
|
|
"start_offset" : 2,
|
|
"end_offset" : 3,
|
|
"type" : "word",
|
|
"position" : 1
|
|
}]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE
|
|
|
|
<1> This is a compound token that spans two positions (`mixed` mode).
|
|
--
|
|
|
|
`user_dictionary_rules`::
|
|
+
|
|
--
|
|
|
|
You can also inline the rules directly in the tokenizer definition using
|
|
the `user_dictionary_rules` option:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT nori_sample?include_type_name=true
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"tokenizer": {
|
|
"nori_user_dict": {
|
|
"type": "nori_tokenizer",
|
|
"decompound_mode": "mixed",
|
|
"user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]
|
|
}
|
|
},
|
|
"analyzer": {
|
|
"my_analyzer": {
|
|
"type": "custom",
|
|
"tokenizer": "nori_user_dict"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
--
|
|
|
|
The `nori_tokenizer` sets a number of additional attributes per token that are used by token filters
|
|
to modify the stream.
|
|
You can view all these additional attributes with the following request:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
GET _analyze
|
|
{
|
|
"tokenizer": "nori_tokenizer",
|
|
"text": "뿌리가 깊은 나무는", <1>
|
|
"attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
|
|
"explain": true
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> A tree with deep roots
|
|
|
|
Which responds with:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"detail": {
|
|
"custom_analyzer": true,
|
|
"charfilters": [],
|
|
"tokenizer": {
|
|
"name": "nori_tokenizer",
|
|
"tokens": [
|
|
{
|
|
"token": "뿌리",
|
|
"start_offset": 0,
|
|
"end_offset": 2,
|
|
"type": "word",
|
|
"position": 0,
|
|
"leftPOS": "NNG(General Noun)",
|
|
"morphemes": null,
|
|
"posType": "MORPHEME",
|
|
"reading": null,
|
|
"rightPOS": "NNG(General Noun)"
|
|
},
|
|
{
|
|
"token": "가",
|
|
"start_offset": 2,
|
|
"end_offset": 3,
|
|
"type": "word",
|
|
"position": 1,
|
|
"leftPOS": "J(Ending Particle)",
|
|
"morphemes": null,
|
|
"posType": "MORPHEME",
|
|
"reading": null,
|
|
"rightPOS": "J(Ending Particle)"
|
|
},
|
|
{
|
|
"token": "깊",
|
|
"start_offset": 4,
|
|
"end_offset": 5,
|
|
"type": "word",
|
|
"position": 2,
|
|
"leftPOS": "VA(Adjective)",
|
|
"morphemes": null,
|
|
"posType": "MORPHEME",
|
|
"reading": null,
|
|
"rightPOS": "VA(Adjective)"
|
|
},
|
|
{
|
|
"token": "은",
|
|
"start_offset": 5,
|
|
"end_offset": 6,
|
|
"type": "word",
|
|
"position": 3,
|
|
"leftPOS": "E(Verbal endings)",
|
|
"morphemes": null,
|
|
"posType": "MORPHEME",
|
|
"reading": null,
|
|
"rightPOS": "E(Verbal endings)"
|
|
},
|
|
{
|
|
"token": "나무",
|
|
"start_offset": 7,
|
|
"end_offset": 9,
|
|
"type": "word",
|
|
"position": 4,
|
|
"leftPOS": "NNG(General Noun)",
|
|
"morphemes": null,
|
|
"posType": "MORPHEME",
|
|
"reading": null,
|
|
"rightPOS": "NNG(General Noun)"
|
|
},
|
|
{
|
|
"token": "는",
|
|
"start_offset": 9,
|
|
"end_offset": 10,
|
|
"type": "word",
|
|
"position": 5,
|
|
"leftPOS": "J(Ending Particle)",
|
|
"morphemes": null,
|
|
"posType": "MORPHEME",
|
|
"reading": null,
|
|
"rightPOS": "J(Ending Particle)"
|
|
}
|
|
]
|
|
},
|
|
"tokenfilters": []
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE
|
|
|
|
[[analysis-nori-speech]]
|
|
==== `nori_part_of_speech` token filter
|
|
|
|
The `nori_part_of_speech` token filter removes tokens that match a set of
|
|
part-of-speech tags. The list of supported tags and their meanings can be found here:
|
|
{lucene-core-javadoc}/../analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html[Part of speech tags]
|
|
|
|
It accepts the following setting:
|
|
|
|
`stoptags`::
|
|
|
|
An array of part-of-speech tags that should be removed.
|
|
|
|
and defaults to:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"stoptags": [
|
|
"E",
|
|
"IC",
|
|
"J",
|
|
"MAG", "MAJ", "MM",
|
|
"SP", "SSC", "SSO", "SC", "SE",
|
|
"XPN", "XSA", "XSN", "XSV",
|
|
"UNA", "NA", "VSV"
|
|
]
|
|
--------------------------------------------------
|
|
// NOTCONSOLE
|
|
|
|
For example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT nori_sample?include_type_name=true
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_analyzer": {
|
|
"tokenizer": "nori_tokenizer",
|
|
"filter": [
|
|
"my_posfilter"
|
|
]
|
|
}
|
|
},
|
|
"filter": {
|
|
"my_posfilter": {
|
|
"type": "nori_part_of_speech",
|
|
"stoptags": [
|
|
"NR" <1>
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
GET nori_sample/_analyze
|
|
{
|
|
"analyzer": "my_analyzer",
|
|
"text": "여섯 용이" <2>
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> Korean numerals should be removed (`NR`)
|
|
<2> Six dragons
|
|
|
|
Which responds with:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"tokens" : [ {
|
|
"token" : "용",
|
|
"start_offset" : 3,
|
|
"end_offset" : 4,
|
|
"type" : "word",
|
|
"position" : 1
|
|
}, {
|
|
"token" : "이",
|
|
"start_offset" : 4,
|
|
"end_offset" : 5,
|
|
"type" : "word",
|
|
"position" : 2
|
|
} ]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE
|
|
|
|
[[analysis-nori-readingform]]
|
|
==== `nori_readingform` token filter
|
|
|
|
The `nori_readingform` token filter rewrites tokens written in Hanja to their Hangul form.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT nori_sample?include_type_name=true
|
|
{
|
|
"settings": {
|
|
"index":{
|
|
"analysis":{
|
|
"analyzer" : {
|
|
"my_analyzer" : {
|
|
"tokenizer" : "nori_tokenizer",
|
|
"filter" : ["nori_readingform"]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
GET nori_sample/_analyze
|
|
{
|
|
"analyzer": "my_analyzer",
|
|
"text": "鄕歌" <1>
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> A token written in Hanja: Hyangga
|
|
|
|
Which responds with:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"tokens" : [ {
|
|
"token" : "향가", <1>
|
|
"start_offset" : 0,
|
|
"end_offset" : 2,
|
|
"type" : "word",
|
|
"position" : 0
|
|
}]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE
|
|
|
|
<1> The Hanja form is replaced by the Hangul translation.
|