mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-18 19:05:06 +00:00
The "include_type_name" parameter was temporarily introduced in #37285 to facilitate moving the default parameter setting to "false" in many places in the documentation code snippets. Most of the places can simply be reverted without causing errors. In this change I looked for asciidoc files that contained the "include_type_name=true" addition when creating new indices but didn't look likey they made use of the "_doc" type for mappings. This is mostly the case e.g. in the analysis docs where index creating often only contains settings. I manually corrected the use of types in some places where the docs still used an explicit type name and not the dummy "_doc" type.
262 lines
5.8 KiB
Plaintext
262 lines
5.8 KiB
Plaintext
[[analysis-custom-analyzer]]
|
|
=== Custom Analyzer
|
|
|
|
When the built-in analyzers do not fulfill your needs, you can create a
|
|
`custom` analyzer which uses the appropriate combination of:
|
|
|
|
* zero or more <<analysis-charfilters, character filters>>
|
|
* a <<analysis-tokenizers,tokenizer>>
|
|
* zero or more <<analysis-tokenfilters,token filters>>.
|
|
|
|
[float]
|
|
=== Configuration
|
|
|
|
The `custom` analyzer accepts the following parameters:
|
|
|
|
[horizontal]
|
|
`tokenizer`::
|
|
|
|
A built-in or customised <<analysis-tokenizers,tokenizer>>.
|
|
(Required)
|
|
|
|
`char_filter`::
|
|
|
|
An optional array of built-in or customised
|
|
<<analysis-charfilters, character filters>>.
|
|
|
|
`filter`::
|
|
|
|
An optional array of built-in or customised
|
|
<<analysis-tokenfilters, token filters>>.
|
|
|
|
`position_increment_gap`::
|
|
|
|
When indexing an array of text values, Elasticsearch inserts a fake "gap"
|
|
between the last term of one value and the first term of the next value to
|
|
ensure that a phrase query doesn't match two terms from different array
|
|
elements. Defaults to `100`. See <<position-increment-gap>> for more.
|
|
|
|
[float]
|
|
=== Example configuration
|
|
|
|
Here is an example that combines the following:
|
|
|
|
Character Filter::
|
|
* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>
|
|
|
|
Tokenizer::
|
|
* <<analysis-standard-tokenizer,Standard Tokenizer>>
|
|
|
|
Token Filters::
|
|
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
|
|
* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>
|
|
|
|
[source,js]
|
|
--------------------------------
|
|
PUT my_index
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_custom_analyzer": {
|
|
"type": "custom", <1>
|
|
"tokenizer": "standard",
|
|
"char_filter": [
|
|
"html_strip"
|
|
],
|
|
"filter": [
|
|
"lowercase",
|
|
"asciifolding"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST my_index/_analyze
|
|
{
|
|
"analyzer": "my_custom_analyzer",
|
|
"text": "Is this <b>déjà vu</b>?"
|
|
}
|
|
--------------------------------
|
|
// CONSOLE
|
|
|
|
<1> Setting `type` to `custom` tells Elasticsearch that we are defining a custom analyzer.
|
|
Compare this to how <<configuring-analyzers,built-in analyzers can be configured>>:
|
|
`type` will be set to the name of the built-in analyzer, like
|
|
<<analysis-standard-analyzer,`standard`>> or <<analysis-simple-analyzer,`simple`>>.
|
|
|
|
/////////////////////
|
|
|
|
[source,js]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "is",
|
|
"start_offset": 0,
|
|
"end_offset": 2,
|
|
"type": "<ALPHANUM>",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "this",
|
|
"start_offset": 3,
|
|
"end_offset": 7,
|
|
"type": "<ALPHANUM>",
|
|
"position": 1
|
|
},
|
|
{
|
|
"token": "deja",
|
|
"start_offset": 11,
|
|
"end_offset": 15,
|
|
"type": "<ALPHANUM>",
|
|
"position": 2
|
|
},
|
|
{
|
|
"token": "vu",
|
|
"start_offset": 16,
|
|
"end_offset": 22,
|
|
"type": "<ALPHANUM>",
|
|
"position": 3
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
// TESTRESPONSE
|
|
|
|
/////////////////////
|
|
|
|
|
|
The above example produces the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ is, this, deja, vu ]
|
|
---------------------------
|
|
|
|
The previous example used tokenizer, token filters, and character filters with
|
|
their default configurations, but it is possible to create configured versions
|
|
of each and to use them in a custom analyzer.
|
|
|
|
Here is a more complicated example that combines the following:
|
|
|
|
Character Filter::
|
|
* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`
|
|
|
|
Tokenizer::
|
|
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters
|
|
|
|
Token Filters::
|
|
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
|
|
* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words
|
|
|
|
|
|
Here is an example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT my_index
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_custom_analyzer": {
|
|
"type": "custom",
|
|
"char_filter": [
|
|
"emoticons" <1>
|
|
],
|
|
"tokenizer": "punctuation", <1>
|
|
"filter": [
|
|
"lowercase",
|
|
"english_stop" <1>
|
|
]
|
|
}
|
|
},
|
|
"tokenizer": {
|
|
"punctuation": { <1>
|
|
"type": "pattern",
|
|
"pattern": "[ .,!?]"
|
|
}
|
|
},
|
|
"char_filter": {
|
|
"emoticons": { <1>
|
|
"type": "mapping",
|
|
"mappings": [
|
|
":) => _happy_",
|
|
":( => _sad_"
|
|
]
|
|
}
|
|
},
|
|
"filter": {
|
|
"english_stop": { <1>
|
|
"type": "stop",
|
|
"stopwords": "_english_"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST my_index/_analyze
|
|
{
|
|
"analyzer": "my_custom_analyzer",
|
|
"text": "I'm a :) person, and you?"
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
<1> The `emoticons` character filter, `punctuation` tokenizer and
|
|
`english_stop` token filter are custom implementations which are defined
|
|
in the same index settings.
|
|
|
|
/////////////////////
|
|
|
|
[source,js]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "i'm",
|
|
"start_offset": 0,
|
|
"end_offset": 3,
|
|
"type": "word",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "_happy_",
|
|
"start_offset": 6,
|
|
"end_offset": 8,
|
|
"type": "word",
|
|
"position": 2
|
|
},
|
|
{
|
|
"token": "person",
|
|
"start_offset": 9,
|
|
"end_offset": 15,
|
|
"type": "word",
|
|
"position": 3
|
|
},
|
|
{
|
|
"token": "you",
|
|
"start_offset": 21,
|
|
"end_offset": 24,
|
|
"type": "word",
|
|
"position": 5
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
// TESTRESPONSE
|
|
|
|
/////////////////////
|
|
|
|
|
|
The above example produces the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ i'm, _happy_, person, you ]
|
|
---------------------------
|