OpenSearch/docs/reference/analysis/analyzers/custom-analyzer.asciidoc

180 lines
4.2 KiB
Plaintext

[[analysis-custom-analyzer]]
=== Custom Analyzer
When the built-in analyzers do not fulfill your needs, you can create a
`custom` analyzer which uses the appropriate combination of:
* zero or more <<analysis-charfilters, character filters>>
* a <<analysis-tokenizers,tokenizer>>
* zero or more <<analysis-tokenfilters,token filters>>.
[float]
=== Configuration
The `custom` analyzer accepts the following parameters:
[horizontal]
`tokenizer`::
A built-in or customised <<analysis-tokenizers,tokenizer>>.
(Required)
`char_filter`::
An optional array of built-in or customised
<<analysis-charfilters, character filters>>.
`filter`::
An optional array of built-in or customised
<<analysis-tokenfilters, token filters>>.
`position_increment_gap`::
When indexing an array of text values, Elasticsearch inserts a fake "gap"
between the last term of one value and the first term of the next value to
ensure that a phrase query doesn't match two terms from different array
elements. Defaults to `100`. See <<position-increment-gap>> for more.
[float]
=== Example configuration
Here is an example that combines the following:
Character Filter::
* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>
[source,js]
--------------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}
--------------------------------
// CONSOLE
The above example produces the following terms:
[source,text]
---------------------------
[ is, this, deja, vu ]
---------------------------
The previous example used tokenizer, token filters, and character filters with
their default configurations, but it is possible to create configured versions
of each and to use them in a custom analyzer.
Here is a more complicated example that combines the following:
Character Filter::
* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`
Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters
Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words
Here is an example:
[source,js]
--------------------------------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons" <1>
],
"tokenizer": "punctuation", <1>
"filter": [
"lowercase",
"english_stop" <1>
]
}
},
"tokenizer": {
"punctuation": { <1>
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": { <1>
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": { <1>
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
--------------------------------------------------
<1> The `emoticon` character filter, `punctuation` tokenizer and
`english_stop` token filter are custom implementations which are defined
in the same index settings.
The above example produces the following terms:
[source,text]
---------------------------
[ i'm, _happy_, person, you ]
---------------------------