OpenSearch/docs/reference/analysis/analyzers/custom-analyzer.asciidoc

262 lines
5.6 KiB
Plaintext
Raw Normal View History

[[analysis-custom-analyzer]]
=== Custom Analyzer
When the built-in analyzers do not fulfill your needs, you can create a
`custom` analyzer which uses the appropriate combination of:
* zero or more <<analysis-charfilters, character filters>>
* a <<analysis-tokenizers,tokenizer>>
* zero or more <<analysis-tokenfilters,token filters>>.
[float]
=== Configuration
The `custom` analyzer accepts the following parameters:
[horizontal]
`tokenizer`::
A built-in or customised <<analysis-tokenizers,tokenizer>>.
(Required)
`char_filter`::
An optional array of built-in or customised
<<analysis-charfilters, character filters>>.
`filter`::
An optional array of built-in or customised
<<analysis-tokenfilters, token filters>>.
`position_increment_gap`::
When indexing an array of text values, Elasticsearch inserts a fake "gap"
between the last term of one value and the first term of the next value to
ensure that a phrase query doesn't match two terms from different array
elements. Defaults to `100`. See <<position-increment-gap>> for more.
[float]
=== Example configuration
Here is an example that combines the following:
Character Filter::
* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>
[source,js]
--------------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}
--------------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "is",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "this",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "deja",
"start_offset": 11,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "vu",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ is, this, deja, vu ]
---------------------------
The previous example used tokenizer, token filters, and character filters with
their default configurations, but it is possible to create configured versions
of each and to use them in a custom analyzer.
Here is a more complicated example that combines the following:
Character Filter::
* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`
Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters
Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words
Here is an example:
[source,js]
--------------------------------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons" <1>
],
"tokenizer": "punctuation", <1>
"filter": [
"lowercase",
"english_stop" <1>
]
}
},
"tokenizer": {
"punctuation": { <1>
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": { <1>
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": { <1>
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
--------------------------------------------------
// CONSOLE
<1> The `emoticon` character filter, `punctuation` tokenizer and
`english_stop` token filter are custom implementations which are defined
in the same index settings.
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "i'm",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "_happy_",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "person",
"start_offset": 9,
"end_offset": 15,
"type": "word",
"position": 3
},
{
"token": "you",
"start_offset": 21,
"end_offset": 24,
"type": "word",
"position": 5
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ i'm, _happy_, person, you ]
---------------------------