261 lines
5.9 KiB
Plaintext
261 lines
5.9 KiB
Plaintext
[[analysis-custom-analyzer]]
|
|
=== Create a custom analyzer
|
|
|
|
When the built-in analyzers do not fulfill your needs, you can create a
|
|
`custom` analyzer which uses the appropriate combination of:
|
|
|
|
* zero or more <<analysis-charfilters, character filters>>
|
|
* a <<analysis-tokenizers,tokenizer>>
|
|
* zero or more <<analysis-tokenfilters,token filters>>.
|
|
|
|
[float]
|
|
=== Configuration
|
|
|
|
The `custom` analyzer accepts the following parameters:
|
|
|
|
[horizontal]
|
|
`tokenizer`::
|
|
|
|
A built-in or customised <<analysis-tokenizers,tokenizer>>.
|
|
(Required)
|
|
|
|
`char_filter`::
|
|
|
|
An optional array of built-in or customised
|
|
<<analysis-charfilters, character filters>>.
|
|
|
|
`filter`::
|
|
|
|
An optional array of built-in or customised
|
|
<<analysis-tokenfilters, token filters>>.
|
|
|
|
`position_increment_gap`::
|
|
|
|
When indexing an array of text values, Elasticsearch inserts a fake "gap"
|
|
between the last term of one value and the first term of the next value to
|
|
ensure that a phrase query doesn't match two terms from different array
|
|
elements. Defaults to `100`. See <<position-increment-gap>> for more.
|
|
|
|
[float]
|
|
=== Example configuration
|
|
|
|
Here is an example that combines the following:
|
|
|
|
Character Filter::
|
|
* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>
|
|
|
|
Tokenizer::
|
|
* <<analysis-standard-tokenizer,Standard Tokenizer>>
|
|
|
|
Token Filters::
|
|
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
|
|
* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>
|
|
|
|
[source,console]
|
|
--------------------------------
|
|
PUT my_index
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_custom_analyzer": {
|
|
"type": "custom", <1>
|
|
"tokenizer": "standard",
|
|
"char_filter": [
|
|
"html_strip"
|
|
],
|
|
"filter": [
|
|
"lowercase",
|
|
"asciifolding"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST my_index/_analyze
|
|
{
|
|
"analyzer": "my_custom_analyzer",
|
|
"text": "Is this <b>déjà vu</b>?"
|
|
}
|
|
--------------------------------
|
|
|
|
<1> Setting `type` to `custom` tells Elasticsearch that we are defining a custom analyzer.
|
|
Compare this to how <<configuring-analyzers,built-in analyzers can be configured>>:
|
|
`type` will be set to the name of the built-in analyzer, like
|
|
<<analysis-standard-analyzer,`standard`>> or <<analysis-simple-analyzer,`simple`>>.
|
|
|
|
/////////////////////
|
|
|
|
[source,console-result]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "is",
|
|
"start_offset": 0,
|
|
"end_offset": 2,
|
|
"type": "<ALPHANUM>",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "this",
|
|
"start_offset": 3,
|
|
"end_offset": 7,
|
|
"type": "<ALPHANUM>",
|
|
"position": 1
|
|
},
|
|
{
|
|
"token": "deja",
|
|
"start_offset": 11,
|
|
"end_offset": 15,
|
|
"type": "<ALPHANUM>",
|
|
"position": 2
|
|
},
|
|
{
|
|
"token": "vu",
|
|
"start_offset": 16,
|
|
"end_offset": 22,
|
|
"type": "<ALPHANUM>",
|
|
"position": 3
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
|
|
The above example produces the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ is, this, deja, vu ]
|
|
---------------------------
|
|
|
|
The previous example used tokenizer, token filters, and character filters with
|
|
their default configurations, but it is possible to create configured versions
|
|
of each and to use them in a custom analyzer.
|
|
|
|
Here is a more complicated example that combines the following:
|
|
|
|
Character Filter::
|
|
* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`
|
|
|
|
Tokenizer::
|
|
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters
|
|
|
|
Token Filters::
|
|
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
|
|
* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words
|
|
|
|
|
|
Here is an example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT my_index
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_custom_analyzer": { <1>
|
|
"type": "custom",
|
|
"char_filter": [
|
|
"emoticons"
|
|
],
|
|
"tokenizer": "punctuation",
|
|
"filter": [
|
|
"lowercase",
|
|
"english_stop"
|
|
]
|
|
}
|
|
},
|
|
"tokenizer": {
|
|
"punctuation": { <2>
|
|
"type": "pattern",
|
|
"pattern": "[ .,!?]"
|
|
}
|
|
},
|
|
"char_filter": {
|
|
"emoticons": { <3>
|
|
"type": "mapping",
|
|
"mappings": [
|
|
":) => _happy_",
|
|
":( => _sad_"
|
|
]
|
|
}
|
|
},
|
|
"filter": {
|
|
"english_stop": { <4>
|
|
"type": "stop",
|
|
"stopwords": "_english_"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST my_index/_analyze
|
|
{
|
|
"analyzer": "my_custom_analyzer",
|
|
"text": "I'm a :) person, and you?"
|
|
}
|
|
--------------------------------------------------
|
|
|
|
<1> Assigns the index a default custom analyzer, `my_custom_analyzer`. This
|
|
analyzer uses a custom tokenizer, character filter, and token filter that
|
|
are defined later in the request.
|
|
<2> Defines the custom `punctuation` tokenizer.
|
|
<3> Defines the custom `emoticons` character filter.
|
|
<4> Defines the custom `english_stop` token filter.
|
|
|
|
/////////////////////
|
|
|
|
[source,console-result]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "i'm",
|
|
"start_offset": 0,
|
|
"end_offset": 3,
|
|
"type": "word",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "_happy_",
|
|
"start_offset": 6,
|
|
"end_offset": 8,
|
|
"type": "word",
|
|
"position": 2
|
|
},
|
|
{
|
|
"token": "person",
|
|
"start_offset": 9,
|
|
"end_offset": 15,
|
|
"type": "word",
|
|
"position": 3
|
|
},
|
|
{
|
|
"token": "you",
|
|
"start_offset": 21,
|
|
"end_offset": 24,
|
|
"type": "word",
|
|
"position": 5
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
|
|
The above example produces the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ i'm, _happy_, person, you ]
|
|
---------------------------
|