OpenSearch/docs/reference/analysis/analyzers/custom-analyzer.asciidoc
Nik Everett 4b9664beeb Mapping: Default position_offset_gap to 100
This is much more fiddly than you'd expect it to be because of the way
position_offset_gap is applied in StringFieldMapper. Instead of setting
the default to 100 its simpler to make sure that all the analyzers default
to 100 and that StringFieldMapper doesn't override the default unless the
user specifies something different. Unless the index was created before
2.1, in which case the old default of 0 has to take.

Also postition_offset_gaps less than 0 aren't allowed at all.

New tests test that:
1. the new default doesn't match phrases across values with reasonably low
slop (5)
2. the new default doest match phrases across values with reasonably high
slop (50)
3. you can override the value and phrases work as you'd expect
4. if you leave the value undefined in the mapping and define it on a
custom analyzer the the value from the custom analyzer shines through

Closes #7268
2015-08-25 14:21:50 -04:00

60 lines
2.0 KiB
Plaintext

[[analysis-custom-analyzer]]
=== Custom Analyzer
An analyzer of type `custom` that allows to combine a `Tokenizer` with
zero or more `Token Filters`, and zero or more `Char Filters`. The
custom analyzer accepts a logical/registered name of the tokenizer to
use, and a list of logical/registered names of token filters.
The name of the custom analyzer must not start with "_".
The following are settings that can be set for a `custom` analyzer type:
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`tokenizer` |The logical / registered name of the tokenizer to use.
|`filter` |An optional list of logical / registered name of token
filters.
|`char_filter` |An optional list of logical / registered name of char
filters.
|`position_offset_gap` |An optional number of positions to increment
between each field value of a field using this analyzer. Defaults to 100.
100 was chosen because it prevents phrase queries with reasonably large
slops (less than 100) from matching terms across field values.
|=======================================================================
Here is an example:
[source,js]
--------------------------------------------------
index :
analysis :
analyzer :
myAnalyzer2 :
type : custom
tokenizer : myTokenizer1
filter : [myTokenFilter1, myTokenFilter2]
char_filter : [my_html]
position_offset_gap: 256
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type : length
min : 0
max : 2000
char_filter :
my_html :
type : html_strip
escaped_tags : [xxx, yyy]
read_ahead : 1024
--------------------------------------------------