mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-06 04:58:50 +00:00
4b9664beeb
This is much more fiddly than you'd expect it to be because of the way position_offset_gap is applied in StringFieldMapper. Instead of setting the default to 100 its simpler to make sure that all the analyzers default to 100 and that StringFieldMapper doesn't override the default unless the user specifies something different. Unless the index was created before 2.1, in which case the old default of 0 has to take. Also postition_offset_gaps less than 0 aren't allowed at all. New tests test that: 1. the new default doesn't match phrases across values with reasonably low slop (5) 2. the new default doest match phrases across values with reasonably high slop (50) 3. you can override the value and phrases work as you'd expect 4. if you leave the value undefined in the mapping and define it on a custom analyzer the the value from the custom analyzer shines through Closes #7268
60 lines
2.0 KiB
Plaintext
60 lines
2.0 KiB
Plaintext
[[analysis-custom-analyzer]]
|
|
=== Custom Analyzer
|
|
|
|
An analyzer of type `custom` that allows to combine a `Tokenizer` with
|
|
zero or more `Token Filters`, and zero or more `Char Filters`. The
|
|
custom analyzer accepts a logical/registered name of the tokenizer to
|
|
use, and a list of logical/registered names of token filters.
|
|
The name of the custom analyzer must not start with "_".
|
|
|
|
The following are settings that can be set for a `custom` analyzer type:
|
|
|
|
[cols="<,<",options="header",]
|
|
|=======================================================================
|
|
|Setting |Description
|
|
|`tokenizer` |The logical / registered name of the tokenizer to use.
|
|
|
|
|`filter` |An optional list of logical / registered name of token
|
|
filters.
|
|
|
|
|`char_filter` |An optional list of logical / registered name of char
|
|
filters.
|
|
|
|
|`position_offset_gap` |An optional number of positions to increment
|
|
between each field value of a field using this analyzer. Defaults to 100.
|
|
100 was chosen because it prevents phrase queries with reasonably large
|
|
slops (less than 100) from matching terms across field values.
|
|
|=======================================================================
|
|
|
|
Here is an example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
index :
|
|
analysis :
|
|
analyzer :
|
|
myAnalyzer2 :
|
|
type : custom
|
|
tokenizer : myTokenizer1
|
|
filter : [myTokenFilter1, myTokenFilter2]
|
|
char_filter : [my_html]
|
|
position_offset_gap: 256
|
|
tokenizer :
|
|
myTokenizer1 :
|
|
type : standard
|
|
max_token_length : 900
|
|
filter :
|
|
myTokenFilter1 :
|
|
type : stop
|
|
stopwords : [stop1, stop2, stop3, stop4]
|
|
myTokenFilter2 :
|
|
type : length
|
|
min : 0
|
|
max : 2000
|
|
char_filter :
|
|
my_html :
|
|
type : html_strip
|
|
escaped_tags : [xxx, yyy]
|
|
read_ahead : 1024
|
|
--------------------------------------------------
|