OpenSearch/docs/reference/analysis/analyzers/custom-analyzer.asciidoc

[[analysis-custom-analyzer]]
=== Custom Analyzer

An analyzer of type `custom` that allows to combine a `Tokenizer` with
zero or more `Token Filters`, and zero or more `Char Filters`. The
custom analyzer accepts a logical/registered name of the tokenizer to
use, and a list of logical/registered names of token filters.
The name of the custom analyzer must not start with "_".

The following are settings that can be set for a `custom` analyzer type:

[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`tokenizer` |The logical / registered name of the tokenizer to use.

|`filter` |An optional list of logical / registered name of token
filters.

|`char_filter` |An optional list of logical / registered name of char
filters.

|`position_increment_gap` |An optional number of positions to increment
between each field value of a field using this analyzer. Defaults to 100.
100 was chosen because it prevents phrase queries with reasonably large
slops (less than 100) from matching terms across field values.
|=======================================================================

Here is an example:

[source,js]
--------------------------------------------------
index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : myTokenizer1
                filter : [myTokenFilter1, myTokenFilter2]
                char_filter : [my_html]
                position_increment_gap: 256
        tokenizer :
            myTokenizer1 :
                type : standard
                max_token_length : 900
        filter :
            myTokenFilter1 :
                type : stop
                stopwords : [stop1, stop2, stop3, stop4]
            myTokenFilter2 :
                type : length
                min : 0
                max : 2000
        char_filter :
              my_html :
                type : html_strip
                escaped_tags : [xxx, yyy]
                read_ahead : 1024
--------------------------------------------------
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-custom-analyzer]]`
			`=== Custom Analyzer`

			An analyzer of type `custom` that allows to combine a `Tokenizer` with
			zero or more `Token Filters`, and zero or more `Char Filters`. The
			`custom analyzer accepts a logical/registered name of the tokenizer to`
			`use, and a list of logical/registered names of token filters.`
spell correct and add single quotes 2015-05-26 05:40:19 -04:00			`The name of the custom analyzer must not start with "_".`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			The following are settings that can be set for a `custom` analyzer type:

			`[cols="<,<",options="header",]`
			`\|=======================================================================`
			`\|Setting \|Description`
			\|`tokenizer` \|The logical / registered name of the tokenizer to use.

			\|`filter` \|An optional list of logical / registered name of token
			`filters.`

			\|`char_filter` \|An optional list of logical / registered name of char
			`filters.`
document and test custom analyzer position offset gap 2015-05-02 00:36:27 -04:00
The name "position_offset_gap" is confusing because Lucene has three similar sounding things: * Analyzer#getPositionIncrementGap * Analyzer#getOffsetGap * IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS and * FieldType#storeTermVectorOffsets Rename position_offset_gap to position_increment_gap closes #13056 2015-08-22 04:39:18 -04:00			\|`position_increment_gap` \|An optional number of positions to increment
Mapping: Default position_offset_gap to 100 This is much more fiddly than you'd expect it to be because of the way position_offset_gap is applied in StringFieldMapper. Instead of setting the default to 100 its simpler to make sure that all the analyzers default to 100 and that StringFieldMapper doesn't override the default unless the user specifies something different. Unless the index was created before 2.1, in which case the old default of 0 has to take. Also postition_offset_gaps less than 0 aren't allowed at all. New tests test that: 1. the new default doesn't match phrases across values with reasonably low slop (5) 2. the new default doest match phrases across values with reasonably high slop (50) 3. you can override the value and phrases work as you'd expect 4. if you leave the value undefined in the mapping and define it on a custom analyzer the the value from the custom analyzer shines through Closes #7268 2015-07-29 17:07:26 -04:00			`between each field value of a field using this analyzer. Defaults to 100.`
			`100 was chosen because it prevents phrase queries with reasonably large`
			`slops (less than 100) from matching terms across field values.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`\|=======================================================================`

			`Here is an example:`

			`[source,js]`
			`--------------------------------------------------`
			`index :`
			`analysis :`
Mapping: Default position_offset_gap to 100 This is much more fiddly than you'd expect it to be because of the way position_offset_gap is applied in StringFieldMapper. Instead of setting the default to 100 its simpler to make sure that all the analyzers default to 100 and that StringFieldMapper doesn't override the default unless the user specifies something different. Unless the index was created before 2.1, in which case the old default of 0 has to take. Also postition_offset_gaps less than 0 aren't allowed at all. New tests test that: 1. the new default doesn't match phrases across values with reasonably low slop (5) 2. the new default doest match phrases across values with reasonably high slop (50) 3. you can override the value and phrases work as you'd expect 4. if you leave the value undefined in the mapping and define it on a custom analyzer the the value from the custom analyzer shines through Closes #7268 2015-07-29 17:07:26 -04:00			`analyzer :`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`myAnalyzer2 :`
			`type : custom`
			`tokenizer : myTokenizer1`
			`filter : [myTokenFilter1, myTokenFilter2]`
			`char_filter : [my_html]`
The name "position_offset_gap" is confusing because Lucene has three similar sounding things: * Analyzer#getPositionIncrementGap * Analyzer#getOffsetGap * IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS and * FieldType#storeTermVectorOffsets Rename position_offset_gap to position_increment_gap closes #13056 2015-08-22 04:39:18 -04:00			`position_increment_gap: 256`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`tokenizer :`
			`myTokenizer1 :`
			`type : standard`
			`max_token_length : 900`
			`filter :`
			`myTokenFilter1 :`
			`type : stop`
			`stopwords : [stop1, stop2, stop3, stop4]`
			`myTokenFilter2 :`
			`type : length`
			`min : 0`
			`max : 2000`
			`char_filter :`
			`my_html :`
			`type : html_strip`
			`escaped_tags : [xxx, yyy]`
			`read_ahead : 1024`
			`--------------------------------------------------`