[[analysis-custom-analyzer]] === Custom Analyzer When the built-in analyzers do not fulfill your needs, you can create a `custom` analyzer which uses the appropriate combination of: * zero or more <<analysis-charfilters, character filters>> * a <<analysis-tokenizers,tokenizer>> * zero or more <<analysis-tokenfilters,token filters>>. [float] === Configuration The `custom` analyzer accepts the following parameters: [horizontal] `tokenizer`:: A built-in or customised <<analysis-tokenizers,tokenizer>>. (Required) `char_filter`:: An optional array of built-in or customised <<analysis-charfilters, character filters>>. `filter`:: An optional array of built-in or customised <<analysis-tokenfilters, token filters>>. `position_increment_gap`:: When indexing an array of text values, Elasticsearch inserts a fake "gap" between the last term of one value and the first term of the next value to ensure that a phrase query doesn't match two terms from different array elements. Defaults to `100`. See <<position-increment-gap>> for more. [float] === Example configuration Here is an example that combines the following: Character Filter:: * <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>> Tokenizer:: * <<analysis-standard-tokenizer,Standard Tokenizer>> Token Filters:: * <<analysis-lowercase-tokenfilter,Lowercase Token Filter>> * <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>> [source,console] -------------------------------- PUT my_index { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", <1> "tokenizer": "standard", "char_filter": [ "html_strip" ], "filter": [ "lowercase", "asciifolding" ] } } } } } POST my_index/_analyze { "analyzer": "my_custom_analyzer", "text": "Is this <b>déjà vu</b>?" } -------------------------------- <1> Setting `type` to `custom` tells Elasticsearch that we are defining a custom analyzer. Compare this to how <<configuring-analyzers,built-in analyzers can be configured>>: `type` will be set to the name of the built-in analyzer, like <<analysis-standard-analyzer,`standard`>> or <<analysis-simple-analyzer,`simple`>>. ///////////////////// [source,console-result] ---------------------------- { "tokens": [ { "token": "is", "start_offset": 0, "end_offset": 2, "type": "<ALPHANUM>", "position": 0 }, { "token": "this", "start_offset": 3, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "deja", "start_offset": 11, "end_offset": 15, "type": "<ALPHANUM>", "position": 2 }, { "token": "vu", "start_offset": 16, "end_offset": 22, "type": "<ALPHANUM>", "position": 3 } ] } ---------------------------- ///////////////////// The above example produces the following terms: [source,text] --------------------------- [ is, this, deja, vu ] --------------------------- The previous example used tokenizer, token filters, and character filters with their default configurations, but it is possible to create configured versions of each and to use them in a custom analyzer. Here is a more complicated example that combines the following: Character Filter:: * <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_` Tokenizer:: * <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters Token Filters:: * <<analysis-lowercase-tokenfilter,Lowercase Token Filter>> * <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words Here is an example: [source,console] -------------------------------------------------- PUT my_index { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { <1> "type": "custom", "char_filter": [ "emoticons" ], "tokenizer": "punctuation", "filter": [ "lowercase", "english_stop" ] } }, "tokenizer": { "punctuation": { <2> "type": "pattern", "pattern": "[ .,!?]" } }, "char_filter": { "emoticons": { <3> "type": "mapping", "mappings": [ ":) => _happy_", ":( => _sad_" ] } }, "filter": { "english_stop": { <4> "type": "stop", "stopwords": "_english_" } } } } } POST my_index/_analyze { "analyzer": "my_custom_analyzer", "text": "I'm a :) person, and you?" } -------------------------------------------------- <1> Assigns the index a default custom analyzer, `my_custom_analyzer`. This analyzer uses a custom tokenizer, character filter, and token filter that are defined later in the request. <2> Defines the custom `punctuation` tokenizer. <3> Defines the custom `emoticons` character filter. <4> Defines the custom `english_stop` token filter. ///////////////////// [source,console-result] ---------------------------- { "tokens": [ { "token": "i'm", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "_happy_", "start_offset": 6, "end_offset": 8, "type": "word", "position": 2 }, { "token": "person", "start_offset": 9, "end_offset": 15, "type": "word", "position": 3 }, { "token": "you", "start_offset": 21, "end_offset": 24, "type": "word", "position": 5 } ] } ---------------------------- ///////////////////// The above example produces the following terms: [source,text] --------------------------- [ i'm, _happy_, person, you ] ---------------------------