[[analysis-custom-analyzer]] === Create a custom analyzer When the built-in analyzers do not fulfill your needs, you can create a `custom` analyzer which uses the appropriate combination of: * zero or more <> * a <> * zero or more <>. [float] === Configuration The `custom` analyzer accepts the following parameters: [horizontal] `tokenizer`:: A built-in or customised <>. (Required) `char_filter`:: An optional array of built-in or customised <>. `filter`:: An optional array of built-in or customised <>. `position_increment_gap`:: When indexing an array of text values, Elasticsearch inserts a fake "gap" between the last term of one value and the first term of the next value to ensure that a phrase query doesn't match two terms from different array elements. Defaults to `100`. See <> for more. [float] === Example configuration Here is an example that combines the following: Character Filter:: * <> Tokenizer:: * <> Token Filters:: * <> * <> [source,console] -------------------------------- PUT my_index { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", <1> "tokenizer": "standard", "char_filter": [ "html_strip" ], "filter": [ "lowercase", "asciifolding" ] } } } } } POST my_index/_analyze { "analyzer": "my_custom_analyzer", "text": "Is this déjà vu?" } -------------------------------- <1> Setting `type` to `custom` tells Elasticsearch that we are defining a custom analyzer. Compare this to how <>: `type` will be set to the name of the built-in analyzer, like <> or <>. ///////////////////// [source,console-result] ---------------------------- { "tokens": [ { "token": "is", "start_offset": 0, "end_offset": 2, "type": "", "position": 0 }, { "token": "this", "start_offset": 3, "end_offset": 7, "type": "", "position": 1 }, { "token": "deja", "start_offset": 11, "end_offset": 15, "type": "", "position": 2 }, { "token": "vu", "start_offset": 16, "end_offset": 22, "type": "", "position": 3 } ] } ---------------------------- ///////////////////// The above example produces the following terms: [source,text] --------------------------- [ is, this, deja, vu ] --------------------------- The previous example used tokenizer, token filters, and character filters with their default configurations, but it is possible to create configured versions of each and to use them in a custom analyzer. Here is a more complicated example that combines the following: Character Filter:: * <>, configured to replace `:)` with `_happy_` and `:(` with `_sad_` Tokenizer:: * <>, configured to split on punctuation characters Token Filters:: * <> * <>, configured to use the pre-defined list of English stop words Here is an example: [source,console] -------------------------------------------------- PUT my_index { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { <1> "type": "custom", "char_filter": [ "emoticons" ], "tokenizer": "punctuation", "filter": [ "lowercase", "english_stop" ] } }, "tokenizer": { "punctuation": { <2> "type": "pattern", "pattern": "[ .,!?]" } }, "char_filter": { "emoticons": { <3> "type": "mapping", "mappings": [ ":) => _happy_", ":( => _sad_" ] } }, "filter": { "english_stop": { <4> "type": "stop", "stopwords": "_english_" } } } } } POST my_index/_analyze { "analyzer": "my_custom_analyzer", "text": "I'm a :) person, and you?" } -------------------------------------------------- <1> Assigns the index a default custom analyzer, `my_custom_analyzer`. This analyzer uses a custom tokenizer, character filter, and token filter that are defined later in the request. <2> Defines the custom `punctuation` tokenizer. <3> Defines the custom `emoticons` character filter. <4> Defines the custom `english_stop` token filter. ///////////////////// [source,console-result] ---------------------------- { "tokens": [ { "token": "i'm", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "_happy_", "start_offset": 6, "end_offset": 8, "type": "word", "position": 2 }, { "token": "person", "start_offset": 9, "end_offset": 15, "type": "word", "position": 3 }, { "token": "you", "start_offset": 21, "end_offset": 24, "type": "word", "position": 5 } ] } ---------------------------- ///////////////////// The above example produces the following terms: [source,text] --------------------------- [ i'm, _happy_, person, you ] ---------------------------