OpenSearch/docs/reference/analysis/analyzers/custom-analyzer.asciidoc

[[analysis-custom-analyzer]]
=== Custom Analyzer

When the built-in analyzers do not fulfill your needs, you can create a
`custom` analyzer which uses the appropriate combination of:

* zero or more <<analysis-charfilters, character filters>>
* a <<analysis-tokenizers,tokenizer>>
* zero or more <<analysis-tokenfilters,token filters>>.

[float]
=== Configuration

The `custom` analyzer accepts the following parameters:

[horizontal]
`tokenizer`::

    A built-in or customised <<analysis-tokenizers,tokenizer>>.
    (Required)

`char_filter`::

    An optional array of built-in or customised
    <<analysis-charfilters, character filters>>.

`filter`::

    An optional array of built-in or customised
    <<analysis-tokenfilters, token filters>>.

`position_increment_gap`::

    When indexing an array of text values, Elasticsearch inserts a fake "gap"
    between the last term of one value and the first term of the next value to
    ensure that a phrase query doesn't match two terms from different array
    elements.  Defaults to `100`. See <<position-increment-gap>> for more.

[float]
=== Example configuration

Here is an example that combines the following:

Character Filter::
* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>

[source,js]
--------------------------------
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

GET _cluster/health?wait_for_status=yellow

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}
--------------------------------
// CONSOLE

/////////////////////

[source,js]
----------------------------
{
  "tokens": [
    {
      "token": "is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 11,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "vu",
      "start_offset": 16,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
----------------------------
// TESTRESPONSE

/////////////////////


The above example produces the following terms:

[source,text]
---------------------------
[ is, this, deja, vu ]
---------------------------

The previous example used tokenizer, token filters, and character filters with
their default configurations, but it is possible to create configured versions
of each and to use them in a custom analyzer.

Here is a more complicated example that combines the following:

Character Filter::
* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`

Tokenizer::
*  <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters

Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words


Here is an example:

[source,js]
--------------------------------------------------
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
            "emoticons" <1>
          ],
          "tokenizer": "punctuation", <1>
          "filter": [
            "lowercase",
            "english_stop" <1>
          ]
        }
      },
      "tokenizer": {
        "punctuation": { <1>
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { <1>
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { <1>
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

GET _cluster/health?wait_for_status=yellow

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}
--------------------------------------------------
// CONSOLE

<1> The `emoticon` character filter, `punctuation` tokenizer and
    `english_stop` token filter are custom implementations which are defined
    in the same index settings.

/////////////////////

[source,js]
----------------------------
{
  "tokens": [
    {
      "token": "i'm",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "_happy_",
      "start_offset": 6,
      "end_offset": 8,
      "type": "word",
      "position": 2
    },
    {
      "token": "person",
      "start_offset": 9,
      "end_offset": 15,
      "type": "word",
      "position": 3
    },
    {
      "token": "you",
      "start_offset": 21,
      "end_offset": 24,
      "type": "word",
      "position": 5
    }
  ]
}
----------------------------
// TESTRESPONSE

/////////////////////


The above example produces the following terms:

[source,text]
---------------------------
[ i'm, _happy_, person, you ]
---------------------------
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-custom-analyzer]]`
			`=== Custom Analyzer`

First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`When the built-in analyzers do not fulfill your needs, you can create a`
			`custom` analyzer which uses the appropriate combination of:
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`* zero or more <<analysis-charfilters, character filters>>`
			`* a <<analysis-tokenizers,tokenizer>>`
			`* zero or more <<analysis-tokenfilters,token filters>>.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`[float]`
			`=== Configuration`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			The `custom` analyzer accepts the following parameters:
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`[horizontal]`
			`tokenizer`::

			`A built-in or customised <<analysis-tokenizers,tokenizer>>.`
			`(Required)`

			`char_filter`::

			`An optional array of built-in or customised`
			`<<analysis-charfilters, character filters>>.`

			`filter`::

			`An optional array of built-in or customised`
			`<<analysis-tokenfilters, token filters>>.`

			`position_increment_gap`::

			`When indexing an array of text values, Elasticsearch inserts a fake "gap"`
			`between the last term of one value and the first term of the next value to`
			`ensure that a phrase query doesn't match two terms from different array`
			elements. Defaults to `100`. See <<position-increment-gap>> for more.

			`[float]`
			`=== Example configuration`

			`Here is an example that combines the following:`

			`Character Filter::`
			`* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>`

			`Tokenizer::`
			`* <<analysis-standard-tokenizer,Standard Tokenizer>>`

			`Token Filters::`
			`* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>`
			`* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>`

			`[source,js]`
			`--------------------------------`
			`PUT my_index`
			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_custom_analyzer": {`
			`"type": "custom",`
			`"tokenizer": "standard",`
			`"char_filter": [`
			`"html_strip"`
			`],`
			`"filter": [`
			`"lowercase",`
			`"asciifolding"`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`

			`GET _cluster/health?wait_for_status=yellow`

			`POST my_index/_analyze`
			`{`
			`"analyzer": "my_custom_analyzer",`
			`"text": "Is this <b>déjà vu</b>?"`
			`}`
			`--------------------------------`
			`// CONSOLE`

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`/////////////////////`

			`[source,js]`
			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "is",`
			`"start_offset": 0,`
			`"end_offset": 2,`
			`"type": "<ALPHANUM>",`
			`"position": 0`
			`},`
			`{`
			`"token": "this",`
			`"start_offset": 3,`
			`"end_offset": 7,`
			`"type": "<ALPHANUM>",`
			`"position": 1`
			`},`
			`{`
			`"token": "deja",`
			`"start_offset": 11,`
			`"end_offset": 15,`
			`"type": "<ALPHANUM>",`
			`"position": 2`
			`},`
			`{`
			`"token": "vu",`
			`"start_offset": 16,`
			`"end_offset": 22,`
			`"type": "<ALPHANUM>",`
			`"position": 3`
			`}`
			`]`
			`}`
			`----------------------------`
			`// TESTRESPONSE`

			`/////////////////////`


First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`The above example produces the following terms:`

			`[source,text]`
			`---------------------------`
			`[ is, this, deja, vu ]`
			`---------------------------`

			`The previous example used tokenizer, token filters, and character filters with`
			`their default configurations, but it is possible to create configured versions`
			`of each and to use them in a custom analyzer.`

			`Here is a more complicated example that combines the following:`

			`Character Filter::`
			* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`

			`Tokenizer::`
			`* <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters`

			`Token Filters::`
			`* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>`
			`* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words`
document and test custom analyzer position offset gap 2015-05-02 00:36:27 -04:00
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`Here is an example:`

			`[source,js]`
			`--------------------------------------------------`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`PUT my_index`
			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_custom_analyzer": {`
			`"type": "custom",`
			`"char_filter": [`
			`"emoticons" <1>`
			`],`
			`"tokenizer": "punctuation", <1>`
			`"filter": [`
			`"lowercase",`
			`"english_stop" <1>`
			`]`
			`}`
			`},`
			`"tokenizer": {`
			`"punctuation": { <1>`
			`"type": "pattern",`
			`"pattern": "[ .,!?]"`
			`}`
			`},`
			`"char_filter": {`
			`"emoticons": { <1>`
			`"type": "mapping",`
			`"mappings": [`
			`":) => _happy_",`
			`":( => _sad_"`
			`]`
			`}`
			`},`
			`"filter": {`
			`"english_stop": { <1>`
			`"type": "stop",`
			`"stopwords": "_english_"`
			`}`
			`}`
			`}`
			`}`
			`}`

			`GET _cluster/health?wait_for_status=yellow`

			`POST my_index/_analyze`
			`{`
			`"analyzer": "my_custom_analyzer",`
			`"text": "I'm a :) person, and you?"`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`// CONSOLE`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			<1> The `emoticon` character filter, `punctuation` tokenizer and
			`english_stop` token filter are custom implementations which are defined
			`in the same index settings.`

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`/////////////////////`

			`[source,js]`
			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "i'm",`
			`"start_offset": 0,`
			`"end_offset": 3,`
			`"type": "word",`
			`"position": 0`
			`},`
			`{`
			`"token": "_happy_",`
			`"start_offset": 6,`
			`"end_offset": 8,`
			`"type": "word",`
			`"position": 2`
			`},`
			`{`
			`"token": "person",`
			`"start_offset": 9,`
			`"end_offset": 15,`
			`"type": "word",`
			`"position": 3`
			`},`
			`{`
			`"token": "you",`
			`"start_offset": 21,`
			`"end_offset": 24,`
			`"type": "word",`
			`"position": 5`
			`}`
			`]`
			`}`
			`----------------------------`
			`// TESTRESPONSE`

			`/////////////////////`


First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`The above example produces the following terms:`

			`[source,text]`
			`---------------------------`
			`[ i'm, _happy_, person, you ]`
			`---------------------------`