OpenSearch/docs/reference/analysis/analyzers/custom-analyzer.asciidoc

[[analysis-custom-analyzer]]
=== Custom Analyzer

When the built-in analyzers do not fulfill your needs, you can create a
`custom` analyzer which uses the appropriate combination of:

* zero or more <<analysis-charfilters, character filters>>
* a <<analysis-tokenizers,tokenizer>>
* zero or more <<analysis-tokenfilters,token filters>>.

[float]
=== Configuration

The `custom` analyzer accepts the following parameters:

[horizontal]
`tokenizer`::

    A built-in or customised <<analysis-tokenizers,tokenizer>>.
    (Required)

`char_filter`::

    An optional array of built-in or customised
    <<analysis-charfilters, character filters>>.

`filter`::

    An optional array of built-in or customised
    <<analysis-tokenfilters, token filters>>.

`position_increment_gap`::

    When indexing an array of text values, Elasticsearch inserts a fake "gap"
    between the last term of one value and the first term of the next value to
    ensure that a phrase query doesn't match two terms from different array
    elements.  Defaults to `100`. See <<position-increment-gap>> for more.

[float]
=== Example configuration

Here is an example that combines the following:

Character Filter::
* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>

[source,js]
--------------------------------
PUT my_index?include_type_name=true
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", <1>
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}
--------------------------------
// CONSOLE

<1> Setting `type` to `custom` tells Elasticsearch that we are defining a custom analyzer.
    Compare this to how <<configuring-analyzers,built-in analyzers can be configured>>:
    `type` will be set to the name of the built-in analyzer, like
    <<analysis-standard-analyzer,`standard`>> or <<analysis-simple-analyzer,`simple`>>.

/////////////////////

[source,js]
----------------------------
{
  "tokens": [
    {
      "token": "is",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "deja",
      "start_offset": 11,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "vu",
      "start_offset": 16,
      "end_offset": 22,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
----------------------------
// TESTRESPONSE

/////////////////////


The above example produces the following terms:

[source,text]
---------------------------
[ is, this, deja, vu ]
---------------------------

The previous example used tokenizer, token filters, and character filters with
their default configurations, but it is possible to create configured versions
of each and to use them in a custom analyzer.

Here is a more complicated example that combines the following:

Character Filter::
* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`

Tokenizer::
*  <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters

Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words


Here is an example:

[source,js]
--------------------------------------------------
PUT my_index?include_type_name=true
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
            "emoticons" <1>
          ],
          "tokenizer": "punctuation", <1>
          "filter": [
            "lowercase",
            "english_stop" <1>
          ]
        }
      },
      "tokenizer": {
        "punctuation": { <1>
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { <1>
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { <1>
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}
--------------------------------------------------
// CONSOLE

<1> The `emoticons` character filter, `punctuation` tokenizer and
    `english_stop` token filter are custom implementations which are defined
    in the same index settings.

/////////////////////

[source,js]
----------------------------
{
  "tokens": [
    {
      "token": "i'm",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "_happy_",
      "start_offset": 6,
      "end_offset": 8,
      "type": "word",
      "position": 2
    },
    {
      "token": "person",
      "start_offset": 9,
      "end_offset": 15,
      "type": "word",
      "position": 3
    },
    {
      "token": "you",
      "start_offset": 21,
      "end_offset": 24,
      "type": "word",
      "position": 5
    }
  ]
}
----------------------------
// TESTRESPONSE

/////////////////////


The above example produces the following terms:

[source,text]
---------------------------
[ i'm, _happy_, person, you ]
---------------------------
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-custom-analyzer]]`
			`=== Custom Analyzer`

First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`When the built-in analyzers do not fulfill your needs, you can create a`
			`custom` analyzer which uses the appropriate combination of:
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`* zero or more <<analysis-charfilters, character filters>>`
			`* a <<analysis-tokenizers,tokenizer>>`
			`* zero or more <<analysis-tokenfilters,token filters>>.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`[float]`
			`=== Configuration`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			The `custom` analyzer accepts the following parameters:
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`[horizontal]`
			`tokenizer`::

			`A built-in or customised <<analysis-tokenizers,tokenizer>>.`
			`(Required)`

			`char_filter`::

			`An optional array of built-in or customised`
			`<<analysis-charfilters, character filters>>.`

			`filter`::

			`An optional array of built-in or customised`
			`<<analysis-tokenfilters, token filters>>.`

			`position_increment_gap`::

			`When indexing an array of text values, Elasticsearch inserts a fake "gap"`
			`between the last term of one value and the first term of the next value to`
			`ensure that a phrase query doesn't match two terms from different array`
			elements. Defaults to `100`. See <<position-increment-gap>> for more.

			`[float]`
			`=== Example configuration`

			`Here is an example that combines the following:`

			`Character Filter::`
			`* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>`

			`Tokenizer::`
			`* <<analysis-standard-tokenizer,Standard Tokenizer>>`

			`Token Filters::`
			`* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>`
			`* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>`

			`[source,js]`
			`--------------------------------`
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs. 2019-01-14 16:08:01 -05:00			`PUT my_index?include_type_name=true`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_custom_analyzer": {`
[DOCS] Clarify 'type' parameter meaning for custom analyzer (#34012) This pull request improves the docs on the meaning of type parameter on the custom analyzer doc page. Closes #33456 2018-09-25 09:32:27 -04:00			`"type": "custom", <1>`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`"tokenizer": "standard",`
			`"char_filter": [`
			`"html_strip"`
			`],`
			`"filter": [`
			`"lowercase",`
			`"asciifolding"`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST my_index/_analyze`
			`{`
			`"analyzer": "my_custom_analyzer",`
			`"text": "Is this <b>déjà vu</b>?"`
			`}`
			`--------------------------------`
			`// CONSOLE`

[DOCS] Clarify 'type' parameter meaning for custom analyzer (#34012) This pull request improves the docs on the meaning of type parameter on the custom analyzer doc page. Closes #33456 2018-09-25 09:32:27 -04:00			<1> Setting `type` to `custom` tells Elasticsearch that we are defining a custom analyzer.
			`Compare this to how <<configuring-analyzers,built-in analyzers can be configured>>:`
			`type` will be set to the name of the built-in analyzer, like
			<<analysis-standard-analyzer,`standard`>> or <<analysis-simple-analyzer,`simple`>>.

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`/////////////////////`

			`[source,js]`
			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "is",`
			`"start_offset": 0,`
			`"end_offset": 2,`
			`"type": "<ALPHANUM>",`
			`"position": 0`
			`},`
			`{`
			`"token": "this",`
			`"start_offset": 3,`
			`"end_offset": 7,`
			`"type": "<ALPHANUM>",`
			`"position": 1`
			`},`
			`{`
			`"token": "deja",`
			`"start_offset": 11,`
			`"end_offset": 15,`
			`"type": "<ALPHANUM>",`
			`"position": 2`
			`},`
			`{`
			`"token": "vu",`
			`"start_offset": 16,`
			`"end_offset": 22,`
			`"type": "<ALPHANUM>",`
			`"position": 3`
			`}`
			`]`
			`}`
			`----------------------------`
			`// TESTRESPONSE`

			`/////////////////////`


First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`The above example produces the following terms:`

			`[source,text]`
			`---------------------------`
			`[ is, this, deja, vu ]`
			`---------------------------`

			`The previous example used tokenizer, token filters, and character filters with`
			`their default configurations, but it is possible to create configured versions`
			`of each and to use them in a custom analyzer.`

			`Here is a more complicated example that combines the following:`

			`Character Filter::`
			* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`

			`Tokenizer::`
			`* <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters`

			`Token Filters::`
			`* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>`
			`* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words`
document and test custom analyzer position offset gap 2015-05-02 00:36:27 -04:00
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`Here is an example:`

			`[source,js]`
			`--------------------------------------------------`
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs. 2019-01-14 16:08:01 -05:00			`PUT my_index?include_type_name=true`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_custom_analyzer": {`
			`"type": "custom",`
			`"char_filter": [`
			`"emoticons" <1>`
			`],`
			`"tokenizer": "punctuation", <1>`
			`"filter": [`
			`"lowercase",`
			`"english_stop" <1>`
			`]`
			`}`
			`},`
			`"tokenizer": {`
			`"punctuation": { <1>`
			`"type": "pattern",`
			`"pattern": "[ .,!?]"`
			`}`
			`},`
			`"char_filter": {`
			`"emoticons": { <1>`
			`"type": "mapping",`
			`"mappings": [`
			`":) => _happy_",`
			`":( => _sad_"`
			`]`
			`}`
			`},`
			`"filter": {`
			`"english_stop": { <1>`
			`"type": "stop",`
			`"stopwords": "_english_"`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST my_index/_analyze`
			`{`
			`"analyzer": "my_custom_analyzer",`
			`"text": "I'm a :) person, and you?"`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`// CONSOLE`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
[Docs] Fix name of character filter in example. (#26724) 2017-09-20 11:07:19 -04:00			<1> The `emoticons` character filter, `punctuation` tokenizer and
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`english_stop` token filter are custom implementations which are defined
			`in the same index settings.`

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`/////////////////////`

			`[source,js]`
			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "i'm",`
			`"start_offset": 0,`
			`"end_offset": 3,`
			`"type": "word",`
			`"position": 0`
			`},`
			`{`
			`"token": "_happy_",`
			`"start_offset": 6,`
			`"end_offset": 8,`
			`"type": "word",`
			`"position": 2`
			`},`
			`{`
			`"token": "person",`
			`"start_offset": 9,`
			`"end_offset": 15,`
			`"type": "word",`
			`"position": 3`
			`},`
			`{`
			`"token": "you",`
			`"start_offset": 21,`
			`"end_offset": 24,`
			`"type": "word",`
			`"position": 5`
			`}`
			`]`
			`}`
			`----------------------------`
			`// TESTRESPONSE`

			`/////////////////////`


First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`The above example produces the following terms:`

			`[source,text]`
			`---------------------------`
			`[ i'm, _happy_, person, you ]`
			`---------------------------`