OpenSearch/docs/reference/analysis/tokenizers/ngram-tokenizer.asciidoc

[[analysis-ngram-tokenizer]]
=== NGram Tokenizer

The `ngram` tokenizer first breaks text down into words whenever it encounters
one of a list of specified characters, then it emits
https://en.wikipedia.org/wiki/N-gram[N-grams] of each word of the specified
length.

N-grams are like a sliding window that moves across the word - a continuous
sequence of characters of the specified length. They are useful for querying
languages that don't use spaces or that have long compound words, like German.

[float]
=== Example output

With the default settings, the `ngram` tokenizer treats the initial text as a
single token and produces N-grams with minimum length `1` and maximum length
`2`:

[source,console]
---------------------------
POST _analyze
{
  "tokenizer": "ngram",
  "text": "Quick Fox"
}
---------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "Q",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "Qu",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "u",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "ui",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "i",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 4
    },
    {
      "token": "ic",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 5
    },
    {
      "token": "c",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 6
    },
    {
      "token": "ck",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 7
    },
    {
      "token": "k",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 8
    },
    {
      "token": "k ",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 9
    },
    {
      "token": " ",
      "start_offset": 5,
      "end_offset": 6,
      "type": "word",
      "position": 10
    },
    {
      "token": " F",
      "start_offset": 5,
      "end_offset": 7,
      "type": "word",
      "position": 11
    },
    {
      "token": "F",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 12
    },
    {
      "token": "Fo",
      "start_offset": 6,
      "end_offset": 8,
      "type": "word",
      "position": 13
    },
    {
      "token": "o",
      "start_offset": 7,
      "end_offset": 8,
      "type": "word",
      "position": 14
    },
    {
      "token": "ox",
      "start_offset": 7,
      "end_offset": 9,
      "type": "word",
      "position": 15
    },
    {
      "token": "x",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 16
    }
  ]
}
----------------------------

/////////////////////


The above sentence would produce the following terms:

[source,text]
---------------------------
[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]
---------------------------

[float]
=== Configuration

The `ngram` tokenizer accepts the following parameters:

[horizontal]
`min_gram`::
    Minimum length of characters in a gram.  Defaults to `1`.

`max_gram`::
    Maximum length of characters in a gram.  Defaults to `2`.

`token_chars`::

    Character classes that should be included in a token.  Elasticsearch
    will split on characters that don't belong to the classes specified.
    Defaults to `[]` (keep all characters).
+
Character classes may be any of the following:
+
* `letter` --      for example `a`, `b`, `ï` or `京`
* `digit` --       for example `3` or `7`
* `whitespace` --  for example `" "` or `"\n"`
* `punctuation` -- for example `!` or `"`
* `symbol` --      for example `$` or `√`
* `custom` --      custom characters which need to be set using the
`custom_token_chars` setting.

`custom_token_chars`::

    Custom characters that should be treated as part of a token. For example,
    setting this to `+-_` will make the tokenizer treat the plus, minus and
    underscore sign  as part of a token.

TIP:  It usually makes sense to set `min_gram` and `max_gram` to the same
value.  The smaller the length, the more documents will match but the lower
the quality of the matches.  The longer the length, the more specific the
matches.  A tri-gram (length `3`) is a good place to start.

The index level setting `index.max_ngram_diff` controls the maximum allowed
difference between `max_gram` and `min_gram`.

[float]
=== Example configuration

In this example, we configure the `ngram` tokenizer to treat letters and
digits as tokens, and to produce tri-grams (grams of length `3`):

[source,console]
----------------------------
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}
----------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "Qui",
      "start_offset": 2,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "uic",
      "start_offset": 3,
      "end_offset": 6,
      "type": "word",
      "position": 1
    },
    {
      "token": "ick",
      "start_offset": 4,
      "end_offset": 7,
      "type": "word",
      "position": 2
    },
    {
      "token": "Fox",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 3
    },
    {
      "token": "oxe",
      "start_offset": 9,
      "end_offset": 12,
      "type": "word",
      "position": 4
    },
    {
      "token": "xes",
      "start_offset": 10,
      "end_offset": 13,
      "type": "word",
      "position": 5
    }
  ]
}
----------------------------

/////////////////////


The above example produces the following terms:

[source,text]
---------------------------
[ Qui, uic, ick, Fox, oxe, xes ]
---------------------------
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`[[analysis-ngram-tokenizer]]`
			`=== NGram Tokenizer`

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			The `ngram` tokenizer first breaks text down into words whenever it encounters
			`one of a list of specified characters, then it emits`
			`https://en.wikipedia.org/wiki/N-gram[N-grams] of each word of the specified`
			`length.`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`N-grams are like a sliding window that moves across the word - a continuous`
			`sequence of characters of the specified length. They are useful for querying`
			`languages that don't use spaces or that have long compound words, like German.`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`[float]`
			`=== Example output`

			With the default settings, the `ngram` tokenizer treats the initial text as a
			single token and produces N-grams with minimum length `1` and maximum length
			`2`:

[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`---------------------------`
			`POST _analyze`
			`{`
			`"tokenizer": "ngram",`
			`"text": "Quick Fox"`
			`}`
			`---------------------------`

			`/////////////////////`

[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418) 2019-09-06 09:22:08 -04:00			`[source,console-result]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "Q",`
			`"start_offset": 0,`
			`"end_offset": 1,`
			`"type": "word",`
			`"position": 0`
			`},`
			`{`
			`"token": "Qu",`
			`"start_offset": 0,`
			`"end_offset": 2,`
			`"type": "word",`
			`"position": 1`
			`},`
			`{`
			`"token": "u",`
			`"start_offset": 1,`
			`"end_offset": 2,`
			`"type": "word",`
			`"position": 2`
			`},`
			`{`
			`"token": "ui",`
			`"start_offset": 1,`
			`"end_offset": 3,`
			`"type": "word",`
			`"position": 3`
			`},`
			`{`
			`"token": "i",`
			`"start_offset": 2,`
			`"end_offset": 3,`
			`"type": "word",`
			`"position": 4`
			`},`
			`{`
			`"token": "ic",`
			`"start_offset": 2,`
			`"end_offset": 4,`
			`"type": "word",`
			`"position": 5`
			`},`
			`{`
			`"token": "c",`
			`"start_offset": 3,`
			`"end_offset": 4,`
			`"type": "word",`
			`"position": 6`
			`},`
			`{`
			`"token": "ck",`
			`"start_offset": 3,`
			`"end_offset": 5,`
			`"type": "word",`
			`"position": 7`
			`},`
			`{`
			`"token": "k",`
			`"start_offset": 4,`
			`"end_offset": 5,`
			`"type": "word",`
			`"position": 8`
			`},`
			`{`
			`"token": "k ",`
			`"start_offset": 4,`
			`"end_offset": 6,`
			`"type": "word",`
			`"position": 9`
			`},`
			`{`
			`"token": " ",`
			`"start_offset": 5,`
			`"end_offset": 6,`
			`"type": "word",`
			`"position": 10`
			`},`
			`{`
			`"token": " F",`
			`"start_offset": 5,`
			`"end_offset": 7,`
			`"type": "word",`
			`"position": 11`
			`},`
			`{`
			`"token": "F",`
			`"start_offset": 6,`
			`"end_offset": 7,`
			`"type": "word",`
			`"position": 12`
			`},`
			`{`
			`"token": "Fo",`
			`"start_offset": 6,`
			`"end_offset": 8,`
			`"type": "word",`
			`"position": 13`
			`},`
			`{`
			`"token": "o",`
			`"start_offset": 7,`
			`"end_offset": 8,`
			`"type": "word",`
			`"position": 14`
			`},`
			`{`
			`"token": "ox",`
			`"start_offset": 7,`
			`"end_offset": 9,`
			`"type": "word",`
			`"position": 15`
			`},`
			`{`
			`"token": "x",`
			`"start_offset": 8,`
			`"end_offset": 9,`
			`"type": "word",`
			`"position": 16`
			`}`
			`]`
			`}`
			`----------------------------`

			`/////////////////////`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`The above sentence would produce the following terms:`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`[source,text]`
			`---------------------------`
			`[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]`
			`---------------------------`

			`[float]`
			`=== Configuration`

			The `ngram` tokenizer accepts the following parameters:
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
			`[horizontal]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`min_gram`::
			Minimum length of characters in a gram. Defaults to `1`.

			`max_gram`::
			Maximum length of characters in a gram. Defaults to `2`.

			`token_chars`::

			`Character classes that should be included in a token. Elasticsearch`
			`will split on characters that don't belong to the classes specified.`
			Defaults to `[]` (keep all characters).
			`+`
			`Character classes may be any of the following:`
			`+`
			* `letter` -- for example `a`, `b`, `ï` or `京`
			* `digit` -- for example `3` or `7`
			* `whitespace` -- for example `" "` or `"\n"`
			* `punctuation` -- for example `!` or `"`
			* `symbol` -- for example `$` or `√`
Allow custom characters in token_chars of ngram tokenizers (#49250) Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894 2019-11-20 10:36:39 +01:00			* `custom` -- custom characters which need to be set using the
			`custom_token_chars` setting.

			`custom_token_chars`::

			`Custom characters that should be treated as part of a token. For example,`
			setting this to `+-_` will make the tokenizer treat the plus, minus and
			`underscore sign as part of a token.`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00
			TIP: It usually makes sense to set `min_gram` and `max_gram` to the same
			`value. The smaller the length, the more documents will match but the lower`
			`the quality of the matches. The longer the length, the more specific the`
			matches. A tri-gram (length `3`) is a good place to start.
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
Add limits for ngram and shingle settings (#27211) * Add limits for ngram and shingle settings (#27211) Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes #25887 2017-11-07 08:14:55 -05:00			The index level setting `index.max_ngram_diff` controls the maximum allowed
			difference between `max_gram` and `min_gram`.

Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`[float]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`=== Example configuration`

			In this example, we configure the `ngram` tokenizer to treat letters and
			digits as tokens, and to produce tri-grams (grams of length `3`):
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`----------------------------`
Remove `include_type_name` in asciidoc where possible (#37568) The "include_type_name" parameter was temporarily introduced in #37285 to facilitate moving the default parameter setting to "false" in many places in the documentation code snippets. Most of the places can simply be reverted without causing errors. In this change I looked for asciidoc files that contained the "include_type_name=true" addition when creating new indices but didn't look likey they made use of the "_doc" type for mappings. This is mostly the case e.g. in the analysis docs where index creating often only contains settings. I manually corrected the use of types in some places where the docs still used an explicit type name and not the dummy "_doc" type. 2019-01-18 09:34:11 +01:00			`PUT my_index`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_analyzer": {`
			`"tokenizer": "my_tokenizer"`
			`}`
			`},`
			`"tokenizer": {`
			`"my_tokenizer": {`
			`"type": "ngram",`
			`"min_gram": 3,`
			`"max_gram": 3,`
			`"token_chars": [`
			`"letter",`
			`"digit"`
			`]`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`}`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`}`
			`}`
			`}`
			`}`

			`POST my_index/_analyze`
			`{`
			`"analyzer": "my_analyzer",`
			`"text": "2 Quick Foxes."`
			`}`
			`----------------------------`

			`/////////////////////`

[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418) 2019-09-06 09:22:08 -04:00			`[source,console-result]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 19:42:23 +02:00			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "Qui",`
			`"start_offset": 2,`
			`"end_offset": 5,`
			`"type": "word",`
			`"position": 0`
			`},`
			`{`
			`"token": "uic",`
			`"start_offset": 3,`
			`"end_offset": 6,`
			`"type": "word",`
			`"position": 1`
			`},`
			`{`
			`"token": "ick",`
			`"start_offset": 4,`
			`"end_offset": 7,`
			`"type": "word",`
			`"position": 2`
			`},`
			`{`
			`"token": "Fox",`
			`"start_offset": 8,`
			`"end_offset": 11,`
			`"type": "word",`
			`"position": 3`
			`},`
			`{`
			`"token": "oxe",`
			`"start_offset": 9,`
			`"end_offset": 12,`
			`"type": "word",`
			`"position": 4`
			`},`
			`{`
			`"token": "xes",`
			`"start_offset": 10,`
			`"end_offset": 13,`
			`"type": "word",`
			`"position": 5`
			`}`
			`]`
			`}`
			`----------------------------`

			`/////////////////////`


			`The above example produces the following terms:`

			`[source,text]`
			`---------------------------`
			`[ Qui, uic, ick, Fox, oxe, xes ]`
			`---------------------------`