OpenSearch/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc

[[analysis-standard-tokenizer]]
=== Standard tokenizer
++++
<titleabbrev>Standard</titleabbrev>
++++

The `standard` tokenizer provides grammar based tokenization (based on the
Unicode Text Segmentation algorithm, as specified in
https://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
for most languages.

[discrete]
=== Example output

[source,console]
---------------------------
POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "QUICK",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "Brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "Foxes",
      "start_offset": 18,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "jumped",
      "start_offset": 24,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "over",
      "start_offset": 31,
      "end_offset": 35,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 36,
      "end_offset": 39,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "lazy",
      "start_offset": 40,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "dog's",
      "start_offset": 45,
      "end_offset": 50,
      "type": "<ALPHANUM>",
      "position": 9
    },
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "<ALPHANUM>",
      "position": 10
    }
  ]
}
----------------------------

/////////////////////


The above sentence would produce the following terms:

[source,text]
---------------------------
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
---------------------------

[discrete]
=== Configuration

The `standard` tokenizer accepts the following parameters:

[horizontal]
`max_token_length`::

    The maximum token length. If a token is seen that exceeds this length then
    it is split at `max_token_length` intervals. Defaults to `255`.

[discrete]
=== Example configuration

In this example, we configure the `standard` tokenizer to have a
`max_token_length` of 5 (for demonstration purposes):

[source,console]
----------------------------
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
----------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "QUICK",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "Brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "Foxes",
      "start_offset": 18,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "jumpe",
      "start_offset": 24,
      "end_offset": 29,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "d",
      "start_offset": 29,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "over",
      "start_offset": 31,
      "end_offset": 35,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "the",
      "start_offset": 36,
      "end_offset": 39,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "lazy",
      "start_offset": 40,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 9
    },
    {
      "token": "dog's",
      "start_offset": 45,
      "end_offset": 50,
      "type": "<ALPHANUM>",
      "position": 10
    },
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "<ALPHANUM>",
      "position": 11
    }
  ]
}
----------------------------

/////////////////////


The above example produces the following terms:

[source,text]
---------------------------
[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]
---------------------------
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-standard-tokenizer]]`
[DOCS] Fix tokenizer page titles (#58361) (#58598) Changes the titles for tokenizer pages to sentence case. Also moves the 'Path hierarchy tokenizer examples' page within the 'Path hierarchy tokenizer' page and adds a related redirect. 2020-06-26 09:24:41 -04:00			`=== Standard tokenizer`
			`++++`
			`<titleabbrev>Standard</titleabbrev>`
			`++++`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			The `standard` tokenizer provides grammar based tokenization (based on the
			`Unicode Text Segmentation algorithm, as specified in`
[DOCS] http -> https, remove outdated plugin docs (#60380) (#60545) Plugin discovery documentation contained information about installing Elasticsearch 2.0 and installing an oracle JDK, both of which is no longer valid. While noticing that the instructions used cleartext HTTP to install packages, this commit replaces HTTPs links instead of HTTP where possible. In addition a few community links have been removed, as they do not seem to exist anymore. Co-authored-by: Alexander Reelsen <alexander@reelsen.net> 2020-07-31 16:16:31 -04:00			`https://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`for most languages.`

[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`=== Example output`

[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`---------------------------`
			`POST _analyze`
			`{`
			`"tokenizer": "standard",`
			`"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."`
			`}`
			`---------------------------`

			`/////////////////////`

[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418) 2019-09-06 09:22:08 -04:00			`[source,console-result]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "The",`
			`"start_offset": 0,`
			`"end_offset": 3,`
			`"type": "<ALPHANUM>",`
			`"position": 0`
			`},`
			`{`
			`"token": "2",`
			`"start_offset": 4,`
			`"end_offset": 5,`
			`"type": "<NUM>",`
			`"position": 1`
			`},`
			`{`
			`"token": "QUICK",`
			`"start_offset": 6,`
			`"end_offset": 11,`
			`"type": "<ALPHANUM>",`
			`"position": 2`
			`},`
			`{`
			`"token": "Brown",`
			`"start_offset": 12,`
			`"end_offset": 17,`
			`"type": "<ALPHANUM>",`
			`"position": 3`
			`},`
			`{`
			`"token": "Foxes",`
			`"start_offset": 18,`
			`"end_offset": 23,`
			`"type": "<ALPHANUM>",`
			`"position": 4`
			`},`
			`{`
			`"token": "jumped",`
			`"start_offset": 24,`
			`"end_offset": 30,`
			`"type": "<ALPHANUM>",`
			`"position": 5`
			`},`
			`{`
			`"token": "over",`
			`"start_offset": 31,`
			`"end_offset": 35,`
			`"type": "<ALPHANUM>",`
			`"position": 6`
			`},`
			`{`
			`"token": "the",`
			`"start_offset": 36,`
			`"end_offset": 39,`
			`"type": "<ALPHANUM>",`
			`"position": 7`
			`},`
			`{`
			`"token": "lazy",`
			`"start_offset": 40,`
			`"end_offset": 44,`
			`"type": "<ALPHANUM>",`
			`"position": 8`
			`},`
			`{`
			`"token": "dog's",`
			`"start_offset": 45,`
			`"end_offset": 50,`
			`"type": "<ALPHANUM>",`
			`"position": 9`
			`},`
			`{`
			`"token": "bone",`
			`"start_offset": 51,`
			`"end_offset": 55,`
			`"type": "<ALPHANUM>",`
			`"position": 10`
			`}`
			`]`
			`}`
			`----------------------------`

			`/////////////////////`


			`The above sentence would produce the following terms:`

			`[source,text]`
			`---------------------------`
			`[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]`
			`---------------------------`

[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`=== Configuration`

			The `standard` tokenizer accepts the following parameters:

			`[horizontal]`
			`max_token_length`::

			`The maximum token length. If a token is seen that exceeds this length then`
			it is split at `max_token_length` intervals. Defaults to `255`.

[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`=== Example configuration`

			In this example, we configure the `standard` tokenizer to have a
			`max_token_length` of 5 (for demonstration purposes):

[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`----------------------------`
[DOCS] Update my-index examples (#60132) (#60248) Changes the following example index names to `my-index-000001` for consistency: * `my-index` * `my_index` * `myindex` 2020-07-27 15:58:26 -04:00			`PUT my-index-000001`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_analyzer": {`
			`"tokenizer": "my_tokenizer"`
			`}`
			`},`
			`"tokenizer": {`
			`"my_tokenizer": {`
			`"type": "standard",`
			`"max_token_length": 5`
			`}`
			`}`
			`}`
			`}`
			`}`

[DOCS] Update my-index examples (#60132) (#60248) Changes the following example index names to `my-index-000001` for consistency: * `my-index` * `my_index` * `myindex` 2020-07-27 15:58:26 -04:00			`POST my-index-000001/_analyze`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`{`
			`"analyzer": "my_analyzer",`
			`"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."`
			`}`
			`----------------------------`

			`/////////////////////`

[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418) 2019-09-06 09:22:08 -04:00			`[source,console-result]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "The",`
			`"start_offset": 0,`
			`"end_offset": 3,`
			`"type": "<ALPHANUM>",`
			`"position": 0`
			`},`
			`{`
			`"token": "2",`
			`"start_offset": 4,`
			`"end_offset": 5,`
			`"type": "<NUM>",`
			`"position": 1`
			`},`
			`{`
			`"token": "QUICK",`
			`"start_offset": 6,`
			`"end_offset": 11,`
			`"type": "<ALPHANUM>",`
			`"position": 2`
			`},`
			`{`
			`"token": "Brown",`
			`"start_offset": 12,`
			`"end_offset": 17,`
			`"type": "<ALPHANUM>",`
			`"position": 3`
			`},`
			`{`
			`"token": "Foxes",`
			`"start_offset": 18,`
			`"end_offset": 23,`
			`"type": "<ALPHANUM>",`
			`"position": 4`
			`},`
			`{`
			`"token": "jumpe",`
			`"start_offset": 24,`
			`"end_offset": 29,`
			`"type": "<ALPHANUM>",`
			`"position": 5`
			`},`
			`{`
			`"token": "d",`
			`"start_offset": 29,`
			`"end_offset": 30,`
			`"type": "<ALPHANUM>",`
			`"position": 6`
			`},`
			`{`
			`"token": "over",`
			`"start_offset": 31,`
			`"end_offset": 35,`
			`"type": "<ALPHANUM>",`
			`"position": 7`
			`},`
			`{`
			`"token": "the",`
			`"start_offset": 36,`
			`"end_offset": 39,`
			`"type": "<ALPHANUM>",`
			`"position": 8`
			`},`
			`{`
			`"token": "lazy",`
			`"start_offset": 40,`
			`"end_offset": 44,`
			`"type": "<ALPHANUM>",`
			`"position": 9`
			`},`
			`{`
			`"token": "dog's",`
			`"start_offset": 45,`
			`"end_offset": 50,`
			`"type": "<ALPHANUM>",`
			`"position": 10`
			`},`
			`{`
			`"token": "bone",`
			`"start_offset": 51,`
			`"end_offset": 55,`
			`"type": "<ALPHANUM>",`
			`"position": 11`
			`}`
			`]`
			`}`
			`----------------------------`

			`/////////////////////`


			`The above example produces the following terms:`

			`[source,text]`
			`---------------------------`
			`[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]`
			`---------------------------`