OpenSearch/docs/reference/analysis/analyzers/standard-analyzer.asciidoc

[[analysis-standard-analyzer]]
=== Standard Analyzer

The `standard` analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization (based on the Unicode Text
Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
for most languages.

[float]
=== Example output

[source,console]
---------------------------
POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "the",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "quick",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "foxes",
      "start_offset": 18,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "jumped",
      "start_offset": 24,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "over",
      "start_offset": 31,
      "end_offset": 35,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 36,
      "end_offset": 39,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "lazy",
      "start_offset": 40,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "dog's",
      "start_offset": 45,
      "end_offset": 50,
      "type": "<ALPHANUM>",
      "position": 9
    },
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "<ALPHANUM>",
      "position": 10
    }
  ]
}
----------------------------

/////////////////////


The above sentence would produce the following terms:

[source,text]
---------------------------
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
---------------------------

[float]
=== Configuration

The `standard` analyzer accepts the following parameters:

[horizontal]
`max_token_length`::

    The maximum token length. If a token is seen that exceeds this length then
    it is split at `max_token_length` intervals. Defaults to `255`.

`stopwords`::

    A pre-defined stop words list like `_english_` or an array  containing a
    list of stop words.  Defaults to `_none_`.

`stopwords_path`::

    The path to a file containing stop words.

See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.


[float]
=== Example configuration

In this example, we configure the `standard` analyzer to have a
`max_token_length` of 5 (for demonstration purposes), and to use the
pre-defined list of English stop words:

[source,console]
----------------------------
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
----------------------------

/////////////////////

[source,console-result]
----------------------------
{
  "tokens": [
    {
      "token": "2",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "quick",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "foxes",
      "start_offset": 18,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "jumpe",
      "start_offset": 24,
      "end_offset": 29,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "d",
      "start_offset": 29,
      "end_offset": 30,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "over",
      "start_offset": 31,
      "end_offset": 35,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "lazy",
      "start_offset": 40,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 9
    },
    {
      "token": "dog's",
      "start_offset": 45,
      "end_offset": 50,
      "type": "<ALPHANUM>",
      "position": 10
    },
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "<ALPHANUM>",
      "position": 11
    }
  ]
}
----------------------------

/////////////////////

The above example produces the following terms:

[source,text]
---------------------------
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
---------------------------

[float]
=== Definition

The `standard` analyzer consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)

If you need to customize the `standard` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`standard` analyzer and you can use it as a starting point:

[source,console]
----------------------------------------------------
PUT /standard_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       <1>
          ]
        }
      }
    }
  }
}
----------------------------------------------------
// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
<1> You'd add any token filters after `lowercase`.
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-standard-analyzer]]`
			`=== Standard Analyzer`

First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			The `standard` analyzer is the default analyzer which is used if none is
			`specified. It provides grammar based tokenization (based on the Unicode Text`
			`Segmentation algorithm, as specified in`
			`http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well`
			`for most languages.`

			`[float]`
			`=== Example output`

[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`---------------------------`
			`POST _analyze`
			`{`
			`"analyzer": "standard",`
			`"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."`
			`}`
			`---------------------------`

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`/////////////////////`

[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418) 2019-09-06 09:22:08 -04:00			`[source,console-result]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "the",`
			`"start_offset": 0,`
			`"end_offset": 3,`
			`"type": "<ALPHANUM>",`
			`"position": 0`
			`},`
			`{`
			`"token": "2",`
			`"start_offset": 4,`
			`"end_offset": 5,`
			`"type": "<NUM>",`
			`"position": 1`
			`},`
			`{`
			`"token": "quick",`
			`"start_offset": 6,`
			`"end_offset": 11,`
			`"type": "<ALPHANUM>",`
			`"position": 2`
			`},`
			`{`
			`"token": "brown",`
			`"start_offset": 12,`
			`"end_offset": 17,`
			`"type": "<ALPHANUM>",`
			`"position": 3`
			`},`
			`{`
			`"token": "foxes",`
			`"start_offset": 18,`
			`"end_offset": 23,`
			`"type": "<ALPHANUM>",`
			`"position": 4`
			`},`
			`{`
			`"token": "jumped",`
			`"start_offset": 24,`
			`"end_offset": 30,`
			`"type": "<ALPHANUM>",`
			`"position": 5`
			`},`
			`{`
			`"token": "over",`
			`"start_offset": 31,`
			`"end_offset": 35,`
			`"type": "<ALPHANUM>",`
			`"position": 6`
			`},`
			`{`
			`"token": "the",`
			`"start_offset": 36,`
			`"end_offset": 39,`
			`"type": "<ALPHANUM>",`
			`"position": 7`
			`},`
			`{`
			`"token": "lazy",`
			`"start_offset": 40,`
			`"end_offset": 44,`
			`"type": "<ALPHANUM>",`
			`"position": 8`
			`},`
			`{`
			`"token": "dog's",`
			`"start_offset": 45,`
			`"end_offset": 50,`
			`"type": "<ALPHANUM>",`
			`"position": 9`
			`},`
			`{`
			`"token": "bone",`
			`"start_offset": 51,`
			`"end_offset": 55,`
			`"type": "<ALPHANUM>",`
			`"position": 10`
			`}`
			`]`
			`}`
			`----------------------------`

			`/////////////////////`


First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`The above sentence would produce the following terms:`

			`[source,text]`
			`---------------------------`
			`[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]`
			`---------------------------`

			`[float]`
			`=== Configuration`

			The `standard` analyzer accepts the following parameters:

			`[horizontal]`
			`max_token_length`::

			`The maximum token length. If a token is seen that exceeds this length then`
			it is split at `max_token_length` intervals. Defaults to `255`.

			`stopwords`::

			A pre-defined stop words list like `_english_` or an array containing a
[Docs] Correct spelling the "_none_" stopwords element (#41191) 2019-04-15 08:09:46 -04:00			list of stop words. Defaults to `_none_`.
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`stopwords_path`::

			`The path to a file containing stop words.`

			`See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information`
			`about stop word configuration.`


			`[float]`
			`=== Example configuration`

			In this example, we configure the `standard` analyzer to have a
			`max_token_length` of 5 (for demonstration purposes), and to use the
			`pre-defined list of English stop words:`

[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`----------------------------`
Remove `include_type_name` in asciidoc where possible (#37568) The "include_type_name" parameter was temporarily introduced in #37285 to facilitate moving the default parameter setting to "false" in many places in the documentation code snippets. Most of the places can simply be reverted without causing errors. In this change I looked for asciidoc files that contained the "include_type_name=true" addition when creating new indices but didn't look likey they made use of the "_doc" type for mappings. This is mostly the case e.g. in the analysis docs where index creating often only contains settings. I manually corrected the use of types in some places where the docs still used an explicit type name and not the dummy "_doc" type. 2019-01-18 03:34:11 -05:00			`PUT my_index`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_english_analyzer": {`
			`"type": "standard",`
			`"max_token_length": 5,`
			`"stopwords": "_english_"`
			`}`
			`}`
			`}`
			`}`
			`}`

			`POST my_index/_analyze`
			`{`
			`"analyzer": "my_english_analyzer",`
			`"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."`
			`}`
			`----------------------------`

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`/////////////////////`

[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418) 2019-09-06 09:22:08 -04:00			`[source,console-result]`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "2",`
			`"start_offset": 4,`
			`"end_offset": 5,`
			`"type": "<NUM>",`
			`"position": 1`
			`},`
			`{`
			`"token": "quick",`
			`"start_offset": 6,`
			`"end_offset": 11,`
			`"type": "<ALPHANUM>",`
			`"position": 2`
			`},`
			`{`
			`"token": "brown",`
			`"start_offset": 12,`
			`"end_offset": 17,`
			`"type": "<ALPHANUM>",`
			`"position": 3`
			`},`
			`{`
			`"token": "foxes",`
			`"start_offset": 18,`
			`"end_offset": 23,`
			`"type": "<ALPHANUM>",`
			`"position": 4`
			`},`
			`{`
			`"token": "jumpe",`
			`"start_offset": 24,`
			`"end_offset": 29,`
			`"type": "<ALPHANUM>",`
			`"position": 5`
			`},`
			`{`
			`"token": "d",`
			`"start_offset": 29,`
			`"end_offset": 30,`
			`"type": "<ALPHANUM>",`
			`"position": 6`
			`},`
			`{`
			`"token": "over",`
			`"start_offset": 31,`
			`"end_offset": 35,`
			`"type": "<ALPHANUM>",`
			`"position": 7`
			`},`
			`{`
			`"token": "lazy",`
			`"start_offset": 40,`
			`"end_offset": 44,`
			`"type": "<ALPHANUM>",`
			`"position": 9`
			`},`
			`{`
			`"token": "dog's",`
			`"start_offset": 45,`
			`"end_offset": 50,`
			`"type": "<ALPHANUM>",`
			`"position": 10`
			`},`
			`{`
			`"token": "bone",`
			`"start_offset": 51,`
			`"end_offset": 55,`
			`"type": "<ALPHANUM>",`
			`"position": 11`
			`}`
			`]`
			`}`
			`----------------------------`

			`/////////////////////`

First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`The above example produces the following terms:`

			`[source,text]`
			`---------------------------`
			`[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]`
			`---------------------------`
Docs: Document how to rebuild analyzers (#30498) Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499 2018-05-14 18:40:54 -04:00
			`[float]`
			`=== Definition`

			The `standard` analyzer consists of:

			`Tokenizer::`
			`* <<analysis-standard-tokenizer,Standard Tokenizer>>`

			`Token Filters::`
			`* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>`
			`* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)`

			If you need to customize the `standard` analyzer beyond the configuration
			parameters then you need to recreate it as a `custom` analyzer and modify
			`it, usually by adding token filters. This would recreate the built-in`
			`standard` analyzer and you can use it as a starting point:

[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00			`[source,console]`
Docs: Document how to rebuild analyzers (#30498) Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499 2018-05-14 18:40:54 -04:00			`----------------------------------------------------`
Remove `include_type_name` in asciidoc where possible (#37568) The "include_type_name" parameter was temporarily introduced in #37285 to facilitate moving the default parameter setting to "false" in many places in the documentation code snippets. Most of the places can simply be reverted without causing errors. In this change I looked for asciidoc files that contained the "include_type_name=true" addition when creating new indices but didn't look likey they made use of the "_doc" type for mappings. This is mostly the case e.g. in the analysis docs where index creating often only contains settings. I manually corrected the use of types in some places where the docs still used an explicit type name and not the dummy "_doc" type. 2019-01-18 03:34:11 -05:00			`PUT /standard_example`
Docs: Document how to rebuild analyzers (#30498) Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499 2018-05-14 18:40:54 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"rebuilt_standard": {`
			`"tokenizer": "standard",`
			`"filter": [`
			`"lowercase" <1>`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`----------------------------------------------------`
			`// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]`
			<1> You'd add any token filters after `lowercase`.