OpenSearch/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc

[[analysis-standard-tokenizer]]
=== Standard Tokenizer

A tokenizer of type `standard` providing grammar based tokenizer that is
a good tokenizer for most European language documents. The tokenizer
implements the Unicode Text Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29].

The following are settings that can be set for a `standard` tokenizer
type:

[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`max_token_length` |The maximum token length. If a token is seen that
exceeds this length then it is split at `max_token_length` intervals. Defaults to `255`.
|=======================================================================
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`[[analysis-standard-tokenizer]]`
			`=== Standard Tokenizer`

			A tokenizer of type `standard` providing grammar based tokenizer that is
			`a good tokenizer for most European language documents. The tokenizer`
			`implements the Unicode Text Segmentation algorithm, as specified in`
			`http://unicode.org/reports/tr29/[Unicode Standard Annex #29].`

			The following are settings that can be set for a `standard` tokenizer
			`type:`

			`[cols="<,<",options="header",]`
			`\|=======================================================================`
			`\|Setting \|Description`
			\|`max_token_length` \|The maximum token length. If a token is seen that
Docs: Corrected behaviour of max_token_length in standard tokenizer 2016-03-18 10:57:16 +01:00			exceeds this length then it is split at `max_token_length` intervals. Defaults to `255`.
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`\|=======================================================================`