OpenSearch/docs/reference/analysis/tokenizers/standard-tokenizer.asciidoc

[[analysis-standard-tokenizer]]
=== Standard Tokenizer

A tokenizer of type `standard` providing grammar based tokenizer that is
a good tokenizer for most European language documents. The tokenizer
implements the Unicode Text Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29].

The following are settings that can be set for a `standard` tokenizer
type:

[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`max_token_length` |The maximum token length. If a token is seen that
exceeds this length then it is discarded. Defaults to `255`.
|=======================================================================
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-standard-tokenizer]]`
			`=== Standard Tokenizer`

			A tokenizer of type `standard` providing grammar based tokenizer that is
			`a good tokenizer for most European language documents. The tokenizer`
			`implements the Unicode Text Segmentation algorithm, as specified in`
			`http://unicode.org/reports/tr29/[Unicode Standard Annex #29].`

			The following are settings that can be set for a `standard` tokenizer
			`type:`

			`[cols="<,<",options="header",]`
			`\|=======================================================================`
			`\|Setting \|Description`
			\|`max_token_length` \|The maximum token length. If a token is seen that
			exceeds this length then it is discarded. Defaults to `255`.
			`\|=======================================================================`