OpenSearch/docs/reference/analysis/tokenizers/classic-tokenizer.asciidoc

[[analysis-classic-tokenizer]]
=== Classic Tokenizer

added[1.3.0]

A tokenizer of type `classic` providing grammar based tokenizer that is
a good tokenizer for English language documents. This tokenizer has 
heuristics for special treatment of acronyms, company names, email addresses,
and internet host names. However, these rules don't always work, and 
the tokenizer doesn't work well for most languages other than English.

The following are settings that can be set for a `classic` tokenizer
type:

[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`max_token_length` |The maximum token length. If a token is seen that
exceeds this length then it is discarded. Defaults to `255`.
|=======================================================================
Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene Add `irish` analyzer Add `sorani` analyzer (Kurdish) Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc. Add `thai` tokenizer: segments thai text into words. Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself Add `german_normalization` tokenfilter: umlaut/sharp S normalization Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages Add `sorani_normalization` tokenfilter: normalizes kurdish text Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization` Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk` Add support access to default Thai stopword set "_thai_" Fix some bugs and broken links in documentation. Closes #5935 2014-07-02 14:59:18 -04:00			`[[analysis-classic-tokenizer]]`
			`=== Classic Tokenizer`

[DOCS] move all coming tags to added in master 2014-07-23 10:36:53 -04:00			`added[1.3.0]`
Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene Add `irish` analyzer Add `sorani` analyzer (Kurdish) Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc. Add `thai` tokenizer: segments thai text into words. Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself Add `german_normalization` tokenfilter: umlaut/sharp S normalization Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages Add `sorani_normalization` tokenfilter: normalizes kurdish text Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization` Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk` Add support access to default Thai stopword set "_thai_" Fix some bugs and broken links in documentation. Closes #5935 2014-07-02 14:59:18 -04:00
			A tokenizer of type `classic` providing grammar based tokenizer that is
			`a good tokenizer for English language documents. This tokenizer has`
			`heuristics for special treatment of acronyms, company names, email addresses,`
			`and internet host names. However, these rules don't always work, and`
			`the tokenizer doesn't work well for most languages other than English.`

			The following are settings that can be set for a `classic` tokenizer
			`type:`

			`[cols="<,<",options="header",]`
			`\|=======================================================================`
			`\|Setting \|Description`
			\|`max_token_length` \|The maximum token length. If a token is seen that
			exceeds this length then it is discarded. Defaults to `255`.
			`\|=======================================================================`