OpenSearch/docs/reference/analysis.asciidoc

[[analysis]]
= Analysis

[partintro]
--
The index analysis module acts as a configurable registry of Analyzers
that can be used in order to both break indexed (analyzed) fields when a
document is indexed and process query strings. It maps to the Lucene
`Analyzer`.


Analyzers are composed of a single <<analysis-tokenizers,Tokenizer>> 
and zero or more <<analysis-tokenfilters,TokenFilters>>. The tokenizer may 
be preceded by one or more <<analysis-charfilters,CharFilters>>. The
analysis module allows one to register `TokenFilters`, `Tokenizers` and
`Analyzers` under logical names that can then be referenced either in
mapping definitions or in certain APIs. The Analysis module
automatically registers (*if not explicitly defined*) built in
analyzers, token filters, and tokenizers.

Here is a sample configuration:

[source,js]
--------------------------------------------------
index :
    analysis :
        analyzer : 
            standard : 
                type : standard
                stopwords : [stop1, stop2]
            myAnalyzer1 :
                type : standard
                stopwords : [stop1, stop2, stop3]
                max_token_length : 500
            # configure a custom analyzer which is 
            # exactly like the default standard analyzer
            myAnalyzer2 :
                tokenizer : standard
                filter : [standard, lowercase, stop]
        tokenizer :
            myTokenizer1 :
                type : standard
                max_token_length : 900
            myTokenizer2 :
                type : keyword
                buffer_size : 512
        filter :
            myTokenFilter1 :
                type : stop
                stopwords : [stop1, stop2, stop3, stop4]
            myTokenFilter2 :
                type : length
                min : 0
                max : 2000
--------------------------------------------------

[float]
[[backwards-compatibility]]
=== Backwards compatibility

All analyzers, tokenizers, and token filters can be configured with a
`version` parameter to control which Lucene version behavior they should
use. Possible values are: `3.0` - `3.6`, `4.0` - `4.3` (the highest
version number is the default option).

--

include::analysis/analyzers.asciidoc[]

include::analysis/tokenizers.asciidoc[]

include::analysis/tokenfilters.asciidoc[]

include::analysis/charfilters.asciidoc[]

include::analysis/icu-plugin.asciidoc[]
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis]]`
			`= Analysis`

			`[partintro]`
			`--`
			`The index analysis module acts as a configurable registry of Analyzers`
			`that can be used in order to both break indexed (analyzed) fields when a`
			`document is indexed and process query strings. It maps to the Lucene`
			`Analyzer`.


			`Analyzers are composed of a single <<analysis-tokenizers,Tokenizer>>`
			`and zero or more <<analysis-tokenfilters,TokenFilters>>. The tokenizer may`
			`be preceded by one or more <<analysis-charfilters,CharFilters>>. The`
			analysis module allows one to register `TokenFilters`, `Tokenizers` and
			`Analyzers` under logical names that can then be referenced either in
			`mapping definitions or in certain APIs. The Analysis module`
			`automatically registers (if not explicitly defined) built in`
			`analyzers, token filters, and tokenizers.`

			`Here is a sample configuration:`

			`[source,js]`
			`--------------------------------------------------`
			`index :`
			`analysis :`
			`analyzer :`
			`standard :`
			`type : standard`
			`stopwords : [stop1, stop2]`
			`myAnalyzer1 :`
			`type : standard`
			`stopwords : [stop1, stop2, stop3]`
			`max_token_length : 500`
			`# configure a custom analyzer which is`
			`# exactly like the default standard analyzer`
			`myAnalyzer2 :`
			`tokenizer : standard`
			`filter : [standard, lowercase, stop]`
			`tokenizer :`
			`myTokenizer1 :`
			`type : standard`
			`max_token_length : 900`
			`myTokenizer2 :`
			`type : keyword`
			`buffer_size : 512`
			`filter :`
			`myTokenFilter1 :`
			`type : stop`
			`stopwords : [stop1, stop2, stop3, stop4]`
			`myTokenFilter2 :`
			`type : length`
			`min : 0`
			`max : 2000`
			`--------------------------------------------------`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[backwards-compatibility]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`=== Backwards compatibility`

			`All analyzers, tokenizers, and token filters can be configured with a`
			`version` parameter to control which Lucene version behavior they should
			use. Possible values are: `3.0` - `3.6`, `4.0` - `4.3` (the highest
			`version number is the default option).`

			`--`

			`include::analysis/analyzers.asciidoc[]`

			`include::analysis/tokenizers.asciidoc[]`

			`include::analysis/tokenfilters.asciidoc[]`

			`include::analysis/charfilters.asciidoc[]`

			`include::analysis/icu-plugin.asciidoc[]`