2013-08-28 19:24:34 -04:00
|
|
|
[[analysis]]
|
|
|
|
= Analysis
|
|
|
|
|
|
|
|
[partintro]
|
|
|
|
--
|
|
|
|
The index analysis module acts as a configurable registry of Analyzers
|
|
|
|
that can be used in order to both break indexed (analyzed) fields when a
|
|
|
|
document is indexed and process query strings. It maps to the Lucene
|
|
|
|
`Analyzer`.
|
|
|
|
|
|
|
|
|
|
|
|
Analyzers are composed of a single <<analysis-tokenizers,Tokenizer>>
|
|
|
|
and zero or more <<analysis-tokenfilters,TokenFilters>>. The tokenizer may
|
|
|
|
be preceded by one or more <<analysis-charfilters,CharFilters>>. The
|
|
|
|
analysis module allows one to register `TokenFilters`, `Tokenizers` and
|
|
|
|
`Analyzers` under logical names that can then be referenced either in
|
|
|
|
mapping definitions or in certain APIs. The Analysis module
|
|
|
|
automatically registers (*if not explicitly defined*) built in
|
|
|
|
analyzers, token filters, and tokenizers.
|
|
|
|
|
|
|
|
Here is a sample configuration:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
index :
|
|
|
|
analysis :
|
|
|
|
analyzer :
|
|
|
|
standard :
|
|
|
|
type : standard
|
|
|
|
stopwords : [stop1, stop2]
|
|
|
|
myAnalyzer1 :
|
|
|
|
type : standard
|
|
|
|
stopwords : [stop1, stop2, stop3]
|
|
|
|
max_token_length : 500
|
|
|
|
# configure a custom analyzer which is
|
|
|
|
# exactly like the default standard analyzer
|
|
|
|
myAnalyzer2 :
|
|
|
|
tokenizer : standard
|
|
|
|
filter : [standard, lowercase, stop]
|
|
|
|
tokenizer :
|
|
|
|
myTokenizer1 :
|
|
|
|
type : standard
|
|
|
|
max_token_length : 900
|
|
|
|
myTokenizer2 :
|
|
|
|
type : keyword
|
|
|
|
buffer_size : 512
|
|
|
|
filter :
|
|
|
|
myTokenFilter1 :
|
|
|
|
type : stop
|
|
|
|
stopwords : [stop1, stop2, stop3, stop4]
|
|
|
|
myTokenFilter2 :
|
|
|
|
type : length
|
|
|
|
min : 0
|
|
|
|
max : 2000
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
[float]
|
2013-09-25 12:17:40 -04:00
|
|
|
[[backwards-compatibility]]
|
2013-08-28 19:24:34 -04:00
|
|
|
=== Backwards compatibility
|
|
|
|
|
|
|
|
All analyzers, tokenizers, and token filters can be configured with a
|
|
|
|
`version` parameter to control which Lucene version behavior they should
|
|
|
|
use. Possible values are: `3.0` - `3.6`, `4.0` - `4.3` (the highest
|
|
|
|
version number is the default option).
|
|
|
|
|
|
|
|
--
|
|
|
|
|
|
|
|
include::analysis/analyzers.asciidoc[]
|
|
|
|
|
|
|
|
include::analysis/tokenizers.asciidoc[]
|
|
|
|
|
|
|
|
include::analysis/tokenfilters.asciidoc[]
|
|
|
|
|
|
|
|
include::analysis/charfilters.asciidoc[]
|
|
|
|
|
|
|
|
include::analysis/icu-plugin.asciidoc[]
|
|
|
|
|