First pass at improving analyzer docs (#18269)

* Docs: First pass at improving analyzer docs

I've rewritten the intro to analyzers plus the docs
for all analyzers to provide working examples.

I've also removed:

* analyzer aliases (see #18244)
* analyzer versions (see #18267)
* snowball analyzer (see #8690)

Next steps will be tokenizers, token filters, char filters

* Fixed two typos
This commit is contained in:
Clinton Gormley 2016-05-11 14:17:56 +02:00
parent ae01a4f7f9
commit 97a41ee973
16 changed files with 1045 additions and 362 deletions

View File

@ -3,68 +3,113 @@
[partintro]
--
The index analysis module acts as a configurable registry of Analyzers
that can be used in order to both break indexed (analyzed) fields when a
document is indexed and process query strings. It maps to the Lucene
`Analyzer`.
Analyzers are composed of a single <<analysis-tokenizers,Tokenizer>>
and zero or more <<analysis-tokenfilters,TokenFilters>>. The tokenizer may
be preceded by one or more <<analysis-charfilters,CharFilters>>. The
analysis module allows one to register `TokenFilters`, `Tokenizers` and
`Analyzers` under logical names that can then be referenced either in
mapping definitions or in certain APIs. The Analysis module
automatically registers (*if not explicitly defined*) built in
analyzers, token filters, and tokenizers.
Here is a sample configuration:
[source,js]
--------------------------------------------------
index :
analysis :
analyzer :
standard :
type : standard
stopwords : [stop1, stop2]
myAnalyzer1 :
type : standard
stopwords : [stop1, stop2, stop3]
max_token_length : 500
# configure a custom analyzer which is
# exactly like the default standard analyzer
myAnalyzer2 :
tokenizer : standard
filter : [standard, lowercase, stop]
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
myTokenizer2 :
type : keyword
buffer_size : 512
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type : length
min : 0
max : 2000
--------------------------------------------------
_Analysis_ is the process of converting text, like the body of any email, into
_tokens_ or _terms_ which are added to the inverted index for searching.
Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
defined per index.
[float]
[[backwards-compatibility]]
=== Backwards compatibility
== Index time analysis
All analyzers, tokenizers, and token filters can be configured with a
`version` parameter to control which Lucene version behavior they should
use. Possible values are: `3.0` - `3.6`, `4.0` - `4.3` (the highest
version number is the default option).
For instance at index time, the built-in <<english-analyzer,`english`>> _analyzer_ would
convert this sentence:
[source,text]
------
"The QUICK brown foxes jumped over the lazy dog!"
------
into these terms, which would be added to the inverted index.
[source,text]
------
[ quick, brown, fox, jump, over, lazi, dog ]
------
[float]
=== Specifying an index time analyzer
Each <<text,`text`>> field in a mapping can specify its own
<<analyzer,`analyzer`>>:
[source,js]
-------------------------
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
-------------------------
// CONSOLE
At index time, if no `analyzer` has been specified, it looks for an analyzer
in the index settings called `default`. Failing that, it defaults to using
the <<analysis-standard-analyzer,`standard` analyzer>>.
[float]
== Search time analysis
This same analysis process is applied to the query string at search time in
<<full-text-queries,full text queries>> like the
<<query-dsl-match-query,`match` query>>
to convert the text in the query string into terms of the same form as those
that are stored in the inverted index.
For instance, a user might search for:
[source,text]
------
"a quick fox"
------
which would be analysed by the same `english` analyzer into the following terms:
[source,text]
------
[ quick, fox ]
------
Even though the exact words used in the query string don't appear in the
original text (`quick` vs `QUICK`, `fox` vs `foxes`), because we have applied
the same analyzer to both the text and the query string, the terms from the
query string exactly match the terms from the text in the inverted index,
which means that this query would match our example document.
[float]
=== Specifying a search time analyzer
Usually the same analyzer should be used both at
index time and at search time, and <<full-text-queries,full text queries>>
like the <<query-dsl-match-query,`match` query>> will use the mapping to look
up the analyzer to use for each field.
The analyzer to use to search a particular field is determined by
looking for:
* An `analyzer` specified in the query itself.
* The <<search-analyzer,`search_analyzer`>> mapping parameter.
* The <<analyzer,`analyzer`>> mapping parameter.
* An analyzer in the index settings called `default_search`.
* An analyzer in the index settings called `default`.
* The `standard` analyzer.
--
include::analysis/anatomy.asciidoc[]
include::analysis/testing.asciidoc[]
include::analysis/analyzers.asciidoc[]
include::analysis/tokenizers.asciidoc[]

View File

@ -1,67 +1,60 @@
[[analysis-analyzers]]
== Analyzers
Analyzers are composed of a single <<analysis-tokenizers,Tokenizer>>
and zero or more <<analysis-tokenfilters,TokenFilters>>. The tokenizer may
be preceded by one or more <<analysis-charfilters,CharFilters>>.
The analysis module allows you to register `Analyzers` under logical
names which can then be referenced either in mapping definitions or in
certain APIs.
Elasticsearch ships with a wide range of built-in analyzers, which can be used
in any index without further configuration:
Elasticsearch comes with a number of prebuilt analyzers which are
ready to use. Alternatively, you can combine the built in
character filters, tokenizers and token filters to create
<<analysis-custom-analyzer,custom analyzers>>.
<<analysis-standard-analyzer,Standard Analyzer>>::
The `standard` analyzer divides text into terms on word boundaries, as defined
by the Unicode Text Segmentation algorithm. It removes most punctuation,
lowercases terms, and supports removing stop words.
<<analysis-simple-analyzer,Simple Analyzer>>::
The `simple` analyzer divides text into terms whenever it encounters a
character which is not a letter. It lowercases all terms.
<<analysis-whitespace-analyzer,Whitespace Analyzer>>::
The `whitespace` analyzer divides text into terms whenever it encounters any
whitespace character. It does not lowercase terms.
<<analysis-stop-analyzer,Stop Analyzer>>::
The `stop` analyzer is like the `simple` analyzer, but also supports removal
of stop words.
<<analysis-keyword-analyzer,Keyword Analyzer>>::
The `keyword` analyzer is a ``noop'' analyzer that accepts whatever text it is
given and outputs the exact same text as a single term.
<<analysis-pattern-analyzer,Pattern Analyzer>>::
The `pattern` analyzer uses a regular expression to split the text into terms.
It supports lower-casing and stop words.
<<analysis-lang-analyzer,Language Analyzers>>::
Elasticsearch provides many language-specific analyzers like `english` or
`french`.
<<analysis-fingerprint-analyzer,Fingerprint Analyzer>>::
The `fingerprint` analyzer is a specialist analyzer which creates a
fingerprint which can be used for duplicate detection.
[float]
[[default-analyzers]]
=== Default Analyzers
=== Custom analyzers
An analyzer is registered under a logical name. It can then be
referenced from mapping definitions or certain APIs. When none are
defined, defaults are used. There is an option to define which analyzers
will be used by default when none can be derived.
The `default` logical name allows one to configure an analyzer that will
be used both for indexing and for searching APIs. The `default_search`
logical name can be used to configure a default analyzer that will be
used just when searching (the `default` analyzer would still be used for
indexing).
For instance, the following settings could be used to perform exact matching
only by default:
[source,js]
--------------------------------------------------
index :
analysis :
analyzer :
default :
tokenizer : keyword
--------------------------------------------------
[float]
[[aliasing-analyzers]]
=== Aliasing Analyzers
Analyzers can be aliased to have several registered lookup names
associated with them. For example, the following will allow
the `standard` analyzer to also be referenced with `alias1`
and `alias2` values.
If you do not find an analyzer suitable for your needs, you can create a
<<analysis-custom-analyzer,`custom`>> analyzer which combines the appropriate
<<analysis-charfilters, character filters>>,
<<analysis-tokenizers,tokenizer>>, and <<analysis-tokenfilters,token filters>>.
[source,js]
--------------------------------------------------
index :
analysis :
analyzer :
standard :
alias: [alias1, alias2]
type : standard
stopwords : [test1, test2, test3]
--------------------------------------------------
Below is a list of the built in analyzers.
include::analyzers/configuring.asciidoc[]
include::analyzers/standard-analyzer.asciidoc[]
@ -77,8 +70,6 @@ include::analyzers/pattern-analyzer.asciidoc[]
include::analyzers/lang-analyzer.asciidoc[]
include::analyzers/snowball-analyzer.asciidoc[]
include::analyzers/fingerprint-analyzer.asciidoc[]
include::analyzers/custom-analyzer.asciidoc[]

View File

@ -0,0 +1,66 @@
[[configuring-analyzers]]
=== Configuring built-in analyzers
The built-in analyzers can be used directly without any configuration. Some
of them, however, support configuration options to alter their behaviour. For
instance, the <<analysis-standard-analyzer,`standard` analyzer>> can be configured
to support a list of stop words:
[source,js]
--------------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"std_english": { <1>
"type": "standard",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_text": {
"type": "text",
"analyzer": "standard", <2>
"fields": {
"english": {
"type": "text",
"analyzer": "std_english" <3>
}
}
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"field": "my_text", <2>
"text": "The old brown cow"
}
POST my_index/_analyze
{
"field": "my_text.english", <3>
"text": "The old brown cow"
}
--------------------------------
// CONSOLE
<1> We define the `std_english` analyzer to be based on the `standard`
analyzer, but configured to remove the pre-defined list of English stopwords.
<2> The `my_text` field uses the `standard` analyzer directly, without
any configuration. No stop words will be removed from this field.
The resulting terms are: `[ the, old, brown, cow ]`
<3> The `my_text.english` field uses the `std_english` analyzer, so
English stop words will be removed. The resulting terms are:
`[ old, brown, cow ]`

View File

@ -1,59 +1,179 @@
[[analysis-custom-analyzer]]
=== Custom Analyzer
An analyzer of type `custom` that allows to combine a `Tokenizer` with
zero or more `Token Filters`, and zero or more `Char Filters`. The
custom analyzer accepts a logical/registered name of the tokenizer to
use, and a list of logical/registered names of token filters.
The name of the custom analyzer must not start with "_".
When the built-in analyzers do not fulfill your needs, you can create a
`custom` analyzer which uses the appropriate combination of:
The following are settings that can be set for a `custom` analyzer type:
* zero or more <<analysis-charfilters, character filters>>
* a <<analysis-tokenizers,tokenizer>>
* zero or more <<analysis-tokenfilters,token filters>>.
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`tokenizer` |The logical / registered name of the tokenizer to use.
[float]
=== Configuration
|`filter` |An optional list of logical / registered name of token
filters.
The `custom` analyzer accepts the following parameters:
|`char_filter` |An optional list of logical / registered name of char
filters.
[horizontal]
`tokenizer`::
A built-in or customised <<analysis-tokenizers,tokenizer>>.
(Required)
`char_filter`::
An optional array of built-in or customised
<<analysis-charfilters, character filters>>.
`filter`::
An optional array of built-in or customised
<<analysis-tokenfilters, token filters>>.
`position_increment_gap`::
When indexing an array of text values, Elasticsearch inserts a fake "gap"
between the last term of one value and the first term of the next value to
ensure that a phrase query doesn't match two terms from different array
elements. Defaults to `100`. See <<position-increment-gap>> for more.
[float]
=== Example configuration
Here is an example that combines the following:
Character Filter::
* <<analysis-htmlstrip-charfilter,HTML Strip Character Filter>>
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-asciifolding-tokenfilter,ASCII-Folding Token Filter>>
[source,js]
--------------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}
--------------------------------
// CONSOLE
The above example produces the following terms:
[source,text]
---------------------------
[ is, this, deja, vu ]
---------------------------
The previous example used tokenizer, token filters, and character filters with
their default configurations, but it is possible to create configured versions
of each and to use them in a custom analyzer.
Here is a more complicated example that combines the following:
Character Filter::
* <<analysis-mapping-charfilter,Mapping Character Filter>>, configured to replace `:)` with `_happy_` and `:(` with `_sad_`
Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>, configured to split on punctuation characters
Token Filters::
* <<analysis-lowercase-tokenfilter,Lowercase Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>>, configured to use the pre-defined list of English stop words
|`position_increment_gap` |An optional number of positions to increment
between each field value of a field using this analyzer. Defaults to 100.
100 was chosen because it prevents phrase queries with reasonably large
slops (less than 100) from matching terms across field values.
|=======================================================================
Here is an example:
[source,js]
--------------------------------------------------
index :
analysis :
analyzer :
myAnalyzer2 :
type : custom
tokenizer : myTokenizer1
filter : [myTokenFilter1, myTokenFilter2]
char_filter : [my_html]
position_increment_gap: 256
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type : length
min : 0
max : 2000
char_filter :
my_html :
type : html_strip
escaped_tags : [xxx, yyy]
read_ahead : 1024
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons" <1>
],
"tokenizer": "punctuation", <1>
"filter": [
"lowercase",
"english_stop" <1>
]
}
},
"tokenizer": {
"punctuation": { <1>
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": { <1>
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": { <1>
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
--------------------------------------------------
<1> The `emoticon` character filter, `punctuation` tokenizer and
`english_stop` token filter are custom implementations which are defined
in the same index settings.
The above example produces the following terms:
[source,text]
---------------------------
[ i'm, _happy_, person, you ]
---------------------------

View File

@ -5,37 +5,115 @@ The `fingerprint` analyzer implements a
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
which is used by the OpenRefine project to assist in clustering.
The `fingerprint` analyzer is composed of a <<analysis-standard-tokenizer>>, and four
token filters (in this order): <<analysis-lowercase-tokenfilter>>, <<analysis-stop-tokenfilter>>,
<<analysis-fingerprint-tokenfilter>> and <<analysis-asciifolding-tokenfilter>>.
Input text is lowercased, normalized to remove extended characters, sorted,
deduplicated and concatenated into a single token. If a stopword list is
configured, stop words will also be removed.
Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and
concatenated into a single token. If a stopword list is configured, stop words will
also be removed. For example, the sentence:
[float]
=== Definition
____
"Yes yes, Gödel said this sentence is consistent and."
____
It consists of:
will be transformed into the token: `"and consistent godel is said sentence this yes"`
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
* <<analysis-fingerprint-tokenfilter>>
* <<analysis-asciifolding-tokenfilter>>
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
---------------------------
// CONSOLE
The above sentence would produce the following single term:
[source,text]
---------------------------
[ and consistent godel is said sentence this yes ]
---------------------------
[float]
=== Configuration
The `fingerprint` analyzer accepts the following parameters:
[horizontal]
`separator`::
The character to use to concate the terms. Defaults to a space.
`max_output_size`::
The maximum token size to emit. Defaults to `255`. Tokens larger than
this size will be discarded.
`preserve_original`::
If `true`, emits two tokens: one with ASCII-folding of terms that contain
extended characters (if any) and one with the original characters.
Defaults to `false`.
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
`stopwords_path`::
The path to a file containing stop words.
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.
Notice how the words are all lowercased, the umlaut in "gödel" has been normalized to "godel",
punctuation has been removed, and "yes" has been de-duplicated.
[float]
=== Example configuration
The `fingerprint` analyzer has these configurable settings
In this example, we configure the `fingerprint` analyzer to use the
pre-defined list of English stop words, and to emit a second token in
the presence of non-ASCII characters:
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`separator` | The character that separates the tokens after concatenation.
Defaults to a space.
|`max_output_size` | The maximum token size to emit. Defaults to `255`. See <<analysis-fingerprint-tokenfilter-max-size>>
|`preserve_original`| If true, emits both the original and folded version of
tokens that contain extended characters. Defaults to `false`
|`stopwords` | A list of stop words to use. Defaults to an empty list (`_none_`).
|`stopwords_path` | A path (either relative to `config` location, or absolute) to a stopwords
file configuration. Each stop word should be in its own "line" (separated
by a line break). The file must be UTF-8 encoded.
|=======================================================================
[source,js]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer": {
"type": "fingerprint",
"stopwords": "_english_",
"preserve_original": true
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
----------------------------
// CONSOLE
The above example produces the following two terms:
[source,text]
---------------------------
[ consistent godel said sentence yes, consistent gödel said sentence yes ]
---------------------------

View File

@ -1,7 +1,38 @@
[[analysis-keyword-analyzer]]
=== Keyword Analyzer
An analyzer of type `keyword` that "tokenizes" an entire stream as a
single token. This is useful for data like zip codes, ids and so on.
Note, when using mapping definitions, it might make more sense to simply
map the field as a <<keyword,`keyword`>>.
The `keyword` analyzer is a ``noop'' analyzer which returns the entire input
string as a single token.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
The above sentence would produce the following single term:
[source,text]
---------------------------
[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
---------------------------
[float]
=== Configuration
The `keyword` analyzer is not configurable.

View File

@ -303,8 +303,8 @@ The `catalan` analyzer could be reimplemented as a `custom` analyzer as follows:
"analysis": {
"filter": {
"catalan_elision": {
"type": "elision",
"articles": [ "d", "l", "m", "n", "s", "t"]
"type": "elision",
"articles": [ "d", "l", "m", "n", "s", "t"]
},
"catalan_stop": {
"type": "stop",
@ -623,10 +623,10 @@ The `french` analyzer could be reimplemented as a `custom` analyzer as follows:
"french_elision": {
"type": "elision",
"articles_case": true,
"articles": [
"articles": [
"l", "m", "t", "qu", "n", "s",
"j", "d", "c", "jusqu", "quoiqu",
"lorsqu", "puisqu"
"j", "d", "c", "jusqu", "quoiqu",
"lorsqu", "puisqu"
]
},
"french_stop": {
@ -1000,13 +1000,13 @@ The `italian` analyzer could be reimplemented as a `custom` analyzer as follows:
"analysis": {
"filter": {
"italian_elision": {
"type": "elision",
"articles": [
"type": "elision",
"articles": [
"c", "l", "all", "dall", "dell",
"nell", "sull", "coll", "pell",
"gl", "agl", "dagl", "degl", "negl",
"sugl", "un", "m", "t", "s", "v", "d"
]
]
},
"italian_stop": {
"type": "stop",

View File

@ -1,46 +1,96 @@
[[analysis-pattern-analyzer]]
=== Pattern Analyzer
An analyzer of type `pattern` that can flexibly separate text into terms
via a regular expression. Accepts the following settings:
The `pattern` analyzer uses a regular expression to split the text into terms.
The regular expression should match the *token separators* not the tokens
themselves. The regular expression defaults to `\W+` (or all non-word characters).
The following are settings that can be set for a `pattern` analyzer
type:
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
The above sentence would produce the following terms:
[source,text]
---------------------------
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
---------------------------
[float]
=== Configuration
The `pattern` analyzer accepts the following parameters:
[horizontal]
`lowercase`:: Should terms be lowercased or not. Defaults to `true`.
`pattern`:: The regular expression pattern, defaults to `\W+`.
`flags`:: The regular expression flags.
`stopwords`:: A list of stopwords to initialize the stop filter with.
Defaults to an 'empty' stopword list Check
<<analysis-stop-analyzer,Stop Analyzer>> for more details.
`pattern`::
*IMPORTANT*: The regular expression should match the *token separators*,
not the tokens themselves.
A http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java regular expression], defaults to `\W+`.
`flags`::
Java regular expression http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary[flags].
lags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
`lowercase`::
Should terms be lowercased or not. Defaults to `true`.
`max_token_length`::
The maximum token length. If a token is seen that exceeds this length then
it is split at `max_token_length` intervals. Defaults to `255`.
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
`stopwords_path`::
The path to a file containing stop words.
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.
Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
Pattern API] for more details about `flags` options.
[float]
==== Pattern Analyzer Examples
=== Example configuration
In order to try out these examples, you should delete the `test` index
before running each example.
[float]
===== Whitespace tokenizer
In this example, we configure the `pattern` analyzer to split email addresses
on non-word characters or on underscores (`\W|_`), and to lower-case the result:
[source,js]
--------------------------------------------------
PUT test
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"whitespace": {
"type": "pattern",
"pattern": "\\s+"
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_", <1>
"lowercase": true
}
}
}
@ -49,47 +99,32 @@ PUT test
GET _cluster/health?wait_for_status=yellow
GET test/_analyze?analyzer=whitespace&text=foo,bar baz
# "foo,bar", "baz"
--------------------------------------------------
// CONSOLE
[float]
===== Non-word character tokenizer
[source,js]
--------------------------------------------------
PUT test
POST my_index/_analyze
{
"settings": {
"analysis": {
"analyzer": {
"nonword": {
"type": "pattern",
"pattern": "[^\\w]+" <1>
}
}
}
}
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
GET _cluster/health?wait_for_status=yellow
GET test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"
GET test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"
--------------------------------------------------
----------------------------
// CONSOLE
<1> The backslashes in the pattern need to be escaped when specifying the
pattern as a JSON string.
The above example produces the following terms:
[source,text]
---------------------------
[ john, smith, foo, bar, com ]
---------------------------
[float]
===== CamelCase tokenizer
==== CamelCase tokenizer
The following more complicated example splits CamelCase text into tokens:
[source,js]
--------------------------------------------------
PUT test?pretty=1
PUT my_index
{
"settings": {
"analysis": {
@ -105,11 +140,21 @@ PUT test?pretty=1
GET _cluster/health?wait_for_status=yellow
GET test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"
GET my_index/_analyze
{
"analyzer": "camel",
"text": "MooseX::FTPClass2_beta"
}
--------------------------------------------------
// CONSOLE
The above example produces the following terms:
[source,text]
---------------------------
[ moose, x, ftp, class, 2, beta ]
---------------------------
The regex above is easier to understand as:
[source,js]

View File

@ -1,6 +1,38 @@
[[analysis-simple-analyzer]]
=== Simple Analyzer
An analyzer of type `simple` that is built using a
<<analysis-lowercase-tokenizer,Lower
Case Tokenizer>>.
The `simple` analyzer breaks text into terms whenever it encounters a
character which is not a letter. All terms are lower cased.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
The above sentence would produce the following terms:
[source,text]
---------------------------
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
---------------------------
[float]
=== Configuration
The `simple` analyzer is not configurable.

View File

@ -1,64 +0,0 @@
[[analysis-snowball-analyzer]]
=== Snowball Analyzer
An analyzer of type `snowball` that uses the
<<analysis-standard-tokenizer,standard
tokenizer>>, with
<<analysis-standard-tokenfilter,standard
filter>>,
<<analysis-lowercase-tokenfilter,lowercase
filter>>,
<<analysis-stop-tokenfilter,stop
filter>>, and
<<analysis-snowball-tokenfilter,snowball
filter>>.
The Snowball Analyzer is a stemming analyzer from Lucene that is
originally based on the snowball project from
http://snowballstem.org[snowballstem.org].
Sample usage:
[source,js]
--------------------------------------------------
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "snowball",
"language" : "English"
}
}
}
}
}
--------------------------------------------------
The `language` parameter can have the same values as the
<<analysis-snowball-tokenfilter,snowball
filter>> and defaults to `English`. Note that not all the language
analyzers have a default set of stopwords provided.
The `stopwords` parameter can be used to provide stopwords for the
languages that have no defaults, or to simply replace the default set
with your custom list. Check <<analysis-stop-analyzer,Stop Analyzer>>
for more details. A default set of stopwords for many of these
languages is available from for instance
https://github.com/apache/lucene-solr/tree/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/[here]
and
https://github.com/apache/lucene-solr/tree/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball[here.]
A sample configuration (in YAML format) specifying Swedish with
stopwords:
[source,js]
--------------------------------------------------
index :
analysis :
analyzer :
my_analyzer:
type: snowball
language: Swedish
stopwords: "och,det,att,i,en,jag,hon,som,han,på,den,med,var,sig,för,så,till,är,men,ett,om,hade,de,av,icke,mig,du,henne,då,sin,nu,har,inte,hans,honom,skulle,hennes,där,min,man,ej,vid,kunde,något,från,ut,när,efter,upp,vi,dem,vara,vad,över,än,dig,kan,sina,här,ha,mot,alla,under,någon,allt,mycket,sedan,ju,denna,själv,detta,åt,utan,varit,hur,ingen,mitt,ni,bli,blev,oss,din,dessa,några,deras,blir,mina,samma,vilken,er,sådan,vår,blivit,dess,inom,mellan,sådant,varför,varje,vilka,ditt,vem,vilket,sitta,sådana,vart,dina,vars,vårt,våra,ert,era,vilkas"
--------------------------------------------------

View File

@ -1,26 +1,107 @@
[[analysis-standard-analyzer]]
=== Standard Analyzer
An analyzer of type `standard` is built using the
<<analysis-standard-tokenizer,Standard
Tokenizer>> with the
<<analysis-standard-tokenfilter,Standard
Token Filter>>,
<<analysis-lowercase-tokenfilter,Lower
Case Token Filter>>, and
<<analysis-stop-tokenfilter,Stop
Token Filter>>.
The `standard` analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization (based on the Unicode Text
Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
for most languages.
The following are settings that can be set for a `standard` analyzer
type:
[float]
=== Definition
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`stopwords` |A list of stopwords to initialize the stop filter with.
Defaults to an 'empty' stopword list Check
<<analysis-stop-analyzer,Stop Analyzer>> for more details.
|`max_token_length` |The maximum token length. If a token is seen that exceeds
this length then it is split at `max_token_length` intervals. Defaults to `255`.
|=======================================================================
It consists of:
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-standard-tokenfilter,Standard Token Filter>>
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
The above sentence would produce the following terms:
[source,text]
---------------------------
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
---------------------------
[float]
=== Configuration
The `standard` analyzer accepts the following parameters:
[horizontal]
`max_token_length`::
The maximum token length. If a token is seen that exceeds this length then
it is split at `max_token_length` intervals. Defaults to `255`.
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
`stopwords_path`::
The path to a file containing stop words.
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.
[float]
=== Example configuration
In this example, we configure the `standard` analyzer to have a
`max_token_length` of 5 (for demonstration purposes), and to use the
pre-defined list of English stop words:
[source,js]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
----------------------------
// CONSOLE
The above example produces the following terms:
[source,text]
---------------------------
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
---------------------------

View File

@ -1,22 +1,97 @@
[[analysis-stop-analyzer]]
=== Stop Analyzer
An analyzer of type `stop` that is built using a
<<analysis-lowercase-tokenizer,Lower
Case Tokenizer>>, with
<<analysis-stop-tokenfilter,Stop
Token Filter>>.
The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analyzer>>
but adds support for removing stop words. It defaults to using the
`_english_` stop words.
The following are settings that can be set for a `stop` analyzer type:
[float]
=== Definition
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`stopwords` |A list of stopwords to initialize the stop filter with.
Defaults to the english stop words.
|`stopwords_path` |A path (either relative to `config` location, or
absolute) to a stopwords file configuration.
|=======================================================================
It consists of:
Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
Token filters::
* <<analysis-stop-tokenfilter,Stop Token Filter>>
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
The above sentence would produce the following terms:
[source,text]
---------------------------
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
---------------------------
[float]
=== Configuration
The `stop` analyzer accepts the following parameters:
[horizontal]
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_english_`.
`stopwords_path`::
The path to a file containing stop words.
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.
[float]
=== Example configuration
In this example, we configure the `stop` analyzer to use a specified list of
words as stop words:
[source,js]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer": {
"type": "stop",
"stopwords": ["the", "over"]
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_stop_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
----------------------------
// CONSOLE
The above example produces the following terms:
[source,text]
---------------------------
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
---------------------------
Use `stopwords: _none_` to explicitly specify an 'empty' stopword list.

View File

@ -1,6 +1,38 @@
[[analysis-whitespace-analyzer]]
=== Whitespace Analyzer
An analyzer of type `whitespace` that is built using a
<<analysis-whitespace-tokenizer,Whitespace
Tokenizer>>.
The `whitespace` analyzer breaks text into terms whenever it encounters a
whitespace character.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
The above sentence would produce the following terms:
[source,text]
---------------------------
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
---------------------------
[float]
=== Configuration
The `whitespace` analyzer is not configurable.

View File

@ -0,0 +1,60 @@
[[analyzer-anatomy]]
== Anatomy of an analyzer
An _analyzer_ -- whether built-in or custom -- is just a package which
contains three lower-level building blocks: _character filters_,
_tokenizers_, and _token filters_.
The built-in <<analysis-analyzers,analyzers>> pre-package these building
blocks into analyzers suitable for different languages and types of text.
Elasticsearch also exposes the individual building blocks so that they can be
combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.
[float]
=== Character filters
A _character filter_ receives the original text as a stream of characters and
can transform the stream by adding, removing, or changing characters. For
instance, a character filter could be used to convert Arabic numerals
(٠‎١٢٣٤٥٦٧٨‎٩‎) into their Latin equivalents (0123456789), or to strip HTML
elements like `<b>` from the stream.
An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
which are applied in order.
[float]
=== Tokenizer
A _tokenizer_ receives a stream of characters, breaks it up into individual
_tokens_ (usually individual words), and outputs a stream of _tokens_. For
instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
text into tokens whenever it sees any whitespace. It would convert the text
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
The tokenizer is also responsible for recording the order or _position_ of
each term and the start and end _character offsets_ of the original word which
the term represents.
An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.
[float]
=== Token filters
A _token filter_ receives the token stream and may add, remove, or change
tokens. For example, a <<analysis-lowercase-tokenfilter,`lowercase`>> token
filter converts all tokens to lowercase, a
<<analysis-stop-tokenfilter,`stop`>> token filter removes common words
(_stop words_) like `the` from the token stream, and a
<<analysis-synonym-tokenfilter,`synonym`>> token filter introduces synonyms
into the token stream.
Token filters are not allowed to change the position or character offsets of
each token.
An analyzer may have *zero or more* <<analysis-tokenfilters,token filters>>,
which are applied in order.

View File

@ -0,0 +1,91 @@
== Testing analyzers
The <<indices-analyze,`analyze` API>> is an invaluable tool for viewing the
terms produced by an analyzer. A built-in analyzer (or combination of built-in
tokenizer, token filters, and character filters) can be specified inline in
the request:
[source,js]
-------------------------------------
POST _analyze
{
"analyzer": "whitespace",
"text": "The quick brown fox."
}
POST _analyze
{
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ],
"text": "Is this déja vu?"
}
-------------------------------------
// CONSOLE
.Positions and character offsets
*********************************************************
As can be seen from the output of the `analyze` API, analyzers not only
convert words into terms, they also record the order or relative _positions_
of each term (used for phrase queries or word proximity queries), and the
start and end _character offsets_ of each term in the original text (used for
highlighting search snippets).
*********************************************************
Alternatively, a <<analysis-custom-analyzer,`custom` analyzer>> can be
referred to when running the `analyze` API on a specific index:
[source,js]
-------------------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"std_folded": { <1>
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_text": {
"type": "text",
"analyzer": "std_folded" <2>
}
}
}
}
}
GET my_index/_analyze <3>
{
"analyzer": "std_folded", <4>
"text": "Is this déjà vu?"
}
GET my_index/_analyze <3>
{
"field": "my_text", <5>
"text": "Is this déjà vu?"
}
-------------------------------------
// CONSOLE
<1> Define a `custom` analyzer called `std_folded`.
<2> The field `my_text` uses the `std_folded` analyzer.
<3> To refer to this analyzer, the `analyze` API must specify the index name.
<4> Refer to the analyzer by name.
<5> Refer to the analyzer used by field `my_text`.

View File

@ -1,10 +1,10 @@
[[position-increment-gap]]
=== `position_increment_gap`
<<mapping-index,Analyzed>> string fields take term <<index-options,positions>>
<<mapping-index,Analyzed>> text fields take term <<index-options,positions>>
into account, in order to be able to support
<<query-dsl-match-query-phrase,proximity or phrase queries>>.
When indexing string fields with multiple values a "fake" gap is added between
When indexing text fields with multiple values a "fake" gap is added between
the values to prevent most phrase queries from matching across the values. The
size of this gap is configured using `position_increment_gap` and defaults to
`100`.