108 lines
2.5 KiB
Plaintext
108 lines
2.5 KiB
Plaintext
[[analysis-standard-analyzer]]
|
|
=== Standard Analyzer
|
|
|
|
The `standard` analyzer is the default analyzer which is used if none is
|
|
specified. It provides grammar based tokenization (based on the Unicode Text
|
|
Segmentation algorithm, as specified in
|
|
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
|
|
for most languages.
|
|
|
|
[float]
|
|
=== Definition
|
|
|
|
It consists of:
|
|
|
|
Tokenizer::
|
|
* <<analysis-standard-tokenizer,Standard Tokenizer>>
|
|
|
|
Token Filters::
|
|
* <<analysis-standard-tokenfilter,Standard Token Filter>>
|
|
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
|
|
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
|
|
|
|
[float]
|
|
=== Example output
|
|
|
|
[source,js]
|
|
---------------------------
|
|
POST _analyze
|
|
{
|
|
"analyzer": "standard",
|
|
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
|
|
}
|
|
---------------------------
|
|
// CONSOLE
|
|
|
|
The above sentence would produce the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
|
|
---------------------------
|
|
|
|
[float]
|
|
=== Configuration
|
|
|
|
The `standard` analyzer accepts the following parameters:
|
|
|
|
[horizontal]
|
|
`max_token_length`::
|
|
|
|
The maximum token length. If a token is seen that exceeds this length then
|
|
it is split at `max_token_length` intervals. Defaults to `255`.
|
|
|
|
`stopwords`::
|
|
|
|
A pre-defined stop words list like `_english_` or an array containing a
|
|
list of stop words. Defaults to `_none_`.
|
|
|
|
`stopwords_path`::
|
|
|
|
The path to a file containing stop words.
|
|
|
|
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
|
|
about stop word configuration.
|
|
|
|
|
|
[float]
|
|
=== Example configuration
|
|
|
|
In this example, we configure the `standard` analyzer to have a
|
|
`max_token_length` of 5 (for demonstration purposes), and to use the
|
|
pre-defined list of English stop words:
|
|
|
|
[source,js]
|
|
----------------------------
|
|
PUT my_index
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_english_analyzer": {
|
|
"type": "standard",
|
|
"max_token_length": 5,
|
|
"stopwords": "_english_"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
GET _cluster/health?wait_for_status=yellow
|
|
|
|
POST my_index/_analyze
|
|
{
|
|
"analyzer": "my_english_analyzer",
|
|
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
|
|
}
|
|
----------------------------
|
|
// CONSOLE
|
|
|
|
The above example produces the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
|
|
---------------------------
|
|
|