OpenSearch/docs/reference/analysis/analyzers/standard-analyzer.asciidoc

108 lines
2.5 KiB
Plaintext

[[analysis-standard-analyzer]]
=== Standard Analyzer
The `standard` analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization (based on the Unicode Text
Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
for most languages.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-standard-tokenfilter,Standard Token Filter>>
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
The above sentence would produce the following terms:
[source,text]
---------------------------
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
---------------------------
[float]
=== Configuration
The `standard` analyzer accepts the following parameters:
[horizontal]
`max_token_length`::
The maximum token length. If a token is seen that exceeds this length then
it is split at `max_token_length` intervals. Defaults to `255`.
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
`stopwords_path`::
The path to a file containing stop words.
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.
[float]
=== Example configuration
In this example, we configure the `standard` analyzer to have a
`max_token_length` of 5 (for demonstration purposes), and to use the
pre-defined list of English stop words:
[source,js]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
GET _cluster/health?wait_for_status=yellow
POST my_index/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
----------------------------
// CONSOLE
The above example produces the following terms:
[source,text]
---------------------------
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
---------------------------