OpenSearch/docs/reference/analysis/analyzers/standard-analyzer.asciidoc
James Rodewig 5a2c6f0d4f
[DOCS] http -> https, remove outdated plugin docs (#60380) (#60545)
Plugin discovery documentation contained information about installing
Elasticsearch 2.0 and installing an oracle JDK, both of which is no
longer valid.

While noticing that the instructions used cleartext HTTP to install
packages, this commit replaces HTTPs links instead of HTTP where possible.

In addition a few community links have been removed, as they do not seem
to exist anymore.

Co-authored-by: Alexander Reelsen <alexander@reelsen.net>
2020-07-31 16:16:31 -04:00

303 lines
6.3 KiB
Plaintext

[[analysis-standard-analyzer]]
=== Standard analyzer
++++
<titleabbrev>Standard</titleabbrev>
++++
The `standard` analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization (based on the Unicode Text
Segmentation algorithm, as specified in
https://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
for most languages.
[discrete]
=== Example output
[source,console]
---------------------------
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
/////////////////////
[source,console-result]
----------------------------
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "2",
"start_offset": 4,
"end_offset": 5,
"type": "<NUM>",
"position": 1
},
{
"token": "quick",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "brown",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "foxes",
"start_offset": 18,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "jumped",
"start_offset": 24,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "the",
"start_offset": 36,
"end_offset": 39,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "dog's",
"start_offset": 45,
"end_offset": 50,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "<ALPHANUM>",
"position": 10
}
]
}
----------------------------
/////////////////////
The above sentence would produce the following terms:
[source,text]
---------------------------
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
---------------------------
[discrete]
=== Configuration
The `standard` analyzer accepts the following parameters:
[horizontal]
`max_token_length`::
The maximum token length. If a token is seen that exceeds this length then
it is split at `max_token_length` intervals. Defaults to `255`.
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `_none_`.
`stopwords_path`::
The path to a file containing stop words.
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.
[discrete]
=== Example configuration
In this example, we configure the `standard` analyzer to have a
`max_token_length` of 5 (for demonstration purposes), and to use the
pre-defined list of English stop words:
[source,console]
----------------------------
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
----------------------------
/////////////////////
[source,console-result]
----------------------------
{
"tokens": [
{
"token": "2",
"start_offset": 4,
"end_offset": 5,
"type": "<NUM>",
"position": 1
},
{
"token": "quick",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "brown",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "foxes",
"start_offset": 18,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "jumpe",
"start_offset": 24,
"end_offset": 29,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "d",
"start_offset": 29,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "dog's",
"start_offset": 45,
"end_offset": 50,
"type": "<ALPHANUM>",
"position": 10
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "<ALPHANUM>",
"position": 11
}
]
}
----------------------------
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
---------------------------
[discrete]
=== Definition
The `standard` analyzer consists of:
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
If you need to customize the `standard` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`standard` analyzer and you can use it as a starting point:
[source,console]
----------------------------------------------------
PUT /standard_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"lowercase" <1>
]
}
}
}
}
}
----------------------------------------------------
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
<1> You'd add any token filters after `lowercase`.