[[analysis-standard-analyzer]] === Standard analyzer ++++ Standard ++++ The `standard` analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well for most languages. [discrete] === Example output [source,console] --------------------------- POST _analyze { "analyzer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } --------------------------- ///////////////////// [source,console-result] ---------------------------- { "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "", "position": 0 }, { "token": "2", "start_offset": 4, "end_offset": 5, "type": "", "position": 1 }, { "token": "quick", "start_offset": 6, "end_offset": 11, "type": "", "position": 2 }, { "token": "brown", "start_offset": 12, "end_offset": 17, "type": "", "position": 3 }, { "token": "foxes", "start_offset": 18, "end_offset": 23, "type": "", "position": 4 }, { "token": "jumped", "start_offset": 24, "end_offset": 30, "type": "", "position": 5 }, { "token": "over", "start_offset": 31, "end_offset": 35, "type": "", "position": 6 }, { "token": "the", "start_offset": 36, "end_offset": 39, "type": "", "position": 7 }, { "token": "lazy", "start_offset": 40, "end_offset": 44, "type": "", "position": 8 }, { "token": "dog's", "start_offset": 45, "end_offset": 50, "type": "", "position": 9 }, { "token": "bone", "start_offset": 51, "end_offset": 55, "type": "", "position": 10 } ] } ---------------------------- ///////////////////// The above sentence would produce the following terms: [source,text] --------------------------- [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ] --------------------------- [discrete] === Configuration The `standard` analyzer accepts the following parameters: [horizontal] `max_token_length`:: The maximum token length. If a token is seen that exceeds this length then it is split at `max_token_length` intervals. Defaults to `255`. `stopwords`:: A pre-defined stop words list like `_english_` or an array containing a list of stop words. Defaults to `_none_`. `stopwords_path`:: The path to a file containing stop words. See the <> for more information about stop word configuration. [discrete] === Example configuration In this example, we configure the `standard` analyzer to have a `max_token_length` of 5 (for demonstration purposes), and to use the pre-defined list of English stop words: [source,console] ---------------------------- PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_english_analyzer": { "type": "standard", "max_token_length": 5, "stopwords": "_english_" } } } } } POST my-index-000001/_analyze { "analyzer": "my_english_analyzer", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ---------------------------- ///////////////////// [source,console-result] ---------------------------- { "tokens": [ { "token": "2", "start_offset": 4, "end_offset": 5, "type": "", "position": 1 }, { "token": "quick", "start_offset": 6, "end_offset": 11, "type": "", "position": 2 }, { "token": "brown", "start_offset": 12, "end_offset": 17, "type": "", "position": 3 }, { "token": "foxes", "start_offset": 18, "end_offset": 23, "type": "", "position": 4 }, { "token": "jumpe", "start_offset": 24, "end_offset": 29, "type": "", "position": 5 }, { "token": "d", "start_offset": 29, "end_offset": 30, "type": "", "position": 6 }, { "token": "over", "start_offset": 31, "end_offset": 35, "type": "", "position": 7 }, { "token": "lazy", "start_offset": 40, "end_offset": 44, "type": "", "position": 9 }, { "token": "dog's", "start_offset": 45, "end_offset": 50, "type": "", "position": 10 }, { "token": "bone", "start_offset": 51, "end_offset": 55, "type": "", "position": 11 } ] } ---------------------------- ///////////////////// The above example produces the following terms: [source,text] --------------------------- [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ] --------------------------- [discrete] === Definition The `standard` analyzer consists of: Tokenizer:: * <> Token Filters:: * <> * <> (disabled by default) If you need to customize the `standard` analyzer beyond the configuration parameters then you need to recreate it as a `custom` analyzer and modify it, usually by adding token filters. This would recreate the built-in `standard` analyzer and you can use it as a starting point: [source,console] ---------------------------------------------------- PUT /standard_example { "settings": { "analysis": { "analyzer": { "rebuilt_standard": { "tokenizer": "standard", "filter": [ "lowercase" <1> ] } } } } } ---------------------------------------------------- // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/] <1> You'd add any token filters after `lowercase`.