diff --git a/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc b/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc index 53c7d913ad2..cc873a4fe89 100644 --- a/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc +++ b/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc @@ -9,20 +9,6 @@ Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed. -[float] -=== Definition - -It consists of: - -Tokenizer:: -* <> - -Token Filters (in order):: -1. <> -2. <> -3. <> (disabled by default) -4. <> - [float] === Example output @@ -149,3 +135,46 @@ The above example produces the following term: --------------------------- [ consistent godel said sentence yes ] --------------------------- + +[float] +=== Definition + +The `fingerprint` tokenizer consists of: + +Tokenizer:: +* <> + +Token Filters (in order):: +* <> +* <> +* <> (disabled by default) +* <> + +If you need to customize the `fingerprint` analyzer beyond the configuration +parameters then you need to recreate it as a `custom` analyzer and modify +it, usually by adding token filters. This would recreate the built-in +`fingerprint` analyzer and you can use it as a starting point for further +customization: + +[source,js] +---------------------------------------------------- +PUT /fingerprint_example +{ + "settings": { + "analysis": { + "analyzer": { + "rebuilt_fingerprint": { + "tokenizer": "standard", + "filter": [ + "lowercase", + "asciifolding", + "fingerprint" + ] + } + } + } + } +} +---------------------------------------------------- +// CONSOLE +// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/] diff --git a/docs/reference/analysis/analyzers/keyword-analyzer.asciidoc b/docs/reference/analysis/analyzers/keyword-analyzer.asciidoc index cc94f3b757e..954b514ced6 100644 --- a/docs/reference/analysis/analyzers/keyword-analyzer.asciidoc +++ b/docs/reference/analysis/analyzers/keyword-analyzer.asciidoc @@ -4,14 +4,6 @@ The `keyword` analyzer is a ``noop'' analyzer which returns the entire input string as a single token. -[float] -=== Definition - -It consists of: - -Tokenizer:: -* <> - [float] === Example output @@ -57,3 +49,40 @@ The above sentence would produce the following single term: === Configuration The `keyword` analyzer is not configurable. + +[float] +=== Definition + +The `keyword` analyzer consists of: + +Tokenizer:: +* <> + +If you need to customize the `keyword` analyzer then you need to +recreate it as a `custom` analyzer and modify it, usually by adding +token filters. Usually, you should prefer the +<> when you want strings that are not split +into tokens, but just in case you need it, this would recreate the +built-in `keyword` analyzer and you can use it as a starting point +for further customization: + +[source,js] +---------------------------------------------------- +PUT /keyword_example +{ + "settings": { + "analysis": { + "analyzer": { + "rebuilt_keyword": { + "tokenizer": "keyword", + "filter": [ <1> + ] + } + } + } + } +} +---------------------------------------------------- +// CONSOLE +// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: keyword_example, first: keyword, second: rebuilt_keyword}\nendyaml\n/] +<1> You'd add any token filters here. diff --git a/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc b/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc index 64ab3999ef9..027f37280a6 100644 --- a/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc +++ b/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc @@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic ======================================== - -[float] -=== Definition - -It consists of: - -Tokenizer:: -* <> - -Token Filters:: -* <> -* <> (disabled by default) - [float] === Example output @@ -378,3 +365,51 @@ The regex above is easier to understand as: [\p{L}&&[^\p{Lu}]] # then lower case ) -------------------------------------------------- + +[float] +=== Definition + +The `pattern` anlayzer consists of: + +Tokenizer:: +* <> + +Token Filters:: +* <> +* <> (disabled by default) + +If you need to customize the `pattern` analyzer beyond the configuration +parameters then you need to recreate it as a `custom` analyzer and modify +it, usually by adding token filters. This would recreate the built-in +`pattern` analyzer and you can use it as a starting point for further +customization: + +[source,js] +---------------------------------------------------- +PUT /pattern_example +{ + "settings": { + "analysis": { + "tokenizer": { + "split_on_non_word": { + "type": "pattern", + "pattern": "\\W+" <1> + } + }, + "analyzer": { + "rebuilt_pattern": { + "tokenizer": "split_on_non_word", + "filter": [ + "lowercase" <2> + ] + } + } + } + } +} +---------------------------------------------------- +// CONSOLE +// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/] +<1> The default pattern is `\W+` which splits on non-word characters +and this is where you'd change it. +<2> You'd add other token filters after `lowercase`. diff --git a/docs/reference/analysis/analyzers/simple-analyzer.asciidoc b/docs/reference/analysis/analyzers/simple-analyzer.asciidoc index a57c30d8dd6..d82655d9bd8 100644 --- a/docs/reference/analysis/analyzers/simple-analyzer.asciidoc +++ b/docs/reference/analysis/analyzers/simple-analyzer.asciidoc @@ -4,14 +4,6 @@ The `simple` analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased. -[float] -=== Definition - -It consists of: - -Tokenizer:: -* <> - [float] === Example output @@ -127,3 +119,37 @@ The above sentence would produce the following terms: === Configuration The `simple` analyzer is not configurable. + +[float] +=== Definition + +The `simple` analzyer consists of: + +Tokenizer:: +* <> + +If you need to customize the `simple` analyzer then you need to recreate +it as a `custom` analyzer and modify it, usually by adding token filters. +This would recreate the built-in `simple` analyzer and you can use it as +a starting point for further customization: + +[source,js] +---------------------------------------------------- +PUT /simple_example +{ + "settings": { + "analysis": { + "analyzer": { + "rebuilt_simple": { + "tokenizer": "lowercase", + "filter": [ <1> + ] + } + } + } + } +} +---------------------------------------------------- +// CONSOLE +// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/] +<1> You'd add any token filters here. diff --git a/docs/reference/analysis/analyzers/standard-analyzer.asciidoc b/docs/reference/analysis/analyzers/standard-analyzer.asciidoc index eacbb1c3cad..20aa072066b 100644 --- a/docs/reference/analysis/analyzers/standard-analyzer.asciidoc +++ b/docs/reference/analysis/analyzers/standard-analyzer.asciidoc @@ -7,19 +7,6 @@ Segmentation algorithm, as specified in http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well for most languages. -[float] -=== Definition - -It consists of: - -Tokenizer:: -* <> - -Token Filters:: -* <> -* <> -* <> (disabled by default) - [float] === Example output @@ -276,3 +263,44 @@ The above example produces the following terms: --------------------------- [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ] --------------------------- + +[float] +=== Definition + +The `standard` analyzer consists of: + +Tokenizer:: +* <> + +Token Filters:: +* <> +* <> +* <> (disabled by default) + +If you need to customize the `standard` analyzer beyond the configuration +parameters then you need to recreate it as a `custom` analyzer and modify +it, usually by adding token filters. This would recreate the built-in +`standard` analyzer and you can use it as a starting point: + +[source,js] +---------------------------------------------------- +PUT /standard_example +{ + "settings": { + "analysis": { + "analyzer": { + "rebuilt_standard": { + "tokenizer": "standard", + "filter": [ + "standard", + "lowercase" <1> + ] + } + } + } + } +} +---------------------------------------------------- +// CONSOLE +// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/] +<1> You'd add any token filters after `lowercase`. diff --git a/docs/reference/analysis/analyzers/stop-analyzer.asciidoc b/docs/reference/analysis/analyzers/stop-analyzer.asciidoc index eacc7e106e7..1b84797d947 100644 --- a/docs/reference/analysis/analyzers/stop-analyzer.asciidoc +++ b/docs/reference/analysis/analyzers/stop-analyzer.asciidoc @@ -5,17 +5,6 @@ The `stop` analyzer is the same as the <> - -Token filters:: -* <> - [float] === Example output @@ -239,3 +228,50 @@ The above example produces the following terms: --------------------------- [ quick, brown, foxes, jumped, lazy, dog, s, bone ] --------------------------- + +[float] +=== Definition + +It consists of: + +Tokenizer:: +* <> + +Token filters:: +* <> + +If you need to customize the `stop` analyzer beyond the configuration +parameters then you need to recreate it as a `custom` analyzer and modify +it, usually by adding token filters. This would recreate the built-in +`stop` analyzer and you can use it as a starting point for further +customization: + +[source,js] +---------------------------------------------------- +PUT /stop_example +{ + "settings": { + "analysis": { + "filter": { + "english_stop": { + "type": "stop", + "stopwords": "_english_" <1> + } + }, + "analyzer": { + "rebuilt_stop": { + "tokenizer": "lowercase", + "filter": [ + "english_stop" <2> + ] + } + } + } + } +} +---------------------------------------------------- +// CONSOLE +// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/] +<1> The default stopwords can be overridden with the `stopwords` + or `stopwords_path` parameters. +<2> You'd add any token filters after `english_stop`. diff --git a/docs/reference/analysis/analyzers/whitespace-analyzer.asciidoc b/docs/reference/analysis/analyzers/whitespace-analyzer.asciidoc index f95e5c6e4ab..31ba8d9ce8f 100644 --- a/docs/reference/analysis/analyzers/whitespace-analyzer.asciidoc +++ b/docs/reference/analysis/analyzers/whitespace-analyzer.asciidoc @@ -4,14 +4,6 @@ The `whitespace` analyzer breaks text into terms whenever it encounters a whitespace character. -[float] -=== Definition - -It consists of: - -Tokenizer:: -* <> - [float] === Example output @@ -120,3 +112,37 @@ The above sentence would produce the following terms: === Configuration The `whitespace` analyzer is not configurable. + +[float] +=== Definition + +It consists of: + +Tokenizer:: +* <> + +If you need to customize the `whitespace` analyzer then you need to +recreate it as a `custom` analyzer and modify it, usually by adding +token filters. This would recreate the built-in `whitespace` analyzer +and you can use it as a starting point for further customization: + +[source,js] +---------------------------------------------------- +PUT /whitespace_example +{ + "settings": { + "analysis": { + "analyzer": { + "rebuilt_whitespace": { + "tokenizer": "whitespace", + "filter": [ <1> + ] + } + } + } + } +} +---------------------------------------------------- +// CONSOLE +// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: whitespace_example, first: whitespace, second: rebuilt_whitespace}\nendyaml\n/] +<1> You'd add any token filters here.