Docs: Document how to rebuild analyzers (#30498)
Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499
This commit is contained in:
parent
7f47ff9fcd
commit
9881bfaea5
|
@ -9,20 +9,6 @@ Input text is lowercased, normalized to remove extended characters, sorted,
|
|||
deduplicated and concatenated into a single token. If a stopword list is
|
||||
configured, stop words will also be removed.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-standard-tokenizer,Standard Tokenizer>>
|
||||
|
||||
Token Filters (in order)::
|
||||
1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
|
||||
2. <<analysis-asciifolding-tokenfilter>>
|
||||
3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
|
||||
4. <<analysis-fingerprint-tokenfilter>>
|
||||
|
||||
[float]
|
||||
=== Example output
|
||||
|
||||
|
@ -149,3 +135,46 @@ The above example produces the following term:
|
|||
---------------------------
|
||||
[ consistent godel said sentence yes ]
|
||||
---------------------------
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
The `fingerprint` tokenizer consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-standard-tokenizer,Standard Tokenizer>>
|
||||
|
||||
Token Filters (in order)::
|
||||
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
|
||||
* <<analysis-asciifolding-tokenfilter>>
|
||||
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
|
||||
* <<analysis-fingerprint-tokenfilter>>
|
||||
|
||||
If you need to customize the `fingerprint` analyzer beyond the configuration
|
||||
parameters then you need to recreate it as a `custom` analyzer and modify
|
||||
it, usually by adding token filters. This would recreate the built-in
|
||||
`fingerprint` analyzer and you can use it as a starting point for further
|
||||
customization:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------
|
||||
PUT /fingerprint_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"rebuilt_fingerprint": {
|
||||
"tokenizer": "standard",
|
||||
"filter": [
|
||||
"lowercase",
|
||||
"asciifolding",
|
||||
"fingerprint"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]
|
||||
|
|
|
@ -4,14 +4,6 @@
|
|||
The `keyword` analyzer is a ``noop'' analyzer which returns the entire input
|
||||
string as a single token.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
|
||||
|
||||
[float]
|
||||
=== Example output
|
||||
|
||||
|
@ -57,3 +49,40 @@ The above sentence would produce the following single term:
|
|||
=== Configuration
|
||||
|
||||
The `keyword` analyzer is not configurable.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
The `keyword` analyzer consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
|
||||
|
||||
If you need to customize the `keyword` analyzer then you need to
|
||||
recreate it as a `custom` analyzer and modify it, usually by adding
|
||||
token filters. Usually, you should prefer the
|
||||
<<keyword, Keyword type>> when you want strings that are not split
|
||||
into tokens, but just in case you need it, this would recreate the
|
||||
built-in `keyword` analyzer and you can use it as a starting point
|
||||
for further customization:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------
|
||||
PUT /keyword_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"rebuilt_keyword": {
|
||||
"tokenizer": "keyword",
|
||||
"filter": [ <1>
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: keyword_example, first: keyword, second: rebuilt_keyword}\nendyaml\n/]
|
||||
<1> You'd add any token filters here.
|
||||
|
|
|
@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic
|
|||
|
||||
========================================
|
||||
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
|
||||
|
||||
Token Filters::
|
||||
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
|
||||
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
|
||||
|
||||
[float]
|
||||
=== Example output
|
||||
|
||||
|
@ -378,3 +365,51 @@ The regex above is easier to understand as:
|
|||
[\p{L}&&[^\p{Lu}]] # then lower case
|
||||
)
|
||||
--------------------------------------------------
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
The `pattern` anlayzer consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
|
||||
|
||||
Token Filters::
|
||||
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
|
||||
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
|
||||
|
||||
If you need to customize the `pattern` analyzer beyond the configuration
|
||||
parameters then you need to recreate it as a `custom` analyzer and modify
|
||||
it, usually by adding token filters. This would recreate the built-in
|
||||
`pattern` analyzer and you can use it as a starting point for further
|
||||
customization:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------
|
||||
PUT /pattern_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"tokenizer": {
|
||||
"split_on_non_word": {
|
||||
"type": "pattern",
|
||||
"pattern": "\\W+" <1>
|
||||
}
|
||||
},
|
||||
"analyzer": {
|
||||
"rebuilt_pattern": {
|
||||
"tokenizer": "split_on_non_word",
|
||||
"filter": [
|
||||
"lowercase" <2>
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
|
||||
<1> The default pattern is `\W+` which splits on non-word characters
|
||||
and this is where you'd change it.
|
||||
<2> You'd add other token filters after `lowercase`.
|
||||
|
|
|
@ -4,14 +4,6 @@
|
|||
The `simple` analyzer breaks text into terms whenever it encounters a
|
||||
character which is not a letter. All terms are lower cased.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
|
||||
|
||||
[float]
|
||||
=== Example output
|
||||
|
||||
|
@ -127,3 +119,37 @@ The above sentence would produce the following terms:
|
|||
=== Configuration
|
||||
|
||||
The `simple` analyzer is not configurable.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
The `simple` analzyer consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
|
||||
|
||||
If you need to customize the `simple` analyzer then you need to recreate
|
||||
it as a `custom` analyzer and modify it, usually by adding token filters.
|
||||
This would recreate the built-in `simple` analyzer and you can use it as
|
||||
a starting point for further customization:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------
|
||||
PUT /simple_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"rebuilt_simple": {
|
||||
"tokenizer": "lowercase",
|
||||
"filter": [ <1>
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/]
|
||||
<1> You'd add any token filters here.
|
||||
|
|
|
@ -7,19 +7,6 @@ Segmentation algorithm, as specified in
|
|||
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
|
||||
for most languages.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-standard-tokenizer,Standard Tokenizer>>
|
||||
|
||||
Token Filters::
|
||||
* <<analysis-standard-tokenfilter,Standard Token Filter>>
|
||||
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
|
||||
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
|
||||
|
||||
[float]
|
||||
=== Example output
|
||||
|
||||
|
@ -276,3 +263,44 @@ The above example produces the following terms:
|
|||
---------------------------
|
||||
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
|
||||
---------------------------
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
The `standard` analyzer consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-standard-tokenizer,Standard Tokenizer>>
|
||||
|
||||
Token Filters::
|
||||
* <<analysis-standard-tokenfilter,Standard Token Filter>>
|
||||
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
|
||||
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
|
||||
|
||||
If you need to customize the `standard` analyzer beyond the configuration
|
||||
parameters then you need to recreate it as a `custom` analyzer and modify
|
||||
it, usually by adding token filters. This would recreate the built-in
|
||||
`standard` analyzer and you can use it as a starting point:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------
|
||||
PUT /standard_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"rebuilt_standard": {
|
||||
"tokenizer": "standard",
|
||||
"filter": [
|
||||
"standard",
|
||||
"lowercase" <1>
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
|
||||
<1> You'd add any token filters after `lowercase`.
|
||||
|
|
|
@ -5,17 +5,6 @@ The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analy
|
|||
but adds support for removing stop words. It defaults to using the
|
||||
`_english_` stop words.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
|
||||
|
||||
Token filters::
|
||||
* <<analysis-stop-tokenfilter,Stop Token Filter>>
|
||||
|
||||
[float]
|
||||
=== Example output
|
||||
|
||||
|
@ -239,3 +228,50 @@ The above example produces the following terms:
|
|||
---------------------------
|
||||
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
|
||||
---------------------------
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
|
||||
|
||||
Token filters::
|
||||
* <<analysis-stop-tokenfilter,Stop Token Filter>>
|
||||
|
||||
If you need to customize the `stop` analyzer beyond the configuration
|
||||
parameters then you need to recreate it as a `custom` analyzer and modify
|
||||
it, usually by adding token filters. This would recreate the built-in
|
||||
`stop` analyzer and you can use it as a starting point for further
|
||||
customization:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------
|
||||
PUT /stop_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"filter": {
|
||||
"english_stop": {
|
||||
"type": "stop",
|
||||
"stopwords": "_english_" <1>
|
||||
}
|
||||
},
|
||||
"analyzer": {
|
||||
"rebuilt_stop": {
|
||||
"tokenizer": "lowercase",
|
||||
"filter": [
|
||||
"english_stop" <2>
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
|
||||
<1> The default stopwords can be overridden with the `stopwords`
|
||||
or `stopwords_path` parameters.
|
||||
<2> You'd add any token filters after `english_stop`.
|
||||
|
|
|
@ -4,14 +4,6 @@
|
|||
The `whitespace` analyzer breaks text into terms whenever it encounters a
|
||||
whitespace character.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
|
||||
|
||||
[float]
|
||||
=== Example output
|
||||
|
||||
|
@ -120,3 +112,37 @@ The above sentence would produce the following terms:
|
|||
=== Configuration
|
||||
|
||||
The `whitespace` analyzer is not configurable.
|
||||
|
||||
[float]
|
||||
=== Definition
|
||||
|
||||
It consists of:
|
||||
|
||||
Tokenizer::
|
||||
* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
|
||||
|
||||
If you need to customize the `whitespace` analyzer then you need to
|
||||
recreate it as a `custom` analyzer and modify it, usually by adding
|
||||
token filters. This would recreate the built-in `whitespace` analyzer
|
||||
and you can use it as a starting point for further customization:
|
||||
|
||||
[source,js]
|
||||
----------------------------------------------------
|
||||
PUT /whitespace_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"rebuilt_whitespace": {
|
||||
"tokenizer": "whitespace",
|
||||
"filter": [ <1>
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----------------------------------------------------
|
||||
// CONSOLE
|
||||
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: whitespace_example, first: whitespace, second: rebuilt_whitespace}\nendyaml\n/]
|
||||
<1> You'd add any token filters here.
|
||||
|
|
Loading…
Reference in New Issue