Docs: Document how to rebuild analyzers (#30498)

Adds documentation for how to rebuild all the built in analyzers and
tests for that documentation using the mechanism added in #29535.

Closes #29499
This commit is contained in:
Nik Everett 2018-05-14 18:40:54 -04:00 committed by GitHub
parent 7f47ff9fcd
commit 9881bfaea5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 284 additions and 75 deletions

View File

@ -9,20 +9,6 @@ Input text is lowercased, normalized to remove extended characters, sorted,
deduplicated and concatenated into a single token. If a stopword list is
configured, stop words will also be removed.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters (in order)::
1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
2. <<analysis-asciifolding-tokenfilter>>
3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
4. <<analysis-fingerprint-tokenfilter>>
[float]
=== Example output
@ -149,3 +135,46 @@ The above example produces the following term:
---------------------------
[ consistent godel said sentence yes ]
---------------------------
[float]
=== Definition
The `fingerprint` tokenizer consists of:
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters (in order)::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-asciifolding-tokenfilter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
* <<analysis-fingerprint-tokenfilter>>
If you need to customize the `fingerprint` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`fingerprint` analyzer and you can use it as a starting point for further
customization:
[source,js]
----------------------------------------------------
PUT /fingerprint_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_fingerprint": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"fingerprint"
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]

View File

@ -4,14 +4,6 @@
The `keyword` analyzer is a ``noop'' analyzer which returns the entire input
string as a single token.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
[float]
=== Example output
@ -57,3 +49,40 @@ The above sentence would produce the following single term:
=== Configuration
The `keyword` analyzer is not configurable.
[float]
=== Definition
The `keyword` analyzer consists of:
Tokenizer::
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
If you need to customize the `keyword` analyzer then you need to
recreate it as a `custom` analyzer and modify it, usually by adding
token filters. Usually, you should prefer the
<<keyword, Keyword type>> when you want strings that are not split
into tokens, but just in case you need it, this would recreate the
built-in `keyword` analyzer and you can use it as a starting point
for further customization:
[source,js]
----------------------------------------------------
PUT /keyword_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_keyword": {
"tokenizer": "keyword",
"filter": [ <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: keyword_example, first: keyword, second: rebuilt_keyword}\nendyaml\n/]
<1> You'd add any token filters here.

View File

@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic
========================================
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
[float]
=== Example output
@ -378,3 +365,51 @@ The regex above is easier to understand as:
[\p{L}&&[^\p{Lu}]] # then lower case
)
--------------------------------------------------
[float]
=== Definition
The `pattern` anlayzer consists of:
Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
If you need to customize the `pattern` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`pattern` analyzer and you can use it as a starting point for further
customization:
[source,js]
----------------------------------------------------
PUT /pattern_example
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"pattern": "\\W+" <1>
}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase" <2>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
<1> The default pattern is `\W+` which splits on non-word characters
and this is where you'd change it.
<2> You'd add other token filters after `lowercase`.

View File

@ -4,14 +4,6 @@
The `simple` analyzer breaks text into terms whenever it encounters a
character which is not a letter. All terms are lower cased.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
[float]
=== Example output
@ -127,3 +119,37 @@ The above sentence would produce the following terms:
=== Configuration
The `simple` analyzer is not configurable.
[float]
=== Definition
The `simple` analzyer consists of:
Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
If you need to customize the `simple` analyzer then you need to recreate
it as a `custom` analyzer and modify it, usually by adding token filters.
This would recreate the built-in `simple` analyzer and you can use it as
a starting point for further customization:
[source,js]
----------------------------------------------------
PUT /simple_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_simple": {
"tokenizer": "lowercase",
"filter": [ <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/]
<1> You'd add any token filters here.

View File

@ -7,19 +7,6 @@ Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
for most languages.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-standard-tokenfilter,Standard Token Filter>>
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
[float]
=== Example output
@ -276,3 +263,44 @@ The above example produces the following terms:
---------------------------
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
---------------------------
[float]
=== Definition
The `standard` analyzer consists of:
Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>
Token Filters::
* <<analysis-standard-tokenfilter,Standard Token Filter>>
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
If you need to customize the `standard` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`standard` analyzer and you can use it as a starting point:
[source,js]
----------------------------------------------------
PUT /standard_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase" <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
<1> You'd add any token filters after `lowercase`.

View File

@ -5,17 +5,6 @@ The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analy
but adds support for removing stop words. It defaults to using the
`_english_` stop words.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
Token filters::
* <<analysis-stop-tokenfilter,Stop Token Filter>>
[float]
=== Example output
@ -239,3 +228,50 @@ The above example produces the following terms:
---------------------------
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
---------------------------
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
Token filters::
* <<analysis-stop-tokenfilter,Stop Token Filter>>
If you need to customize the `stop` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`stop` analyzer and you can use it as a starting point for further
customization:
[source,js]
----------------------------------------------------
PUT /stop_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_" <1>
}
},
"analyzer": {
"rebuilt_stop": {
"tokenizer": "lowercase",
"filter": [
"english_stop" <2>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
<1> The default stopwords can be overridden with the `stopwords`
or `stopwords_path` parameters.
<2> You'd add any token filters after `english_stop`.

View File

@ -4,14 +4,6 @@
The `whitespace` analyzer breaks text into terms whenever it encounters a
whitespace character.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
[float]
=== Example output
@ -120,3 +112,37 @@ The above sentence would produce the following terms:
=== Configuration
The `whitespace` analyzer is not configurable.
[float]
=== Definition
It consists of:
Tokenizer::
* <<analysis-whitespace-tokenizer,Whitespace Tokenizer>>
If you need to customize the `whitespace` analyzer then you need to
recreate it as a `custom` analyzer and modify it, usually by adding
token filters. This would recreate the built-in `whitespace` analyzer
and you can use it as a starting point for further customization:
[source,js]
----------------------------------------------------
PUT /whitespace_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_whitespace": {
"tokenizer": "whitespace",
"filter": [ <1>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: whitespace_example, first: whitespace, second: rebuilt_whitespace}\nendyaml\n/]
<1> You'd add any token filters here.