[DOCS] Reformat keep types and keep words token filter docs (#49604)

* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds explanations of token types to keep types token filter and tokenizer docs
This commit is contained in:
James Rodewig 2019-12-02 09:22:21 -05:00
parent 86a40f6d8b
commit ade72b97b7
3 changed files with 285 additions and 121 deletions

View File

@ -1,137 +1,202 @@
[[analysis-keep-types-tokenfilter]]
=== Keep Types Token Filter
=== Keep types token filter
++++
<titleabbrev>Keep types</titleabbrev>
++++
A token filter of type `keep_types` that only keeps tokens with a token type
contained in a predefined set.
Keeps or removes tokens of a specific type. For example, you can use this filter
to change `3 quick foxes` to `quick foxes` by keeping only `<ALPHANUM>`
(alphanumeric) tokens.
[NOTE]
.Token types
====
Token types are set by the <<analysis-tokenizers,tokenizer>> when converting
characters to tokens. Token types can vary between tokenizers.
[float]
=== Options
[horizontal]
types:: a list of types to include (default mode) or exclude
mode:: if set to `include` (default) the specified token types will be kept,
if set to `exclude` the specified token types will be removed from the stream
For example, the <<analysis-standard-tokenizer,`standard`>> tokenizer can
produce a variety of token types, including `<ALPHANUM>`, `<HANGUL>`, and
`<NUM>`. Simpler analyzers, like the
<<analysis-lowercase-tokenizer,`lowercase`>> tokenizer, only produce the `word`
token type.
[float]
=== Settings example
Certain token filters can also add token types. For example, the
<<analysis-synonym-tokenfilter,`synonym`>> filter can add the `<SYNONYM>` token
type.
====
You can set it up like:
This filter uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html[TypeTokenFilter].
[[analysis-keep-types-tokenfilter-analyze-include-ex]]
==== Include example
The following <<indices-analyze,analyze API>> request uses the `keep_types`
filter to keep only `<NUM>` (numeric) tokens from `1 quick fox 2 lazy dogs`.
[source,console]
--------------------------------------------------
PUT /keep_types_example
GET _analyze
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "extract_numbers"]
}
},
"filter" : {
"extract_numbers" : {
"type" : "keep_types",
"types" : [ "<NUM>" ]
}
}
}
"tokenizer": "standard",
"filter": [
{
"type": "keep_types",
"types": [ "<NUM>" ]
}
],
"text": "1 quick fox 2 lazy dogs"
}
--------------------------------------------------
And test it like:
The filter produces the following tokens:
[source,console]
[source,text]
--------------------------------------------------
POST /keep_types_example/_analyze
{
"analyzer" : "my_analyzer",
"text" : "this is just 1 a test"
}
[ 1, 2 ]
--------------------------------------------------
// TEST[continued]
The response will be:
/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "1",
"start_offset": 13,
"end_offset": 14,
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "2",
"start_offset": 12,
"end_offset": 13,
"type": "<NUM>",
"position": 3
}
]
}
--------------------------------------------------
/////////////////////
Note how only the `<NUM>` token is in the output.
[[analysis-keep-types-tokenfilter-analyze-exclude-ex]]
==== Exclude example
[discrete]
=== Exclude mode settings example
If the `mode` parameter is set to `exclude` like in the following example:
The following <<indices-analyze,analyze API>> request uses the `keep_types`
filter to remove `<NUM>` tokens from `1 quick fox 2 lazy dogs`. Note the `mode`
parameter is set to `exclude`.
[source,console]
--------------------------------------------------
PUT /keep_types_exclude_example
GET _analyze
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "remove_numbers"]
}
},
"filter" : {
"remove_numbers" : {
"type" : "keep_types",
"mode" : "exclude",
"types" : [ "<NUM>" ]
}
}
}
"tokenizer": "standard",
"filter": [
{
"type": "keep_types",
"types": [ "<NUM>" ],
"mode": "exclude"
}
],
"text": "1 quick fox 2 lazy dogs"
}
--------------------------------------------------
And we test it like:
The filter produces the following tokens:
[source,console]
[source,text]
--------------------------------------------------
POST /keep_types_exclude_example/_analyze
{
"analyzer" : "my_analyzer",
"text" : "hello 101 world"
}
[ quick, fox, lazy, dogs ]
--------------------------------------------------
// TEST[continued]
The response will be:
/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"token": "quick",
"start_offset": 2,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
"position": 1
},
{
"token": "world",
"start_offset": 10,
"end_offset": 15,
"token": "fox",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "lazy",
"start_offset": 14,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "dogs",
"start_offset": 19,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 5
}
]
}
--------------------------------------------------
/////////////////////
[[analysis-keep-types-tokenfilter-configure-parms]]
==== Configurable parameters
`types`::
(Required, array of strings)
List of token types to keep or remove.
`mode`::
(Optional, string)
Indicates whether to keep or remove the specified token types.
Valid values are:
`include`:::
(Default) Keep only the specified token types.
`exclude`:::
Remove the specified token types.
[[analysis-keep-types-tokenfilter-customize]]
==== Customize and add to an analyzer
To customize the `keep_types` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following <<indices-create-index,create index API>> request
uses a custom `keep_types` filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>. The custom `keep_types` filter
keeps only `<ALPHANUM>` (alphanumeric) tokens.
[source,console]
--------------------------------------------------
PUT keep_types_example
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [ "extract_alpha" ]
}
},
"filter": {
"extract_alpha": {
"type": "keep_types",
"types": [ "<ALPHANUM>" ]
}
}
}
}
}
--------------------------------------------------

View File

@ -1,50 +1,146 @@
[[analysis-keep-words-tokenfilter]]
=== Keep Words Token Filter
=== Keep words token filter
++++
<titleabbrev>Keep words</titleabbrev>
++++
A token filter of type `keep` that only keeps tokens with text contained in a
predefined set of words. The set of words can be defined in the settings or
loaded from a text file containing one word per line.
Keeps only tokens contained in a specified word list.
This filter uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html[KeepWordFilter].
[float]
=== Options
[horizontal]
keep_words:: a list of words to keep
keep_words_path:: a path to a words file
keep_words_case:: a boolean indicating whether to lower case the words (defaults to `false`)
[NOTE]
====
To remove a list of words from a token stream, use the
<<analysis-stop-tokenfilter,`stop`>> filter.
====
[[analysis-keep-words-tokenfilter-analyze-ex]]
==== Example
[float]
=== Settings example
The following <<indices-analyze,analyze API>> request uses the `keep` filter to
keep only the `fox` and `dog` tokens from
`the quick fox jumps over the lazy dog`.
[source,console]
--------------------------------------------------
PUT /keep_words_example
GET _analyze
{
"settings" : {
"analysis" : {
"analyzer" : {
"example_1" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "words_till_three"]
},
"example_2" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "words_in_file"]
}
},
"filter" : {
"words_till_three" : {
"type" : "keep",
"keep_words" : [ "one", "two", "three"]
},
"words_in_file" : {
"type" : "keep",
"keep_words_path" : "analysis/example_word_list.txt"
}
}
}
"tokenizer": "whitespace",
"filter": [
{
"type": "keep",
"keep_words": [ "dog", "elephant", "fox" ]
}
],
"text": "the quick fox jumps over the lazy dog"
}
--------------------------------------------------
The filter produces the following tokens:
[source,text]
--------------------------------------------------
[ fox, dog ]
--------------------------------------------------
/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "fox",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "dog",
"start_offset": 34,
"end_offset": 37,
"type": "word",
"position": 7
}
]
}
--------------------------------------------------
/////////////////////
[[analysis-keep-words-tokenfilter-configure-parms]]
==== Configurable parameters
`keep_words`::
+
--
(Required+++*+++, array of strings)
List of words to keep. Only tokens that match words in this list are included in
the output.
Either this parameter or `keep_words_path` must be specified.
--
`keep_words_path`::
+
--
(Required+++*+++, array of strings)
Path to a file that contains a list of words to keep. Only tokens that match
words in this list are included in the output.
This path must be absolute or relative to the `config` location, and the file
must be UTF-8 encoded. Each word in the file must be separated by a line break.
Either this parameter or `keep_words` must be specified.
--
`keep_words_case`::
(Optional, boolean)
If `true`, lowercase all keep words. Defaults to `false`.
[[analysis-keep-words-tokenfilter-customize]]
==== Customize and add to an analyzer
To customize the `keep` filter, duplicate it to create the basis for a new
custom token filter. You can modify the filter using its configurable
parameters.
For example, the following <<indices-create-index,create index API>> request
uses custom `keep` filters to configure two new
<<analysis-custom-analyzer,custom analyzers>>:
* `standard_keep_word_array`, which uses a custom `keep` filter with an inline
array of keep words
* `standard_keep_word_file`, which uses a customer `keep` filter with a keep
words file
[source,console]
--------------------------------------------------
PUT keep_words_example
{
"settings": {
"analysis": {
"analyzer": {
"standard_keep_word_array": {
"tokenizer": "standard",
"filter": [ "keep_word_array" ]
},
"standard_keep_word_file": {
"tokenizer": "standard",
"filter": [ "keep_word_file" ]
}
},
"filter": {
"keep_word_array": {
"type": "keep",
"keep_words": [ "one", "two", "three" ]
},
"keep_word_file": {
"type": "keep",
"keep_words_path": "analysis/example_word_list.txt"
}
}
}
}
}
--------------------------------------------------

View File

@ -7,10 +7,13 @@ instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
text into tokens whenever it sees any whitespace. It would convert the text
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
The tokenizer is also responsible for recording the order or _position_ of
each term (used for phrase and word proximity queries) and the start and end
_character offsets_ of the original word which the term represents (used for
highlighting search snippets).
The tokenizer is also responsible for recording the following:
* Order or _position_ of each term (used for phrase and word proximity queries)
* Start and end _character offsets_ of the original word which the term
represents (used for highlighting search snippets).
* _Token type_, a classification of each term produced, such as `<ALPHANUM>`,
`<HANGUL>`, or `<NUM>`. Simpler analyzers only produce the `word` token type.
Elasticsearch has a number of built in tokenizers which can be used to build
<<analysis-custom-analyzer,custom analyzers>>.