mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-17 02:14:54 +00:00
[DOCS] Reformat keep types and keep words token filter docs (#49604)
* Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds explanations of token types to keep types token filter and tokenizer docs
This commit is contained in:
parent
86a40f6d8b
commit
ade72b97b7
@ -1,137 +1,202 @@
|
||||
[[analysis-keep-types-tokenfilter]]
|
||||
=== Keep Types Token Filter
|
||||
=== Keep types token filter
|
||||
++++
|
||||
<titleabbrev>Keep types</titleabbrev>
|
||||
++++
|
||||
|
||||
A token filter of type `keep_types` that only keeps tokens with a token type
|
||||
contained in a predefined set.
|
||||
Keeps or removes tokens of a specific type. For example, you can use this filter
|
||||
to change `3 quick foxes` to `quick foxes` by keeping only `<ALPHANUM>`
|
||||
(alphanumeric) tokens.
|
||||
|
||||
[NOTE]
|
||||
.Token types
|
||||
====
|
||||
Token types are set by the <<analysis-tokenizers,tokenizer>> when converting
|
||||
characters to tokens. Token types can vary between tokenizers.
|
||||
|
||||
[float]
|
||||
=== Options
|
||||
[horizontal]
|
||||
types:: a list of types to include (default mode) or exclude
|
||||
mode:: if set to `include` (default) the specified token types will be kept,
|
||||
if set to `exclude` the specified token types will be removed from the stream
|
||||
For example, the <<analysis-standard-tokenizer,`standard`>> tokenizer can
|
||||
produce a variety of token types, including `<ALPHANUM>`, `<HANGUL>`, and
|
||||
`<NUM>`. Simpler analyzers, like the
|
||||
<<analysis-lowercase-tokenizer,`lowercase`>> tokenizer, only produce the `word`
|
||||
token type.
|
||||
|
||||
[float]
|
||||
=== Settings example
|
||||
Certain token filters can also add token types. For example, the
|
||||
<<analysis-synonym-tokenfilter,`synonym`>> filter can add the `<SYNONYM>` token
|
||||
type.
|
||||
====
|
||||
|
||||
You can set it up like:
|
||||
This filter uses Lucene's
|
||||
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html[TypeTokenFilter].
|
||||
|
||||
[[analysis-keep-types-tokenfilter-analyze-include-ex]]
|
||||
==== Include example
|
||||
|
||||
The following <<indices-analyze,analyze API>> request uses the `keep_types`
|
||||
filter to keep only `<NUM>` (numeric) tokens from `1 quick fox 2 lazy dogs`.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT /keep_types_example
|
||||
GET _analyze
|
||||
{
|
||||
"settings" : {
|
||||
"analysis" : {
|
||||
"analyzer" : {
|
||||
"my_analyzer" : {
|
||||
"tokenizer" : "standard",
|
||||
"filter" : ["lowercase", "extract_numbers"]
|
||||
}
|
||||
},
|
||||
"filter" : {
|
||||
"extract_numbers" : {
|
||||
"type" : "keep_types",
|
||||
"types" : [ "<NUM>" ]
|
||||
}
|
||||
}
|
||||
}
|
||||
"tokenizer": "standard",
|
||||
"filter": [
|
||||
{
|
||||
"type": "keep_types",
|
||||
"types": [ "<NUM>" ]
|
||||
}
|
||||
],
|
||||
"text": "1 quick fox 2 lazy dogs"
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
And test it like:
|
||||
The filter produces the following tokens:
|
||||
|
||||
[source,console]
|
||||
[source,text]
|
||||
--------------------------------------------------
|
||||
POST /keep_types_example/_analyze
|
||||
{
|
||||
"analyzer" : "my_analyzer",
|
||||
"text" : "this is just 1 a test"
|
||||
}
|
||||
[ 1, 2 ]
|
||||
--------------------------------------------------
|
||||
// TEST[continued]
|
||||
|
||||
The response will be:
|
||||
|
||||
/////////////////////
|
||||
[source,console-result]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"tokens": [
|
||||
{
|
||||
"token": "1",
|
||||
"start_offset": 13,
|
||||
"end_offset": 14,
|
||||
"start_offset": 0,
|
||||
"end_offset": 1,
|
||||
"type": "<NUM>",
|
||||
"position": 0
|
||||
},
|
||||
{
|
||||
"token": "2",
|
||||
"start_offset": 12,
|
||||
"end_offset": 13,
|
||||
"type": "<NUM>",
|
||||
"position": 3
|
||||
}
|
||||
]
|
||||
}
|
||||
--------------------------------------------------
|
||||
/////////////////////
|
||||
|
||||
Note how only the `<NUM>` token is in the output.
|
||||
[[analysis-keep-types-tokenfilter-analyze-exclude-ex]]
|
||||
==== Exclude example
|
||||
|
||||
[discrete]
|
||||
=== Exclude mode settings example
|
||||
|
||||
If the `mode` parameter is set to `exclude` like in the following example:
|
||||
The following <<indices-analyze,analyze API>> request uses the `keep_types`
|
||||
filter to remove `<NUM>` tokens from `1 quick fox 2 lazy dogs`. Note the `mode`
|
||||
parameter is set to `exclude`.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT /keep_types_exclude_example
|
||||
GET _analyze
|
||||
{
|
||||
"settings" : {
|
||||
"analysis" : {
|
||||
"analyzer" : {
|
||||
"my_analyzer" : {
|
||||
"tokenizer" : "standard",
|
||||
"filter" : ["lowercase", "remove_numbers"]
|
||||
}
|
||||
},
|
||||
"filter" : {
|
||||
"remove_numbers" : {
|
||||
"type" : "keep_types",
|
||||
"mode" : "exclude",
|
||||
"types" : [ "<NUM>" ]
|
||||
}
|
||||
}
|
||||
}
|
||||
"tokenizer": "standard",
|
||||
"filter": [
|
||||
{
|
||||
"type": "keep_types",
|
||||
"types": [ "<NUM>" ],
|
||||
"mode": "exclude"
|
||||
}
|
||||
],
|
||||
"text": "1 quick fox 2 lazy dogs"
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
And we test it like:
|
||||
The filter produces the following tokens:
|
||||
|
||||
[source,console]
|
||||
[source,text]
|
||||
--------------------------------------------------
|
||||
POST /keep_types_exclude_example/_analyze
|
||||
{
|
||||
"analyzer" : "my_analyzer",
|
||||
"text" : "hello 101 world"
|
||||
}
|
||||
[ quick, fox, lazy, dogs ]
|
||||
--------------------------------------------------
|
||||
// TEST[continued]
|
||||
|
||||
The response will be:
|
||||
|
||||
/////////////////////
|
||||
[source,console-result]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"tokens": [
|
||||
{
|
||||
"token": "hello",
|
||||
"start_offset": 0,
|
||||
"end_offset": 5,
|
||||
"token": "quick",
|
||||
"start_offset": 2,
|
||||
"end_offset": 7,
|
||||
"type": "<ALPHANUM>",
|
||||
"position": 0
|
||||
},
|
||||
"position": 1
|
||||
},
|
||||
{
|
||||
"token": "world",
|
||||
"start_offset": 10,
|
||||
"end_offset": 15,
|
||||
"token": "fox",
|
||||
"start_offset": 8,
|
||||
"end_offset": 11,
|
||||
"type": "<ALPHANUM>",
|
||||
"position": 2
|
||||
},
|
||||
{
|
||||
"token": "lazy",
|
||||
"start_offset": 14,
|
||||
"end_offset": 18,
|
||||
"type": "<ALPHANUM>",
|
||||
"position": 4
|
||||
},
|
||||
{
|
||||
"token": "dogs",
|
||||
"start_offset": 19,
|
||||
"end_offset": 23,
|
||||
"type": "<ALPHANUM>",
|
||||
"position": 5
|
||||
}
|
||||
]
|
||||
}
|
||||
--------------------------------------------------
|
||||
/////////////////////
|
||||
|
||||
[[analysis-keep-types-tokenfilter-configure-parms]]
|
||||
==== Configurable parameters
|
||||
|
||||
`types`::
|
||||
(Required, array of strings)
|
||||
List of token types to keep or remove.
|
||||
|
||||
`mode`::
|
||||
(Optional, string)
|
||||
Indicates whether to keep or remove the specified token types.
|
||||
Valid values are:
|
||||
|
||||
`include`:::
|
||||
(Default) Keep only the specified token types.
|
||||
|
||||
`exclude`:::
|
||||
Remove the specified token types.
|
||||
|
||||
[[analysis-keep-types-tokenfilter-customize]]
|
||||
==== Customize and add to an analyzer
|
||||
|
||||
To customize the `keep_types` filter, duplicate it to create the basis
|
||||
for a new custom token filter. You can modify the filter using its configurable
|
||||
parameters.
|
||||
|
||||
For example, the following <<indices-create-index,create index API>> request
|
||||
uses a custom `keep_types` filter to configure a new
|
||||
<<analysis-custom-analyzer,custom analyzer>>. The custom `keep_types` filter
|
||||
keeps only `<ALPHANUM>` (alphanumeric) tokens.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT keep_types_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"my_analyzer": {
|
||||
"tokenizer": "standard",
|
||||
"filter": [ "extract_alpha" ]
|
||||
}
|
||||
},
|
||||
"filter": {
|
||||
"extract_alpha": {
|
||||
"type": "keep_types",
|
||||
"types": [ "<ALPHANUM>" ]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
@ -1,50 +1,146 @@
|
||||
[[analysis-keep-words-tokenfilter]]
|
||||
=== Keep Words Token Filter
|
||||
=== Keep words token filter
|
||||
++++
|
||||
<titleabbrev>Keep words</titleabbrev>
|
||||
++++
|
||||
|
||||
A token filter of type `keep` that only keeps tokens with text contained in a
|
||||
predefined set of words. The set of words can be defined in the settings or
|
||||
loaded from a text file containing one word per line.
|
||||
Keeps only tokens contained in a specified word list.
|
||||
|
||||
This filter uses Lucene's
|
||||
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html[KeepWordFilter].
|
||||
|
||||
[float]
|
||||
=== Options
|
||||
[horizontal]
|
||||
keep_words:: a list of words to keep
|
||||
keep_words_path:: a path to a words file
|
||||
keep_words_case:: a boolean indicating whether to lower case the words (defaults to `false`)
|
||||
[NOTE]
|
||||
====
|
||||
To remove a list of words from a token stream, use the
|
||||
<<analysis-stop-tokenfilter,`stop`>> filter.
|
||||
====
|
||||
|
||||
[[analysis-keep-words-tokenfilter-analyze-ex]]
|
||||
==== Example
|
||||
|
||||
|
||||
[float]
|
||||
=== Settings example
|
||||
The following <<indices-analyze,analyze API>> request uses the `keep` filter to
|
||||
keep only the `fox` and `dog` tokens from
|
||||
`the quick fox jumps over the lazy dog`.
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT /keep_words_example
|
||||
GET _analyze
|
||||
{
|
||||
"settings" : {
|
||||
"analysis" : {
|
||||
"analyzer" : {
|
||||
"example_1" : {
|
||||
"tokenizer" : "standard",
|
||||
"filter" : ["lowercase", "words_till_three"]
|
||||
},
|
||||
"example_2" : {
|
||||
"tokenizer" : "standard",
|
||||
"filter" : ["lowercase", "words_in_file"]
|
||||
}
|
||||
},
|
||||
"filter" : {
|
||||
"words_till_three" : {
|
||||
"type" : "keep",
|
||||
"keep_words" : [ "one", "two", "three"]
|
||||
},
|
||||
"words_in_file" : {
|
||||
"type" : "keep",
|
||||
"keep_words_path" : "analysis/example_word_list.txt"
|
||||
}
|
||||
}
|
||||
}
|
||||
"tokenizer": "whitespace",
|
||||
"filter": [
|
||||
{
|
||||
"type": "keep",
|
||||
"keep_words": [ "dog", "elephant", "fox" ]
|
||||
}
|
||||
],
|
||||
"text": "the quick fox jumps over the lazy dog"
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
The filter produces the following tokens:
|
||||
|
||||
[source,text]
|
||||
--------------------------------------------------
|
||||
[ fox, dog ]
|
||||
--------------------------------------------------
|
||||
|
||||
/////////////////////
|
||||
[source,console-result]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"tokens": [
|
||||
{
|
||||
"token": "fox",
|
||||
"start_offset": 10,
|
||||
"end_offset": 13,
|
||||
"type": "word",
|
||||
"position": 2
|
||||
},
|
||||
{
|
||||
"token": "dog",
|
||||
"start_offset": 34,
|
||||
"end_offset": 37,
|
||||
"type": "word",
|
||||
"position": 7
|
||||
}
|
||||
]
|
||||
}
|
||||
--------------------------------------------------
|
||||
/////////////////////
|
||||
|
||||
[[analysis-keep-words-tokenfilter-configure-parms]]
|
||||
==== Configurable parameters
|
||||
|
||||
`keep_words`::
|
||||
+
|
||||
--
|
||||
(Required+++*+++, array of strings)
|
||||
List of words to keep. Only tokens that match words in this list are included in
|
||||
the output.
|
||||
|
||||
Either this parameter or `keep_words_path` must be specified.
|
||||
--
|
||||
|
||||
`keep_words_path`::
|
||||
+
|
||||
--
|
||||
(Required+++*+++, array of strings)
|
||||
Path to a file that contains a list of words to keep. Only tokens that match
|
||||
words in this list are included in the output.
|
||||
|
||||
This path must be absolute or relative to the `config` location, and the file
|
||||
must be UTF-8 encoded. Each word in the file must be separated by a line break.
|
||||
|
||||
Either this parameter or `keep_words` must be specified.
|
||||
--
|
||||
|
||||
`keep_words_case`::
|
||||
(Optional, boolean)
|
||||
If `true`, lowercase all keep words. Defaults to `false`.
|
||||
|
||||
[[analysis-keep-words-tokenfilter-customize]]
|
||||
==== Customize and add to an analyzer
|
||||
|
||||
To customize the `keep` filter, duplicate it to create the basis for a new
|
||||
custom token filter. You can modify the filter using its configurable
|
||||
parameters.
|
||||
|
||||
For example, the following <<indices-create-index,create index API>> request
|
||||
uses custom `keep` filters to configure two new
|
||||
<<analysis-custom-analyzer,custom analyzers>>:
|
||||
|
||||
* `standard_keep_word_array`, which uses a custom `keep` filter with an inline
|
||||
array of keep words
|
||||
* `standard_keep_word_file`, which uses a customer `keep` filter with a keep
|
||||
words file
|
||||
|
||||
[source,console]
|
||||
--------------------------------------------------
|
||||
PUT keep_words_example
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"standard_keep_word_array": {
|
||||
"tokenizer": "standard",
|
||||
"filter": [ "keep_word_array" ]
|
||||
},
|
||||
"standard_keep_word_file": {
|
||||
"tokenizer": "standard",
|
||||
"filter": [ "keep_word_file" ]
|
||||
}
|
||||
},
|
||||
"filter": {
|
||||
"keep_word_array": {
|
||||
"type": "keep",
|
||||
"keep_words": [ "one", "two", "three" ]
|
||||
},
|
||||
"keep_word_file": {
|
||||
"type": "keep",
|
||||
"keep_words_path": "analysis/example_word_list.txt"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
@ -7,10 +7,13 @@ instance, a <<analysis-whitespace-tokenizer,`whitespace`>> tokenizer breaks
|
||||
text into tokens whenever it sees any whitespace. It would convert the text
|
||||
`"Quick brown fox!"` into the terms `[Quick, brown, fox!]`.
|
||||
|
||||
The tokenizer is also responsible for recording the order or _position_ of
|
||||
each term (used for phrase and word proximity queries) and the start and end
|
||||
_character offsets_ of the original word which the term represents (used for
|
||||
highlighting search snippets).
|
||||
The tokenizer is also responsible for recording the following:
|
||||
|
||||
* Order or _position_ of each term (used for phrase and word proximity queries)
|
||||
* Start and end _character offsets_ of the original word which the term
|
||||
represents (used for highlighting search snippets).
|
||||
* _Token type_, a classification of each term produced, such as `<ALPHANUM>`,
|
||||
`<HANGUL>`, or `<NUM>`. Simpler analyzers only produce the `word` token type.
|
||||
|
||||
Elasticsearch has a number of built in tokenizers which can be used to build
|
||||
<<analysis-custom-analyzer,custom analyzers>>.
|
||||
|
Loading…
x
Reference in New Issue
Block a user