[DOCS] Reformat `keyword_marker` token filter (#54076)

Makes the following changes to the `keyword_marker` token filter docs:

* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Rewrites parameter definitions
* Adds custom analyzer and filter example
This commit is contained in:
James Rodewig 2020-03-25 09:26:06 -04:00 committed by GitHub
parent ba09a778dc
commit 74051d68a8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 360 additions and 112 deletions

View File

@ -4,137 +4,385 @@
<titleabbrev>Keyword marker</titleabbrev> <titleabbrev>Keyword marker</titleabbrev>
++++ ++++
Protects words from being modified by stemmers. Must be placed before Marks specified tokens as keywords, which are not stemmed.
any stemming filters.
[cols="<,<",options="header",] The `keyword_marker` assigns specified tokens a `keyword` attribute of `true`.
|======================================================================= Stemmer token filters, such as <<analysis-stemmer-tokenfilter,`stemmer`>> or
|Setting |Description <<analysis-porterstem-tokenfilter,`porter_stem`>>, skip tokens with a `keyword`
|`keywords` |A list of words to use. attribute of `true`.
|`keywords_path` |A path (either relative to `config` location, or [IMPORTANT]
absolute) to a list of words. ====
To work properly, the `keyword_marker` filter must be listed before any stemmer
token filters in the <<analysis-custom-analyzer,analyzer configuration>>.
====
|`keywords_pattern` |A regular expression pattern to match against words The `keyword_marker` filter uses Lucene's
in the text. {lucene-analysis-docs}/miscellaneous/KeywordMarkerFilter.html[KeywordMarkerFilter].
|`ignore_case` |Set to `true` to lower case all words first. Defaults to [[analysis-keyword-marker-tokenfilter-analyze-ex]]
`false`. ==== Example
|=======================================================================
You can configure it like: To see how the `keyword_marker` filter works, you first need to produce a token
stream containing stemmed tokens.
The following <<indices-analyze,analyze API>> request uses the
<<analysis-stemmer-tokenfilter,`stemmer`>> filter to create stemmed tokens for
`fox running and jumping`.
[source,console] [source,console]
-------------------------------------------------- ----
PUT /keyword_marker_example GET /_analyze
{
"tokenizer": "whitespace",
"filter": [ "stemmer" ],
"text": "fox running and jumping"
}
----
The request produces the following tokens. Note that `running` was stemmed to
`run` and `jumping` was stemmed to `jump`.
[source,text]
----
[ fox, run, and, jump ]
----
////
[source,console-result]
----
{
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "run",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "jump",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3
}
]
}
----
////
To prevent `jumping` from being stemmed, add the `keyword_marker` filter before
the `stemmer` filter in the previous analyze API request. Specify `jumping` in
the `keywords` parameter of the `keyword_marker` filter.
[source,console]
----
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "keyword_marker",
"keywords": [ "jumping" ]
},
"stemmer"
],
"text": "fox running and jumping"
}
----
The request produces the following tokens. `running` is still stemmed to `run`,
but `jumping` is not stemmed.
[source,text]
----
[ fox, run, and, jumping ]
----
////
[source,console-result]
----
{
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "run",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3
}
]
}
----
////
To see the `keyword` attribute for these tokens, add the following arguments to
the analyze API request:
* `explain`: `true`
* `attributes`: `keyword`
[source,console]
----
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "keyword_marker",
"keywords": [ "jumping" ]
},
"stemmer"
],
"text": "fox running and jumping",
"explain": true,
"attributes": "keyword"
}
----
The API returns the following response. Note the `jumping` token has a
`keyword` attribute of `true`.
[source,console-result]
----
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "whitespace",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3
}
]
},
"tokenfilters": [
{
"name": "__anonymous__keyword_marker",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": false
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": false
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
}
]
},
{
"name": "stemmer",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": false
},
{
"token": "run",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": false
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
}
]
}
]
}
}
----
[[analysis-keyword-marker-tokenfilter-configure-parms]]
==== Configurable parameters
`ignore_case`::
(Optional, boolean)
If `true`, matching for the `keywords` and `keywords_path` parameters ignores
letter case. Defaults to `false`.
`keywords`::
(Required*, array of strings)
Array of keywords. Tokens that match these keywords are not stemmed.
+
This parameter, `keywords_path`, or `keywords_pattern` must be specified.
You cannot specify this parameter and `keywords_pattern`.
`keywords_path`::
+
--
(Required*, array of strings)
Path to a file that contains a list of keywords. Tokens that match these
keywords are not stemmed.
This path must be absolute or relative to the `config` location, and the file
must be UTF-8 encoded. Each word in the file must be separated by a line break.
This parameter, `keywords`, or `keywords_pattern` must be specified.
You cannot specify this parameter and `keywords_pattern`.
--
`keywords_pattern`::
+
--
(Required*, string)
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java
regular expression] used to match tokens. Tokens that match this expression are
marked as keywords and not stemmed.
This parameter, `keywords`, or `keywords_path` must be specified. You
cannot specify this parameter and `keywords` or `keywords_pattern`.
[WARNING]
====
Poorly written regular expressions can cause {es} to run slowly or result
in stack overflow errors, causing the running node to suddenly exit.
====
--
[[analysis-keyword-marker-tokenfilter-customize]]
==== Customize and add to an analyzer
To customize the `keyword_marker` filter, duplicate it to create the basis for a
new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following <<indices-create-index,create index API>> request
uses a custom `keyword_marker` filter and the `porter_stem`
filter to configure a new <<analysis-custom-analyzer,custom analyzer>>.
The custom `keyword_marker` filter marks tokens specified in the
`analysis/example_word_list.txt` file as keywords. The `porter_stem` filter does
not stem these tokens.
[source,console]
----
PUT /my_index
{ {
"settings": { "settings": {
"analysis": { "analysis": {
"analyzer": { "analyzer": {
"protect_cats": { "my_custom_analyzer": {
"type": "custom", "type": "custom",
"tokenizer": "standard", "tokenizer": "standard",
"filter": ["lowercase", "protect_cats", "porter_stem"] "filter": [
}, "my_custom_keyword_marker_filter",
"normal": { "porter_stem"
"type": "custom", ]
"tokenizer": "standard",
"filter": ["lowercase", "porter_stem"]
} }
}, },
"filter": { "filter": {
"protect_cats": { "my_custom_keyword_marker_filter": {
"type": "keyword_marker", "type": "keyword_marker",
"keywords": ["cats"] "keywords": "analysis/example_word_list.txt"
} }
} }
} }
} }
} }
-------------------------------------------------- ----
And test it with:
[source,console]
--------------------------------------------------
POST /keyword_marker_example/_analyze
{
"analyzer" : "protect_cats",
"text" : "I like cats"
}
--------------------------------------------------
// TEST[continued]
And it'd respond:
[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "like",
"start_offset": 2,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "cats",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
}
]
}
--------------------------------------------------
As compared to the `normal` analyzer which has `cats` stemmed to `cat`:
[source,console]
--------------------------------------------------
POST /keyword_marker_example/_analyze
{
"analyzer" : "normal",
"text" : "I like cats"
}
--------------------------------------------------
// TEST[continued]
Response:
[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "like",
"start_offset": 2,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "cat",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
}
]
}
--------------------------------------------------