[DOCS] Reformat `remove_duplicates` token filter (#53608)

Makes the following changes to the `remove_duplicates` token filter
docs:

* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Adds custom analyzer example
This commit is contained in:
James Rodewig 2020-03-16 11:37:06 -04:00 committed by GitHub
parent 2c74f3e22c
commit e1eebea846
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 148 additions and 2 deletions

View File

@ -4,5 +4,151 @@
<titleabbrev>Remove duplicates</titleabbrev>
++++
A token filter of type `remove_duplicates` that drops identical tokens at the
same position.
Removes duplicate tokens in the same position.
The `remove_duplicates` filter uses Lucene's
{lucene-analysis-docs}/miscellaneous/RemoveDuplicatesTokenFilter.html[RemoveDuplicatesTokenFilter].
[[analysis-remove-duplicates-tokenfilter-analyze-ex]]
==== Example
To see how the `remove_duplicates` filter works, you first need to produce a
token stream containing duplicate tokens in the same position.
The following <<indices-analyze,analyze API>> request uses the
<<analysis-keyword-repeat-tokenfilter,`keyword_repeat`>> and
<<analysis-stemmer-tokenfilter,`stemmer`>> filters to create stemmed and
unstemmed tokens for `jumping dog`.
[source,console]
----
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer"
],
"text": "jumping dog"
}
----
The API returns the following response. Note that the `dog` token in position
`1` is duplicated.
[source,console-result]
----
{
"tokens": [
{
"token": "jumping",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
----
To remove one of the duplicate `dog` tokens, add the `remove_duplicates` filter
to the previous analyze API request.
[source,console]
----
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer",
"remove_duplicates"
],
"text": "jumping dog"
}
----
The API returns the following response. There is now only one `dog` token in
position `1`.
[source,console-result]
----
{
"tokens": [
{
"token": "jumping",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
----
[[analysis-remove-duplicates-tokenfilter-analyzer-ex]]
==== Add to an analyzer
The following <<indices-create-index,create index API>> request uses the
`remove_duplicates` filter to configure a new <<analysis-custom-analyzer,custom
analyzer>>.
This custom analyzer uses the `keyword_repeat` and `stemmer` filters to create a
stemmed and unstemmed version of each token in a stream. The `remove_duplicates`
filter then removes any duplicate tokens in the same position.
[source,console]
----
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"stemmer",
"remove_duplicates"
]
}
}
}
}
}
----