[DOCS] Reformat fingerprint token filter docs (#49311)

This commit is contained in:
James Rodewig 2019-11-19 10:54:16 -05:00
parent 0acba44a2e
commit 8639ddab5e
1 changed files with 130 additions and 20 deletions

View File

@ -1,28 +1,138 @@
[[analysis-fingerprint-tokenfilter]]
=== Fingerprint Token Filter
=== Fingerprint token filter
++++
<titleabbrev>Fingerprint</titleabbrev>
++++
The `fingerprint` token filter emits a single token which is useful for fingerprinting
a body of text, and/or providing a token that can be clustered on. It does this by
sorting the tokens, deduplicating and then concatenating them back into a single token.
Sorts and removes duplicate tokens from a token stream, then concatenates the
stream into a single output token.
For example, the tokens `["the", "quick", "quick", "brown", "fox", "was", "very", "brown"]` will be
transformed into a single token: `"brown fox quick the very was"`. Notice how the tokens were sorted
alphabetically, and there is only one `"quick"`.
For example, this filter changes the `[ the, fox, was, very, very, quick ]`
token stream as follows:
The following are settings that can be set for a `fingerprint` token
filter type:
. Sorts the tokens alphabetically to `[ fox, quick, the, very, very, was ]`
[cols="<,<",options="header",]
|======================================================
|Setting |Description
|`separator` |Defaults to a space.
|`max_output_size` |Defaults to `255`.
|======================================================
. Removes a duplicate instance of the `very` token.
. Concatenates the token stream to a output single token: `[fox quick the very was ]`
Output tokens produced by this filter are useful for
fingerprinting and clustering a body of text as described in the
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[OpenRefine
project].
This filter uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene//analysis/miscellaneous/FingerprintFilter.html[FingerprintFilter].
[[analysis-fingerprint-tokenfilter-analyze-ex]]
==== Example
The following <<indices-analyze,analyze API>> request uses the `fingerprint`
filter to create a single output token for the text `zebra jumps over resting
resting dog`:
[source,console]
--------------------------------------------------
GET _analyze
{
"tokenizer" : "whitespace",
"filter" : ["fingerprint"],
"text" : "zebra jumps over resting resting dog"
}
--------------------------------------------------
The filter produces the following token:
[source,text]
--------------------------------------------------
[ dog jumps over resting zebra ]
--------------------------------------------------
/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens" : [
{
"token" : "dog jumps over resting zebra",
"start_offset" : 0,
"end_offset" : 36,
"type" : "fingerprint",
"position" : 0
}
]
}
--------------------------------------------------
/////////////////////
[[analysis-fingerprint-tokenfilter-analyzer-ex]]
==== Add to an analyzer
The following <<indices-create-index,create index API>> request uses the
`fingerprint` filter to configure a new <<analysis-custom-analyzer,custom
analyzer>>.
[source,console]
--------------------------------------------------
PUT fingerprint_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_fingerprint": {
"tokenizer": "whitespace",
"filter": [ "elision" ]
}
}
}
}
}
--------------------------------------------------
[[analysis-fingerprint-tokenfilter-configure-parms]]
==== Configurable parameters
[[analysis-fingerprint-tokenfilter-max-size]]
==== Maximum token size
`max_output_size`::
(Optional, integer)
Maximum character length, including whitespace, of the output token. Defaults to
`255`. Concatenated tokens longer than this will result in no token output.
Because a field may have many unique tokens, it is important to set a cutoff so that fields do not grow
too large. The `max_output_size` setting controls this behavior. If the concatenated fingerprint
grows larger than `max_output_size`, the token filter will exit and will not emit a token (e.g. the
field will be empty).
`separator`::
(Optional, string)
Character to use to concatenate the token stream input. Defaults to a space.
[[analysis-fingerprint-tokenfilter-customize]]
==== Customize
To customize the `fingerprint` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a custom `fingerprint` filter with
that use `+` to concatenate token streams. The filter also limits
output tokens to `100` characters or fewer.
[source,console]
--------------------------------------------------
PUT custom_fingerprint_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_": {
"tokenizer": "whitespace",
"filter": [ "fingerprint_plus_concat" ]
}
},
"filter": {
"fingerprint_plus_concat": {
"type": "fingerprint",
"max_output_size": 100,
"separator": "+"
}
}
}
}
}
--------------------------------------------------