[DOCS] Reformat fingerprint token filter docs (#49311)

2025-02-23 05:15:04 +00:00 · 2019-11-19 10:54:16 -05:00 · 2019-11-19 10:54:16 -05:00 · 8639ddab5e
commit 8639ddab5e
parent 0acba44a2e
1 changed files with 130 additions and 20 deletions
--- a/docs/reference/analysis/tokenfilters/fingerprint-tokenfilter.asciidoc
+++ b/docs/reference/analysis/tokenfilters/fingerprint-tokenfilter.asciidoc
@ -1,28 +1,138 @@
 [[analysis-fingerprint-tokenfilter]]
-=== Fingerprint Token Filter
+=== Fingerprint token filter
++++
+<titleabbrev>Fingerprint</titleabbrev>
++++

-The `fingerprint` token filter emits a single token which is useful for fingerprinting
-a body of text, and/or providing a token that can be clustered on.  It does this by
-sorting the tokens, deduplicating and then concatenating them back into a single token.
+Sorts and removes duplicate tokens from a token stream, then concatenates the
+stream into a single output token. 

-For example, the tokens `["the", "quick", "quick", "brown", "fox", "was", "very", "brown"]` will be
-transformed into a single token: `"brown fox quick the very was"`.  Notice how the tokens were sorted
-alphabetically, and there is only one `"quick"`.
+For example, this filter changes the `[ the, fox, was, very, very, quick ]`
+token stream as follows:

-The following are settings that can be set for a `fingerprint` token
-filter type:
+. Sorts the tokens alphabetically to `[ fox, quick, the, very, very, was ]`

-[cols="<,<",options="header",]
-|======================================================
-|Setting |Description
-|`separator` |Defaults to a space.
-|`max_output_size` |Defaults to `255`.
-|======================================================
+. Removes a duplicate instance of the `very` token.
+
+. Concatenates the token stream to a output single token: `[fox quick the very was ]`
+
+Output tokens produced by this filter are useful for
+fingerprinting and clustering a body of text as described in the
+https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[OpenRefine
+project].
+
+This filter uses Lucene's
+https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene//analysis/miscellaneous/FingerprintFilter.html[FingerprintFilter].
+
+[[analysis-fingerprint-tokenfilter-analyze-ex]]
+==== Example
+
+The following <<indices-analyze,analyze API>> request uses the `fingerprint`
+filter to create a single output token for the text `zebra jumps over resting
+resting dog`:
+
+[source,console]
+--------------------------------------------------
+GET _analyze
+{
+  "tokenizer" : "whitespace",
+  "filter" : ["fingerprint"],
+  "text" : "zebra jumps over resting resting dog"
+}
+--------------------------------------------------
+
+The filter produces the following token:
+
+[source,text]
+--------------------------------------------------
+[ dog jumps over resting zebra ]
+--------------------------------------------------
+
+/////////////////////
+[source,console-result]
+--------------------------------------------------
+{
+  "tokens" : [
+    {
+      "token" : "dog jumps over resting zebra",
+      "start_offset" : 0,
+      "end_offset" : 36,
+      "type" : "fingerprint",
+      "position" : 0
+    }
+  ]
+}
+--------------------------------------------------
+/////////////////////
+
+[[analysis-fingerprint-tokenfilter-analyzer-ex]]
+==== Add to an analyzer
+
+The following <<indices-create-index,create index API>> request uses the
+`fingerprint` filter to configure a new <<analysis-custom-analyzer,custom
+analyzer>>.
+
+[source,console]
+--------------------------------------------------
+PUT fingerprint_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "whitespace_fingerprint": {
+          "tokenizer": "whitespace",
+          "filter": [ "elision" ]
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
+
+[[analysis-fingerprint-tokenfilter-configure-parms]]
+==== Configurable parameters

 [[analysis-fingerprint-tokenfilter-max-size]]
-==== Maximum token size
+`max_output_size`::
+(Optional, integer)
+Maximum character length, including whitespace, of the output token. Defaults to
+`255`. Concatenated tokens longer than this will result in no token output.

-Because a field may have many unique tokens, it is important to set a cutoff so that fields do not grow
-too large.  The `max_output_size` setting controls this behavior.  If the concatenated fingerprint
-grows larger than `max_output_size`, the token filter will exit and will not emit a token (e.g. the
-field will be empty).
+`separator`::
+(Optional, string)
+Character to use to concatenate the token stream input. Defaults to a space.
+
+[[analysis-fingerprint-tokenfilter-customize]]
+==== Customize
+
+To customize the `fingerprint` filter, duplicate it to create the basis
+for a new custom token filter. You can modify the filter using its configurable
+parameters.
+
+For example, the following request creates a custom `fingerprint` filter with
+that use `+` to concatenate token streams. The filter also limits
+output tokens to `100` characters or fewer.
+
+[source,console]
+--------------------------------------------------
+PUT custom_fingerprint_example
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "whitespace_": {
+          "tokenizer": "whitespace",
+          "filter": [ "fingerprint_plus_concat" ]
+        }
+      },
+      "filter": {
+        "fingerprint_plus_concat": {
+          "type": "fingerprint",
+          "max_output_size": 100,
+          "separator": "+"
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------