OpenSearch/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc

[[analysis-fingerprint-analyzer]]
=== Fingerprint Analyzer

The `fingerprint` analyzer implements a
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
which is used by the OpenRefine project to assist in clustering.

Input text is lowercased, normalized to remove extended characters, sorted,
deduplicated and concatenated into a single token.  If a stopword list is
configured, stop words will also be removed.

[float]
=== Definition

It consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters (in order)::
1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
2. <<analysis-asciifolding-tokenfilter>>
3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
4. <<analysis-fingerprint-tokenfilter>>

[float]
=== Example output

[source,js]
---------------------------
POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
---------------------------
// CONSOLE

The above sentence would produce the following single term:

[source,text]
---------------------------
[ and consistent godel is said sentence this yes ]
---------------------------

[float]
=== Configuration

The `fingerprint` analyzer accepts the following parameters:

[horizontal]
`separator`::

    The character to use to concate the terms.  Defaults to a space.

`max_output_size`::

    The maximum token size to emit.  Defaults to `255`. Tokens larger than
    this size will be discarded.

`preserve_original`::

    If `true`, emits two tokens: one with ASCII-folding of terms that contain
    extended characters (if any) and one with the original characters.
    Defaults to `false`.

`stopwords`::

    A pre-defined stop words list like `_english_` or an array  containing a
    list of stop words.  Defaults to `_none_`.
`stopwords_path`::

    The path to a file containing stop words.

See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.


[float]
=== Example configuration

In this example, we configure the `fingerprint` analyzer to use the
pre-defined list of English stop words, and to emit a second token in
the presence of non-ASCII characters:

[source,js]
----------------------------
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer": {
          "type": "fingerprint",
          "stopwords": "_english_",
          "preserve_original": true
        }
      }
    }
  }
}

GET _cluster/health?wait_for_status=yellow

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
----------------------------
// CONSOLE

The above example produces the following two terms:

[source,text]
---------------------------
[ consistent godel said sentence yes, consistent gödel said sentence yes ]
---------------------------