[[analysis-fingerprint-analyzer]] === Fingerprint Analyzer The `fingerprint` analyzer implements a https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm] which is used by the OpenRefine project to assist in clustering. The `fingerprint` analyzer is composed of a <>, and four token filters (in this order): <>, <>, <> and <>. Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed. For example, the sentence: ____ "Yes yes, Gödel said this sentence is consistent and." ____ will be transformed into the token: `"and consistent godel is said sentence this yes"` Notice how the words are all lowercased, the umlaut in "gödel" has been normalized to "godel", punctuation has been removed, and "yes" has been de-duplicated. The `fingerprint` analyzer has these configurable settings [cols="<,<",options="header",] |======================================================================= |Setting |Description |`separator` | The character that separates the tokens after concatenation. Defaults to a space. |`max_output_size` | The maximum token size to emit. Defaults to `255`. See <> |`preserve_original`| If true, emits both the original and folded version of tokens that contain extended characters. Defaults to `false` |`stopwords` | A list of stop words to use. Defaults to an empty list (`_none_`). |`stopwords_path` | A path (either relative to `config` location, or absolute) to a stopwords file configuration. Each stop word should be in its own "line" (separated by a line break). The file must be UTF-8 encoded. |=======================================================================