[[analysis-fingerprint-analyzer]] === Fingerprint Analyzer The `fingerprint` analyzer implements a https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm] which is used by the OpenRefine project to assist in clustering. Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed. [float] === Definition It consists of: Tokenizer:: * <> Token Filters (in order):: 1. <> 2. <> 3. <> (disabled by default) 4. <> [float] === Example output [source,js] --------------------------- POST _analyze { "analyzer": "fingerprint", "text": "Yes yes, Gödel said this sentence is consistent and." } --------------------------- // CONSOLE The above sentence would produce the following single term: [source,text] --------------------------- [ and consistent godel is said sentence this yes ] --------------------------- [float] === Configuration The `fingerprint` analyzer accepts the following parameters: [horizontal] `separator`:: The character to use to concate the terms. Defaults to a space. `max_output_size`:: The maximum token size to emit. Defaults to `255`. Tokens larger than this size will be discarded. `preserve_original`:: If `true`, emits two tokens: one with ASCII-folding of terms that contain extended characters (if any) and one with the original characters. Defaults to `false`. `stopwords`:: A pre-defined stop words list like `_english_` or an array containing a list of stop words. Defaults to `_none_`. `stopwords_path`:: The path to a file containing stop words. See the <> for more information about stop word configuration. [float] === Example configuration In this example, we configure the `fingerprint` analyzer to use the pre-defined list of English stop words, and to emit a second token in the presence of non-ASCII characters: [source,js] ---------------------------- PUT my_index { "settings": { "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_", "preserve_original": true } } } } } GET _cluster/health?wait_for_status=yellow POST my_index/_analyze { "analyzer": "my_fingerprint_analyzer", "text": "Yes yes, Gödel said this sentence is consistent and." } ---------------------------- // CONSOLE The above example produces the following two terms: [source,text] --------------------------- [ consistent godel said sentence yes, consistent gödel said sentence yes ] ---------------------------