[[analysis-fingerprint-analyzer]] === Fingerprint analyzer ++++ Fingerprint ++++ The `fingerprint` analyzer implements a https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm] which is used by the OpenRefine project to assist in clustering. Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed. [discrete] === Example output [source,console] --------------------------- POST _analyze { "analyzer": "fingerprint", "text": "Yes yes, Gödel said this sentence is consistent and." } --------------------------- ///////////////////// [source,console-result] ---------------------------- { "tokens": [ { "token": "and consistent godel is said sentence this yes", "start_offset": 0, "end_offset": 52, "type": "fingerprint", "position": 0 } ] } ---------------------------- ///////////////////// The above sentence would produce the following single term: [source,text] --------------------------- [ and consistent godel is said sentence this yes ] --------------------------- [discrete] === Configuration The `fingerprint` analyzer accepts the following parameters: [horizontal] `separator`:: The character to use to concatenate the terms. Defaults to a space. `max_output_size`:: The maximum token size to emit. Defaults to `255`. Tokens larger than this size will be discarded. `stopwords`:: A pre-defined stop words list like `_english_` or an array containing a list of stop words. Defaults to `_none_`. `stopwords_path`:: The path to a file containing stop words. See the <> for more information about stop word configuration. [discrete] === Example configuration In this example, we configure the `fingerprint` analyzer to use the pre-defined list of English stop words: [source,console] ---------------------------- PUT my_index { "settings": { "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_" } } } } } POST my_index/_analyze { "analyzer": "my_fingerprint_analyzer", "text": "Yes yes, Gödel said this sentence is consistent and." } ---------------------------- ///////////////////// [source,console-result] ---------------------------- { "tokens": [ { "token": "consistent godel said sentence yes", "start_offset": 0, "end_offset": 52, "type": "fingerprint", "position": 0 } ] } ---------------------------- ///////////////////// The above example produces the following term: [source,text] --------------------------- [ consistent godel said sentence yes ] --------------------------- [discrete] === Definition The `fingerprint` tokenizer consists of: Tokenizer:: * <> Token Filters (in order):: * <> * <> * <> (disabled by default) * <> If you need to customize the `fingerprint` analyzer beyond the configuration parameters then you need to recreate it as a `custom` analyzer and modify it, usually by adding token filters. This would recreate the built-in `fingerprint` analyzer and you can use it as a starting point for further customization: [source,console] ---------------------------------------------------- PUT /fingerprint_example { "settings": { "analysis": { "analyzer": { "rebuilt_fingerprint": { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding", "fingerprint" ] } } } } } ---------------------------------------------------- // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]