OpenSearch/docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc

[[analysis-fingerprint-analyzer]]
=== Fingerprint Analyzer

The `fingerprint` analyzer implements a
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
which is used by the OpenRefine project to assist in clustering.

Input text is lowercased, normalized to remove extended characters, sorted,
deduplicated and concatenated into a single token.  If a stopword list is
configured, stop words will also be removed.

[float]
=== Example output

[source,js]
---------------------------
POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
---------------------------
// CONSOLE

/////////////////////

[source,js]
----------------------------
{
  "tokens": [
    {
      "token": "and consistent godel is said sentence this yes",
      "start_offset": 0,
      "end_offset": 52,
      "type": "fingerprint",
      "position": 0
    }
  ]
}
----------------------------
// TESTRESPONSE

/////////////////////


The above sentence would produce the following single term:

[source,text]
---------------------------
[ and consistent godel is said sentence this yes ]
---------------------------

[float]
=== Configuration

The `fingerprint` analyzer accepts the following parameters:

[horizontal]
`separator`::

    The character to use to concatenate the terms.  Defaults to a space.

`max_output_size`::

    The maximum token size to emit.  Defaults to `255`. Tokens larger than
    this size will be discarded.

`stopwords`::

    A pre-defined stop words list like `_english_` or an array  containing a
    list of stop words.  Defaults to `\_none_`.

`stopwords_path`::

    The path to a file containing stop words.

See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.


[float]
=== Example configuration

In this example, we configure the `fingerprint` analyzer to use the
pre-defined list of English stop words:

[source,js]
----------------------------
PUT my_index?include_type_name=true
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer": {
          "type": "fingerprint",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}
----------------------------
// CONSOLE

/////////////////////

[source,js]
----------------------------
{
  "tokens": [
    {
      "token": "consistent godel said sentence yes",
      "start_offset": 0,
      "end_offset": 52,
      "type": "fingerprint",
      "position": 0
    }
  ]
}
----------------------------
// TESTRESPONSE

/////////////////////


The above example produces the following term:

[source,text]
---------------------------
[ consistent godel said sentence yes ]
---------------------------

[float]
=== Definition

The `fingerprint` tokenizer consists of:

Tokenizer::
* <<analysis-standard-tokenizer,Standard Tokenizer>>

Token Filters (in order)::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-asciifolding-tokenfilter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
* <<analysis-fingerprint-tokenfilter>>

If you need to customize the `fingerprint` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`fingerprint` analyzer and you can use it as a starting point for further
customization:

[source,js]
----------------------------------------------------
PUT /fingerprint_example?include_type_name=true
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_fingerprint": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "fingerprint"
          ]
        }
      }
    }
  }
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n  - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]
Add `fingerprint` token filter and `fingerprint` analyzer Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter, and a `fingerprint` analyzer that combines the Fingerprint filter with lowercasing, stop word removal and asciifolding. Closes #13325 2016-04-20 16:10:56 -04:00			`[[analysis-fingerprint-analyzer]]`
			`=== Fingerprint Analyzer`

			The `fingerprint` analyzer implements a
			`https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]`
			`which is used by the OpenRefine project to assist in clustering.`

First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`Input text is lowercased, normalized to remove extended characters, sorted,`
			`deduplicated and concatenated into a single token. If a stopword list is`
			`configured, stop words will also be removed.`
Add `fingerprint` token filter and `fingerprint` analyzer Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter, and a `fingerprint` analyzer that combines the Fingerprint filter with lowercasing, stop word removal and asciifolding. Closes #13325 2016-04-20 16:10:56 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`[float]`
			`=== Example output`
Add `fingerprint` token filter and `fingerprint` analyzer Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter, and a `fingerprint` analyzer that combines the Fingerprint filter with lowercasing, stop word removal and asciifolding. Closes #13325 2016-04-20 16:10:56 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`[source,js]`
			`---------------------------`
			`POST _analyze`
			`{`
			`"analyzer": "fingerprint",`
			`"text": "Yes yes, Gödel said this sentence is consistent and."`
			`}`
			`---------------------------`
			`// CONSOLE`
Add `fingerprint` token filter and `fingerprint` analyzer Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter, and a `fingerprint` analyzer that combines the Fingerprint filter with lowercasing, stop word removal and asciifolding. Closes #13325 2016-04-20 16:10:56 -04:00
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`/////////////////////`

			`[source,js]`
			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "and consistent godel is said sentence this yes",`
			`"start_offset": 0,`
			`"end_offset": 52,`
			`"type": "fingerprint",`
			`"position": 0`
			`}`
			`]`
			`}`
			`----------------------------`
			`// TESTRESPONSE`

			`/////////////////////`


First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`The above sentence would produce the following single term:`
Add `fingerprint` token filter and `fingerprint` analyzer Adds a `fingerprint` token filter which uses Lucene's FingerprintFilter, and a `fingerprint` analyzer that combines the Fingerprint filter with lowercasing, stop word removal and asciifolding. Closes #13325 2016-04-20 16:10:56 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`[source,text]`
			`---------------------------`
			`[ and consistent godel is said sentence this yes ]`
			`---------------------------`

			`[float]`
			`=== Configuration`

			The `fingerprint` analyzer accepts the following parameters:

			`[horizontal]`
			`separator`::

[DOCS] Various spelling corrections (#37046) 2019-01-07 08:44:12 -05:00			`The character to use to concatenate the terms. Defaults to a space.`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`max_output_size`::

			The maximum token size to emit. Defaults to `255`. Tokens larger than
			`this size will be discarded.`

			`stopwords`::

			A pre-defined stop words list like `_english_` or an array containing a
Add the ability to disable the retrieval of the stored fields entirely This change adds a special field named _none_ that allows to disable the retrieval of the stored fields in a search request or in a TopHitsAggregation. To completely disable stored fields retrieval (including disabling metadata fields retrieval such as _id or _type) use _none_ like this: ```` POST _search { "stored_fields": "_none_" } ```` 2016-08-17 09:59:38 -04:00			list of stop words. Defaults to `\_none_`.
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`stopwords_path`::

			`The path to a file containing stop words.`

			`See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information`
			`about stop word configuration.`


			`[float]`
			`=== Example configuration`

			In this example, we configure the `fingerprint` analyzer to use the
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`pre-defined list of English stop words:`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`[source,js]`
			`----------------------------`
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs. 2019-01-14 16:08:01 -05:00			`PUT my_index?include_type_name=true`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"my_fingerprint_analyzer": {`
			`"type": "fingerprint",`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`"stopwords": "_english_"`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`}`
			`}`
			`}`
			`}`
			`}`

			`POST my_index/_analyze`
			`{`
			`"analyzer": "my_fingerprint_analyzer",`
			`"text": "Yes yes, Gödel said this sentence is consistent and."`
			`}`
			`----------------------------`
			`// CONSOLE`

Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`/////////////////////`

			`[source,js]`
			`----------------------------`
			`{`
			`"tokens": [`
			`{`
			`"token": "consistent godel said sentence yes",`
			`"start_offset": 0,`
			`"end_offset": 52,`
			`"type": "fingerprint",`
			`"position": 0`
			`}`
			`]`
			`}`
			`----------------------------`
			`// TESTRESPONSE`

			`/////////////////////`


			`The above example produces the following term:`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00
			`[source,text]`
			`---------------------------`
Docs: Improved tokenizer docs (#18356) * Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer 2016-05-19 13:42:23 -04:00			`[ consistent godel said sentence yes ]`
First pass at improving analyzer docs (#18269) * Docs: First pass at improving analyzer docs I've rewritten the intro to analyzers plus the docs for all analyzers to provide working examples. I've also removed: * analyzer aliases (see #18244) * analyzer versions (see #18267) * snowball analyzer (see #8690) Next steps will be tokenizers, token filters, char filters * Fixed two typos 2016-05-11 08:17:56 -04:00			`---------------------------`
Docs: Document how to rebuild analyzers (#30498) Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499 2018-05-14 18:40:54 -04:00
			`[float]`
			`=== Definition`

			The `fingerprint` tokenizer consists of:

			`Tokenizer::`
			`* <<analysis-standard-tokenizer,Standard Tokenizer>>`

			`Token Filters (in order)::`
			`* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>`
			`* <<analysis-asciifolding-tokenfilter>>`
			`* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)`
			`* <<analysis-fingerprint-tokenfilter>>`

			If you need to customize the `fingerprint` analyzer beyond the configuration
			parameters then you need to recreate it as a `custom` analyzer and modify
			`it, usually by adding token filters. This would recreate the built-in`
			`fingerprint` analyzer and you can use it as a starting point for further
			`customization:`

			`[source,js]`
			`----------------------------------------------------`
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs. 2019-01-14 16:08:01 -05:00			`PUT /fingerprint_example?include_type_name=true`
Docs: Document how to rebuild analyzers (#30498) Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499 2018-05-14 18:40:54 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"rebuilt_fingerprint": {`
			`"tokenizer": "standard",`
			`"filter": [`
			`"lowercase",`
			`"asciifolding",`
			`"fingerprint"`
			`]`
			`}`
			`}`
			`}`
			`}`
			`}`
			`----------------------------------------------------`
			`// CONSOLE`
			`// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]`