2016-04-20 16:10:56 -04:00
|
|
|
[[analysis-fingerprint-analyzer]]
|
|
|
|
=== Fingerprint Analyzer
|
|
|
|
|
|
|
|
The `fingerprint` analyzer implements a
|
|
|
|
https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint[fingerprinting algorithm]
|
|
|
|
which is used by the OpenRefine project to assist in clustering.
|
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
Input text is lowercased, normalized to remove extended characters, sorted,
|
|
|
|
deduplicated and concatenated into a single token. If a stopword list is
|
|
|
|
configured, stop words will also be removed.
|
2016-04-20 16:10:56 -04:00
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
[float]
|
|
|
|
=== Example output
|
2016-04-20 16:10:56 -04:00
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2016-05-11 08:17:56 -04:00
|
|
|
---------------------------
|
|
|
|
POST _analyze
|
|
|
|
{
|
|
|
|
"analyzer": "fingerprint",
|
|
|
|
"text": "Yes yes, Gödel said this sentence is consistent and."
|
|
|
|
}
|
|
|
|
---------------------------
|
2016-04-20 16:10:56 -04:00
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
/////////////////////
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
[source,console-result]
|
2016-05-19 13:42:23 -04:00
|
|
|
----------------------------
|
|
|
|
{
|
|
|
|
"tokens": [
|
|
|
|
{
|
|
|
|
"token": "and consistent godel is said sentence this yes",
|
|
|
|
"start_offset": 0,
|
|
|
|
"end_offset": 52,
|
|
|
|
"type": "fingerprint",
|
|
|
|
"position": 0
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
/////////////////////
|
|
|
|
|
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
The above sentence would produce the following single term:
|
2016-04-20 16:10:56 -04:00
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
[source,text]
|
|
|
|
---------------------------
|
|
|
|
[ and consistent godel is said sentence this yes ]
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Configuration
|
|
|
|
|
|
|
|
The `fingerprint` analyzer accepts the following parameters:
|
|
|
|
|
|
|
|
[horizontal]
|
|
|
|
`separator`::
|
|
|
|
|
2019-01-07 08:44:12 -05:00
|
|
|
The character to use to concatenate the terms. Defaults to a space.
|
2016-05-11 08:17:56 -04:00
|
|
|
|
|
|
|
`max_output_size`::
|
|
|
|
|
|
|
|
The maximum token size to emit. Defaults to `255`. Tokens larger than
|
|
|
|
this size will be discarded.
|
|
|
|
|
|
|
|
`stopwords`::
|
|
|
|
|
|
|
|
A pre-defined stop words list like `_english_` or an array containing a
|
2019-04-23 13:12:41 -04:00
|
|
|
list of stop words. Defaults to `_none_`.
|
2016-05-19 13:42:23 -04:00
|
|
|
|
2016-05-11 08:17:56 -04:00
|
|
|
`stopwords_path`::
|
|
|
|
|
|
|
|
The path to a file containing stop words.
|
|
|
|
|
|
|
|
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
|
|
|
|
about stop word configuration.
|
|
|
|
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Example configuration
|
|
|
|
|
|
|
|
In this example, we configure the `fingerprint` analyzer to use the
|
2016-05-19 13:42:23 -04:00
|
|
|
pre-defined list of English stop words:
|
2016-05-11 08:17:56 -04:00
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2016-05-11 08:17:56 -04:00
|
|
|
----------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
PUT my_index
|
2016-05-11 08:17:56 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"my_fingerprint_analyzer": {
|
|
|
|
"type": "fingerprint",
|
2016-05-19 13:42:23 -04:00
|
|
|
"stopwords": "_english_"
|
2016-05-11 08:17:56 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
POST my_index/_analyze
|
|
|
|
{
|
|
|
|
"analyzer": "my_fingerprint_analyzer",
|
|
|
|
"text": "Yes yes, Gödel said this sentence is consistent and."
|
|
|
|
}
|
|
|
|
----------------------------
|
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
/////////////////////
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
[source,console-result]
|
2016-05-19 13:42:23 -04:00
|
|
|
----------------------------
|
|
|
|
{
|
|
|
|
"tokens": [
|
|
|
|
{
|
|
|
|
"token": "consistent godel said sentence yes",
|
|
|
|
"start_offset": 0,
|
|
|
|
"end_offset": 52,
|
|
|
|
"type": "fingerprint",
|
|
|
|
"position": 0
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
/////////////////////
|
|
|
|
|
|
|
|
|
|
|
|
The above example produces the following term:
|
2016-05-11 08:17:56 -04:00
|
|
|
|
|
|
|
[source,text]
|
|
|
|
---------------------------
|
2016-05-19 13:42:23 -04:00
|
|
|
[ consistent godel said sentence yes ]
|
2016-05-11 08:17:56 -04:00
|
|
|
---------------------------
|
2018-05-14 18:40:54 -04:00
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Definition
|
|
|
|
|
|
|
|
The `fingerprint` tokenizer consists of:
|
|
|
|
|
|
|
|
Tokenizer::
|
|
|
|
* <<analysis-standard-tokenizer,Standard Tokenizer>>
|
|
|
|
|
|
|
|
Token Filters (in order)::
|
|
|
|
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
|
|
|
|
* <<analysis-asciifolding-tokenfilter>>
|
|
|
|
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
|
|
|
|
* <<analysis-fingerprint-tokenfilter>>
|
|
|
|
|
|
|
|
If you need to customize the `fingerprint` analyzer beyond the configuration
|
|
|
|
parameters then you need to recreate it as a `custom` analyzer and modify
|
|
|
|
it, usually by adding token filters. This would recreate the built-in
|
|
|
|
`fingerprint` analyzer and you can use it as a starting point for further
|
|
|
|
customization:
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2018-05-14 18:40:54 -04:00
|
|
|
----------------------------------------------------
|
2019-01-18 03:34:11 -05:00
|
|
|
PUT /fingerprint_example
|
2018-05-14 18:40:54 -04:00
|
|
|
{
|
|
|
|
"settings": {
|
|
|
|
"analysis": {
|
|
|
|
"analyzer": {
|
|
|
|
"rebuilt_fingerprint": {
|
|
|
|
"tokenizer": "standard",
|
|
|
|
"filter": [
|
|
|
|
"lowercase",
|
|
|
|
"asciifolding",
|
|
|
|
"fingerprint"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
----------------------------------------------------
|
|
|
|
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]
|