545 lines
14 KiB
Plaintext
545 lines
14 KiB
Plaintext
[[index-modules-similarity]]
|
|
== Similarity module
|
|
|
|
A similarity (scoring / ranking model) defines how matching documents
|
|
are scored. Similarity is per field, meaning that via the mapping one
|
|
can define a different similarity per field.
|
|
|
|
Configuring a custom similarity is considered an expert feature and the
|
|
builtin similarities are most likely sufficient as is described in
|
|
<<similarity>>.
|
|
|
|
[float]
|
|
[[configuration]]
|
|
=== Configuring a similarity
|
|
|
|
Most existing or custom Similarities have configuration options which
|
|
can be configured via the index settings as shown below. The index
|
|
options can be provided when creating an index or updating index
|
|
settings.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index
|
|
{
|
|
"settings" : {
|
|
"index" : {
|
|
"similarity" : {
|
|
"my_similarity" : {
|
|
"type" : "DFR",
|
|
"basic_model" : "g",
|
|
"after_effect" : "l",
|
|
"normalization" : "h2",
|
|
"normalization.h2.c" : "3.0"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
Here we configure the DFRSimilarity so it can be referenced as
|
|
`my_similarity` in mappings as is illustrate in the below example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index/_mapping/_doc
|
|
{
|
|
"properties" : {
|
|
"title" : { "type" : "text", "similarity" : "my_similarity" }
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
[float]
|
|
=== Available similarities
|
|
|
|
[float]
|
|
[[bm25]]
|
|
==== BM25 similarity (*default*)
|
|
|
|
TF/IDF based similarity that has built-in tf normalization and
|
|
is supposed to work better for short fields (like names). See
|
|
http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`k1`::
|
|
Controls non-linear term frequency normalization
|
|
(saturation). The default value is `1.2`.
|
|
|
|
`b`::
|
|
Controls to what degree document length normalizes tf values.
|
|
The default value is `0.75`.
|
|
|
|
`discount_overlaps`::
|
|
Determines whether overlap tokens (Tokens with
|
|
0 position increment) are ignored when computing norm. By default this
|
|
is true, meaning overlap tokens do not count when computing norms.
|
|
|
|
Type name: `BM25`
|
|
|
|
[float]
|
|
[[classic-similarity]]
|
|
==== Classic similarity
|
|
|
|
The classic similarity that is based on the TF/IDF model. This
|
|
similarity has the following option:
|
|
|
|
`discount_overlaps`::
|
|
Determines whether overlap tokens (Tokens with
|
|
0 position increment) are ignored when computing norm. By default this
|
|
is true, meaning overlap tokens do not count when computing norms.
|
|
|
|
Type name: `classic`
|
|
|
|
[float]
|
|
[[drf]]
|
|
==== DFR similarity
|
|
|
|
Similarity that implements the
|
|
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
|
|
from randomness] framework. This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`basic_model`::
|
|
Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
|
|
|
|
`after_effect`::
|
|
Possible values: `no`, `b` and `l`.
|
|
|
|
`normalization`::
|
|
Possible values: `no`, `h1`, `h2`, `h3` and `z`.
|
|
|
|
All options but the first option need a normalization value.
|
|
|
|
Type name: `DFR`
|
|
|
|
[float]
|
|
[[dfi]]
|
|
==== DFI similarity
|
|
|
|
Similarity that implements the http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence]
|
|
model.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.
|
|
|
|
Type name: `DFI`
|
|
|
|
[float]
|
|
[[ib]]
|
|
==== IB similarity.
|
|
|
|
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
|
|
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
|
|
sequence is primarily determined by the repetitive usage of its basic elements.
|
|
For written texts this challenge would correspond to comparing the writing styles of different authors.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`distribution`:: Possible values: `ll` and `spl`.
|
|
`lambda`:: Possible values: `df` and `ttf`.
|
|
`normalization`:: Same as in `DFR` similarity.
|
|
|
|
Type name: `IB`
|
|
|
|
[float]
|
|
[[lm_dirichlet]]
|
|
==== LM Dirichlet similarity.
|
|
|
|
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
|
|
Dirichlet similarity] . This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`mu`:: Default to `2000`.
|
|
|
|
Type name: `LMDirichlet`
|
|
|
|
[float]
|
|
[[lm_jelinek_mercer]]
|
|
==== LM Jelinek Mercer similarity.
|
|
|
|
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
|
|
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
|
|
for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.
|
|
|
|
Type name: `LMJelinekMercer`
|
|
|
|
[float]
|
|
[[scripted_similarity]]
|
|
==== Scripted similarity
|
|
|
|
A similarity that allows you to use a script in order to specify how scores
|
|
should be computed. For instance, the below example shows how to reimplement
|
|
TF-IDF:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index
|
|
{
|
|
"settings": {
|
|
"number_of_shards": 1,
|
|
"similarity": {
|
|
"scripted_tfidf": {
|
|
"type": "scripted",
|
|
"script": {
|
|
"source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"_doc": {
|
|
"properties": {
|
|
"field": {
|
|
"type": "text",
|
|
"similarity": "scripted_tfidf"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
PUT /index/_doc/1
|
|
{
|
|
"field": "foo bar foo"
|
|
}
|
|
|
|
PUT /index/_doc/2
|
|
{
|
|
"field": "bar baz"
|
|
}
|
|
|
|
POST /index/_refresh
|
|
|
|
GET /index/_search?explain=true
|
|
{
|
|
"query": {
|
|
"query_string": {
|
|
"query": "foo^1.7",
|
|
"default_field": "field"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
Which yields:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took": 12,
|
|
"timed_out": false,
|
|
"_shards": {
|
|
"total": 1,
|
|
"successful": 1,
|
|
"skipped": 0,
|
|
"failed": 0
|
|
},
|
|
"hits": {
|
|
"total": 1,
|
|
"max_score": 1.9508477,
|
|
"hits": [
|
|
{
|
|
"_shard": "[index][0]",
|
|
"_node": "OzrdjxNtQGaqs4DmioFw9A",
|
|
"_index": "index",
|
|
"_type": "_doc",
|
|
"_id": "1",
|
|
"_score": 1.9508477,
|
|
"_source": {
|
|
"field": "foo bar foo"
|
|
},
|
|
"_explanation": {
|
|
"value": 1.9508477,
|
|
"description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
|
|
"details": [
|
|
{
|
|
"value": 1.9508477,
|
|
"description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:",
|
|
"details": [
|
|
{
|
|
"value": 1.0,
|
|
"description": "weight",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1.7,
|
|
"description": "query.boost",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "field.docCount",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 4.0,
|
|
"description": "field.sumDocFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 5.0,
|
|
"description": "field.sumTotalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1.0,
|
|
"description": "term.docFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "term.totalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "doc.freq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 3.0,
|
|
"description": "doc.length",
|
|
"details": []
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"took": 12/"took" : $body.took/]
|
|
// TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
|
|
|
|
You might have noticed that a significant part of the script depends on
|
|
statistics that are the same for every document. It is possible to make the
|
|
above slightly more efficient by providing an `weight_script` which will
|
|
compute the document-independent part of the score and will be available
|
|
under the `weight` variable. When no `weight_script` is provided, `weight`
|
|
is equal to `1`. The `weight_script` has access to the same variables as
|
|
the `script` except `doc` since it is supposed to compute a
|
|
document-independent contribution to the score.
|
|
|
|
The below configuration will give the same tf-idf scores but is slightly
|
|
more efficient:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index
|
|
{
|
|
"settings": {
|
|
"number_of_shards": 1,
|
|
"similarity": {
|
|
"scripted_tfidf": {
|
|
"type": "scripted",
|
|
"weight_script": {
|
|
"source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
|
|
},
|
|
"script": {
|
|
"source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"_doc": {
|
|
"properties": {
|
|
"field": {
|
|
"type": "text",
|
|
"similarity": "scripted_tfidf"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
////////////////////
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index/_doc/1
|
|
{
|
|
"field": "foo bar foo"
|
|
}
|
|
|
|
PUT /index/_doc/2
|
|
{
|
|
"field": "bar baz"
|
|
}
|
|
|
|
POST /index/_refresh
|
|
|
|
GET /index/_search?explain=true
|
|
{
|
|
"query": {
|
|
"query_string": {
|
|
"query": "foo^1.7",
|
|
"default_field": "field"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took": 1,
|
|
"timed_out": false,
|
|
"_shards": {
|
|
"total": 1,
|
|
"successful": 1,
|
|
"skipped": 0,
|
|
"failed": 0
|
|
},
|
|
"hits": {
|
|
"total": 1,
|
|
"max_score": 1.9508477,
|
|
"hits": [
|
|
{
|
|
"_shard": "[index][0]",
|
|
"_node": "OzrdjxNtQGaqs4DmioFw9A",
|
|
"_index": "index",
|
|
"_type": "_doc",
|
|
"_id": "1",
|
|
"_score": 1.9508477,
|
|
"_source": {
|
|
"field": "foo bar foo"
|
|
},
|
|
"_explanation": {
|
|
"value": 1.9508477,
|
|
"description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
|
|
"details": [
|
|
{
|
|
"value": 1.9508477,
|
|
"description": "score from ScriptedSimilarity(weightScript=[Script{type=inline, lang='painless', idOrCode='double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;', options={}, params={}}], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;', options={}, params={}}]) computed from:",
|
|
"details": [
|
|
{
|
|
"value": 2.3892908,
|
|
"description": "weight",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1.7,
|
|
"description": "query.boost",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "field.docCount",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 4.0,
|
|
"description": "field.sumDocFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 5.0,
|
|
"description": "field.sumTotalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1.0,
|
|
"description": "term.docFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "term.totalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "doc.freq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 3.0,
|
|
"description": "doc.length",
|
|
"details": []
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"took": 1/"took" : $body.took/]
|
|
// TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
|
|
|
|
////////////////////
|
|
|
|
|
|
Type name: `scripted`
|
|
|
|
[float]
|
|
[[default-base]]
|
|
==== Default Similarity
|
|
|
|
By default, Elasticsearch will use whatever similarity is configured as
|
|
`default`.
|
|
|
|
You can change the default similarity for all fields in an index when
|
|
it is <<indices-create-index,created>>:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"similarity": {
|
|
"default": {
|
|
"type": "classic"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
If you want to change the default similarity after creating the index
|
|
you must <<indices-open-close,close>> your index, send the following
|
|
request and <<indices-open-close,open>> it again afterwards:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /index/_close
|
|
|
|
PUT /index/_settings
|
|
{
|
|
"index": {
|
|
"similarity": {
|
|
"default": {
|
|
"type": "classic"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST /index/_open
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|