559 lines
16 KiB
Plaintext
559 lines
16 KiB
Plaintext
[[index-modules-similarity]]
|
|
== Similarity module
|
|
|
|
A similarity (scoring / ranking model) defines how matching documents
|
|
are scored. Similarity is per field, meaning that via the mapping one
|
|
can define a different similarity per field.
|
|
|
|
Configuring a custom similarity is considered an expert feature and the
|
|
builtin similarities are most likely sufficient as is described in
|
|
<<similarity>>.
|
|
|
|
[discrete]
|
|
[[configuration]]
|
|
=== Configuring a similarity
|
|
|
|
Most existing or custom Similarities have configuration options which
|
|
can be configured via the index settings as shown below. The index
|
|
options can be provided when creating an index or updating index
|
|
settings.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /index
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"similarity": {
|
|
"my_similarity": {
|
|
"type": "DFR",
|
|
"basic_model": "g",
|
|
"after_effect": "l",
|
|
"normalization": "h2",
|
|
"normalization.h2.c": "3.0"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Here we configure the DFR similarity so it can be referenced as
|
|
`my_similarity` in mappings as is illustrate in the below example:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /index/_mapping
|
|
{
|
|
"properties" : {
|
|
"title" : { "type" : "text", "similarity" : "my_similarity" }
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
[discrete]
|
|
=== Available similarities
|
|
|
|
[discrete]
|
|
[[bm25]]
|
|
==== BM25 similarity (*default*)
|
|
|
|
TF/IDF based similarity that has built-in tf normalization and
|
|
is supposed to work better for short fields (like names). See
|
|
http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`k1`::
|
|
Controls non-linear term frequency normalization
|
|
(saturation). The default value is `1.2`.
|
|
|
|
`b`::
|
|
Controls to what degree document length normalizes tf values.
|
|
The default value is `0.75`.
|
|
|
|
`discount_overlaps`::
|
|
Determines whether overlap tokens (Tokens with
|
|
0 position increment) are ignored when computing norm. By default this
|
|
is true, meaning overlap tokens do not count when computing norms.
|
|
|
|
Type name: `BM25`
|
|
|
|
[discrete]
|
|
[[dfr]]
|
|
==== DFR similarity
|
|
|
|
Similarity that implements the
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
|
|
from randomness] framework. This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`basic_model`::
|
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`g`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIF.html[`if`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIn.html[`in`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIne.html[`ine`].
|
|
|
|
`after_effect`::
|
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectB.html[`b`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectL.html[`l`].
|
|
|
|
`normalization`::
|
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/Normalization.NoNormalization.html[`no`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h1`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH2.html[`h2`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH3.html[`h3`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationZ.html[`z`].
|
|
|
|
All options but the first option need a normalization value.
|
|
|
|
Type name: `DFR`
|
|
|
|
[discrete]
|
|
[[dfi]]
|
|
==== DFI similarity
|
|
|
|
Similarity that implements the http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence]
|
|
model.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`independence_measure`:: Possible values
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceStandardized.html[`standardized`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceSaturated.html[`saturated`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceChiSquared.html[`chisquared`].
|
|
|
|
When using this similarity, it is highly recommended *not* to remove stop words to get
|
|
good relevance. Also beware that terms whose frequency is less than the expected
|
|
frequency will get a score equal to 0.
|
|
|
|
Type name: `DFI`
|
|
|
|
[discrete]
|
|
[[ib]]
|
|
==== IB similarity.
|
|
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IBSimilarity.html[Information
|
|
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
|
|
sequence is primarily determined by the repetitive usage of its basic elements.
|
|
For written texts this challenge would correspond to comparing the writing styles of different authors.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`distribution`:: Possible values:
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionLL.html[`ll`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionSPL.html[`spl`].
|
|
`lambda`:: Possible values:
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaDF.html[`df`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaTTF.html[`ttf`].
|
|
`normalization`:: Same as in `DFR` similarity.
|
|
|
|
Type name: `IB`
|
|
|
|
[discrete]
|
|
[[lm_dirichlet]]
|
|
==== LM Dirichlet similarity.
|
|
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
|
|
Dirichlet similarity] . This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`mu`:: Default to `2000`.
|
|
|
|
The scoring formula in the paper assigns negative scores to terms that have
|
|
fewer occurrences than predicted by the language model, which is illegal to
|
|
Lucene, so such terms get a score of 0.
|
|
|
|
Type name: `LMDirichlet`
|
|
|
|
[discrete]
|
|
[[lm_jelinek_mercer]]
|
|
==== LM Jelinek Mercer similarity.
|
|
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
|
|
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
|
|
for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.
|
|
|
|
Type name: `LMJelinekMercer`
|
|
|
|
[discrete]
|
|
[[scripted_similarity]]
|
|
==== Scripted similarity
|
|
|
|
A similarity that allows you to use a script in order to specify how scores
|
|
should be computed. For instance, the below example shows how to reimplement
|
|
TF-IDF:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /index
|
|
{
|
|
"settings": {
|
|
"number_of_shards": 1,
|
|
"similarity": {
|
|
"scripted_tfidf": {
|
|
"type": "scripted",
|
|
"script": {
|
|
"source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"field": {
|
|
"type": "text",
|
|
"similarity": "scripted_tfidf"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
PUT /index/_doc/1
|
|
{
|
|
"field": "foo bar foo"
|
|
}
|
|
|
|
PUT /index/_doc/2
|
|
{
|
|
"field": "bar baz"
|
|
}
|
|
|
|
POST /index/_refresh
|
|
|
|
GET /index/_search?explain=true
|
|
{
|
|
"query": {
|
|
"query_string": {
|
|
"query": "foo^1.7",
|
|
"default_field": "field"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Which yields:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"took": 12,
|
|
"timed_out": false,
|
|
"_shards": {
|
|
"total": 1,
|
|
"successful": 1,
|
|
"skipped": 0,
|
|
"failed": 0
|
|
},
|
|
"hits": {
|
|
"total": {
|
|
"value": 1,
|
|
"relation": "eq"
|
|
},
|
|
"max_score": 1.9508477,
|
|
"hits": [
|
|
{
|
|
"_shard": "[index][0]",
|
|
"_node": "OzrdjxNtQGaqs4DmioFw9A",
|
|
"_index": "index",
|
|
"_type": "_doc",
|
|
"_id": "1",
|
|
"_score": 1.9508477,
|
|
"_source": {
|
|
"field": "foo bar foo"
|
|
},
|
|
"_explanation": {
|
|
"value": 1.9508477,
|
|
"description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
|
|
"details": [
|
|
{
|
|
"value": 1.9508477,
|
|
"description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:",
|
|
"details": [
|
|
{
|
|
"value": 1.0,
|
|
"description": "weight",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1.7,
|
|
"description": "query.boost",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2,
|
|
"description": "field.docCount",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 4,
|
|
"description": "field.sumDocFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 5,
|
|
"description": "field.sumTotalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1,
|
|
"description": "term.docFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2,
|
|
"description": "term.totalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "doc.freq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 3,
|
|
"description": "doc.length",
|
|
"details": []
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"took": 12/"took" : $body.took/]
|
|
// TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
|
|
|
|
WARNING: While scripted similarities provide a lot of flexibility, there is
|
|
a set of rules that they need to satisfy. Failing to do so could make
|
|
Elasticsearch silently return wrong top hits or fail with internal errors at
|
|
search time:
|
|
|
|
- Returned scores must be positive.
|
|
- All other variables remaining equal, scores must not decrease when
|
|
`doc.freq` increases.
|
|
- All other variables remaining equal, scores must not increase when
|
|
`doc.length` increases.
|
|
|
|
You might have noticed that a significant part of the above script depends on
|
|
statistics that are the same for every document. It is possible to make the
|
|
above slightly more efficient by providing an `weight_script` which will
|
|
compute the document-independent part of the score and will be available
|
|
under the `weight` variable. When no `weight_script` is provided, `weight`
|
|
is equal to `1`. The `weight_script` has access to the same variables as
|
|
the `script` except `doc` since it is supposed to compute a
|
|
document-independent contribution to the score.
|
|
|
|
The below configuration will give the same tf-idf scores but is slightly
|
|
more efficient:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /index
|
|
{
|
|
"settings": {
|
|
"number_of_shards": 1,
|
|
"similarity": {
|
|
"scripted_tfidf": {
|
|
"type": "scripted",
|
|
"weight_script": {
|
|
"source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
|
|
},
|
|
"script": {
|
|
"source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"field": {
|
|
"type": "text",
|
|
"similarity": "scripted_tfidf"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
////////////////////
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /index/_doc/1
|
|
{
|
|
"field": "foo bar foo"
|
|
}
|
|
|
|
PUT /index/_doc/2
|
|
{
|
|
"field": "bar baz"
|
|
}
|
|
|
|
POST /index/_refresh
|
|
|
|
GET /index/_search?explain=true
|
|
{
|
|
"query": {
|
|
"query_string": {
|
|
"query": "foo^1.7",
|
|
"default_field": "field"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"took": 1,
|
|
"timed_out": false,
|
|
"_shards": {
|
|
"total": 1,
|
|
"successful": 1,
|
|
"skipped": 0,
|
|
"failed": 0
|
|
},
|
|
"hits": {
|
|
"total": {
|
|
"value": 1,
|
|
"relation": "eq"
|
|
},
|
|
"max_score": 1.9508477,
|
|
"hits": [
|
|
{
|
|
"_shard": "[index][0]",
|
|
"_node": "OzrdjxNtQGaqs4DmioFw9A",
|
|
"_index": "index",
|
|
"_type": "_doc",
|
|
"_id": "1",
|
|
"_score": 1.9508477,
|
|
"_source": {
|
|
"field": "foo bar foo"
|
|
},
|
|
"_explanation": {
|
|
"value": 1.9508477,
|
|
"description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
|
|
"details": [
|
|
{
|
|
"value": 1.9508477,
|
|
"description": "score from ScriptedSimilarity(weightScript=[Script{type=inline, lang='painless', idOrCode='double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;', options={}, params={}}], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;', options={}, params={}}]) computed from:",
|
|
"details": [
|
|
{
|
|
"value": 2.3892908,
|
|
"description": "weight",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1.7,
|
|
"description": "query.boost",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2,
|
|
"description": "field.docCount",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 4,
|
|
"description": "field.sumDocFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 5,
|
|
"description": "field.sumTotalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1,
|
|
"description": "term.docFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2,
|
|
"description": "term.totalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "doc.freq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 3,
|
|
"description": "doc.length",
|
|
"details": []
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"took": 1/"took" : $body.took/]
|
|
// TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
|
|
|
|
////////////////////
|
|
|
|
Type name: `scripted`
|
|
|
|
[discrete]
|
|
[[default-base]]
|
|
==== Default Similarity
|
|
|
|
By default, Elasticsearch will use whatever similarity is configured as
|
|
`default`.
|
|
|
|
You can change the default similarity for all fields in an index when
|
|
it is <<indices-create-index,created>>:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /index
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"similarity": {
|
|
"default": {
|
|
"type": "boolean"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
If you want to change the default similarity after creating the index
|
|
you must <<indices-open-close,close>> your index, send the following
|
|
request and <<indices-open-close,open>> it again afterwards:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST /index/_close
|
|
|
|
PUT /index/_settings
|
|
{
|
|
"index": {
|
|
"similarity": {
|
|
"default": {
|
|
"type": "boolean"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST /index/_open
|
|
--------------------------------------------------
|
|
// TEST[continued]
|