169 lines
5.3 KiB
Plaintext
169 lines
5.3 KiB
Plaintext
[[index-modules-similarity]]
|
|
== Similarity module
|
|
|
|
A similarity (scoring / ranking model) defines how matching documents
|
|
are scored. Similarity is per field, meaning that via the mapping one
|
|
can define a different similarity per field.
|
|
|
|
Configuring a custom similarity is considered a expert feature and the
|
|
builtin similarities are most likely sufficient as is described in
|
|
<<similarity>>.
|
|
|
|
[float]
|
|
[[configuration]]
|
|
=== Configuring a similarity
|
|
|
|
Most existing or custom Similarities have configuration options which
|
|
can be configured via the index settings as shown below. The index
|
|
options can be provided when creating an index or updating index
|
|
settings.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
"similarity" : {
|
|
"my_similarity" : {
|
|
"type" : "DFR",
|
|
"basic_model" : "g",
|
|
"after_effect" : "l",
|
|
"normalization" : "h2",
|
|
"normalization.h2.c" : "3.0"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Here we configure the DFRSimilarity so it can be referenced as
|
|
`my_similarity` in mappings as is illustrate in the below example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"book" : {
|
|
"properties" : {
|
|
"title" : { "type" : "string", "similarity" : "my_similarity" }
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
=== Available similarities
|
|
|
|
[float]
|
|
[[classic-similarity]]
|
|
==== Classic similarity
|
|
|
|
The classic similarity that is based on the TF/IDF model. This
|
|
similarity has the following option:
|
|
|
|
`discount_overlaps`::
|
|
Determines whether overlap tokens (Tokens with
|
|
0 position increment) are ignored when computing norm. By default this
|
|
is true, meaning overlap tokens do not count when computing norms.
|
|
|
|
Type name: `classic`
|
|
|
|
[float]
|
|
[[bm25]]
|
|
==== BM25 similarity
|
|
|
|
Another TF/IDF based similarity that has built-in tf normalization and
|
|
is supposed to work better for short fields (like names). See
|
|
http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`k1`::
|
|
Controls non-linear term frequency normalization
|
|
(saturation).
|
|
|
|
`b`::
|
|
Controls to what degree document length normalizes tf values.
|
|
|
|
`discount_overlaps`::
|
|
Determines whether overlap tokens (Tokens with
|
|
0 position increment) are ignored when computing norm. By default this
|
|
is true, meaning overlap tokens do not count when computing norms.
|
|
|
|
Type name: `BM25`
|
|
|
|
[float]
|
|
[[drf]]
|
|
==== DFR similarity
|
|
|
|
Similarity that implements the
|
|
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
|
|
from randomness] framework. This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`basic_model`::
|
|
Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
|
|
|
|
`after_effect`::
|
|
Possible values: `no`, `b` and `l`.
|
|
|
|
`normalization`::
|
|
Possible values: `no`, `h1`, `h2`, `h3` and `z`.
|
|
|
|
All options but the first option need a normalization value.
|
|
|
|
Type name: `DFR`
|
|
|
|
[float]
|
|
[[ib]]
|
|
==== IB similarity.
|
|
|
|
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
|
|
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
|
|
sequence is primarily determined by the repetitive usage of its basic elements.
|
|
For written texts this challenge would correspond to comparing the writing styles of diferent authors.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`distribution`:: Possible values: `ll` and `spl`.
|
|
`lambda`:: Possible values: `df` and `ttf`.
|
|
`normalization`:: Same as in `DFR` similarity.
|
|
|
|
Type name: `IB`
|
|
|
|
[float]
|
|
[[lm_dirichlet]]
|
|
==== LM Dirichlet similarity.
|
|
|
|
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
|
|
Dirichlet similarity] . This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`mu`:: Default to `2000`.
|
|
|
|
Type name: `LMDirichlet`
|
|
|
|
[float]
|
|
[[lm_jelinek_mercer]]
|
|
==== LM Jelinek Mercer similarity.
|
|
|
|
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
|
|
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
|
|
for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.
|
|
|
|
Type name: `LMJelinekMercer`
|
|
|
|
[float]
|
|
[[default-base]]
|
|
==== Default and Base Similarities
|
|
|
|
By default, Elasticsearch will use whatever similarity is configured as
|
|
`default`. However, the similarity functions `queryNorm()` and `coord()`
|
|
are not per-field. Consequently, for expert users wanting to change the
|
|
implementation used for these two methods, while not changing the
|
|
`default`, it is possible to configure a similarity with the name
|
|
`base`. This similarity will then be used for the two methods.
|
|
|
|
You can change the default similarity for all fields by putting the following setting into `elasticsearch.yml`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
index.similarity.default.type: BM25
|
|
--------------------------------------------------
|