OpenSearch/docs/reference/index-modules/similarity.asciidoc

166 lines
4.7 KiB
Plaintext

[[index-modules-similarity]]
== Similarity module
A similarity (scoring / ranking model) defines how matching documents
are scored. Similarity is per field, meaning that via the mapping one
can define a different similarity per field.
Configuring a custom similarity is considered a expert feature and the
builtin similarities are most likely sufficient as is described in
<<similarity>>.
[float]
[[configuration]]
=== Configuring a similarity
Most existing or custom Similarities have configuration options which
can be configured via the index settings as shown below. The index
options can be provided when creating an index or updating index
settings.
[source,js]
--------------------------------------------------
"similarity" : {
"my_similarity" : {
"type" : "DFR",
"basic_model" : "g",
"after_effect" : "l",
"normalization" : "h2",
"normalization.h2.c" : "3.0"
}
}
--------------------------------------------------
Here we configure the DFRSimilarity so it can be referenced as
`my_similarity` in mappings as is illustrate in the below example:
[source,js]
--------------------------------------------------
{
"book" : {
"properties" : {
"title" : { "type" : "string", "similarity" : "my_similarity" }
}
}
--------------------------------------------------
[float]
=== Available similarities
[float]
[[default-similarity]]
==== Default similarity
The default similarity that is based on the TF/IDF model. This
similarity has the following option:
`discount_overlaps`::
Determines whether overlap tokens (Tokens with
0 position increment) are ignored when computing norm. By default this
is true, meaning overlap tokens do not count when computing norms.
Type name: `default`
[float]
[[bm25]]
==== BM25 similarity
Another TF/IDF based similarity that has built-in tf normalization and
is supposed to work better for short fields (like names). See
http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
This similarity has the following options:
[horizontal]
`k1`::
Controls non-linear term frequency normalization
(saturation).
`b`::
Controls to what degree document length normalizes tf values.
`discount_overlaps`::
Determines whether overlap tokens (Tokens with
0 position increment) are ignored when computing norm. By default this
is true, meaning overlap tokens do not count when computing norms.
Type name: `BM25`
[float]
[[drf]]
==== DFR similarity
Similarity that implements the
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
from randomness] framework. This similarity has the following options:
[horizontal]
`basic_model`::
Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
`after_effect`::
Possible values: `no`, `b` and `l`.
`normalization`::
Possible values: `no`, `h1`, `h2`, `h3` and `z`.
All options but the first option need a normalization value.
Type name: `DFR`
[float]
[[ib]]
==== IB similarity.
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
based model] . This similarity has the following options:
[horizontal]
`distribution`:: Possible values: `ll` and `spl`.
`lambda`:: Possible values: `df` and `ttf`.
`normalization`:: Same as in `DFR` similarity.
Type name: `IB`
[float]
[[lm_dirichlet]]
==== LM Dirichlet similarity.
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
Dirichlet similarity] . This similarity has the following options:
[horizontal]
`mu`:: Default to `2000`.
Type name: `LMDirichlet`
[float]
[[lm_jelinek_mercer]]
==== LM Jelinek Mercer similarity.
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
Jelinek Mercer similarity] . This similarity has the following options:
[horizontal]
`lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
for title queries and `0.7` for long queries. Default to `0.1`.
Type name: `LMJelinekMercer`
[float]
[[default-base]]
==== Default and Base Similarities
By default, Elasticsearch will use whatever similarity is configured as
`default`. However, the similarity functions `queryNorm()` and `coord()`
are not per-field. Consequently, for expert users wanting to change the
implementation used for these two methods, while not changing the
`default`, it is possible to configure a similarity with the name
`base`. This similarity will then be used for the two methods.
You can change the default similarity for all fields like this:
[source,js]
--------------------------------------------------
index.similarity.default.type: BM25
--------------------------------------------------