OpenSearch/docs/reference/index-modules/similarity.asciidoc

[[index-modules-similarity]]
== Similarity module

A similarity (scoring / ranking model) defines how matching documents
are scored. Similarity is per field, meaning that via the mapping one
can define a different similarity per field.

Configuring a custom similarity is considered a expert feature and the
builtin similarities are most likely sufficient as is described in
<<similarity>>.

[float]
[[configuration]]
=== Configuring a similarity

Most existing or custom Similarities have configuration options which
can be configured via the index settings as shown below. The index
options can be provided when creating an index or updating index
settings.

[source,js]
--------------------------------------------------
"similarity" : {
  "my_similarity" : {
    "type" : "DFR",
    "basic_model" : "g",
    "after_effect" : "l",
    "normalization" : "h2",
    "normalization.h2.c" : "3.0"
  }
}
--------------------------------------------------

Here we configure the DFRSimilarity so it can be referenced as
`my_similarity` in mappings as is illustrate in the below example:

[source,js]
--------------------------------------------------
{
  "book" : {
    "properties" : {
      "title" : { "type" : "text", "similarity" : "my_similarity" }
    }
}
--------------------------------------------------

[float]
=== Available similarities

[float]
[[bm25]]
==== BM25 similarity (*default*)

TF/IDF based similarity that has built-in tf normalization and
is supposed to work better for short fields (like names). See
http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
This similarity has the following options:

[horizontal]
`k1`::
    Controls non-linear term frequency normalization
    (saturation). The default value is `1.2`.

`b`::
    Controls to what degree document length normalizes tf values.
    The default value is `0.75`.

`discount_overlaps`::
    Determines whether overlap tokens (Tokens with
    0 position increment) are ignored when computing norm. By default this
    is true, meaning overlap tokens do not count when computing norms.

Type name: `BM25`

[float]
[[classic-similarity]]
==== Classic similarity

The classic similarity that is based on the TF/IDF model. This
similarity has the following option:

`discount_overlaps`::
    Determines whether overlap tokens (Tokens with
    0 position increment) are ignored when computing norm. By default this
    is true, meaning overlap tokens do not count when computing norms.

Type name: `classic`

[float]
[[drf]]
==== DFR similarity

Similarity that implements the
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
from randomness] framework. This similarity has the following options:

[horizontal]
`basic_model`::
    Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.

`after_effect`::
    Possible values: `no`, `b` and `l`.

`normalization`::
    Possible values: `no`, `h1`, `h2`, `h3` and `z`.

All options but the first option need a normalization value.

Type name: `DFR`

[float]
[[dfi]]
==== DFI similarity

Similarity that implements the http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence] 
model.
This similarity has the following options:

[horizontal]
`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.

Type name: `DFI`

[float]
[[ib]]
==== IB similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
sequence is primarily determined by the repetitive usage of its basic elements.
For written texts this challenge would correspond to comparing the writing styles of different authors.
This similarity has the following options:

[horizontal]
`distribution`::  Possible values: `ll` and `spl`.
`lambda`::        Possible values: `df` and `ttf`.
`normalization`:: Same as in `DFR` similarity.

Type name: `IB`

[float]
[[lm_dirichlet]]
==== LM Dirichlet similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
Dirichlet similarity] . This similarity has the following options:

[horizontal]
`mu`::  Default to `2000`.

Type name: `LMDirichlet`

[float]
[[lm_jelinek_mercer]]
==== LM Jelinek Mercer similarity.

http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:

[horizontal]
`lambda`::  The optimal value depends on both the collection and the query. The optimal value is around `0.1`
for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.

Type name: `LMJelinekMercer`

[float]
[[default-base]]
==== Default Similarity

By default, Elasticsearch will use whatever similarity is configured as
`default`.

You can change the default similarity for all fields in an index when
it is <<indices-create-index,created>>:

[source,js]
--------------------------------------------------
PUT /my_index
{
  "settings": {
    "index": {
      "similarity": {
        "default": {
          "type": "boolean"
        }
      }
    }
  }
}
--------------------------------------------------

If you want to change the default similarity after creating the index
you must <<indices-open-close,close>> your index, send the follwing
request and <<indices-open-close,open>> it again afterwards:

[source,js]
--------------------------------------------------
PUT /my_index/_settings
{
  "settings": {
    "index": {
      "similarity": {
        "default": {
          "type": "boolean"
        }
      }
    }
  }
}
--------------------------------------------------
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[index-modules-similarity]]`
			`== Similarity module`

			`A similarity (scoring / ranking model) defines how matching documents`
			`are scored. Similarity is per field, meaning that via the mapping one`
			`can define a different similarity per field.`

			`Configuring a custom similarity is considered a expert feature and the`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`builtin similarities are most likely sufficient as is described in`
			`<<similarity>>.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[configuration]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`=== Configuring a similarity`

			`Most existing or custom Similarities have configuration options which`
			`can be configured via the index settings as shown below. The index`
			`options can be provided when creating an index or updating index`
			`settings.`

			`[source,js]`
			`--------------------------------------------------`
			`"similarity" : {`
			`"my_similarity" : {`
			`"type" : "DFR",`
			`"basic_model" : "g",`
			`"after_effect" : "l",`
			`"normalization" : "h2",`
			`"normalization.h2.c" : "3.0"`
			`}`
			`}`
			`--------------------------------------------------`

			`Here we configure the DFRSimilarity so it can be referenced as`
			`my_similarity` in mappings as is illustrate in the below example:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"book" : {`
			`"properties" : {`
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			`"title" : { "type" : "text", "similarity" : "my_similarity" }`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`

			`[float]`
			`=== Available similarities`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[bm25]]`
Change default similarity to BM25 The default similarity was set to `classic` which refers to TFIDF and has not been moved after the upgrade to Lucene 6. Though moving to BM25 could have some downside for queries that relies on coordination factor (match_query, multi_match_query) ? relates #18944 2016-06-17 13:08:40 -04:00			`==== BM25 similarity (default)`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Change default similarity to BM25 The default similarity was set to `classic` which refers to TFIDF and has not been moved after the upgrade to Lucene 6. Though moving to BM25 could have some downside for queries that relies on coordination factor (match_query, multi_match_query) ? relates #18944 2016-06-17 13:08:40 -04:00			`TF/IDF based similarity that has built-in tf normalization and`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`is supposed to work better for short fields (like names). See`
			`http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.`
			`This similarity has the following options:`

			`[horizontal]`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`k1`::
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`Controls non-linear term frequency normalization`
default values for BM25 Similarity (#18778) assuming elasticsearch uses the lucene default values 2016-06-13 12:57:01 -04:00			(saturation). The default value is `1.2`.
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`b`::
			`Controls to what degree document length normalizes tf values.`
default values for BM25 Similarity (#18778) assuming elasticsearch uses the lucene default values 2016-06-13 12:57:01 -04:00			The default value is `0.75`.
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`discount_overlaps`::
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`Determines whether overlap tokens (Tokens with`
			`0 position increment) are ignored when computing norm. By default this`
			`is true, meaning overlap tokens do not count when computing norms.`

			Type name: `BM25`

Change default similarity to BM25 The default similarity was set to `classic` which refers to TFIDF and has not been moved after the upgrade to Lucene 6. Though moving to BM25 could have some downside for queries that relies on coordination factor (match_query, multi_match_query) ? relates #18944 2016-06-17 13:08:40 -04:00			`[float]`
			`[[classic-similarity]]`
			`==== Classic similarity`

			`The classic similarity that is based on the TF/IDF model. This`
			`similarity has the following option:`

			`discount_overlaps`::
			`Determines whether overlap tokens (Tokens with`
			`0 position increment) are ignored when computing norm. By default this`
			`is true, meaning overlap tokens do not count when computing norms.`

			Type name: `classic`

Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[drf]]`
Fix typo in similarity docs DRF similarity -> DFR similarity 2014-02-13 10:45:30 -05:00			`==== DFR similarity`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`Similarity that implements the`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`from randomness] framework. This similarity has the following options:`

			`[horizontal]`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`basic_model`::
			Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`after_effect`::
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			Possible values: `no`, `b` and `l`.
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`normalization`::
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			Possible values: `no`, `h1`, `h2`, `h3` and `z`.

			`All options but the first option need a normalization value.`

			Type name: `DFR`

Update lucene to r1725675 Adds DFI (divergence from independence) provider. Fixes test bugs passing invalid values for BM25 parameters. 2016-01-20 03:32:51 -05:00			`[float]`
			`[[dfi]]`
			`==== DFI similarity`

			`Similarity that implements the http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence]`
Upgrade to lucene 5.5.0-snapshot-1725675 2016-02-02 22:53:39 -05:00			`model.`
			`This similarity has the following options:`

			`[horizontal]`
			`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.
Update lucene to r1725675 Adds DFI (divergence from independence) provider. Fixes test bugs passing invalid values for BM25 parameters. 2016-01-20 03:32:51 -05:00
Added Type name for DFI (#18480) 2016-05-20 05:01:07 -04:00			Type name: `DFI`

Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[ib]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`==== IB similarity.`

Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information`
Merge pull request #15405 from alexg-dev/patch-1 More detailed explanation of some similarity types 2015-12-14 08:27:40 -05:00			`based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'`
			`sequence is primarily determined by the repetitive usage of its basic elements.`
Upgrade to lucene 5.5.0-snapshot-1725675 2016-02-02 22:53:39 -05:00			`For written texts this challenge would correspond to comparing the writing styles of different authors.`
Merge pull request #15405 from alexg-dev/patch-1 More detailed explanation of some similarity types 2015-12-14 08:27:40 -05:00			`This similarity has the following options:`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[horizontal]`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`distribution`:: Possible values: `ll` and `spl`.
			`lambda`:: Possible values: `df` and `ttf`.
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`normalization`:: Same as in `DFR` similarity.

			Type name: `IB`

add lucene language model similarities (Dirichlet & JelinekMercer) 2014-04-06 22:20:46 -04:00			`[float]`
			`[[lm_dirichlet]]`
			`==== LM Dirichlet similarity.`

Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM`
add lucene language model similarities (Dirichlet & JelinekMercer) 2014-04-06 22:20:46 -04:00			`Dirichlet similarity] . This similarity has the following options:`

			`[horizontal]`
			`mu`:: Default to `2000`.

			Type name: `LMDirichlet`

			`[float]`
			`[[lm_jelinek_mercer]]`
			`==== LM Jelinek Mercer similarity.`

Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM`
Merge pull request #15405 from alexg-dev/patch-1 More detailed explanation of some similarity types 2015-12-14 08:27:40 -05:00			`Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:`
add lucene language model similarities (Dirichlet & JelinekMercer) 2014-04-06 22:20:46 -04:00
			`[horizontal]`
			`lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
Merge pull request #15405 from alexg-dev/patch-1 More detailed explanation of some similarity types 2015-12-14 08:27:40 -05:00			for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.
add lucene language model similarities (Dirichlet & JelinekMercer) 2014-04-06 22:20:46 -04:00
			Type name: `LMJelinekMercer`

Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[float]`
Uniquify anchor links to fix asciidoc/docbook generation 2013-09-30 17:32:00 -04:00			`[[default-base]]`
Upgrade to a Lucene 7 snapshot (#24089) We want to upgrade to Lucene 7 ahead of time in order to be able to check whether it causes any trouble to Elasticsearch before Lucene 7.0 gets released. From a user perspective, the main benefit of this upgrade is the enhanced support for sparse fields, whose resource consumption is now function of the number of docs that have a value rather than the total number of docs in the index. Some notes about the change: - it includes the deprecation of the `disable_coord` parameter of the `bool` and `common_terms` queries: Lucene has removed support for coord factors - it includes the deprecation of the `index.similarity.base` expert setting, since it was only useful to configure coords and query norms, which have both been removed - two tests have been marked with `@AwaitsFix` because of #23966, which we intend to address after the merge 2017-04-18 09:17:21 -04:00			`==== Default Similarity`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`By default, Elasticsearch will use whatever similarity is configured as`
Upgrade to a Lucene 7 snapshot (#24089) We want to upgrade to Lucene 7 ahead of time in order to be able to check whether it causes any trouble to Elasticsearch before Lucene 7.0 gets released. From a user perspective, the main benefit of this upgrade is the enhanced support for sparse fields, whose resource consumption is now function of the number of docs that have a value rather than the total number of docs in the index. Some notes about the change: - it includes the deprecation of the `disable_coord` parameter of the `bool` and `common_terms` queries: Lucene has removed support for coord factors - it includes the deprecation of the `index.similarity.base` expert setting, since it was only useful to configure coords and query norms, which have both been removed - two tests have been marked with `@AwaitsFix` because of #23966, which we intend to address after the merge 2017-04-18 09:17:21 -04:00			`default`.
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Add docs with up to date instructions on updating default similarity (#21242) * Add docs with up to date instructions on updating default similarity The default similarity can no longer be set in the configuration file (you will get an error on startup). Update the docs with the method that works. * Add instructions for changing similarity on index creation 2016-11-01 16:14:20 -04:00			`You can change the default similarity for all fields in an index when`
			`it is <<indices-create-index,created>>:`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[source,js]`
			`--------------------------------------------------`
Add docs with up to date instructions on updating default similarity (#21242) * Add docs with up to date instructions on updating default similarity The default similarity can no longer be set in the configuration file (you will get an error on startup). Update the docs with the method that works. * Add instructions for changing similarity on index creation 2016-11-01 16:14:20 -04:00			`PUT /my_index`
			`{`
			`"settings": {`
			`"index": {`
			`"similarity": {`
			`"default": {`
Upgrade to a Lucene 7 snapshot (#24089) We want to upgrade to Lucene 7 ahead of time in order to be able to check whether it causes any trouble to Elasticsearch before Lucene 7.0 gets released. From a user perspective, the main benefit of this upgrade is the enhanced support for sparse fields, whose resource consumption is now function of the number of docs that have a value rather than the total number of docs in the index. Some notes about the change: - it includes the deprecation of the `disable_coord` parameter of the `bool` and `common_terms` queries: Lucene has removed support for coord factors - it includes the deprecation of the `index.similarity.base` expert setting, since it was only useful to configure coords and query norms, which have both been removed - two tests have been marked with `@AwaitsFix` because of #23966, which we intend to address after the merge 2017-04-18 09:17:21 -04:00			`"type": "boolean"`
Add docs with up to date instructions on updating default similarity (#21242) * Add docs with up to date instructions on updating default similarity The default similarity can no longer be set in the configuration file (you will get an error on startup). Update the docs with the method that works. * Add instructions for changing similarity on index creation 2016-11-01 16:14:20 -04:00			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`If you want to change the default similarity after creating the index`
			`you must <<indices-open-close,close>> your index, send the follwing`
			`request and <<indices-open-close,open>> it again afterwards:`

			`[source,js]`
			`--------------------------------------------------`
			`PUT /my_index/_settings`
			`{`
			`"settings": {`
			`"index": {`
			`"similarity": {`
			`"default": {`
Upgrade to a Lucene 7 snapshot (#24089) We want to upgrade to Lucene 7 ahead of time in order to be able to check whether it causes any trouble to Elasticsearch before Lucene 7.0 gets released. From a user perspective, the main benefit of this upgrade is the enhanced support for sparse fields, whose resource consumption is now function of the number of docs that have a value rather than the total number of docs in the index. Some notes about the change: - it includes the deprecation of the `disable_coord` parameter of the `bool` and `common_terms` queries: Lucene has removed support for coord factors - it includes the deprecation of the `index.similarity.base` expert setting, since it was only useful to configure coords and query norms, which have both been removed - two tests have been marked with `@AwaitsFix` because of #23966, which we intend to address after the merge 2017-04-18 09:17:21 -04:00			`"type": "boolean"`
Add docs with up to date instructions on updating default similarity (#21242) * Add docs with up to date instructions on updating default similarity The default similarity can no longer be set in the configuration file (you will get an error on startup). Update the docs with the method that works. * Add instructions for changing similarity on index creation 2016-11-01 16:14:20 -04:00			`}`
			`}`
			`}`
			`}`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`