mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-06 04:58:50 +00:00
* Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
578 lines
17 KiB
Plaintext
578 lines
17 KiB
Plaintext
[[index-modules-similarity]]
|
|
== Similarity module
|
|
|
|
A similarity (scoring / ranking model) defines how matching documents
|
|
are scored. Similarity is per field, meaning that via the mapping one
|
|
can define a different similarity per field.
|
|
|
|
Configuring a custom similarity is considered an expert feature and the
|
|
builtin similarities are most likely sufficient as is described in
|
|
<<similarity>>.
|
|
|
|
[float]
|
|
[[configuration]]
|
|
=== Configuring a similarity
|
|
|
|
Most existing or custom Similarities have configuration options which
|
|
can be configured via the index settings as shown below. The index
|
|
options can be provided when creating an index or updating index
|
|
settings.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index?include_type_name=true
|
|
{
|
|
"settings" : {
|
|
"index" : {
|
|
"similarity" : {
|
|
"my_similarity" : {
|
|
"type" : "DFR",
|
|
"basic_model" : "g",
|
|
"after_effect" : "l",
|
|
"normalization" : "h2",
|
|
"normalization.h2.c" : "3.0"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
Here we configure the DFRSimilarity so it can be referenced as
|
|
`my_similarity` in mappings as is illustrate in the below example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index/_mapping
|
|
{
|
|
"properties" : {
|
|
"title" : { "type" : "text", "similarity" : "my_similarity" }
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
[float]
|
|
=== Available similarities
|
|
|
|
[float]
|
|
[[bm25]]
|
|
==== BM25 similarity (*default*)
|
|
|
|
TF/IDF based similarity that has built-in tf normalization and
|
|
is supposed to work better for short fields (like names). See
|
|
http://en.wikipedia.org/wiki/Okapi_BM25[Okapi_BM25] for more details.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`k1`::
|
|
Controls non-linear term frequency normalization
|
|
(saturation). The default value is `1.2`.
|
|
|
|
`b`::
|
|
Controls to what degree document length normalizes tf values.
|
|
The default value is `0.75`.
|
|
|
|
`discount_overlaps`::
|
|
Determines whether overlap tokens (Tokens with
|
|
0 position increment) are ignored when computing norm. By default this
|
|
is true, meaning overlap tokens do not count when computing norms.
|
|
|
|
Type name: `BM25`
|
|
|
|
[float]
|
|
[[dfr]]
|
|
==== DFR similarity
|
|
|
|
Similarity that implements the
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
|
|
from randomness] framework. This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`basic_model`::
|
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`be`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelD.html[`d`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`g`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIF.html[`if`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIn.html[`in`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIne.html[`ine`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelP.html[`p`].
|
|
|
|
`be`, `d` and `p` should be avoided in practice as they might return scores that
|
|
are equal to 0 or infinite with terms that do not meet the expected random
|
|
distribution.
|
|
|
|
`after_effect`::
|
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffect.NoAfterEffect.html[`no`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectB.html[`b`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectL.html[`l`].
|
|
|
|
`normalization`::
|
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/Normalization.NoNormalization.html[`no`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h1`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH2.html[`h2`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h3`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationZ.html[`z`].
|
|
|
|
All options but the first option need a normalization value.
|
|
|
|
Type name: `DFR`
|
|
|
|
[float]
|
|
[[dfi]]
|
|
==== DFI similarity
|
|
|
|
Similarity that implements the http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf[divergence from independence]
|
|
model.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`independence_measure`:: Possible values
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceStandardized.html[`standardized`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceSaturated.html[`saturated`],
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceChiSquared.html[`chisquared`].
|
|
|
|
When using this similarity, it is highly recommended to remove stop words to get
|
|
good relevance. Also beware that terms whose frequency is less than the expected
|
|
frequency will get a score equal to 0.
|
|
|
|
Type name: `DFI`
|
|
|
|
[float]
|
|
[[ib]]
|
|
==== IB similarity.
|
|
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IBSimilarity.html[Information
|
|
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
|
|
sequence is primarily determined by the repetitive usage of its basic elements.
|
|
For written texts this challenge would correspond to comparing the writing styles of different authors.
|
|
This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`distribution`:: Possible values:
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionLL.html[`ll`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionSPL.html[`spl`].
|
|
`lambda`:: Possible values:
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaDF.html[`df`] and
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaTTF.html[`ttf`].
|
|
`normalization`:: Same as in `DFR` similarity.
|
|
|
|
Type name: `IB`
|
|
|
|
[float]
|
|
[[lm_dirichlet]]
|
|
==== LM Dirichlet similarity.
|
|
|
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
|
|
Dirichlet similarity] . This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`mu`:: Default to `2000`.
|
|
|
|
The scoring formula in the paper assigns negative scores to terms that have
|
|
fewer occurrences than predicted by the language model, which is illegal to
|
|
Lucene, so such terms get a score of 0.
|
|
|
|
Type name: `LMDirichlet`
|
|
|
|
[float]
|
|
[[lm_jelinek_mercer]]
|
|
==== LM Jelinek Mercer similarity.
|
|
|
|
{lucene-core-javadoc}/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
|
|
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
|
|
|
|
[horizontal]
|
|
`lambda`:: The optimal value depends on both the collection and the query. The optimal value is around `0.1`
|
|
for title queries and `0.7` for long queries. Default to `0.1`. When value approaches `0`, documents that match more query terms will be ranked higher than those that match fewer terms.
|
|
|
|
Type name: `LMJelinekMercer`
|
|
|
|
[float]
|
|
[[scripted_similarity]]
|
|
==== Scripted similarity
|
|
|
|
A similarity that allows you to use a script in order to specify how scores
|
|
should be computed. For instance, the below example shows how to reimplement
|
|
TF-IDF:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index?include_type_name=true
|
|
{
|
|
"settings": {
|
|
"number_of_shards": 1,
|
|
"similarity": {
|
|
"scripted_tfidf": {
|
|
"type": "scripted",
|
|
"script": {
|
|
"source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"_doc": {
|
|
"properties": {
|
|
"field": {
|
|
"type": "text",
|
|
"similarity": "scripted_tfidf"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
PUT /index/_doc/1
|
|
{
|
|
"field": "foo bar foo"
|
|
}
|
|
|
|
PUT /index/_doc/2
|
|
{
|
|
"field": "bar baz"
|
|
}
|
|
|
|
POST /index/_refresh
|
|
|
|
GET /index/_search?explain=true
|
|
{
|
|
"query": {
|
|
"query_string": {
|
|
"query": "foo^1.7",
|
|
"default_field": "field"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
Which yields:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took": 12,
|
|
"timed_out": false,
|
|
"_shards": {
|
|
"total": 1,
|
|
"successful": 1,
|
|
"skipped": 0,
|
|
"failed": 0
|
|
},
|
|
"hits": {
|
|
"total": {
|
|
"value": 1,
|
|
"relation": "eq"
|
|
},
|
|
"max_score": 1.9508477,
|
|
"hits": [
|
|
{
|
|
"_shard": "[index][0]",
|
|
"_node": "OzrdjxNtQGaqs4DmioFw9A",
|
|
"_index": "index",
|
|
"_type": "_doc",
|
|
"_id": "1",
|
|
"_score": 1.9508477,
|
|
"_source": {
|
|
"field": "foo bar foo"
|
|
},
|
|
"_explanation": {
|
|
"value": 1.9508477,
|
|
"description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
|
|
"details": [
|
|
{
|
|
"value": 1.9508477,
|
|
"description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:",
|
|
"details": [
|
|
{
|
|
"value": 1.0,
|
|
"description": "weight",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1.7,
|
|
"description": "query.boost",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2,
|
|
"description": "field.docCount",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 4,
|
|
"description": "field.sumDocFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 5,
|
|
"description": "field.sumTotalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1,
|
|
"description": "term.docFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2,
|
|
"description": "term.totalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "doc.freq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 3,
|
|
"description": "doc.length",
|
|
"details": []
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"took": 12/"took" : $body.took/]
|
|
// TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
|
|
|
|
WARNING: While scripted similarities provide a lot of flexibility, there is
|
|
a set of rules that they need to satisfy. Failing to do so could make
|
|
Elasticsearch silently return wrong top hits or fail with internal errors at
|
|
search time:
|
|
|
|
- Returned scores must be positive.
|
|
- All other variables remaining equal, scores must not decrease when
|
|
`doc.freq` increases.
|
|
- All other variables remaining equal, scores must not increase when
|
|
`doc.length` increases.
|
|
|
|
You might have noticed that a significant part of the above script depends on
|
|
statistics that are the same for every document. It is possible to make the
|
|
above slightly more efficient by providing an `weight_script` which will
|
|
compute the document-independent part of the score and will be available
|
|
under the `weight` variable. When no `weight_script` is provided, `weight`
|
|
is equal to `1`. The `weight_script` has access to the same variables as
|
|
the `script` except `doc` since it is supposed to compute a
|
|
document-independent contribution to the score.
|
|
|
|
The below configuration will give the same tf-idf scores but is slightly
|
|
more efficient:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index?include_type_name=true
|
|
{
|
|
"settings": {
|
|
"number_of_shards": 1,
|
|
"similarity": {
|
|
"scripted_tfidf": {
|
|
"type": "scripted",
|
|
"weight_script": {
|
|
"source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
|
|
},
|
|
"script": {
|
|
"source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"_doc": {
|
|
"properties": {
|
|
"field": {
|
|
"type": "text",
|
|
"similarity": "scripted_tfidf"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
////////////////////
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index/_doc/1
|
|
{
|
|
"field": "foo bar foo"
|
|
}
|
|
|
|
PUT /index/_doc/2
|
|
{
|
|
"field": "bar baz"
|
|
}
|
|
|
|
POST /index/_refresh
|
|
|
|
GET /index/_search?explain=true
|
|
{
|
|
"query": {
|
|
"query_string": {
|
|
"query": "foo^1.7",
|
|
"default_field": "field"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"took": 1,
|
|
"timed_out": false,
|
|
"_shards": {
|
|
"total": 1,
|
|
"successful": 1,
|
|
"skipped": 0,
|
|
"failed": 0
|
|
},
|
|
"hits": {
|
|
"total": {
|
|
"value": 1,
|
|
"relation": "eq"
|
|
},
|
|
"max_score": 1.9508477,
|
|
"hits": [
|
|
{
|
|
"_shard": "[index][0]",
|
|
"_node": "OzrdjxNtQGaqs4DmioFw9A",
|
|
"_index": "index",
|
|
"_type": "_doc",
|
|
"_id": "1",
|
|
"_score": 1.9508477,
|
|
"_source": {
|
|
"field": "foo bar foo"
|
|
},
|
|
"_explanation": {
|
|
"value": 1.9508477,
|
|
"description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
|
|
"details": [
|
|
{
|
|
"value": 1.9508477,
|
|
"description": "score from ScriptedSimilarity(weightScript=[Script{type=inline, lang='painless', idOrCode='double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;', options={}, params={}}], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;', options={}, params={}}]) computed from:",
|
|
"details": [
|
|
{
|
|
"value": 2.3892908,
|
|
"description": "weight",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1.7,
|
|
"description": "query.boost",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2,
|
|
"description": "field.docCount",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 4,
|
|
"description": "field.sumDocFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 5,
|
|
"description": "field.sumTotalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 1,
|
|
"description": "term.docFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2,
|
|
"description": "term.totalTermFreq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 2.0,
|
|
"description": "doc.freq",
|
|
"details": []
|
|
},
|
|
{
|
|
"value": 3,
|
|
"description": "doc.length",
|
|
"details": []
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[s/"took": 1/"took" : $body.took/]
|
|
// TESTRESPONSE[s/OzrdjxNtQGaqs4DmioFw9A/$body.hits.hits.0._node/]
|
|
|
|
////////////////////
|
|
|
|
Type name: `scripted`
|
|
|
|
[float]
|
|
[[default-base]]
|
|
==== Default Similarity
|
|
|
|
By default, Elasticsearch will use whatever similarity is configured as
|
|
`default`.
|
|
|
|
You can change the default similarity for all fields in an index when
|
|
it is <<indices-create-index,created>>:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /index?include_type_name=true
|
|
{
|
|
"settings": {
|
|
"index": {
|
|
"similarity": {
|
|
"default": {
|
|
"type": "boolean"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
|
|
If you want to change the default similarity after creating the index
|
|
you must <<indices-open-close,close>> your index, send the following
|
|
request and <<indices-open-close,open>> it again afterwards:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
POST /index/_close
|
|
|
|
PUT /index/_settings
|
|
{
|
|
"index": {
|
|
"similarity": {
|
|
"default": {
|
|
"type": "boolean"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST /index/_open
|
|
--------------------------------------------------
|
|
// CONSOLE
|
|
// TEST[continued]
|