mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-03-27 18:38:41 +00:00
Previously, the only way to specify a document not present in the index was to use `like_text`. This would usually lead to complex queries made of multiple MLT queries per document field. This commit adds the ability to the MLT query to directly specify documents not present in the index (artificial documents). The syntax is similar to the Percolator API or to the Multi Term Vector API. Closes #7725
164 lines
6.1 KiB
Plaintext
164 lines
6.1 KiB
Plaintext
[[query-dsl-mlt-query]]
|
|
=== More Like This Query
|
|
|
|
More like this query find documents that are "like" provided text by
|
|
running it against one or more fields.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"more_like_this" : {
|
|
"fields" : ["name.first", "name.last"],
|
|
"like_text" : "text like this one",
|
|
"min_term_freq" : 1,
|
|
"max_query_terms" : 12
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
More Like This can find documents that are "like" a set of
|
|
chosen documents. The syntax to specify one or more documents is similar to
|
|
the <<docs-multi-get,Multi GET API>>, and supports the `ids` or `docs` array.
|
|
If only one document is specified, the query behaves the same as the
|
|
<<search-more-like-this,More Like This API>>.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"more_like_this" : {
|
|
"fields" : ["name.first", "name.last"],
|
|
"docs" : [
|
|
{
|
|
"_index" : "test",
|
|
"_type" : "type",
|
|
"_id" : "1"
|
|
},
|
|
{
|
|
"_index" : "test",
|
|
"_type" : "type",
|
|
"_id" : "2"
|
|
}
|
|
],
|
|
"ids" : ["3", "4"],
|
|
"min_term_freq" : 1,
|
|
"max_query_terms" : 12
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Additionally, the `doc` syntax of the
|
|
<<docs-multi-termvectors,Multi Term Vectors API>> is also supported. This is useful in
|
|
order to specify one or more documents not present in the index, and in
|
|
this case should be preferred over only using `like_text`.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"more_like_this" : {
|
|
"fields" : ["name.first", "name.last"],
|
|
"docs" : [
|
|
{
|
|
"_index" : "test",
|
|
"_type" : "type",
|
|
"doc" : {
|
|
"name": {
|
|
"first": "Ben",
|
|
"last": "Grimm"
|
|
},
|
|
"tweet": "You got no idea what I'd... what I'd give to be invisible."
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"_index" : "test",
|
|
"_type" : "type",
|
|
"_id" : "2"
|
|
}
|
|
],
|
|
"min_term_freq" : 1,
|
|
"max_query_terms" : 12
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
`more_like_this` can be shortened to `mlt`.
|
|
|
|
Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
|
|
interesting terms extracted from some provided text. The interesting terms are
|
|
selected with respect to their tf-idf scores. These are controlled by
|
|
`min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
|
|
terms is controlled by `max_query_terms`. While the minimum number of clauses
|
|
that must be satisfied is controlled by `percent_terms_to_match`. The terms
|
|
are extracted from `like_text` which is analyzed by the analyzer associated
|
|
with the field, unless specified by `analyzer`. There are other parameters,
|
|
such as `min_word_length`, `max_word_length` or `stop_words`, to control what
|
|
terms should be considered as interesting. In order to give more weight to
|
|
more interesting terms, each boolean clause associated with a term could be
|
|
boosted by the term tf-idf score times some boosting factor `boost_terms`.
|
|
When a search for multiple `docs` is issued, More Like This generates a
|
|
`more_like_this` query per document field in `fields`. These `fields` are
|
|
specified as a top level parameter or within each `doc`.
|
|
|
|
IMPORTANT: The fields must be indexed and of type `string`. Additionally, when
|
|
using `ids` or `docs`, the fields must be either `stored`, store `term_vector`
|
|
or `_source` must be enabled.
|
|
|
|
The `more_like_this` top level parameters include:
|
|
|
|
[cols="<,<",options="header",]
|
|
|=======================================================================
|
|
|Parameter |Description
|
|
|`fields` |A list of the fields to run the more like this query against.
|
|
Defaults to the `_all` field for `like_text` and to all possible fields
|
|
for `ids` or `docs`.
|
|
|
|
|`like_text` |The text to find documents like it, *required* if `ids` or `docs` are
|
|
not specified.
|
|
|
|
|`ids` or `docs` |A list of documents following the same syntax as the
|
|
<<docs-multi-get,Multi GET API>> or <<docs-multi-termvectors,Multi Term Vectors API>>.
|
|
The text is fetched from `fields` unless specified otherwise in each `doc`.
|
|
|
|
|`include` |When using `ids` or `docs`, specifies whether the documents should be
|
|
included from the search. Defaults to `false`.
|
|
|
|
|`minimum_should_match`| From the generated query, the number of terms that
|
|
must match following the <<query-dsl-minimum-should-match,minimum should
|
|
syntax>>. (Defaults to `"30%"`).
|
|
|
|
|`min_term_freq` |The frequency below which terms will be ignored in the
|
|
source doc. The default frequency is `2`.
|
|
|
|
|`max_query_terms` |The maximum number of query terms that will be
|
|
included in any generated query. Defaults to `25`.
|
|
|
|
|`stop_words` |An array of stop words. Any word in this set is
|
|
considered "uninteresting" and ignored. Even if your Analyzer allows
|
|
stopwords, you might want to tell the MoreLikeThis code to ignore them,
|
|
as for the purposes of document similarity it seems reasonable to assume
|
|
that "a stop word is never interesting".
|
|
|
|
|`min_doc_freq` |The frequency at which words will be ignored which do
|
|
not occur in at least this many docs. Defaults to `5`.
|
|
|
|
|`max_doc_freq` |The maximum frequency in which words may still appear.
|
|
Words that appear in more than this many docs will be ignored. Defaults
|
|
to unbounded.
|
|
|
|
|`min_word_length` |The minimum word length below which words will be
|
|
ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)
|
|
|
|
|`max_word_length` |The maximum word length above which words will be
|
|
ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)
|
|
|
|
|`boost_terms` |Sets the boost factor to use when boosting terms.
|
|
Defaults to deactivated (`0`). Any other value activates boosting with given
|
|
boost factor.
|
|
|
|
|`boost` |Sets the boost value of the query. Defaults to `1.0`.
|
|
|
|
|`analyzer` |The analyzer that will be used to analyze the `like text`.
|
|
Defaults to the analyzer associated with the first field in `fields`.
|
|
|=======================================================================
|
|
|