178 lines
7.0 KiB
Plaintext
178 lines
7.0 KiB
Plaintext
[[query-dsl-mlt-query]]
|
|
=== More Like This Query
|
|
|
|
More like this query find documents that are "like" provided text by
|
|
running it against one or more fields.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"more_like_this" : {
|
|
"fields" : ["name.first", "name.last"],
|
|
"like" : "text like this one",
|
|
"min_term_freq" : 1,
|
|
"max_query_terms" : 12
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
More Like This can find documents that are "like" a set of
|
|
chosen documents. The syntax to specify one or more documents is similar to
|
|
the <<docs-multi-get,Multi GET API>>.
|
|
If only one document is specified, the query behaves the same as the
|
|
<<search-more-like-this,More Like This API>>.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"more_like_this" : {
|
|
"fields" : ["name.first", "name.last"],
|
|
"like" : [
|
|
{
|
|
"_index" : "test",
|
|
"_type" : "type",
|
|
"_id" : "1"
|
|
},
|
|
{
|
|
"_index" : "test",
|
|
"_type" : "type",
|
|
"_id" : "2"
|
|
},
|
|
"and also some text like this one!"
|
|
],
|
|
"min_term_freq" : 1,
|
|
"max_query_terms" : 12
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Additionally, <<docs-termvectors-artificial-doc,artificial documents>> are also supported.
|
|
This is useful in order to specify one or more documents not present in the index.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"more_like_this" : {
|
|
"fields" : ["name.first", "name.last"],
|
|
"like" : [
|
|
{
|
|
"_index" : "test",
|
|
"_type" : "type",
|
|
"doc" : {
|
|
"name": {
|
|
"first": "Ben",
|
|
"last": "Grimm"
|
|
},
|
|
"tweet": "You got no idea what I'd... what I'd give to be invisible."
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"_index" : "test",
|
|
"_type" : "type",
|
|
"_id" : "2"
|
|
}
|
|
],
|
|
"min_term_freq" : 1,
|
|
"max_query_terms" : 12
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
`more_like_this` can be shortened to `mlt`.
|
|
|
|
Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
|
|
interesting terms extracted from some provided text. The interesting terms are
|
|
selected with respect to their tf-idf scores. These are controlled by
|
|
`min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
|
|
terms is controlled by `max_query_terms`. While the minimum number of clauses
|
|
that must be satisfied is controlled by `minimum_should_match`. The terms
|
|
are extracted from the text in `like` and analyzed by the analyzer associated
|
|
with the field, unless specified by `analyzer`. There are other parameters,
|
|
such as `min_word_length`, `max_word_length` or `stop_words`, to control what
|
|
terms should be considered as interesting. In order to give more weight to
|
|
more interesting terms, each boolean clause associated with a term could be
|
|
boosted by the term tf-idf score times some boosting factor `boost_terms`.
|
|
When a search for multiple documents is issued, More Like This generates a
|
|
`more_like_this` query per document field in `fields`. These `fields` are
|
|
specified as a top level parameter or within each document request.
|
|
|
|
IMPORTANT: The fields must be indexed and of type `string`. Additionally, when
|
|
using `like` with documents, the fields must be either `stored`, store `term_vector`
|
|
or `_source` must be enabled.
|
|
|
|
The `more_like_this` top level parameters include:
|
|
|
|
[cols="<,<",options="header",]
|
|
|=======================================================================
|
|
|Parameter |Description
|
|
|`fields` |A list of the fields to run the more like this query against.
|
|
Defaults to the `_all` field for text and to all possible fields
|
|
for documents.
|
|
|
|
|`like`|coming[2.0]
|
|
Can either be some text, some documents or a combination of all, *required*.
|
|
A document request follows the same syntax as the
|
|
<<docs-multi-get,Multi Get API>> or <<docs-multi-termvectors,Multi Term Vectors API>>.
|
|
In this case, the text is fetched from `fields` unless specified otherwise in each document request.
|
|
The text is analyzed by the default analyzer at the field, unless overridden by the
|
|
`per_field_analyzer` parameter of the <<docs-termvectors-per-field-analyzer,Term Vectors API>>.
|
|
|
|
|`like_text` |deprecated[2.0,Replaced by `like`]
|
|
The text to find documents like it, *required* if `ids` or `docs` are
|
|
not specified.
|
|
|
|
|`ids` or `docs` |deprecated[2.0,Replaced by `like`]
|
|
A list of documents following the same syntax as the
|
|
<<docs-multi-get,Multi GET API>> or <<docs-multi-termvectors,Multi termvectors API>>.
|
|
The text is fetched from `fields` unless specified otherwise in each `doc`.
|
|
The text is analyzed by the default analyzer at the field, unless specified by the
|
|
`per_field_analyzer` parameter of the <<docs-termvectors-per-field-analyzer,Term Vectors API>>.
|
|
|
|
|`include` |When using `like` with document requests, specifies whether the documents should be
|
|
included from the search. Defaults to `false`.
|
|
|
|
|`minimum_should_match`| coming[1.5.0] From the generated query, the number of terms that
|
|
must match following the <<query-dsl-minimum-should-match,minimum should
|
|
syntax>>. (Defaults to `"30%"`).
|
|
|
|
|`percent_terms_to_match` | deprecated[1.5.0,Replaced by `minimum_should_match`]
|
|
From the generated query, the percentage of terms that must match (float value
|
|
between 0 and 1). Defaults to `0.3` (30 percent).
|
|
|
|
|`min_term_freq` |The frequency below which terms will be ignored in the
|
|
source doc. The default frequency is `2`.
|
|
|
|
|`max_query_terms` |The maximum number of query terms that will be
|
|
included in any generated query. Defaults to `25`.
|
|
|
|
|`stop_words` |An array of stop words. Any word in this set is
|
|
considered "uninteresting" and ignored. Even if your Analyzer allows
|
|
stopwords, you might want to tell the MoreLikeThis code to ignore them,
|
|
as for the purposes of document similarity it seems reasonable to assume
|
|
that "a stop word is never interesting".
|
|
|
|
|`min_doc_freq` |The frequency at which words will be ignored which do
|
|
not occur in at least this many docs. Defaults to `5`.
|
|
|
|
|`max_doc_freq` |The maximum frequency in which words may still appear.
|
|
Words that appear in more than this many docs will be ignored. Defaults
|
|
to unbounded.
|
|
|
|
|`min_word_length` |The minimum word length below which words will be
|
|
ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)
|
|
|
|
|`max_word_length` |The maximum word length above which words will be
|
|
ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)
|
|
|
|
|`boost_terms` |Sets the boost factor to use when boosting terms.
|
|
Defaults to deactivated (`0`). Any other value activates boosting with given
|
|
boost factor.
|
|
|
|
|`boost` |Sets the boost value of the query. Defaults to `1.0`.
|
|
|
|
|`analyzer` |The analyzer that will be used to analyze the `like text`.
|
|
Defaults to the analyzer associated with the first field in `fields`.
|
|
|=======================================================================
|
|
|