OpenSearch/docs/reference/query-dsl/queries/mlt-query.asciidoc

[[query-dsl-mlt-query]]
=== More Like This Query

More like this query find documents that are "like" provided text by
running it against one or more fields.

[source,js]
--------------------------------------------------
{
    "more_like_this" : {
        "fields" : ["name.first", "name.last"],
        "like_text" : "text like this one",
        "min_term_freq" : 1,
        "max_query_terms" : 12
    }
}
--------------------------------------------------

`more_like_this` can be shortened to `mlt`.

Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
interesting terms extracted from some provided text. The interesting terms are
selected with respect to their tf-idf scores. These are controlled by
`min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
terms is controlled by `max_query_terms`. While the minimum number of clauses
that must be satisfied is controlled by `percent_terms_to_match`. The terms
are extracted from `like_text` which is analyzed by the analyzer associated
with the field, unless specified by `analyzer`. There are other parameters,
such as `min_word_length`, `max_word_length` or `stop_words`, to control what
terms should be considered as interesting. In order to give more weight to
more interesting terms, each boolean clause associated with a term could be
boosted by the term tf-idf score times some boosting factor `boost_terms`.

The `more_like_this` top level parameters include:

[cols="<,<",options="header",]
|=======================================================================
|Parameter |Description
|`fields` |A list of the fields to run the more like this query against.
Defaults to the `_all` field.

|`like_text` |The text to find documents like it, *required*.

|`percent_terms_to_match` |The percentage of terms to match on (float
value). Defaults to `0.3` (30 percent).

|`min_term_freq` |The frequency below which terms will be ignored in the
source doc. The default frequency is `2`.

|`max_query_terms` |The maximum number of query terms that will be
included in any generated query. Defaults to `25`.

|`stop_words` |An array of stop words. Any word in this set is
considered "uninteresting" and ignored. Even if your Analyzer allows
stopwords, you might want to tell the MoreLikeThis code to ignore them,
as for the purposes of document similarity it seems reasonable to assume
that "a stop word is never interesting".

|`min_doc_freq` |The frequency at which words will be ignored which do
not occur in at least this many docs. Defaults to `5`.

|`max_doc_freq` |The maximum frequency in which words may still appear.
Words that appear in more than this many docs will be ignored. Defaults
to unbounded.

|`min_word_length` |The minimum word length below which words will be
ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)

|`max_word_length` |The maximum word length above which words will be
ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)

|`boost_terms` |Sets the boost factor to use when boosting terms.
Defaults to deactivated (`0`). Any other value activates boosting with given
boost factor.

|`boost` |Sets the boost value of the query. Defaults to `1.0`.

|`analyzer` |The analyzer that will be used to analyze the text.
Defaults to the analyzer associated with the field.
|=======================================================================
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`[[query-dsl-mlt-query]]`
			`=== More Like This Query`

			`More like this query find documents that are "like" provided text by`
			`running it against one or more fields.`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"more_like_this" : {`
			`"fields" : ["name.first", "name.last"],`
			`"like_text" : "text like this one",`
			`"min_term_freq" : 1,`
			`"max_query_terms" : 12`
			`}`
			`}`
			`--------------------------------------------------`

			`more_like_this` can be shortened to `mlt`.

Provided some insights as to how More Like This works internally. In the Google Groups forum there appears to be some confusion as to what mlt does. This documentation update should hopefully help demystifying this feature, and provide some understanding as to how to use its parameters. Closes #6092 2014-05-08 12:21:18 +02:00			Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
			`interesting terms extracted from some provided text. The interesting terms are`
			`selected with respect to their tf-idf scores. These are controlled by`
			`min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
			terms is controlled by `max_query_terms`. While the minimum number of clauses
			that must be satisfied is controlled by `percent_terms_to_match`. The terms
			are extracted from `like_text` which is analyzed by the analyzer associated
			with the field, unless specified by `analyzer`. There are other parameters,
			such as `min_word_length`, `max_word_length` or `stop_words`, to control what
			`terms should be considered as interesting. In order to give more weight to`
			`more interesting terms, each boolean clause associated with a term could be`
			boosted by the term tf-idf score times some boosting factor `boost_terms`.

Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			The `more_like_this` top level parameters include:

			`[cols="<,<",options="header",]`
			`\|=======================================================================`
			`\|Parameter \|Description`
			\|`fields` \|A list of the fields to run the more like this query against.
			Defaults to the `_all` field.

			\|`like_text` \|The text to find documents like it, required.

			\|`percent_terms_to_match` \|The percentage of terms to match on (float
			value). Defaults to `0.3` (30 percent).

			\|`min_term_freq` \|The frequency below which terms will be ignored in the
			source doc. The default frequency is `2`.

			\|`max_query_terms` \|The maximum number of query terms that will be
			included in any generated query. Defaults to `25`.

			\|`stop_words` \|An array of stop words. Any word in this set is
			`considered "uninteresting" and ignored. Even if your Analyzer allows`
			`stopwords, you might want to tell the MoreLikeThis code to ignore them,`
			`as for the purposes of document similarity it seems reasonable to assume`
			`that "a stop word is never interesting".`

			\|`min_doc_freq` \|The frequency at which words will be ignored which do
			not occur in at least this many docs. Defaults to `5`.

			\|`max_doc_freq` \|The maximum frequency in which words may still appear.
			`Words that appear in more than this many docs will be ignored. Defaults`
			`to unbounded.`

Standardized use of “_length” for parameter names rather than “_len”. Java Builder apis drop old “len” methods in favour of new “length” Rest APIs support both old “len: and new “length” forms using new ParseField class to a) provide compiler-checked consistency between Builder and Parser classes and b) a common means of handling deprecated syntax in the DSL. Documentation and rest specs only document the new “*length” forms Closes #4083 2014-01-02 16:11:20 +00:00			\|`min_word_length` \|The minimum word length below which words will be
			ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
Standardized use of “_length” for parameter names rather than “_len”. Java Builder apis drop old “len” methods in favour of new “length” Rest APIs support both old “len: and new “length” forms using new ParseField class to a) provide compiler-checked consistency between Builder and Parser classes and b) a common means of handling deprecated syntax in the DSL. Documentation and rest specs only document the new “*length” forms Closes #4083 2014-01-02 16:11:20 +00:00			\|`max_word_length` \|The maximum word length above which words will be
			ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
			\|`boost_terms` \|Sets the boost factor to use when boosting terms.
Fix behavior on default boost factor for More Like This. A boost terms factor of 1.0 is not the same as no boosting of terms. The desired behavior is to deactivate boosting by default. If the user specifies any value other than 0, then boosting is activated. Closes #6021 2014-05-02 15:52:29 +02:00			Defaults to deactivated (`0`). Any other value activates boosting with given
			`boost factor.`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
			\|`boost` \|Sets the boost value of the query. Defaults to `1.0`.

			\|`analyzer` \|The analyzer that will be used to analyze the text.
			`Defaults to the analyzer associated with the field.`
			`\|=======================================================================`