OpenSearch/docs/reference/query-dsl/queries/mlt-query.asciidoc

[[query-dsl-mlt-query]]
=== More Like This Query

More like this query find documents that are "like" provided text by
running it against one or more fields.

[source,js]
--------------------------------------------------
{
    "more_like_this" : {
        "fields" : ["name.first", "name.last"],
        "like_text" : "text like this one",
        "min_term_freq" : 1,
        "max_query_terms" : 12
    }
}
--------------------------------------------------

More Like This can find documents that are "like" a set of
chosen documents. The syntax to specify one or more documents is similar to
the <<docs-multi-get,Multi GET API>>, and supports the `ids` or `docs` array.
If only one document is specified, the query behaves the same as the
<<search-more-like-this,More Like This API>>.

[source,js]
--------------------------------------------------
{
    "more_like_this" : {
        "fields" : ["name.first", "name.last"],
        "docs" : [
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "1"
        },
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "2"
        }
        ],
        "ids" : ["3", "4"],
        "min_term_freq" : 1,
        "max_query_terms" : 12
    }
}
--------------------------------------------------

Additionally, the `doc` syntax of the
<<docs-multi-termvectors,Multi Term Vectors API>> is also supported. This is useful in
order to specify one or more documents not present in the index, and in
this case should be preferred over only using `like_text`.

[source,js]
--------------------------------------------------
{
    "more_like_this" : {
        "fields" : ["name.first", "name.last"],
        "docs" : [
        {
            "_index" : "test",
            "_type" : "type",
            "doc" : {
                "name": {
                    "first": "Ben",
                    "last": "Grimm"
                },
                "tweet": "You got no idea what I'd... what I'd give to be invisible."
              }
            }
        },
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "2"
        }
        ],
        "min_term_freq" : 1,
        "max_query_terms" : 12
    }
}
--------------------------------------------------

`more_like_this` can be shortened to `mlt`.

Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
interesting terms extracted from some provided text. The interesting terms are
selected with respect to their tf-idf scores. These are controlled by
`min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
terms is controlled by `max_query_terms`. While the minimum number of clauses
that must be satisfied is controlled by `percent_terms_to_match`. The terms
are extracted from `like_text` which is analyzed by the analyzer associated
with the field, unless specified by `analyzer`. There are other parameters,
such as `min_word_length`, `max_word_length` or `stop_words`, to control what
terms should be considered as interesting. In order to give more weight to
more interesting terms, each boolean clause associated with a term could be
boosted by the term tf-idf score times some boosting factor `boost_terms`.
When a search for multiple `docs` is issued, More Like This generates a
`more_like_this` query per document field in `fields`. These `fields` are
specified as a top level parameter or within each `doc`.

IMPORTANT: The fields must be indexed and of type `string`. Additionally, when
using `ids` or `docs`, the fields must be either `stored`, store `term_vector`
or `_source` must be enabled.

The `more_like_this` top level parameters include:

[cols="<,<",options="header",]
|=======================================================================
|Parameter |Description
|`fields` |A list of the fields to run the more like this query against.
Defaults to the `_all` field for `like_text` and to all possible fields
for `ids` or `docs`.

|`like_text` |The text to find documents like it, *required* if `ids` or `docs` are
not specified.

|`ids` or `docs` |A list of documents following the same syntax as the
<<docs-multi-get,Multi GET API>> or <<docs-multi-termvectors,Multi Term Vectors API>>.
The text is fetched from `fields` unless specified otherwise in each `doc`.
The text is analyzed by the default analyzer at the field, unless specified by the
`per_field_analyzer` parameter of the <<docs-termvectors-per-field-analyzer,Term Vectors API>>.

|`include` |When using `ids` or `docs`, specifies whether the documents should be
included from the search. Defaults to `false`.

|`minimum_should_match`| From the generated query, the number of terms that
must match following the <<query-dsl-minimum-should-match,minimum should
syntax>>. (Defaults to `"30%"`).

|`min_term_freq` |The frequency below which terms will be ignored in the
source doc. The default frequency is `2`.

|`max_query_terms` |The maximum number of query terms that will be
included in any generated query. Defaults to `25`.

|`stop_words` |An array of stop words. Any word in this set is
considered "uninteresting" and ignored. Even if your Analyzer allows
stopwords, you might want to tell the MoreLikeThis code to ignore them,
as for the purposes of document similarity it seems reasonable to assume
that "a stop word is never interesting".

|`min_doc_freq` |The frequency at which words will be ignored which do
not occur in at least this many docs. Defaults to `5`.

|`max_doc_freq` |The maximum frequency in which words may still appear.
Words that appear in more than this many docs will be ignored. Defaults
to unbounded.

|`min_word_length` |The minimum word length below which words will be
ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)

|`max_word_length` |The maximum word length above which words will be
ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)

|`boost_terms` |Sets the boost factor to use when boosting terms.
Defaults to deactivated (`0`). Any other value activates boosting with given
boost factor.

|`boost` |Sets the boost value of the query. Defaults to `1.0`.

|`analyzer` |The analyzer that will be used to analyze the `like text`.
Defaults to the analyzer associated with the first field in `fields`.
|=======================================================================
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`[[query-dsl-mlt-query]]`
			`=== More Like This Query`

			`More like this query find documents that are "like" provided text by`
			`running it against one or more fields.`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"more_like_this" : {`
			`"fields" : ["name.first", "name.last"],`
			`"like_text" : "text like this one",`
			`"min_term_freq" : 1,`
			`"max_query_terms" : 12`
			`}`
			`}`
			`--------------------------------------------------`

MLT Query: Support for artificial documents Previously, the only way to specify a document not present in the index was to use `like_text`. This would usually lead to complex queries made of multiple MLT queries per document field. This commit adds the ability to the MLT query to directly specify documents not present in the index (artificial documents). The syntax is similar to the Percolator API or to the Multi Term Vector API. Closes #7725 2014-09-15 16:17:49 +02:00			`More Like This can find documents that are "like" a set of`
More Like This Query: Added searching for multiple items. The syntax to specify one or more items is the same as for the Multi GET API. If only one document is specified, the results returned are the same as when using the More Like This API. Relates #4075 Closes #5857 2014-04-17 17:09:20 +02:00			`chosen documents. The syntax to specify one or more documents is similar to`
			the <<docs-multi-get,Multi GET API>>, and supports the `ids` or `docs` array.
Term Vectors/MLT Query: support for different analyzers than default at field This adds a `per_field_analyzer` parameter to the Term Vectors API, which allows to override the default analyzer at the field. If the field already stores term vectors, then they will be re-generated. Since the MLT Query uses the Term Vectors API under its hood, this commits also adds the same ability to the MLT Query, thereby allowing users to fine grain how each field item should be processed and analyzed. Closes #7801 2014-09-19 14:43:47 +02:00			`If only one document is specified, the query behaves the same as the`
More Like This Query: Added searching for multiple items. The syntax to specify one or more items is the same as for the Multi GET API. If only one document is specified, the results returned are the same as when using the More Like This API. Relates #4075 Closes #5857 2014-04-17 17:09:20 +02:00			`<<search-more-like-this,More Like This API>>.`

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"more_like_this" : {`
			`"fields" : ["name.first", "name.last"],`
			`"docs" : [`
			`{`
			`"_index" : "test",`
			`"_type" : "type",`
			`"_id" : "1"`
			`},`
			`{`
			`"_index" : "test",`
			`"_type" : "type",`
			`"_id" : "2"`
			`}`
			`],`
			`"ids" : ["3", "4"],`
			`"min_term_freq" : 1,`
			`"max_query_terms" : 12`
			`}`
			`}`
			`--------------------------------------------------`

MLT Query: Support for artificial documents Previously, the only way to specify a document not present in the index was to use `like_text`. This would usually lead to complex queries made of multiple MLT queries per document field. This commit adds the ability to the MLT query to directly specify documents not present in the index (artificial documents). The syntax is similar to the Percolator API or to the Multi Term Vector API. Closes #7725 2014-09-15 16:17:49 +02:00			Additionally, the `doc` syntax of the
			`<<docs-multi-termvectors,Multi Term Vectors API>> is also supported. This is useful in`
			`order to specify one or more documents not present in the index, and in`
			this case should be preferred over only using `like_text`.

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"more_like_this" : {`
			`"fields" : ["name.first", "name.last"],`
			`"docs" : [`
			`{`
			`"_index" : "test",`
			`"_type" : "type",`
			`"doc" : {`
			`"name": {`
			`"first": "Ben",`
			`"last": "Grimm"`
			`},`
			`"tweet": "You got no idea what I'd... what I'd give to be invisible."`
			`}`
			`}`
			`},`
			`{`
			`"_index" : "test",`
			`"_type" : "type",`
			`"_id" : "2"`
			`}`
			`],`
			`"min_term_freq" : 1,`
			`"max_query_terms" : 12`
			`}`
			`}`
			`--------------------------------------------------`

Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`more_like_this` can be shortened to `mlt`.

Provided some insights as to how More Like This works internally. In the Google Groups forum there appears to be some confusion as to what mlt does. This documentation update should hopefully help demystifying this feature, and provide some understanding as to how to use its parameters. Closes #6092 2014-05-08 12:21:18 +02:00			Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
			`interesting terms extracted from some provided text. The interesting terms are`
			`selected with respect to their tf-idf scores. These are controlled by`
			`min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
			terms is controlled by `max_query_terms`. While the minimum number of clauses
			that must be satisfied is controlled by `percent_terms_to_match`. The terms
			are extracted from `like_text` which is analyzed by the analyzer associated
			with the field, unless specified by `analyzer`. There are other parameters,
			such as `min_word_length`, `max_word_length` or `stop_words`, to control what
			`terms should be considered as interesting. In order to give more weight to`
			`more interesting terms, each boolean clause associated with a term could be`
			boosted by the term tf-idf score times some boosting factor `boost_terms`.
More Like This Query: Added searching for multiple items. The syntax to specify one or more items is the same as for the Multi GET API. If only one document is specified, the results returned are the same as when using the More Like This API. Relates #4075 Closes #5857 2014-04-17 17:09:20 +02:00			When a search for multiple `docs` is issued, More Like This generates a
			`more_like_this` query per document field in `fields`. These `fields` are
			specified as a top level parameter or within each `doc`.

More Like This Query: defaults to all possible fields for items Items with no specified field now defaults to all the possible fields from the document source. Previously, we had required 'fields' to be specified either as a top level parameter or for each item. The default behavior is now similar to the MLT API. Closes #7382 2014-08-21 19:29:26 +02:00			IMPORTANT: The fields must be indexed and of type `string`. Additionally, when
			using `ids` or `docs`, the fields must be either `stored`, store `term_vector`
			or `_source` must be enabled.

Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			The `more_like_this` top level parameters include:

			`[cols="<,<",options="header",]`
			`\|=======================================================================`
			`\|Parameter \|Description`
			\|`fields` \|A list of the fields to run the more like this query against.
More Like This Query: defaults to all possible fields for items Items with no specified field now defaults to all the possible fields from the document source. Previously, we had required 'fields' to be specified either as a top level parameter or for each item. The default behavior is now similar to the MLT API. Closes #7382 2014-08-21 19:29:26 +02:00			Defaults to the `_all` field for `like_text` and to all possible fields
			for `ids` or `docs`.
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
More Like This Query: allow for both 'like_text' and 'docs/ids' to be specified. Closes #6246 2014-05-20 15:58:26 +02:00			\|`like_text` \|The text to find documents like it, required if `ids` or `docs` are
More Like This Query: Added searching for multiple items. The syntax to specify one or more items is the same as for the Multi GET API. If only one document is specified, the results returned are the same as when using the More Like This API. Relates #4075 Closes #5857 2014-04-17 17:09:20 +02:00			`not specified.`

Term Vectors/MLT Query: support for different analyzers than default at field This adds a `per_field_analyzer` parameter to the Term Vectors API, which allows to override the default analyzer at the field. If the field already stores term vectors, then they will be re-generated. Since the MLT Query uses the Term Vectors API under its hood, this commits also adds the same ability to the MLT Query, thereby allowing users to fine grain how each field item should be processed and analyzed. Closes #7801 2014-09-19 14:43:47 +02:00			\|`ids` or `docs` \|A list of documents following the same syntax as the
MLT Query: Support for artificial documents Previously, the only way to specify a document not present in the index was to use `like_text`. This would usually lead to complex queries made of multiple MLT queries per document field. This commit adds the ability to the MLT query to directly specify documents not present in the index (artificial documents). The syntax is similar to the Percolator API or to the Multi Term Vector API. Closes #7725 2014-09-15 16:17:49 +02:00			`<<docs-multi-get,Multi GET API>> or <<docs-multi-termvectors,Multi Term Vectors API>>.`
			The text is fetched from `fields` unless specified otherwise in each `doc`.
Term Vectors/MLT Query: support for different analyzers than default at field This adds a `per_field_analyzer` parameter to the Term Vectors API, which allows to override the default analyzer at the field. If the field already stores term vectors, then they will be re-generated. Since the MLT Query uses the Term Vectors API under its hood, this commits also adds the same ability to the MLT Query, thereby allowing users to fine grain how each field item should be processed and analyzed. Closes #7801 2014-09-19 14:43:47 +02:00			`The text is analyzed by the default analyzer at the field, unless specified by the`
			`per_field_analyzer` parameter of the <<docs-termvectors-per-field-analyzer,Term Vectors API>>.
More Like This Query: Added searching for multiple items. The syntax to specify one or more items is the same as for the Multi GET API. If only one document is specified, the results returned are the same as when using the More Like This API. Relates #4075 Closes #5857 2014-04-17 17:09:20 +02:00
More Like This Query: replaced 'exclude' with 'include' to avoid double negation when set. Closes #6248 2014-05-20 16:28:09 +02:00			\|`include` \|When using `ids` or `docs`, specifies whether the documents should be
			included from the search. Defaults to `false`.
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
MLT Query: use minimum should match more extensive syntax The minimum number of optional should clauses of the generated query to match can now be set using the more extensive minimum should match syntax. This makes the `percent_terms_to_match` parameter deprecated, and replaced in favor to a new `minimum_should_match` parameter. Closes #7898 2014-09-26 16:30:43 +02:00			\|`minimum_should_match`\| From the generated query, the number of terms that
			`must match following the <<query-dsl-minimum-should-match,minimum should`
			syntax>>. (Defaults to `"30%"`).
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
			\|`min_term_freq` \|The frequency below which terms will be ignored in the
			source doc. The default frequency is `2`.

			\|`max_query_terms` \|The maximum number of query terms that will be
			included in any generated query. Defaults to `25`.

			\|`stop_words` \|An array of stop words. Any word in this set is
			`considered "uninteresting" and ignored. Even if your Analyzer allows`
			`stopwords, you might want to tell the MoreLikeThis code to ignore them,`
			`as for the purposes of document similarity it seems reasonable to assume`
			`that "a stop word is never interesting".`

			\|`min_doc_freq` \|The frequency at which words will be ignored which do
			not occur in at least this many docs. Defaults to `5`.

			\|`max_doc_freq` \|The maximum frequency in which words may still appear.
			`Words that appear in more than this many docs will be ignored. Defaults`
			`to unbounded.`

Standardized use of “_length” for parameter names rather than “_len”. Java Builder apis drop old “len” methods in favour of new “length” Rest APIs support both old “len: and new “length” forms using new ParseField class to a) provide compiler-checked consistency between Builder and Parser classes and b) a common means of handling deprecated syntax in the DSL. Documentation and rest specs only document the new “*length” forms Closes #4083 2014-01-02 16:11:20 +00:00			\|`min_word_length` \|The minimum word length below which words will be
			ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
Standardized use of “_length” for parameter names rather than “_len”. Java Builder apis drop old “len” methods in favour of new “length” Rest APIs support both old “len: and new “length” forms using new ParseField class to a) provide compiler-checked consistency between Builder and Parser classes and b) a common means of handling deprecated syntax in the DSL. Documentation and rest specs only document the new “*length” forms Closes #4083 2014-01-02 16:11:20 +00:00			\|`max_word_length` \|The maximum word length above which words will be
			ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
			\|`boost_terms` \|Sets the boost factor to use when boosting terms.
Fix behavior on default boost factor for More Like This. A boost terms factor of 1.0 is not the same as no boosting of terms. The desired behavior is to deactivate boosting by default. If the user specifies any value other than 0, then boosting is activated. Closes #6021 2014-05-02 15:52:29 +02:00			Defaults to deactivated (`0`). Any other value activates boosting with given
			`boost factor.`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00
			\|`boost` \|Sets the boost value of the query. Defaults to `1.0`.

More Like This Query: Switch to using the multi-termvectors API The term vector API can now generate term vectors on the fly, if the terms are not already stored in the index. This commit exploits this new functionality for the MLT query. Now the terms are directly retrieved using multi- termvectors API, instead of generating them from the texts retrieved using the multi-get API. Closes #7014 2014-07-23 16:58:47 +02:00			\|`analyzer` \|The analyzer that will be used to analyze the `like text`.
			Defaults to the analyzer associated with the first field in `fields`.
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`\|=======================================================================`