parent
23c358d2ef
commit
615513ee9b
|
@ -1,44 +1,51 @@
|
||||||
[[query-dsl-mlt-query]]
|
[[query-dsl-mlt-query]]
|
||||||
=== More Like This Query
|
=== More Like This Query
|
||||||
|
|
||||||
More like this query find documents that are "like" provided text by
|
The More Like This Query (MLT Query) finds documents that are "like" a given
|
||||||
running it against one or more fields.
|
set of documents. In order to do so, MLT selects a set of representative terms
|
||||||
|
of these input documents, forms a query using these terms, executes the query
|
||||||
|
and returns the results. The user controls the input documents, how the terms
|
||||||
|
should be selected and how the query is formed. `more_like_this` can be
|
||||||
|
shortened to `mlt`.
|
||||||
|
|
||||||
|
The simplest use case consists of asking for documents that are similar to a
|
||||||
|
provided piece of text. Here, we are asking for all movies that have some text
|
||||||
|
similar to "Once upon a time" in their "title" and in their "description"
|
||||||
|
fields, limiting the number of selected terms to 12.
|
||||||
|
|
||||||
[source,js]
|
[source,js]
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
{
|
{
|
||||||
"more_like_this" : {
|
"more_like_this" : {
|
||||||
"fields" : ["name.first", "name.last"],
|
"fields" : ["title", "description"],
|
||||||
"like" : "text like this one",
|
"like" : "Once upon a time",
|
||||||
"min_term_freq" : 1,
|
"min_term_freq" : 1,
|
||||||
"max_query_terms" : 12
|
"max_query_terms" : 12
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
|
|
||||||
More Like This can find documents that are "like" a set of
|
A more complicated use case consists of mixing texts with documents already
|
||||||
chosen documents. The syntax to specify one or more documents is similar to
|
existing in the index. In this case, the syntax to specify a document is
|
||||||
the <<docs-multi-get,Multi GET API>>.
|
similar to the one used in the <<docs-multi-get,Multi GET API>>.
|
||||||
If only one document is specified, the query behaves the same as the
|
|
||||||
<<search-more-like-this,More Like This API>>.
|
|
||||||
|
|
||||||
[source,js]
|
[source,js]
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
{
|
{
|
||||||
"more_like_this" : {
|
"more_like_this" : {
|
||||||
"fields" : ["name.first", "name.last"],
|
"fields" : ["title", "description"],
|
||||||
"like" : [
|
"like" : [
|
||||||
{
|
{
|
||||||
"_index" : "test",
|
"_index" : "imdb",
|
||||||
"_type" : "type",
|
"_type" : "movies",
|
||||||
"_id" : "1"
|
"_id" : "1"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"_index" : "test",
|
"_index" : "imdb",
|
||||||
"_type" : "type",
|
"_type" : "movies",
|
||||||
"_id" : "2"
|
"_id" : "2"
|
||||||
},
|
},
|
||||||
"and also some text like this one!"
|
"and potentially some more text here as well"
|
||||||
],
|
],
|
||||||
"min_term_freq" : 1,
|
"min_term_freq" : 1,
|
||||||
"max_query_terms" : 12
|
"max_query_terms" : 12
|
||||||
|
@ -46,8 +53,9 @@ If only one document is specified, the query behaves the same as the
|
||||||
}
|
}
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
|
|
||||||
Additionally, <<docs-termvectors-artificial-doc,artificial documents>> are also supported.
|
Finally, users can mix some texts, a chosen set of documents but also provide
|
||||||
This is useful in order to specify one or more documents not present in the index.
|
documents not necessarily present in the index. To provide documents not
|
||||||
|
present in the index, the syntax is similar to <<docs-termvectors-artificial-doc,artificial documents>>.
|
||||||
|
|
||||||
[source,js]
|
[source,js]
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
|
@ -56,8 +64,8 @@ This is useful in order to specify one or more documents not present in the inde
|
||||||
"fields" : ["name.first", "name.last"],
|
"fields" : ["name.first", "name.last"],
|
||||||
"like" : [
|
"like" : [
|
||||||
{
|
{
|
||||||
"_index" : "test",
|
"_index" : "marvel",
|
||||||
"_type" : "type",
|
"_type" : "quotes",
|
||||||
"doc" : {
|
"doc" : {
|
||||||
"name": {
|
"name": {
|
||||||
"first": "Ben",
|
"first": "Ben",
|
||||||
|
@ -68,8 +76,8 @@ This is useful in order to specify one or more documents not present in the inde
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"_index" : "test",
|
"_index" : "marvel",
|
||||||
"_type" : "type",
|
"_type" : "quotes",
|
||||||
"_id" : "2"
|
"_id" : "2"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
|
@ -79,100 +87,155 @@ This is useful in order to specify one or more documents not present in the inde
|
||||||
}
|
}
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
|
|
||||||
`more_like_this` can be shortened to `mlt`.
|
==== How it Works
|
||||||
|
|
||||||
Under the hood, `more_like_this` simply creates multiple `should` clauses in a `bool` query of
|
Suppose we wanted to find all documents similar to a given input document.
|
||||||
interesting terms extracted from some provided text. The interesting terms are
|
Obviously, the input document itself should be its best match for that type of
|
||||||
selected with respect to their tf-idf scores. These are controlled by
|
query. And the reason would be mostly, according to
|
||||||
`min_term_freq`, `min_doc_freq`, and `max_doc_freq`. The number of interesting
|
link:https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html[Lucene scoring formula],
|
||||||
terms is controlled by `max_query_terms`. While the minimum number of clauses
|
due to the terms with the highest tf-idf. Therefore, the terms of the input
|
||||||
that must be satisfied is controlled by `minimum_should_match`. The terms
|
document that have the highest tf-idf are good representatives of that
|
||||||
are extracted from the text in `like` and analyzed by the analyzer associated
|
document, and could be used within a disjunctive query (or `OR`) to retrieve similar
|
||||||
with the field, unless specified by `analyzer`. There are other parameters,
|
documents. The MLT query simply extracts the text from the input document,
|
||||||
such as `min_word_length`, `max_word_length` or `stop_words`, to control what
|
analyzes it, usually using the same analyzer at the field, then selects the
|
||||||
terms should be considered as interesting. In order to give more weight to
|
top K terms with highest tf-idf to form a disjunctive query of these terms.
|
||||||
more interesting terms, each boolean clause associated with a term could be
|
|
||||||
boosted by the term tf-idf score times some boosting factor `boost_terms`.
|
|
||||||
When a search for multiple documents is issued, More Like This generates a
|
|
||||||
`more_like_this` query per document field in `fields`. These `fields` are
|
|
||||||
specified as a top level parameter or within each document request.
|
|
||||||
|
|
||||||
IMPORTANT: The fields must be indexed and of type `string`. Additionally, when
|
IMPORTANT: The fields on which to perform MLT must be indexed and of type
|
||||||
using `like` with documents, the fields must be either `stored`, store `term_vector`
|
`string`. Additionally, when using `like` with documents, either `_source`
|
||||||
or `_source` must be enabled.
|
must be enabled or the fields must be `stored` or store `term_vector`. In
|
||||||
|
order to speed up analysis, it could help to store term vectors at index time.
|
||||||
|
|
||||||
The `more_like_this` top level parameters include:
|
For example, if we wish to perform MLT on the "title" and "tags.raw" fields,
|
||||||
|
we can explicitly store their `term_vector` at index time. We can still
|
||||||
|
perform MLT on the "description" and "tags" fields, as `_source` is enabled by
|
||||||
|
default, but there will be no speed up on analysis for these fields.
|
||||||
|
|
||||||
[cols="<,<",options="header",]
|
[source,js]
|
||||||
|=======================================================================
|
--------------------------------------------------
|
||||||
|Parameter |Description
|
curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
|
||||||
|`fields` |A list of the fields to run the more like this query against.
|
"mappings": {
|
||||||
Defaults to the `_all` field for text and to all possible fields
|
"movies": {
|
||||||
for documents.
|
"properties": {
|
||||||
|
"title": {
|
||||||
|
"type": "string",
|
||||||
|
"term_vector": "yes"
|
||||||
|
},
|
||||||
|
"description": {
|
||||||
|
"type": "string"
|
||||||
|
},
|
||||||
|
"tags": {
|
||||||
|
"type": "string",
|
||||||
|
"fields" : {
|
||||||
|
"raw": {
|
||||||
|
"type" : "string",
|
||||||
|
"index" : "not_analyzed",
|
||||||
|
"term_vector" : "yes"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|`like`|coming[2.0]
|
==== Parameters
|
||||||
Can either be some text, some documents or a combination of all, *required*.
|
|
||||||
A document request follows the same syntax as the
|
|
||||||
<<docs-multi-get,Multi Get API>> or <<docs-multi-termvectors,Multi Term Vectors API>>.
|
|
||||||
In this case, the text is fetched from `fields` unless specified otherwise in each document request.
|
|
||||||
The text is analyzed by the default analyzer at the field, unless overridden by the
|
|
||||||
`per_field_analyzer` parameter of the <<docs-termvectors-per-field-analyzer,Term Vectors API>>.
|
|
||||||
|
|
||||||
|`like_text` |deprecated[2.0,Replaced by `like`]
|
The only required parameter is `like`, all other parameters have sensible
|
||||||
The text to find documents like it, *required* if `ids` or `docs` are
|
defaults. There are three types of parameters: one to specify the document
|
||||||
not specified.
|
input, the other one for term selection and for query formation.
|
||||||
|
|
||||||
|`ids` or `docs` |deprecated[2.0,Replaced by `like`]
|
[float]
|
||||||
A list of documents following the same syntax as the
|
==== Document Input Parameters
|
||||||
<<docs-multi-get,Multi GET API>> or <<docs-multi-termvectors,Multi termvectors API>>.
|
|
||||||
The text is fetched from `fields` unless specified otherwise in each `doc`.
|
|
||||||
The text is analyzed by the default analyzer at the field, unless specified by the
|
|
||||||
`per_field_analyzer` parameter of the <<docs-termvectors-per-field-analyzer,Term Vectors API>>.
|
|
||||||
|
|
||||||
|`ignore_like`|coming[2.0] The `ignore_like` parameter is used to skip terms
|
[horizontal]
|
||||||
from the documents specified by `like`. In other words, we could ask for
|
`like`:: coming[2.0]
|
||||||
documents `like: "Apple"`, but `ignore_like: "cake crumble tree"`. Follows the
|
The only *required* parameter of the MLT query is `like` and follows a
|
||||||
same syntax as `like`.
|
versatile syntax, in which the user can specify free form text and/or a single
|
||||||
|
or multiple documents (see examples above). The syntax to specify documents is
|
||||||
|
similar to the one used by the <<docs-multi-get,Multi GET API>>. When
|
||||||
|
specifying documents, the text is fetched from `fields` unless overridden in
|
||||||
|
each document request. The text is analyzed by the analyzer at the field, but
|
||||||
|
could also be overridden. The syntax to override the analyzer at the field
|
||||||
|
follows a similar syntax to the `per_field_analyzer` parameter of the
|
||||||
|
<<docs-termvectors-per-field-analyzer,Term Vectors API>>.
|
||||||
|
Additionally, to provide documents not necessarily present in the index,
|
||||||
|
<<docs-termvectors-artificial-doc,artificial documents>> are also supported.
|
||||||
|
|
||||||
|`include` |When using `like` with document requests, specifies whether the documents should be
|
`fields`::
|
||||||
included from the search. Defaults to `false`.
|
A list of fields to fetch and analyze the text from. Defaults to the `_all`
|
||||||
|
field for free text and to all possible fields for document inputs.
|
||||||
|
|
||||||
|`minimum_should_match`| From the generated query, the number of terms that
|
`ignore_like`:: coming[2.0]
|
||||||
must match following the <<query-dsl-minimum-should-match,minimum should
|
The `ignore_like` parameter is used to skip the terms found in a chosen set of
|
||||||
syntax>>. (Defaults to `"30%"`).
|
documents. In other words, we could ask for documents `like: "Apple"`, but
|
||||||
|
`ignore_like: "cake crumble tree"`. The syntax is the same as `like`.
|
||||||
|
|
||||||
|`min_term_freq` |The frequency below which terms will be ignored in the
|
`like_text`:: deprecated[2.0,Replaced by `like`]
|
||||||
source doc. The default frequency is `2`.
|
The text to find documents like it.
|
||||||
|
|
||||||
|`max_query_terms` |The maximum number of query terms that will be
|
`ids` or `docs`:: deprecated[2.0,Replaced by `like`]
|
||||||
included in any generated query. Defaults to `25`.
|
A list of documents following the same syntax as the <<docs-multi-get,Multi GET API>>.
|
||||||
|
|
||||||
|`stop_words` |An array of stop words. Any word in this set is
|
[float]
|
||||||
considered "uninteresting" and ignored. Even if your Analyzer allows
|
==== Term Selection Parameters
|
||||||
stopwords, you might want to tell the MoreLikeThis code to ignore them,
|
|
||||||
as for the purposes of document similarity it seems reasonable to assume
|
|
||||||
that "a stop word is never interesting".
|
|
||||||
|
|
||||||
|`min_doc_freq` |The frequency at which words will be ignored which do
|
[horizontal]
|
||||||
not occur in at least this many docs. Defaults to `5`.
|
`max_query_terms`::
|
||||||
|
The maximum number of query terms that will be selected. Increasing this value
|
||||||
|
gives greater accuracy at the expense of query execution speed. Defaults to
|
||||||
|
`25`.
|
||||||
|
|
||||||
|`max_doc_freq` |The maximum frequency in which words may still appear.
|
`min_term_freq`::
|
||||||
Words that appear in more than this many docs will be ignored. Defaults
|
The minimum term frequency below which the terms will be ignored from the
|
||||||
to unbounded.
|
input document. Defaults to `2`.
|
||||||
|
|
||||||
|`min_word_length` |The minimum word length below which words will be
|
`min_doc_freq`::
|
||||||
ignored. Defaults to `0`.(Old name "min_word_len" is deprecated)
|
The minimum document frequency below which the terms will be ignored from the
|
||||||
|
input document. Defaults to `5`.
|
||||||
|
|
||||||
|`max_word_length` |The maximum word length above which words will be
|
`max_doc_freq`::
|
||||||
ignored. Defaults to unbounded (`0`). (Old name "max_word_len" is deprecated)
|
The maximum document frequency above which the terms will be ignored from the
|
||||||
|
input document. This could be useful in order to ignore highly frequent words
|
||||||
|
such as stop words. Defaults to unbounded (`0`).
|
||||||
|
|
||||||
|`boost_terms` |Sets the boost factor to use when boosting terms.
|
`min_word_length`::
|
||||||
Defaults to deactivated (`0`). Any other value activates boosting with given
|
The minimum word length below which the terms will be ignored. The old name
|
||||||
boost factor.
|
`min_word_len` is deprecated. Defaults to `0`.
|
||||||
|
|
||||||
|`boost` |Sets the boost value of the query. Defaults to `1.0`.
|
`max_word_length`::
|
||||||
|
The maximum word length above which the terms will be ignored. The old name
|
||||||
|
`max_word_len` is deprecated. Defaults to unbounded (`0`).
|
||||||
|
|
||||||
|`analyzer` |The analyzer that will be used to analyze the `like text`.
|
`stop_words`::
|
||||||
Defaults to the analyzer associated with the first field in `fields`.
|
An array of stop words. Any word in this set is considered "uninteresting" and
|
||||||
|=======================================================================
|
ignored. If the analyzer allows for stop words, you might want to tell MLT to
|
||||||
|
explicitly ignore them, as for the purposes of document similarity it seems
|
||||||
|
reasonable to assume that "a stop word is never interesting".
|
||||||
|
|
||||||
|
`analyzer`::
|
||||||
|
The analyzer that is used to analyze the free form text. Defaults to the
|
||||||
|
analyzer associated with the first field in `fields`.
|
||||||
|
|
||||||
|
[float]
|
||||||
|
==== Query Formation Parameters
|
||||||
|
|
||||||
|
[horizontal]
|
||||||
|
`minimum_should_match`::
|
||||||
|
After the disjunctive query has been formed, this parameter controls the
|
||||||
|
number of terms that must match.
|
||||||
|
The syntax is the same as the <<query-dsl-minimum-should-match,minimum should match>>.
|
||||||
|
(Defaults to `"30%"`).
|
||||||
|
|
||||||
|
`boost_terms`::
|
||||||
|
Each term in the formed query could be further boosted by their tf-idf score.
|
||||||
|
This sets the boost factor to use when using this feature. Defaults to
|
||||||
|
deactivated (`0`). Any other positive value activates terms boosting with the
|
||||||
|
given boost factor.
|
||||||
|
|
||||||
|
`include`::
|
||||||
|
Specifies whether the input documents should also be included in the search
|
||||||
|
results returned. Defaults to `false`.
|
||||||
|
|
||||||
|
`boost`::
|
||||||
|
Sets the boost value of the whole query. Defaults to `1.0`.
|
||||||
|
|
Loading…
Reference in New Issue