lucene/solr/solr-ref-guide/src/morelikethis.adoc

113 lines
6.0 KiB
Plaintext

= MoreLikeThis
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
The `MoreLikeThis` search component enables users to query for documents similar to a document in their result list.
It does this by using terms from the original document to find similar documents in the index.
There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link).
The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results.
The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.
== How MoreLikeThis Works
`MoreLikeThis` constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the `mlt.fl` parameter, below). For best results, the fields should have stored term vectors in `schema.xml`. For example:
[source,xml]
----
<field name="cat" ... termVectors="true" />
----
If term vectors are not stored, `MoreLikeThis` will generate terms from stored fields. A `uniqueKey` must also be stored in order for MoreLikeThis to work properly.
The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the `mlt.qf` parameter, below) and a new document set is returned.
== Common Parameters for MoreLikeThis
The table below summarizes the `MoreLikeThis` parameters supported by Lucene/Solr. These parameters can be used with any of the three possible MoreLikeThis approaches.
`mlt.fl`::
Specifies the fields to use for similarity. If possible, these should have stored `termVectors`.
`mlt.mintf`::
Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the source document.
`mlt.mindf`::
Specifies the Minimum Document Frequency, the frequency at which words will be ignored which do not occur in at least this many documents.
`mlt.maxdf`::
Specifies the Maximum Document Frequency, the frequency at which words will be ignored which occur in more than this many documents.
`mlt.maxdfpct`::
Specifies the Maximum Document Frequency using a relative ratio to the number of documents in the index. The argument must be an integer between 0 and 100. For example 75 means the word will be ignored if it occurs in more than 75 percent of the documents in the index.
`mlt.minwl`::
Sets the minimum word length below which words will be ignored.
`mlt.maxwl`::
Sets the maximum word length above which words will be ignored.
`mlt.maxqt`::
Sets the maximum number of query terms that will be included in any generated query.
`mlt.maxntp`::
Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support.
`mlt.boost`::
Specifies if the query will be boosted by the interesting term relevance. It can be either "true" or "false".
`mlt.qf`::
Query fields and their boosts using the same format as that used by the <<the-dismax-query-parser.adoc#the-dismax-query-parser,DisMax Query Parser>>. These fields must also be specified in `mlt.fl`.
== Parameters for the MoreLikeThisComponent
Using MoreLikeThis as a search component returns similar documents for each document in the response set. In addition to the common parameters, these additional options are available:
`mlt`::
If set to `true`, activates the `MoreLikeThis` component and enables Solr to return `MoreLikeThis` results.
`mlt.count`::
Specifies the number of similar documents to be returned for each result. The default value is 5.
`mlt.interestingTerms`:: _Same as defined below for the MLT Handler._
== Parameters for the MoreLikeThisHandler
The table below summarizes parameters accessible through the `MoreLikeThisHandler`. It supports faceting, paging, and filtering using common query parameters, but does not work well with alternate query parsers.
`mlt.match.include`::
Specifies whether or not the response should include the matched document. If set to false, the response will look like a normal select response.
`mlt.match.offset`::
Specifies an offset into the main query search results to locate the document on which the `MoreLikeThis` query should operate. By default, the query operates on the first result for the q parameter.
`mlt.interestingTerms`::
Controls how the `MoreLikeThis` component presents the "interesting" terms (the top TF/IDF terms) for the query.
It supports three settings:
The setting `list` lists the terms.
The setting `none` lists no terms.
The setting `details` lists the terms along with the boost value used for each term.
Unless `mlt.boost=true`, all terms will have `boost=1.0`.
== MoreLikeThis Query Parser
The `mlt` query parser provides a mechanism to retrieve documents similar to a given document, like the handler. More information on the usage of the mlt query parser can be found in the section <<other-parsers.adoc#other-parsers,Other Parsers>>.