lucene/solr/solr-ref-guide/src/suggester.adoc

539 lines
25 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

= Suggester
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
The SuggestComponent in Solr provides users with automatic suggestions for query terms.
You can use this to implement a powerful auto-suggest feature in your search application.
Although it is possible to use the <<spell-checking.adoc#spell-checking,Spell Checking>> functionality to power autosuggest behavior, Solr has a dedicated http://lucene.apache.org/solr/api/solr-core/org/apache/solr/handler/component/SuggestComponent.html[SuggestComponent] designed for this functionality.
This approach utilizes Lucene's Suggester implementation and supports all of the lookup implementations available in Lucene.
The main features of this Suggester are:
* Lookup implementation pluggability
* Term dictionary pluggability, giving you the flexibility to choose the dictionary implementation
* Distributed support
The `solrconfig.xml` found in Solr's "```techproducts```" example has a Suggester implementation configured already. For more on search components, see the section <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#requesthandlers-and-searchcomponents-in-solrconfig,RequestHandlers and SearchComponents in SolrConfig>>.
The "```techproducts```" example `solrconfig.xml` has a `suggest` search component and a `/suggest` request handler already configured. You can use that as the basis for your configuration, or create it from scratch, as detailed below.
== Adding the Suggest Search Component
The first step is to add a search component to `solrconfig.xml` and tell it to use the SuggestComponent. Here is some sample code that could be used.
[source,xml]
----
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">cat</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
----
=== Suggester Search Component Parameters
The Suggester search component takes several configuration parameters.
The choice of the lookup implementation (`lookupImpl`, how terms are found in the suggestion dictionary) and the dictionary implementation (`dictionaryImpl`, how terms are stored in the suggestion dictionary) will dictate some of the parameters required.
Below are the main parameters that can be used no matter what lookup or dictionary implementation is used. In the following sections additional parameters are provided for each implementation.
`searchComponent name`::
Arbitrary name for the search component.
`name`::
A symbolic name for this suggester. You can refer to this name in the URL parameters and in the SearchHandler configuration. It is possible to have multiples of these in one `solrconfig.xml` file.
`lookupImpl`::
Lookup implementation. There are several possible implementations, described below in the section <<Lookup Implementations>>. If not set, the default lookup is `JaspellLookupFactory`.
`dictionaryImpl`::
The dictionary implementation to use. There are several possible implementations, described below in the section <<Dictionary Implementations>>.
+
If not set, the default dictionary implementation is `HighFrequencyDictionaryFactory`. However, if a `sourceLocation` is used, the dictionary implementation will be `FileDictionaryFactory`.
`field`::
A field from the index to use as the basis of suggestion terms. If `sourceLocation` is empty (meaning any dictionary implementation other than `FileDictionaryFactory`), then terms from this field in the index will be used.
+
To be used as the basis for a suggestion, the field must be stored. You may want to <<copying-fields.adoc#copying-fields,use copyField rules>> to create a special 'suggest' field comprised of terms from other fields in documents. In any event, you very likely want a minimal amount of analysis on the field, so an additional option is to create a field type in your schema that only uses basic tokenizers or filters. One option for such a field type is shown here:
+
[source,xml]
----
<fieldType class="solr.TextField" name="textSuggest" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
----
+
However, this minimal analysis is not required if you want more analysis to occur on terms. If using the `AnalyzingLookupFactory` as your `lookupImpl`, however, you have the option of defining the field type rules to use for index and query time analysis.
`sourceLocation`::
The path to the dictionary file if using the `FileDictionaryFactory`. If this value is empty then the main index will be used as a source of terms and weights.
`storeDir`::
The location to store the dictionary file.
`buildOnCommit` and `buildOnOptimize`::
If `true`, the lookup data structure will be rebuilt after soft-commit. If `false`, the default, then the lookup data will be built only when requested by URL parameter `suggest.build=true`. Use `buildOnCommit` to rebuild the dictionary with every soft-commit, or `buildOnOptimize` to build the dictionary only when the index is optimized.
+
Some lookup implementations may take a long time to build, especially with large indexes. In such cases, using `buildOnCommit` or `buildOnOptimize`, particularly with a high frequency of softCommits is not recommended; it's recommended instead to build the suggester at a lower frequency by manually issuing requests with `suggest.build=true`.
`buildOnStartup`::
If `true,` then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found.
+
Enabling this to `true` could lead to the core talking longer to load (or reload) as the suggester data structure needs to be built, which can sometimes take a long time. Its usually preferred to have this setting set to `false`, the default, and build suggesters manually issuing requests with `suggest.build=true`.
=== Lookup Implementations
The `lookupImpl` parameter defines the algorithms used to look up terms in the suggest index. There are several possible implementations to choose from, and some require additional parameters to be configured.
==== AnalyzingLookupFactory
A lookup that first analyzes the incoming text and adds the analyzed form to a weighted FST, and then does the same thing at lookup time.
This implementation uses the following additional properties:
`suggestAnalyzerFieldType`::
The field type to use for the query-time and build-time term suggestion analysis.
`exactMatchFirst`::
If `true`, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights.
`preserveSep`::
If `true`, the default, then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g., baseball is different from base ball).
`preservePositionIncrements`::
If `true`, the suggester will preserve position increments. This means that token filters which leave gaps (for example, when StopFilter matches a stopword) the position would be respected when building the suggester. The default is `false`.
==== FuzzyLookupFactory
This is a suggester which is an extension of the AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the Levenshtein algorithm.
This implementation uses the following additional properties:
`exactMatchFirst`::
If `true`, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights.
`preserveSep`::
If `true`, the default, then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g., baseball is different from base ball).
`maxSurfaceFormsPerAnalyzedForm`::
The maximum number of surface forms to keep for a single analyzed form. When there are too many surface forms we discard the lowest weighted ones.
`maxGraphExpansions`::
When building the FST ("index-time"), we add each path through the tokenstream graph as an individual entry. This places an upper-bound on how many expansions will be added for a single suggestion. The default is `-1` which means there is no limit.
`preservePositionIncrements`::
If `true`, the suggester will preserve position increments. This means that token filters which leave gaps (for example, when StopFilter matches a stopword) the position would be respected when building the suggester. The default is `false`.
`maxEdits`::
The maximum number of string edits allowed. The system's hard limit is `2`. The default is `1`.
`transpositions`::
If `true`, the default, transpositions should be treated as a primitive edit operation.
`nonFuzzyPrefix`::
The length of the common non fuzzy prefix match which must match a suggestion. The default is `1`.
`minFuzzyLength`::
The minimum length of query before which any string edits will be allowed. The default is `3`.
`unicodeAware`::
If `true`, the `maxEdits`, `minFuzzyLength`, `transpositions` and `nonFuzzyPrefix` parameters will be measured in unicode code points (actual letters) instead of bytes. The default is `false`.
==== AnalyzingInfixLookupFactory
Analyzes the input text and then suggests matches based on prefix matches to any tokens in the indexed text. This uses a Lucene index for its dictionary.
This implementation uses the following additional properties.
`indexPath`::
When using `AnalyzingInfixSuggester` you can provide your own path where the index will get built. The default is `analyzingInfixSuggesterIndexDir` and will be created in your collection's `data/` directory.
`minPrefixChars`::
Minimum number of leading characters before PrefixQuery is used (default is `4`). Prefixes shorter than this are indexed as character ngrams (increasing index size but making lookups faster).
`allTermsRequired`::
Boolean option for multiple terms. The default is `true`, all terms will be required.
`highlight`::
Highlight suggest terms. Default is `true`.
This implementation supports <<Context Filtering>>.
==== BlendedInfixLookupFactory
An extension of the `AnalyzingInfixSuggester` which provides additional functionality to weight prefix matches across the matched documents. You can tell it to score higher if a hit is closer to the start of the suggestion or vice versa.
This implementation uses the following additional properties:
`blenderType`::
Used to calculate weight coefficient using the position of the first matching word. Can be one of:
`position_linear`:::
`weightFieldValue*(1 - 0.10*position)`: Matches to the start will be given a higher score. This is the default.
`position_reciprocal`:::
`weightFieldValue/(1+position)`: Matches to the end will be given a higher score.
`exponent`::::
An optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default `2.0`.
`numFactor`::
The factor to multiply the number of searched elements from which results will be pruned. Default is `10`.
`indexPath`::
When using `BlendedInfixSuggester` you can provide your own path where the index will get built. The default directory name is `blendedInfixSuggesterIndexDir` and will be created in your collections data directory.
`minPrefixChars`::
Minimum number of leading characters before PrefixQuery is used (the default is `4`). Prefixes shorter than this are indexed as character ngrams (increasing index size but making lookups faster).
This implementation supports <<Context Filtering>>.
==== FreeTextLookupFactory
It looks at the last tokens plus the prefix of whatever final token the user is typing, if present, to predict the most likely next token. The number of previous tokens that need to be considered can also be specified. This suggester would only be used as a fallback, when the primary suggester fails to find any suggestions.
This implementation uses the following additional properties:
`suggestFreeTextAnalyzerFieldType`::
The analyzer used at "query-time" and "build-time" to analyze suggestions. This parameter is required.
`ngrams`::
The max number of tokens out of which singles will be made the dictionary. The default value is `2`. Increasing this would mean you want more than the previous 2 tokens to be taken into consideration when making the suggestions.
==== FSTLookupFactory
An automaton-based lookup. This implementation is slower to build, but provides the lowest memory cost. We recommend using this implementation unless you need more sophisticated matching results, in which case you should use the Jaspell implementation.
This implementation uses the following additional properties:
`exactMatchFirst`::
If `true`, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights.
`weightBuckets`::
The number of separate buckets for weights which the suggester will use while building its dictionary.
==== TSTLookupFactory
A simple compact ternary trie based lookup.
==== WFSTLookupFactory
A weighted automaton representation which is an alternative to `FSTLookup` for more fine-grained ranking. `WFSTLookup` does not use buckets, but instead a shortest path algorithm.
Note that it expects weights to be whole numbers. If weight is missing it's assumed to be `1.0`. Weights affect the sorting of matching suggestions when `spellcheck.onlyMorePopular=true` is selected: weights are treated as "popularity" score, with higher weights preferred over suggestions with lower weights.
==== JaspellLookupFactory
A more complex lookup based on a ternary trie from the http://jaspell.sourceforge.net/[JaSpell] project. Use this implementation if you need more sophisticated matching results.
=== Dictionary Implementations
The dictionary implementations define how terms are stored. There are several options, and multiple dictionaries can be used in a single request if necessary.
==== DocumentDictionaryFactory
A dictionary with terms, weights, and an optional payload taken from the index.
This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation:
`weightField`::
A field that is stored or a numeric DocValue field. This parameter is optional.
`payloadField`::
The `payloadField` should be a field that is stored. This parameter is optional.
`contextField`::
Field to be used for context filtering. Note that only some lookup implementations support filtering.
==== DocumentExpressionDictionaryFactory
This dictionary implementation is the same as the `DocumentDictionaryFactory` but allows users to specify an arbitrary expression into the `weightExpression` tag.
This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation:
`payloadField`::
The `payloadField` should be a field that is stored. This parameter is optional.
`weightExpression`::
An arbitrary expression used for scoring the suggestions. The fields used must be numeric fields. This parameter is required.
`contextField`::
Field to be used for context filtering. Note that only some lookup implementations support filtering.
==== HighFrequencyDictionaryFactory
This dictionary implementation allows adding a threshold to prune out less frequent terms in cases where very common terms may overwhelm other terms.
This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation:
`threshold`::
A value between zero and one representing the minimum fraction of the total documents where a term should appear in order to be added to the lookup dictionary.
==== FileDictionaryFactory
This dictionary implementation allows using an external file that contains suggest entries. Weights and payloads can also be used.
If using a dictionary file, it should be a plain text file in UTF-8 encoding. You can use both single terms and phrases in the dictionary file. If adding weights or payloads, those should be separated from terms using the delimiter defined with the `fieldDelimiter` property (the default is '\t', the tab representation). If using payloads, the first line in the file *must* specify a payload.
This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation:
`fieldDelimiter`::
Specifies the delimiter to be used separating the entries, weights and payloads. The default is tab (`\t`).
+
.Example File
[source,text]
----
acquire
accidentally 2.0
accommodate 3.0
----
=== Multiple Dictionaries
It is possible to include multiple `dictionaryImpl` definitions in a single SuggestComponent definition.
To do this, simply define separate suggesters, as in this example:
[source,xml]
----
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">cat</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">string</str>
</lst>
<lst name="suggester">
<str name="name">altSuggester</str>
<str name="dictionaryImpl">DocumentExpressionDictionaryFactory</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="field">product_name</str>
<str name="weightExpression">((price * 2) + ln(popularity))</str>
<str name="sortField">weight</str>
<str name="sortField">price</str>
<str name="storeDir">suggest_fuzzy_doc_expr_dict</str>
<str name="suggestAnalyzerFieldType">text_en</str>
</lst>
</searchComponent>
----
When using these Suggesters in a query, you would define multiple `suggest.dictionary` parameters in the request, referring to the names given for each Suggester in the search component definition. The response will include the terms in sections for each Suggester. See the <<Example Usages>> section below for an example request and response.
== Adding the Suggest Request Handler
After adding the search component, a request handler must be added to `solrconfig.xml`. This request handler works the <<requesthandlers-and-searchcomponents-in-solrconfig.adoc#requesthandlers-and-searchcomponents-in-solrconfig,same as any other request handler>>, and allows you to configure default parameters for serving suggestion requests. The request handler definition must incorporate the "suggest" search component defined previously.
[source,xml]
----
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
----
=== Suggest Request Handler Parameters
The following parameters allow you to set defaults for the Suggest request handler:
`suggest=true`::
This parameter should always be `true`, because we always want to run the Suggester for queries submitted to this handler.
`suggest.dictionary`::
The name of the dictionary component configured in the search component. This is a mandatory parameter. It can be set in the request handler, or sent as a parameter at query time.
`suggest.q`::
The query to use for suggestion lookups.
`suggest.count`::
Specifies the number of suggestions for Solr to return.
`suggest.cfq`::
A Context Filter Query used to filter suggestions based on the context field, if supported by the suggester.
`suggest.build`::
If `true`, it will build the suggester index. This is likely useful only for initial requests; you would probably not want to build the dictionary on every request, particularly in a production system. If you would like to keep your dictionary up to date, you should use the `buildOnCommit` or `buildOnOptimize` parameter for the search component.
`suggest.reload`::
If `true`, it will reload the suggester index.
`suggest.buildAll`::
If `true`, it will build all suggester indexes.
`suggest.reloadAll`::
If `true`, it will reload all suggester indexes.
These properties can also be overridden at query time, or not set in the request handler at all and always sent at query time.
.Context Filtering
[IMPORTANT]
====
Context filtering (`suggest.cfq`) is currently only supported by `AnalyzingInfixLookupFactory` and `BlendedInfixLookupFactory`, and only when backed by a `Document*Dictionary`. All other implementations will return unfiltered matches as if filtering was not requested.
====
== Example Usages
=== Get Suggestions with Weights
This is a basic suggestion using a single dictionary and a single Solr core.
Example query:
[source,text]
----
http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.build=true&suggest.dictionary=mySuggester&suggest.q=elec
----
In this example, we've simply requested the string 'elec' with the `suggest.q` parameter and requested that the suggestion dictionary be built with `suggest.build` (note, however, that you would likely not want to build the index on every query - instead you should use `buildOnCommit` or `buildOnOptimize` if you have regularly changing documents).
Example response:
[source,json]
----
{
"responseHeader": {
"status": 0,
"QTime": 35
},
"command": "build",
"suggest": {
"mySuggester": {
"elec": {
"numFound": 3,
"suggestions": [
{
"term": "electronics and computer1",
"weight": 2199,
"payload": ""
},
{
"term": "electronics",
"weight": 649,
"payload": ""
},
{
"term": "electronics and stuff2",
"weight": 279,
"payload": ""
}
]
}
}
}
}
----
=== Using Multiple Dictionaries
If you have defined multiple dictionaries, you can use them in queries.
Example query:
[source,text]
----
http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.dictionary=mySuggester&suggest.dictionary=altSuggester&suggest.q=elec
----
In this example we have sent the string 'elec' as the `suggest.q` parameter and named two `suggest.dictionary` definitions to be used.
Example response:
[source,json]
----
{
"responseHeader": {
"status": 0,
"QTime": 3
},
"suggest": {
"mySuggester": {
"elec": {
"numFound": 1,
"suggestions": [
{
"term": "electronics and computer1",
"weight": 100,
"payload": ""
}
]
}
},
"altSuggester": {
"elec": {
"numFound": 1,
"suggestions": [
{
"term": "electronics and computer1",
"weight": 10,
"payload": ""
}
]
}
}
}
}
----
=== Context Filtering
Context filtering lets you filter suggestions by a separate context field, such as category, department or any other token. The `AnalyzingInfixLookupFactory` and `BlendedInfixLookupFactory` currently support this feature, when backed by `DocumentDictionaryFactory`.
Add `contextField` to your suggester configuration. This example will suggest names and allow to filter by category:
.solrconfig.xml
[source,xml]
----
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">name</str>
<str name="weightField">price</str>
<str name="contextField">cat</str>
<str name="suggestAnalyzerFieldType">string</str>
<str name="buildOnStartup">false</str>
</lst>
</searchComponent>
----
Example context filtering suggest query:
[source,text]
----
http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.build=true&suggest.dictionary=mySuggester&suggest.q=c&suggest.cfq=memory
----
The suggester will only bring back suggestions for products tagged with 'cat=memory'.