SOLR-10700: Convert PostingsSolrHighlighter to extend UnifiedSolrHighlighter

This commit is contained in:
David Smiley 2017-05-23 11:02:52 -04:00
parent 715d8b0ccf
commit 2218ded2af
4 changed files with 53 additions and 315 deletions

View File

@ -76,6 +76,9 @@ Upgrading from Solr 6.x
* Setting <defaultSearchField> in schema is no longer allowed and will cause an exception.
Please use "df" parameter on the request instead. For more details, see SOLR-10585.
* The PostingsSolrHighlighter is deprecated. Furthermore, it now internally works via a re-configuration
of the UnifiedSolrHighlighter.
New Features
----------------------
* SOLR-9857, SOLR-9858: Collect aggregated metrics from nodes and shard leaders in overseer. (ab)
@ -185,6 +188,9 @@ Other Changes
* SOLR-10378: Clicking Solr logo on AdminUI shows blank page (Takumi Yoshida via janhoy)
* SOLR-10700: Deprecated and converted the PostingsSolrHighlighter to extend UnifiedSolrHighlighter and thus no
longer use the PostingsHighlighter. It should behave mostly the same. (David Smiley)
================== 6.7.0 ==================
Consult the LUCENE_CHANGES.txt file for additional, low level, changes in this release.

View File

@ -16,306 +16,56 @@
*/
package org.apache.solr.highlight;
import java.io.IOException;
import java.text.BreakIterator;
import java.util.Collections;
import java.util.Locale;
import java.util.Map;
import java.util.Set;
import java.lang.invoke.MethodHandles;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.postingshighlight.CustomSeparatorBreakIterator;
import org.apache.lucene.search.postingshighlight.DefaultPassageFormatter;
import org.apache.lucene.search.postingshighlight.Passage;
import org.apache.lucene.search.postingshighlight.PassageFormatter;
import org.apache.lucene.search.postingshighlight.PassageScorer;
import org.apache.lucene.search.postingshighlight.PostingsHighlighter;
import org.apache.lucene.search.postingshighlight.WholeBreakIterator;
import org.apache.solr.common.SolrException;
import org.apache.lucene.search.uhighlight.UnifiedHighlighter;
import org.apache.solr.common.params.HighlightParams;
import org.apache.solr.common.params.ModifiableSolrParams;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.common.util.SimpleOrderedMap;
import org.apache.solr.core.PluginInfo;
import org.apache.solr.request.LocalSolrQueryRequest;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.schema.IndexSchema;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.DocIterator;
import org.apache.solr.search.DocList;
import org.apache.solr.search.SolrIndexSearcher;
import org.apache.solr.util.plugin.PluginInfoInitialized;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* Highlighter impl that uses {@link PostingsHighlighter}
* <p>
* Example configuration:
* <pre class="prettyprint">
* &lt;requestHandler name="/select" class="solr.SearchHandler"&gt;
* &lt;lst name="defaults"&gt;
* &lt;str name="hl.method"&gt;postings&lt;/str&gt;
* &lt;int name="hl.snippets"&gt;1&lt;/int&gt;
* &lt;str name="hl.tag.pre"&gt;&amp;lt;em&amp;gt;&lt;/str&gt;
* &lt;str name="hl.tag.post"&gt;&amp;lt;/em&amp;gt;&lt;/str&gt;
* &lt;str name="hl.tag.ellipsis"&gt;... &lt;/str&gt;
* &lt;bool name="hl.defaultSummary"&gt;true&lt;/bool&gt;
* &lt;str name="hl.encoder"&gt;simple&lt;/str&gt;
* &lt;float name="hl.score.k1"&gt;1.2&lt;/float&gt;
* &lt;float name="hl.score.b"&gt;0.75&lt;/float&gt;
* &lt;float name="hl.score.pivot"&gt;87&lt;/float&gt;
* &lt;str name="hl.bs.language"&gt;&lt;/str&gt;
* &lt;str name="hl.bs.country"&gt;&lt;/str&gt;
* &lt;str name="hl.bs.variant"&gt;&lt;/str&gt;
* &lt;str name="hl.bs.type"&gt;SENTENCE&lt;/str&gt;
* &lt;int name="hl.maxAnalyzedChars"&gt;51200&lt;/int&gt;
* &lt;str name="hl.multiValuedSeparatorChar"&gt; &lt;/str&gt;
* &lt;bool name="hl.highlightMultiTerm"&gt;false&lt;/bool&gt;
* &lt;/lst&gt;
* &lt;/requestHandler&gt;
* </pre>
* <p>
* Notes:
* <ul>
* <li>fields to highlight must be configured with storeOffsetsWithPositions="true"
* <li>hl.q (string) can specify the query
* <li>hl.fl (string) specifies the field list.
* <li>hl.snippets (int) specifies how many underlying passages form the resulting snippet.
* <li>hl.tag.pre (string) specifies text which appears before a highlighted term.
* <li>hl.tag.post (string) specifies text which appears after a highlighted term.
* <li>hl.tag.ellipsis (string) specifies text which joins non-adjacent passages.
* <li>hl.defaultSummary (bool) specifies if a field should have a default summary.
* <li>hl.encoder (string) can be 'html' (html escapes content) or 'simple' (no escaping).
* <li>hl.score.k1 (float) specifies bm25 scoring parameter 'k1'
* <li>hl.score.b (float) specifies bm25 scoring parameter 'b'
* <li>hl.score.pivot (float) specifies bm25 scoring parameter 'avgdl'
* <li>hl.bs.type (string) specifies how to divide text into passages: [SENTENCE, LINE, WORD, CHAR, WHOLE]
* <li>hl.bs.language (string) specifies language code for BreakIterator. default is empty string (root locale)
* <li>hl.bs.country (string) specifies country code for BreakIterator. default is empty string (root locale)
* <li>hl.bs.variant (string) specifies country code for BreakIterator. default is empty string (root locale)
* <li>hl.maxAnalyzedChars specifies how many characters at most will be processed in a document.
* <li>hl.multiValuedSeparatorChar specifies the logical separator between values for multi-valued fields.
* <li>hl.highlightMultiTerm enables highlighting for range/wildcard/fuzzy/prefix queries.
* NOTE: currently hl.maxAnalyzedChars cannot yet be specified per-field
* </ul>
* Highlighter impl that uses {@link UnifiedHighlighter} configured to operate as it's ancestor/predecessor, the
* {code PostingsHighlighter}.
*
* @lucene.experimental
* @deprecated Use {@link UnifiedSolrHighlighter} instead
*/
public class PostingsSolrHighlighter extends SolrHighlighter implements PluginInfoInitialized {
@Deprecated
public class PostingsSolrHighlighter extends UnifiedSolrHighlighter {
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
@Override
public void init(PluginInfo info) {}
public void init(PluginInfo info) {
log.warn("The PostingsSolrHighlighter is deprecated; use the UnifiedSolrHighlighter instead.");
super.init(info);
}
@Override
public NamedList<Object> doHighlighting(DocList docs, Query query, SolrQueryRequest req, String[] defaultFields) throws IOException {
final SolrParams params = req.getParams();
protected UnifiedHighlighter getHighlighter(SolrQueryRequest req) {
// Adjust the highlight parameters to match what the old PostingsHighlighter had.
ModifiableSolrParams invariants = new ModifiableSolrParams();
invariants.set(HighlightParams.OFFSET_SOURCE, "POSTINGS");
invariants.set(HighlightParams.FIELD_MATCH, true);
invariants.set(HighlightParams.USE_PHRASE_HIGHLIGHTER, false);
invariants.set(HighlightParams.FRAGSIZE, -1);
// if highlighting isnt enabled, then why call doHighlighting?
if (!isHighlightingEnabled(params))
return null;
ModifiableSolrParams defaults = new ModifiableSolrParams();
defaults.set(HighlightParams.DEFAULT_SUMMARY, true);
defaults.set(HighlightParams.TAG_ELLIPSIS, "... ");
SolrIndexSearcher searcher = req.getSearcher();
int[] docIDs = toDocIDs(docs);
// fetch the unique keys
String[] keys = getUniqueKeys(searcher, docIDs);
// query-time parameters
String[] fieldNames = getHighlightFields(query, req, defaultFields);
int maxPassages[] = new int[fieldNames.length];
for (int i = 0; i < fieldNames.length; i++) {
maxPassages[i] = params.getFieldInt(fieldNames[i], HighlightParams.SNIPPETS, 1);
}
PostingsHighlighter highlighter = getHighlighter(req);
Map<String,String[]> snippets = highlighter.highlightFields(fieldNames, query, searcher, docIDs, maxPassages);
return encodeSnippets(keys, fieldNames, snippets);
}
/** Creates an instance of the Lucene PostingsHighlighter. Provided for subclass extension so that
* a subclass can return a subclass of {@link PostingsSolrHighlighter.SolrExtendedPostingsHighlighter}. */
protected PostingsHighlighter getHighlighter(SolrQueryRequest req) {
return new SolrExtendedPostingsHighlighter(req);
}
/**
* Encodes the resulting snippets into a namedlist
* @param keys the document unique keys
* @param fieldNames field names to highlight in the order
* @param snippets map from field name to snippet array for the docs
* @return encoded namedlist of summaries
*/
protected NamedList<Object> encodeSnippets(String[] keys, String[] fieldNames, Map<String,String[]> snippets) {
NamedList<Object> list = new SimpleOrderedMap<>();
for (int i = 0; i < keys.length; i++) {
NamedList<Object> summary = new SimpleOrderedMap<>();
for (String field : fieldNames) {
String snippet = snippets.get(field)[i];
// box in an array to match the format of existing highlighters,
// even though it's always one element.
if (snippet == null) {
summary.add(field, new String[0]);
} else {
summary.add(field, new String[] { snippet });
}
}
list.add(keys[i], summary);
}
return list;
}
/** Converts solr's DocList to the int[] docIDs */
protected int[] toDocIDs(DocList docs) {
int[] docIDs = new int[docs.size()];
DocIterator iterator = docs.iterator();
for (int i = 0; i < docIDs.length; i++) {
if (!iterator.hasNext()) {
throw new AssertionError();
}
docIDs[i] = iterator.nextDoc();
}
if (iterator.hasNext()) {
throw new AssertionError();
}
return docIDs;
}
/** Retrieves the unique keys for the topdocs to key the results */
protected String[] getUniqueKeys(SolrIndexSearcher searcher, int[] docIDs) throws IOException {
IndexSchema schema = searcher.getSchema();
SchemaField keyField = schema.getUniqueKeyField();
if (keyField != null) {
Set<String> selector = Collections.singleton(keyField.getName());
String uniqueKeys[] = new String[docIDs.length];
for (int i = 0; i < docIDs.length; i++) {
int docid = docIDs[i];
Document doc = searcher.doc(docid, selector);
String id = schema.printableUniqueKey(doc);
uniqueKeys[i] = id;
}
return uniqueKeys;
} else {
return new String[docIDs.length];
}
}
/** From {@link #getHighlighter(org.apache.solr.request.SolrQueryRequest)}. */
public class SolrExtendedPostingsHighlighter extends PostingsHighlighter {
protected final SolrParams params;
protected final IndexSchema schema;
public SolrExtendedPostingsHighlighter(SolrQueryRequest req) {
super(req.getParams().getInt(HighlightParams.MAX_CHARS, DEFAULT_MAX_CHARS));
this.params = req.getParams();
this.schema = req.getSchema();
}
@Override
protected Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) {
boolean defaultSummary = params.getFieldBool(fieldName, HighlightParams.DEFAULT_SUMMARY, true);
if (defaultSummary) {
return super.getEmptyHighlight(fieldName, bi, maxPassages);
} else {
//TODO reuse logic of DefaultSolrHighlighter.alternateField
return new Passage[0];
}
}
@Override
protected PassageFormatter getFormatter(String fieldName) {
String preTag = params.getFieldParam(fieldName, HighlightParams.TAG_PRE, "<em>");
String postTag = params.getFieldParam(fieldName, HighlightParams.TAG_POST, "</em>");
String ellipsis = params.getFieldParam(fieldName, HighlightParams.TAG_ELLIPSIS, "... ");
String encoder = params.getFieldParam(fieldName, HighlightParams.ENCODER, "simple");
return new DefaultPassageFormatter(preTag, postTag, ellipsis, "html".equals(encoder));
}
@Override
protected PassageScorer getScorer(String fieldName) {
float k1 = params.getFieldFloat(fieldName, HighlightParams.SCORE_K1, 1.2f);
float b = params.getFieldFloat(fieldName, HighlightParams.SCORE_B, 0.75f);
float pivot = params.getFieldFloat(fieldName, HighlightParams.SCORE_PIVOT, 87f);
return new PassageScorer(k1, b, pivot);
}
@Override
protected BreakIterator getBreakIterator(String field) {
String type = params.getFieldParam(field, HighlightParams.BS_TYPE);
if ("WHOLE".equals(type)) {
return new WholeBreakIterator();
} else if ("SEPARATOR".equals(type)) {
char customSep = parseBiSepChar(params.getFieldParam(field, HighlightParams.BS_SEP));
return new CustomSeparatorBreakIterator(customSep);
} else {
String language = params.getFieldParam(field, HighlightParams.BS_LANGUAGE);
String country = params.getFieldParam(field, HighlightParams.BS_COUNTRY);
String variant = params.getFieldParam(field, HighlightParams.BS_VARIANT);
Locale locale = parseLocale(language, country, variant);
return parseBreakIterator(type, locale);
}
}
/**
* parse custom separator char for {@link CustomSeparatorBreakIterator}
*/
protected char parseBiSepChar(String sepChar) {
if (sepChar == null) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, HighlightParams.BS_SEP + " not passed");
}
if (sepChar.length() != 1) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, HighlightParams.BS_SEP +
" must be a single char but got: '" + sepChar + "'");
}
return sepChar.charAt(0);
}
@Override
protected char getMultiValuedSeparator(String field) {
String sep = params.getFieldParam(field, HighlightParams.MULTI_VALUED_SEPARATOR, " ");
if (sep.length() != 1) {
throw new IllegalArgumentException(HighlightParams.MULTI_VALUED_SEPARATOR + " must be exactly one character.");
}
return sep.charAt(0);
}
@Override
protected Analyzer getIndexAnalyzer(String field) {
if (params.getFieldBool(field, HighlightParams.HIGHLIGHT_MULTI_TERM, false)) {
return schema.getIndexAnalyzer();
} else {
return null;
}
}
}
/** parse a break iterator type for the specified locale */
protected BreakIterator parseBreakIterator(String type, Locale locale) {
if (type == null || "SENTENCE".equals(type)) {
return BreakIterator.getSentenceInstance(locale);
} else if ("LINE".equals(type)) {
return BreakIterator.getLineInstance(locale);
} else if ("WORD".equals(type)) {
return BreakIterator.getWordInstance(locale);
} else if ("CHARACTER".equals(type)) {
return BreakIterator.getCharacterInstance(locale);
} else {
throw new IllegalArgumentException("Unknown " + HighlightParams.BS_TYPE + ": " + type);
}
}
/** parse a locale from a language+country+variant spec */
protected Locale parseLocale(String language, String country, String variant) {
if (language == null && country == null && variant == null) {
return Locale.ROOT;
} else if (language != null && country == null && variant != null) {
throw new IllegalArgumentException("To specify variant, country is required");
} else if (language != null && country != null && variant != null) {
return new Locale(language, country, variant);
} else if (language != null && country != null) {
return new Locale(language, country);
} else {
return new Locale(language);
SolrParams newParams = SolrParams.wrapDefaults(
invariants,// this takes precedence
SolrParams.wrapDefaults(
req.getParams(), // then this (original)
defaults // finally our defaults
)
);
try (LocalSolrQueryRequest fakeReq = new LocalSolrQueryRequest(req.getCore(), newParams)) {
return super.getHighlighter(fakeReq);
}
}
}

View File

@ -34,17 +34,17 @@
},
"termVectors": {
"type": "boolean",
"description": "If true, term vectors will be stored to be able to compute similarity between two documents. This is required to use More Like This. If this is not defined, it will inherit the value from the fieldType. If the fieldType does not define a value, it will default to false. Do not enable this if using the PostingsHighlighter.",
"description": "If true, term vectors will be stored which can be used to optimize More Like This and optimizing highlighting wildcard queries. If this is not defined, it will inherit the value from the fieldType. If the fieldType does not define a value, it will default to false.",
"default": "false"
},
"termPositions": {
"type": "boolean",
"description": "If true, term positions will be stored for use with highlighting. If this is not defined, it will inherit the value from the fieldType. If the fieldType does not define a value, it will default to false. Do not enable this if using the PostingsHighlighter.",
"description": "If true, term vectors will include positions. If this is not defined, it will inherit the value from the fieldType. If the fieldType does not define a value, it will default to false.",
"default": "false"
},
"termOffsets": {
"type": "boolean",
"description": "If true, term offsets will be stored for use with highlighting. If this is not defined, it will inherit the value from the fieldType. If the fieldType does not define a value, it will default to false. Do not enable this if using the PostingsHighlighter.",
"description": "If true, term vectors will include offsets. If this is not defined, it will inherit the value from the fieldType. If the fieldType does not define a value, it will default to false.",
"default": "false"
},
"multiValued": {
@ -73,7 +73,7 @@
},
"storeOffsetsWithPositions": {
"type": "boolean",
"description": "If true, term offsets will be stored with positions in the postings list in the index. This is required if using the PostingsHighlighter. If this is not defined, it will inherit the value from the fieldType. If the fieldType does not define a value, it will default to false.",
"description": "If true, term offsets will be stored with positions in the postings list in the index. This optimizes highlighting with the UnifiedHighlighter. If this is not defined, it will inherit the value from the fieldType. If the fieldType does not define a value, it will default to false.",
"default": "false"
},
"docValues": {

View File

@ -137,11 +137,14 @@ This highlighter's query-representation is less advanced than the Original or Un
+
Note that both the FastVector and Original Highlighters can be used in conjunction in a search request to highlight some fields with one and some the other. In contrast, the other highlighters can only be chosen exclusively.
<<Highlighting-ThePostingsHighlighter,Postings Highlighter>>:: (`hl.method=postings`)
The Postings Highlighter:: (`hl.method=postings`)
+
The Postings Highlighter is the ancestor of the Unified Highlighter, supporting a subset of its options and none of its index configuration flexibility - it _requires_ `storeOffsetsWithPositions` on all fields to highlight. This option is here for backwards compatibility; if you find you need it, please share your experience with the Solr community.
The Postings Highlighter is the ancestor of the Unified Highlighter, supporting a subset of its options and none of its index configuration flexibility - it _requires_ `storeOffsetsWithPositions` on all fields to highlight.
This option is here for backwards compatibility; it's deprecated.
In 7.0, it is internally implemented as a reconfiguration of the Unified Highlighter.
See older reference guide editions for its options.
The Unified Highlighter and Postings Highlighter from which it derives, are exclusively configured via search parameters. In contrast, some settings for the Original and FastVector Highlighters are set in `solrconfig.xml`. There's a robust example of the latter in the "```techproducts```" configset.
The Unified Highlighter is exclusively configured via search parameters. In contrast, some settings for the Original and FastVector Highlighters are set in `solrconfig.xml`. There's a robust example of the latter in the "```techproducts```" configset.
In addition to further information below, more information can be found in the {solr-javadocs}/solr-core/org/apache/solr/highlight/package-summary.html[Solr javadocs].
@ -157,7 +160,7 @@ The benefit of this approach is that your index won't grow larger with any extra
The down side is that highlighting speed is roughly linear with the amount of text to process, with a large factor being the complexity of your analysis chain.
+
For "short" text, this is a good choice. Or maybe it's not short but you're prioritizing a smaller index and indexing speed over highlighting performance.
* *Postings*: Supported by the Unified and Postings Highlighters. Set `storeOffsetsWithPositions` to `true`. This adds a moderate amount of extra data to the index but it speeds up highlighting tremendously, especially compared to analysis with longer text fields.
* *Postings*: Supported by the Unified Highlighter. Set `storeOffsetsWithPositions` to `true`. This adds a moderate amount of extra data to the index but it speeds up highlighting tremendously, especially compared to analysis with longer text fields.
+
However, wildcard queries will fall back to analysis unless "light" term vectors are added.
@ -193,27 +196,6 @@ The Unified Highlighter supports these following additional parameters to the on
|hl.bs.separator |_(blank)_ |Indicates which character to break the text on. Requires `hl.bs.type=SEPARATOR`. This is useful when the text has already been manipulated in advance to have a special delineation character at desired highlight passage boundaries. This character will still appear in the text as the last character of a passage.
|===
[[Highlighting-ThePostingsHighlighter]]
=== The Postings Highlighter
The Postings Highlighter is the ancestor of the Unified Highlighter, supporting a subset of it's options and sometimes with different default settings for some common parameters.
Viewed from the perspective of the Unified Highlighter, these settings are effectively non-settings and fixed as-such:
* `hl.offsetSource=POSTINGS`
* `hl.requireFieldMatch=true`
* `hl.usePhraseHighlighter=false`
* `hl.fragsize=-1` (none).
It has these different default settings:
* `hl.defaultSummary=true`
* `hl.tag.ellipsis="... "`.
In addition, it has a setting `hl.multiValuedSeparatorChar=" "` (space).
This highlighter never returns separate snippets as separate values; they are always joined by `hl.tag.ellipsis`.
[[Highlighting-TheOriginalHighlighter]]
== The Original Highlighter