SOLR-14254: Docs for text tagger: FST50 trade-off (#1332)

This commit is contained in:
David Smiley 2020-03-13 22:02:01 -04:00 committed by GitHub
parent adb829cf35
commit cbd0dcb5df
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 13 additions and 5 deletions

View File

@ -46,6 +46,8 @@ for an overview of the main new features of Solr 8.5.
When upgrading to 8.5.x users should be aware of the following major changes from 8.4.
__Note: an index incompatibility warning was retroactively added below to 8.4 for users choosing a non-default postings format (e.g. "FST50").__
*Considerations for a SolrCloud Upgrade*
Solr 8.5 introduces a change in the format used for the elements in the Overseer queues and maps (see https://issues.apache.org/jira/browse/SOLR-14095[SOLR-14095] for technical discussion of the change). This queue is used internally by the Overseer to reliably handle
@ -134,6 +136,11 @@ for an overview of the main new features of Solr 8.4.
When upgrading to 8.4.x users should be aware of the following major changes from 8.3.
*Reminder:* If you set the `postingsFormat` or `docValuesFormat` in the schema in order to use a non-default option, you risk preventing yourself from upgrading your Lucene/Solr software at future versions.
Multiple non-default postings formats changed in 8.4, thus rendering the index data from a previous index.
This includes "FST50" which was recommended by the Solr TaggerHandler for performance reasons.
There is now improved documentation to navigate this trade-off choice.
*Package Management System*
Version 8.4 introduces a package management system to Solr. The goals of the

View File

@ -43,7 +43,7 @@ To configure the tagger, your Solr schema needs 2 fields:
Recommended field settings: set `docValues=true`.
* A tag field, which must be a `TextField`, with `ConcatenateGraphFilterFactory` at the end of the index chain (not the query chain):
Set `preservePositionIncrements=false` on that filter.
Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and `postingsFormat=FST50`.
Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and _maybe_ specify the postings format -- see <<tagger-performance-tips,performance tips>>.
The text field's _index analysis chain_, aside from needing `ConcatenateGraphFilterFactory` at the end,
can otherwise have whatever tokenizer and filters suit your matching preferences.
@ -271,11 +271,12 @@ The response should be this (the QTime may vary):
}}
----
== Tagger Tips
== Tagger Performance Tips
Performance Tips:
* Follow the recommended configuration field settings, especially `postingsFormat=FST50`.
* Follow the recommended configuration field settings above.
Additionally, for the best tagger performance, set `postingsFormat=FST50`.
However, non-default postings formats have no backwards-compatibility guarantees, and so if you upgrade Solr then you may find a nasty exception on startup as it fails to read the older index.
If the input text to be tagged is small (e.g. you are tagging queries or tweets) then the postings format choice isn't as important.
* "optimize" after loading your dictionary down to 1 Lucene segment, or at least to as few as possible.
* For bulk tagging lots of documents, there are some strategies, not mutually exclusive:
** Batch them.