mirror of https://github.com/apache/lucene.git
SOLR-14254: Docs for text tagger: FST50 trade-off (#1332)
This commit is contained in:
parent
adb829cf35
commit
cbd0dcb5df
|
@ -46,6 +46,8 @@ for an overview of the main new features of Solr 8.5.
|
|||
|
||||
When upgrading to 8.5.x users should be aware of the following major changes from 8.4.
|
||||
|
||||
__Note: an index incompatibility warning was retroactively added below to 8.4 for users choosing a non-default postings format (e.g. "FST50").__
|
||||
|
||||
*Considerations for a SolrCloud Upgrade*
|
||||
|
||||
Solr 8.5 introduces a change in the format used for the elements in the Overseer queues and maps (see https://issues.apache.org/jira/browse/SOLR-14095[SOLR-14095] for technical discussion of the change). This queue is used internally by the Overseer to reliably handle
|
||||
|
@ -134,6 +136,11 @@ for an overview of the main new features of Solr 8.4.
|
|||
|
||||
When upgrading to 8.4.x users should be aware of the following major changes from 8.3.
|
||||
|
||||
*Reminder:* If you set the `postingsFormat` or `docValuesFormat` in the schema in order to use a non-default option, you risk preventing yourself from upgrading your Lucene/Solr software at future versions.
|
||||
Multiple non-default postings formats changed in 8.4, thus rendering the index data from a previous index.
|
||||
This includes "FST50" which was recommended by the Solr TaggerHandler for performance reasons.
|
||||
There is now improved documentation to navigate this trade-off choice.
|
||||
|
||||
*Package Management System*
|
||||
|
||||
Version 8.4 introduces a package management system to Solr. The goals of the
|
||||
|
|
|
@ -43,7 +43,7 @@ To configure the tagger, your Solr schema needs 2 fields:
|
|||
Recommended field settings: set `docValues=true`.
|
||||
* A tag field, which must be a `TextField`, with `ConcatenateGraphFilterFactory` at the end of the index chain (not the query chain):
|
||||
Set `preservePositionIncrements=false` on that filter.
|
||||
Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and `postingsFormat=FST50`.
|
||||
Recommended field settings: `omitNorms=true`, `omitTermFreqAndPositions=true` and _maybe_ specify the postings format -- see <<tagger-performance-tips,performance tips>>.
|
||||
|
||||
The text field's _index analysis chain_, aside from needing `ConcatenateGraphFilterFactory` at the end,
|
||||
can otherwise have whatever tokenizer and filters suit your matching preferences.
|
||||
|
@ -271,11 +271,12 @@ The response should be this (the QTime may vary):
|
|||
}}
|
||||
----
|
||||
|
||||
== Tagger Tips
|
||||
== Tagger Performance Tips
|
||||
|
||||
Performance Tips:
|
||||
|
||||
* Follow the recommended configuration field settings, especially `postingsFormat=FST50`.
|
||||
* Follow the recommended configuration field settings above.
|
||||
Additionally, for the best tagger performance, set `postingsFormat=FST50`.
|
||||
However, non-default postings formats have no backwards-compatibility guarantees, and so if you upgrade Solr then you may find a nasty exception on startup as it fails to read the older index.
|
||||
If the input text to be tagged is small (e.g. you are tagging queries or tweets) then the postings format choice isn't as important.
|
||||
* "optimize" after loading your dictionary down to 1 Lucene segment, or at least to as few as possible.
|
||||
* For bulk tagging lots of documents, there are some strategies, not mutually exclusive:
|
||||
** Batch them.
|
||||
|
|
Loading…
Reference in New Issue