Improve similarity docs. (#29089)
This adds links to the relevant Lucene javadocs and warnings regarding similarities that might return 0 as a score. Close #29015
This commit is contained in:
parent
08c530907a
commit
1d6ed824c7
|
@ -97,22 +97,38 @@ similarity has the following option:
|
||||||
Type name: `classic`
|
Type name: `classic`
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[drf]]
|
[[dfr]]
|
||||||
==== DFR similarity
|
==== DFR similarity
|
||||||
|
|
||||||
Similarity that implements the
|
Similarity that implements the
|
||||||
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DFRSimilarity.html[divergence
|
||||||
from randomness] framework. This similarity has the following options:
|
from randomness] framework. This similarity has the following options:
|
||||||
|
|
||||||
[horizontal]
|
[horizontal]
|
||||||
`basic_model`::
|
`basic_model`::
|
||||||
Possible values: `be`, `d`, `g`, `if`, `in`, `ine` and `p`.
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`be`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelD.html[`d`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelG.html[`g`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIF.html[`if`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIn.html[`in`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelIne.html[`ine`] and
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/BasicModelP.html[`p`].
|
||||||
|
|
||||||
|
`be`, `d` and `p` should be avoided in practice as they might return scores that
|
||||||
|
are equal to 0 or infinite with terms that do not meet the expected random
|
||||||
|
distribution.
|
||||||
|
|
||||||
`after_effect`::
|
`after_effect`::
|
||||||
Possible values: `no`, `b` and `l`.
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffect.NoAfterEffect.html[`no`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectB.html[`b`] and
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/AfterEffectL.html[`l`].
|
||||||
|
|
||||||
`normalization`::
|
`normalization`::
|
||||||
Possible values: `no`, `h1`, `h2`, `h3` and `z`.
|
Possible values: {lucene-core-javadoc}/org/apache/lucene/search/similarities/Normalization.NoNormalization.html[`no`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h1`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH2.html[`h2`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationH1.html[`h3`] and
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/NormalizationZ.html[`z`].
|
||||||
|
|
||||||
All options but the first option need a normalization value.
|
All options but the first option need a normalization value.
|
||||||
|
|
||||||
|
@ -127,7 +143,14 @@ model.
|
||||||
This similarity has the following options:
|
This similarity has the following options:
|
||||||
|
|
||||||
[horizontal]
|
[horizontal]
|
||||||
`independence_measure`:: Possible values `standardized`, `saturated`, `chisquared`.
|
`independence_measure`:: Possible values
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceStandardized.html[`standardized`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceSaturated.html[`saturated`],
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IndependenceChiSquared.html[`chisquared`].
|
||||||
|
|
||||||
|
When using this similarity, it is highly recommended to remove stop words to get
|
||||||
|
good relevance. Also beware that terms whose frequency is less than the expected
|
||||||
|
frequency will get a score equal to 0.
|
||||||
|
|
||||||
Type name: `DFI`
|
Type name: `DFI`
|
||||||
|
|
||||||
|
@ -135,15 +158,19 @@ Type name: `DFI`
|
||||||
[[ib]]
|
[[ib]]
|
||||||
==== IB similarity.
|
==== IB similarity.
|
||||||
|
|
||||||
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/IBSimilarity.html[Information
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/IBSimilarity.html[Information
|
||||||
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
|
based model] . The algorithm is based on the concept that the information content in any symbolic 'distribution'
|
||||||
sequence is primarily determined by the repetitive usage of its basic elements.
|
sequence is primarily determined by the repetitive usage of its basic elements.
|
||||||
For written texts this challenge would correspond to comparing the writing styles of different authors.
|
For written texts this challenge would correspond to comparing the writing styles of different authors.
|
||||||
This similarity has the following options:
|
This similarity has the following options:
|
||||||
|
|
||||||
[horizontal]
|
[horizontal]
|
||||||
`distribution`:: Possible values: `ll` and `spl`.
|
`distribution`:: Possible values:
|
||||||
`lambda`:: Possible values: `df` and `ttf`.
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionLL.html[`ll`] and
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/DistributionSPL.html[`spl`].
|
||||||
|
`lambda`:: Possible values:
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaDF.html[`df`] and
|
||||||
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LambdaTTF.html[`ttf`].
|
||||||
`normalization`:: Same as in `DFR` similarity.
|
`normalization`:: Same as in `DFR` similarity.
|
||||||
|
|
||||||
Type name: `IB`
|
Type name: `IB`
|
||||||
|
@ -152,19 +179,23 @@ Type name: `IB`
|
||||||
[[lm_dirichlet]]
|
[[lm_dirichlet]]
|
||||||
==== LM Dirichlet similarity.
|
==== LM Dirichlet similarity.
|
||||||
|
|
||||||
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
|
{lucene-core-javadoc}/org/apache/lucene/search/similarities/LMDirichletSimilarity.html[LM
|
||||||
Dirichlet similarity] . This similarity has the following options:
|
Dirichlet similarity] . This similarity has the following options:
|
||||||
|
|
||||||
[horizontal]
|
[horizontal]
|
||||||
`mu`:: Default to `2000`.
|
`mu`:: Default to `2000`.
|
||||||
|
|
||||||
|
The scoring formula in the paper assigns negative scores to terms that have
|
||||||
|
fewer occurrences than predicted by the language model, which is illegal to
|
||||||
|
Lucene, so such terms get a score of 0.
|
||||||
|
|
||||||
Type name: `LMDirichlet`
|
Type name: `LMDirichlet`
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[lm_jelinek_mercer]]
|
[[lm_jelinek_mercer]]
|
||||||
==== LM Jelinek Mercer similarity.
|
==== LM Jelinek Mercer similarity.
|
||||||
|
|
||||||
http://lucene.apache.org/core/5_2_1/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
|
{lucene-core-javadoc}/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html[LM
|
||||||
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
|
Jelinek Mercer similarity] . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
|
||||||
|
|
||||||
[horizontal]
|
[horizontal]
|
||||||
|
|
Loading…
Reference in New Issue