significant terms: infrastructure for changing easily the significance heuristic

This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding

- SignificanceHeuristic
- SignificanceHeuristicBuilder
- SignificanceHeuristicParser

closes #6561
This commit is contained in:
Britta Weber 2014-05-14 17:39:07 +02:00
parent 8865e60e93
commit 74927adced
26 changed files with 1747 additions and 154 deletions

View File

@ -194,10 +194,7 @@ where a simple `terms` aggregation would typically show the very popular "consta
.How are the scores calculated?
**********************************
The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users.
The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favour
common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favour rare terms.
Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in _foreground_ and _background_ sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.
**********************************
@ -282,7 +279,35 @@ However, the `size` and `shard size` settings covered in the next section provid
==== Parameters
===== JLH score
The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
===== mutual information
added[1.3.0]
Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter
[source,js]
--------------------------------------------------
"mutual_information": {
"include_negatives": true
}
--------------------------------------------------
Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, `include_negatives` can be set to `false`.
Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set
[source,js]
--------------------------------------------------
"background_is_superset": false
--------------------------------------------------
===== Size & Shard Size
The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
@ -338,7 +363,7 @@ Terms that score highly will be collected on a shard level and merged with the t
added[1.2.0] `shard_min_doc_count` parameter
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.

View File

@ -55,6 +55,16 @@ public class ParseField {
return underscoreName;
}
public String[] getAllNamesIncludedDeprecated() {
String[] allNames = new String[2 + deprecatedNames.length];
allNames[0] = camelCaseName;
allNames[1] = underscoreName;
for (int i = 0; i < deprecatedNames.length; i++) {
allNames[i + 2] = deprecatedNames[i];
}
return allNames;
}
public ParseField withDeprecation(String... deprecatedNames) {
return new ParseField(this.underscoreName, deprecatedNames);
}

View File

@ -27,6 +27,7 @@ import org.elasticsearch.index.query.functionscore.FunctionScoreModule;
import org.elasticsearch.index.search.morelikethis.MoreLikeThisFetchService;
import org.elasticsearch.search.action.SearchServiceTransportAction;
import org.elasticsearch.search.aggregations.AggregationModule;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificantTermsHeuristicModule;
import org.elasticsearch.search.controller.SearchPhaseController;
import org.elasticsearch.search.dfs.DfsPhase;
import org.elasticsearch.search.facet.FacetModule;
@ -50,7 +51,7 @@ public class SearchModule extends AbstractModule implements SpawnModules {
@Override
public Iterable<? extends Module> spawnModules() {
return ImmutableList.of(new TransportSearchModule(), new FacetModule(), new HighlightModule(), new SuggestModule(), new FunctionScoreModule(), new AggregationModule());
return ImmutableList.of(new TransportSearchModule(), new FacetModule(), new HighlightModule(), new SuggestModule(), new FunctionScoreModule(), new AggregationModule(), new SignificantTermsHeuristicModule());
}
@Override

View File

@ -99,7 +99,7 @@ public class GlobalOrdinalsSignificantTermsAggregator extends GlobalOrdinalsStri
// that are for this shard only
// Back at the central reducer these properties will be updated with
// global stats
spare.updateScore();
spare.updateScore(termsAggFactory.getSignificanceHeuristic());
if (spare.subsetDf >= bucketCountThresholds.getShardMinDocCount()) {
spare = (SignificantStringTerms.Bucket) ordered.insertWithOverflow(spare);
}
@ -114,7 +114,7 @@ public class GlobalOrdinalsSignificantTermsAggregator extends GlobalOrdinalsStri
list[i] = bucket;
}
return new SignificantStringTerms(subsetSize, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Arrays.asList(list));
return new SignificantStringTerms(subsetSize, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Arrays.asList(list));
}
@Override
@ -123,7 +123,7 @@ public class GlobalOrdinalsSignificantTermsAggregator extends GlobalOrdinalsStri
ContextIndexSearcher searcher = context.searchContext().searcher();
IndexReader topReader = searcher.getIndexReader();
int supersetSize = topReader.numDocs();
return new SignificantStringTerms(0, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Collections.<InternalSignificantTerms.Bucket>emptyList());
return new SignificantStringTerms(0, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Collections.<InternalSignificantTerms.Bucket>emptyList());
}
@Override

View File

@ -25,6 +25,7 @@ import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.search.aggregations.Aggregations;
import org.elasticsearch.search.aggregations.InternalAggregation;
import org.elasticsearch.search.aggregations.InternalAggregations;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristic;
import java.util.*;
@ -33,6 +34,7 @@ import java.util.*;
*/
public abstract class InternalSignificantTerms extends InternalAggregation implements SignificantTerms, ToXContent, Streamable {
protected SignificanceHeuristic significanceHeuristic;
protected int requiredSize;
protected long minDocCount;
protected Collection<Bucket> buckets;
@ -42,7 +44,6 @@ public abstract class InternalSignificantTerms extends InternalAggregation imple
protected InternalSignificantTerms() {} // for serialization
// TODO updateScore call in constructor to be cleaned up as part of adding pluggable scoring algos
@SuppressWarnings("PMD.ConstructorCallsOverridableMethod")
public static abstract class Bucket extends SignificantTerms.Bucket {
@ -53,7 +54,6 @@ public abstract class InternalSignificantTerms extends InternalAggregation imple
protected Bucket(long subsetDf, long subsetSize, long supersetDf, long supersetSize, InternalAggregations aggregations) {
super(subsetDf, subsetSize, supersetDf, supersetSize);
this.aggregations = aggregations;
updateScore();
}
@Override
@ -76,59 +76,8 @@ public abstract class InternalSignificantTerms extends InternalAggregation imple
return subsetSize;
}
/**
* Calculates the significance of a term in a sample against a background of
* normal distributions by comparing the changes in frequency. This is the heart
* of the significant terms feature.
* <p/>
* TODO - allow pluggable scoring implementations
*
* @param subsetFreq The frequency of the term in the selected sample
* @param subsetSize The size of the selected sample (typically number of docs)
* @param supersetFreq The frequency of the term in the superset from which the sample was taken
* @param supersetSize The size of the superset from which the sample was taken (typically number of docs)
* @return a "significance" score
*/
public static double getSampledTermSignificance(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize) {
if ((subsetSize == 0) || (supersetSize == 0)) {
// avoid any divide by zero issues
return 0;
}
if (supersetFreq == 0) {
// If we are using a background context that is not a strict superset, a foreground
// term may be missing from the background, so for the purposes of this calculation
// we assume a value of 1 for our calculations which avoids returning an "infinity" result
supersetFreq = 1;
}
double subsetProbability = (double) subsetFreq / (double) subsetSize;
double supersetProbability = (double) supersetFreq / (double) supersetSize;
// Using absoluteProbabilityChange alone favours very common words e.g. you, we etc
// because a doubling in popularity of a common term is a big percent difference
// whereas a rare term would have to achieve a hundred-fold increase in popularity to
// achieve the same difference measure.
// In favouring common words as suggested features for search we would get high
// recall but low precision.
double absoluteProbabilityChange = subsetProbability - supersetProbability;
if (absoluteProbabilityChange <= 0) {
return 0;
}
// Using relativeProbabilityChange tends to favour rarer terms e.g.mis-spellings or
// unique URLs.
// A very low-probability term can very easily double in popularity due to the low
// numbers required to do so whereas a high-probability term would have to add many
// extra individual sightings to achieve the same shift.
// In favouring rare words as suggested features for search we would get high
// precision but low recall.
double relativeProbabilityChange = (subsetProbability / supersetProbability);
// A blend of the above metrics - favours medium-rare terms to strike a useful
// balance between precision and recall.
return absoluteProbabilityChange * relativeProbabilityChange;
}
public void updateScore() {
score = getSampledTermSignificance(subsetDf, subsetSize, supersetDf, supersetSize);
public void updateScore(SignificanceHeuristic significanceHeuristic) {
score = significanceHeuristic.getScore(subsetDf, subsetSize, supersetDf, supersetSize);
}
@Override
@ -162,13 +111,14 @@ public abstract class InternalSignificantTerms extends InternalAggregation imple
}
}
protected InternalSignificantTerms(long subsetSize, long supersetSize, String name, int requiredSize, long minDocCount, Collection<Bucket> buckets) {
protected InternalSignificantTerms(long subsetSize, long supersetSize, String name, int requiredSize, long minDocCount, SignificanceHeuristic significanceHeuristic, Collection<Bucket> buckets) {
super(name);
this.requiredSize = requiredSize;
this.minDocCount = minDocCount;
this.buckets = buckets;
this.subsetSize = subsetSize;
this.supersetSize = supersetSize;
this.significanceHeuristic = significanceHeuristic;
}
@Override
@ -227,6 +177,7 @@ public abstract class InternalSignificantTerms extends InternalAggregation imple
for (Map.Entry<String, List<Bucket>> entry : buckets.entrySet()) {
List<Bucket> sameTermBuckets = entry.getValue();
final Bucket b = sameTermBuckets.get(0).reduce(sameTermBuckets, reduceContext.bigArrays());
b.updateScore(significanceHeuristic);
if ((b.score > 0) && (b.subsetDf >= minDocCount)) {
ordered.insertWithOverflow(b);
}

View File

@ -18,6 +18,7 @@
*/
package org.elasticsearch.search.aggregations.bucket.significant;
import org.elasticsearch.Version;
import org.elasticsearch.common.Nullable;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
@ -26,6 +27,7 @@ import org.elasticsearch.common.text.Text;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.search.aggregations.AggregationStreams;
import org.elasticsearch.search.aggregations.InternalAggregations;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.*;
import org.elasticsearch.search.aggregations.support.format.ValueFormatter;
import org.elasticsearch.search.aggregations.support.format.ValueFormatterStreams;
@ -92,12 +94,13 @@ public class SignificantLongTerms extends InternalSignificantTerms {
private ValueFormatter formatter;
SignificantLongTerms() {} // for serialization
SignificantLongTerms() {
} // for serialization
public SignificantLongTerms(long subsetSize, long supersetSize, String name, @Nullable ValueFormatter formatter,
int requiredSize, long minDocCount, Collection<InternalSignificantTerms.Bucket> buckets) {
int requiredSize, long minDocCount, SignificanceHeuristic significanceHeuristic, Collection<InternalSignificantTerms.Bucket> buckets) {
super(subsetSize, supersetSize, name, requiredSize, minDocCount, buckets);
super(subsetSize, supersetSize, name, requiredSize, minDocCount, significanceHeuristic, buckets);
this.formatter = formatter;
}
@ -109,7 +112,7 @@ public class SignificantLongTerms extends InternalSignificantTerms {
@Override
InternalSignificantTerms newAggregation(long subsetSize, long supersetSize,
List<InternalSignificantTerms.Bucket> buckets) {
return new SignificantLongTerms(subsetSize, supersetSize, getName(), formatter, requiredSize, minDocCount, buckets);
return new SignificantLongTerms(subsetSize, supersetSize, getName(), formatter, requiredSize, minDocCount, significanceHeuristic, buckets);
}
@Override
@ -120,6 +123,7 @@ public class SignificantLongTerms extends InternalSignificantTerms {
this.minDocCount = in.readVLong();
this.subsetSize = in.readVLong();
this.supersetSize = in.readVLong();
significanceHeuristic = SignificanceHeuristicStreams.read(in);
int size = in.readVInt();
List<InternalSignificantTerms.Bucket> buckets = new ArrayList<>(size);
@ -127,7 +131,9 @@ public class SignificantLongTerms extends InternalSignificantTerms {
long subsetDf = in.readVLong();
long supersetDf = in.readVLong();
long term = in.readLong();
buckets.add(new Bucket(subsetDf, subsetSize, supersetDf,supersetSize, term, InternalAggregations.readAggregations(in)));
Bucket readBucket = new Bucket(subsetDf, subsetSize, supersetDf,supersetSize, term, InternalAggregations.readAggregations(in));
readBucket.updateScore(significanceHeuristic);
buckets.add(readBucket);
}
this.buckets = buckets;
this.bucketMap = null;
@ -141,6 +147,9 @@ public class SignificantLongTerms extends InternalSignificantTerms {
out.writeVLong(minDocCount);
out.writeVLong(subsetSize);
out.writeVLong(supersetSize);
if (out.getVersion().onOrAfter(Version.V_1_3_0)) {
significanceHeuristic.writeTo(out);
}
out.writeVInt(buckets.size());
for (InternalSignificantTerms.Bucket bucket : buckets) {
out.writeVLong(((Bucket) bucket).subsetDf);

View File

@ -75,10 +75,9 @@ public class SignificantLongTermsAggregator extends LongTermsAggregator {
spare.subsetSize = subsetSize;
spare.supersetDf = termsAggFactory.getBackgroundFrequency(spare.term);
spare.supersetSize = supersetSize;
assert spare.subsetDf <= spare.supersetDf;
// During shard-local down-selection we use subset/superset stats that are for this shard only
// Back at the central reducer these properties will be updated with global stats
spare.updateScore();
spare.updateScore(termsAggFactory.getSignificanceHeuristic());
spare.bucketOrd = i;
if (spare.subsetDf >= bucketCountThresholds.getShardMinDocCount()) {
@ -92,7 +91,7 @@ public class SignificantLongTermsAggregator extends LongTermsAggregator {
bucket.aggregations = bucketAggregations(bucket.bucketOrd);
list[i] = bucket;
}
return new SignificantLongTerms(subsetSize, supersetSize, name, formatter, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Arrays.asList(list));
return new SignificantLongTerms(subsetSize, supersetSize, name, formatter, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Arrays.asList(list));
}
@Override
@ -101,7 +100,7 @@ public class SignificantLongTermsAggregator extends LongTermsAggregator {
ContextIndexSearcher searcher = context.searchContext().searcher();
IndexReader topReader = searcher.getIndexReader();
int supersetSize = topReader.numDocs();
return new SignificantLongTerms(0, supersetSize, name, formatter, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Collections.<InternalSignificantTerms.Bucket>emptyList());
return new SignificantLongTerms(0, supersetSize, name, formatter, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Collections.<InternalSignificantTerms.Bucket>emptyList());
}
@Override

View File

@ -19,6 +19,7 @@
package org.elasticsearch.search.aggregations.bucket.significant;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.Version;
import org.elasticsearch.common.bytes.BytesArray;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
@ -28,6 +29,7 @@ import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.search.aggregations.AggregationStreams;
import org.elasticsearch.search.aggregations.InternalAggregation;
import org.elasticsearch.search.aggregations.InternalAggregations;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.*;
import java.io.IOException;
import java.util.ArrayList;
@ -94,8 +96,8 @@ public class SignificantStringTerms extends InternalSignificantTerms {
SignificantStringTerms() {} // for serialization
public SignificantStringTerms(long subsetSize, long supersetSize, String name, int requiredSize,
long minDocCount, Collection<InternalSignificantTerms.Bucket> buckets) {
super(subsetSize, supersetSize, name, requiredSize, minDocCount, buckets);
long minDocCount, SignificanceHeuristic significanceHeuristic, Collection<InternalSignificantTerms.Bucket> buckets) {
super(subsetSize, supersetSize, name, requiredSize, minDocCount, significanceHeuristic, buckets);
}
@Override
@ -106,7 +108,7 @@ public class SignificantStringTerms extends InternalSignificantTerms {
@Override
InternalSignificantTerms newAggregation(long subsetSize, long supersetSize,
List<InternalSignificantTerms.Bucket> buckets) {
return new SignificantStringTerms(subsetSize, supersetSize, getName(), requiredSize, minDocCount, buckets);
return new SignificantStringTerms(subsetSize, supersetSize, getName(), requiredSize, minDocCount, significanceHeuristic, buckets);
}
@Override
@ -116,13 +118,16 @@ public class SignificantStringTerms extends InternalSignificantTerms {
this.minDocCount = in.readVLong();
this.subsetSize = in.readVLong();
this.supersetSize = in.readVLong();
significanceHeuristic = SignificanceHeuristicStreams.read(in);
int size = in.readVInt();
List<InternalSignificantTerms.Bucket> buckets = new ArrayList<>(size);
for (int i = 0; i < size; i++) {
BytesRef term = in.readBytesRef();
long subsetDf = in.readVLong();
long supersetDf = in.readVLong();
buckets.add(new Bucket(term, subsetDf, subsetSize, supersetDf, supersetSize, InternalAggregations.readAggregations(in)));
Bucket readBucket = new Bucket(term, subsetDf, subsetSize, supersetDf, supersetSize, InternalAggregations.readAggregations(in));
readBucket.updateScore(significanceHeuristic);
buckets.add(readBucket);
}
this.buckets = buckets;
this.bucketMap = null;
@ -135,6 +140,9 @@ public class SignificantStringTerms extends InternalSignificantTerms {
out.writeVLong(minDocCount);
out.writeVLong(subsetSize);
out.writeVLong(supersetSize);
if (out.getVersion().onOrAfter(Version.V_1_3_0)) {
significanceHeuristic.writeTo(out);
}
out.writeVInt(buckets.size());
for (InternalSignificantTerms.Bucket bucket : buckets) {
out.writeBytesRef(((Bucket) bucket).termBytes);

View File

@ -76,11 +76,11 @@ public class SignificantStringTermsAggregator extends StringTermsAggregator {
spare.subsetSize = subsetSize;
spare.supersetDf = termsAggFactory.getBackgroundFrequency(spare.termBytes);
spare.supersetSize = supersetSize;
// During shard-local down-selection we use subset/superset stats
// During shard-local down-selection we use subset/superset stats
// that are for this shard only
// Back at the central reducer these properties will be updated with
// global stats
spare.updateScore();
spare.updateScore(termsAggFactory.getSignificanceHeuristic());
spare.bucketOrd = i;
if (spare.subsetDf >= bucketCountThresholds.getShardMinDocCount()) {
@ -97,7 +97,7 @@ public class SignificantStringTermsAggregator extends StringTermsAggregator {
list[i] = bucket;
}
return new SignificantStringTerms(subsetSize, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Arrays.asList(list));
return new SignificantStringTerms(subsetSize, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Arrays.asList(list));
}
@Override
@ -106,7 +106,7 @@ public class SignificantStringTermsAggregator extends StringTermsAggregator {
ContextIndexSearcher searcher = context.searchContext().searcher();
IndexReader topReader = searcher.getIndexReader();
int supersetSize = topReader.numDocs();
return new SignificantStringTerms(0, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Collections.<InternalSignificantTerms.Bucket>emptyList());
return new SignificantStringTerms(0, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Collections.<InternalSignificantTerms.Bucket>emptyList());
}
@Override

View File

@ -31,6 +31,7 @@ import org.elasticsearch.common.lucene.index.FilterableTermsEnum;
import org.elasticsearch.common.lucene.index.FreqTermsEnum;
import org.elasticsearch.index.mapper.FieldMapper;
import org.elasticsearch.search.aggregations.*;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristic;
import org.elasticsearch.search.aggregations.bucket.terms.TermsAggregator;
import org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory;
import org.elasticsearch.search.aggregations.bucket.terms.support.IncludeExclude;
@ -47,6 +48,10 @@ import java.io.IOException;
*/
public class SignificantTermsAggregatorFactory extends ValuesSourceAggregatorFactory implements Releasable {
public SignificanceHeuristic getSignificanceHeuristic() {
return significanceHeuristic;
}
public enum ExecutionMode {
MAP(new ParseField("map")) {
@ -131,14 +136,20 @@ public class SignificantTermsAggregatorFactory extends ValuesSourceAggregatorFac
private int numberOfAggregatorsCreated = 0;
private Filter filter;
private final TermsAggregator.BucketCountThresholds bucketCountThresholds;
private final SignificanceHeuristic significanceHeuristic;
protected TermsAggregator.BucketCountThresholds getBucketCountThresholds() {
return new TermsAggregator.BucketCountThresholds(bucketCountThresholds);
}
public SignificantTermsAggregatorFactory(String name, ValuesSourceConfig valueSourceConfig, TermsAggregator.BucketCountThresholds bucketCountThresholds, IncludeExclude includeExclude,
String executionHint, Filter filter) {
String executionHint, Filter filter, SignificanceHeuristic significanceHeuristic) {
super(name, SignificantStringTerms.TYPE.name(), valueSourceConfig);
this.bucketCountThresholds = bucketCountThresholds;
this.includeExclude = includeExclude;
this.executionHint = executionHint;
this.significanceHeuristic = significanceHeuristic;
if (!valueSourceConfig.unmapped()) {
this.indexedFieldName = config.fieldContext().field();
mapper = SearchContext.current().smartNameFieldMapper(indexedFieldName);

View File

@ -22,6 +22,7 @@ package org.elasticsearch.search.aggregations.bucket.significant;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.index.query.FilterBuilder;
import org.elasticsearch.search.aggregations.AggregationBuilder;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristicBuilder;
import org.elasticsearch.search.aggregations.bucket.terms.AbstractTermsParametersParser;
import org.elasticsearch.search.aggregations.bucket.terms.TermsAggregator;
@ -44,6 +45,7 @@ public class SignificantTermsBuilder extends AggregationBuilder<SignificantTerms
private String excludePattern;
private int excludeFlags;
private FilterBuilder filterBuilder;
private SignificanceHeuristicBuilder significanceHeuristicBuilder;
public SignificantTermsBuilder(String name) {
@ -165,7 +167,15 @@ public class SignificantTermsBuilder extends AggregationBuilder<SignificantTerms
builder.field(SignificantTermsParametersParser.BACKGROUND_FILTER.getPreferredName());
filterBuilder.toXContent(builder, params);
}
if (significanceHeuristicBuilder != null) {
significanceHeuristicBuilder.toXContent(builder);
}
return builder.endObject();
}
public SignificantTermsBuilder significanceHeuristic(SignificanceHeuristicBuilder significanceHeuristicBuilder) {
this.significanceHeuristicBuilder = significanceHeuristicBuilder;
return this;
}
}

View File

@ -24,6 +24,9 @@ import org.apache.lucene.search.Filter;
import org.elasticsearch.common.ParseField;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.search.SearchParseException;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristic;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristicParser;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristicParserMapper;
import org.elasticsearch.search.aggregations.bucket.terms.AbstractTermsParametersParser;
import org.elasticsearch.search.aggregations.bucket.terms.TermsAggregator;
import org.elasticsearch.search.internal.SearchContext;
@ -34,6 +37,11 @@ import java.io.IOException;
public class SignificantTermsParametersParser extends AbstractTermsParametersParser {
private static final TermsAggregator.BucketCountThresholds DEFAULT_BUCKET_COUNT_THRESHOLDS = new TermsAggregator.BucketCountThresholds(3, 0, 10, -1);
private final SignificanceHeuristicParserMapper significanceHeuristicParserMapper;
public SignificantTermsParametersParser(SignificanceHeuristicParserMapper significanceHeuristicParserMapper) {
this.significanceHeuristicParserMapper = significanceHeuristicParserMapper;
}
public Filter getFilter() {
return filter;
@ -41,6 +49,8 @@ public class SignificantTermsParametersParser extends AbstractTermsParametersPar
private Filter filter = null;
private SignificanceHeuristic significanceHeuristic;
public TermsAggregator.BucketCountThresholds getDefaultBucketCountThresholds() {
return new TermsAggregator.BucketCountThresholds(DEFAULT_BUCKET_COUNT_THRESHOLDS);
}
@ -49,9 +59,12 @@ public class SignificantTermsParametersParser extends AbstractTermsParametersPar
@Override
public void parseSpecial(String aggregationName, XContentParser parser, SearchContext context, XContentParser.Token token, String currentFieldName) throws IOException {
if (token == XContentParser.Token.START_OBJECT) {
if (BACKGROUND_FILTER.match(currentFieldName)) {
SignificanceHeuristicParser significanceHeuristicParser = significanceHeuristicParserMapper.get(currentFieldName);
if (significanceHeuristicParser != null) {
significanceHeuristic = significanceHeuristicParser.parse(parser);
} else if (BACKGROUND_FILTER.match(currentFieldName)) {
filter = context.queryParserService().parseInnerFilter(parser).filter();
} else {
throw new SearchParseException(context, "Unknown key for a " + token + " in [" + aggregationName + "]: [" + currentFieldName + "].");
@ -60,4 +73,8 @@ public class SignificantTermsParametersParser extends AbstractTermsParametersPar
throw new SearchParseException(context, "Unknown key for a " + token + " in [" + aggregationName + "]: [" + currentFieldName + "].");
}
}
public SignificanceHeuristic getSignificanceHeuristic() {
return significanceHeuristic;
}
}

View File

@ -18,10 +18,14 @@
*/
package org.elasticsearch.search.aggregations.bucket.significant;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.search.aggregations.Aggregator;
import org.elasticsearch.search.aggregations.AggregatorFactory;
import org.elasticsearch.search.aggregations.bucket.BucketUtils;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.JLHScore;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristic;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristicParserMapper;
import org.elasticsearch.search.aggregations.bucket.terms.TermsAggregator;
import org.elasticsearch.search.aggregations.bucket.terms.support.IncludeExclude;
import org.elasticsearch.search.aggregations.support.ValuesSourceParser;
@ -34,6 +38,13 @@ import java.io.IOException;
*/
public class SignificantTermsParser implements Aggregator.Parser {
private final SignificanceHeuristicParserMapper significanceHeuristicParserMapper;
@Inject
public SignificantTermsParser(SignificanceHeuristicParserMapper significanceHeuristicParserMapper) {
this.significanceHeuristicParserMapper = significanceHeuristicParserMapper;
}
@Override
public String type() {
return SignificantStringTerms.TYPE.name();
@ -41,7 +52,7 @@ public class SignificantTermsParser implements Aggregator.Parser {
@Override
public AggregatorFactory parse(String aggregationName, XContentParser parser, SearchContext context) throws IOException {
SignificantTermsParametersParser aggParser = new SignificantTermsParametersParser();
SignificantTermsParametersParser aggParser = new SignificantTermsParametersParser(significanceHeuristicParserMapper);
ValuesSourceParser vsParser = ValuesSourceParser.any(aggregationName, SignificantStringTerms.TYPE, context)
.scriptable(false)
.formattable(true)
@ -52,7 +63,7 @@ public class SignificantTermsParser implements Aggregator.Parser {
aggParser.parse(aggregationName, parser, context, vsParser, incExcParser);
TermsAggregator.BucketCountThresholds bucketCountThresholds = aggParser.getBucketCountThresholds();
if (bucketCountThresholds.getShardSize() == new SignificantTermsParametersParser().getDefaultBucketCountThresholds().getShardSize()) {
if (bucketCountThresholds.getShardSize() == aggParser.getDefaultBucketCountThresholds().getShardSize()) {
//The user has not made a shardSize selection .
//Use default heuristic to avoid any wrong-ranking caused by distributed counting
//but request double the usual amount.
@ -64,6 +75,10 @@ public class SignificantTermsParser implements Aggregator.Parser {
}
bucketCountThresholds.ensureValidity();
return new SignificantTermsAggregatorFactory(aggregationName, vsParser.config(), bucketCountThresholds, aggParser.getIncludeExclude(), aggParser.getExecutionHint(), aggParser.getFilter());
SignificanceHeuristic significanceHeuristic = aggParser.getSignificanceHeuristic();
if (significanceHeuristic == null) {
significanceHeuristic = JLHScore.INSTANCE;
}
return new SignificantTermsAggregatorFactory(aggregationName, vsParser.config(), bucketCountThresholds, aggParser.getIncludeExclude(), aggParser.getExecutionHint(), aggParser.getFilter(), significanceHeuristic);
}
}

View File

@ -23,6 +23,7 @@ import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.search.aggregations.AggregationStreams;
import org.elasticsearch.search.aggregations.InternalAggregation;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.JLHScore;
import java.io.IOException;
import java.util.Collection;
@ -58,7 +59,7 @@ public class UnmappedSignificantTerms extends InternalSignificantTerms {
public UnmappedSignificantTerms(String name, int requiredSize, long minDocCount) {
//We pass zero for index/subset sizes because for the purpose of significant term analysis
// we assume an unmapped index's size is irrelevant to the proceedings.
super(0, 0, name, requiredSize, minDocCount, BUCKETS);
super(0, 0, name, requiredSize, minDocCount, JLHScore.INSTANCE, BUCKETS);
}
@Override

View File

@ -0,0 +1,148 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;
import org.elasticsearch.ElasticsearchIllegalArgumentException;
import org.elasticsearch.ElasticsearchParseException;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.query.QueryParsingException;
import java.io.IOException;
public class JLHScore implements SignificanceHeuristic {
public static final JLHScore INSTANCE = new JLHScore();
protected static final String[] NAMES = {"jlh"};
private JLHScore() {};
public static final SignificanceHeuristicStreams.Stream STREAM = new SignificanceHeuristicStreams.Stream() {
@Override
public SignificanceHeuristic readResult(StreamInput in) throws IOException {
return readFrom(in);
}
@Override
public String getName() {
return NAMES[0];
}
};
public static SignificanceHeuristic readFrom(StreamInput in) throws IOException {
return INSTANCE;
}
/**
* Calculates the significance of a term in a sample against a background of
* normal distributions by comparing the changes in frequency. This is the heart
* of the significant terms feature.
* <p/>
*
* @param subsetFreq The frequency of the term in the selected sample
* @param subsetSize The size of the selected sample (typically number of docs)
* @param supersetFreq The frequency of the term in the superset from which the sample was taken
* @param supersetSize The size of the superset from which the sample was taken (typically number of docs)
* @return a "significance" score
*/
@Override
public double getScore(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize) {
if (subsetFreq < 0 || subsetSize < 0 || supersetFreq < 0 || supersetSize < 0) {
throw new ElasticsearchIllegalArgumentException("Frequencies of subset and superset must be positive in JLHScore.getScore()");
}
if (subsetFreq > subsetSize) {
throw new ElasticsearchIllegalArgumentException("subsetFreq > subsetSize, in JLHScore.score(..)");
}
if (supersetFreq > supersetSize) {
throw new ElasticsearchIllegalArgumentException("supersetFreq > supersetSize, in JLHScore.score(..)");
}
if ((subsetSize == 0) || (supersetSize == 0)) {
// avoid any divide by zero issues
return 0;
}
if (supersetFreq == 0) {
// If we are using a background context that is not a strict superset, a foreground
// term may be missing from the background, so for the purposes of this calculation
// we assume a value of 1 for our calculations which avoids returning an "infinity" result
supersetFreq = 1;
}
double subsetProbability = (double) subsetFreq / (double) subsetSize;
double supersetProbability = (double) supersetFreq / (double) supersetSize;
// Using absoluteProbabilityChange alone favours very common words e.g. you, we etc
// because a doubling in popularity of a common term is a big percent difference
// whereas a rare term would have to achieve a hundred-fold increase in popularity to
// achieve the same difference measure.
// In favouring common words as suggested features for search we would get high
// recall but low precision.
double absoluteProbabilityChange = subsetProbability - supersetProbability;
if (absoluteProbabilityChange <= 0) {
return 0;
}
// Using relativeProbabilityChange tends to favour rarer terms e.g.mis-spellings or
// unique URLs.
// A very low-probability term can very easily double in popularity due to the low
// numbers required to do so whereas a high-probability term would have to add many
// extra individual sightings to achieve the same shift.
// In favouring rare words as suggested features for search we would get high
// precision but low recall.
double relativeProbabilityChange = (subsetProbability / supersetProbability);
// A blend of the above metrics - favours medium-rare terms to strike a useful
// balance between precision and recall.
return absoluteProbabilityChange * relativeProbabilityChange;
}
@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeString(STREAM.getName());
}
public static class JLHScoreParser implements SignificanceHeuristicParser {
@Override
public SignificanceHeuristic parse(XContentParser parser) throws IOException, QueryParsingException {
// move to the closing bracket
if (!parser.nextToken().equals(XContentParser.Token.END_OBJECT)) {
throw new ElasticsearchParseException("expected }, got " + parser.currentName() + " instead in jhl score");
}
return new JLHScore();
}
@Override
public String[] getNames() {
return NAMES;
}
}
public static class JLHScoreBuilder implements SignificanceHeuristicBuilder {
@Override
public void toXContent(XContentBuilder builder) throws IOException {
builder.startObject(STREAM.getName()).endObject();
}
}
}

View File

@ -0,0 +1,263 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;
import org.elasticsearch.ElasticsearchIllegalArgumentException;
import org.elasticsearch.ElasticsearchParseException;
import org.elasticsearch.common.ParseField;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.query.QueryParsingException;
import java.io.IOException;
public class MutualInformation implements SignificanceHeuristic {
protected static final ParseField NAMES_FIELD = new ParseField("mutual_information");
protected static final ParseField INCLUDE_NEGATIVES_FIELD = new ParseField("include_negatives");
protected static final ParseField BACKGROUND_IS_SUPERSET = new ParseField("background_is_superset");
protected static final String SCORE_ERROR_MESSAGE = ", does your background filter not include all documents in the bucket? If so and it is intentional, set \"" + BACKGROUND_IS_SUPERSET.getPreferredName() + "\": false";
private static final double log2 = Math.log(2.0);
/**
* Mutual information does not differentiate between terms that are descriptive for subset or for
* the background without the subset. We might want to filter out the terms that are appear much less often
* in the subset than in the background without the subset.
*/
protected boolean includeNegatives = false;
private boolean backgroundIsSuperset = true;
private MutualInformation() {};
public MutualInformation(boolean includeNegatives, boolean backgroundIsSuperset) {
this.includeNegatives = includeNegatives;
this.backgroundIsSuperset = backgroundIsSuperset;
}
@Override
public boolean equals(Object other) {
if (! (other instanceof MutualInformation)) {
return false;
}
return ((MutualInformation)other).includeNegatives == includeNegatives && ((MutualInformation)other).backgroundIsSuperset == backgroundIsSuperset;
}
public static final SignificanceHeuristicStreams.Stream STREAM = new SignificanceHeuristicStreams.Stream() {
@Override
public SignificanceHeuristic readResult(StreamInput in) throws IOException {
return new MutualInformation(in.readBoolean(), in.readBoolean());
}
@Override
public String getName() {
return NAMES_FIELD.getPreferredName();
}
};
/**
* Calculates mutual information
* see "Information Retrieval", Manning et al., Eq. 13.17
*
* @param subsetFreq The frequency of the term in the selected sample
* @param subsetSize The size of the selected sample (typically number of docs)
* @param supersetFreq The frequency of the term in the superset from which the sample was taken
* @param supersetSize The size of the superset from which the sample was taken (typically number of docs)
* @return a "significance" score
*/
@Override
public double getScore(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize) {
if (subsetFreq < 0 || subsetSize < 0 || supersetFreq < 0 || supersetSize < 0) {
throw new ElasticsearchIllegalArgumentException("Frequencies of subset and superset must be positive in MutualInformation.getScore()");
}
if (subsetFreq > subsetSize) {
throw new ElasticsearchIllegalArgumentException("subsetFreq > subsetSize, in MutualInformation.score(..)");
}
if (supersetFreq > supersetSize) {
throw new ElasticsearchIllegalArgumentException("supersetFreq > supersetSize, in MutualInformation.score(..)");
}
if (backgroundIsSuperset) {
if (subsetFreq > supersetFreq) {
throw new ElasticsearchIllegalArgumentException("subsetFreq > supersetFreq" + SCORE_ERROR_MESSAGE);
}
if (subsetSize > supersetSize) {
throw new ElasticsearchIllegalArgumentException("subsetSize > supersetSize" + SCORE_ERROR_MESSAGE);
}
if (supersetFreq - subsetFreq > supersetSize - subsetSize) {
throw new ElasticsearchIllegalArgumentException("supersetFreq - subsetFreq > supersetSize - subsetSize" + SCORE_ERROR_MESSAGE);
}
}
double N00, N01, N10, N11, N0_, N1_, N_0, N_1, N;
if (backgroundIsSuperset) {
//documents not in class and do not contain term
N00 = supersetSize - supersetFreq - (subsetSize - subsetFreq);
//documents in class and do not contain term
N01 = (subsetSize - subsetFreq);
// documents not in class and do contain term
N10 = supersetFreq - subsetFreq;
// documents in class and do contain term
N11 = subsetFreq;
//documents that do not contain term
N0_ = supersetSize - supersetFreq;
//documents that contain term
N1_ = supersetFreq;
//documents that are not in class
N_0 = supersetSize - subsetSize;
//documents that are in class
N_1 = subsetSize;
//all docs
N = supersetSize;
} else {
//documents not in class and do not contain term
N00 = supersetSize - supersetFreq;
//documents in class and do not contain term
N01 = subsetSize - subsetFreq;
// documents not in class and do contain term
N10 = supersetFreq;
// documents in class and do contain term
N11 = subsetFreq;
//documents that do not contain term
N0_ = supersetSize - supersetFreq + subsetSize - subsetFreq;
//documents that contain term
N1_ = supersetFreq + subsetFreq;
//documents that are not in class
N_0 = supersetSize;
//documents that are in class
N_1 = subsetSize;
//all docs
N = supersetSize + subsetSize;
}
double score = (getMITerm(N00, N0_, N_0, N) +
getMITerm(N01, N0_, N_1, N) +
getMITerm(N10, N1_, N_0, N) +
getMITerm(N11, N1_, N_1, N))
/ log2;
if (Double.isNaN(score)) {
score = -1.0 * Float.MAX_VALUE;
}
// here we check if the term appears more often in subset than in background without subset.
if (!includeNegatives && N11 / N_1 < N10 / N_0) {
score = -1.0 * Double.MAX_VALUE;
}
return score;
}
/* make sure that
0 * log(0/0) = 0
0 * log(0) = 0
Else, this would be the score:
double score =
N11 / N * Math.log((N * N11) / (N1_ * N_1))
+ N01 / N * Math.log((N * N01) / (N0_ * N_1))
+ N10 / N * Math.log((N * N10) / (N1_ * N_0))
+ N00 / N * Math.log((N * N00) / (N0_ * N_0));
but we get many NaN if we do not take case of the 0s */
double getMITerm(double Nxy, double Nx_, double N_y, double N) {
double numerator = Math.abs(N * Nxy);
double denominator = Math.abs(Nx_ * N_y);
double factor = Math.abs(Nxy / N);
if (numerator < 1.e-7 && factor < 1.e-7) {
return 0.0;
} else {
return factor * Math.log(numerator / denominator);
}
}
@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeString(STREAM.getName());
out.writeBoolean(includeNegatives);
out.writeBoolean(backgroundIsSuperset);
}
public boolean getIncludeNegatives() {
return includeNegatives;
}
@Override
public int hashCode() {
int result = (includeNegatives ? 1 : 0);
result = 31 * result + (backgroundIsSuperset ? 1 : 0);
return result;
}
public static class MutualInformationParser implements SignificanceHeuristicParser {
@Override
public SignificanceHeuristic parse(XContentParser parser) throws IOException, QueryParsingException {
NAMES_FIELD.match(parser.currentName(), ParseField.EMPTY_FLAGS);
boolean includeNegatives = false;
boolean backgroundIsSuperset = true;
XContentParser.Token token = parser.nextToken();
while (!token.equals(XContentParser.Token.END_OBJECT)) {
if (INCLUDE_NEGATIVES_FIELD.match(parser.currentName(), ParseField.EMPTY_FLAGS)) {
parser.nextToken();
includeNegatives = parser.booleanValue();
} else if (BACKGROUND_IS_SUPERSET.match(parser.currentName(), ParseField.EMPTY_FLAGS)) {
parser.nextToken();
backgroundIsSuperset = parser.booleanValue();
} else {
throw new ElasticsearchParseException("Field " + parser.currentName().toString() + " unknown for mutual_information.");
}
token = parser.nextToken();
}
// move to the closing bracket
return new MutualInformation(includeNegatives, backgroundIsSuperset);
}
@Override
public String[] getNames() {
return NAMES_FIELD.getAllNamesIncludedDeprecated();
}
}
public static class MutualInformationBuilder implements SignificanceHeuristicBuilder {
boolean includeNegatives = true;
private boolean backgroundIsSuperset = true;
private MutualInformationBuilder() {};
public MutualInformationBuilder(boolean includeNegatives, boolean backgroundIsSuperset) {
this.includeNegatives = includeNegatives;
this.backgroundIsSuperset = backgroundIsSuperset;
}
@Override
public void toXContent(XContentBuilder builder) throws IOException {
builder.startObject(STREAM.getName())
.field(INCLUDE_NEGATIVES_FIELD.getPreferredName(), includeNegatives)
.field(BACKGROUND_IS_SUPERSET.getPreferredName(), backgroundIsSuperset)
.endObject();
}
}
}

View File

@ -0,0 +1,32 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;
import org.elasticsearch.common.io.stream.StreamOutput;
import java.io.IOException;
public interface SignificanceHeuristic {
public double getScore(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize);
void writeTo(StreamOutput out) throws IOException;
}

View File

@ -0,0 +1,31 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;
import org.elasticsearch.common.xcontent.XContentBuilder;
import java.io.IOException;
public interface SignificanceHeuristicBuilder {
public void toXContent(XContentBuilder builder) throws IOException;
}

View File

@ -0,0 +1,36 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;
import org.elasticsearch.common.ParseField;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.query.QueryParsingException;
import org.elasticsearch.search.internal.SearchContext;
import java.io.IOException;
import java.util.EnumSet;
public interface SignificanceHeuristicParser {
public SignificanceHeuristic parse(XContentParser parser) throws IOException, QueryParsingException;
public String[] getNames();
}

View File

@ -0,0 +1,47 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;
import com.google.common.collect.ImmutableMap;
import org.elasticsearch.common.collect.MapBuilder;
import org.elasticsearch.common.inject.Inject;
import java.util.Set;
public class SignificanceHeuristicParserMapper {
protected ImmutableMap<String, SignificanceHeuristicParser> significanceHeuristicParsers;
@Inject
public SignificanceHeuristicParserMapper(Set<SignificanceHeuristicParser> parsers) {
MapBuilder<String, SignificanceHeuristicParser> builder = MapBuilder.newMapBuilder();
for (SignificanceHeuristicParser parser : parsers) {
for (String name : parser.getNames()) {
builder.put(name, parser);
}
}
significanceHeuristicParsers = builder.immutableMap();
}
public SignificanceHeuristicParser get(String parserName) {
return significanceHeuristicParsers.get(parserName);
}
}

View File

@ -0,0 +1,78 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;
import com.google.common.collect.ImmutableMap;
import org.elasticsearch.Version;
import org.elasticsearch.common.collect.MapBuilder;
import org.elasticsearch.common.io.stream.StreamInput;
import java.io.IOException;
/**
* A registry for all significance heuristics. This is needed for reading them from a stream without knowing which
* one it is.
*/
public class SignificanceHeuristicStreams {
private static ImmutableMap<String, Stream> STREAMS = ImmutableMap.of();
public static SignificanceHeuristic read(StreamInput in) throws IOException {
if (in.getVersion().onOrAfter(Version.V_1_3_0)) {
return stream(in.readString()).readResult(in);
} else {
return JLHScore.INSTANCE;
}
}
/**
* A stream that knows how to read an heuristic from the input.
*/
public static interface Stream {
SignificanceHeuristic readResult(StreamInput in) throws IOException;
String getName();
}
/**
* Registers the given stream and associate it with the given types.
*
* @param stream The stream to register
* @param names The names associated with the streams
*/
public static synchronized void registerStream(Stream stream, String... names) {
MapBuilder<String, Stream> uStreams = MapBuilder.newMapBuilder(STREAMS);
for (String name : names) {
uStreams.put(name, stream);
}
STREAMS = uStreams.immutableMap();
}
/**
* Returns the stream that is registered for the given name
*
* @param name The given name
* @return The associated stream
*/
public static Stream stream(String name) {
return STREAMS.get(name);
}
}

View File

@ -0,0 +1,56 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant.heuristics;
import com.google.common.collect.Lists;
import org.elasticsearch.common.inject.AbstractModule;
import org.elasticsearch.common.inject.multibindings.Multibinder;
import java.util.List;
public class SignificantTermsHeuristicModule extends AbstractModule {
private List<Class<? extends SignificanceHeuristicParser>> parsers = Lists.newArrayList();
private List<SignificanceHeuristicStreams.Stream> streams = Lists.newArrayList();
public SignificantTermsHeuristicModule() {
registerHeuristic(JLHScore.JLHScoreParser.class, JLHScore.STREAM);
registerHeuristic(MutualInformation.MutualInformationParser.class, MutualInformation.STREAM);
}
public void registerHeuristic(Class<? extends SignificanceHeuristicParser> parser, SignificanceHeuristicStreams.Stream stream) {
parsers.add(parser);
streams.add(stream);
}
@Override
protected void configure() {
Multibinder<SignificanceHeuristicParser> parserMapBinder = Multibinder.newSetBinder(binder(), SignificanceHeuristicParser.class);
for (Class<? extends SignificanceHeuristicParser> clazz : parsers) {
parserMapBinder.addBinding().to(clazz);
}
bind(SignificanceHeuristicParserMapper.class);
for (SignificanceHeuristicStreams.Stream stream : streams) {
SignificanceHeuristicStreams.registerStream(stream, stream.getName());
}
}
}

View File

@ -0,0 +1,117 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.search.aggregations.Aggregation;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTerms;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTermsBuilder;
import org.elasticsearch.search.aggregations.bucket.terms.StringTerms;
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
import org.elasticsearch.search.aggregations.bucket.terms.TermsBuilder;
import org.elasticsearch.test.ElasticsearchBackwardsCompatIntegrationTest;
import org.elasticsearch.test.ElasticsearchIntegrationTest;
import org.junit.Test;
import java.io.IOException;
import java.util.*;
import java.util.concurrent.ExecutionException;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertSearchResponse;
import static org.hamcrest.Matchers.equalTo;
/**
*/
public class SignificantTermsBackwardCompatibilityTests extends ElasticsearchBackwardsCompatIntegrationTest {
static final String INDEX_NAME = "testidx";
static final String DOC_TYPE = "doc";
static final String TEXT_FIELD = "text";
static final String CLASS_FIELD = "class";
/**
* Simple upgrade test for streaming significant terms buckets
*/
@Test
public void testBucketStreaming() throws IOException, ExecutionException, InterruptedException {
logger.debug("testBucketStreaming: indexing documents");
String type = randomBoolean() ? "string" : "long";
String settings = "{\"index.number_of_shards\": 5, \"index.number_of_replicas\": 0}";
index01Docs(type, settings);
logClusterState();
boolean upgraded;
int upgradedNodesCounter = 1;
do {
logger.debug("testBucketStreaming: upgrading {}st node", upgradedNodesCounter++);
upgraded = backwardsCluster().upgradeOneNode();
ensureGreen();
logClusterState();
checkSignificantTermsAggregationCorrect();
} while (upgraded);
logger.debug("testBucketStreaming: done testing significant terms while upgrading");
}
private void index01Docs(String type, String settings) throws ExecutionException, InterruptedException {
String mappings = "{\"doc\": {\"properties\":{\"text\": {\"type\":\"" + type + "\"}}}}";
assertAcked(prepareCreate(INDEX_NAME).setSettings(settings).addMapping("doc", mappings));
String[] gb = {"0", "1"};
List<IndexRequestBuilder> indexRequestBuilderList = new ArrayList<>();
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "1")
.setSource(TEXT_FIELD, "1", CLASS_FIELD, "1"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "2")
.setSource(TEXT_FIELD, "1", CLASS_FIELD, "1"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "3")
.setSource(TEXT_FIELD, "0", CLASS_FIELD, "0"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "4")
.setSource(TEXT_FIELD, "0", CLASS_FIELD, "0"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "5")
.setSource(TEXT_FIELD, gb, CLASS_FIELD, "1"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "6")
.setSource(TEXT_FIELD, gb, CLASS_FIELD, "0"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "7")
.setSource(TEXT_FIELD, "0", CLASS_FIELD, "0"));
indexRandom(true, indexRequestBuilderList);
}
private void checkSignificantTermsAggregationCorrect() {
SearchResponse response = client().prepareSearch(INDEX_NAME).setTypes(DOC_TYPE)
.addAggregation(new TermsBuilder("class").field(CLASS_FIELD).subAggregation(
new SignificantTermsBuilder("sig_terms")
.field(TEXT_FIELD)))
.execute()
.actionGet();
assertSearchResponse(response);
StringTerms classes = (StringTerms) response.getAggregations().get("class");
assertThat(classes.getBuckets().size(), equalTo(2));
for (Terms.Bucket classBucket : classes.getBuckets()) {
Map<String, Aggregation> aggs = classBucket.getAggregations().asMap();
assertTrue(aggs.containsKey("sig_terms"));
SignificantTerms agg = (SignificantTerms) aggs.get("sig_terms");
assertThat(agg.getBuckets().size(), equalTo(1));
String term = agg.iterator().next().getKey();
String classTerm = classBucket.getKey();
assertTrue(term.equals(classTerm));
}
}
}

View File

@ -0,0 +1,404 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.search.SearchPhaseExecutionException;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.common.ParseField;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.query.FilterBuilders;
import org.elasticsearch.index.query.QueryParsingException;
import org.elasticsearch.plugins.AbstractPlugin;
import org.elasticsearch.search.aggregations.Aggregation;
import org.elasticsearch.search.aggregations.Aggregations;
import org.elasticsearch.search.aggregations.bucket.filter.FilterAggregationBuilder;
import org.elasticsearch.search.aggregations.bucket.filter.InternalFilter;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTerms;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTermsAggregatorFactory;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTermsBuilder;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.*;
import org.elasticsearch.search.aggregations.bucket.terms.StringTerms;
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
import org.elasticsearch.search.aggregations.bucket.terms.TermsBuilder;
import org.elasticsearch.test.ElasticsearchIntegrationTest;
import org.junit.Test;
import java.io.IOException;
import java.util.*;
import java.util.concurrent.ExecutionException;
import static org.elasticsearch.cluster.metadata.IndexMetaData.SETTING_NUMBER_OF_REPLICAS;
import static org.elasticsearch.cluster.metadata.IndexMetaData.SETTING_NUMBER_OF_SHARDS;
import static org.elasticsearch.common.settings.ImmutableSettings.settingsBuilder;
import static org.elasticsearch.test.ElasticsearchIntegrationTest.ClusterScope;
import static org.elasticsearch.test.ElasticsearchIntegrationTest.Scope;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertSearchResponse;
import static org.hamcrest.Matchers.*;
/**
*
*/
@ClusterScope(scope = Scope.SUITE)
public class SignificantTermsSignificanceScoreTests extends ElasticsearchIntegrationTest {
static final String INDEX_NAME = "testidx";
static final String DOC_TYPE = "doc";
static final String TEXT_FIELD = "text";
static final String CLASS_FIELD = "class";
@Override
protected Settings nodeSettings(int nodeOrdinal) {
return settingsBuilder()
.put("plugin.types", CustomSignificanceHeuristicPlugin.class.getName())
.put(super.nodeSettings(nodeOrdinal))
.build();
}
public String randomExecutionHint() {
return randomBoolean() ? null : randomFrom(SignificantTermsAggregatorFactory.ExecutionMode.values()).toString();
}
@Test
public void testPlugin() throws Exception {
String type = randomBoolean() ? "string" : "long";
String settings = "{\"index.number_of_shards\": 1, \"index.number_of_replicas\": 0}";
index01Docs(type, settings);
SearchResponse response = client().prepareSearch(INDEX_NAME).setTypes(DOC_TYPE)
.addAggregation(new TermsBuilder("class")
.field(CLASS_FIELD)
.subAggregation((new SignificantTermsBuilder("sig_terms"))
.field(TEXT_FIELD)
.significanceHeuristic(new SimpleHeuristic.SimpleHeuristicBuilder())
.minDocCount(1)
)
)
.execute()
.actionGet();
assertSearchResponse(response);
StringTerms classes = (StringTerms) response.getAggregations().get("class");
assertThat(classes.getBuckets().size(), equalTo(2));
for (Terms.Bucket classBucket : classes.getBuckets()) {
Map<String, Aggregation> aggs = classBucket.getAggregations().asMap();
assertTrue(aggs.containsKey("sig_terms"));
SignificantTerms agg = (SignificantTerms) aggs.get("sig_terms");
assertThat(agg.getBuckets().size(), equalTo(2));
Iterator<SignificantTerms.Bucket> bucketIterator = agg.iterator();
SignificantTerms.Bucket sigBucket = bucketIterator.next();
String term = sigBucket.getKey();
String classTerm = classBucket.getKey();
assertTrue(term.equals(classTerm));
assertThat(sigBucket.getSignificanceScore(), closeTo(2.0, 1.e-8));
sigBucket = bucketIterator.next();
assertThat(sigBucket.getSignificanceScore(), closeTo(1.0, 1.e-8));
}
// we run the same test again but this time we do not call assertSearchResponse() before the assertions
// the reason is that this would trigger toXContent and we would like to check that this has no potential side effects
response = client().prepareSearch(INDEX_NAME).setTypes(DOC_TYPE)
.addAggregation(new TermsBuilder("class")
.field(CLASS_FIELD)
.subAggregation((new SignificantTermsBuilder("sig_terms"))
.field(TEXT_FIELD)
.significanceHeuristic(new SimpleHeuristic.SimpleHeuristicBuilder())
.minDocCount(1)
)
)
.execute()
.actionGet();
classes = (StringTerms) response.getAggregations().get("class");
assertThat(classes.getBuckets().size(), equalTo(2));
for (Terms.Bucket classBucket : classes.getBuckets()) {
Map<String, Aggregation> aggs = classBucket.getAggregations().asMap();
assertTrue(aggs.containsKey("sig_terms"));
SignificantTerms agg = (SignificantTerms) aggs.get("sig_terms");
assertThat(agg.getBuckets().size(), equalTo(2));
Iterator<SignificantTerms.Bucket> bucketIterator = agg.iterator();
SignificantTerms.Bucket sigBucket = bucketIterator.next();
String term = sigBucket.getKey();
String classTerm = classBucket.getKey();
assertTrue(term.equals(classTerm));
assertThat(sigBucket.getSignificanceScore(), closeTo(2.0, 1.e-8));
sigBucket = bucketIterator.next();
assertThat(sigBucket.getSignificanceScore(), closeTo(1.0, 1.e-8));
}
}
public static class CustomSignificanceHeuristicPlugin extends AbstractPlugin {
@Override
public String name() {
return "test-plugin-significance-heuristic";
}
@Override
public String description() {
return "Significance heuristic plugin";
}
public void onModule(SignificantTermsHeuristicModule significanceModule) {
significanceModule.registerHeuristic(SimpleHeuristic.SimpleHeuristicParser.class, SimpleHeuristic.STREAM);
}
}
public static class SimpleHeuristic implements SignificanceHeuristic {
protected static final String[] NAMES = {"simple"};
public static final SignificanceHeuristicStreams.Stream STREAM = new SignificanceHeuristicStreams.Stream() {
@Override
public SignificanceHeuristic readResult(StreamInput in) throws IOException {
return readFrom(in);
}
@Override
public String getName() {
return NAMES[0];
}
};
public static SignificanceHeuristic readFrom(StreamInput in) throws IOException {
return new SimpleHeuristic();
}
/**
* @param subsetFreq The frequency of the term in the selected sample
* @param subsetSize The size of the selected sample (typically number of docs)
* @param supersetFreq The frequency of the term in the superset from which the sample was taken
* @param supersetSize The size of the superset from which the sample was taken (typically number of docs)
* @return a "significance" score
*/
@Override
public double getScore(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize) {
return subsetFreq / subsetSize > supersetFreq / supersetSize ? 2.0 : 1.0;
}
@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeString(STREAM.getName());
}
public static class SimpleHeuristicParser implements SignificanceHeuristicParser {
@Override
public SignificanceHeuristic parse(XContentParser parser) throws IOException, QueryParsingException {
parser.nextToken();
return new SimpleHeuristic();
}
@Override
public String[] getNames() {
return NAMES;
}
}
public static class SimpleHeuristicBuilder implements SignificanceHeuristicBuilder {
@Override
public void toXContent(XContentBuilder builder) throws IOException {
builder.startObject(STREAM.getName()).endObject();
}
}
}
@Test
public void testXContentResponse() throws Exception {
String type = randomBoolean() ? "string" : "long";
String settings = "{\"index.number_of_shards\": 1, \"index.number_of_replicas\": 0}";
index01Docs(type, settings);
SearchResponse response = client().prepareSearch(INDEX_NAME).setTypes(DOC_TYPE)
.addAggregation(new TermsBuilder("class").field(CLASS_FIELD).subAggregation(new SignificantTermsBuilder("sig_terms").field(TEXT_FIELD)))
.execute()
.actionGet();
assertSearchResponse(response);
StringTerms classes = (StringTerms) response.getAggregations().get("class");
assertThat(classes.getBuckets().size(), equalTo(2));
for (Terms.Bucket classBucket : classes.getBuckets()) {
Map<String, Aggregation> aggs = classBucket.getAggregations().asMap();
assertTrue(aggs.containsKey("sig_terms"));
SignificantTerms agg = (SignificantTerms) aggs.get("sig_terms");
assertThat(agg.getBuckets().size(), equalTo(1));
String term = agg.iterator().next().getKey();
String classTerm = classBucket.getKey();
assertTrue(term.equals(classTerm));
}
XContentBuilder responseBuilder = XContentFactory.jsonBuilder();
classes.toXContent(responseBuilder, null);
String result = null;
if (type.equals("long")) {
result = "\"class\"{\"buckets\":[{\"key\":\"0\",\"doc_count\":4,\"sig_terms\":{\"doc_count\":4,\"buckets\":[{\"key\":0,\"key_as_string\":\"0\",\"doc_count\":4,\"score\":0.39999999999999997,\"bg_count\":5}]}},{\"key\":\"1\",\"doc_count\":3,\"sig_terms\":{\"doc_count\":3,\"buckets\":[{\"key\":1,\"key_as_string\":\"1\",\"doc_count\":3,\"score\":0.75,\"bg_count\":4}]}}]}";
} else {
result = "\"class\"{\"buckets\":[{\"key\":\"0\",\"doc_count\":4,\"sig_terms\":{\"doc_count\":4,\"buckets\":[{\"key\":\"0\",\"doc_count\":4,\"score\":0.39999999999999997,\"bg_count\":5}]}},{\"key\":\"1\",\"doc_count\":3,\"sig_terms\":{\"doc_count\":3,\"buckets\":[{\"key\":\"1\",\"doc_count\":3,\"score\":0.75,\"bg_count\":4}]}}]}";
}
assertThat(responseBuilder.string(), equalTo(result));
}
// compute significance score by
// 1. terms agg on class and significant terms
// 2. filter buckets and set the background to the other class and set is_background false
// both should yield exact same result
@Test
public void testBackgroundVsSeparateSet() throws Exception {
String type = randomBoolean() ? "string" : "long";
String settings = "{\"index.number_of_shards\": 1, \"index.number_of_replicas\": 0}";
index01Docs(type, settings);
SearchResponse response1 = client().prepareSearch(INDEX_NAME).setTypes(DOC_TYPE)
.addAggregation(new TermsBuilder("class")
.field(CLASS_FIELD)
.subAggregation(
new SignificantTermsBuilder("sig_terms")
.field(TEXT_FIELD)
.minDocCount(1)
.significanceHeuristic(
new MutualInformation.MutualInformationBuilder(true, true))))
.execute()
.actionGet();
assertSearchResponse(response1);
SearchResponse response2 = client().prepareSearch(INDEX_NAME).setTypes(DOC_TYPE)
.addAggregation((new FilterAggregationBuilder("0"))
.filter(FilterBuilders.termFilter(CLASS_FIELD, "0"))
.subAggregation(new SignificantTermsBuilder("sig_terms")
.field(TEXT_FIELD)
.minDocCount(1)
.backgroundFilter(FilterBuilders.termFilter(CLASS_FIELD, "1"))
.significanceHeuristic(new MutualInformation.MutualInformationBuilder(true, false))))
.addAggregation((new FilterAggregationBuilder("1"))
.filter(FilterBuilders.termFilter(CLASS_FIELD, "1"))
.subAggregation(new SignificantTermsBuilder("sig_terms")
.field(TEXT_FIELD)
.minDocCount(1)
.backgroundFilter(FilterBuilders.termFilter(CLASS_FIELD, "0"))
.significanceHeuristic(new MutualInformation.MutualInformationBuilder(true, false))))
.execute()
.actionGet();
SignificantTerms sigTerms0 = ((SignificantTerms) (((StringTerms) response1.getAggregations().get("class")).getBucketByKey("0").getAggregations().asMap().get("sig_terms")));
assertThat(sigTerms0.getBuckets().size(), equalTo(2));
double score00Background = sigTerms0.getBucketByKey("0").getSignificanceScore();
double score01Background = sigTerms0.getBucketByKey("1").getSignificanceScore();
SignificantTerms sigTerms1 = ((SignificantTerms) (((StringTerms) response1.getAggregations().get("class")).getBucketByKey("0").getAggregations().asMap().get("sig_terms")));
double score10Background = sigTerms1.getBucketByKey("0").getSignificanceScore();
double score11Background = sigTerms1.getBucketByKey("1").getSignificanceScore();
double score00SeparateSets = ((SignificantTerms) ((InternalFilter) response2.getAggregations().get("0")).getAggregations().getAsMap().get("sig_terms")).getBucketByKey("0").getSignificanceScore();
double score01SeparateSets = ((SignificantTerms) ((InternalFilter) response2.getAggregations().get("0")).getAggregations().getAsMap().get("sig_terms")).getBucketByKey("1").getSignificanceScore();
double score10SeparateSets = ((SignificantTerms) ((InternalFilter) response2.getAggregations().get("1")).getAggregations().getAsMap().get("sig_terms")).getBucketByKey("0").getSignificanceScore();
double score11SeparateSets = ((SignificantTerms) ((InternalFilter) response2.getAggregations().get("1")).getAggregations().getAsMap().get("sig_terms")).getBucketByKey("1").getSignificanceScore();
assertThat(score00Background, equalTo(score00SeparateSets));
assertThat(score01Background, equalTo(score01SeparateSets));
assertThat(score10Background, equalTo(score10SeparateSets));
assertThat(score11Background, equalTo(score11SeparateSets));
}
private void index01Docs(String type, String settings) throws ExecutionException, InterruptedException {
String mappings = "{\"doc\": {\"properties\":{\"text\": {\"type\":\"" + type + "\"}}}}";
assertAcked(prepareCreate(INDEX_NAME).setSettings(settings).addMapping("doc", mappings));
String[] gb = {"0", "1"};
List<IndexRequestBuilder> indexRequestBuilderList = new ArrayList<>();
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "1")
.setSource(TEXT_FIELD, "1", CLASS_FIELD, "1"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "2")
.setSource(TEXT_FIELD, "1", CLASS_FIELD, "1"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "3")
.setSource(TEXT_FIELD, "0", CLASS_FIELD, "0"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "4")
.setSource(TEXT_FIELD, "0", CLASS_FIELD, "0"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "5")
.setSource(TEXT_FIELD, gb, CLASS_FIELD, "1"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "6")
.setSource(TEXT_FIELD, gb, CLASS_FIELD, "0"));
indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME, DOC_TYPE, "7")
.setSource(TEXT_FIELD, "0", CLASS_FIELD, "0"));
indexRandom(true, indexRequestBuilderList);
}
@Test
public void testMutualInformationEqual() throws Exception {
indexEqualTestData();
//now, check that results for both classes are the same with exclude negatives = false and classes are routing ids
SearchResponse response = client().prepareSearch("test")
.addAggregation(new TermsBuilder("class").field("class").subAggregation(new SignificantTermsBuilder("mySignificantTerms")
.field("text")
.executionHint(randomExecutionHint())
.significanceHeuristic(new MutualInformation.MutualInformationBuilder(true, true))
.minDocCount(1).shardSize(1000).size(1000)))
.execute()
.actionGet();
assertSearchResponse(response);
StringTerms classes = (StringTerms) response.getAggregations().get("class");
assertThat(classes.getBuckets().size(), equalTo(2));
Iterator<Terms.Bucket> classBuckets = classes.getBuckets().iterator();
Collection<SignificantTerms.Bucket> classA = ((SignificantTerms) classBuckets.next().getAggregations().get("mySignificantTerms")).getBuckets();
Iterator<SignificantTerms.Bucket> classBBucketIterator = ((SignificantTerms) classBuckets.next().getAggregations().get("mySignificantTerms")).getBuckets().iterator();
assertThat(classA.size(), greaterThan(0));
for (SignificantTerms.Bucket classABucket : classA) {
SignificantTerms.Bucket classBBucket = classBBucketIterator.next();
assertThat(classABucket.getKey(), equalTo(classBBucket.getKey()));
assertThat(classABucket.getSignificanceScore(), closeTo(classBBucket.getSignificanceScore(), 1.e-5));
}
}
private void indexEqualTestData() throws ExecutionException, InterruptedException {
assertAcked(prepareCreate("test").setSettings(SETTING_NUMBER_OF_SHARDS, 1, SETTING_NUMBER_OF_REPLICAS, 0).addMapping("doc",
"text", "type=string", "class", "type=string"));
createIndex("idx_unmapped");
ensureGreen();
String data[] = {
"A\ta",
"A\ta",
"A\tb",
"A\tb",
"A\tb",
"B\tc",
"B\tc",
"B\tc",
"B\tc",
"B\td",
"B\td",
"B\td",
"B\td",
"B\td",
"A\tc d",
"B\ta b"
};
List<IndexRequestBuilder> indexRequestBuilders = new ArrayList<>();
for (int i = 0; i < data.length; i++) {
String[] parts = data[i].split("\t");
indexRequestBuilders.add(client().prepareIndex("test", "doc", "" + i)
.setSource("class", parts[0], "text", parts[1]));
}
indexRandom(true, indexRequestBuilders);
}
}

View File

@ -19,30 +19,24 @@
package org.elasticsearch.search.aggregations.bucket;
import org.elasticsearch.action.admin.indices.refresh.RefreshRequest;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.search.SearchType;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.query.FilterBuilders;
import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.aggregations.Aggregation;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantStringTerms;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTerms;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTerms.Bucket;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTermsAggregatorFactory.ExecutionMode;
import org.elasticsearch.search.aggregations.bucket.significant.SignificantTermsBuilder;
import org.elasticsearch.search.aggregations.bucket.terms.StringTerms;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.JLHScore;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.MutualInformation;
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
import org.elasticsearch.search.aggregations.bucket.terms.TermsBuilder;
import org.elasticsearch.test.ElasticsearchIntegrationTest;
import org.junit.Test;
import java.util.*;
import java.util.concurrent.ExecutionException;
import static org.elasticsearch.cluster.metadata.IndexMetaData.SETTING_NUMBER_OF_REPLICAS;
import static org.elasticsearch.cluster.metadata.IndexMetaData.SETTING_NUMBER_OF_SHARDS;
@ -219,7 +213,8 @@ public class SignificantTermsTests extends ElasticsearchIntegrationTest {
}
}
assertTrue(hasMissingBackgroundTerms);
}
}
@Test
public void filteredAnalysis() throws Exception {
@ -276,13 +271,13 @@ public class SignificantTermsTests extends ElasticsearchIntegrationTest {
@Test
public void partiallyUnmapped() throws Exception {
SearchResponse response = client().prepareSearch("idx_unmapped","test")
SearchResponse response = client().prepareSearch("idx_unmapped", "test")
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(new TermQueryBuilder("_all", "terje"))
.setFrom(0).setSize(60).setExplain(true)
.addAggregation(new SignificantTermsBuilder("mySignificantTerms").field("description")
.executionHint(randomExecutionHint())
.minDocCount(2))
.executionHint(randomExecutionHint())
.minDocCount(2))
.execute()
.actionGet();
assertSearchResponse(response);
@ -306,63 +301,38 @@ public class SignificantTermsTests extends ElasticsearchIntegrationTest {
assertEquals(4, kellyTerm.getSupersetDf());
}
@Test
public void testXContentResponse() throws Exception {
String indexName = "10index";
String docType = "doc";
String classField = "class";
String textField = "text";
cluster().wipeIndices(indexName);
String type = randomBoolean() ? "string" : "long";
index01Docs(indexName, docType, classField, textField, type);
SearchResponse response = client().prepareSearch(indexName).setTypes(docType)
.addAggregation(new TermsBuilder("class").field(classField).subAggregation(new SignificantTermsBuilder("sig_terms").field(textField)))
public void testDefaultSignificanceHeuristic() throws Exception {
SearchResponse response = client().prepareSearch("test")
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(new TermQueryBuilder("_all", "terje"))
.setFrom(0).setSize(60).setExplain(true)
.addAggregation(new SignificantTermsBuilder("mySignificantTerms")
.field("description")
.executionHint(randomExecutionHint())
.significanceHeuristic(new JLHScore.JLHScoreBuilder())
.minDocCount(2))
.execute()
.actionGet();
assertSearchResponse(response);
StringTerms classes = (StringTerms) response.getAggregations().get("class");
assertThat(classes.getBuckets().size(), equalTo(2));
for (Terms.Bucket classBucket : classes.getBuckets()) {
Map<String, Aggregation> aggs = classBucket.getAggregations().asMap();
assertTrue(aggs.containsKey("sig_terms"));
SignificantTerms agg = (SignificantTerms) aggs.get("sig_terms");
assertThat(agg.getBuckets().size(), equalTo(1));
String term = agg.iterator().next().getKey();
String classTerm = classBucket.getKey();
assertTrue(term.equals(classTerm));
}
XContentBuilder responseBuilder = XContentFactory.jsonBuilder();
classes.toXContent(responseBuilder, null);
String result = null;
if (type.equals("long")) {
result = "\"class\"{\"buckets\":[{\"key\":\"0\",\"doc_count\":4,\"sig_terms\":{\"doc_count\":4,\"buckets\":[{\"key\":0,\"key_as_string\":\"0\",\"doc_count\":4,\"score\":0.39999999999999997,\"bg_count\":5}]}},{\"key\":\"1\",\"doc_count\":3,\"sig_terms\":{\"doc_count\":3,\"buckets\":[{\"key\":1,\"key_as_string\":\"1\",\"doc_count\":3,\"score\":0.75,\"bg_count\":4}]}}]}";
} else {
result = "\"class\"{\"buckets\":[{\"key\":\"0\",\"doc_count\":4,\"sig_terms\":{\"doc_count\":4,\"buckets\":[{\"key\":\"0\",\"doc_count\":4,\"score\":0.39999999999999997,\"bg_count\":5}]}},{\"key\":\"1\",\"doc_count\":3,\"sig_terms\":{\"doc_count\":3,\"buckets\":[{\"key\":\"1\",\"doc_count\":3,\"score\":0.75,\"bg_count\":4}]}}]}";
}
assertThat(responseBuilder.string(), equalTo(result));
SignificantTerms topTerms = response.getAggregations().get("mySignificantTerms");
checkExpectedStringTermsFound(topTerms);
}
private void index01Docs(String indexName, String docType, String classField, String textField, String type) throws ExecutionException, InterruptedException {
String mappings = "{\"doc\": {\"properties\":{\"text\": {\"type\":\"" + type + "\"}}}}";
assertAcked(prepareCreate(indexName).setSettings(SETTING_NUMBER_OF_SHARDS, 1, SETTING_NUMBER_OF_REPLICAS, 0).addMapping("doc", mappings));
String[] gb = {"0", "1"};
List<IndexRequestBuilder> indexRequestBuilderList = new ArrayList<>();
indexRequestBuilderList.add(client().prepareIndex(indexName, docType, "1")
.setSource(textField, "1", classField, "1"));
indexRequestBuilderList.add(client().prepareIndex(indexName, docType, "2")
.setSource(textField, "1", classField, "1"));
indexRequestBuilderList.add(client().prepareIndex(indexName, docType, "3")
.setSource(textField, "0", classField, "0"));
indexRequestBuilderList.add(client().prepareIndex(indexName, docType, "4")
.setSource(textField, "0", classField, "0"));
indexRequestBuilderList.add(client().prepareIndex(indexName, docType, "5")
.setSource(textField, gb, classField, "1"));
indexRequestBuilderList.add(client().prepareIndex(indexName, docType, "6")
.setSource(textField, gb, classField, "0"));
indexRequestBuilderList.add(client().prepareIndex(indexName, docType, "7")
.setSource(textField, "0", classField, "0"));
indexRandom(true, indexRequestBuilderList);
@Test
public void testMutualInformation() throws Exception {
SearchResponse response = client().prepareSearch("test")
.setSearchType(SearchType.QUERY_AND_FETCH)
.setQuery(new TermQueryBuilder("_all", "terje"))
.setFrom(0).setSize(60).setExplain(true)
.addAggregation(new SignificantTermsBuilder("mySignificantTerms")
.field("description")
.executionHint(randomExecutionHint())
.significanceHeuristic(new MutualInformation.MutualInformationBuilder(false, true))
.minDocCount(1))
.execute()
.actionGet();
assertSearchResponse(response);
SignificantTerms topTerms = response.getAggregations().get("mySignificantTerms");
checkExpectedStringTermsFound(topTerms);
}
}

View File

@ -0,0 +1,354 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.aggregations.bucket.significant;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticsearchIllegalArgumentException;
import org.elasticsearch.ElasticsearchParseException;
import org.elasticsearch.Version;
import org.elasticsearch.common.io.stream.InputStreamStreamInput;
import org.elasticsearch.common.io.stream.OutputStreamStreamOutput;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.common.xcontent.json.JsonXContent;
import org.elasticsearch.index.search.child.TestSearchContext;
import org.elasticsearch.search.SearchShardTarget;
import org.elasticsearch.search.aggregations.InternalAggregations;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.*;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.test.ElasticsearchIntegrationTest;
import org.elasticsearch.test.ElasticsearchTestCase;
import org.junit.Test;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.util.*;
import static org.hamcrest.Matchers.*;
/**
*
*/
public class SignificanceHeuristicTests extends ElasticsearchTestCase {
static class SignificantTermsTestSearchContext extends TestSearchContext {
@Override
public int numberOfShards() {
return 1;
}
@Override
public SearchShardTarget shardTarget() {
return new SearchShardTarget("no node, this is a unit test", "no index, this is a unit test", 0);
}
}
// test that stream output can actually be read - does not replace bwc test
@Test
public void streamResponse() throws Exception {
SignificanceHeuristicStreams.registerStream(MutualInformation.STREAM, MutualInformation.STREAM.getName());
SignificanceHeuristicStreams.registerStream(JLHScore.STREAM, JLHScore.STREAM.getName());
Version version = ElasticsearchIntegrationTest.randomVersion();
InternalSignificantTerms[] sigTerms = getRandomSignificantTerms(getRandomSignificanceheuristic());
// write
ByteArrayOutputStream outBuffer = new ByteArrayOutputStream();
OutputStreamStreamOutput out = new OutputStreamStreamOutput(outBuffer);
out.setVersion(version);
sigTerms[0].writeTo(out);
// read
ByteArrayInputStream inBuffer = new ByteArrayInputStream(outBuffer.toByteArray());
InputStreamStreamInput in = new InputStreamStreamInput(inBuffer);
in.setVersion(version);
sigTerms[1].readFrom(in);
if (version.onOrAfter(Version.V_1_3_0)) {
assertTrue(sigTerms[1].significanceHeuristic.equals(sigTerms[0].significanceHeuristic));
} else {
assertTrue(sigTerms[1].significanceHeuristic instanceof JLHScore);
}
}
InternalSignificantTerms[] getRandomSignificantTerms(SignificanceHeuristic heuristic) {
InternalSignificantTerms[] sTerms = new InternalSignificantTerms[2];
ArrayList<InternalSignificantTerms.Bucket> buckets = new ArrayList<>();
if (randomBoolean()) {
BytesRef term = new BytesRef("123.0");
buckets.add(new SignificantLongTerms.Bucket(1, 2, 3, 4, 123, InternalAggregations.EMPTY));
sTerms[0] = new SignificantLongTerms(10, 20, "some_name", null, 1, 1, heuristic, buckets);
sTerms[1] = new SignificantLongTerms();
} else {
BytesRef term = new BytesRef("someterm");
buckets.add(new SignificantStringTerms.Bucket(term, 1, 2, 3, 4, InternalAggregations.EMPTY));
sTerms[0] = new SignificantStringTerms(10, 20, "some_name", 1, 1, heuristic, buckets);
sTerms[1] = new SignificantStringTerms();
}
return sTerms;
}
SignificanceHeuristic getRandomSignificanceheuristic() {
if (randomBoolean()) {
return JLHScore.INSTANCE;
} else {
return new MutualInformation(randomBoolean(), true);
}
}
// test that
// 1. The output of the builders can actually be parsed
// 2. The parser does not swallow parameters after a significance heuristic was defined
@Test
public void testBuilderAndParser() throws Exception {
Set<SignificanceHeuristicParser> parsers = new HashSet<>();
parsers.add(new JLHScore.JLHScoreParser());
parsers.add(new MutualInformation.MutualInformationParser());
SignificanceHeuristicParserMapper heuristicParserMapper = new SignificanceHeuristicParserMapper(parsers);
SearchContext searchContext = new SignificantTermsTestSearchContext();
// test default with string
XContentParser stParser = JsonXContent.jsonXContent.createParser("{\"field\":\"text\", \"jlh\":{}, \"min_doc_count\":200}");
stParser.nextToken();
SignificantTermsAggregatorFactory aggregatorFactory = (SignificantTermsAggregatorFactory) new SignificantTermsParser(heuristicParserMapper).parse("testagg", stParser, searchContext);
stParser.nextToken();
assertThat(aggregatorFactory.getBucketCountThresholds().getMinDocCount(), equalTo(200l));
assertThat(stParser.currentToken(), equalTo(null));
stParser.close();
// test default with builders
SignificantTermsBuilder stBuilder = new SignificantTermsBuilder("testagg");
stBuilder.significanceHeuristic(new JLHScore.JLHScoreBuilder()).field("text").minDocCount(200);
XContentBuilder stXContentBuilder = XContentFactory.jsonBuilder();
stBuilder.internalXContent(stXContentBuilder, null);
stParser = JsonXContent.jsonXContent.createParser(stXContentBuilder.string());
stParser.nextToken();
aggregatorFactory = (SignificantTermsAggregatorFactory) new SignificantTermsParser(heuristicParserMapper).parse("testagg", stParser, searchContext);
stParser.nextToken();
assertThat(aggregatorFactory.getBucketCountThresholds().getMinDocCount(), equalTo(200l));
assertThat(stParser.currentToken(), equalTo(null));
stParser.close();
// test mutual_information with string
stParser = JsonXContent.jsonXContent.createParser("{\"field\":\"text\", \"mutual_information\":{\"include_negatives\": false}, \"min_doc_count\":200}");
stParser.nextToken();
aggregatorFactory = (SignificantTermsAggregatorFactory) new SignificantTermsParser(heuristicParserMapper).parse("testagg", stParser, searchContext);
stParser.nextToken();
assertThat(aggregatorFactory.getBucketCountThresholds().getMinDocCount(), equalTo(200l));
assertTrue(!((MutualInformation) aggregatorFactory.getSignificanceHeuristic()).getIncludeNegatives());
assertThat(stParser.currentToken(), equalTo(null));
stParser.close();
// test mutual_information with builders
stBuilder = new SignificantTermsBuilder("testagg");
stBuilder.significanceHeuristic(new MutualInformation.MutualInformationBuilder(false, true)).field("text").minDocCount(200);
stXContentBuilder = XContentFactory.jsonBuilder();
stBuilder.internalXContent(stXContentBuilder, null);
stParser = JsonXContent.jsonXContent.createParser(stXContentBuilder.string());
stParser.nextToken();
aggregatorFactory = (SignificantTermsAggregatorFactory) new SignificantTermsParser(heuristicParserMapper).parse("testagg", stParser, searchContext);
stParser.nextToken();
assertThat(aggregatorFactory.getBucketCountThresholds().getMinDocCount(), equalTo(200l));
assertTrue(!((MutualInformation) aggregatorFactory.getSignificanceHeuristic()).getIncludeNegatives());
assertThat(stParser.currentToken(), equalTo(null));
stParser.close();
// test exceptions
try {
// 1. invalid field
stParser = JsonXContent.jsonXContent.createParser("{\"field\":\"text\", \"mutual_information\":{\"include_negatives\": false, \"some_unknown_field\": false}\"min_doc_count\":200}");
stParser.nextToken();
new SignificantTermsParser(heuristicParserMapper).parse("testagg", stParser, searchContext);
fail();
} catch (ElasticsearchParseException e) {
assertTrue(e.getMessage().contains("unknown for mutual_information"));
}
try {
// 2. unknown field in jlh_score
stParser = JsonXContent.jsonXContent.createParser("{\"field\":\"text\", \"jlh\":{\"unknown_field\": true}, \"min_doc_count\":200}");
stParser.nextToken();
new SignificantTermsParser(heuristicParserMapper).parse("testagg", stParser, searchContext);
fail();
} catch (ElasticsearchParseException e) {
assertTrue(e.getMessage().contains("expected }, got "));
}
}
@Test
public void testAssertions() throws Exception {
MutualInformation mutualInformation = new MutualInformation(true, true);
try {
mutualInformation.getScore(2, 3, 1, 4);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("subsetFreq > supersetFreq"));
}
try {
mutualInformation.getScore(1, 4, 2, 3);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("subsetSize > supersetSize"));
}
try {
mutualInformation.getScore(2, 1, 3, 4);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("subsetFreq > subsetSize"));
}
try {
mutualInformation.getScore(1, 2, 4, 3);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("supersetFreq > supersetSize"));
}
try {
mutualInformation.getScore(1, 3, 4, 4);
fail();
} catch (ElasticsearchIllegalArgumentException assertionError) {
assertNotNull(assertionError.getMessage());
assertTrue(assertionError.getMessage().contains("supersetFreq - subsetFreq > supersetSize - subsetSize"));
}
try {
int idx = randomInt(3);
long[] values = {1, 2, 3, 4};
values[idx] *= -1;
mutualInformation.getScore(values[0], values[1], values[2], values[3]);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("Frequencies of subset and superset must be positive"));
}
mutualInformation = new MutualInformation(true, false);
double score = mutualInformation.getScore(2, 3, 1, 4);
assertThat(score, greaterThanOrEqualTo(0.0));
assertThat(score, lessThanOrEqualTo(1.0));
score = mutualInformation.getScore(1, 4, 2, 3);
assertThat(score, greaterThanOrEqualTo(0.0));
assertThat(score, lessThanOrEqualTo(1.0));
try {
mutualInformation.getScore(2, 1, 3, 4);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("subsetFreq > subsetSize"));
}
try {
mutualInformation.getScore(1, 2, 4, 3);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("supersetFreq > supersetSize"));
}
score = mutualInformation.getScore(1, 3, 4, 4);
assertThat(score, greaterThanOrEqualTo(0.0));
assertThat(score, lessThanOrEqualTo(1.0));
try {
int idx = randomInt(3);
long[] values = {1, 2, 3, 4};
values[idx] *= -1;
mutualInformation.getScore(values[0], values[1], values[2], values[3]);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("Frequencies of subset and superset must be positive"));
}
JLHScore jlhScore = JLHScore.INSTANCE;
try {
int idx = randomInt(3);
long[] values = {1, 2, 3, 4};
values[idx] *= -1;
jlhScore.getScore(values[0], values[1], values[2], values[3]);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("Frequencies of subset and superset must be positive"));
}
try {
jlhScore.getScore(1, 2, 4, 3);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("supersetFreq > supersetSize"));
}
try {
jlhScore.getScore(2, 1, 3, 4);
fail();
} catch (ElasticsearchIllegalArgumentException illegalArgumentException) {
assertNotNull(illegalArgumentException.getMessage());
assertTrue(illegalArgumentException.getMessage().contains("subsetFreq > subsetSize"));
}
}
@Test
public void scoreDefault() {
SignificanceHeuristic heuristic = JLHScore.INSTANCE;
assertThat(heuristic.getScore(1, 1, 1, 3), greaterThan(0.0));
assertThat(heuristic.getScore(1, 1, 2, 3), lessThan(heuristic.getScore(1, 1, 1, 3)));
assertThat(heuristic.getScore(0, 1, 2, 3), equalTo(0.0));
double score = 0.0;
try {
long a = randomLong();
long b = randomLong();
long c = randomLong();
long d = randomLong();
score = heuristic.getScore(a, b, c, d);
} catch (ElasticsearchIllegalArgumentException e) {
}
assertThat(score, greaterThanOrEqualTo(0.0));
}
@Test
public void scoreMutual() throws Exception {
SignificanceHeuristic heuristic = new MutualInformation(true, true);
assertThat(heuristic.getScore(1, 1, 1, 3), greaterThan(0.0));
assertThat(heuristic.getScore(1, 1, 2, 3), lessThan(heuristic.getScore(1, 1, 1, 3)));
assertThat(heuristic.getScore(2, 2, 2, 4), equalTo(1.0));
assertThat(heuristic.getScore(0, 2, 2, 4), equalTo(1.0));
assertThat(heuristic.getScore(2, 2, 4, 4), equalTo(0.0));
assertThat(heuristic.getScore(1, 2, 2, 4), equalTo(0.0));
assertThat(heuristic.getScore(3, 6, 9, 18), equalTo(0.0));
double score = 0.0;
try {
long a = randomLong();
long b = randomLong();
long c = randomLong();
long d = randomLong();
score = heuristic.getScore(a, b, c, d);
} catch (ElasticsearchIllegalArgumentException e) {
}
assertThat(score, lessThanOrEqualTo(1.0));
assertThat(score, greaterThanOrEqualTo(0.0));
heuristic = new MutualInformation(false, true);
assertThat(heuristic.getScore(0, 1, 2, 3), equalTo(-1.0 * Double.MAX_VALUE));
}
}