Added `cross_fields` mode to multi_match query

`cross_fields` attemps to treat fields with the same analysis
configuration as a single field and uses maximum score promotion or
combination of the scores based depending on the `use_dis_max` setting.
By default scores are combined. `cross_fields` can also search across
fields of hetrogenous types for instance if numbers can be part of
the query it makes sense to search also on numeric fields if an analyzer
is provided in the reqeust.

Relates to #2959
This commit is contained in:
Simon Willnauer 2014-01-29 17:57:27 +01:00
parent 56479fb0e4
commit 162ca99376
8 changed files with 1616 additions and 341 deletions

View File

@ -1,64 +1,437 @@
[[query-dsl-multi-match-query]]
=== Multi Match Query
The `multi_match` query builds further on top of the `match` query by
allowing multiple fields to be specified. The idea here is to allow to
more easily build a concise match type query over multiple fields
instead of using a relatively more expressive query by using multiple
match queries within a `bool` query.
The `multi_match` query builds on the <<query-dsl-match-query,`match` query>>
to allow multi-field queries:
The structure of the query is a bit different. Instead of a nested json
object defining the query field, there is a top json level field for
defining the query fields. Example:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query": "this is a test", <1>
"fields": [ "subject", "message" ] <2>
}
}
--------------------------------------------------
<1> The query string.
<2> The fields to be queried.
[float]
=== `fields` and per-field boosting
Fields can be specified with wildcards, eg:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query": "Will Smith"
"fields": [ "title", "*_name" ] <1>
}
}
--------------------------------------------------
<1> Query the `title`, `first_name` and `last_name` fields.
Individual fields can be boosted with the caret (`^`) notation:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query" : "this is a test",
"fields" : [ "subject", "message" ]
"fields" : [ "subject^3", "message" ] <1>
}
}
--------------------------------------------------
The `multi_match` query creates either a `bool` or a `dis_max` top level
query. Each field is a query clause in this top level query. The query
clause contains the actual query (the specified 'type' defines what
query this will be). Each query clause is basically a `should` clause.
<1> The `subject` field is three times as important than the `message` field.
[float]
=== `use_dis_max`
deprecated[1.1.0,Use `type:best_fields` or `type:most_fields` instead. See <<multi-match-types>>]
By default, the `multi_match` query generates a `match` clause per field, then wraps them
in a `dis_max` query. By setting `use_dis_max` to `false`, they will be wrapped in a
`bool` query instead.
[[multi-match-types]]
[float]
==== Options
=== Types of `multi_match` query:
All options that apply on the `match` query also apply on the
`multi_match` query. The `match` query options apply only on the
individual clauses inside the top level query.
coming[1.1.0]
* `fields` - Fields to be used in the query.
* `use_dis_max` - Boolean indicating to either create a `dis_max` query
or a `bool` query. Defaults to `true`.
* `tie_breaker` - Multiplier value to balance the scores between lower
and higher scoring fields. Only applicable when `use_dis_max` is set to
true. Defaults to `0.0`.
The way the `multi_match` query is executed internally depends on the `type`
parameter, which can be set to:
The query accepts all the options that a regular `match` query accepts.
[horizontal]
`best_fields`:: (*default*) Finds documents which match any field, but
uses the `_score` from the best field. See <<type-best-fields>>.
[float]
[float]
==== Boosting
`most_fields`:: Finds documents which match any field and combines
the `_score` from each field. See <<type-most-fields>>.
The `multi_match` query supports field boosting via `^` notation in the
fields json field.
`cross_fields`:: Treats fields with the same `analyzer` as though they
were one big field. Looks for each word in *any*
field. See <<type-cross-fields>>.
`phrase`:: Runs a `match_phrase` query on each field and combines
the `_score` from each field. See <<type-phrase>>.
`phrase_prefix`:: Runs a `match_phrase_prefix` query on each field and
combines the `_score` from each field. See <<type-phrase>>.
[[type-best-fields]]
==== `best_fields`
The `best_fields` type is most useful when you are searching for multiple
words best found in the same field. For instance ``brown fox'' in a single
field is more meaningful than ``brown'' in one field and ``fox'' in the other.
The `best_fields` type generates a <<query-dsl-match-query,`match` query>> for
each field and wraps them in a <<query-dsl-dis-max-query,`dis_max`>> query, to
find the single best matching field. For instance, this query:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query" : "this is a test",
"fields" : [ "subject^2", "message" ]
"query": "brown fox",
"type": "best_fields",
"fields": [ "subject", "message" ],
"tie_breaker": 0.3
}
}
--------------------------------------------------
In the above example hits in the `subject` field are 2 times more
important than in the `message` field.
would be executed as:
[source,js]
--------------------------------------------------
{
"dis_max": {
"queries": [
{ "match": { "subject": "brown fox" }},
{ "match": { "message": "brown fox" }}
],
"tie_breaker": 0.3
}
}
--------------------------------------------------
Normally the `best_fields` type uses the score of the *single* best matching
field, but if `tie_breaker` is specified, then it calculates the score as
follows:
* the score from the best matching field
* plus `tie_breaker * _score` for all other matching fields
Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
`fuzziness`, `prefix_length`, `max_expansions`, `rewrite`, `zero_terms_query`
and `cutoff_frequency`, as explained in <<query-dsl-match-query, match query>>.
[IMPORTANT]
[[operator-min]]
.`operator` and `minimum_should_match`
==================================================
The `best_fields` and `most_fields` types are _field-centric_ -- they generate
a `match` query *per field*. This means that the `operator` and
`minimum_should_match` parameters are applied to each field individually,
which is probably not what you want.
Take this query for example:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query": "Will Smith",
"type": "best_fields",
"fields": [ "first_name", "last_name" ],
"operator": "and" <1>
}
}
--------------------------------------------------
<1> All terms must be present.
This query is executed as:
(+first_name:will +first_name:smith)
| (+last_name:will +last_name:smith)
In other words, *all terms* must be present *in a single field* for a document
to match.
See <<type-cross-fields>> for a better solution.
==================================================
[[type-most-fields]]
==== `most_fields`
The `most_fields` type is most useful when querying multiple fields that
contain the same text analyzed in different ways. For instance, the main
field may contain synonyms, stemming and terms without diacritics. A second
field may contain the original terms, and a third field might contain
shingles. By combining scores from all three fields we can match as many
documents as possible with the main field, but use the second and third fields
to push the most similar results to the top of the list.
This query:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query": "quick brown fox",
"type": "most_fields",
"fields": [ "title", "title.original", "title.shingles" ]
}
}
--------------------------------------------------
would be executed as:
[source,js]
--------------------------------------------------
{
"bool": {
"should": [
{ "match": { "title": "quick brown fox" }},
{ "match": { "title.original": "quick brown fox" }},
{ "match": { "title.shingles": "quick brown fox" }}
]
}
}
--------------------------------------------------
The score from each `match` clause is added together, then divided by the
number of `match` clauses.
Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
`fuzziness`, `prefix_length`, `max_expansions`, `rewrite`, `zero_terms_query`
and `cutoff_frequency`, as explained in <<query-dsl-match-query,match query>>, but
*see <<operator-min>>*.
[[type-phrase]]
==== `phrase` and `phrase_prefix`
The `phrase` and `phrase_prefix` types behave just like <<type-best-fields>>,
but they use a `match_phrase` or `match_phrase_prefix` query instead of a
`match` query.
This query:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query": "quick brown f",
"type": "phrase_prefix",
"fields": [ "subject", "message" ]
}
}
--------------------------------------------------
[source,js]
--------------------------------------------------
{
"dis_max": {
"queries": [
{ "match_phrase_prefix": { "subject": "quick brown f" }},
{ "match_phrase_prefix": { "message": "quick brown f" }}
]
}
}
--------------------------------------------------
Also, accepts `analyzer`, `boost`, `slop` and `zero_terms_query` as explained
in <<query-dsl-match-query>>. Type `phrase_prefix` additionally accepts
`max_expansions`.
[[type-cross-fields]]
==== `cross_fields`
The `cross_fields` type is particularly useful with structured documents where
multiple fields *should* match. For instance, when querying the `first_name`
and `last_name` fields for ``Will Smith'', the best match is like to have
``Will'' in one field and ``Smith'' in the other.
****
This sounds like a job for <<type-most-fields>> but there are two problems
with that approach. The first problem is that `operator` and
`minimum_should_match` are applied per-field, instead of per-term (see
<<operator-min,explanation above>>).
The second problem is to do with relevance: the different term frequencies in
the `first_name` and `last_name` fields can produce unexpected results.
For instance, imagine we have two people: ``Will Smith'' and ``Smith Jones''.
``Smith'' as a last name is very common (and so is of low importance) but
``Smith'' as a first name is very uncommon (and so is of great importance).
If we do a search for ``Will Smith'', the ``Smith Jones'' document will
probably appear above the better matching ``Will Smith'' because the score of
`first_name:smith` has trumped the combined scores of `first_name:will` plus
`last_name:smith`.
****
One way of dealing with these types of queries is simply to index the
`first_name` and `last_name` fields into a single `full_name` field. Of
course, this can only be done at index time.
The `cross_field` type tries to solve these problems at query time by taking a
_term-centric_ approach. It first analyzes the query string into individual
terms, then looks for each term in any of the fields, as though they were one
big field.
A query like:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query": "Will Smith",
"type": "cross_fields",
"fields": [ "first_name", "last_name" ],
"operator": "and"
}
}
--------------------------------------------------
is executed as:
+(first_name:will last_name:will)
+(first_name:smith last_name:smith)
In other words, *all terms* must be present *in at least one field* for a
document to match. (Compare this to
<<operator-min,the logic used for `best_fields` and `most_fields`>>.)
That solves one of the two problems. The problem of differing term frequencies
is solved by _blending_ the term frequencies for all fields in order to even
out the differences. In other words, `first_name:smith` will be treated as
though it has the same weight as `last_name:smith`. (Actually,
`first_name:smith` is given a tiny advantage over `last_name:smith`, just to
make the order of results more stable.)
If you run the above query through the <<search-validate>>, it returns this
explanation:
+blended("will", fields: [first_name, last_name])
+blended("smith", fields: [first_name, last_name])
Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
`zero_terms_query` and `cutoff_frequency`, as explained in
<<query-dsl-match-query, match query>>.
===== `cross_field` and analysis
The `cross_field` type can only work in term-centric mode on fields that have
the same analyzer. Fields with the same analyzer are grouped together as in
the example above. If there are multiple groups, they are combined with a
`bool` query.
For instance, if we have a `first` and `last` field which have
the same analyzer, plus a `first.edge` and `last.edge` which
both use an `edge_ngram` analyzer, this query:
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query": "Jon",
"type": "cross_fields",
"fields": [
"first", "first.edge",
"last", "last.edge"
]
}
}
--------------------------------------------------
would be executed as:
blended("jon", fields: [first, last])
| (
blended("j", fields: [first.edge, last.edge])
blended("jo", fields: [first.edge, last.edge])
blended("jon", fields: [first.edge, last.edge])
)
In other words, `first` and `last` would be grouped together and
treated as a single field, and `first.edge` and `last.edge` would be
grouped together and treated as a single field.
Having multiple groups is fine but, when combined with `operator` or
`minimum_should_match`, it can suffer from the <<operator-min,same problem>>
as `most_fields` or `best_fields`.
You can easily rewrite this query yourself as two separate `cross_type`
queries combined with a `bool` query, and apply the `minimum_should_match`
parameter to just one of them:
[source,js]
--------------------------------------------------
{
"bool": {
"should": [
{
"multi_match" : {
"query": "Will Smith",
"type": "cross_fields",
"fields": [ "first", "last" ],
"minimum_should_match": 50% <1>
}
},
{
"multi_match" : {
"query": "Will Smith",
"type": "cross_fields",
"fields": [ "*.edge" ]
}
}
]
}
}
--------------------------------------------------
<1> Either `will` or `smith` must be present in either of the `first`
or `last` fields
You can force all fields into the same group by specifying the `analyzer`
parameter in the query.
[source,js]
--------------------------------------------------
{
"multi_match" : {
"query": "Jon",
"type": "cross_fields",
"analyzer": "standard", <1>
"fields": [ "first", "last", "*.edge" ]
}
}
--------------------------------------------------
<1> Use the `standard` analyzer for all fields.
which will be executed as:
blended("will", fields: [first, first.edge, last.edge, last])
blended("smith", fields: [first, first.edge, last.edge, last])
===== `tie_breaker`
By default, each per-term `blended` query will use the best score returned by
any field in a group, then these scores are added together to give the final
score. The `tie_breaker` parameter can change the default behaviour of the
per-term `blended` queries. It accepts:
[horizontal]
`0.0`:: Take the single best score out of (eg) `first_name:will`
and `last_name:will` (*default*)
`1.0`:: Add together the scores for (eg) `first_name:will` and
`last_name:will`
`0.0 < n < 1.0`:: Take the single best score plus +tie_breaker+ multiplied
by each of the scores from other matching fields.

View File

@ -0,0 +1,302 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.lucene.queries;
import com.google.common.primitives.Ints;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.util.ArrayUtil;
import java.io.IOException;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
/**
* BlendedTermQuery can be used to unify term statistics across
* one or more fields in the index. A common problem with structured
* documents is that a term that is significant in on field might not be
* significant in other fields like in a scenario where documents represent
* users with a "first_name" and a "second_name". When someone searches
* for "simon" it will very likely get "paul simon" first since "simon" is a
* an uncommon last name ie. has a low document frequency. This query
* tries to "lie" about the global statistics like document frequency as well
* total term frequency to rank based on the estimated statistics.
* <p>
* While aggregating the total term frequency is trivial since it
* can be summed up not every {@link org.apache.lucene.search.similarities.Similarity}
* makes use of this statistic. The document frequency which is used in the
* {@link org.apache.lucene.search.similarities.DefaultSimilarity}
* can only be estimated as an lower-bound since it is a document based statistic. For
* the document frequency the maximum frequency across all fields per term is used
* which is the minimum number of documents the terms occurs in.
* </p>
*/
// TODO maybe contribute to Lucene
public abstract class BlendedTermQuery extends Query {
private final Term[] terms;
public BlendedTermQuery(Term[] terms) {
if (terms == null || terms.length == 0) {
throw new IllegalArgumentException("terms must not be null or empty");
}
this.terms = terms;
}
@Override
public Query rewrite(IndexReader reader) throws IOException {
IndexReaderContext context = reader.getContext();
TermContext[] ctx = new TermContext[terms.length];
int[] docFreqs = new int[ctx.length];
for (int i = 0; i < terms.length; i++) {
ctx[i] = TermContext.build(context, terms[i]);
docFreqs[i] = ctx[i].docFreq();
}
final int maxDoc = reader.maxDoc();
blend(ctx, maxDoc, reader);
Query query = topLevelQuery(terms, ctx, docFreqs, maxDoc);
query.setBoost(getBoost());
return query;
}
protected abstract Query topLevelQuery(Term[] terms, TermContext[] ctx, int[] docFreqs, int maxDoc);
protected void blend(TermContext[] contexts, int maxDoc, IndexReader reader) throws IOException {
if (contexts.length <= 1) {
return;
}
int max = 0;
long minSumTTF = Long.MAX_VALUE;
for (int i = 0; i < contexts.length; i++) {
TermContext ctx = contexts[i];
int df = ctx.docFreq();
// we use the max here since it's the only "true" estimation we can make here
// at least max(df) documents have that term. Sum or Averages don't seem
// to have a significant meaning here.
// TODO: Maybe it could also make sense to assume independent distributions of documents and eg. have:
// df = df1 + df2 - (df1 * df2 / maxDoc)?
max = Math.max(df, max);
if (minSumTTF != -1 && ctx.totalTermFreq() != -1) {
// we need to find out the minimum sumTTF to adjust the statistics
// otherwise the statistics don't match
minSumTTF = Math.min(minSumTTF, reader.getSumTotalTermFreq(terms[i].field()));
} else {
minSumTTF = -1;
}
}
if (minSumTTF != -1 && maxDoc > minSumTTF) {
maxDoc = (int)minSumTTF;
}
if (max == 0) {
return; // we are done that term doesn't exist at all
}
long sumTTF = minSumTTF == -1 ? -1 : 0;
final TermContext[] tieBreak = new TermContext[contexts.length];
System.arraycopy(contexts, 0, tieBreak, 0, contexts.length);
ArrayUtil.timSort(tieBreak, new Comparator<TermContext>() {
@Override
public int compare(TermContext o1, TermContext o2) {
return Ints.compare(o2.docFreq(), o1.docFreq());
}
});
int prev = tieBreak[0].docFreq();
int actualDf = Math.min(maxDoc, max);
assert actualDf >=0 : "DF must be >= 0";
// here we try to add a little bias towards
// the more popular (more frequent) fields
// that acts as a tie breaker
for (TermContext ctx : tieBreak) {
if (ctx.docFreq() == 0) {
break;
}
final int current = ctx.docFreq();
if (prev > current) {
actualDf++;
}
ctx.setDocFreq(Math.min(maxDoc, actualDf));
prev = current;
if (sumTTF >= 0 && ctx.totalTermFreq() >= 0) {
sumTTF += ctx.totalTermFreq();
} else {
sumTTF = -1; // omit once TF is omitted anywhere!
}
}
sumTTF = Math.min(sumTTF, minSumTTF);
for (int i = 0; i < contexts.length; i++) {
int df = contexts[i].docFreq();
if (df == 0) {
continue;
}
// the blended sumTTF can't be greater than the sumTTTF on the field
final long fixedTTF = sumTTF == -1 ? -1 : sumTTF;
contexts[i] = adjustTTF(contexts[i], fixedTTF);
}
}
private TermContext adjustTTF(TermContext termContext, long sumTTF) {
if (sumTTF == -1 && termContext.totalTermFreq() == -1) {
return termContext;
}
TermContext newTermContext = new TermContext(termContext.topReaderContext);
List<AtomicReaderContext> leaves = termContext.topReaderContext.leaves();
final int len;
if (leaves == null) {
len = 1;
} else {
len = leaves.size();
}
int df = termContext.docFreq();
long ttf = sumTTF;
for (int i = 0; i < len; i++) {
TermState termState = termContext.get(i);
if (termState == null) {
continue;
}
newTermContext.register(termState, i, df, ttf);
df = 0;
ttf = 0;
}
return newTermContext;
}
@Override
public String toString(String field) {
return "blended(terms: " + Arrays.toString(terms) + ")";
}
private volatile Term[] equalTerms = null;
private Term[] equalsTerms() {
if (terms.length == 1) {
return terms;
}
if (equalTerms == null) {
// sort the terms to make sure equals and hashCode are consistent
// this should be a very small cost and equivalent to a HashSet but less object creation
final Term[] t = new Term[terms.length];
System.arraycopy(terms, 0, t, 0, terms.length);
ArrayUtil.timSort(t);
equalTerms = t;
}
return equalTerms;
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
if (!super.equals(o)) return false;
BlendedTermQuery that = (BlendedTermQuery) o;
if (!Arrays.equals(equalsTerms(), that.equalsTerms())) return false;
return true;
}
@Override
public int hashCode() {
int result = super.hashCode();
result = 31 * result + Arrays.hashCode(equalsTerms());
return result;
}
public static BlendedTermQuery booleanBlendedQuery(Term[] terms, final boolean disableCoord) {
return booleanBlendedQuery(terms, null, disableCoord);
}
public static BlendedTermQuery booleanBlendedQuery(Term[] terms, final float[] boosts, final boolean disableCoord) {
return new BlendedTermQuery(terms) {
protected Query topLevelQuery(Term[] terms, TermContext[] ctx, int[] docFreqs, int maxDoc) {
BooleanQuery query = new BooleanQuery(disableCoord);
for (int i = 0; i < terms.length; i++) {
TermQuery termQuery = new TermQuery(terms[i], ctx[i]);
if (boosts != null) {
termQuery.setBoost(boosts[i]);
}
query.add(termQuery, BooleanClause.Occur.SHOULD);
}
return query;
}
};
}
public static BlendedTermQuery commonTermsBlendedQuery(Term[] terms, final float[] boosts, final boolean disableCoord, final float maxTermFrequency) {
return new BlendedTermQuery(terms) {
protected Query topLevelQuery(Term[] terms, TermContext[] ctx, int[] docFreqs, int maxDoc) {
BooleanQuery query = new BooleanQuery(true);
BooleanQuery high = new BooleanQuery(disableCoord);
BooleanQuery low = new BooleanQuery(disableCoord);
for (int i = 0; i < terms.length; i++) {
TermQuery termQuery = new TermQuery(terms[i], ctx[i]);
if (boosts != null) {
termQuery.setBoost(boosts[i]);
}
if ((maxTermFrequency >= 1f && docFreqs[i] > maxTermFrequency)
|| (docFreqs[i] > (int) Math.ceil(maxTermFrequency
* (float) maxDoc))) {
high.add(termQuery, BooleanClause.Occur.SHOULD);
} else {
low.add(termQuery, BooleanClause.Occur.SHOULD);
}
}
if (low.clauses().isEmpty()) {
for (BooleanClause booleanClause : high) {
booleanClause.setOccur(BooleanClause.Occur.MUST);
}
return high;
} else if (high.clauses().isEmpty()) {
return low;
} else {
query.add(high, BooleanClause.Occur.SHOULD);
query.add(low, BooleanClause.Occur.MUST);
return query;
}
}
};
}
public static BlendedTermQuery dismaxBlendedQuery(Term[] terms, final float tieBreakerMultiplier) {
return dismaxBlendedQuery(terms, null, tieBreakerMultiplier);
}
public static BlendedTermQuery dismaxBlendedQuery(Term[] terms, final float[] boosts, final float tieBreakerMultiplier) {
return new BlendedTermQuery(terms) {
protected Query topLevelQuery(Term[] terms, TermContext[] ctx, int[] docFreqs, int maxDoc) {
DisjunctionMaxQuery query = new DisjunctionMaxQuery(tieBreakerMultiplier);
for (int i = 0; i < terms.length; i++) {
TermQuery termQuery = new TermQuery(terms[i], ctx[i]);
if (boosts != null) {
termQuery.setBoost(boosts[i]);
}
query.add(termQuery);
}
return query;
}
};
}
}

View File

@ -21,11 +21,15 @@ package org.elasticsearch.index.query;
import com.carrotsearch.hppc.ObjectFloatOpenHashMap;
import com.google.common.collect.Lists;
import org.elasticsearch.ElasticsearchParseException;
import org.elasticsearch.common.ParseField;
import org.elasticsearch.common.unit.Fuzziness;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.index.search.MatchQuery;
import java.io.IOException;
import java.util.Arrays;
import java.util.EnumSet;
import java.util.List;
import java.util.Locale;
@ -39,7 +43,7 @@ public class MultiMatchQueryBuilder extends BaseQueryBuilder implements Boostabl
private final List<String> fields;
private ObjectFloatOpenHashMap<String> fieldsBoosts;
private MatchQueryBuilder.Type type;
private MultiMatchQueryBuilder.Type type;
private MatchQueryBuilder.Operator operator;
@ -73,6 +77,82 @@ public class MultiMatchQueryBuilder extends BaseQueryBuilder implements Boostabl
private String queryName;
public enum Type {
/**
* Uses the best matching boolean field as main score and uses
* a tie-breaker to adjust the score based on remaining field matches
*/
BEST_FIELDS(MatchQuery.Type.BOOLEAN, 0.0f, new ParseField("best_fields", "boolean")),
/**
* Uses the sum of the matching boolean fields to score the query
*/
MOST_FIELDS(MatchQuery.Type.BOOLEAN, 1.0f, new ParseField("most_fields")),
/**
* Uses a blended DocumentFrequency to dynamically combine the queried
* fields into a single field given the configured analysis is identical.
* This type uses a tie-breaker to adjust the score based on remaining
* matches per analyzed terms
*/
CROSS_FIELDS(MatchQuery.Type.BOOLEAN, 0.0f, new ParseField("cross_fields")),
/**
* Uses the best matching phrase field as main score and uses
* a tie-breaker to adjust the score based on remaining field matches
*/
PHRASE(MatchQuery.Type.PHRASE, 0.0f, new ParseField("phrase")),
/**
* Uses the best matching phrase-prefix field as main score and uses
* a tie-breaker to adjust the score based on remaining field matches
*/
PHRASE_PREFIX(MatchQuery.Type.PHRASE_PREFIX, 0.0f, new ParseField("phrase_prefix"));
private MatchQuery.Type matchQueryType;
private final float tieBreaker;
private final ParseField parseField;
Type (MatchQuery.Type matchQueryType, float tieBreaker, ParseField parseField) {
this.matchQueryType = matchQueryType;
this.tieBreaker = tieBreaker;
this.parseField = parseField;
}
public float tieBreaker() {
return this.tieBreaker;
}
public MatchQuery.Type matchQueryType() {
return matchQueryType;
}
public ParseField parseField() {
return parseField;
}
public static Type parse(String value) {
return parse(value, ParseField.EMPTY_FLAGS);
}
public static Type parse(String value, EnumSet<ParseField.Flag> flags) {
MultiMatchQueryBuilder.Type[] values = MultiMatchQueryBuilder.Type.values();
Type type = null;
for (MultiMatchQueryBuilder.Type t : values) {
if (t.parseField().match(value, flags)) {
type = t;
break;
}
}
if (type == null) {
throw new ElasticsearchParseException("No type found for value: " + value);
}
return type;
}
}
/**
* Constructs a new text query.
*/
@ -105,11 +185,19 @@ public class MultiMatchQueryBuilder extends BaseQueryBuilder implements Boostabl
/**
* Sets the type of the text query.
*/
public MultiMatchQueryBuilder type(MatchQueryBuilder.Type type) {
public MultiMatchQueryBuilder type(MultiMatchQueryBuilder.Type type) {
this.type = type;
return this;
}
/**
* Sets the type of the text query.
*/
public MultiMatchQueryBuilder type(Object type) {
this.type = type == null ? null : Type.parse(type.toString().toLowerCase(Locale.ROOT));
return this;
}
/**
* Sets the operator to use when using a boolean query. Defaults to <tt>OR</tt>.
*/
@ -180,12 +268,29 @@ public class MultiMatchQueryBuilder extends BaseQueryBuilder implements Boostabl
return this;
}
public MultiMatchQueryBuilder useDisMax(Boolean useDisMax) {
/**
* @deprecated use a tieBreaker of 1.0f to disable "dis-max"
* query or select the appropriate {@link Type}
*/
@Deprecated
public MultiMatchQueryBuilder useDisMax(boolean useDisMax) {
this.useDisMax = useDisMax;
return this;
}
public MultiMatchQueryBuilder tieBreaker(Float tieBreaker) {
/**
* <p>Tie-Breaker for "best-match" disjunction queries (OR-Queries).
* The tie breaker capability allows documents that match more than on query clause
* (in this case on more than one field) to be scored better than documents that
* match only the best of the fields, without confusing this with the better case of
* two distinct matches in the multiple fields.</p>
*
* <p>A tie-breaker value of <tt>1.0</tt> is interpreted as a signal to score queries as
* "most-match" queries where all matching query clauses are considered for scoring.</p>
*
* @see Type
*/
public MultiMatchQueryBuilder tieBreaker(float tieBreaker) {
this.tieBreaker = tieBreaker;
return this;
}
@ -297,4 +402,5 @@ public class MultiMatchQueryBuilder extends BaseQueryBuilder implements Boostabl
builder.endObject();
}
}

View File

@ -57,14 +57,15 @@ public class MultiMatchQueryParser implements QueryParser {
Object value = null;
float boost = 1.0f;
MatchQuery.Type type = MatchQuery.Type.BOOLEAN;
Float tieBreaker = null;
MultiMatchQueryBuilder.Type type = null;
MultiMatchQuery multiMatchQuery = new MultiMatchQuery(parseContext);
String minimumShouldMatch = null;
Map<String, Float> fieldNameWithBoosts = Maps.newHashMap();
String queryName = null;
XContentParser.Token token;
String currentFieldName = null;
Boolean useDisMax = null;
while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
currentFieldName = parser.currentName();
@ -83,16 +84,7 @@ public class MultiMatchQueryParser implements QueryParser {
if ("query".equals(currentFieldName)) {
value = parser.objectText();
} else if ("type".equals(currentFieldName)) {
String tStr = parser.text();
if ("boolean".equals(tStr)) {
type = MatchQuery.Type.BOOLEAN;
} else if ("phrase".equals(tStr)) {
type = MatchQuery.Type.PHRASE;
} else if ("phrase_prefix".equals(tStr) || "phrasePrefix".equals(currentFieldName)) {
type = MatchQuery.Type.PHRASE_PREFIX;
} else {
throw new QueryParsingException(parseContext.index(), "[" + NAME + "] query does not support type " + tStr);
}
type = MultiMatchQueryBuilder.Type.parse(parser.text(), parseContext.parseFlags());
} else if ("analyzer".equals(currentFieldName)) {
String analyzer = parser.text();
if (parseContext.analysisService().analyzer(analyzer) == null) {
@ -125,9 +117,9 @@ public class MultiMatchQueryParser implements QueryParser {
} else if ("fuzzy_rewrite".equals(currentFieldName) || "fuzzyRewrite".equals(currentFieldName)) {
multiMatchQuery.setFuzzyRewriteMethod(QueryParsers.parseRewriteMethod(parser.textOrNull(), null));
} else if ("use_dis_max".equals(currentFieldName) || "useDisMax".equals(currentFieldName)) {
multiMatchQuery.setUseDisMax(parser.booleanValue());
useDisMax = parser.booleanValue();
} else if ("tie_breaker".equals(currentFieldName) || "tieBreaker".equals(currentFieldName)) {
multiMatchQuery.setTieBreaker(parser.floatValue());
multiMatchQuery.setTieBreaker(tieBreaker = parser.floatValue());
} else if ("cutoff_frequency".equals(currentFieldName)) {
multiMatchQuery.setCommonTermsCutoff(parser.floatValue());
} else if ("lenient".equals(currentFieldName)) {
@ -156,7 +148,19 @@ public class MultiMatchQueryParser implements QueryParser {
if (fieldNameWithBoosts.isEmpty()) {
throw new QueryParsingException(parseContext.index(), "No fields specified for match_all query");
}
if (type == null) {
type = MultiMatchQueryBuilder.Type.BEST_FIELDS;
}
if (useDisMax != null) { // backwards foobar
boolean typeUsesDismax = type.tieBreaker() != 1.0f;
if (typeUsesDismax != useDisMax) {
if (useDisMax && tieBreaker == null) {
multiMatchQuery.setTieBreaker(0.0f);
} else {
multiMatchQuery.setTieBreaker(1.0f);
}
}
}
Query query = multiMatchQuery.parse(type, fieldNameWithBoosts, value, minimumShouldMatch);
if (query == null) {
return null;

View File

@ -20,16 +20,11 @@
package org.elasticsearch.index.search;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CachingTokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.index.Term;
import org.apache.lucene.queries.ExtendedCommonTermsQuery;
import org.apache.lucene.search.*;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.UnicodeUtil;
import org.apache.lucene.util.QueryBuilder;
import org.elasticsearch.ElasticsearchIllegalArgumentException;
import org.elasticsearch.ElasticsearchIllegalStateException;
import org.elasticsearch.common.Nullable;
@ -42,7 +37,6 @@ import org.elasticsearch.index.query.QueryParseContext;
import org.elasticsearch.index.query.support.QueryParsers;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import static org.elasticsearch.index.query.support.QueryParsers.wrapSmartNameQuery;
@ -145,6 +139,31 @@ public class MatchQuery {
this.zeroTermsQuery = zeroTermsQuery;
}
protected boolean forceAnalyzeQueryString() {
return false;
}
protected Analyzer getAnalyzer(FieldMapper mapper, MapperService.SmartNameFieldMappers smartNameFieldMappers) {
Analyzer analyzer = null;
if (this.analyzer == null) {
if (mapper != null) {
analyzer = mapper.searchAnalyzer();
}
if (analyzer == null && smartNameFieldMappers != null) {
analyzer = smartNameFieldMappers.searchAnalyzer();
}
if (analyzer == null) {
analyzer = parseContext.mapperService().searchAnalyzer();
}
} else {
analyzer = parseContext.mapperService().analysisService().analyzer(this.analyzer);
if (analyzer == null) {
throw new ElasticsearchIllegalArgumentException("No analyzer found for [" + this.analyzer + "]");
}
}
return analyzer;
}
public Query parse(Type type, String fieldName, Object value) throws IOException {
FieldMapper mapper = null;
final String field;
@ -156,7 +175,7 @@ public class MatchQuery {
field = fieldName;
}
if (mapper != null && mapper.useTermQueryWithQueryString()) {
if (mapper != null && mapper.useTermQueryWithQueryString() && !forceAnalyzeQueryString()) {
if (smartNameFieldMappers.explicitTypeInNameWithDocMapper()) {
String[] previousTypes = QueryParseContext.setTypesWithPrevious(new String[]{smartNameFieldMappers.docMapper().type()});
try {
@ -180,185 +199,102 @@ public class MatchQuery {
}
}
}
Analyzer analyzer = getAnalyzer(mapper, smartNameFieldMappers);
MatchQueryBuilder builder = new MatchQueryBuilder(analyzer, mapper);
builder.setEnablePositionIncrements(this.enablePositionIncrements);
Analyzer analyzer = null;
if (this.analyzer == null) {
if (mapper != null) {
analyzer = mapper.searchAnalyzer();
}
if (analyzer == null && smartNameFieldMappers != null) {
analyzer = smartNameFieldMappers.searchAnalyzer();
}
if (analyzer == null) {
analyzer = parseContext.mapperService().searchAnalyzer();
}
} else {
analyzer = parseContext.mapperService().analysisService().analyzer(this.analyzer);
if (analyzer == null) {
throw new ElasticsearchIllegalArgumentException("No analyzer found for [" + this.analyzer + "]");
}
}
// Logic similar to QueryParser#getFieldQuery
final TokenStream source = analyzer.tokenStream(field, value.toString());
source.reset();
int numTokens = 0;
int positionCount = 0;
boolean severalTokensAtSamePosition = false;
final CachingTokenFilter buffer = new CachingTokenFilter(source);
buffer.reset();
final CharTermAttribute termAtt = buffer.addAttribute(CharTermAttribute.class);
final PositionIncrementAttribute posIncrAtt = buffer.addAttribute(PositionIncrementAttribute.class);
boolean hasMoreTokens = buffer.incrementToken();
while (hasMoreTokens) {
numTokens++;
int positionIncrement = posIncrAtt.getPositionIncrement();
if (positionIncrement != 0) {
positionCount += positionIncrement;
} else {
severalTokensAtSamePosition = true;
}
hasMoreTokens = buffer.incrementToken();
}
// rewind the buffer stream
buffer.reset();
source.close();
if (numTokens == 0) {
return zeroTermsQuery();
} else if (type == Type.BOOLEAN) {
if (numTokens == 1) {
boolean hasNext = buffer.incrementToken();
assert hasNext == true;
final Query q = newTermQuery(mapper, new Term(field, termToByteRef(termAtt)));
return wrapSmartNameQuery(q, smartNameFieldMappers, parseContext);
}
if (commonTermsCutoff != null) {
ExtendedCommonTermsQuery q = new ExtendedCommonTermsQuery(occur, occur, commonTermsCutoff, positionCount == 1);
for (int i = 0; i < numTokens; i++) {
boolean hasNext = buffer.incrementToken();
assert hasNext == true;
q.add(new Term(field, termToByteRef(termAtt)));
}
return wrapSmartNameQuery(q, smartNameFieldMappers, parseContext);
} if (severalTokensAtSamePosition && occur == Occur.MUST) {
BooleanQuery q = new BooleanQuery(positionCount == 1);
Query currentQuery = null;
for (int i = 0; i < numTokens; i++) {
boolean hasNext = buffer.incrementToken();
assert hasNext == true;
if (posIncrAtt != null && posIncrAtt.getPositionIncrement() == 0) {
if (!(currentQuery instanceof BooleanQuery)) {
Query t = currentQuery;
currentQuery = new BooleanQuery(true);
((BooleanQuery)currentQuery).add(t, BooleanClause.Occur.SHOULD);
}
((BooleanQuery)currentQuery).add(newTermQuery(mapper, new Term(field, termToByteRef(termAtt))), BooleanClause.Occur.SHOULD);
} else {
if (currentQuery != null) {
q.add(currentQuery, occur);
}
currentQuery = newTermQuery(mapper, new Term(field, termToByteRef(termAtt)));
}
}
q.add(currentQuery, occur);
return wrapSmartNameQuery(q, smartNameFieldMappers, parseContext);
} else {
BooleanQuery q = new BooleanQuery(positionCount == 1);
for (int i = 0; i < numTokens; i++) {
boolean hasNext = buffer.incrementToken();
assert hasNext == true;
final Query currentQuery = newTermQuery(mapper, new Term(field, termToByteRef(termAtt)));
q.add(currentQuery, occur);
}
return wrapSmartNameQuery(q, smartNameFieldMappers, parseContext);
}
} else if (type == Type.PHRASE) {
if (severalTokensAtSamePosition) {
final MultiPhraseQuery mpq = new MultiPhraseQuery();
mpq.setSlop(phraseSlop);
final List<Term> multiTerms = new ArrayList<Term>();
int position = -1;
for (int i = 0; i < numTokens; i++) {
int positionIncrement = 1;
boolean hasNext = buffer.incrementToken();
assert hasNext == true;
positionIncrement = posIncrAtt.getPositionIncrement();
if (positionIncrement > 0 && multiTerms.size() > 0) {
if (enablePositionIncrements) {
mpq.add(multiTerms.toArray(new Term[multiTerms.size()]), position);
} else {
mpq.add(multiTerms.toArray(new Term[multiTerms.size()]));
}
multiTerms.clear();
}
position += positionIncrement;
//LUCENE 4 UPGRADE instead of string term we can convert directly from utf-16 to utf-8
multiTerms.add(new Term(field, termToByteRef(termAtt)));
}
if (enablePositionIncrements) {
mpq.add(multiTerms.toArray(new Term[multiTerms.size()]), position);
Query query = null;
switch (type) {
case BOOLEAN:
if (commonTermsCutoff == null) {
query = builder.createBooleanQuery(field, value.toString(), occur);
} else {
mpq.add(multiTerms.toArray(new Term[multiTerms.size()]));
query = builder.createCommonTermsQuery(field, value.toString(), occur, occur, commonTermsCutoff);
}
return wrapSmartNameQuery(mpq, smartNameFieldMappers, parseContext);
} else {
PhraseQuery pq = new PhraseQuery();
pq.setSlop(phraseSlop);
int position = -1;
for (int i = 0; i < numTokens; i++) {
int positionIncrement = 1;
boolean hasNext = buffer.incrementToken();
assert hasNext == true;
positionIncrement = posIncrAtt.getPositionIncrement();
if (enablePositionIncrements) {
position += positionIncrement;
//LUCENE 4 UPGRADE instead of string term we can convert directly from utf-16 to utf-8
pq.add(new Term(field, termToByteRef(termAtt)), position);
} else {
pq.add(new Term(field, termToByteRef(termAtt)));
}
}
return wrapSmartNameQuery(pq, smartNameFieldMappers, parseContext);
}
} else if (type == Type.PHRASE_PREFIX) {
MultiPhrasePrefixQuery mpq = new MultiPhrasePrefixQuery();
mpq.setSlop(phraseSlop);
mpq.setMaxExpansions(maxExpansions);
List<Term> multiTerms = new ArrayList<Term>();
int position = -1;
for (int i = 0; i < numTokens; i++) {
int positionIncrement = 1;
boolean hasNext = buffer.incrementToken();
assert hasNext == true;
positionIncrement = posIncrAtt.getPositionIncrement();
if (positionIncrement > 0 && multiTerms.size() > 0) {
if (enablePositionIncrements) {
mpq.add(multiTerms.toArray(new Term[multiTerms.size()]), position);
} else {
mpq.add(multiTerms.toArray(new Term[multiTerms.size()]));
}
multiTerms.clear();
}
position += positionIncrement;
multiTerms.add(new Term(field, termToByteRef(termAtt)));
}
if (enablePositionIncrements) {
mpq.add(multiTerms.toArray(new Term[multiTerms.size()]), position);
} else {
mpq.add(multiTerms.toArray(new Term[multiTerms.size()]));
}
return wrapSmartNameQuery(mpq, smartNameFieldMappers, parseContext);
break;
case PHRASE:
query = builder.createPhraseQuery(field, value.toString(), phraseSlop);
break;
case PHRASE_PREFIX:
query = builder.createPhrasePrefixQuery(field, value.toString(), phraseSlop, maxExpansions);
break;
default:
throw new ElasticsearchIllegalStateException("No type found for [" + type + "]");
}
throw new ElasticsearchIllegalStateException("No type found for [" + type + "]");
if (query == null) {
return zeroTermsQuery();
} else {
return wrapSmartNameQuery(query, smartNameFieldMappers, parseContext);
}
}
private Query newTermQuery(@Nullable FieldMapper mapper, Term term) {
protected Query zeroTermsQuery() {
return zeroTermsQuery == ZeroTermsQuery.NONE ? Queries.newMatchNoDocsQuery() : Queries.newMatchAllQuery();
}
private class MatchQueryBuilder extends QueryBuilder {
private final FieldMapper mapper;
/**
* Creates a new QueryBuilder using the given analyzer.
*/
public MatchQueryBuilder(Analyzer analyzer, @Nullable FieldMapper mapper) {
super(analyzer);
this.mapper = mapper;
}
@Override
protected Query newTermQuery(Term term) {
return blendTermQuery(term, mapper);
}
public Query createPhrasePrefixQuery(String field, String queryText, int phraseSlop, int maxExpansions) {
Query query = createFieldQuery(getAnalyzer(), Occur.MUST, field, queryText, true, phraseSlop);
if (query instanceof PhraseQuery) {
PhraseQuery pq = (PhraseQuery)query;
MultiPhrasePrefixQuery prefixQuery = new MultiPhrasePrefixQuery();
prefixQuery.setMaxExpansions(maxExpansions);
Term[] terms = pq.getTerms();
int[] positions = pq.getPositions();
for (int i = 0; i < terms.length; i++) {
prefixQuery.add(new Term[] {terms[i]}, positions[i]);
}
return prefixQuery;
} else if (query instanceof MultiPhraseQuery) {
MultiPhraseQuery pq = (MultiPhraseQuery)query;
MultiPhrasePrefixQuery prefixQuery = new MultiPhrasePrefixQuery();
prefixQuery.setMaxExpansions(maxExpansions);
List<Term[]> terms = pq.getTermArrays();
int[] positions = pq.getPositions();
for (int i = 0; i < terms.size(); i++) {
prefixQuery.add(terms.get(i), positions[i]);
}
return prefixQuery;
}
return query;
}
public Query createCommonTermsQuery(String field, String queryText, Occur highFreqOccur, Occur lowFreqOccur, float maxTermFrequency) {
Query booleanQuery = createBooleanQuery(field, queryText, Occur.SHOULD);
if (booleanQuery != null && booleanQuery instanceof BooleanQuery) {
BooleanQuery bq = (BooleanQuery) booleanQuery;
ExtendedCommonTermsQuery query = new ExtendedCommonTermsQuery(highFreqOccur, lowFreqOccur, maxTermFrequency, ((BooleanQuery)booleanQuery).isCoordDisabled());
for (BooleanClause clause : bq.clauses()) {
if (!(clause.getQuery() instanceof TermQuery)) {
return booleanQuery;
}
query.add(((TermQuery) clause.getQuery()).getTerm());
}
return query;
}
return booleanQuery;
}
}
protected Query blendTermQuery(Term term, FieldMapper mapper) {
if (fuzziness != null) {
if (mapper != null) {
Query query = mapper.fuzzyQuery(term.text(), fuzziness, fuzzyPrefixLength, maxExpansions, transpositions);
@ -379,14 +315,5 @@ public class MatchQuery {
}
return new TermQuery(term);
}
private static BytesRef termToByteRef(CharTermAttribute attr) {
final BytesRef ref = new BytesRef();
UnicodeUtil.UTF16toUTF8(attr.buffer(), 0, attr.length(), ref);
return ref;
}
protected Query zeroTermsQuery() {
return zeroTermsQuery == ZeroTermsQuery.NONE ? Queries.newMatchNoDocsQuery() : Queries.newMatchAllQuery();
}
}

View File

@ -19,27 +19,34 @@
package org.elasticsearch.index.search;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.queries.BlendedTermQuery;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.DisjunctionMaxQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticsearchIllegalStateException;
import org.elasticsearch.common.collect.Tuple;
import org.elasticsearch.common.lucene.search.Queries;
import org.elasticsearch.index.mapper.FieldMapper;
import org.elasticsearch.index.mapper.MapperService;
import org.elasticsearch.index.query.MultiMatchQueryBuilder;
import org.elasticsearch.index.query.QueryParseContext;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class MultiMatchQuery extends MatchQuery {
private boolean useDisMax = true;
private float tieBreaker;
public void setUseDisMax(boolean useDisMax) {
this.useDisMax = useDisMax;
}
private Float groupTieBreaker = null;
public void setTieBreaker(float tieBreaker) {
this.tieBreaker = tieBreaker;
this.groupTieBreaker = tieBreaker;
}
public MultiMatchQuery(QueryParseContext parseContext) {
@ -57,36 +64,205 @@ public class MultiMatchQuery extends MatchQuery {
return query;
}
public Query parse(Type type, Map<String, Float> fieldNames, Object value, String minimumShouldMatch) throws IOException {
public Query parse(MultiMatchQueryBuilder.Type type, Map<String, Float> fieldNames, Object value, String minimumShouldMatch) throws IOException {
if (fieldNames.size() == 1) {
Map.Entry<String, Float> fieldBoost = fieldNames.entrySet().iterator().next();
Float boostValue = fieldBoost.getValue();
return parseAndApply(type, fieldBoost.getKey(), value, minimumShouldMatch, boostValue);
return parseAndApply(type.matchQueryType(), fieldBoost.getKey(), value, minimumShouldMatch, boostValue);
}
if (useDisMax) {
DisjunctionMaxQuery disMaxQuery = new DisjunctionMaxQuery(tieBreaker);
boolean clauseAdded = false;
final float tieBreaker = groupTieBreaker == null ? type.tieBreaker() : groupTieBreaker;
switch (type) {
case PHRASE:
case PHRASE_PREFIX:
case BEST_FIELDS:
case MOST_FIELDS:
queryBuilder = new QueryBuilder(tieBreaker);
break;
case CROSS_FIELDS:
queryBuilder = new CrossFieldsQueryBuilder(tieBreaker);
break;
default:
throw new ElasticsearchIllegalStateException("No such type: " + type);
}
final List<? extends Query> queries = queryBuilder.buildGroupedQueries(type, fieldNames, value, minimumShouldMatch);
return queryBuilder.conbineGrouped(queries);
}
private QueryBuilder queryBuilder;
public class QueryBuilder {
protected final boolean groupDismax;
protected final float tieBreaker;
public QueryBuilder(float tieBreaker) {
this(tieBreaker != 1.0f, tieBreaker);
}
public QueryBuilder(boolean groupDismax, float tieBreaker) {
this.groupDismax = groupDismax;
this.tieBreaker = tieBreaker;
}
public List<Query> buildGroupedQueries(MultiMatchQueryBuilder.Type type, Map<String, Float> fieldNames, Object value, String minimumShouldMatch) throws IOException{
List<Query> queries = new ArrayList<Query>();
for (String fieldName : fieldNames.keySet()) {
Float boostValue = fieldNames.get(fieldName);
Query query = parseAndApply(type, fieldName, value, minimumShouldMatch, boostValue);
Query query = parseGroup(type.matchQueryType(), fieldName, boostValue, value, minimumShouldMatch);
if (query != null) {
clauseAdded = true;
queries.add(query);
}
}
return queries;
}
public Query parseGroup(Type type, String field, Float boostValue, Object value, String minimumShouldMatch) throws IOException {
return parseAndApply(type, field, value, minimumShouldMatch, boostValue);
}
public Query conbineGrouped(List<? extends Query> groupQuery) {
if (groupQuery == null || groupQuery.isEmpty()) {
return null;
}
if (groupQuery.size() == 1) {
return groupQuery.get(0);
}
if (groupDismax) {
DisjunctionMaxQuery disMaxQuery = new DisjunctionMaxQuery(tieBreaker);
for (Query query : groupQuery) {
disMaxQuery.add(query);
}
}
return clauseAdded ? disMaxQuery : null;
} else {
BooleanQuery booleanQuery = new BooleanQuery();
for (String fieldName : fieldNames.keySet()) {
Float boostValue = fieldNames.get(fieldName);
Query query = parseAndApply(type, fieldName, value, minimumShouldMatch, boostValue);
if (query != null) {
return disMaxQuery;
} else {
final BooleanQuery booleanQuery = new BooleanQuery();
for (Query query : groupQuery) {
booleanQuery.add(query, BooleanClause.Occur.SHOULD);
}
return booleanQuery;
}
return !booleanQuery.clauses().isEmpty() ? booleanQuery : null;
}
public Query blendTerm(Term term, FieldMapper mapper) {
return MultiMatchQuery.super.blendTermQuery(term, mapper);
}
public boolean forceAnalyzeQueryString() {
return false;
}
}
public class CrossFieldsQueryBuilder extends QueryBuilder {
private FieldAndMapper[] blendedFields;
public CrossFieldsQueryBuilder(float tieBreaker) {
super(false, tieBreaker);
}
public List<Query> buildGroupedQueries(MultiMatchQueryBuilder.Type type, Map<String, Float> fieldNames, Object value, String minimumShouldMatch) throws IOException {
Map<Analyzer, List<FieldAndMapper>> groups = new HashMap<Analyzer, List<FieldAndMapper>>();
List<Tuple<String, Float>> missing = new ArrayList<Tuple<String, Float>>();
for (Map.Entry<String, Float> entry : fieldNames.entrySet()) {
String name = entry.getKey();
MapperService.SmartNameFieldMappers smartNameFieldMappers = parseContext.smartFieldMappers(name);
if (smartNameFieldMappers != null && smartNameFieldMappers.hasMapper()) {
Analyzer actualAnalyzer = getAnalyzer(smartNameFieldMappers.mapper(), smartNameFieldMappers);
name = smartNameFieldMappers.mapper().names().indexName();
if (!groups.containsKey(actualAnalyzer)) {
groups.put(actualAnalyzer, new ArrayList<FieldAndMapper>());
}
Float boost = entry.getValue();
boost = boost == null ? Float.valueOf(1.0f) : boost;
groups.get(actualAnalyzer).add(new FieldAndMapper(name, smartNameFieldMappers.mapper(), boost));
} else {
missing.add(new Tuple(name, entry.getValue()));
}
}
List<Query> queries = new ArrayList<Query>();
for (Tuple<String, Float> tuple : missing) {
Query q = parseGroup(type.matchQueryType(), tuple.v1(), tuple.v2(), value, minimumShouldMatch);
if (q != null) {
queries.add(q);
}
}
for (List<FieldAndMapper> group : groups.values()) {
if (group.size() > 1) {
blendedFields = new FieldAndMapper[group.size()];
int i = 0;
for (FieldAndMapper fieldAndMapper : group) {
blendedFields[i++] = fieldAndMapper;
}
} else {
blendedFields = null;
}
final FieldAndMapper fieldAndMapper= group.get(0);
Query q = parseGroup(type.matchQueryType(), fieldAndMapper.field, fieldAndMapper.boost, value, minimumShouldMatch);
if (q != null) {
queries.add(q);
}
}
return queries.isEmpty() ? null : queries;
}
public boolean forceAnalyzeQueryString() {
return blendedFields != null;
}
public Query blendTerm(Term term, FieldMapper mapper) {
if (blendedFields == null) {
return super.blendTerm(term, mapper);
}
final Term[] terms = new Term[blendedFields.length];
float[] blendedBoost = new float[blendedFields.length];
for (int i = 0; i < blendedFields.length; i++) {
terms[i] = blendedFields[i].newTerm(term.text());
blendedBoost[i] = blendedFields[i].boost;
}
if (commonTermsCutoff != null) {
return BlendedTermQuery.commonTermsBlendedQuery(terms, blendedBoost, false, commonTermsCutoff);
}
if (tieBreaker == 1.0f) {
return BlendedTermQuery.booleanBlendedQuery(terms, blendedBoost, false);
}
return BlendedTermQuery.dismaxBlendedQuery(terms, blendedBoost, tieBreaker);
}
}
@Override
protected Query blendTermQuery(Term term, FieldMapper mapper) {
if (queryBuilder == null) {
return super.blendTermQuery(term, mapper);
}
return queryBuilder.blendTerm(term, mapper);
}
private static final class FieldAndMapper {
final String field;
final FieldMapper mapper;
final float boost;
private FieldAndMapper(String field, FieldMapper mapper, float boost) {
this.field = field;
this.mapper = mapper;
this.boost = boost;
}
public Term newTerm(String value) {
try {
final BytesRef bytesRef = mapper.indexedValueForSearch(value);
return new Term(field, bytesRef);
} catch (Exception ex) {
// we can't parse it just use the incoming value -- it will
// just have a DF of 0 at the end of the day and will be ignored
}
return new Term(field, value);
}
}
protected boolean forceAnalyzeQueryString() {
return this.queryBuilder.forceAnalyzeQueryString();
}
}

View File

@ -0,0 +1,204 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.lucene.queries;
import org.apache.lucene.analysis.MockAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.search.similarities.DefaultSimilarity;
import org.apache.lucene.search.similarities.Similarity;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.LuceneTestCase;
import org.apache.lucene.util._TestUtil;
import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
/**
*/
public class BlendedTermQueryTest extends LuceneTestCase {
public void testBooleanQuery() throws IOException {
Directory dir = newDirectory();
IndexWriter w = new IndexWriter(dir, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())));
String[] firstNames = new String[]{
"simon", "paul"
};
String[] surNames = new String[]{
"willnauer", "simon"
};
for (int i = 0; i < surNames.length; i++) {
Document d = new Document();
d.add(new TextField("id", Integer.toString(i), Field.Store.YES));
d.add(new TextField("firstname", firstNames[i], Field.Store.NO));
d.add(new TextField("surname", surNames[i], Field.Store.NO));
w.addDocument(d);
}
int iters = atLeast(25);
for (int j = 0; j < iters; j++) {
Document d = new Document();
d.add(new TextField("id", Integer.toString(firstNames.length + j), Field.Store.YES));
d.add(new TextField("firstname", rarely() ? "some_other_name" : "simon", Field.Store.NO));
d.add(new TextField("surname", "bogus", Field.Store.NO));
w.addDocument(d);
}
w.commit();
DirectoryReader reader = DirectoryReader.open(w, true);
IndexSearcher searcher = setSimilarity(newSearcher(reader));
{
Term[] terms = new Term[]{new Term("firstname", "simon"), new Term("surname", "simon")};
BlendedTermQuery query = BlendedTermQuery.booleanBlendedQuery(terms, true);
TopDocs search = searcher.search(query, 3);
ScoreDoc[] scoreDocs = search.scoreDocs;
assertEquals(3, scoreDocs.length);
assertEquals(Integer.toString(0), reader.document(scoreDocs[0].doc).getField("id").stringValue());
}
{
BooleanQuery query = new BooleanQuery(false);
query.add(new TermQuery(new Term("firstname", "simon")), BooleanClause.Occur.SHOULD);
query.add(new TermQuery(new Term("surname", "simon")), BooleanClause.Occur.SHOULD);
TopDocs search = searcher.search(query, 1);
ScoreDoc[] scoreDocs = search.scoreDocs;
assertEquals(Integer.toString(1), reader.document(scoreDocs[0].doc).getField("id").stringValue());
}
reader.close();
w.close();
dir.close();
}
public void testDismaxQuery() throws IOException {
Directory dir = newDirectory();
IndexWriter w = new IndexWriter(dir, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())));
String[] username = new String[]{
"foo fighters", "some cool fan", "cover band"};
String[] song = new String[]{
"generator", "foo fighers - generator", "foo fighters generator"
};
FieldType ft = new FieldType(TextField.TYPE_NOT_STORED);
ft.setIndexOptions(random().nextBoolean() ? FieldInfo.IndexOptions.DOCS_ONLY : FieldInfo.IndexOptions.DOCS_AND_FREQS);
ft.setOmitNorms(random().nextBoolean());
ft.freeze();
FieldType ft1 = new FieldType(TextField.TYPE_NOT_STORED);
ft1.setIndexOptions(random().nextBoolean() ? FieldInfo.IndexOptions.DOCS_ONLY : FieldInfo.IndexOptions.DOCS_AND_FREQS);
ft1.setOmitNorms(random().nextBoolean());
ft1.freeze();
for (int i = 0; i < username.length; i++) {
Document d = new Document();
d.add(new TextField("id", Integer.toString(i), Field.Store.YES));
d.add(new Field("username", username[i], ft));
d.add(new Field("song", song[i], ft));
w.addDocument(d);
}
int iters = atLeast(25);
for (int j = 0; j < iters; j++) {
Document d = new Document();
d.add(new TextField("id", Integer.toString(username.length + j), Field.Store.YES));
d.add(new Field("username", "foo fighters", ft1));
d.add(new Field("song", "some bogus text to bump up IDF", ft1));
w.addDocument(d);
}
w.commit();
DirectoryReader reader = DirectoryReader.open(w, true);
IndexSearcher searcher = setSimilarity(newSearcher(reader));
{
String[] fields = new String[]{"username", "song"};
BooleanQuery query = new BooleanQuery(false);
query.add(BlendedTermQuery.dismaxBlendedQuery(toTerms(fields, "foo"), 0.1f), BooleanClause.Occur.SHOULD);
query.add(BlendedTermQuery.dismaxBlendedQuery(toTerms(fields, "fighters"), 0.1f), BooleanClause.Occur.SHOULD);
query.add(BlendedTermQuery.dismaxBlendedQuery(toTerms(fields, "generator"), 0.1f), BooleanClause.Occur.SHOULD);
TopDocs search = searcher.search(query, 10);
ScoreDoc[] scoreDocs = search.scoreDocs;
assertEquals(Integer.toString(0), reader.document(scoreDocs[0].doc).getField("id").stringValue());
}
{
BooleanQuery query = new BooleanQuery(false);
DisjunctionMaxQuery uname = new DisjunctionMaxQuery(0.0f);
uname.add(new TermQuery(new Term("username", "foo")));
uname.add(new TermQuery(new Term("song", "foo")));
DisjunctionMaxQuery s = new DisjunctionMaxQuery(0.0f);
s.add(new TermQuery(new Term("username", "fighers")));
s.add(new TermQuery(new Term("song", "fighers")));
DisjunctionMaxQuery gen = new DisjunctionMaxQuery(0f);
gen.add(new TermQuery(new Term("username", "generator")));
gen.add(new TermQuery(new Term("song", "generator")));
query.add(uname, BooleanClause.Occur.SHOULD);
query.add(s, BooleanClause.Occur.SHOULD);
query.add(gen, BooleanClause.Occur.SHOULD);
TopDocs search = searcher.search(query, 4);
ScoreDoc[] scoreDocs = search.scoreDocs;
assertEquals(Integer.toString(1), reader.document(scoreDocs[0].doc).getField("id").stringValue());
}
reader.close();
w.close();
dir.close();
}
public void testBasics() {
final int iters = atLeast(5);
for (int j = 0; j < iters; j++) {
String[] fields = new String[1 + random().nextInt(10)];
for (int i = 0; i < fields.length; i++) {
fields[i] = _TestUtil.randomRealisticUnicodeString(random(), 1, 10);
}
String term = _TestUtil.randomRealisticUnicodeString(random(), 1, 10);
Term[] terms = toTerms(fields, term);
boolean disableCoord = random().nextBoolean();
boolean useBoolean = random().nextBoolean();
float tieBreaker = random().nextFloat();
BlendedTermQuery query = useBoolean ? BlendedTermQuery.booleanBlendedQuery(terms, disableCoord) : BlendedTermQuery.dismaxBlendedQuery(terms, tieBreaker);
QueryUtils.check(query);
terms = toTerms(fields, term);
BlendedTermQuery query2 = useBoolean ? BlendedTermQuery.booleanBlendedQuery(terms, disableCoord) : BlendedTermQuery.dismaxBlendedQuery(terms, tieBreaker);
assertEquals(query, query2);
}
}
public Term[] toTerms(String[] fields, String term) {
Term[] terms = new Term[fields.length];
List<String> fieldsList = Arrays.asList(fields);
Collections.shuffle(fieldsList, random());
fields = fieldsList.toArray(new String[0]);
for (int i = 0; i < fields.length; i++) {
terms[i] = new Term(fields[i], term);
}
return terms;
}
public IndexSearcher setSimilarity(IndexSearcher searcher) {
Similarity similarity = random().nextBoolean() ? new BM25Similarity() : new DefaultSimilarity();
searcher.setSimilarity(similarity);
return searcher;
}
}

View File

@ -18,7 +18,6 @@
*/
package org.elasticsearch.search.query;
import com.carrotsearch.randomizedtesting.annotations.Repeat;
import com.carrotsearch.randomizedtesting.generators.RandomPicks;
import com.google.common.collect.Sets;
import org.elasticsearch.action.admin.indices.create.CreateIndexRequestBuilder;
@ -27,6 +26,7 @@ import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.index.query.MatchQueryBuilder;
import org.elasticsearch.index.query.MultiMatchQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.SearchHits;
import org.elasticsearch.test.ElasticsearchIntegrationTest;
@ -34,6 +34,7 @@ import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
@ -68,29 +69,35 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
"full_name", "Captain America",
"first_name", "Captain",
"last_name", "America",
"category", "marvel hero"));
"category", "marvel hero",
"skill", 15,
"int-field", 25));
builders.add(client().prepareIndex("test", "test", "theother").setSource(
"full_name", "marvel hero",
"first_name", "marvel",
"last_name", "hero",
"category", "bogus"));
"category", "bogus",
"skill", 5));
builders.add(client().prepareIndex("test", "test", "ultimate1").setSource(
"full_name", "Alpha the Ultimate Mutant",
"first_name", "Alpha the",
"last_name", "Ultimate Mutant",
"category", "marvel hero"));
"category", "marvel hero",
"skill", 1));
builders.add(client().prepareIndex("test", "test", "ultimate2").setSource(
"full_name", "Man the Ultimate Ninja",
"first_name", "Man the Ultimate",
"last_name", "Ninja",
"category", "marvel hero"));
"category", "marvel hero",
"skill", 3));
builders.add(client().prepareIndex("test", "test", "anotherhero").setSource(
"full_name", "ultimate",
"first_name", "wolferine",
"last_name", "",
"category", "marvel hero"));
"category", "marvel hero",
"skill", 1));
List<String> firstNames = new ArrayList<String>();
fill(firstNames, "Captain", between(15, 25));
fill(firstNames, "Ultimate", between(5, 10));
@ -105,51 +112,12 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
"full_name", first + " " + last,
"first_name", first,
"last_name", last,
"category", randomBoolean() ? "marvel hero" : "bogus"));
"category", randomBoolean() ? "marvel hero" : "bogus",
"skill", between(1, 3)));
}
indexRandom(true, builders);
}
@Test
public void testDefaults() throws ExecutionException, InterruptedException {
MatchQueryBuilder.Type type = randomBoolean() ? null : MatchQueryBuilder.Type.BOOLEAN;
SearchResponse searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR)).get();
Set<String> topNIds = Sets.newHashSet("theone", "theother");
for (int i = 0; i < searchResponse.getHits().hits().length; i++) {
topNIds.remove(searchResponse.getHits().getAt(i).getId());
// very likely that we hit a random doc that has the same score so orders are random since
// the doc id is the tie-breaker
}
assertThat(topNIds, empty());
assertThat(searchResponse.getHits().hits()[0].getScore(), equalTo(searchResponse.getHits().hits()[1].getScore()));
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).useDisMax(false).type(type)).get();
assertFirstHit(searchResponse, anyOf(hasId("theone"), hasId("theother")));
assertThat(searchResponse.getHits().hits()[0].getScore(), greaterThan(searchResponse.getHits().hits()[1].getScore()));
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("marvel hero", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).type(type)).get();
assertFirstHit(searchResponse, hasId("theother"));
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).type(type)).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).type(type)).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
}
private XContentBuilder createMapping() throws IOException {
return XContentFactory.jsonBuilder().startObject().startObject("test")
.startObject("properties")
@ -179,21 +147,62 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
.endObject().endObject();
}
@Test
public void testDefaults() throws ExecutionException, InterruptedException {
MatchQueryBuilder.Type type = randomBoolean() ? null : MatchQueryBuilder.Type.BOOLEAN;
SearchResponse searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR))).get();
Set<String> topNIds = Sets.newHashSet("theone", "theother");
for (int i = 0; i < searchResponse.getHits().hits().length; i++) {
topNIds.remove(searchResponse.getHits().getAt(i).getId());
// very likely that we hit a random doc that has the same score so orders are random since
// the doc id is the tie-breaker
}
assertThat(topNIds, empty());
assertThat(searchResponse.getHits().hits()[0].getScore(), equalTo(searchResponse.getHits().hits()[1].getScore()));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).useDisMax(false).type(type))).get();
assertFirstHit(searchResponse, anyOf(hasId("theone"), hasId("theother")));
assertThat(searchResponse.getHits().hits()[0].getScore(), greaterThan(searchResponse.getHits().hits()[1].getScore()));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("marvel hero", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).type(type))).get();
assertFirstHit(searchResponse, hasId("theother"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).type(type))).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).type(type))).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
}
@Test
public void testPhraseType() {
SearchResponse searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("Man the Ultimate", "full_name_phrase", "first_name_phrase", "last_name_phrase", "category_phrase")
.operator(MatchQueryBuilder.Operator.OR).type(MatchQueryBuilder.Type.PHRASE)).get();
.setQuery(randomizeType(multiMatchQuery("Man the Ultimate", "full_name_phrase", "first_name_phrase", "last_name_phrase", "category_phrase")
.operator(MatchQueryBuilder.Operator.OR).type(MatchQueryBuilder.Type.PHRASE))).get();
assertFirstHit(searchResponse, hasId("ultimate2"));
assertHitCount(searchResponse, 1l);
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("Captain", "full_name_phrase", "first_name_phrase", "last_name_phrase", "category_phrase")
.operator(MatchQueryBuilder.Operator.OR).type(MatchQueryBuilder.Type.PHRASE)).get();
.setQuery(randomizeType(multiMatchQuery("Captain", "full_name_phrase", "first_name_phrase", "last_name_phrase", "category_phrase")
.operator(MatchQueryBuilder.Operator.OR).type(MatchQueryBuilder.Type.PHRASE))).get();
assertThat(searchResponse.getHits().getTotalHits(), greaterThan(1l));
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("the Ul", "full_name_phrase", "first_name_phrase", "last_name_phrase", "category_phrase")
.operator(MatchQueryBuilder.Operator.OR).type(MatchQueryBuilder.Type.PHRASE_PREFIX)).get();
.setQuery(randomizeType(multiMatchQuery("the Ul", "full_name_phrase", "first_name_phrase", "last_name_phrase", "category_phrase")
.operator(MatchQueryBuilder.Operator.OR).type(MatchQueryBuilder.Type.PHRASE_PREFIX))).get();
assertFirstHit(searchResponse, hasId("ultimate2"));
assertSecondHit(searchResponse, hasId("ultimate1"));
assertHitCount(searchResponse, 2l);
@ -206,8 +215,8 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
MatchQueryBuilder.Type type = randomBoolean() ? null : MatchQueryBuilder.Type.BOOLEAN;
Float cutoffFrequency = randomBoolean() ? Math.min(1, numDocs * 1.f / between(10, 20)) : 1.f / between(10, 20);
SearchResponse searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).cutoffFrequency(cutoffFrequency)).get();
.setQuery(randomizeType(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).cutoffFrequency(cutoffFrequency))).get();
Set<String> topNIds = Sets.newHashSet("theone", "theother");
for (int i = 0; i < searchResponse.getHits().hits().length; i++) {
topNIds.remove(searchResponse.getHits().getAt(i).getId());
@ -219,39 +228,48 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
cutoffFrequency = randomBoolean() ? Math.min(1, numDocs * 1.f / between(10, 20)) : 1.f / between(10, 20);
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).useDisMax(false).cutoffFrequency(cutoffFrequency).type(type)).get();
.setQuery(randomizeType(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).useDisMax(false).cutoffFrequency(cutoffFrequency).type(type))).get();
assertFirstHit(searchResponse, anyOf(hasId("theone"), hasId("theother")));
assertThat(searchResponse.getHits().hits()[0].getScore(), greaterThan(searchResponse.getHits().hits()[1].getScore()));
long size = searchResponse.getHits().getTotalHits();
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).useDisMax(false).type(type)).get();
.setQuery(randomizeType(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).useDisMax(false).type(type))).get();
assertFirstHit(searchResponse, anyOf(hasId("theone"), hasId("theother")));
assertThat("common terms expected to be a way smaller result set", size, lessThan(searchResponse.getHits().getTotalHits()));
cutoffFrequency = randomBoolean() ? Math.min(1, numDocs * 1.f / between(10, 20)) : 1.f / between(10, 20);
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("marvel hero", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).cutoffFrequency(cutoffFrequency).type(type)).get();
.setQuery(randomizeType(multiMatchQuery("marvel hero", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).cutoffFrequency(cutoffFrequency).type(type))).get();
assertFirstHit(searchResponse, hasId("theother"));
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).cutoffFrequency(cutoffFrequency).type(type)).get();
.setQuery(randomizeType(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).cutoffFrequency(cutoffFrequency).type(type))).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).cutoffFrequency(cutoffFrequency).type(type)).get();
.setQuery(randomizeType(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).cutoffFrequency(cutoffFrequency).type(type))).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("marvel hero", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.AND).cutoffFrequency(cutoffFrequency)
.analyzer("category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS))).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theother"));
}
@Test
public void testEquivalence() {
final int numDocs = (int) client().prepareCount("test")
@ -260,9 +278,11 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
for (int i = 0; i < numIters; i++) {
{
MatchQueryBuilder.Type type = randomBoolean() ? null : MatchQueryBuilder.Type.BOOLEAN;
MultiMatchQueryBuilder multiMatchQueryBuilder = randomBoolean() ? multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category") :
multiMatchQuery("marvel hero captain america", "*_name", randomBoolean() ? "category" : "categ*");
SearchResponse left = client().prepareSearch("test").setSize(numDocs)
.setQuery(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.operator(MatchQueryBuilder.Operator.OR).type(type)).get();
.setQuery(randomizeType(multiMatchQueryBuilder
.operator(MatchQueryBuilder.Operator.OR).type(type))).get();
SearchResponse right = client().prepareSearch("test").setSize(numDocs)
.setQuery(disMaxQuery().
@ -278,9 +298,11 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
MatchQueryBuilder.Type type = randomBoolean() ? null : MatchQueryBuilder.Type.BOOLEAN;
String minShouldMatch = randomBoolean() ? null : "" + between(0, 1);
MatchQueryBuilder.Operator op = randomBoolean() ? MatchQueryBuilder.Operator.AND : MatchQueryBuilder.Operator.OR;
MultiMatchQueryBuilder multiMatchQueryBuilder = randomBoolean() ? multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category") :
multiMatchQuery("captain america", "*_name", randomBoolean() ? "category" : "categ*");
SearchResponse left = client().prepareSearch("test").setSize(numDocs)
.setQuery(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.operator(op).useDisMax(false).minimumShouldMatch(minShouldMatch).type(type)).get();
.setQuery(randomizeType(multiMatchQueryBuilder
.operator(op).useDisMax(false).minimumShouldMatch(minShouldMatch).type(type))).get();
SearchResponse right = client().prepareSearch("test").setSize(numDocs)
.setQuery(boolQuery().minimumShouldMatch(minShouldMatch)
@ -296,8 +318,8 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
String minShouldMatch = randomBoolean() ? null : "" + between(0, 1);
MatchQueryBuilder.Operator op = randomBoolean() ? MatchQueryBuilder.Operator.AND : MatchQueryBuilder.Operator.OR;
SearchResponse left = client().prepareSearch("test").setSize(numDocs)
.setQuery(multiMatchQuery("capta", "full_name", "first_name", "last_name", "category")
.type(MatchQueryBuilder.Type.PHRASE_PREFIX).useDisMax(false).minimumShouldMatch(minShouldMatch)).get();
.setQuery(randomizeType(multiMatchQuery("capta", "full_name", "first_name", "last_name", "category")
.type(MatchQueryBuilder.Type.PHRASE_PREFIX).useDisMax(false).minimumShouldMatch(minShouldMatch))).get();
SearchResponse right = client().prepareSearch("test").setSize(numDocs)
.setQuery(boolQuery().minimumShouldMatch(minShouldMatch)
@ -311,10 +333,16 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
{
String minShouldMatch = randomBoolean() ? null : "" + between(0, 1);
MatchQueryBuilder.Operator op = randomBoolean() ? MatchQueryBuilder.Operator.AND : MatchQueryBuilder.Operator.OR;
SearchResponse left = client().prepareSearch("test").setSize(numDocs)
.setQuery(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.type(MatchQueryBuilder.Type.PHRASE).useDisMax(false).minimumShouldMatch(minShouldMatch)).get();
SearchResponse left;
if (randomBoolean()) {
left = client().prepareSearch("test").setSize(numDocs)
.setQuery(randomizeType(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.type(MatchQueryBuilder.Type.PHRASE).useDisMax(false).minimumShouldMatch(minShouldMatch))).get();
} else {
left = client().prepareSearch("test").setSize(numDocs)
.setQuery(randomizeType(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.type(MatchQueryBuilder.Type.PHRASE).tieBreaker(1.0f).minimumShouldMatch(minShouldMatch))).get();
}
SearchResponse right = client().prepareSearch("test").setSize(numDocs)
.setQuery(boolQuery().minimumShouldMatch(minShouldMatch)
.should(matchPhraseQuery("full_name", "captain america"))
@ -327,6 +355,114 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
}
}
@Test
public void testCrossFieldMode() throws ExecutionException, InterruptedException {
SearchResponse searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america", "full_name", "first_name", "last_name")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.operator(MatchQueryBuilder.Operator.OR))).get();
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("marvel hero captain america", "full_name", "first_name", "last_name", "category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.operator(MatchQueryBuilder.Operator.OR))).get();
assertFirstHit(searchResponse, hasId("theone"));
assertSecondHit(searchResponse, hasId("theother"));
assertThat(searchResponse.getHits().hits()[0].getScore(), greaterThan(searchResponse.getHits().hits()[1].getScore()));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("marvel hero", "full_name", "first_name", "last_name", "category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.operator(MatchQueryBuilder.Operator.OR))).get();
assertFirstHit(searchResponse, hasId("theother"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america", "full_name", "first_name", "last_name", "category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.operator(MatchQueryBuilder.Operator.AND))).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america 15", "full_name", "first_name", "last_name", "category", "skill")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.analyzer("category")
.operator(MatchQueryBuilder.Operator.AND))).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america 15", "first_name", "last_name", "skill")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.analyzer("category"))).get();
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("25 15", "int-field", "skill")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.analyzer("category"))).get();
assertFirstHit(searchResponse, hasId("theone"));
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america marvel hero", "first_name", "last_name", "category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.cutoffFrequency(0.1f)
.analyzer("category")
.operator(MatchQueryBuilder.Operator.OR))).get();
assertFirstHit(searchResponse, anyOf(hasId("theother"), hasId("theone")));
long numResults = searchResponse.getHits().totalHits();
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america marvel hero", "first_name", "last_name", "category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.analyzer("category")
.operator(MatchQueryBuilder.Operator.OR))).get();
assertThat(numResults, lessThan(searchResponse.getHits().getTotalHits()));
assertFirstHit(searchResponse, hasId("theone"));
// test group based on analyzer -- all fields are grouped into a cross field search
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america marvel hero", "first_name", "last_name", "category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.analyzer("category")
.operator(MatchQueryBuilder.Operator.AND))).get();
assertHitCount(searchResponse, 1l);
assertFirstHit(searchResponse, hasId("theone"));
// counter example
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america marvel hero", "first_name", "last_name", "category")
.type(randomBoolean() ? MultiMatchQueryBuilder.Type.CROSS_FIELDS : null)
.operator(MatchQueryBuilder.Operator.AND))).get();
assertHitCount(searchResponse, 0l);
// counter example
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("captain america marvel hero", "first_name", "last_name", "category")
.type(randomBoolean() ? MultiMatchQueryBuilder.Type.CROSS_FIELDS : null)
.operator(MatchQueryBuilder.Operator.AND))).get();
assertHitCount(searchResponse, 0l);
// test if boosts work
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("the ultimate", "full_name", "first_name", "last_name^2", "category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.operator(MatchQueryBuilder.Operator.AND))).get();
assertFirstHit(searchResponse, hasId("ultimate1")); // has ultimate in the last_name and that is boosted
assertSecondHit(searchResponse, hasId("ultimate2"));
assertThat(searchResponse.getHits().hits()[0].getScore(), greaterThan(searchResponse.getHits().hits()[1].getScore()));
// since we try to treat the matching fields as one field scores are very similar but we have a small bias towards the
// more frequent field that acts as a tie-breaker internally
searchResponse = client().prepareSearch("test")
.setQuery(randomizeType(multiMatchQuery("the ultimate", "full_name", "first_name", "last_name", "category")
.type(MultiMatchQueryBuilder.Type.CROSS_FIELDS)
.operator(MatchQueryBuilder.Operator.AND))).get();
assertFirstHit(searchResponse, hasId("ultimate2"));
assertSecondHit(searchResponse, hasId("ultimate1"));
assertThat(searchResponse.getHits().hits()[0].getScore(), greaterThan(searchResponse.getHits().hits()[1].getScore()));
}
private static final void assertEquivalent(String query, SearchResponse left, SearchResponse right) {
assertNoFailures(left);
@ -338,7 +474,7 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
SearchHit[] hits = leftHits.getHits();
SearchHit[] rHits = rightHits.getHits();
for (int i = 0; i < hits.length; i++) {
assertThat("query: " + query + " hit: " + i, (double)hits[i].getScore(), closeTo(rHits[i].getScore(), 0.00001d));
assertThat("query: " + query + " hit: " + i, (double) hits[i].getScore(), closeTo(rHits[i].getScore(), 0.00001d));
}
for (int i = 0; i < hits.length; i++) {
if (hits[i].getScore() == hits[hits.length - 1].getScore()) {
@ -372,4 +508,51 @@ public class MultiMatchQueryTests extends ElasticsearchIntegrationTest {
return t;
}
}
public MultiMatchQueryBuilder randomizeType(MultiMatchQueryBuilder builder) {
try {
Field field = MultiMatchQueryBuilder.class.getDeclaredField("type");
field.setAccessible(true);
MultiMatchQueryBuilder.Type type = (MultiMatchQueryBuilder.Type) field.get(builder);
if (type == null && randomBoolean()) {
return builder;
}
if (type == null) {
type = MultiMatchQueryBuilder.Type.BEST_FIELDS;
}
if (randomBoolean()) {
builder.type(type);
} else {
Object oType = type;
switch (type) {
case BEST_FIELDS:
if (randomBoolean()) {
oType = MatchQueryBuilder.Type.BOOLEAN;
}
break;
case MOST_FIELDS:
if (randomBoolean()) {
oType = MatchQueryBuilder.Type.BOOLEAN;
}
break;
case CROSS_FIELDS:
break;
case PHRASE:
if (randomBoolean()) {
oType = MatchQueryBuilder.Type.PHRASE;
}
break;
case PHRASE_PREFIX:
if (randomBoolean()) {
oType = MatchQueryBuilder.Type.PHRASE_PREFIX;
}
break;
}
builder.type(oType);
}
return builder;
} catch (Exception ex) {
throw new RuntimeException(ex);
}
}
}