# Phrase Suggester

The `term` suggester provides a very convenient API to access word alternatives on token
basis within a certain string distance. The API allows accessing each token in the stream
individually while suggest-selection is left to the API consumer. Yet, often already ranked
/ selected suggestions are required in order to present to the end-user.
Inside ElasticSearch we have the ability to access way more statistics and information quickly
to make better decision which token alternative to pick or if to pick an alternative at all.

This `phrase` suggester adds some logic on top of the `term` suggester to select entire
corrected phrases instead of individual tokens weighted based on a *ngram-langugage models*. In practice it
will be able to make better decision about which tokens to pick based on co-occurence and frequencies.
The current implementation is kept quite general and leaves room for future improvements.

# API Example

The `phrase` request is defined along side the query part in the json request:

```json
curl -s -XPOST 'localhost:9200/_search' -d {
  "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 1,
        "real_word_error_likelihood" : 0.95,
        "max_errors" : 0.5,
        "gram_size" : 2,
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        } ]
      }
    }
  }
}
```

The response contains suggested sored by the most likely spell correction first. In this case we got the expected correction
`xorr the god jewel` first while the second correction is less conservative where only one of the errors is corrected. Note, the request
is executed with `max_errors` set to `0.5` so 50% of the terms can contain misspellings (See parameter descriptions below).

```json
  {
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2938,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "simple_phrase" : [ {
      "text" : "Xor the Got-Jewel",
      "offset" : 0,
      "length" : 17,
      "options" : [ {
        "text" : "xorr the god jewel",
        "score" : 0.17877324
      }, {
        "text" : "xor the god jewel",
        "score" : 0.14231323
      } ]
    } ]
  }
}
````

# Phrase suggest API

## Basic parameters

* `field` - the name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections.
* `gram_size` - sets max size of the n-grams (shingles) in the `field`. If the field doesn't contain n-grams (shingles) this should be omitted or set to `1`.
* `real_word_error_likelihood` - the likelihood of a term being a misspelled even if the term exists in the dictionary. The default it `0.95` corresponding to 5% or the real words are misspelled.
* `confidence` - The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of `1.0` will only return suggestions that score higher than the input phrase. If set to `0.0` the top N candidates are returned. The default is `1.0`.
* `max_errors` - the maximum percentage of the terms that at most considered to be misspellings in order to form a correction. This method accepts a float value in the range `[0..1)` as a fraction of the actual query terms a number `>=1` as an absolut number of query terms. The default is set to `1.0` which corresponds to that only corrections with at most 1 misspelled term are returned.
* `separator` - the separator that is used to separate terms in the bigram field. If not set the whitespce character is used as a separator.
* `size` - the number of candidates that are generated for each individual query term Low numbers like `3` or `5` typically produce good results. Raising this can bring up terms with higher edit distances. The default is `5`.
* `analyzer` -  Sets the analyzer to analyse to suggest text with. Defaults to the search analyzer of the suggest field passed via `field`.
* `shard_size` - Sets the maximum number of suggested term to be retrieved from each individual shard. During the reduce phase the only the top N suggestions are returned based on the `size` option. Defaults to `5`.
* `text` - Sets the text / query to provide suggestions for.

## Smoothing Models
The `phrase` suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index).
* `laplace` - the default model that uses an additive smoothing model where a constant (typically `1.0` or smaller) is added to all counts to balance weights, The default `alpha` is `0.5`.
* `stupid_backoff` - a simple backoff model that backs off to lower order n-gram models if the higher order count is `0` and discounts the lower order n-gram model by a constant factor. The default `discount` is `0.4`.
* `linear_interpolation` - a smoothing model that takes the weighted mean of the unigrams, bigrams and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn't have any default values. All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`) must be supplied.

## Candidate Generators
The `phrase` suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a `term` suggester called for each individual term in the text. The output of the generators is subsequently scored in in combination with the candidates from the other terms to for suggestion candidates.
Currently only one type of candidate generator is supported, the `direct_generator`. The Phrase suggest API accepts a list of generators under the key `direct_generator` each of the generators in the list are called per term in the original text.

## Direct Generators

The direct generators support the following parameters:

* `field` - The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion.
* `analyzer` - The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field.
* `size` - The maximum corrections to be returned per suggest text token.
* `suggest_mode` - The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified:
 * `missing` - Only suggest terms in the suggest text that aren't in the index. This is the default.
 * `popular` - Only suggest suggestions that occur in more docs then the original suggest text term.
 * `always` - Suggest any matching suggestions based on terms in the suggest text.
* `max_edits` - The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2.
* `min_prefix` - The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don't occur in the beginning of terms.
* `min_query_length` -  The minimum length a suggest text term must have in order to be included. Defaults to 4.
* `max_inspections` - A factor that is used to multiply with the `shards_size` in order to inspect more candidate spell corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5.
* `threshold_frequency` - The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option.
* `max_query_frequency` - The maximum threshold in number of documents a sugges text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this this also improves the spellcheck performance.  The shard level document frequencies are used for this option.
* pre_filter -  a filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. (optional)
* post_filter - a filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. (optional)

The following example shows a `phrase` suggest call with two generators, the first one is using a field containing ordinary indexed terms and the second one uses a field that uses
terms indexed with a `reverse` filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The `pre_filter` and `post_filter` options accept ordinary analyzer names.

```json
curl -s -XPOST 'localhost:9200/_search' -d {
 "suggest" : {
    "text" : "Xor the Got-Jewel",
    "simple_phrase" : {
      "phrase" : {
        "analyzer" : "body",
        "field" : "bigram",
        "size" : 4,
        "real_word_error_likelihood" : 0.95,
        "confidence" : 2.0,
        "gram_size" : 2,
        "direct_generator" : [ {
          "field" : "body",
          "suggest_mode" : "always",
          "min_word_len" : 1
        }, {
          "field" : "reverse",
          "suggest_mode" : "always",
          "min_word_len" : 1,
          "pre_filter" : "reverse",
          "post_filter" : "reverse"
        } ]
      }
    }
  }
}
```

`pre_filter` and `post_filter` can also be used to inject synonyms after candidates are generated. For instance for the query `captain usq` we might generate a candidate `usa` for term `usq` which is a synonym for `america` which allows to present `captain america` to the user if this phrase scores high enough.

Closes #2709
This commit is contained in:
Simon Willnauer 2013-02-11 18:19:50 +01:00
parent 2bc624806d
commit d4ec03ed76
37 changed files with 4534 additions and 979 deletions

View File

@ -657,9 +657,9 @@ public class SearchRequestBuilder extends ActionRequestBuilder<SearchRequest, Se
}
/**
* Delegates to {@link org.elasticsearch.search.suggest.SuggestBuilder#addSuggestion(org.elasticsearch.search.suggest.SuggestBuilder.Suggestion)}.
* Delegates to {@link org.elasticsearch.search.suggest.SuggestBuilder#addSuggestion(org.elasticsearch.search.suggest.SuggestBuilder.SuggestionBuilder)}.
*/
public SearchRequestBuilder addSuggestion(SuggestBuilder.Suggestion suggestion) {
public SearchRequestBuilder addSuggestion(SuggestBuilder.SuggestionBuilder<?> suggestion) {
suggestBuilder().addSuggestion(suggestion);
return this;
}

View File

@ -36,7 +36,7 @@ public class ShingleTokenFilterFactory extends AbstractTokenFilterFactory {
private final boolean outputUnigrams;
private Boolean outputUnigramsIfNoShingles;
private final boolean outputUnigramsIfNoShingles;
private String tokenSeparator;
@ -60,4 +60,20 @@ public class ShingleTokenFilterFactory extends AbstractTokenFilterFactory {
filter.setTokenSeparator(tokenSeparator);
return filter;
}
public int getMaxShingleSize() {
return maxShingleSize;
}
public int getMinShingleSize() {
return minShingleSize;
}
public boolean getOutputUnigrams() {
return outputUnigrams;
}
public boolean getOutputUnigramsIfNoShingles() {
return outputUnigramsIfNoShingles;
}
}

View File

@ -762,6 +762,15 @@ public class MapperService extends AbstractIndexComponent implements Iterable<Do
public Analyzer searchQuoteAnalyzer() {
return this.searchQuoteAnalyzer;
}
public Analyzer fieldSearchAnalyzer(String field) {
return this.searchAnalyzer.getWrappedAnalyzer(field);
}
public Analyzer fieldSearchQuoteAnalyzer(String field) {
return this.searchQuoteAnalyzer.getWrappedAnalyzer(field);
}
/**
* Resolves the closest inherited {@link ObjectMapper} that is nested.

View File

@ -45,7 +45,7 @@ import static org.elasticsearch.rest.RestRequest.Method.GET;
import static org.elasticsearch.rest.RestRequest.Method.POST;
import static org.elasticsearch.rest.RestStatus.BAD_REQUEST;
import static org.elasticsearch.rest.action.support.RestXContentBuilder.restContentBuilder;
import static org.elasticsearch.search.suggest.SuggestBuilder.fuzzySuggestion;
import static org.elasticsearch.search.suggest.SuggestBuilder.termSuggestion;
/**
*
@ -286,8 +286,8 @@ public class RestSearchAction extends BaseRestHandler {
}
String suggestMode = request.param("suggest_mode");
searchSourceBuilder.suggest().addSuggestion(
fuzzySuggestion(suggestField).setField(suggestField).setText(suggestText).setSize(suggestSize)
.setSuggestMode(suggestMode)
termSuggestion(suggestField).field(suggestField).text(suggestText).size(suggestSize)
.suggestMode(suggestMode)
);
}

View File

@ -140,7 +140,7 @@ public class SearchServiceTransportAction extends AbstractComponent {
try {
QuerySearchResult result = searchService.executeQueryPhase(request);
listener.onResult(result);
} catch (Exception e) {
} catch (Throwable e) {
listener.onFailure(e);
}
} else {

View File

@ -49,9 +49,13 @@ import org.elasticsearch.search.internal.InternalSearchResponse;
import org.elasticsearch.search.query.QuerySearchResult;
import org.elasticsearch.search.query.QuerySearchResultProvider;
import org.elasticsearch.search.suggest.Suggest;
import org.elasticsearch.search.suggest.Suggest.Suggestion;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry.Option;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
@ -376,32 +380,31 @@ public class SearchPhaseController extends AbstractComponent {
// merge suggest results
Suggest suggest = null;
if (!queryResults.isEmpty()) {
List<Suggest.Suggestion> mergedSuggestions = null;
Map<String, List<Suggest.Suggestion>> groupedSuggestions = new HashMap<String, List<Suggest.Suggestion>>();
for (QuerySearchResultProvider resultProvider : queryResults.values()) {
Suggest shardResult = resultProvider.queryResult().suggest();
if (shardResult == null) {
continue;
}
if (mergedSuggestions == null) {
mergedSuggestions = shardResult.getSuggestions();
continue;
}
for (Suggest.Suggestion shardCommand : shardResult.getSuggestions()) {
for (Suggest.Suggestion mergedSuggestion : mergedSuggestions) {
if (mergedSuggestion.getName().equals(shardCommand.getName())) {
mergedSuggestion.reduce(shardCommand);
}
for (Suggestion<? extends Entry<? extends Option>> suggestion : shardResult) {
List<Suggestion> list = groupedSuggestions.get(suggestion.getName());
if (list == null) {
list = new ArrayList<Suggest.Suggestion>();
groupedSuggestions.put(suggestion.getName(), list);
}
list.add(suggestion);
}
}
if (mergedSuggestions != null) {
suggest = new Suggest(mergedSuggestions);
for (Suggest.Suggestion suggestion : mergedSuggestions) {
suggestion.trim();
}
List<Suggestion<? extends Entry<? extends Option>>> reduced = new ArrayList<Suggestion<? extends Entry<? extends Option>>>();
for (java.util.Map.Entry<String, List<Suggestion>> unmergedResults : groupedSuggestions.entrySet()) {
List<Suggestion> value = unmergedResults.getValue();
Suggestion reduce = value.get(0).reduce(value);
reduce.trim();
reduced.add(reduce);
}
suggest = new Suggest(reduced);
}
InternalSearchHits searchHits = new InternalSearchHits(hits.toArray(new InternalSearchHit[hits.size()]), totalHits, maxScore);

View File

@ -0,0 +1,119 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import org.apache.lucene.search.spell.DirectSpellChecker;
import org.apache.lucene.search.spell.StringDistance;
import org.apache.lucene.search.spell.SuggestMode;
import org.apache.lucene.util.automaton.LevenshteinAutomata;
public class DirectSpellcheckerSettings {
private SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;
private float accuracy = 0.5f;
private Suggest.Suggestion.Sort sort = Suggest.Suggestion.Sort.SCORE;
private StringDistance stringDistance = DirectSpellChecker.INTERNAL_LEVENSHTEIN;
private int maxEdits = LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE;
private int maxInspections = 5;
private float maxTermFreq = 0.01f;
private int prefixLength = 1;
private int minWordLength = 4;
private float minDocFreq = 0f;
public SuggestMode suggestMode() {
return suggestMode;
}
public void suggestMode(SuggestMode suggestMode) {
this.suggestMode = suggestMode;
}
public float accuracy() {
return accuracy;
}
public void accuracy(float accuracy) {
this.accuracy = accuracy;
}
public Suggest.Suggestion.Sort sort() {
return sort;
}
public void sort(Suggest.Suggestion.Sort sort) {
this.sort = sort;
}
public StringDistance stringDistance() {
return stringDistance;
}
public void stringDistance(StringDistance distance) {
this.stringDistance = distance;
}
public int maxEdits() {
return maxEdits;
}
public void maxEdits(int maxEdits) {
this.maxEdits = maxEdits;
}
public int maxInspections() {
return maxInspections;
}
public void maxInspections(int maxInspections) {
this.maxInspections = maxInspections;
}
public float maxTermFreq() {
return maxTermFreq;
}
public void maxTermFreq(float maxTermFreq) {
this.maxTermFreq = maxTermFreq;
}
public int prefixLength() {
return prefixLength;
}
public void prefixLength(int prefixLength) {
this.prefixLength = prefixLength;
}
public int minWordLength() {
return minWordLength;
}
public void minQueryLength(int minQueryLength) {
this.minWordLength = minQueryLength;
}
public float minDocFreq() {
return minDocFreq;
}
public void minDocFreq(float minDocFreq) {
this.minDocFreq = minDocFreq;
}
}

View File

@ -16,9 +16,17 @@
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
@ -27,48 +35,83 @@ import org.elasticsearch.common.text.Text;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentBuilderString;
import java.io.IOException;
import java.util.*;
import org.elasticsearch.search.suggest.Suggest.Suggestion;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry.Option;
import org.elasticsearch.search.suggest.term.TermSuggestion;
/**
* Top level suggest result, containing the result for each suggestion.
*/
public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXContent {
public class Suggest implements Iterable<Suggest.Suggestion<? extends Entry<? extends Option>>>, Streamable, ToXContent {
static class Fields {
static final XContentBuilderString SUGGEST = new XContentBuilderString("suggest");
}
private List<Suggestion> suggestions;
private static final Comparator<Option> COMPARATOR = new Comparator<Suggest.Suggestion.Entry.Option>() {
@Override
public int compare(Option first, Option second) {
int cmp = Float.compare(second.getScore(), first.getScore());
if (cmp != 0) {
return cmp;
}
return first.getText().compareTo(second.getText());
}
};
Suggest() {
private List<Suggestion<? extends Entry<? extends Option>>> suggestions;
private Map<String, Suggestion<? extends Entry<? extends Option>>> suggestMap;
public Suggest() {
}
public Suggest(List<Suggestion> suggestions) {
public Suggest(List<Suggestion<? extends Entry<? extends Option>>> suggestions) {
this.suggestions = suggestions;
}
/**
* @return the suggestions
*/
public List<Suggestion> getSuggestions() {
return suggestions;
}
@Override
public Iterator<Suggestion> iterator() {
public Iterator<Suggestion<? extends Entry<? extends Option>>> iterator() {
return suggestions.iterator();
}
/**
* The number of suggestions in this {@link Suggest} result
*/
public int size() {
return suggestions.size();
}
public <T extends Suggestion<? extends Entry<? extends Option>>> T getSuggestion(String name) {
if (suggestions.isEmpty() || name == null) {
return null;
} else if (suggestions.size() == 1) {
return (T) (name.equals(suggestions.get(0).name) ? suggestions.get(0) : null);
} else if (this.suggestMap == null) {
suggestMap = new HashMap<String, Suggestion<? extends Entry<? extends Option>>>();
for (Suggest.Suggestion<? extends Entry<? extends Option>> item : suggestions) {
suggestMap.put(item.getName(), item);
}
}
return (T) suggestMap.get(name);
}
@Override
public void readFrom(StreamInput in) throws IOException {
int size = in.readVInt();
suggestions = new ArrayList<Suggestion>(size);
final int size = in.readVInt();
suggestions = new ArrayList<Suggestion<? extends Entry<? extends Option>>>(size);
for (int i = 0; i < size; i++) {
Suggestion suggestion = new Suggestion();
Suggestion<? extends Entry<? extends Option>> suggestion;
final int type = in.readVInt();
switch (type) {
case TermSuggestion.TYPE:
suggestion = new TermSuggestion();
break;
default:
suggestion = new Suggestion<Entry<Option>>();
break;
}
suggestion.readFrom(in);
suggestions.add(suggestion);
}
@ -77,7 +120,8 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeVInt(suggestions.size());
for (Suggestion command : suggestions) {
for (Suggestion<?> command : suggestions) {
out.writeVInt(command.getType());
command.writeTo(out);
}
}
@ -85,7 +129,7 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject(Fields.SUGGEST);
for (Suggestion suggestion : suggestions) {
for (Suggestion<?> suggestion : suggestions) {
suggestion.toXContent(builder, params);
}
builder.endObject();
@ -101,35 +145,39 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
/**
* The suggestion responses corresponding with the suggestions in the request.
*/
public static class Suggestion implements Iterable<Suggestion.Entry>, Streamable, ToXContent {
public static class Suggestion<T extends Suggestion.Entry> implements Iterable<T>, Streamable, ToXContent {
public static final int TYPE = 0;
protected String name;
protected int size;
protected final List<T> entries = new ArrayList<T>(5);
private String name;
private int size;
private Sort sort;
private final List<Entry> entries = new ArrayList<Entry>(5);
Suggestion() {
public Suggestion() {
}
Suggestion(String name, int size, Sort sort) {
public Suggestion(String name, int size) {
this.name = name;
this.size = size; // The suggested term size specified in request, only used for merging shard responses
this.sort = sort;
}
void addTerm(Entry entry) {
public void addTerm(T entry) {
entries.add(entry);
}
public int getType() {
return TYPE;
}
@Override
public Iterator<Entry> iterator() {
public Iterator<T> iterator() {
return entries.iterator();
}
/**
* @return The entries for this suggestion.
*/
public List<Entry> getEntries() {
public List<T> getEntries() {
return entries;
}
@ -144,53 +192,85 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
* Merges the result of another suggestion into this suggestion.
* For internal usage.
*/
public void reduce(Suggestion other) {
assert name.equals(other.name);
assert entries.size() == other.entries.size();
for (int i = 0; i < entries.size(); i++) {
Entry thisEntry = entries.get(i);
Entry otherEntry = other.entries.get(i);
thisEntry.reduce(otherEntry, sort);
public Suggestion<T> reduce(List<Suggestion<T>> toReduce) {
if (toReduce.size() == 1) {
return toReduce.get(0);
} else if (toReduce.isEmpty()) {
return null;
}
Suggestion<T> leader = toReduce.get(0);
List<T> entries = leader.entries;
final int size = entries.size();
Comparator<Option> sortComparator = sortComparator();
List<T> currentEntries = new ArrayList<T>();
for (int i = 0; i < size; i++) {
for (Suggestion<T> suggestion : toReduce) {
assert suggestion.entries.size() == size;
assert suggestion.name.equals(leader.name);
currentEntries.add(suggestion.entries.get(i));
}
T entry = (T) entries.get(i).reduce(currentEntries);
entry.sort(sortComparator);
entries.set(i, entry);
currentEntries.clear();
}
return leader;
}
protected Comparator<Option> sortComparator() {
return COMPARATOR;
}
/**
* Trims the number of options per suggest text term to the requested size.
* For internal usage.
*/
public void trim() {
for (Entry entry : entries) {
for (Entry<?> entry : entries) {
entry.trim(size);
}
}
@Override
public void readFrom(StreamInput in) throws IOException {
name = in.readString();
size = in.readVInt();
sort = Sort.fromId(in.readByte());
innerReadFrom(in);
int size = in.readVInt();
entries.clear();
for (int i = 0; i < size; i++) {
entries.add(Entry.read(in));
T newEntry = newEntry();
newEntry.readFrom(in);
entries.add(newEntry);
}
}
protected T newEntry() {
return (T)new Entry();
}
protected void innerReadFrom(StreamInput in) throws IOException {
name = in.readString();
size = in.readVInt();
}
@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeString(name);
out.writeVInt(size);
out.writeByte(sort.id());
innerWriteTo(out);
out.writeVInt(entries.size());
for (Entry entry : entries) {
for (Entry<?> entry : entries) {
entry.writeTo(out);
}
}
public void innerWriteTo(StreamOutput out) throws IOException {
out.writeString(name);
out.writeVInt(size);
}
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.startArray(name);
for (Entry entry : entries) {
for (Entry<?> entry : entries) {
entry.toXContent(builder, params);
}
builder.endArray();
@ -201,7 +281,7 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
/**
* Represents a part from the suggest text with suggested options.
*/
public static class Entry implements Iterable<Entry.Option>, Streamable, ToXContent {
public static class Entry<O extends Entry.Option> implements Iterable<O>, Streamable, ToXContent {
static class Fields {
@ -212,55 +292,54 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
}
private Text text;
private int offset;
private int length;
protected Text text;
protected int offset;
protected int length;
private List<Option> options;
protected List<O> options;
Entry(Text text, int offset, int length) {
public Entry(Text text, int offset, int length) {
this.text = text;
this.offset = offset;
this.length = length;
this.options = new ArrayList<Option>(5);
this.options = new ArrayList<O>(5);
}
Entry() {
public Entry() {
}
void addOption(Option option) {
public void addOption(O option) {
options.add(option);
}
void reduce(Entry otherEntry, Sort sort) {
assert text.equals(otherEntry.text);
assert offset == otherEntry.offset;
assert length == otherEntry.length;
for (Option otherOption : otherEntry.options) {
int index = options.indexOf(otherOption);
if (index >= 0) {
Option thisOption = options.get(index);
thisOption.setFreq(thisOption.freq + otherOption.freq);
} else {
options.add(otherOption);
}
}
Comparator<Option> comparator;
switch (sort) {
case SCORE:
comparator = SuggestPhase.SCORE;
break;
case FREQUENCY:
comparator = SuggestPhase.FREQUENCY;
break;
default:
throw new ElasticSearchException("Could not resolve comparator in reduce phase.");
}
protected void sort(Comparator<O> comparator) {
Collections.sort(options, comparator);
}
protected Entry<O> reduce(List<Entry<O>> toReduce) {
if (toReduce.size() == 1) {
return toReduce.get(0);
}
final Map<O, O> entries = new HashMap<O, O>();
Entry<O> leader = toReduce.get(0);
for (Entry<O> entry : toReduce) {
assert leader.text.equals(entry.text);
assert leader.offset == entry.offset;
assert leader.length == entry.length;
for (O option : entry) {
O merger = entries.get(option);
if (merger == null) {
entries.put(option, option);
} else {
merger.mergeInto(option);
}
}
}
leader.options.clear();
leader.options.addAll(entries.keySet());
return leader;
}
/**
* @return the text (analyzed by suggest analyzer) originating from the suggest text. Usually this is a
* single term.
@ -284,7 +363,7 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
}
@Override
public Iterator<Option> iterator() {
public Iterator<O> iterator() {
return options.iterator();
}
@ -292,7 +371,7 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
* @return The suggested options for this particular suggest entry. If there are no suggested terms then
* an empty list is returned.
*/
public List<Option> getOptions() {
public List<O> getOptions() {
return options;
}
@ -308,7 +387,7 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Entry entry = (Entry) o;
Entry<?> entry = (Entry<?>) o;
if (length != entry.length) return false;
if (offset != entry.offset) return false;
@ -325,23 +404,23 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
return result;
}
static Entry read(StreamInput in) throws IOException {
Entry entry = new Entry();
entry.readFrom(in);
return entry;
}
@Override
public void readFrom(StreamInput in) throws IOException {
text = in.readText();
offset = in.readVInt();
length = in.readVInt();
int suggestedWords = in.readVInt();
options = new ArrayList<Option>(suggestedWords);
options = new ArrayList<O>(suggestedWords);
for (int j = 0; j < suggestedWords; j++) {
options.add(Option.create(in));
O newOption = newOption();
newOption.readFrom(in);
options.add(newOption);
}
}
protected O newOption(){
return (O) new Option();
}
@Override
public void writeTo(StreamOutput out) throws IOException {
@ -377,26 +456,19 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
static class Fields {
static final XContentBuilderString TEXT = new XContentBuilderString("text");
static final XContentBuilderString FREQ = new XContentBuilderString("freq");
static final XContentBuilderString SCORE = new XContentBuilderString("score");
}
private Text text;
private int freq;
private float score;
Option(Text text, int freq, float score) {
public Option(Text text, float score) {
this.text = text;
this.freq = freq;
this.score = score;
}
Option() {
}
public void setFreq(int freq) {
this.freq = freq;
public Option() {
}
/**
@ -406,13 +478,6 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
return text;
}
/**
* @return How often this suggested text appears in the index.
*/
public int getFreq() {
return freq;
}
/**
* @return The score based on the edit distance difference between the suggested term and the
* term in the suggest text.
@ -421,35 +486,34 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
return score;
}
static Option create(StreamInput in) throws IOException {
Option suggestion = new Option();
suggestion.readFrom(in);
return suggestion;
}
@Override
public void readFrom(StreamInput in) throws IOException {
text = in.readText();
freq = in.readVInt();
score = in.readFloat();
}
@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeText(text);
out.writeVInt(freq);
out.writeFloat(score);
}
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject();
builder.field(Fields.TEXT, text);
builder.field(Fields.FREQ, freq);
builder.field(Fields.SCORE, score);
innerToXContent(builder, params);
builder.endObject();
return builder;
}
protected XContentBuilder innerToXContent(XContentBuilder builder, Params params) throws IOException {
builder.field(Fields.TEXT, text);
builder.field(Fields.SCORE, score);
return builder;
}
protected void mergeInto(Option otherOption) {
}
@Override
public boolean equals(Object o) {
@ -468,8 +532,7 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
}
}
enum Sort {
public enum Sort {
/**
* Sort should first be based on score.
@ -491,7 +554,7 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
return id;
}
static Sort fromId(byte id) {
public static Sort fromId(byte id) {
if (id == 0) {
return SCORE;
} else if (id == 1) {
@ -504,5 +567,4 @@ public class Suggest implements Iterable<Suggest.Suggestion>, Streamable, ToXCon
}
}
}

View File

@ -16,20 +16,21 @@
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.XContentBuilder;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.search.suggest.phrase.PhraseSuggestionBuilder;
import org.elasticsearch.search.suggest.term.TermSuggestionBuilder;
/**
* Defines how to perform suggesting. This builders allows a number of global options to be specified and
* an arbitrary number of {@link org.elasticsearch.search.suggest.SuggestBuilder.FuzzySuggestion} instances.
* an arbitrary number of {@link org.elasticsearch.search.suggest.SuggestBuilder.TermSuggestionBuilder} instances.
* <p/>
* Suggesting works by suggesting terms that appear in the suggest text that are similar compared to the terms in
* provided text. These spelling suggestions are based on several options described in this class.
@ -38,11 +39,11 @@ public class SuggestBuilder implements ToXContent {
private String globalText;
private final List<Suggestion> suggestions = new ArrayList<Suggestion>();
private final List<SuggestionBuilder<?>> suggestions = new ArrayList<SuggestionBuilder<?>>();
/**
* Sets the text to provide suggestions for. The suggest text is a required option that needs
* to be set either via this setter or via the {@link org.elasticsearch.search.suggest.SuggestBuilder.Suggestion#setText(String)} method.
* to be set either via this setter or via the {@link org.elasticsearch.search.suggest.SuggestBuilder.SuggestionBuilder#setText(String)} method.
* <p/>
* The suggest text gets analyzed by the suggest analyzer or the suggest field search analyzer.
* For each analyzed token, suggested terms are suggested if possible.
@ -53,10 +54,10 @@ public class SuggestBuilder implements ToXContent {
}
/**
* Adds an {@link org.elasticsearch.search.suggest.SuggestBuilder.FuzzySuggestion} instance under a user defined name.
* Adds an {@link org.elasticsearch.search.suggest.SuggestBuilder.TermSuggestionBuilder} instance under a user defined name.
* The order in which the <code>Suggestions</code> are added, is the same as in the response.
*/
public SuggestBuilder addSuggestion(Suggestion suggestion) {
public SuggestBuilder addSuggestion(SuggestionBuilder<?> suggestion) {
suggestions.add(suggestion);
return this;
}
@ -64,7 +65,7 @@ public class SuggestBuilder implements ToXContent {
/**
* Returns all suggestions with the defined names.
*/
public List<Suggestion> getSuggestion() {
public List<SuggestionBuilder<?>> getSuggestion() {
return suggestions;
}
@ -74,7 +75,7 @@ public class SuggestBuilder implements ToXContent {
if (globalText != null) {
builder.field("text", globalText);
}
for (Suggestion suggestion : suggestions) {
for (SuggestionBuilder<?> suggestion : suggestions) {
builder = suggestion.toXContent(builder, params);
}
builder.endObject();
@ -86,17 +87,30 @@ public class SuggestBuilder implements ToXContent {
*
* @param name The name of this suggestion. This is a required parameter.
*/
public static FuzzySuggestion fuzzySuggestion(String name) {
return new FuzzySuggestion(name);
public static TermSuggestionBuilder termSuggestion(String name) {
return new TermSuggestionBuilder(name);
}
/**
* Convenience factory method.
*
* @param name The name of this suggestion. This is a required parameter.
*/
public static PhraseSuggestionBuilder phraseSuggestion(String name) {
return new PhraseSuggestionBuilder(name);
}
public static abstract class Suggestion<T> implements ToXContent {
public static abstract class SuggestionBuilder<T> implements ToXContent {
private String name;
private String suggester;
private String text;
private String field;
private String analyzer;
private Integer size;
private Integer shardSize;
public Suggestion(String name, String suggester) {
public SuggestionBuilder(String name, String suggester) {
this.name = name;
this.suggester = suggester;
}
@ -104,7 +118,8 @@ public class SuggestBuilder implements ToXContent {
/**
* Same as in {@link SuggestBuilder#setText(String)}, but in the suggestion scope.
*/
public T setText(String text) {
@SuppressWarnings("unchecked")
public T text(String text) {
this.text = text;
return (T) this;
}
@ -116,6 +131,18 @@ public class SuggestBuilder implements ToXContent {
builder.field("text", text);
}
builder.startObject(suggester);
if (analyzer != null) {
builder.field("analyzer", analyzer);
}
if (field != null) {
builder.field("field", field);
}
if (size != null) {
builder.field("size", size);
}
if (shardSize != null) {
builder.field("shard_size", shardSize);
}
builder = innerToXContent(builder, params);
builder.endObject();
builder.endObject();
@ -123,256 +150,57 @@ public class SuggestBuilder implements ToXContent {
}
protected abstract XContentBuilder innerToXContent(XContentBuilder builder, Params params) throws IOException;
}
/**
* Defines the actual suggest command. Each command uses the global options unless defined in the suggestion itself.
* All options are the same as the global options, but are only applicable for this suggestion.
*/
public static class FuzzySuggestion extends Suggestion<FuzzySuggestion> {
private String field;
private String analyzer;
private String suggestMode;
private Float accuracy;
private Integer size;
private String sort;
private String stringDistance;
private Boolean lowerCaseTerms;
private Integer maxEdits;
private Integer factor;
private Float maxTermFreq;
private Integer prefixLength;
private Integer minWordLength;
private Float minDocFreq;
private Integer shardSize;
/**
* @param name The name of this suggestion. This is a required parameter.
* Sets from what field to fetch the candidate suggestions from. This is an
* required option and needs to be set via this setter or
* {@link org.elasticsearch.search.suggest.SuggestBuilder.TermSuggestionBuilder#setField(String)}
* method
*/
public FuzzySuggestion(String name) {
super(name, "fuzzy");
}
/**
* Sets from what field to fetch the candidate suggestions from. This is an required option and needs to be set
* via this setter or {@link org.elasticsearch.search.suggest.SuggestBuilder.FuzzySuggestion#setField(String)} method
*/
public FuzzySuggestion setField(String field) {
@SuppressWarnings("unchecked")
public T field(String field) {
this.field = field;
return this;
return (T)this;
}
/**
* Sets the analyzer to analyse to suggest text with. Defaults to the search analyzer of the suggest field.
* Sets the analyzer to analyse to suggest text with. Defaults to the search
* analyzer of the suggest field.
*/
public FuzzySuggestion setAnalyzer(String analyzer) {
@SuppressWarnings("unchecked")
public T analyzer(String analyzer) {
this.analyzer = analyzer;
return this;
}
/**
* The global suggest mode controls what suggested terms are included or controls for what suggest text tokens,
* terms should be suggested for. Three possible values can be specified:
* <ol>
* <li><code>missing</code> - Only suggest terms in the suggest text that aren't in the index. This is the default.
* <li><code>popular</code> - Only suggest terms that occur in more docs then the original suggest text term.
* <li><code>always</code> - Suggest any matching suggest terms based on tokens in the suggest text.
* </ol>
*/
public FuzzySuggestion setSuggestMode(String suggestMode) {
this.suggestMode = suggestMode;
return this;
}
/**
* Sets how similar the suggested terms at least need to be compared to the original suggest text tokens.
* A value between 0 and 1 can be specified. This value will be compared to the string distance result of each
* candidate spelling correction.
* <p/>
* Default is 0.5f.
*/
public FuzzySuggestion setAccuracy(float accuracy) {
this.accuracy = accuracy;
return this;
return (T)this;
}
/**
* Sets the maximum suggestions to be returned per suggest text term.
*/
public FuzzySuggestion setSize(int size) {
@SuppressWarnings("unchecked")
public T size(int size) {
if (size <= 0) {
throw new ElasticSearchIllegalArgumentException("Size must be positive");
}
this.size = size;
return this;
return (T)this;
}
/**
* Sets how to sort the suggest terms per suggest text token.
* Two possible values:
* <ol>
* <li><code>score</code> - Sort should first be based on score, then document frequency and then the term itself.
* <li><code>frequency</code> - Sort should first be based on document frequency, then scotr and then the term itself.
* </ol>
* <p/>
* What the score is depends on the suggester being used.
*/
public FuzzySuggestion setSort(String sort) {
this.sort = sort;
return this;
}
/**
* Sets what string distance implementation to use for comparing how similar suggested terms are.
* Four possible values can be specified:
* <ol>
* <li><code>internal</code> - This is the default and is based on <code>damerau_levenshtein</code>, but
* highly optimized for comparing string distance for terms inside the index.
* <li><code>damerau_levenshtein</code> - String distance algorithm based on Damerau-Levenshtein algorithm.
* <li><code>levenstein</code> - String distance algorithm based on Levenstein edit distance algorithm.
* <li><code>jarowinkler</code> - String distance algorithm based on Jaro-Winkler algorithm.
* <li><code>ngram</code> - String distance algorithm based on n-grams.
* </ol>
*/
public FuzzySuggestion setStringDistance(String stringDistance) {
this.stringDistance = stringDistance;
return this;
}
/**
* Sets whether to lowercase the suggest text tokens just before suggesting terms.
*/
public FuzzySuggestion setLowerCaseTerms(Boolean lowerCaseTerms) {
this.lowerCaseTerms = lowerCaseTerms;
return this;
}
/**
* Sets the maximum edit distance candidate suggestions can have in order to be considered as a suggestion.
* Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2.
*/
public FuzzySuggestion setMaxEdits(Integer maxEdits) {
this.maxEdits = maxEdits;
return this;
}
/**
* A factor that is used to multiply with the size in order to inspect more candidate suggestions.
* Can improve accuracy at the cost of performance. Defaults to 5.
*/
public FuzzySuggestion setFactor(Integer factor) {
this.factor = factor;
return this;
}
/**
* Sets a maximum threshold in number of documents a suggest text token can exist in order to be corrected.
* Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies.
* If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f.
* <p/>
* This can be used to exclude high frequency terms from being suggested. High frequency terms are usually
* spelled correctly on top of this this also improves the suggest performance.
*/
public FuzzySuggestion setMaxTermFreq(float maxTermFreq) {
this.maxTermFreq = maxTermFreq;
return this;
}
/**
* Sets the number of minimal prefix characters that must match in order be a candidate suggestion.
* Defaults to 1. Increasing this number improves suggest performance. Usually misspellings don't occur in the
* beginning of terms.
*/
public FuzzySuggestion setPrefixLength(int prefixLength) {
this.prefixLength = prefixLength;
return this;
}
/**
* The minimum length a suggest text term must have in order to be corrected. Defaults to 4.
*/
public FuzzySuggestion setMinWordLength(int minWordLength) {
this.minWordLength = minWordLength;
return this;
}
/**
* Sets a minimal threshold in number of documents a suggested term should appear in. This can be specified as
* an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting
* high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number
* cannot be fractional.
*/
public FuzzySuggestion setMinDocFreq(float minDocFreq) {
this.minDocFreq = minDocFreq;
return this;
}
/**
* Sets the maximum number of suggested term to be retrieved from each individual shard. During the reduce
* phase the only the top N suggestions are returned based on the <code>size</code> option. Defaults to the
* Sets the maximum number of suggested term to be retrieved from each
* individual shard. During the reduce phase the only the top N suggestions
* are returned based on the <code>size</code> option. Defaults to the
* <code>size</code> option.
* <p/>
* Setting this to a value higher than the `size` can be useful in order to get a more accurate document frequency
* for suggested terms. Due to the fact that terms are partitioned amongst shards, the shard level document
* frequencies of suggestions may not be precise. Increasing this will make these document frequencies
* more precise.
* Setting this to a value higher than the `size` can be useful in order to
* get a more accurate document frequency for suggested terms. Due to the
* fact that terms are partitioned amongst shards, the shard level document
* frequencies of suggestions may not be precise. Increasing this will make
* these document frequencies more precise.
*/
public FuzzySuggestion setShardSize(Integer shardSize) {
@SuppressWarnings("unchecked")
public T shardSize(Integer shardSize) {
this.shardSize = shardSize;
return this;
}
@Override
public XContentBuilder innerToXContent(XContentBuilder builder, Params params) throws IOException {
if (analyzer != null) {
builder.field("analyzer", analyzer);
}
if (field != null) {
builder.field("field", field);
}
if (suggestMode != null) {
builder.field("suggest_mode", suggestMode);
}
if (accuracy != null) {
builder.field("accuracy", accuracy);
}
if (size != null) {
builder.field("size", size);
}
if (sort != null) {
builder.field("sort", sort);
}
if (stringDistance != null) {
builder.field("string_distance", stringDistance);
}
if (lowerCaseTerms != null) {
builder.field("lowercase_terms", lowerCaseTerms);
}
if (maxEdits != null) {
builder.field("max_edits", maxEdits);
}
if (factor != null) {
builder.field("factor", factor);
}
if (maxTermFreq != null) {
builder.field("max_term_freq", maxTermFreq);
}
if (prefixLength != null) {
builder.field("prefix_length", prefixLength);
}
if (minWordLength != null) {
builder.field("min_word_len", minWordLength);
}
if (minDocFreq != null) {
builder.field("min_doc_freq", minDocFreq);
}
if (shardSize != null) {
builder.field("shard_size", shardSize);
}
return builder;
return (T)this;
}
}
}

View File

@ -0,0 +1,29 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import java.io.IOException;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.search.internal.SearchContext;
public interface SuggestContextParser {
public SuggestionSearchContext.SuggestionContext parse(XContentParser parser, SearchContext context) throws IOException;
}

View File

@ -16,43 +16,30 @@
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.search.spell.*;
import java.io.IOException;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.automaton.LevenshteinAutomata;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.search.SearchParseElement;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.suggest.SuggestionSearchContext.SuggestionContext;
import org.elasticsearch.search.suggest.phrase.PhraseSuggestParser;
import org.elasticsearch.search.suggest.term.TermSuggestParser;
/**
*
*/
public class SuggestParseElement implements SearchParseElement {
private final SuggestContextParser termSuggestParser = new TermSuggestParser();
private final SuggestContextParser phraseSuggestParser = new PhraseSuggestParser();
@Override
public void parse(XContentParser parser, SearchContext context) throws Exception {
SuggestionSearchContext suggestionSearchContext = new SuggestionSearchContext();
BytesRef globalText = null;
Analyzer defaultAnalyzer = context.mapperService().searchAnalyzer();
float defaultAccuracy = SpellChecker.DEFAULT_ACCURACY;
int defaultSize = 5;
SuggestMode defaultSuggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;
Suggest.Suggestion.Sort defaultSort = Suggest.Suggestion.Sort.SCORE;
StringDistance defaultStringDistance = DirectSpellChecker.INTERNAL_LEVENSHTEIN;
boolean defaultLowerCaseTerms = false; // changed from Lucene default because we rely on search analyzer to properly handle it
int defaultMaxEdits = LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE;
int defaultFactor = 5;
float defaultMaxTermFreq = 0.01f;
int defaultPrefixLength = 1;
int defaultMinQueryLength = 4;
float defaultMinDocFreq = 0f;
String fieldName = null;
XContentParser.Token token;
while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
@ -81,162 +68,30 @@ public class SuggestParseElement implements SearchParseElement {
if (suggestionName == null) {
throw new ElasticSearchIllegalArgumentException("Suggestion must have name");
}
// TODO: Once we have more suggester impls we need to have different parsing logic per suggester.
// This code is now specific for the fuzzy suggester
if (!"fuzzy".equals(fieldName)) {
final SuggestContextParser contextParser;
if ("term".equals(fieldName)) {
contextParser = termSuggestParser;
} else if ("phrase".equals(fieldName)) {
contextParser = phraseSuggestParser;
} else {
throw new ElasticSearchIllegalArgumentException("Suggester[" + fieldName + "] not supported");
}
SuggestionSearchContext.Suggestion suggestion = new SuggestionSearchContext.Suggestion();
suggestion.text(suggestText);
suggestionSearchContext.addSuggestion(suggestionName, suggestion);
while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
fieldName = parser.currentName();
} else if (token.isValue()) {
if ("analyzer".equals(fieldName)) {
String analyzerName = parser.text();
Analyzer analyzer = context.mapperService().analysisService().analyzer(analyzerName);
if (analyzer == null) {
throw new ElasticSearchIllegalArgumentException("Analyzer [" + analyzerName + "] doesn't exists");
}
suggestion.analyzer(analyzer);
} else if ("field".equals(fieldName)) {
suggestion.setField(parser.text());
} else if ("accuracy".equals(fieldName)) {
suggestion.accuracy(parser.floatValue());
} else if ("size".equals(fieldName)) {
suggestion.size(parser.intValue());
} else if ("suggest_mode".equals(fieldName) || "suggestMode".equals(fieldName)) {
suggestion.suggestMode(resolveSuggestMode(parser.text()));
} else if ("sort".equals(fieldName)) {
suggestion.sort(resolveSort(parser.text()));
} else if ("string_distance".equals(fieldName) || "stringDistance".equals(fieldName)) {
suggestion.stringDistance(resolveDistance(parser.text()));
} else if ("lowercase_terms".equals(fieldName) || "lowercaseTerms".equals(fieldName)) {
suggestion.lowerCaseTerms(parser.booleanValue());
} else if ("max_edits".equals(fieldName) || "maxEdits".equals(fieldName) || "fuzziness".equals(fieldName)) {
suggestion.maxEdits(parser.intValue());
if (suggestion.maxEdits() < 1 || suggestion.maxEdits() > LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE) {
throw new ElasticSearchIllegalArgumentException("Illegal max_edits value " + suggestion.maxEdits());
}
} else if ("factor".equals(fieldName)) {
suggestion.factor(parser.intValue());
} else if ("max_term_freq".equals(fieldName) || "maxTermFreq".equals(fieldName)) {
suggestion.maxTermFreq(parser.floatValue());
} else if ("prefix_length".equals(fieldName) || "prefixLength".equals(fieldName)) {
suggestion.prefixLength(parser.intValue());
} else if ("min_word_len".equals(fieldName) || "minWordLen".equals(fieldName)) {
suggestion.minQueryLength(parser.intValue());
} else if ("min_doc_freq".equals(fieldName) || "minDocFreq".equals(fieldName)) {
suggestion.minDocFreq(parser.floatValue());
} else if ("shard_size".equals(fieldName) || "shardSize".equals(fieldName)) {
suggestion.shardSize(parser.intValue());
} else {
throw new ElasticSearchIllegalArgumentException("suggester[fuzzy] doesn't support [" + fieldName + "]");
}
}
}
parseAndVerify(parser, context, suggestionSearchContext, globalText, suggestionName, suggestText, contextParser);
}
}
}
}
// Verify options and set defaults
for (SuggestionSearchContext.Suggestion command : suggestionSearchContext.suggestions().values()) {
if (command.field() == null) {
throw new ElasticSearchIllegalArgumentException("The required field option is missing");
}
if (command.text() == null) {
if (globalText == null) {
throw new ElasticSearchIllegalArgumentException("The required text option is missing");
}
command.text(globalText);
}
if (command.analyzer() == null) {
command.analyzer(defaultAnalyzer);
}
if (command.accuracy() == null) {
command.accuracy(defaultAccuracy);
}
if (command.size() == null) {
command.size(defaultSize);
}
if (command.suggestMode() == null) {
command.suggestMode(defaultSuggestMode);
}
if (command.sort() == null) {
command.sort(defaultSort);
}
if (command.stringDistance() == null) {
command.stringDistance(defaultStringDistance);
}
if (command.lowerCaseTerms() == null) {
command.lowerCaseTerms(defaultLowerCaseTerms);
}
if (command.maxEdits() == null) {
command.maxEdits(defaultMaxEdits);
}
if (command.factor() == null) {
command.factor(defaultFactor);
}
if (command.maxTermFreq() == null) {
command.maxTermFreq(defaultMaxTermFreq);
}
if (command.prefixLength() == null) {
command.prefixLength(defaultPrefixLength);
}
if (command.minWordLength() == null) {
command.minQueryLength(defaultMinQueryLength);
}
if (command.minDocFreq() == null) {
command.minDocFreq(defaultMinDocFreq);
}
if (command.shardSize() == null) {
command.shardSize(defaultSize);
}
}
context.suggest(suggestionSearchContext);
}
private SuggestMode resolveSuggestMode(String sortVal) {
if ("missing".equals(sortVal)) {
return SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;
} else if ("popular".equals(sortVal)) {
return SuggestMode.SUGGEST_MORE_POPULAR;
} else if ("always".equals(sortVal)) {
return SuggestMode.SUGGEST_ALWAYS;
} else {
throw new ElasticSearchIllegalArgumentException("Illegal suggest mode " + sortVal);
}
}
private Suggest.Suggestion.Sort resolveSort(String sortVal) {
if ("score".equals(sortVal)) {
return Suggest.Suggestion.Sort.SCORE;
} else if ("frequency".equals(sortVal)) {
return Suggest.Suggestion.Sort.FREQUENCY;
} else {
throw new ElasticSearchIllegalArgumentException("Illegal suggest sort " + sortVal);
}
}
private StringDistance resolveDistance(String distanceVal) {
if ("internal".equals(distanceVal)) {
return DirectSpellChecker.INTERNAL_LEVENSHTEIN;
} else if ("damerau_levenshtein".equals(distanceVal)) {
return new LuceneLevenshteinDistance();
} else if ("levenstein".equals(distanceVal)) {
return new LevensteinDistance();
} else if ("jarowinkler".equals(distanceVal)) {
return new JaroWinklerDistance();
} else if ("ngram".equals(distanceVal)) {
return new NGramDistance();
} else {
throw new ElasticSearchIllegalArgumentException("Illegal distance option " + distanceVal);
}
public void parseAndVerify(XContentParser parser, SearchContext context, SuggestionSearchContext suggestionSearchContext,
BytesRef globalText, String suggestionName, BytesRef suggestText, SuggestContextParser suggestParser ) throws IOException {
SuggestionContext suggestion = suggestParser.parse(parser, context);
suggestion.setText(suggestText);
SuggestUtils.verifySuggestion(context, globalText, suggestion);
suggestionSearchContext.addSuggestion(suggestionName, suggestion);
}
}

View File

@ -16,43 +16,27 @@
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import com.google.common.collect.ImmutableMap;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.spell.DirectSpellChecker;
import org.apache.lucene.search.spell.SuggestWord;
import org.apache.lucene.search.spell.SuggestWordFrequencyComparator;
import org.apache.lucene.search.spell.SuggestWordQueue;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.CharsRef;
import org.apache.lucene.util.UnicodeUtil;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.common.bytes.BytesArray;
import org.elasticsearch.common.component.AbstractComponent;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.io.FastCharArrayReader;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.text.BytesText;
import org.elasticsearch.common.text.StringText;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.search.SearchParseElement;
import org.elasticsearch.search.SearchPhase;
import org.elasticsearch.search.internal.SearchContext;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.Map;
import static org.elasticsearch.search.suggest.Suggest.Suggestion;
import org.apache.lucene.util.CharsRef;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.common.component.AbstractComponent;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.search.SearchParseElement;
import org.elasticsearch.search.SearchPhase;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.suggest.Suggest.Suggestion;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry.Option;
import org.elasticsearch.search.suggest.SuggestionSearchContext.SuggestionContext;
import com.google.common.collect.ImmutableMap;
/**
*/
@ -76,152 +60,27 @@ public class SuggestPhase extends AbstractComponent implements SearchPhase {
@Override
public void execute(SearchContext context) throws ElasticSearchException {
try {
SuggestionSearchContext suggest = context.suggest();
if (suggest == null) {
return;
}
try {
CharsRef spare = new CharsRef(); // Maybe add CharsRef to CacheRecycler?
List<Suggestion> suggestions = new ArrayList<Suggestion>(2);
for (Map.Entry<String, SuggestionSearchContext.Suggestion> entry : suggest.suggestions().entrySet()) {
SuggestionSearchContext.Suggestion suggestion = entry.getValue();
suggestions.add(executeDirectSpellChecker(entry.getKey(), suggestion, context, spare));
final List<Suggestion<? extends Entry<? extends Option>>> suggestions = new ArrayList<Suggestion<? extends Entry<? extends Option>>>(suggest.suggestions().size());
for (Map.Entry<String, SuggestionSearchContext.SuggestionContext> entry : suggest.suggestions().entrySet()) {
SuggestionSearchContext.SuggestionContext suggestion = entry.getValue();
Suggester<SuggestionContext> suggester = suggestion.getSuggester();
Suggestion<? extends Entry<? extends Option>> result = suggester.execute(entry.getKey(), suggestion, context, spare);
assert entry.getKey().equals(result.name);
suggestions.add(result);
}
context.queryResult().suggest(new Suggest(suggestions));
} catch (IOException e) {
throw new ElasticSearchException("I/O exception during suggest phase", e);
}
}
private Suggestion executeDirectSpellChecker(String name, SuggestionSearchContext.Suggestion suggestion, SearchContext context, CharsRef spare) throws IOException {
DirectSpellChecker directSpellChecker = new DirectSpellChecker();
directSpellChecker.setAccuracy(suggestion.accuracy());
Comparator<SuggestWord> comparator;
switch (suggestion.sort()) {
case SCORE:
comparator = SuggestWordQueue.DEFAULT_COMPARATOR;
break;
case FREQUENCY:
comparator = LUCENE_FREQUENCY;
break;
default:
throw new ElasticSearchIllegalArgumentException("Illegal suggest sort: " + suggestion.sort());
} catch (NullPointerException e) {
e.printStackTrace();
}
directSpellChecker.setComparator(comparator);
directSpellChecker.setDistance(suggestion.stringDistance());
directSpellChecker.setLowerCaseTerms(suggestion.lowerCaseTerms());
directSpellChecker.setMaxEdits(suggestion.maxEdits());
directSpellChecker.setMaxInspections(suggestion.factor());
directSpellChecker.setMaxQueryFrequency(suggestion.maxTermFreq());
directSpellChecker.setMinPrefix(suggestion.prefixLength());
directSpellChecker.setMinQueryLength(suggestion.minWordLength());
directSpellChecker.setThresholdFrequency(suggestion.minDocFreq());
Suggestion response = new Suggestion(
name, suggestion.size(), suggestion.sort()
);
List<Token> tokens = queryTerms(suggestion, spare);
for (Token token : tokens) {
IndexReader indexReader = context.searcher().getIndexReader();
// TODO: Extend DirectSpellChecker in 4.1, to get the raw suggested words as BytesRef
SuggestWord[] suggestedWords = directSpellChecker.suggestSimilar(
token.term, suggestion.shardSize(), indexReader, suggestion.suggestMode()
);
Text key = new BytesText(new BytesArray(token.term.bytes()));
Suggestion.Entry resultEntry = new Suggestion.Entry(key, token.startOffset, token.endOffset - token.startOffset);
for (SuggestWord suggestWord : suggestedWords) {
Text word = new StringText(suggestWord.string);
resultEntry.addOption(new Suggestion.Entry.Option(word, suggestWord.freq, suggestWord.score));
}
response.addTerm(resultEntry);
}
return response;
}
private List<Token> queryTerms(SuggestionSearchContext.Suggestion suggestion, CharsRef spare) throws IOException {
UnicodeUtil.UTF8toUTF16(suggestion.text(), spare);
TokenStream ts = suggestion.analyzer().tokenStream(
suggestion.field(), new FastCharArrayReader(spare.chars, spare.offset, spare.length)
);
ts.reset();
TermToBytesRefAttribute termAtt = ts.addAttribute(TermToBytesRefAttribute.class);
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
BytesRef termRef = termAtt.getBytesRef();
List<Token> result = new ArrayList<Token>(5);
while (ts.incrementToken()) {
termAtt.fillBytesRef();
Term term = new Term(suggestion.field(), BytesRef.deepCopyOf(termRef));
result.add(new Token(term, offsetAtt.startOffset(), offsetAtt.endOffset()));
}
return result;
}
private static Comparator<SuggestWord> LUCENE_FREQUENCY = new SuggestWordFrequencyComparator();
public static Comparator<Suggestion.Entry.Option> SCORE = new Score();
public static Comparator<Suggestion.Entry.Option> FREQUENCY = new Frequency();
// Same behaviour as comparators in suggest module, but for SuggestedWord
// Highest score first, then highest freq first, then lowest term first
public static class Score implements Comparator<Suggestion.Entry.Option> {
@Override
public int compare(Suggestion.Entry.Option first, Suggestion.Entry.Option second) {
// first criteria: the distance
int cmp = Float.compare(second.getScore(), first.getScore());
if (cmp != 0) {
return cmp;
}
// second criteria (if first criteria is equal): the popularity
cmp = second.getFreq() - first.getFreq();
if (cmp != 0) {
return cmp;
}
// third criteria: term text
return first.getText().compareTo(second.getText());
}
}
// Same behaviour as comparators in suggest module, but for SuggestedWord
// Highest freq first, then highest score first, then lowest term first
public static class Frequency implements Comparator<Suggestion.Entry.Option> {
@Override
public int compare(Suggestion.Entry.Option first, Suggestion.Entry.Option second) {
// first criteria: the popularity
int cmp = second.getFreq() - first.getFreq();
if (cmp != 0) {
return cmp;
}
// second criteria (if first criteria is equal): the distance
cmp = Float.compare(second.getScore(), first.getScore());
if (cmp != 0) {
return cmp;
}
// third criteria: term text
return first.getText().compareTo(second.getText());
}
}
private static class Token {
public final Term term;
public final int startOffset;
public final int endOffset;
private Token(Term term, int startOffset, int endOffset) {
this.term = term;
this.startOffset = startOffset;
this.endOffset = endOffset;
}
}
}

View File

@ -0,0 +1,293 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import java.io.IOException;
import java.util.Comparator;
import java.util.Locale;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CustomAnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.search.spell.DirectSpellChecker;
import org.apache.lucene.search.spell.JaroWinklerDistance;
import org.apache.lucene.search.spell.LevensteinDistance;
import org.apache.lucene.search.spell.LuceneLevenshteinDistance;
import org.apache.lucene.search.spell.NGramDistance;
import org.apache.lucene.search.spell.StringDistance;
import org.apache.lucene.search.spell.SuggestMode;
import org.apache.lucene.search.spell.SuggestWord;
import org.apache.lucene.search.spell.SuggestWordFrequencyComparator;
import org.apache.lucene.search.spell.SuggestWordQueue;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.CharsRef;
import org.apache.lucene.util.UnicodeUtil;
import org.apache.lucene.util.automaton.LevenshteinAutomata;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.common.io.FastCharArrayReader;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.index.analysis.CustomAnalyzer;
import org.elasticsearch.index.analysis.NamedAnalyzer;
import org.elasticsearch.index.analysis.ShingleTokenFilterFactory;
import org.elasticsearch.index.analysis.TokenFilterFactory;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.suggest.SuggestionSearchContext.SuggestionContext;
public final class SuggestUtils {
public static Comparator<SuggestWord> LUCENE_FREQUENCY = new SuggestWordFrequencyComparator();
public static Comparator<SuggestWord> SCORE_COMPARATOR = SuggestWordQueue.DEFAULT_COMPARATOR;
private SuggestUtils() {
// utils!!
}
public static DirectSpellChecker getDirectSpellChecker(DirectSpellcheckerSettings suggestion) {
DirectSpellChecker directSpellChecker = new DirectSpellChecker();
directSpellChecker.setAccuracy(suggestion.accuracy());
Comparator<SuggestWord> comparator;
switch (suggestion.sort()) {
case SCORE:
comparator = SCORE_COMPARATOR;
break;
case FREQUENCY:
comparator = LUCENE_FREQUENCY;
break;
default:
throw new ElasticSearchIllegalArgumentException("Illegal suggest sort: " + suggestion.sort());
}
directSpellChecker.setComparator(comparator);
directSpellChecker.setDistance(suggestion.stringDistance());
directSpellChecker.setMaxEdits(suggestion.maxEdits());
directSpellChecker.setMaxInspections(suggestion.maxInspections());
directSpellChecker.setMaxQueryFrequency(suggestion.maxTermFreq());
directSpellChecker.setMinPrefix(suggestion.prefixLength());
directSpellChecker.setMinQueryLength(suggestion.minWordLength());
directSpellChecker.setThresholdFrequency(suggestion.minDocFreq());
return directSpellChecker;
}
public static BytesRef join(BytesRef separator, BytesRef result, BytesRef... toJoin) {
int len = separator.length * toJoin.length - 1;
for (BytesRef br : toJoin) {
len += br.length;
}
result.grow(len);
return joinPreAllocated(separator, result, toJoin);
}
public static BytesRef joinPreAllocated(BytesRef separator, BytesRef result, BytesRef... toJoin) {
result.length = 0;
result.offset = 0;
for (int i = 0; i < toJoin.length - 1; i++) {
BytesRef br = toJoin[i];
System.arraycopy(br.bytes, br.offset, result.bytes, result.offset, br.length);
result.offset += br.length;
System.arraycopy(separator.bytes, separator.offset, result.bytes, result.offset, separator.length);
result.offset += separator.length;
}
final BytesRef br = toJoin[toJoin.length-1];
System.arraycopy(br.bytes, br.offset, result.bytes, result.offset, br.length);
result.length = result.offset + br.length;
result.offset = 0;
return result;
}
public static abstract class TokenConsumer {
protected CharTermAttribute charTermAttr;
protected PositionIncrementAttribute posIncAttr;
protected OffsetAttribute offsetAttr;
public void reset(TokenStream stream) {
charTermAttr = stream.addAttribute(CharTermAttribute.class);
posIncAttr = stream.addAttribute(PositionIncrementAttribute.class);
offsetAttr = stream.addAttribute(OffsetAttribute.class);
}
protected BytesRef fillBytesRef(BytesRef spare) {
spare.offset = 0;
spare.length = spare.bytes.length;
char[] source = charTermAttr.buffer();
UnicodeUtil.UTF16toUTF8(source, 0, charTermAttr.length(), spare);
return spare;
}
public abstract void nextToken() throws IOException;
public void end() {}
}
public static int analyze(Analyzer analyzer, BytesRef toAnalyze, String field, TokenConsumer consumer, CharsRef spare) throws IOException {
UnicodeUtil.UTF8toUTF16(toAnalyze, spare);
return analyze(analyzer, spare, field, consumer);
}
public static int analyze(Analyzer analyzer, CharsRef toAnalyze, String field, TokenConsumer consumer) throws IOException {
TokenStream ts = analyzer.tokenStream(
field, new FastCharArrayReader(toAnalyze.chars, toAnalyze.offset, toAnalyze.length)
);
return analyze(ts, consumer);
}
public static int analyze(TokenStream stream, TokenConsumer consumer) throws IOException {
stream.reset();
consumer.reset(stream);
int numTokens = 0;
while (stream.incrementToken()) {
consumer.nextToken();
numTokens++;
}
consumer.end();
return numTokens;
}
public static SuggestMode resolveSuggestMode(String suggestMode) {
suggestMode = suggestMode.toLowerCase(Locale.US);
if ("missing".equals(suggestMode)) {
return SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;
} else if ("popular".equals(suggestMode)) {
return SuggestMode.SUGGEST_MORE_POPULAR;
} else if ("always".equals(suggestMode)) {
return SuggestMode.SUGGEST_ALWAYS;
} else {
throw new ElasticSearchIllegalArgumentException("Illegal suggest mode " + suggestMode);
}
}
public static Suggest.Suggestion.Sort resolveSort(String sortVal) {
if ("score".equals(sortVal)) {
return Suggest.Suggestion.Sort.SCORE;
} else if ("frequency".equals(sortVal)) {
return Suggest.Suggestion.Sort.FREQUENCY;
} else {
throw new ElasticSearchIllegalArgumentException("Illegal suggest sort " + sortVal);
}
}
public static StringDistance resolveDistance(String distanceVal) {
if ("internal".equals(distanceVal)) {
return DirectSpellChecker.INTERNAL_LEVENSHTEIN;
} else if ("damerau_levenshtein".equals(distanceVal)) {
return new LuceneLevenshteinDistance();
} else if ("levenstein".equals(distanceVal)) {
return new LevensteinDistance();
} else if ("jarowinkler".equals(distanceVal)) {
return new JaroWinklerDistance();
} else if ("ngram".equals(distanceVal)) {
return new NGramDistance();
} else {
throw new ElasticSearchIllegalArgumentException("Illegal distance option " + distanceVal);
}
}
public static boolean parseDirectSpellcheckerSettings(XContentParser parser, String fieldName,
DirectSpellcheckerSettings suggestion) throws IOException {
if ("accuracy".equals(fieldName)) {
suggestion.accuracy(parser.floatValue());
} else if ("suggest_mode".equals(fieldName) || "suggestMode".equals(fieldName)) {
suggestion.suggestMode(SuggestUtils.resolveSuggestMode(parser.text()));
} else if ("sort".equals(fieldName)) {
suggestion.sort(SuggestUtils.resolveSort(parser.text()));
} else if ("string_distance".equals(fieldName) || "stringDistance".equals(fieldName)) {
suggestion.stringDistance(SuggestUtils.resolveDistance(parser.text()));
} else if ("max_edits".equals(fieldName) || "maxEdits".equals(fieldName) || "fuzziness".equals(fieldName)) {
suggestion.maxEdits(parser.intValue());
if (suggestion.maxEdits() < 1 || suggestion.maxEdits() > LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE) {
throw new ElasticSearchIllegalArgumentException("Illegal max_edits value " + suggestion.maxEdits());
}
} else if ("max_inspections".equals(fieldName)) {
suggestion.maxInspections(parser.intValue());
} else if ("max_term_freq".equals(fieldName) || "maxTermFreq".equals(fieldName)) {
suggestion.maxTermFreq(parser.floatValue());
} else if ("prefix_length".equals(fieldName) || "prefixLength".equals(fieldName)) {
suggestion.prefixLength(parser.intValue());
} else if ("min_word_len".equals(fieldName) || "minWordLen".equals(fieldName)) {
suggestion.minQueryLength(parser.intValue());
} else if ("min_doc_freq".equals(fieldName) || "minDocFreq".equals(fieldName)) {
suggestion.minDocFreq(parser.floatValue());
} else {
return false;
}
return true;
}
public static boolean parseSuggestContext(XContentParser parser, SearchContext context, String fieldName,
SuggestionSearchContext.SuggestionContext suggestion) throws IOException {
if ("analyzer".equals(fieldName)) {
String analyzerName = parser.text();
Analyzer analyzer = context.mapperService().analysisService().analyzer(analyzerName);
if (analyzer == null) {
throw new ElasticSearchIllegalArgumentException("Analyzer [" + analyzerName + "] doesn't exists");
}
suggestion.setAnalyzer(analyzer);
} else if ("field".equals(fieldName)) {
suggestion.setField(parser.text());
} else if ("size".equals(fieldName)) {
suggestion.setSize(parser.intValue());
} else if ("shard_size".equals(fieldName) || "shardSize".equals(fieldName)) {
suggestion.setShardSize(parser.intValue());
} else {
return false;
}
return true;
}
public static void verifySuggestion(SearchContext context, BytesRef globalText, SuggestionContext suggestion) {
// Verify options and set defaults
if (suggestion.getField() == null) {
throw new ElasticSearchIllegalArgumentException("The required field option is missing");
}
if (suggestion.getText() == null) {
if (globalText == null) {
throw new ElasticSearchIllegalArgumentException("The required text option is missing");
}
suggestion.setText(globalText);
}
if (suggestion.getAnalyzer() == null) {
suggestion.setAnalyzer(context.mapperService().searchAnalyzer());
}
}
public static ShingleTokenFilterFactory getShingleFilterFactory(Analyzer analyzer) {
if (analyzer instanceof NamedAnalyzer) {
analyzer = ((NamedAnalyzer)analyzer).analyzer();
}
if (analyzer instanceof CustomAnalyzer) {
CustomAnalyzer a = (CustomAnalyzer) analyzer;
TokenFilterFactory[] tokenFilters = a.tokenFilters();
for (TokenFilterFactory tokenFilterFactory : tokenFilters) {
if (tokenFilterFactory instanceof ShingleTokenFilterFactory) {
return ((ShingleTokenFilterFactory) tokenFilterFactory);
}
}
}
return null;
}
}

View File

@ -0,0 +1,31 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import java.io.IOException;
import org.apache.lucene.util.CharsRef;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.suggest.Suggest.Suggestion;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry.Option;
public interface Suggester<T extends SuggestionSearchContext.SuggestionContext> {
public abstract Suggestion<? extends Entry<? extends Option>> execute(String name, T suggestion, SearchContext context, CharsRef spare)
throws IOException;
}

View File

@ -16,68 +16,62 @@
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.search.spell.StringDistance;
import org.apache.lucene.search.spell.SuggestMode;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import java.util.LinkedHashMap;
import java.util.Map;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
/**
*/
public class SuggestionSearchContext {
private final Map<String, Suggestion> suggestions = new LinkedHashMap<String, Suggestion>(4);
private final Map<String, SuggestionContext> suggestions = new LinkedHashMap<String, SuggestionContext>(4);
public void addSuggestion(String name, Suggestion suggestion) {
public void addSuggestion(String name, SuggestionContext suggestion) {
suggestions.put(name, suggestion);
}
public Map<String, Suggestion> suggestions() {
public Map<String, SuggestionContext> suggestions() {
return suggestions;
}
public static class Suggestion {
public static class SuggestionContext {
private BytesRef text;
private final Suggester suggester;
private String field;
private Analyzer analyzer;
private SuggestMode suggestMode;
private Float accuracy;
private Integer size;
private Suggest.Suggestion.Sort sort;
private StringDistance stringDistance;
private Boolean lowerCaseTerms;
private Integer maxEdits;
private Integer factor;
private Float maxTermFreq;
private Integer prefixLength;
private Integer minWordLength;
private Float minDocFreq;
private Integer shardSize;
private int size = 5;
private int shardSize = 5;
public BytesRef text() {
public BytesRef getText() {
return text;
}
public void text(BytesRef text) {
public void setText(BytesRef text) {
this.text = text;
}
protected SuggestionContext(Suggester suggester) {
this.suggester = suggester;
}
public Suggester<SuggestionContext> getSuggester() {
return this.suggester;
}
public Analyzer analyzer() {
public Analyzer getAnalyzer() {
return analyzer;
}
public void analyzer(Analyzer analyzer) {
public void setAnalyzer(Analyzer analyzer) {
this.analyzer = analyzer;
}
public String field() {
public String getField() {
return field;
}
@ -85,111 +79,22 @@ public class SuggestionSearchContext {
this.field = field;
}
public SuggestMode suggestMode() {
return suggestMode;
}
public void suggestMode(SuggestMode suggestMode) {
this.suggestMode = suggestMode;
}
public Float accuracy() {
return accuracy;
}
public void accuracy(float accuracy) {
this.accuracy = accuracy;
}
public Integer size() {
public int getSize() {
return size;
}
public void size(int size) {
public void setSize(int size) {
if (size <= 0) {
throw new ElasticSearchIllegalArgumentException("Size must be positive");
}
this.size = size;
}
public Suggest.Suggestion.Sort sort() {
return sort;
}
public void sort(Suggest.Suggestion.Sort sort) {
this.sort = sort;
}
public StringDistance stringDistance() {
return stringDistance;
}
public void stringDistance(StringDistance distance) {
this.stringDistance = distance;
}
public Boolean lowerCaseTerms() {
return lowerCaseTerms;
}
public void lowerCaseTerms(boolean lowerCaseTerms) {
this.lowerCaseTerms = lowerCaseTerms;
}
public Integer maxEdits() {
return maxEdits;
}
public void maxEdits(int maxEdits) {
this.maxEdits = maxEdits;
}
public Integer factor() {
return factor;
}
public void factor(int factor) {
this.factor = factor;
}
public Float maxTermFreq() {
return maxTermFreq;
}
public void maxTermFreq(float maxTermFreq) {
this.maxTermFreq = maxTermFreq;
}
public Integer prefixLength() {
return prefixLength;
}
public void prefixLength(int prefixLength) {
this.prefixLength = prefixLength;
}
public Integer minWordLength() {
return minWordLength;
}
public void minQueryLength(int minQueryLength) {
this.minWordLength = minQueryLength;
}
public Float minDocFreq() {
return minDocFreq;
}
public void minDocFreq(float minDocFreq) {
this.minDocFreq = minDocFreq;
}
public Integer shardSize() {
public Integer getShardSize() {
return shardSize;
}
public void shardSize(Integer shardSize) {
public void setShardSize(int shardSize) {
this.shardSize = shardSize;
}
}

View File

@ -0,0 +1,46 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.CandidateSet;
//TODO public for tests
public abstract class CandidateGenerator {
public abstract boolean isKnownWord(BytesRef term) throws IOException;
public abstract int frequency(BytesRef term) throws IOException;
public CandidateSet drawCandidates(BytesRef term, int numCandidates) throws IOException {
CandidateSet set = new CandidateSet(Candidate.EMPTY, createCandidate(term));
return drawCandidates(set, numCandidates);
}
public Candidate createCandidate(BytesRef term) throws IOException {
return createCandidate(term, frequency(term), 1.0);
}
public abstract Candidate createCandidate(BytesRef term, int frequency, double channelScore) throws IOException;
public abstract CandidateSet drawCandidates(CandidateSet set, int numCandidates) throws IOException;
}

View File

@ -0,0 +1,114 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import org.apache.lucene.util.PriorityQueue;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.CandidateSet;
final class CandidateScorer {
private final WordScorer scorer;
private final int maxNumCorrections;
private final int gramSize;
public CandidateScorer(WordScorer scorer, int maxNumCorrections, int gramSize) {
this.scorer = scorer;
this.maxNumCorrections = maxNumCorrections;
this.gramSize = gramSize;
}
public Correction[] findBestCandiates(CandidateSet[] sets, float errorFraction, double cutoffScore) throws IOException {
PriorityQueue<Correction> corrections = new PriorityQueue<Correction>(maxNumCorrections) {
@Override
protected boolean lessThan(Correction a, Correction b) {
return a.score < b.score;
}
};
int numMissspellings = 1;
if (errorFraction >= 1.0) {
numMissspellings = (int) errorFraction;
} else {
numMissspellings = Math.round(errorFraction * sets.length);
}
findCandidates(sets, new Candidate[sets.length], 0, Math.max(1, numMissspellings), corrections, cutoffScore, 0.0);
Correction[] result = new Correction[corrections.size()];
for (int i = result.length - 1; i >= 0; i--) {
result[i] = corrections.pop();
}
assert corrections.size() == 0;
return result;
}
public void findCandidates(CandidateSet[] candidates, Candidate[] path, int ord, int numMissspellingsLeft,
PriorityQueue<Correction> corrections, double cutoffScore, final double pathScore) throws IOException {
CandidateSet current = candidates[ord];
if (ord == candidates.length - 1) {
path[ord] = current.originalTerm;
updateTop(candidates, path, corrections, cutoffScore, pathScore + scorer.score(path, candidates, ord, gramSize));
if (numMissspellingsLeft > 0) {
for (int i = 0; i < current.candidates.length; i++) {
path[ord] = current.candidates[i];
updateTop(candidates, path, corrections, cutoffScore, pathScore + scorer.score(path, candidates, ord, gramSize));
}
}
} else {
if (numMissspellingsLeft > 0) {
path[ord] = current.originalTerm;
findCandidates(candidates, path, ord + 1, numMissspellingsLeft, corrections, cutoffScore, pathScore + scorer.score(path, candidates, ord, gramSize));
for (int i = 0; i < current.candidates.length; i++) {
path[ord] = current.candidates[i];
findCandidates(candidates, path, ord + 1, numMissspellingsLeft - 1, corrections, cutoffScore, pathScore + scorer.score(path, candidates, ord, gramSize));
}
} else {
path[ord] = current.originalTerm;
findCandidates(candidates, path, ord + 1, 0, corrections, cutoffScore, pathScore + scorer.score(path, candidates, ord, gramSize));
}
}
}
private void updateTop(CandidateSet[] candidates, Candidate[] path, PriorityQueue<Correction> corrections, double cutoffScore, double score)
throws IOException {
score = Math.exp(score);
assert Math.abs(score - score(path, candidates)) < 0.00001;
if (score > cutoffScore) {
if (corrections.size() < maxNumCorrections) {
Candidate[] c = new Candidate[candidates.length];
System.arraycopy(path, 0, c, 0, path.length);
corrections.add(new Correction(score, c));
} else if (corrections.top().score < score) {
Correction top = corrections.top();
System.arraycopy(path, 0, top.candidates, 0, path.length);
top.score = score;
corrections.updateTop();
}
}
}
public double score(Candidate[] path, CandidateSet[] candidates) throws IOException {
double score = 0.0d;
for (int i = 0; i < candidates.length; i++) {
score += scorer.score(path, candidates, i, gramSize);
}
return Math.exp(score);
}
}

View File

@ -0,0 +1,57 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.util.Arrays;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
//TODO public for tests
public final class Correction {
public double score;
public final Candidate[] candidates;
public Correction(double score, Candidate[] candidates) {
this.score = score;
this.candidates = candidates;
}
@Override
public String toString() {
return "Correction [score=" + score + ", candidates=" + Arrays.toString(candidates) + "]";
}
public BytesRef join(BytesRef separator) {
return join(separator, new BytesRef());
}
public BytesRef join(BytesRef separator, BytesRef result) {
BytesRef[] toJoin = new BytesRef[this.candidates.length];
int len = separator.length * this.candidates.length - 1;
for (int i = 0; i < toJoin.length; i++) {
toJoin[i] = candidates[i].term;
len += toJoin[i].length;
}
result.offset = 0;
result.grow(len);
return SuggestUtils.joinPreAllocated(separator, result, toJoin);
}
}

View File

@ -0,0 +1,242 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.search.spell.DirectSpellChecker;
import org.apache.lucene.search.spell.SuggestMode;
import org.apache.lucene.search.spell.SuggestWord;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.CharsRef;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.search.suggest.SuggestUtils;
//TODO public for tests
public final class DirectCandidateGenerator extends CandidateGenerator {
private final DirectSpellChecker spellchecker;
private final String field;
private final SuggestMode suggestMode;
private final IndexReader reader;
private final int docCount;
private final double logBase = 5;
private final int frequencyPlateau;
private final Analyzer preFilter;
private final Analyzer postFilter;
private final double nonErrorLikelihood;
public DirectCandidateGenerator(DirectSpellChecker spellchecker, String field, SuggestMode suggestMode, IndexReader reader, double nonErrorLikelihood) throws IOException {
this(spellchecker, field, suggestMode, reader, nonErrorLikelihood, null, null);
}
public DirectCandidateGenerator(DirectSpellChecker spellchecker, String field, SuggestMode suggestMode, IndexReader reader, double nonErrorLikelihood, Analyzer preFilter, Analyzer postFilter) throws IOException {
this.spellchecker = spellchecker;
this.field = field;
this.suggestMode = suggestMode;
this.reader = reader;
Terms terms = MultiFields.getTerms(reader, field);
if (terms == null) {
throw new ElasticSearchIllegalArgumentException("generator field [" + field + "] doesn't exist");
}
final int docCount = terms.getDocCount();
this.docCount = docCount == -1 ? reader.maxDoc() : docCount;
this.preFilter = preFilter;
this.postFilter = postFilter;
this.nonErrorLikelihood = nonErrorLikelihood;
float thresholdFrequency = spellchecker.getThresholdFrequency();
this.frequencyPlateau = thresholdFrequency >= 1.0f ? (int) thresholdFrequency: (int)(docCount * thresholdFrequency);
}
/* (non-Javadoc)
* @see org.elasticsearch.search.suggest.phrase.CandidateGenerator#isKnownWord(org.apache.lucene.util.BytesRef)
*/
@Override
public boolean isKnownWord(BytesRef term) throws IOException {
return frequency(term) > 0;
}
/* (non-Javadoc)
* @see org.elasticsearch.search.suggest.phrase.CandidateGenerator#frequency(org.apache.lucene.util.BytesRef)
*/
@Override
public int frequency(BytesRef term) throws IOException {
return reader.docFreq(new Term(field, term));
}
public String getField() {
return field;
}
/* (non-Javadoc)
* @see org.elasticsearch.search.suggest.phrase.CandidateGenerator#drawCandidates(org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.CandidateSet, int)
*/
@Override
public CandidateSet drawCandidates(CandidateSet set, int numCandidates) throws IOException {
CharsRef spare = new CharsRef();
BytesRef byteSpare = new BytesRef();
Candidate original = set.originalTerm;
BytesRef term = preFilter(original.term, spare, byteSpare);
final int frequency = original.frequency;
spellchecker.setThresholdFrequency(thresholdFrequency(frequency, docCount));
SuggestWord[] suggestSimilar = spellchecker.suggestSimilar(new Term(field, term), numCandidates, reader, this.suggestMode);
List<Candidate> candidates = new ArrayList<Candidate>(suggestSimilar.length);
for (int i = 0; i < suggestSimilar.length; i++) {
SuggestWord suggestWord = suggestSimilar[i];
BytesRef candidate = new BytesRef(suggestWord.string);
postFilter(new Candidate(candidate, suggestWord.freq, suggestWord.score, score(suggestWord.freq, suggestWord.score, docCount)), spare, byteSpare, candidates);
}
set.addCandidates(candidates);
return set;
}
protected BytesRef preFilter(final BytesRef term, final CharsRef spare, final BytesRef byteSpare) throws IOException {
if (preFilter == null) {
return term;
}
final BytesRef result = byteSpare;
SuggestUtils.analyze(preFilter, term, field, new SuggestUtils.TokenConsumer() {
@Override
public void nextToken() throws IOException {
this.fillBytesRef(result);
}
}, spare);
return result;
}
protected void postFilter(final Candidate candidate, final CharsRef spare, BytesRef byteSpare, final List<Candidate> candidates) throws IOException {
if (postFilter == null) {
candidates.add(candidate);
} else {
final BytesRef result = byteSpare;
SuggestUtils.analyze(postFilter, candidate.term, field, new SuggestUtils.TokenConsumer() {
@Override
public void nextToken() throws IOException {
this.fillBytesRef(result);
if (posIncAttr.getPositionIncrement() > 0 && result.bytesEquals(candidate.term)) {
candidates.add(new Candidate(BytesRef.deepCopyOf(result), candidate.frequency, candidate.stringDistance, score(candidate.frequency, candidate.stringDistance, docCount)));
} else {
int freq = frequency(result);
candidates.add(new Candidate(BytesRef.deepCopyOf(result), freq, nonErrorLikelihood, score(candidate.frequency, candidate.stringDistance, docCount)));
}
}
}, spare);
}
}
private double score(int frequency, double errorScore, int docCount) {
return errorScore * (((double)frequency + 1) / ((double)docCount +1));
}
protected int thresholdFrequency(int termFrequency, int docCount) {
if (termFrequency > 0) {
return (int) Math.round(termFrequency * (Math.log10(termFrequency - frequencyPlateau) * (1.0 / Math.log10(logBase))) + 1);
}
return 0;
}
public static class CandidateSet {
public Candidate[] candidates;
public final Candidate originalTerm;
public CandidateSet(Candidate[] candidates, Candidate originalTerm) {
this.candidates = candidates;
this.originalTerm = originalTerm;
}
public void addCandidates(List<Candidate> candidates) {
final Set<Candidate> set = new HashSet<DirectCandidateGenerator.Candidate>(candidates);
for (int i = 0; i < this.candidates.length; i++) {
set.add(this.candidates[i]);
}
this.candidates = set.toArray(new Candidate[set.size()]);
}
public void addOneCandidate(Candidate candidate) {
Candidate[] candidates = new Candidate[this.candidates.length + 1];
System.arraycopy(this.candidates, 0, candidates, 0, this.candidates.length);
candidates[candidates.length-1] = candidate;
this.candidates = candidates;
}
}
public static class Candidate {
public static final Candidate[] EMPTY = new Candidate[0];
public final BytesRef term;
public final double stringDistance;
public final int frequency;
public final double score;
public Candidate(BytesRef term, int frequency, double stringDistance, double score) {
this.frequency = frequency;
this.term = term;
this.stringDistance = stringDistance;
this.score = score;
}
@Override
public String toString() {
return "Candidate [term=" + term.utf8ToString() + ", stringDistance=" + stringDistance + ", frequency=" + frequency + "]";
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((term == null) ? 0 : term.hashCode());
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Candidate other = (Candidate) obj;
if (term == null) {
if (other.term != null)
return false;
} else if (!term.equals(other.term))
return false;
return true;
}
}
@Override
public Candidate createCandidate(BytesRef term, int frequency, double channelScore) throws IOException {
return new Candidate(term, frequency, channelScore, score(frequency, channelScore, docCount));
}
}

View File

@ -0,0 +1,65 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
//TODO public for tests
public final class LaplaceScorer extends WordScorer {
public static final WordScorerFactory FACTORY = new WordScorer.WordScorerFactory() {
@Override
public WordScorer newScorer(IndexReader reader, String field, double realWordLikelyhood, BytesRef separator) throws IOException {
return new LaplaceScorer(reader, field, realWordLikelyhood, separator, 0.5);
}
};
private double alpha;
public LaplaceScorer(IndexReader reader, String field,
double realWordLikelyhood, BytesRef separator, double alpha) throws IOException {
super(reader, field, realWordLikelyhood, separator);
this.alpha = alpha;
}
public double score(Candidate word, Candidate previousWord) throws IOException{
SuggestUtils.join(separator, spare, previousWord.term, word.term);
return (alpha + frequency(spare)) / (alpha + previousWord.frequency);
}
@Override
protected double scoreBigram(Candidate word, Candidate w_1) throws IOException {
SuggestUtils.join(separator, spare, w_1.term, word.term);
return (alpha + frequency(spare)) / (alpha + w_1.frequency);
}
@Override
protected double scoreTrigram(Candidate word, Candidate w_1, Candidate w_2) throws IOException {
SuggestUtils.join(separator, spare, w_2.term, w_1.term, word.term);
int trigramCount = frequency(spare);
SuggestUtils.join(separator, spare, w_1.term, word.term);
return (alpha + trigramCount) / (alpha + frequency(spare));
}
}

View File

@ -0,0 +1,64 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
//TODO public for tests
public final class LinearInterpoatingScorer extends WordScorer {
private final double unigramLambda;
private final double bigramLambda;
private final double trigramLambda;
public LinearInterpoatingScorer(IndexReader reader, String field, double realWordLikelyhood, BytesRef separator, double trigramLambda, double bigramLambda, double unigramLambda)
throws IOException {
super(reader, field, realWordLikelyhood, separator);
double sum = unigramLambda + bigramLambda + trigramLambda;
this.unigramLambda = unigramLambda / sum;
this.bigramLambda = bigramLambda / sum;
this.trigramLambda = trigramLambda / sum;
}
@Override
protected double scoreBigram(Candidate word, Candidate w_1) throws IOException {
SuggestUtils.join(separator, spare, w_1.term, word.term);
final int count = frequency(spare);
if (count < 1) {
return unigramLambda * scoreUnigram(word);
}
return bigramLambda * (count / (0.5d+w_1.frequency)) + unigramLambda * scoreUnigram(word);
}
@Override
protected double scoreTrigram(Candidate w, Candidate w_1, Candidate w_2) throws IOException {
SuggestUtils.join(separator, spare, w.term, w_1.term, w_2.term);
final int count = frequency(spare);
if (count < 1) {
return scoreBigram(w, w_1);
}
SuggestUtils.join(separator, spare, w.term, w_1.term);
return trigramLambda * (count / (1.d+frequency(spare))) + scoreBigram(w, w_1);
}
}

View File

@ -0,0 +1,77 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import java.util.Arrays;
import java.util.Comparator;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.CandidateSet;
//TODO public for tests
public final class MultiCandidateGeneratorWrapper extends CandidateGenerator {
private final CandidateGenerator[] candidateGenerator;
public MultiCandidateGeneratorWrapper(CandidateGenerator...candidateGenerators) {
this.candidateGenerator = candidateGenerators;
}
@Override
public boolean isKnownWord(BytesRef term) throws IOException {
return candidateGenerator[0].isKnownWord(term);
}
@Override
public int frequency(BytesRef term) throws IOException {
return candidateGenerator[0].frequency(term);
}
@Override
public CandidateSet drawCandidates(CandidateSet set, int numCandidates) throws IOException {
for (CandidateGenerator generator : candidateGenerator) {
generator.drawCandidates(set, numCandidates);
}
return reduce(set, numCandidates);
}
private final CandidateSet reduce(CandidateSet set, int numCandidates) {
if (set.candidates.length > numCandidates) {
Candidate[] candidates = set.candidates;
Arrays.sort(candidates, new Comparator<Candidate>() {
@Override
public int compare(Candidate left, Candidate right) {
return Double.compare(right.score, left.score);
}
});
Candidate[] newSet = new Candidate[numCandidates];
System.arraycopy(candidates, 0, newSet, 0, numCandidates);
set.candidates = newSet;
}
return set;
}
@Override
public Candidate createCandidate(BytesRef term, int frequency, double channelScore) throws IOException {
return candidateGenerator[0].createCandidate(term, frequency, channelScore);
}
}

View File

@ -0,0 +1,138 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.synonym.SynonymFilter;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.CharsRef;
import org.apache.lucene.util.UnicodeUtil;
import org.elasticsearch.common.io.FastCharArrayReader;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.CandidateSet;
//TODO public for tests
public final class NoisyChannelSpellChecker {
public static final double REAL_WORD_LIKELYHOOD = 0.95d;
private final double realWordLikelihood;
private final boolean requireUnigram;
public NoisyChannelSpellChecker() {
this(REAL_WORD_LIKELYHOOD);
}
public NoisyChannelSpellChecker(double nonErrorLikelihood) {
this(nonErrorLikelihood, true);
}
public NoisyChannelSpellChecker(double nonErrorLikelihood, boolean requireUnigram) {
this.realWordLikelihood = nonErrorLikelihood;
this.requireUnigram = requireUnigram;
}
public Correction[] getCorrections(TokenStream stream, final CandidateGenerator generator, final int numCandidates,
float maxErrors, int numCorrections, IndexReader reader, WordScorer wordScorer, BytesRef separator, float confidence, int gramSize) throws IOException {
final List<CandidateSet> candidateSetsList = new ArrayList<DirectCandidateGenerator.CandidateSet>();
SuggestUtils.analyze(stream, new SuggestUtils.TokenConsumer() {
CandidateSet currentSet = null;
private TypeAttribute typeAttribute;
private final BytesRef termsRef = new BytesRef();
private boolean anyUnigram = false;
private boolean anyTokens = false;
@Override
public void reset(TokenStream stream) {
super.reset(stream);
typeAttribute = stream.addAttribute(TypeAttribute.class);
}
@Override
public void nextToken() throws IOException {
anyTokens = true;
BytesRef term = fillBytesRef(termsRef);
if (requireUnigram && typeAttribute.type() == ShingleFilter.DEFAULT_TOKEN_TYPE) {
return;
}
anyUnigram = true;
if (posIncAttr.getPositionIncrement() == 0 && typeAttribute.type() == SynonymFilter.TYPE_SYNONYM) {
assert currentSet != null;
int freq = 0;
if ((freq = generator.frequency(term)) > 0) {
currentSet.addOneCandidate(generator.createCandidate(BytesRef.deepCopyOf(term), freq, realWordLikelihood));
}
} else {
if (currentSet != null) {
candidateSetsList.add(currentSet);
}
currentSet = new CandidateSet(Candidate.EMPTY, generator.createCandidate(BytesRef.deepCopyOf(term)));
}
}
@Override
public void end() {
if (currentSet != null) {
candidateSetsList.add(currentSet);
}
if (requireUnigram && !anyUnigram && anyTokens) {
throw new IllegalStateException("At least one unigram is required but all tokens were ngrams");
}
}
});
for (CandidateSet candidateSet : candidateSetsList) {
generator.drawCandidates(candidateSet, numCandidates);
}
double cutoffScore = Double.MIN_VALUE;
CandidateScorer scorer = new CandidateScorer(wordScorer, numCorrections, gramSize);
CandidateSet[] candidateSets = candidateSetsList.toArray(new CandidateSet[candidateSetsList.size()]);
if (confidence > 0.0) {
Candidate[] candidates = new Candidate[candidateSets.length];
for (int i = 0; i < candidates.length; i++) {
candidates[i] = candidateSets[i].originalTerm;
}
cutoffScore = scorer.score(candidates, candidateSets);
}
Correction[] findBestCandiates = scorer.findBestCandiates(candidateSets, maxErrors, cutoffScore * confidence);
return findBestCandiates;
}
public Correction[] getCorrections(Analyzer analyzer, BytesRef query, CandidateGenerator generator, int numCandidates,
float maxErrors, int numCorrections, IndexReader reader, String analysisField, WordScorer scorer, float confidence, int gramSize) throws IOException {
return getCorrections(tokenStream(analyzer, query, new CharsRef(), analysisField), generator, numCandidates, maxErrors, numCorrections, reader, scorer, new BytesRef(" "), confidence, gramSize);
}
public TokenStream tokenStream(Analyzer analyzer, BytesRef query, CharsRef spare, String field) throws IOException {
UnicodeUtil.UTF8toUTF16(query, spare);
return analyzer.tokenStream(field, new FastCharArrayReader(spare.chars, spare.offset, spare.length));
}
}

View File

@ -0,0 +1,285 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
/*
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.common.xcontent.XContentParser.Token;
import org.elasticsearch.index.analysis.ShingleTokenFilterFactory;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.suggest.SuggestContextParser;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.SuggestionSearchContext;
import org.elasticsearch.search.suggest.phrase.PhraseSuggestionContext.DirectCandidateGenerator;
public final class PhraseSuggestParser implements SuggestContextParser {
private final PhraseSuggester suggester = new PhraseSuggester();
public SuggestionSearchContext.SuggestionContext parse(XContentParser parser, SearchContext context) throws IOException {
PhraseSuggestionContext suggestion = new PhraseSuggestionContext(suggester);
XContentParser.Token token;
String fieldName = null;
boolean gramSizeSet = false;
while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
fieldName = parser.currentName();
} else if (token.isValue()) {
if (!SuggestUtils.parseSuggestContext(parser, context, fieldName, suggestion)) {
if ("real_word_error_likelihood".equals(fieldName)) {
suggestion.setRealWordErrorLikelihood(parser.floatValue());
if (suggestion.realworldErrorLikelyhood() <= 0.0) {
throw new ElasticSearchIllegalArgumentException("real_word_error_likelihood must be > 0.0");
}
} else if ("confidence".equals(fieldName)) {
suggestion.setConfidence(parser.floatValue());
if (suggestion.confidence() < 0.0) {
throw new ElasticSearchIllegalArgumentException("confidence must be >= 0.0");
}
} else if ("separator".equals(fieldName)) {
suggestion.setSeparator(new BytesRef(parser.text()));
} else if ("max_errors".equals(fieldName)) {
suggestion.setMaxErrors(parser.floatValue());
if (suggestion.maxErrors() <= 0.0) {
throw new ElasticSearchIllegalArgumentException("max_error must be > 0.0");
}
} else if ("gram_size".equals(fieldName)) {
suggestion.setGramSize(parser.intValue());
if (suggestion.gramSize() < 1) {
throw new ElasticSearchIllegalArgumentException("gram_size must be >= 1");
}
gramSizeSet = true;
} else if ("force_unigrams".equals(fieldName)) {
suggestion.setRequireUnigram(parser.booleanValue());
}
}
} else if (token == Token.START_ARRAY) {
if ("direct_generator".equals(fieldName)) {
// for now we only have a single type of generators
while ((token = parser.nextToken()) == Token.START_OBJECT) {
PhraseSuggestionContext.DirectCandidateGenerator generator = new PhraseSuggestionContext.DirectCandidateGenerator();
while ((token = parser.nextToken()) != Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
fieldName = parser.currentName();
}
if (token.isValue()) {
parseCandidateGenerator(parser, context, fieldName, generator);
}
}
verifyGenerator(context, generator);
suggestion.addGenerator(generator);
}
} else {
throw new ElasticSearchIllegalArgumentException("suggester[phrase] doesn't support array field [" + fieldName + "]");
}
} else if (token == Token.START_OBJECT) {
if ("linear".equals(fieldName)) {
ensureNoSmoothing(suggestion);
final double[] lambdas = new double[3];
while ((token = parser.nextToken()) != Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
fieldName = parser.currentName();
}
if (token.isValue()) {
if ("trigram_lambda".equals(fieldName)) {
lambdas[0] = parser.doubleValue();
if (lambdas[0] < 0) {
throw new ElasticSearchIllegalArgumentException("trigram_lambda must be positive");
}
}
if ("bigram_lambda".equals(fieldName)) {
lambdas[1] = parser.doubleValue();
if (lambdas[1] < 0) {
throw new ElasticSearchIllegalArgumentException("bigram_lambda must be positive");
}
}
if ("unigram_lambda".equals(fieldName)) {
lambdas[2] = parser.doubleValue();
if (lambdas[2] < 0) {
throw new ElasticSearchIllegalArgumentException("unigram_lambda must be positive");
}
}
}
}
double sum = 0.0d;
for (int i = 0; i < lambdas.length; i++) {
sum += lambdas[i];
}
if (Math.abs(sum - 1.0) > 0.001) {
throw new ElasticSearchIllegalArgumentException("linear smoothing lambdas must sum to 1");
}
suggestion.setModel(new WordScorer.WordScorerFactory() {
@Override
public WordScorer newScorer(IndexReader reader, String field, double realWordLikelyhood, BytesRef separator)
throws IOException {
return new LinearInterpoatingScorer(reader, field, realWordLikelyhood, separator, lambdas[0], lambdas[1],
lambdas[2]);
}
});
} else if ("laplace".equals(fieldName)) {
ensureNoSmoothing(suggestion);
double theAlpha = 0.5;
while ((token = parser.nextToken()) != Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
fieldName = parser.currentName();
}
if (token.isValue()) {
if ("alpha".equals(fieldName)) {
theAlpha = parser.doubleValue();
}
}
}
final double alpha = theAlpha;
suggestion.setModel( new WordScorer.WordScorerFactory() {
@Override
public WordScorer newScorer(IndexReader reader, String field, double realWordLikelyhood, BytesRef separator) throws IOException {
return new LaplaceScorer(reader, field, realWordLikelyhood, separator, alpha);
}
});
} else if ("stupid_backoff".equals(fieldName)) {
ensureNoSmoothing(suggestion);
double theDiscount = 0.4;
while ((token = parser.nextToken()) != Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
fieldName = parser.currentName();
}
if (token.isValue()) {
if ("discount".equals(fieldName)) {
theDiscount = parser.doubleValue();
}
}
}
final double discount = theDiscount;
suggestion.setModel( new WordScorer.WordScorerFactory() {
@Override
public WordScorer newScorer(IndexReader reader, String field, double realWordLikelyhood, BytesRef separator) throws IOException {
return new StupidBackoffScorer(reader, field, realWordLikelyhood, separator, discount);
}
});
} else {
throw new ElasticSearchIllegalArgumentException("suggester[phrase] doesn't support object field [" + fieldName + "]");
}
} else {
throw new ElasticSearchIllegalArgumentException("suggester[phrase] doesn't support field [" + fieldName + "]");
}
}
if (suggestion.getField() == null) {
throw new ElasticSearchIllegalArgumentException("The required field option is missing");
}
if (suggestion.model() == null) {
suggestion.setModel(LaplaceScorer.FACTORY);
}
if (!gramSizeSet || suggestion.generators().isEmpty()) {
final ShingleTokenFilterFactory shingleFilterFactory = SuggestUtils.getShingleFilterFactory(suggestion.getAnalyzer() == null ? context.mapperService().fieldSearchAnalyzer(suggestion.getField()) : suggestion.getAnalyzer()); ;
if (!gramSizeSet) {
// try to detect the shingle size
if (shingleFilterFactory != null) {
suggestion.setGramSize(shingleFilterFactory.getMaxShingleSize());
if (suggestion.getAnalyzer() == null && shingleFilterFactory.getMinShingleSize() > 1 && !shingleFilterFactory.getOutputUnigrams()) {
throw new ElasticSearchIllegalArgumentException("The default analyzer for field: [" + suggestion.getField() + "] doesn't emit unigrams. If this is intentional try to set the analyzer explicitly");
}
}
}
if (suggestion.generators().isEmpty()) {
if (shingleFilterFactory != null && shingleFilterFactory.getMinShingleSize() > 1 && !shingleFilterFactory.getOutputUnigrams() && suggestion.getRequireUnigram()) {
throw new ElasticSearchIllegalArgumentException("The default candidate generator for phrase suggest can't operate on field: [" + suggestion.getField() + "] since it doesn't emit unigrams. If this is intentional try to set the candidate generator field explicitly");
}
// use a default generator on the same field
DirectCandidateGenerator generator = new DirectCandidateGenerator();
generator.setField(suggestion.getField());
suggestion.addGenerator(generator);
}
}
return suggestion;
}
private void ensureNoSmoothing(PhraseSuggestionContext suggestion) {
if (suggestion.model() != null) {
throw new ElasticSearchIllegalArgumentException("only one smoothing model supported");
}
}
private void verifyGenerator(SearchContext context, PhraseSuggestionContext.DirectCandidateGenerator suggestion) {
// Verify options and set defaults
if (suggestion.field() == null) {
throw new ElasticSearchIllegalArgumentException("The required field option is missing");
}
}
private void parseCandidateGenerator(XContentParser parser, SearchContext context, String fieldName,
PhraseSuggestionContext.DirectCandidateGenerator generator) throws IOException {
if (!SuggestUtils.parseDirectSpellcheckerSettings(parser, fieldName, generator)) {
if ("field".equals(fieldName)) {
generator.setField(parser.text());
} else if ("size".equals(fieldName)) {
generator.size(parser.intValue());
} else if ("pre_filter".equals(fieldName) || "preFilter".equals(fieldName)) {
String analyzerName = parser.text();
Analyzer analyzer = context.mapperService().analysisService().analyzer(analyzerName);
if (analyzer == null) {
throw new ElasticSearchIllegalArgumentException("Analyzer [" + analyzerName + "] doesn't exists");
}
generator.preFilter(analyzer);
} else if ("post_filter".equals(fieldName) || "postFilter".equals(fieldName)) {
String analyzerName = parser.text();
Analyzer analyzer = context.mapperService().analysisService().analyzer(analyzerName);
if (analyzer == null) {
throw new ElasticSearchIllegalArgumentException("Analyzer [" + analyzerName + "] doesn't exists");
}
generator.postFilter(analyzer);
} else {
throw new ElasticSearchIllegalArgumentException("CandidateGenerator doesn't support [" + fieldName + "]");
}
}
}
}

View File

@ -0,0 +1,86 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import java.util.List;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.spell.DirectSpellChecker;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.CharsRef;
import org.apache.lucene.util.UnicodeUtil;
import org.elasticsearch.common.text.StringText;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.suggest.Suggest.Suggestion;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry.Option;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.Suggester;
final class PhraseSuggester implements Suggester<PhraseSuggestionContext> {
private final BytesRef SEPARATOR = new BytesRef(" ");
/*
* More Ideas:
* - add ability to find whitespace problems -> we can build a poor mans decompounder with our index based on a automaton?
* - add ability to build different error models maybe based on a confusion matrix?
* - try to combine a token with its subsequent token to find / detect word splits (optional)
* - for this to work we need some way to defined the position length of a candidate
* - phonetic filters could be interesting here too for candidate selection
*/
@Override
public Suggestion<? extends Entry<? extends Option>> execute(String name, PhraseSuggestionContext suggestion,
SearchContext context, CharsRef spare) throws IOException {
final IndexReader indexReader = context.searcher().getIndexReader();
double realWordErrorLikelihood = suggestion.realworldErrorLikelyhood();
List<PhraseSuggestionContext.DirectCandidateGenerator> generators = suggestion.generators();
CandidateGenerator[] gens = new CandidateGenerator[generators.size()];
for (int i = 0; i < gens.length; i++) {
PhraseSuggestionContext.DirectCandidateGenerator generator = generators.get(i);
DirectSpellChecker directSpellChecker = SuggestUtils.getDirectSpellChecker(generator);
gens[i] = new DirectCandidateGenerator(directSpellChecker, generator.field(), generator.suggestMode(), indexReader, realWordErrorLikelihood, generator.preFilter(), generator.postFilter());
}
final NoisyChannelSpellChecker checker = new NoisyChannelSpellChecker(realWordErrorLikelihood, suggestion.getRequireUnigram());
final BytesRef separator = suggestion.separator();
TokenStream stream = checker.tokenStream(suggestion.getAnalyzer(), suggestion.getText(), spare, suggestion.getField());
WordScorer wordScorer = suggestion.model().newScorer(indexReader, suggestion.getField(), realWordErrorLikelihood, separator);
Correction[] corrections = checker.getCorrections(stream, new MultiCandidateGeneratorWrapper(gens), suggestion.getShardSize(), suggestion.maxErrors(),
suggestion.getShardSize(), indexReader,wordScorer , separator, suggestion.confidence(), suggestion.gramSize());
UnicodeUtil.UTF8toUTF16(suggestion.getText(), spare);
Suggestion.Entry<Option> resultEntry = new Suggestion.Entry<Option>(new StringText(spare.toString()), 0, spare.length);
BytesRef byteSpare = new BytesRef();
for (Correction correction : corrections) {
UnicodeUtil.UTF8toUTF16(correction.join(SEPARATOR, byteSpare), spare);
Text phrase = new StringText(spare.toString());
resultEntry.addOption(new Suggestion.Entry.Option(phrase, (float) (correction.score)));
}
final Suggestion<Entry<Option>> response = new Suggestion<Entry<Option>>(name, suggestion.getSize());
response.addTerm(resultEntry);
return response;
}
}

View File

@ -0,0 +1,578 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.search.suggest.SuggestBuilder.SuggestionBuilder;
/**
* Defines the actual suggest command for phrase suggestions ( <tt>phrase</tt>).
*/
public final class PhraseSuggestionBuilder extends SuggestionBuilder<PhraseSuggestionBuilder> {
private Float maxErrors;
private String separator;
private Float realWordErrorLikelihood;
private Float confidence;
private final Map<String, List<CandidateGenerator>> generators = new HashMap<String, List<PhraseSuggestionBuilder.CandidateGenerator>>();
private Integer gramSize;
private SmoothingModel model;
private Boolean forceUnigrams;
public PhraseSuggestionBuilder(String name) {
super(name, "phrase");
}
/**
* Sets the gram size for the n-gram model used for this suggester. The
* default value is <tt>1</tt> corresponding to <tt>unigrams</tt>. Use
* <tt>2</tt> for <tt>bigrams</tt> and <tt>3</tt> for <tt>trigrams</tt>.
*/
public PhraseSuggestionBuilder gramSize(int gramSize) {
if (gramSize < 1) {
throw new ElasticSearchIllegalArgumentException("gramSize must be >= 1");
}
this.gramSize = gramSize;
return this;
}
/**
* Sets the maximum percentage of the terms that at most considered to be
* misspellings in order to form a correction. This method accepts a float
* value in the range [0..1) as a fraction of the actual query terms a
* number <tt>&gt;=1</tt> as an absolut number of query terms.
*
* The default is set to <tt>1.0</tt> which corresponds to that only
* corrections with at most 1 missspelled term are returned.
*/
public PhraseSuggestionBuilder maxErrors(Float maxErrors) {
this.maxErrors = maxErrors;
return this;
}
/**
* Sets the separator that is used to separate terms in the bigram field. If
* not set the whitespace character is used as a separator.
*/
public PhraseSuggestionBuilder separator(String separator) {
this.separator = separator;
return this;
}
/**
* Sets the likelihood of a term being a misspelled even if the term exists
* in the dictionary. The default it <tt>0.95</tt> corresponding to 5% or
* the real words are misspelled.
*/
public PhraseSuggestionBuilder realWordErrorLikelihood(Float realWordErrorLikelihood) {
this.realWordErrorLikelihood = realWordErrorLikelihood;
return this;
}
/**
* Sets the confidence level for this suggester. The confidence level
* defines a factor applied to the input phrases score which is used as a
* threshold for other suggest candidates. Only candidates that score higher
* than the threshold will be included in the result. For instance a
* confidence level of <tt>1.0</tt> will only return suggestions that score
* higher than the input phrase. If set to <tt>0.0</tt> the top N candidates
* are returned. The default is <tt>1.0</tt>
*/
public PhraseSuggestionBuilder confidence(Float confidence) {
this.confidence = confidence;
return this;
}
/**
* Adds a {@link CandidateGenerator} to this suggester. The
* {@link CandidateGenerator} is used to draw candidates for each individual
* phrase term before the candidates are scored.
*/
public PhraseSuggestionBuilder addCandidateGenerator(CandidateGenerator generator) {
List<CandidateGenerator> list = this.generators.get(generator.getType());
if (list == null) {
list = new ArrayList<PhraseSuggestionBuilder.CandidateGenerator>();
this.generators.put(generator.getType(), list);
}
list.add(generator);
return this;
}
/**
* If set to <code>true</code> the phrase suggester will fail if the analyzer only
* produces ngrams. the default it <code>true</code>.
*/
public PhraseSuggestionBuilder forceUnigrams(boolean forceUnigrams) {
this.forceUnigrams = forceUnigrams;
return this;
}
/**
* Sets an explict smoothing model used for this suggester. The default is
* {@link #LAPLACE}.
*/
public PhraseSuggestionBuilder smoothingModel(SmoothingModel model) {
this.model = model;
return this;
}
@Override
public XContentBuilder innerToXContent(XContentBuilder builder, Params params) throws IOException {
if (realWordErrorLikelihood != null) {
builder.field("real_word_error_likelihood", realWordErrorLikelihood);
}
if (confidence != null) {
builder.field("confidence", confidence);
}
if (separator != null) {
builder.field("separator", separator);
}
if (maxErrors != null) {
builder.field("max_errors", maxErrors);
}
if (gramSize != null) {
builder.field("gram_size", gramSize);
}
if (forceUnigrams != null) {
builder.field("force_unigrams", forceUnigrams);
}
if (!generators.isEmpty()) {
Set<Entry<String, List<CandidateGenerator>>> entrySet = generators.entrySet();
for (Entry<String, List<CandidateGenerator>> entry : entrySet) {
builder.startArray(entry.getKey());
for (CandidateGenerator generator : entry.getValue()) {
generator.toXContent(builder, params);
}
builder.endArray();
}
}
if (model != null) {
builder.startObject(model.type);
model.toXContent(builder, params);
builder.endObject();
}
return builder;
}
/**
* Creates a new {@link DirectCandidateGenerator}
*
* @param field
* the field this candidate generator operates on.
*/
public static DirectCandidateGenerator candidateGenerator(String field) {
return new DirectCandidateGenerator(field);
}
/**
* A "stupid-backoff" smoothing model simialr to <a
* href="http://en.wikipedia.org/wiki/Katz's_back-off_model"> Katz's
* Backoff</a>.
* <p>
* See <a
* href="http://en.wikipedia.org/wiki/N-gram#Smoothing_techniques">N-Gram
* Smoothing</a> for details.
* </p>
*/
public static final class StupidBackoff extends SmoothingModel {
private final double discount;
/**
* Creates a Stupid-Backoff smoothing model.
*
* @param discount
* the discount given to lower order ngrams if the higher order ngram doesn't exits
*/
public StupidBackoff(double discount) {
super("stupid_backoff");
this.discount = discount;
}
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder = super.toXContent(builder, params);
builder.field("discount", discount);
return builder;
}
}
/**
* An <a href="http://en.wikipedia.org/wiki/Additive_smoothing">additive
* smoothing</a> model. Laplace is used as the default if no smoothing model
* is configured.
* <p>
* See <a
* href="http://en.wikipedia.org/wiki/N-gram#Smoothing_techniques">N-Gram
* Smoothing</a> for details.
* </p>
*/
public static final class Laplace extends SmoothingModel {
private final double alpha;
/**
* Creates a Laplace smoothing model.
*
* @param discount
* the discount given to lower order ngrams if the higher order ngram doesn't exits
*/
public Laplace(double alpha) {
super("laplace");
this.alpha = alpha;
}
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder = super.toXContent(builder, params);
builder.field("alpha", alpha);
return builder;
}
}
public static class SmoothingModel implements ToXContent {
private final String type;
protected SmoothingModel(String type) {
this.type = type;
}
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
return builder;
}
}
/**
* Linear interpolation smoothing model.
* <p>
* See <a
* href="http://en.wikipedia.org/wiki/N-gram#Smoothing_techniques">N-Gram
* Smoothing</a> for details.
* </p>
*/
public static final class LinearInterpolation extends SmoothingModel {
private final double trigramLambda;
private final double bigramLambda;
private final double unigramLambda;
/**
* Creates a linear interpolation smoothing model.
*
* Note: the lambdas must sum up to one.
*
* @param trigramLambda
* the trigram lambda
* @param bigramLambda
* the bigram lambda
* @param unigramLambda
* the unigram lambda
*/
public LinearInterpolation(double trigramLambda, double bigramLambda, double unigramLambda) {
super("linear");
this.trigramLambda = trigramLambda;
this.bigramLambda = bigramLambda;
this.unigramLambda = unigramLambda;
}
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder = super.toXContent(builder, params);
builder.field("trigram_lambda", trigramLambda);
builder.field("bigram_lambda", bigramLambda);
builder.field("unigram_lambda", unigramLambda);
return builder;
}
}
/**
* {@link CandidateGenerator} base class.
*/
public static abstract class CandidateGenerator implements ToXContent {
private final String type;
public CandidateGenerator(String type) {
this.type = type;
}
public String getType() {
return type;
}
}
/**
*
*
*/
public static final class DirectCandidateGenerator extends CandidateGenerator {
private final String field;
private String preFilter;
private String postFilter;
private String suggestMode;
private Float accuracy;
private Integer size;
private String sort;
private String stringDistance;
private Integer maxEdits;
private Integer maxInspections;
private Float maxTermFreq;
private Integer prefixLength;
private Integer minWordLength;
private Float minDocFreq;
/**
* Sets from what field to fetch the candidate suggestions from. This is
* an required option and needs to be set via this setter or
* {@link org.elasticsearch.search.suggest.SuggestBuilder.TermSuggestionBuilder#setField(String)}
* method
*/
public DirectCandidateGenerator(String field) {
super("direct_generator");
this.field = field;
}
/**
* The global suggest mode controls what suggested terms are included or
* controls for what suggest text tokens, terms should be suggested for.
* Three possible values can be specified:
* <ol>
* <li><code>missing</code> - Only suggest terms in the suggest text
* that aren't in the index. This is the default.
* <li><code>popular</code> - Only suggest terms that occur in more docs
* then the original suggest text term.
* <li><code>always</code> - Suggest any matching suggest terms based on
* tokens in the suggest text.
* </ol>
*/
public DirectCandidateGenerator suggestMode(String suggestMode) {
this.suggestMode = suggestMode;
return this;
}
/**
* Sets how similar the suggested terms at least need to be compared to
* the original suggest text tokens. A value between 0 and 1 can be
* specified. This value will be compared to the string distance result
* of each candidate spelling correction.
* <p/>
* Default is <tt>0.5</tt>
*/
public DirectCandidateGenerator accuracy(float accuracy) {
this.accuracy = accuracy;
return this;
}
/**
* Sets the maximum suggestions to be returned per suggest text term.
*/
public DirectCandidateGenerator size(int size) {
if (size <= 0) {
throw new ElasticSearchIllegalArgumentException("Size must be positive");
}
this.size = size;
return this;
}
/**
* Sets how to sort the suggest terms per suggest text token. Two
* possible values:
* <ol>
* <li><code>score</code> - Sort should first be based on score, then
* document frequency and then the term itself.
* <li><code>frequency</code> - Sort should first be based on document
* frequency, then scotr and then the term itself.
* </ol>
* <p/>
* What the score is depends on the suggester being used.
*/
public DirectCandidateGenerator sort(String sort) {
this.sort = sort;
return this;
}
/**
* Sets what string distance implementation to use for comparing how
* similar suggested terms are. Four possible values can be specified:
* <ol>
* <li><code>internal</code> - This is the default and is based on
* <code>damerau_levenshtein</code>, but highly optimized for comparing
* string distance for terms inside the index.
* <li><code>damerau_levenshtein</code> - String distance algorithm
* based on Damerau-Levenshtein algorithm.
* <li><code>levenstein</code> - String distance algorithm based on
* Levenstein edit distance algorithm.
* <li><code>jarowinkler</code> - String distance algorithm based on
* Jaro-Winkler algorithm.
* <li><code>ngram</code> - String distance algorithm based on character
* n-grams.
* </ol>
*/
public DirectCandidateGenerator stringDistance(String stringDistance) {
this.stringDistance = stringDistance;
return this;
}
/**
* Sets the maximum edit distance candidate suggestions can have in
* order to be considered as a suggestion. Can only be a value between 1
* and 2. Any other value result in an bad request error being thrown.
* Defaults to <tt>2</tt>.
*/
public DirectCandidateGenerator maxEdits(Integer maxEdits) {
this.maxEdits = maxEdits;
return this;
}
/**
* A factor that is used to multiply with the size in order to inspect
* more candidate suggestions. Can improve accuracy at the cost of
* performance. Defaults to <tt>5</tt>.
*/
public DirectCandidateGenerator maxInspections(Integer maxInspections) {
this.maxInspections = maxInspections;
return this;
}
/**
* Sets a maximum threshold in number of documents a suggest text token
* can exist in order to be corrected. Can be a relative percentage
* number (e.g 0.4) or an absolute number to represent document
* frequencies. If an value higher than 1 is specified then fractional
* can not be specified. Defaults to <tt>0.01</tt>.
* <p/>
* This can be used to exclude high frequency terms from being
* suggested. High frequency terms are usually spelled correctly on top
* of this this also improves the suggest performance.
*/
public DirectCandidateGenerator maxTermFreq(float maxTermFreq) {
this.maxTermFreq = maxTermFreq;
return this;
}
/**
* Sets the number of minimal prefix characters that must match in order
* be a candidate suggestion. Defaults to 1. Increasing this number
* improves suggest performance. Usually misspellings don't occur in the
* beginning of terms.
*/
public DirectCandidateGenerator prefixLength(int prefixLength) {
this.prefixLength = prefixLength;
return this;
}
/**
* The minimum length a suggest text term must have in order to be
* corrected. Defaults to <tt>4</tt>.
*/
public DirectCandidateGenerator minWordLength(int minWordLength) {
this.minWordLength = minWordLength;
return this;
}
/**
* Sets a minimal threshold in number of documents a suggested term
* should appear in. This can be specified as an absolute number or as a
* relative percentage of number of documents. This can improve quality
* by only suggesting high frequency terms. Defaults to 0f and is not
* enabled. If a value higher than 1 is specified then the number cannot
* be fractional.
*/
public DirectCandidateGenerator minDocFreq(float minDocFreq) {
this.minDocFreq = minDocFreq;
return this;
}
/**
* Sets a filter (analyzer) that is applied to each of the tokens passed to this candidate generator.
* This filter is applied to the original token before candidates are generated.
*/
public DirectCandidateGenerator preFilter(String preFilter) {
this.preFilter = preFilter;
return this;
}
/**
* Sets a filter (analyzer) that is applied to each of the generated tokens
* before they are passed to the actual phrase scorer.
*/
public DirectCandidateGenerator postFilter(String postFilter) {
this.postFilter = postFilter;
return this;
}
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.startObject();
if (field != null) {
builder.field("field", field);
}
if (suggestMode != null) {
builder.field("suggest_mode", suggestMode);
}
if (accuracy != null) {
builder.field("accuracy", accuracy);
}
if (size != null) {
builder.field("size", size);
}
if (sort != null) {
builder.field("sort", sort);
}
if (stringDistance != null) {
builder.field("string_distance", stringDistance);
}
if (maxEdits != null) {
builder.field("max_edits", maxEdits);
}
if (maxInspections != null) {
builder.field("max_inspections", maxInspections);
}
if (maxTermFreq != null) {
builder.field("max_term_freq", maxTermFreq);
}
if (prefixLength != null) {
builder.field("prefix_length", prefixLength);
}
if (minWordLength != null) {
builder.field("min_word_len", minWordLength);
}
if (minDocFreq != null) {
builder.field("min_doc_freq", minDocFreq);
}
if (preFilter != null) {
builder.field("pre_filter", preFilter);
}
if (postFilter != null) {
builder.field("post_filter", postFilter);
}
builder.endObject();
return builder;
}
}
}

View File

@ -0,0 +1,157 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.search.suggest.DirectSpellcheckerSettings;
import org.elasticsearch.search.suggest.Suggester;
import org.elasticsearch.search.suggest.SuggestionSearchContext.SuggestionContext;
class PhraseSuggestionContext extends SuggestionContext {
private final BytesRef SEPARATOR = new BytesRef(" ");
private float maxErrors = 0.5f;
private BytesRef separator = SEPARATOR;
private float realworldErrorLikelihood = 0.95f;
private List<DirectCandidateGenerator> generators = new ArrayList<PhraseSuggestionContext.DirectCandidateGenerator>();
private int gramSize = 1;
private float confidence = 1.0f;
private WordScorer.WordScorerFactory scorer;
private boolean requireUnigram = true;
public PhraseSuggestionContext(Suggester<? extends PhraseSuggestionContext> suggester) {
super(suggester);
}
public float maxErrors() {
return maxErrors;
}
public void setMaxErrors(Float maxErrors) {
this.maxErrors = maxErrors;
}
public BytesRef separator() {
return separator;
}
public void setSeparator(BytesRef separator) {
this.separator = separator;
}
public Float realworldErrorLikelyhood() {
return realworldErrorLikelihood;
}
public void setRealWordErrorLikelihood(Float realworldErrorLikelihood) {
this.realworldErrorLikelihood = realworldErrorLikelihood;
}
public void addGenerator(DirectCandidateGenerator generator) {
this.generators.add(generator);
}
public List<DirectCandidateGenerator> generators() {
return this.generators ;
}
public void setGramSize(int gramSize) {
this.gramSize = gramSize;
}
public int gramSize() {
return gramSize;
}
public float confidence() {
return confidence;
}
public void setConfidence(float confidence) {
this.confidence = confidence;
}
public void setModel(WordScorer.WordScorerFactory scorer) {
this.scorer = scorer;
}
public WordScorer.WordScorerFactory model() {
return scorer;
}
static class DirectCandidateGenerator extends DirectSpellcheckerSettings {
private Analyzer preFilter;
private Analyzer postFilter;
private String field;
private int size = 5;
public String field() {
return field;
}
public void setField(String field) {
this.field = field;
}
public int size() {
return size;
}
public void size(int size) {
if (size <= 0) {
throw new ElasticSearchIllegalArgumentException("Size must be positive");
}
this.size = size;
}
public Analyzer preFilter() {
return preFilter;
}
public void preFilter(Analyzer preFilter) {
this.preFilter = preFilter;
}
public Analyzer postFilter() {
return postFilter;
}
public void postFilter(Analyzer postFilter) {
this.postFilter = postFilter;
}
}
public void setRequireUnigram(boolean requireUnigram) {
this.requireUnigram = requireUnigram;
}
public boolean getRequireUnigram() {
return requireUnigram;
}
}

View File

@ -0,0 +1,67 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
public class StupidBackoffScorer extends WordScorer {
public static final WordScorerFactory FACTORY = new WordScorer.WordScorerFactory() {
@Override
public WordScorer newScorer(IndexReader reader, String field, double realWordLikelyhood, BytesRef separator) throws IOException {
return new StupidBackoffScorer(reader, field, realWordLikelyhood, separator, 0.4f);
}
};
private final double discount;
public StupidBackoffScorer(IndexReader reader, String field, double realWordLikelyhood, BytesRef separator, double discount)
throws IOException {
super(reader, field, realWordLikelyhood, separator);
this.discount = discount;
}
@Override
protected double scoreBigram(Candidate word, Candidate w_1) throws IOException {
SuggestUtils.join(separator, spare, word.term, w_1.term);
final int count = frequency(spare);
if (count < 1) {
return discount * scoreUnigram(word);
}
return count / (w_1.frequency + 0.00000000001d);
}
@Override
protected double scoreTrigram(Candidate w, Candidate w_1, Candidate w_2) throws IOException {
SuggestUtils.join(separator, spare, w_2.term, w_1.term, w.term);
final int trigramCount = frequency(spare);
if (trigramCount < 1) {
return discount * scoreBigram(w, w_1);
}
SuggestUtils.join(separator, spare, w_1.term, w.term);
final int bigramCount = frequency(spare);
return trigramCount / (bigramCount + 0.00000000001d);
}
}

View File

@ -0,0 +1,98 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.phrase;
import java.io.IOException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.Candidate;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.CandidateSet;
//TODO public for tests
public abstract class WordScorer {
protected final IndexReader reader;
protected final String field;
protected final Terms terms;
protected final int totalDocuments;
protected double realWordLikelyhood;
protected final BytesRef spare = new BytesRef();
protected final BytesRef separator;
protected final TermsEnum termsEnum;
public WordScorer(IndexReader reader, String field, double realWordLikelyHood, BytesRef separator) throws IOException {
this.field = field;
this.terms = MultiFields.getTerms(reader, field);
if (terms == null) {
throw new ElasticSearchIllegalArgumentException("Field: [" + field + "] does not exist");
}
final int docCount = terms.getDocCount();
this.totalDocuments = docCount == -1 ? reader.maxDoc() : docCount;
this.termsEnum = terms.iterator(null);
this.reader = reader;
this.realWordLikelyhood = realWordLikelyHood;
this.separator = separator;
}
public int frequency(BytesRef term) throws IOException {
if (termsEnum.seekExact(term, true)) {
return termsEnum.docFreq();
}
return 0;
}
protected double channelScore(Candidate candidate, Candidate original) throws IOException {
if (candidate.stringDistance == 1.0d) {
return realWordLikelyhood;
}
return candidate.stringDistance;
}
public double score(Candidate[] path, CandidateSet[] candidateSet, int at, int gramSize) throws IOException {
if (at == 0 || gramSize == 1) {
return Math.log10(channelScore(path[0], candidateSet[0].originalTerm) * scoreUnigram(path[0]));
} else if (at == 1 || gramSize == 2) {
return Math.log10(channelScore(path[at], candidateSet[at].originalTerm) * scoreBigram(path[at], path[at - 1]));
} else {
return Math.log10(channelScore(path[at], candidateSet[at].originalTerm) * scoreTrigram(path[at], path[at - 1], path[at - 2]));
}
}
protected double scoreUnigram(Candidate word) throws IOException {
return (1.0 + word.frequency) / (1.0 + totalDocuments);
}
protected double scoreBigram(Candidate word, Candidate w_1) throws IOException {
return scoreUnigram(word);
}
protected double scoreTrigram(Candidate word, Candidate w_1, Candidate w_2) throws IOException {
return scoreBigram(word, w_1);
}
public static interface WordScorerFactory {
public WordScorer newScorer(IndexReader reader, String field,
double realWordLikelyhood, BytesRef separator) throws IOException;
}
}

View File

@ -0,0 +1,60 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.term;
import java.io.IOException;
import org.elasticsearch.ElasticSearchIllegalArgumentException;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.suggest.DirectSpellcheckerSettings;
import org.elasticsearch.search.suggest.SuggestContextParser;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.SuggestionSearchContext;
public final class TermSuggestParser implements SuggestContextParser {
private final TermSuggester suggester = new TermSuggester();
public SuggestionSearchContext.SuggestionContext parse(XContentParser parser, SearchContext context) throws IOException {
XContentParser.Token token;
String fieldName = null;
TermSuggestionContext suggestion = new TermSuggestionContext(suggester);
DirectSpellcheckerSettings settings = suggestion.getDirectSpellCheckerSettings();
while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
fieldName = parser.currentName();
} else if (token.isValue()) {
parseTokenValue(parser, context, fieldName, suggestion, settings);
} else {
throw new ElasticSearchIllegalArgumentException("suggester[term] doesn't support field [" + fieldName + "]");
}
}
return suggestion;
}
private void parseTokenValue(XContentParser parser, SearchContext context, String fieldName, TermSuggestionContext suggestion,
DirectSpellcheckerSettings settings) throws IOException {
if (!(SuggestUtils.parseSuggestContext(parser, context, fieldName, suggestion) || SuggestUtils.parseDirectSpellcheckerSettings(
parser, fieldName, settings))) {
throw new ElasticSearchIllegalArgumentException("suggester[term] doesn't support [" + fieldName + "]");
}
}
}

View File

@ -0,0 +1,95 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.term;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.spell.DirectSpellChecker;
import org.apache.lucene.search.spell.SuggestWord;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.CharsRef;
import org.elasticsearch.common.bytes.BytesArray;
import org.elasticsearch.common.text.BytesText;
import org.elasticsearch.common.text.StringText;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.suggest.SuggestUtils;
import org.elasticsearch.search.suggest.Suggester;
import org.elasticsearch.search.suggest.SuggestionSearchContext.SuggestionContext;
final class TermSuggester implements Suggester<TermSuggestionContext> {
@Override
public TermSuggestion execute(String name, TermSuggestionContext suggestion, SearchContext context, CharsRef spare) throws IOException {
DirectSpellChecker directSpellChecker = SuggestUtils.getDirectSpellChecker(suggestion.getDirectSpellCheckerSettings());
TermSuggestion response = new TermSuggestion(
name, suggestion.getSize(), suggestion.getDirectSpellCheckerSettings().sort()
);
List<Token> tokens = queryTerms(suggestion, spare);
for (Token token : tokens) {
IndexReader indexReader = context.searcher().getIndexReader();
// TODO: Extend DirectSpellChecker in 4.1, to get the raw suggested words as BytesRef
SuggestWord[] suggestedWords = directSpellChecker.suggestSimilar(
token.term, suggestion.getShardSize(), indexReader, suggestion.getDirectSpellCheckerSettings().suggestMode()
);
Text key = new BytesText(new BytesArray(token.term.bytes()));
TermSuggestion.Entry resultEntry = new TermSuggestion.Entry(key, token.startOffset, token.endOffset - token.startOffset);
for (SuggestWord suggestWord : suggestedWords) {
Text word = new StringText(suggestWord.string);
resultEntry.addOption(new TermSuggestion.Entry.Option(word, suggestWord.freq, suggestWord.score));
}
response.addTerm(resultEntry);
}
return response;
}
private List<Token> queryTerms(SuggestionContext suggestion, CharsRef spare) throws IOException {
final List<Token> result = new ArrayList<TermSuggester.Token>();
final String field = suggestion.getField();
SuggestUtils.analyze(suggestion.getAnalyzer(), suggestion.getText(), field, new SuggestUtils.TokenConsumer() {
@Override
public void nextToken() {
Term term = new Term(field, BytesRef.deepCopyOf(fillBytesRef(new BytesRef())));
result.add(new Token(term, offsetAttr.startOffset(), offsetAttr.endOffset()));
}
}, spare);
return result;
}
private static class Token {
public final Term term;
public final int startOffset;
public final int endOffset;
private Token(Term term, int startOffset, int endOffset) {
this.term = term;
this.startOffset = startOffset;
this.endOffset = endOffset;
}
}
}

View File

@ -0,0 +1,201 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.term;
import java.io.IOException;
import java.util.Comparator;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.text.Text;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentBuilderString;
import org.elasticsearch.search.suggest.Suggest.Suggestion;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry.Option;
/**
* The suggestion responses corresponding with the suggestions in the request.
*/
public class TermSuggestion extends Suggestion<TermSuggestion.Entry> {
public static Comparator<Suggestion.Entry.Option> SCORE = new Score();
public static Comparator<Suggestion.Entry.Option> FREQUENCY = new Frequency();
// Same behaviour as comparators in suggest module, but for SuggestedWord
// Highest score first, then highest freq first, then lowest term first
public static class Score implements Comparator<Suggestion.Entry.Option> {
@Override
public int compare(Suggestion.Entry.Option first, Suggestion.Entry.Option second) {
// first criteria: the distance
int cmp = Float.compare(second.getScore(), first.getScore());
if (cmp != 0) {
return cmp;
}
return FREQUENCY.compare(first, second);
}
}
// Same behaviour as comparators in suggest module, but for SuggestedWord
// Highest freq first, then highest score first, then lowest term first
public static class Frequency implements Comparator<Suggestion.Entry.Option> {
@Override
public int compare(Suggestion.Entry.Option first, Suggestion.Entry.Option second) {
// first criteria: the popularity
int cmp = ((TermSuggestion.Entry.Option) second).getFreq() - ((TermSuggestion.Entry.Option) first).getFreq();
if (cmp != 0) {
return cmp;
}
// second criteria (if first criteria is equal): the distance
cmp = Float.compare(second.getScore(), first.getScore());
if (cmp != 0) {
return cmp;
}
// third criteria: term text
return first.getText().compareTo(second.getText());
}
}
public static final int TYPE = 1;
private Sort sort;
public TermSuggestion() {
}
public TermSuggestion(String name, int size, Sort sort) {
super(name, size);
this.sort = sort;
}
public int getType() {
return TYPE;
}
@Override
protected Comparator<Option> sortComparator() {
switch (sort) {
case SCORE:
return SCORE;
case FREQUENCY:
return FREQUENCY;
default:
throw new ElasticSearchException("Could not resolve comparator for sort key: [" + sort + "]");
}
}
@Override
protected void innerReadFrom(StreamInput in) throws IOException {
super.innerReadFrom(in);
sort = Sort.fromId(in.readByte());
}
@Override
public void innerWriteTo(StreamOutput out) throws IOException {
super.innerWriteTo(out);
out.writeByte(sort.id());
}
protected Entry newEntry() {
return new Entry();
}
/**
* Represents a part from the suggest text with suggested options.
*/
public static class Entry extends
org.elasticsearch.search.suggest.Suggest.Suggestion.Entry<TermSuggestion.Entry.Option> {
Entry(Text text, int offset, int length) {
super(text, offset, length);
}
Entry() {
}
@Override
protected Option newOption() {
return new Option();
}
/**
* Contains the suggested text with its document frequency and score.
*/
public static class Option extends org.elasticsearch.search.suggest.Suggest.Suggestion.Entry.Option {
static class Fields {
static final XContentBuilderString FREQ = new XContentBuilderString("freq");
}
private int freq;
protected Option(Text text, int freq, float score) {
super(text, score);
this.freq = freq;
}
@Override
protected void mergeInto(Suggestion.Entry.Option otherOption) {
super.mergeInto(otherOption);
freq += ((Option) otherOption).freq;
}
protected Option() {
super();
}
public void setFreq(int freq) {
this.freq = freq;
}
/**
* @return How often this suggested text appears in the index.
*/
public int getFreq() {
return freq;
}
@Override
public void readFrom(StreamInput in) throws IOException {
super.readFrom(in);
freq = in.readVInt();
}
@Override
public void writeTo(StreamOutput out) throws IOException {
super.writeTo(out);
out.writeVInt(freq);
}
@Override
protected XContentBuilder innerToXContent(XContentBuilder builder, Params params) throws IOException {
builder = super.innerToXContent(builder, params);
builder.field(Fields.FREQ, freq);
return builder;
}
}
}
}

View File

@ -0,0 +1,224 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.term;
import java.io.IOException;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.search.suggest.SuggestBuilder.SuggestionBuilder;
/**
* Defines the actual suggest command. Each command uses the global options
* unless defined in the suggestion itself. All options are the same as the
* global options, but are only applicable for this suggestion.
*/
public class TermSuggestionBuilder extends SuggestionBuilder<TermSuggestionBuilder> {
private String suggestMode;
private Float accuracy;
private String sort;
private String stringDistance;
private Integer maxEdits;
private Integer maxInspections;
private Float maxTermFreq;
private Integer prefixLength;
private Integer minWordLength;
private Float minDocFreq;
/**
* @param name
* The name of this suggestion. This is a required parameter.
*/
public TermSuggestionBuilder(String name) {
super(name, "term");
}
/**
* The global suggest mode controls what suggested terms are included or
* controls for what suggest text tokens, terms should be suggested for.
* Three possible values can be specified:
* <ol>
* <li><code>missing</code> - Only suggest terms in the suggest text that
* aren't in the index. This is the default.
* <li><code>popular</code> - Only suggest terms that occur in more docs
* then the original suggest text term.
* <li><code>always</code> - Suggest any matching suggest terms based on
* tokens in the suggest text.
* </ol>
*/
public TermSuggestionBuilder suggestMode(String suggestMode) {
this.suggestMode = suggestMode;
return this;
}
/**
* s how similar the suggested terms at least need to be compared to the
* original suggest text tokens. A value between 0 and 1 can be specified.
* This value will be compared to the string distance result of each
* candidate spelling correction.
* <p/>
* Default is <tt>0.5</tt>
*/
public TermSuggestionBuilder setAccuracy(float accuracy) {
this.accuracy = accuracy;
return this;
}
/**
* Sets how to sort the suggest terms per suggest text token. Two possible
* values:
* <ol>
* <li><code>score</code> - Sort should first be based on score, then
* document frequency and then the term itself.
* <li><code>frequency</code> - Sort should first be based on document
* frequency, then scotr and then the term itself.
* </ol>
* <p/>
* What the score is depends on the suggester being used.
*/
public TermSuggestionBuilder sort(String sort) {
this.sort = sort;
return this;
}
/**
* Sets what string distance implementation to use for comparing how similar
* suggested terms are. Four possible values can be specified:
* <ol>
* <li><code>internal</code> - This is the default and is based on
* <code>damerau_levenshtein</code>, but highly optimized for comparing
* string distance for terms inside the index.
* <li><code>damerau_levenshtein</code> - String distance algorithm based on
* Damerau-Levenshtein algorithm.
* <li><code>levenstein</code> - String distance algorithm based on
* Levenstein edit distance algorithm.
* <li><code>jarowinkler</code> - String distance algorithm based on
* Jaro-Winkler algorithm.
* <li><code>ngram</code> - String distance algorithm based on character
* n-grams.
* </ol>
*/
public TermSuggestionBuilder stringDistance(String stringDistance) {
this.stringDistance = stringDistance;
return this;
}
/**
* Sets the maximum edit distance candidate suggestions can have in order to
* be considered as a suggestion. Can only be a value between 1 and 2. Any
* other value result in an bad request error being thrown. Defaults to
* <tt>2</tt>.
*/
public TermSuggestionBuilder maxEdits(Integer maxEdits) {
this.maxEdits = maxEdits;
return this;
}
/**
* A factor that is used to multiply with the size in order to inspect more
* candidate suggestions. Can improve accuracy at the cost of performance.
* Defaults to <tt>5</tt>.
*/
public TermSuggestionBuilder maxInspections(Integer maxInspections) {
this.maxInspections = maxInspections;
return this;
}
/**
* Sets a maximum threshold in number of documents a suggest text token can
* exist in order to be corrected. Can be a relative percentage number (e.g
* 0.4) or an absolute number to represent document frequencies. If an value
* higher than 1 is specified then fractional can not be specified. Defaults
* to <tt>0.01</tt>.
* <p/>
* This can be used to exclude high frequency terms from being suggested.
* High frequency terms are usually spelled correctly on top of this this
* also improves the suggest performance.
*/
public TermSuggestionBuilder maxTermFreq(float maxTermFreq) {
this.maxTermFreq = maxTermFreq;
return this;
}
/**
* Sets the number of minimal prefix characters that must match in order be
* a candidate suggestion. Defaults to 1. Increasing this number improves
* suggest performance. Usually misspellings don't occur in the beginning of
* terms.
*/
public TermSuggestionBuilder prefixLength(int prefixLength) {
this.prefixLength = prefixLength;
return this;
}
/**
* The minimum length a suggest text term must have in order to be
* corrected. Defaults to <tt>4</tt>.
*/
public TermSuggestionBuilder minWordLength(int minWordLength) {
this.minWordLength = minWordLength;
return this;
}
/**
* Sets a minimal threshold in number of documents a suggested term should
* appear in. This can be specified as an absolute number or as a relative
* percentage of number of documents. This can improve quality by only
* suggesting high frequency terms. Defaults to 0f and is not enabled. If a
* value higher than 1 is specified then the number cannot be fractional.
*/
public TermSuggestionBuilder minDocFreq(float minDocFreq) {
this.minDocFreq = minDocFreq;
return this;
}
@Override
public XContentBuilder innerToXContent(XContentBuilder builder, Params params) throws IOException {
if (suggestMode != null) {
builder.field("suggest_mode", suggestMode);
}
if (accuracy != null) {
builder.field("accuracy", accuracy);
}
if (sort != null) {
builder.field("sort", sort);
}
if (stringDistance != null) {
builder.field("string_distance", stringDistance);
}
if (maxEdits != null) {
builder.field("max_edits", maxEdits);
}
if (maxInspections != null) {
builder.field("max_inspections", maxInspections);
}
if (maxTermFreq != null) {
builder.field("max_term_freq", maxTermFreq);
}
if (prefixLength != null) {
builder.field("prefix_length", prefixLength);
}
if (minWordLength != null) {
builder.field("min_word_len", minWordLength);
}
if (minDocFreq != null) {
builder.field("min_doc_freq", minDocFreq);
}
return builder;
}
}

View File

@ -0,0 +1,37 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.search.suggest.term;
import org.elasticsearch.search.suggest.DirectSpellcheckerSettings;
import org.elasticsearch.search.suggest.Suggester;
import org.elasticsearch.search.suggest.SuggestionSearchContext.SuggestionContext;
final class TermSuggestionContext extends SuggestionContext {
private final DirectSpellcheckerSettings settings = new DirectSpellcheckerSettings();
public TermSuggestionContext(Suggester<? extends TermSuggestionContext> suggester) {
super(suggester);
}
public DirectSpellcheckerSettings getDirectSpellCheckerSettings() {
return settings;
}
}

View File

@ -31,7 +31,7 @@ import org.elasticsearch.common.unit.SizeValue;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.node.Node;
import org.elasticsearch.search.suggest.Suggest;
import org.elasticsearch.search.suggest.Suggest.Suggestion.Entry.Option;
import org.elasticsearch.search.suggest.SuggestBuilder;
import java.io.IOException;
@ -118,7 +118,7 @@ public class SuggestSearchBenchMark {
String term = "prefix" + startChar;
SearchResponse response = client.prepareSearch()
.setQuery(prefixQuery("field", term))
.addSuggestion(new SuggestBuilder.FuzzySuggestion("field").setField("field").setText(term).setSuggestMode("always"))
.addSuggestion(SuggestBuilder.termSuggestion("field").field("field").text(term).suggestMode("always"))
.execute().actionGet();
if (response.getHits().totalHits() == 0) {
System.err.println("No hits");
@ -135,14 +135,14 @@ public class SuggestSearchBenchMark {
String term = "prefix" + startChar;
SearchResponse response = client.prepareSearch()
.setQuery(matchQuery("field", term))
.addSuggestion(new SuggestBuilder.FuzzySuggestion("field").setText(term).setField("field").setSuggestMode("always"))
.addSuggestion(SuggestBuilder.termSuggestion("field").text(term).field("field").suggestMode("always"))
.execute().actionGet();
timeTaken += response.getTookInMillis();
if (response.getSuggest() == null) {
System.err.println("No suggestions");
continue;
}
List<Suggest.Suggestion.Entry.Option> options = response.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions();
List<? extends Option> options = response.getSuggest().getSuggestion("field").getEntries().get(0).getOptions();
if (options == null || options.isEmpty()) {
System.err.println("No suggestions");
}

View File

@ -19,27 +19,40 @@
package org.elasticsearch.test.integration.search.suggest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.Client;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.test.integration.AbstractNodesTests;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import static org.elasticsearch.cluster.metadata.IndexMetaData.SETTING_NUMBER_OF_REPLICAS;
import static org.elasticsearch.cluster.metadata.IndexMetaData.SETTING_NUMBER_OF_SHARDS;
import static org.elasticsearch.common.settings.ImmutableSettings.settingsBuilder;
import static org.elasticsearch.index.query.QueryBuilders.matchQuery;
import static org.elasticsearch.search.suggest.SuggestBuilder.fuzzySuggestion;
import static org.elasticsearch.search.suggest.SuggestBuilder.phraseSuggestion;
import static org.elasticsearch.search.suggest.SuggestBuilder.termSuggestion;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.equalTo;
import static org.hamcrest.Matchers.notNullValue;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.action.ActionListener;
import org.elasticsearch.action.ListenableActionFuture;
import org.elasticsearch.action.search.SearchPhaseExecutionException;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.search.SearchType;
import org.elasticsearch.client.Client;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.settings.ImmutableSettings.Builder;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.search.suggest.phrase.PhraseSuggestionBuilder;
import org.elasticsearch.test.integration.AbstractNodesTests;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;
/**
*/
public class SuggestSearchTests extends AbstractNodesTests {
@ -106,38 +119,38 @@ public class SuggestSearchTests extends AbstractNodesTests {
SearchResponse search = client.prepareSearch()
.setQuery(matchQuery("text", "spellcecker"))
.addSuggestion(
fuzzySuggestion("test").setSuggestMode("always") // Always, otherwise the results can vary between requests.
.setText("abcd")
.setField("text"))
termSuggestion("test").suggestMode("always") // Always, otherwise the results can vary between requests.
.text("abcd")
.field("text"))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().getSuggestions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getName(), equalTo("test"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getText().string(), equalTo("abcd"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(0).getText().string(), equalTo("aacd"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(1).getText().string(), equalTo("abbd"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(2).getText().string(), equalTo("abcc"));
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("test").getName(), equalTo("test"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getText().string(), equalTo("abcd"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("aacd"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().get(1).getText().string(), equalTo("abbd"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().get(2).getText().string(), equalTo("abcc"));
client.prepareSearch()
.addSuggestion(
fuzzySuggestion("test").setSuggestMode("always") // Always, otherwise the results can vary between requests.
.setText("abcd")
.setField("text"))
termSuggestion("test").suggestMode("always") // Always, otherwise the results can vary between requests.
.text("abcd")
.field("text"))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().getSuggestions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getName(), equalTo("test"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(0).getText().string(), equalTo("aacd"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(1).getText().string(), equalTo("abbd"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(2).getText().string(), equalTo("abcc"));
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("test").getName(), equalTo("test"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("aacd"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().get(1).getText().string(), equalTo("abbd"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().get(2).getText().string(), equalTo("abcc"));
}
@Test
@ -153,32 +166,32 @@ public class SuggestSearchTests extends AbstractNodesTests {
SearchResponse search = client.prepareSearch()
.setQuery(matchQuery("text", "spellcecker"))
.addSuggestion(
fuzzySuggestion("test").setSuggestMode("always") // Always, otherwise the results can vary between requests.
.setText("abcd")
.setField("text"))
termSuggestion("test").suggestMode("always") // Always, otherwise the results can vary between requests.
.text("abcd")
.field("text"))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().getSuggestions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getName(), equalTo("test"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getText().string(), equalTo("abcd"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().size(), equalTo(0));
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("test").getName(), equalTo("test"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getText().string(), equalTo("abcd"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().size(), equalTo(0));
client.prepareSearch()
.addSuggestion(
fuzzySuggestion("test").setSuggestMode("always") // Always, otherwise the results can vary between requests.
.setText("abcd")
.setField("text"))
termSuggestion("test").suggestMode("always") // Always, otherwise the results can vary between requests.
.text("abcd")
.field("text"))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().getSuggestions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getName(), equalTo("test"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().size(), equalTo(0));
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("test").getName(), equalTo("test"));
assertThat(search.getSuggest().getSuggestion("test").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("test").getEntries().get(0).getOptions().size(), equalTo(0));
}
@Test
@ -226,39 +239,39 @@ public class SuggestSearchTests extends AbstractNodesTests {
client.admin().indices().prepareRefresh().execute().actionGet();
SearchResponse search = client.prepareSearch()
.addSuggestion(fuzzySuggestion("size1")
.setSize(1).setText("prefix_abcd").setMaxTermFreq(10).setMinDocFreq(0)
.setField("field1").setSuggestMode("always"))
.addSuggestion(fuzzySuggestion("field2")
.setField("field2").setText("prefix_eeeh prefix_efgh")
.setMaxTermFreq(10).setMinDocFreq(0).setSuggestMode("always"))
.addSuggestion(fuzzySuggestion("accuracy")
.setField("field2").setText("prefix_efgh").setAccuracy(1f)
.setMaxTermFreq(10).setMinDocFreq(0).setSuggestMode("always"))
.addSuggestion(termSuggestion("size1")
.size(1).text("prefix_abcd").maxTermFreq(10).minDocFreq(0)
.field("field1").suggestMode("always"))
.addSuggestion(termSuggestion("field2")
.field("field2").text("prefix_eeeh prefix_efgh")
.maxTermFreq(10).minDocFreq(0).suggestMode("always"))
.addSuggestion(termSuggestion("accuracy")
.field("field2").text("prefix_efgh").setAccuracy(1f)
.maxTermFreq(10).minDocFreq(0).suggestMode("always"))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().getSuggestions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestions().get(0).getName(), equalTo("size1"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestions().get(1).getName(), equalTo("field2"));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().size(), equalTo(2));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(0).getText().string(), equalTo("prefix_eeeh"));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(0).getOffset(), equalTo(0));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(0).getLength(), equalTo(11));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(1).getText().string(), equalTo("prefix_efgh"));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(1).getOffset(), equalTo(12));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(1).getLength(), equalTo(11));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(1).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(1).getOptions().get(0).getText().string(), equalTo("prefix_eeeh"));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(1).getOptions().get(1).getText().string(), equalTo("prefix_efff"));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(1).getOptions().get(2).getText().string(), equalTo("prefix_eggg"));
assertThat(search.getSuggest().getSuggestions().get(2).getName(), equalTo("accuracy"));
assertThat(search.getSuggest().getSuggestions().get(2).getEntries().get(0).getOptions().isEmpty(), equalTo(true));
assertThat(search.getSuggest().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestion("size1").getName(), equalTo("size1"));
assertThat(search.getSuggest().getSuggestion("size1").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("size1").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("size1").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestion("field2").getName(), equalTo("field2"));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().size(), equalTo(2));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(0).getText().string(), equalTo("prefix_eeeh"));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(0).getOffset(), equalTo(0));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(0).getLength(), equalTo(11));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(1).getText().string(), equalTo("prefix_efgh"));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(1).getOffset(), equalTo(12));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(1).getLength(), equalTo(11));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(1).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(1).getOptions().get(0).getText().string(), equalTo("prefix_eeeh"));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(1).getOptions().get(1).getText().string(), equalTo("prefix_efff"));
assertThat(search.getSuggest().getSuggestion("field2").getEntries().get(1).getOptions().get(2).getText().string(), equalTo("prefix_eggg"));
assertThat(search.getSuggest().getSuggestion("accuracy").getName(), equalTo("accuracy"));
assertThat(search.getSuggest().getSuggestion("accuracy").getEntries().get(0).getOptions().isEmpty(), equalTo(true));
}
@Test
@ -301,61 +314,522 @@ public class SuggestSearchTests extends AbstractNodesTests {
SearchResponse search = client.prepareSearch()
.setSuggestText("prefix_abcd")
.addSuggestion(fuzzySuggestion("size3SortScoreFirst")
.setSize(3).setMinDocFreq(0).setField("field1").setSuggestMode("always"))
.addSuggestion(fuzzySuggestion("size10SortScoreFirst")
.setSize(10).setMinDocFreq(0).setField("field1").setSuggestMode("always").setShardSize(50))
.addSuggestion(fuzzySuggestion("size3SortScoreFirstMaxEdits1")
.setMaxEdits(1)
.setSize(10).setMinDocFreq(0).setField("field1").setSuggestMode("always"))
.addSuggestion(fuzzySuggestion("size10SortFrequencyFirst")
.setSize(10).setSort("frequency").setShardSize(1000)
.setMinDocFreq(0).setField("field1").setSuggestMode("always"))
.addSuggestion(termSuggestion("size3SortScoreFirst")
.size(3).minDocFreq(0).field("field1").suggestMode("always"))
.addSuggestion(termSuggestion("size10SortScoreFirst")
.size(10).minDocFreq(0).field("field1").suggestMode("always").shardSize(50))
.addSuggestion(termSuggestion("size3SortScoreFirstMaxEdits1")
.maxEdits(1)
.size(10).minDocFreq(0).field("field1").suggestMode("always"))
.addSuggestion(termSuggestion("size10SortFrequencyFirst")
.size(10).sort("frequency").shardSize(1000)
.minDocFreq(0).field("field1").suggestMode("always"))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().getSuggestions().size(), equalTo(4));
assertThat(search.getSuggest().getSuggestions().get(0).getName(), equalTo("size3SortScoreFirst"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(1).getText().string(), equalTo("prefix_abcc"));
assertThat(search.getSuggest().getSuggestions().get(0).getEntries().get(0).getOptions().get(2).getText().string(), equalTo("prefix_accd"));
assertThat(search.getSuggest().size(), equalTo(4));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirst").getName(), equalTo("size3SortScoreFirst"));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirst").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirst").getEntries().get(0).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirst").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirst").getEntries().get(0).getOptions().get(1).getText().string(), equalTo("prefix_abcc"));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirst").getEntries().get(0).getOptions().get(2).getText().string(), equalTo("prefix_accd"));
assertThat(search.getSuggest().getSuggestions().get(1).getName(), equalTo("size10SortScoreFirst"));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(0).getOptions().size(), equalTo(10));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(0).getOptions().get(1).getText().string(), equalTo("prefix_abcc"));
assertThat(search.getSuggest().getSuggestions().get(1).getEntries().get(0).getOptions().get(2).getText().string(), equalTo("prefix_accd"));
assertThat(search.getSuggest().getSuggestion("size10SortScoreFirst").getName(), equalTo("size10SortScoreFirst"));
assertThat(search.getSuggest().getSuggestion("size10SortScoreFirst").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("size10SortScoreFirst").getEntries().get(0).getOptions().size(), equalTo(10));
assertThat(search.getSuggest().getSuggestion("size10SortScoreFirst").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestion("size10SortScoreFirst").getEntries().get(0).getOptions().get(1).getText().string(), equalTo("prefix_abcc"));
assertThat(search.getSuggest().getSuggestion("size10SortScoreFirst").getEntries().get(0).getOptions().get(2).getText().string(), equalTo("prefix_accd"));
// This fails sometimes. Depending on how the docs are sharded. The suggested suggest corrections get the df on shard level, which
// isn't correct comparing it to the index level.
// assertThat(search.suggest().suggestions().get(1).getSuggestedWords().get("prefix_abcd").get(3).getTerm(), equalTo("prefix_aaad"));
assertThat(search.getSuggest().getSuggestions().get(2).getName(), equalTo("size3SortScoreFirstMaxEdits1"));
assertThat(search.getSuggest().getSuggestions().get(2).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(2).getEntries().get(0).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestions().get(2).getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestions().get(2).getEntries().get(0).getOptions().get(1).getText().string(), equalTo("prefix_abcc"));
assertThat(search.getSuggest().getSuggestions().get(2).getEntries().get(0).getOptions().get(2).getText().string(), equalTo("prefix_accd"));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirstMaxEdits1").getName(), equalTo("size3SortScoreFirstMaxEdits1"));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirstMaxEdits1").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirstMaxEdits1").getEntries().get(0).getOptions().size(), equalTo(3));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirstMaxEdits1").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirstMaxEdits1").getEntries().get(0).getOptions().get(1).getText().string(), equalTo("prefix_abcc"));
assertThat(search.getSuggest().getSuggestion("size3SortScoreFirstMaxEdits1").getEntries().get(0).getOptions().get(2).getText().string(), equalTo("prefix_accd"));
assertThat(search.getSuggest().getSuggestions().get(3).getName(), equalTo("size10SortFrequencyFirst"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().size(), equalTo(10));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aaad"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(1).getText().string(), equalTo("prefix_abbb"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(2).getText().string(), equalTo("prefix_aaca"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(3).getText().string(), equalTo("prefix_abba"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(4).getText().string(), equalTo("prefix_accc"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(5).getText().string(), equalTo("prefix_addd"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(6).getText().string(), equalTo("prefix_abaa"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(7).getText().string(), equalTo("prefix_dbca"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(8).getText().string(), equalTo("prefix_cbad"));
assertThat(search.getSuggest().getSuggestions().get(3).getEntries().get(0).getOptions().get(9).getText().string(), equalTo("prefix_aacd"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getName(), equalTo("size10SortFrequencyFirst"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().size(), equalTo(10));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("prefix_aaad"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(1).getText().string(), equalTo("prefix_abbb"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(2).getText().string(), equalTo("prefix_aaca"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(3).getText().string(), equalTo("prefix_abba"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(4).getText().string(), equalTo("prefix_accc"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(5).getText().string(), equalTo("prefix_addd"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(6).getText().string(), equalTo("prefix_abaa"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(7).getText().string(), equalTo("prefix_dbca"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(8).getText().string(), equalTo("prefix_cbad"));
assertThat(search.getSuggest().getSuggestion("size10SortFrequencyFirst").getEntries().get(0).getOptions().get(9).getText().string(), equalTo("prefix_aacd"));
// assertThat(search.suggest().suggestions().get(3).getSuggestedWords().get("prefix_abcd").get(4).getTerm(), equalTo("prefix_abcc"));
// assertThat(search.suggest().suggestions().get(3).getSuggestedWords().get("prefix_abcd").get(4).getTerm(), equalTo("prefix_accd"));
}
@Test
public void testMarvelHerosPhraseSuggest() throws ElasticSearchException, IOException {
client.admin().indices().prepareDelete().execute().actionGet();
Builder builder = ImmutableSettings.builder();
builder.put("index.analysis.analyzer.reverse.tokenizer", "standard");
builder.putArray("index.analysis.analyzer.reverse.filter", "lowercase", "reverse");
builder.put("index.analysis.analyzer.body.tokenizer", "standard");
builder.putArray("index.analysis.analyzer.body.filter", "lowercase");
builder.put("index.analysis.analyzer.bigram.tokenizer", "standard");
builder.putArray("index.analysis.analyzer.bigram.filter", "my_shingle", "lowercase");
builder.put("index.analysis.filter.my_shingle.type", "shingle");
builder.put("index.analysis.filter.my_shingle.output_unigrams", false);
builder.put("index.analysis.filter.my_shingle.min_shingle_size", 2);
builder.put("index.analysis.filter.my_shingle.max_shingle_size", 2);
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1")
.startObject("_all").field("store", "yes").field("termVector", "with_positions_offsets").endObject()
.startObject("properties")
.startObject("body").field("type", "string").field("analyzer", "body").endObject()
.startObject("body_reverse").field("type", "string").field("analyzer", "reverse").endObject()
.startObject("bigram").field("type", "string").field("analyzer", "bigram").endObject()
.endObject()
.endObject().endObject();
client.admin().indices().prepareCreate("test").setSettings(builder.build()).addMapping("type1", mapping).execute().actionGet();
client.admin().cluster().prepareHealth("test").setWaitForGreenStatus().execute().actionGet();
BufferedReader reader = new BufferedReader(new InputStreamReader(SuggestSearchTests.class.getResourceAsStream("/config/names.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
client.prepareIndex("test", "type1")
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field("body", line)
.field("body_reverse", line)
.field("bigram", line)
.endObject()
)
.execute().actionGet();
}
client.admin().indices().prepareRefresh().execute().actionGet();
SearchResponse search = client.prepareSearch()
.setSuggestText("american ame")
.addSuggestion(phraseSuggestion("simple_phrase")
.field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.size(1)).execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("american ame"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("american ace"));
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.maxErrors(0.5f)
.size(1))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xor the Got-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
// pass in a correct phrase
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xorr the God-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.maxErrors(0.5f)
.confidence(0.f)
.size(2))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(2));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xorr the God-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
// pass in a correct phrase - set confidence to 2
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xorr the God-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.maxErrors(0.5f)
.confidence(2.f))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(0));
// pass in a correct phrase - set confidence to 0.99
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xorr the God-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.maxErrors(0.5f)
.confidence(0.99f))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xorr the God-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
//test reverse suggestions with pre & post filter
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("xor the yod-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body_reverse").minWordLength(1).suggestMode("always").preFilter("reverse").postFilter("reverse"))
.maxErrors(0.5f)
.size(1))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("xor the yod-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
// set all mass to trigrams (not indexed)
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.smoothingModel(new PhraseSuggestionBuilder.LinearInterpolation(1,0,0))
.maxErrors(0.5f)
.size(1))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(0));
// set all mass to bigrams
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.smoothingModel(new PhraseSuggestionBuilder.LinearInterpolation(0,1,0))
.maxErrors(0.5f)
.size(1))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xor the Got-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
// distribute mass
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.smoothingModel(new PhraseSuggestionBuilder.LinearInterpolation(0.4,0.4,0.2))
.maxErrors(0.5f)
.size(1))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xor the Got-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
search = client.prepareSearch()
.setSuggestText("american ame")
.addSuggestion(phraseSuggestion("simple_phrase")
.field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.smoothingModel(new PhraseSuggestionBuilder.LinearInterpolation(0.4,0.4,0.2))
.size(1)).execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("american ame"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("american ace"));
// try all smoothing methods
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.smoothingModel(new PhraseSuggestionBuilder.LinearInterpolation(0.4,0.4,0.2))
.maxErrors(0.5f)
.size(1))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xor the Got-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.smoothingModel(new PhraseSuggestionBuilder.Laplace(0.2))
.maxErrors(0.5f)
.size(1))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xor the Got-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(phraseSuggestion("simple_phrase").
realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("body")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.smoothingModel(new PhraseSuggestionBuilder.StupidBackoff(1.0))
.maxErrors(0.5f)
.size(1))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xor the Got-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
}
@Test
public void testPhraseBoundaryCases() throws ElasticSearchException, IOException {
client.admin().indices().prepareDelete().execute().actionGet();
Builder builder = ImmutableSettings.builder();
builder.put("index.analysis.analyzer.body.tokenizer", "standard");
builder.putArray("index.analysis.analyzer.body.filter", "lowercase");
builder.put("index.analysis.analyzer.bigram.tokenizer", "standard");
builder.putArray("index.analysis.analyzer.bigram.filter", "my_shingle", "lowercase");
builder.put("index.analysis.analyzer.ngram.tokenizer", "standard");
builder.putArray("index.analysis.analyzer.ngram.filter", "my_shingle2", "lowercase");
builder.put("index.analysis.filter.my_shingle.type", "shingle");
builder.put("index.analysis.filter.my_shingle.output_unigrams", false);
builder.put("index.analysis.filter.my_shingle.min_shingle_size", 2);
builder.put("index.analysis.filter.my_shingle.max_shingle_size", 2);
builder.put("index.analysis.filter.my_shingle2.type", "shingle");
builder.put("index.analysis.filter.my_shingle2.output_unigrams", true);
builder.put("index.analysis.filter.my_shingle2.min_shingle_size", 2);
builder.put("index.analysis.filter.my_shingle2.max_shingle_size", 2);
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1")
.startObject("_all").field("store", "yes").field("termVector", "with_positions_offsets").endObject()
.startObject("properties")
.startObject("body").field("type", "string").field("analyzer", "body").endObject()
.startObject("bigram").field("type", "string").field("analyzer", "bigram").endObject()
.startObject("ngram").field("type", "string").field("analyzer", "ngram").endObject()
.endObject()
.endObject().endObject();
client.admin().indices().prepareCreate("test").setSettings(builder.build()).addMapping("type1", mapping).execute().actionGet();
client.admin().cluster().prepareHealth("test").setWaitForGreenStatus().execute().actionGet();
BufferedReader reader = new BufferedReader(new InputStreamReader(SuggestSearchTests.class.getResourceAsStream("/config/names.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
client.prepareIndex("test", "type1")
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field("body", line)
.field("bigram", line)
.field("ngram", line)
.endObject()
)
.execute().actionGet();
}
client.admin().indices().prepareRefresh().execute().actionGet();
try {
client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(
phraseSuggestion("simple_phrase")
.realWordErrorLikelihood(0.95f)
.field("bigram")
.gramSize(2)
.analyzer("body")
.addCandidateGenerator(
PhraseSuggestionBuilder.candidateGenerator("does_not_exists").minWordLength(1)
.suggestMode("always")).maxErrors(0.5f).size(1)).execute().actionGet();
assert false : "field does not exists";
} catch (SearchPhaseExecutionException e) {
}
try {
client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(
phraseSuggestion("simple_phrase").realWordErrorLikelihood(0.95f).field("bigram").maxErrors(0.5f)
.size(1)).execute().actionGet();
assert false : "analyzer does only produce ngrams";
} catch (SearchPhaseExecutionException e) {
}
try {
client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(
phraseSuggestion("simple_phrase").realWordErrorLikelihood(0.95f).field("bigram").analyzer("bigram").maxErrors(0.5f)
.size(1)).execute().actionGet();
assert false : "analyzer does only produce ngrams";
} catch (SearchPhaseExecutionException e) {
}
// don't force unigrams
client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(
phraseSuggestion("simple_phrase").realWordErrorLikelihood(0.95f).field("bigram").gramSize(2).analyzer("bigram").forceUnigrams(false).maxErrors(0.5f)
.size(1)).execute().actionGet();
client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(
phraseSuggestion("simple_phrase").realWordErrorLikelihood(0.95f).field("bigram").analyzer("ngram").maxErrors(0.5f)
.size(1)).execute().actionGet();
SearchResponse search = client.prepareSearch()
.setSearchType(SearchType.COUNT)
.setSuggestText("Xor the Got-Jewel")
.addSuggestion(
phraseSuggestion("simple_phrase").maxErrors(0.5f).field("ngram")
.addCandidateGenerator(PhraseSuggestionBuilder.candidateGenerator("body").minWordLength(1).suggestMode("always"))
.size(1)).execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getName(), equalTo("simple_phrase"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getText().string(), equalTo("Xor the Got-Jewel"));
assertThat(search.getSuggest().getSuggestion("simple_phrase").getEntries().get(0).getOptions().get(0).getText().string(), equalTo("xorr the god jewel"));
}
@Test
public void testDifferentShardSize() throws Exception {
// test suggestion with explicitly added different shard sizes
client.admin().indices().prepareDelete().execute().actionGet();
client.admin().indices().prepareCreate("test")
.setSettings(settingsBuilder()
.put(SETTING_NUMBER_OF_SHARDS, 5)
.put(SETTING_NUMBER_OF_REPLICAS, 0))
.execute().actionGet();
client.admin().cluster().prepareHealth("test").setWaitForGreenStatus().execute().actionGet();
client.prepareIndex("test", "type1", "1")
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field("field1", "foobar1")
.endObject()
).setRouting("1").execute().actionGet();
client.prepareIndex("test", "type1", "2")
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field("field1", "foobar2")
.endObject()
).setRouting("2").execute().actionGet();
client.prepareIndex("test", "type1", "3")
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field("field1", "foobar3")
.endObject()
).setRouting("1").execute().actionGet();
client.admin().indices().prepareRefresh().execute().actionGet();
SearchResponse search = client.prepareSearch()
.setSuggestText("foobar")
.addSuggestion(termSuggestion("simple")
.size(10).minDocFreq(0).field("field1").suggestMode("always"))
.execute().actionGet();
assertThat(Arrays.toString(search.getShardFailures()), search.getFailedShards(), equalTo(0));
assertThat(search.getSuggest(), notNullValue());
assertThat(search.getSuggest().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple").getName(), equalTo("simple"));
assertThat(search.getSuggest().getSuggestion("simple").getEntries().size(), equalTo(1));
assertThat(search.getSuggest().getSuggestion("simple").getEntries().get(0).getOptions().size(), equalTo(3));
}
}

View File

@ -0,0 +1,381 @@
package org.elasticsearch.test.unit.search.suggest.phrase;
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.equalTo;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.reverse.ReverseStringFilter;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.synonym.SolrSynonymParser;
import org.apache.lucene.analysis.synonym.SynonymFilter;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.spell.DirectSpellChecker;
import org.apache.lucene.search.spell.SuggestMode;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;
import org.elasticsearch.search.suggest.phrase.CandidateGenerator;
import org.elasticsearch.search.suggest.phrase.Correction;
import org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator;
import org.elasticsearch.search.suggest.phrase.LaplaceScorer;
import org.elasticsearch.search.suggest.phrase.LinearInterpoatingScorer;
import org.elasticsearch.search.suggest.phrase.MultiCandidateGeneratorWrapper;
import org.elasticsearch.search.suggest.phrase.NoisyChannelSpellChecker;
import org.elasticsearch.search.suggest.phrase.StupidBackoffScorer;
import org.elasticsearch.search.suggest.phrase.WordScorer;
import org.testng.annotations.Test;
public class NoisyChannelSpellCheckerTests {
@Test
public void testMarvelHeros() throws IOException {
RAMDirectory dir = new RAMDirectory();
Map<String, Analyzer> mapping = new HashMap<String, Analyzer>();
mapping.put("body_ngram", new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
ShingleFilter tf = new ShingleFilter(t, 2, 3);
tf.setOutputUnigrams(false);
return new TokenStreamComponents(t, new LowerCaseFilter(Version.LUCENE_41, tf));
}
});
mapping.put("body", new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
return new TokenStreamComponents(t, new LowerCaseFilter(Version.LUCENE_41, t));
}
});
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer(Version.LUCENE_41), mapping);
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_41, wrapper);
IndexWriter writer = new IndexWriter(dir, conf);
BufferedReader reader = new BufferedReader(new InputStreamReader(NoisyChannelSpellCheckerTests.class.getResourceAsStream("/config/names.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
Document doc = new Document();
doc.add(new Field("body", line, TextField.TYPE_NOT_STORED));
doc.add(new Field("body_ngram", line, TextField.TYPE_NOT_STORED));
writer.addDocument(doc);
}
DirectoryReader ir = DirectoryReader.open(writer, false);
WordScorer wordScorer = new LaplaceScorer(ir, "body_ngram", 0.95d, new BytesRef(" "), 0.5f);
NoisyChannelSpellChecker suggester = new NoisyChannelSpellChecker();
DirectSpellChecker spellchecker = new DirectSpellChecker();
spellchecker.setMinQueryLength(1);
DirectCandidateGenerator generator = new DirectCandidateGenerator(spellchecker, "body", SuggestMode.SUGGEST_MORE_POPULAR, ir, 0.95);
Correction[] corrections = suggester.getCorrections(wrapper, new BytesRef("american ame"), generator, 5, 1, 1, ir, "body", wordScorer, 1, 2);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ace"));
corrections = suggester.getCorrections(wrapper, new BytesRef("american ame"), generator, 5, 1, 1, ir, "body", wordScorer, 0, 1);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ame"));
suggester = new NoisyChannelSpellChecker(0.85);
wordScorer = new LaplaceScorer(ir, "body_ngram", 0.85d, new BytesRef(" "), 0.5f);
corrections = suggester.getCorrections(wrapper, new BytesRef("Xor the Got-Jewel"), generator, 5, 0.5f, 4, ir, "body", wordScorer, 0, 2);
assertThat(corrections.length, equalTo(4));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("xor the god jewel"));
assertThat(corrections[2].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the got jewel"));
assertThat(corrections[3].join(new BytesRef(" ")).utf8ToString(), equalTo("xorn the god jewel"));
corrections = suggester.getCorrections(wrapper, new BytesRef("Xor the Got-Jewel"), generator, 5, 0.5f, 4, ir, "body", wordScorer, 1, 2);
assertThat(corrections.length, equalTo(4));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("xor the god jewel"));
assertThat(corrections[2].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the got jewel"));
assertThat(corrections[3].join(new BytesRef(" ")).utf8ToString(), equalTo("xorn the god jewel"));
// test synonyms
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
TokenFilter filter = new LowerCaseFilter(Version.LUCENE_41, t);
try {
SolrSynonymParser parser = new SolrSynonymParser(true, false, new WhitespaceAnalyzer(Version.LUCENE_41));
((SolrSynonymParser) parser).add(new StringReader("usa => usa, america, american\nursa => usa, america, american"));
filter = new SynonymFilter(filter, parser.build(), true);
} catch (Exception e) {
throw new RuntimeException(e);
}
return new TokenStreamComponents(t, filter);
}
};
spellchecker.setAccuracy(0.0f);
spellchecker.setMinPrefix(1);
spellchecker.setMinQueryLength(1);
suggester = new NoisyChannelSpellChecker(0.85);
wordScorer = new LaplaceScorer(ir, "body_ngram", 0.85d, new BytesRef(" "), 0.5f);
corrections = suggester.getCorrections(analyzer, new BytesRef("captian usa"), generator, 10, 2, 4, ir, "body", wordScorer, 1, 2);
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("captain america"));
generator = new DirectCandidateGenerator(spellchecker, "body", SuggestMode.SUGGEST_MORE_POPULAR, ir, 0.85, null, analyzer);
corrections = suggester.getCorrections(analyzer, new BytesRef("captian usw"), generator, 10, 2, 4, ir, "body", wordScorer, 1, 2);
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("captain america"));
}
@Test
public void testMarvelHerosMultiGenerator() throws IOException {
RAMDirectory dir = new RAMDirectory();
Map<String, Analyzer> mapping = new HashMap<String, Analyzer>();
mapping.put("body_ngram", new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
ShingleFilter tf = new ShingleFilter(t, 2, 3);
tf.setOutputUnigrams(false);
return new TokenStreamComponents(t, new LowerCaseFilter(Version.LUCENE_41, tf));
}
});
mapping.put("body", new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
return new TokenStreamComponents(t, new LowerCaseFilter(Version.LUCENE_41, t));
}
});
mapping.put("body_reverse", new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
return new TokenStreamComponents(t, new ReverseStringFilter(Version.LUCENE_41, new LowerCaseFilter(Version.LUCENE_41, t)));
}
});
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer(Version.LUCENE_41), mapping);
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_41, wrapper);
IndexWriter writer = new IndexWriter(dir, conf);
BufferedReader reader = new BufferedReader(new InputStreamReader(NoisyChannelSpellCheckerTests.class.getResourceAsStream("/config/names.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
Document doc = new Document();
doc.add(new Field("body", line, TextField.TYPE_NOT_STORED));
doc.add(new Field("body_reverse", line, TextField.TYPE_NOT_STORED));
doc.add(new Field("body_ngram", line, TextField.TYPE_NOT_STORED));
writer.addDocument(doc);
}
DirectoryReader ir = DirectoryReader.open(writer, false);
LaplaceScorer wordScorer = new LaplaceScorer(ir, "body_ngram", 0.95d, new BytesRef(" "), 0.5f);
NoisyChannelSpellChecker suggester = new NoisyChannelSpellChecker();
DirectSpellChecker spellchecker = new DirectSpellChecker();
spellchecker.setMinQueryLength(1);
DirectCandidateGenerator forward = new DirectCandidateGenerator(spellchecker, "body", SuggestMode.SUGGEST_MORE_POPULAR, ir, 0.95);
DirectCandidateGenerator reverse = new DirectCandidateGenerator(spellchecker, "body_reverse", SuggestMode.SUGGEST_MORE_POPULAR, ir, 0.95, wrapper, wrapper);
CandidateGenerator generator = new MultiCandidateGeneratorWrapper(forward, reverse);
Correction[] corrections = suggester.getCorrections(wrapper, new BytesRef("american cae"), generator, 5, 1, 1, ir, "body", wordScorer, 1, 2);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ace"));
corrections = suggester.getCorrections(wrapper, new BytesRef("american ame"), generator, 5, 1, 1, ir, "body", wordScorer, 1, 2);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ace"));
corrections = suggester.getCorrections(wrapper, new BytesRef("american cae"), forward, 5, 1, 1, ir, "body", wordScorer, 1, 2);
assertThat(corrections.length, equalTo(0)); // only use forward with constant prefix
corrections = suggester.getCorrections(wrapper, new BytesRef("america cae"), generator, 5, 2, 1, ir, "body", wordScorer, 1, 2);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ace"));
corrections = suggester.getCorrections(wrapper, new BytesRef("Zorr the Got-Jewel"), generator, 5, 0.5f, 4, ir, "body", wordScorer, 0, 2);
assertThat(corrections.length, equalTo(4));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the got jewel"));
assertThat(corrections[2].join(new BytesRef(" ")).utf8ToString(), equalTo("zorr the god jewel"));
assertThat(corrections[3].join(new BytesRef(" ")).utf8ToString(), equalTo("gorr the god jewel"));
corrections = suggester.getCorrections(wrapper, new BytesRef("Zorr the Got-Jewel"), generator, 5, 0.5f, 1, ir, "body", wordScorer, 1.5f, 2);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
corrections = suggester.getCorrections(wrapper, new BytesRef("Xor the Got-Jewel"), generator, 5, 0.5f, 1, ir, "body", wordScorer, 1.5f, 2);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
}
@Test
public void testMarvelHerosTrigram() throws IOException {
RAMDirectory dir = new RAMDirectory();
Map<String, Analyzer> mapping = new HashMap<String, Analyzer>();
mapping.put("body_ngram", new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
ShingleFilter tf = new ShingleFilter(t, 2, 3);
tf.setOutputUnigrams(false);
return new TokenStreamComponents(t, new LowerCaseFilter(Version.LUCENE_41, tf));
}
});
mapping.put("body", new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
return new TokenStreamComponents(t, new LowerCaseFilter(Version.LUCENE_41, t));
}
});
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer(Version.LUCENE_41), mapping);
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_41, wrapper);
IndexWriter writer = new IndexWriter(dir, conf);
BufferedReader reader = new BufferedReader(new InputStreamReader(NoisyChannelSpellCheckerTests.class.getResourceAsStream("/config/names.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
Document doc = new Document();
doc.add(new Field("body", line, TextField.TYPE_NOT_STORED));
doc.add(new Field("body_ngram", line, TextField.TYPE_NOT_STORED));
writer.addDocument(doc);
}
DirectoryReader ir = DirectoryReader.open(writer, false);
WordScorer wordScorer = new LinearInterpoatingScorer(ir, "body_ngram", 0.85d, new BytesRef(" "), 0.5, 0.4, 0.1);
NoisyChannelSpellChecker suggester = new NoisyChannelSpellChecker();
DirectSpellChecker spellchecker = new DirectSpellChecker();
spellchecker.setMinQueryLength(1);
DirectCandidateGenerator generator = new DirectCandidateGenerator(spellchecker, "body", SuggestMode.SUGGEST_MORE_POPULAR, ir, 0.95);
Correction[] corrections = suggester.getCorrections(wrapper, new BytesRef("american ame"), generator, 5, 1, 1, ir, "body", wordScorer, 1, 3);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ace"));
corrections = suggester.getCorrections(wrapper, new BytesRef("american ame"), generator, 5, 1, 1, ir, "body", wordScorer, 1, 1);
assertThat(corrections.length, equalTo(0));
// assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("american ape"));
wordScorer = new LinearInterpoatingScorer(ir, "body_ngram", 0.85d, new BytesRef(" "), 0.5, 0.4, 0.1);
corrections = suggester.getCorrections(wrapper, new BytesRef("Xor the Got-Jewel"), generator, 5, 0.5f, 4, ir, "body", wordScorer, 0, 3);
assertThat(corrections.length, equalTo(4));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("xorn the god jewel"));
assertThat(corrections[2].join(new BytesRef(" ")).utf8ToString(), equalTo("xor the god jewel"));
assertThat(corrections[3].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the gog jewel"));
corrections = suggester.getCorrections(wrapper, new BytesRef("Xor the Got-Jewel"), generator, 5, 0.5f, 4, ir, "body", wordScorer, 1, 3);
assertThat(corrections.length, equalTo(4));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("xorn the god jewel"));
assertThat(corrections[2].join(new BytesRef(" ")).utf8ToString(), equalTo("xor the god jewel"));
assertThat(corrections[3].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the gog jewel"));
corrections = suggester.getCorrections(wrapper, new BytesRef("Xor the Got-Jewel"), generator, 5, 0.5f, 1, ir, "body", wordScorer, 100, 3);
assertThat(corrections.length, equalTo(1));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
// test synonyms
Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer t = new StandardTokenizer(Version.LUCENE_41, reader);
TokenFilter filter = new LowerCaseFilter(Version.LUCENE_41, t);
try {
SolrSynonymParser parser = new SolrSynonymParser(true, false, new WhitespaceAnalyzer(Version.LUCENE_41));
((SolrSynonymParser) parser).add(new StringReader("usa => usa, america, american\nursa => usa, america, american"));
filter = new SynonymFilter(filter, parser.build(), true);
} catch (Exception e) {
throw new RuntimeException(e);
}
return new TokenStreamComponents(t, filter);
}
};
spellchecker.setAccuracy(0.0f);
spellchecker.setMinPrefix(1);
spellchecker.setMinQueryLength(1);
suggester = new NoisyChannelSpellChecker(0.95);
wordScorer = new LinearInterpoatingScorer(ir, "body_ngram", 0.95d, new BytesRef(" "), 0.5, 0.4, 0.1);
corrections = suggester.getCorrections(analyzer, new BytesRef("captian usa"), generator, 10, 2, 4, ir, "body", wordScorer, 1, 3);
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("captain america"));
assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("captain american"));
assertThat(corrections[2].join(new BytesRef(" ")).utf8ToString(), equalTo("captain ursa"));
generator = new DirectCandidateGenerator(spellchecker, "body", SuggestMode.SUGGEST_MORE_POPULAR, ir, 0.95, null, analyzer);
corrections = suggester.getCorrections(analyzer, new BytesRef("captian usw"), generator, 10, 2, 4, ir, "body", wordScorer, 1, 3);
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("captain america"));
assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("captain american"));
assertThat(corrections[2].join(new BytesRef(" ")).utf8ToString(), equalTo("captain usw"));
wordScorer = new StupidBackoffScorer(ir, "body_ngram", 0.85d, new BytesRef(" "), 0.4);
corrections = suggester.getCorrections(wrapper, new BytesRef("Xor the Got-Jewel"), generator, 5, 0.5f, 2, ir, "body", wordScorer, 0, 3);
assertThat(corrections.length, equalTo(2));
assertThat(corrections[0].join(new BytesRef(" ")).utf8ToString(), equalTo("xorr the god jewel"));
assertThat(corrections[1].join(new BytesRef(" ")).utf8ToString(), equalTo("xorn the god jewel"));
}
}