term vector request

================================

Returns information and statistics on terms in the fields of a particular document as stored in the index.

        curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true'

Tree types of values can be requested: term information, term statistics and field statistics.
By default, all term information and field statistics are returned for all fields but no term statistics.

Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url

	curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...'

or adding by adding the requested fields in the request body (see example below).

Term information
-------------------------

- term frequency in the field (always returned)
- term positions ("positions" : true)
- start and end offsets ("offsets" : true)
- term payloads ("payloads" : true), as base64 encoded bytes

If the requested information wasn't stored in the index, it will be omitted without further warning.
See [mapping](http://www.elasticsearch.org/guide/reference/mapping/core-types/) on how to configure your index to store term vectors.

Term statistics
-------------------------

Setting "term_statistics" to "true" (default is "false") will return

- total term frequency (how often a term occurs in all documents)
- document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.

Field statistics
-------------------------

Setting "field_statistics" to "false" (default is "true") will omit

- document count (how many documents contain this field)
- sum of document frequencies (the sum of document frequencies for all terms in this field)
- sum of total term frequencies (the sum of total term frequencies of each term in this field)

Behavior
-------------------------

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.

Example
-------------------------

First, we create an index that stores term vectors, payloads etc. :

    curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
        "mappings": {
            "tweet": {
                "properties": {
                    "text": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "store" : "yes",
                                "index_analyzer" : "fulltext_analyzer"
                         },
                     "fullname": {
                                "type": "string",
                                "term_vector": "with_positions_offsets_payloads",
                                "index_analyzer" : "fulltext_analyzer"
                         }
                 }
            }
        },
        "settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replicas" : 0
            },
            "analysis": {
                    "analyzer": {
                        "fulltext_analyzer": {
                            "type": "custom",
                            "tokenizer": "whitespace",
                            "filter": [
                                "lowercase",
                                "type_as_payload"
                            ]
                        }
                    }
            }
         }
    }'

Second, we add some documents:

    curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
      "fullname" : "John Doe",
      "text" : "twitter test test test "

    }'

    curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
      "fullname" : "Jane Doe",
      "text" : "Another twitter test ..."

    }'

The following request returns all information and statistics for field "text" in document "1" (John Doe):

     curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
                    "fields" : ["text"],
                    "offsets" : true,
                    "payloads" : true,
                    "positions" : true,
                    "term_statistics" : true,
                    "field_statistics" : true
            }'
Equivalently, all parameters can be passed as URI parameters:
     curl -GET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true&fields=text&offsets=true&payloads=true&positions=true&term_statistics=true&field_statistics=true'

Response:

  {
    "_index" : "twitter",
    "_type" : "tweet",
    "_id" : "1",
    "_version" : 1,
    "exists" : true,
    "term_vectors" : {
      "text" : {
        "field_statistics" : {
          "sum_doc_freq" : 6,
          "doc_count" : 2,
          "sum_ttf" : 8
        },
        "terms" : {
          "test" : {
            "doc_freq" : 2,
            "ttf" : 4,
            "term_freq" : 3,
            "pos" : [ 1, 2, 3 ],
            "start" : [ 8, 13, 18 ],
            "end" : [ 12, 17, 22 ],
            "payload" : [ "d29yZA==", "d29yZA==", "d29yZA==" ]
          },
          "twitter" : {
            "doc_freq" : 2,
            "ttf" : 2,
            "term_freq" : 1,
            "pos" : [ 0 ],
            "start" : [ 0 ],
            "end" : [ 7 ],
            "payload" : [ "d29yZA==" ]
          }
        }
      }
    }
  }

Further changes:
-------------------------

XContentBuilder
new method
public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value)
to put an integer array.

IndicesAnalysisService
make token filter for saving payloads available in elasticsearch

AbstractFieldMapper/TypeParser
make term vector options string available and also fix the parsing of this string:
with_positions_payloads is actually allowed as can be seen in TermVectorsConsumerPerFields.

Closes #3114
This commit is contained in:
Britta Weber 2013-05-29 19:31:19 +02:00
parent 945b89fd80
commit 11d08ac436
20 changed files with 2859 additions and 2 deletions

View File

@ -119,6 +119,8 @@ import org.elasticsearch.action.search.type.*;
import org.elasticsearch.action.suggest.SuggestAction;
import org.elasticsearch.action.suggest.TransportSuggestAction;
import org.elasticsearch.action.support.TransportAction;
import org.elasticsearch.action.termvector.TermVectorAction;
import org.elasticsearch.action.termvector.TransportSingleShardTermVectorAction;
import org.elasticsearch.action.update.TransportUpdateAction;
import org.elasticsearch.action.update.UpdateAction;
import org.elasticsearch.common.inject.AbstractModule;
@ -210,6 +212,7 @@ public class ActionModule extends AbstractModule {
registerAction(IndexAction.INSTANCE, TransportIndexAction.class);
registerAction(GetAction.INSTANCE, TransportGetAction.class);
registerAction(TermVectorAction.INSTANCE, TransportSingleShardTermVectorAction.class);
registerAction(DeleteAction.INSTANCE, TransportDeleteAction.class,
TransportIndexDeleteAction.class, TransportShardDeleteAction.class);
registerAction(CountAction.INSTANCE, TransportCountAction.class);

View File

@ -0,0 +1,46 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.termvector;
import org.elasticsearch.action.Action;
import org.elasticsearch.client.Client;
/**
*/
public class TermVectorAction extends Action<TermVectorRequest, TermVectorResponse, TermVectorRequestBuilder> {
public static final TermVectorAction INSTANCE = new TermVectorAction();
public static final String NAME = "tv";
private TermVectorAction() {
super(NAME);
}
@Override
public TermVectorResponse newResponse() {
return new TermVectorResponse();
}
@Override
public TermVectorRequestBuilder newRequestBuilder(Client client) {
return new TermVectorRequestBuilder(client);
}
}

View File

@ -0,0 +1,469 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.termvector;
import static org.apache.lucene.util.ArrayUtil.grow;
import gnu.trove.map.hash.TObjectLongHashMap;
import java.io.IOException;
import java.util.Comparator;
import java.util.Iterator;
import org.apache.lucene.index.DocsAndPositionsEnum;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.ArrayUtil;
import org.apache.lucene.util.Bits;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.RamUsageEstimator;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.io.stream.BytesStreamInput;
/**
* This class represents the result of a {@link TermVectorRequest}. It works
* exactly like the {@link Fields} class except for one thing: It can return
* offsets and payloads even if positions are not present. You must call
* nextPosition() anyway to move the counter although this method only returns
* <tt>-1,</tt>, if no positions were returned by the {@link TermVectorRequest}.
*
* The data is stored in two byte arrays ({@code headerRef} and
* {@code termVectors}, both {@link ByteRef}) that have the following format:
* <p>
* {@code headerRef}: Stores offsets per field in the {@code termVectors} array
* and some header information as {@link BytesRef}. Format is
* <ul>
* <li>String : "TV"</li>
* <li>vint: version (=-1)</li>
* <li>boolean: hasTermStatistics (are the term statistics stored?)</li>
* <li>boolean: hasFieldStatitsics (are the field statistics stored?)</li>
* <li>vint: number of fields</li>
* <ul>
* <li>String: field name 1</li>
* <li>vint: offset in {@code termVectors} for field 1</li>
* <li>...</li>
* <li>String: field name last field</li>
* <li>vint: offset in {@code termVectors} for last field</li>
* </ul>
* </ul>
*
* termVectors: Stores the actual term vectors as a {@link BytesRef}.
*
* Term vectors for each fields are stored in blocks, one for each field. The
* offsets in {@code headerRef} are used to find where the block for a field
* starts. Each block begins with a
* <ul>
* <li>vint: number of terms</li>
* <li>boolean: positions (has it positions stored?)</li>
* <li>boolean: offsets (has it offsets stored?)</li>
* <li>boolean: payloads (has it payloads stored?)</li>
* </ul>
* If the field statistics were requested ({@code hasFieldStatistics} is true,
* see {@code headerRef}), the following numbers are stored:
* <ul>
* <li>vlong: sum of total term freqencies of the field (sumTotalTermFreq)</li>
* <li>vlong: sum of document frequencies for each term (sumDocFreq)</li>
* <li>vint: number of documents in the shard that has an entry for this field
* (docCount)</li>
* </ul>
*
* After that, for each term it stores
* <ul>
* <ul>
* <li>vint: term lengths</li>
* <li>BytesRef: term name</li>
* </ul>
*
* If term statistics are requested ({@code hasTermStatistics} is true, see
* {@code headerRef}):
* <ul>
* <li>vint: document frequency, how often does this term appear in documents?</li>
* <li>vlong: total term frequency. Sum of terms in this field.</li>
* </ul>
* After that
* <ul>
* <li>vint: frequency (always returned)</li>
* <ul>
* <li>vint: position_1 (if positions == true)</li>
* <li>vint: startOffset_1 (if offset == true)</li>
* <li>vint: endOffset_1 (if offset == true)</li>
* <li>BytesRef: payload_1 (if payloads == true)</li>
* <li>...</li>
* <li>vint: endOffset_freqency (if offset == true)</li>
* <li>BytesRef: payload_freqency (if payloads == true)</li>
* <ul>
* </ul> </ul>
*
*/
public final class TermVectorFields extends Fields {
final private TObjectLongHashMap<String> fieldMap;
final private BytesReference termVectors;
final boolean hasTermStatistic;
final boolean hasFieldStatistic;
/**
* @param headerRef
* Stores offsets per field in the {@code termVectors} and some
* header information as {@link BytesRef}.
*
* @param termVectors
* Stores the actual term vectors as a {@link BytesRef}.
*
*/
public TermVectorFields(BytesReference headerRef, BytesReference termVectors) throws IOException {
BytesStreamInput header = new BytesStreamInput(headerRef);
fieldMap = new TObjectLongHashMap<String>();
// here we read the header to fill the field offset map
String headerString = header.readString();
assert headerString.equals("TV");
int version = header.readInt();
assert version == -1;
hasTermStatistic = header.readBoolean();
hasFieldStatistic = header.readBoolean();
final int numFields = header.readVInt();
for (int i = 0; i < numFields; i++) {
fieldMap.put((header.readString()), header.readVLong());
}
header.close();
// reference to the term vector data
this.termVectors = termVectors;
}
@Override
public Iterator<String> iterator() {
return fieldMap.keySet().iterator();
}
@Override
public Terms terms(String field) throws IOException {
// first, find where in the termVectors bytes the actual term vector for
// this field is stored
Long offset = fieldMap.get(field);
final BytesStreamInput perFieldTermVectorInput = new BytesStreamInput(this.termVectors);
perFieldTermVectorInput.reset();
perFieldTermVectorInput.skip(offset.longValue());
// read how many terms....
final long numTerms = perFieldTermVectorInput.readVLong();
// ...if positions etc. were stored....
final boolean hasPositions = perFieldTermVectorInput.readBoolean();
final boolean hasOffsets = perFieldTermVectorInput.readBoolean();
final boolean hasPayloads = perFieldTermVectorInput.readBoolean();
// read the field statistics
final long sumTotalTermFreq = hasFieldStatistic ? readPotentiallyNegativeVLong(perFieldTermVectorInput) : -1;
final long sumDocFreq = hasFieldStatistic ? readPotentiallyNegativeVLong(perFieldTermVectorInput) : -1;
final int docCount = hasFieldStatistic ? readPotentiallyNegativeVInt(perFieldTermVectorInput) : -1;
return new Terms() {
@Override
public TermsEnum iterator(TermsEnum reuse) throws IOException {
// convert bytes ref for the terms to actual data
return new TermsEnum() {
int currentTerm = 0;
int freq = 0;
int docFreq = -1;
long totalTermFrequency = -1;
int[] positions = new int[1];
int[] startOffsets = new int[1];
int[] endOffsets = new int[1];
BytesRef[] payloads = new BytesRef[1];
final BytesRef spare = new BytesRef();
@Override
public BytesRef next() throws IOException {
if (currentTerm++ < numTerms) {
// term string. first the size...
int termVectorSize = perFieldTermVectorInput.readVInt();
spare.grow(termVectorSize);
// ...then the value.
perFieldTermVectorInput.readBytes(spare.bytes, 0, termVectorSize);
spare.length = termVectorSize;
if (hasTermStatistic) {
docFreq = readPotentiallyNegativeVInt(perFieldTermVectorInput);
totalTermFrequency = readPotentiallyNegativeVLong(perFieldTermVectorInput);
}
freq = readPotentiallyNegativeVInt(perFieldTermVectorInput);
// grow the arrays to read the values. this is just
// for performance reasons. Re-use memory instead of
// realloc.
growBuffers();
// finally, read the values into the arrays
// curentPosition etc. so that we can just iterate
// later
writeInfos(perFieldTermVectorInput);
return spare;
} else {
return null;
}
}
private void writeInfos(final BytesStreamInput input) throws IOException {
for (int i = 0; i < freq; i++) {
if (hasPositions) {
positions[i] = input.readVInt();
}
if (hasOffsets) {
startOffsets[i] = input.readVInt();
endOffsets[i] = input.readVInt();
}
if (hasPayloads) {
int payloadLength = input.readVInt();
if (payloadLength > 0) {
if (payloads[i] == null) {
payloads[i] = new BytesRef(payloadLength);
} else {
payloads[i].grow(payloadLength);
}
input.readBytes(payloads[i].bytes, 0, payloadLength);
payloads[i].length = payloadLength;
payloads[i].offset = 0;
}
}
}
}
private void growBuffers() {
if (hasPositions) {
positions = grow(positions, freq);
}
if (hasOffsets) {
startOffsets = grow(startOffsets, freq);
endOffsets = grow(endOffsets, freq);
}
if (hasPayloads) {
if (payloads.length < freq) {
final BytesRef[] newArray = new BytesRef[ArrayUtil.oversize(freq, RamUsageEstimator.NUM_BYTES_OBJECT_REF)];
System.arraycopy(payloads, 0, newArray, 0, payloads.length);
payloads = newArray;
}
}
}
@Override
public Comparator<BytesRef> getComparator() {
return BytesRef.getUTF8SortedAsUnicodeComparator();
}
@Override
public SeekStatus seekCeil(BytesRef text, boolean useCache) throws IOException {
throw new UnsupportedOperationException();
}
@Override
public void seekExact(long ord) throws IOException {
throw new UnsupportedOperationException("Seek is not supported");
}
@Override
public BytesRef term() throws IOException {
return spare;
}
@Override
public long ord() throws IOException {
throw new UnsupportedOperationException("ordinals are not supported");
}
@Override
public int docFreq() throws IOException {
return docFreq;
}
@Override
public long totalTermFreq() throws IOException {
return totalTermFrequency;
}
@Override
public DocsEnum docs(Bits liveDocs, DocsEnum reuse, int flags) throws IOException {
return docsAndPositions(liveDocs, reuse instanceof DocsAndPositionsEnum ? (DocsAndPositionsEnum) reuse : null, 0);
}
@Override
public DocsAndPositionsEnum docsAndPositions(Bits liveDocs, DocsAndPositionsEnum reuse, int flags) throws IOException {
final TermVectorsDocsAndPosEnum retVal = (reuse instanceof TermVectorsDocsAndPosEnum ? (TermVectorsDocsAndPosEnum) reuse
: new TermVectorsDocsAndPosEnum());
return retVal.reset(hasPositions ? positions : null, hasOffsets ? startOffsets : null, hasOffsets ? endOffsets
: null, hasPayloads ? payloads : null, freq);
}
};
}
@Override
public Comparator<BytesRef> getComparator() {
return BytesRef.getUTF8SortedAsUnicodeComparator();
}
@Override
public long size() throws IOException {
return numTerms;
}
@Override
public long getSumTotalTermFreq() throws IOException {
return sumTotalTermFreq;
}
@Override
public long getSumDocFreq() throws IOException {
return sumDocFreq;
}
@Override
public int getDocCount() throws IOException {
return docCount;
}
@Override
public boolean hasOffsets() {
return hasOffsets;
}
@Override
public boolean hasPositions() {
return hasPositions;
}
@Override
public boolean hasPayloads() {
return hasPayloads;
}
};
}
@Override
public int size() {
return fieldMap.size();
}
private final class TermVectorsDocsAndPosEnum extends DocsAndPositionsEnum {
private boolean hasPositions;
private boolean hasOffsets;
private boolean hasPayloads;
int curPos = -1;
int doc = -1;
private int freq;
private int[] startOffsets;
private int[] positions;
private BytesRef[] payloads;
private int[] endOffsets;
private DocsAndPositionsEnum reset(int[] positions, int[] startOffsets, int[] endOffsets, BytesRef[] payloads, int freq) {
curPos = -1;
doc = -1;
this.hasPositions = positions != null;
this.hasOffsets = startOffsets != null;
this.hasPayloads = payloads != null;
this.freq = freq;
this.startOffsets = startOffsets;
this.endOffsets = endOffsets;
this.payloads = payloads;
this.positions = positions;
return this;
}
@Override
public int nextDoc() throws IOException {
return doc = (doc == -1 ? 0 : NO_MORE_DOCS);
}
@Override
public int docID() {
return doc;
}
@Override
public int advance(int target) throws IOException {
while (nextDoc() < target && doc != NO_MORE_DOCS) {
}
return doc;
}
@Override
public int freq() throws IOException {
return freq;
}
// call nextPosition once before calling this one
// because else counter is not advanced
@Override
public int startOffset() throws IOException {
assert curPos < freq && curPos >= 0;
return hasOffsets ? startOffsets[curPos] : -1;
}
@Override
// can return -1 if posistions were not requested or
// stored but offsets were stored and requested
public int nextPosition() throws IOException {
assert curPos + 1 < freq;
++curPos;
// this is kind of cheating but if you don't need positions
// we safe lots fo space on the wire
return hasPositions ? positions[curPos] : -1;
}
@Override
public BytesRef getPayload() throws IOException {
assert curPos < freq && curPos >= 0;
return hasPayloads ? payloads[curPos] : null;
}
@Override
public int endOffset() throws IOException {
assert curPos < freq && curPos >= 0;
return hasOffsets ? endOffsets[curPos] : -1;
}
@Override
public long cost() {
return 1;
}
}
// read a vInt. this is used if the integer might be negative. In this case,
// the writer writes a 0 for -1 or value +1 and accordingly we have to
// substract 1 again
// adds one to mock not existing term freq
int readPotentiallyNegativeVInt(BytesStreamInput stream) throws IOException {
return stream.readVInt() - 1;
}
// read a vLong. this is used if the integer might be negative. In this
// case, the writer writes a 0 for -1 or value +1 and accordingly we have to
// substract 1 again
// adds one to mock not existing term freq
long readPotentiallyNegativeVLong(BytesStreamInput stream) throws IOException {
return stream.readVLong() - 1;
}
}

View File

@ -0,0 +1,303 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.termvector;
import java.io.IOException;
import java.util.EnumSet;
import java.util.HashSet;
import java.util.Set;
import org.elasticsearch.action.ActionRequestValidationException;
import org.elasticsearch.action.ValidateActions;
import org.elasticsearch.action.support.single.shard.SingleShardOperationRequest;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import com.google.common.collect.Sets;
/**
* Request returning the term vector (doc frequency, positions, offsets) for a
* document.
* <p>
* Note, the {@link #index()}, {@link #type(String)} and {@link #id(String)} are
* required.
*/
public class TermVectorRequest extends SingleShardOperationRequest<TermVectorRequest> {
private String type;
private String id;
private String routing;
protected String preference;
private Set<String> selectedFields;
private EnumSet<Flag> flagsEnum = EnumSet.of(Flag.Positions, Flag.Offsets, Flag.Payloads,
Flag.FieldStatistics);
TermVectorRequest() {
}
/**
* Constructs a new term vector request for a document that will be fetch
* from the provided index. Use {@link #type(String)} and
* {@link #id(String)} to specify the document to load.
*/
public TermVectorRequest(String index, String type, String id) {
super(index);
this.id = id;
this.type = type;
}
public EnumSet<Flag> getFlags() {
return flagsEnum;
}
/**
* Returns the type of document to get the term vector for.
*/
public String type() {
return type;
}
/**
* Returns the id of document the term vector is requested for.
*/
public String id() {
return id;
}
/**
* @return The routing for this request.
*/
public String routing() {
return routing;
}
public void routing(String routing) {
this.routing = routing;
}
/**
* Sets the parent id of this document. Will simply set the routing to this
* value, as it is only used for routing with delete requests.
*/
public TermVectorRequest parent(String parent) {
if (routing == null) {
routing = parent;
}
return this;
}
public String preference() {
return this.preference;
}
/**
* Sets the preference to execute the search. Defaults to randomize across
* shards. Can be set to <tt>_local</tt> to prefer local shards,
* <tt>_primary</tt> to execute only on primary shards, or a custom value,
* which guarantees that the same order will be used across different
* requests.
*/
public TermVectorRequest preference(String preference) {
this.preference = preference;
return this;
}
/**
* Return the start and stop offsets for each term if they were stored or
* skip offsets.
*/
public TermVectorRequest offsets(boolean offsets) {
setFlag(Flag.Offsets, offsets);
return this;
}
/**
* @returns <code>true</code> if term offsets should be returned. Otherwise
* <code>false</code>
*/
public boolean offsets() {
return flagsEnum.contains(Flag.Offsets);
}
/**
* Return the positions for each term if stored or skip.
*/
public TermVectorRequest positions(boolean positions) {
setFlag(Flag.Positions, positions);
return this;
}
/**
* @return Returns if the positions for each term should be returned if
* stored or skip.
*/
public boolean positions() {
return flagsEnum.contains(Flag.Positions);
}
/**
* @returns <code>true</code> if term payloads should be returned. Otherwise
* <code>false</code>
*/
public boolean payloads() {
return flagsEnum.contains(Flag.Payloads);
}
/**
* Return the payloads for each term or skip.
*/
public TermVectorRequest payloads(boolean payloads) {
setFlag(Flag.Payloads, payloads);
return this;
}
/**
* @returns <code>true</code> if term statistics should be returned.
* Otherwise <code>false</code>
*/
public boolean termStatistics() {
return flagsEnum.contains(Flag.TermStatistics);
}
/**
* Return the term statistics for each term in the shard or skip.
*/
public TermVectorRequest termStatistics(boolean termStatistics) {
setFlag(Flag.TermStatistics, termStatistics);
return this;
}
/**
* @returns <code>true</code> if field statistics should be returned.
* Otherwise <code>false</code>
*/
public boolean fieldStatistics() {
return flagsEnum.contains(Flag.FieldStatistics);
}
/**
* Return the field statistics for each term in the shard or skip.
*/
public TermVectorRequest fieldStatistics(boolean fieldStatistics) {
setFlag(Flag.FieldStatistics, fieldStatistics);
return this;
}
/**
* Return only term vectors for special selected fields. Returns for term
* vectors for all fields if selectedFields == null
*/
public Set<String> selectedFields() {
return selectedFields;
}
/**
* Return only term vectors for special selected fields. Returns the term
* vectors for all fields if selectedFields == null
*/
public TermVectorRequest selectedFields(String[] fields) {
selectedFields = fields != null && fields.length != 0 ? Sets.newHashSet(fields) : null;
return this;
}
private void setFlag(Flag flag, boolean set) {
if (set && !flagsEnum.contains(flag)) {
flagsEnum.add(flag);
} else if (!set) {
flagsEnum.remove(flag);
assert (!flagsEnum.contains(flag));
}
}
@Override
public ActionRequestValidationException validate() {
ActionRequestValidationException validationException = null;
if (index == null) {
validationException = ValidateActions.addValidationError("index is missing", validationException);
}
if (type == null) {
validationException = ValidateActions.addValidationError("type is missing", validationException);
}
if (id == null) {
validationException = ValidateActions.addValidationError("id is missing", validationException);
}
return validationException;
}
@Override
public void readFrom(StreamInput in) throws IOException {
super.readFrom(in);
index = in.readString();
type = in.readString();
id = in.readString();
routing = in.readOptionalString();
preference = in.readOptionalString();
long flags = in.readVLong();
flagsEnum.clear();
for (Flag flag : Flag.values()) {
if ((flags & (1 << flag.ordinal())) != 0) {
flagsEnum.add(flag);
}
}
int numSelectedFields = in.readVInt();
if (numSelectedFields > 0) {
selectedFields = new HashSet<String>();
for (int i = 0; i < numSelectedFields; i++) {
selectedFields.add(in.readString());
}
}
}
@Override
public void writeTo(StreamOutput out) throws IOException {
super.writeTo(out);
out.writeString(index);
out.writeString(type);
out.writeString(id);
out.writeOptionalString(routing);
out.writeOptionalString(preference);
long longFlags = 0;
for (Flag flag : flagsEnum) {
longFlags |= (1 << flag.ordinal());
}
out.writeVLong(longFlags);
if (selectedFields != null) {
out.writeVInt(selectedFields.size());
for (String selectedField : selectedFields) {
out.writeString(selectedField);
}
} else {
out.writeVInt(0);
}
}
public static enum Flag {
// Do not change the order of these flags we use
// the ordinal for encoding! Only append to the end!
Positions, Offsets, Payloads, FieldStatistics, TermStatistics;
}
}

View File

@ -0,0 +1,81 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.termvector;
import org.elasticsearch.action.ActionListener;
import org.elasticsearch.action.ActionRequestBuilder;
import org.elasticsearch.client.Client;
import org.elasticsearch.client.internal.InternalClient;
/**
*/
public class TermVectorRequestBuilder extends ActionRequestBuilder<TermVectorRequest, TermVectorResponse, TermVectorRequestBuilder> {
public TermVectorRequestBuilder(Client client) {
super((InternalClient) client, new TermVectorRequest());
}
public TermVectorRequestBuilder(Client client, String index, String type, String id) {
super((InternalClient) client, new TermVectorRequest(index, type, id));
}
/**
* Sets the routing. Required if routing isn't id based.
*/
public TermVectorRequestBuilder setRouting(String routing) {
request.routing(routing);
return this;
}
public TermVectorRequestBuilder setOffsets(boolean offsets) {
request.offsets(offsets);
return this;
}
public TermVectorRequestBuilder setPositions(boolean positions) {
request.positions(positions);
return this;
}
public TermVectorRequestBuilder setPayloads(boolean payloads) {
request.payloads(payloads);
return this;
}
public TermVectorRequestBuilder setTermStatistics(boolean termStatistics) {
request.termStatistics(termStatistics);
return this;
}
public TermVectorRequestBuilder setFieldStatistics(boolean fieldStatistics) {
request.fieldStatistics(fieldStatistics);
return this;
}
public TermVectorRequestBuilder setSelectedFields(String... fields) {
request.selectedFields(fields);
return this;
}
@Override
protected void doExecute(ActionListener<TermVectorResponse> listener) {
((Client) client).termVector(request, listener);
}
}

View File

@ -0,0 +1,323 @@
/*
* Licensed to Elastic Search and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. Elastic Search licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.termvector;
import java.io.IOException;
import java.util.EnumSet;
import java.util.Iterator;
import java.util.Set;
import org.apache.lucene.index.DocsAndPositionsEnum;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.ArrayUtil;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.CharsRef;
import org.apache.lucene.util.UnicodeUtil;
import org.elasticsearch.ElasticSearchIllegalStateException;
import org.elasticsearch.action.ActionResponse;
import org.elasticsearch.action.termvector.TermVectorRequest.Flag;
import org.elasticsearch.common.Base64;
import org.elasticsearch.common.bytes.BytesArray;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.io.stream.BytesStreamOutput;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentBuilderString;
import com.google.common.collect.Iterators;
public class TermVectorResponse extends ActionResponse implements ToXContent {
private static class FieldStrings {
// term statistics strings
public static final XContentBuilderString TTF = new XContentBuilderString("ttf");
public static final XContentBuilderString DOC_FREQ = new XContentBuilderString("doc_freq");
public static final XContentBuilderString TERM_FREQ = new XContentBuilderString("term_freq");
// field statistics strings
public static final XContentBuilderString FIELD_STATISTICS = new XContentBuilderString("field_statistics");
public static final XContentBuilderString DOC_COUNT = new XContentBuilderString("doc_count");
public static final XContentBuilderString SUM_DOC_FREQ = new XContentBuilderString("sum_doc_freq");
public static final XContentBuilderString SUM_TTF = new XContentBuilderString("sum_ttf");
public static final XContentBuilderString POS = new XContentBuilderString("pos");
public static final XContentBuilderString START_OFFSET = new XContentBuilderString("start");
public static final XContentBuilderString END_OFFSET = new XContentBuilderString("end");
public static final XContentBuilderString PAYLOAD = new XContentBuilderString("payload");
public static final XContentBuilderString _INDEX = new XContentBuilderString("_index");
public static final XContentBuilderString _TYPE = new XContentBuilderString("_type");
public static final XContentBuilderString _ID = new XContentBuilderString("_id");
public static final XContentBuilderString _VERSION = new XContentBuilderString("_version");
public static final XContentBuilderString EXISTS = new XContentBuilderString("exists");
public static final XContentBuilderString TERMS = new XContentBuilderString("terms");
public static final XContentBuilderString TERM_VECTORS = new XContentBuilderString("term_vectors");
}
private BytesReference termVectors;
private BytesReference headerRef;
private String index;
private String type;
private String id;
private long docVersion;
private boolean sourceCopied = false;
int[] curentPositions = new int[0];
int[] currentStartOffset = new int[0];
int[] currentEndOffset = new int[0];
BytesReference[] currentPayloads = new BytesReference[0];
public TermVectorResponse(String index, String type, String id) {
this.index = index;
this.type = type;
this.id = id;
}
TermVectorResponse() {
}
@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeString(index);
out.writeString(type);
out.writeString(id);
out.writeVLong(docVersion);
final boolean docExists = documentExists();
out.writeBoolean(docExists);
if (docExists) {
out.writeBytesReference(headerRef);
out.writeBytesReference(termVectors);
}
}
@Override
public void readFrom(StreamInput in) throws IOException {
index = in.readString();
type = in.readString();
id = in.readString();
docVersion = in.readVLong();
if (in.readBoolean()) {
headerRef = in.readBytesReference();
termVectors = in.readBytesReference();
}
}
public Fields getFields() throws IOException {
if (documentExists()) {
if (!sourceCopied) { // make the bytes safe
headerRef = headerRef.copyBytesArray();
termVectors = termVectors.copyBytesArray();
}
return new TermVectorFields(headerRef, termVectors);
} else {
return new Fields() {
@Override
public Iterator<String> iterator() {
return Iterators.emptyIterator();
}
@Override
public Terms terms(String field) throws IOException {
return null;
}
@Override
public int size() {
return 0;
}
};
}
}
@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
assert index != null;
assert type != null;
assert id != null;
builder.startObject();
builder.field(FieldStrings._INDEX, index);
builder.field(FieldStrings._TYPE, type);
builder.field(FieldStrings._ID, id);
builder.field(FieldStrings._VERSION, docVersion);
builder.field(FieldStrings.EXISTS, documentExists());
if (!documentExists()) {
builder.endObject();
return builder;
}
builder.startObject(FieldStrings.TERM_VECTORS);
final CharsRef spare = new CharsRef();
Fields theFields = getFields();
Iterator<String> fieldIter = theFields.iterator();
while (fieldIter.hasNext()) {
buildField(builder, spare, theFields, fieldIter);
}
builder.endObject();
builder.endObject();
return builder;
}
private void buildField(XContentBuilder builder, final CharsRef spare, Fields theFields, Iterator<String> fieldIter) throws IOException {
String fieldName = fieldIter.next();
builder.startObject(fieldName);
Terms curTerms = theFields.terms(fieldName);
// write field statistics
buildFieldStatistics(builder, curTerms);
builder.startObject(FieldStrings.TERMS);
TermsEnum termIter = curTerms.iterator(null);
for (int i = 0; i < curTerms.size(); i++) {
buildTerm(builder, spare, curTerms, termIter);
}
builder.endObject();
builder.endObject();
}
private void buildTerm(XContentBuilder builder, final CharsRef spare, Terms curTerms, TermsEnum termIter) throws IOException {
// start term, optimized writing
BytesRef term = termIter.next();
UnicodeUtil.UTF8toUTF16(term, spare);
builder.startObject(spare.toString());
buildTermStatistics(builder, termIter);
// finally write the term vectors
DocsAndPositionsEnum posEnum = termIter.docsAndPositions(null, null);
int termFreq = posEnum.freq();
builder.field(FieldStrings.TERM_FREQ, termFreq);
initMemory(curTerms, termFreq);
initValues(curTerms, posEnum, termFreq);
buildValues(builder, curTerms, termFreq);
builder.endObject();
}
private void buildTermStatistics(XContentBuilder builder, TermsEnum termIter) throws IOException {
// write term statistics. At this point we do not naturally have a
// boolean that says if these values actually were requested.
// However, we can assume that they were not if the statistic values are
// <= 0.
assert (((termIter.docFreq() > 0) && (termIter.totalTermFreq() > 0)) || ((termIter.docFreq() == -1) && (termIter.totalTermFreq() == -1)));
int docFreq = termIter.docFreq();
if (docFreq > 0) {
builder.field(FieldStrings.DOC_FREQ, docFreq);
builder.field(FieldStrings.TTF, termIter.totalTermFreq());
}
}
private void buildValues(XContentBuilder builder, Terms curTerms, int termFreq) throws IOException {
if (curTerms.hasPositions()) {
builder.field(FieldStrings.POS, 0, termFreq, curentPositions);
}
if (curTerms.hasOffsets()) {
builder.field(FieldStrings.START_OFFSET, 0, termFreq, currentStartOffset);
builder.field(FieldStrings.END_OFFSET, 0, termFreq, currentEndOffset);
}
if (curTerms.hasPayloads()) {
builder.array(FieldStrings.PAYLOAD, (Object[])currentPayloads);
}
}
private void initValues(Terms curTerms, DocsAndPositionsEnum posEnum, int termFreq) throws IOException {
for (int j = 0; j < termFreq; j++) {
int nextPos = posEnum.nextPosition();
if (curTerms.hasPositions()) {
curentPositions[j] = nextPos;
}
if (curTerms.hasOffsets()) {
currentStartOffset[j] = posEnum.startOffset();
currentEndOffset[j] = posEnum.endOffset();
}
if (curTerms.hasPayloads()) {
BytesRef curPaypoad = posEnum.getPayload();
currentPayloads[j] = new BytesArray(curPaypoad.bytes, 0, curPaypoad.length);
}
}
}
private void initMemory(Terms curTerms, int termFreq) {
// init memory for performance reasons
if (curTerms.hasPositions()) {
curentPositions = ArrayUtil.grow(curentPositions, termFreq);
}
if (curTerms.hasOffsets()) {
currentStartOffset = ArrayUtil.grow(currentStartOffset, termFreq);
currentEndOffset = ArrayUtil.grow(currentEndOffset, termFreq);
}
if (curTerms.hasPayloads()) {
currentPayloads = new BytesArray[termFreq];
}
}
private void buildFieldStatistics(XContentBuilder builder, Terms curTerms) throws IOException {
long sumDocFreq = curTerms.getSumDocFreq();
int docCount = curTerms.getDocCount();
long sumTotalTermFrequencies = curTerms.getSumTotalTermFreq();
if (docCount > 0) {
assert ((sumDocFreq > 0)) : "docCount >= 0 but sumDocFreq ain't!";
assert ((sumTotalTermFrequencies > 0)) : "docCount >= 0 but sumTotalTermFrequencies ain't!";
builder.startObject(FieldStrings.FIELD_STATISTICS);
builder.field(FieldStrings.SUM_DOC_FREQ, sumDocFreq);
builder.field(FieldStrings.DOC_COUNT, docCount);
builder.field(FieldStrings.SUM_TTF, sumTotalTermFrequencies);
builder.endObject();
} else if (docCount == -1) { // this should only be -1 if the field
// statistics were not requested at all. In
// this case all 3 values should be -1
assert ((sumDocFreq == -1)) : "docCount was -1 but sumDocFreq ain't!";
assert ((sumTotalTermFrequencies == -1)) : "docCount was -1 but sumTotalTermFrequencies ain't!";
} else {
throw new ElasticSearchIllegalStateException(
"Something is wrong with the field statistics of the term vector request: Values are " + "\n"
+ FieldStrings.SUM_DOC_FREQ + " " + sumDocFreq + "\n" + FieldStrings.DOC_COUNT + " " + docCount + "\n"
+ FieldStrings.SUM_TTF + " " + sumTotalTermFrequencies);
}
}
public boolean documentExists() {
assert (headerRef == null && termVectors == null) || (headerRef != null && termVectors != null);
return headerRef != null;
}
public void setFields(Fields fields, Set<String> selectedFields, EnumSet<Flag> flags, Fields topLevelFields) throws IOException {
TermVectorWriter tvw = new TermVectorWriter(this);
tvw.setFields(fields, selectedFields, flags, topLevelFields);
}
public void setTermVectorField(BytesStreamOutput output) {
termVectors = output.bytes();
}
public void setHeader(BytesReference header) {
headerRef = header;
}
public void setDocVersion(long version) {
this.docVersion = version;
}
}

View File

@ -0,0 +1,236 @@
/*
* Licensed to Elastic Search and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. Elastic Search licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.termvector;
import java.io.IOException;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.List;
import java.util.Set;
import org.apache.lucene.index.DocsAndPositionsEnum;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.action.termvector.TermVectorRequest.Flag;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.io.stream.BytesStreamOutput;
// package only - this is an internal class!
final class TermVectorWriter {
final List<String> fields = new ArrayList<String>();
final List<Long> fieldOffset = new ArrayList<Long>();
final BytesStreamOutput output = new BytesStreamOutput(1); // can we somehow
// predict the
// size here?
private static final String HEADER = "TV";
private static final int CURRENT_VERSION = -1;
TermVectorResponse response = null;
TermVectorWriter(TermVectorResponse termVectorResponse) throws IOException {
response = termVectorResponse;
}
void setFields(Fields fields, Set<String> selectedFields, EnumSet<Flag> flags, Fields topLevelFields) throws IOException {
int numFieldsWritten = 0;
TermsEnum iterator = null;
DocsAndPositionsEnum docsAndPosEnum = null;
DocsEnum docsEnum = null;
TermsEnum topLevelIterator = null;
for (String field : fields) {
if ((selectedFields != null) && (!selectedFields.contains(field))) {
continue;
}
Terms terms = fields.terms(field);
Terms topLevelTerms = topLevelFields.terms(field);
topLevelIterator = topLevelTerms.iterator(topLevelIterator);
boolean positions = flags.contains(Flag.Positions) && terms.hasPositions();
boolean offsets = flags.contains(Flag.Offsets) && terms.hasOffsets();
boolean payloads = flags.contains(Flag.Payloads) && terms.hasPayloads();
startField(field, terms.size(), positions, offsets, payloads);
if (flags.contains(Flag.FieldStatistics)) {
writeFieldStatistics(topLevelTerms);
}
iterator = terms.iterator(iterator);
final boolean useDocsAndPos = positions || offsets || payloads;
while (iterator.next() != null) { // iterate all terms of the
// current field
// get the doc frequency
BytesRef term = iterator.term();
boolean foundTerm = topLevelIterator.seekExact(term, false);
assert (foundTerm);
startTerm(term);
if (flags.contains(Flag.TermStatistics)) {
writeTermStatistics(topLevelIterator);
}
if (useDocsAndPos) {
// given we have pos or offsets
docsAndPosEnum = writeTermWithDocsAndPos(iterator, docsAndPosEnum, positions, offsets, payloads);
} else {
// if we do not have the positions stored, we need to
// get the frequency from a DocsEnum.
docsEnum = writeTermWithDocsOnly(iterator, docsEnum);
}
}
numFieldsWritten++;
}
response.setTermVectorField(output);
response.setHeader(writeHeader(numFieldsWritten, flags.contains(Flag.TermStatistics), flags.contains(Flag.FieldStatistics)));
}
private BytesReference writeHeader(int numFieldsWritten, boolean getTermStatistics, boolean getFieldStatistics) throws IOException {
// now, write the information about offset of the terms in the
// termVectors field
BytesStreamOutput header = new BytesStreamOutput();
header.writeString(HEADER);
header.writeInt(CURRENT_VERSION);
header.writeBoolean(getTermStatistics);
header.writeBoolean(getFieldStatistics);
header.writeVInt(numFieldsWritten);
for (int i = 0; i < fields.size(); i++) {
header.writeString(fields.get(i));
header.writeVLong(fieldOffset.get(i).longValue());
}
header.close();
return header.bytes();
}
private DocsEnum writeTermWithDocsOnly(TermsEnum iterator, DocsEnum docsEnum) throws IOException {
docsEnum = iterator.docs(null, docsEnum);
int nextDoc = docsEnum.nextDoc();
assert nextDoc != DocsEnum.NO_MORE_DOCS;
writeFreq(docsEnum.freq());
nextDoc = docsEnum.nextDoc();
assert nextDoc == DocsEnum.NO_MORE_DOCS;
return docsEnum;
}
private DocsAndPositionsEnum writeTermWithDocsAndPos(TermsEnum iterator, DocsAndPositionsEnum docsAndPosEnum, boolean positions,
boolean offsets, boolean payloads) throws IOException {
docsAndPosEnum = iterator.docsAndPositions(null, docsAndPosEnum);
// for each term (iterator next) in this field (field)
// iterate over the docs (should only be one)
int nextDoc = docsAndPosEnum.nextDoc();
assert nextDoc != DocsEnum.NO_MORE_DOCS;
final int freq = docsAndPosEnum.freq();
writeFreq(freq);
for (int j = 0; j < freq; j++) {
int curPos = docsAndPosEnum.nextPosition();
if (positions) {
writePosition(curPos);
}
if (offsets) {
writeOffsets(docsAndPosEnum.startOffset(), docsAndPosEnum.endOffset());
}
if (payloads) {
writePayload(docsAndPosEnum.getPayload());
}
}
nextDoc = docsAndPosEnum.nextDoc();
assert nextDoc == DocsEnum.NO_MORE_DOCS;
return docsAndPosEnum;
}
private void writePayload(BytesRef payload) throws IOException {
assert (payload != null);
if (payload != null) {
output.writeVInt(payload.length);
output.writeBytes(payload.bytes, payload.offset, payload.length);
}
}
private void writeFreq(int termFreq) throws IOException {
writePotentiallyNegativeVInt(termFreq);
}
private void writeOffsets(int startOffset, int endOffset) throws IOException {
assert (startOffset >= 0);
assert (endOffset >= 0);
if ((startOffset >= 0) && (endOffset >= 0)) {
output.writeVInt(startOffset);
output.writeVInt(endOffset);
}
}
private void writePosition(int pos) throws IOException {
assert (pos >= 0);
if (pos >= 0) {
output.writeVInt(pos);
}
}
private void startField(String fieldName, long termsSize, boolean writePositions, boolean writeOffsets, boolean writePayloads)
throws IOException {
fields.add(fieldName);
fieldOffset.add(output.position());
output.writeVLong(termsSize);
// add information on if positions etc. are written
output.writeBoolean(writePositions);
output.writeBoolean(writeOffsets);
output.writeBoolean(writePayloads);
}
private void startTerm(BytesRef term) throws IOException {
output.writeVInt(term.length);
output.writeBytes(term.bytes, term.offset, term.length);
}
private void writeTermStatistics(TermsEnum topLevelIterator) throws IOException {
int docFreq = topLevelIterator.docFreq();
assert (docFreq >= 0);
writePotentiallyNegativeVInt(docFreq);
long ttf = topLevelIterator.totalTermFreq();
assert (ttf >= 0);
writePotentiallyNegativeVLong(ttf);
}
private void writeFieldStatistics(Terms topLevelTerms) throws IOException {
long sttf = topLevelTerms.getSumTotalTermFreq();
assert (sttf >= 0);
writePotentiallyNegativeVLong(sttf);
long sdf = topLevelTerms.getSumDocFreq();
assert (sdf >= 0);
writePotentiallyNegativeVLong(sdf);
int dc = topLevelTerms.getDocCount();
assert (dc >= 0);
writePotentiallyNegativeVInt(dc);
}
private void writePotentiallyNegativeVInt(int value) throws IOException {
// term freq etc. can be negative if not present... we transport that
// further...
output.writeVInt(Math.max(0, value + 1));
}
private void writePotentiallyNegativeVLong(long value) throws IOException {
// term freq etc. can be negative if not present... we transport that
// further...
output.writeVLong(Math.max(0, value + 1));
}
}

View File

@ -0,0 +1,138 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.action.termvector;
import java.util.List;
import org.apache.lucene.index.AtomicReader;
import org.apache.lucene.index.AtomicReaderContext;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.index.Term;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction;
import org.elasticsearch.cluster.ClusterService;
import org.elasticsearch.cluster.ClusterState;
import org.elasticsearch.cluster.block.ClusterBlockException;
import org.elasticsearch.cluster.block.ClusterBlockLevel;
import org.elasticsearch.cluster.routing.ShardIterator;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.lucene.Lucene;
import org.elasticsearch.common.lucene.uid.Versions;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.engine.Engine;
import org.elasticsearch.index.mapper.Uid;
import org.elasticsearch.index.mapper.internal.UidFieldMapper;
import org.elasticsearch.index.service.IndexService;
import org.elasticsearch.index.shard.service.IndexShard;
import org.elasticsearch.indices.IndicesService;
import org.elasticsearch.threadpool.ThreadPool;
import org.elasticsearch.transport.TransportService;
/**
* Performs the get operation.
*/
public class TransportSingleShardTermVectorAction extends TransportShardSingleOperationAction<TermVectorRequest, TermVectorResponse> {
private final IndicesService indicesService;
@Inject
public TransportSingleShardTermVectorAction(Settings settings, ClusterService clusterService, TransportService transportService,
IndicesService indicesService, ThreadPool threadPool) {
super(settings, threadPool, clusterService, transportService);
this.indicesService = indicesService;
}
@Override
protected String executor() {
// TODO: Is this the right pool to execute this on?
return ThreadPool.Names.GET;
}
@Override
protected String transportAction() {
return TermVectorAction.NAME;
}
@Override
protected ClusterBlockException checkGlobalBlock(ClusterState state, TermVectorRequest request) {
return state.blocks().globalBlockedException(ClusterBlockLevel.READ);
}
@Override
protected ClusterBlockException checkRequestBlock(ClusterState state, TermVectorRequest request) {
return state.blocks().indexBlockedException(ClusterBlockLevel.READ, request.index());
}
@Override
protected ShardIterator shards(ClusterState state, TermVectorRequest request) {
return clusterService.operationRouting().getShards(clusterService.state(), request.index(), request.type(), request.id(),
request.routing(), request.preference());
}
@Override
protected void resolveRequest(ClusterState state, TermVectorRequest request) {
// update the routing (request#index here is possibly an alias)
request.routing(state.metaData().resolveIndexRouting(request.routing(), request.index()));
request.index(state.metaData().concreteIndex(request.index()));
}
@Override
protected TermVectorResponse shardOperation(TermVectorRequest request, int shardId) throws ElasticSearchException {
IndexService indexService = indicesService.indexServiceSafe(request.index());
IndexShard indexShard = indexService.shardSafe(shardId);
final Engine.Searcher searcher = indexShard.searcher();
IndexReader topLevelReader = searcher.reader();
final TermVectorResponse termVectorResponse = new TermVectorResponse(request.index(), request.type(), request.id());
final Term uidTerm = new Term(UidFieldMapper.NAME, Uid.createUidAsBytes(request.type(), request.id()));
try {
Fields topLevelFields = MultiFields.getFields(topLevelReader);
Versions.DocIdAndVersion docIdAndVersion = Versions.loadDocIdAndVersion(topLevelReader, uidTerm);
if(docIdAndVersion!=null) {
termVectorResponse.setFields(topLevelReader.getTermVectors(docIdAndVersion.docId), request.selectedFields(),
request.getFlags(), topLevelFields);
termVectorResponse.setDocVersion(docIdAndVersion.version);
} else {
}
} catch (Throwable ex) {
throw new ElasticSearchException("failed to execute term vector request", ex);
} finally {
searcher.release();
}
return termVectorResponse;
}
@Override
protected TermVectorRequest newRequest() {
return new TermVectorRequest();
}
@Override
protected TermVectorResponse newResponse() {
return new TermVectorResponse();
}
}

View File

@ -0,0 +1,23 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
/**
* Get the term vector for a specific document.
*/
package org.elasticsearch.action.termvector;

View File

@ -41,6 +41,9 @@ import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.mlt.MoreLikeThisRequest;
import org.elasticsearch.action.mlt.MoreLikeThisRequestBuilder;
import org.elasticsearch.action.termvector.TermVectorRequest;
import org.elasticsearch.action.termvector.TermVectorRequestBuilder;
import org.elasticsearch.action.termvector.TermVectorResponse;
import org.elasticsearch.action.percolate.PercolateRequest;
import org.elasticsearch.action.percolate.PercolateRequestBuilder;
import org.elasticsearch.action.percolate.PercolateResponse;
@ -433,7 +436,7 @@ public interface Client {
* @param listener A listener to be notified of the result
*/
void moreLikeThis(MoreLikeThisRequest request, ActionListener<SearchResponse> listener);
/**
* A more like this action to search for documents that are "like" a specific document.
*
@ -443,6 +446,33 @@ public interface Client {
*/
MoreLikeThisRequestBuilder prepareMoreLikeThis(String index, String type, String id);
/**
* An action that returns the term vectors for a specific document.
*
* @param request The term vector request
* @return The response future
*/
ActionFuture<TermVectorResponse> termVector(TermVectorRequest request);
/**
* An action that returns the term vectors for a specific document.
*
* @param request The term vector request
* @return The response future
*/
void termVector(TermVectorRequest request, ActionListener<TermVectorResponse> listener);
/**
* Builder for the term vector request.
*
* @param index The index to load the document from
* @param type The type of the document
* @param id The id of the document
*/
TermVectorRequestBuilder prepareTermVector(String index, String type, String id);
/**
* Percolates a request returning the matches documents.
*/

View File

@ -48,6 +48,10 @@ import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.mlt.MoreLikeThisAction;
import org.elasticsearch.action.mlt.MoreLikeThisRequest;
import org.elasticsearch.action.mlt.MoreLikeThisRequestBuilder;
import org.elasticsearch.action.termvector.TermVectorAction;
import org.elasticsearch.action.termvector.TermVectorRequest;
import org.elasticsearch.action.termvector.TermVectorRequestBuilder;
import org.elasticsearch.action.termvector.TermVectorResponse;
import org.elasticsearch.action.percolate.PercolateAction;
import org.elasticsearch.action.percolate.PercolateRequest;
import org.elasticsearch.action.percolate.PercolateRequestBuilder;
@ -293,6 +297,21 @@ public abstract class AbstractClient implements InternalClient {
public MoreLikeThisRequestBuilder prepareMoreLikeThis(String index, String type, String id) {
return new MoreLikeThisRequestBuilder(this, index, type, id);
}
@Override
public ActionFuture<TermVectorResponse> termVector(final TermVectorRequest request) {
return execute(TermVectorAction.INSTANCE, request);
}
@Override
public void termVector(final TermVectorRequest request, final ActionListener<TermVectorResponse> listener) {
execute(TermVectorAction.INSTANCE, request, listener);
}
@Override
public TermVectorRequestBuilder prepareTermVector(String index, String type, String id) {
return new TermVectorRequestBuilder(this, index, type, id);
}
@Override
public ActionFuture<PercolateResponse> percolate(final PercolateRequest request) {

View File

@ -39,6 +39,8 @@ import org.elasticsearch.action.get.MultiGetResponse;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.mlt.MoreLikeThisRequest;
import org.elasticsearch.action.termvector.TermVectorRequest;
import org.elasticsearch.action.termvector.TermVectorResponse;
import org.elasticsearch.action.percolate.PercolateRequest;
import org.elasticsearch.action.percolate.PercolateResponse;
import org.elasticsearch.action.search.*;
@ -429,6 +431,11 @@ public class TransportClient extends AbstractClient {
public void moreLikeThis(MoreLikeThisRequest request, ActionListener<SearchResponse> listener) {
internalClient.moreLikeThis(request, listener);
}
@Override
public void termVector(TermVectorRequest request, ActionListener<TermVectorResponse> listener) {
internalClient.termVector(request, listener);
}
@Override
public ActionFuture<PercolateResponse> percolate(PercolateRequest request) {

View File

@ -647,6 +647,16 @@ public final class XContentBuilder implements BytesStream {
endArray();
return this;
}
public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value) throws IOException {
assert ((offset >= 0) && (value.length > length));
startArray(name);
for (int i = offset; i < length; i++) {
value(value[i]);
}
endArray();
return this;
}
public XContentBuilder field(XContentBuilderString name, int... value) throws IOException {
startArray(name);

View File

@ -666,7 +666,7 @@ public abstract class AbstractFieldMapper<T> implements FieldMapper<T>, Mapper {
}
}
protected static String termVectorOptionsToString(FieldType fieldType) {
public static String termVectorOptionsToString(FieldType fieldType) {
if (!fieldType.storeTermVectors()) {
return "no";
} else if(!fieldType.storeTermVectorOffsets() && !fieldType.storeTermVectorPositions()) {

View File

@ -164,6 +164,9 @@ public class TypeParsers {
} else if ("with_positions_offsets".equals(termVector)) {
builder.storeTermVectorPositions(true);
builder.storeTermVectorOffsets(true);
} else if ("with_positions_payloads".equals(termVector)) {
builder.storeTermVectorPositions(true);
builder.storeTermVectorPayloads(true);
} else if ("with_positions_offsets_payloads".equals(termVector)) {
builder.storeTermVectorPositions(true);
builder.storeTermVectorOffsets(true);

View File

@ -65,6 +65,7 @@ import org.apache.lucene.analysis.nl.DutchStemFilter;
import org.apache.lucene.analysis.no.NorwegianAnalyzer;
import org.apache.lucene.analysis.path.PathHierarchyTokenizer;
import org.apache.lucene.analysis.pattern.PatternTokenizer;
import org.apache.lucene.analysis.payloads.TypeAsPayloadTokenFilter;
import org.apache.lucene.analysis.pt.PortugueseAnalyzer;
import org.apache.lucene.analysis.reverse.ReverseStringFilter;
import org.apache.lucene.analysis.ro.RomanianAnalyzer;
@ -649,6 +650,19 @@ public class IndicesAnalysisService extends AbstractComponent {
return new KeywordRepeatFilter(tokenStream);
}
}));
tokenFilterFactories.put("type_as_payload", new PreBuiltTokenFilterFactoryFactory(new TokenFilterFactory() {
@Override
public String name() {
return "type_as_payload";
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new TypeAsPayloadTokenFilter(tokenStream);
}
}));
// Char Filter
charFilterFactories.put("html_strip", new PreBuiltCharFilterFactoryFactory(new CharFilterFactory() {

View File

@ -83,6 +83,7 @@ import org.elasticsearch.rest.action.search.RestMultiSearchAction;
import org.elasticsearch.rest.action.search.RestSearchAction;
import org.elasticsearch.rest.action.search.RestSearchScrollAction;
import org.elasticsearch.rest.action.suggest.RestSuggestAction;
import org.elasticsearch.rest.action.termvector.RestTermVectorAction;
import org.elasticsearch.rest.action.update.RestUpdateAction;
import java.util.List;
@ -165,6 +166,7 @@ public class RestActionModule extends AbstractModule {
bind(RestDeleteByQueryAction.class).asEagerSingleton();
bind(RestCountAction.class).asEagerSingleton();
bind(RestSuggestAction.class).asEagerSingleton();
bind(RestTermVectorAction.class).asEagerSingleton();
bind(RestBulkAction.class).asEagerSingleton();
bind(RestUpdateAction.class).asEagerSingleton();
bind(RestPercolateAction.class).asEagerSingleton();

View File

@ -0,0 +1,188 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.rest.action.termvector;
import static org.elasticsearch.rest.RestRequest.Method.GET;
import static org.elasticsearch.rest.RestRequest.Method.POST;
import static org.elasticsearch.rest.RestStatus.OK;
import static org.elasticsearch.rest.action.support.RestXContentBuilder.restContentBuilder;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.StringTokenizer;
import org.elasticsearch.ElasticSearchParseException;
import org.elasticsearch.action.ActionListener;
import org.elasticsearch.action.termvector.TermVectorRequest;
import org.elasticsearch.action.termvector.TermVectorResponse;
import org.elasticsearch.client.Client;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.common.xcontent.XContentParser;
import org.elasticsearch.rest.BaseRestHandler;
import org.elasticsearch.rest.RestChannel;
import org.elasticsearch.rest.RestController;
import org.elasticsearch.rest.RestRequest;
import org.elasticsearch.rest.XContentRestResponse;
import org.elasticsearch.rest.XContentThrowableRestResponse;
/**
* This class parses the json request and translates it into a
* TermVectorRequest.
*/
public class RestTermVectorAction extends BaseRestHandler {
@Inject
public RestTermVectorAction(Settings settings, Client client, RestController controller) {
super(settings, client);
controller.registerHandler(GET, "/{index}/{type}/{id}/_termvector", this);
controller.registerHandler(POST, "/{index}/{type}/{id}/_termvector", this);
}
@Override
public void handleRequest(final RestRequest request, final RestChannel channel) {
TermVectorRequest termVectorRequest = new TermVectorRequest(request.param("index"), request.param("type"), request.param("id"));
termVectorRequest.routing(request.param("routing"));
termVectorRequest.parent(request.param("parent"));
termVectorRequest.preference(request.param("preference"));
if (request.hasContent()) {
try {
parseRequest(request.content(), termVectorRequest);
} catch (IOException e1) {
Set<String> selectedFields = termVectorRequest.selectedFields();
String fieldString = "all";
if (selectedFields != null) {
Strings.arrayToDelimitedString(termVectorRequest.selectedFields().toArray(new String[1]), " ");
}
logger.error("Something is wrong with your parameters for the term vector request. I am using parameters "
+ "\n positions :" + termVectorRequest.positions() + "\n offsets :" + termVectorRequest.offsets() + "\n payloads :"
+ termVectorRequest.payloads() + "\n termStatistics :" + termVectorRequest.termStatistics()
+ "\n fieldStatistics :" + termVectorRequest.fieldStatistics() + "\nfields " + fieldString, (Object) null);
}
}
readURIParameters(termVectorRequest, request);
client.termVector(termVectorRequest, new ActionListener<TermVectorResponse>() {
@Override
public void onResponse(TermVectorResponse response) {
try {
XContentBuilder builder = restContentBuilder(request);
response.toXContent(builder, request);
channel.sendResponse(new XContentRestResponse(request, OK, builder));
} catch (Throwable e) {
onFailure(e);
}
}
@Override
public void onFailure(Throwable e) {
try {
channel.sendResponse(new XContentThrowableRestResponse(request, e));
} catch (IOException e1) {
logger.error("Failed to send failure response", e1);
}
}
});
}
static public void readURIParameters(TermVectorRequest termVectorRequest, RestRequest request) {
String fields = request.param("fields");
addFieldStringsFromParameter(termVectorRequest, fields);
termVectorRequest.offsets(request.paramAsBoolean("offsets", termVectorRequest.offsets()));
termVectorRequest.positions(request.paramAsBoolean("positions", termVectorRequest.positions()));
termVectorRequest.payloads(request.paramAsBoolean("payloads", termVectorRequest.payloads()));
termVectorRequest.termStatistics(request.paramAsBoolean("termStatistics", termVectorRequest.termStatistics()));
termVectorRequest.termStatistics(request.paramAsBoolean("term_statistics", termVectorRequest.termStatistics()));
termVectorRequest.fieldStatistics(request.paramAsBoolean("fieldStatistics", termVectorRequest.fieldStatistics()));
termVectorRequest.fieldStatistics(request.paramAsBoolean("field_statistics", termVectorRequest.fieldStatistics()));
}
static public void addFieldStringsFromParameter(TermVectorRequest termVectorRequest, String fields) {
Set<String> selectedFields = termVectorRequest.selectedFields();
if (fields != null) {
String[] paramFieldStrings = Strings.commaDelimitedListToStringArray(fields);
for (String field : paramFieldStrings) {
if (selectedFields == null) {
selectedFields = new HashSet<String>();
}
if (!selectedFields.contains(field)) {
field = field.replaceAll("\\s", "");
selectedFields.add(field);
}
}
}
if (selectedFields != null) {
termVectorRequest.selectedFields(selectedFields.toArray(new String[selectedFields.size()]));
}
}
static public void parseRequest(BytesReference cont, TermVectorRequest termVectorRequest) throws IOException {
XContentParser parser = XContentFactory.xContent(cont).createParser(cont);
try {
XContentParser.Token token;
String currentFieldName = null;
List<String> fields = new ArrayList<String>();
while ((token = parser.nextToken()) != XContentParser.Token.END_OBJECT) {
if (token == XContentParser.Token.FIELD_NAME) {
currentFieldName = parser.currentName();
} else if (currentFieldName != null) {
if (currentFieldName.equals("fields")) {
if (token == XContentParser.Token.START_ARRAY) {
while (parser.nextToken() != XContentParser.Token.END_ARRAY) {
fields.add(parser.text());
}
} else {
throw new ElasticSearchParseException(
"The parameter fields must be given as an array! Use syntax : \"fields\" : [\"field1\", \"field2\",...]");
}
} else if (currentFieldName.equals("offsets")) {
termVectorRequest.offsets(parser.booleanValue());
} else if (currentFieldName.equals("positions")) {
termVectorRequest.positions(parser.booleanValue());
} else if (currentFieldName.equals("payloads")) {
termVectorRequest.payloads(parser.booleanValue());
} else if (currentFieldName.equals("term_statistics") || currentFieldName.equals("termStatistics")) {
termVectorRequest.termStatistics(parser.booleanValue());
} else if (currentFieldName.equals("field_statistics") || currentFieldName.equals("fieldStatistics")) {
termVectorRequest.fieldStatistics(parser.booleanValue());
} else {
throw new ElasticSearchParseException("The parameter " + currentFieldName
+ " is not valid for term vector request!");
}
}
}
String[] fieldsAsArray = new String[fields.size()];
termVectorRequest.selectedFields(fields.toArray(fieldsAsArray));
} finally {
parser.close();
}
}
}

View File

@ -0,0 +1,665 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.test.integration.termvectors;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.equalTo;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.util.Arrays;
import java.util.EnumSet;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Map;
import java.util.Random;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.payloads.TypeAsPayloadTokenFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.DocsAndPositionsEnum;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.Version;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.action.ActionFuture;
import org.elasticsearch.action.termvector.TermVectorRequest;
import org.elasticsearch.action.termvector.TermVectorRequest.Flag;
import org.elasticsearch.action.termvector.TermVectorRequestBuilder;
import org.elasticsearch.action.termvector.TermVectorResponse;
import org.elasticsearch.common.bytes.BytesArray;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.io.stream.InputStreamStreamInput;
import org.elasticsearch.common.io.stream.OutputStreamStreamOutput;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.index.mapper.MapperParsingException;
import org.elasticsearch.index.mapper.core.AbstractFieldMapper;
import org.elasticsearch.index.mapper.core.TypeParsers;
import org.elasticsearch.index.mapper.internal.AllFieldMapper;
import org.elasticsearch.rest.action.termvector.RestTermVectorAction;
import org.elasticsearch.test.integration.AbstractSharedClusterTest;
import org.hamcrest.Matchers;
import org.testng.annotations.Test;
public class GetTermVectorTests extends AbstractSharedClusterTest {
@Test
public void streamTest() throws Exception {
TermVectorResponse outResponse = new TermVectorResponse("a", "b", "c");
writeStandardTermVector(outResponse);
// write
ByteArrayOutputStream outBuffer = new ByteArrayOutputStream();
OutputStreamStreamOutput out = new OutputStreamStreamOutput(outBuffer);
outResponse.writeTo(out);
// read
ByteArrayInputStream esInBuffer = new ByteArrayInputStream(outBuffer.toByteArray());
InputStreamStreamInput esBuffer = new InputStreamStreamInput(esInBuffer);
TermVectorResponse inResponse = new TermVectorResponse("a", "b", "c");
inResponse.readFrom(esBuffer);
// see if correct
checkIfStandardTermVector(inResponse);
}
private void checkIfStandardTermVector(TermVectorResponse inResponse) throws IOException {
Fields fields = inResponse.getFields();
assertThat(fields.terms("title"), Matchers.notNullValue());
assertThat(fields.terms("desc"), Matchers.notNullValue());
assertThat(fields.size(), equalTo(2));
}
private void writeStandardTermVector(TermVectorResponse outResponse) throws IOException {
Directory dir = FSDirectory.open(new File("/tmp/foo"));
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_42, new StandardAnalyzer(Version.LUCENE_42));
conf.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, conf);
FieldType type = new FieldType(TextField.TYPE_STORED);
type.setStoreTermVectorOffsets(true);
type.setStoreTermVectorPayloads(false);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectors(true);
type.freeze();
Document d = new Document();
d.add(new Field("id", "abc", StringField.TYPE_STORED));
d.add(new Field("title", "the1 quick brown fox jumps over the1 lazy dog", type));
d.add(new Field("desc", "the1 quick brown fox jumps over the1 lazy dog", type));
writer.updateDocument(new Term("id", "abc"), d);
writer.commit();
writer.close();
DirectoryReader dr = DirectoryReader.open(dir);
IndexSearcher s = new IndexSearcher(dr);
TopDocs search = s.search(new TermQuery(new Term("id", "abc")), 1);
ScoreDoc[] scoreDocs = search.scoreDocs;
int doc = scoreDocs[0].doc;
Fields fields = dr.getTermVectors(doc);
EnumSet<Flag> flags = EnumSet.of(Flag.Positions, Flag.Offsets);
outResponse.setFields(fields, null, flags, fields);
}
private Fields buildWithLuceneAndReturnFields(String docId, String[] fields, String[] content, boolean[] withPositions,
boolean[] withOffsets, boolean[] withPayloads) throws IOException {
assert (fields.length == withPayloads.length);
assert (content.length == withPayloads.length);
assert (withPositions.length == withPayloads.length);
assert (withOffsets.length == withPayloads.length);
Map<String, Analyzer> mapping = new HashMap<String, Analyzer>();
for (int i = 0; i < withPayloads.length; i++) {
if (withPayloads[i]) {
mapping.put(fields[i], new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_42, reader);
TokenFilter filter = new LowerCaseFilter(Version.LUCENE_42, tokenizer);
filter = new TypeAsPayloadTokenFilter(filter);
return new TokenStreamComponents(tokenizer, filter);
}
});
}
}
PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_42), mapping);
Directory dir = FSDirectory.open(new File("/tmp/foo"));
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_42, wrapper);
conf.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, conf);
Document d = new Document();
for (int i = 0; i < fields.length; i++) {
d.add(new Field("id", docId, StringField.TYPE_STORED));
FieldType type = new FieldType(TextField.TYPE_STORED);
type.setStoreTermVectorOffsets(withOffsets[i]);
type.setStoreTermVectorPayloads(withPayloads[i]);
type.setStoreTermVectorPositions(withPositions[i] || withOffsets[i] || withPayloads[i]);
type.setStoreTermVectors(true);
type.freeze();
d.add(new Field(fields[i], content[i], type));
writer.updateDocument(new Term("id", docId), d);
writer.commit();
}
writer.close();
DirectoryReader dr = DirectoryReader.open(dir);
IndexSearcher s = new IndexSearcher(dr);
TopDocs search = s.search(new TermQuery(new Term("id", docId)), 1);
ScoreDoc[] scoreDocs = search.scoreDocs;
assert (scoreDocs.length == 1);
int doc = scoreDocs[0].doc;
Fields returnFields = dr.getTermVectors(doc);
return returnFields;
}
@Test
public void testRestRequestParsing() throws Exception {
BytesReference inputBytes = new BytesArray(
" {\"fields\" : [\"a\", \"b\",\"c\"], \"offsets\":false, \"positions\":false, \"payloads\":true}");
TermVectorRequest tvr = new TermVectorRequest(null, null, null);
RestTermVectorAction.parseRequest(inputBytes, tvr);
Set<String> fields = tvr.selectedFields();
assertThat(fields.contains("a"), equalTo(true));
assertThat(fields.contains("b"), equalTo(true));
assertThat(fields.contains("c"), equalTo(true));
assertThat(tvr.offsets(), equalTo(false));
assertThat(tvr.positions(), equalTo(false));
assertThat(tvr.payloads(), equalTo(true));
String additionalFields = "b,c ,d, e ";
RestTermVectorAction.addFieldStringsFromParameter(tvr, additionalFields);
assertThat(tvr.selectedFields().size(), equalTo(5));
assertThat(fields.contains("d"), equalTo(true));
assertThat(fields.contains("e"), equalTo(true));
additionalFields = "";
RestTermVectorAction.addFieldStringsFromParameter(tvr, additionalFields);
inputBytes = new BytesArray(" {\"offsets\":false, \"positions\":false, \"payloads\":true}");
tvr = new TermVectorRequest(null, null, null);
RestTermVectorAction.parseRequest(inputBytes, tvr);
additionalFields = "";
RestTermVectorAction.addFieldStringsFromParameter(tvr, additionalFields);
assertThat(tvr.selectedFields(), equalTo(null));
additionalFields = "b,c ,d, e ";
RestTermVectorAction.addFieldStringsFromParameter(tvr, additionalFields);
assertThat(tvr.selectedFields().size(), equalTo(4));
}
@Test
public void testRestRequestParsingThrowsException() throws Exception {
BytesReference inputBytes = new BytesArray(
" {\"fields\" : \"a, b,c \", \"offsets\":false, \"positions\":false, \"payloads\":true, \"meaningless_term\":2}");
TermVectorRequest tvr = new TermVectorRequest(null, null, null);
boolean threwException = false;
try {
RestTermVectorAction.parseRequest(inputBytes, tvr);
} catch (Exception e) {
threwException = true;
}
assertThat(threwException, equalTo(true));
}
@Test
public void testNoSuchDoc() throws Exception {
run(addMapping(prepareCreate("test"), "type1", new Object[] { "field", "type", "string", "term_vector",
"with_positions_offsets_payloads" }));
ensureYellow();
client().prepareIndex("test", "type1", "666").setSource("field", "foo bar").execute().actionGet();
refresh();
for (int i = 0; i < 20; i++) {
ActionFuture<TermVectorResponse> termVector = client().termVector(new TermVectorRequest("test", "type1", "" + i));
TermVectorResponse actionGet = termVector.actionGet();
assertThat(actionGet, Matchers.notNullValue());
assertThat(actionGet.documentExists(), Matchers.equalTo(false));
}
}
@Test
public void testSimpleTermVectors() throws ElasticSearchException, IOException {
run(addMapping(prepareCreate("test"), "type1",
new Object[] { "field", "type", "string", "term_vector", "with_positions_offsets_payloads", "analyzer", "tv_test" })
.setSettings(
ImmutableSettings.settingsBuilder().put("index.analysis.analyzer.tv_test.tokenizer", "whitespace")
.putArray("index.analysis.analyzer.tv_test.filter", "type_as_payload", "lowercase")));
ensureYellow();
for (int i = 0; i < 10; i++) {
client().prepareIndex("test", "type1", Integer.toString(i))
.setSource(XContentFactory.jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog")
// 0the3 4quick9 10brown15 16fox19 20jumps25 26over30
// 31the34 35lazy39 40dog43
.endObject()).execute().actionGet();
refresh();
}
String[] values = { "brown", "dog", "fox", "jumps", "lazy", "over", "quick", "the" };
int[] freq = { 1, 1, 1, 1, 1, 1, 1, 2 };
int[][] pos = { { 2 }, { 8 }, { 3 }, { 4 }, { 7 }, { 5 }, { 1 }, { 0, 6 } };
int[][] startOffset = { { 10 }, { 40 }, { 16 }, { 20 }, { 35 }, { 26 }, { 4 }, { 0, 31 } };
int[][] endOffset = { { 15 }, { 43 }, { 19 }, { 25 }, { 39 }, { 30 }, { 9 }, { 3, 34 } };
for (int i = 0; i < 10; i++) {
TermVectorRequestBuilder resp = client().prepareTermVector("test", "type1", Integer.toString(i)).setPayloads(true)
.setOffsets(true).setPositions(true).setSelectedFields();
TermVectorResponse response = resp.execute().actionGet();
assertThat("doc id: " + i + " doesn't exists but should", response.documentExists(), equalTo(true));
Fields fields = response.getFields();
assertThat(fields.size(), equalTo(1));
Terms terms = fields.terms("field");
assertThat(terms.size(), equalTo(8l));
TermsEnum iterator = terms.iterator(null);
for (int j = 0; j < values.length; j++) {
String string = values[j];
BytesRef next = iterator.next();
assertThat(next, Matchers.notNullValue());
assertThat("expected " + string, string, equalTo(next.utf8ToString()));
assertThat(next, Matchers.notNullValue());
// do not test ttf or doc frequency, because here we have many
// shards and do not know how documents are distributed
DocsAndPositionsEnum docsAndPositions = iterator.docsAndPositions(null, null);
assertThat(docsAndPositions.nextDoc(), equalTo(0));
assertThat(freq[j], equalTo(docsAndPositions.freq()));
int[] termPos = pos[j];
int[] termStartOffset = startOffset[j];
int[] termEndOffset = endOffset[j];
assertThat(termPos.length, equalTo(freq[j]));
assertThat(termStartOffset.length, equalTo(freq[j]));
assertThat(termEndOffset.length, equalTo(freq[j]));
for (int k = 0; k < freq[j]; k++) {
int nextPosition = docsAndPositions.nextPosition();
assertThat("term: " + string, nextPosition, equalTo(termPos[k]));
assertThat("term: " + string, docsAndPositions.startOffset(), equalTo(termStartOffset[k]));
assertThat("term: " + string, docsAndPositions.endOffset(), equalTo(termEndOffset[k]));
assertThat("term: " + string, docsAndPositions.getPayload(), equalTo(new BytesRef("word")));
}
}
assertThat(iterator.next(), Matchers.nullValue());
}
}
@Test
public void testRandomSingleTermVectors() throws ElasticSearchException, IOException {
long seed = System.currentTimeMillis();
Random random = new Random(seed);
FieldType ft = new FieldType();
int config = random.nextInt(6);
boolean storePositions = false;
boolean storeOffsets = false;
boolean storePayloads = false;
boolean storeTermVectors = false;
switch (config) {
case 0: {
// do nothing
}
case 1: {
storeTermVectors = true;
}
case 2: {
storeTermVectors = true;
storePositions = true;
}
case 3: {
storeTermVectors = true;
storeOffsets = true;
}
case 4: {
storeTermVectors = true;
storePositions = true;
storeOffsets = true;
}
case 5: {
storeTermVectors = true;
storePositions = true;
storePayloads = true;
}
case 6: {
storeTermVectors = true;
storePositions = true;
storeOffsets = true;
storePayloads = true;
}
}
ft.setStoreTermVectors(storeTermVectors);
ft.setStoreTermVectorOffsets(storeOffsets);
ft.setStoreTermVectorPayloads(storePayloads);
ft.setStoreTermVectorPositions(storePositions);
String optionString = AbstractFieldMapper.termVectorOptionsToString(ft);
run(addMapping(prepareCreate("test"), "type1",
new Object[] { "field", "type", "string", "term_vector", optionString, "analyzer", "tv_test" }).setSettings(
ImmutableSettings.settingsBuilder().put("index.analysis.analyzer.tv_test.tokenizer", "whitespace")
.putArray("index.analysis.analyzer.tv_test.filter", "type_as_payload", "lowercase")));
ensureYellow();
for (int i = 0; i < 10; i++) {
client().prepareIndex("test", "type1", Integer.toString(i))
.setSource(XContentFactory.jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog")
// 0the3 4quick9 10brown15 16fox19 20jumps25 26over30
// 31the34 35lazy39 40dog43
.endObject()).execute().actionGet();
refresh();
}
String[] values = { "brown", "dog", "fox", "jumps", "lazy", "over", "quick", "the" };
int[] freq = { 1, 1, 1, 1, 1, 1, 1, 2 };
int[][] pos = { { 2 }, { 8 }, { 3 }, { 4 }, { 7 }, { 5 }, { 1 }, { 0, 6 } };
int[][] startOffset = { { 10 }, { 40 }, { 16 }, { 20 }, { 35 }, { 26 }, { 4 }, { 0, 31 } };
int[][] endOffset = { { 15 }, { 43 }, { 19 }, { 25 }, { 39 }, { 30 }, { 9 }, { 3, 34 } };
boolean isPayloadRequested = random.nextBoolean();
boolean isOffsetRequested = random.nextBoolean();
boolean isPositionsRequested = random.nextBoolean();
String infoString = createInfoString(isPositionsRequested, isOffsetRequested, isPayloadRequested, optionString, seed);
for (int i = 0; i < 10; i++) {
TermVectorRequestBuilder resp = client().prepareTermVector("test", "type1", Integer.toString(i))
.setPayloads(isPayloadRequested).setOffsets(isOffsetRequested).setPositions(isPositionsRequested).setSelectedFields();
TermVectorResponse response = resp.execute().actionGet();
assertThat(infoString + "doc id: " + i + " doesn't exists but should", response.documentExists(), equalTo(true));
Fields fields = response.getFields();
assertThat(fields.size(), equalTo(ft.storeTermVectors() ? 1 : 0));
if (ft.storeTermVectors()) {
Terms terms = fields.terms("field");
assertThat(terms.size(), equalTo(8l));
TermsEnum iterator = terms.iterator(null);
for (int j = 0; j < values.length; j++) {
String string = values[j];
BytesRef next = iterator.next();
assertThat(infoString, next, Matchers.notNullValue());
assertThat(infoString + "expected " + string, string, equalTo(next.utf8ToString()));
assertThat(infoString, next, Matchers.notNullValue());
// do not test ttf or doc frequency, because here we have
// many shards and do not know how documents are distributed
DocsAndPositionsEnum docsAndPositions = iterator.docsAndPositions(null, null);
// docs and pos only returns something if positions or
// payloads or offsets are stored / requestd Otherwise use
// DocsEnum?
assertThat(infoString, docsAndPositions.nextDoc(), equalTo(0));
assertThat(infoString, freq[j], equalTo(docsAndPositions.freq()));
int[] termPos = pos[j];
int[] termStartOffset = startOffset[j];
int[] termEndOffset = endOffset[j];
if (isPositionsRequested && storePositions) {
assertThat(infoString, termPos.length, equalTo(freq[j]));
}
if (isOffsetRequested && storeOffsets) {
assertThat(termStartOffset.length, equalTo(freq[j]));
assertThat(termEndOffset.length, equalTo(freq[j]));
}
for (int k = 0; k < freq[j]; k++) {
int nextPosition = docsAndPositions.nextPosition();
// only return something useful if requested and stored
if (isPositionsRequested && storePositions) {
assertThat(infoString + "positions for term: " + string, nextPosition, equalTo(termPos[k]));
} else {
assertThat(infoString + "positions for term: ", nextPosition, equalTo(-1));
}
// only return something useful if requested and stored
if (isPayloadRequested && storePayloads) {
assertThat(infoString + "payloads for term: " + string, docsAndPositions.getPayload(), equalTo(new BytesRef(
"word")));
} else {
assertThat(infoString + "payloads for term: " + string, docsAndPositions.getPayload(), equalTo(null));
}
// only return something useful if requested and stored
if (isOffsetRequested && storeOffsets) {
assertThat(infoString + "startOffsets term: " + string, docsAndPositions.startOffset(),
equalTo(termStartOffset[k]));
assertThat(infoString + "endOffsets term: " + string, docsAndPositions.endOffset(), equalTo(termEndOffset[k]));
} else {
assertThat(infoString + "startOffsets term: " + string, docsAndPositions.startOffset(), equalTo(-1));
assertThat(infoString + "endOffsets term: " + string, docsAndPositions.endOffset(), equalTo(-1));
}
}
}
assertThat(iterator.next(), Matchers.nullValue());
}
}
}
private String createInfoString(boolean isPositionsRequested, boolean isOffsetRequested, boolean isPayloadRequested,
String optionString, long seed) {
String ret = "Seed: " + seed + "\n" + "Store config: " + optionString + "\n" + "Requested: pos-"
+ (isPositionsRequested ? "yes" : "no") + ", offsets-" + (isOffsetRequested ? "yes" : "no") + ", payload- "
+ (isPayloadRequested ? "yes" : "no") + "\n";
return ret;
}
@Test
public void testDuellESLucene() throws Exception {
String[] fieldNames = { "field_that_should_not_be_requested", "field_with_positions", "field_with_offsets", "field_with_only_tv",
"field_with_positions_offsets", "field_with_positions_payloads" };
run(addMapping(prepareCreate("test"), "type1",
new Object[] { fieldNames[0], "type", "string", "term_vector", "with_positions_offsets" },
new Object[] { fieldNames[1], "type", "string", "term_vector", "with_positions" },
new Object[] { fieldNames[2], "type", "string", "term_vector", "with_offsets" },
new Object[] { fieldNames[3], "type", "string", "store_term_vectors", "yes" },
new Object[] { fieldNames[4], "type", "string", "term_vector", "with_positions_offsets" },
new Object[] { fieldNames[5], "type", "string", "term_vector", "with_positions_payloads", "analyzer", "tv_test" })
.setSettings(
ImmutableSettings.settingsBuilder().put("index.analysis.analyzer.tv_test.tokenizer", "standard")
.putArray("index.analysis.analyzer.tv_test.filter", "type_as_payload", "lowercase")));
ensureYellow();
// ginge auc mit XContentBuilder xcb = new XContentBuilder();
// now, create the same thing with lucene and see if the returned stuff
// is the same
String[] fieldContent = { "the quick shard jumps over the stupid brain", "here is another field",
"And yet another field withut any use.", "I am out of ideas on what to type here.",
"The last field for which offsets are stored but not positons.",
"The last field for which offsets are stored but not positons." };
boolean[] storeOffsets = { true, false, true, false, true, false };
boolean[] storePositions = { true, true, false, false, true, true };
boolean[] storePayloads = { false, false, false, false, false, true };
Map<String, Object> testSource = new HashMap<String, Object>();
for (int i = 0; i < fieldNames.length; i++) {
testSource.put(fieldNames[i], fieldContent[i]);
}
client().prepareIndex("test", "type1", "1").setSource(testSource).execute().actionGet();
refresh();
String[] selectedFields = { fieldNames[1], fieldNames[2], fieldNames[3], fieldNames[4], fieldNames[5] };
testForConfig(fieldNames, fieldContent, storeOffsets, storePositions, storePayloads, selectedFields, false, false, false);
testForConfig(fieldNames, fieldContent, storeOffsets, storePositions, storePayloads, selectedFields, true, false, false);
testForConfig(fieldNames, fieldContent, storeOffsets, storePositions, storePayloads, selectedFields, false, true, false);
testForConfig(fieldNames, fieldContent, storeOffsets, storePositions, storePayloads, selectedFields, true, true, false);
testForConfig(fieldNames, fieldContent, storeOffsets, storePositions, storePayloads, selectedFields, true, false, true);
testForConfig(fieldNames, fieldContent, storeOffsets, storePositions, storePayloads, selectedFields, true, true, true);
}
private void testForConfig(String[] fieldNames, String[] fieldContent, boolean[] storeOffsets, boolean[] storePositions,
boolean[] storePayloads, String[] selectedFields, boolean withPositions, boolean withOffsets, boolean withPayloads)
throws IOException {
TermVectorRequestBuilder resp = client().prepareTermVector("test", "type1", "1").setPayloads(withPayloads).setOffsets(withOffsets)
.setPositions(withPositions).setFieldStatistics(true).setTermStatistics(true).setSelectedFields(selectedFields);
TermVectorResponse response = resp.execute().actionGet();
// build the same with lucene and compare the Fields
Fields luceneFields = buildWithLuceneAndReturnFields("1", fieldNames, fieldContent, storePositions, storeOffsets, storePayloads);
HashMap<String, Boolean> storeOfsetsMap = new HashMap<String, Boolean>();
HashMap<String, Boolean> storePositionsMap = new HashMap<String, Boolean>();
HashMap<String, Boolean> storePayloadsMap = new HashMap<String, Boolean>();
for (int i = 0; i < storePositions.length; i++) {
storeOfsetsMap.put(fieldNames[i], storeOffsets[i]);
storePositionsMap.put(fieldNames[i], storePositions[i]);
storePayloadsMap.put(fieldNames[i], storePayloads[i]);
}
compareLuceneESTermVectorResults(response.getFields(), luceneFields, storePositionsMap, storeOfsetsMap, storePayloadsMap,
withPositions, withOffsets, withPayloads, selectedFields);
}
private void compareLuceneESTermVectorResults(Fields fields, Fields luceneFields, HashMap<String, Boolean> storePositionsMap,
HashMap<String, Boolean> storeOfsetsMap, HashMap<String, Boolean> storePayloadsMap, boolean getPositions, boolean getOffsets,
boolean getPayloads, String[] selectedFields) throws IOException {
HashSet<String> selectedFieldsMap = new HashSet<String>(Arrays.asList(selectedFields));
Iterator<String> luceneFieldNames = luceneFields.iterator();
assertThat(luceneFields.size(), equalTo(storeOfsetsMap.size()));
assertThat(fields.size(), equalTo(selectedFields.length));
while (luceneFieldNames.hasNext()) {
String luceneFieldName = luceneFieldNames.next();
if (!selectedFieldsMap.contains(luceneFieldName))
continue;
Terms esTerms = fields.terms(luceneFieldName);
Terms luceneTerms = luceneFields.terms(luceneFieldName);
TermsEnum esTermEnum = esTerms.iterator(null);
TermsEnum luceneTermEnum = luceneTerms.iterator(null);
int numTerms = 0;
while (esTermEnum.next() != null) {
luceneTermEnum.next();
assertThat(esTermEnum.totalTermFreq(), equalTo(luceneTermEnum.totalTermFreq()));
DocsAndPositionsEnum esDocsPosEnum = esTermEnum.docsAndPositions(null, null, 0);
DocsAndPositionsEnum luceneDocsPosEnum = luceneTermEnum.docsAndPositions(null, null, 0);
if (luceneDocsPosEnum == null) {
assertThat(storeOfsetsMap.get(luceneFieldName), equalTo(false));
assertThat(storePayloadsMap.get(luceneFieldName), equalTo(false));
assertThat(storePositionsMap.get(luceneFieldName), equalTo(false));
continue;
}
numTerms++;
assertThat("failed for field: " + luceneFieldName, esTermEnum.term().utf8ToString(), equalTo(luceneTermEnum.term()
.utf8ToString()));
esDocsPosEnum.nextDoc();
luceneDocsPosEnum.nextDoc();
int freq = (int) esDocsPosEnum.freq();
assertThat(freq, equalTo(luceneDocsPosEnum.freq()));
for (int i = 0; i < freq; i++) {
int lucenePos = luceneDocsPosEnum.nextPosition();
int esPos = esDocsPosEnum.nextPosition();
if (storePositionsMap.get(luceneFieldName) && getPositions) {
assertThat(luceneFieldName, lucenePos, equalTo(esPos));
} else {
assertThat(esPos, equalTo(-1));
}
if (storeOfsetsMap.get(luceneFieldName) && getOffsets) {
assertThat(luceneDocsPosEnum.startOffset(), equalTo(esDocsPosEnum.startOffset()));
assertThat(luceneDocsPosEnum.endOffset(), equalTo(esDocsPosEnum.endOffset()));
} else {
assertThat(esDocsPosEnum.startOffset(), equalTo(-1));
assertThat(esDocsPosEnum.endOffset(), equalTo(-1));
}
if (storePayloadsMap.get(luceneFieldName) && getPayloads) {
assertThat(luceneFieldName, luceneDocsPosEnum.getPayload(), equalTo(esDocsPosEnum.getPayload()));
} else {
assertThat(esDocsPosEnum.getPayload(), equalTo(null));
}
}
}
}
}
@Test
public void testFieldTypeToTermVectorString() throws Exception {
FieldType ft = new FieldType();
ft.setStoreTermVectorOffsets(false);
ft.setStoreTermVectorPayloads(true);
ft.setStoreTermVectors(true);
ft.setStoreTermVectorPositions(true);
String ftOpts = AbstractFieldMapper.termVectorOptionsToString(ft);
assertThat("with_positions_payloads", equalTo(ftOpts));
AllFieldMapper.Builder builder = new AllFieldMapper.Builder();
boolean excptiontrown = false;
try {
TypeParsers.parseTermVector("", ftOpts, builder);
} catch (MapperParsingException e) {
excptiontrown = true;
}
assertThat("TypeParsers.parseTermVector should accept string with_positions_payloads but does not.", excptiontrown, equalTo(false));
}
@Test
public void testTermVectorStringGenerationIllegalState() throws Exception {
FieldType ft = new FieldType();
ft.setStoreTermVectorOffsets(true);
ft.setStoreTermVectorPayloads(true);
ft.setStoreTermVectors(true);
ft.setStoreTermVectorPositions(false);
String ftOpts = AbstractFieldMapper.termVectorOptionsToString(ft);
assertThat(ftOpts, equalTo("with_offsets"));
}
}

View File

@ -0,0 +1,297 @@
/*
* Licensed to ElasticSearch and Shay Banon under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. ElasticSearch licenses this
* file to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.elasticsearch.test.integration.termvectors;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.equalTo;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.Random;
import org.apache.lucene.index.DocsAndPositionsEnum;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.ElasticSearchException;
import org.elasticsearch.action.termvector.TermVectorRequest;
import org.elasticsearch.action.termvector.TermVectorRequestBuilder;
import org.elasticsearch.action.termvector.TermVectorResponse;
import org.elasticsearch.common.io.BytesStream;
import org.elasticsearch.common.io.stream.InputStreamStreamInput;
import org.elasticsearch.common.io.stream.OutputStreamStreamOutput;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.test.integration.AbstractSharedClusterTest;
import org.hamcrest.Matchers;
import org.testng.annotations.Test;
public class GetTermVectorTestsCheckDocFreq extends AbstractSharedClusterTest {
@Test
public void streamRequest() throws IOException {
long seed = System.currentTimeMillis();
Random random = new Random(seed);
for (int i = 0; i < 10; i++) {
TermVectorRequest request = new TermVectorRequest("index", "type", "id");
request.offsets(random.nextBoolean());
request.fieldStatistics(random.nextBoolean());
request.payloads(random.nextBoolean());
request.positions(random.nextBoolean());
request.termStatistics(random.nextBoolean());
String parent = random.nextBoolean() ? "someParent" : null;
request.parent(parent);
String pref = random.nextBoolean() ? "somePreference" : null;
request.preference(pref);
// write
ByteArrayOutputStream outBuffer = new ByteArrayOutputStream();
OutputStreamStreamOutput out = new OutputStreamStreamOutput(outBuffer);
request.writeTo(out);
// read
ByteArrayInputStream esInBuffer = new ByteArrayInputStream(outBuffer.toByteArray());
InputStreamStreamInput esBuffer = new InputStreamStreamInput(esInBuffer);
TermVectorRequest req2 = new TermVectorRequest(null, null, null);
req2.readFrom(esBuffer);
assertThat(request.offsets(), equalTo(req2.offsets()));
assertThat(request.fieldStatistics(), equalTo(req2.fieldStatistics()));
assertThat(request.payloads(), equalTo(req2.payloads()));
assertThat(request.positions(), equalTo(req2.positions()));
assertThat(request.termStatistics(), equalTo(req2.termStatistics()));
assertThat(request.preference(), equalTo(pref));
assertThat(request.routing(), equalTo(parent));
}
}
@Test
public void testSimpleTermVectors() throws ElasticSearchException, IOException {
run(addMapping(prepareCreate("test"), "type1",
new Object[] { "field", "type", "string", "term_vector", "with_positions_offsets_payloads", "analyzer", "tv_test" })
.setSettings(
ImmutableSettings.settingsBuilder().put("index.number_of_shards", 1)
.put("index.analysis.analyzer.tv_test.tokenizer", "whitespace").put("index.number_of_replicas", 0)
.putArray("index.analysis.analyzer.tv_test.filter", "type_as_payload", "lowercase")));
ensureGreen();
int numDocs = 15;
for (int i = 0; i < numDocs; i++) {
client().prepareIndex("test", "type1", Integer.toString(i))
.setSource(XContentFactory.jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog")
// 0the3 4quick9 10brown15 16fox19 20jumps25 26over30
// 31the34 35lazy39 40dog43
.endObject()).execute().actionGet();
refresh();
}
String[] values = { "brown", "dog", "fox", "jumps", "lazy", "over", "quick", "the" };
int[] freq = { 1, 1, 1, 1, 1, 1, 1, 2 };
int[][] pos = { { 2 }, { 8 }, { 3 }, { 4 }, { 7 }, { 5 }, { 1 }, { 0, 6 } };
int[][] startOffset = { { 10 }, { 40 }, { 16 }, { 20 }, { 35 }, { 26 }, { 4 }, { 0, 31 } };
int[][] endOffset = { { 15 }, { 43 }, { 19 }, { 25 }, { 39 }, { 30 }, { 9 }, { 3, 34 } };
for (int i = 0; i < numDocs; i++) {
checkAllInfo(numDocs, values, freq, pos, startOffset, endOffset, i);
checkWithoutTermStatistics(numDocs, values, freq, pos, startOffset, endOffset, i);
checkWithoutFieldStatistics(numDocs, values, freq, pos, startOffset, endOffset, i);
}
}
private void checkWithoutFieldStatistics(int numDocs, String[] values, int[] freq, int[][] pos, int[][] startOffset, int[][] endOffset,
int i) throws IOException {
TermVectorRequestBuilder resp = client().prepareTermVector("test", "type1", Integer.toString(i)).setPayloads(true).setOffsets(true)
.setPositions(true).setTermStatistics(true).setFieldStatistics(false).setSelectedFields();
TermVectorResponse response = resp.execute().actionGet();
assertThat("doc id: " + i + " doesn't exists but should", response.documentExists(), equalTo(true));
Fields fields = response.getFields();
assertThat(fields.size(), equalTo(1));
Terms terms = fields.terms("field");
assertThat(terms.size(), equalTo(8l));
assertThat(terms.getSumTotalTermFreq(), Matchers.equalTo((long) -1));
assertThat(terms.getDocCount(), Matchers.equalTo(-1));
assertThat(terms.getSumDocFreq(), equalTo((long) -1));
TermsEnum iterator = terms.iterator(null);
for (int j = 0; j < values.length; j++) {
String string = values[j];
BytesRef next = iterator.next();
assertThat(next, Matchers.notNullValue());
assertThat("expected " + string, string, equalTo(next.utf8ToString()));
assertThat(next, Matchers.notNullValue());
if (string.equals("the")) {
assertThat("expected ttf of " + string, numDocs * 2, equalTo((int) iterator.totalTermFreq()));
} else {
assertThat("expected ttf of " + string, numDocs, equalTo((int) iterator.totalTermFreq()));
}
DocsAndPositionsEnum docsAndPositions = iterator.docsAndPositions(null, null);
assertThat(docsAndPositions.nextDoc(), equalTo(0));
assertThat(freq[j], equalTo(docsAndPositions.freq()));
assertThat(iterator.docFreq(), equalTo(numDocs));
int[] termPos = pos[j];
int[] termStartOffset = startOffset[j];
int[] termEndOffset = endOffset[j];
assertThat(termPos.length, equalTo(freq[j]));
assertThat(termStartOffset.length, equalTo(freq[j]));
assertThat(termEndOffset.length, equalTo(freq[j]));
for (int k = 0; k < freq[j]; k++) {
int nextPosition = docsAndPositions.nextPosition();
assertThat("term: " + string, nextPosition, equalTo(termPos[k]));
assertThat("term: " + string, docsAndPositions.startOffset(), equalTo(termStartOffset[k]));
assertThat("term: " + string, docsAndPositions.endOffset(), equalTo(termEndOffset[k]));
assertThat("term: " + string, docsAndPositions.getPayload(), equalTo(new BytesRef("word")));
}
}
assertThat(iterator.next(), Matchers.nullValue());
XContentBuilder xBuilder = new XContentFactory().jsonBuilder();
response.toXContent(xBuilder, null);
BytesStream bytesStream = xBuilder.bytesStream();
String utf8 = bytesStream.bytes().toUtf8();
String expectedString = "{\"_index\":\"test\",\"_type\":\"type1\",\"_id\":\""
+ i
+ "\",\"_version\":1,\"exists\":true,\"term_vectors\":{\"field\":{\"terms\":{\"brown\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[2],\"start\":[10],\"end\":[15],\"payload\":[\"d29yZA==\"]},\"dog\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[8],\"start\":[40],\"end\":[43],\"payload\":[\"d29yZA==\"]},\"fox\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[3],\"start\":[16],\"end\":[19],\"payload\":[\"d29yZA==\"]},\"jumps\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[4],\"start\":[20],\"end\":[25],\"payload\":[\"d29yZA==\"]},\"lazy\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[7],\"start\":[35],\"end\":[39],\"payload\":[\"d29yZA==\"]},\"over\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[5],\"start\":[26],\"end\":[30],\"payload\":[\"d29yZA==\"]},\"quick\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[1],\"start\":[4],\"end\":[9],\"payload\":[\"d29yZA==\"]},\"the\":{\"doc_freq\":15,\"ttf\":30,\"term_freq\":2,\"pos\":[0,6],\"start\":[0,31],\"end\":[3,34],\"payload\":[\"d29yZA==\",\"d29yZA==\"]}}}}}";
assertThat(utf8, equalTo(expectedString));
}
private void checkWithoutTermStatistics(int numDocs, String[] values, int[] freq, int[][] pos, int[][] startOffset, int[][] endOffset,
int i) throws IOException {
TermVectorRequestBuilder resp = client().prepareTermVector("test", "type1", Integer.toString(i)).setPayloads(true).setOffsets(true)
.setPositions(true).setTermStatistics(false).setFieldStatistics(true).setSelectedFields();
assertThat(resp.request().termStatistics(), equalTo(false));
TermVectorResponse response = resp.execute().actionGet();
assertThat("doc id: " + i + " doesn't exists but should", response.documentExists(), equalTo(true));
Fields fields = response.getFields();
assertThat(fields.size(), equalTo(1));
Terms terms = fields.terms("field");
assertThat(terms.size(), equalTo(8l));
assertThat(terms.getSumTotalTermFreq(), Matchers.equalTo((long) (9 * numDocs)));
assertThat(terms.getDocCount(), Matchers.equalTo(numDocs));
assertThat(terms.getSumDocFreq(), equalTo((long) numDocs * values.length));
TermsEnum iterator = terms.iterator(null);
for (int j = 0; j < values.length; j++) {
String string = values[j];
BytesRef next = iterator.next();
assertThat(next, Matchers.notNullValue());
assertThat("expected " + string, string, equalTo(next.utf8ToString()));
assertThat(next, Matchers.notNullValue());
assertThat("expected ttf of " + string, -1, equalTo((int) iterator.totalTermFreq()));
DocsAndPositionsEnum docsAndPositions = iterator.docsAndPositions(null, null);
assertThat(docsAndPositions.nextDoc(), equalTo(0));
assertThat(freq[j], equalTo(docsAndPositions.freq()));
assertThat(iterator.docFreq(), equalTo(-1));
int[] termPos = pos[j];
int[] termStartOffset = startOffset[j];
int[] termEndOffset = endOffset[j];
assertThat(termPos.length, equalTo(freq[j]));
assertThat(termStartOffset.length, equalTo(freq[j]));
assertThat(termEndOffset.length, equalTo(freq[j]));
for (int k = 0; k < freq[j]; k++) {
int nextPosition = docsAndPositions.nextPosition();
assertThat("term: " + string, nextPosition, equalTo(termPos[k]));
assertThat("term: " + string, docsAndPositions.startOffset(), equalTo(termStartOffset[k]));
assertThat("term: " + string, docsAndPositions.endOffset(), equalTo(termEndOffset[k]));
assertThat("term: " + string, docsAndPositions.getPayload(), equalTo(new BytesRef("word")));
}
}
assertThat(iterator.next(), Matchers.nullValue());
XContentBuilder xBuilder = new XContentFactory().jsonBuilder();
response.toXContent(xBuilder, null);
BytesStream bytesStream = xBuilder.bytesStream();
String utf8 = bytesStream.bytes().toUtf8();
String expectedString = "{\"_index\":\"test\",\"_type\":\"type1\",\"_id\":\""
+ i
+ "\",\"_version\":1,\"exists\":true,\"term_vectors\":{\"field\":{\"field_statistics\":{\"sum_doc_freq\":120,\"doc_count\":15,\"sum_ttf\":135},\"terms\":{\"brown\":{\"term_freq\":1,\"pos\":[2],\"start\":[10],\"end\":[15],\"payload\":[\"d29yZA==\"]},\"dog\":{\"term_freq\":1,\"pos\":[8],\"start\":[40],\"end\":[43],\"payload\":[\"d29yZA==\"]},\"fox\":{\"term_freq\":1,\"pos\":[3],\"start\":[16],\"end\":[19],\"payload\":[\"d29yZA==\"]},\"jumps\":{\"term_freq\":1,\"pos\":[4],\"start\":[20],\"end\":[25],\"payload\":[\"d29yZA==\"]},\"lazy\":{\"term_freq\":1,\"pos\":[7],\"start\":[35],\"end\":[39],\"payload\":[\"d29yZA==\"]},\"over\":{\"term_freq\":1,\"pos\":[5],\"start\":[26],\"end\":[30],\"payload\":[\"d29yZA==\"]},\"quick\":{\"term_freq\":1,\"pos\":[1],\"start\":[4],\"end\":[9],\"payload\":[\"d29yZA==\"]},\"the\":{\"term_freq\":2,\"pos\":[0,6],\"start\":[0,31],\"end\":[3,34],\"payload\":[\"d29yZA==\",\"d29yZA==\"]}}}}}";
assertThat(utf8, equalTo(expectedString));
}
private void checkAllInfo(int numDocs, String[] values, int[] freq, int[][] pos, int[][] startOffset, int[][] endOffset, int i)
throws IOException {
TermVectorRequestBuilder resp = client().prepareTermVector("test", "type1", Integer.toString(i)).setPayloads(true).setOffsets(true)
.setPositions(true).setFieldStatistics(true).setTermStatistics(true).setSelectedFields();
assertThat(resp.request().fieldStatistics(), equalTo(true));
TermVectorResponse response = resp.execute().actionGet();
assertThat("doc id: " + i + " doesn't exists but should", response.documentExists(), equalTo(true));
Fields fields = response.getFields();
assertThat(fields.size(), equalTo(1));
Terms terms = fields.terms("field");
assertThat(terms.size(), equalTo(8l));
assertThat(terms.getSumTotalTermFreq(), Matchers.equalTo((long) (9 * numDocs)));
assertThat(terms.getDocCount(), Matchers.equalTo(numDocs));
assertThat(terms.getSumDocFreq(), equalTo((long) numDocs * values.length));
TermsEnum iterator = terms.iterator(null);
for (int j = 0; j < values.length; j++) {
String string = values[j];
BytesRef next = iterator.next();
assertThat(next, Matchers.notNullValue());
assertThat("expected " + string, string, equalTo(next.utf8ToString()));
assertThat(next, Matchers.notNullValue());
if (string.equals("the")) {
assertThat("expected ttf of " + string, numDocs * 2, equalTo((int) iterator.totalTermFreq()));
} else {
assertThat("expected ttf of " + string, numDocs, equalTo((int) iterator.totalTermFreq()));
}
DocsAndPositionsEnum docsAndPositions = iterator.docsAndPositions(null, null);
assertThat(docsAndPositions.nextDoc(), equalTo(0));
assertThat(freq[j], equalTo(docsAndPositions.freq()));
assertThat(iterator.docFreq(), equalTo(numDocs));
int[] termPos = pos[j];
int[] termStartOffset = startOffset[j];
int[] termEndOffset = endOffset[j];
assertThat(termPos.length, equalTo(freq[j]));
assertThat(termStartOffset.length, equalTo(freq[j]));
assertThat(termEndOffset.length, equalTo(freq[j]));
for (int k = 0; k < freq[j]; k++) {
int nextPosition = docsAndPositions.nextPosition();
assertThat("term: " + string, nextPosition, equalTo(termPos[k]));
assertThat("term: " + string, docsAndPositions.startOffset(), equalTo(termStartOffset[k]));
assertThat("term: " + string, docsAndPositions.endOffset(), equalTo(termEndOffset[k]));
assertThat("term: " + string, docsAndPositions.getPayload(), equalTo(new BytesRef("word")));
}
}
assertThat(iterator.next(), Matchers.nullValue());
XContentBuilder xBuilder = new XContentFactory().jsonBuilder();
response.toXContent(xBuilder, null);
BytesStream bytesStream = xBuilder.bytesStream();
String utf8 = bytesStream.bytes().toUtf8();
String expectedString = "{\"_index\":\"test\",\"_type\":\"type1\",\"_id\":\""
+ i
+ "\",\"_version\":1,\"exists\":true,\"term_vectors\":{\"field\":{\"field_statistics\":{\"sum_doc_freq\":120,\"doc_count\":15,\"sum_ttf\":135},\"terms\":{\"brown\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[2],\"start\":[10],\"end\":[15],\"payload\":[\"d29yZA==\"]},\"dog\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[8],\"start\":[40],\"end\":[43],\"payload\":[\"d29yZA==\"]},\"fox\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[3],\"start\":[16],\"end\":[19],\"payload\":[\"d29yZA==\"]},\"jumps\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[4],\"start\":[20],\"end\":[25],\"payload\":[\"d29yZA==\"]},\"lazy\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[7],\"start\":[35],\"end\":[39],\"payload\":[\"d29yZA==\"]},\"over\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[5],\"start\":[26],\"end\":[30],\"payload\":[\"d29yZA==\"]},\"quick\":{\"doc_freq\":15,\"ttf\":15,\"term_freq\":1,\"pos\":[1],\"start\":[4],\"end\":[9],\"payload\":[\"d29yZA==\"]},\"the\":{\"doc_freq\":15,\"ttf\":30,\"term_freq\":2,\"pos\":[0,6],\"start\":[0,31],\"end\":[3,34],\"payload\":[\"d29yZA==\",\"d29yZA==\"]}}}}}";
assertThat(utf8, equalTo(expectedString));
}
}