Term Vectors: Support for artificial documents

This adds the ability to the Term Vector API to generate term vectors for
artifical documents, that is for documents not present in the index. Following
a similar syntax to the Percolator API, a new 'doc' parameter is used, instead
of '_id', that specifies the document of interest. The parameters '_index' and
'_type' determine the mapping and therefore analyzers to apply to each value
field.

Closes #7530
This commit is contained in:
Alex Ksikes 2014-08-29 14:32:22 +02:00
parent b49853a619
commit 07d741c2cb
13 changed files with 457 additions and 70 deletions

View File

@ -1,7 +1,10 @@
[[docs-multi-termvectors]]
== Multi termvectors API
Multi termvectors API allows to get multiple termvectors based on an index, type and id. The response includes a `docs`
Multi termvectors API allows to get multiple termvectors at once. The
documents from which to retrieve the term vectors are specified by an index,
type and id. But the documents could also be artificially provided coming[1.4.0].
The response includes a `docs`
array with all the fetched termvectors, each element having the structure
provided by the <<docs-termvectors,termvectors>>
API. Here is an example:
@ -89,4 +92,31 @@ curl 'localhost:9200/testidx/test/_mtermvectors' -d '{
}'
--------------------------------------------------
Parameters can also be set by passing them as uri parameters (see <<docs-termvectors,termvectors>>). uri parameters are the default parameters and are overwritten by any parameter setting defined in the body.
Additionally coming[1.4.0], just like for the <<docs-termvectors,termvectors>>
API, term vectors could be generated for user provided documents. The syntax
is similar to the <<search-percolate,percolator>> API. The mapping used is
determined by `_index` and `_type`.
[source,js]
--------------------------------------------------
curl 'localhost:9200/_mtermvectors' -d '{
"docs": [
{
"_index": "testidx",
"_type": "test",
"doc" : {
"fullname" : "John Doe",
"text" : "twitter test test test"
}
},
{
"_index": "testidx",
"_type": "test",
"doc" : {
"fullname" : "Jane Doe",
"text" : "Another twitter test ..."
}
}
]
}'
--------------------------------------------------

View File

@ -3,10 +3,11 @@
added[1.0.0.Beta1]
Returns information and statistics on terms in the fields of a
particular document as stored in the index. Note that this is a
near realtime API as the term vectors are not available until the
next refresh.
Returns information and statistics on terms in the fields of a particular
document. The document could be stored in the index or artificially provided
by the user coming[1.4.0]. Note that for documents stored in the index, this
is a near realtime API as the term vectors are not available until the next
refresh.
[source,js]
--------------------------------------------------
@ -41,10 +42,10 @@ statistics are returned for all fields but no term statistics.
* term payloads (`payloads` : true), as base64 encoded bytes
If the requested information wasn't stored in the index, it will be
computed on the fly if possible. See <<mapping-types,type mapping>>
for how to configure your index to store term vectors.
computed on the fly if possible. Additionally, term vectors could be computed
for documents not even existing in the index, but instead provided by the user.
coming[1.4.0,The ability to computed term vectors on the fly is only available from 1.4.0 onwards (see below)]
coming[1.4.0,The ability to computed term vectors on the fly as well as support for artificial documents is only available from 1.4.0 onwards (see below example 2 and 3 respectively)]
[WARNING]
======
@ -86,7 +87,9 @@ The term and field statistics are not accurate. Deleted documents
are not taken into account. The information is only retrieved for the
shard the requested document resides in. The term and field statistics
are therefore only useful as relative measures whereas the absolute
numbers have no meaning in this context.
numbers have no meaning in this context. By default, when requesting
term vectors of artificial documents, a shard to get the statistics from
is randomly selected. Use `routing` only to hit a particular shard.
[float]
=== Example 1
@ -231,7 +234,7 @@ Response:
[float]
=== Example 2 coming[1.4.0]
Additionally, term vectors which are not explicitly stored in the index are automatically
Term vectors which are not explicitly stored in the index are automatically
computed on the fly. The following request returns all information and statistics for the
fields in document `1`, even though the terms haven't been explicitly stored in the index.
Note that for the field `text`, the terms are not re-generated.
@ -246,3 +249,29 @@ curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
"field_statistics" : true
}'
--------------------------------------------------
[float]
=== Example 3 coming[1.4.0]
Additionally, term vectors can also be generated for artificial documents,
that is for documents not present in the index. The syntax is similar to the
<<search-percolate,percolator>> API. For example, the following request would
return the same results as in example 1. The mapping used is determined by the
`index` and `type`.
[WARNING]
======
If dynamic mapping is turned on (default), the document fields not in the original
mapping will be dynamically created.
======
[source,js]
--------------------------------------------------
curl -XGET 'http://localhost:9200/twitter/tweet/_termvector' -d '{
"doc" : {
"fullname" : "John Doe",
"text" : "twitter test test test"
}
}'
--------------------------------------------------

View File

@ -90,7 +90,6 @@ public class MultiTermVectorsRequest extends ActionRequest<MultiTermVectorsReque
if (token == XContentParser.Token.FIELD_NAME) {
currentFieldName = parser.currentName();
} else if (token == XContentParser.Token.START_ARRAY) {
if ("docs".equals(currentFieldName)) {
while ((token = parser.nextToken()) != XContentParser.Token.END_ARRAY) {
if (token != XContentParser.Token.START_OBJECT) {

View File

@ -26,12 +26,17 @@ import org.elasticsearch.action.ActionRequestValidationException;
import org.elasticsearch.action.ValidateActions;
import org.elasticsearch.action.get.MultiGetRequest;
import org.elasticsearch.action.support.single.shard.SingleShardOperationRequest;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentParser;
import java.io.IOException;
import java.util.*;
import java.util.concurrent.atomic.AtomicInteger;
import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
/**
* Request returning the term vector (doc frequency, positions, offsets) for a
@ -46,10 +51,14 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
private String id;
private BytesReference doc;
private String routing;
protected String preference;
private static final AtomicInteger randomInt = new AtomicInteger(0);
// TODO: change to String[]
private Set<String> selectedFields;
@ -129,6 +138,23 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
return this;
}
/**
* Returns the artificial document from which term vectors are requested for.
*/
public BytesReference doc() {
return doc;
}
/**
* Sets an artificial document from which term vectors are requested for.
*/
public TermVectorRequest doc(XContentBuilder documentBuilder) {
// assign a random id to this artificial document, for routing
this.id(String.valueOf(randomInt.getAndAdd(1)));
this.doc = documentBuilder.bytes();
return this;
}
/**
* @return The routing for this request.
*/
@ -281,8 +307,8 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
if (type == null) {
validationException = ValidateActions.addValidationError("type is missing", validationException);
}
if (id == null) {
validationException = ValidateActions.addValidationError("id is missing", validationException);
if (id == null && doc == null) {
validationException = ValidateActions.addValidationError("id or doc is missing", validationException);
}
return validationException;
}
@ -303,6 +329,12 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
}
type = in.readString();
id = in.readString();
if (in.getVersion().onOrAfter(Version.V_1_4_0)) {
if (in.readBoolean()) {
doc = in.readBytesReference();
}
}
routing = in.readOptionalString();
preference = in.readOptionalString();
long flags = in.readVLong();
@ -331,6 +363,13 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
}
out.writeString(type);
out.writeString(id);
if (out.getVersion().onOrAfter(Version.V_1_4_0)) {
out.writeBoolean(doc != null);
if (doc != null) {
out.writeBytesReference(doc);
}
}
out.writeOptionalString(routing);
out.writeOptionalString(preference);
long longFlags = 0;
@ -389,7 +428,15 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
} else if ("_type".equals(currentFieldName)) {
termVectorRequest.type = parser.text();
} else if ("_id".equals(currentFieldName)) {
if (termVectorRequest.doc != null) {
throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!");
}
termVectorRequest.id = parser.text();
} else if ("doc".equals(currentFieldName)) {
if (termVectorRequest.id != null) {
throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!");
}
termVectorRequest.doc(jsonBuilder().copyCurrentStructure(parser));
} else if ("_routing".equals(currentFieldName) || "routing".equals(currentFieldName)) {
termVectorRequest.routing = parser.text();
} else {
@ -398,7 +445,6 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
}
}
}
if (fields.size() > 0) {
String[] fieldsAsArray = new String[fields.size()];
termVectorRequest.selectedFields(fields.toArray(fieldsAsArray));

View File

@ -22,6 +22,7 @@ package org.elasticsearch.action.termvector;
import org.elasticsearch.action.ActionListener;
import org.elasticsearch.action.ActionRequestBuilder;
import org.elasticsearch.client.Client;
import org.elasticsearch.common.xcontent.XContentBuilder;
/**
*/
@ -35,6 +36,38 @@ public class TermVectorRequestBuilder extends ActionRequestBuilder<TermVectorReq
super(client, new TermVectorRequest(index, type, id));
}
/**
* Sets the index where the document is located.
*/
public TermVectorRequestBuilder setIndex(String index) {
request.index(index);
return this;
}
/**
* Sets the type of the document.
*/
public TermVectorRequestBuilder setType(String type) {
request.type(type);
return this;
}
/**
* Sets the id of the document.
*/
public TermVectorRequestBuilder setId(String id) {
request.id(id);
return this;
}
/**
* Sets the artificial document from which to generate term vectors.
*/
public TermVectorRequestBuilder setDoc(XContentBuilder xContent) {
request.doc(xContent);
return this;
}
/**
* Sets the routing. Required if routing isn't id based.
*/

View File

@ -81,10 +81,11 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
private String id;
private long docVersion;
private boolean exists = false;
private boolean artificial = false;
private boolean sourceCopied = false;
int[] curentPositions = new int[0];
int[] currentPositions = new int[0];
int[] currentStartOffset = new int[0];
int[] currentEndOffset = new int[0];
BytesReference[] currentPayloads = new BytesReference[0];
@ -156,7 +157,6 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
}
};
}
}
@Override
@ -166,7 +166,9 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
assert id != null;
builder.field(FieldStrings._INDEX, index);
builder.field(FieldStrings._TYPE, type);
builder.field(FieldStrings._ID, id);
if (!isArtificial()) {
builder.field(FieldStrings._ID, id);
}
builder.field(FieldStrings._VERSION, docVersion);
builder.field(FieldStrings.FOUND, isExists());
if (!isExists()) {
@ -181,7 +183,6 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
}
builder.endObject();
return builder;
}
private void buildField(XContentBuilder builder, final CharsRef spare, Fields theFields, Iterator<String> fieldIter) throws IOException {
@ -237,7 +238,7 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
for (int i = 0; i < termFreq; i++) {
builder.startObject();
if (curTerms.hasPositions()) {
builder.field(FieldStrings.POS, curentPositions[i]);
builder.field(FieldStrings.POS, currentPositions[i]);
}
if (curTerms.hasOffsets()) {
builder.field(FieldStrings.START_OFFSET, currentStartOffset[i]);
@ -249,14 +250,13 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
builder.endObject();
}
builder.endArray();
}
private void initValues(Terms curTerms, DocsAndPositionsEnum posEnum, int termFreq) throws IOException {
for (int j = 0; j < termFreq; j++) {
int nextPos = posEnum.nextPosition();
if (curTerms.hasPositions()) {
curentPositions[j] = nextPos;
currentPositions[j] = nextPos;
}
if (curTerms.hasOffsets()) {
currentStartOffset[j] = posEnum.startOffset();
@ -269,7 +269,6 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
} else {
currentPayloads[j] = null;
}
}
}
}
@ -277,7 +276,7 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
private void initMemory(Terms curTerms, int termFreq) {
// init memory for performance reasons
if (curTerms.hasPositions()) {
curentPositions = ArrayUtil.grow(curentPositions, termFreq);
currentPositions = ArrayUtil.grow(currentPositions, termFreq);
}
if (curTerms.hasOffsets()) {
currentStartOffset = ArrayUtil.grow(currentStartOffset, termFreq);
@ -336,7 +335,6 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
public void setHeader(BytesReference header) {
headerRef = header;
}
public void setDocVersion(long version) {
@ -356,4 +354,11 @@ public class TermVectorResponse extends ActionResponse implements ToXContent {
return id;
}
public boolean isArtificial() {
return artificial;
}
public void setArtificial(boolean artificial) {
this.artificial = artificial;
}
}

View File

@ -46,7 +46,6 @@ final class TermVectorWriter {
}
void setFields(Fields termVectorsByField, Set<String> selectedFields, EnumSet<Flag> flags, Fields topLevelFields) throws IOException {
int numFieldsWritten = 0;
TermsEnum iterator = null;
DocsAndPositionsEnum docsAndPosEnum = null;
@ -60,6 +59,11 @@ final class TermVectorWriter {
Terms fieldTermVector = termVectorsByField.terms(field);
Terms topLevelTerms = topLevelFields.terms(field);
// if no terms found, take the retrieved term vector fields for stats
if (topLevelTerms == null) {
topLevelTerms = fieldTermVector;
}
topLevelIterator = topLevelTerms.iterator(topLevelIterator);
boolean positions = flags.contains(Flag.Positions) && fieldTermVector.hasPositions();
boolean offsets = flags.contains(Flag.Offsets) && fieldTermVector.hasOffsets();
@ -75,7 +79,6 @@ final class TermVectorWriter {
// get the doc frequency
BytesRef term = iterator.term();
boolean foundTerm = topLevelIterator.seekExact(term);
assert (foundTerm);
startTerm(term);
if (flags.contains(Flag.TermStatistics)) {
writeTermStatistics(topLevelIterator);

View File

@ -533,7 +533,6 @@ public interface Client extends ElasticsearchClient<Client>, Releasable {
*/
MoreLikeThisRequestBuilder prepareMoreLikeThis(String index, String type, String id);
/**
* An action that returns the term vectors for a specific document.
*
@ -550,6 +549,10 @@ public interface Client extends ElasticsearchClient<Client>, Releasable {
*/
void termVector(TermVectorRequest request, ActionListener<TermVectorResponse> listener);
/**
* Builder for the term vector request.
*/
TermVectorRequestBuilder prepareTermVector();
/**
* Builder for the term vector request.
@ -560,7 +563,6 @@ public interface Client extends ElasticsearchClient<Client>, Releasable {
*/
TermVectorRequestBuilder prepareTermVector(String index, String type, String id);
/**
* Multi get term vectors.
*/
@ -576,7 +578,6 @@ public interface Client extends ElasticsearchClient<Client>, Releasable {
*/
MultiTermVectorsRequestBuilder prepareMultiTermVectors();
/**
* Percolates a request returning the matches documents.
*/

View File

@ -441,6 +441,11 @@ public abstract class AbstractClient implements Client {
execute(TermVectorAction.INSTANCE, request, listener);
}
@Override
public TermVectorRequestBuilder prepareTermVector() {
return new TermVectorRequestBuilder(this);
}
@Override
public TermVectorRequestBuilder prepareTermVector(String index, String type, String id) {
return new TermVectorRequestBuilder(this, index, type, id);

View File

@ -126,6 +126,26 @@ public abstract class ParseContext {
return f.toArray(new IndexableField[f.size()]);
}
/**
* Returns an array of values of the field specified as the method parameter.
* This method returns an empty array when there are no
* matching fields. It never returns null.
* For {@link org.apache.lucene.document.IntField}, {@link org.apache.lucene.document.LongField}, {@link
* org.apache.lucene.document.FloatField} and {@link org.apache.lucene.document.DoubleField} it returns the string value of the number.
* If you want the actual numeric field instances back, use {@link #getFields}.
* @param name the name of the field
* @return a <code>String[]</code> of field values
*/
public final String[] getValues(String name) {
List<String> result = new ArrayList<>();
for (IndexableField field : fields) {
if (field.name().equals(name) && field.stringValue() != null) {
result.add(field.stringValue());
}
}
return result.toArray(new String[result.size()]);
}
public IndexableField getField(String name) {
for (IndexableField field : fields) {
if (field.name().equals(name)) {

View File

@ -25,18 +25,20 @@ import org.apache.lucene.index.memory.MemoryIndex;
import org.elasticsearch.ElasticsearchException;
import org.elasticsearch.action.termvector.TermVectorRequest;
import org.elasticsearch.action.termvector.TermVectorResponse;
import org.elasticsearch.cluster.action.index.MappingUpdatedAction;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.collect.Tuple;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.lucene.uid.Versions;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.engine.Engine;
import org.elasticsearch.index.get.GetField;
import org.elasticsearch.index.get.GetResult;
import org.elasticsearch.index.mapper.FieldMapper;
import org.elasticsearch.index.mapper.Uid;
import org.elasticsearch.index.mapper.*;
import org.elasticsearch.index.mapper.core.StringFieldMapper;
import org.elasticsearch.index.mapper.internal.UidFieldMapper;
import org.elasticsearch.index.service.IndexService;
import org.elasticsearch.index.settings.IndexSettings;
import org.elasticsearch.index.shard.AbstractIndexShardComponent;
import org.elasticsearch.index.shard.ShardId;
@ -45,16 +47,20 @@ import org.elasticsearch.index.shard.service.IndexShard;
import java.io.IOException;
import java.util.*;
import static org.elasticsearch.index.mapper.SourceToParse.source;
/**
*/
public class ShardTermVectorService extends AbstractIndexShardComponent {
private IndexShard indexShard;
private final MappingUpdatedAction mappingUpdatedAction;
@Inject
public ShardTermVectorService(ShardId shardId, @IndexSettings Settings indexSettings) {
public ShardTermVectorService(ShardId shardId, @IndexSettings Settings indexSettings, MappingUpdatedAction mappingUpdatedAction) {
super(shardId, indexSettings);
this.mappingUpdatedAction = mappingUpdatedAction;
}
// sadly, to overcome cyclic dep, we need to do this and inject it ourselves...
@ -67,23 +73,39 @@ public class ShardTermVectorService extends AbstractIndexShardComponent {
final Engine.Searcher searcher = indexShard.acquireSearcher("term_vector");
IndexReader topLevelReader = searcher.reader();
final TermVectorResponse termVectorResponse = new TermVectorResponse(concreteIndex, request.type(), request.id());
final Term uidTerm = new Term(UidFieldMapper.NAME, Uid.createUidAsBytes(request.type(), request.id()));
/* handle potential wildcards in fields */
if (request.selectedFields() != null) {
handleFieldWildcards(request);
}
try {
Fields topLevelFields = MultiFields.getFields(topLevelReader);
Versions.DocIdAndVersion docIdAndVersion = Versions.loadDocIdAndVersion(topLevelReader, uidTerm);
if (docIdAndVersion != null) {
/* handle potential wildcards in fields */
if (request.selectedFields() != null) {
handleFieldWildcards(request);
}
/* generate term vectors if not available */
Fields termVectorsByField = docIdAndVersion.context.reader().getTermVectors(docIdAndVersion.docId);
if (request.selectedFields() != null) {
termVectorsByField = generateTermVectorsIfNeeded(termVectorsByField, request, uidTerm, false);
/* from an artificial document */
if (request.doc() != null) {
Fields termVectorsByField = generateTermVectorsFromDoc(request);
// if no document indexed in shard, take the queried document itself for stats
if (topLevelFields == null) {
topLevelFields = termVectorsByField;
}
termVectorResponse.setFields(termVectorsByField, request.selectedFields(), request.getFlags(), topLevelFields);
termVectorResponse.setExists(true);
termVectorResponse.setArtificial(true);
return termVectorResponse;
}
/* or from an existing document */
final Term uidTerm = new Term(UidFieldMapper.NAME, Uid.createUidAsBytes(request.type(), request.id()));
Versions.DocIdAndVersion docIdAndVersion = Versions.loadDocIdAndVersion(topLevelReader, uidTerm);
if (docIdAndVersion != null) {
// fields with stored term vectors
Fields termVectorsByField = docIdAndVersion.context.reader().getTermVectors(docIdAndVersion.docId);
// fields without term vectors
if (request.selectedFields() != null) {
termVectorsByField = addGeneratedTermVectors(termVectorsByField, request, uidTerm, false);
}
termVectorResponse.setFields(termVectorsByField, request.selectedFields(), request.getFlags(), topLevelFields);
termVectorResponse.setDocVersion(docIdAndVersion.version);
termVectorResponse.setExists(true);
} else {
termVectorResponse.setExists(false);
}
@ -103,39 +125,52 @@ public class ShardTermVectorService extends AbstractIndexShardComponent {
request.selectedFields(fieldNames.toArray(Strings.EMPTY_ARRAY));
}
private Fields generateTermVectorsIfNeeded(Fields termVectorsByField, TermVectorRequest request, Term uidTerm, boolean realTime) throws IOException {
List<String> validFields = new ArrayList<>();
private boolean isValidField(FieldMapper field) {
// must be a string
if (!(field instanceof StringFieldMapper)) {
return false;
}
// and must be indexed
if (!field.fieldType().indexed()) {
return false;
}
return true;
}
private Fields addGeneratedTermVectors(Fields termVectorsByField, TermVectorRequest request, Term uidTerm, boolean realTime) throws IOException {
/* only keep valid fields */
Set<String> validFields = new HashSet<>();
for (String field : request.selectedFields()) {
FieldMapper fieldMapper = indexShard.mapperService().smartNameFieldMapper(field);
if (!(fieldMapper instanceof StringFieldMapper)) {
if (!isValidField(fieldMapper)) {
continue;
}
// already retrieved
if (fieldMapper.fieldType().storeTermVectors()) {
continue;
}
// only disallow fields which are not indexed
if (!fieldMapper.fieldType().indexed()) {
continue;
}
validFields.add(field);
}
if (validFields.isEmpty()) {
return termVectorsByField;
}
/* generate term vectors from fetched document fields */
Engine.GetResult get = indexShard.get(new Engine.Get(realTime, uidTerm));
Fields generatedTermVectors;
try {
if (!get.exists()) {
return termVectorsByField;
}
// TODO: support for fetchSourceContext?
GetResult getResult = indexShard.getService().get(
get, request.id(), request.type(), validFields.toArray(Strings.EMPTY_ARRAY), null, false);
generatedTermVectors = generateTermVectors(getResult.getFields().values(), request.offsets());
} finally {
get.release();
}
/* merge with existing Fields */
if (termVectorsByField == null) {
return generatedTermVectors;
} else {
@ -144,7 +179,7 @@ public class ShardTermVectorService extends AbstractIndexShardComponent {
}
private Fields generateTermVectors(Collection<GetField> getFields, boolean withOffsets) throws IOException {
// store document in memory index
/* store document in memory index */
MemoryIndex index = new MemoryIndex(withOffsets);
for (GetField getField : getFields) {
String field = getField.getName();
@ -156,10 +191,51 @@ public class ShardTermVectorService extends AbstractIndexShardComponent {
index.addField(field, text.toString(), analyzer);
}
}
// and read vectors from it
/* and read vectors from it */
return MultiFields.getFields(index.createSearcher().getIndexReader());
}
private Fields generateTermVectorsFromDoc(TermVectorRequest request) throws IOException {
// parse the document, at the moment we do update the mapping, just like percolate
ParsedDocument parsedDocument = parseDocument(indexShard.shardId().getIndex(), request.type(), request.doc());
// select the right fields and generate term vectors
ParseContext.Document doc = parsedDocument.rootDoc();
Collection<String> seenFields = new HashSet<>();
Collection<GetField> getFields = new HashSet<>();
for (IndexableField field : doc.getFields()) {
FieldMapper fieldMapper = indexShard.mapperService().smartNameFieldMapper(field.name());
if (seenFields.contains(field.name())) {
continue;
}
else {
seenFields.add(field.name());
}
if (!isValidField(fieldMapper)) {
continue;
}
if (request.selectedFields() != null && !request.selectedFields().contains(field.name())) {
continue;
}
String[] values = doc.getValues(field.name());
getFields.add(new GetField(field.name(), Arrays.asList((Object[]) values)));
}
return generateTermVectors(getFields, request.offsets());
}
private ParsedDocument parseDocument(String index, String type, BytesReference doc) {
MapperService mapperService = indexShard.mapperService();
IndexService indexService = indexShard.indexService();
// TODO: make parsing not dynamically create fields not in the original mapping
Tuple<DocumentMapper, Boolean> docMapper = mapperService.documentMapperWithAutoCreate(type);
ParsedDocument parsedDocument = docMapper.v1().parse(source(doc).type(type).flyweight(true)).setMappingsModified(docMapper);
if (parsedDocument.mappingsModified()) {
mappingUpdatedAction.updateMappingOnMaster(index, docMapper.v1(), indexService.indexUUID());
}
return parsedDocument;
}
private Fields mergeFields(String[] fieldNames, Fields... fieldsObject) throws IOException {
ParallelFields parallelFields = new ParallelFields();
for (Fields fieldObject : fieldsObject) {

View File

@ -48,6 +48,8 @@ public class RestTermVectorAction extends BaseRestHandler {
@Inject
public RestTermVectorAction(Settings settings, Client client, RestController controller) {
super(settings, client);
controller.registerHandler(GET, "/{index}/{type}/_termvector", this);
controller.registerHandler(POST, "/{index}/{type}/_termvector", this);
controller.registerHandler(GET, "/{index}/{type}/{id}/_termvector", this);
controller.registerHandler(POST, "/{index}/{type}/{id}/_termvector", this);
}

View File

@ -31,8 +31,9 @@ import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.xcontent.ToXContent;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentFactory;
import org.elasticsearch.index.mapper.core.AbstractFieldMapper;
import org.elasticsearch.index.service.IndexService;
import org.elasticsearch.indices.IndicesService;
import org.junit.Test;
import java.io.IOException;
@ -43,6 +44,7 @@ import java.util.Map;
import java.util.concurrent.ExecutionException;
import static org.elasticsearch.common.settings.ImmutableSettings.settingsBuilder;
import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertThrows;
import static org.hamcrest.Matchers.*;
@ -51,7 +53,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
@Test
public void testNoSuchDoc() throws Exception {
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1")
XContentBuilder mapping = jsonBuilder().startObject().startObject("type1")
.startObject("properties")
.startObject("field")
.field("type", "string")
@ -72,13 +74,13 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
assertThat(actionGet.getIndex(), equalTo("test"));
assertThat(actionGet.isExists(), equalTo(false));
// check response is nevertheless serializable to json
actionGet.toXContent(XContentFactory.jsonBuilder(), ToXContent.EMPTY_PARAMS);
actionGet.toXContent(jsonBuilder(), ToXContent.EMPTY_PARAMS);
}
}
@Test
public void testExistingFieldWithNoTermVectorsNoNPE() throws Exception {
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1")
XContentBuilder mapping = jsonBuilder().startObject().startObject("type1")
.startObject("properties")
.startObject("existingfield")
.field("type", "string")
@ -107,7 +109,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
@Test
public void testExistingFieldButNotInDocNPE() throws Exception {
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1")
XContentBuilder mapping = jsonBuilder().startObject().startObject("type1")
.startObject("properties")
.startObject("existingfield")
.field("type", "string")
@ -179,7 +181,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
@Test
public void testSimpleTermVectors() throws ElasticsearchException, IOException {
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1")
XContentBuilder mapping = jsonBuilder().startObject().startObject("type1")
.startObject("properties")
.startObject("field")
.field("type", "string")
@ -197,7 +199,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
ensureYellow();
for (int i = 0; i < 10; i++) {
client().prepareIndex("test", "type1", Integer.toString(i))
.setSource(XContentFactory.jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog")
.setSource(jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog")
// 0the3 4quick9 10brown15 16fox19 20jumps25 26over30
// 31the34 35lazy39 40dog43
.endObject()).execute().actionGet();
@ -268,7 +270,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
ft.setStoreTermVectorPositions(storePositions);
String optionString = AbstractFieldMapper.termVectorOptionsToString(ft);
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1")
XContentBuilder mapping = jsonBuilder().startObject().startObject("type1")
.startObject("properties")
.startObject("field")
.field("type", "string")
@ -284,7 +286,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
ensureYellow();
for (int i = 0; i < 10; i++) {
client().prepareIndex("test", "type1", Integer.toString(i))
.setSource(XContentFactory.jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog")
.setSource(jsonBuilder().startObject().field("field", "the quick brown fox jumps over the lazy dog")
// 0the3 4quick9 10brown15 16fox19 20jumps25 26over30
// 31the34 35lazy39 40dog43
.endObject()).execute().actionGet();
@ -423,7 +425,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
String delimiter = createRandomDelimiter(tokens);
String queryString = createString(tokens, payloads, encoding, delimiter.charAt(0));
//create the mapping
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1").startObject("properties")
XContentBuilder mapping = jsonBuilder().startObject().startObject("type1").startObject("properties")
.startObject("field").field("type", "string").field("term_vector", "with_positions_offsets_payloads")
.field("analyzer", "payload_test").endObject().endObject().endObject().endObject();
assertAcked(prepareCreate("test").addMapping("type1", mapping).setSettings(
@ -437,7 +439,7 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
ensureYellow();
client().prepareIndex("test", "type1", Integer.toString(1))
.setSource(XContentFactory.jsonBuilder().startObject().field("field", queryString).endObject()).execute().actionGet();
.setSource(jsonBuilder().startObject().field("field", queryString).endObject()).execute().actionGet();
refresh();
TermVectorRequestBuilder resp = client().prepareTermVector("test", "type1", Integer.toString(1)).setPayloads(true).setOffsets(true)
.setPositions(true).setSelectedFields();
@ -579,8 +581,8 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
fieldNames[i] = "field" + String.valueOf(i);
}
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1").startObject("properties");
XContentBuilder source = XContentFactory.jsonBuilder().startObject();
XContentBuilder mapping = jsonBuilder().startObject().startObject("type1").startObject("properties");
XContentBuilder source = jsonBuilder().startObject();
for (String field : fieldNames) {
mapping.startObject(field)
.field("type", "string")
@ -764,8 +766,8 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
public void testSimpleWildCards() throws ElasticsearchException, IOException {
int numFields = 25;
XContentBuilder mapping = XContentFactory.jsonBuilder().startObject().startObject("type1").startObject("properties");
XContentBuilder source = XContentFactory.jsonBuilder().startObject();
XContentBuilder mapping = jsonBuilder().startObject().startObject("type1").startObject("properties");
XContentBuilder source = jsonBuilder().startObject();
for (int i = 0; i < numFields; i++) {
mapping.startObject("field" + i)
.field("type", "string")
@ -788,6 +790,142 @@ public class GetTermVectorTests extends AbstractTermVectorTests {
assertThat("All term vectors should have been generated", response.getFields().size(), equalTo(numFields));
}
@Test
public void testArtificialVsExisting() throws ElasticsearchException, ExecutionException, InterruptedException, IOException {
// setup indices
ImmutableSettings.Builder settings = settingsBuilder()
.put(indexSettings())
.put("index.analysis.analyzer", "standard");
assertAcked(prepareCreate("test")
.setSettings(settings)
.addMapping("type1", "field1", "type=string,term_vector=with_positions_offsets"));
ensureGreen();
// index documents existing document
String[] content = new String[]{
"Generating a random permutation of a sequence (such as when shuffling cards).",
"Selecting a random sample of a population (important in statistical sampling).",
"Allocating experimental units via random assignment to a treatment or control condition.",
"Generating random numbers: see Random number generation."};
List<IndexRequestBuilder> indexBuilders = new ArrayList<>();
for (int i = 0; i < content.length; i++) {
indexBuilders.add(client().prepareIndex()
.setIndex("test")
.setType("type1")
.setId(String.valueOf(i))
.setSource("field1", content[i]));
}
indexRandom(true, indexBuilders);
for (int i = 0; i < content.length; i++) {
// request tvs from existing document
TermVectorResponse respExisting = client().prepareTermVector("test", "type1", String.valueOf(i))
.setOffsets(true)
.setPositions(true)
.setFieldStatistics(true)
.setTermStatistics(true)
.get();
assertThat("doc with index: test, type1 and id: existing", respExisting.isExists(), equalTo(true));
// request tvs from artificial document
TermVectorResponse respArtificial = client().prepareTermVector()
.setIndex("test")
.setType("type1")
.setRouting(String.valueOf(i)) // ensure we get the stats from the same shard as existing doc
.setDoc(jsonBuilder()
.startObject()
.field("field1", content[i])
.endObject())
.setOffsets(true)
.setPositions(true)
.setFieldStatistics(true)
.setTermStatistics(true)
.get();
assertThat("doc with index: test, type1 and id: " + String.valueOf(i), respArtificial.isExists(), equalTo(true));
// compare existing tvs with artificial
compareTermVectors("field1", respExisting.getFields(), respArtificial.getFields());
}
}
@Test
public void testArtificialNoDoc() throws IOException {
// setup indices
ImmutableSettings.Builder settings = settingsBuilder()
.put(indexSettings())
.put("index.analysis.analyzer", "standard");
assertAcked(prepareCreate("test")
.setSettings(settings)
.addMapping("type1", "field1", "type=string"));
ensureGreen();
// request tvs from artificial document
String text = "the quick brown fox jumps over the lazy dog";
TermVectorResponse resp = client().prepareTermVector()
.setIndex("test")
.setType("type1")
.setDoc(jsonBuilder()
.startObject()
.field("field1", text)
.endObject())
.setOffsets(true)
.setPositions(true)
.setFieldStatistics(true)
.setTermStatistics(true)
.get();
assertThat(resp.isExists(), equalTo(true));
checkBrownFoxTermVector(resp.getFields(), "field1", false);
}
@Test
public void testArtificialNonExistingField() throws Exception {
// setup indices
ImmutableSettings.Builder settings = settingsBuilder()
.put(indexSettings())
.put("index.analysis.analyzer", "standard");
assertAcked(prepareCreate("test")
.setSettings(settings)
.addMapping("type1", "field1", "type=string"));
ensureGreen();
// index just one doc
List<IndexRequestBuilder> indexBuilders = new ArrayList<>();
indexBuilders.add(client().prepareIndex()
.setIndex("test")
.setType("type1")
.setId("1")
.setRouting("1")
.setSource("field1", "some text"));
indexRandom(true, indexBuilders);
// request tvs from artificial document
XContentBuilder doc = jsonBuilder()
.startObject()
.field("field1", "the quick brown fox jumps over the lazy dog")
.field("non_existing", "the quick brown fox jumps over the lazy dog")
.endObject();
for (int i = 0; i < 2; i++) {
TermVectorResponse resp = client().prepareTermVector()
.setIndex("test")
.setType("type1")
.setDoc(doc)
.setRouting("" + i)
.setOffsets(true)
.setPositions(true)
.setFieldStatistics(true)
.setTermStatistics(true)
.get();
assertThat(resp.isExists(), equalTo(true));
checkBrownFoxTermVector(resp.getFields(), "field1", false);
// we should have created a mapping for this field
waitForMappingOnMaster("test", "type1", "non_existing");
// and return the generated term vectors
checkBrownFoxTermVector(resp.getFields(), "non_existing", false);
}
}
private static String indexOrAlias() {
return randomBoolean() ? "test" : "alias";
}