From 11d08ac436321a800fcd91caa2f2ef2c1944aafb Mon Sep 17 00:00:00 2001 From: Britta Weber Date: Wed, 29 May 2013 19:31:19 +0200 Subject: [PATCH] term vector request ================================ Returns information and statistics on terms in the fields of a particular document as stored in the index. curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' Tree types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics. Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...' or adding by adding the requested fields in the request body (see example below). Term information ------------------------- - term frequency in the field (always returned) - term positions ("positions" : true) - start and end offsets ("offsets" : true) - term payloads ("payloads" : true), as base64 encoded bytes If the requested information wasn't stored in the index, it will be omitted without further warning. See [mapping](http://www.elasticsearch.org/guide/reference/mapping/core-types/) on how to configure your index to store term vectors. Term statistics ------------------------- Setting "term_statistics" to "true" (default is "false") will return - total term frequency (how often a term occurs in all documents) - document frequency (the number of documents containing the current term) By default these values are not returned since term statistics can have a serious performance impact. Field statistics ------------------------- Setting "field_statistics" to "false" (default is "true") will omit - document count (how many documents contain this field) - sum of document frequencies (the sum of document frequencies for all terms in this field) - sum of total term frequencies (the sum of total term frequencies of each term in this field) Behavior ------------------------- The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. Example ------------------------- First, we create an index that stores term vectors, payloads etc. : curl -s -XPUT 'http://localhost:9200/twitter/' -d '{ "mappings": { "tweet": { "properties": { "text": { "type": "string", "term_vector": "with_positions_offsets_payloads", "store" : "yes", "index_analyzer" : "fulltext_analyzer" }, "fullname": { "type": "string", "term_vector": "with_positions_offsets_payloads", "index_analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } }' Second, we add some documents: curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{ "fullname" : "John Doe", "text" : "twitter test test test " }' curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{ "fullname" : "Jane Doe", "text" : "Another twitter test ..." }' The following request returns all information and statistics for field "text" in document "1" (John Doe): curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{ "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }' Equivalently, all parameters can be passed as URI parameters: curl -GET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true&fields=text&offsets=true&payloads=true&positions=true&term_statistics=true&field_statistics=true' Response: { "_index" : "twitter", "_type" : "tweet", "_id" : "1", "_version" : 1, "exists" : true, "term_vectors" : { "text" : { "field_statistics" : { "sum_doc_freq" : 6, "doc_count" : 2, "sum_ttf" : 8 }, "terms" : { "test" : { "doc_freq" : 2, "ttf" : 4, "term_freq" : 3, "pos" : [ 1, 2, 3 ], "start" : [ 8, 13, 18 ], "end" : [ 12, 17, 22 ], "payload" : [ "d29yZA==", "d29yZA==", "d29yZA==" ] }, "twitter" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "pos" : [ 0 ], "start" : [ 0 ], "end" : [ 7 ], "payload" : [ "d29yZA==" ] } } } } } Further changes: ------------------------- XContentBuilder new method public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value) to put an integer array. IndicesAnalysisService make token filter for saving payloads available in elasticsearch AbstractFieldMapper/TypeParser make term vector options string available and also fix the parsing of this string: with_positions_payloads is actually allowed as can be seen in TermVectorsConsumerPerFields. Closes #3114 --- .../elasticsearch/action/ActionModule.java | 3 + .../action/termvector/TermVectorAction.java | 46 ++ .../action/termvector/TermVectorFields.java | 469 ++++++++++++ .../action/termvector/TermVectorRequest.java | 303 ++++++++ .../termvector/TermVectorRequestBuilder.java | 81 +++ .../action/termvector/TermVectorResponse.java | 323 +++++++++ .../action/termvector/TermVectorWriter.java | 236 +++++++ .../TransportSingleShardTermVectorAction.java | 138 ++++ .../action/termvector/package-info.java | 23 + .../java/org/elasticsearch/client/Client.java | 32 +- .../client/support/AbstractClient.java | 19 + .../client/transport/TransportClient.java | 7 + .../common/xcontent/XContentBuilder.java | 10 + .../mapper/core/AbstractFieldMapper.java | 2 +- .../index/mapper/core/TypeParsers.java | 3 + .../analysis/IndicesAnalysisService.java | 14 + .../rest/action/RestActionModule.java | 2 + .../termvector/RestTermVectorAction.java | 188 +++++ .../termvectors/GetTermVectorTests.java | 665 ++++++++++++++++++ .../GetTermVectorTestsCheckDocFreq.java | 297 ++++++++ 20 files changed, 2859 insertions(+), 2 deletions(-) create mode 100644 src/main/java/org/elasticsearch/action/termvector/TermVectorAction.java create mode 100644 src/main/java/org/elasticsearch/action/termvector/TermVectorFields.java create mode 100644 src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java create mode 100644 src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java create mode 100644 src/main/java/org/elasticsearch/action/termvector/TermVectorResponse.java create mode 100644 src/main/java/org/elasticsearch/action/termvector/TermVectorWriter.java create mode 100644 src/main/java/org/elasticsearch/action/termvector/TransportSingleShardTermVectorAction.java create mode 100644 src/main/java/org/elasticsearch/action/termvector/package-info.java create mode 100644 src/main/java/org/elasticsearch/rest/action/termvector/RestTermVectorAction.java create mode 100644 src/test/java/org/elasticsearch/test/integration/termvectors/GetTermVectorTests.java create mode 100644 src/test/java/org/elasticsearch/test/integration/termvectors/GetTermVectorTestsCheckDocFreq.java diff --git a/src/main/java/org/elasticsearch/action/ActionModule.java b/src/main/java/org/elasticsearch/action/ActionModule.java index dc03a164528..f9d7e28c3a5 100644 --- a/src/main/java/org/elasticsearch/action/ActionModule.java +++ b/src/main/java/org/elasticsearch/action/ActionModule.java @@ -119,6 +119,8 @@ import org.elasticsearch.action.search.type.*; import org.elasticsearch.action.suggest.SuggestAction; import org.elasticsearch.action.suggest.TransportSuggestAction; import org.elasticsearch.action.support.TransportAction; +import org.elasticsearch.action.termvector.TermVectorAction; +import org.elasticsearch.action.termvector.TransportSingleShardTermVectorAction; import org.elasticsearch.action.update.TransportUpdateAction; import org.elasticsearch.action.update.UpdateAction; import org.elasticsearch.common.inject.AbstractModule; @@ -210,6 +212,7 @@ public class ActionModule extends AbstractModule { registerAction(IndexAction.INSTANCE, TransportIndexAction.class); registerAction(GetAction.INSTANCE, TransportGetAction.class); + registerAction(TermVectorAction.INSTANCE, TransportSingleShardTermVectorAction.class); registerAction(DeleteAction.INSTANCE, TransportDeleteAction.class, TransportIndexDeleteAction.class, TransportShardDeleteAction.class); registerAction(CountAction.INSTANCE, TransportCountAction.class); diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorAction.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorAction.java new file mode 100644 index 00000000000..00c4997be06 --- /dev/null +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorAction.java @@ -0,0 +1,46 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.elasticsearch.action.termvector; + +import org.elasticsearch.action.Action; + +import org.elasticsearch.client.Client; + +/** + */ +public class TermVectorAction extends Action { + + public static final TermVectorAction INSTANCE = new TermVectorAction(); + public static final String NAME = "tv"; + + private TermVectorAction() { + super(NAME); + } + + @Override + public TermVectorResponse newResponse() { + return new TermVectorResponse(); + } + + @Override + public TermVectorRequestBuilder newRequestBuilder(Client client) { + return new TermVectorRequestBuilder(client); + } +} diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorFields.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorFields.java new file mode 100644 index 00000000000..aa0ef394e4e --- /dev/null +++ b/src/main/java/org/elasticsearch/action/termvector/TermVectorFields.java @@ -0,0 +1,469 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.elasticsearch.action.termvector; + +import static org.apache.lucene.util.ArrayUtil.grow; +import gnu.trove.map.hash.TObjectLongHashMap; + +import java.io.IOException; +import java.util.Comparator; +import java.util.Iterator; + +import org.apache.lucene.index.DocsAndPositionsEnum; +import org.apache.lucene.index.DocsEnum; +import org.apache.lucene.index.Fields; +import org.apache.lucene.index.Terms; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.RamUsageEstimator; +import org.elasticsearch.common.bytes.BytesReference; +import org.elasticsearch.common.io.stream.BytesStreamInput; + +/** + * This class represents the result of a {@link TermVectorRequest}. It works + * exactly like the {@link Fields} class except for one thing: It can return + * offsets and payloads even if positions are not present. You must call + * nextPosition() anyway to move the counter although this method only returns + * -1,, if no positions were returned by the {@link TermVectorRequest}. + * + * The data is stored in two byte arrays ({@code headerRef} and + * {@code termVectors}, both {@link ByteRef}) that have the following format: + *

+ * {@code headerRef}: Stores offsets per field in the {@code termVectors} array + * and some header information as {@link BytesRef}. Format is + *

+ * + * termVectors: Stores the actual term vectors as a {@link BytesRef}. + * + * Term vectors for each fields are stored in blocks, one for each field. The + * offsets in {@code headerRef} are used to find where the block for a field + * starts. Each block begins with a + * + * If the field statistics were requested ({@code hasFieldStatistics} is true, + * see {@code headerRef}), the following numbers are stored: + * + * + * After that, for each term it stores + *