Currently we have many different places that convert String to UTF-8
bytes and back. We shouldn't maintain more code than necessary to
do this conversion and rather use Lucene's support for it.
================================
Returns information and statistics on terms in the fields of a particular document as stored in the index.
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true'
Tree types of values can be requested: term information, term statistics and field statistics.
By default, all term information and field statistics are returned for all fields but no term statistics.
Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...'
or adding by adding the requested fields in the request body (see example below).
Term information
-------------------------
- term frequency in the field (always returned)
- term positions ("positions" : true)
- start and end offsets ("offsets" : true)
- term payloads ("payloads" : true), as base64 encoded bytes
If the requested information wasn't stored in the index, it will be omitted without further warning.
See [mapping](http://www.elasticsearch.org/guide/reference/mapping/core-types/) on how to configure your index to store term vectors.
Term statistics
-------------------------
Setting "term_statistics" to "true" (default is "false") will return
- total term frequency (how often a term occurs in all documents)
- document frequency (the number of documents containing the current term)
By default these values are not returned since term statistics can have a serious performance impact.
Field statistics
-------------------------
Setting "field_statistics" to "false" (default is "true") will omit
- document count (how many documents contain this field)
- sum of document frequencies (the sum of document frequencies for all terms in this field)
- sum of total term frequencies (the sum of total term frequencies of each term in this field)
Behavior
-------------------------
The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.
Example
-------------------------
First, we create an index that stores term vectors, payloads etc. :
curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : "yes",
"index_analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"index_analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}'
Second, we add some documents:
curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
"fullname" : "John Doe",
"text" : "twitter test test test "
}'
curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
"fullname" : "Jane Doe",
"text" : "Another twitter test ..."
}'
The following request returns all information and statistics for field "text" in document "1" (John Doe):
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}'
Equivalently, all parameters can be passed as URI parameters:
curl -GET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true&fields=text&offsets=true&payloads=true&positions=true&term_statistics=true&field_statistics=true'
Response:
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_version" : 1,
"exists" : true,
"term_vectors" : {
"text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"test" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 3,
"pos" : [ 1, 2, 3 ],
"start" : [ 8, 13, 18 ],
"end" : [ 12, 17, 22 ],
"payload" : [ "d29yZA==", "d29yZA==", "d29yZA==" ]
},
"twitter" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"pos" : [ 0 ],
"start" : [ 0 ],
"end" : [ 7 ],
"payload" : [ "d29yZA==" ]
}
}
}
}
}
Further changes:
-------------------------
XContentBuilder
new method
public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value)
to put an integer array.
IndicesAnalysisService
make token filter for saving payloads available in elasticsearch
AbstractFieldMapper/TypeParser
make term vector options string available and also fix the parsing of this string:
with_positions_payloads is actually allowed as can be seen in TermVectorsConsumerPerFields.
Closes#3114
According to #2515 the ubuntu software center does not allow to install
debian packages which are not lintian compatible
I worked on the package and made it lintian compatible by doing
* Ignoring errors about arch dependent binaries as we will not split
this package. The arch dependent libraries are used correctly.
* Added a copyright file pointing to the apache license in debian
Closes#2515Closes#2320
Currently if MPQ is very large highlighing can take down a node
or cause high CPU / RAM consumption. If the query grows > 16 terms
we just extract the terms and do term by term highlighting.
Closes #3142#3128
The SimpleFragemntsBuilder did not correct offsets if the used
analysis chais could produce broken offsets that could lead to
StringArrayIndexOutOfBounds Exceptions
Closes#3140
- SimpleSortTests#testSortScript which was not using the mapping correctly
- SearchStatsTests#testSimpleStats which didn't clear the stats before
running the test and a previous run could have added queries
Version is now stored on a distinct field, that AbstractSimpleEngineTests
didn't correctly add before running tests. This generated a test failure
when the version needed to be loaded from the index.
Since people are using the Oracle JAVA distribution and not the OpenJDK.
You can suggest it of course. Now the installation will at least continue.
If the init script is called, it will exit with a useful error message, that
no JDK is available via the JAVA_HOME variable.
The Version class had hard to understand semantics when two versions were
compared against each other.
Sample of the new logic:
* V_0_20_0.before(V_0_90_0) => true
* V_0_90_0.after(V_0_20_0) => true
Closes#3124
This test checks for the "perfect" or a "sane" allocation
when the total number of shards is separable by the total number of nodes
the index can be allocated on.
In order to ensure that configuration files do not get overwritten when
upgrading an RPM, it is not sufficient to mark them as configuration. You
have to use the 'noreplace' parameter to make sure, they are never
overwritten. Added this parameter for the /etc/elasticsearch directory
as well as the /etc/sysconfig/elasticsearch file.
In addition, the post remove script now only deletes the user in case of
a package removal (and does nothing on package upgrade).
Closes#3123
The new AbstractSharedClusterTest abstracts integration testing further to
reduce the overhead of writing tests that don't rely on explict control over
the cluster. For instance tests that run query, facets or that test highlighting
don't need to explictly start and stop nodes. Testing features like the ones
just mentioned are based on the assumption that the underlying cluster can
be arbitray. Based on this assumption this base class allows to:
* randomize cluster and index settings if not explictly specified
* transparently test transport & node clients
* test features like search or highlighting on different cluster sizes
* allow reuse of node insteance across tests
* provide utility methods that act as upper or lower bounds that a test must pass with
ie. if a test requries at least 3 nodes then it should also pass with 4 nodes
* given a cluster has unmodified cluster settings (persistent and transient) the cluster
should not differ to a fresh started cluster when reused across nodes.
* within a test the client implementation and the clients associated node can be changed
at any time and should return a valid result.
This patch also prepares some redundant tests like 'RelocationTests.java' for randomized
testing. Test like this are very long-running on some machines and run the same test with
different parameters like 'number of writers' or 'number of relocations' which can easily
be chosen with a random number and run only ones during development but multiple times
during CI builds.
All the improvements in this change reduce the test time by ~30%
This is mainly due to the fact that SpanNearQuery allows some neat
tricks with negative slops to run zero-sloped near queries across
2 or more SpanTermQueries.
Closes#3079
New option -l, --list displays list of existing plugins
New option -h, --help displays help
Deprecate options:
-install is now -i, --install
-remove is now -r, --remove
-url is now -u, --url
Catch ArraysOutOfBoundException when no arg given to install, remove or url option
Add description on plugin name structure:
- elasticsearch/plugin/version for official elasticsearch plugins (download from download.elasticsearch.org)
- groupId/artifactId/version for community plugins (download from maven central or oss sonatype)
- username/repository for site plugins (download from github master)
Closes#3112.
This patch makes mvn eclipse:eclipse generate additional eclipse configuration
files so that Eclipse:
- uses Java 1.6 compliance level,
- truncates lines after 140 chars,
- uses 4 spaces for indentation,
- automatically adds a license header when creating a new class file,
- organizes imports the same way as Intellij Idea (which makes sense I guess
since most of the code bas has been written with Intellij, this will prevent
from having large diffs due to the fact that the order of imports has
changed).
Doc values can be expected to be more compact than payloads and should provide
better flexibility since doc values formats can be picked on a per-field basis.
This patch:
- makes _version stored as a numeric doc values field,
- manages backwards compatibility: if a version is not found in doc values,
then it will look into payloads,
- uses background merges to upgrade old segments and move _version from
payloads to doc values.
Closes#3103
PlainHighlighter fails with a NPE when the field to highlight is marked as
stored in the mapping but doesn't exist in a hit. This patch makes
FieldsVisitor.fields less error-prone by returning an empty list instead
of null when no matching stored field was found.
Closes#3109
This patch tries to make the suggester implementation as pluggable as
facets or highlight implementations. The goal is to be able to create
own suggest implementations in a suggest query.
Closes#3089