PhraseSuggester can be very slow and CPU intensive if a lot of terms
are suggested. Yet, to prevent cluster instabilty and long running requests
this commit adds a hard limit of by default 10 tokens where we just return
no correction for anymore if the query is parsed into more tokens.
Closes#3164
Until now 'named dates' like dateOptionalTime could not be used as a group
of dates. This patch allows it to group it arbitrarily like this:
* yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||dateOptionalTime
* dateOptionalTime||yyyy/MM/dd HH:mm:ss||yyyy/MM/dd
* yyyy/MM/dd HH:mm:ss||dateOptionalTime||yyyy/MM/dd
* date_time||date_time_no_millis
Closes#2132
Character.codePointAt and codePointBefore have two versions: one which only
accepts an offset, and one which accepts an offset and a limit. The former can
be dangerous when working with buffers of characters because if the offset
is the last char of the buffer, a char outside the buffer might be used to
compute the code point, so one should always use the version which accepts a
limit.
Collections.sort is wasteful on random-access lists: it dumps data into an
array, sorts the list and then adds elements back to the list. However, the
sorting can easily be performed in-place by using Lucene's
CollectionUtil.(merge|quick|tim)Sort.
The flag is set to true when a document is new, false when replacing an existing object.
Other minor changes:
Fixed an issue with dynamic gc deletes settings update
Added an assertThrows to ElasticsearchAssertion
Closes#3084 , Closes#3154
Both apis now also support a `local` parameter, that fetches the mapping / warmer from the cluster state of the node that received the request. The `type` option in the get mapping api now also support wildcards. The warmer api now also support the `type` option.
Closes#3171
Lucenes MergePolicies support a noCFSRatio. This commit introduces
support for this ratio via `index.compound_format`. This setting
can parse a boolean value or a value in the interval [0..1] that
is equivalent to the noCFSRatio. The setting `1`, `1.0` and `true`
are equivalent as well as `0`, `0.0` and `false`.
Closes#3166
Currently we have many different places that convert String to UTF-8
bytes and back. We shouldn't maintain more code than necessary to
do this conversion and rather use Lucene's support for it.
================================
Returns information and statistics on terms in the fields of a particular document as stored in the index.
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true'
Tree types of values can be requested: term information, term statistics and field statistics.
By default, all term information and field statistics are returned for all fields but no term statistics.
Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?fields=text,...'
or adding by adding the requested fields in the request body (see example below).
Term information
-------------------------
- term frequency in the field (always returned)
- term positions ("positions" : true)
- start and end offsets ("offsets" : true)
- term payloads ("payloads" : true), as base64 encoded bytes
If the requested information wasn't stored in the index, it will be omitted without further warning.
See [mapping](http://www.elasticsearch.org/guide/reference/mapping/core-types/) on how to configure your index to store term vectors.
Term statistics
-------------------------
Setting "term_statistics" to "true" (default is "false") will return
- total term frequency (how often a term occurs in all documents)
- document frequency (the number of documents containing the current term)
By default these values are not returned since term statistics can have a serious performance impact.
Field statistics
-------------------------
Setting "field_statistics" to "false" (default is "true") will omit
- document count (how many documents contain this field)
- sum of document frequencies (the sum of document frequencies for all terms in this field)
- sum of total term frequencies (the sum of total term frequencies of each term in this field)
Behavior
-------------------------
The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.
Example
-------------------------
First, we create an index that stores term vectors, payloads etc. :
curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : "yes",
"index_analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"index_analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}'
Second, we add some documents:
curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
"fullname" : "John Doe",
"text" : "twitter test test test "
}'
curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
"fullname" : "Jane Doe",
"text" : "Another twitter test ..."
}'
The following request returns all information and statistics for field "text" in document "1" (John Doe):
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}'
Equivalently, all parameters can be passed as URI parameters:
curl -GET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true&fields=text&offsets=true&payloads=true&positions=true&term_statistics=true&field_statistics=true'
Response:
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_version" : 1,
"exists" : true,
"term_vectors" : {
"text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"test" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 3,
"pos" : [ 1, 2, 3 ],
"start" : [ 8, 13, 18 ],
"end" : [ 12, 17, 22 ],
"payload" : [ "d29yZA==", "d29yZA==", "d29yZA==" ]
},
"twitter" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"pos" : [ 0 ],
"start" : [ 0 ],
"end" : [ 7 ],
"payload" : [ "d29yZA==" ]
}
}
}
}
}
Further changes:
-------------------------
XContentBuilder
new method
public XContentBuilder field(XContentBuilderString name, int offset, int length, int... value)
to put an integer array.
IndicesAnalysisService
make token filter for saving payloads available in elasticsearch
AbstractFieldMapper/TypeParser
make term vector options string available and also fix the parsing of this string:
with_positions_payloads is actually allowed as can be seen in TermVectorsConsumerPerFields.
Closes#3114
According to #2515 the ubuntu software center does not allow to install
debian packages which are not lintian compatible
I worked on the package and made it lintian compatible by doing
* Ignoring errors about arch dependent binaries as we will not split
this package. The arch dependent libraries are used correctly.
* Added a copyright file pointing to the apache license in debian
Closes#2515Closes#2320
Currently if MPQ is very large highlighing can take down a node
or cause high CPU / RAM consumption. If the query grows > 16 terms
we just extract the terms and do term by term highlighting.
Closes #3142#3128
The SimpleFragemntsBuilder did not correct offsets if the used
analysis chais could produce broken offsets that could lead to
StringArrayIndexOutOfBounds Exceptions
Closes#3140
- SimpleSortTests#testSortScript which was not using the mapping correctly
- SearchStatsTests#testSimpleStats which didn't clear the stats before
running the test and a previous run could have added queries
Version is now stored on a distinct field, that AbstractSimpleEngineTests
didn't correctly add before running tests. This generated a test failure
when the version needed to be loaded from the index.
Since people are using the Oracle JAVA distribution and not the OpenJDK.
You can suggest it of course. Now the installation will at least continue.
If the init script is called, it will exit with a useful error message, that
no JDK is available via the JAVA_HOME variable.
The Version class had hard to understand semantics when two versions were
compared against each other.
Sample of the new logic:
* V_0_20_0.before(V_0_90_0) => true
* V_0_90_0.after(V_0_20_0) => true
Closes#3124