RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.
closes#6907, #7446
Items with no specified field now defaults to all the possible fields from the
document source. Previously, we had required 'fields' to be specified either
as a top level parameter or for each item. The default behavior is now similar
to the MLT API.
Closes#7382
The term vector API can now generate term vectors on the fly, if the terms are
not already stored in the index. This commit exploits this new functionality
for the MLT query. Now the terms are directly retrieved using multi-
termvectors API, instead of generating them from the texts retrieved using the
multi-get API.
Closes#7014
The current implementation of 'date_histogram' does not understand
the `factor` parameter. Since the docs shouldn't raise false hopes,
I removed the section.
Closes#7277
This change means that the default settings for expand_wildcards are only applied if the expand_wildcards parameter is not specified rather than being set upfront. It also adds the none and all options to the parameter to allow the user to specify no expansion and expansion to all indexes (equivalent to 'open,closed')
Closes#7258
This change stores the index creation time in the index metadata when an index is created. The creation time cannot be changed but can be set as part of the create index request to allow for correct creation times for historical data.
Closes#7119
This documentation was dangerous because it felt like it was possible to gain
substantial performance by just switching the codec of the index.
However, non-default codecs are dangerous to use since they are not supported
in terms of backward compatibility, and most improvements that they bring have
been folded into the default codec anyway (for example, the default codec
"pulses" postings lists that contain a single document).
In the case of inserts the UpdateHelper class will now allow the script used to apply updates to run on the upsert doc provided by clients. This allows the logic for managing the internal state of the data item to be managed by the script and is not reliant on clients performing the initialisation of data structures managed by the script.
Closes#7143
This adds support to return the "Access-Control-Allow-Credentials" header
if needed, so CORS will work flawlessly with authenticated applications.
Closes#6380
AND and OR filter docs talk about different targets for the operators. I
believe that both should be described in terms of modifying other 'filters'.
I also added articles for easier (human) parsing. This fixes#4762Closes#7165
Filters and Queries now supports `time_zone` parameter which defines which time zone should be applied to the query or filter to convert it to UTC time based value.
When applied on `date` fields the `range` filter and queries accept also a `time_zone` parameter.
The `time_zone` parameter will be applied to your input lower and upper bounds and will move them to UTC time based date:
[source,js]
--------------------------------------------------
{
"constant_score": {
"filter": {
"range" : {
"born" : {
"gte": "2012-01-01",
"lte": "now",
"time_zone": "+1:00"
}
}
}
}
}
{
"range" : {
"born" : {
"gte": "2012-01-01",
"lte": "now",
"time_zone": "+1:00"
}
}
}
--------------------------------------------------
In the above examples, `gte` will be actually moved to `2011-12-31T23:00:00` UTC date.
NOTE: if you give a date with a timezone explicitly defined and use the `time_zone` parameter, `time_zone` will be
ignored. For example, setting `from` to `2012-01-01T00:00:00+01:00` with `"time_zone":"+10:00"` will still use `+01:00` time zone.
Closes#3729.
Fields of type `token_count`, `murmur3`, `_all` and `_field_names` are generated only when indexing.
If a GET requests accesses the transaction log (because no refresh
between indexing and GET request) then these fields cannot be retrieved at all.
Before the behavior was so:
`_all, _field_names`: The field was siletly ignored
`murmur3, token_count`: `NumberFormatException` because GET tried to parse the values from the source.
In addition, if these fields were not stored, the same behavior occured if the fields were
retrieved with GET after a `refresh()` because here also the source was used to get the fields.
Now, GET accepts a parameter `ignore_errors_on_generated_fields` which has
the following effect:
- Throw exception with meaningful error message explaining the problem if set to false (default)
- Ignore the field if set to true
- Always ignore the field if it was not set to stored
This changes the behavior for `_all` and `_field_names` as now an Exception is thrown if a user
tries to GET them before a `refresh()`.
closes#6676closes#6973
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure
relates to #7115
A multi-bucket aggregation where multiple filters can be defined (each filter defines a bucket). The buckets will collect all the documents that match their associated filter.
This aggregation can be very useful when one wants to compare analytics between different criterias. It can also be accomplished using multiple definitions of the single filter aggregation, but here, the user will only need to define the sub-aggregations only once.
Closes#6118
Implements a new Exists API allowing users to do fast exists check on any matched documents for a given query.
This API should be faster then using the Count API as it will:
- early terminate the search execution once any document is found to exist
- return the response as soon as the first shard reports matched documents
closes#6995
Index process fails when having `_timestamp` enabled and `path` option is set.
It fails with a `TimestampParsingException[failed to parse timestamp [null]]` message.
Reproduction:
```
DELETE test
PUT test
{
"mappings": {
"test": {
"_timestamp" : {
"enabled" : "yes",
"path" : "post_date"
}
}
}
}
PUT test/test/1
{
"foo": "bar"
}
```
You can define a default value for when timestamp is not provided
within the index request or in the `_source` document.
By default, the default value is `now` which means the date the document was processed by the indexing chain.
You can disable that default value by setting `default` to `null`. It means that `timestamp` is mandatory:
```
{
"tweet" : {
"_timestamp" : {
"enabled" : true,
"default" : null
}
}
}
```
If you don't provide any timestamp value, indexation will fail.
You can also set the default value to any date respecting timestamp format:
```
{
"tweet" : {
"_timestamp" : {
"enabled" : true,
"format" : "YYYY-MM-dd",
"default" : "1970-01-01"
}
}
}
```
If you don't provide any timestamp value, indexation will fail.
Closes#4718.
Closes#7036.
Adds a breaker for request BigArrays, which are used for parent/child
queries as well as some aggregations. Certain operations like Netty HTTP
responses and transport responses increment the breaker, but will not
trip.
This also changes the output of the nodes' stats endpoint to show the
parent breaker as well as the fielddata and request breakers.
There are a number of new settings for breakers now:
`indices.breaker.total.limit`: starting limit for all memory-use breaker,
defaults to 70%
`indices.breaker.fielddata.limit`: starting limit for fielddata breaker,
defaults to 60%
`indices.breaker.fielddata.overhead`: overhead for fielddata breaker
estimations, defaults to 1.03
(the fielddata breaker settings also use the backwards-compatible
setting `indices.fielddata.breaker.limit` and
`indices.fielddata.breaker.overhead`)
`indices.breaker.request.limit`: starting limit for request breaker,
defaults to 40%
`indices.breaker.request.overhead`: request breaker estimation overhead,
defaults to 1.0
The breaker service infrastructure is now generic and opens the path to
adding additional circuit breakers in the future.
Fixes#6129
Conflicts:
src/main/java/org/elasticsearch/index/fielddata/IndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/IndexFieldDataService.java
src/main/java/org/elasticsearch/index/fielddata/RamAccountingTermsEnum.java
src/main/java/org/elasticsearch/index/fielddata/ordinals/GlobalOrdinalsBuilder.java
src/main/java/org/elasticsearch/index/fielddata/ordinals/InternalGlobalOrdinalsBuilder.java
src/main/java/org/elasticsearch/index/fielddata/plain/AbstractIndexOrdinalsFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/DisabledIndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/IndexIndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/NonEstimatingEstimator.java
src/main/java/org/elasticsearch/index/fielddata/plain/PackedArrayIndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/ParentChildIndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/SortedSetDVOrdinalsIndexFieldData.java
src/main/java/org/elasticsearch/node/internal/InternalNode.java
src/test/java/org/elasticsearch/index/aliases/IndexAliasesServiceTests.java
src/test/java/org/elasticsearch/index/codec/CodecTests.java
src/test/java/org/elasticsearch/index/fielddata/AbstractFieldDataTests.java
src/test/java/org/elasticsearch/index/fielddata/IndexFieldDataServiceTests.java
src/test/java/org/elasticsearch/index/mapper/MapperTestUtils.java
src/test/java/org/elasticsearch/index/query/IndexQueryParserFilterCachingTests.java
src/test/java/org/elasticsearch/index/query/SimpleIndexQueryParserTests.java
src/test/java/org/elasticsearch/index/query/guice/IndexQueryParserModuleTests.java
src/test/java/org/elasticsearch/index/search/FieldDataTermsFilterTests.java
src/test/java/org/elasticsearch/index/search/child/ChildrenConstantScoreQueryTests.java
src/test/java/org/elasticsearch/index/similarity/SimilarityTests.java
This is only applicable when the order is set to _count. The upper bound of the error in the doc count is calculated by summing the doc count of the last term on each shard which did not return the term. The implementation calculates the error by summing the doc count for the last term on each shard for which the term IS returned and then subtracts this value from the sum of the doc counts for the last term from ALL shards.
Closes#6696
This commit adds regular expression support for the allow-origin
header depending on the value of the request `Origin` header.
The existing HttpRequestBuilder is also extended to support the
OPTIONS HTTP method.
Relates #5601Closes#6891
This commit adds the ability to force blocking on the flush operaition
to make sure all files have been written and synced to disk. Without
this option a flush might be executing at the same time causing the
current flush to fail and return before all files being synced.
Closes#6996
Allow users to control document collection termination, if a specified terminate_after number is
set. Upon setting the newly added parameter, the response will include a boolean terminated_early
flag, indicating if the document collection for any shard terminated early.
closes#6876
This change just changes the default for index.codec.bloom.load to
false: with recent performance improvements to ID lookup, such as
#6298, bloom filters don't give much of a performance gain anymore,
and they can consume non-trivial RAM when there are many tiny
documents.
For now, we still index the bloom filters, so if a given app wants
them back, it can just update the index.codec.bloom.load to true.
Closes#6959
A new option `prune` has been added to allow users to control phrase suggestion pruning when `collate`
is set. If the new option is set, the phrase suggestion option will contain a boolean `collate_match`
indicating whether the respective result had hits in collation.
CLoses#6927
Adds the ability to the Term Vector API to generate term vectors for some
chosen fields, even though they haven't been explicitely stored in the index.
Relates to #5184Closes#6567
These are javascript expressions, which can only access numeric
fielddata, parameters, and _score. They can only be used for searches (not document updates).
closes#6818
The newly added collate option will let the user provide a template query/filter which will be executed for every phrase suggestions generated to ensure that the suggestion matches at least one document for the filter/query.
The user can also add routing preference `preference` to route the collate query/filter and additional `params` to inject into the collate template.
Closes#3482
The Hunspell service would throw a confusing error message if more than
one affix file was present. This commit distinguishes between the two
error cases: where there are no affix files and when there are too many
affix files.
Also implements lazy dictionary loading, which was used in the tests
but not implemented.
Closes#6850
This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding
- SignificanceHeuristic
- SignificanceHeuristicBuilder
- SignificanceHeuristicParser
closes#6561
The `recovery_after_time` tells the gateway to wait before starting recovery from disk. The goal here is to allow for more nodes to join the cluster and thus not start potentially unneeded replications. The `expectedNodes` setting (and friends) tells the gateway when it can start recovering even if the `recover_after_time` has not yet elapsed. However, `expectedNodes` is useless if one doesn't set `recovery_after_time`. This commit changes that by setting a sensible default of 5m for `recover_after_time` *if* a `expectedNodes` setting is present.
Closes#6742
`mmapfs` is really good for random access but can have sideeffects if
memory maps are large depending on the operating system etc. A hybrid
solution where only selected files are actually memory mapped but others
mostly consumed sequentially brings the best of both worlds and
minimizes the memory map impact.
This commit mmaps only the `dvd` and `tim` file for fast random access
on docvalues and term dictionaries.
Closes#6636
Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`.
Closes#6615
Add `irish` analyzer
Add `sorani` analyzer (Kurdish)
Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.
Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`
Add support access to default Thai stopword set "_thai_"
Fix some bugs and broken links in documentation.
Closes#5935
For the casual reader, the reference to "term queries" may be glossed over, yielding an unexpected result when using `regexp` queries.
This attempts to make that distinction more prominent.
Closes#6698
If the match query with cutoff_frequency encounters stacked tokens,
like synonyms in the same position, it returns a boolean query instead
of a common terms query. However, if the original operator was set
to "and", it was ignoring that and resetting the operator to "or".
In fact, if operator is "and" then there is little benefit in using
a common terms query as a must query is already
executed efficiently.
The `exists` and `missing` filters need to merge postings lists of all existing
terms, which can be very costly, especially on high-cardinality fields. This
commit indexes the field names of a document under `_field_names` and reuses it
to speed up the `exists` and `missing` filters.
This is only enabled for indices that are created on or after Elasticsearch
1.3.0.
Close#5659
Added the http.jsonp.enable option to configure disabling of JSONP responses, as those
might pose a security risk, and can be disabled if unused.
This also fixes bugs in NettyHttpChannel
* JSONP responses were never setting application/javascript as the content-type
* The content-type and content-length headers were being overwritten even if they were set before
Closes#6164
Percentile Rank Aggregation is the reverse of the Percetiles aggregation. It determines the percentile rank (the proportion of values less than a given value) of the provided array of values.
Closes#6386
When a node sends a join request to the master, only send back the response after it has been added to the master cluster state and published.
This will fix the rare cases where today, a join request can return, and the master, since its under load, have not yet added the node to its cluster state, and the node that joined will start a fault detect against the master, failing since its not part of the cluster state.
Since now the join request is longer, also increase the join request timeout default.
closes#6480
* `english` returned the slow snowball English stemmer
* `porter2` returned the snowball Porter stemmer (v1)
* `portuguese` was used twice, preventing the second version from working
Changes:
* `english` now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)
* `porter2` now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
* `light_english` now returns the `kstem` stemmer (`kstem` still works)
* `portuguese_rslp` returns the PortugueseStemmer
* `dutch_kp` is a synonym for `kp`
Tests and docs updated
Fixes#6345Fixes#6213Fixes#6330
Previously if the user provided a non-conforming string, it would blow up with
`java.lang.StringIndexOutOfBoundsException: String index out of range: -1`
which is not a *helpful* error message.
Also updated the documentation to make the possible setting values more clear.
Close#5752
A new "breadth_first" results collection mode allows upper branches of aggregation tree to be calculated and then pruned
to a smaller selection before advancing into executing collection on child branches.
Closes#6128
The existing Note about the shorthand suggest syntax was poorly worded and confusing. Please check whether the way I've phrased it now is still correct as to what the shorthand form actually does and doesn't do: the original wording did not provide me enough information to be sure.
Thanks!
The GeoBounds Aggregation is a new single bucket aggregation which outputs the coordinates of a bounding box containing all the points from all the documents passed to the aggregation as well as the doc count. Geobound Aggregation also use a wrap_logitude parameter which specifies whether the resulting bounding box is permitted to overlap the international date line. This option defaults to true.
This aggregation introduces the idea of MetricsAggregation which do not return double values and cannot be used for sorting. The existing MetricsAggregation has been renamed to NumericMetricsAggregation and is a subclass of MetricsAggregation. MetricsAggregations do not store doc counts and do not support child aggregations.
Closes#5634
Added support for min_children and max_children parameters to
the has_child query and filter. A parent document will only
be considered if a match if the number of matching children
fall between the min/max bounds.
Closes#6019
Using ping.timeout, which defaults to 3s, to use as a timeout value on the join request a node makes to the master once its discovered can be too small, specifically when there is a large cluster state involved (and by definition, all the buffers and such on the nio layer will be "cold"). Introduce a dedicated join.timeout setting, that by default is 10x the ping.timeout (so 30s by default).
closes#6342
Because json objects are unordered this also adds an explicit order syntax
that looks like
"highlight": {
"fields": [
{"title":{ /*params*/ }},
{"text":{ /*params*/ }}
]
}
This is not useful for any of the builtin highlighters but will be useful
in plugins.
Closes#4649
This change adds a new cluster state that waits for the replication of a shard to finish before starting snapshotting process. Because this change adds a new snapshot state, an pre-1.2.0 nodes will not be able to join the 1.2.0 cluster that is currently running snapshot/restore operation.
Closes#5531
This commit upgrades to the latest Lucene 4.8.1 release including the
following bugfixes:
* An IndexThrottle now kicks in when merges start falling behind
limiting index threads to 1 until merges caught up. Closes#6066
* RateLimiter now kicks in at the configured rate where previously
the limiter was limiting at ~8MB/sec almost all the time. Closes#6018
This is an update for the _cat/recovery API documentation. The examples
have been updated. Removed the bottom paragraph explaining why there
could be values > 100%. This can no longer happen so that had to be
removed.
Closes#6159
The syntax to specify one or more items is the same as for the Multi GET API.
If only one document is specified, the results returned are the same as when
using the More Like This API.
Relates #4075Closes#5857
Until now all version types have officially required the version to be a positive long number. Despite of this has being documented, ES versions <=1.0 did not enforce it when using the `external` version type. As a result people have succesfully indexed documents with 0 as a version. In 1.1. we introduced validation checks on incoming version values and causing indexing request to fail if the version was set to 0. While this is strictly speaking OK, we effectively have a situation where data already indexed does not match the version invariant.
To be lenient and adhere to spirit of our data backward compatibility policy, we have decided to allow 0 as a valid external version type. This is somewhat complicated as 0 is also the internal value of `MATCH_ANY`, which indicates requests should succeed regardles off the current doc version. To keep things simple, this commit changes the internal value of `MATCH_ANY` to `-3` for all version types.
Since we're doing this in a minor release (and because versions are stored in the transaction log), the default `internal` version type still accepts 0 as a `MATCH_ANY` value. This is not a problem for other version types as `MATCH_ANY` doesn't make sense in that context.
Closes#5662
* If plugin does not provide `lucene` property, we consider that the plugin is compatible.
* If plugin provides `lucene` property, we try to load related Enum org.apache.lucene.util.Version. If this fails, it means that the node is too "old" comparing to the Lucene version the plugin was built for.
* We compare then two first digits of current node lucene version against two first digits of plugin Lucene version. If not equal, it means that the plugin is too "old" for the current node.
Plugin developers who wants to launch plugin check only have to add a `lucene` property in `es-plugin.properties` file. If you are using maven to build your plugin, you can do it like this:
In `pom.xml`:
```xml
<properties>
<lucene.version>4.6.0</lucene.version>
</properties>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
</build>
```
In `es-plugin.properties`, add:
```properties
lucene=${lucene.version}
```
BTW, if you don't already have it, you can add the plugin version as well:
```properties
version=${project.version}
```
You can disable that check using `plugins.check_lucene: false`.
Our improvements to t-digest have been pushed upstream and t-digest also got
some additional nice improvements around memory usage and speedups of quantile
estimation. So it makes sense to use it as a dependency now.
This also allows to remove the test dependency on Apache Mahout.
Close#6142
It's dangerous to expose SerialMergeScheduler as an option: since it only allows one merge at a time, it can easily cause merging to fall behind.
Closes#6120
By default More Like This API excludes the queried document from the response.
However, when debugging or when comparing scores across different queries, it
could be useful to have the best possible matched hit. So this option lets users
explicitly specify the desired behavior.
Closes#6067
In the Google Groups forum there appears to be some confusion as to what mlt
does. This documentation update should hopefully help demystifying this
feature, and provide some understanding as to how to use its parameters.
Closes#6092
- Randomized integration tests for the benchmark API.
- Negative tests for cases where the cluster cannot run benchmarks.
- Return 404 on missing benchmark name.
- Allow to specify 'types' as an array in the JSON syntax when describing a benchmark competition.
- Don't record slowest for single-request competitions.
Closes#6003, #5906, #5903, #5904
Adds a table with the exhaustive list of all available headers with a brief description (mostly from `org.elasticsearch.rest.action.cat.RestNodesAction`) so that people do not need to go searching for them in the code like I did, or search through `nodes?help`.
Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.
This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.
On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.
closes#5998closes#6041
Update `geo-shape-type.asciidoc` to include all `GeoShapeType`s supported by the `org.elasticsearch.common.geo.builders.ShapeBuilder`.
Changes include:
1. A tabular mapping of GeoJSON types to Elasticsearch types
2. Listing all types, with brief examples, for all support Elasticsearch types
3. Putting non-standard types to the bottom (really just moving Envelope to the bottom)
4. Linking to all GeoJSON types.
5. Adding whitespace around tightly nested arrays (particularly `multipolygon`) for readability
The possibility of filtering for index templates in the cluster state API
had been introduced before there was a dedicated index templates API. This
commit removes this support from the cluster state API, as it was not really
clean, requiring you to specify the metadata and the index templates.
Closes#4954