This documentation was dangerous because it felt like it was possible to gain
substantial performance by just switching the codec of the index.
However, non-default codecs are dangerous to use since they are not supported
in terms of backward compatibility, and most improvements that they bring have
been folded into the default codec anyway (for example, the default codec
"pulses" postings lists that contain a single document).
In the case of inserts the UpdateHelper class will now allow the script used to apply updates to run on the upsert doc provided by clients. This allows the logic for managing the internal state of the data item to be managed by the script and is not reliant on clients performing the initialisation of data structures managed by the script.
Closes#7143
This adds support to return the "Access-Control-Allow-Credentials" header
if needed, so CORS will work flawlessly with authenticated applications.
Closes#6380
AND and OR filter docs talk about different targets for the operators. I
believe that both should be described in terms of modifying other 'filters'.
I also added articles for easier (human) parsing. This fixes#4762Closes#7165
Filters and Queries now supports `time_zone` parameter which defines which time zone should be applied to the query or filter to convert it to UTC time based value.
When applied on `date` fields the `range` filter and queries accept also a `time_zone` parameter.
The `time_zone` parameter will be applied to your input lower and upper bounds and will move them to UTC time based date:
[source,js]
--------------------------------------------------
{
"constant_score": {
"filter": {
"range" : {
"born" : {
"gte": "2012-01-01",
"lte": "now",
"time_zone": "+1:00"
}
}
}
}
}
{
"range" : {
"born" : {
"gte": "2012-01-01",
"lte": "now",
"time_zone": "+1:00"
}
}
}
--------------------------------------------------
In the above examples, `gte` will be actually moved to `2011-12-31T23:00:00` UTC date.
NOTE: if you give a date with a timezone explicitly defined and use the `time_zone` parameter, `time_zone` will be
ignored. For example, setting `from` to `2012-01-01T00:00:00+01:00` with `"time_zone":"+10:00"` will still use `+01:00` time zone.
Closes#3729.
Fields of type `token_count`, `murmur3`, `_all` and `_field_names` are generated only when indexing.
If a GET requests accesses the transaction log (because no refresh
between indexing and GET request) then these fields cannot be retrieved at all.
Before the behavior was so:
`_all, _field_names`: The field was siletly ignored
`murmur3, token_count`: `NumberFormatException` because GET tried to parse the values from the source.
In addition, if these fields were not stored, the same behavior occured if the fields were
retrieved with GET after a `refresh()` because here also the source was used to get the fields.
Now, GET accepts a parameter `ignore_errors_on_generated_fields` which has
the following effect:
- Throw exception with meaningful error message explaining the problem if set to false (default)
- Ignore the field if set to true
- Always ignore the field if it was not set to stored
This changes the behavior for `_all` and `_field_names` as now an Exception is thrown if a user
tries to GET them before a `refresh()`.
closes#6676closes#6973
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure
relates to #7115
A multi-bucket aggregation where multiple filters can be defined (each filter defines a bucket). The buckets will collect all the documents that match their associated filter.
This aggregation can be very useful when one wants to compare analytics between different criterias. It can also be accomplished using multiple definitions of the single filter aggregation, but here, the user will only need to define the sub-aggregations only once.
Closes#6118
Implements a new Exists API allowing users to do fast exists check on any matched documents for a given query.
This API should be faster then using the Count API as it will:
- early terminate the search execution once any document is found to exist
- return the response as soon as the first shard reports matched documents
closes#6995
Index process fails when having `_timestamp` enabled and `path` option is set.
It fails with a `TimestampParsingException[failed to parse timestamp [null]]` message.
Reproduction:
```
DELETE test
PUT test
{
"mappings": {
"test": {
"_timestamp" : {
"enabled" : "yes",
"path" : "post_date"
}
}
}
}
PUT test/test/1
{
"foo": "bar"
}
```
You can define a default value for when timestamp is not provided
within the index request or in the `_source` document.
By default, the default value is `now` which means the date the document was processed by the indexing chain.
You can disable that default value by setting `default` to `null`. It means that `timestamp` is mandatory:
```
{
"tweet" : {
"_timestamp" : {
"enabled" : true,
"default" : null
}
}
}
```
If you don't provide any timestamp value, indexation will fail.
You can also set the default value to any date respecting timestamp format:
```
{
"tweet" : {
"_timestamp" : {
"enabled" : true,
"format" : "YYYY-MM-dd",
"default" : "1970-01-01"
}
}
}
```
If you don't provide any timestamp value, indexation will fail.
Closes#4718.
Closes#7036.
Adds a breaker for request BigArrays, which are used for parent/child
queries as well as some aggregations. Certain operations like Netty HTTP
responses and transport responses increment the breaker, but will not
trip.
This also changes the output of the nodes' stats endpoint to show the
parent breaker as well as the fielddata and request breakers.
There are a number of new settings for breakers now:
`indices.breaker.total.limit`: starting limit for all memory-use breaker,
defaults to 70%
`indices.breaker.fielddata.limit`: starting limit for fielddata breaker,
defaults to 60%
`indices.breaker.fielddata.overhead`: overhead for fielddata breaker
estimations, defaults to 1.03
(the fielddata breaker settings also use the backwards-compatible
setting `indices.fielddata.breaker.limit` and
`indices.fielddata.breaker.overhead`)
`indices.breaker.request.limit`: starting limit for request breaker,
defaults to 40%
`indices.breaker.request.overhead`: request breaker estimation overhead,
defaults to 1.0
The breaker service infrastructure is now generic and opens the path to
adding additional circuit breakers in the future.
Fixes#6129
Conflicts:
src/main/java/org/elasticsearch/index/fielddata/IndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/IndexFieldDataService.java
src/main/java/org/elasticsearch/index/fielddata/RamAccountingTermsEnum.java
src/main/java/org/elasticsearch/index/fielddata/ordinals/GlobalOrdinalsBuilder.java
src/main/java/org/elasticsearch/index/fielddata/ordinals/InternalGlobalOrdinalsBuilder.java
src/main/java/org/elasticsearch/index/fielddata/plain/AbstractIndexOrdinalsFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/DisabledIndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/IndexIndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/NonEstimatingEstimator.java
src/main/java/org/elasticsearch/index/fielddata/plain/PackedArrayIndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/ParentChildIndexFieldData.java
src/main/java/org/elasticsearch/index/fielddata/plain/SortedSetDVOrdinalsIndexFieldData.java
src/main/java/org/elasticsearch/node/internal/InternalNode.java
src/test/java/org/elasticsearch/index/aliases/IndexAliasesServiceTests.java
src/test/java/org/elasticsearch/index/codec/CodecTests.java
src/test/java/org/elasticsearch/index/fielddata/AbstractFieldDataTests.java
src/test/java/org/elasticsearch/index/fielddata/IndexFieldDataServiceTests.java
src/test/java/org/elasticsearch/index/mapper/MapperTestUtils.java
src/test/java/org/elasticsearch/index/query/IndexQueryParserFilterCachingTests.java
src/test/java/org/elasticsearch/index/query/SimpleIndexQueryParserTests.java
src/test/java/org/elasticsearch/index/query/guice/IndexQueryParserModuleTests.java
src/test/java/org/elasticsearch/index/search/FieldDataTermsFilterTests.java
src/test/java/org/elasticsearch/index/search/child/ChildrenConstantScoreQueryTests.java
src/test/java/org/elasticsearch/index/similarity/SimilarityTests.java
This is only applicable when the order is set to _count. The upper bound of the error in the doc count is calculated by summing the doc count of the last term on each shard which did not return the term. The implementation calculates the error by summing the doc count for the last term on each shard for which the term IS returned and then subtracts this value from the sum of the doc counts for the last term from ALL shards.
Closes#6696
This commit adds regular expression support for the allow-origin
header depending on the value of the request `Origin` header.
The existing HttpRequestBuilder is also extended to support the
OPTIONS HTTP method.
Relates #5601Closes#6891
This commit adds the ability to force blocking on the flush operaition
to make sure all files have been written and synced to disk. Without
this option a flush might be executing at the same time causing the
current flush to fail and return before all files being synced.
Closes#6996
Allow users to control document collection termination, if a specified terminate_after number is
set. Upon setting the newly added parameter, the response will include a boolean terminated_early
flag, indicating if the document collection for any shard terminated early.
closes#6876
This change just changes the default for index.codec.bloom.load to
false: with recent performance improvements to ID lookup, such as
#6298, bloom filters don't give much of a performance gain anymore,
and they can consume non-trivial RAM when there are many tiny
documents.
For now, we still index the bloom filters, so if a given app wants
them back, it can just update the index.codec.bloom.load to true.
Closes#6959
A new option `prune` has been added to allow users to control phrase suggestion pruning when `collate`
is set. If the new option is set, the phrase suggestion option will contain a boolean `collate_match`
indicating whether the respective result had hits in collation.
CLoses#6927
Adds the ability to the Term Vector API to generate term vectors for some
chosen fields, even though they haven't been explicitely stored in the index.
Relates to #5184Closes#6567
These are javascript expressions, which can only access numeric
fielddata, parameters, and _score. They can only be used for searches (not document updates).
closes#6818
The newly added collate option will let the user provide a template query/filter which will be executed for every phrase suggestions generated to ensure that the suggestion matches at least one document for the filter/query.
The user can also add routing preference `preference` to route the collate query/filter and additional `params` to inject into the collate template.
Closes#3482
The Hunspell service would throw a confusing error message if more than
one affix file was present. This commit distinguishes between the two
error cases: where there are no affix files and when there are too many
affix files.
Also implements lazy dictionary loading, which was used in the tests
but not implemented.
Closes#6850
This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding
- SignificanceHeuristic
- SignificanceHeuristicBuilder
- SignificanceHeuristicParser
closes#6561
The `recovery_after_time` tells the gateway to wait before starting recovery from disk. The goal here is to allow for more nodes to join the cluster and thus not start potentially unneeded replications. The `expectedNodes` setting (and friends) tells the gateway when it can start recovering even if the `recover_after_time` has not yet elapsed. However, `expectedNodes` is useless if one doesn't set `recovery_after_time`. This commit changes that by setting a sensible default of 5m for `recover_after_time` *if* a `expectedNodes` setting is present.
Closes#6742
`mmapfs` is really good for random access but can have sideeffects if
memory maps are large depending on the operating system etc. A hybrid
solution where only selected files are actually memory mapped but others
mostly consumed sequentially brings the best of both worlds and
minimizes the memory map impact.
This commit mmaps only the `dvd` and `tim` file for fast random access
on docvalues and term dictionaries.
Closes#6636
Currently regexes in Pattern Tokenizer docs are escaped (it seems according to Java rules). I think it is better not to escape them because JSON escaping should be automatic in client libraries, and string escaping depends on a client language used. The default pattern is `\W+`, not `\\W+`.
Closes#6615
Add `irish` analyzer
Add `sorani` analyzer (Kurdish)
Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.
Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`
Add support access to default Thai stopword set "_thai_"
Fix some bugs and broken links in documentation.
Closes#5935
For the casual reader, the reference to "term queries" may be glossed over, yielding an unexpected result when using `regexp` queries.
This attempts to make that distinction more prominent.
Closes#6698
If the match query with cutoff_frequency encounters stacked tokens,
like synonyms in the same position, it returns a boolean query instead
of a common terms query. However, if the original operator was set
to "and", it was ignoring that and resetting the operator to "or".
In fact, if operator is "and" then there is little benefit in using
a common terms query as a must query is already
executed efficiently.
The `exists` and `missing` filters need to merge postings lists of all existing
terms, which can be very costly, especially on high-cardinality fields. This
commit indexes the field names of a document under `_field_names` and reuses it
to speed up the `exists` and `missing` filters.
This is only enabled for indices that are created on or after Elasticsearch
1.3.0.
Close#5659
Added the http.jsonp.enable option to configure disabling of JSONP responses, as those
might pose a security risk, and can be disabled if unused.
This also fixes bugs in NettyHttpChannel
* JSONP responses were never setting application/javascript as the content-type
* The content-type and content-length headers were being overwritten even if they were set before
Closes#6164
Percentile Rank Aggregation is the reverse of the Percetiles aggregation. It determines the percentile rank (the proportion of values less than a given value) of the provided array of values.
Closes#6386
When a node sends a join request to the master, only send back the response after it has been added to the master cluster state and published.
This will fix the rare cases where today, a join request can return, and the master, since its under load, have not yet added the node to its cluster state, and the node that joined will start a fault detect against the master, failing since its not part of the cluster state.
Since now the join request is longer, also increase the join request timeout default.
closes#6480
* `english` returned the slow snowball English stemmer
* `porter2` returned the snowball Porter stemmer (v1)
* `portuguese` was used twice, preventing the second version from working
Changes:
* `english` now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)
* `porter2` now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
* `light_english` now returns the `kstem` stemmer (`kstem` still works)
* `portuguese_rslp` returns the PortugueseStemmer
* `dutch_kp` is a synonym for `kp`
Tests and docs updated
Fixes#6345Fixes#6213Fixes#6330
Previously if the user provided a non-conforming string, it would blow up with
`java.lang.StringIndexOutOfBoundsException: String index out of range: -1`
which is not a *helpful* error message.
Also updated the documentation to make the possible setting values more clear.
Close#5752
A new "breadth_first" results collection mode allows upper branches of aggregation tree to be calculated and then pruned
to a smaller selection before advancing into executing collection on child branches.
Closes#6128
The existing Note about the shorthand suggest syntax was poorly worded and confusing. Please check whether the way I've phrased it now is still correct as to what the shorthand form actually does and doesn't do: the original wording did not provide me enough information to be sure.
Thanks!
The GeoBounds Aggregation is a new single bucket aggregation which outputs the coordinates of a bounding box containing all the points from all the documents passed to the aggregation as well as the doc count. Geobound Aggregation also use a wrap_logitude parameter which specifies whether the resulting bounding box is permitted to overlap the international date line. This option defaults to true.
This aggregation introduces the idea of MetricsAggregation which do not return double values and cannot be used for sorting. The existing MetricsAggregation has been renamed to NumericMetricsAggregation and is a subclass of MetricsAggregation. MetricsAggregations do not store doc counts and do not support child aggregations.
Closes#5634
Added support for min_children and max_children parameters to
the has_child query and filter. A parent document will only
be considered if a match if the number of matching children
fall between the min/max bounds.
Closes#6019
Using ping.timeout, which defaults to 3s, to use as a timeout value on the join request a node makes to the master once its discovered can be too small, specifically when there is a large cluster state involved (and by definition, all the buffers and such on the nio layer will be "cold"). Introduce a dedicated join.timeout setting, that by default is 10x the ping.timeout (so 30s by default).
closes#6342
Because json objects are unordered this also adds an explicit order syntax
that looks like
"highlight": {
"fields": [
{"title":{ /*params*/ }},
{"text":{ /*params*/ }}
]
}
This is not useful for any of the builtin highlighters but will be useful
in plugins.
Closes#4649
This change adds a new cluster state that waits for the replication of a shard to finish before starting snapshotting process. Because this change adds a new snapshot state, an pre-1.2.0 nodes will not be able to join the 1.2.0 cluster that is currently running snapshot/restore operation.
Closes#5531
This commit upgrades to the latest Lucene 4.8.1 release including the
following bugfixes:
* An IndexThrottle now kicks in when merges start falling behind
limiting index threads to 1 until merges caught up. Closes#6066
* RateLimiter now kicks in at the configured rate where previously
the limiter was limiting at ~8MB/sec almost all the time. Closes#6018
This is an update for the _cat/recovery API documentation. The examples
have been updated. Removed the bottom paragraph explaining why there
could be values > 100%. This can no longer happen so that had to be
removed.
Closes#6159
The syntax to specify one or more items is the same as for the Multi GET API.
If only one document is specified, the results returned are the same as when
using the More Like This API.
Relates #4075Closes#5857
Until now all version types have officially required the version to be a positive long number. Despite of this has being documented, ES versions <=1.0 did not enforce it when using the `external` version type. As a result people have succesfully indexed documents with 0 as a version. In 1.1. we introduced validation checks on incoming version values and causing indexing request to fail if the version was set to 0. While this is strictly speaking OK, we effectively have a situation where data already indexed does not match the version invariant.
To be lenient and adhere to spirit of our data backward compatibility policy, we have decided to allow 0 as a valid external version type. This is somewhat complicated as 0 is also the internal value of `MATCH_ANY`, which indicates requests should succeed regardles off the current doc version. To keep things simple, this commit changes the internal value of `MATCH_ANY` to `-3` for all version types.
Since we're doing this in a minor release (and because versions are stored in the transaction log), the default `internal` version type still accepts 0 as a `MATCH_ANY` value. This is not a problem for other version types as `MATCH_ANY` doesn't make sense in that context.
Closes#5662
* If plugin does not provide `lucene` property, we consider that the plugin is compatible.
* If plugin provides `lucene` property, we try to load related Enum org.apache.lucene.util.Version. If this fails, it means that the node is too "old" comparing to the Lucene version the plugin was built for.
* We compare then two first digits of current node lucene version against two first digits of plugin Lucene version. If not equal, it means that the plugin is too "old" for the current node.
Plugin developers who wants to launch plugin check only have to add a `lucene` property in `es-plugin.properties` file. If you are using maven to build your plugin, you can do it like this:
In `pom.xml`:
```xml
<properties>
<lucene.version>4.6.0</lucene.version>
</properties>
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
</build>
```
In `es-plugin.properties`, add:
```properties
lucene=${lucene.version}
```
BTW, if you don't already have it, you can add the plugin version as well:
```properties
version=${project.version}
```
You can disable that check using `plugins.check_lucene: false`.
Our improvements to t-digest have been pushed upstream and t-digest also got
some additional nice improvements around memory usage and speedups of quantile
estimation. So it makes sense to use it as a dependency now.
This also allows to remove the test dependency on Apache Mahout.
Close#6142
It's dangerous to expose SerialMergeScheduler as an option: since it only allows one merge at a time, it can easily cause merging to fall behind.
Closes#6120
By default More Like This API excludes the queried document from the response.
However, when debugging or when comparing scores across different queries, it
could be useful to have the best possible matched hit. So this option lets users
explicitly specify the desired behavior.
Closes#6067
In the Google Groups forum there appears to be some confusion as to what mlt
does. This documentation update should hopefully help demystifying this
feature, and provide some understanding as to how to use its parameters.
Closes#6092
- Randomized integration tests for the benchmark API.
- Negative tests for cases where the cluster cannot run benchmarks.
- Return 404 on missing benchmark name.
- Allow to specify 'types' as an array in the JSON syntax when describing a benchmark competition.
- Don't record slowest for single-request competitions.
Closes#6003, #5906, #5903, #5904
Adds a table with the exhaustive list of all available headers with a brief description (mostly from `org.elasticsearch.rest.action.cat.RestNodesAction`) so that people do not need to go searching for them in the code like I did, or search through `nodes?help`.
Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.
This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.
On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.
closes#5998closes#6041
Update `geo-shape-type.asciidoc` to include all `GeoShapeType`s supported by the `org.elasticsearch.common.geo.builders.ShapeBuilder`.
Changes include:
1. A tabular mapping of GeoJSON types to Elasticsearch types
2. Listing all types, with brief examples, for all support Elasticsearch types
3. Putting non-standard types to the bottom (really just moving Envelope to the bottom)
4. Linking to all GeoJSON types.
5. Adding whitespace around tightly nested arrays (particularly `multipolygon`) for readability
The possibility of filtering for index templates in the cluster state API
had been introduced before there was a dedicated index templates API. This
commit removes this support from the cluster state API, as it was not really
clean, requiring you to specify the metadata and the index templates.
Closes#4954
A boost terms factor of 1.0 is not the same as no boosting of terms.
The desired behavior is to deactivate boosting by default. If the user
specifies any value other than 0, then boosting is activated.
Closes#6021
Updating to this version allows to configure a special JNA directory,
in case the /tmp directory is mounted with the noexec option, as JNA
extracts some data and tries to execute parts of it.
Also updated documentation to clarify mlockall and memory settings as well
as pointing to the new jna.tmpdir system property.
Closes#5493
Decay functions currently only use the first value in a field that contains
multiple values to compute the distance to the origin. Instead, it should
consider all distances if more values are in the field and then use
one of min/max/sum/avg which is defined by the user.
Relates to #3960closes#5940
Separate version check logic for reads and writes for all version types, which allows different behavior in these cases.
Change `VersionType.EXTERNAL` & `VersionType.EXTERNAL_GTE` to behave the same as `VersionType.INTERNAL` for read operations.
The previous behavior was fit for writes but is useless in reads.
This commit also makes the usage of `EXTERNAL` & `EXTERNAL_GTE` in the update api raise a validation error as it make cause data to
be lost.
Closes#5663 , Closes#5661, Closes#5929
The current setting of 20MB/sec seems to be too conservative given
the capabilities of modern hardware / network throughput.
A 50MB default should provide better out of the box performance.
Change the default numeric precision_step to 16 for 64-bit types,
8 for 32-bit and 16-bit types. Disable precision_step for the 8-bit
byte type.
Closes#5905
The current setting of 20MB/sec seems to be too conservative given
the capabilities of modern hardware. Even on cloud infrastructure this
seems to be too lowish. A 50MB default should provide better out of the box
performance
Currently we use 5k operations as a flush threshold. Indexing 5k documents
per second is rather common which would cause the index to be committed on
the lucene level each time the flush logic runs which is 5 seconds by default.
We should rather use a size based threshold similar to the lucene index writer
that doesn't cause such agressive commits which can slow down indexing significantly
especially since they cause the underlying devices to fsync their data.
Load tests showed that SerialMS has problems to keep up with
the merges under high load. We should switch back to CMS
until we have a better story to balance merge
threads / efforts across shards on a single node.
Closes#5817
Add an API endpoint at /_bench for submitting, listing, and aborting
search benchmarks. This API can be used for timing search requests,
subject to various user-defined settings.
Benchmark results provide summary and detailed statistics on such
values as min, max, and mean time. Values are reported per-node so that
it is easy to spot outliers. Slow requests are also reported.
Long running benchmarks can be viewed with a GET request, or aborted
with a POST request.
Benchmark results are optionally stored in an index for subsequent
analysis.
Closes#5407
The default precision was way too exact and could lead people to
think that geo context suggestions are not working. This patch now
requires you to set the precision in the mapping, as elasticsearch itself
can never tell exactly, what the required precision for the users
suggestions are.
Closes#5621
The `field_value_factor` function uses the value of a field in the
document to influence the score.
A query that looks like:
{
"query": {
"function_score": {
"query": {"match": { "body": "foo" }},
"functions": [
{
"field_value_factor": {
"field": "popularity",
"factor": 1.1,
"modifier": "square"
}
}
],
"score_mode": "max",
"boost_mode": "sum"
}
}
}
Would have the score modified by:
square(1.1 * doc['popularity'].value)
Closes#5519
allow to configure on the index level which blocks can optionally be applied using tribe.blocks.indices prefix settings.
allow to control what will be done when a conflict is detected on index names coming from several clusters using the tribe.on_conflict setting. Defaults remains "any", but now support also "drop" and "prefer_[tribeName]".
closes#5501
Adds a new API endpoint at /_recovery as well as to the Java API. The
recovery API allows one to see the recovery status of all shards in the
cluster. It will report on percent complete, recovery type, and which
files are copied.
Closes#4637
By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when when requesting empty buckets (min_doc_count : 0), this causes a confusion, specifically, when the data is also filtered.
To understand why, let's look at an example:
Lets say the you're filtering your request to get all docs from the last month, and in the date_histogram aggs you'd like to slice the data per day. You also specify min_doc_count:0 so that you'd still get empty buckets for those days to which no document belongs. By default, if the first document that fall in this last month also happen to fall on the first day of the **second week** of the month, the date_histogram will **not** return empty buckets for all those days prior to that second week. The reason for that is that by default the histogram aggregations only start building buckets when they encounter documents (hence, missing on all the days of the first week in our example).
With extended_bounds, you now can "force" the histogram aggregations to start building buckets on a specific min values and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if the min_doc_count is greater than 0).
Note that (as the name suggest) extended_bounds is **not** filtering buckets. Meaning, if the min bounds is higher than the values extracted from the documents, the documents will still dictate what the min bucket will be (and the same goes to the extended_bounds.max and the max bucket). For filtering buckets, one should nest the histogram agg under a range filter agg with the appropriate min/max.
Closes#5224
Today, we use ConcurrentMergeScheduler, and this can be painful since it is concurrent on a shard level, with a max of 3 threads doing concurrent merges. If there are several shards being indexed, then there will be a minor explosion of threads trying to do merges, all being throttled by our merge throttling.
Moving to serial merge scheduler will still maintain concurrency of merges across shards, as we have the merge thread pool that schedules those merges. It will just be a serial one on a specific shard.
Also, on serial merge scheduler, we now have a limit of how many merges it will do at one go, so it will let other shards get their fair chance of merging. We use the pending merges on IW to check if merges are needed or not for it.
Note, that if a merge is happening, it will not block due to a sync on the maybeMerge call at indexing (flush) time, since we wrap our merge scheduler with the EnabledMergeScheduler, where maybeMerge is not activated during indexing, only with explicit calls to IW#maybeMerge (see Merges).
closes#5447
If we want to have a full picture of versions running in a cluster, we need to add a `_cat/plugins` endpoint.
Response could look like:
```sh
% curl es2:9200/_cat/plugins?v
node component version type url desc
es1 mapper-attachments 1.7.0 j Adds the attachment type allowing to parse difference attachment formats
es1 lang-javascript 1.4.0 j JavaScript plugin allowing to add javascript scripting support
es1 analysis-smartcn 1.9.0 j Smart Chinese analysis support
es1 marvel 1.1.0 j/s http://localhost:9200/_plugins/marvel Elasticsearch Management & Monitoring
es1 kopf 0.5.3 s http://localhost:9200/_plugins/kopf kopf - simple web administration tool for ElasticSearch
es2 mapper-attachments 2.0.0.RC1 j Adds the attachment type allowing to parse difference attachment formats
es2 lang-javascript 2.0.0.RC1 j JavaScript plugin allowing to add javascript scripting support
es2 analysis-smartcn 2.0.0.RC1 j Smart Chinese analysis support
```
Closes#4824.
Significance is related to the changes in document frequency observed between everyday use in the corpus and
frequency observed in the result set. The asciidocs include extensive details on the applications of this feature.
Closes#5146
This aggregation computes unique term counts using the hyperloglog++ algorithm
which uses linear counting to estimate low cardinalities and hyperloglog on
higher cardinalities.
Since this algorithm works on hashes, it is useful for high-cardinality fields
to store the hash of values directly in the index, which is the purpose of
the new `murmur3` field type. This is less necessary on low-cardinality
string fields because the aggregator is smart enough to only compute the hash
once per unique value per segment thanks to ordinals, or on numeric fields
since hashing them is very fast.
Close#5426
================
This commit extends the `CompletionSuggester` by context
informations. In example such a context informations can
be a simple string representing a category reducing the
suggestions in order to this category.
Three base implementations of these context informations
have been setup in this commit.
- a Category Context
- a Geo Context
All the mapping for these context informations are
specified within a context field in the completion
field that should use this kind of information.
Introduced two levels of randomization for the number of shards (between 1 and 10) when running tests:
1) through the existing random index template, which now sets a random number of shards that is shared across all the indices created in the same test method unless overwritten
2) through `createIndex` and `prepareCreate` methods, similar to what happens using the `indexSettings` method, which changes for every `createIndex` or `prepareCreate` unless overwritten (overwrites index template for what concerns the number of shards)
Added the following facilities to deal with the random number of shards:
- `getNumShards` to retrieve the number of shards of a given existing index, useful when doing comparisons based on the number of shards and we can avoid specifying a static number. The method returns an object containing the number of primaries, number of replicas and the total number of shards for the existing index
- added `assertFailures` that checks that a shard failure happened during a search request, either partial failure or total (all shards failed). Checks also the error code and the error message related to the failure. This is needed as without knowing the number of shards upfront, when simulating errors we can run into either partial (search returns partial results and failures) or total failures (search returns an error)
- added common methods similar to `indexSettings`, to be used in combination with `createIndex` and `prepareCreate` method and explicitly control the second level of randomization: `numberOfShards`, `minimumNumberOfShards` and `maximumNumberOfShards`. Added also `numberOfReplicas` despite the number of replicas is not randomized (default not specified but can be overwritten by tests)
Tests that specified the number of shards have been reviewed and the results follow:
- removed number_of_shards in node settings, ignored anyway as it would be overwritten by both mechanisms above
- remove specific number of shards when not needed
- removed manual shards randomization where present, replaced with ordinary one that's now available
- adapted tests that didn't need a specific number of shards to the new random behaviour
- fixed a couple of test bugs (e.g. 3 levels parent child test could only work on a single shard as the routing key used for grand-children wasn't correct)
- also done some cleanup, shared code through shard size facets and aggs tests and used common methods like `assertAcked`, `ensureGreen`, `refresh`, `flush` and `refreshAndFlush` where possible
- made sure that `indexSettings()` is always used as a basis when using `prepareCreate` to inject specific settings
- converted indexRandom(false, ...) + refresh to indexRandom(true, ...)
Supports sorting on sub-aggs down the current hierarchy. This is supported as long as the aggregation in the specified order path are of a single-bucket type, where the last aggregation in the path points to either a single-bucket aggregation or a metrics one. If it's a single-bucket aggregation, the sort will be applied on the document count in the bucket (i.e. doc_count), and if it is a metrics type, the sort will be applied on the pointed out metric (in case of a single-metric aggregations, such as avg, the sort will be applied on the single metric value)
NOTE: this commit adds a constraint on what should be considered a valid aggregation name. Aggregations names must be alpha-numeric and may contain '-' and '_'.
Closes#5253
Lucene 4.7 supports a setter for the `filler_token` that is
inserted if there are gaps in the token stream. This change exposes
this setting.
Closes#4307
In #4052 we added support for highlighting multi term queries using the postings highlighter. That worked only for top-level queries though, and not for multi term queries that are nested for instance within a bool query, or filtered query, or a constant score query.
The way we make this work is by walking the query structure and temporarily overriding the query rewrite method with a method that allows for multi terms extraction.
Closes#5102
Fixes#5128
Remove java 7 specific Locale functions, add "coming[1.1.0]" to documentation
add LocaleUtils utility class for dealing with Locale functions
Adds support for storing mustache based query templates that can later be filled
with query parameter values at execution time. Templates may be both quoted,
non-quoted and referencing templates stored in config/scripts/*.mustache by file
name.
See docs/reference/query-dsl/queries/template-query.asciidoc for templating
examples.
Implementation detail: mustache itself is being shaded as it depends directly on
guava - so having it marked optional but included in the final distribution
raises chances of version conflicts downstream.
Fixes#4879
It is now possible to specify aliases during index creation:
curl -XPUT 'http://localhost:9200/test' -d '
{
"aliases" : {
"alias1" : {},
"alias2" : {
"filter" : { "term" : {"field":"value"}}
}
}
}'
Closes#4920
In order to be consistent (and because in 1.0 we switched from
parameter driven information to specifzing the metrics as part of the URI)
this patch moves from 'plugin' to 'plugins' in the Nodes Info API.
`cross_fields` attemps to treat fields with the same analysis
configuration as a single field and uses maximum score promotion or
combination of the scores based depending on the `use_dis_max` setting.
By default scores are combined. `cross_fields` can also search across
fields of hetrogenous types for instance if numbers can be part of
the query it makes sense to search also on numeric fields if an analyzer
is provided in the reqeust.
Relates to #2959
* Mostly minor things like typos and grammar stuff
* Some clarifications
* The note on the deprecation was ambiguous. I've removed the problematic part so that it now definitely says it's deprecated
Currently, boosting on `copy_to` is misleading and does not work as originally specified in #4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field. If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6. This was not the intention.
The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
By default active, rejected and queue thread statistics are included for the index, bulk and search thread pool.
Other thread statistics of other thread pools can be included via the `h` query string parameter.
Closes#4907
Detects if rescores arrive as an array instead of a plain object. If so
then parse each element of the array as a separate rescore to be executed
one after another. It looks like this:
"rescore" : [ {
"window_size" : 100,
"query" : {
"rescore_query" : {
"match" : {
"field1" : {
"query" : "the quick brown",
"type" : "phrase",
"slop" : 2
}
}
},
"query_weight" : 0.7,
"rescore_query_weight" : 1.2
}
}, {
"window_size" : 10,
"query" : {
"score_mode": "multiply",
"rescore_query" : {
"function_score" : {
"script_score": {
"script": "log10(doc['numeric'].value + 2)"
}
}
}
}
} ]
Rescores as a single object are still supported.
Closes#4748
Removed unused misc.asciidoc file
Added plugins directory to directory layout
Fixed transport.tcp.connect_timeout value to match the code found in NetworkService.TcpSettings
Clarified that phrase query does not preserve order of terms
Clarified merge page
Added instructions on how to build documentation to docs/README
Terms aggregations return up to `size` terms, so up to now, the way to get all
matching terms back was to set `size` to an arbitrary high number that would be
larger than the number of unique terms.
Terms aggregators already made sure to not allocate memory based on the `size`
parameter so this commit mostly consists in making `0` an alias for the
maximum integer value in the TermsParser.
Close#4837
Adds a new FetchSubPhase, FieldDataFieldsFetchSubPhase, which loads the
field data cache for a field and returns an array of values for the
field.
Also removes `doc['<field>']` and `_source.<field>` workaround no longer
needed in field name resolving.
Closes#4492
* Make it clearer that `aggs` is an allowed synomym
for the `aggregations` key
* Fix broken example in for datehistogram, `1.5M` is
not an allowed interval
* Make use of colon before examples consistent
* Fix typos
- Removed "ok": true from response examples
- Added "created" flag to index response examples
- Replaced exists flag with found in delete response examples
* Made GET mappings consistent, supporting
* /{index}/_mappings/{type}
* /{index}/_mapping/{type}
* /_mapping/{type}
* Added "mappings" in the JSON response to align it with other responses
* Made GET warmers consistent, support /{index}/_warmers/{type} and /_warmer, /_warner/{name}
as well as wildcards and _all notation
* Made GET aliases consistent, support /{index}/_aliases/{name} and /_alias, /_aliases/{name}
as well as wildcards and _all notation
* Made GET settings consistent, added /{index}/_setting/{name}, /_settings/{name}
as well as supportings wildcards in settings name
* Returning empty JSON instead of a 404, if a specific warmer/
setting/alias/type is missing
* Added a ton of spec tests for all of the above
* Added a couple of more integration tests for several features
Relates #4071
See issue #4071
PUT options for _mapping:
Single type can now be added with
`[PUT|POST] {index|_all|*|regex|blank}/[_mapping|_mappings]/type`
and
`[PUT|POST] {index|_all|*|regex|blank}/type/[_mapping|_mappings]`
PUT options for _warmer:
PUT with a single warmer can now be done with
`[PUT|POST] {index|_all|*|prefix*|blank}/{type|_all|*|prefix*|blank}/[_warmer|_warmers]/warmer_name`
PUT options for _alias:
Single alias can now be PUT with
`[PUT|POST] {index|_all|*|prefix*|blank}/[_alias|_aliases]/alias`
DELETE options _mapping:
Several mappings can be deleted at once by defining several indices and types with
`[DELETE] /{index}/{type}`
`[DELETE] /{index}/{type}/_mapping`
`[DELETE] /{index}/_mapping/{type}`
where
`index= * | _all | glob pattern | name1, name2, …`
`type= * | _all | glob pattern | name1, name2, …`
Alternatively, the keyword `_mapings` can be used.
DELETE options for _warmer:
Several warmers can be deleted at once by defining several indices and names with
`[DELETE] /{index}/_warmer/{type}`
where
`index= * | _all | glob pattern | name1, name2, …`
`type= * | _all | glob pattern | name1, name2, …`
Alternatively, the keyword `_warmers` can be used.
DELETE options for _alias:
Several aliases can be deleted at once by defining several indices and names with
`[DELETE] /{index}/_alias/{type}`
where
`index= * | _all | glob pattern | name1, name2, …`
`type= * | _all | glob pattern | name1, name2, …`
Alternatively, the keyword `_aliases` can be used.
Fixes#4701. Changes behavior of the snapshot operation. The operation now fails if not all primary shards are available at the beginning of the snapshot operation. The restore operation no longer tries to restore indices with shards that failed or were missing during snapshot operation.
Currently it is possible to index a document as:
```
POST /myindex/mytype/1
{ "foo"...}
```
or as:
```
POST /myindex/mytype/1
{
"mytype": {
"foo"...
}
}
```
This makes indexing non-deterministic and fields can be misinterpreted
as type names.
This changes makes Elasticsearch accept only the first form by default,
ie without the type wrapper. This can be changed by setting
`index.mapping.allow_type_wrapper` to `true`` when creating the index.
Closes#4484
When set to false a new strict mode of parsing is employed which
a) does not permit numbers to be passed as JSON strings in quotes
b) rejects numbers with fractions that are passed to integer, short or long fields.
Closes#4117
Java Builder apis drop old “len” methods in favour of new “length”
Rest APIs support both old “len: and new “length” forms using new ParseField class to a) provide compiler-checked consistency between Builder and Parser classes and
b) a common means of handling deprecated syntax in the DSL.
Documentation and rest specs only document the new “*length” forms
Closes#4083