This adds a low level primitive operations to shrink an existing
index into a new index with a single shard. This primitive expects
all shards of the source index to allocated on a single node. Once the target index is initializing on the shrink node it takes a snapshot of the source index shards and copies all files into the target indices data folder. An [optimization](https://issues.apache.org/jira/browse/LUCENE-7300) coming in Lucene 6.1 will also allow for optional constant time copy if hard-links are supported by the filesystem. All mappings are merged into the new indexes metadata once the snapshots have been taken on the merge node.
To shrink an existing index all shards must be moved to a single node (one instance of each shard) and the index must be read-only:
```BASH
$ curl -XPUT 'http://localhost:9200/logs/_settings' -d '{
"settings" : {
"index.routing.allocation.require._name" : "shrink_node_name",
"index.blocks.write" : true
}
}
```
once all shards are started on the shrink node. the new index can be created via:
```BASH
$ curl -XPUT 'http://localhost:9200/logs/_shrink/logs_single_shard' -d '{
"settings" : {
"index.codec" : "best_compression",
"index.number_of_replicas" : 1
}
}'
```
This API will perform all needed check before the new index is created and selects the shrink node based on the allocation of the source index. This call returns immediately, to monitor shrink progress the recovery API should be used since all copy operations are reflected in the recovery API with byte copy progress etc.
The shrink operation does not modify the source index, if a shrink operation should
be canceled or if the shrink failed, the target index can simply be deleted and
all resources are released.
Like on other places in the query dsl the full field name should be used.
Before this change this wasn't the case for nested inner hits when source filtering was used.
Highlighting has a workaround, which is now removed as the source of nested inner hits can only be refered by the full name.
Closes#16653
This commit adds documentation for the bootstrap checks and provides
either links or inline guidance for setting the necessary settings to
pass the bootstrap checks.
Relates #18605
This PR changes the InternalTestCluster to support dedicated master nodes. The creation of dedicated master nodes can be controlled using a new `supportsMasterNodes` parameter to the ClusterScope annotation. If set to true (the default), dedicated master nodes will randomly be used. If set to false, no master nodes will be created and data nodes will also be allowed to become masters. If active, test runs will either have 1 or 3 masternodes
This commit removes the ability to specify a custom plugins
path. Instead, the plugins path will always be a subdirectory called
"plugins" off of the home directory.
- now you can specify a list of grok patterns to match your field with
and the first one to successfully match wins.
- only non-null captures will be inserted into your matched document.
Fixes#17903.
`doc_values` for _type field are created but any attempt to load them throws an IAE.
This PR re-enables `doc_values` loading for _type, it also enables `fielddata` loading for indices created between 2.0 and 2.1 since doc_values were disabled during that period.
It also restores the old docs that gives example on how to sort or aggregate on _type field.
The syntax highlighter only supports [source,js].
Also adds a check to the rest test generator that runs during
the build that'll fail the build if it sees `[source,json]`.
Significant changes:
* AbstractQueryTestCase has moved to the test framework module, in order for query builder tests in modules and plugins
* Added support to AbstractQueryTestCase to register plugins
* Lift the restriction that only one percolator could be added per index. This validation existed in MapperService, but because the percolator moved to a module it could no longer exist there. Instead of bringing it back it was removed. This validation existed since the percolator cache only supported one percolator query per document, since the percolator cache has been removed this restriction could removed as well.
* While moving percolator tests to the new module, also removed a couple of tests for the deprecated percolate and mpercolate api. These APIs are now sugar APIs for bwc and rediect to the searvh and msearvh APIs. Some tests were still testing as if percolate and mpercolate API did the percolation, but this no longer the case and these tests could be removed.
Before the query extraction would have been aborted and the percolator query would be marked as unknown.
This resulted in a situation that these queries always need to be evaluated by the memory index at search time.
By adding support for this query many more percolator query candidate hits can skip the expensive memory index verification step. For example the `match` query parser returns a MatchNoDocsQuery if the query terms are removed by text analysis (lets query text only contained stop words).
Remove the arbitrary limit on epoch_millis and epoch_seconds of 13 and 10
characters, respectively. Instead allow any character combination that can
be converted to a Java Long.
Update the docs to reflect this change.
Today if a shard fails during initialization phase due to misconfiguration, broken disks,
missing analyzers, not installed plugins etc. elasticsaerch keeps on trying to initialize
or rather allocate that shard. Yet, in the worst case scenario this ends in an endless
allocation loop. To prevent this loop and all it's sideeffects like spamming log files over
and over again this commit adds an allocation decider that stops allocating a shard that
failed more than N times in a row to allocate. The number or retries can be configured via
`index.allocation.max_retry` and it's default is set to `5`. Once the setting is updated
shards with less failures than the number set per index will be allowed to allocate again.
Internally we maintain a counter on the UnassignedInfo that is reset to `0` once the shards
has been started.
Relates to #18417
Before 5.0 for it was required that the percolator queries were cached in jvm heap as Lucene queries for two reasons:
1) Performance. The percolator evaluated all percolator queries all the time. There was no pre-selecting queries that are likely to match like we have today.
2) Updates made to percolator queries were visible in realtime, Today these changes are visible in near realtime. So updating no longer requires the percolator to have the queries in jvm heap.
So having the percolator queries in jvm heap via the percolator cache is now less attractive. Especially when there are many percolator queries then these queries can consume many GBs of jvm heap.
Removing the percolator cache does make the percolate query slower compared to how the execution time in 5.0.0-alpha1 and alpha2, but it is still faster compared to 2.x and before.
Currently the query builders expose the clauses of the span
query as a modifiable list. Instead we should make the that
getter return an unmodifiable list. Also renaming the method
used to add a clause from `clause(spanQuery)` to
`addClause(spanQuery)`.
Today when parsing settings during bootstrap, we add a system property
for every Elasticsearch setting. Additionally, settings can be set via
system properties. This commit simplifies this situation.
- settings are no longer propogated to system properties
- system properties can not be used to set settings
- the "es." prefix on settings is no longer required (nor permitted)
- test logging has a dedicated system property (tests.logger.level)
Relates #18198
* Docs: Improved tokenizer docs
Added descriptions and runnable examples
* Addressed Nik's comments
* Added TESTRESPONSEs for all tokenizer examples
* Added TESTRESPONSEs for all analyzer examples too
* Added docs, examples, and TESTRESPONSES for character filters
* Skipping two tests:
One interprets "$1" as a stack variable - same problem exists with the REST tests
The other because the "took" value is always different
* Fixed tests with "took"
* Fixed failing tests and removed preserve_original from fingerprint analyzer
* Register `indices.query.bool.max_clause_count` setting
This commit registers `indices.query.bool.max_clause_count` as a node
level setting and removes support for its synonym setting
`index.query.bool.max_clause_count`.
Closes#18336
Two of the snippets in validate weren't working properly so they are
marked as skip and linked to this:
https://github.com/elastic/elasticsearch/issues/18254
We didn't properly handle empty parameter values. We were sending
them as the literal string "null". Now we do better and send them
as the empty string.
This commit adds a variety of real disk metrics for the block devices
that back Elasticsearch data paths. A collection of statistics are read
from /proc/diskstats and are used to report the raw metrics for
operations and read/write bytes.
Relates #15915
This uses the same backoff policy we use for bulk and just retries until
the request isn't rejected.
Instead of `{"retries": 12}` in the response to count retries this now
looks like `{"retries": {"bulk": 12", "search": 1}`.
Closes#18059
Sorts an array of values in ascending or descending order. If all elements are numerics, they will be sorted numerically. If values are strings, or mixtures of strings/numbers, the elements will be sorted lexicographically.
This change does the following:
- Queries that are currently unsupported such as prefix queries on numeric
fields or term queries on geo fields now throw an error rather than returning
a query that does not match anything.
- Fuzzy queries on numeric, date and ip fields are now unsupported: they used
to create range queries, we now expect users to use range queries directly.
Fuzzy, regexp and prefix queries are now only supported on text/keyword
fields (including `_all`).
- The `_uid` and `_id` fields do not support prefix or range queries anymore as
it would prevent us to store them more efficiently in the future, eg. by
using a binary encoding.
Note that it is still possible to ignore these errors by using the `lenient`
option of the `match` or `query_string` queries.
This commit adds a note to the Windows service docs regarding the thread
stack size setting for the Windows service installer. As the Apache
Commons procrun daemon requires that this setting be explicitly set, we
need a value to be set when the service is installed. The right place
for this setting is the jvm.options file. We do not want to ship with a
hard-coded value here because we do not want to override the default
setting on other platforms, and the right default depends on whether or
not the end-user is on a 32-bit versus a 64-bit Windows system.
Relates #18324
Previously multiple extensions could be provided, however, this can lead
to confusion with on-disk scripts (ie, "foo.js" and "foo.javascript")
having different content. Only a single extension is now supported.
The only language currently supporting multiple extensions was the
Javascript engine ("js" and "javascript"). It now only supports the
`.js` extension.
Relates to #10598
Currently `fuzziness` is not supported for the `cross_fields` type
of the `multi_match` query since it complicates the logic that
blends the term queries that cross_fields uses internally. At the
moment using this combination is silently ignored, which can lead to
confusions. Instead we should throw an exception in this case.
The same is true for phrase and phrase_prefix type.
Closes#7764
This removes all the mentions of the sandbox from the script engine
services and permissions model. This means that the following settings
are no longer supported:
```yaml
script.inline: sandbox
script.stored: sandbox
```
Instead, only a `true` or `false` value can be specified.
Since this would otherwise break the default-allow parameter for
languages like expressions, painless, and mustache, all script engines
have been updated to have individual settings, for instance:
```yaml
script.engine.groovy.inline: true
```
Would enable all inline scripts for groovy. (they can still be
overridden on a per-operation basis).
Expressions, Painless, and Mustache all default to `true` for inline,
file, and stored scripts to preserve the old scripting behavior.
Resolves#17114
Rearranges the FingerprintAnalyzer so that AsciiFolding comes earlier in the chain (after lowercasing, before stop removal, for maximum deduping power)
Closes#18266
It will keep using the caching terms enum for keyword/text fields and falls back
to IndexSearcher.count for fields that do not use the inverted index for
searching (such as numbers and ip addresses). Note that this probably means that
significant terms aggregations on these fields will be less efficient than they
used to be. It should be ok under a sampler aggregation though.
This moves tests back to the state they were in before numbers started using
points, and also adds a new test that significant terms aggs fail if a field is
not indexed.
In the long term, we might want to follow the approach that Robert initially
proposed that consists in collecting all documents from the background filter in
order to compute frequencies using doc values. This would also mean that
significant terms aggregations do not require fields to be indexed anymore.
* Docs: First pass at improving analyzer docs
I've rewritten the intro to analyzers plus the docs
for all analyzers to provide working examples.
I've also removed:
* analyzer aliases (see #18244)
* analyzer versions (see #18267)
* snowball analyzer (see #8690)
Next steps will be tokenizers, token filters, char filters
* Fixed two typos
This commit adds a hard requirement to the RPM and Debian packages for
/bin/bash to be present, and adds a note regarding this to the migration
docs.
Relates #18259
Remove performance hack for accessing a document's fields, its not needed.
Add support for accessing is-getter methods like List.isEmpty() as .empty
Closes#18201
The plugin script parses command-line options looking for Java system
properties and extracts these arguments to pass to the java command when
starting the JVM. Since elasticsearch-plugin allows arbitrary user
arguments to the JVM via ES_JAVA_OPTS, this parsing is unnecessary. This
commit removes this unnecessary
Relates #18207
We should prevent highlighting if a field is anything but a text or keyword field.
However, someone might implement a custom field type that has text and still want to
highlight on that. We cannot know in advance if the highlighter will be able to
highlight such a field and so we do the following:
If the field is only highlighted because the field matches a wildcard we assume
it was a mistake and do not process it.
If the field was explicitly given we assume that whoever issued the query knew
what they were doing and try to highlight anyway.
closes#17537
The `ip` field uses a binary representation internally. This breaks when
rendering sort values in search responses since elasticsearch tries to write a
binary byte[] as an utf8 json string. This commit extends the `DocValueFormat`
API in order to give fields a chance to choose how to render values.
Closes#6077
QueryBuilder has generics, but those are never used: all call sites use
`QueryBuilder<?>`. Only `AbstractQueryBuilder` needs generics so that the base
class can contain a default implementation for setters that returns `this`.
This gives better coverage and consistency with the scripting APIs, by
whitelisting the primary search scripting API classes and using them instead
of only Map and List methods.
For example, accessing fields can now be done with `.value` instead of `.0`
because `getValue()` is whitelisted. For now, access to a document's fields in
this way (loads) are fast-pathed in the code, to avoid dynamic overhead.
Access to geo fields and geo distance functions is now supported.
TODO: date support (e.g. whitelist ReadableDateTime methods as a start)
TODO: improve docs (like expressions and groovy have for document's fields)
TODO: remove fast-path hack
Closes#18169
Squashed commit of the following:
commit ec9f24b2424891a7429bb4c0a03f9868cba0a213
Author: Robert Muir <rmuir@apache.org>
Date: Thu May 5 17:59:37 2016 -0400
cutover to <Def> instead of <Object> here
commit 9edb1550438acd209733bc36f0d2e0aecf190ecb
Author: Robert Muir <rmuir@apache.org>
Date: Thu May 5 17:03:02 2016 -0400
add fast-path for docvalues field loads
commit f8e38c0932fccc0cfa217516130ad61522e59fe5
Author: Robert Muir <rmuir@apache.org>
Date: Thu May 5 16:47:31 2016 -0400
Painless: add fielddata accessors (.value/.values/.distance()/etc)
The snippet in the docs creates and index and uses it with the
_analyze api. The trouble is that if the index hasn't been created
fully the _analyze API will fail. This adds a
GET _cluster/health?wait_for_status=yellow
which fixes the issue.
While this does make the docs more cluttered, it also makes the snippets
actually runnable.
Closes#18165
Adds infrastructure so `gradle :docs:check` will extract tests from
snippets in the documentation and execute the tests. This is included
in `gradle check` so it should happen on CI and during a normal build.
By default each `// AUTOSENSE` snippet creates a unique REST test. These
tests are executed in a random order and the cluster is wiped between
each one. If multiple snippets chain together into a test you can annotate
all snippets after the first with `// TEST[continued]` to have the
generated tests for both snippets joined.
Snippets marked as `// TESTRESPONSE` are checked against the response
of the last action.
See docs/README.asciidoc for lots more.
Closes#12583. That issue is about catching bugs in the docs during build.
This catches *some* bugs in the docs during build which is a good start.
All other values are errors.
Add java test for throttling. We had a REST test but it only ran against
one node so it didn't catch serialization errors.
Add Simple round trip test for rethrottle request
* Reorganize scripting documentation
* Further changes to tidy up scripting docs
Closes#18116
* Add note about .lat/lon potentially returning null
* Added .value to expressions example
* Fixed two bad ASCIIDOC links