This commit adds a small note to the discovery docs to include a note
that we recommend that the unicast hosts list be maintained as the list
of master-eligible nodes in the cluster.
Relates #25991
This commit updates the docs for the config files to explain the new
mechanism for customizing the configuration directory via the
environment variable CONF_DIR.
Relates #25990
This commit removes an outdated reference to http_address in the nodes
info docs. This information is available in the http object for each
node in the nodes info API response.
Relates #25980
On non-Windows platforms, we ignore the environment variable
JAVA_TOOL_OPTIONS (this is an environment variable that the JVM respects
by default for picking up extra JVM options). The primary reason that we
ignore this because of the Jayatana agent on Ubuntu; a secondary reason
is that it produces an annoying "Picked up JAVA_TOOL_OPTIONS: ..."
output message. When the elasticsearch-env batch script was introduced
for Windows, ignoring this environment variable was deliberately not
carried over as the primary reason does not apply on Windows. However,
after additional thinking, it seems that we should simply be consistent
to the extent possible here (and also avoid that annoying "Picked up
JAVA_TOOL_OPTIONS: ..." on Windows too). This commit causes the Windows
version of elasticsearch-env to also ignore JAVA_TOOL_OPTIONS.
Relates #25968
This commit adds a bootstrap check for the maximum file size, and
ensures the limit is set correctly when Elasticsearch is installed as a
service on systemd-based systems.
Relates #25974
The Writeble representation is less heavy to parse and that will benefit percolate performance and throughput.
The query builder's binary format has now the same bwc guarentees as the xcontent format.
Added a qa test that verifies that percolator queries written in older versions are still readable by the current version.
The example output for node info and cluster stats was outdated w.r.t.
to the information that is shown for plugins. With this commit we
updated the example output and update the explanation of the respective
fields.
This commit changes the way we handle field expansion in `match`, `multi_match` and `query_string` query.
The main changes are:
- For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.
- For partial field names (with `*` suffix), the expansion is done only on `keyword`, `text`, `date`, `ip` and `number` field types. Other field types are simply ignored.
- For all fields (`*`), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.
- The `*` notation can also be used to set `default_field` option on`query_string` query. This should replace the needs for the extra option `use_all_fields` which is deprecated in this change.
This commit also rewrites simple `*` query to matchalldocs query when all fields are requested (Fixes#25556).
The same change should be done on `simple_query_string` for completeness.
`use_all_fields` option in `query_string` is also deprecated in this change, `default_field` should be set to `*` instead.
Relates #25551
This commit adds the min wire/index compat versions to the main action
output. Not only will this make the compatility expected more
transparent, but it also allows to test which version others think the
compat versions are, similar to how we test the lucene version.
When a node tries to join a cluster, it goes through a validation step to make sure the node is compatible with the cluster. Currently we validation that the node can read the cluster state and that it is compatible with the indexes of the cluster. This PR adds validation that the joining node's version is compatible with the versions of existing nodes. Concretely we check that:
1) The node's min compatible version is higher or equal to any node in the cluster (this prevents a too-new node from joining)
2) The node's version is higher or equal to the min compat version of all cluster nodes (this prevents a too old join where, for example, the master is on 5.6, there's another 6.0 node in the cluster and a 5.4 node tries to join).
3) The node's major version is at least as higher as the lowest node in the cluster. This is important as we use the minimum version in the cluster to stop executing bwc code for operations that require multiple nodes. If the nodes are already operating in "new cluster mode", we should prevent nodes from the previous major to join (even if they are wire level compatible). This does mean that if you have a very unlucky partition during the upgrade which partitions all old nodes which are also a minority / data nodes only, the may not be able to re-join the cluster. We feel this edge case risk is well worth the simplification it brings to BWC layers only going one way. This restriction only holds if the cluster state has been recovered (i.e., the cluster has properly formed).
Also, the node join validation can now selectively fail specific nodes (previously the entire batch was failed). This is an important preparation for a follow up PR where we plan to have a rejected joining node die with dignity.
Also has updates to ScriptMetaData for allowing the old namespace format to be loaded all the way back through 5.0; however, it will throw an exception if two scripts share the same id but different languages.
Today we enable users to customize the environment through the use of
ES_INCLUDE. This made sense for legacy reasons when we did not have
nicities like jvm.options (so dumped JVM options in the default include
script) and somewhat duplicates some of the functionality that we will
need from a dedicated environment script. This commit removes support
for ES_INCLUDE as a first step towards a dedicated include script.
Relates #25804
We currently use fielddata on the `_id` field which is trappy, especially as we
do it implicitly. This changes the `random_score` function to use doc ids when
no seed is provided and to suggest a field when a seed is provided.
For now the change only emits a deprecation warning when no field is supplied
but this should be replaced by a strict check on 7.0.
Closes#25240
When a node tries to join a cluster, it goes through a validation step to make sure the node is compatible with the cluster. Currently we validation that the node can read the cluster state and that it is compatible with the indexes of the cluster. This PR adds validation that the joining node's version is compatible with the versions of existing nodes. Concretely we check that:
1) The node's min compatible version is higher or equal to any node in the cluster (this prevents a too-new node from joining)
2) The node's version is higher or equal to the min compat version of all cluster nodes (this prevents a too old join where, for example, the master is on 5.6, there's another 6.0 node in the cluster and a 5.4 node tries to join).
3) The node's major version is at least as higher as the lowest node in the cluster. This is important as we use the minimum version in the cluster to stop executing bwc code for operations that require multiple nodes. If the nodes are already operating in "new cluster mode", we should prevent nodes from the previous major to join (even if they are wire level compatible). This does mean that if you have a very unlucky partition during the upgrade which partitions all old nodes which are also a minority / data nodes only, the may not be able to re-join the cluster. We feel this edge case risk is well worth the simplification it brings to BWC layers only going one way.
Also, the node join validation can now selectively fail specific nodes (previously the entire batch was failed). This is an important preparation for a follow up PR where we plan to have a rejected joining node die with dignity.
This commit expands on the migration note regarding the removal of
default.path.data and default.path.logs to include a note that users
that were relying on the defaults (the common case for path.logs), and
they carry over their previous elasticsearch.yml configruation file,
then they must add explicit values for path.data and path.logs.
403 can be confused with security. If an API doesn't support working against closed indices and closed indices are referred to in a request, that is a bad request, hence 400 is more appropriate.
Currently the `to` and `from` parameter in the `date_range` aggregation is not
parsed with the correct date field format from the mappings or the aggregation
if the argument is numeric, but always treated as a long value specifying
`epoch_millis`. This leads to problems e.g. when the format is `epoch_second`,
but the `to` and `from` are currently treated as millis.
With this change, we interpret these parameters according to the `format` of the target field.
If the `format` in the mappings is not compatible with numeric input values,
a compatible `format` (e.g. `epoch_millis`, `epoch_second`) must be specified in
the `date_range` aggregation itself, otherwise an error is thrown.
#Closes #17920
This change removes the leniency of having a `null` index to fetch
terms from in 6.0 onwards. This feature will be deprecated in the 5.x series
and 6.0 nodes will require the index to be set.
Closes#25750
When simulating an ingest pipeline against an existing pipeline, the
_source field is required to wrap each doc. This commit fixes another
example in the docs that is missing this.
Relates #25743, relates e3a0c11239
When simulating an ingest pipeline against an existing pipeline, the
_source field is required to wrap each doc. This commit fixes an example
in the docs that is missing this.
Relates #25742
This commit changes the default heap size to 1 GB. Experimenting with
elasticsearch is often done on laptops, and 1 GB is much friendlier to
laptop memory. It does put more pressure on the gc, but the tradeoff is
a smaller default footprint. Users running in production can (and
should) adjust the heap size as necessary for their usecase.
The Filtered Query has been deprecated in favour of the Bool Query with a filter context. However, this deleted page for the Filtered Query is often ranked highly in search results when searching for documentation on "filtered queries". Often people just copy the first code snippet they see, which in this case is the INCORRECT syntax (the correct syntax follows). I think reordering the examples would help avoid a lot of confusion (I have seen people make this same mistake 3 times now)
Adding a comment to indicate that the first example shouldn't be used
This change refactors the query_string query to analyze the query text around logical operators of the query string the same way than a match_query/multi_match_query.
It also adds a type parameter that can be used to change the way multi fields query are built the same way than a multi_match query does.
Now that these queries share the same behavior regarding text analysis, some parameters are obsolete and have been deprecated:
split_on_whitespace: This setting is now ignored with a deprecation notice
if it is used explicitely. With this PR The query_string always splits on logical operator.
It simplifies the understanding of the other parameters that can have different meanings
depending on the value of split_on_whitespace.
auto_generate_phrase_queries: This setting is now ignored with a deprecation notice
if it is used explicitely. This setting only makes sense when the parser splits on whitespace.
use_dismax: This setting is now ignored with a deprecation notice
if it is used explicitely. The tie_breaker parameter is sufficient to handle best_fields/most_fields.
Fixes#25574
Today if we search across a large amount of shards we hit every shard. Yet, it's quite
common to search across an index pattern for time based indices but filtering will exclude
all results outside a certain time range ie. `now-3d`. While the search can potentially hit
hundreds of shards the majority of the shards might yield 0 results since there is not document
that is within this date range. Kibana for instance does this regularly but used `_field_stats`
to optimize the indexes they need to query. Now with the deprecation of `_field_stats` and it's upcoming removal a single dashboard in kibana can potentially turn into searches hitting hundreds or thousands of shards and that can easily cause search rejections even though the most of the requests are very likely super cheap and only need a query rewriting to early terminate with 0 results.
This change adds a pre-filter phase for searches that can, if the number of shards are higher than a the `pre_filter_shard_size` threshold (defaults to 128 shards), fan out to the shards
and check if the query can potentially match any documents at all. While false positives are possible, a negative response means that no matches are possible. These requests are not subject to rejection and can greatly reduce the number of shards a request needs to hit. The approach here is preferable to the kibana approach with field stats since it correctly handles aliases and uses the correct threadpools to execute these requests. Further it's completely transparent to the user and improves scalability of elasticsearch in general on large clusters.
This commit enables management of the main Elasticsearch log files
out-of-the-box by the following changes:
- compress rolled logs
- roll logs every 128 MB
- maintain a sliding window of logs
- remove the oldest logs maintaining no more than 2 GB of compressed
logs on disk
Relates #25660
This commit removes the environment variable ES_JVM_OPTIONS that allows
the jvm.options file to sit separately from the rest of the config
directory. Instead, we use the CONF_DIR environment variable for custom
configuration location just as we do for the other configuration files.
Relates #25679
Requests that execute a stored script will no longer be allowed to specify the lang of the script. This information is stored in the cluster state making only an id necessary to execute against. Putting a stored script will still require a lang.
Indexing a join field on a document requires a value of type "object" and two sub fields "name"
and "parent". The "parent" field is only required on child documents, but the "name" field which
denotes the name of the relation is always needed. Previously, only the short-hand version of the
join field was documented. This adds documentation for the long-hand join field data, and
explicitly points out that just specifying the name of the relation for the field value is a
convenience shortcut.
This is a protection mechanism to prevent a single search request from
hitting a large number of shards in the cluster concurrently. If a search is
executed against all indices in the cluster this can easily overload the cluster
causing rejections etc. which is not necessarily desirable. Instead this PR adds
a per request limit of `max_concurrent_shard_requests` that throttles the number of
concurrent initial phase requests to `256` by default. This limit can be increased per request
and protects single search requests from overloading the cluster. Subsequent PRs can introduces
addiontional improvemetns ie. limiting this on a `_msearch` level, making defaults a factor of
the number of nodes or sort shards iters such that we gain the best concurrency across nodes.
This change collapses some of the packages for the bucket aggregations into their parent packages. This was done for the following aggregations:
* The variants of the range aggregation (geo_distance, date and ip) were moved into the `o.e.s.a.bucket.range` package
* The `o.e.s.a.bucket.terms.support` package was removed and the classes were moved to `o.e.s.a.bucket.terms`
* The filter aggregation was moved to `o.e.s.a.bucket.filter`
Since this PR is already relatively large with only the above changes subsequent PRs will do similar operations on relevant metric and pipeline aggregations
Relates to #22868
This commit adds cross-settings validation for the low/high/flood stage
disk watermark settings. This validation was enabled by the introduction
of multiple settings validation.
Relates #25600
The created and found fields in index and delete responses became obsolete after the introduction of the result field in index, update and delete responses (#19566).
After deprecating the created and found fields in 5.x (#19633), now they are removed.
Fixes#19630
* Improved REST endpoint exception handling, see #15335
Also improved OPTIONS http method handling to better conform with the
http spec.
* Tidied up formatting and comments
See #15335
* Tests for #15335
* Cleaned up comments, added section number
* Swapped out tab indents for space indents
* Test class now extends ESSingleNodeTestCase
* Capture RestResponse so it can be examined in test cases
Simple addition to surface the RestResponse object so we can run tests
against it (see issue #15335).
* Refactored class name, included feedback
See #15335.
* Unit test for REST error handling enhancements
Randomizing unit test for enhanced REST response error handling. See
issue #15335 for more details.
* Cleaned up formatting
* New constructor to set HTTP method
Constructor added to support RestController test cases.
* Refactored FakeRestRequest, streamlined test case.
* Cleaned up conflicts
* Tests for #15335
* Added functionality to ignore or include path wildcards
See #15335
* Further enhancements to request handling
Refactored executeHandler to prioritize explicit path matches. See
#15335 for more information.
* Cosmetic fixes
* Refactored method handlers
* Removed redundant import
* Updated integration tests
* Refactoring to address issue #17853
* Cleaned up test assertions
* Fixed edge case if OPTIONS method randomly selected as invalid method
In this test, an OPTIONS method request is valid, and should not return
a 405 error.
* Remove redundant static modifier
* Hook the multiple PathTrie attempts into RestHandler.dispatchRequest
* Add missing space
* Correctly retrieve new handler for each Trie strategy
* Only copy headers to threadcontext once
* Fix test after REST header copying moved higher up
* Restore original params when trying the next trie candidate
* Remove OPTIONS for invalidHttpMethodArray so a 405 is guaranteed in tests
* Re-add the fix I already added and got removed during merge :-/
* Add missing GET method to test
* Add documentation to migration guide about breaking 404 -> 405 changes
* Explain boolean response, pull into local var
* fixup! Explain boolean response, pull into local var
* Encapsulate multiple HTTP methods into PathTrie<MethodHandlers>
* Add PathTrie.retrieveAll where all matching modes can be retrieved
Then TrieMatchingMode can be package private and not leak into RestController
* Include body of error with 405 responses to give hint about valid methods
* Fix missing usageService handler addition
I accidentally removed this :X
* Initialize PathTrieIterator modes with Arrays.asList
* Use "== false" instead of !
* Missing paren :-/
Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in #24299
this change monitors disk utilization and adds a flood-stage watermark
that causes all indices that are allocated on a node hitting the flood-stage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk
situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.
The flood-stage watermark is set to `95%` utilization by default.
Closes#24299
Add an Important admonition for upgrading via the command line
using the Windows MSI Installer. This calls out the need to pass
the same command line options for an upgrade as were used for
the initial installation.
* Adds check for negative search request size
This change adds a check to `SearchSourceBuilder` to throw and exception if the size set on it is set to a negative value.
Closes#22530
* fix error in reindex
* update re-index tests
* Addresses review comment
* Fixed tests
* Added random negative size test
* Fixes test
This commit adds a note to the docs regarding explicilty setting a
publish host if the network.host setting results in multiple bind
addresses.
Relates #25496
Currently QueryParseContext is only a thin wrapper around an XContentParser that
adds little functionality of its own. I provides helpers for long deprecated
field names which can be removed and two helper methods that can be made static
and moved to other classes. This is a first step in helping to remove
QueryParseContext entirely.
Expand `/_cat/nodes` with already present information about available disk space `diskAvail` (alias: `d`, `disk`) by:
* `diskTotal` (alias `dt`): total disk space
* `diskUsed` (alias `du`): used disk space (`diskTotal - diskAvail`)
* `diskUsedPercent` (alias `dup`): used disk space percentage
Note: The available disk space is the number of bytes available to the node's Java virtual machine. The size might be smaller than the real one. That means the used disk space (percentage) is larger.
Closes#21679
We previously tried to maintain (while not formally supporting) 32-bit
support, although we never tested this anywhere in CI. Since we do not
formally support this, and 32-bit usage is very low, we have elected to
no longer maintain 32-bit support. This commit removes any implication
of 32-bit support.
Relates #25435
Changed `rescore`s to `rescore` requests as an backtick followed by the s character appears to be interpreted as an apostrophe which then leads to an unbalanced backtick for the next code span in the remainder of the paragraph
Closes#25443
This commit removes the default path settings for data and logs. With
this change, we now ship the packages with these settings set in the
elasticsearch.yml configuration file rather than going through the
default.path.data and default.path.logs dance that we went through in
the past.
Relates #25408
This commit removes path.conf as a valid setting and replaces it with a
command-line flag for specifying a non-default path for configuration.
Relates #25392
#25147 added the translog deletion policy but didn't enable it by default. This PR enables a default retention of 512MB (same maximum size of the current translog) and an age of 12 hours (i.e., after 12 hours all translog files will be deleted). This increases to chance to have an ops based recovery, even if the primary flushed or the replica was offline for a few hours.
In order to see which parts of the translog are committed into lucene the translog stats are extended to include information about uncommitted operations.
Views now include all translog ops and guarantee, as before, that those will not go away. Snapshotting a view allows to filter out generations that are not relevant based on a specific sequence number.
Relates to #10708
The `document_type` parameter is no longer required to be specified,
because by default from 6.0 only a single type is allowed. (`index.mapping.single_type` defaults to `true`)
* [Analysis] Parse synonyms with the same analysis chain
Synonym Token Filter / Synonym Graph Filter tokenize synonyms with whatever tokenizer and token filters appear before it in the chain.
Close#7199
* Add MSI installation to documentation
Move installation documentation for Windows with the .zip archive into the zip and tar installation documentation, and clearly indicate any differences for installing on macOS/Linux and Windows.
* Separate out installation with .zip on Windows
With #23997 we have introduced a new internal index option that allows to resolve index expressions only against concrete indices while ignoring aliases. Such index option was applied to IndicesAliasesRequest, so that the index part of alias actions would only be resolved against concrete indices.
Same is done in this commit with delete index request. Deleting aliases has always been confusing as some users expect it to only remove the alias from the index (which has its own specific API). Even worse, in case of filtered aliases, deleting an alias may leave users with the expectation that only the documents that match the filter are deleted, which was never the case. To address all this confusion, delete index api works now only against concrete indices. WIldcard expressions will be only resolved against concrete index, as if aliases didn't exist. If one tries to delete against an alias, an IndexNotFoundException will be thrown regardless of whether the alias exists or not, as a concrete index with such a name doesn't exist.
Closes#2318
It adds notes about:
- how preference can help optimize cache usage
- the fact that too many replicas can hurt search performance due to lower
utilization of the filesystem cache
- how index sorting can improve _source compression
- how always putting fields in the same order in documents can improve _source
compression
* Add documentation for the new parent-join field
This commit adds the docs for the new parent-join field.
It explains how to define, index and query this new field.
Relates #20257
* Upgrade icu4j for the ICU analysis plugin to 59.1
Lucene upgraded to 59.1 so we should use the same.
Closes#21425
* Add breaking change for the icu upgrade
This snapshot has faster range queries on range fields (LUCENE-7828), more
accurate norms (LUCENE-7730) and the ability to use fake term frequencies
(LUCENE-7854).
Expose the experimental simplepattern and
simplepatternsplit tokenizers in the common
analysis plugin. They provide tokenization based
on regular expressions, using Lucene's
deterministic regex implementation that is usually
faster than Java's and has protections against
creating too-deep stacks during matching.
Both have a not-very-useful default pattern of the
empty string because all tokenizer factories must
be able to be instantiated at index creation time.
They should always be configured by the user
in practice.
Get mappings HEAD requests incorrectly return a content-length header of
0. This commit addresses this by removing the special handling for get
mappings HEAD requests, and just relying on the general mechanism that
exists for handling HEAD requests in the REST layer.
Relates #23192
This commit adds back "id" as the key within a script to specify a
stored script (which with file scripts now gone is no longer ambiguous).
It also adds "source" as a replacement for "code". This is in an attempt
to normalize how scripts are specified across both put stored scripts and script usages, including search template requests. This also deprecates the old inline/stored keys.
This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the `unified` highlighter when index_options is set to `offsets`:
https://issues.apache.org/jira/browse/LUCENE-7815
It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided).
The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text.
Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such.
There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available.
I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.
This PR enables Ingest plugins to leverage processor-scoped REST
endpoints. First of which being the Grok endpoint that retrieves
Grok Patterns for users to retrieve all the built-in patterns.
Example usage: Kibana Grok Autocomplete!
This commit refactors the query phase in order to be able
to automatically detect queries that can be early terminated.
If the index sort matches the query sort, the top docs collection is early terminated
on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector.
This change also adds a new parameter to the search request called `track_total_hits`.
It indicates if the total number of hits that match the query should be tracked.
If false, queries sorted by the index sort will not try to compute this information and
and will limit the collection to the first N documents per segment.
Aggregations are not impacted and will continue to see every document
even when the index sort matches the query sort and `track_total_hits` is false.
Relates #6720
During package install on systemd-based systems, some sysctl settings
should be set (e.g. vm.max_map_count).
In some environments, changing sysctl settings plainly does not work;
previously a global environment variable named
ES_SKIP_SET_KERNEL_PARAMETERS was introduced to skip calling sysctl, but
this causes trouble for:
- configuration management systems, which usually cannot apply an env
var when running a package manager
- package upgrades, which will not have the env var set any more, and
thus leaving the package management system in a bad state (possibly
half-way upgraded, can be very hard to recover)
This removes the env var again and instead of calling systemd-sysctl
manually, tells systemd to restart the wrapper unit - which itself can
be masked by system administrators or management tools if it is known
that sysctl does not work in a given environment.
The restart is not silent on systems in their default configuration, but
is ignored if the unit is masked.
Relates #24234
The index parameter in the update-aliases, put-alias, and delete-alias APIs no longer accepts alias names. Instead, it accepts only index names (or wildcards which will expand to matching indices).
Closes#23960
This removes the parsing of things like `GET /idx/_aliases,_mappings`, instead,
a user must choose between retriving all index metadata with `GET /idx`, or only
a specific form such as `GET /idx/_settings`.
Relates to (and is a prerequisite of) #24437
Some response classes in the java api expose both `getTook()` which returns a `TimeValue` and `getTookInMillis` which returns a `long` value. `getTook()` is enough as one can do `getTook().millis()` to obtain the same result as `getTookInMillis()`, which can be removed.
* Adds nodes usage API to monitor usages of actions
The nodes usage API has 2 main endpoints
/_nodes/usage and /_nodes/{nodeIds}/usage return the usage statistics
for all nodes and the specified node(s) respectively.
At the moment only one type of usage statistics is available, the REST
actions usage. This records the number of times each REST action class is
called and when the nodes usage api is called will return a map of rest
action class name to long representing the number of times each of the action
classes has been called.
Still to do:
* [x] Create usage service to store usage statistics
* [x] Record usage in REST layer
* [x] Add Transport Actions
* [x] Add REST Actions
* [x] Tests
* [x] Documentation
* Rafactors UsageService so counts are done by the handlers
* Fixing up docs tests
* Adds a name to all rest actions
* Addresses review comments
This commit adds a new bg_count field to the REST response of
SignificantTerms aggregations. Similarly to the bg_count that already
exists in significant terms buckets, this new bg_count field is set at
the aggregation level and is populated with the superset size value.
Currently global ordinals are documented under `fielddata`. It moves them to
their own file since they also work with doc values and fielddata is on the way
out.
Closes#23101
This commit adds a `doc_count` field to the response body of Matrix
Stats aggregation. It exposes the number of documents involved in
the computation of statistics, a value that can already be retrieved using
the method MatrixStats.getDocCount() in the Java API.
* SignificantText aggregation - like significant_terms but doesn’t require fielddata=true, recommended used with `sampler` agg to limit expense of tokenizing docs and takes optional `filter_duplicate_text`:true setting to avoid stats skew from repeated sections of text in search results.
Closes#23674
Adds a "magic" key to the yaml testing stash mostly for use with
documentation tests. When unstashing an object, `$_path` is the
path into the current position in the object you are unstashing.
This means that in docs tests you can use
`// TESTRESPONSEs/somevalue/$body.${_path}/` to mean "replace
`somevalue` with whatever is the response in the same position."
Compare how you must carefully mock out all the numbers in the profile
response without this change:
```
// TESTRESPONSE[s/"id": "\[2aE02wS1R8q_QFnYu6vDVQ\]\[twitter\]\[1\]"/"id": $body.profile.shards.0.id/]
// TESTRESPONSE[s/"rewrite_time": 51443/"rewrite_time": $body.profile.shards.0.searches.0.rewrite_time/]
// TESTRESPONSE[s/"score": 51306/"score": $body.profile.shards.0.searches.0.query.0.breakdown.score/]
// TESTRESPONSE[s/"time_in_nanos": "1873811"/"time_in_nanos": $body.profile.shards.0.searches.0.query.0.time_in_nanos/]
// TESTRESPONSE[s/"build_scorer": 2935582/"build_scorer": $body.profile.shards.0.searches.0.query.0.breakdown.build_scorer/]
// TESTRESPONSE[s/"create_weight": 919297/"create_weight": $body.profile.shards.0.searches.0.query.0.breakdown.create_weight/]
// TESTRESPONSE[s/"next_doc": 53876/"next_doc": $body.profile.shards.0.searches.0.query.0.breakdown.next_doc/]
// TESTRESPONSE[s/"time_in_nanos": "391943"/"time_in_nanos": $body.profile.shards.0.searches.0.query.0.children.0.time_in_nanos/]
// TESTRESPONSE[s/"score": 28776/"score": $body.profile.shards.0.searches.0.query.0.children.0.breakdown.score/]
// TESTRESPONSE[s/"build_scorer": 784451/"build_scorer": $body.profile.shards.0.searches.0.query.0.children.0.breakdown.build_scorer/]
// TESTRESPONSE[s/"create_weight": 1669564/"create_weight": $body.profile.shards.0.searches.0.query.0.children.0.breakdown.create_weight/]
// TESTRESPONSE[s/"next_doc": 10111/"next_doc": $body.profile.shards.0.searches.0.query.0.children.0.breakdown.next_doc/]
// TESTRESPONSE[s/"time_in_nanos": "210682"/"time_in_nanos": $body.profile.shards.0.searches.0.query.0.children.1.time_in_nanos/]
// TESTRESPONSE[s/"score": 4552/"score": $body.profile.shards.0.searches.0.query.0.children.1.breakdown.score/]
// TESTRESPONSE[s/"build_scorer": 42602/"build_scorer": $body.profile.shards.0.searches.0.query.0.children.1.breakdown.build_scorer/]
// TESTRESPONSE[s/"create_weight": 89323/"create_weight": $body.profile.shards.0.searches.0.query.0.children.1.breakdown.create_weight/]
// TESTRESPONSE[s/"next_doc": 2852/"next_doc": $body.profile.shards.0.searches.0.query.0.children.1.breakdown.next_doc/]
// TESTRESPONSE[s/"time_in_nanos": "304311"/"time_in_nanos": $body.profile.shards.0.searches.0.collector.0.time_in_nanos/]
// TESTRESPONSE[s/"time_in_nanos": "32273"/"time_in_nanos": $body.profile.shards.0.searches.0.collector.0.children.0.time_in_nanos/]
```
To how you can cavalierly mock all the numbers at once with this change:
```
// TESTRESPONSE[s/(?<=[" ])\d+(\.\d+)?/$body.$_path/]
```
Currently a `delete document` request against a non-existing index actually **creates** this index.
With this change the `delete document` no longer creates the previously non-existing index and throws an `index_not_found` exception instead.
However as discussed in https://github.com/elastic/elasticsearch/pull/15451#issuecomment-165772026, if an external version is explicitly used, the current behavior is preserved and the index is still created and the document is marked for deletion.
Fixes#15425
Native scripts have been replaced in documentation by implementing
a ScriptEngine and they were deprecated in 5.5.0. This commit
removes the native script infrastructure for 6.0.
closes#19966
This PR adds a new thread pool type: `fixed_auto_queue_size`. This thread pool
behaves like a regular `fixed` threadpool, except that every
`auto_queue_frame_size` operations (default: 10,000) in the thread pool,
[Little's Law](https://en.wikipedia.org/wiki/Little's_law) is calculated and
used to adjust the pool's `queue_size` either up or down by 50. A minimum and
maximum is taken into account also. When the min and max are the same value, a
regular fixed executor is used instead.
The `SEARCH` threadpool is changed to use this new type of thread pool. However,
the min and max are both set to 1000, meaning auto adjustment is opt-in rather
than opt-out.
Resolves#3890
In scripts (at least some of the languages), the terms dictionary and
postings can be access with the special _index variable. This is for
very advanced use cases which want to do their own scoring. The problem
is segment level statistics must be recomputed for every document.
Additionally, this is not friendly to the terms index caching as the
order of looking up terms should be controlled by lucene.
This change removes _index from scripts. Anyone using it can and should
instead write a Similarity plugin, which is explicitly designed to allow
doing the calculations needed for a relevance score.
closes#19359
Today when an index is `read-only` the index is also blocked from
being deleted which sometimes is undesired since in-order to make
changes to a cluster indices must be deleted to free up space. This is
a likely scenario in a hosted environment when disk-space is limited to switch
indices read-only but allow deletions to free up space.
Our strong recommendation is disabling swap over any other alternative
to avoid the JVM from landing on disk. This commit clarifies the docs in
this regard.
This commit documents how to write a `ScriptEngine` in order to use
expert internal apis, such as using Lucene directly to find index term
statistics. These documents prepare the way to remove both native
scripts and IndexLookup.
The example java code is actually compiled and tested under a new gradle
subproject for example plugins. This change does not yet breakup
jvm-example into the new examples dir, which should be done separately.
relates #19359
relates #19966
This commit adds support for histogram and date_histogram agg compound order by refactoring and reusing terms agg order code. The major change is that the Terms.Order and Histogram.Order classes have been replaced/refactored into a new class BucketOrder. This is a breaking change for the Java Transport API. For backward compatibility with previous ES versions the (date)histogram compound order will use the first order. Also the _term and _time aggregation order keys have been deprecated; replaced by _key.
Relates to #20003: now that all these aggregations use the same order code, it should be easier to move validation to parse time (as a follow up PR).
Relates to #14771: histogram and date_histogram aggregation order will now be validated at reduce time.
Closes#23613: if a single BucketOrder that is not a tie-breaker is added with the Java Transport API, it will be converted into a CompoundOrder with a tie-breaker.
Currently, the get snapshots API (e.g. /_snapshot/{repositoryName}/_all)
provides information about snapshots in the repository, including the
snapshot state, number of shards snapshotted, failures, etc. In order
to provide information about each snapshot in the repository, the call
must read the snapshot metadata blob (`snap-{snapshot_uuid}.dat`) for
every snapshot. In cloud-based repositories, this can be expensive,
both from a cost and performance perspective. Sometimes, all the user
wants is to retrieve all the names/uuids of each snapshot, and the
indices that went into each snapshot, without any of the other status
information about the snapshot. This minimal information can be
retrieved from the repository index blob (`index-N`) without needing to
read each snapshot metadata blob.
This commit enhances the get snapshots API with an optional `verbose`
parameter. If `verbose` is set to false on the request, then the get
snapshots API will only retrieve the minimal information about each
snapshot (the name, uuid, and indices in the snapshot), and only read
this information from the repository index blob, thereby giving users
the option to retrieve the snapshots in a repository in a more
cost-effective and efficient manner.
Closes#24288
Now that indices have a single type by default, we can move to the next step
and identify documents using their `_id` rather than the `_uid`.
One notable change in this commit is that I made deletions implicitly create
types. This helps with the live version map in the case that documents are
deleted before the first type is introduced. Otherwise there would be no way
to differenciate `DELETE index/foo/1` followed by `PUT index/foo/1` from
`DELETE index/bar/1` followed by `PUT index/foo/1`, even though those are
different if versioning is involved.
This commit adds support for indexing and searching a new ip_range field type. Both IPv4 and IPv6 formats are supported. Tests are updated and docs are added.
`_search_shards`API today only returns aliases names if there is an alias
filter associated with one of them. Now it can be useful to see which aliases
have been expanded for an index given the index expressions. This change also includes non-filtering aliases even without a filtering alias being present.
Adds CONSOLE to cross-cluster-search docs but skips them for testing
because we don't have a second cluster set up. This gets us the
`VIEW IN CONSOLE` and `COPY AS CURL` links and makes sure that they
are valid yaml (not json, technically) but doesn't get testing.
Which is better than we had before.
Adds CONSOLE to the dynamic templates docs and ingest-node docs.
The ingest-node docs contain a *ton* of non-console snippets. We
might want to convert them to full examples later, but that can be
a separate thing.
Relates to #18160
Rewrites most of the snippets in the `innert_hits` docs to be
complete examples and enables `VIEW IN CONSOLE`, `COPY AS CURL`,
and automatic testing of the snippets.
Today we go to heroic lengths to workaround bugs in the JDK or around
issues like BSD jails to get information about the underlying file
store. For example, we went to lengths to work around a JDK bug where
the file store returned would incorrectly report whether or not a path
is writable in certain situations in Windows operating
systems. Another bug prevented getting file store information on
Windows on a virtual drive on Windows. We no longer need to work
around these bugs, we could simply try to write to disk and let an I/O
exception arise if we could not write to the disk or take advantage of
the fact that these bugs are fixed in recent releases of the JDK
(e.g., the file store bug is fixed since 8u72). Additionally, we
collected information about all file stores on the system which meant
that if the user had a stale NFS mount, Elasticsearch could hang and
fail on startup if that mount point was not available. Finally, we
collected information through Lucene about whether or not a disk was a
spinning disk versus an SSD, information that we do not need since we
assume SSDs by default. This commit takes into consideration that we
simply do not need this heroic effort, we do not need information
about all file stores, and we do not need information about whether or
not a disk spins to greatly simplfy file store handling.
Relates #24402
Open/Close index api have allow_no_indices set to false by default, while delete index has it set to true. The flag controls where a wildcard expression that matches no indices will be ignored or an error will be thrown instead. This commit aligns open/close default behaviour to that of delete index.
Add info about the base image used and the github repo of
elasticsearch-docker.
Clarify that setting `memlock=-1:-1` is only a requirement when
`bootstrap_memory_lock=true` and the alternatives we document
elsewhere in docs for disabling swap are valid for Docker as well.
Additionally, with latest versions of docker-ce shipping with
unlimited (or high enough) defaults for `nofile` and `nproc`, clarify
that explicitly setting those per ES container is not required, unless
they are not defined in the Docker daemon.
Finally simplify production `docker-compose.yml` example by removing
unneeded options.
Relates #24389
This change makes the request builder code-path same as `Client#execute`. The request builder used to return a `ListenableActionFuture` when calling execute, which allows to associate listeners with the returned future. For async execution though it is recommended to use the `execute` method that accepts an `ActionListener`, like users would do when using `Client#execute`.
Relates to #24412
Relates to #9201
The _cat/shards docs asserted that one of the columns looked like
a propery byte size but used a regex like `\d+\.\d+.*` which doesn't
match `0b` which is a possible value. Instead this uses
`\d(\.\d+)?[kmg]?b`.
The response is attempting to illustrate the sync_id marker, but in
the test the index is too "fresh" to have a sync marker. So the test
needs to execute a sync flush behind the scenes so that the marker
is present
Currently we don't write the count value to the geo_centroid aggregation rest response,
but it is provided via the java api and the count() method in the GeoCentroid interface.
We should add this parameter to the rest output and also provide it via the getProperty()
method.
The alias parameter was documented as a list in our rest-spec, yet only the first value out of a list was getting read and processed. This commit adds support for multiple aliases to _cat/aliases
Closes#23661
This adds the `index.mapping.single_type` setting, which enforces that indices
have at most one type when it is true. The default value is true for 6.0+ indices
and false for old indices.
Relates #15613
Docs: rewrite description of `bool`'s `should`
Rewrites the description of the `bool` query's `should`
clauses so it is (hopefully) more clear what the defaults
for `minimum_should_match` are.
There is still an `[IMPORTANT]` section about `minimum_should_match`
in a filter context. I think it is worth keeping because it is, well,
important.
Closes#23831
This commit adds a link to the minimum master nodes section of the
important settings docs from the Zen discovery docs to clarify the
meaning and importance of setting minimum master nodes to a quorum of
master-eligible nodes.
Relates #24311
The `count` value in the stats aggregation represents a simple doc count
that doesn't require a formatted version. We didn't render an "as_string"
version for count in the rest response, so the method should also be
removed in favour of just using String.valueOf(getCount()) if a string
version of the count is needed.
Closes#24287
I just spent ages debugging a script I wrote after following the documentation. It was not clear to me that _index is not defined when using painless; if it was mentioned on this page I would have saved myself a lot of time.
For the Windows service, JAVA_HOME should be set to the path to the
JDK. We should make this clear in the docs to help users avoid
frustrating startup problems.
Relates #24260
Add option "enable_position_increments" with default value true.
If option is set to false, indexed value is the number of tokens
(not position increments count)
This commit rewords the note on whitespace in Log4j settings to not
refer to only of the examples on the page, but instead be clear that the
note applies to all the examples on the page.
A confusing thing that can happen when configuring Log4j is that
extraneous whitespace throws off its configuration parsing yet the error
messages that arise give no indication that this is the problem. This
commit adds a note to the docs.
Relates #24198
This commit removes the deprecated cloud.aws.* settings. It also removes
backcompat for specifying `discovery.type: ec2`, and unused aws signer
code which was removed in a previous PR.
This change adds an index setting to define how the documents should be sorted inside each Segment.
It allows any numeric, date, boolean or keyword field inside a mapping to be used to sort the index on disk.
It is not allowed to use a `nested` fields inside an index that defines an index sorting since `nested` fields relies on the original sort of the index.
This change does not add early termination capabilities in the search layer. This will be added in a follow up.
Relates #6720
Elasticsearch runs as user elasticsearch with uid:gid 1000:1000 inside
the Docker container. Clarify that bind mounted local directories need
to be accessible by this user.
Relates #24092
We want to upgrade to Lucene 7 ahead of time in order to be able to check whether it causes any trouble to Elasticsearch before Lucene 7.0 gets released. From a user perspective, the main benefit of this upgrade is the enhanced support for sparse fields, whose resource consumption is now function of the number of docs that have a value rather than the total number of docs in the index.
Some notes about the change:
- it includes the deprecation of the `disable_coord` parameter of the `bool` and `common_terms` queries: Lucene has removed support for coord factors
- it includes the deprecation of the `index.similarity.base` expert setting, since it was only useful to configure coords and query norms, which have both been removed
- two tests have been marked with `@AwaitsFix` because of #23966, which we intend to address after the merge
The docs don't clearly explain that the deleted doc count also comes from lucene.
IMHO, it is worth highlighting this information separately, as a Note.
Apart from that, there should be an official recommended alternative as well.
Today Elasticsearch allows default settings to be used only if the
actual setting is not set. These settings are trappy, and the complexity
invites bugs. This commit removes support for default settings with the
exception of default.path.data, default.path.conf, and default.path.logs
which are maintainted to support packaging. A follow-up will remove
support for these as well.
Relates #24093
Drops any mention of non-sandboxed scripting languages other than a
brief "we don't support them and we shouldn't because A and B"
statement.
Relates to #23930
Now that we have incremental reduce functions for topN and aggregations
we can set the default for `action.search.shard_count.limit` to unlimited.
This still allows users to restrict these settings while by default we executed
across all shards matching the search requests index pattern.
_field_stats has evolved quite a lot to become a multi purpose API capable of retrieving the field capabilities and the min/max value for a field.
In the mean time a more focused API called `_field_caps` has been added, this enpoint is a good replacement for _field_stats since he can
retrieve the field capabilities by just looking at the field mapping (no lookup in the index structures).
Also the recent improvement made to range queries makes the _field_stats API obsolete since this queries are now rewritten per shard based on the min/max found for the field.
This means that a range query that does not match any document in a shard can return quickly and can be cached efficiently.
For these reasons this change deprecates _field_stats. The deprecation should happen in 5.4 but we won't remove this API in 6.x yet which is why
this PR is made directly to 6.0.
The rest tests have also been adapted to not throw an error while this change is backported to 5.4.
This commit removes some leniency from the plugin service which skips
hidden files in the plugins directory. We really want to ensure the
integrity of the plugin folder, so hasta la vista leniency.
Relates #23982
They needed to be updated now that Painless is the default and
the non-sandboxed scripting languages are going away or gone.
I dropped the entire section about customizing the classloader
whitelists. In master this barely does anything (exposes more
things to expressions).
Before now ranges where forbidden, because the percolator query itself could get cached and then the percolator queries with now ranges that should no longer match, incorrectly will continue to match.
By disabling caching when the `percolator` is being used, the percolator can now correctly support range queries with now based ranges.
I think this is the right tradeoff. The percolator query is likely to not be the same between search requests and disabling range queries with now ranges really disabled people using the percolator for their use cases.
Also fixed an issue that existed in the percolator fieldmapper, it was unable to find forbidden queries inside `dismax` queries.
Closes#23859
While there are use-cases where a single-node is in production, there
are also use-cases for starting a single-node that binds transport to an
external interface where the node is not in production (for example, for
testing the transport client against a node started in a Docker
container). It's tricky to balance the desire to always enforce the
bootstrap checks when a node might be in production with the need for
the community to perform testing in situations that would trip the
bootstrap checks. This commit enables some flexibility for these
users. By setting the discovery type to "single-node", we disable the
bootstrap checks independently of how transport is bound. While this
sounds like a hole in the bootstrap checks, the bootstrap checks can
already be avoided in the single-node use-case by binding only HTTP but
not transport. For users that are genuinely in production on a
single-node use-case with transport bound to an external use-case, they
can set the system property "es.enable.bootstrap.checks" to force
running the bootstrap checks. It would be a mistake for them not to do
this.
Relates #23598
Fielddata can no longer be configured to be loaded eagerly (it only accepts
`true` and `false`), so this line is a little misleading because it talks about
a procedure we can no longer do.
Converts the analysis docs to that were marked as json into `CONSOLE`
format. A few of them were in yaml but marked as json for historical
reasons. I added more complete examples for a few of the less obvious
sounding ones.
Relates to #18160
The pattern-analyzer docs contained a snippet that was an expanded
regex that was marked as `[source,js]`. This changes it to
`[source,regex]`.
The htmlstrip-charfilter and pattern-replace-charfilter docs had
examples that were actually a list of tokens but marked `[source,js]`.
This marks them as `[source,text]` so they don't count as unconverted
CONSOLE snippets.
The pattern-replace-charfilter also had a doc who's test was
skipped because of funny interaction with the test framework. This
fixes the test.
Three more down, eighty-two to go.
Relates to #18160
CONSOLEifies the lang-analyzer docs and replaces the (invalid)
empty `keyword_marker` setups that were on the page with one
that contains the word "example" translated into the appropriate
language.
Relates to #18160
This change introduces a new API called `_field_caps` that allows to retrieve the capabilities of specific fields.
Example:
````
GET t,s,v,w/_field_caps?fields=field1,field2
````
... returns:
````
{
"fields": {
"field1": {
"string": {
"searchable": true,
"aggregatable": true
}
},
"field2": {
"keyword": {
"searchable": false,
"aggregatable": true,
"non_searchable_indices": ["t"]
"indices": ["t", "s"]
},
"long": {
"searchable": true,
"aggregatable": false,
"non_aggregatable_indices": ["v"]
"indices": ["v", "w"]
}
}
}
}
````
In this example `field1` have the same type `text` across the requested indices `t`, `s`, `v`, `w`.
Conversely `field2` is defined with two conflicting types `keyword` and `long`.
Note that `_field_caps` does not treat this case as an error but rather return the list of unique types seen for this field.
I managed to push the last one without testing it because I'd changed
the way I run tests locally and hadn't picked it up. Ooops. This one
works better.
All the docs for the `exists` query that aren't marked as `CONSOLE`
aren't actually `CONSOLE`-worthy so this marks them as `NOTCONSOLE`.
It also rewrites the text around `missing` query. Since it was
removed in 5.0 we don't need to talk about it in the 6.0 docs.
Relates to #18160
Turns the top example in each of the geo aggregation docs into a working
example that can be opened in CONSOLE. Subsequent examples can all also
be opened in console and will work after you've run the first example.
All examples are tested as part of the build.
This commit clarifies the preference docs regarding the explanation of
how operations are routed by default. In particular, the previous use of
"shard replicas" was confusing as it could imply an operation would only
be routed to replicas by default.
Relates #23794
This commit adds support for the pattern keyword marker filter in
Lucene. Previously, the keyword marker filter in Elasticsearch
supported specifying a keywords set or a path to a set of keywords.
This commit exposes the regular expression pattern based keyword marker
filter also available in Lucene, so that any token matching the pattern
specified by the `keywords_pattern` setting is excluded from being
stemmed by any stemming filters.
Closes#4877
As the query of a search request defaults to match_all,
calling _delete_by_query without an explicit query may
result in deleting all data.
In order to protect users against falling into that
pitfall, this commit adds a check to require the explicit
setting of a query.
Closes#23629
This commit adds the boolean similarity scoring from Lucene to
Elasticsearch. The boolean similarity provides a means to specify that
a field should not be scored with typical full-text ranking algorithms,
but rather just whether the query terms match the document or not.
Boolean similarity scores a query term equal to its query boost only.
Boolean similarity is available as a default similarity option and thus
a field can be specified to have boolean similarity by declaring in its
mapping:
"similarity": "boolean"
Closes#6731
The OpenJDK project provides early-access builds of upcoming
releases. These early-access builds are not suitable for
production. These builds sometimes end up on systems due to aggressive
packaging (e.g., Ubuntu). This commit adds a bootstrap check to ensure
these early-access builds are not being used in production.
Relates #23743
This is especially useful when we rewrite the query because the result of the rewrite can be very different on different shards. See #18254 for example.
* Add support for fragment_length in the unified highlighter
This commit introduce a new break iterator (a BoundedBreakIterator) designed for the unified highlighter
that is able to limit the size of fragments produced by generic break iterator like `sentence`.
The `unified` highlighter now supports `boundary_scanner` which can `words` or `sentence`.
The `sentence` mode will use the bounded break iterator in order to limit the size of the sentence to `fragment_length`.
When sentences bigger than `fragment_length` are produced, this mode will break the sentence at the next word boundary **after**
`fragment_length` is reached.
The reindex API is mature now, and we will work to maintain backwards
compatibility in accordance with our backwards compatibility
policy. This commit unmarks the reindex API as experimental.
Relates #23621
In SI units, "kilobyte" or "kB" would mean 1000 bytes, whereas "KiB" is
used for 1024. Add a note in `api-conventions.asciidoc` to clarify the
meaning in Elasticsearch.
This commit adds a system property that enables end-users to explicitly
enforce the bootstrap checks, independently of the binding of the
transport protocol. This can be useful for single-node production
systems that do not bind the transport protocol (and thus the bootstrap
checks would not be enforced).
Relates #23585
This commit adds the size of the cluster state to the response for the
get cluster state API call (GET /_cluster/state). The size that is
returned is the size of the full cluster state in bytes when compressed.
This is the same size of the full cluster state when serialized to
transmit over the network. Specifying the ?human flag displays the
compressed size in a more human friendly manner. Note that even if the
cluster state request filters items from the cluster state (so a subset
of the cluster state is returned), the size that is returned is the
compressed size of the entire cluster state.
Closes#3415
This change exposes the new Lucene graph based word delimiter token filter in the analysis filters.
Unlike the `word_delimiter` this token filter named `word_delimiter_graph` correctly handles multi terms expansion at query time.
Closes#23104
This commit adds a boundary_scanner property to the search highlight
request so the user can specify different boundary scanners:
* `chars` (default, current behavior)
* `word` Use a WordBreakIterator
* `sentence` Use a SentenceBreakIterator
This commit also adds "boundary_scanner_locale" to define which locale
should be used when scanning the text.
Today we have multiple ways to define settings when a user needs to create a repository:
* in `elasticsearch.yml` file using `repositories.azure` prefix
* when creating the repository itself with `PUT _snaphot/repo`
The plan is to:
* Deprecate `repositories.azure` settings in 5.x (done with #22856)
* Remove in 6.x (this PR)
Related to #22800
This commit enforces the requirement of Content-Type for the REST layer and removes the deprecated methods in transport
requests and their usages.
While doing this, it turns out that there are many places where *Entity classes are used from the apache http client
libraries and many of these usages did not specify the content type. The methods that do not specify a content type
explicitly have been added to forbidden apis to prevent more of these from entering our code base.
Relates #19388
* Console-ify curl statements for allocation explain API docs
Relates to #23001
* Fix tests
* Remove exclusion from build.gradle
* Call out index creation in prose
* Add console back and skip test
This commit brings the snapshot documentation in conformity
with the CONSOLE format, and fixes the docs so that the documentation
tests can be run against them.
When nested objects are present in the mappings, many queries get deoptimized
due to the need to exclude documents that are not in the right space. For
instance, a filter is applied to all queries that prevents them from matching
non-root documents (`+*:* -_type:__*`). Moreover, a filter is applied to all
child queries of `nested` queries in order to make sure that the child query
only matches child documents (`_type:__nested_path`), which is required by
`ToParentBlockJoinQuery` (the Lucene query behing Elasticsearch's `nested`
queries).
These additional filters slow down `nested` queries. In 1.7-, the cost was
somehow amortized by the fact that we cached filters very aggressively. However,
this has proven to be a significant source of slow downs since 2.0 for users
of `nested` mappings and queries, see #20797.
This change makes the filtering a bit smarter. For instance if the query is a
`match_all` query, then we need to exclude nested docs. However, if the query
is `foo: bar` then it may only match root documents since `foo` is a top-level
field, so no additional filtering is required.
Another improvement is to use a `FILTER` clause on all types rather than a
`MUST_NOT` clause on all nested paths when possible since `FILTER` clauses
are more efficient.
Here are some examples of queries and how they get rewritten:
```
"match_all": {}
```
This query gets rewritten to `ConstantScore(+*:* -_type:__*)` on master and
`ConstantScore(_type:AutomatonQuery {\norg.apache.lucene.util.automaton.Automaton@4371da44})`
with this change. The automaton is the complement of `_type:__*` so it matches
the same documents, but is faster since it is now a positive clause. Simplistic
performance testing on a 10M index where each root document has 5 nested
documents on average gave a latency of 420ms on master and 90ms with this change
applied.
```
"term": {
"foo": {
"value": "0"
}
}
```
This query is rewritten to `+foo:0 #(ConstantScore(+*:* -_type:__*))^0.0` on
master and `foo:0` with this change: we do not need to filter nested docs out
since the query cannot match nested docs. While doing performance testing in
the same conditions as above, response times went from 250ms to 50ms.
```
"nested": {
"path": "nested",
"query": {
"term": {
"nested.foo": {
"value": "0"
}
}
}
}
```
This query is rewritten to
`+ToParentBlockJoinQuery (+nested.foo:0 #_type:__nested) #(ConstantScore(+*:* -_type:__*))^0.0`
on master and `ToParentBlockJoinQuery (nested.foo:0)` with this change. The
top-level filter (`-_type:__*`) could be removed since `nested` queries only
match documents of the parent space, as well as the child filter
(`#_type:__nested`) since the child query may only match nested docs since the
`nested` object has both `include_in_parent` and `include_in_root` set to
`false`. While doing performance testing in the same conditions as above,
response times went from 850ms to 270ms.
This pull request reuses the typed_keys parameter added in #22965, but this time it applies it to suggesters. When set to true, the suggester names in the search response will be prefixed with a prefix that reflects their type.
This changes removes the SearchResponseListener that was used by the ExpandCollapseSearchResponseListener to expand collapsed hits.
The removal of SearchResponseListener is not a breaking change because it was never released.
This change also replace the blocking call in ExpandCollapseSearchResponseListener by a single asynchronous multi search request. The parallelism of the expand request can be set via CollapseBuilder#max_concurrent_group_searches
Closes#23048
This pull request adds a new parameter to the REST Search API named `typed_keys`. When set to true, the aggregation names in the search response will be prefixed with a prefix that reflects the internal type of the aggregation.
Here is a simple example:
```
GET /_search?typed_keys
{
"aggs": {
"tweets_per_user": {
"terms": {
"field": "user"
}
}
},
"size": 0
}
```
And the response:
```
{
"aggs": {
"sterms:tweets_per_user": {
...
}
}
}
```
This parameter is intended to make life easier for REST clients that could parse back the prefix and could detect the type of the aggregation to parse. It could also be implemented for suggesters.
This commit removes support for the `application/x-ldjson` Content-Type header as this was only used in the first draft
of the spec and had very little uptake. Additionally, the docs for bulk and msearch have been updated to specifically
call out ndjson and mention that the newline character may be preceded by a carriage return.
Finally, the bulk request handling of the carriage return has been improved to remove this character from the source.
Closes#23025
Elasticsearch v5.0.0 uses allocation IDs to safely allocate primary shards whereas prior versions of ES used a version-based mode instead. Elasticsearch v5 still has support for version-based primary shard allocation as it needs to be able to load 2.x shards. ES v6 can drop the legacy support.
Since `_all` is now deprecated and cannot be set for new indices, we should also
disallow any field that has the `include_in_all` parameter set.
Resolves#22923
These need to be CONSOLEified *now* because we're starting to
require Content-Type headers and they didn't have any.
* cluster/reroute: Marked as CONSOLE but skipped because the docs
build runs with a single node.
* docs/bulk: Marked as NOTCONSOLE because the snippets describe
either examples or `curl` commands. Fixed the `curl` command to
include the `Content-Type` header.
* query-dsl/terms-query: Marked as CONSOLE.
* search/request/rescore: Marked as CONSOLE. Fixed deprecated
syntax.
Relates #23001
Relates #18160
This adds the `COPY AS CURL` and `VIEW IN CONSOLE` links to the docs
and causes the snippets to be tested during Elasticsearch's build.
Relates to #18160
This adds the `COPY AS CURL` and `VIEW IN CONSOLE` buttons to the
docs and makes the build execute the snippets as part of `docs:check`.
Relates to #18160
Painless uses Ruby-like method dispatch (reciever type, method name,
and arity) rather than Java-like (reciever type, method name, and
argument compile time types) or Groovy-like method dispatch (receiver
type, method name, and argument run time types). We do this for
mostly good reasons but we never documented it.
Relates to #22720
Today either all nodes in the cluster connect to remote clusters of only nodes
that have remote clusters configured in their node config. To allow global remote
cluster configuration but restrict connections to a set of nodes in the cluster
this change adds a new setting `search.remote.connect` (defaults to `true`) to allow
to disable remote cluster connections on a per node basis.
This changes the way that replica failures are handled such that not all
failures will cause the replica shard to be failed or marked as stale.
In some cases such as refresh operations, or global checkpoint syncs, it is
"okay" for the operation to fail without the shard being failed (because no data
is out of sync). In these cases, instead of failing the shard we should simply
fail the operation, and, in the event it is a user-facing operation, return a
5xx response code including the shard-specific failures.
This was accomplished by having two forms of the `Replicas` proxy, one that is
for non-write operations that does not fail the shard, and one that is for write
operations that will fail the shard when an operation fails.
Relates to #10708
It was accidentally renamed `enabled_position_increment` in the cleanups
for 5.0. This adds `enable_position_increment` as a deprecated alias
so it will continue to work.
This commit removes the following queries and parameters (which were deprecated in 5.0):
* GeoPointDistanceRangeQuery
* coerce, and ignore_malformed for GeoBoundingBoxQuery, GeoDistanceQuery, GeoPolygonQuery, and GeoDistanceSort
This change also removes the reference to the difference bewteen full name and index name.
They are always the same since 2.x and `name` does not refer anymore to `author.name` automatically.
A simple pattern must be used instead.
Remove redundant code that checks the field name twice.
This change adds a strict mode for xcontent parsing on the rest layer. The strict mode will be off by default for 5.x and in a separate commit will be enabled by default for 6.0. The strict mode, which can be enabled by setting `http.content_type.required: true` in 5.x, will require that all incoming rest requests have a valid and supported content type header before the request is dispatched. In the non-strict mode, the Content-Type header will be inspected and if it is not present or not valid, we will continue with auto detection of content like we have done previously.
The content type header is parsed to the matching XContentType value with the only exception being for plain text requests. This value is then passed on with the content bytes so that we can reduce the number of places where we need to auto-detect the content type.
As part of this, many transport requests and builders were updated to provide methods that
accepted the XContentType along with the bytes and the methods that would rely on auto-detection have been deprecated.
In the non-strict mode, deprecation warnings are issued whenever a request with body doesn't provide the Content-Type header.
See #19388
GeoDistance query, sort, and scripts make use of a crazy GeoDistance enum for handling 4 different ways of computing geo distance: SLOPPY_ARC, ARC, FACTOR, and PLANE. Only two of these are necessary: ARC, PLANE. This commit removes SLOPPY_ARC, and FACTOR and cleans up the way Geo distance is computed.
Implemented by wrapping an array of reused `ModuleDateTime`s that
we grow when needed. The `ModuleDateTime`s are reused when we
move to the next document.
Also improves the error message returned when attempting to modify
the `ScriptdocValues`, removes a couple of allocations, and documents
that the date functions are available in Painless.
Relates to #22162
Currently, stored scripts use a namespace of (lang, id) to be put, get, deleted, and executed. This is not necessary since the lang is stored with the stored script. A user should only have to specify an id to use a stored script. This change makes that possible while keeping backwards compatibility with the previous namespace of (lang, id). Anywhere the previous namespace is used will log deprecation warnings.
The new behavior is the following:
When a user specifies a stored script, that script will be stored under both the new namespace and old namespace.
Take for example script 'A' with lang 'L0' and data 'D0'. If we add script 'A' to the empty set, the scripts map will be ["A" -- D0, "A#L0" -- D0]. If a script 'A' with lang 'L1' and data 'D1' is then added, the scripts map will be ["A" -- D1, "A#L1" -- D1, "A#L0" -- D0].
When a user deletes a stored script, that script will be deleted from both the new namespace (if it exists) and the old namespace.
Take for example a scripts map with {"A" -- D1, "A#L1" -- D1, "A#L0" -- D0}. If a script is removed specified by an id 'A' and lang null then the scripts map will be {"A#L0" -- D0}. To remove the final script, the deprecated namespace must be used, so an id 'A' and lang 'L0' would need to be specified.
When a user gets/executes a stored script, if the new namespace is used then the script will be retrieved/executed using only 'id', and if the old namespace is used then the script will be retrieved/executed using 'id' and 'lang'
* Integrate UnifiedHighlighter
This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter can extract offsets from either postings, term vectors, or via re-analyzing text.
The best strategy is picked automatically at query time and depends on the field and the query to highlight.
Added missing CONSOLE scripts to documentation for sampler and diversified_sampler aggs.
Includes new StackOverflow index setup in build.gradle
Closes#22746
* Formatting tweaks
This change removes the ability to set region for s3 repositories.
Endpoint should be used instead if a custom s3 location needs to be
used.
closes#22758
Follow up of #22857 where we deprecate automatic creation of azure containers.
BTW I found that the `AzureSnapshotRestoreServiceIntegTests` does not bring any value because it runs basically a Snapshot/Restore operation on local files which we already test in core.
So instead of trying to fix it to make it pass with this PR, I simply removed it.
Adds "Appending B. Painless API Reference", a reference of all classes
and methods available from Painless. Removes links to java packages
because they contain methods that we don't expose and don't contain
methods that we do expose (the ones in Augmentation). Instead this
generates a list of every class and every exposed method using the same
type information available to the
interpreter/compiler/whatever-we-call-it. From there you can jump to
the relevant docs.
Right now you build all the asciidoc files by running
```
gradle generatePainlessApi
```
These files are expected to be committed because we build the docs
without running `gradle`.
Also changes the output of `Debug.explain` so that it is easy to
search for the class in the generated reference documentation.
You can also run it in an IDE safely if you pass the path to the
directory in which to generate the docs as the first parameter. It'll
blow away the entire directory an recreate it from scratch so be careful.
And then you can build the docs by running something like:
```
../docs/build_docs.pl --out ../built_docs/ --doc docs/reference/index.asciidoc --open
```
That is, if you have checked out https://github.com/elastic/docs in
`../docs`. Wait a minute or two and your browser will pop open in with
all of Elasticsearch's reference documentation. If you go to
`http://localhost:8000/painless-api-reference.html` you can see this
list. Or you can get there by following the links to `Modules` and
`Scripting` and `Painless` and then clicking the link in the paragraphs
below titled `Appendix B. Painless API Reference`.
I like having these in asciidoc because we can deep link to them from the
rest of the guide with constructs like
`<<painless-api-reference-Object-hashCode-0>>` and
`<<painless-api-reference->>` and we get link checking. Then the only
brittle link maintenance bit is the link generation for javadoc. Which
sucks. But I think it is important that we link to the methods directly
so they are easy to find.
Relates to #22720
Add unit tests for `TopHitsAggregator` and convert some snippets in
docs for `top_hits` aggregation to `// CONSOLE`.
Relates to #22278
Relates to #18160
There was a typo in the `ParseField` declaration. I know
we want to port these parsers to `ObjectParser` eventually
but I don't have the energy for that today and want to get
this fixed.
Closes#22722
When users need to specify a custom location for configuration files,
they also need to specify a custom location for the jvm.options file yet
our docs are absent in this regard. This commit adds a note to the
rolling upgrade docs explaining this situation.
Relates #22747
* Add top hits collapsing to search request
The field collapsing is done with a custom top docs collector that "collapse" search hits with same field value.
The distributed aspect is resolve using the two passes that the regular search uses. The first pass "collapse" the top hits, then the coordinating node merge/collapse the top hits from each shard.
```
GET _search
{
"collapse": {
"field": "category",
}
}
```
This change also adds an ExpandCollapseSearchResponseListener that intercepts the search response and expands collapsed hits using the CollapseBuilder#innerHit} options.
The retrieval of each inner_hits is done by sending a query to all shards filtered by the collapse key.
```
GET _search
{
"collapse": {
"field": "category",
"inner_hits": {
"size": 2
}
}
}
```
Adds the `VIEW IN CONSOLE` and `COPY AS CURL` links to the snippets
in the `value_count` docs and causes the build to execute the snippets
for testing.
Release #18160
This adds the `VIEW IN CONSOLE` and `COPY AS CURL` links to the
snippets in the docs for the `date_range` aggregation and tests
those snippets as part of the build.
Relates to #18160
This adds the `VIEW IN SENSE` and `COPY AS CURL` links and has
the build automatically execute the snippets and verify that they
work.
Relates to #18160
Adds the `VIEW IN CONSOLE` and `COPY AS CURL` links to the example
`global` aggregation. Also improves the example by adding a
non-`global` aggregation to compare it to.
Relates to #18160
Similar to the Filters aggregation but only supports "keyed" filter buckets and automatically "ANDs" pairs of filters to produce a form of adjacency matrix.
The intersection of buckets "A" and "B" is named "A&B" (the choice of separator is configurable). Empty intersection buckets are removed from the final results.
Closes#22169
This PR removes all leniency in the conversion of Strings to booleans: "true"
is converted to the boolean value `true`, "false" is converted to the boolean
value `false`. Everything else raises an error.
Relates to #22024
On top of documentation, the PR adds deprecation loggers and deals with the resulting warning headers.
The yaml test is set exclude versions up to 6.0. This is need to make sure bwc tests pass until this is backported to 5.2.0 . Once that's done, I will change the yaml test version limits
This change makes it possible for custom routing values to go to a subset of shards rather than
just a single shard. This enables the ability to utilize the spatial locality that custom routing can
provide while mitigating the likelihood of ending up with an imbalanced cluster or suffering
from a hot shard.
This is ideal for large multi-tenant indices with custom routing that suffer from one or both of
the following:
- The big tenants cannot fit into a single shard or there is so many of them that they will likely
end up on the same shard
- Tenants often have a surge in write traffic and a single shard cannot process it fast enough
Beyond that, this should also be useful for use cases where most queries are done under the context
of a specific field (e.g. a category) since it gives a hint at how the data can be stored to minimize
the number of shards to check per query. While a similar solution can be achieved with multiple
concrete indices or aliases per value today, those approaches breakdown for high cardinality fields.
A partitioned index enforces that mappings have routing required, that the partition size does not
change when shrinking an index (the partitions will shrink proportionally), and rejects mappings
that have parent/child relationships.
Closes#21585
This commit corrects the grammar in a list in the how-to docs; namely,
the word "and" was missing preceding the final element in a list.
Relates #22663
* Grammatical correction
* Add note for legacy string mapping type
* Update truncate token filter to not mention the keyword tokenizer
The advice predates the existence of the keyword field
Closes#22650
Currently both ProfileResult and CollectorResult print the time field in a human readable string format
(e.g. "time": "55.20315000ms"). When trying to parse this back to a long value, for example to use in
the planned high level java rest client, we can lose precision because of conversion and rounding issues.
This change adds a new additional field (`time_in_nanos`) to the profile response to be able to get the
original time value in nanoseconds back.
The old `time` field is only printed when the `?`human=true` flag in the url is set. This follow the behaviour for
all other stats-related apis. Also the format of the `time` field is slightly changed. Instead of always formatting
the output as a 10-digit ms value, by using the `XContentBuilder#timeValueField()` method we now print
the largest time unit present is used (e.g. "s", "ms", "micros").
For certain situations, end-users need the base path for Elasticsearch
logs. Exposing this as a property is better than hard-coding the path
into the logging configuration file as otherwise the logging
configuration file could easily diverge from the Elasticsearch
configuration file. Additionally, Elasticsearch will only have
permissions to write to the log directory configured in the
Elasticsearch configuration file. This commit adds a property that
exposes this base path.
One use-case for this is configuring a rollover strategy to retain logs
for a certain period of time. As such, we add an example of this to the
documentation.
Additionally, we expose the property es.logs.cluster_name as this is
used as the name of the log files in the default configuration.
Finally, we expose es.logs.node_name in cases where node.name is
explicitly set in case users want to include the node name as part of
the name of the log files.
Relates #22625
This change disables the _all meta field by default.
Now that we have the "all-fields" method of query execution, we can save both
indexing time and disk space by disabling it.
_all can no longer be configured for indices created after 6.0.
Relates to #20925 and #21341Resolves#19784
Today when you do not specify a port for an entry in
discovery.zen.ping.unicast.hosts, the default port is the value of the
setting transport.profiles.default.port and falls back to the value of
transport.tcp.port if this is not set. For a node that is explicitly
bound to a different port than the default port, this means that the
default port will be equal to this explicitly bound port. Yet, the docs
say that we fall back to 9300 here. This commit corrects the docs.
Relates #22568
Instead of `search.remote.seeds.${clustername}` we now specify the seeds as:
`search.remote.${clustername}.seeds` which is a real list setting compared to an unvalidated
group setting before.
Fix the `Dockerfile` example in the `Customizing image` third configuration
method by adding the missing RUN instruction.
Originally reported by Shankar Vasudevan (@vshank77).
Relates #21973
#9261 added a warning about the use of `add-apt-repository` which is becoming obsolete over time as new distribution releases include later versions of `add-apt-repository` which don't automatically add the `deb-src` line. This change updates the documentation to make the block a note rather than a warning and adds two other reasons for avoiding `add-apt-repository` which are still relevant: avoiding edits to a system shared file and not requiring a large number of non-default packages to add one line of text to a file.
The example
```
/<logstash-{now/d-2d}>,<logstash-{now/d-1d}>,<logstash-{now/d}>/_search
```
shows escaped URL where `, => %2C`, so I assume it should be escaped and be present in the table
This commit updates the cluster allocation explain API documentation to
explain the new request parameters and response formats, and gives
examples of the explain API responses under various scenarios.
This commit adds a document describing our data replication model in high level terms. The goal is give people basic insight into how things work in order to better understand how read and writes interact, both during normal operations and under failures.
* Promote longs to doubles when a terms agg mixes decimal and non-decimal number
This change makes the terms aggregation work when the buckets coming from different indices are a mix of decimal numbers and non-decimal numbers. In this case non-decimal number (longs) are promoted to decimal (double) which can result in a loss of precision for big numbers.
Fixes#22232
Provides an example of using is and an example return description
and explains that we've added descriptions for some tasks but not
even close to all of them. And that we expect to change the
descriptions as we learn more.
Closes#22407
* Fix example
Getting a single task is always detailed, no need to specify.
* Rewrite like imotov wants it
After deprecating getters and setters and the query DSL parameter in 5.x,
support for `minimum_number_should_match` can be removed entirely. Also
consolidated comments with the ones on 5.x branch and added an entry to the
migration docs.
The `Script source settings` section currently states that `false` means scripting is ENABLED.
The other sections seem to indicate that `false` means scripting is DISABLED.
If the current documentation is correct, that would imply that `inline` and `stored` scripting are ENABLED by default, which seems to conflict with all the other sections in the document.
This adds a new `normalizer` property to `keyword` fields that pre-processes the
field value prior to indexing, but without altering the `_source`. Note that
only the normalization components that work on a per-character basis are
applied, so for instance stemming filters will be ignored while lowercasing or
ascii folding will be applied.
Closes#18064
Today we try to pull stats from index writer but we do not get a
consistent view of stats. Under heavy indexing, this inconsistency can
be very skewed indeed. In particular, it can lead to the number of
deleted docs being reported as negative and this leads to serialization
issues. Instead, we should provide a consistent view of the stats by
using an index reader.
Relates #22317
Today we only expose `value_type` in scriptable aggregations, however it is
also useful with unmapped fields. I suspect we never noticed because
`value_type` was not documented (fixed) and most aggregations are scriptable.
Closes#20163