The shard preference _primary, _replica and its variants were useful
for the asynchronous replication. However, with the current impl, they
are no longer useful and should be removed.
Closes#26335
Add fuzzy_transpositions parameter to multi_match and query_string queries.
Add fuzzy_transpositions, fuzzy_prefix_length and fuzzy_max_expansions
parameters to simple_query_string query.
In 5.x pure wildcard queries `*` in `query_string` are rewritten to `exists` query for efficiency.
Though this introduced a change in the document that match such queries because
`exists` query also return documents with an empty value for the field.
This change clarifies this behavior for 5.x and beyond.
Closes#26801
* review
This change adds cgroup memory usage/limit to the OS stats section of
the node stats on Linux. This information is useful because in Docker
containers the standard node stats report the host memory limit, not
taking account of extra restrictions that may have been applied to the
container.
The original idea was to store these values as Long, truncating any values
outside the range of long. However, this meant that in the relatively common
case of no limit being applied, users would not see the same value in the OS
stats as they see by querying Linux directly. So instead the values are stored
as String. This change places a burden on consumers of the strings to
convert the strings to numbers and decide what to do about extremely large
values, but there will be very few consumers and they would need to have a
policy for dealing with "no limit" in any case.
Early termination with index sorting always return the best top N in the response but set the flag `terminated_early`
in the response. This can be confusing because we use the same flag for `terminate_after` which on the contrary returns partial results.
This change removes the flag when results are not partial (early termination due to index sorting) and keeps it only when `terminate_after` is used.
Closes#26408
Numeric fields no longer support the index_options parameter. This changes the parameter
to be rejected in numeric field types after it was deprecated in 6.0.
Closes#21475
Other tokenizers like the standard tokenizer allow overriding the default
maximum token length of 255 using the `"max_token_length` parameter. This change
enables using this parameter also with the whitespace tokenizer. The range that
is currently allowed is from 0 to StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT,
which is 1024 * 1024 = 1048576 characters.
Closes#26643
The JVM defaults to dumping the heap to the working directory of
Elasticsearch. For the RPM and Debian packages, this location is
/usr/share/elasticsearch. This directory is not writable by the
elasticsearch user, so by default heap dumps in this situation are
lost. This commit modifies the packaging for the RPM and Debian packages
to set the heap dump path to /var/lib/elasticsearch as the default
location for dumping the heap. This location is writable by the
elasticsearch user by default. We add documentation of this important
setting if /var/lib/elasticsearch is not suitable for receiving heap
dumps.
Relates #26755
Adds the wait_for_active_shards parameter to the index open command. Similar to the index creation command, the index open command will now, by default, wait until the primaries have been allocated.
Closes#20937
The `fielddata` field and the use of the `_name` field in the short syntax of the range
query have been deprecated in 5.0 and can be removed.
The same goes for the deprecated `score_mode` field in HasParentQueryBuilder,
the deprecated `like_text`, `ids` and `docs` parameter in the `more_like_this` query,
the deprecated query name in the short version of the `regexp` query, and several
deprecated alternative field names in other query builders.
In this test, 260b is replaced by the regexp \d+b
but the test sometimes produces results like 1.1kb
so this commit adapts the regexp to match values
with decimals
The new ops based recovery, introduce as part of #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR).
The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR).
We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.
Removing several occurrences of this typo in the docs and javadocs, seems to be
a common mistake. Corrections turn up once in a while in PRs, better to correct
some of this in one sweep.
* Fix percolator highlight sub fetch phase to not highlight query twice
The PercolatorHighlightSubFetchPhase does not override hitExecute and since it extends HighlightPhase the search hits
are highlighted twice (by the highlight phase and then by the percolator). This does not alter the results, the second highlighting
just overrides the first one but this slow down the request because it duplicates the work.
Requesting to many script_fields in a search request can be costly
because of script execution. This change introduces a soft limit on the number
of script fields that are allowed per request. The setting can be
changed per index using the index.max_script_fields setting.
Relates to #26390
Requesting to many docvalue_fields in a search request can potentially be costly
because it might incur a per-field per-document seek. This change introduces a
soft limit on the number of fields that can be retrieved. The setting can be
changed per index using the `index.max_docvalue_fields_search` setting.
Relates to #26390
You can define a proxy using the following settings:
```yml
azure.client.default.proxy.host: proxy.host
azure.client.default.proxy.port: 8888
azure.client.default.proxy.type: http
```
Supported values for `proxy.type` are `direct`, `http` or `socks`. Defaults to `direct` (no proxy).
Closes#23506
BTW I changed a test `testGetSelectedClientBackoffPolicyNbRetries` as it was using an old setting name `cloud.azure.storage.azure.max_retries` instead of `azure.client.azure1.max_retries`.
Follow up for #23405.
We remove azure deprecated settings in 7.0:
* The legacy azure settings which where starting with `cloud.azure.storage.` prefix have been removed.
This includes `account`, `key`, `default` and `timeout`.
You need to use settings which are starting with `azure.client.` prefix instead.
* Global timeout setting `cloud.azure.storage.timeout` has been removed.
You must set it per azure client instead. Like `azure.client.default.timeout: 10s` for example.
* Limit the number of expanded fields it query_string and simple_query_string
This limits the number of automatically expanded fields for the "all fields"
mode (`"default_field": "*"`) for the `query_string` and `simple_query_string`
queries to 1024 fields.
Resolves#25105
* Add blurb about limit to the docs
The percolator will add a `_percolator_document_slot` field to all percolator
hits to indicate with what document it has matched. This number matches with
the order in which the documents have been specified in the percolate query.
Also improved the support for multiple percolate queries in a search request.
This change exposes the duplicate removal option added in Lucene for the completion suggester
with a new option called `skip_duplicates` (defaults to false).
This commit also adapts the custom suggest collector to handle deduplication when multiple contexts match the input.
Closes#23364
The current "Building Queries" and "Building Aggregations" pages are
located under the "Supported Apis" section because they are linked to
the "Search API" page.
It should instead be in a dedicated section: this commit adds a new
"Using Java Builders" section and renames few filenames in favor of
more meaningful names.
The `index.percolator.map_unmapped_fields_as_text` is a more better name, because unmapped fields are mapped to a text field with default settings
and string is no longer a field type (it is either keyword or text).
The definition of development vs. production mode has evolved slightly
over time (with the introduction of single-node) discovery. This commit
clarifies the documentation to better account for this adjustment.
Relates #26460
Adding a check to QueryStringQueryBuilderTests that checks the override
behaviour of `quote_analyzer`, also adding documentation explaining the use of
this parameter in `query_string` query.
Closes#25417
The current script service has a script compilation limit for a one
minute window. This is set to a small default value of 15. Instead of
increasing that default value, this commit introduces a new setting
that allows to configure a rate per time unit, so that the script service can deal with bursts better.
The new setting is named `script.max_compilations_rate`,
requires a nonnegative number and a positive time value.
The default is `75/5m`, which is equivalent to the existing 15 per minute.
Multi-level Nested Sort with Filters
Allow multiple levels of nested sorting where each level can have it's own filter.
Backward compatible with previous single-level nested sort.
* Remove the _all metadata field
This change removes the `_all` metadata field. This field is deprecated in 6
and cannot be activated for indices created in 6 so it can be safely removed in
the next major version (e.g. 7).
At current, we do not feel there is enough of a reason to shade the low
level rest client. It caused problems with commons logging and IDE's
during the brief time it was used. We did not know exactly how many
users will need this, and decided that leaving shading out until we
gather more information is best. Users can still shade the jar
themselves. For information and feeback, see issue #26366.
Closes#26328
This reverts commit 3a20922046.
This reverts commit 2c271f0f22.
This reverts commit 9d10dbea39.
This reverts commit e816ef89a2.
587409e893 introduced a bug where an example of the format of a request which contained placeholder values was attempted to be tested. This change adds `NOTCONSOLE` to that snippet as the immediately following snippet tests a concrete example.
220212dd69 introduced a bug because the test substitution was looking for `otherhost` where the snippet contained `oldhost`. This change fixes the substitution
By making RestHighLevelClient Closeable, its close method will close the internal low-level REST client instance by default, which simplifies the way most users interact with the high-level client.
Its constructor accepts now a RestClientBuilder, which clarifies that the low-level REST client is internally created and managed.
It is still possible to provide an already built `RestClient` instance, but that can only be done by subclassing `RestHighLevelClient` and calling the protected constructor that accepts a `RestClient`. In such case a consumer has also to be provided, which controls what has to be done when the high-level client gets done.
Closes#26086
* Accept an array of field names and boosts in the index.query.default_field setting
This commit allows to define an array of field names and boosts for the index setting `index.query.default_field`.
The format is equivalent to the `fields` options of the full text search queries (e.g. field_name^boost).
This commit also makes this setting dynamically updatable.
Fixes#25946
* Deprecate global_ordinals_hash and global_ordinals_low_cardinality
This change deprecates the `global_ordinals_hash` and `global_ordinals_low_cardinality` and
makes the `global_ordinals` execution hint choose internally if global ords should be remapped or use the segment ord directly.
These hints are too sensitive and expert to be exposed and we should be able to take the right decision internally based on the agg tree.
Currently the `precision` parameter must be a precision level
in the range of [1,12]. In #5042 it was suggested also supporting
distance units like "1km" to automatically approcimate the needed
precision level. This change adds this support to the Rest API by
making use of GeoUtils#geoHashLevelsForPrecision.
Plain integer values without a unit are still treated as precision
levels like before. Distance values that are too small to be represented
by a precision level of 12 (values approx. less than 0.056m) are
rejected.
Closes#5042
There was some confusion about the fact that tokens emitted from a Pattern
Capture Token Filter are treated as synonyms when used to analyze a search
query. This commit adds an explanation to the note in the docs to emphasize this
behaviour.
Closes#25746
This change is a continuation of #25726 that aligns field expansions for the simple_query_string with the query_string and multi_match query.
The main changes are:
* For exact field name, the new behavior is to rewrite to a matchnodocs query when the field name is not found in the mapping.
* For partial field names (with * suffix), the expansion is done only on keyword, text, date, ip and number field types. Other field types are simply ignored.
* For all fields (*), the expansion is done on accepted field types only (see above) and metadata fields are also filtered.
The use_all_fields option is deprecated in this change and can be replaced by setting `*` in the fields parameter.
This commit also changes how text fields are analyzed. Previously the default search analyzer (or the provided analyzer) was used to analyze every text part
, ignoring the analyzer set on the field in the mapping. With this change, the field analyzer is used instead unless an analyzer has been forced in the parameter of the query.
Finally now that all full text queries can handle the special "*" expansion (`all_fields` mode), the `index.query.default_field` is now set to `*` for indices created in 6.
We use `:` for cross-cluster search (eg `cluster:index`), therefore, we should
not allow the ambiguity when allowing cluster or index names.
Relates to #23892
Links to inner classes were using `$` in urls instead of `.`, causing
them to 404.
Also fixes the doc generation code to generate docs into the correct
directory. We moved the docs but never updated the generation code.
All of the snippets in our docs marked with `// TESTRESPONSE` are
checked against the response from Elasticsearch but, due to the
way they are implemented they are actually parsed as YAML instead
of JSON. Luckilly, all valid JSON is valid YAML! Unfurtunately
that means that invalid JSON has snuck into the exmples!
This adds a step during the build to parse them as JSON and fail
the build if they don't parse.
But no! It isn't quite that simple. The displayed text of some of
these responses looks like:
```
{
...
"aggregations": {
"range": {
"buckets": [
{
"to": 1.4436576E12,
"to_as_string": "10-2015",
"doc_count": 7,
"key": "*-10-2015"
},
{
"from": 1.4436576E12,
"from_as_string": "10-2015",
"doc_count": 0,
"key": "10-2015-*"
}
]
}
}
}
```
Note the `...` which isn't valid json but we like it anyway and want
it in the output. We use substitution rules to convert the `...`
into the response we expect. That yields a response that looks like:
```
{
"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,
"aggregations": {
"range": {
"buckets": [
{
"to": 1.4436576E12,
"to_as_string": "10-2015",
"doc_count": 7,
"key": "*-10-2015"
},
{
"from": 1.4436576E12,
"from_as_string": "10-2015",
"doc_count": 0,
"key": "10-2015-*"
}
]
}
}
}
```
That is what the tests consume but it isn't valid JSON! Oh no! We don't
want to go update all the substitution rules because that'd be huge and,
ultimately, wouldn't buy much. So we quote the `$body.took` bits before
parsing the JSON.
Note the responses that we use for the `_cat` APIs are all converted into
regexes and there is no expectation that they are valid JSON.
Closes#26233
* Migrate migration docs from 6.0 to 7.0
Since we only keep one version of migration docs and master is now on 7.0, we
should migrate these so breaking changes can be added in the right place.
* Remove release notes as well
They link to the migration guides, so they have to go.
* Add placeholder notes for 7.0 so doc build is happy
The names of two settings in the script security docs are incorrect,
referring to the prefix as "scripts" instead of "script". This commit
fixes this issue.
Relates #26236
In #26185 we made the description of `requests_per_second` sane
for reindex. This improves on the description by using some more
common vocabulary ("batch size", etc) and improving the formatting
of the example calculation so it stands out and doesn't require
scrolling.
An array of values is required because there is no default (or
reasonable way to set a default). But validation for values
only happens if it is actually set. If the values param is omitted
entirely than the agg builder will NPE.
The environment variable CONF_DIR was previously inconsistently used in
our packaging to customize the location of Elasticsearch configuration
files. The importance of this environment variable has increased
starting in 6.0.0 as it's now used consistently to ensure Elasticsearch
and all secondary scripts (e.g., elasticsearch-keystore) all use the
same configuration. The name CONF_DIR is there for legacy reasons yet
it's too generic. This commit renames CONF_DIR to ES_PATH_CONF.
Relates #26197
In reindex APIs, when using the `slices` parameter to choose the number of slices, adds the option to specify `slices` as "auto" which will choose a reasonable number of slices. It uses the number of shards in the source index, up to a ceiling. If there is more than one source index, it uses the smallest number of shards among them.
This gives users an easy way to use slicing in these APIs without having to make decisions about how to configure it, as it provides a good-enough configuration for them out of the box. This may become the default behavior for these APIs in the future.
The percolator field mapper doesn't need to extract all terms and ranges from a bool query with must or filter clauses.
In order to help to default extraction behavior, boost fields can be configured, so that fields that are known for not being
selective enough can be ignored in favor for other fields or clauses with specific fields can forcefully take precedence over other clauses.
This can help selecting clauses for fields that don't match with a lot of percolator queries over other clauses and thus improving performance of the percolate query.
For example a status like field is something that should configured as an ignore field.
Queries on this field tend to match with more documents and so if clauses for this fields
get selected as best clause then that isn't very helpful for the candidate query that the
percolate query generates to filter out percolator queries that are likely not going to match.
With this commit we remove the following three previously unused
(and undocumented) Netty 4 related settings:
* transport.netty.max_cumulation_buffer_capacity,
* transport.netty.max_composite_buffer_components and
* http.netty.max_cumulation_buffer_capacity
from Elasticsearch.
When using the High Level Rest Client 6.0.0-beta1, we are missing some transitive dependencies for Lucene as Lucene 7 has not been released yet. See the following `pom.xml`:
```xml
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-client</artifactId>
<version>6.0.0-beta1</version>
</dependency>
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>6.0.0-beta1</version>
</dependency>
```
It gives:
```
[ERROR] Failed to execute goal on project fscrawler: Could not resolve dependencies for project fr.pilato.elasticsearch.crawler:fscrawler:jar:2.4-SNAPSHOT: The following artifacts could not be resolved: org.apache.lucene:lucene-analyzers-common:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-backward-codecs:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-grouping:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-highlighter:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-join:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-memory:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-misc:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-queries:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-queryparser:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-sandbox:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-spatial:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-spatial-extras:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-spatial3d:jar:7.0.0-snapshot-00142c9, org.apache.lucene:lucene-suggest:jar:7.0.0-snapshot-00142c9: Failure to find org.apache.lucene:lucene-analyzers-common:jar:7.0.0-snapshot-00142c9 in https://artifacts.elastic.co/maven/ was cached in the local repository, resolution will not be reattempted until the update interval of elastic-download-service has elapsed or updates are forced -
```
We need to add some temporary documentation on how to add the missing repository to a gradle or maven project:
```xml
<repository>
<id>elastic-lucene-snapshots</id>
<name>Elastic Lucene Snapshots</name>
<url>http://s3.amazonaws.com/download.elasticsearch.org/lucenesnapshots/00142c9</url>
<releases><enabled>true</enabled></releases>
<snapshots><enabled>false</enabled></snapshots>
</repository>
```
This also applies to the transport client.
Closes#26106.
* Add support for auto_generate_synonyms_phrase_query in match_query, multi_match_query, query_string and simple_query_string
This change adds a new parameter called auto_generate_synonyms_phrase_query (defaults to true).
This option can be used in conjunction with synonym_graph token filter to generate phrase queries
when multi terms synonyms are encountered.
For example, a synonym like "ny, new york" would produce the following boolean query when "ny city" is parsed:
((ny OR "new york") AND city)
Note how the multi terms synonym "new york" produces a phrase query.
We should have the same behavior for Azure repositories as we have for S3 (see #22762).
Instead of:
```yml
cloud:
azure:
storage:
my_account1:
account: your_azure_storage_account1
key: your_azure_storage_key1
default: true
my_account2:
account: your_azure_storage_account2
key: your_azure_storage_key2
```
Support something like:
```
azure.client:
default:
account: your_azure_storage_account1
key: your_azure_storage_key1
my_account2:
account: your_azure_storage_account2
key: your_azure_storage_key2
```
Then instead of:
```
PUT _snapshot/my_backup3
{
"type": "azure",
"settings": {
"account": "my_account2"
}
}
```
Use:
```
PUT _snapshot/my_backup3
{
"type": "azure",
"settings": {
"config": "my_account2"
}
}
```
If someone uses:
```
PUT _snapshot/my_backup3
{
"type": "azure"
}
```
It will use the `default` azure repository settings.
And mark as deprecated old settings.
Closes#22763.
The goal of this similarity is to help users who would like to keep the
functionality of the `tf-idf` similarity that we want to remove, or to allow
for specific usec-cases (disabling idf, disabling tf, disabling length norm,
etc.) to not have to build a custom plugin and familiarize with the low-level
Lucene API.
Raw requests are supported only by the java yaml test runner and were introduced to test docs snippets. Some yaml tests ended up using them (see #23497) which causes failures for other language clients. This commit migrates those yaml tests to Java tests that send requests through the Java low-level REST client, and also moves the ability to send raw requests to a special client that's only available when testing docs snippets.
Closes#25694
This commit updates the s3 repository docs to clearly mark settings as
part of the s3 client settings, as well as those that are secure and
must be stored in the elasticsearch keystore.
relates #25619
The s3 repository plugin has "third party" integ tests which rely
on external service and configuration setup. These tests are really
internal verification of the plugin (and should be moved to real integ
tests). Running them is not something a user should do, and the
documentation has been out of date for all of 5.x. This commit removes
the docs, removing potential confusion for users.
This commit adds a small note to the discovery docs to include a note
that we recommend that the unicast hosts list be maintained as the list
of master-eligible nodes in the cluster.
Relates #25991
This commit updates the docs for the config files to explain the new
mechanism for customizing the configuration directory via the
environment variable CONF_DIR.
Relates #25990