The pattern in the latest failure is similar to the source fixed in #46956
but relates to synced-flush. If peer recovery happens after indexing,
and indexing flushes some shard at the end, then a synced flush in the
test will not roll or commit translog.
Closes#46712
* Fix Snapshot Finalization not Waiting for Index Metadata
We were mixing up the listeners here which led to the final listener
that should be called after all the metadata has been written
to be called before that.
I fixed this by removing the one redundant listener and flattening
the logic out.
* Closes#47425
Today when settings validate, they can only validate against settings
that are of the same type. While this strong-type is convenient from a
development perspective, it is too limiting in that some settings need
to validate against settings of a different type. For example, the list
setting xpack.monitoring.exporters.<namespace>.host wants to validate
that it is non-empty if and only if the string setting
xpack.monitoring.exporters.<namespace>.type is "http". Today this is
impossible since the settings validation framework only allows that
setting to validate against other list settings. This commit increases
the flexibility here to validate against settings of arbitrary type, at
the expense of losing strong-typing during development.
* Add API to execute SLM retention on-demand (#47405)
This is a backport of #47405
This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.
This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.
This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.
Resolves#46508
Relates to #43663
* Fix AllocationRoutedStepTests.testConditionMetOnlyOneCopyAllocated
These tests were using randomly generated includes/excludes/requires for
routing, however, it was possible to generate mutually exclusive
allocation settings (about 1 out of 50,000 times for my runs).
This splits the test into three different tests, and removes the
randomization (it doesn't add anything to the testing here) to fix the
issue.
Resolves#47142
The passage formatter that the unified highlighter use doesn't handle terms with overlapping offsets.
For tokenizer that provides multiple segmentation of the same terms (edge ngram for instance) the formatter
should select the largest span in order to highlight the term only once. This change implements this logic.
Fixes multiple Active Directory related tests that run against the
samba fixture. Some were failing since we changed the realm settings
format in 7.0 and a few were slightly broken in other ways.
We can move to cleanup the tests in a follow up but this work fits
better to be done with or after we move the tests from a Samba
based fixture to a real(-ish) Microsoft Active Directory based
fixture.
Resolves: #33425, #35738
While it seemed like the PUT data frame analytics action did not
have to be a master node action as the config is stored in an index
rather than the cluster state, there are other subtle nuances which
make it worthwhile to convert it. In particular, it helps maintain
order of execution for put actions which are anyhow user driven and
are expected to have low volume.
This commit converts `TransportPutDataFrameAnalyticsAction` from
a handled transport action to a master node action.
Note this means that the action might fail in a mixed cluster
but as the API is still experimental and not widely used there will
be few moments more suitable to make this change than now.
Changes auto-id index requests to use optype CREATE, making it compliant with our docs.
This will also make these auto-id index requests compatible with the new "create-doc" index
privilege (which is based on the optype), the default optype is changed to create, just as it is
already documented.
The autoGeneratedTimestamp field is internally used to speed up indexing of operations with
auto-ids, as we can rule out duplicates. Setting this field externally can make the index
inconsistent, resulting in duplicate documents with same id.
Adds support for handling auto-id requests with optype CREATE. Also simplifies the code
handling this by using the standard indexing path when dealing with possible retry conflicts.
Relates #47169
Bulk requests currently do not allow adding "create" actions with auto-generated IDs.
This commit allows using the optype CREATE for append-only indexing operations. This is
mainly the user facing aspect of it.
* Add support for bwc for testclusters and convert full cluster restart (#45374)
* Testclusters fix bwc (#46740)
Additions to make testclsuters work with lather versions of ES
* Do common node config on bwc tests
Before this PR we always ever ran `ElasticsearchCluster.start` once, and
the common node config was never done.
This becomes apparent in upgrading from `6.x` to `7.x` as the new config
is missing preventing the cluster from starting.
* Do common node config on bwc tests
Before this PR we always ever ran `ElasticsearchCluster.start` once, and
the common node config was never done.
This becomes apparent in upgrading from `6.x` to `7.x` as the new config
is missing preventing the cluster from starting.
* Fix logic to pick up snapshot from 6.x
* Make sure ports are cleared
* Fix test
* Don't clear all the config as we rely on it
* Fix removal of keys
XPackPlugin holds data in statics and can only be initialized once. This
caused tests to fail primarily when running with a low max-workers.
Replaced usages with the LocalStateCompositeXPackPlugin, which handles
this properly for testing.
Most of the information in AnalysisPredicateScript.Token is pulled directly
from its underlying AttributeSource, but we also keep track of the token position,
and this state is held directly on the Token. This information needs to be reset when
the containing ScriptFilteringTokenFilter or ScriptedConditionTokenFilter is re-used.
Fixes#47197
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.
This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)
Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)
Relates #47103
Synonym queries (when two tokens/paths start at the same position) use the alias field instead
of the concrete field to build Lucene queries. This commit fixes this bug by resolving the alias field upfront in order to provide the concrete field to the actual query parser.
* [DOCS] Adds examples to the PUT dfa and the evaluate dfa APIs.
* [DOCS] Removes extra lines from examples.
* Update docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
* Update docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
* [DOCS] Explains examples.
We do mention that rolling back an upgrade requires a restore from a snapshot,
but it's hidden at the bottom of the "preparing to upgrade" instructions on a
different page from the actual upgrade instructions. This commit duplicates the
preparatory instructions onto the pages containing the actual upgrade
instructions and rewords the point about rollbacks a bit.
This commit restores the model state if available in data
frame analytics jobs.
In addition, this changes the start API so that a stopped job
can be restarted. As we now store the progress in the state index
when the task is stopped, we can use it to determine what state
the job was in when it got stopped.
Note that in order to be able to distinguish between a job
that runs for the first time and another that is restarting,
we ensure reindexing progress is reported to be at least 1
for a running task.
Since the bundled jdk was added to Elasticsearch, there are now 2 ways
java can be missing. Either JAVA_HOME is set but does not exist, or the
bundled jdk does not exist. This commit improves the error messages in
those two cases, and also ensures our tests cover both cases.
Today, we don't clear the shard info of the primary shard when a new
node joins; then we might risk of making replica allocation decisions
based on the stale information of the primary. The serious problem is
that we can cancel the current recovery which is more advanced than the
copy on the new node due to the old info we have from the primary.
With this change, we ensure the shard info from the primary is not older
than any node when allocating replicas.
Relates #46959
This work was done by Henning in #42518.
Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
This commit adjusts randomization for the cluster shard limit tests so
that there is often more of a gap left between the limit and the size of
the first index. This allows the same randomization to be used for all
tests, and alleviates flakiness in
`testIndexCreationOverLimitFromTemplate`.
Today the comment boldly claims that this line of code keeps nodes above the
10-byte low watermark when in fact this is not true at all. This change fixes
this so that it really does keep nodes above the low watermark.
Fixes#45338. Again.
This is a preliminary of #46250 making the snapshot
delete work by doing all the metadata updates first
and then bulk deleting all of the now unreferenced
blobs.
Before this change, the metadata updates for each shard
and subsequent deletion of the blobs that have become unreferenced
due to the delete would happen sequentially shard-by-shard
parallelising only over all the indices in the snapshot.
This change makes it so the all the metadata updates
happen in parallel on a shard level first.
Once all of the updates of shard-level metadata have finished,
all the now unreferenced blobs are deleted in bulk.
This has two benefits (outside of making #46250 a smaller change):
* We have a lower likelihood of failing to update shard level metadata because
it happens with priority and a higher degree of parallelism
* Deleting of unreferenced data in the shards should go much faster in many cases (rolling indices, large number of indices with many unchanged shards) as well because a number of small bulk deletions (just two blobs for `index-N` and `snap-` for each unchanged shard) are grouped into larger bulk deletes of `100-1000` blobs depending on Cloud provider (even though the final bulk deletes are happening sequentially this should be much faster in almost all cases as you'd parallelism of 50 (GCS) to 500 (S3) snapshot threads to achieve the same delete rates when deleting from unchanged shards).
Due to a regression bug the metadata Active Directory realm
setting is ignored (it works correctly for the LDAP realm type).
This commit redresses it.
Closes#45848
DATE_PART(<datetime unit>, <date/datetime>) is a function that allows
the user to extract the specified unit from a date/datetime field
similar to the EXTRACT (<datetime unit> FROM <date/datetime>) but
with different names and aliases for the units and it also provides more
options like `DATE_PART('tzoffset', datetimeField)`.
Implemented following the SQL server's spec: https://docs.microsoft.com/en-us/sql/t-sql/functions/datepart-transact-sql?view=sql-server-2017
with the difference that the <datetime unit> argument is either a
literal single quoted string or gets a value from a table field, whereas
in SQL server keywords are used (unquoted identifiers) and it's not
possible to use a value coming for a table column.
Closes: #46372
(cherry picked from commit ead743d3579eb753fd314d4a58fae205e465d72e)
* [ML][Inference] adding .ml-inference* index and storage (#47267)
* [ML][Inference] adding .ml-inference* index and storage
* Addressing PR comments
* Allowing null definition, adding validation tests for model config
* fixing line length
* adjusting for backport
The change #47238 fixed a first issue (#47076) but introduced
another one that can be reproduced using:
org.elasticsearch.common.CharArraysTests > testConstantTimeEquals FAILED
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at __randomizedtesting.SeedInfo.seed([DFCA64FE2C786BE3:ED987E883715C63B]:0)
at java.lang.String.substring(String.java:1963)
at org.elasticsearch.common.CharArraysTests.testConstantTimeEquals(CharArraysTests.java:74)
REPRODUCE WITH: ./gradlew ':libs:elasticsearch-core:test' --tests
"org.elasticsearch.common.CharArraysTests.testConstantTimeEquals"
-Dtests.seed=DFCA64FE2C786BE3 -Dtests.security.manager=true -Dtests.locale=fr-CA
-Dtests.timezone=Pacific/Johnston -Dcompiler.java=12 -Druntime.java=8
that happens when the first randomized string has a length of 0.
We cancel ongoing peer recoveries if a node joins the cluster with a completely
up-to-date copy of a shard, because we can use such a copy to recover a replica
instantly. However, today we only look for recoveries to cancel while there are
unassigned shards in the cluster. This means that we do not contemplate the
cancellation of the last few recoveries since recovering shards are not
unassigned. It might take much longer for these recoveries to complete than
would be necessary if they were cancelled.
This commit fixes this by checking for cancellable recoveries even if all
shards are assigned.