On windows, it happens that the process we called terminates but some
other process it creates still has the same output strems and thus the
files open, so we can't clean it up.
This PR makes the cleanup a best effort.
When an ML job runs the memory required can be
broken down into:
1. Memory required to load the executable code
2. Instrumented model memory
3. Other memory used by the job's main process or
ancilliary processes that is not instrumented
Previously we added a simple fixed overhead to
account for 1 and 3. This was 100MB for anomaly
detection jobs (large because of the completely
uninstrumented categorization function and
normalize process), and 20MB for data frame
analytics jobs.
However, this was an oversimplification because
the executable code only needs to be loaded once
per machine. Also the 100MB overhead for anomaly
detection jobs was probably too high in most cases
because categorization and normalization don't use
_that_ much memory.
This PR therefore changes the calculation of memory
requirements as follows:
1. A per-node overhead of 30MB for _only_ the first
job of any type to be run on a given node - this
is to account for loading the executable code
2. The established model memory (if applicable) or
model memory limit of the job
3. A per-job overhead of 10MB for anomaly detection
jobs and 5MB for data frame analytics jobs, to
account for the uninstrumented memory usage
This change will enable more jobs to be run on the
same node. It will be particularly beneficial when
there are a large number of small jobs. It will
have less of an effect when there are a small number
of large jobs.
This PR makes the necesary adaptations to the tests and adds a power shell script to
invoke the OS tests on GCP instances connected as CI workers.
Also noticed that logs were not being produced by the tests and that theses were not using log4j so fixed that too.
One of the difficulties in working on theses tests was that the tests just stalled with no indication where the problem is.
To ease with the debugging, after process explorer suggested that the tests are running some commands, we now have multiple timeouts: one for the tests ( which will generate a thread dump ) and one for individual commands ( that bails with the command being ran and output and error so far ) to make it easier to see what went wrong.
The tests were blocking because apparently the pipes to the sub-process were not closing, thus the threads were blocking on them and we were blocking indefinitely on the join. I'm not sure why this doesn't happen in vagrant, but we now properly deal with it.
While function scores using scripts do allow explanations, they are only
creatable with an expert plugin. This commit improves the situation for
the newer script score query by adding the ability to set the
explanation from the script itself.
To set the explanation, a user would check for `explanation != null` to
indicate an explanation is needed, and then call
`explanation.set("some description")`.
When API key is invalidated we do two things first it tries to trigger `ExpiredApiKeysRemover` task
and second, we do index the invalidation for the API key. The index invalidation may happen
before the `ExpiredApiKeysRemover` task is run and in that case, the API key
invalidated will also get deleted. If the `ExpiredApiKeysRemover` runs before the
API key invalidation is indexed then the API key is not deleted and will be
deleted in the future run.
This behavior was not captured in the tests related to `ExpiredApiKeysRemover`
causing intermittent failures.
This commit fixes those tests by checking if the API key invalidated is reported
back when we get API keys after invalidation and perform the checks based on that.
Closes#41747
Today we control the extra translog (when soft-deletes is disabled) for
peer recoveries by size and age. If users manually (force) flush many
times within a short period, we can keep many small (or empty) translog
files as neither the size or age condition is reached. We can protect
the cluster from running out of the file descriptors in such a situation
by limiting the number of retaining translog files.
The changes introduced in #47179 made it so that we could try to
build an SSLContext with verification mode set to None, which is
not allowed in FIPS 140 JVMs. This commit address that
The warning section above the example tells that index name has to end with the digits but the example itself uses index name without digits which is confusing.
If we fail to read the global metadata in a snapshot
we would throw `SnapshotMissingException` but wouldn't
do so for the index metadata.
This is breaking SLM tests at a low rate because they
use `SnapshotMissingException` thrown from snapshot status APIs
to wait for a snapshot being gone.
Also, we should be consistent here in general and not leak the
`NoSuchFileException` to the transport layer for index meta.
Closes#46508
Since the property defaulted to `true` this deprecation logging
runs every time unless its set to `false` manually (in which case
it should've also logged but didn't).
I didn't add a tests and removed the tests we had in `7.x` that
covered this logging. I did move the check out of the `if (InetAddresses.isInetAddress(hostString) == false) {` condition so this is sort-of covered by the REST tests.
IMO, any unit-test of this would be somewhat redundant and would've forced
adding a field that just indicates that the deprecated property was used to
every instance which seemed pointless.
Closes#47436
* Remove eclipse conditionals
We used to have some meta projects with a `-test` prefix because
historically eclipse could not distinguish between test and main
source-sets and could only use a single classpath.
This is no longer the case for the past few Eclipse versions.
This PR adds the necessary configuration to correctly categorize source
folders and libraries.
With this change eclipse can import projects, and the visibility rules
are correct e.x. auto compete doesn't offer classes from test code or
`testCompile` dependencies when editing classes in `main`.
Unfortunately the cyclic dependency detection in Eclipse doesn't seem to
take the difference between test and non test source sets into account,
but since we are checking this in Gradle anyhow, it's safe to set to
`warning` in the settings. Unfortunately there is no setting to ignore
it.
This might cause problems when building since Eclipse will probably not
know the right order to build things in so more wirk might be necesarry.
The rest high level client has a dependency on mapper-extras but the jar
is not published so this commit adds a client jar for this module.
Closes#47413
The pattern in the latest failure is similar to the source fixed in #46956
but relates to synced-flush. If peer recovery happens after indexing,
and indexing flushes some shard at the end, then a synced flush in the
test will not roll or commit translog.
Closes#46712
* Fix Snapshot Finalization not Waiting for Index Metadata
We were mixing up the listeners here which led to the final listener
that should be called after all the metadata has been written
to be called before that.
I fixed this by removing the one redundant listener and flattening
the logic out.
* Closes#47425
Today when settings validate, they can only validate against settings
that are of the same type. While this strong-type is convenient from a
development perspective, it is too limiting in that some settings need
to validate against settings of a different type. For example, the list
setting xpack.monitoring.exporters.<namespace>.host wants to validate
that it is non-empty if and only if the string setting
xpack.monitoring.exporters.<namespace>.type is "http". Today this is
impossible since the settings validation framework only allows that
setting to validate against other list settings. This commit increases
the flexibility here to validate against settings of arbitrary type, at
the expense of losing strong-typing during development.
* Add API to execute SLM retention on-demand (#47405)
This is a backport of #47405
This commit adds the `/_slm/_execute_retention` API endpoint. This
endpoint kicks off SLM retention and then returns immediately.
This in particular allows us to run retention without scheduling it
(for entirely manual invocation) or perform a one-off cleanup.
This commit also includes HLRC for the new API, and fixes an issue
in SLMSnapshotBlockingIntegTests where retention invoked prior to the
test completing could resurrect an index the internal test cluster
cleanup had already deleted.
Resolves#46508
Relates to #43663
* Fix AllocationRoutedStepTests.testConditionMetOnlyOneCopyAllocated
These tests were using randomly generated includes/excludes/requires for
routing, however, it was possible to generate mutually exclusive
allocation settings (about 1 out of 50,000 times for my runs).
This splits the test into three different tests, and removes the
randomization (it doesn't add anything to the testing here) to fix the
issue.
Resolves#47142
The passage formatter that the unified highlighter use doesn't handle terms with overlapping offsets.
For tokenizer that provides multiple segmentation of the same terms (edge ngram for instance) the formatter
should select the largest span in order to highlight the term only once. This change implements this logic.
Fixes multiple Active Directory related tests that run against the
samba fixture. Some were failing since we changed the realm settings
format in 7.0 and a few were slightly broken in other ways.
We can move to cleanup the tests in a follow up but this work fits
better to be done with or after we move the tests from a Samba
based fixture to a real(-ish) Microsoft Active Directory based
fixture.
Resolves: #33425, #35738
While it seemed like the PUT data frame analytics action did not
have to be a master node action as the config is stored in an index
rather than the cluster state, there are other subtle nuances which
make it worthwhile to convert it. In particular, it helps maintain
order of execution for put actions which are anyhow user driven and
are expected to have low volume.
This commit converts `TransportPutDataFrameAnalyticsAction` from
a handled transport action to a master node action.
Note this means that the action might fail in a mixed cluster
but as the API is still experimental and not widely used there will
be few moments more suitable to make this change than now.
Changes auto-id index requests to use optype CREATE, making it compliant with our docs.
This will also make these auto-id index requests compatible with the new "create-doc" index
privilege (which is based on the optype), the default optype is changed to create, just as it is
already documented.
The autoGeneratedTimestamp field is internally used to speed up indexing of operations with
auto-ids, as we can rule out duplicates. Setting this field externally can make the index
inconsistent, resulting in duplicate documents with same id.
Adds support for handling auto-id requests with optype CREATE. Also simplifies the code
handling this by using the standard indexing path when dealing with possible retry conflicts.
Relates #47169
Bulk requests currently do not allow adding "create" actions with auto-generated IDs.
This commit allows using the optype CREATE for append-only indexing operations. This is
mainly the user facing aspect of it.
* Add support for bwc for testclusters and convert full cluster restart (#45374)
* Testclusters fix bwc (#46740)
Additions to make testclsuters work with lather versions of ES
* Do common node config on bwc tests
Before this PR we always ever ran `ElasticsearchCluster.start` once, and
the common node config was never done.
This becomes apparent in upgrading from `6.x` to `7.x` as the new config
is missing preventing the cluster from starting.
* Do common node config on bwc tests
Before this PR we always ever ran `ElasticsearchCluster.start` once, and
the common node config was never done.
This becomes apparent in upgrading from `6.x` to `7.x` as the new config
is missing preventing the cluster from starting.
* Fix logic to pick up snapshot from 6.x
* Make sure ports are cleared
* Fix test
* Don't clear all the config as we rely on it
* Fix removal of keys
XPackPlugin holds data in statics and can only be initialized once. This
caused tests to fail primarily when running with a low max-workers.
Replaced usages with the LocalStateCompositeXPackPlugin, which handles
this properly for testing.
Most of the information in AnalysisPredicateScript.Token is pulled directly
from its underlying AttributeSource, but we also keep track of the token position,
and this state is held directly on the Token. This information needs to be reset when
the containing ScriptFilteringTokenFilter or ScriptedConditionTokenFilter is re-used.
Fixes#47197
Due to #47003 many clusters will have built up a
large backlog of expired results. On upgrading to
a version where that bug is fixed users could find
that the first ML daily maintenance task deletes
a very large amount of documents.
This change introduces throttling to the
delete-by-query that the ML daily maintenance uses
to delete expired results to limit it to deleting an
average 200 documents per second. (There is no
throttling for state/forecast documents as these
are expected to be lower volume.)
Additionally a rough time limit of 8 hours is applied
to the whole delete expired data action. (This is only
rough as it won't stop part way through a single
operation - it only checks the timeout between
operations.)
Relates #47103
Synonym queries (when two tokens/paths start at the same position) use the alias field instead
of the concrete field to build Lucene queries. This commit fixes this bug by resolving the alias field upfront in order to provide the concrete field to the actual query parser.
* [DOCS] Adds examples to the PUT dfa and the evaluate dfa APIs.
* [DOCS] Removes extra lines from examples.
* Update docs/reference/ml/df-analytics/apis/evaluate-dfanalytics.asciidoc
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
* Update docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc
Co-Authored-By: Lisa Cawley <lcawley@elastic.co>
* [DOCS] Explains examples.