Values higher than 100% are now allowed to accommodate use
cases where swapping has been determined to be acceptable.
Anomaly detector jobs only use their full model memory
during background persistence, and this is deliberately
staggered, so with large numbers of jobs few will generally
be persisting state at the same time. Settings higher than
available memory are only recommended for OEM type
situations where a wrapper tightly controls the types of
jobs that can be created, and each job alone is considerably
smaller than what each node can handle.
This change adds information about which UI path
(if any) created ML anomaly detector jobs to the
stats returned by the _xpack/usage endpoint.
Counts for the following possibilities are expected:
* ml_module_apache_access
* ml_module_apm_transaction
* ml_module_auditbeat_process_docker
* ml_module_auditbeat_process_hosts
* ml_module_nginx_access
* ml_module_sample
* multi_metric_wizard
* population_wizard
* single_metric_wizard
* unknown
The "unknown" count is for jobs that do not have a
created_by setting in their custom_settings.
Closes#38403
If multiple jobs are created together and the anomaly
results index does not exist then some of the jobs could
fail to update the mappings of the results index. This
lead them to fail to write their results correctly later.
Although this scenario sounds rare, it is exactly what
happens if the user creates their first jobs using the
Nginx module in the ML UI.
This change fixes the problem by updating the mappings
of the results index if it is found to exist during a
creation attempt.
Fixes#38785
* [ML] Refactor common utils out of ML plugin to XPack.Core
* implementing GET filters with abstract transport
* removing added rest param
* adjusting how defaults can be supplied
The problem here was that `DatafeedJob` was updating the last end time searched
based on the `now` even though when there are aggregations, the extactor will
only search up to the floor of `now` against the histogram interval.
This commit fixes the issue by using the end time as calculated by the extractor.
It also adds an integration test that uses aggregations. This test would fail
before this fix. Unfortunately the test is slow as we need to wait for the
datafeed to work in real time.
Closes#39842
* [ML] refactoring lazy query and agg parsing
* Clean up and addressing PR comments
* removing unnecessary try/catch block
* removing bad call to logger
* removing unused import
* fixing bwc test failure due to serialization and config migrator test
* fixing style issues
* Adjusting DafafeedUpdate class serialization
* Adding todo for refactor in v8
* Making query non-optional so it does not write a boolean byte
This change does the following:
1. Makes the per-node setting xpack.ml.max_open_jobs
into a cluster-wide dynamic setting
2. Changes the job node selection to continue to use the
per-node attributes storing the maximum number of open
jobs if any node in the cluster is older than 7.1, and
use the dynamic cluster-wide setting if all nodes are on
7.1 or later
3. Changes the docs to reflect this
4. Changes the thread pools for native process communication
from fixed size to scaling, to support the dynamic nature
of xpack.ml.max_open_jobs
5. Renames the autodetect thread pool to the job comms
thread pool to make clear that it will be used for other
types of ML jobs (data frame analytics in particular)
Backport of #39320
ML has historically used doc as the single mapping type but reindex in 7.x
will change the mapping to _doc. Switching to the typeless APIs handles
case where the mapping type is either doc or _doc. This change removes
deprecated typed usages.
The ScheduledEvent class has never preserved the time
zone so it makes more sense for it to store the start and
end time using Instant rather than ZonedDateTime.
Closes#38620
The ML memory tracker does searches against ML results
and config indices. These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.
This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete. Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.
The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.
Fixes#37117
These two changes are interlinked.
Before this change unsetting ML upgrade mode would wait for all
datafeeds to be assigned and not waiting for their corresponding
jobs to initialise. However, this could be inappropriate, if
there was a reason other that upgrade mode why one job was unable
to be assigned or slow to start up. Unsetting of upgrade mode
would hang in this case.
This change relaxes the condition for considering upgrade mode to
be unset to simply that an assignment attempt has been made for
each ML persistent task that did not fail because upgrade mode
was enabled. Thus after unsetting upgrade mode there is no
guarantee that every ML persistent task is assigned, just that
each is not unassigned due to upgrade mode.
In order to make setting upgrade mode work immediately after
unsetting upgrade mode it was then also necessary to make it
possible to stop a datafeed that was not assigned. There was
no particularly good reason why this was not allowed in the past.
It is trivial to stop an unassigned datafeed because it just
involves removing the persistent task.
The .ml-annotations index is created asynchronously when
some other ML index exists. This can interfere with the
post-test index deletion, as the .ml-annotations index
can be created after all other indices have been deleted.
This change adds an ML specific post-test cleanup step
that runs before the main cleanup and:
1. Checks if any ML indices exist
2. If so, waits for the .ml-annotations index to exist
3. Deletes the other ML indices found in step 1.
4. Calls the super class cleanup
This means that by the time the main post-test index
cleanup code runs:
1. The only ML index it has to delete will be the
.ml-annotations index
2. No other ML indices will exist that could trigger
recreation of the .ml-annotations index
Fixes#38952
Elasticsearch has long [supported](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning) compare and set (a.k.a optimistic concurrency control) operations using internal document versioning. Sadly that approach is flawed and can sometime do the wrong thing. Here's the relevant excerpt from the resiliency status page:
> When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is partitioned away. It won’t acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed. The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted writes for the same document
We recently [introduced](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/optimistic-concurrency-control.html) a new sequence number based approach that doesn't suffer from this dirty reads problem.
This commit removes support for internal versioning as a concurrency control mechanism in favor of the sequence number approach.
Relates to #1078
With this change we no longer support pluggable discovery implementations. No
known implementations of `DiscoveryPlugin` actually override this method, so in
practice this should have no effect on the wider world. However, we were using
this rather extensively in tests to provide the `test-zen` discovery type. We
no longer need a separate discovery type for tests as we no longer need to
customise its behaviour.
Relates #38410
If a job cannot be assigned to a node because an index it
requires is unavailable and there are lazy ML nodes then
index unavailable should be reported as the assignment
explanation rather than waiting for a lazy ML node.
Today the following settings in the `discovery.zen` namespace are still used:
- `discovery.zen.no_master_block`
- `discovery.zen.hosts_provider`
- `discovery.zen.ping.unicast.concurrent_connects`
- `discovery.zen.ping.unicast.hosts.resolve_timeout`
- `discovery.zen.ping.unicast.hosts`
This commit deprecates all other settings in this namespace so that they can be
removed in the next major version.
In 7.x Java timestamp formats are the default timestamp format and
there is no need to prefix them with "8". (The "8" prefix was used
in 6.7 to distinguish Java timestamp formats from Joda timestamp
formats.)
This change removes the "8" prefixes from timestamp formats in the
output of the ML file structure finder.
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.
This is a continuation of #28667, #36137 and also fixes#37708.
Doc-value fields now return a value that is based on the mappings rather than
the script implementation by default.
This deprecates the special `use_field_mapping` docvalue format which was added
in #29639 only to ease the transition to 7.x and it is not necessary anymore in
7.0.
Runnables can be submitted to
AutodetectProcessManager.AutodetectWorkerExecutorService
without error after it has been shutdown which can lead
to requests timing out as their handlers are never called
by the terminated executor.
This change throws an EsRejectedExecutionException if a
runnable is submitted after after the shutdown and calls
AbstractRunnable.onRejection on any tasks not run.
Closes#37108
* ML: Add MlMetadata.upgrade_mode and API
* Adding tests
* Adding wait conditionals for the upgrade_mode call to return
* Adding tests
* adjusting format and tests
* Adjusting wait conditions for api return and msgs
* adjusting doc tests
* adding upgrade mode tests to black list
We have read and write aliases for the ML results indices. However,
the job still had methods that purported to reliably return the name
of the concrete results index being used by the job. After reindexing
prior to upgrade to 7.x this will be wrong, so the method has been
renamed and the comments made more explicit to say the returned index
name may not be the actual concrete index name for the lifetime of the
job. Additionally, the selection of indices when deleting the job
has been changed so that it works regardless of concrete index names.
All these changes are nice-to-have for 6.7 and 7.0, but will become
critical if we add rolling results indices in the 7.x release stream
as 6.7 and 7.0 nodes may have to operate in a mixed version cluster
that includes a version that can roll results indices.
The ML file structure finder has always reported both Joda
and Java time format strings. This change makes the Java time
format strings the ones that are incorporated into mappings
and ingest pipeline definitions.
The BWC syntax of prepending "8" to these formats is used.
This will need to be removed once Java time format strings
become the default in Elasticsearch.
This commit also removes direct imports of Joda classes in the
structure finder unit tests. Instead the core Joda BWC class
is used.
This commit changes the default for the `track_total_hits` option of the search request
to `10,000`. This means that by default search requests will accurately track the total hit count
up to `10,000` documents, requests that match more than this value will set the `"total.relation"`
to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response.
Scroll queries are not impacted, they will continue to count the total hits accurately.
The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request.
I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate.
Closes#33028
This commit moves the aggregation and mapping code from joda time to
java time. This includes field mappers, root object mappers, aggregations with date
histograms, query builders and a lot of changes within tests.
The cut-over to java time is a requirement so that we can support nanoseconds
properly in a future field mapper.
Relates #27330
This change moves the update to the results index mappings
from the open job action to the code that starts the
autodetect process.
When a rolling upgrade is performed we need to update the
mappings for already-open jobs that are reassigned from an
old version node to a new version node, but the open job
action is not called in this case.
Closes#37607
This is a continuation of #28667 and has as goal to convert all executors to propagate errors to the
uncaught exception handler. Notable missing ones were the direct executor and the scheduler. This
commit also makes it the property of the executor, not the runnable, to ensure this property. A big
part of this commit also consists of vastly improving the test coverage in this area.
Migrate ml job and datafeed config of open jobs and update
the parameters of the persistent tasks as they become unallocated
during a rolling upgrade. Block allocation of ml persistent tasks
until the configs are migrated.
* ML: Updating .ml-state calls to be able to support > 1 index
* Matching bulk delete behavior with dbq
* Adjusting state name
* refreshing indices before search
* fixing line length
* adjusting index expansion options
This is a reinforcement of #37227. It turns out that
persistent tasks are not made stale if the node they
were running on is restarted and the master node does
not notice this. The main scenario where this happens
is when minimum master nodes is the same as the number
of nodes in the cluster, so the cluster cannot elect a
master node when any node is restarted.
When an ML node restarts we need the datafeeds for any
jobs that were running on that node to not just wait
until the jobs are allocated, but to wait for the
autodetect process of the job to start up. In the case
of reassignment of the job persistent task this was
dealt with by the stale status test. But in the case
where a node restarts but its persistent tasks are not
reassigned we need a deeper test.
Fixes#36810
Added warnings checks to existing tests
Added “defaultTypeIfNull” to DocWriteRequest interface so that Bulk requests can override a null choice of document type with any global custom choice.
Related to #35190
Jobs created in version 6.1 or earlier can have a
null model_memory_limit. If these are parsed from
cluster state following a full cluster restart then
we replace the null with 4096mb to make the meaning
explicit. But if such jobs are streamed from an
old node in a mixed version cluster this does not
happen. Therefore we need to account for the
possibility of a null model_memory_limit in the ML
memory tracker.