This adds a setting to data frame analytics jobs called
`max_number_threads`. The setting expects a positive integer.
When used the user specifies the max number of threads that may
be used by the analysis. Note that the actual number of threads
used is limited by the number of processors on the node where
the job is assigned. Also, the process may use a couple more threads
for operational functionality that is not the analysis itself.
This setting may also be updated for a stopped job.
More threads may reduce the time it takes to complete the job at the cost
of using more CPU.
Backport of #59254 and #57274
Since we are able to load the inference model
and perform inference in java, we no longer need
to rely on the analytics process to be performing
test inference on the docs that were not used for
training. The benefit is that we do not need to
send test docs and fit them in memory of the c++
process.
Backport of #58877
Co-authored-by: Dimitris Athanasiou <dimitris@elastic.co>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Backport of #59076 to 7.x branch.
The commit makes the following changes:
* The timestamp field of a data stream definition in a composable
index template can only be set to '@timestamp'.
* Removed custom data stream timestamp field validation and reuse the validation from `TimestampFieldMapper` and
instead only check that the _timestamp field mapping has been defined on a backing index of a data stream.
* Moved code that injects _timestamp meta field mapping from `MetadataCreateIndexService#applyCreateIndexRequestWithV2Template58956(...)` method
to `MetadataIndexTemplateService#collectMappings(...)` method.
* Fixed a bug (#58956) that cases timestamp field validation to be performed
for each template and instead of the final mappings that is created.
* only apply _timestamp meta field if index is created as part of a data stream or data stream rollover,
this fixes a docs test, where a regular index creation matches (logs-*) with a template with a data stream definition.
Relates to #58642
Relates to #53100Closes#58956Closes#58583
This removes the blocking model lookup from the `inference` aggregator's
builder by integrating it into the request rewrite process that loads
stuff asynchronously.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
There have been a few test failures that are likely caused by tests
performing actions that use ML indices immediately after the actions
that create those ML indices. Currently this can result in attempts
to search the newly created index before its shards have initialized.
This change makes the method that creates the internal ML indices
that have been affected by this problem (state and stats) wait for
the shards to be initialized before returning.
Backport of #59027
This commit creates a new Gradle plugin to provide a separate task name
and source set for running YAML based REST tests. The only project
converted to use the new plugin in this PR is distribution/archives/integ-test-zip.
For which the testing has been moved to :rest-api-spec since it makes the most
sense and it avoids a small but awkward change to the distribution plugin.
The remaining cases in modules, plugins, and x-pack will be handled in followups.
This plugin is distinctly different from the plugin introduced in #55896 since
the YAML REST tests are intended to be black box tests over HTTP. As such they
should not (by default) have access to the classpath for that which they are testing.
The YAML based REST tests will be moved to separate source sets (yamlRestTest).
The which source is the target for the test resources is dependent on if this
new plugin is applied. If it is not applied, it will default to the test source
set.
Further, this introduces a breaking change for plugin developers that
use the YAML testing framework. They will now need to either use the new source set
and matching task, or configure the rest resources to use the old "test" source set that
matches the old integTest task. (The former should be preferred).
As part of this change (which is also breaking for plugin developers) the
rest resources plugin has been removed from the build plugin and now requires
either explicit application or application via the new YAML REST test plugin.
Plugin developers should be able to fix the breaking changes to the YAML tests
by adding apply plugin: 'elasticsearch.yaml-rest-test' and moving the YAML tests
under a yamlRestTest folder (instead of test)
Backport of #58582 to 7.x branch.
This commit adds a new metadata field mapper that validates,
that a document has exactly a single timestamp value in the data stream timestamp field and
that the timestamp field mapping only has `type`, `meta` or `format` attributes configured.
Other attributes can affect the guarantee that an index with this meta field mapper has a
useable timestamp field.
The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever
a new backing index of a data stream is created.
Relates to #53100
Dry up tests that use a disruption that isolates the master from all other nodes.
Also, turn disruption types that have neither parameters nor state into constants
to make things a little clearer.
Working through a heap dump for an unrelated issue I found that we can easily rack up
tens of MBs of duplicate empty instances in some cases.
I moved to a static constructor to guard against that in all cases.
.ml-state-write is supposed to be an index alias, however by accident it can become an index. If
.ml-state-write is a concrete index instead of an alias ML stops working. This change improves error
handling by setting the job to failed and properly log and audit the problem. The user still has to
manually fix the problem. This change should lead to a quicker resolution of the problem.
fixes#58482
We already had code to ensure the config index mappings were
up-to-date before creating a new config. However, it's also
possible that an update to a config could add the latest
settings that require the latest mappings to index correctly.
This change checks that the latest config index mappings are
in place in the 3 update actions in the same way as the checks
are done in the 3 put actions.
Backport of #58916
* [ML] handles compressed model stream from native process (#58009)
This moves model storage from handling the fully parsed JSON string to handling two separate types of documents.
1. ModelSizeInfo which contains model size information
2. TrainedModelDefinitionChunk which contains a particular chunk of the compressed model definition string.
`model_size_info` is assumed to be handled first. This will generate the model_id and store the initial trained model config object. Then each chunk is assumed to be in correct order for concatenating the chunks to get a compressed definition.
Native side change: https://github.com/elastic/ml-cpp/pull/1349
The checks on the license state have a singular method, isAllowed, that
returns whether the given feature is allowed by the current license.
However, there are two classes of usages, one which intends to actually
use a feature, and another that intends to return in telemetry whether
the feature is allowed. When feature usage tracking is added, the latter
case should not count as a "usage", so this commit reworks the calls to
isAllowed into 2 methods, checkFeature, which will (eventually) both
check whether a feature is allowed, and keep track of the last usage
time, and isAllowed, which simply determines whether the feature is
allowed.
Note that I considered having a boolean flag on the current method, but
wanted the additional clarity that a different method name provides,
versus a boolean flag which is more easily copied without realizing what
the flag means since it is nameless in call sites.
This PR implements recursive mapping merging for composable index templates.
When creating an index, we perform the following:
* Add each component template mapping in order, merging each one in after the
last.
* Merge in the index template mappings (if present).
* Merge in the mappings on the index request itself (if present).
Some principles:
* All 'structural' changes are disallowed (but everything else is fine). An
object mapper can never be changed between `type: object` and `type: nested`. A
field mapper can never be changed to an object mapper, and vice versa.
* Generally, each section is merged recursively. This includes `object`
mappings, as well as root options like `dynamic_templates` and `meta`. Once we
reach 'leaf components' like field definitions, they always overwrite an
existing one instead of being merged.
Relates to #53101.
When per_partition_categorization.stop_on_warn is set for an analysis
config it is now passed through to the autodetect C++ process.
Also adds some end-to-end tests that exercise the functionality
added in elastic/ml-cpp#1356
Backport of #58632
* Replace compile configuration usage with api (#58451)
- Use java-library instead of plugin to allow api configuration usage
- Remove explicit references to runtime configurations in dependency declarations
- Make test runtime classpath input for testing convention
- required as java library will by default not have build jar file
- jar file is now explicit input of the task and gradle will ensure its properly build
* Fix compile usages in 7.x branch
Adds parsing of `status` and `memory_reestimate_bytes`
to data frame analytics `memory_usage`. When the training surpasses
the model memory limit, the status will be set to `hard_limit` and
`memory_reestimate_bytes` can be used to update the job's
limit in order to restart the job.
Backport of #58588
Today we have individual settings for configuring node roles such as
node.data and node.master. Additionally, roles are pluggable and we have
used this to introduce roles such as node.ml and node.voting_only. As
the number of roles is growing, managing these becomes harder for the
user. For example, to create a master-only node, today a user has to
configure:
- node.data: false
- node.ingest: false
- node.remote_cluster_client: false
- node.ml: false
at a minimum if they are relying on defaults, but also add:
- node.master: true
- node.transform: false
- node.voting_only: false
If they want to be explicit. This is also challenging in cases where a
user wants to have configure a coordinating-only node which requires
disabling all roles, a list which we are adding to, requiring the user
to keep checking whether a node has acquired any of these roles.
This commit addresses this by adding a list setting node.roles for which
a user has explicit control over the list of roles that a node has. If
the setting is configured, the node has exactly the roles in the list,
and not any additional roles. This means to configure a master-only
node, the setting is merely 'node.roles: [master]', and to configure a
coordinating-only node, the setting is merely: 'node.roles: []'.
With this change we deprecate the existing 'node.*' settings such as
'node.data'.
* [ML] make waiting for renormalization optional for internally flushing job (#58537)
When flushing, datafeeds only need the guaruntee that the latest bucket has been handled.
But, in addition to this, the typical call to flush waits for renormalization to complete. For large jobs, this can take a fair bit of time (even longer than a bucket length). This causes unnecessary delays in handling data.
This commit adds a new internal only flag that allows datafeeds (and forecasting) to skip waiting on renormalization.
closes#58395
This changes the default value for the results field of inference
applied on models that are trained via a data frame analytics job.
Previously, the results field default was `predicted_value`. This
commit makes it the same as in the training job itself. The new
default field is `<dependent_variable>_prediction`. Apart from
making inference consistent with the training job the model came
from, it is helpful to preserve the dependent variable name
by default as it provides some context to the user that may
avoid confusion as to which model results came from.
Backport of #58538
This commits allows data streams to be a valid source for analytics and transforms.
Data streams are fairly transparent and our `_search` and `_reindex` actions work without error.
For `_transforms` the check-pointing works as desired as well. Data streams are effectively treated as an `alias` and the backing index values are stored within checkpointing information.
Unlike `classification`, which is using a cross validation splitter
that produces training sets whose size is predictable and equal to
`training_percent * class_cardinality`, for regression we have been
using a random splitter that takes an independent decision for each
document. This means we cannot predict the exact size of the training
set. This poses a problem as we move towards performing test inference
on the java side as we need to be able to provide an accurate upper
bound of the training set size to the c++ process.
This commit replaces the random splitter we use for regression with
the same streaming-reservoir approach we do for `classification`.
Backport of #58331
There was a discrepancy in the implementation of flush
acknowledgements: most of the class was designed on the
basis that the "last finalized bucket time" could be null
but the wire serialization assumed that it was never
null. This works because, the C++ sends zero "last
finalized bucket time" when it is not known or not
relevant. But then the Java code will print that to
XContent as it is assuming null represents not known or
not relevant.
This change corrects the discrepancies. Internally within
the class null represents not known or not relevant, but
this is translated from/to 0 for communications from the
C++ and old nodes that have the bug.
Additionally I switched from Date to Instant for this
class and made the member variables final to modernise it
a bit.
Backport of #58413
Backporting #58096 to 7.x branch.
Relates to #53100
* use mapping source direcly instead of using mapper service to extract the relevant mapping details
* moved assertion to TimestampField class and added helper method for tests
* Improved logic that inserts timestamp field mapping into an mapping.
If the timestamp field path consisted out of object fields and
if the final mapping did not contain the parent field then an error
occurred, because the prior logic assumed that the object field existed.
When a local model is constructed, the cache hit miss count is incremented.
When a user calls _stats, we will include the sum cache hit miss count across ALL nodes. This statistic is important to in comparing against the inference_count. If the cache hit miss count is near the inference_count it indicates that the cache is overburdened, or inappropriately configured.
This commit bumps our JNA dependency from 4.5.1 to 5.5.0, so that we are
now on the latest maintained line, and pick up a large collection of bug
fixes that have accumulated.