This change introduces a new feature for indices so that they can be
hidden from wildcard expansion. The feature is referred to as hidden
indices. An index can be marked hidden through the use of an index
setting, `index.hidden`, at creation time. One primary use case for
this feature is to have a construct that fits indices that are created
by the stack that contain data used for display to the user and/or
intended for querying by the user. The desire to keep them hidden is
to avoid confusing users when searching all of the data they have
indexed and getting results returned from indices created by the
system.
Hidden indices have the following properties:
* API calls for all indices (empty indices array, _all, or *) will not
return hidden indices by default.
* Wildcard expansion will not return hidden indices by default unless
the wildcard pattern begins with a `.`. This behavior is similar to
shell expansion of wildcards.
* REST API calls can enable the expansion of wildcards to hidden
indices with the `expand_wildcards` parameter. To expand wildcards
to hidden indices, use the value `hidden` in conjunction with `open`
and/or `closed`.
* Creation of a hidden index will ignore global index templates. A
global index template is one with a match-all pattern.
* Index templates can make an index hidden, with the exception of a
global index template.
* Accessing a hidden index directly requires no additional parameters.
Backport of #50452
If 1000 different category definitions are created for a job in
the first 100 buckets it processes then an audit warning will now
be created. (This will cause a yellow warning triangle in the
ML UI's jobs list.)
Such a large number of categories suggests that the field that
categorization is working on is not well suited to the ML
categorization functionality.
Object fields cannot be used as features. At the moment _explain
API includes them and even worse it allows it does not error when
an object field is excluded. This creates the expectation to the
user that all children fields will also be excluded while it's not
the case.
This commit omits object fields from the _explain API and also
adds an error if an object field is included or excluded.
Backport of #51115
There have been occasional failures, presumably due to
too many tests running in parallel, caused by jobs taking
around 15 seconds to open. (You can see the job open
successfully during the cleanup phase shortly after the
failure of the test in these cases.) This change increases
the wait time from 10 seconds to 20 seconds to reduce the
risk of this happening.
Check it out:
```
$ curl -u elastic:password -HContent-Type:application/json -XPOST localhost:9200/test/_update/foo?pretty -d'{
"dac": {}
}'
{
"error" : {
"root_cause" : [
{
"type" : "x_content_parse_exception",
"reason" : "[2:3] [UpdateRequest] unknown field [dac] did you mean [doc]?"
}
],
"type" : "x_content_parse_exception",
"reason" : "[2:3] [UpdateRequest] unknown field [dac] did you mean [doc]?"
},
"status" : 400
}
```
The tricky thing about implementing this is that x-content doesn't
depend on Lucene. So this works by creating an extension point for the
error message using SPI. Elasticsearch's server module provides the
"spell checking" implementation.
s
* [ML][Inference] Adding classification_weights to ensemble models
classification_weights are a way to allow models to
prefer specific classification results over others
this might be advantageous if classification value
probabilities are a known quantity and can improve
model error rates.
In classes where the client is used directly rather than through a call to
executeAsyncWithOrigin explicitly require the client to be OriginSettingClient
rather than using the Client interface.
Also remove calls to deprecated ClientHelper.clientWithOrigin() method.
Adds a new parameter to regression and classification that enables computation
of importance for the top most important features. The computation of the importance
is based on SHAP (SHapley Additive exPlanations) method.
Backport of #50914
The system created and models we provide now use the `_xpack` user for uniformity with our other features
The `PUT` action is now an admin cluster action
And XPackClient class now references the action instance.
* [ML][Inference] PUT API (#50852)
This adds the `PUT` API for creating trained models that support our format.
This includes
* HLRC change for the API
* API creation
* Validations of model format and call
* fixing backport
This commit removes validation logic of source and dest indices
for data frame analytics and replaces it with using the common
`SourceDestValidator` class which is already used by transforms.
This way the validations and their messages become consistent
while we reduce code.
This means that where these validations fail the error messages
will be slightly different for data frame analytics.
Backport of #50841
A very large number of recursive calls can cause a stack overflow
exception. This commit forks the recursive calls for non-async
processors. Once forked, each thread will handle at most 10
recursive calls to help keep the stack size and thread count
down to a reasonable size.
This adds the necessary named XContent classes to the HLRC for the lang ident model. This is so the HLRC can call `GET _ml/inference/lang_ident_model_1?include_definition=true` without XContent parsing errors.
The constructors are package private as since this classes are used exclusively within the pre-packaged model (and require the specific weights, etc. to be of any use).
This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.
In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:
- keys can't be longer than 20 chars,
- values can only be numbers or strings of no more than 50 chars - no inner
arrays or objects,
- the metadata can't have more than 5 keys in total.
Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.
Here is how the meta might look like in mappings:
```json
{
"properties": {
"latency": {
"type": "long",
"meta": {
"unit": "ms"
}
}
}
}
```
And then in the field capabilities response:
```json
{
"latency": {
"long": {
"searchable": true,
"aggreggatable": true,
"meta": {
"unit": [ "ms" ]
}
}
}
}
```
When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:
```json
{
"latency": {
"long": {
"searchable": true,
"aggreggatable": true,
"meta": {
"unit": [ "ms", "ns" ]
}
}
}
}
```
Closes#33267
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.
Fixes#50613
The end offset of a tokenizer is supposed to point one past the
end of the input, not to the end character of the input. The
ml_classic tokenizer was erroneously doing the latter.
Sharing a random generator may cause test failures as non-threadsafe random generators are periodically utilized in tests (see: https://github.com/elastic/elasticsearch/issues/50651)
This change constructs a calls `Randomness.get()` within the `bulkIndexWithRetry` method so that the returned `Random` object is only used in a single thread. Before, the member variable could have been used between threads, which caused test failures.
* [ML][Inference] lang_ident model (#50292)
This PR contains a java port of Google's CLD3 compact NN model https://github.com/google/cld3
The ported model is formatted to fit within our inference model formatting and stored as a resource in the `:xpack:ml:` plugin and is under basic license.
The model is broken up into two major parts:
- Preprocessing through the custom embedding (based on CLD3's embedding layer)
- Pushing the embedded text through the two layers of fully connected shallow NN.
Main differences between this port and CLD3:
- We take advantage of Java's internal Unicode handling where possible (i.e. codepoints, characters, decoders, etc.)
- We do not trim down input text by removing duplicated tokens
- We do not encode doubles/floats as longs/integers.
Adds a `force` parameter to the delete data frame analytics
request. When `force` is `true`, the action force-stops the
jobs and then proceeds to the deletion. This can be used in
order to delete a non-stopped job with a single request.
Closes#48124
Backport of #50553
Eclipse 4.13 shows a type mismatch error in the affected line because it cannot
correctly infer the boolean return type for the method call. Assigning return
value to a local variable resolves this problem.
In order to ensure any persisted model state is searchable by the moment
the job reports itself as `stopped`, we need to refresh the state index
before completing.
This should fix the occasional failures we see in #50168 and #50313 where
the model state appears missing.
Closes#50168Closes#50313
Backport of #50322
This fixes support for nested fields
We now support fully nested, fully collapsed, or a mix of both on inference docs.
ES mappings allow the `_source` to be any combination of nested objects + dot delimited fields.
So, we should do our best to find the best path down the Map for the desired field.
This commit adds removal of unused data frame analytics state
from the _delete_expired_data API (and in extend th ML daily
maintenance task). At the moment the potential state docs
include the progress document and state for regression and
classification analyses.
Backport of #50243
Executing the data frame analytics _explain API with a config that contains
a field that is not in the includes list but at the same time is the excludes
list results to trying to remove the field twice from the iterator. That causes
an `IllegalStateException`. This commit fixes this issue and adds a test that
captures the scenario.
Backport of #50192
This adds a new field for the inference processor.
`warning_field` is a place for us to write warnings provided from the inference call. When there are warnings we are not going to write an inference result. The goal of this is to indicate that the data provided was too poor or too different for the model to make an accurate prediction.
The user could optionally include the `warning_field`. When it is not provided, it is assumed no warnings were desired to be written.
The first of these warnings is when ALL of the input fields are missing. If none of the trained fields are present, we don't bother inferencing against the model and instead provide a warning stating that the fields were missing.
Also, this adds checks to not allow duplicated fields during processor creation.
This exchanges the direct use of the `Client` for `ResultsPersisterService`. State doc persistence will now retry. Failures to persist state will still not throw, but will be audited and logged.
This commit fixes a bug that caused the data frame analytics
_explain API to time out in a multi-node setup when the source
index was missing. When we try to create the extracted fields detector,
we check the index settings. If the index is missing that responds
with a failure that could be wrapped as a remote exception.
While we unwrapped correctly to check if the cause was an
`IndexNotFoundException`, we then proceeded to cast the original
exception instead of the cause.
Backport of #50176
* [ML] Add graceful retry for anomaly detector result indexing failures (#49508)
All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.
* fixing test
The `ClassificationIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet`
test was previously set up to just have 10 rows. With `training_percent`
of 50%, only 5 rows will be used for training. There is a good chance that
all 5 rows will be of one class which results to failure.
This commit increases the rows to 100. Now 50 rows should be used for training
and the chance of failure should be very small.
Backport of #50072
Adjusts the subclasses of `TransportMasterNodeAction` to use their own loggers
instead of the one for the base class.
Relates #50056.
Partial backport of #46431 to 7.x.
This adds a new `randomize_seed` for regression and classification.
When not explicitly set, the seed is randomly generated. One can
reuse the seed in a similar job in order to ensure the same docs
are picked for training.
Backport of #49990
When checking the cardinality of a field, the query should be take into account. The user might know about some bad data in their index and want to filter down to the target_field values they care about.
Work in progress in the c++ side is increasing memory estimates
a bit and this test fails. At the time of this commit the mem
estimate when there is no source query is a about 2Mb. So I
am relaxing the test to assert memory estimate is less than 1Mb
instead of 500Kb.
Backport of #49924
This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.
Closes#49531
Backport of #49690
We depend on the number of data frame rows in order to report progress
for the writing of results, the last phase of a job run. However, results
include other objects than just the data frame rows (e.g, progress, inference model, etc.).
The problem this commit fixes is that if we receive the last data frame row results
we'll report that progress is complete even though we still have more results to process
potentially. If the job gets stopped for any reason at this point, we will not be able
to restart the job properly as we'll think that the job was completed.
This commit addresses this by limiting the max progress we can report for the
writing_results phase before the results processor completes to 98.
At the end, when the process is done we set the progress to 100.
The commit also improves failure capturing and reporting in the results processor.
Backport of #49551
The categorization job wizard in the ML UI will use this
information when showing the effect of the chosen categorization
analyzer on a sample of input.
Before this change excluding an unsupported field resulted in
an error message that explained the excluded field could not be
detected as if it doesn't exist. This error message is confusing.
This commit commit changes this so that there is no error in this
scenario. When excluding a field that does exist but has been
automatically been excluded from the analysis there is no harm
(unlike excluding a missing field which could be a typo).
Backport of #49535
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.
The API consolidates information that is useful before
creating a data frame analytics job.
It includes:
- memory estimation
- field selection explanation
Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.
Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.
Backport of #49455
If a datafeed is stopped normally and force stopped at the same
time then it is possible that the force stop removes the
persistent task while the normal stop is performing actions.
Currently this causes the normal stop to error, but since
stopping a stopped datafeed is not an error this doesn't make
sense. Instead the force stop should just take precedence.
This is a followup to #49191 and should really have been
included in the changes in that PR.
This commit moves the async calls required to retrieve the components
that make up `ExtractedFieldsExtractor` out of `DataFrameDataExtractorFactory`
and into a dedicated `ExtractorFieldsExtractorFactory` class.
A few more refactorings are performed:
- The detector no longer needs the results field. Instead, it knows
whether to use it or not based on whether the task is restarting.
- We pass more accurately whether the task is restarting or not.
- The validation of whether fields that have a cardinality limit
are valid is now performed in the detector after retrieving the
respective cardinalities.
Backport of #49315
The following edge cases were fixed:
1. A request to force-stop a stopping datafeed is no longer
ignored. Force-stop is an important recovery mechanism
if normal stop doesn't work for some reason, and needs
to operate on a datafeed in any state other than stopped.
2. If the node that a datafeed is running on is removed from
the cluster during a normal stop then the stop request is
retried (and will likely succeed on this retry by simply
cancelling the persistent task for the affected datafeed).
3. If there are multiple simultaneous force-stop requests for
the same datafeed we no longer fail the one that is
processed second. The previous behaviour was wrong as
stopping a stopped datafeed is not an error, so stopping
a datafeed twice simultaneously should not be either.
Backport of #49191
* [ML] ML Model Inference Ingest Processor (#49052)
* [ML][Inference] adds lazy model loader and inference (#47410)
This adds a couple of things:
- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:
* [ML][Inference] Adjust inference configuration option API (#47812)
* [ML][Inference] adds logistic_regression output aggregator (#48075)
* [ML][Inference] Adding read/del trained models (#47882)
* [ML][Inference] Adding inference ingest processor (#47859)
* [ML][Inference] fixing classification inference for ensemble (#48463)
* [ML][Inference] Adding model memory estimations (#48323)
* [ML][Inference] adding more options to inference processor (#48545)
* [ML][Inference] handle string values better in feature extraction (#48584)
* [ML][Inference] Adding _stats endpoint for inference (#48492)
* [ML][Inference] add inference processors and trained models to usage (#47869)
* [ML][Inference] add new flag for optionally including model definition (#48718)
* [ML][Inference] adding license checks (#49056)
* [ML][Inference] Adding memory and compute estimates to inference (#48955)
* fixing version of indexed docs for model inference
This commit fixes a NPE problem as reported in #49150.
But this problem uncovered that we never added proper handling
of state for data frame analytics tasks.
In this commit we improve the `MlTasks.getDataFrameAnalyticsState`
method to handle null tasks and state tasks properly.
Closes#49150
Backport of #49186
Backport of #48849. Update `.editorconfig` to make the Java settings the
default for all files, and then apply a 2-space indent to all `*.gradle`
files. Then reformat all the files.
In the case multi-fields exist in the source index, we pick
all variants of them in our extracted fields detection for
data frame analytics. This means we may have multiple instances
of the same feature. The worse consequence of this is when the
dependent variable (for regression or classification) is also
duplicated which means we train a model on the dependent variable
itself.
Now that #48770 is merged, this commit is adding logic to
only select one variant of multi-fields.
Closes#48756
Backport of #48799
Aggregatable mutli-fields are at the moment wrongly mapped
as normal doc_value fields and thus they support fetching from
source. However, they do not exist in the source. This results
to failure to extract such fields.
This commit fixes this bug. While a fix could be worked out
on top of the existing code, it is evident the extraction logic
has become difficult to understand and maintain. As we also
want to deduplicate multi-fields for data frame analytics,
it seemed appropriate to refactor the code to simplify and
better handle the extraction of multi-fields.
Relates #48756
Backport of #48770
At present the ML C++ artifact is always downloaded from
S3. This change adds an option to configure the location.
(The intention is to use a file:/// URL to pick up the
artifact built in a Docker container in ml-cpp PR builds
so that C++ changes that will break Java integration tests
can be detected before the ml-cpp PRs are merged.)
Relates elastic/ml-cpp#766
Moves common field extraction logic to its own package so that it can
be used both for anomaly detection and data frame analytics.
In preparation for refactoring extraction fields to be simpler and to
support multi-fields properly.
Backport of #48709
* [ML][Inference] separating definition and config object storage (#48651)
This separates out the `definition` object from being stored within the configuration object in the index.
This allows us to gather the config object without decompressing a potentially large definition.
Additionally, `input` is moved to the TrainedModelConfig object and out of the definition. This is so the trained input fields are accessible outside the potentially large model definition.
There is a watchdog in order to avoid long running (and expensive)
grok expressions. Currently the watchdog is thread based, threads
that run grok expressions are registered and after completion unregister.
If these threads stay registered for too long then the watch dog interrupts
these threads. Joni (the library that powers grok expressions) has a
mechanism that checks whether the current thread is interrupted and
if so abort the pattern matching.
Newer versions have an additional method to abort long running pattern
matching inside joni. Instead of checking the thread's interrupted flag,
joni now also checks a volatile field that can be set via a `Matcher`
instance. This is more efficient method for aborting long running matches.
(joni checks each 30k iterations whether interrupted flag is set vs.
just checking a volatile field)
Recently we upgraded to a recent joni version (#47374), and this PR
is a followup of that PR.
This change should also fix#43673, since it appears when unit tests
are ran the a test runner thread's interrupted flag may already have
been set, due to some thread reuse.
Reverting the change introducing IsoLocal.ROOT and introducing IsoCalendarDataProvider that defaults start of the week to Monday and requires minimum 4 days in first week of a year. This extension is using java SPI mechanism and defaults for Locale.ROOT only.
It require jvm property java.locale.providers to be set with SPI,COMPAT
closes#41670
backport #48209
If a job stops right after reindexing is finished but before
we refresh the destination index, we don't refresh at all.
If the job is started again right after, it jumps into the analyzing state.
However, the data is still not searchable.
This is why we were seeing test failures that we start the process
expecting X rows (where X is lower than the expected number of docs)
and we end up getting X+.
We fix this by moving the refresh of the dest index right before
we start the process so it always ensures the data is searchable.
Closes#47612
Backport of #48090