Commit Graph

764 Commits

Author SHA1 Message Date
Dimitris Athanasiou ec405978fc
[7.x][ML] Update reindexing task progress before persisting job progress (#61868) (#61875)
This fixes a bug introduced by #61782. In that PR I thought I could
simplify the persistence of progress by using the progress straight
from the stats holder in the task instead of calling the get
stats action. However, I overlooked that it is then possible to
have stale progress for the reindexing task as that is only updated
when the get stats API is called.

In this commit this is fixed by updating reindexing task progress
before persisting the job progress. This seems to be much more
lightweight than calling the get stats request.

Closes #61852

Backport of #61868
2020-09-02 21:44:18 +03:00
Dimitris Athanasiou 07ab0beea0
[7.x][ML] Improve handling of exception while starting DFA process (#61838) (#61847)
While starting the data frame analytics process it is possible
to get an exception before the process crash handler is in place.
In addition, right after starting the process, we check the process
is alive to ensure we capture a failed process. However, those exceptions
are unhandled.

This commit catches any exception thrown while starting the process
and sets the task to failed with the root cause error message.

I have also taken the chance to remove some unused parameters
in `NativeAnalyticsProcessFactory`.

Relates #61704

Backport of #61838
2020-09-02 16:32:45 +03:00
David Kyle d268540f20
[ML] Check and install the latest template in the DFA executor (#61589) (#61842)
During a rolling upgrade it is possible that a worker node will be upgraded before
the master in which case the DFA templates will not have been installed.
Before a DFA task starts check that the latest template is installed and install it if necessary.
2020-09-02 12:16:29 +01:00
Dimitris Athanasiou 2547cfbe54
[7.x][ML] Persist progress when setting DFA task to failed (#61782) (#61792)
When an error occurs and we set the task to failed via
the `DataFrameAnalyticsTask.setFailed` method we do not
persist progress. If the job is later restarted, this means
we do not correctly restore from where we can but instead
we start the job from scratch and have to redo the reindexing
phase.

This commit solves this bug by persisting the progress before
setting the task to failed.

Backport of #61782
2020-09-01 18:33:07 +03:00
Benjamin Trent 7dabaad7d9
[ML] refactor ml job node selection into its own class (#61521) (#61747)
This is a minor refactor where the job node load logic (node availability, etc.) is refactored into its own class.

This will allow future things (i.e. autoscaling decisions) to use the same node load detection class.
backport of #61521
2020-08-31 14:00:23 -04:00
David Kyle 49a5afc6c1
[ML] Increase wait for templates timeout in tests (#61623) (#61628) 2020-08-27 12:57:12 +01:00
David Kyle 25e811ced7
Rewrite Inference yml tests for better clean up (#61180) (#61555)
Inference processors asynchronously usage write stats to the .ml-stats index after they used. 
In tests the write can leak into the next test causing failures depending on which test follows.
This change waits for the usage stats docs to be written at the end of the test
2020-08-27 11:16:26 +01:00
Dimitris Athanasiou 3ed65eb418
[7.x][ML] Recover data frame extraction search from latest sort key (#61544) (#61572)
If a search failure occurs during data frame extraction we catch
the error and retry once. However, we retry another search that is
identical to the first one. This means we will re-fetch any docs
that were already processed. This may result either to training
a model using duplicate data or in the case of outlier detection to
an error message that the process received more records than it
expected.

This commit fixes this issue by tracking the latest doc's sort key
and then using that in a range query in case we restart the search
due to a failure.

Backport of #61544

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-08-26 17:54:00 +03:00
Benjamin Trent a6e7a3d65f
[7.x] [ML] write warning if configured memory limit is too low for analytics job (#61505) (#61528)
Backports the following commits to 7.x:

[ML] write warning if configured memory limit is too low for analytics job (#61505)

Having `_start` fail when the configured memory limit is too low can be frustrating. 

We should instead warn the user that their job might not run properly if their configured limit is too low. 

It might be that our estimate is too high, and their configured limit works just fine.
2020-08-26 10:35:38 -04:00
Przemyslaw Gomulka 9f566644af
Do not create two loggers for DeprecationLogger backport(#58435) (#61530)
DeprecationLogger's constructor should not create two loggers. It was
taking parent logger instance, changing its name with a .deprecation
prefix and creating a new logger.
Most of the time parent logger was not needed. It was causing Log4j to
unnecessarily cache the unused parent logger instance.

depends on #61515
backports #58435
2020-08-26 16:04:02 +02:00
Przemysław Witek 11c2710e7f
[7.x] [ML] Do not mark the DFA job as FAILED when a failure occurs after the node is shutdown (#61331) (#61526) 2020-08-26 09:53:13 +02:00
Przemyslaw Gomulka f3f7d25316
Header warning logging refactoring backport(#55941) (#61515)
Splitting DeprecationLogger into two. HeaderWarningLogger - responsible for adding a response warning headers and ThrottlingLogger - responsible for limiting the duplicated log entries for the same key (previously deprecateAndMaybeLog).
Introducing A ThrottlingAndHeaderWarningLogger which is a base for other common logging usages where both response warning header and logging throttling was needed.

relates #55699
relates #52369
backports #55941
2020-08-25 16:35:54 +02:00
David Kyle 539cf914bc
[ML] handle new model metadata stream from native process (#59725) (#61251)
This adds the serialization handling for the new model_metadata object from the native process.

Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
2020-08-24 15:52:13 -04:00
Dimitris Athanasiou 618dd65d5f
[7.x][ML] Add debug logging for field caps request during DF Analytics (#61459) (#61478)
Adds debug logging for the request and the response that is getting
field capabilities during a data frame analytics job.

Backport of #61459
2020-08-24 18:01:30 +03:00
Dimitris Athanasiou 18ca8a6be3
[7.x][ML] Remove redundant logging for creation of annotations index (#61461) (#61475)
This commit removes the log info message "Created ML annotations index and aliases".

The message comes in addition to elasticsearch's index creation logging and it does
not add to it. In addition, since #61107 that message may be logged multiple times.

Backport of #61461
2020-08-24 17:46:29 +03:00
David Kyle ba89af544f
[7.x] Respect ML upgrade mode in TrainedModelStatsService (#61143) (#61187)
When in upgrade mode the ml stats service should not write to the stats index.
2020-08-17 11:09:25 +01:00
Benjamin Trent 8f302282f4
[ML] adds new feature_processors field for data frame analytics (#60528) (#61148)
feature_processors allow users to create custom features from
individual document fields.

These `feature_processors` are the same object as the trained model's pre_processors.

They are passed to the native process and the native process then appends them to the
pre_processor array in the inference model.

closes https://github.com/elastic/elasticsearch/issues/59327
2020-08-14 10:32:20 -04:00
David Roberts d1b60269f4
[ML] Ensure annotations index mappings are up to date (#61142)
When the ML annotations index was first added, only the
ML UI wrote to it, so the code to create it was designed
with this in mind.  Now the ML backend also creates
annotations, and those mappings can change between
versions.

In this change:

1. The code that runs on the master node to create the
   annotations index if it doesn't exist but another ML
   index does also now ensures the mappings are up-to-date.
   This is good enough for the ML UI's use of the
   annotations index, because the upgrade order rules say
   that the whole Elasticsearch cluster must be upgraded
   prior to Kibana, so the master node should be on the
   newer version before Kibana tries to write an
   annotation with the new fields.
2. We now also check whether the annotations index exists
   with the correct mappings before starting an autodetect
   process on a node.  This is necessary because ML nodes
   can be upgraded before the master node, so could write
   an annotation with the new fields before the master node
   knows about the new fields.

Backport of #61107
2020-08-14 13:51:04 +01:00
Benjamin Trent a497263c47
[ML] ensure config index is updated before clearing finished_time (#61064) (#61085)
When a user upgrades between versions, they may stop their ML jobs.

Then when the upgrade is complete, they will want to open the jobs again.

But, when opening a job, we attempt to clear out the jobs finished_time. If the job configuration has adjusted between the versions (i.e. added a new field), it will dynamically update the .ml-config index.

We should instead manually change the mapping to be the updated version.
2020-08-13 08:12:10 -04:00
Benjamin Trent 4275a715c9
[ML] adjusting inference processor to support foreach usage (#60915) (#61022)
`foreach` processors store information within the `_ingest` metadata object.

This commit adds the contents of the `_ingest` metadata (if it is not empty).

And will append new inference results if the result field already exists.

This allows a `foreach` to execute and multiple inference results being written to the same result field.

closes https://github.com/elastic/elasticsearch/issues/60867
2020-08-12 08:34:18 -04:00
Dimitris Athanasiou 2e18c0f2ac
[7.x][ML] Audit force stopping data frame analytics (#60973) (#61004)
Audits a message when a data frame analytics job is force stopped.

Backport of #60973
2020-08-12 07:45:26 +03:00
Dimitris Athanasiou 6062672148
[7.x][ML] Monitor reindex response in DF analytics (#60911) (#60958)
Examines the reindex response in order to report potential
problems that occurred during the reindexing phase of
data frame analytics jobs.

Backport of #60911
2020-08-11 16:17:37 +03:00
David Kyle 18a65c5b9a
DFA Get Stats can return multiple responses if more than one error occurs (#60950)
If the search for get stats with multiple job Ids fails the listener is called for each failure. 
This change waits for all responses then returns the first error if there was one.
2020-08-11 10:28:05 +01:00
Benjamin Trent 66b3e89482
[ML] enable logging for test failures (#60902) (#60910) 2020-08-10 12:36:30 -04:00
Benjamin Trent bc17afc535
[7.x] [ML] have DELETE analytics ignore stats failures and clean up unused stats (#60776) (#60784)
* [ML] have DELETE analytics ignore stats failures and clean up unused stats (#60776)

When deleting an analytics configuration, the request MIGHT fail if
the .ml-stats index does not exist or is in strange state (shards unallocated).

Instead of making the request fail, we should log that we were unable to delete the stats docs and then
have them cleaned up in the 'delete_expire_data' janitorial process
2020-08-06 08:55:35 -04:00
Dimitris Athanasiou cedbe6968b
[7.x][ML] Include cause in logging during test inference (#60749) (#60805)
When an exception is thrown during test inference we are
not including the cause message in our logging. This commit
addresses this issue.

Backport of #60749
2020-08-06 11:45:59 +03:00
Przemysław Witek 0afa1bd972
Deprecate allow_no_jobs and allow_no_datafeeds in favor of allow_no_match (#60601) (#60727) 2020-08-05 13:39:40 +02:00
Przemysław Witek 9e27f7474c
Make MlDailyMaintenanceService delete jobs that are in deleting state anyway (#60121) (#60439) 2020-07-30 09:53:11 +02:00
David Roberts 2a0116f51b
[ML] Take more care that memory estimation uses unique named pipes (#60405)
Prior to this change ML memory estimation processes for a
given job would always use the same named pipe names.  This
would often cause one of the processes to fail.

This change avoids this risk by adding an incrementing counter
value into the named pipe names used for memory estimation
processes.

Backport of #60395
2020-07-29 17:29:55 +01:00
Benjamin Trent 76359aaa53
[ML] always write prediction_[score|probability] for classification inference (#60335) (#60397)
In order to unify model inference and analytics results we
need to write the same fields.

prediction_probability and prediction_score are now written
for inference calls against classification models.
2020-07-29 10:58:14 -04:00
Benjamin Trent a6da1fd73e
[ML] require alias when indexing to an alias that should be created (#60315) (#60394)
This sets up all indexing to one of our write aliases to require it actually be an alias.

This allows failures scenarios to be captured quickly, loudly, and then potentially recovered.
2020-07-29 10:52:36 -04:00
Dimitris Athanasiou ed7dcff7c4
[7.x][ML] Audit updates on data frame analytics jobs (#60126) (#60287)
Closes #59652

Backport of #60126
2020-07-28 16:33:35 +03:00
Dimitris Athanasiou 16ffcfb9f6
[7.x][ML] Ensure bulk requests are not over memory limit (#60219) (#60283)
Data frame analytics jobs that work with very large datasets
may produce bulk requests that are over the memory limit
for indexing. This commit adds a helper class that bundles
index requests in bulk requests that steer away from the
memory limit. We then use this class both from the results
joiner and the inference runner ensuring data frame analytics
jobs do not generate bulk requests that are too large.

Note the limit was implemented in #58885.

Backport of #60219
2020-07-28 16:04:03 +03:00
Dimitris Athanasiou 439b7f7e59
[7.x][ML] DFA result processor should only skip rows and model chunks on cancel (#60113) (#60193)
When the job is force-closed or shutting down due to a fatal error we clean
up all cancellable job operations. This includes cancelling the results processor.
However, this means that we might not persist objects that are written from the
process like stats, memory usage, etc.

In hindsight, we do not gain from cancelling the results processor in its
entirety. It makes more sense to skip row results and model chunks but keep
stats and instrumentation about the job as the latter may contain useful information
to understand what happened to the job.

Backport of #60113
2020-07-27 13:42:46 +03:00
Dimitris Athanasiou 6b9a362ec2
[7.x][ML] Skip test inference if DFA task has been stopped (#60116) (#60127)
If the job is stopped before starting inference on test data, we
should skip inference entirely.

Backport of #60116
2020-07-23 18:34:09 +03:00
Jay Modi c8ef2e18f7
Thread safe clean up of LocalNodeModeListeners (#60007)
This commit continues on the work in #59801 and makes other
implementors of the LocalNodeMasterListener interface thread safe in
that they will no longer allow the callbacks to run on different
threads and possibly race each other. This also helps address other
issues where these events could be queued to wait for execution while
the service keeps moving forward thinking it is the master even when
that is not the case.

In order to accomplish this, the LocalNodeMasterListener no longer has
the executorName() method to prevent future uses that could encounter
this surprising behavior.

Each use was inspected and if the class was also a
ClusterStateListener, the implementation of LocalNodeMasterListener
was removed in favor of a single listener that combined the logic. A
single listener is used and there is currently no guarantee on execution
order between ClusterStateListeners and LocalNodeMasterListeners,
so a future change there could cause undesired consequences. For other
classes, the implementations of the callbacks were inspected and if the
operations were lightweight, the overriden executorName method was
removed to use the default, which runs on the same thread.

Backport of #59932
2020-07-22 08:02:18 -06:00
Dimitris Athanasiou 7e652ca873
[7.x][ML] Include same fields during test inference as in training (#… (#60034)
In #58877, when we switched test inference on java, we just
use the doc's `_source` as features. However, this could be
missing out on features that were used during training,
e.g. alias fields, etc.

This commit addresses this by extracting fields to use as
features during inference the same way they are extracted
in `DataFrameDataExtractor` when they are used for training.

Backport of #59963
2020-07-22 12:54:13 +03:00
Benjamin Trent a28547c4b4
[7.x] [ML] add new `custom` field to trained model processors (#59542) (#59700)
* [ML] add new `custom` field to trained model processors (#59542)

This commit adds the new configurable field `custom`.

`custom` indicates if the preprocessor was submitted by a user or automatically created by the analytics job.

Eventually, this field will be used in calculating feature importance. When `custom` is true, the feature importance for
the processed fields is calculated. When `false` the current behavior is the same (we calculate the importance for the originating field/feature).

This also adds new required methods to the preprocessor interface. If users are to supply their own preprocessors
in the analytics job configuration, we need to know the input and output field names.
2020-07-16 10:57:38 -04:00
Przemysław Witek df4fea79cb
Add a "verbose" option to the data frame analytics stats endpoint (#59589) (#59621) 2020-07-16 09:51:31 +02:00
David Kyle df7fc8f967
Accounting for model size when models are not cached (#59607)
When an inference model is loaded it is accounted for in circuit breaker
and should not be released until there are no users of the model. Adds
a reference count to the model to track usage.
2020-07-15 18:06:15 +01:00
Ryan Ernst 3b688bfee5
Add license feature usage api (#59342) (#59571)
This commit adds a new api to track when gold+ features are used within
x-pack. The tracking is done internally whenever a feature is checked
against the current license. The output of the api is a list of each
used feature, which includes the name, license level, and last time it
was used. In addition to a unit test for the tracking, a rest test is
added which ensures starting up a default configured node does not
result in any features registering as used.

There are a couple features which currently do not work well with the
tracking, as they are checked in a manner that makes them look always
used. Those features will be fixed in followups, and in this PR they are
omitted from the feature usage output.
2020-07-14 14:34:59 -07:00
David Kyle 0d2ea1b881
Check for ml privilege when using the Inference Aggregation (#59530) (#59562)
The inference pipeline aggregation requires the user has permission to access
the ml get trained models endpoint (_ml/inference/)
2020-07-14 20:53:40 +01:00
Dimitris Athanasiou ee4610c0ca
[7.x][ML] Rename cross validation splitter package (#59529) (#59544)
Renames and moves the cross validation splitter package.

First, the package and classes are renamed from using
"cross validation splitter" to "train test splitter".
Cross validation as a term is overloaded and encompasses
more concepts than what we are trying to do here.

Second, the package used to be under `process` but it does
not make sense to be there, it can be a top level package
under `dataframe`.

Backport of #59529
2020-07-14 18:54:46 +03:00
Dimitris Athanasiou 37406487b9
[7.x][ML] Improve error for non-included field with unsupported type (#59424) (#59541)
When a field is not included yet its type is unsupported, we currently
state that the reason the field is excluded is that it is not in the
includes list. However, this implies the user could include it but
if the user tried to do so, they would get a failure as they would
be including a field with unsupported type.

This commit improves this by stating the reason a not included field
with unsupported type is excluded is because of its type.

Backport of #59424
2020-07-14 18:54:34 +03:00
Dimitris Athanasiou e302c66847
[7.x][ML] Fix NPE when starting classification with missing dependent_variable (#59524) (#59540)
Since we have added checking the cardinality of the dependent_variable
for classification, we have introduced a bug where an NPE is thrown
if the dependent_variable is a missing field.

This commit is fixing this issue.

Backport of #59524
2020-07-14 17:56:55 +03:00
David Kyle d86435938b
[7.x] Add ml licence check to the pipeline inference agg. (#59213) (#59412)
Ensures the licence is sufficient for the model used in inference
2020-07-14 14:03:10 +01:00
David Roberts 529aa345df [ML] Account for per-partition categorization in model memory estimate (#59458)
Now that we have per-partition categorization, the estimate for
the model memory limit required for a particular analysis config
needs to take into account whether categorization is operating
for the job as a whole or per-partition.
2020-07-14 09:16:28 +01:00
Tim Brooks 623df95a32
Adding indexing pressure stats to node stats API (#59467)
We have recently added internal metrics to monitor the amount of
indexing occurring on a node. These metrics introduce back pressure to
indexing when memory utilization is too high. This commit exposes these
stats through the node stats API.
2020-07-13 17:23:42 -06:00
Dimitris Athanasiou a7895ff458
[7.x][ML] Remove unused member var from ExtractedFieldsDetector (#59395) (#59406)
Removes member variable `index` from `ExtractedFieldsDetector`
as it is not used.

Backport of #59395

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-07-13 19:10:43 +03:00
Martijn van Groningen b1b7bf3912
Make data streams a basic licensed feature. (#59392)
Backport of #59293 to 7.x branch.

* Create new data-stream xpack module.
* Move TimestampFieldMapper to the new module,
  this results in storing a composable index template
  with data stream definition only to work with default
  distribution. This way data streams can only be used
  with default distribution, since a data stream can
  currently only be created if a matching composable index
  template exists with a data stream definition.
* Renamed `_timestamp` meta field mapper
   to `_data_stream_timestamp` meta field mapper.
* Add logic to put composable index template api
  to fail if `_data_stream_timestamp` meta field mapper
  isn't registered. So that a more understandable
  error is returned when attempting to store a template
  with data stream definition via the oss distribution.

In a follow up the data stream transport and
rest actions can be moved to the xpack data-stream module.
2020-07-13 17:26:46 +02:00