754 Commits

Author SHA1 Message Date
Albert Zaharovits
f55a361b64
Preserve Task Id for ML Datafeed (#54943)
This change preserves the task id for internal requests for the `StartDatafeedPersistentTask`.

Task ids are a way to express a relationship between related internal requests.
In this particular case, the task ids are used for debugging and (soon) security auditing,
but not for task cancellation, because there is already a graceful-shutdown of child
internal requests (given a task id) in place.
2020-04-09 13:22:29 +03:00
Jay Modi
3600c9862f
Reintroduce system index APIs for Kibana (#54935)
This change reintroduces the system index APIs for Kibana without the
changes made for marking what system indices could be accessed using
these APIs. In essence, this is a partial revert of #53912. The changes
for marking what system indices should be allowed access will be
handled in a separate change.

The APIs introduced here are wrapped versions of the existing REST
endpoints. A new setting is also introduced since the Kibana system
indices' names are allowed to be changed by a user in case multiple
instances of Kibana use the same instance of Elasticsearch.

Relates #52385
Backport of #54858
2020-04-08 09:08:49 -06:00
Ryan Ernst
37795d259a
Remove guava from transitive compile classpath (#54309) (#54695)
Guava was removed from Elasticsearch many years ago, but remnants of it
remain due to transitive dependencies. When a dependency pulls guava
into the compile classpath, devs can inadvertently begin using methods
from guava without realizing it. This commit moves guava to a runtime
dependency in the modules that it is needed.

Note that one special case is the html sanitizer in watcher. The third
party dep uses guava in the PolicyFactory class signature. However, only
calling a method on the PolicyFactory actually causes the class to be
loaded, a reference alone does not trigger compilation to look at the
class implementation. There we utilize a MethodHandle for invoking the
relevant method at runtime, where guava will continue to exist.
2020-04-07 23:20:17 -07:00
Dimitris Athanasiou
9b4ac60b53
[7.x][ML] Cancel reindex task from correct thread context (#54874) (#54898)
When a data frame analytics job is stopped, if the reindexing
task was still in progress we cancel it. Cancelling it should
be done from the same context as when we executed the reindexing
task. That means from a thread context with ML origin.

Backport of #54874
2020-04-07 18:11:58 +03:00
David Roberts
df4ae79b41
[TEST] Unmute CategorizationIT.testNumMatchesAndCategoryPreference (#54868)
Should work again now that https://github.com/elastic/ml-cpp/issues/1121
is resolved.

Backport of #54768
2020-04-07 11:04:31 +01:00
David Kyle
03bc368c14
Wait for ML templates after creating a new cluster in TooManyJobsIT (#54801) 2020-04-06 13:45:56 +01:00
Dimitris Athanasiou
ed4ef78330
[7.x][ML] Increase open job wait time in MlDistributedFailureIT (#54792) (#54798)
It seems the 20 seconds timeout is occasionally not enough.
We still get sporadic failures where the logs reveal the job
wasn't opened within 20 seconds. I'm increasing the wait time
to 30 seconds.

Closes #54448

Backport of #54792
2020-04-06 14:51:46 +03:00
David Roberts
470aa9a5f1 [TEST] Mute CategorizationIT.testNumMatchesAndCategoryPreference (#54717)
The test results are affected by the off-by-one error that is
fixed by https://github.com/elastic/ml-cpp/pull/1122

This test can be unmuted once that fix is merged and has been
built into ml-cpp snapshots.
2020-04-03 14:40:47 +01:00
Dimitris Athanasiou
e8c0351fd8
[7.x][ML] Allow force stopping failed and stopping DF analytics (#54650) (#54712)
Force stopping a failed job used to work but it
now puts the job in `stopping` state and hangs.
In addition, force stopping a `stopping` job is
not handled.

This commit addresses those issues with force
stopping data frame analytics. It inlines the
approach with that followed for anomaly detection
jobs.

Backport of #54650
2020-04-03 16:08:06 +03:00
Christoph Büscher
9f22c0d37c Fix Eclipse compile problem in ModelLoadingService (#54670)
Current Eclipse 4.14.0 cannot deal with the direct lambda notation, changing to
an exlicite one.
2020-04-03 11:56:30 +02:00
Benjamin Trent
6e73f67f3b
[ML] unmute categorization test for native backport (#54679) 2020-04-02 17:08:19 -04:00
Benjamin Trent
7fe38935f6
[ML] add training_percent to analytics process params (#54605) (#54678)
This adds training_percent parameter to the analytics process for Classification and Regression. This parameter is then used to give more accurate memory estimations.

See native side pr: elastic/ml-cpp#1111
2020-04-02 17:08:06 -04:00
Benjamin Trent
4a1610265f
[7.x] [ML] add new inference_config field to trained model config (#54421) (#54647)
* [ML] add new inference_config field to trained model config (#54421)

A new field called `inference_config` is now added to the trained model config object. This new field allows for default inference settings from analytics or some external model builder.

The inference processor can still override whatever is set as the default in the trained model config.

* fixing for backport
2020-04-02 12:25:10 -04:00
Benjamin Trent
65233383f6
[7.x] [ML] prefer secondary authorization header for data[feed|frame] authz (#54121) (#54645)
* [ML] prefer secondary authorization header for data[feed|frame] authz (#54121)

Secondary authorization headers are to be used to facilitate Kibana spaces support + ML jobs/datafeeds.

Now on PUT/Update/Preview datafeed, and PUT data frame analytics the secondary authorization is preferred over the primary (if provided).

closes https://github.com/elastic/elasticsearch/issues/53801

* fixing for backport
2020-04-02 11:20:25 -04:00
David Roberts
4b4800e096
[ML] Take more care that normalize processes use unique named pipes (#54641)
When one of ML's normalize processes fails to connect to the JVM
quickly enough and another normalize process for the same job
starts shortly afterwards it is possible that their named pipes
can get mixed up.

This change avoids the risk of that by adding an incrementing
counter value into the named pipe names used for normalize
processes.

Backport of #54636
2020-04-02 14:25:31 +01:00
Benjamin Trent
eb31be0e71
[7.x] [ML] add num_matches and preferred_to_categories to category defintion objects (#54214) (#54639)
* [ML] add num_matches and preferred_to_categories to category defintion objects (#54214)

This adds two new fields to category definitions.

- `num_matches` indicating how many documents have been seen by this category
- `preferred_to_categories` indicating which other categories this particular category supersedes when messages are categorized.

These fields are only guaranteed to be up to date after a `_flush` or `_close`

native change: https://github.com/elastic/ml-cpp/pull/1062

* adjusting for backport
2020-04-02 09:09:19 -04:00
Mayya Sharipova
bf4857d9e0
Search hit refactoring (#41656) (#54584)
Refactor SearchHit to have separate document and meta fields.
This is a part of bigger refactoring of issue #24422 to remove
dependency on MapperService to check if a field is metafield.

Relates to PR: #38373
Relates to issue #24422

Co-authored-by: sandmannn <bohdanpukalskyi@gmail.com>
2020-04-01 15:19:00 -04:00
Przemysław Witek
1fe2705826
Skip daily maintenance activity if upgrade mode is enabled (#54565) (#54571) 2020-04-01 13:29:34 +02:00
Jason Tedor
63e5f2b765
Rename META_DATA to METADATA
This is a follow up to a previous commit that renamed MetaData to
Metadata in all of the places. In that commit in master, we renamed
META_DATA to METADATA, but lost this on the backport. This commit
addresses that.
2020-03-31 17:30:51 -04:00
Jason Tedor
5fcda57b37
Rename MetaData to Metadata in all of the places (#54519)
This is a simple naming change PR, to fix the fact that "metadata" is a
single English word, and for too long we have not followed general
naming conventions for it. We are also not consistent about it, for
example, METADATA instead of META_DATA if we were trying to be
consistent with MetaData (although METADATA is correct when considered
in the context of "metadata"). This was a simple find and replace across
the code base, only taking a few minutes to fix this naming issue
forever.
2020-03-31 17:24:38 -04:00
David Roberts
b8f06df53f
[ML] Fix bug, add tests, improve estimates for estimate_model_memory (#54508)
This PR:

1. Fixes the bug where a cardinality estimate of zero could cause
   a 500 status
2. Adds tests for that scenario and a few others
3. Adds sensible estimates for the cases that were previously TODO

Backport of #54462
2020-03-31 17:59:38 +01:00
David Kyle
9150e77269
[7.x] Remove unused environment from anomaly detector classes (#54399) (#54456) 2020-03-31 16:55:37 +01:00
Dimitris Athanasiou
e4230c533c
[7.x][ML] Move DFA MemoryUsage to stats.common pkg (#54492) (#54512)
This belongs in stats.common

Backport of #54492
2020-03-31 18:36:05 +03:00
Dimitris Athanasiou
6d96ca9bc8
[7.x][ML] Reenable classification and regression integ tests (#54489) (#54494)
Relates #54401

Backport of #54489
2020-03-31 17:50:08 +03:00
Dimitris Athanasiou
b4b54efa73
[7.x][ML] Hyperparameter names should match config (#54401) (#54435)
Java side of elastic/ml-cpp#1096

Backport of #54401
2020-03-30 23:32:40 +03:00
Przemysław Witek
3c604da7f6
[7.x] Create an annotation when a model snapshot is stored (#53783) (#54405) 2020-03-30 15:17:08 +02:00
Martijn van Groningen
4b4fbc160d
Refactor AliasOrIndex abstraction. (#54394)
Backport of #53982

In order to prepare the `AliasOrIndex` abstraction for the introduction of data streams,
the abstraction needs to be made more flexible, because currently it really can be only
an alias or an index.

* Renamed `AliasOrIndex` to `IndexAbstraction`.
* Introduced a `IndexAbstraction.Type` enum to indicate what a `IndexAbstraction` instance is.
* Replaced the `isAlias()` method that returns a boolean with the `getType()` method that returns the new Type enum.
* Moved `getWriteIndex()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface.
* Moved `getAliasName()` up from the `IndexAbstraction.Alias` to the `IndexAbstraction` interface and renamed it to `getName()`.
* Removed unnecessary casting to `IndexAbstraction.Alias` by just checking the `getType()` method.

Relates to #53100
2020-03-30 10:12:16 +02:00
Jason Tedor
cf68ac8a2c
Do not stash environment in machine learning (#54371)
Today the machine learning plugin stashes a copy of the environment in
its constructor, and uses the stashed copy to construct its components
even though it is provided with an environment to create these
components. What is more, the environment it creates in its constructor
is not fully initialized, as it does not have the final copy of the
settings, but the environment passed in while creating components
does. This commit removes that stashed copy of the environment.
2020-03-28 12:46:16 -04:00
Stuart Tettemer
1630de4a42
Scripting: stats per context in nodes stats (#54008) (#54357)
Adds script cache stats to `_node/stats`.
If using the general cache:
```
      "script_cache": {
        "sum": {
          "compilations": 12,
          "cache_evictions": 9,
          "compilation_limit_triggered": 5
        }
      }

```
If using context caches:
```
      "script_cache": {
        "sum": {
          "compilations": 13,
          "cache_evictions": 9,
          "compilation_limit_triggered": 5
        },
        "contexts": [
          {
            "context": "aggregation_selector",
            "compilations": 8,
            "cache_evictions": 6,
            "compilation_limit_triggered": 3
          },
          {
            "context": "aggs",
            "compilations": 5,
            "cache_evictions": 3,
            "compilation_limit_triggered": 2
          },
```
Backport of: 32f46f2
Refs: #50152
2020-03-27 12:26:00 -06:00
Przemysław Witek
2eb079b67f
Add version guards around ML hidden indices settings (#54322) 2020-03-27 14:50:57 +01:00
Przemysław Witek
d40afc7871
[7.x] Do not fail Evaluate API when the actual and predicted fields' types differ (#54255) (#54319) 2020-03-27 10:05:19 +01:00
William Brafford
14204f8381
Use set-based interface for NodesStatsRequest (#53637) (#54141)
The NodesStatsRequest class uses a set of strings for its internal
serialization. This commit updates the class's interface so that we
no longer use hard-coded getters and setters, but rather
methods that add strings directly. For example, the old way of
adding "os" metrics to a request would be to call request.os(true).
The new way of doing this is to call request.addMetric("os").

For the time being, the canonical list of metrics is an enum in
NodesStatsRequest. This will eventually be replaced with something
pluggable.
2020-03-26 14:41:49 -04:00
Dimitris Athanasiou
13368aae37
[7.x][ML] DF Analytics should always display operational stats (#54210) (#54290)
This commit populates the _stats API response with sensible "empty"
`data_counts` and `memory_usage` objects when the job itself
has not started reporting them.

Backport of #54210
2020-03-26 20:03:14 +02:00
Dimitris Athanasiou
cc981fa377
[7.x][ML] Get ML filters size should default to 100 (#54207) (#54278)
When get filters is called without setting the `size`
paramter only up to 10 filters are returned. However,
100 filters should be returned. This commit fixes this
and adds an integ test to guard it.

It seems this was accidentally broken in #39976.

Closes #54206

Backport of #54207
2020-03-26 17:51:43 +02:00
Mark Vieira
7728ccd920
Encore consistent compile options across all projects (#54120)
(cherry picked from commit ddd068a7e92dc140774598664efdc15155ab05c2)
2020-03-25 08:24:21 -07:00
Dimitris Athanasiou
ba09a778dc
[7.x][ML] Unmute classification cardinality integ test (#54165) (#54173)
Adjusts test to work for new cardinality limit.

Backport of #54165
2020-03-25 15:00:34 +02:00
Benjamin Trent
ef05a4f416
[ML] relaxing parameters on stratified split test (#54127) (#54168)
Relaxing the error rate a bit on two of the tests.
Ran 1000s of times locally and never had a failure after these changes. 

closes https://github.com/elastic/elasticsearch/issues/54122
2020-03-25 08:06:15 -04:00
Tanguy Leroux
3a3930c7ec
Mute TooManyJobsIT.testCloseFailedJob on 7.x (#54163)
Relates #54162
2020-03-25 12:44:41 +01:00
Jason Tedor
381d7586e4
Introduce formal role for remote cluster client (#54138)
This commit introduce a formal role for identifying nodes that are
capable of making connections to remote clusters.

Relates #53924
2020-03-24 21:59:43 -04:00
David Roberts
7667004b20
[ML] Add a model memory estimation endpoint for anomaly detection (#54129)
A new endpoint for estimating anomaly detection job
model memory requirements:

POST _ml/anomaly_detectors/estimate_model_memory

Backport of #53507
2020-03-24 22:55:11 +00:00
Dimitris Athanasiou
c141c1dd89
[7.x][ML] Stratified cross validation split for classification (#54087) (#54104)
As classification now works for multiple classes, randomly
picking training/test data frame rows is not good enough.
This commit introduces a stratified cross validation splitter
that maintains the proportion of the each class in the dataset
in the sample that is used for training the model.

Backport of #54087
2020-03-24 18:47:36 +02:00
David Roberts
1421471556
[ML] Introduce a "starting" datafeed state for lazy jobs (#54065)
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true.  Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.

This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started.  The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.

Backport of #53918
2020-03-24 13:00:04 +00:00
Dimitris Athanasiou
be20bb5755
[7.x][ML] No refresh on indexing DFA stats (#53977) (#54064)
When we index data frame analytics stats docs we do not
need to refresh immediately.

Backport of #53977
2020-03-24 13:13:03 +02:00
Dimitris Athanasiou
5ce7c99e74
[7.x][ML] Data frame analytics data counts (#53998) (#54031)
This commit instruments data frame analytics
with stats for the data that are being analyzed.
In particular, we count training docs, test docs,
and skipped docs.

In order to account docs with missing values as skipped
docs for analyses that do not support missing values,
this commit changes the extractor so that it only ignores
docs with missing values when it collects the data summary,
which is used to estimate memory usage.

Backport of #53998
2020-03-24 11:30:43 +02:00
Benjamin Trent
19af869243
[ML] adds multi-class feature importance support (#53803) (#54024)
Adds multi-class feature importance calculation. 

Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
   "feature_name": "feature_0",
   "importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{ 
   “feature_name”: “feature_0”, 
   “importance”: 2.0, // sum(abs()) of class importances
   “foo”: 1.0, 
   “bar”: 0.5, 
   “baz”: -0.5 
},
```

For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)
```
 "ml.inference.feature_importance": {
          "type": "nested",
          "dynamic": true,
          "properties": {
            "feature_name": {
              "type": "keyword"
            },
            "importance": {
              "type": "double"
            }
          }
        }
```
The mapping field name is as follows
`ml.<inference.target_field>.<inference.tag>.feature_importance`
if `inference.tag` is not provided in the processor definition, it is not part of the field path.
`inference.target_field` is defaulted to `ml.inference`.
//cc @lcawl ^ Where should we document this?

If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
2020-03-23 18:49:07 -04:00
Benjamin Trent
d276058c6c
[ML] adjusting feature importance mapping for multi-class support (#53821) (#54013)
Feature importance storage format is changing to encompass multi-class.

Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
   "feature_name": "feature_0",
   "importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{
   “feature_name”: “feature_0”,
   “importance”: 2.0, // sum(abs()) of class importances
   “foo”: 1.0,
   “bar”: 0.5,
   “baz”: -0.5
},
```

This change adjusts the mapping creation for analytics so that the field is mapped as a `nested` type.

Native side change: https://github.com/elastic/ml-cpp/pull/1071
2020-03-23 15:50:12 -04:00
Przemysław Witek
88c5d520b3
[7.x] Verify that the field is aggregatable before attempting cardinality aggregation (#53874) (#54004) 2020-03-23 19:36:33 +01:00
Dimitris Athanasiou
965af3a68b
[7.x][ML] Delete DF analytics stats upon job deletion (#53933) (#53997)
Since a data frame analytics job may have associated docs
in the .ml-stats-* indices, when the job is deleted we
should delete those docs too.

Backport of #53933
2020-03-23 19:55:36 +02:00
Ryan Ernst
960d1fb578
Revert "Introduce system index APIs for Kibana (#53035)" (#53992)
This reverts commit c610e0893db3e713bb9eb7d5d1335b9053681638.

backport of #53912
2020-03-23 10:29:35 -07:00
Dimitris Athanasiou
3873510332
[7.x][ML] Refactor DFA custom processor to cross validation splitter (#53915) (#53956)
While `CustomProcessor` is generic and allows for flexibility, there
are new requirements that make cross validation a concept it's hard
to abstract behind custom processor. In particular, we would like to
add data_counts to the DFA jobs stats. Counting training VS. test
docs would be a useful statistic. We would also want to add a
different cross validation strategy for multiclass classification.

This commit renames custom processors to cross validation splitters
which allows for those enhancements without cryptically doing
things as a side effect of the abstract custom processing.

Backport of #53915
2020-03-23 17:15:14 +02:00