Commit Graph

366 Commits

Author SHA1 Message Date
Benjamin Trent cec102a391
[7.x] [ML] adds new n_gram_encoding custom processor (#61578) (#61935)
* [ML] adds new n_gram_encoding custom processor (#61578)

This adds a new `n_gram_encoding` feature processor for analytics and inference.

The focus of this processor is simple ngram encodings that allow:
 - multiple ngrams [1..5]
 - Prefix, infix, suffix
2020-09-04 08:36:50 -04:00
Tanguy Leroux c90ee32cdc
Mute ClassificationIT.testTooLowConfiguredMemoryStillStarts (#61915)
Relates #61913
2020-09-03 15:52:01 +02:00
Benjamin Trent c22415c241
[7.x] [ML] unmute testTooLowConfiguredMemoryStillStarts (#61846) (#61869)
* [ML] unmute testTooLowConfiguredMemoryStillStarts (#61846)

Native PR addresses this test failure: https://github.com/elastic/ml-cpp/pull/1465


closes https://github.com/elastic/elasticsearch/issues/61704

closes https://github.com/elastic/elasticsearch/issues/61561
2020-09-02 13:23:23 -04:00
Jake Landis 794aac717d
[7.x] Convert first 1/2 x-pack plugins from integTest to [yaml | java]RestTest or internalClusterTest (#60630) (#61855)
For 1/2 the plugins in x-pack, the integTest
task is now a no-op and all of the tests are now executed via a test,
yamlRestTest, javaRestTest, or internalClusterTest.

This includes the following projects:
async-search, autoscaling, ccr, enrich, eql, frozen-indicies,
data-streams, graph, ilm, mapper-constant-keyword, mapper-flattened, ml

A few of the more specialized qa projects within these plugins
have not been changed with this PR due to additional complexity which should
be addressed separately.

A follow up PR will address the remaining x-pack plugins (this PR is big enough as-is).

related: #61802
related: #56841
related: #59939
related: #55896
2020-09-02 11:19:24 -05:00
Nik Everett f8158bdb2d Skip failing test
Tracked by https://github.com/elastic/elasticsearch/issues/61561
2020-09-01 13:44:31 -04:00
Henning Andersen 4c9fe31da8 Mute testTooLowConfiguredMemoryStillStarts (#61705)
Related to #61704
2020-08-31 11:19:53 +02:00
David Kyle 25e811ced7
Rewrite Inference yml tests for better clean up (#61180) (#61555)
Inference processors asynchronously usage write stats to the .ml-stats index after they used. 
In tests the write can leak into the next test causing failures depending on which test follows.
This change waits for the usage stats docs to be written at the end of the test
2020-08-27 11:16:26 +01:00
Benjamin Trent a6e7a3d65f
[7.x] [ML] write warning if configured memory limit is too low for analytics job (#61505) (#61528)
Backports the following commits to 7.x:

[ML] write warning if configured memory limit is too low for analytics job (#61505)

Having `_start` fail when the configured memory limit is too low can be frustrating. 

We should instead warn the user that their job might not run properly if their configured limit is too low. 

It might be that our estimate is too high, and their configured limit works just fine.
2020-08-26 10:35:38 -04:00
David Kyle 539cf914bc
[ML] handle new model metadata stream from native process (#59725) (#61251)
This adds the serialization handling for the new model_metadata object from the native process.

Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
2020-08-24 15:52:13 -04:00
Benjamin Trent 43fc6c34bc
Muting analytics integration tests for change new native output model_metadata (#61158)
relates to elastic/ml-cpp#1456
2020-08-14 11:45:35 -04:00
Benjamin Trent 8f302282f4
[ML] adds new feature_processors field for data frame analytics (#60528) (#61148)
feature_processors allow users to create custom features from
individual document fields.

These `feature_processors` are the same object as the trained model's pre_processors.

They are passed to the native process and the native process then appends them to the
pre_processor array in the inference model.

closes https://github.com/elastic/elasticsearch/issues/59327
2020-08-14 10:32:20 -04:00
Benjamin Trent 4275a715c9
[ML] adjusting inference processor to support foreach usage (#60915) (#61022)
`foreach` processors store information within the `_ingest` metadata object.

This commit adds the contents of the `_ingest` metadata (if it is not empty).

And will append new inference results if the result field already exists.

This allows a `foreach` to execute and multiple inference results being written to the same result field.

closes https://github.com/elastic/elasticsearch/issues/60867
2020-08-12 08:34:18 -04:00
Benjamin Trent 66b3e89482
[ML] enable logging for test failures (#60902) (#60910) 2020-08-10 12:36:30 -04:00
Przemysław Witek 0afa1bd972
Deprecate allow_no_jobs and allow_no_datafeeds in favor of allow_no_match (#60601) (#60727) 2020-08-05 13:39:40 +02:00
Rene Groeschke bdd7347bbf
Merge test runner task into RestIntegTest (7.x backport) (#60600)
* Merge test runner task into RestIntegTest (#60261)
* Merge test runner task into RestIntegTest
* Reorganizing Standalone runner and RestIntegTest task
* Rework general test task configuration and extension
* Fix merge issues
* use former 7.x common test configuration
2020-08-04 14:46:32 +02:00
Rene Groeschke ed4b70190b
Replace immediate task creations by using task avoidance api (#60071) (#60504)
- Replace immediate task creations by using task avoidance api
- One step closer to #56610
- Still many tasks are created during configuration phase. Tackled in separate steps
2020-07-31 13:09:04 +02:00
Przemysław Witek 9e27f7474c
Make MlDailyMaintenanceService delete jobs that are in deleting state anyway (#60121) (#60439) 2020-07-30 09:53:11 +02:00
David Roberts 2a0116f51b
[ML] Take more care that memory estimation uses unique named pipes (#60405)
Prior to this change ML memory estimation processes for a
given job would always use the same named pipe names.  This
would often cause one of the processes to fail.

This change avoids this risk by adding an incrementing counter
value into the named pipe names used for memory estimation
processes.

Backport of #60395
2020-07-29 17:29:55 +01:00
Benjamin Trent 76359aaa53
[ML] always write prediction_[score|probability] for classification inference (#60335) (#60397)
In order to unify model inference and analytics results we
need to write the same fields.

prediction_probability and prediction_score are now written
for inference calls against classification models.
2020-07-29 10:58:14 -04:00
Benjamin Trent a6da1fd73e
[ML] require alias when indexing to an alias that should be created (#60315) (#60394)
This sets up all indexing to one of our write aliases to require it actually be an alias.

This allows failures scenarios to be captured quickly, loudly, and then potentially recovered.
2020-07-29 10:52:36 -04:00
Benjamin Trent 54c8936508
[ML] do not summerize importance for custom features (#60198) (#60333)
If a feature is created via a custom pre-processor,
we should return the importance for that feature.

This means we will not return the importance for the
original document field for custom processed features.

closes https://github.com/elastic/elasticsearch/issues/59330
2020-07-28 15:58:20 -04:00
Dimitris Athanasiou 981e436d6c
[7.x][ML] Improve assertion on regression alias field test (#60221) (#60264)
Previously the test was asserting the prediction on each document
was close 10.0 from the expected. It turned out that was not enough
as we occasionally saw the test failing by little.

Instead of relaxing that assertion, this commit changes it to
assert the mean prediction error is less than 10.0. This should
reduce the chances of the test failing significantly.

Fixes #60212

Backport of #60221
2020-07-28 11:48:00 +03:00
Benjamin Trent ea3c49979e
Test mute for issue 60212 (#60214) 2020-07-27 10:10:40 -04:00
David Roberts 89466eefa5
Don't require separate privilege for internal detail of put pipeline (#60190)
Putting an ingest pipeline used to require that the user calling
it had permission to get nodes info as well as permission to
manage ingest.  This was due to an internal implementaton detail
that was not visible to the end user.

This change alters the behaviour so that a user with the
manage_pipeline cluster privilege can put an ingest pipeline
regardless of whether they have the separate privilege to get
nodes info.  The internal implementation detail now runs as
the internal _xpack user when security is enabled.

Backport of #60106
2020-07-27 10:44:48 +01:00
Dimitris Athanasiou 7e652ca873
[7.x][ML] Include same fields during test inference as in training (#… (#60034)
In #58877, when we switched test inference on java, we just
use the doc's `_source` as features. However, this could be
missing out on features that were used during training,
e.g. alias fields, etc.

This commit addresses this by extracting fields to use as
features during inference the same way they are extracted
in `DataFrameDataExtractor` when they are used for training.

Backport of #59963
2020-07-22 12:54:13 +03:00
David Roberts 7358f9fb05 [ML] Mute ForecastIT.testOverflowToDisk in EAR builds (#60040)
Due to https://github.com/elastic/elasticsearch/issues/58806
2020-07-22 10:17:37 +01:00
Przemysław Witek 283a1f605c
Rename binary_soft_classification evaluation to outlier_detection (#59951) (#59970) 2020-07-21 15:15:04 +02:00
David Kyle c349fdcb89 Mute RegressionIT testWithDataStream (#59687)
For #59664
2020-07-16 09:45:29 +01:00
Martijn van Groningen 2a89e13e43
Move data stream transport and rest action to xpack (#59593)
Backport of #59525 to 7.x branch.

* Actions are moved to xpack core.
* Transport and rest actions are moved the data-streams module.
* Removed data streams methods from Client interface.
* Adjusted tests to use client.execute(...) instead of data stream specific methods.
* only attempt to delete all data streams if xpack is installed in rest tests
* Now that ds apis are in xpack and ESIntegTestCase
no longers deletes all ds, do that in the MlNativeIntegTestCase
class for ml tests.
2020-07-15 16:50:44 +02:00
Martijn van Groningen 35ae3d19db
Remove data stream feature flag (#59572)
so that it can used in the next minor release (7.9.0).

Backport of #59504 to 7.x branch.
Closes #53100
2020-07-14 23:50:41 +02:00
Albert Zaharovits 4eb310c777
Disallow mapping updates for doc ingestion privileges (#58784)
The `create_doc`, `create`, `write` and `index` privileges do not grant
the PutMapping action anymore. Apart from the `write` privilege, the other
three privileges also do NOT grant (auto) updating the mapping when ingesting
a document with unmapped fields, according to the templates.

In order to maintain the BWC in the 7.x releases, the above privileges will still grant
the Put and AutoPutMapping actions, but only when the "index" entity is an alias
or a concrete index, but not a data stream or a backing index of a data stream.
2020-07-14 23:39:41 +03:00
David Kyle 0d2ea1b881
Check for ml privilege when using the Inference Aggregation (#59530) (#59562)
The inference pipeline aggregation requires the user has permission to access
the ml get trained models endpoint (_ml/inference/)
2020-07-14 20:53:40 +01:00
Andrei Dan 7dcdaeae49
Default to @timestamp in composable template datastream definition (#59317) (#59516)
This makes the data_stream timestamp field specification optional when
defining a composable template.
When there isn't one specified it will default to `@timestamp`.

(cherry picked from commit 5609353c5d164e15a636c22019c9c17fa98aac30)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-07-14 12:36:54 +01:00
David Kyle 054d5236d4 Mute RegressionIT failure (#59414)
For #59413
2020-07-13 14:12:19 +01:00
Dimitris Athanasiou d07b11b86b
[7.x][ML] Perform test inference on java (#58877) (#59298)
Since we are able to load the inference model
and perform inference in java, we no longer need
to rely on the analytics process to be performing
test inference on the docs that were not used for
training. The benefit is that we do not need to
send test docs and fit them in memory of the c++
process.

Backport of #58877

Co-authored-by: Dimitris Athanasiou <dimitris@elastic.co>

Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
2020-07-09 16:30:49 +03:00
Martijn van Groningen 17bd559253
Fix the timestamp field of a data stream to @timestamp (#59210)
Backport of #59076 to 7.x branch.

The commit makes the following changes:
* The timestamp field of a data stream definition in a composable
  index template can only be set to '@timestamp'.
* Removed custom data stream timestamp field validation and reuse the validation from `TimestampFieldMapper` and
  instead only check that the _timestamp field mapping has been defined on a backing index of a data stream.
* Moved code that injects _timestamp meta field mapping from `MetadataCreateIndexService#applyCreateIndexRequestWithV2Template58956(...)` method
  to `MetadataIndexTemplateService#collectMappings(...)` method.
* Fixed a bug (#58956) that cases timestamp field validation to be performed
  for each template and instead of the final mappings that is created.
* only apply _timestamp meta field if index is created as part of a data stream or data stream rollover,
this fixes a docs test, where a regular index creation matches (logs-*) with a template with a data stream definition.

Relates to #58642
Relates to #53100
Closes #58956
Closes #58583
2020-07-08 17:30:46 +02:00
Benjamin Trent e343e066fc
[7.x] [ML] prefer secondary auth headers on evaluate (#59167) (#59183)
* [ML] prefer secondary auth headers on evaluate (#59167)

We should prefer the secondary auth headers when evaluating a data frame
2020-07-07 15:34:47 -04:00
David Roberts e217f9a1e8
[ML] Wait for shards to initialize after creating ML internal indices (#59087)
There have been a few test failures that are likely caused by tests
performing actions that use ML indices immediately after the actions
that create those ML indices.  Currently this can result in attempts
to search the newly created index before its shards have initialized.

This change makes the method that creates the internal ML indices
that have been affected by this problem (state and stats) wait for
the shards to be initialized before returning.

Backport of #59027
2020-07-07 10:52:10 +01:00
Jake Landis 604c6dd528
7.x - Create plugin for yamlTest task (#56841) (#59090)
This commit creates a new Gradle plugin to provide a separate task name
and source set for running YAML based REST tests. The only project
converted to use the new plugin in this PR is distribution/archives/integ-test-zip.
For which the testing has been moved to :rest-api-spec since it makes the most
sense and it avoids a small but awkward change to the distribution plugin.

The remaining cases in modules, plugins, and x-pack will be handled in followups.

This plugin is distinctly different from the plugin introduced in #55896 since
the YAML REST tests are intended to be black box tests over HTTP. As such they
should not (by default) have access to the classpath for that which they are testing.

The YAML based REST tests will be moved to separate source sets (yamlRestTest).
The which source is the target for the test resources is dependent on if this
new plugin is applied. If it is not applied, it will default to the test source
set.

Further, this introduces a breaking change for plugin developers that
use the YAML testing framework. They will now need to either use the new source set
and matching task, or configure the rest resources to use the old "test" source set that
matches the old integTest task. (The former should be preferred).

As part of this change (which is also breaking for plugin developers) the
rest resources plugin has been removed from the build plugin and now requires
either explicit application or application via the new YAML REST test plugin.

Plugin developers should be able to fix the breaking changes to the YAML tests
by adding apply plugin: 'elasticsearch.yaml-rest-test' and moving the YAML tests
under a yamlRestTest folder (instead of test)
2020-07-06 14:16:26 -05:00
Martijn van Groningen f0dd9b4ace
Add data stream timestamp validation via metadata field mapper (#59002)
Backport of #58582 to 7.x branch.

This commit adds a new metadata field mapper that validates,
that a document has exactly a single timestamp value in the data stream timestamp field and
that the timestamp field mapping only has `type`, `meta` or `format` attributes configured.
Other attributes can affect the guarantee that an index with this meta field mapper has a
useable timestamp field.

The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever
a new backing index of a data stream is created.

Relates to #53100
2020-07-06 11:32:33 +02:00
David Kyle f6a0c2c59d
[7.x] Pipeline Inference Aggregation (#58965)
Adds a pipeline aggregation that loads a model and performs inference on the
input aggregation results.
2020-07-03 09:29:04 +01:00
Przemysław Witek 751e84e4c8
Rename regression evaluation metrics to make the names consistent with loss functions (#58887) (#58927) 2020-07-02 17:35:55 +02:00
Przemysław Witek 8e074c4495
Rename "error" field to "value" for consistency between metrics (#58726) (#58870) 2020-07-02 09:08:56 +02:00
Benjamin Trent c64e283dbf
[7.x] [ML] handles compressed model stream from native process (#58009) (#58836)
* [ML] handles compressed model stream from native process (#58009)

This moves model storage from handling the fully parsed JSON string to handling two separate types of documents.

1. ModelSizeInfo which contains model size information 
2. TrainedModelDefinitionChunk which contains a particular chunk of the compressed model definition string.

`model_size_info` is assumed to be handled first. This will generate the model_id and store the initial trained model config object. Then each chunk is assumed to be in correct order for concatenating the chunks to get a compressed definition.


Native side change: https://github.com/elastic/ml-cpp/pull/1349
2020-07-01 15:14:31 -04:00
Przemysław Witek 909649dd15
[7.x] Implement pseudo Huber loss (PseudoHuber) evaluation metric for regression analysis (#58734) (#58825) 2020-07-01 14:52:06 +02:00
Julie Tibshirani ab65a57d70
Merge mappings for composable index templates (#58709)
This PR implements recursive mapping merging for composable index templates.

When creating an index, we perform the following:
* Add each component template mapping in order, merging each one in after the
last.
* Merge in the index template mappings (if present).
* Merge in the mappings on the index request itself (if present).

Some principles:
* All 'structural' changes are disallowed (but everything else is fine). An
object mapper can never be changed between `type: object` and `type: nested`. A
field mapper can never be changed to an object mapper, and vice versa.
* Generally, each section is merged recursively. This includes `object`
mappings, as well as root options like `dynamic_templates` and `meta`. Once we
reach 'leaf components' like field definitions, they always overwrite an
existing one instead of being merged.

Relates to #53101.
2020-06-30 08:01:37 -07:00
David Roberts d9e0e0bf95
[ML] Pass through the stop-on-warn setting for categorization jobs (#58738)
When per_partition_categorization.stop_on_warn is set for an analysis
config it is now passed through to the autodetect C++ process.

Also adds some end-to-end tests that exercise the functionality
added in elastic/ml-cpp#1356

Backport of #58632
2020-06-30 15:17:04 +01:00
Rene Groeschke d952b101e6
Replace compile configuration usage with api (7.x backport) (#58721)
* Replace compile configuration usage with api (#58451)

- Use java-library instead of plugin to allow api configuration usage
- Remove explicit references to runtime configurations in dependency declarations
- Make test runtime classpath input for testing convention
  - required as java library will by default not have build jar file
  - jar file is now explicit input of the task and gradle will ensure its properly build

* Fix compile usages in 7.x branch
2020-06-30 15:57:41 +02:00
Przemysław Witek 9ea9b7bd3b
[7.x] Implement MSLE (MeanSquaredLogarithmicError) evaluation metric for regression analysis (#58684) (#58731) 2020-06-30 14:09:11 +02:00
Przemysław Witek 3f7c45472e
[7.x] Introduce DataFrameAnalyticsConfig update API (#58302) (#58648) 2020-06-29 10:56:11 +02:00