OpenSearch

Commit Graph

Author	SHA1	Message	Date
Przemysław Witek	acbd48f834	[ML] Allow setting num_top_classes to a special value -1 (#63587 ) (#63602 )	2020-10-13 13:57:50 +02:00
Przemysław Witek	4366d58564	[7.x] [ML] Implement AucRoc metric for classification (#60502 ) (#63051 )	2020-09-30 12:55:52 +02:00
Jake Landis	5b7246157f	[7.x] Fix projects that failed to build within Intellij (#62258 ) (#62408 ) This commit address some build failures from the perspective of Intellij. These changes include: * changing an order of a dependency definition that seems to can cause Intellij build to fail. * introduction of an abstract class out of the test source set (seems to be an issue sharing classes cross projects with non-standard source sets. * a couple of missing dependency definitions (not sure how the command line worked prior to this)	2020-09-17 17:45:12 -05:00
Dimitris Athanasiou	d37f197efd	[7.x][ML] Allow training_percent to be any positive double up to hundred (#61977 ) (#61990 ) This changes the valid range of `training_percent` for regression and classification from [1, 100] to (0, 100]. Backport of #61977	2020-09-04 17:34:14 +03:00
Jake Landis	794aac717d	[7.x] Convert first 1/2 x-pack plugins from integTest to [yaml \| java]RestTest or internalClusterTest (#60630 ) (#61855 ) For 1/2 the plugins in x-pack, the integTest task is now a no-op and all of the tests are now executed via a test, yamlRestTest, javaRestTest, or internalClusterTest. This includes the following projects: async-search, autoscaling, ccr, enrich, eql, frozen-indicies, data-streams, graph, ilm, mapper-constant-keyword, mapper-flattened, ml A few of the more specialized qa projects within these plugins have not been changed with this PR due to additional complexity which should be addressed separately. A follow up PR will address the remaining x-pack plugins (this PR is big enough as-is). related: #61802 related: #56841 related: #59939 related: #55896	2020-09-02 11:19:24 -05:00
Przemysław Witek	0afa1bd972	Deprecate allow_no_jobs and allow_no_datafeeds in favor of allow_no_match (#60601 ) (#60727 )	2020-08-05 13:39:40 +02:00
Rene Groeschke	bdd7347bbf	Merge test runner task into RestIntegTest (7.x backport) (#60600 ) * Merge test runner task into RestIntegTest (#60261) * Merge test runner task into RestIntegTest * Reorganizing Standalone runner and RestIntegTest task * Rework general test task configuration and extension * Fix merge issues * use former 7.x common test configuration	2020-08-04 14:46:32 +02:00
David Roberts	89466eefa5	Don't require separate privilege for internal detail of put pipeline (#60190 ) Putting an ingest pipeline used to require that the user calling it had permission to get nodes info as well as permission to manage ingest. This was due to an internal implementaton detail that was not visible to the end user. This change alters the behaviour so that a user with the manage_pipeline cluster privilege can put an ingest pipeline regardless of whether they have the separate privilege to get nodes info. The internal implementation detail now runs as the internal _xpack user when security is enabled. Backport of #60106	2020-07-27 10:44:48 +01:00
Przemysław Witek	283a1f605c	Rename binary_soft_classification evaluation to outlier_detection (#59951 ) (#59970 )	2020-07-21 15:15:04 +02:00
Albert Zaharovits	4eb310c777	Disallow mapping updates for doc ingestion privileges (#58784 ) The `create_doc`, `create`, `write` and `index` privileges do not grant the PutMapping action anymore. Apart from the `write` privilege, the other three privileges also do NOT grant (auto) updating the mapping when ingesting a document with unmapped fields, according to the templates. In order to maintain the BWC in the 7.x releases, the above privileges will still grant the Put and AutoPutMapping actions, but only when the "index" entity is an alias or a concrete index, but not a data stream or a backing index of a data stream.	2020-07-14 23:39:41 +03:00
David Kyle	0d2ea1b881	Check for ml privilege when using the Inference Aggregation (#59530 ) (#59562 ) The inference pipeline aggregation requires the user has permission to access the ml get trained models endpoint (_ml/inference/)	2020-07-14 20:53:40 +01:00
Jake Landis	604c6dd528	7.x - Create plugin for yamlTest task (#56841 ) (#59090 ) This commit creates a new Gradle plugin to provide a separate task name and source set for running YAML based REST tests. The only project converted to use the new plugin in this PR is distribution/archives/integ-test-zip. For which the testing has been moved to :rest-api-spec since it makes the most sense and it avoids a small but awkward change to the distribution plugin. The remaining cases in modules, plugins, and x-pack will be handled in followups. This plugin is distinctly different from the plugin introduced in #55896 since the YAML REST tests are intended to be black box tests over HTTP. As such they should not (by default) have access to the classpath for that which they are testing. The YAML based REST tests will be moved to separate source sets (yamlRestTest). The which source is the target for the test resources is dependent on if this new plugin is applied. If it is not applied, it will default to the test source set. Further, this introduces a breaking change for plugin developers that use the YAML testing framework. They will now need to either use the new source set and matching task, or configure the rest resources to use the old "test" source set that matches the old integTest task. (The former should be preferred). As part of this change (which is also breaking for plugin developers) the rest resources plugin has been removed from the build plugin and now requires either explicit application or application via the new YAML REST test plugin. Plugin developers should be able to fix the breaking changes to the YAML tests by adding apply plugin: 'elasticsearch.yaml-rest-test' and moving the YAML tests under a yamlRestTest folder (instead of test)	2020-07-06 14:16:26 -05:00
David Kyle	f6a0c2c59d	[7.x] Pipeline Inference Aggregation (#58965 ) Adds a pipeline aggregation that loads a model and performs inference on the input aggregation results.	2020-07-03 09:29:04 +01:00
Rene Groeschke	01e9126588	Remove deprecated usage of testCompile configuration (#57921 ) (#58083 ) * Remove usage of deprecated testCompile configuration * Replace testCompile usage by testImplementation * Make testImplementation non transitive by default (as we did for testCompile) * Update CONTRIBUTING about using testImplementation for test dependencies * Fail on testCompile configuration usage	2020-06-14 22:30:44 +02:00
David Roberts	7aa0daaabd	[7.x][ML] More advanced model snapshot retention options (#56194 ) This PR implements the following changes to make ML model snapshot retention more flexible in advance of adding a UI for the feature in an upcoming release. - The default for `model_snapshot_retention_days` for new jobs is now 10 instead of 1 - There is a new job setting, `daily_model_snapshot_retention_after_days`, that defaults to 1 for new jobs and `model_snapshot_retention_days` for pre-7.8 jobs - For days that are older than `model_snapshot_retention_days`, all model snapshots are deleted as before - For days that are in between `daily_model_snapshot_retention_after_days` and `model_snapshot_retention_days` all but the first model snapshot for that day are deleted - The `retain` setting of model snapshots is still respected to allow selected model snapshots to be retained indefinitely Backport of #56125	2020-05-05 14:31:58 +01:00
Dimitris Athanasiou	75dadb7a6d	[7.x][ML] Add loss_function to regression (#56118 ) (#56187 ) Adds parameters `loss_function` and `loss_function_parameter` to regression. Backport of #56118	2020-05-05 14:59:51 +03:00
Martijn van Groningen	6d03081560	Add auto create action (#56122 ) Backport of #55858 to 7.x branch. Currently the TransportBulkAction detects whether an index is missing and then decides whether it should be auto created. The coordination of the index creation also happens in the TransportBulkAction on the coordinating node. This change adds a new transport action that the TransportBulkAction delegates to if missing indices need to be created. The reasons for this change: * Auto creation of data streams can't occur on the coordinating node. Based on the index template (v2) either a regular index or a data stream should be created. However if the coordinating node is slow in processing cluster state updates then it may be unaware of the existence of certain index templates, which then can load to the TransportBulkAction creating an index instead of a data stream. Therefor the coordination of creating an index or data stream should occur on the master node. See #55377 * From a security perspective it is useful to know whether index creation originates from the create index api or from auto creating a new index via the bulk or index api. For example a user would be allowed to auto create an index, but not to use the create index api. The auto create action will allow security to distinguish these two different patterns of index creation. This change adds the following new transport actions: AutoCreateAction, the TransportBulkAction redirects to this action and this action will actually create the index (instead of the TransportCreateIndexAction). Later via #55377, can improve the AutoCreateAction to also determine whether an index or data stream should be created. The create_index index privilege is also modified, so that if this permission is granted then a user is also allowed to auto create indices. This change does not yet add an auto_create index privilege. A future change can introduce this new index privilege or modify an existing index / write index privilege. Relates to #53100	2020-05-04 19:10:09 +02:00
Dimitris Athanasiou	d9685a0f19	[7.x][ML] Validate at least one feature is available for DF analytics (#55876 ) (#55914 ) We were previously checking at least one supported field existed when the _explain API was called. However, in the case of analyses with required fields (e.g. regression) we were not accounting that the dependent variable is not a feature and thus if the source index only contains the dependent variable field there are no features to train a model on. This commit adds a validation that at least one feature is available for analysis. Note that we also move that validation away from `ExtractedFieldsDetector` and the _explain API and straight into the _start API. The reason for doing this is to allow the user to use the _explain API in order to understand why they would be seeing an error like this one. For example, the user might be using an index that has fields but they are of unsupported types. If they start the job and get an error that there are no features, they will wonder why that is. Calling the _explain API will show them that all their fields are unsupported. If the _explain API was failing instead, there would be no way for the user to understand why all those fields are ignored. Closes #55593 Backport of #55876	2020-04-29 11:39:58 +03:00
Benjamin Trent	4a1610265f	[7.x] [ML] add new inference_config field to trained model config (#54421 ) (#54647 ) * [ML] add new inference_config field to trained model config (#54421) A new field called `inference_config` is now added to the trained model config object. This new field allows for default inference settings from analytics or some external model builder. The inference processor can still override whatever is set as the default in the trained model config. * fixing for backport	2020-04-02 12:25:10 -04:00
David Roberts	b8f06df53f	[ML] Fix bug, add tests, improve estimates for estimate_model_memory (#54508 ) This PR: 1. Fixes the bug where a cardinality estimate of zero could cause a 500 status 2. Adds tests for that scenario and a few others 3. Adds sensible estimates for the cases that were previously TODO Backport of #54462	2020-03-31 17:59:38 +01:00
Dimitris Athanasiou	cc981fa377	[7.x][ML] Get ML filters size should default to 100 (#54207 ) (#54278 ) When get filters is called without setting the `size` paramter only up to 10 filters are returned. However, 100 filters should be returned. This commit fixes this and adds an integ test to guard it. It seems this was accidentally broken in #39976. Closes #54206 Backport of #54207	2020-03-26 17:51:43 +02:00
Jake Landis	db3420d757	[7.x] Optimize which Rest resources are used by the Rest tests… (#53766 ) This should help with Gradle's incremental compile such that projects only depend upon the resources they use. related #52114	2020-03-19 12:28:59 -05:00
Benjamin Trent	4e43ede735	[ML] renaming inference processor field field_mappings to new name field_map (#53433 ) (#53502 ) This renames the `inference` processor configuration field `field_mappings` to `field_map`. `field_mappings` is now deprecated.	2020-03-13 15:40:57 -04:00
Dimitris Athanasiou	0fd0516d0d	[7.x][ML] Rename data frame analytics maximum_number_trees to max_trees (#53300 ) (#53390 ) Deprecates `maximum_number_trees` parameter of classification and regression and replaces it with `max_trees`. Backport of #53300	2020-03-11 12:45:27 +02:00
Jake Landis	b4179a8814	[7.x] Refactor watcher tests (#52799 ) (#52844 ) This PR moves the majority of the Watcher REST tests under the Watcher x-pack plugin. Specifically, moves the Watcher tests from: x-pack/plugin/test x-pack/qa/smoke-test-watcher x-pack/qa/smoke-test-watcher-with-security x-pack/qa/smoke-test-monitoring-with-watcher to: x-pack/plugin/watcher/qa/rest (/test and /qa/smoke-test-watcher) x-pack/plugin/watcher/qa/with-security x-pack/plugin/watcher/qa/with-monitoring Additionally, this disables Watcher from the main x-pack test cluster and consolidates the stop/start logic for the tests listed. No changes to the tests (beyond moving them) are included. 3rd party tests and doc tests (which also touch Watcher) are not included in the changes here.	2020-02-26 15:57:10 -06:00
Jake Landis	8d311297ca	[7.x] Smarter copying of the rest specs and tests (#52114 ) (#52798 ) * Smarter copying of the rest specs and tests (#52114) This PR addresses the unnecessary copying of the rest specs and allows for better semantics for which specs and tests are copied. By default the rest specs will get copied if the project applies `elasticsearch.standalone-rest-test` or `esplugin` and the project has rest tests or you configure the custom extension `restResources`. This PR also removes the need for dozens of places where the x-pack specs were copied by supporting copying of the x-pack rest specs too. The plugin/task introduced here can also copy the rest tests to the local project through a similar configuration. The new plugin/task allows a user to minimize the surface area of which rest specs are copied. Per project can be configured to include only a subset of the specs (or tests). Configuring a project to only copy the specs when actually needed should help with build cache hit rates since we can better define what is actually in use. However, project level optimizations for build cache hit rates are not included with this PR. Also, with this PR you can no longer use the includePackaged flag on integTest task. The following items are included in this PR: * new plugin: `elasticsearch.rest-resources` * new tasks: CopyRestApiTask and CopyRestTestsTask - performs the copy * new extension 'restResources' ``` restResources { restApi { includeCore 'foo' , 'bar' //will include the core specs that start with foo and bar includeXpack 'baz' //will include x-pack specs that start with baz } restTests { includeCore 'foo', 'bar' //will include the core tests that start with foo and bar includeXpack 'baz' //will include the x-pack tests that start with baz } } ```	2020-02-26 08:13:41 -06:00
David Kyle	7bbe5c8464	[Ml] Validate tree feature index is within range (#52514 ) This changes the tree validation code to ensure no node in the tree has a feature index that is beyond the bounds of the feature_names array. Specifically this handles the situation where the C++ emits a tree containing a single node and an empty feature_names list. This is valid tree used to centre the data in the ensemble but the validation code would reject this as feature_names is empty. This meant a broken workflow as you cannot GET the model and PUT it back	2020-02-19 14:41:43 +00:00
Benjamin Trent	fc994d9ce1	[ML][Inference] Adds validations for model PUT (#51376 ) (#51409 ) Adds validations making sure that * `input.field_names` is not empty * `ensemble.trained_models` is not empty * `tree.feature_names` is not empty closes https://github.com/elastic/elasticsearch/issues/51354	2020-01-24 09:29:12 -05:00
Benjamin Trent	fa116a6d26	[7.x] [ML][Inference] PUT API (#50852 ) (#50887 ) * [ML][Inference] PUT API (#50852) This adds the `PUT` API for creating trained models that support our format. This includes * HLRC change for the API * API creation * Validations of model format and call * fixing backport	2020-01-12 10:59:11 -05:00
Dimitris Athanasiou	422422a2bc	[7.x][ML] Reuse SourceDestValidator for data frame analytics (#50841 ) (#50850 ) This commit removes validation logic of source and dest indices for data frame analytics and replaces it with using the common `SourceDestValidator` class which is already used by transforms. This way the validations and their messages become consistent while we reduce code. This means that where these validations fail the error messages will be slightly different for data frame analytics. Backport of #50841	2020-01-10 14:24:13 +02:00
Dimitris Athanasiou	4edb2e7bb6	[7.x][ML] Add optional source filtering during data frame reindexing (#49690 ) (#49718 ) This adds a `_source` setting under the `source` setting of a data frame analytics config. The new `_source` is reusing the structure of a `FetchSourceContext` like `analyzed_fields` does. Specifying includes and excludes for source allows selecting which fields will get reindexed and will be available in the destination index. Closes #49531 Backport of #49690	2019-11-29 16:10:44 +02:00
Dimitris Athanasiou	8eaee7cbdc	[7.x][ML] Explain data frame analytics API (#49455 ) (#49504 ) This commit replaces the _estimate_memory_usage API with a new API, the _explain API. The API consolidates information that is useful before creating a data frame analytics job. It includes: - memory estimation - field selection explanation Memory estimation is moved here from what was previously calculated in the _estimate_memory_usage API. Field selection is a new feature that explains to the user whether each available field was selected to be included or not in the analysis. In the case it was not included, it also explains the reason why. Backport of #49455	2019-11-22 22:06:10 +02:00
Benjamin Trent	eefe7688ce	[7.x][ML] ML Model Inference Ingest Processor (#49052 ) (#49257 ) * [ML] ML Model Inference Ingest Processor (#49052) * [ML][Inference] adds lazy model loader and inference (#47410) This adds a couple of things: - A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them - A Model class and its first sub-class LocalModel. Used to cache model information and run inference. - Transport action and handler for requests to infer against a local model Related Feature PRs: * [ML][Inference] Adjust inference configuration option API (#47812) * [ML][Inference] adds logistic_regression output aggregator (#48075) * [ML][Inference] Adding read/del trained models (#47882) * [ML][Inference] Adding inference ingest processor (#47859) * [ML][Inference] fixing classification inference for ensemble (#48463) * [ML][Inference] Adding model memory estimations (#48323) * [ML][Inference] adding more options to inference processor (#48545) * [ML][Inference] handle string values better in feature extraction (#48584) * [ML][Inference] Adding _stats endpoint for inference (#48492) * [ML][Inference] add inference processors and trained models to usage (#47869) * [ML][Inference] add new flag for optionally including model definition (#48718) * [ML][Inference] adding license checks (#49056) * [ML][Inference] Adding memory and compute estimates to inference (#48955) * fixing version of indexed docs for model inference	2019-11-18 13:19:17 -05:00
Przemysław Witek	150db2b544	Throw an exception when memory usage estimation endpoint encounters empty data frame. (#49143 ) (#49164 )	2019-11-18 07:52:57 +01:00
Rory Hunter	c46a0e8708	Apply 2-space indent to all gradle scripts (#49071 ) Backport of #48849. Update `.editorconfig` to make the Java settings the default for all files, and then apply a 2-space indent to all `*.gradle` files. Then reformat all the files.	2019-11-14 11:01:23 +00:00
Dimitris Athanasiou	7667ea5f6f	[7.x][ML] Additional outlier detection parameters (#47600 ) (#47669 ) Adds the following parameters to `outlier_detection`: - `compute_feature_influence` (boolean): whether to compute or not feature influence scores - `outlier_fraction` (double): the proportion of the data set assumed to be outlying prior to running outlier detection - `standardization_enabled` (boolean): whether to apply standardization to the feature values Backport of #47600	2019-10-07 18:21:33 +03:00
Przemysław Witek	ee952da2e2	[7.x] Implement evaluation API for multiclass classification problem (#47126 ) (#47343 )	2019-10-04 17:54:51 +02:00
Przemysław Witek	ec9b77deaa	[7.x] Implement new analysis type: classification (#46537 ) (#47559 )	2019-10-04 13:47:19 +02:00
Dimitris Athanasiou	873ad3f942	[7.x][ML] Add option to regression to randomize training set (#45969 ) (#46017 ) Adds a parameter `training_percent` to regression. The default value is `100`. When the parameter is set to a value less than `100`, from the rows that can be used for training (ie. those that have a value for the dependent variable) we randomly choose whether to actually use for training. This enables splitting the data into a training set and the rest, usually called testing, validation or holdout set, which allows for validating the model on data that have not been used for training. Technically, the analytics process considers as training the data that have a value for the dependent variable. Thus, when we decide a training row is not going to be used for training, we simply clear the row's dependent variable.	2019-08-27 17:53:11 +03:00
Benjamin Trent	a3a4ae0ac2	[ML] fixing bug where analytics process starts with 0 rows (#45879 ) (#45988 ) The native process requires that there be a non-zero number of rows to analyze. If the flag --rows 0 is passed to the executable, it throws and does not start. When building the configuration for the process we should not start the native process if there are no rows. Adding some logging to indicate what is occurring.	2019-08-26 14:18:17 -05:00
Benjamin Trent	ba7b677618	[ML] better handle empty results when evaluating regression (#45745 ) (#45759 ) * [ML] better handle empty results when evaluating regression * adding new failure test to ml_security black list * fixing equality check for regression results	2019-08-20 17:37:04 -05:00
Przemysław Witek	1aed388a24	Add view_index_metadata to roles.yml and remove as many df analytics test cases from build.gradle blacklist as possible. (#45451 ) (#45465 )	2019-08-13 08:31:58 +02:00
Dimitris Athanasiou	27497ff75f	[7.x][ML] Add regression analysis to DF analytics (#45292 ) (#45388 ) This commit adds a first draft of a regression analysis to data frame analytics. There is high probability that the exact syntax might change. This commit adds the new analysis type and its parameters as well as appropriate validation. It also modifies the extractor and the fields detector to be able to handle categorical fields as regression analysis supports them.	2019-08-09 19:31:13 +03:00
Dimitris Athanasiou	8a6675b994	[7.x][ML] Check dest index is empty when starting DF analytics (#45094 ) (#45112 ) If one tries to start a DF analytics job that has already run, the result will be that the task will fail after reindexing the dest index from the source index. The results of the prior run will be gone and the task state is not properly set to failed with the failure reason. This commit improves the behavior in this scenario. First, we set the task state to `failed` in a set of failures that were missed. Second, a validation is added that if the destination index exists, it must be empty.	2019-08-02 00:19:48 +03:00
Andrei Stefan	2633d11eb7	Switch from using docvalue_fields to extracting values from _source (#44062 ) (#44804 ) * Switch from using docvalue_fields to extracting values from _source where applicable. Doing this means parsing the _source and handling the numbers parsing just like Elasticsearch is doing it when it's indexing a document. * This also introduces a minor limitation: aliases type of fields that are NOT part of a tree of sub-fields will not be able to be retrieved anymore. field_caps API doesn't shed any light into a field being an alias or not and at _source parsing time there is no way to know if a root field is an alias or not. Fields of the type "a.b.c.alias" can be extracted from docvalue_fields, only if the field they point to can be extracted from docvalue_fields. Also, not all fields in a hierarchy of fields can be evaluated to being an alias. (cherry picked from commit 8bf8a055e38f00df5f49c8d97f632f69d6e00c2c)	2019-07-25 10:02:41 +03:00
Ryan Ernst	7e06888bae	Convert testclusters to use distro download plugin (#44253 ) (#44362 ) Test clusters currently has its own set of logic for dealing with finding different versions of Elasticsearch, downloading them, and extracting them. This commit converts testclusters to use the DistributionDownloadPlugin.	2019-07-15 17:53:05 -07:00
Benjamin Trent	c82d9c5b50	[ML] Adds support for regression.mean_squared_error to eval API (#44140 ) (#44218 ) * [ML] Adds support for regression.mean_squared_error to eval API * addressing PR comments * fixing tests	2019-07-11 09:22:52 -05:00
Dimitris Athanasiou	cab879118d	[7.x][ML] Support multiple source indices for df-analytics (#43702 ) (#43731 ) This commit adds support for multiple source indices. In order to deal with multiple indices having different mappings, it attempts a best-effort approach to merge the mappings assuming there are no conflicts. In case conflicts exists an error will be returned. To allow users creating custom mappings for special use cases, the destination index is now allowed to exist before the analytics job runs. In addition, settings are no longer copied except for the `index.number_of_shards` and `index.number_of_replicas`.	2019-06-28 13:28:03 +03:00
Przemysław Witek	94f18da5df	Add version and create_time to data frame analytics config (#43683 ) (#43712 )	2019-06-28 07:37:21 +02:00
Dimitris Athanasiou	126c2fd2d5	[7.x][ML] Machine learning data frame analytics (#43544 ) (#43592 ) This merges the initial work that adds a framework for performing machine learning analytics on data frames. The feature is currently experimental and requires a platinum license. Note that the original commits can be found in the `feature-ml-data-frame-analytics` branch. A new set of APIs is added which allows the creation of data frame analytics jobs. Configuration allows specifying different types of analysis to be performed on a data frame. At first there is support for outlier detection. The APIs are: - PUT _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id}/_stats - POST _ml/data_frame/analysis/{id}/_start - POST _ml/data_frame/analysis/{id}/_stop - DELETE _ml/data_frame/analysis/{id} When a data frame analytics job is started a persistent task is created and started. The main steps of the task are: 1. reindex the source index into the dest index 2. analyze the data through the data_frame_analyzer c++ process 3. merge the results of the process back into the destination index In addition, an evaluation API is added which packages commonly used metrics that provide evaluation of various analysis: - POST _ml/data_frame/_evaluate	2019-06-25 20:29:11 +03:00

1 2

62 Commits