OpenSearch

Commit Graph

Author	SHA1	Message	Date
David Kyle	0d2ea1b881	Check for ml privilege when using the Inference Aggregation (#59530 ) (#59562 ) The inference pipeline aggregation requires the user has permission to access the ml get trained models endpoint (_ml/inference/)	2020-07-14 20:53:40 +01:00
Andrei Dan	7dcdaeae49	Default to @timestamp in composable template datastream definition (#59317 ) (#59516 ) This makes the data_stream timestamp field specification optional when defining a composable template. When there isn't one specified it will default to `@timestamp`. (cherry picked from commit 5609353c5d164e15a636c22019c9c17fa98aac30) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-07-14 12:36:54 +01:00
David Kyle	054d5236d4	Mute RegressionIT failure (#59414 ) For #59413	2020-07-13 14:12:19 +01:00
Dimitris Athanasiou	d07b11b86b	[7.x][ML] Perform test inference on java (#58877 ) (#59298 ) Since we are able to load the inference model and perform inference in java, we no longer need to rely on the analytics process to be performing test inference on the docs that were not used for training. The benefit is that we do not need to send test docs and fit them in memory of the c++ process. Backport of #58877 Co-authored-by: Dimitris Athanasiou <dimitris@elastic.co> Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>	2020-07-09 16:30:49 +03:00
Martijn van Groningen	17bd559253	Fix the timestamp field of a data stream to @timestamp (#59210 ) Backport of #59076 to 7.x branch. The commit makes the following changes: * The timestamp field of a data stream definition in a composable index template can only be set to '@timestamp'. * Removed custom data stream timestamp field validation and reuse the validation from `TimestampFieldMapper` and instead only check that the _timestamp field mapping has been defined on a backing index of a data stream. * Moved code that injects _timestamp meta field mapping from `MetadataCreateIndexService#applyCreateIndexRequestWithV2Template58956(...)` method to `MetadataIndexTemplateService#collectMappings(...)` method. * Fixed a bug (#58956) that cases timestamp field validation to be performed for each template and instead of the final mappings that is created. * only apply _timestamp meta field if index is created as part of a data stream or data stream rollover, this fixes a docs test, where a regular index creation matches (logs-*) with a template with a data stream definition. Relates to #58642 Relates to #53100 Closes #58956 Closes #58583	2020-07-08 17:30:46 +02:00
Benjamin Trent	e343e066fc	[7.x] [ML] prefer secondary auth headers on evaluate (#59167 ) (#59183 ) * [ML] prefer secondary auth headers on evaluate (#59167) We should prefer the secondary auth headers when evaluating a data frame	2020-07-07 15:34:47 -04:00
David Roberts	e217f9a1e8	[ML] Wait for shards to initialize after creating ML internal indices (#59087 ) There have been a few test failures that are likely caused by tests performing actions that use ML indices immediately after the actions that create those ML indices. Currently this can result in attempts to search the newly created index before its shards have initialized. This change makes the method that creates the internal ML indices that have been affected by this problem (state and stats) wait for the shards to be initialized before returning. Backport of #59027	2020-07-07 10:52:10 +01:00
Jake Landis	604c6dd528	7.x - Create plugin for yamlTest task (#56841 ) (#59090 ) This commit creates a new Gradle plugin to provide a separate task name and source set for running YAML based REST tests. The only project converted to use the new plugin in this PR is distribution/archives/integ-test-zip. For which the testing has been moved to :rest-api-spec since it makes the most sense and it avoids a small but awkward change to the distribution plugin. The remaining cases in modules, plugins, and x-pack will be handled in followups. This plugin is distinctly different from the plugin introduced in #55896 since the YAML REST tests are intended to be black box tests over HTTP. As such they should not (by default) have access to the classpath for that which they are testing. The YAML based REST tests will be moved to separate source sets (yamlRestTest). The which source is the target for the test resources is dependent on if this new plugin is applied. If it is not applied, it will default to the test source set. Further, this introduces a breaking change for plugin developers that use the YAML testing framework. They will now need to either use the new source set and matching task, or configure the rest resources to use the old "test" source set that matches the old integTest task. (The former should be preferred). As part of this change (which is also breaking for plugin developers) the rest resources plugin has been removed from the build plugin and now requires either explicit application or application via the new YAML REST test plugin. Plugin developers should be able to fix the breaking changes to the YAML tests by adding apply plugin: 'elasticsearch.yaml-rest-test' and moving the YAML tests under a yamlRestTest folder (instead of test)	2020-07-06 14:16:26 -05:00
Martijn van Groningen	f0dd9b4ace	Add data stream timestamp validation via metadata field mapper (#59002 ) Backport of #58582 to 7.x branch. This commit adds a new metadata field mapper that validates, that a document has exactly a single timestamp value in the data stream timestamp field and that the timestamp field mapping only has `type`, `meta` or `format` attributes configured. Other attributes can affect the guarantee that an index with this meta field mapper has a useable timestamp field. The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever a new backing index of a data stream is created. Relates to #53100	2020-07-06 11:32:33 +02:00
David Kyle	f6a0c2c59d	[7.x] Pipeline Inference Aggregation (#58965 ) Adds a pipeline aggregation that loads a model and performs inference on the input aggregation results.	2020-07-03 09:29:04 +01:00
Przemysław Witek	751e84e4c8	Rename regression evaluation metrics to make the names consistent with loss functions (#58887 ) (#58927 )	2020-07-02 17:35:55 +02:00
Przemysław Witek	8e074c4495	Rename "error" field to "value" for consistency between metrics (#58726 ) (#58870 )	2020-07-02 09:08:56 +02:00
Benjamin Trent	c64e283dbf	[7.x] [ML] handles compressed model stream from native process (#58009 ) (#58836 ) * [ML] handles compressed model stream from native process (#58009) This moves model storage from handling the fully parsed JSON string to handling two separate types of documents. 1. ModelSizeInfo which contains model size information 2. TrainedModelDefinitionChunk which contains a particular chunk of the compressed model definition string. `model_size_info` is assumed to be handled first. This will generate the model_id and store the initial trained model config object. Then each chunk is assumed to be in correct order for concatenating the chunks to get a compressed definition. Native side change: https://github.com/elastic/ml-cpp/pull/1349	2020-07-01 15:14:31 -04:00
Przemysław Witek	909649dd15	[7.x] Implement pseudo Huber loss (PseudoHuber) evaluation metric for regression analysis (#58734 ) (#58825 )	2020-07-01 14:52:06 +02:00
Julie Tibshirani	ab65a57d70	Merge mappings for composable index templates (#58709 ) This PR implements recursive mapping merging for composable index templates. When creating an index, we perform the following: * Add each component template mapping in order, merging each one in after the last. * Merge in the index template mappings (if present). * Merge in the mappings on the index request itself (if present). Some principles: * All 'structural' changes are disallowed (but everything else is fine). An object mapper can never be changed between `type: object` and `type: nested`. A field mapper can never be changed to an object mapper, and vice versa. * Generally, each section is merged recursively. This includes `object` mappings, as well as root options like `dynamic_templates` and `meta`. Once we reach 'leaf components' like field definitions, they always overwrite an existing one instead of being merged. Relates to #53101.	2020-06-30 08:01:37 -07:00
David Roberts	d9e0e0bf95	[ML] Pass through the stop-on-warn setting for categorization jobs (#58738 ) When per_partition_categorization.stop_on_warn is set for an analysis config it is now passed through to the autodetect C++ process. Also adds some end-to-end tests that exercise the functionality added in elastic/ml-cpp#1356 Backport of #58632	2020-06-30 15:17:04 +01:00
Rene Groeschke	d952b101e6	Replace compile configuration usage with api (7.x backport) (#58721 ) * Replace compile configuration usage with api (#58451) - Use java-library instead of plugin to allow api configuration usage - Remove explicit references to runtime configurations in dependency declarations - Make test runtime classpath input for testing convention - required as java library will by default not have build jar file - jar file is now explicit input of the task and gradle will ensure its properly build * Fix compile usages in 7.x branch	2020-06-30 15:57:41 +02:00
Przemysław Witek	9ea9b7bd3b	[7.x] Implement MSLE (MeanSquaredLogarithmicError) evaluation metric for regression analysis (#58684 ) (#58731 )	2020-06-30 14:09:11 +02:00
Przemysław Witek	3f7c45472e	[7.x] Introduce DataFrameAnalyticsConfig update API (#58302 ) (#58648 )	2020-06-29 10:56:11 +02:00
Benjamin Trent	7a202b149e	Muting analytics tests (#58617 ) (#58618 )	2020-06-26 16:50:59 -04:00
Benjamin Trent	add8ff1ad3	[ML] assume data streams are enabled in data stream tests (#58502 ) (#58508 )	2020-06-24 14:14:48 -04:00
Przemysław Witek	551b8bcd73	[7.x] Use static methods (rather than constants) to obtain .ml-meta and .ml-config index names (#58484 ) (#58490 )	2020-06-24 15:52:45 +02:00
Luca Cavanna	dbbf2772d8	Mute newly added ml data streams tests (#58492 ) Relates to #58491	2020-06-24 15:11:40 +02:00
Benjamin Trent	a9b868b7a9	[7.x] [ML] allow data streams to be expanded for analytics and transforms (#58280 ) (#58455 ) This commits allows data streams to be a valid source for analytics and transforms. Data streams are fairly transparent and our `_search` and `_reindex` actions work without error. For `_transforms` the check-pointing works as desired as well. Data streams are effectively treated as an `alias` and the backing index values are stored within checkpointing information.	2020-06-23 14:40:35 -04:00
David Roberts	0d6bfd0ac3	[7.x][ML] Fix wire serialization for flush acknowledgements (#58443 ) There was a discrepancy in the implementation of flush acknowledgements: most of the class was designed on the basis that the "last finalized bucket time" could be null but the wire serialization assumed that it was never null. This works because, the C++ sends zero "last finalized bucket time" when it is not known or not relevant. But then the Java code will print that to XContent as it is assuming null represents not known or not relevant. This change corrects the discrepancies. Internally within the class null represents not known or not relevant, but this is translated from/to 0 for communications from the C++ and old nodes that have the bug. Additionally I switched from Date to Instant for this class and made the member variables final to modernise it a bit. Backport of #58413	2020-06-23 16:42:06 +01:00
Benjamin Trent	bf8641aa15	[7.x] [ML] calculate cache misses for inference and return in stats (#58252 ) (#58363 ) When a local model is constructed, the cache hit miss count is incremented. When a user calls _stats, we will include the sum cache hit miss count across ALL nodes. This statistic is important to in comparing against the inference_count. If the cache hit miss count is near the inference_count it indicates that the cache is overburdened, or inappropriately configured.	2020-06-19 09:46:51 -04:00
Przemysław Witek	9dd3d5aa48	[7.x] Delete auto-generated annotations when model snapshot is reverted (#58240 ) (#58335 )	2020-06-18 17:59:52 +02:00
Jason Tedor	b78b3edeea	Upgrade to JNA 5.5.0 (#58183 ) This commit bumps our JNA dependency from 4.5.1 to 5.5.0, so that we are now on the latest maintained line, and pick up a large collection of bug fixes that have accumulated.	2020-06-17 07:35:08 -04:00
Przemysław Witek	b22e91cefc	[7.x] Delete auto-generated annotations when job is deleted. (#58169 ) (#58219 )	2020-06-17 09:17:20 +02:00
Rene Groeschke	01e9126588	Remove deprecated usage of testCompile configuration (#57921 ) (#58083 ) * Remove usage of deprecated testCompile configuration * Replace testCompile usage by testImplementation * Make testImplementation non transitive by default (as we did for testCompile) * Update CONTRIBUTING about using testImplementation for test dependencies * Fail on testCompile configuration usage	2020-06-14 22:30:44 +02:00
Valeriy Khakhutskyy	c0f368bbf3	[7.x][ML] Adjust assertion for job case memory usage estimates (#57929 ) Since we change the memory estimates for data frame analytics jobs from worst case to a realistic case, the strict less-than assertion in the test does not hold anymore. I replaced it with a less-or-equal-than assertion. Backport or #57882	2020-06-10 15:17:16 +02:00
Benjamin Trent	9666a895f7	[ML] inference performance optimizations and refactor (#57674 ) (#57753 ) This is a major refactor of the underlying inference logic. The main refactor is now we are separating the model configuration and the inference interfaces. This has the following benefits: - we can store extra things with the model that are not necessary for inference (i.e. treenode split information gain) - we can optimize inference separate from model serialization and storage. - The user is oblivious to the optimizations (other than seeing the benefits). A major part of this commit is removing all inference related methods from the trained model configurations (ensemble, tree, etc.) and moving them to a new class. This new class satisfies a new interface that is ONLY for inference. The optimizations applied currently are: - feature maps are flattened once - feature extraction only happens once at the highest level (improves inference + feature importance through put) - Only storing what we need for inference + feature importance on heap	2020-06-05 14:20:58 -04:00
Przemysław Witek	6b5f49d097	[7.x] Introduce ModelPlotConfig. annotations_enabled setting (#57539 ) (#57641 )	2020-06-04 15:15:35 +02:00
Benjamin Trent	34f1e0b6bb	[7.x] [ML] mark forecasts for force closed/failed jobs as failed (#57143 ) (#57374 ) * [ML] mark forecasts for force closed/failed jobs as failed (#57143) forecasts that are still running should be marked as failed/finished in the following scenarios: - Job is force closed - Job is re-assigned to another node. Forecasts are not "resilient". Their execution does not continue after a node failure. Consequently, forecasts marked as STARTED or SCHEDULED should be flagged as failed. These forecasts can then be deleted. Additionally, force closing a job kills the native task directly. This means that if a forecast was running, it is not allowed to complete and could still have the status of `STARTED` in the index. relates to https://github.com/elastic/elasticsearch/issues/56419	2020-05-29 14:48:10 -04:00
Benjamin Trent	35d5126cea	[7.x] [ML] adds new for_export flag to GET _ml/inference API (#57351 ) (#57368 ) * [ML] adds new for_export flag to GET _ml/inference API (#57351) Adds a new boolean flag, `for_export` to the `GET _ml/inference/<model_id>` API. This flag is useful for moving models between clusters.	2020-05-29 14:01:08 -04:00
Benjamin Trent	c8374dc9f3	[ML] add max_model_memory parameter to forecast request (#57254 ) (#57355 ) This adds a max_model_memory setting to forecast requests. This setting can take a string value that is formatted according to byte sizes (i.e. "50mb", "150mb"). The default value is `20mb`. There is a HARD limit at `500mb` which will throw an error if used. If the limit is larger than 40% the anomaly job's configured model limit, the forecast limit is reduced to be strictly lower than that value. This reduction is logged and audited. related native change: https://github.com/elastic/ml-cpp/pull/1238 closes: https://github.com/elastic/elasticsearch/issues/56420	2020-05-29 11:16:08 -04:00
Przemysław Witek	ea2012778e	Mute failing test (#57112 ) (#57113 )	2020-05-25 14:06:29 +02:00
Benjamin Trent	297f864884	[ML] relax throttling on expired data cleanup (#56711 ) (#56895 ) Throttling nightly cleanup as much as we do has been over cautious. Night cleanup should be more lenient in its throttling. We still keep the same batch size, but now the requests per second scale with the number of data nodes. If we have more than 5 data nodes, we don't throttle at all. Additionally, the API now has `requests_per_second` and `timeout` set. So users calling the API directly can set the throttling. This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`. This will allow users to adjust throttling of the nightly maintenance.	2020-05-18 08:46:42 -04:00
Dimitris Athanasiou	011e995165	[7.x][ML] Unmute ClssificationIT.testDependentVariableCardinalityTooHighButWithQueryMakesItWithinRange (#56268 ) (#56287 ) Closes #56240	2020-05-06 18:20:46 +03:00
Julie Tibshirani	49de092b38	Mute RegressionIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet.	2020-05-05 16:25:36 -07:00
Julie Tibshirani	63062ec7bd	Mute ClassificationIT.testDependentVariableCardinalityTooHighButWithQueryMakesItWithinRange.	2020-05-05 13:48:35 -07:00
Benjamin Trent	e1c5ca421e	[7.x] [ML] lay ground work for handling >1 result indices (#55892 ) (#56192 ) * [ML] lay ground work for handling >1 result indices (#55892) This commit removes all but one reference to `getInitialResultsIndexName`. This is to support more than one result index for a single job.	2020-05-05 15:54:08 -04:00
David Roberts	7aa0daaabd	[7.x][ML] More advanced model snapshot retention options (#56194 ) This PR implements the following changes to make ML model snapshot retention more flexible in advance of adding a UI for the feature in an upcoming release. - The default for `model_snapshot_retention_days` for new jobs is now 10 instead of 1 - There is a new job setting, `daily_model_snapshot_retention_after_days`, that defaults to 1 for new jobs and `model_snapshot_retention_days` for pre-7.8 jobs - For days that are older than `model_snapshot_retention_days`, all model snapshots are deleted as before - For days that are in between `daily_model_snapshot_retention_after_days` and `model_snapshot_retention_days` all but the first model snapshot for that day are deleted - The `retain` setting of model snapshots is still respected to allow selected model snapshots to be retained indefinitely Backport of #56125	2020-05-05 14:31:58 +01:00
Dimitris Athanasiou	75dadb7a6d	[7.x][ML] Add loss_function to regression (#56118 ) (#56187 ) Adds parameters `loss_function` and `loss_function_parameter` to regression. Backport of #56118	2020-05-05 14:59:51 +03:00
Martijn van Groningen	6d03081560	Add auto create action (#56122 ) Backport of #55858 to 7.x branch. Currently the TransportBulkAction detects whether an index is missing and then decides whether it should be auto created. The coordination of the index creation also happens in the TransportBulkAction on the coordinating node. This change adds a new transport action that the TransportBulkAction delegates to if missing indices need to be created. The reasons for this change: * Auto creation of data streams can't occur on the coordinating node. Based on the index template (v2) either a regular index or a data stream should be created. However if the coordinating node is slow in processing cluster state updates then it may be unaware of the existence of certain index templates, which then can load to the TransportBulkAction creating an index instead of a data stream. Therefor the coordination of creating an index or data stream should occur on the master node. See #55377 * From a security perspective it is useful to know whether index creation originates from the create index api or from auto creating a new index via the bulk or index api. For example a user would be allowed to auto create an index, but not to use the create index api. The auto create action will allow security to distinguish these two different patterns of index creation. This change adds the following new transport actions: AutoCreateAction, the TransportBulkAction redirects to this action and this action will actually create the index (instead of the TransportCreateIndexAction). Later via #55377, can improve the AutoCreateAction to also determine whether an index or data stream should be created. The create_index index privilege is also modified, so that if this permission is granted then a user is also allowed to auto create indices. This change does not yet add an auto_create index privilege. A future change can introduce this new index privilege or modify an existing index / write index privilege. Relates to #53100	2020-05-04 19:10:09 +02:00
Dimitris Athanasiou	17b904def5	[7.x][ML] Decouple DFA progress testing from analyses phases (#55925 ) (#56024 ) This refactors native integ tests to assert progress without expecting explicit phases for analyses. We can test those with yaml tests in a single place. Backport of #55925	2020-04-30 17:05:47 +03:00
Dimitris Athanasiou	d9685a0f19	[7.x][ML] Validate at least one feature is available for DF analytics (#55876 ) (#55914 ) We were previously checking at least one supported field existed when the _explain API was called. However, in the case of analyses with required fields (e.g. regression) we were not accounting that the dependent variable is not a feature and thus if the source index only contains the dependent variable field there are no features to train a model on. This commit adds a validation that at least one feature is available for analysis. Note that we also move that validation away from `ExtractedFieldsDetector` and the _explain API and straight into the _start API. The reason for doing this is to allow the user to use the _explain API in order to understand why they would be seeing an error like this one. For example, the user might be using an index that has fields but they are of unsupported types. If they start the job and get an error that there are no features, they will wonder why that is. Calling the _explain API will show them that all their fields are unsupported. If the _explain API was failing instead, there would be no way for the user to understand why all those fields are ignored. Closes #55593 Backport of #55876	2020-04-29 11:39:58 +03:00
Przemysław Witek	c89917c799	Register DFA jobs on putAnalytics rather than via a separate method (#55458 ) (#55708 )	2020-04-24 10:59:32 +02:00
Dimitris Athanasiou	b8379872a7	[7.x][ML] Logs error when DFA task is set to failed (#55545 ) (#55668 ) Also unmutes the integ test that stops and restarts an outlier detection job with the hope of learning more of the failure in #55068. Backport of #55545 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>	2020-04-24 11:06:07 +03:00
David Roberts	da5aeb8be7	[ML] Return assigned node in start/open job/datafeed response (#55570 ) Adds a "node" field to the response from the following endpoints: 1. Open anomaly detection job 2. Start datafeed 3. Start data frame analytics job If the job or datafeed is assigned to a node immediately then this field will return the ID of that node. In the case where a job or datafeed is opened or started lazily the node field will contain an empty string. Clients that want to test whether a job or datafeed was opened or started lazily can therefore check for this. Backport of #55473	2020-04-22 12:06:53 +01:00

1 2 3 4 5 ...

335 Commits