Commit Graph

684 Commits

Author SHA1 Message Date
Julie Tibshirani 38ce428831
Create a class to hold field capabilities for one index. (#51844)
Currently, the same class `FieldCapabilities` is used both to represent the
capabilities for one index, and also the merged capabilities across indices. To
help clarify the logic, this PR proposes to create a separate class
`IndexFieldCapabilities` for the capabilities in one index. The refactor will
also help when adding `source_path` information in #49264, since the merged
source path field will have a different structure from the field for a single index.

Individual changes:
* Add a new class IndexFieldCapabilities.
* Remove extra constructor from FieldCapabilities.
* Combine the add and merge methods in FieldCapabilities.Builder.
2020-02-04 11:24:57 -08:00
David Roberts 9d55c45b5a [ML] Improve multiline_start_pattern for CSV in find_file_structure (#51737)
The work to switch file upload over to treating delimited files
like semi-structured text and using the ingest pipeline for CSV
parsing makes the multi-line start pattern used for delimited
files much more critical than it used to be.

Previously it was always based on the time field, even if that
was towards the end of the columns, and no multi-line pattern
was created if no timestamp was detected.

This change improves the multi-line start pattern by:

1. Never creating a multi-line pattern if the sample contained
   only single line records.  This improves the import
   efficiency in a common case.
2. Choosing the leftmost field that has a well-defined pattern,
   whether that be the time field or a boolean/numeric field.
   This reduces the risk of a field with newlines occurring
   earlier, and also means the algorithm doesn't automatically
   fail for data without a timestamp.
2020-02-04 12:37:48 +00:00
Benjamin Trent d293980a09
[7.x] [ML] add GET _cat/ml/datafeeds (#51500) (#51829)
* [ML] add GET _cat/ml/datafeeds (#51500)

This adds GET _cat/ml/datafeeds && _cat/ml/datafeeds/{datafeed_id}

* fixing for java8 compilation
2020-02-03 17:16:33 -05:00
David Roberts d5d8fb26fa [TEST] Remove obsolete test trace logging from NetworkDisruptionIT (#51746)
The issue this logging was added to fix (#49908) was closed in
December and the problem has not recurred so this logging is no
longer needed.
2020-02-03 11:25:53 +00:00
Dimitris Athanasiou 55b5c8f703
[7.x][ML] Remove index.unassigned.node_left.delayed_timeout setting from M… (#51740) (#51764)
This setting was introduced with the purpose of reducing the time took by
tests that shut nodes down. Tests like `MlDistributedFailureIT` and
`NetworkDisruptionIT`. However, it is unfortunate to have to set the value
to an explicit value in production. In addition, and most important, the dynamically
choosing the value for this setting makes it impossible to adopt static index template configs
that we register via `IndexTemplateRegistry`, which we need to use in order to start
registering ILM policies for the ML indices.

This commit removes this setting from our templates. I run the tests a few times and could
not see execution time differing significantly.

Backport of #51740
2020-01-31 20:28:29 +02:00
Benjamin Trent e372854d43
[ML][Inference] Fix model pagination with models as resources (#51573) (#51736)
This adds logic to handle paging problems when the ID pattern + tags reference models stored as resources. 

Most of the complexity comes from the issue where a model stored as a resource could be at the start, or the end of a page or when we are on the last page.
2020-01-31 07:52:19 -05:00
Gordon Brown 10c8179351
Use exclusions list instead of fake system indices (#51586)
This commit switches the strategy for managing dot-prefixed indices that
should be hidden indices from using "fake" system indices to an explicit
exclusions list that must be updated when those indices are converted to
hidden indices.
2020-01-30 16:31:27 -07:00
Benjamin Trent 1380dd439a
[7.x] [ML][Inference] Fix weighted mode definition (#51648) (#51695)
* [ML][Inference] Fix weighted mode definition (#51648)

Weighted mode inaccurately assumed that the "max value" of the input values would be the maximum class value. This does not make sense. 

Weighted Mode should know how many classes there are. Hence the new parameter `num_classes`. This indicates what the maximum class value to be expected.
2020-01-30 15:33:25 -05:00
Henning Andersen 149b68d850 [ML] Fix possible race condition starting datafeed (#51646)
Datafeeds being closed while starting could result in and NPE. This was
handled as any other failure, masking out the NPE. However, this
conflicts with the changes in #50886.

Related to #50886 and #51302
2020-01-30 08:23:45 +01:00
Przemysław Witek 683170b007
Increase the number of indexed documents to increase a chance that there are at least 2 training rows. (#51607) (#51615) 2020-01-29 17:17:19 +01:00
Gordon Brown 89c2834b24
Deprecate creation of dot-prefixed index names except for hidden and system indices (#49959)
This commit deprecates the creation of dot-prefixed index names (e.g.
.watches) unless they are either 1) a hidden index, or 2) registered by
a plugin that extends SystemIndexPlugin. This is the first step
towards more thorough protections for system indices.

This commit also modifies several plugins which use dot-prefixed indices
to register indices they own as system indices, and adds a plugin to
register .tasks as a system index.
2020-01-28 10:01:16 -07:00
David Roberts 550254ec7f [ML] Use CSV ingest processor in find_file_structure ingest pipeline (#51492)
Changes the find_file_structure response to include a CSV
ingest processor in the ingest pipeline it suggests.

Previously the Kibana file upload functionality parsed CSV
in the browser, but by parsing CSV in the ingest pipeline
it makes the Kibana file upload functionality more easily
interchangable with Filebeat such that the configurations
it creates can more easily be used to import data with the
same structure repeatedly in production.
2020-01-28 14:38:43 +00:00
David Roberts 3c223ceea1 [ML] Fix 2 digit year regex in find_file_structure (#51469)
The DATE and DATESTAMP Grok patterns match 2 digit years
as well as 4 digit years.  The pattern determination in
find_file_structure worked correctly in this case, but
the regex used to create a multi-line start pattern was
assuming a 4 digit year.  Also, the quick rule-out
patterns did not always correctly consider 2 digit years,
meaning that detection was inconsistent.

This change fixes both problems, and also extends the
tests for DATE and DATESTAMP to check both 2 and 4 digit
years.
2020-01-27 17:23:18 +00:00
Przemysław Witek dd3e2f1e18
[7.x] Update quantiles document in the index the document belongs to (#51135) (#51415) 2020-01-27 10:13:02 +01:00
Benjamin Trent bf53ca3380
[7.x] [ML] Add _cat/ml/anomaly_detectors API (#51364) (#51408)
[ML] Add _cat/ml/anomaly_detectors API (#51364)
2020-01-24 11:54:22 -05:00
Benjamin Trent fc994d9ce1
[ML][Inference] Adds validations for model PUT (#51376) (#51409)
Adds validations making sure that

* `input.field_names` is not empty
* `ensemble.trained_models` is not empty
* `tree.feature_names` is not empty

closes https://github.com/elastic/elasticsearch/issues/51354
2020-01-24 09:29:12 -05:00
Benjamin Trent 76660a5a4f
[7.x] [ML][Inference] add tags url param to GET (#51330) (#51404)
* [ML][Inference] add tags url param to GET (#51330)

Adds a new URL parameter, `tags` to the GET _ml/inference/<model_id> endpoint.

This parameter allows the list of models to be further reduced to those who contain all the provided tags.
2020-01-24 08:26:58 -05:00
Dimitris Athanasiou 3443d69883
[7.x][ML] Rename DataFrameAnalyticsIndex to DestinationIndex (#51353) (#51356)
As we prepare to introduce a new index for storing additional
information about data frame analytics jobs (e.g. intrumentation),
renaming this class to `DestinationIndex` better captures what it does
and leaves its prior name available for a more suitable use.

Backport of #51353

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-01-24 09:51:48 +02:00
David Kyle 0ac03ac5e7
[ML] Add parsers for inference configuration classes (#51300) 2020-01-22 17:03:01 +00:00
David Kyle ca4b90a001
[ML] Calculate results and snapshot retention using latest bucket timestamps (#51061) (#51301)
The retention period is calculated relative to the last bucket result or snapshot
time rather than wall clock
2020-01-22 14:52:33 +00:00
Dimitris Athanasiou 59687a9384
[7.x][ML] Validate classification dependent_variable cardinality is at lea… (#51232) (#51309)
Data frame analytics classification currently only supports 2 classes for the
dependent variable. We were checking that the field's cardinality is not higher
than 2 but we should also check it is not less than that as otherwise the process
fails.

Backport of #51232
2020-01-22 16:51:16 +02:00
Benjamin Trent 2a73e849d6
[ML][Inference] fixing ingest IT tests (#51267) (#51311)
Converts InferenceIngestIT into a `ESRestTestCase`.

closes #51201
2020-01-22 09:50:17 -05:00
David Roberts 932c63297f [ML] Fix possible race condition when starting datafeed (#51302)
The ID of the datafeed's associated job was being obtained
frequently by looking up the datafeed task in a map that
was being modified in other threads.  This could lead to
NPEs if the datafeed stopped running at an unexpected time.

This change reduces the number of places where a datafeed's
associated job ID is looked up to avoid the possibility of
failures when the datafeed's task is removed from the map
of running tasks during multi-step operations in other
threads.

Fixes #51285
2020-01-22 11:40:39 +00:00
Przemysław Witek bfcfcdee33
[7.x] Do not copy mapping from dependent variable to prediction field in regression analysis (#51227) (#51288) 2020-01-22 12:36:24 +01:00
Benjamin Trent a9b2bc525e
[ML] address two edge cases for categorization.GrokPatternCreator#findBestGrokMatchFromExamples (#51168) (#51255)
There are two edge cases that can be ran into when example input is matched in a weird way.

1. Recursion depth could continue many many times, resulting in a HUGE runtime cost. I put a limit of 10 recursions (could be adjusted I suppose). 
2. If there are no "fixed regex bits", exploring the grok space would result in a fence-post error during runtime (with assertions turned off)
2020-01-21 10:29:29 -05:00
David Roberts 0fa7db9a95 [ML] Make datafeeds work with nanosecond time fields (#51180)
Allows ML datafeeds to work with time fields that have
the "date_nanos" type _and make use of the extra precision_.
(Previously datafeeds only worked with time fields that were
exact multiples of milliseconds.  So datafeeds would work
with "date_nanos" only if the extra precision over "date" was
not used.)

Relates #49889
2020-01-21 09:59:50 +00:00
Jay Modi 107989df3e
Introduce hidden indices (#51164)
This change introduces a new feature for indices so that they can be
hidden from wildcard expansion. The feature is referred to as hidden
indices. An index can be marked hidden through the use of an index
setting, `index.hidden`, at creation time. One primary use case for
this feature is to have a construct that fits indices that are created
by the stack that contain data used for display to the user and/or
intended for querying by the user. The desire to keep them hidden is
to avoid confusing users when searching all of the data they have
indexed and getting results returned from indices created by the
system.

Hidden indices have the following properties:
* API calls for all indices (empty indices array, _all, or *) will not
  return hidden indices by default.
* Wildcard expansion will not return hidden indices by default unless
  the wildcard pattern begins with a `.`. This behavior is similar to
  shell expansion of wildcards.
* REST API calls can enable the expansion of wildcards to hidden
  indices with the `expand_wildcards` parameter. To expand wildcards
  to hidden indices, use the value `hidden` in conjunction with `open`
  and/or `closed`.
* Creation of a hidden index will ignore global index templates. A
  global index template is one with a match-all pattern.
* Index templates can make an index hidden, with the exception of a
  global index template.
* Accessing a hidden index directly requires no additional parameters.

Backport of #50452
2020-01-17 10:09:01 -07:00
David Roberts 295665b1ea [ML] Add audit warning for 1000 categories found early in job (#51146)
If 1000 different category definitions are created for a job in
the first 100 buckets it processes then an audit warning will now
be created.  (This will cause a yellow warning triangle in the
ML UI's jobs list.)

Such a large number of categories suggests that the field that
categorization is working on is not well suited to the ML
categorization functionality.
2020-01-17 16:28:45 +00:00
Przemysław Witek da73c9104e
[ML] Fix tests randomly failing on CI (#51142) (#51150) 2020-01-17 14:58:58 +01:00
Dimitris Athanasiou b70ebdeb96
[7.x][ML] DF Analytics _explain API should skip object fields (#51115) (#51147)
Object fields cannot be used as features. At the moment _explain
API includes them and even worse it allows it does not error when
an object field is excluded. This creates the expectation to the
user that all children fields will also be excluded while it's not
the case.

This commit omits object fields from the _explain API and also
adds an error if an object field is included or excluded.

Backport of #51115
2020-01-17 14:02:59 +02:00
Przemysław Witek b1a526d5e9
[7.x] [ML] Update DFA progress document in the index the document belongs to (#51111) (#51117) 2020-01-17 08:12:54 +01:00
Tom Veasey 32ec934b15
[7.x][ML] Assert top classes are ordered by score (#51028)
Backport #51003.
2020-01-16 12:23:15 +00:00
David Roberts 1536c3e622 [TEST] Increase ML distributed test job open timeout (#50998)
There have been occasional failures, presumably due to
too many tests running in parallel, caused by jobs taking
around 15 seconds to open.  (You can see the job open
successfully during the cleanup phase shortly after the
failure of the test in these cases.)  This change increases
the wait time from 10 seconds to 20 seconds to reduce the
risk of this happening.
2020-01-15 08:58:55 +00:00
Nik Everett fc5fde7950
Add "did you mean" to ObjectParser (#50938) (#50985)
Check it out:
```
$ curl -u elastic:password -HContent-Type:application/json -XPOST localhost:9200/test/_update/foo?pretty -d'{
  "dac": {}
}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "x_content_parse_exception",
        "reason" : "[2:3] [UpdateRequest] unknown field [dac] did you mean [doc]?"
      }
    ],
    "type" : "x_content_parse_exception",
    "reason" : "[2:3] [UpdateRequest] unknown field [dac] did you mean [doc]?"
  },
  "status" : 400
}
```

The tricky thing about implementing this is that x-content doesn't
depend on Lucene. So this works by creating an extension point for the
error message using SPI. Elasticsearch's server module provides the
"spell checking" implementation.
s
2020-01-14 17:53:41 -05:00
Benjamin Trent 72c270946f
[ML][Inference] Adding classification_weights to ensemble models (#50874) (#50994)
* [ML][Inference] Adding classification_weights to ensemble models

classification_weights are a way to allow models to
prefer specific classification results over others
this might be advantageous if classification value
probabilities are a known quantity and can improve
model error rates.
2020-01-14 12:40:25 -05:00
Tom Veasey de5713fa4b
[ML] Disable invalid assertion (#50988)
Backport #50986.
2020-01-14 17:35:00 +00:00
David Kyle 7f309a18f1
[7.x][ML] Explicitly require a OriginSettingClient in ML results iterators (#50981)
In classes where the client is used directly rather than through a call to 
executeAsyncWithOrigin explicitly require the client to be OriginSettingClient 
rather than using the Client interface. 

Also remove calls to deprecated ClientHelper.clientWithOrigin() method.
2020-01-14 17:14:39 +00:00
Dimitris Athanasiou 1d8cb3c741
[7.x][ML] Add num_top_feature_importance_values param to regression and classi… (#50914) (#50976)
Adds a new parameter to regression and classification that enables computation
of importance for the top most important features. The computation of the importance
is based on SHAP (SHapley Additive exPlanations) method.

Backport of #50914
2020-01-14 16:46:09 +02:00
Przemysław Witek 9c6ffdc2be
[7.x] Handle nested and aliased fields correctly when copying mapping. (#50918) (#50968) 2020-01-14 14:43:39 +01:00
Benjamin Trent eb8fd44836
[ML][Inference] minor fixes for created_by, and action permission (#50890) (#50911)
The system created and models we provide now use the `_xpack` user for uniformity with our other features

The `PUT` action is now an admin cluster action

And XPackClient class now references the action instance.
2020-01-13 07:59:31 -05:00
Benjamin Trent fa116a6d26
[7.x] [ML][Inference] PUT API (#50852) (#50887)
* [ML][Inference] PUT API (#50852)

This adds the `PUT` API for creating trained models that support our format.

This includes

* HLRC change for the API
* API creation
* Validations of model format and call

* fixing backport
2020-01-12 10:59:11 -05:00
Dimitris Athanasiou 422422a2bc
[7.x][ML] Reuse SourceDestValidator for data frame analytics (#50841) (#50850)
This commit removes validation logic of source and dest indices
for data frame analytics and replaces it with using the common
`SourceDestValidator` class which is already used by transforms.
This way the validations and their messages become consistent
while we reduce code.

This means that where these validations fail the error messages
will be slightly different for data frame analytics.

Backport of #50841
2020-01-10 14:24:13 +02:00
Jake Landis de6f132887
[7.x] Foreach processor - fork recursive call (#50514) (#50773)
A very large number of recursive calls can cause a stack overflow
exception. This commit forks the recursive calls for non-async
processors. Once forked, each thread will handle at most 10
recursive calls to help keep the stack size and thread count
down to a reasonable size.
2020-01-09 13:21:18 -06:00
Benjamin Trent cc0e64572a
[ML][Inference][HLRC] Add necessary lang ident classes (#50705) (#50794)
This adds the necessary named XContent classes to the HLRC for the lang ident model. This is so the HLRC can call `GET _ml/inference/lang_ident_model_1?include_definition=true` without XContent parsing errors.

The constructors are package private as since this classes are used exclusively within the pre-packaged model (and require the specific weights, etc. to be of any use).
2020-01-09 10:33:38 -05:00
Adrien Grand 31158ab3d5
Add per-field metadata. (#50333)
This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.

In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:
 - keys can't be longer than 20 chars,
 - values can only be numbers or strings of no more than 50 chars - no inner
   arrays or objects,
 - the metadata can't have more than 5 keys in total.

Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.

Here is how the meta might look like in mappings:

```json
{
  "properties": {
    "latency": {
      "type": "long",
      "meta": {
        "unit": "ms"
      }
    }
  }
}
```

And then in the field capabilities response:

```json
{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms" ]
      }
    }
  }
}
```

When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:

```json
{
  "latency": {
    "long": {
      "searchable": true,
      "aggreggatable": true,
      "meta": {
        "unit": [ "ms", "ns" ]
      }
    }
  }
}
```

Closes #33267
2020-01-08 16:21:18 +01:00
Benjamin Trent 060e0a6277
[ML][Inference] Add support for models shipped as resources (#50680) (#50700)
This adds support for models that are shipped as resources in the ML plugin. The first of which is the `lang_ident` model.
2020-01-07 09:21:59 -05:00
Przemysław Witek 4116452d90
Implement testStopAndRestart for ClassificationIT (#50585) (#50698) 2020-01-07 13:41:37 +01:00
David Roberts 35453e2b0e [ML] Improve uniqueness of result document IDs (#50644)
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613
2020-01-07 10:24:45 +00:00
David Roberts 46d600c446 [ML] Fix off-by-one error in ml_classic tokenizer end offset (#50655)
The end offset of a tokenizer is supposed to point one past the
end of the input, not to the end character of the input.  The
ml_classic tokenizer was erroneously doing the latter.
2020-01-07 10:14:59 +00:00
Benjamin Trent 06cea5136e
[ML] construct new random generator on each persistence call (#50657) (#50684)
Sharing a random generator may cause test failures as non-threadsafe random generators are periodically utilized in tests (see: https://github.com/elastic/elasticsearch/issues/50651)

This change constructs a calls `Randomness.get()` within the  `bulkIndexWithRetry` method so that the returned `Random` object is only used in a single thread. Before, the member variable could have been used between threads, which caused test failures.
2020-01-06 16:26:29 -05:00
Benjamin Trent 5ab9e75e28
[7.x] [ML][Inference] lang_ident model (#50292) (#50675)
* [ML][Inference] lang_ident model (#50292)

This PR contains a java port of Google's CLD3 compact NN model https://github.com/google/cld3

The ported model is formatted to fit within our inference model formatting and stored as a resource in the `:xpack:ml:` plugin and is under basic license.

The model is broken up into two major parts:
- Preprocessing through the custom embedding (based on CLD3's embedding layer)
- Pushing the embedded text through the two layers of fully connected shallow NN. 

Main differences between this port and CLD3:
- We take advantage of Java's internal Unicode handling where possible (i.e. codepoints, characters, decoders, etc.)
- We do not trim down input text by removing duplicated tokens
- We do not encode doubles/floats as longs/integers.
2020-01-06 16:24:03 -05:00
Benjamin Trent f52af7977d
[ML][Inference] minor cleanup for inference (#50444) (#50676) 2020-01-06 14:05:04 -05:00
Dimitris Athanasiou ca0828ba07
[7.x][ML] Implement force deleting a data frame analytics job (#50553) (#50589)
Adds a `force` parameter to the delete data frame analytics
request. When `force` is `true`, the action force-stops the
jobs and then proceeds to the deletion. This can be used in
order to delete a non-stopped job with a single request.

Closes #48124

Backport of #50553
2020-01-03 13:46:02 +02:00
Przemysław Witek 8917c05df8
[7.x] Synchronize processInStream.close() call (#50581) 2020-01-03 10:23:51 +01:00
Przemysław Witek 4ecabe496f
Mute testStopAndRestart test case (#50551) 2020-01-02 15:28:20 +01:00
Christoph Büscher 1599af8428 Fix type conversion problem in Eclipse (#50549)
Eclipse 4.13 shows a type mismatch error in the affected line because it cannot
correctly infer the boolean return type for the method call. Assigning return
value to a local variable resolves this problem.
2020-01-02 14:29:20 +01:00
Przemysław Witek 3e3a93002f
[7.x] Fix accuracy metric (#50310) (#50433) 2019-12-20 15:34:38 +01:00
Przemysław Witek 14d95aae46
[7.x] Make each analysis report desired field mappings to be copied (#50219) (#50428) 2019-12-20 15:10:33 +01:00
Przemysław Witek 5bb668b866
[7.x] Get rid of maxClassesCardinality internal parameter (#50418) (#50423) 2019-12-20 14:24:23 +01:00
Przemysław Witek cc4bc797f9
[7.x] Implement `precision` and `recall` metrics for classification evaluation (#49671) (#50378) 2019-12-19 18:55:05 +01:00
Dimitris Athanasiou d3c83cd55a
[7.x][ML] Refresh state index before completing data frame analytics job (#50322) (#50324)
In order to ensure any persisted model state is searchable by the moment
the job reports itself as `stopped`, we need to refresh the state index
before completing.

This should fix the occasional failures we see in #50168 and #50313 where
the model state appears missing.

Closes #50168
Closes #50313

Backport of #50322
2019-12-18 22:19:59 +00:00
Benjamin Trent 4396a1f78b
[ML][Inference] fix support for nested fields (#50258) (#50335)
This fixes support for nested fields

We now support fully nested, fully collapsed, or a mix of both on inference docs.

ES mappings allow the `_source` to be any combination of nested objects + dot delimited fields.
So, we should do our best to find the best path down the Map for the desired field.
2019-12-18 15:47:06 -05:00
Dimitris Athanasiou 447bac27d2
[7.x][ML] Delete unused data frame analytics state (#50243) (#50280)
This commit adds removal of unused data frame analytics state
from the _delete_expired_data API (and in extend th ML daily
maintenance task). At the moment the potential state docs
include the progress document and state for regression and
classification analyses.

Backport of #50243
2019-12-18 12:30:11 +00:00
Przemysław Witek ac974c35c0
Pass processConnectTimeout to the method that fetches C++ process' PID (#50276) (#50290) 2019-12-17 21:32:37 +01:00
David Kyle 098f540f9d
[ML] Remove usage of base action logger in ml actions (#50074) (#50236) 2019-12-17 13:03:27 +00:00
David Kyle 5542686283 [ML] Wait for green after opening job in NetworkDisruptionIT (#50232)
Closes #49908
2019-12-16 14:55:58 +00:00
Dimitris Athanasiou 73add726d7
[7.x][ML] Fix exception when field is not included and excluded at the same time (#50192) (#50223)
Executing the data frame analytics _explain API with a config that contains
a field that is not in the includes list but at the same time is the excludes
list results to trying to remove the field twice from the iterator. That causes
an `IllegalStateException`. This commit fixes this issue and adds a test that
captures the scenario.

Backport of #50192
2019-12-16 11:30:06 +00:00
Benjamin Trent 4805d8ac7d
[ML][Inference] Adding a warning_field for warning msgs. (#49838) (#50183)
This adds a new field for the inference processor.

`warning_field` is a place for us to write warnings provided from the inference call. When there are warnings we are not going to write an inference result. The goal of this is to indicate that the data provided was too poor or too different for the model to make an accurate prediction.

The user could optionally include the `warning_field`. When it is not provided, it is assumed no warnings were desired to be written.

The first of these warnings is when ALL of the input fields are missing. If none of the trained fields are present, we don't bother inferencing against the model and instead provide a warning stating that the fields were missing.

Also, this adds checks to not allow duplicated fields during processor creation.
2019-12-13 10:39:51 -05:00
Benjamin Trent 41736dd6c3
[ML] retry bulk indexing of state docs (#50149) (#50185)
This exchanges the direct use of the `Client` for `ResultsPersisterService`. State doc persistence will now retry. Failures to persist state will still not throw, but will be audited and logged.
2019-12-13 10:39:34 -05:00
Dimitris Athanasiou fe3c9e71d1
[7.x][ML] Fix DFA explain API timeout when source index is missing (#50176) (#50180)
This commit fixes a bug that caused the data frame analytics
_explain API to time out in a multi-node setup when the source
index was missing. When we try to create the extracted fields detector,
we check the index settings. If the index is missing that responds
with a failure that could be wrapped as a remote exception.
While we unwrapped correctly to check if the cause was an
`IndexNotFoundException`, we then proceeded to cast the original
exception instead of the cause.

Backport of #50176
2019-12-13 17:00:55 +02:00
Dimitris Athanasiou e6cbcf7f7c
[7.x] [ML] Persist/restore state for DFA classification (#50040) (#50147)
This commit adds state persist/restore for data frame analytics classification jobs.

Backport of #50040
2019-12-13 10:33:19 +02:00
Benjamin Trent d7ffa7f8f7
[7.x][ML] Add graceful retry for anomaly detector result indexing failures(#49508) (#50145)
* [ML] Add graceful retry for anomaly detector result indexing failures (#49508)

All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.

* fixing test
2019-12-12 12:24:58 -05:00
Benjamin Trent c043aa887f
[ML][Inference] Simplify inference processor options (#50105) (#50146)
* [ML][Inference] Simplify inference processor options

* addressing pr comments
2019-12-12 11:13:55 -05:00
David Roberts 13e47df97d [TEST] Increase timeout for ML internal cluster cleanup (#50142)
Closes #48511
2019-12-12 15:38:22 +00:00
David Kyle 7d4118dc4e Enable trace logging in failing ml NetworkDisruptionIT
https://github.com/elastic/elasticsearch/issues/49908
2019-12-12 11:16:01 +00:00
Dimitris Athanasiou 03ecaae221
[7.x][ML] Avoid classification integ test training on single class (#50072) (#50078)
The `ClassificationIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet`
test was previously set up to just have 10 rows. With `training_percent`
of 50%, only 5 rows will be used for training. There is a good chance that
all 5 rows will be of one class which results to failure.

This commit increases the rows to 100. Now 50 rows should be used for training
and the chance of failure should be very small.

Backport of #50072
2019-12-11 18:50:26 +02:00
David Turner 285eacd267
Use more specific loggers in subclasses of TMNA (#50076)
Adjusts the subclasses of `TransportMasterNodeAction` to use their own loggers
instead of the one for the base class.

Relates #50056.
Partial backport of #46431 to 7.x.
2019-12-11 15:07:47 +00:00
Przemysław Witek 9b116c8fef
A few improvements to AnalyticsProcessManager class that make the code more readable. (#50026) (#50069) 2019-12-11 09:35:05 +01:00
Dimitris Athanasiou 8891f4db88
[7.x][ML] Introduce randomize_seed setting for regression and classification (#49990) (#50023)
This adds a new `randomize_seed` for regression and classification.
When not explicitly set, the seed is randomly generated. One can
reuse the seed in a similar job in order to ensure the same docs
are picked for training.

Backport of #49990
2019-12-10 15:29:19 +02:00
Benjamin Trent 0b6ce9683c
[ML] Use query in cardinality check (#49939) (#49984)
When checking the cardinality of a field, the query should be take into account. The user might know about some bad data in their index and want to filter down to the target_field values they care about.
2019-12-09 10:14:41 -05:00
Przemysław Witek 0965a10468
[7.x] Pass `prediction_field_type` to C++ analytics process (#49861) (#49981) 2019-12-09 14:43:01 +01:00
Benjamin Trent 049d854360
[ML][Inference] adjust so target_field always has inference result and optionally allow new top classes field in the classification config (#49923) (#49982) 2019-12-09 08:29:45 -05:00
Dimitris Athanasiou e4f838e764
[7.x][ML] Update expected mem estimate in explain API integ test (#49924) (#49979)
Work in progress in the c++ side is increasing memory estimates
a bit and this test fails. At the time of this commit the mem
estimate when there is no source query is a about 2Mb. So I
am relaxing the test to assert memory estimate is less than 1Mb
instead of 500Kb.

Backport of #49924
2019-12-09 11:52:06 +02:00
Przemysław Witek e60837aa3b
[7.x] Log whole analytics stats when the state assertion fails (#49906) (#49911) 2019-12-06 14:31:17 +01:00
Przemysław Witek 1d8e3d69d7
Make only a part of `stop()` method a critical section. (#49756) (#49788) 2019-12-03 09:54:16 +01:00
Dimitris Athanasiou 4edb2e7bb6
[7.x][ML] Add optional source filtering during data frame reindexing (#49690) (#49718)
This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.

Closes #49531

Backport of #49690
2019-11-29 16:10:44 +02:00
Dimitris Athanasiou c23a2187da
[7.x][ML] Only report complete writing_results progress after completion (#49551) (#49577)
We depend on the number of data frame rows in order to report progress
for the writing of results, the last phase of a job run. However, results
include other objects than just the data frame rows (e.g, progress, inference model, etc.).

The problem this commit fixes is that if we receive the last data frame row results
we'll report that progress is complete even though we still have more results to process
potentially. If the job gets stopped for any reason at this point, we will not be able
to restart the job properly as we'll think that the job was completed.

This commit addresses this by limiting the max progress we can report for the
writing_results phase before the results processor completes to 98.
At the end, when the process is done we set the progress to 100.

The commit also improves failure capturing and reporting in the results processor.

Backport of #49551
2019-11-26 12:20:37 +02:00
Benjamin Trent 688c78c589
[ML] Stop timing stats failure propagation (#49495) (#49501) 2019-11-25 10:09:30 -05:00
David Roberts 62811c2272 [ML] Add default categorization analyzer definition to ML info (#49545)
The categorization job wizard in the ML UI will use this
information when showing the effect of the chosen categorization
analyzer on a sample of input.
2019-11-25 13:39:16 +00:00
Dimitris Athanasiou aca38f6882
[7.x][ML] DFA jobs should accept excluding an unsupported field (#49535) (#49544)
Before this change excluding an unsupported field resulted in
an error message that explained the excluded field could not be
detected as if it doesn't exist. This error message is confusing.

This commit commit changes this so that there is no error in this
scenario. When excluding a field that does exist but has been
automatically been excluded from the analysis there is no harm
(unlike excluding a missing field which could be a typo).

Backport of #49535
2019-11-25 15:13:00 +02:00
Dimitris Athanasiou c149c64dc4
[7.x][ML] Apply source query on data frame analytics memory estimation (#49517) (#49532)
Closes #49454

Backport of #49517
2019-11-25 12:51:57 +02:00
Dimitris Athanasiou 8eaee7cbdc
[7.x][ML] Explain data frame analytics API (#49455) (#49504)
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

- memory estimation
- field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.

Backport of #49455
2019-11-22 22:06:10 +02:00
Benjamin Trent a7477ad7c3
[7.x] [ML][Inference] compressing model definition and lazy parsing (#49269) (#49446)
* [ML][Inference] compressing model definition and lazy parsing (#49269)

* [ML][Inference] compressing model definition and lazy parsing

* addressing PR comments

* adding commons io

* implementing simplified bounded stream

* adjusting for type inclusion
2019-11-21 15:32:32 -05:00
Benjamin Trent d41b2e3f38
[ML][Inference] allowing per-model licensing (#49398) (#49435)
* [ML][Inference] allowing per-model licensing

* changing to internal action + removing pre-mature opt
2019-11-21 09:46:34 -05:00
Przemysław Witek c7ac2011eb
[7.x] Implement accuracy metric for multiclass classification (#47772) (#49430) 2019-11-21 15:01:18 +01:00
David Roberts 20558cf61c [ML] Fix simultaneous stop and force stop datafeed (#49367)
If a datafeed is stopped normally and force stopped at the same
time then it is possible that the force stop removes the
persistent task while the normal stop is performing actions.
Currently this causes the normal stop to error, but since
stopping a stopped datafeed is not an error this doesn't make
sense. Instead the force stop should just take precedence.

This is a followup to #49191 and should really have been
included in the changes in that PR.
2019-11-20 12:52:47 +00:00
Przemysław Witek 9c0ec7ce23
[7.x] Make AnalyticsProcessManager class more robust (#49282) (#49356) 2019-11-20 10:08:16 +01:00
Dimitris Athanasiou 4d6e037e90
[7.x][ML] Extract creation of DFA field extractor into a factory (#49315) (#49329)
This commit moves the async calls required to retrieve the components
that make up `ExtractedFieldsExtractor` out of `DataFrameDataExtractorFactory`
and into a dedicated `ExtractorFieldsExtractorFactory` class.

A few more refactorings are performed:

  - The detector no longer needs the results field. Instead, it knows
  whether to use it or not based on whether the task is restarting.
  - We pass more accurately whether the task is restarting or not.
  - The validation of whether fields that have a cardinality limit
  are valid is now performed in the detector after retrieving the
  respective cardinalities.

Backport of #49315
2019-11-20 10:02:42 +02:00
Przemysław Witek 42bb8ae525
[7.x] Extract indexData method out of RegressionIT tests (#49306) (#49313) 2019-11-19 22:47:12 +01:00
Benjamin Trent 19602fd573
[ML][Inference] changing setting to be memorySizeSettting (#49259) (#49302) 2019-11-19 07:56:40 -05:00