636 Commits

Author SHA1 Message Date
David Roberts
46d600c446 [ML] Fix off-by-one error in ml_classic tokenizer end offset (#50655)
The end offset of a tokenizer is supposed to point one past the
end of the input, not to the end character of the input.  The
ml_classic tokenizer was erroneously doing the latter.
2020-01-07 10:14:59 +00:00
Benjamin Trent
06cea5136e
[ML] construct new random generator on each persistence call (#50657) (#50684)
Sharing a random generator may cause test failures as non-threadsafe random generators are periodically utilized in tests (see: https://github.com/elastic/elasticsearch/issues/50651)

This change constructs a calls `Randomness.get()` within the  `bulkIndexWithRetry` method so that the returned `Random` object is only used in a single thread. Before, the member variable could have been used between threads, which caused test failures.
2020-01-06 16:26:29 -05:00
Benjamin Trent
5ab9e75e28
[7.x] [ML][Inference] lang_ident model (#50292) (#50675)
* [ML][Inference] lang_ident model (#50292)

This PR contains a java port of Google's CLD3 compact NN model https://github.com/google/cld3

The ported model is formatted to fit within our inference model formatting and stored as a resource in the `:xpack:ml:` plugin and is under basic license.

The model is broken up into two major parts:
- Preprocessing through the custom embedding (based on CLD3's embedding layer)
- Pushing the embedded text through the two layers of fully connected shallow NN. 

Main differences between this port and CLD3:
- We take advantage of Java's internal Unicode handling where possible (i.e. codepoints, characters, decoders, etc.)
- We do not trim down input text by removing duplicated tokens
- We do not encode doubles/floats as longs/integers.
2020-01-06 16:24:03 -05:00
Benjamin Trent
f52af7977d
[ML][Inference] minor cleanup for inference (#50444) (#50676) 2020-01-06 14:05:04 -05:00
Dimitris Athanasiou
ca0828ba07
[7.x][ML] Implement force deleting a data frame analytics job (#50553) (#50589)
Adds a `force` parameter to the delete data frame analytics
request. When `force` is `true`, the action force-stops the
jobs and then proceeds to the deletion. This can be used in
order to delete a non-stopped job with a single request.

Closes #48124

Backport of #50553
2020-01-03 13:46:02 +02:00
Przemysław Witek
8917c05df8
[7.x] Synchronize processInStream.close() call (#50581) 2020-01-03 10:23:51 +01:00
Przemysław Witek
4ecabe496f
Mute testStopAndRestart test case (#50551) 2020-01-02 15:28:20 +01:00
Christoph Büscher
1599af8428 Fix type conversion problem in Eclipse (#50549)
Eclipse 4.13 shows a type mismatch error in the affected line because it cannot
correctly infer the boolean return type for the method call. Assigning return
value to a local variable resolves this problem.
2020-01-02 14:29:20 +01:00
Przemysław Witek
3e3a93002f
[7.x] Fix accuracy metric (#50310) (#50433) 2019-12-20 15:34:38 +01:00
Przemysław Witek
14d95aae46
[7.x] Make each analysis report desired field mappings to be copied (#50219) (#50428) 2019-12-20 15:10:33 +01:00
Przemysław Witek
5bb668b866
[7.x] Get rid of maxClassesCardinality internal parameter (#50418) (#50423) 2019-12-20 14:24:23 +01:00
Przemysław Witek
cc4bc797f9
[7.x] Implement precision and recall metrics for classification evaluation (#49671) (#50378) 2019-12-19 18:55:05 +01:00
Dimitris Athanasiou
d3c83cd55a
[7.x][ML] Refresh state index before completing data frame analytics job (#50322) (#50324)
In order to ensure any persisted model state is searchable by the moment
the job reports itself as `stopped`, we need to refresh the state index
before completing.

This should fix the occasional failures we see in #50168 and #50313 where
the model state appears missing.

Closes #50168
Closes #50313

Backport of #50322
2019-12-18 22:19:59 +00:00
Benjamin Trent
4396a1f78b
[ML][Inference] fix support for nested fields (#50258) (#50335)
This fixes support for nested fields

We now support fully nested, fully collapsed, or a mix of both on inference docs.

ES mappings allow the `_source` to be any combination of nested objects + dot delimited fields.
So, we should do our best to find the best path down the Map for the desired field.
2019-12-18 15:47:06 -05:00
Dimitris Athanasiou
447bac27d2
[7.x][ML] Delete unused data frame analytics state (#50243) (#50280)
This commit adds removal of unused data frame analytics state
from the _delete_expired_data API (and in extend th ML daily
maintenance task). At the moment the potential state docs
include the progress document and state for regression and
classification analyses.

Backport of #50243
2019-12-18 12:30:11 +00:00
Przemysław Witek
ac974c35c0
Pass processConnectTimeout to the method that fetches C++ process' PID (#50276) (#50290) 2019-12-17 21:32:37 +01:00
David Kyle
098f540f9d
[ML] Remove usage of base action logger in ml actions (#50074) (#50236) 2019-12-17 13:03:27 +00:00
David Kyle
5542686283 [ML] Wait for green after opening job in NetworkDisruptionIT (#50232)
Closes #49908
2019-12-16 14:55:58 +00:00
Dimitris Athanasiou
73add726d7
[7.x][ML] Fix exception when field is not included and excluded at the same time (#50192) (#50223)
Executing the data frame analytics _explain API with a config that contains
a field that is not in the includes list but at the same time is the excludes
list results to trying to remove the field twice from the iterator. That causes
an `IllegalStateException`. This commit fixes this issue and adds a test that
captures the scenario.

Backport of #50192
2019-12-16 11:30:06 +00:00
Benjamin Trent
4805d8ac7d
[ML][Inference] Adding a warning_field for warning msgs. (#49838) (#50183)
This adds a new field for the inference processor.

`warning_field` is a place for us to write warnings provided from the inference call. When there are warnings we are not going to write an inference result. The goal of this is to indicate that the data provided was too poor or too different for the model to make an accurate prediction.

The user could optionally include the `warning_field`. When it is not provided, it is assumed no warnings were desired to be written.

The first of these warnings is when ALL of the input fields are missing. If none of the trained fields are present, we don't bother inferencing against the model and instead provide a warning stating that the fields were missing.

Also, this adds checks to not allow duplicated fields during processor creation.
2019-12-13 10:39:51 -05:00
Benjamin Trent
41736dd6c3
[ML] retry bulk indexing of state docs (#50149) (#50185)
This exchanges the direct use of the `Client` for `ResultsPersisterService`. State doc persistence will now retry. Failures to persist state will still not throw, but will be audited and logged.
2019-12-13 10:39:34 -05:00
Dimitris Athanasiou
fe3c9e71d1
[7.x][ML] Fix DFA explain API timeout when source index is missing (#50176) (#50180)
This commit fixes a bug that caused the data frame analytics
_explain API to time out in a multi-node setup when the source
index was missing. When we try to create the extracted fields detector,
we check the index settings. If the index is missing that responds
with a failure that could be wrapped as a remote exception.
While we unwrapped correctly to check if the cause was an
`IndexNotFoundException`, we then proceeded to cast the original
exception instead of the cause.

Backport of #50176
2019-12-13 17:00:55 +02:00
Dimitris Athanasiou
e6cbcf7f7c
[7.x] [ML] Persist/restore state for DFA classification (#50040) (#50147)
This commit adds state persist/restore for data frame analytics classification jobs.

Backport of #50040
2019-12-13 10:33:19 +02:00
Benjamin Trent
d7ffa7f8f7
[7.x][ML] Add graceful retry for anomaly detector result indexing failures(#49508) (#50145)
* [ML] Add graceful retry for anomaly detector result indexing failures (#49508)

All results indexing now retry the amount of times configured in `xpack.ml.persist_results_max_retries`. The retries are done in a semi-random, exponential backoff.

* fixing test
2019-12-12 12:24:58 -05:00
Benjamin Trent
c043aa887f
[ML][Inference] Simplify inference processor options (#50105) (#50146)
* [ML][Inference] Simplify inference processor options

* addressing pr comments
2019-12-12 11:13:55 -05:00
David Roberts
13e47df97d [TEST] Increase timeout for ML internal cluster cleanup (#50142)
Closes #48511
2019-12-12 15:38:22 +00:00
David Kyle
7d4118dc4e Enable trace logging in failing ml NetworkDisruptionIT
https://github.com/elastic/elasticsearch/issues/49908
2019-12-12 11:16:01 +00:00
Dimitris Athanasiou
03ecaae221
[7.x][ML] Avoid classification integ test training on single class (#50072) (#50078)
The `ClassificationIT.testTwoJobsWithSameRandomizeSeedUseSameTrainingSet`
test was previously set up to just have 10 rows. With `training_percent`
of 50%, only 5 rows will be used for training. There is a good chance that
all 5 rows will be of one class which results to failure.

This commit increases the rows to 100. Now 50 rows should be used for training
and the chance of failure should be very small.

Backport of #50072
2019-12-11 18:50:26 +02:00
David Turner
285eacd267
Use more specific loggers in subclasses of TMNA (#50076)
Adjusts the subclasses of `TransportMasterNodeAction` to use their own loggers
instead of the one for the base class.

Relates #50056.
Partial backport of #46431 to 7.x.
2019-12-11 15:07:47 +00:00
Przemysław Witek
9b116c8fef
A few improvements to AnalyticsProcessManager class that make the code more readable. (#50026) (#50069) 2019-12-11 09:35:05 +01:00
Dimitris Athanasiou
8891f4db88
[7.x][ML] Introduce randomize_seed setting for regression and classification (#49990) (#50023)
This adds a new `randomize_seed` for regression and classification.
When not explicitly set, the seed is randomly generated. One can
reuse the seed in a similar job in order to ensure the same docs
are picked for training.

Backport of #49990
2019-12-10 15:29:19 +02:00
Benjamin Trent
0b6ce9683c
[ML] Use query in cardinality check (#49939) (#49984)
When checking the cardinality of a field, the query should be take into account. The user might know about some bad data in their index and want to filter down to the target_field values they care about.
2019-12-09 10:14:41 -05:00
Przemysław Witek
0965a10468
[7.x] Pass prediction_field_type to C++ analytics process (#49861) (#49981) 2019-12-09 14:43:01 +01:00
Benjamin Trent
049d854360
[ML][Inference] adjust so target_field always has inference result and optionally allow new top classes field in the classification config (#49923) (#49982) 2019-12-09 08:29:45 -05:00
Dimitris Athanasiou
e4f838e764
[7.x][ML] Update expected mem estimate in explain API integ test (#49924) (#49979)
Work in progress in the c++ side is increasing memory estimates
a bit and this test fails. At the time of this commit the mem
estimate when there is no source query is a about 2Mb. So I
am relaxing the test to assert memory estimate is less than 1Mb
instead of 500Kb.

Backport of #49924
2019-12-09 11:52:06 +02:00
Przemysław Witek
e60837aa3b
[7.x] Log whole analytics stats when the state assertion fails (#49906) (#49911) 2019-12-06 14:31:17 +01:00
Przemysław Witek
1d8e3d69d7
Make only a part of stop() method a critical section. (#49756) (#49788) 2019-12-03 09:54:16 +01:00
Dimitris Athanasiou
4edb2e7bb6
[7.x][ML] Add optional source filtering during data frame reindexing (#49690) (#49718)
This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.

Closes #49531

Backport of #49690
2019-11-29 16:10:44 +02:00
Dimitris Athanasiou
c23a2187da
[7.x][ML] Only report complete writing_results progress after completion (#49551) (#49577)
We depend on the number of data frame rows in order to report progress
for the writing of results, the last phase of a job run. However, results
include other objects than just the data frame rows (e.g, progress, inference model, etc.).

The problem this commit fixes is that if we receive the last data frame row results
we'll report that progress is complete even though we still have more results to process
potentially. If the job gets stopped for any reason at this point, we will not be able
to restart the job properly as we'll think that the job was completed.

This commit addresses this by limiting the max progress we can report for the
writing_results phase before the results processor completes to 98.
At the end, when the process is done we set the progress to 100.

The commit also improves failure capturing and reporting in the results processor.

Backport of #49551
2019-11-26 12:20:37 +02:00
Benjamin Trent
688c78c589
[ML] Stop timing stats failure propagation (#49495) (#49501) 2019-11-25 10:09:30 -05:00
David Roberts
62811c2272 [ML] Add default categorization analyzer definition to ML info (#49545)
The categorization job wizard in the ML UI will use this
information when showing the effect of the chosen categorization
analyzer on a sample of input.
2019-11-25 13:39:16 +00:00
Dimitris Athanasiou
aca38f6882
[7.x][ML] DFA jobs should accept excluding an unsupported field (#49535) (#49544)
Before this change excluding an unsupported field resulted in
an error message that explained the excluded field could not be
detected as if it doesn't exist. This error message is confusing.

This commit commit changes this so that there is no error in this
scenario. When excluding a field that does exist but has been
automatically been excluded from the analysis there is no harm
(unlike excluding a missing field which could be a typo).

Backport of #49535
2019-11-25 15:13:00 +02:00
Dimitris Athanasiou
c149c64dc4
[7.x][ML] Apply source query on data frame analytics memory estimation (#49517) (#49532)
Closes #49454

Backport of #49517
2019-11-25 12:51:57 +02:00
Dimitris Athanasiou
8eaee7cbdc
[7.x][ML] Explain data frame analytics API (#49455) (#49504)
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

- memory estimation
- field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.

Backport of #49455
2019-11-22 22:06:10 +02:00
Benjamin Trent
a7477ad7c3
[7.x] [ML][Inference] compressing model definition and lazy parsing (#49269) (#49446)
* [ML][Inference] compressing model definition and lazy parsing (#49269)

* [ML][Inference] compressing model definition and lazy parsing

* addressing PR comments

* adding commons io

* implementing simplified bounded stream

* adjusting for type inclusion
2019-11-21 15:32:32 -05:00
Benjamin Trent
d41b2e3f38
[ML][Inference] allowing per-model licensing (#49398) (#49435)
* [ML][Inference] allowing per-model licensing

* changing to internal action + removing pre-mature opt
2019-11-21 09:46:34 -05:00
Przemysław Witek
c7ac2011eb
[7.x] Implement accuracy metric for multiclass classification (#47772) (#49430) 2019-11-21 15:01:18 +01:00
David Roberts
20558cf61c [ML] Fix simultaneous stop and force stop datafeed (#49367)
If a datafeed is stopped normally and force stopped at the same
time then it is possible that the force stop removes the
persistent task while the normal stop is performing actions.
Currently this causes the normal stop to error, but since
stopping a stopped datafeed is not an error this doesn't make
sense. Instead the force stop should just take precedence.

This is a followup to #49191 and should really have been
included in the changes in that PR.
2019-11-20 12:52:47 +00:00
Przemysław Witek
9c0ec7ce23
[7.x] Make AnalyticsProcessManager class more robust (#49282) (#49356) 2019-11-20 10:08:16 +01:00
Dimitris Athanasiou
4d6e037e90
[7.x][ML] Extract creation of DFA field extractor into a factory (#49315) (#49329)
This commit moves the async calls required to retrieve the components
that make up `ExtractedFieldsExtractor` out of `DataFrameDataExtractorFactory`
and into a dedicated `ExtractorFieldsExtractorFactory` class.

A few more refactorings are performed:

  - The detector no longer needs the results field. Instead, it knows
  whether to use it or not based on whether the task is restarting.
  - We pass more accurately whether the task is restarting or not.
  - The validation of whether fields that have a cardinality limit
  are valid is now performed in the detector after retrieving the
  respective cardinalities.

Backport of #49315
2019-11-20 10:02:42 +02:00