Commit Graph

392 Commits

Author SHA1 Message Date
David Roberts b61202b0a8 [ML] Add a limit on line merging in find_file_structure (#42501)
When analysing a semi-structured text file the
find_file_structure endpoint merges lines to form
multi-line messages using the assumption that the
first line in each message contains the timestamp.
However, if the timestamp is misdetected then this
can lead to excessive numbers of lines being merged
to form massive messages.

This commit adds a line_merge_size_limit setting
(default 10000 characters) that halts the analysis
if a message bigger than this is created.  This
prevents significant CPU time being spent subsequently
trying to determine the internal structure of the
huge bogus messages.
2019-06-03 13:45:51 +01:00
David Roberts 10aca87389 [ML] Better detection of binary input in find_file_structure (#42707)
This change helps to prevent the situation where a binary
file uploaded to the find_file_structure endpoint is
detected as being text in the UTF-16 character set, and
then causes a large amount of CPU to be spent analysing
the bogus text structure.

The approach is to check the distribution of zero bytes
between odd and even file positions, on the grounds that
UTF-16BE or UTF16-LE would have a very skewed distribution.
2019-06-03 12:47:22 +01:00
David Roberts 48dc0dca57 [ML] Use map and filter instead of flatMap in find_file_structure (#42534)
Using map and filter avoids the garbage from all the
Stream.of calls that flatMap necessitated. Performance
is better when there are masses of fields.
2019-05-24 20:12:06 +01:00
David Roberts 34de68b007 [ML] Fix possible race condition when closing an opening job (#42506)
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.

If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.

The race could be triggered by opening a job and then closing
it a tiny fraction of a second later.  It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.

The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state.  It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
2019-05-24 20:11:58 +01:00
David Roberts f472186b9f [ML] Improve file structure finder timestamp format determination (#41948)
This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates #38086
Closes #35137
Closes #35132
2019-05-24 09:10:08 +01:00
Dimitris Athanasiou a6eb20ad35
[ML] Include node name when native controller cannot start process (#42225) (#42338)
This adds the node name where we fail to start a process via the native
controller to facilitate debugging as otherwise it might not be known
to which node the job was allocated.
2019-05-22 12:42:04 +03:00
Yannick Welsch 770d8e9e39 Remove usage of max_local_storage_nodes in test infrastructure (#41652)
Moves the test infrastructure away from using node.max_local_storage_nodes, allowing us in a
follow-up PR to deprecate this setting in 7.x and to remove it in 8.0.

This also changes the behavior of InternalTestCluster so that starting up nodes will not automatically
reuse data folders of previously stopped nodes. If this behavior is desired, it needs to be explicitly
done by passing the data path from the stopped node to the new node that is started.
2019-05-22 11:04:55 +02:00
Ed Savage d97f4d5e28 [ML][TEST] Fix limits in AutodetectMemoryLimitIT (#42279)
Re-enable muted tests and accommodate recent backend changes
that result in higher memory usage being reported for a job
at the start of its life-cycle
2019-05-21 18:44:47 +01:00
Dimitris Athanasiou a4e6fb4dd2
[ML] Fix logger declaration in ML plugins (#42222) (#42238)
This corrects what appears to have been a copy-paste error
where the logger for `MachineLearning` and `DataFrame` was wrongly
set to be that of `XPackPlugin`.
2019-05-21 18:03:24 +03:00
Zachary Tong 6ae6f57d39
[7.x Backport] Force selection of calendar or fixed intervals (#41906)
The date_histogram accepts an interval which can be either a calendar
interval (DST-aware, leap seconds, arbitrary length of months, etc) or
fixed interval (strict multiples of SI units). Unfortunately this is inferred
by first trying to parse as a calendar interval, then falling back to fixed
if that fails.

This leads to confusing arrangement where `1d` == calendar, but
`2d` == fixed.  And if you want a day of fixed time, you have to
specify `24h` (e.g. the next smallest unit).  This arrangement is very
error-prone for users.

This PR adds `calendar_interval` and `fixed_interval` parameters to any
code that uses intervals (date_histogram, rollup, composite, datafeed, etc).
Calendar only accepts calendar intervals, fixed accepts any combination of
units (meaning `1d` can be used to specify `24h` in fixed time), and both
are mutually exclusive.

The old interval behavior is deprecated and will throw a deprecation warning.
It is also mutually exclusive with the two new parameters. In the future the
old dual-purpose interval will be removed.

The change applies to both REST and java clients.
2019-05-20 12:07:29 -04:00
Ed Savage 840af87a74 [ML] Temporarily muting failing tests
Muting a number of AutoDetectMemoryLimitIT tests to give CI a chance to
settle before easing in required backend changes.

relates elastic/ml-cpp#486
relates #42086
2019-05-19 08:29:50 -04:00
Ed Savage a68b04e47b [ML] Improve hard_limit audit message (#42086)
Improve the hard_limit memory audit message by reporting how many bytes
over the configured memory limit the job was at the point of the last
allocation failure.

Previously the model memory usage was reported, however this was
inaccurate and hence of limited use -  primarily because the total
memory used by the model can decrease significantly after the models
status is changed to hard_limit but before the model size stats are
reported from autodetect to ES.

While this PR contains the changes to the format of the hard_limit audit
message it is dependent on modifications to the ml-cpp backend to
send additional data fields in the model size stats message. These
changes will follow in a subsequent PR. It is worth noting that this PR
must be merged prior to the ml-cpp one, to keep CI tests happy.
2019-05-17 17:40:08 -04:00
David Roberts 226df35d96 [ML] Improve message misformation error in file structure finder (#42175)
This change replaces the extremely unfriendly message
"Number of messages analyzed must be positive" in the
case where the sample lines were incorrectly grouped
into just one message to an error that more helpfully
explains the likely root cause of the problem.
2019-05-16 18:29:38 +01:00
Benjamin Trent bf5a40c754
[ML] relax set upgrade mode test to match what is guaranteed (#41958) (#41979)
* [ML] relax set upgrade mode test to match what is guaranteed

* removing unused import
2019-05-09 14:28:50 -05:00
Jason Tedor d7fd51a84e
Provide names for all artifact repositories (#41857)
This commit adds a name for each Maven and Ivy repository used in the
build.
2019-05-07 06:35:28 -04:00
Ryan Ernst 6fd8924c5a Switch run task to use real distro (#41590)
The run task is supposed to run elasticsearch with the given plugin or
module. However, for modules, this is most realistic if using the full
distribution. This commit changes the run setup to use the default or
oss as appropriate.
2019-05-06 12:34:07 -07:00
Jason Tedor f4da98ca3d
Use a proper repository for ml-cpp artifacts (#41817)
This switches the strategy used to download machine learning artifacts
from a manual download through S3 to using an Ivy repository on top of
S3. This gives us all the benefits of Gradle dependency resolution
including local caching.
2019-05-04 12:44:19 -04:00
Jason Tedor 241c4ef97a
Use https for artifact locations
This commit switches to using https for some artifact locations.
2019-05-03 16:15:48 -04:00
Benjamin Trent a92c06ae09
[ML] Refactor NativeStorageProvider to enable reuse (#41414) (#41746)
* [ML] Refactor NativeStorageProvider to enable reuse

Moves `NativeStorageProvider` as a machine learning component
so that it can be reused for other job types. Also, we now
pass the persistent task description as unique identifier which
avoids conflicts between jobs of different type but with same ids.

* Adding nativeStorageProvider as component

Since `TransportForecastJobAction` is expected to get injected a `NativeStorageProvider` class, we need to make sure that it is a constructed component, as it does not have a zero parametered, public ctor.
2019-05-02 09:46:22 -05:00
Jason Tedor 7f3ab4524f
Bump 7.x branch to version 7.2.0
This commit adds the 7.2.0 version constant to the 7.x branch, and bumps
BWC logic accordingly.
2019-05-01 13:38:57 -04:00
Tom Veasey b3f4533e1c [ML] Update for model selection change and disable temporarily (#41482) (#41682) 2019-04-30 15:47:54 -05:00
Yogesh Gaikwad c0d40ae4ca
Remove deprecated stashWithOrigin calls and use the alternative (#40847) (#41562)
This commit removes the deprecated `stashWithOrigin` and
modifies its usage to use the alternative.
2019-04-28 21:25:42 +10:00
Christoph Büscher 52495843cc [Docs] Fix common word repetitions (#39703) 2019-04-25 20:47:47 +02:00
Benjamin Trent 07d36fdb23
[ML] refactoring the ML plugin to use the common auditor code (#41419) (#41485) 2019-04-24 09:56:59 -05:00
Dimitris Athanasiou eb2295ac81
[7.1][ML] Refactor autodetect service into its own class (#41378) (#41409)
This also improves aims to improve the corresponding unit tests
with regard to readability and maintainability.
2019-04-22 17:42:13 +03:00
Zachary Tong 7e62ff2823 [Rollup] Validate timezones based on rules not string comparision (#36237)
The date_histogram internally converts obsolete timezones (such as
"Canada/Mountain") into their modern equivalent ("America/Edmonton").
But rollup just stored the TZ as provided by the user.

When checking the TZ for query validation we used a string comparison,
which would fail due to the date_histo's upgrading behavior.

Instead, we should convert both to a TimeZone object and check if their
rules are compatible.
2019-04-17 13:46:44 -04:00
Iana Bondarska e090176f17 [ML] Exclude analysis fields with core field names from anomaly results (#41093)
Added "_index", "_type", "_id" to list of reserved fields.

Closes #39406
2019-04-17 16:08:03 +01:00
David Kyle 116167df55 [ML] Write header to autodetect before it is visible to other calls (#41085) 2019-04-16 13:51:29 +01:00
David Roberts 3f00c29adb [ML] Allow xpack.ml.max_machine_memory_percent higher than 100% (#41193)
Values higher than 100% are now allowed to accommodate use
cases where swapping has been determined to be acceptable.
Anomaly detector jobs only use their full model memory
during background persistence, and this is deliberately
staggered, so with large numbers of jobs few will generally
be persisting state at the same time.  Settings higher than
available memory are only recommended for OEM type
situations where a wrapper tightly controls the types of
jobs that can be created, and each job alone is considerably
smaller than what each node can handle.
2019-04-15 14:37:46 +01:00
Benjamin Trent 05cf53934a
[ML] checking if p-tasks metadata is null before updating state (#41091) (#41123)
* [ML] checking if p-tasks metadata is null before updating state

* Adding test that validates fix

* removing debug println
2019-04-11 13:54:41 -05:00
Przemysław Witek f5014ace64
[ML] Add validation that rejects duplicate detectors in PutJobAction (#40967) (#41072)
* [ML] Add validation that rejects duplicate detectors in PutJobAction

Closes #39704

* Add YML integration test for duplicate detectors fix.

* Use "== false" comparison rather than "!" operator.

* Refine error message to sound more natural.

* Put job description in square brackets in the error message.

* Use the new validation in ValidateJobConfigAction.

* Exclude YML tests for new validation from permission tests.
2019-04-10 15:43:35 +02:00
Mark Vieira 1287c7d91f
[Backport] Replace usages RandomizedTestingTask with built-in Gradle Test (#40978) (#40993)
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)

This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.

(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)

* Fix forking JVM runner

* Don't bump shadow plugin version
2019-04-09 11:52:50 -07:00
Jason Tedor ebba9393c1
Fix unsafe publication of invalid license enforcer (#40985)
The invalid license enforced is exposed to the cluster state update
thread (via the license state listener) before the constructor has
finished. This violates the JLS for safe publication of an object, and
means there is a concurrency bug lurking here. This commit addresses
this by avoiding publication of the invalid license enforcer before the
constructor has returned.
2019-04-09 13:51:37 -04:00
David Roberts d16f86f7ab [ML] Add created_by info to usage stats (#40518)
This change adds information about which UI path
(if any) created ML anomaly detector jobs to the
stats returned by the _xpack/usage endpoint.

Counts for the following possibilities are expected:

* ml_module_apache_access
* ml_module_apm_transaction
* ml_module_auditbeat_process_docker
* ml_module_auditbeat_process_hosts
* ml_module_nginx_access
* ml_module_sample
* multi_metric_wizard
* population_wizard
* single_metric_wizard
* unknown

The "unknown" count is for jobs that do not have a
created_by setting in their custom_settings.

Closes #38403
2019-04-04 10:55:20 +01:00
Dimitris Athanasiou 65cca2ee6f
[7.x][ML] Scrolling datafeed should clear scroll contexts on error (#40773) (#40794)
Closes #40772
2019-04-04 12:28:06 +03:00
David Kyle 1354696db9
[ML] Data Frame HLRC Get Stats API (#40443) 2019-03-26 11:17:13 +00:00
Ed Savage c20ea9a2dd [ML][TEST] Fix failing test testPersistJobOnGracefulShutdown_givenTimeAdvancedAfterNoNewData (#40363)
Ensure that there is at least a 1s delay between the time that state
is persisted by each of the two jobs in the test.

Model snapshot IDs use the current time in epoch seconds to
distinguish themselves, hence snapshots will be overwritten
by another if it occurs in the same 1s window.

Closes #40347
2019-03-25 17:55:10 +00:00
David Turner 1265a15b75 Mute testPersistJobOnGracefulShutdown_givenTimeAdvancedAfterNoNewData 2019-03-22 08:46:51 +00:00
Ed Savage 23d5f7babf
[ML] Add integration tests to check persistence (#40272) (#40315)
Additional checks to exercise the behaviour of
persistence on graceful close of an anomaly job.

Related to elastic/ml-cpp#393
Backports #40272
2019-03-21 17:01:10 +00:00
David Roberts 64028f3d8f Mute JobResultsProviderIT.testMultipleSimultaneousJobCreations
Due to https://github.com/elastic/elasticsearch/issues/40134
2019-03-17 07:50:08 +00:00
David Roberts 8d01b11918 [ML] Fix race condition when creating multiple jobs (#40049)
If multiple jobs are created together and the anomaly
results index does not exist then some of the jobs could
fail to update the mappings of the results index. This
lead them to fail to write their results correctly later.

Although this scenario sounds rare, it is exactly what
happens if the user creates their first jobs using the
Nginx module in the ML UI.

This change fixes the problem by updating the mappings
of the results index if it is found to exist during a
creation attempt.

Fixes #38785
2019-03-15 10:18:03 +00:00
David Kyle 78a9754318 Mute test NetworkDisruptionIT.testJobRelocation
Relates to #39858
2019-03-15 10:06:31 +00:00
Benjamin Trent 2016e23285
[ML] Refactor common utils out of ML plugin to XPack.Core (#39976) (#40009)
* [ML] Refactor common utils out of ML plugin to XPack.Core

* implementing GET filters with abstract transport

* removing added rest param

* adjusting how defaults can be supplied
2019-03-13 17:08:43 -05:00
Dimitris Athanasiou 79e414df86
[ML] Fix datafeed skipping first bucket after lookback when aggs are … (#39859) (#39958)
The problem here was that `DatafeedJob` was updating the last end time searched
based on the `now` even though when there are aggregations, the extactor will
only search up to the floor of `now` against the histogram interval.
This commit fixes the issue by using the end time as calculated by the extractor.

It also adds an integration test that uses aggregations. This test would fail
before this fix. Unfortunately the test is slow as we need to wait for the
datafeed to work in real time.

Closes #39842
2019-03-13 09:09:07 +02:00
David Kyle 48788269b0
[ML] Correct small inconsistencies in ml APIs spec and docs (#39907) 2019-03-11 14:02:50 +00:00
Benjamin Trent 4da04616c9
[ML] refactoring lazy query and agg parsing (#39776) (#39881)
* [ML] refactoring lazy query and agg parsing

* Clean up and addressing PR comments

* removing unnecessary try/catch block

* removing bad call to logger

* removing unused import

* fixing bwc test failure due to serialization and config migrator test

* fixing style issues

* Adjusting DafafeedUpdate class serialization

* Adding todo for refactor in v8

* Making query non-optional so it does not write a boolean byte
2019-03-10 14:54:02 -05:00
David Roberts 5f8f91c03b
[ML] Use scaling thread pool and xpack.ml.max_open_jobs cluster-wide dynamic (#39736)
This change does the following:

1. Makes the per-node setting xpack.ml.max_open_jobs
   into a cluster-wide dynamic setting
2. Changes the job node selection to continue to use the
   per-node attributes storing the maximum number of open
   jobs if any node in the cluster is older than 7.1, and
   use the dynamic cluster-wide setting if all nodes are on
   7.1 or later
3. Changes the docs to reflect this
4. Changes the thread pools for native process communication
   from fixed size to scaling, to support the dynamic nature
   of xpack.ml.max_open_jobs
5. Renames the autodetect thread pool to the job comms
   thread pool to make clear that it will be used for other
   types of ML jobs (data frame analytics in particular)

Backport of #39320
2019-03-06 12:29:34 +00:00
Dimitris Athanasiou 5c023770d2 [ML] Disable security audit trail in native integ tests suite (#39683)
Investigating how to make DeleteExpiredDataIT faster, it was
revealed that the security audit trail threads were quite hot.
Disabling that seems to be helping quite a bit with making this
test faster. This commit also unmutes the test to see how it goes
with the audit trail disabled.

Relates #39658
Closes #39575
2019-03-05 12:43:15 +02:00
David Kyle a58145f9e6
[ML] Transition to typeless (mapping) APIs (#39573)
ML has historically used doc as the single mapping type but reindex in 7.x
will change the mapping to _doc. Switching to the typeless APIs handles 
case where the mapping type is either doc or _doc. This change removes
deprecated typed usages.
2019-03-04 13:52:05 +00:00
David Roberts 085ff38122 Mute DeleteExpiredDataIT.testDeleteExpiredData
Due to https://github.com/elastic/elasticsearch/issues/39575
2019-03-03 18:34:30 +00:00
Dimitris Athanasiou 8843832039 [ML] Shave off DeleteExpiredDataIT runtime (#39557)
This commit parallelizes some parts of the test
and its remove an unnecessary refresh call.
On my local machine it shaves off about 15 seconds
for a test execution time of ~64s (down from ~80s).
This test is still slow but progress over perfection.

Relates #37339
2019-03-01 19:10:00 +02:00
Dimitris Athanasiou 8122650a55 [ML] Add integration test for interim results after advancing bucket (#39447)
This is an integration test that captures the issue described in
elastic/ml-cpp#324
2019-02-28 11:12:08 +02:00
Mehran Koushkebaghi 1d0097b5e8 [ML] Refactoring scheduled event to store instant instead of zoned time zone (#39380)
The ScheduledEvent class has never preserved the time
zone so it makes more sense for it to store the start and
end time using Instant rather than ZonedDateTime.

Closes #38620
2019-02-27 09:27:04 +00:00
David Roberts 4f2bd238d2 [ML] Increase datafeed integration test timeout for slow machines (#39311)
The assertBusy() that waits the default 10 seconds for a
datafeed to complete very occasionally times out on slow
machines.  This commit increases the timeout to 60 seconds.
It will almost never actually take this long, but it's
better to have a timeout that will prevent time being
wasted looking at spurious test failures.
2019-02-22 15:35:32 +00:00
Dimitris Athanasiou 1c6818fe74
[ML] Improve DeleteExpiredDataIT failure message (#39298) (#39310)
This test failed once in a very long time with the assertion
that there is no document for the `non_existing_job` in the
state index. I could not see how that is possible and I cannot
reproduce. With this commit the failure message will reveal
some examples of the left behind docs which might shed a light
about what could go wrong.
2019-02-22 16:15:11 +02:00
Benjamin Trent 109b6451fd
ML refactor DatafeedsConfig(Update) so defaults are not populated in queries or aggs (#38822) (#39119)
* ML refactor DatafeedsConfig(Update) so defaults are not populated in queries or aggs

* Addressing pr feedback
2019-02-19 12:45:56 -06:00
David Roberts 35e30b34f9 [ML] Stop the ML memory tracker before closing node (#39111)
The ML memory tracker does searches against ML results
and config indices.  These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.

This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete.  Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.

The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.

Fixes #37117
2019-02-19 15:12:40 +00:00
David Roberts bbcdea43c5 [ML] Allow stop unassigned datafeed and relax unset upgrade mode wait (#39034)
These two changes are interlinked.

Before this change unsetting ML upgrade mode would wait for all
datafeeds to be assigned and not waiting for their corresponding
jobs to initialise.  However, this could be inappropriate, if
there was a reason other that upgrade mode why one job was unable
to be assigned or slow to start up.  Unsetting of upgrade mode
would hang in this case.

This change relaxes the condition for considering upgrade mode to
be unset to simply that an assignment attempt has been made for
each ML persistent task that did not fail because upgrade mode
was enabled.  Thus after unsetting upgrade mode there is no
guarantee that every ML persistent task is assigned, just that
each is not unassigned due to upgrade mode.

In order to make setting upgrade mode work immediately after
unsetting upgrade mode it was then also necessary to make it
possible to stop a datafeed that was not assigned.  There was
no particularly good reason why this was not allowed in the past.
It is trivial to stop an unassigned datafeed because it just
involves removing the persistent task.
2019-02-19 14:07:10 +00:00
David Roberts b660d2cac6 [ML] More advanced post-test cleanup of ML indices (#39049)
The .ml-annotations index is created asynchronously when
some other ML index exists.  This can interfere with the
post-test index deletion, as the .ml-annotations index
can be created after all other indices have been deleted.

This change adds an ML specific post-test cleanup step
that runs before the main cleanup and:

1. Checks if any ML indices exist
2. If so, waits for the .ml-annotations index to exist
3. Deletes the other ML indices found in step 1.
4. Calls the super class cleanup

This means that by the time the main post-test index
cleanup code runs:

1. The only ML index it has to delete will be the
   .ml-annotations index
2. No other ML indices will exist that could trigger
   recreation of the .ml-annotations index

Fixes #38952
2019-02-18 14:16:03 +00:00
Martijn Laarman 9b4d96534b
Fix #38623 remove xpack namespace REST API (#38625) (#39036)
* Fix #38623 remove xpack namespace REST API

Except for xpack.usage and xpack.info API's, this moves the last remaining API's out of the xpack namespace

* rename xpack api's inside inside the files as well

* updated yaml tests references to xpack namespaces api's

* update callsApi calls in the IT subclasses

* make sure docs testing does not use xpack namespaced api's

* fix leftover xpack namespaced method names in docs/build.gradle

* found another leftover reference

(cherry picked from commit ccb5d934363c37506b76119ac050a254fa80b5e7)
2019-02-18 12:40:07 +01:00
Dimitris Athanasiou 21f76aba28
[ML] Extract base class for integ tests with native processes (#38850) (#38860) 2019-02-14 12:15:00 +02:00
Benjamin Trent d2ac05e249
ML allow aliased .ml-anomalies* index on PUT Job (#38821) (#38847) 2019-02-13 10:58:55 -06:00
Benjamin Trent 24a8ea06f5
ML: update set_upgrade_mode, add logging (#38372) (#38538)
* ML: update set_upgrade_mode, add logging

* Attempt to fix datafeed isolation

Also renamed a few methods/variables for clarity and added
some comments
2019-02-08 12:56:04 -06:00
Boaz Leskes 033ba725af
Remove support for internal versioning for concurrency control (#38254)
Elasticsearch has long [supported](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning) compare and set (a.k.a optimistic concurrency control) operations using internal document versioning. Sadly that approach is flawed and can sometime do the wrong thing. Here's the relevant excerpt from the resiliency status page:

> When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is partitioned away. It won’t acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed. The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted writes for the same document 

We recently [introduced](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/optimistic-concurrency-control.html) a new sequence number based approach that doesn't suffer from this dirty reads problem. 

This commit removes support for internal versioning as a concurrency control mechanism in favor of the sequence number approach.

Relates to #1078
2019-02-05 20:53:35 +01:00
David Turner f2dd5dd6eb
Remove DiscoveryPlugin#getDiscoveryTypes (#38414)
With this change we no longer support pluggable discovery implementations. No
known implementations of `DiscoveryPlugin` actually override this method, so in
practice this should have no effect on the wider world. However, we were using
this rather extensively in tests to provide the `test-zen` discovery type. We
no longer need a separate discovery type for tests as we no longer need to
customise its behaviour.

Relates #38410
2019-02-05 17:42:24 +00:00
David Roberts 92bc681705
[ML] Report index unavailable instead of waiting for lazy node (#38423)
If a job cannot be assigned to a node because an index it
requires is unavailable and there are lazy ML nodes then
index unavailable should be reported as the assignment
explanation rather than waiting for a lazy ML node.
2019-02-05 16:10:00 +00:00
Yogesh Gaikwad fe36861ada
Add support for API keys to access Elasticsearch (#38291)
X-Pack security supports built-in authentication service
`token-service` that allows access tokens to be used to 
access Elasticsearch without using Basic authentication.
The tokens are generated by `token-service` based on
OAuth2 spec. The access token is a short-lived token
(defaults to 20m) and refresh token with a lifetime of 24 hours,
making them unsuitable for long-lived or recurring tasks where
the system might go offline thereby failing refresh of tokens.

This commit introduces a built-in authentication service
`api-key-service` that adds support for long-lived tokens aka API
keys to access Elasticsearch. The `api-key-service` is consulted
after `token-service` in the authentication chain. By default,
if TLS is enabled then `api-key-service` is also enabled.
The service can be disabled using the configuration setting.

The API keys:-
- by default do not have an expiration but expiration can be
  configured where the API keys need to be expired after a
  certain amount of time.
- when generated will keep authentication information of the user that
   generated them.
- can be defined with a role describing the privileges for accessing
   Elasticsearch and will be limited by the role of the user that
   generated them
- can be invalidated via invalidation API
- information can be retrieved via a get API
- that have been expired or invalidated will be retained for 1 week
  before being deleted. The expired API keys remover task handles this.

Following are the API key management APIs:-
1. Create API Key - `PUT/POST /_security/api_key`
2. Get API key(s) - `GET /_security/api_key`
3. Invalidate API Key(s) `DELETE /_security/api_key`

The API keys can be used to access Elasticsearch using `Authorization`
header, where the auth scheme is `ApiKey` and the credentials, is the 
base64 encoding of API key Id and API key separated by a colon.
Example:-
```
curl -H "Authorization: ApiKey YXBpLWtleS1pZDphcGkta2V5" http://localhost:9200/_cluster/health
```

Closes #34383
2019-02-05 14:21:57 +11:00
David Roberts fb6a176caf
[ML] Add explanation so far to file structure finder exceptions (#38191)
The explanation so far can be invaluable for troubleshooting
as incorrect decisions made early on in the structure analysis
can result in seemingly crazy decisions or timeouts later on.

Relates elastic/kibana#29821
2019-02-04 14:32:35 +00:00
Boaz Leskes ff13a43144
Move ML Optimistic Concurrency Control to Seq No (#38278)
This commit moves the usage of internal versioning for CAS operations to use sequence numbers and primary terms

Relates to #36148
Relates to #10708
2019-02-04 10:41:08 +01:00
David Turner 1d82a6d9f9
Deprecate unused Zen1 settings (#38289)
Today the following settings in the `discovery.zen` namespace are still used:

- `discovery.zen.no_master_block`
- `discovery.zen.hosts_provider`
- `discovery.zen.ping.unicast.concurrent_connects`
- `discovery.zen.ping.unicast.hosts.resolve_timeout`
- `discovery.zen.ping.unicast.hosts`

This commit deprecates all other settings in this namespace so that they can be
removed in the next major version.
2019-02-04 08:52:08 +00:00
Benjamin Trent 5db305023d
ML: Fix error race condition on stop _all datafeeds and close _all jobs (#38113)
* ML: Ignore when task is not found for _all

* Addressing PR comments

* Update TransportStopDatafeedAction.java
2019-02-01 11:16:35 -06:00
David Roberts 1fa413a16d
[ML] Remove "8" prefixes from file structure finder timestamp formats (#38016)
In 7.x Java timestamp formats are the default timestamp format and
there is no need to prefix them with "8".  (The "8" prefix was used
in 6.7 to distinguish Java timestamp formats from Joda timestamp
formats.)

This change removes the "8" prefixes from timestamp formats in the
output of the ML file structure finder.
2019-02-01 15:36:04 +00:00
Benjamin Trent be381b4525
ML: better handle task state race condition (#38040) 2019-01-31 11:07:54 -06:00
Henning Andersen 68ed72b923
Handle scheduler exceptions (#38014)
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.

This is a continuation of #28667, #36137 and also fixes #37708.
2019-01-31 17:51:45 +01:00
Benjamin Trent 9782aaa1b8
ML: Add reason field in JobTaskState (#38029)
* ML: adding reason to job failure status

* marking reason as nullable

* Update AutodetectProcessManager.java
2019-01-30 11:56:24 -06:00
Benjamin Trent 8280a20664
ML: Add upgrade mode docs, hlrc, and fix bug (#37942)
* ML: Add upgrade mode docs, hlrc, and fix bug

* [DOCS] Fixes build error and edits text

* adjusting docs

* Update docs/reference/ml/apis/set-upgrade-mode.asciidoc

Co-Authored-By: benwtrent <ben.w.trent@gmail.com>

* Update set-upgrade-mode.asciidoc

* Update set-upgrade-mode.asciidoc
2019-01-30 06:51:11 -06:00
Adrien Grand c8af0f4bfa
Use mappings to format doc-value fields by default. (#30831)
Doc-value fields now return a value that is based on the mappings rather than
the script implementation by default.

This deprecates the special `use_field_mapping` docvalue format which was added
in #29639 only to ease the transition to 7.x and it is not necessary anymore in
7.0.
2019-01-30 10:31:51 +01:00
Benjamin Trent 34d61d3231
ML: ignore unknown fields for JobTaskState (#37982) 2019-01-29 12:51:34 -06:00
David Kyle 6d1693ff49 [ML] Prevent submit after autodetect worker is stopped (#37700)
Runnables can be submitted to
AutodetectProcessManager.AutodetectWorkerExecutorService
without error after it has been shutdown which can lead
to requests timing out as their handlers are never called
by the terminated executor.

This change throws an EsRejectedExecutionException if a
runnable is submitted after after the shutdown and calls
AbstractRunnable.onRejection on any tasks not run.

Closes #37108
2019-01-29 15:09:40 +00:00
Henrique Gonçalves eceb3185c7 [ML] Make GetJobStats work with arbitrary wildcards and groups (#36683)
The /_ml/anomaly_detectors/{job}/_stats endpoint now
works correctly when {job} is a wildcard or job group.

Closes #34745
2019-01-29 09:06:50 +00:00
Dimitris Athanasiou ebe9c95230
[ML] Audit all errors during job deletion (#37933)
This commit moves the auditing of job deletion related errors
to the final listener in the job delete action. This ensures
any error that occurs during job deletion is audited.
2019-01-29 10:23:50 +02:00
Benjamin Trent 7e4c0e6991
ML: Adds set_upgrade_mode API endpoint (#37837)
* ML: Add MlMetadata.upgrade_mode and API

* Adding tests

* Adding wait conditionals for the upgrade_mode call to return

* Adding tests

* adjusting format and tests

* Adjusting wait conditions for api return and msgs

* adjusting doc tests

* adding upgrade mode tests to black list
2019-01-28 09:07:30 -06:00
David Kyle c0409fb9f0
[ML] Marginal gains in slow multi node QA tests (#37825)
Move 2 tests that are simple rest tests and out of the QA suite and cut the number
of post data calls in ForecastIT
2019-01-28 10:00:59 +00:00
David Roberts 57d321ed5f
[ML] Tighten up use of aliases rather than concrete indices (#37874)
We have read and write aliases for the ML results indices.  However,
the job still had methods that purported to reliably return the name
of the concrete results index being used by the job.  After reindexing
prior to upgrade to 7.x this will be wrong, so the method has been
renamed and the comments made more explicit to say the returned index
name may not be the actual concrete index name for the lifetime of the
job.  Additionally, the selection of indices when deleting the job
has been changed so that it works regardless of concrete index names.

All these changes are nice-to-have for 6.7 and 7.0, but will become
critical if we add rolling results indices in the 7.x release stream
as 6.7 and 7.0 nodes may have to operate in a mixed version cluster
that includes a version that can roll results indices.
2019-01-28 09:38:46 +00:00
David Roberts f2c0c26d15
[ML] Adjust structure finder for Joda to Java time migration (#37306)
The ML file structure finder has always reported both Joda
and Java time format strings.  This change makes the Java time
format strings the ones that are incorporated into mappings
and ingest pipeline definitions.

The BWC syntax of prepending "8" to these formats is used.
This will need to be removed once Java time format strings
become the default in Elasticsearch.

This commit also removes direct imports of Joda classes in the
structure finder unit tests.  Instead the core Joda BWC class
is used.
2019-01-26 20:19:57 +00:00
Benjamin Trent 9e932f4869
ML: removing unnecessary upgrade code (#37879) 2019-01-25 13:57:41 -06:00
Christoph Büscher b4b4cd6ebd
Clean codebase from empty statements (#37822)
* Remove empty statements

There are a couple of instances of undocumented empty statements all across the
code base. While they are mostly harmless, they make the code hard to read and
are potentially error-prone. Removing most of these instances and marking blocks
that look empty by intention as such.

* Change test, slightly more verbose but less confusing
2019-01-25 14:23:02 +01:00
David Roberts deafce1acd
[ML] No need to add state doc mapping on job open in 7.x (#37759)
When upgrading from 5.4 to 5.5 to 6.7 (inclusive) it was
necessary to ensure there was a mapping for type "doc" on
the ML state index before opening a job.  This was because
5.4 created a multi-type ML state index.

In version 7.x we can be sure that any such 5.4 index is no
longer in use.  It would have had to be reindexed into the
6.x index format prior to the upgrade to version 7.x.
2019-01-25 13:15:35 +00:00
Jim Ferenczi 787acb14b9
Track total hits up to 10,000 by default (#37466)
This commit changes the default for the `track_total_hits` option of the search request
to `10,000`. This means that by default search requests will accurately track the total hit count
up to `10,000` documents, requests that match more than this value will set the `"total.relation"`
to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response.
Scroll queries are not impacted, they will continue to count the total hits accurately.
The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request.
I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate.

Closes #33028
2019-01-25 13:45:39 +01:00
David Kyle e1226f69b7
[ML] Increase close job timeout and lower the max number (#37770) 2019-01-24 09:18:48 +00:00
Lee Hinman 427bc7f940
Use ILM for Watcher history deletion (#37443)
* Use ILM for Watcher history deletion

This commit adds an index lifecycle policy for the `.watch-history-*` indices.
This policy is automatically used for all new watch history indices.

This does not yet remove the automatic cleanup that the monitoring plugin does
for the .watch-history indices, and it does not touch the
`xpack.watcher.history.cleaner_service.enabled` setting.

Relates to #32041
2019-01-23 10:18:08 -07:00
Alexander Reelsen daa2ec8a60
Switch mapping/aggregations over to java time (#36363)
This commit moves the aggregation and mapping code from joda time to
java time. This includes field mappers, root object mappers, aggregations with date
histograms, query builders and a lot of changes within tests.

The cut-over to java time is a requirement so that we can support nanoseconds
properly in a future field mapper.

Relates #27330
2019-01-23 10:40:05 +01:00
David Roberts 7b3dd3022d
[ML] Update ML results mappings on process start (#37706)
This change moves the update to the results index mappings
from the open job action to the code that starts the
autodetect process.

When a rolling upgrade is performed we need to update the
mappings for already-open jobs that are reassigned from an
old version node to a new version node, but the open job
action is not called in this case.

Closes #37607
2019-01-23 09:37:37 +00:00
Ryan Ernst fc99eb3e65
Add cache cleaning task for ML snapshot (#37505)
The ML subproject of xpack has a cache for the cpp artifact snapshots
which is checked on each build. The cache is outside of the build dir so
that it is not wiped on a typical clean, as the artifacts can be large
and do not change often. This commit adds a cleanCache task which will
wipe the cache dir, as over time the size of the directory can become
bloated.
2019-01-19 16:16:58 -08:00
Benjamin Trent 12cdf1cba4
ML: Add support for single bucket aggs in Datafeeds (#37544)
Single bucket aggs are now supported in datafeed aggregation configurations.
2019-01-18 15:08:53 -06:00
Benjamin Trent 5384162a42
ML: creating ML State write alias and pointing writes there (#37483)
* ML: creating ML State write alias and pointing writes there

* Moving alias check to openJob method

* adjusting concrete index lookup for ml-state
2019-01-18 14:32:34 -06:00
Yannick Welsch 6d64a2a901
Propagate Errors in executors to uncaught exception handler (#36137)
This is a continuation of #28667 and has as goal to convert all executors to propagate errors to the
uncaught exception handler. Notable missing ones were the direct executor and the scheduler. This
commit also makes it the property of the executor, not the runnable, to ensure this property. A big
part of this commit also consists of vastly improving the test coverage in this area.
2019-01-17 17:46:35 +01:00
David Kyle 75410dc632 [Ml] Prevent config snapshot failure blocking migration (#37493) 2019-01-16 11:51:15 +00:00
Hendrik Muhs 15d1b904a1
[ML] log minimum diskspace setting if forecast fails due to insufficient d… (#37486)
log minimum disk space setting if forecast fails due to insufficient disk space
2019-01-16 08:10:13 +01:00
David Kyle bea46f7b52
[ML] Migrate unallocated jobs and datafeeds (#37430)
Migrate ml job and datafeed config of open jobs and update
the parameters of the persistent tasks as they become unallocated
during a rolling upgrade. Block allocation of ml persistent tasks
until the configs are migrated.
2019-01-15 18:21:39 +00:00