Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
This commit creates new base classes for master node actions whose
response types still implement Streamable. This simplifies both finding
remaining classes to convert, as well as creating new master node
actions that use Writeable for their responses.
relates #34389
* HLRC: Fix '+' Not Correctly Encoded in GET Req.
* Encode `+` correctly as `%2B` in URL paths
* Keep encoding `+` as space in URL parameters
* Closes#33077
This commit moves the Supplier variant of HandledTransportAction to have
a different ordering than the Writeable.Reader variant. The Supplier
version is used for the legacy Streamable, and currently having the
location of the Writeable.Reader vs Supplier in the same place forces
using casts of Writeable.Reader to select the correct super constructor.
This change in ordering allows easier migration to Writeable.Reader.
relates #34389
Now that ML job configs are stored in an index rather than
cluster state, availability of the .ml-config index is very
important to the operation of ML. When a cluster starts up
the ML persistent tasks will be considered for node
assignment very early on. It is best in this case if
assignment is deferred until after the .ml-config index is
available.
The introduction of data frame analytics jobs has made this
problem worse, because anomaly detection jobs already waited
for the primary shards of the .ml-state, .ml-anomalies-shared
and .ml-meta indices to be available before doing node
assignment, and by coincidence this would probably lead to
the primary shards of .ml-config also being searchable. But
data frame analytics jobs had no other index checks prior to
this change.
This fixes problem 2 of #44156
By default, we don't check ranges while indexing geo_shapes. As a
result, it is possible to index geoshapes that contain contain
coordinates outside of -90 +90 and -180 +180 ranges. Such geoshapes
will currently break SQL and ML retrieval mechanism. This commit removes
these restriction from the validator is used in SQL and ML retrieval.
When the ML memory tracker is refreshed and a refresh is
already in progress the idea is that the second and
subsequent refresh requests receive the same response as
the currently in progress refresh.
There was a bug that if a refresh failed then the ML
memory tracker's view of whether a refresh was in progress
was not reset, leading to every subsequent request being
registered to receive a response that would never come.
This change makes the ML memory tracker pass on failures
as well as successes to all interested parties and reset
the list of interested parties so that further refresh
attempts are possible after either a success or failure.
This fixes problem 1 of #44156
Custom timestamp overrides provided to the find_file_structure
endpoint produced an invalid Grok pattern if the fractional
seconds separator was a dot rather than a comma or colon.
This commit fixes that problem and adds tests for this sort
of timestamp override.
Fixes#44110
The count should match the number of all df-analytics that
matched the id in the request. However, we set the count
to the number of df-analytics returned which was bound to the
`size` parameter. This commit fixes this by setting the count
to the count of the `get` response.
A bug was introduced in 6.6.0 when we added support for
rollup indices. Rollup caps does NOT support looking at
remote indices, consequently, since we always look up rollup
caps, the datafeed fails with an error if its config
includes a concrete remote index. (When all remote indices
in a datafeed config are wildcards the problem did not
occur.)
The rollups feature does not support remote indices, so if
there is any remote index in a datafeed config (wildcarded
or not), we can skip the rollup cap checks. This PR
implements that change.
This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory,
and removes the need to pass around tokenizer names when building custom analyzers.
As this means that TokenizerFactory is no longer a functional interface, the commit also
adds a factory method to TokenizerFactory to make construction simpler.
This introduces a `failed` state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as `failure_reason`.
The df-analytics stop API now has a `force` boolean parameter. This allows
the user to call it for a failed task in order to reset it to `stopped` after
we have ensured the failure has been communicated to the user.
This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.
Renames `_id_copy` to `ml__id_copy` as field names starting with
underscore are deprecated. The new field name `ml__id_copy` was
chosen as an obscure enough field that users won't have in their data.
Otherwise, this field is only intented to be used by df-analytics.
If a job is opened and then closed and does nothing in
between then it should not persist any results or state
documents. This change adapts the no-op job test to
assert no results in addition to no state, and to log
any documents that cause this assertion to fail.
Relates elastic/ml-cpp#512
Relates #43680
The Action base class currently works for both Streamable and Writeable
response types. This commit intorduces StreamableResponseAction, for
which only the legacy Action implementions which provide newResponse()
will extend. This eliminates the need for overriding newResponse() with
an UnsupportedOperationException.
relates #34389
Since #41817 was merged the ml-cpp zip file for any
given version has been cached indefinitely by Gradle.
This is problematic, particularly in the case of the
master branch where the version 8.0.0-SNAPSHOT will
be in use for more than a year.
This change tells Gradle that the ml-cpp zip file is
a "changing" dependency, and to check whether it has
changed every two hours. Two hours is a compromise
between checking on every build and annoying developers
with slow internet connections and checking rarely
causing bug fixes in the ml-cpp code to take a long
time to propagate through to elasticsearch PRs that
rely on them.
This commit adds support for multiple source indices.
In order to deal with multiple indices having different mappings,
it attempts a best-effort approach to merge the mappings assuming
there are no conflicts. In case conflicts exists an error will be
returned.
To allow users creating custom mappings for special use cases,
the destination index is now allowed to exist before the analytics
job runs. In addition, settings are no longer copied except for
the `index.number_of_shards` and `index.number_of_replicas`.
* Deduplicate org.elasticsearch.xpack.core.dataframe.utils.TimeUtils and org.elasticsearch.xpack.core.ml.utils.time.TimeUtils into a common class: org.elasticsearch.xpack.core.common.time.TimeUtils.
* Add unit tests for parseTimeField and parseTimeFieldToInstant methods
This change introduces a new setting,
xpack.ml.process_connect_timeout, to enable
the timeout for one of the external ML processes
to connect to the ES JVM to be increased.
The timeout may need to be increased if many
processes are being started simultaneously on
the same machine. This is unlikely in clusters
with many ML nodes, as we balance the processes
across the ML nodes, but can happen in clusters
with a single ML node and a high value for
xpack.ml.node_concurrent_job_allocations.
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
The APIs are:
- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index
In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:
- POST _ml/data_frame/_evaluate
The error message if the native controller failed to run
(for example due to running Elasticsearch on an unsupported
platform) was not easy to understand. This change removes
pointless detail from the message and adds some hints about
likely causes.
Fixes#42341
This commit replaces usages of Streamable with Writeable for the
AcknowledgedResponse and its subclasses, plus associated actions.
Note that where possible response fields were made final and default
constructors were removed.
This is a large PR, but the change is mostly mechanical.
Relates to #34389
Backport of #43414
After the network disruption a partition is created,
one side of which can form a cluster the other can't.
Ensure requests are sent to a node on the correct side
of the cluster
This commit removes some very old test logging annotations that appeared
to be added to investigate test failures that are long since closed. If
these are needed, they can be added back on a case-by-case basis with a
comment associating them to a test failure.
* Return 0 for negative "free" and "total" memory reported by the OS
We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.
In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.
Resolves#42157
* Fix test passing in invalid memory value
* Fix another test passing in invalid memory value
* Also change mem check in MachineLearning.machineMemoryFromStats
* Add background documentation for why we prevent negative return values
* Clarify comment a bit more
This trace logging looks like it was copy/pasted from another test,
where the logging in that test was only added to investigate a test
failure. This commit removes the trace logging.
The ML failover tests sometimes need to wait for jobs to be
assigned to new nodes following a node failure. They wait
10 seconds for this to happen. However, if the node that
failed was the master node and a new master was elected then
this 10 seconds might not be long enough as a refresh of the
memory stats will delay job assignment. Once the memory
refresh completes the persistent task will be assigned when
the next cluster state update occurs or after the periodic
recheck interval, which defaults to 30 seconds. Rather than
increase the length of the wait for assignment to 31 seconds,
this change decreases the periodic recheck interval to 1
second.
Fixes#43289
We were stopping a node in the cluster at a time when
the replica shards of the .ml-state index might not
have been created. This change moves the wait for
green status to a point where the .ml-state index
exists.
Fixes#40546Fixes#41742
Forward port of #43111
A static code analysis revealed that we are not closing
the input stream in the post_data endpoint. This
actually makes no difference in practice, as the
particular InputStream implementation in this case is
org.elasticsearch.common.bytes.BytesReferenceStreamInput
and its close() method is a no-op. However, it is good
practice to close the stream anyway.
The machine learning feature of xpack has native binaries with a
different commit id than the rest of code. It is currently exposed in
the xpack info api. This commit adds that commit information to the ML
info api, so that it may be removed from the info api.
Previously 10 digit numbers were considered candidates to be
timestamps recorded as seconds since the epoch and 13 digit
numbers as timestamps recorded as milliseconds since the epoch.
However, this meant that we could detect these formats for
numbers that would represent times far in the future. As an
example ISBN numbers starting with 9 were detected as milliseconds
since the epoch since they had 13 digits.
This change tweaks the logic for detecting such timestamps to
require that they begin with 1 or 2. This means that numbers
that would represent times beyond about 2065 are no longer
detected as epoch timestamps. (We can add 3 to the definition
as we get closer to the cutoff date.)
The description field of xpack featuresets is optionally part of the
xpack info api, when using the verbose flag. However, this information
is unnecessary, as it is better left for documentation (and the existing
descriptions describe anything meaningful). This commit removes the
description field from feature sets.
The tests for the ML TimeoutChecker rely on threads
not being interrupted after the TimeoutChecker is
closed. This change ensures this by making the
close() and setTimeoutExceeded() methods synchronized
so that the code inside them cannot execute
simultaneously.
Fixes#43097
* [ML] Adding support for geo_shape, geo_centroid, geo_point in datafeeds
* only supporting doc_values for geo_point fields
* moving validation into GeoPointField ctor
Get resources action sorts on the resource id. When there are no resources at
all, then it is possible the index does not contain a mapping for the resource
id field. In that case, the search api fails by default.
This commit adjusts the search request to ignore unmapped fields.
Closeselastic/kibana#37870
Both TransportAnalyzeAction and CategorizationAnalyzer have logic to build
custom analyzers for index-independent analysis. A lot of this code is duplicated,
and it requires the AnalysisRegistry to expose a number of internal provider
classes, as well as making some assumptions about when analysis components are
constructed.
This commit moves the build logic directly into AnalysisRegistry, reducing the
registry's API surface considerably.
Previously, a reindex request had two different size specifications in the body:
* Outer level, determining the maximum documents to process
* Inside the source element, determining the scroll/batch size.
The outer level size has now been renamed to max_docs to
avoid confusion and clarify its semantics, with backwards compatibility and
deprecation warnings for using size.
Similarly, the size parameter has been renamed to max_docs for
update/delete-by-query to keep the 3 interfaces consistent.
Finally, all 3 endpoints now support max_docs in both body and URL.
Relates #24344
A static code analysis revealed that we are not closing
the input stream in the find_file_structure endpoint.
This actually makes no difference in practice, as the
particular InputStream implementation in this case is
org.elasticsearch.common.bytes.BytesReferenceStreamInput
and its close() method is a no-op. However, it is good
practice to close the stream anyway.
This change adds the earliest and latest timestamps into
the field stats for fields of type "date" in the output of
the ML find_file_structure endpoint. This will enable the
cards for date fields in the file data visualizer in the UI
to be made to look more similar to the cards for date
fields in the index data visualizer in the UI.
Dots in the column names cause an error in the ingest
pipeline, as dots are special characters in ingest pipeline.
This PR changes dots into underscores in CSV field names
suggested by the ML find_file_structure endpoint _unless_
the field names are specifically overridden. The reason for
allowing them in overrides is that fields that are not
mentioned in the ingest pipeline can contain dots. But it's
more consistent that the default behaviour is to replace
them all.
Fixeselastic/kibana#26800
When analysing a semi-structured text file the
find_file_structure endpoint merges lines to form
multi-line messages using the assumption that the
first line in each message contains the timestamp.
However, if the timestamp is misdetected then this
can lead to excessive numbers of lines being merged
to form massive messages.
This commit adds a line_merge_size_limit setting
(default 10000 characters) that halts the analysis
if a message bigger than this is created. This
prevents significant CPU time being spent subsequently
trying to determine the internal structure of the
huge bogus messages.
This change helps to prevent the situation where a binary
file uploaded to the find_file_structure endpoint is
detected as being text in the UTF-16 character set, and
then causes a large amount of CPU to be spent analysing
the bogus text structure.
The approach is to check the distribution of zero bytes
between odd and even file positions, on the grounds that
UTF-16BE or UTF16-LE would have a very skewed distribution.
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.
If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.
The race could be triggered by opening a job and then closing
it a tiny fraction of a second later. It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.
The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state. It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.
Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them. This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.
Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats. There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set. This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon. Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.
The full list of changes/improvements in this refactor is:
- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
determined or overridden Java timestamp formats, not stored
separately
- Functionality for determining the "best" timestamp format in
a set of lines has been moved from TextLogFileStructureFinder
to TimestampFormatFinder, taking advantage of the fact that
TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
patterns when looking for timestamp formats has been changed
from using simple regular expressions to the much faster
approach of using the Shift-And method of sub-string search,
but using an "alphabet" consisting of just 1 (representing any
digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
whose definition is included within the date processor in the
ingest pipeline
- Grok patterns that correspond to multiple Java date/time
patterns are now handled better - the Grok pattern is accepted
as matching broadly, and the required set of Java date/time
patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
when looking for the "best" timestamp in a set of lines
timestamps are considered different if they are preceded by
a different sequence of punctuation characters (to prevent
timestamps far into some lines being considered similar to
timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
%{DATE} and %{DATESTAMP}, which have indeterminate day/month
ordering
- The order of day/month in formats with indeterminate day/month
order is determined by considering all observed samples (plus
the server locale if the observed samples still do not suggest
an ordering)
Relates #38086Closes#35137Closes#35132
This adds the node name where we fail to start a process via the native
controller to facilitate debugging as otherwise it might not be known
to which node the job was allocated.
Moves the test infrastructure away from using node.max_local_storage_nodes, allowing us in a
follow-up PR to deprecate this setting in 7.x and to remove it in 8.0.
This also changes the behavior of InternalTestCluster so that starting up nodes will not automatically
reuse data folders of previously stopped nodes. If this behavior is desired, it needs to be explicitly
done by passing the data path from the stopped node to the new node that is started.
Re-enable muted tests and accommodate recent backend changes
that result in higher memory usage being reported for a job
at the start of its life-cycle
This corrects what appears to have been a copy-paste error
where the logger for `MachineLearning` and `DataFrame` was wrongly
set to be that of `XPackPlugin`.
The date_histogram accepts an interval which can be either a calendar
interval (DST-aware, leap seconds, arbitrary length of months, etc) or
fixed interval (strict multiples of SI units). Unfortunately this is inferred
by first trying to parse as a calendar interval, then falling back to fixed
if that fails.
This leads to confusing arrangement where `1d` == calendar, but
`2d` == fixed. And if you want a day of fixed time, you have to
specify `24h` (e.g. the next smallest unit). This arrangement is very
error-prone for users.
This PR adds `calendar_interval` and `fixed_interval` parameters to any
code that uses intervals (date_histogram, rollup, composite, datafeed, etc).
Calendar only accepts calendar intervals, fixed accepts any combination of
units (meaning `1d` can be used to specify `24h` in fixed time), and both
are mutually exclusive.
The old interval behavior is deprecated and will throw a deprecation warning.
It is also mutually exclusive with the two new parameters. In the future the
old dual-purpose interval will be removed.
The change applies to both REST and java clients.
Muting a number of AutoDetectMemoryLimitIT tests to give CI a chance to
settle before easing in required backend changes.
relates elastic/ml-cpp#486
relates #42086
Improve the hard_limit memory audit message by reporting how many bytes
over the configured memory limit the job was at the point of the last
allocation failure.
Previously the model memory usage was reported, however this was
inaccurate and hence of limited use - primarily because the total
memory used by the model can decrease significantly after the models
status is changed to hard_limit but before the model size stats are
reported from autodetect to ES.
While this PR contains the changes to the format of the hard_limit audit
message it is dependent on modifications to the ml-cpp backend to
send additional data fields in the model size stats message. These
changes will follow in a subsequent PR. It is worth noting that this PR
must be merged prior to the ml-cpp one, to keep CI tests happy.
This change replaces the extremely unfriendly message
"Number of messages analyzed must be positive" in the
case where the sample lines were incorrectly grouped
into just one message to an error that more helpfully
explains the likely root cause of the problem.
The run task is supposed to run elasticsearch with the given plugin or
module. However, for modules, this is most realistic if using the full
distribution. This commit changes the run setup to use the default or
oss as appropriate.
This switches the strategy used to download machine learning artifacts
from a manual download through S3 to using an Ivy repository on top of
S3. This gives us all the benefits of Gradle dependency resolution
including local caching.
* [ML] Refactor NativeStorageProvider to enable reuse
Moves `NativeStorageProvider` as a machine learning component
so that it can be reused for other job types. Also, we now
pass the persistent task description as unique identifier which
avoids conflicts between jobs of different type but with same ids.
* Adding nativeStorageProvider as component
Since `TransportForecastJobAction` is expected to get injected a `NativeStorageProvider` class, we need to make sure that it is a constructed component, as it does not have a zero parametered, public ctor.
The date_histogram internally converts obsolete timezones (such as
"Canada/Mountain") into their modern equivalent ("America/Edmonton").
But rollup just stored the TZ as provided by the user.
When checking the TZ for query validation we used a string comparison,
which would fail due to the date_histo's upgrading behavior.
Instead, we should convert both to a TimeZone object and check if their
rules are compatible.
Values higher than 100% are now allowed to accommodate use
cases where swapping has been determined to be acceptable.
Anomaly detector jobs only use their full model memory
during background persistence, and this is deliberately
staggered, so with large numbers of jobs few will generally
be persisting state at the same time. Settings higher than
available memory are only recommended for OEM type
situations where a wrapper tightly controls the types of
jobs that can be created, and each job alone is considerably
smaller than what each node can handle.
* [ML] Add validation that rejects duplicate detectors in PutJobAction
Closes#39704
* Add YML integration test for duplicate detectors fix.
* Use "== false" comparison rather than "!" operator.
* Refine error message to sound more natural.
* Put job description in square brackets in the error message.
* Use the new validation in ValidateJobConfigAction.
* Exclude YML tests for new validation from permission tests.
* Replace usages RandomizedTestingTask with built-in Gradle Test (#40978)
This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions.
(cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6)
* Fix forking JVM runner
* Don't bump shadow plugin version
The invalid license enforced is exposed to the cluster state update
thread (via the license state listener) before the constructor has
finished. This violates the JLS for safe publication of an object, and
means there is a concurrency bug lurking here. This commit addresses
this by avoiding publication of the invalid license enforcer before the
constructor has returned.
This change adds information about which UI path
(if any) created ML anomaly detector jobs to the
stats returned by the _xpack/usage endpoint.
Counts for the following possibilities are expected:
* ml_module_apache_access
* ml_module_apm_transaction
* ml_module_auditbeat_process_docker
* ml_module_auditbeat_process_hosts
* ml_module_nginx_access
* ml_module_sample
* multi_metric_wizard
* population_wizard
* single_metric_wizard
* unknown
The "unknown" count is for jobs that do not have a
created_by setting in their custom_settings.
Closes#38403
Ensure that there is at least a 1s delay between the time that state
is persisted by each of the two jobs in the test.
Model snapshot IDs use the current time in epoch seconds to
distinguish themselves, hence snapshots will be overwritten
by another if it occurs in the same 1s window.
Closes#40347
If multiple jobs are created together and the anomaly
results index does not exist then some of the jobs could
fail to update the mappings of the results index. This
lead them to fail to write their results correctly later.
Although this scenario sounds rare, it is exactly what
happens if the user creates their first jobs using the
Nginx module in the ML UI.
This change fixes the problem by updating the mappings
of the results index if it is found to exist during a
creation attempt.
Fixes#38785
* [ML] Refactor common utils out of ML plugin to XPack.Core
* implementing GET filters with abstract transport
* removing added rest param
* adjusting how defaults can be supplied
The problem here was that `DatafeedJob` was updating the last end time searched
based on the `now` even though when there are aggregations, the extactor will
only search up to the floor of `now` against the histogram interval.
This commit fixes the issue by using the end time as calculated by the extractor.
It also adds an integration test that uses aggregations. This test would fail
before this fix. Unfortunately the test is slow as we need to wait for the
datafeed to work in real time.
Closes#39842
* [ML] refactoring lazy query and agg parsing
* Clean up and addressing PR comments
* removing unnecessary try/catch block
* removing bad call to logger
* removing unused import
* fixing bwc test failure due to serialization and config migrator test
* fixing style issues
* Adjusting DafafeedUpdate class serialization
* Adding todo for refactor in v8
* Making query non-optional so it does not write a boolean byte
This change does the following:
1. Makes the per-node setting xpack.ml.max_open_jobs
into a cluster-wide dynamic setting
2. Changes the job node selection to continue to use the
per-node attributes storing the maximum number of open
jobs if any node in the cluster is older than 7.1, and
use the dynamic cluster-wide setting if all nodes are on
7.1 or later
3. Changes the docs to reflect this
4. Changes the thread pools for native process communication
from fixed size to scaling, to support the dynamic nature
of xpack.ml.max_open_jobs
5. Renames the autodetect thread pool to the job comms
thread pool to make clear that it will be used for other
types of ML jobs (data frame analytics in particular)
Backport of #39320
Investigating how to make DeleteExpiredDataIT faster, it was
revealed that the security audit trail threads were quite hot.
Disabling that seems to be helping quite a bit with making this
test faster. This commit also unmutes the test to see how it goes
with the audit trail disabled.
Relates #39658Closes#39575
ML has historically used doc as the single mapping type but reindex in 7.x
will change the mapping to _doc. Switching to the typeless APIs handles
case where the mapping type is either doc or _doc. This change removes
deprecated typed usages.
This commit parallelizes some parts of the test
and its remove an unnecessary refresh call.
On my local machine it shaves off about 15 seconds
for a test execution time of ~64s (down from ~80s).
This test is still slow but progress over perfection.
Relates #37339
The ScheduledEvent class has never preserved the time
zone so it makes more sense for it to store the start and
end time using Instant rather than ZonedDateTime.
Closes#38620
The assertBusy() that waits the default 10 seconds for a
datafeed to complete very occasionally times out on slow
machines. This commit increases the timeout to 60 seconds.
It will almost never actually take this long, but it's
better to have a timeout that will prevent time being
wasted looking at spurious test failures.
This test failed once in a very long time with the assertion
that there is no document for the `non_existing_job` in the
state index. I could not see how that is possible and I cannot
reproduce. With this commit the failure message will reveal
some examples of the left behind docs which might shed a light
about what could go wrong.
The ML memory tracker does searches against ML results
and config indices. These searches can be asynchronous,
and if they are running while the node is closing then
they can cause problems for other components.
This change adds a stop() method to the MlMemoryTracker
that waits for in-flight searches to complete. Once
stop() has returned the MlMemoryTracker will not kick
off any new searches.
The MlLifeCycleService now calls MlMemoryTracker.stop()
before stopping stopping the node.
Fixes#37117
These two changes are interlinked.
Before this change unsetting ML upgrade mode would wait for all
datafeeds to be assigned and not waiting for their corresponding
jobs to initialise. However, this could be inappropriate, if
there was a reason other that upgrade mode why one job was unable
to be assigned or slow to start up. Unsetting of upgrade mode
would hang in this case.
This change relaxes the condition for considering upgrade mode to
be unset to simply that an assignment attempt has been made for
each ML persistent task that did not fail because upgrade mode
was enabled. Thus after unsetting upgrade mode there is no
guarantee that every ML persistent task is assigned, just that
each is not unassigned due to upgrade mode.
In order to make setting upgrade mode work immediately after
unsetting upgrade mode it was then also necessary to make it
possible to stop a datafeed that was not assigned. There was
no particularly good reason why this was not allowed in the past.
It is trivial to stop an unassigned datafeed because it just
involves removing the persistent task.
The .ml-annotations index is created asynchronously when
some other ML index exists. This can interfere with the
post-test index deletion, as the .ml-annotations index
can be created after all other indices have been deleted.
This change adds an ML specific post-test cleanup step
that runs before the main cleanup and:
1. Checks if any ML indices exist
2. If so, waits for the .ml-annotations index to exist
3. Deletes the other ML indices found in step 1.
4. Calls the super class cleanup
This means that by the time the main post-test index
cleanup code runs:
1. The only ML index it has to delete will be the
.ml-annotations index
2. No other ML indices will exist that could trigger
recreation of the .ml-annotations index
Fixes#38952
* Fix#38623 remove xpack namespace REST API
Except for xpack.usage and xpack.info API's, this moves the last remaining API's out of the xpack namespace
* rename xpack api's inside inside the files as well
* updated yaml tests references to xpack namespaces api's
* update callsApi calls in the IT subclasses
* make sure docs testing does not use xpack namespaced api's
* fix leftover xpack namespaced method names in docs/build.gradle
* found another leftover reference
(cherry picked from commit ccb5d934363c37506b76119ac050a254fa80b5e7)
* ML: update set_upgrade_mode, add logging
* Attempt to fix datafeed isolation
Also renamed a few methods/variables for clarity and added
some comments
Elasticsearch has long [supported](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-versioning) compare and set (a.k.a optimistic concurrency control) operations using internal document versioning. Sadly that approach is flawed and can sometime do the wrong thing. Here's the relevant excerpt from the resiliency status page:
> When a primary has been partitioned away from the cluster there is a short period of time until it detects this. During that time it will continue indexing writes locally, thereby updating document versions. When it tries to replicate the operation, however, it will discover that it is partitioned away. It won’t acknowledge the write and will wait until the partition is resolved to negotiate with the master on how to proceed. The master will decide to either fail any replicas which failed to index the operations on the primary or tell the primary that it has to step down because a new primary has been chosen in the meantime. Since the old primary has already written documents, clients may already have read from the old primary before it shuts itself down. The version numbers of these reads may not be unique if the new primary has already accepted writes for the same document
We recently [introduced](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/optimistic-concurrency-control.html) a new sequence number based approach that doesn't suffer from this dirty reads problem.
This commit removes support for internal versioning as a concurrency control mechanism in favor of the sequence number approach.
Relates to #1078
With this change we no longer support pluggable discovery implementations. No
known implementations of `DiscoveryPlugin` actually override this method, so in
practice this should have no effect on the wider world. However, we were using
this rather extensively in tests to provide the `test-zen` discovery type. We
no longer need a separate discovery type for tests as we no longer need to
customise its behaviour.
Relates #38410
If a job cannot be assigned to a node because an index it
requires is unavailable and there are lazy ML nodes then
index unavailable should be reported as the assignment
explanation rather than waiting for a lazy ML node.
X-Pack security supports built-in authentication service
`token-service` that allows access tokens to be used to
access Elasticsearch without using Basic authentication.
The tokens are generated by `token-service` based on
OAuth2 spec. The access token is a short-lived token
(defaults to 20m) and refresh token with a lifetime of 24 hours,
making them unsuitable for long-lived or recurring tasks where
the system might go offline thereby failing refresh of tokens.
This commit introduces a built-in authentication service
`api-key-service` that adds support for long-lived tokens aka API
keys to access Elasticsearch. The `api-key-service` is consulted
after `token-service` in the authentication chain. By default,
if TLS is enabled then `api-key-service` is also enabled.
The service can be disabled using the configuration setting.
The API keys:-
- by default do not have an expiration but expiration can be
configured where the API keys need to be expired after a
certain amount of time.
- when generated will keep authentication information of the user that
generated them.
- can be defined with a role describing the privileges for accessing
Elasticsearch and will be limited by the role of the user that
generated them
- can be invalidated via invalidation API
- information can be retrieved via a get API
- that have been expired or invalidated will be retained for 1 week
before being deleted. The expired API keys remover task handles this.
Following are the API key management APIs:-
1. Create API Key - `PUT/POST /_security/api_key`
2. Get API key(s) - `GET /_security/api_key`
3. Invalidate API Key(s) `DELETE /_security/api_key`
The API keys can be used to access Elasticsearch using `Authorization`
header, where the auth scheme is `ApiKey` and the credentials, is the
base64 encoding of API key Id and API key separated by a colon.
Example:-
```
curl -H "Authorization: ApiKey YXBpLWtleS1pZDphcGkta2V5" http://localhost:9200/_cluster/health
```
Closes#34383
The explanation so far can be invaluable for troubleshooting
as incorrect decisions made early on in the structure analysis
can result in seemingly crazy decisions or timeouts later on.
Relates elastic/kibana#29821
Today the following settings in the `discovery.zen` namespace are still used:
- `discovery.zen.no_master_block`
- `discovery.zen.hosts_provider`
- `discovery.zen.ping.unicast.concurrent_connects`
- `discovery.zen.ping.unicast.hosts.resolve_timeout`
- `discovery.zen.ping.unicast.hosts`
This commit deprecates all other settings in this namespace so that they can be
removed in the next major version.
In 7.x Java timestamp formats are the default timestamp format and
there is no need to prefix them with "8". (The "8" prefix was used
in 6.7 to distinguish Java timestamp formats from Joda timestamp
formats.)
This change removes the "8" prefixes from timestamp formats in the
output of the ML file structure finder.
Scheduler.schedule(...) would previously assume that caller handles
exception by calling get() on the returned ScheduledFuture.
schedule() now returns a ScheduledCancellable that no longer gives
access to the exception. Instead, any exception thrown out of a
scheduled Runnable is logged as a warning.
This is a continuation of #28667, #36137 and also fixes#37708.
Doc-value fields now return a value that is based on the mappings rather than
the script implementation by default.
This deprecates the special `use_field_mapping` docvalue format which was added
in #29639 only to ease the transition to 7.x and it is not necessary anymore in
7.0.
Runnables can be submitted to
AutodetectProcessManager.AutodetectWorkerExecutorService
without error after it has been shutdown which can lead
to requests timing out as their handlers are never called
by the terminated executor.
This change throws an EsRejectedExecutionException if a
runnable is submitted after after the shutdown and calls
AbstractRunnable.onRejection on any tasks not run.
Closes#37108
This commit moves the auditing of job deletion related errors
to the final listener in the job delete action. This ensures
any error that occurs during job deletion is audited.
* ML: Add MlMetadata.upgrade_mode and API
* Adding tests
* Adding wait conditionals for the upgrade_mode call to return
* Adding tests
* adjusting format and tests
* Adjusting wait conditions for api return and msgs
* adjusting doc tests
* adding upgrade mode tests to black list
We have read and write aliases for the ML results indices. However,
the job still had methods that purported to reliably return the name
of the concrete results index being used by the job. After reindexing
prior to upgrade to 7.x this will be wrong, so the method has been
renamed and the comments made more explicit to say the returned index
name may not be the actual concrete index name for the lifetime of the
job. Additionally, the selection of indices when deleting the job
has been changed so that it works regardless of concrete index names.
All these changes are nice-to-have for 6.7 and 7.0, but will become
critical if we add rolling results indices in the 7.x release stream
as 6.7 and 7.0 nodes may have to operate in a mixed version cluster
that includes a version that can roll results indices.
The ML file structure finder has always reported both Joda
and Java time format strings. This change makes the Java time
format strings the ones that are incorporated into mappings
and ingest pipeline definitions.
The BWC syntax of prepending "8" to these formats is used.
This will need to be removed once Java time format strings
become the default in Elasticsearch.
This commit also removes direct imports of Joda classes in the
structure finder unit tests. Instead the core Joda BWC class
is used.