* Reenable Integ Tests in native-multi-node-tests
* The tests broken here were likely fixed by #45463 => let's reenable them and see if things run fine again
* Relates #45405, #45455
This commit adds a first draft of a regression analysis
to data frame analytics. There is high probability that
the exact syntax might change.
This commit adds the new analysis type and its parameters as
well as appropriate validation. It also modifies the extractor
and the fields detector to be able to handle categorical fields
as regression analysis supports them.
In the FIPS JVM the JVM default locale seems to leak into places
where it should be overridden. This change skips assertions
in TimestampFormatFinderTests.testGuessIsDayFirstFromLocale
that may be impacted.
Fixes#45140
When doing a fieldwise Levenshtein distance comparison
between CSV rows, this change ignores all fields that
have long values, not just the longest field.
This approach works better for CSV formats that have
multiple freeform text fields rather than just a single
"message" field.
Fixes#45047
If one tries to start a DF analytics job that has already run,
the result will be that the task will fail after reindexing the
dest index from the source index. The results of the prior run
will be gone and the task state is not properly set to failed
with the failure reason.
This commit improves the behavior in this scenario. First, we
set the task state to `failed` in a set of failures that were
missed. Second, a validation is added that if the destination
index exists, it must be empty.
In case closing the process throws an exception we should be catching
it no matter its type. The process may have terminated because of a
fatal error in which case closing the process will throw a server
error, not an `IOException`. If this happens we fail to mark the
persistent task as failed and the task gets in limbo.
As data frame rows with missing values for analyzed fields are skipped,
we can be more efficient by including a query that only picks documents
that have values for all analyzed fields. Besides improving the number
of documents we go through, we also provide a more accurate measurement
of how many rows we need which reduces the memory requirements.
This also adds an integration test that runs outlier detection on data
with missing fields.
TaskListener accepts today Throwable in its onFailure method. Though
looking at where it is called (TransportAction), it can never be
notified of a Throwable.
This commit changes the signature of TaskListener#onFailure so that it
accepts an `Exception` rather than a `Throwable` as second argument.
* Switch from using docvalue_fields to extracting values from _source
where applicable. Doing this means parsing the _source and handling the
numbers parsing just like Elasticsearch is doing it when it's indexing
a document.
* This also introduces a minor limitation: aliases type of fields that
are NOT part of a tree of sub-fields will not be able to be retrieved
anymore. field_caps API doesn't shed any light into a field being an
alias or not and at _source parsing time there is no way to know if a
root field is an alias or not. Fields of the type "a.b.c.alias" can be
extracted from docvalue_fields, only if the field they point to can be
extracted from docvalue_fields. Also, not all fields in a hierarchy of
fields can be evaluated to being an alias.
(cherry picked from commit 8bf8a055e38f00df5f49c8d97f632f69d6e00c2c)
* Mute failing test
tracked in #44552
* mute EvilSecurityTests
tracking in #44558
* Fix line endings in ESJsonLayoutTests
* Mute failing ForecastIT test on windows
Tracking in #44609
* mute BasicRenormalizationIT.testDefaultRenormalization
tracked in #44613
* fix mute testDefaultRenormalization
* Increase busyWait timeout windows is slow
* Mute failure unconfigured node name
* mute x-pack internal cluster test windows
tracking #44610
* Mute JvmErgonomicsTests on windows
Tracking #44669
* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot
Tracking #44671
* Mute NodeTests on Windows
Tracking #44256
Removes the warning suppression -Xlint:-deprecation,-rawtypes,-serial,-try,-unchecked.
Many warnings were unchecked warnings in the test code often because of the use of mocks.
These are suppressed with @SuppressWarning
many classes still use the Streamable constructors of HandledTransportAction,
this commit moves more of those classes to the new Writeable constructors.
relates #34389.
This commit adds constructors to AcknolwedgedRequest subclasses to
implement Writeable.Reader, and ensures all future subclasses implement
the same.
relates #34389
* Migrate ML Actions to use writeable ActionType (#44302)
This commit converts all the StreamableResponseActionType
actions in the ML core module to be ActionType and leverage
the Writeable infrastructure.
* Add Snapshot Lifecycle Management (#43934)
* Add SnapshotLifecycleService and related CRUD APIs
This commit adds `SnapshotLifecycleService` as a new service under the ilm
plugin. This service handles snapshot lifecycle policies by scheduling based on
the policies defined schedule.
This also includes the get, put, and delete APIs for these policies
Relates to #38461
* Make scheduledJobIds return an immutable set
* Use Object.equals for SnapshotLifecyclePolicy
* Remove unneeded TODO
* Implement ToXContentFragment on SnapshotLifecyclePolicyItem
* Copy contents of the scheduledJobIds
* Handle snapshot lifecycle policy updates and deletions (#40062)
(Note this is a PR against the `snapshot-lifecycle-management` feature branch)
This adds logic to `SnapshotLifecycleService` to handle updates and deletes for
snapshot policies. Policies with incremented versions have the old policy
cancelled and the new one scheduled. Deleted policies have their schedules
cancelled when they are no longer present in the cluster state metadata.
Relates to #38461
* Take a snapshot for the policy when the SLM policy is triggered (#40383)
(This is a PR for the `snapshot-lifecycle-management` branch)
This commit fills in `SnapshotLifecycleTask` to actually perform the
snapshotting when the policy is triggered. Currently there is no handling of the
results (other than logging) as that will be added in subsequent work.
This also adds unit tests and an integration test that schedules a policy and
ensures that a snapshot is correctly taken.
Relates to #38461
* Record most recent snapshot policy success/failure (#40619)
Keeping a record of the results of the successes and failures will aid
troubleshooting of policies and make users more confident that their
snapshots are being taken as expected.
This is the first step toward writing history in a more permanent
fashion.
* Validate snapshot lifecycle policies (#40654)
(This is a PR against the `snapshot-lifecycle-management` branch)
With the commit, we now validate the content of snapshot lifecycle policies when
the policy is being created or updated. This checks for the validity of the id,
name, schedule, and repository. Additionally, cluster state is checked to ensure
that the repository exists prior to the lifecycle being added to the cluster
state.
Part of #38461
* Hook SLM into ILM's start and stop APIs (#40871)
(This pull request is for the `snapshot-lifecycle-management` branch)
This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also
manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are
cancelled.
Relates to #38461
* Add tests for SnapshotLifecyclePolicyItem (#40912)
Adds serialization tests for SnapshotLifecyclePolicyItem.
* Fix improper import in build.gradle after master merge
* Add human readable version of modified date for snapshot lifecycle policy (#41035)
* Add human readable version of modified date for snapshot lifecycle policy
This small change changes it from:
```
...
"modified_date": 1554843903242,
...
```
To
```
...
"modified_date" : "2019-04-09T21:05:03.242Z",
"modified_date_millis" : 1554843903242,
...
```
Including the `"modified_date"` field when the `?human` field is used.
Relates to #38461
* Fix test
* Add API to execute SLM policy on demand (#41038)
This commit adds the ability to perform a snapshot on demand for a policy. This
can be useful to take a snapshot immediately prior to performing some sort of
maintenance.
```json
PUT /_ilm/snapshot/<policy>/_execute
```
And it returns the response with the generated snapshot name:
```json
{
"snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug"
}
```
Note that this does not allow waiting for the snapshot, and the snapshot could
still fail. It *does* record this information into the cluster state similar to
a regularly trigged SLM job.
Relates to #38461
* Add next_execution to SLM policy metadata (#41221)
* Add next_execution to SLM policy metadata
This adds the next time a snapshot lifecycle policy will be executed when
retriving a policy's metadata, for example:
```json
GET /_ilm/snapshot?human
{
"production" : {
"version" : 1,
"modified_date" : "2019-04-15T21:16:21.865Z",
"modified_date_millis" : 1555362981865,
"policy" : {
"name" : "<production-snap-{now/d}>",
"schedule" : "*/30 * * * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"foo-*",
"important"
],
"ignore_unavailable" : true,
"include_global_state" : false
}
},
"next_execution" : "2019-04-15T21:16:30.000Z",
"next_execution_millis" : 1555362990000
},
"other" : {
"version" : 1,
"modified_date" : "2019-04-15T21:12:19.959Z",
"modified_date_millis" : 1555362739959,
"policy" : {
"name" : "<other-snap-{now/d}>",
"schedule" : "0 30 2 * * ?",
"repository" : "repo",
"config" : {
"indices" : [
"other"
],
"ignore_unavailable" : false,
"include_global_state" : true
}
},
"next_execution" : "2019-04-16T02:30:00.000Z",
"next_execution_millis" : 1555381800000
}
}
```
Relates to #38461
* Fix and enhance tests
* Figured out how to Cron
* Change SLM endpoint from /_ilm/* to /_slm/* (#41320)
This commit changes the endpoint for snapshot lifecycle management from:
```
GET /_ilm/snapshot/<policy>
```
to:
```
GET /_slm/policy/<policy>
```
It mimics the ILM path only using `slm` instead of `ilm`.
Relates to #38461
* Add initial documentation for SLM (#41510)
* Add initial documentation for SLM
This adds the initial documentation for snapshot lifecycle management.
It also includes the REST spec API json files since they're sort of
documentation.
Relates to #38461
* Add `manage_slm` and `read_slm` roles (#41607)
* Add `manage_slm` and `read_slm` roles
This adds two more built in roles -
`manage_slm` which has permission to perform any of the SLM actions, as well as
stopping, starting, and retrieving the operation status of ILM.
`read_slm` which has permission to retrieve snapshot lifecycle policies as well
as retrieving the operation status of ILM.
Relates to #38461
* Add execute to the test
* Fix ilm -> slm typo in test
* Record SLM history into an index (#41707)
It is useful to have a record of the actions that Snapshot Lifecycle
Management takes, especially for the purposes of alerting when a
snapshot fails or has not been taken successfully for a certain amount of
time.
This adds the infrastructure to record SLM actions into an index that
can be queried at leisure, along with a lifecycle policy so that this
history does not grow without bound.
Additionally,
SLM automatically setting up an index + lifecycle policy leads to
`index_lifecycle` custom metadata in the cluster state, which some of
the ML tests don't know how to deal with due to setting up custom
`NamedXContentRegistry`s. Watcher would cause the same problem, but it
is already disabled (for the same reason).
* High Level Rest Client support for SLM (#41767)
* High Level Rest Client support for SLM
This commit add HLRC support for SLM.
Relates to #38461
* Fill out documentation tests with tags
* Add more callouts and asciidoc for HLRC
* Update javadoc links to real locations
* Add security test testing SLM cluster privileges (#42678)
* Add security test testing SLM cluster privileges
This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm`
cluster privileges.
Relates to #38461
* Don't redefine vars
* Add Getting Started Guide for SLM (#42878)
This commit adds a basic Getting Started Guide for SLM.
* Include SLM policy name in Snapshot metadata (#43132)
Keep track of which SLM policy in the metadata field of the Snapshots
taken by SLM. This allows users to more easily understand where the
snapshot came from, and will enable future SLM features such as
retention policies.
* Fix compilation after master merge
* [TEST] Move exception wrapping for devious exception throwing
Fixes an issue where an exception was created from one line and thrown in another.
* Fix SLM for the change to AcknowledgedResponse
* Add Snapshot Lifecycle Management Package Docs (#43535)
* Fix compilation for transport actions now that task is required
* Add a note mentioning the privileges needed for SLM (#43708)
* Add a note mentioning the privileges needed for SLM
This adds a note to the top of the "getting started with SLM"
documentation mentioning that there are two built-in privileges to
assist with creating roles for SLM users and administrators.
Relates to #38461
* Mention that you can create snapshots for indices you can't read
* Fix REST tests for new number of cluster privileges
* Mute testThatNonExistingTemplatesAreAddedImmediately (#43951)
* Fix SnapshotHistoryStoreTests after merge
* Remove overridden newResponse functions that have been removed
* Fix compilation for backport
* Fix get snapshot output parsing in test
* [DOCS] Add redirects for removed autogen anchors (#44380)
* Switch <tt>...</tt> in javadocs for {@code ...}
Test clusters currently has its own set of logic for dealing with
finding different versions of Elasticsearch, downloading them, and
extracting them. This commit converts testclusters to use the
DistributionDownloadPlugin.
This commit creates new base classes for master node actions whose
response types still implement Streamable. This simplifies both finding
remaining classes to convert, as well as creating new master node
actions that use Writeable for their responses.
relates #34389
* HLRC: Fix '+' Not Correctly Encoded in GET Req.
* Encode `+` correctly as `%2B` in URL paths
* Keep encoding `+` as space in URL parameters
* Closes#33077
This commit moves the Supplier variant of HandledTransportAction to have
a different ordering than the Writeable.Reader variant. The Supplier
version is used for the legacy Streamable, and currently having the
location of the Writeable.Reader vs Supplier in the same place forces
using casts of Writeable.Reader to select the correct super constructor.
This change in ordering allows easier migration to Writeable.Reader.
relates #34389
Now that ML job configs are stored in an index rather than
cluster state, availability of the .ml-config index is very
important to the operation of ML. When a cluster starts up
the ML persistent tasks will be considered for node
assignment very early on. It is best in this case if
assignment is deferred until after the .ml-config index is
available.
The introduction of data frame analytics jobs has made this
problem worse, because anomaly detection jobs already waited
for the primary shards of the .ml-state, .ml-anomalies-shared
and .ml-meta indices to be available before doing node
assignment, and by coincidence this would probably lead to
the primary shards of .ml-config also being searchable. But
data frame analytics jobs had no other index checks prior to
this change.
This fixes problem 2 of #44156
By default, we don't check ranges while indexing geo_shapes. As a
result, it is possible to index geoshapes that contain contain
coordinates outside of -90 +90 and -180 +180 ranges. Such geoshapes
will currently break SQL and ML retrieval mechanism. This commit removes
these restriction from the validator is used in SQL and ML retrieval.
When the ML memory tracker is refreshed and a refresh is
already in progress the idea is that the second and
subsequent refresh requests receive the same response as
the currently in progress refresh.
There was a bug that if a refresh failed then the ML
memory tracker's view of whether a refresh was in progress
was not reset, leading to every subsequent request being
registered to receive a response that would never come.
This change makes the ML memory tracker pass on failures
as well as successes to all interested parties and reset
the list of interested parties so that further refresh
attempts are possible after either a success or failure.
This fixes problem 1 of #44156
Custom timestamp overrides provided to the find_file_structure
endpoint produced an invalid Grok pattern if the fractional
seconds separator was a dot rather than a comma or colon.
This commit fixes that problem and adds tests for this sort
of timestamp override.
Fixes#44110
The count should match the number of all df-analytics that
matched the id in the request. However, we set the count
to the number of df-analytics returned which was bound to the
`size` parameter. This commit fixes this by setting the count
to the count of the `get` response.
A bug was introduced in 6.6.0 when we added support for
rollup indices. Rollup caps does NOT support looking at
remote indices, consequently, since we always look up rollup
caps, the datafeed fails with an error if its config
includes a concrete remote index. (When all remote indices
in a datafeed config are wildcards the problem did not
occur.)
The rollups feature does not support remote indices, so if
there is any remote index in a datafeed config (wildcarded
or not), we can skip the rollup cap checks. This PR
implements that change.
This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory,
and removes the need to pass around tokenizer names when building custom analyzers.
As this means that TokenizerFactory is no longer a functional interface, the commit also
adds a factory method to TokenizerFactory to make construction simpler.
This introduces a `failed` state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as `failure_reason`.
The df-analytics stop API now has a `force` boolean parameter. This allows
the user to call it for a failed task in order to reset it to `stopped` after
we have ensured the failure has been communicated to the user.
This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.
Renames `_id_copy` to `ml__id_copy` as field names starting with
underscore are deprecated. The new field name `ml__id_copy` was
chosen as an obscure enough field that users won't have in their data.
Otherwise, this field is only intented to be used by df-analytics.
If a job is opened and then closed and does nothing in
between then it should not persist any results or state
documents. This change adapts the no-op job test to
assert no results in addition to no state, and to log
any documents that cause this assertion to fail.
Relates elastic/ml-cpp#512
Relates #43680
The Action base class currently works for both Streamable and Writeable
response types. This commit intorduces StreamableResponseAction, for
which only the legacy Action implementions which provide newResponse()
will extend. This eliminates the need for overriding newResponse() with
an UnsupportedOperationException.
relates #34389
Since #41817 was merged the ml-cpp zip file for any
given version has been cached indefinitely by Gradle.
This is problematic, particularly in the case of the
master branch where the version 8.0.0-SNAPSHOT will
be in use for more than a year.
This change tells Gradle that the ml-cpp zip file is
a "changing" dependency, and to check whether it has
changed every two hours. Two hours is a compromise
between checking on every build and annoying developers
with slow internet connections and checking rarely
causing bug fixes in the ml-cpp code to take a long
time to propagate through to elasticsearch PRs that
rely on them.
This commit adds support for multiple source indices.
In order to deal with multiple indices having different mappings,
it attempts a best-effort approach to merge the mappings assuming
there are no conflicts. In case conflicts exists an error will be
returned.
To allow users creating custom mappings for special use cases,
the destination index is now allowed to exist before the analytics
job runs. In addition, settings are no longer copied except for
the `index.number_of_shards` and `index.number_of_replicas`.
* Deduplicate org.elasticsearch.xpack.core.dataframe.utils.TimeUtils and org.elasticsearch.xpack.core.ml.utils.time.TimeUtils into a common class: org.elasticsearch.xpack.core.common.time.TimeUtils.
* Add unit tests for parseTimeField and parseTimeFieldToInstant methods
This change introduces a new setting,
xpack.ml.process_connect_timeout, to enable
the timeout for one of the external ML processes
to connect to the ES JVM to be increased.
The timeout may need to be increased if many
processes are being started simultaneously on
the same machine. This is unlikely in clusters
with many ML nodes, as we balance the processes
across the ML nodes, but can happen in clusters
with a single ML node and a high value for
xpack.ml.node_concurrent_job_allocations.
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
The APIs are:
- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index
In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:
- POST _ml/data_frame/_evaluate
The error message if the native controller failed to run
(for example due to running Elasticsearch on an unsupported
platform) was not easy to understand. This change removes
pointless detail from the message and adds some hints about
likely causes.
Fixes#42341
This commit replaces usages of Streamable with Writeable for the
AcknowledgedResponse and its subclasses, plus associated actions.
Note that where possible response fields were made final and default
constructors were removed.
This is a large PR, but the change is mostly mechanical.
Relates to #34389
Backport of #43414
After the network disruption a partition is created,
one side of which can form a cluster the other can't.
Ensure requests are sent to a node on the correct side
of the cluster
This commit removes some very old test logging annotations that appeared
to be added to investigate test failures that are long since closed. If
these are needed, they can be added back on a case-by-case basis with a
comment associating them to a test failure.
* Return 0 for negative "free" and "total" memory reported by the OS
We've had a situation where the MX bean reported negative values for the
free memory of the OS, in those rare cases we want to return a value of
0 rather than blowing up later down the pipeline.
In the event that there is a serialization or creation error with regard
to memory use, this adds asserts so the failure will occur as soon as
possible and give us a better location for investigation.
Resolves#42157
* Fix test passing in invalid memory value
* Fix another test passing in invalid memory value
* Also change mem check in MachineLearning.machineMemoryFromStats
* Add background documentation for why we prevent negative return values
* Clarify comment a bit more
This trace logging looks like it was copy/pasted from another test,
where the logging in that test was only added to investigate a test
failure. This commit removes the trace logging.
The ML failover tests sometimes need to wait for jobs to be
assigned to new nodes following a node failure. They wait
10 seconds for this to happen. However, if the node that
failed was the master node and a new master was elected then
this 10 seconds might not be long enough as a refresh of the
memory stats will delay job assignment. Once the memory
refresh completes the persistent task will be assigned when
the next cluster state update occurs or after the periodic
recheck interval, which defaults to 30 seconds. Rather than
increase the length of the wait for assignment to 31 seconds,
this change decreases the periodic recheck interval to 1
second.
Fixes#43289
We were stopping a node in the cluster at a time when
the replica shards of the .ml-state index might not
have been created. This change moves the wait for
green status to a point where the .ml-state index
exists.
Fixes#40546Fixes#41742
Forward port of #43111
A static code analysis revealed that we are not closing
the input stream in the post_data endpoint. This
actually makes no difference in practice, as the
particular InputStream implementation in this case is
org.elasticsearch.common.bytes.BytesReferenceStreamInput
and its close() method is a no-op. However, it is good
practice to close the stream anyway.
The machine learning feature of xpack has native binaries with a
different commit id than the rest of code. It is currently exposed in
the xpack info api. This commit adds that commit information to the ML
info api, so that it may be removed from the info api.
Previously 10 digit numbers were considered candidates to be
timestamps recorded as seconds since the epoch and 13 digit
numbers as timestamps recorded as milliseconds since the epoch.
However, this meant that we could detect these formats for
numbers that would represent times far in the future. As an
example ISBN numbers starting with 9 were detected as milliseconds
since the epoch since they had 13 digits.
This change tweaks the logic for detecting such timestamps to
require that they begin with 1 or 2. This means that numbers
that would represent times beyond about 2065 are no longer
detected as epoch timestamps. (We can add 3 to the definition
as we get closer to the cutoff date.)
The description field of xpack featuresets is optionally part of the
xpack info api, when using the verbose flag. However, this information
is unnecessary, as it is better left for documentation (and the existing
descriptions describe anything meaningful). This commit removes the
description field from feature sets.
The tests for the ML TimeoutChecker rely on threads
not being interrupted after the TimeoutChecker is
closed. This change ensures this by making the
close() and setTimeoutExceeded() methods synchronized
so that the code inside them cannot execute
simultaneously.
Fixes#43097
* [ML] Adding support for geo_shape, geo_centroid, geo_point in datafeeds
* only supporting doc_values for geo_point fields
* moving validation into GeoPointField ctor
Get resources action sorts on the resource id. When there are no resources at
all, then it is possible the index does not contain a mapping for the resource
id field. In that case, the search api fails by default.
This commit adjusts the search request to ignore unmapped fields.
Closeselastic/kibana#37870
Both TransportAnalyzeAction and CategorizationAnalyzer have logic to build
custom analyzers for index-independent analysis. A lot of this code is duplicated,
and it requires the AnalysisRegistry to expose a number of internal provider
classes, as well as making some assumptions about when analysis components are
constructed.
This commit moves the build logic directly into AnalysisRegistry, reducing the
registry's API surface considerably.
Previously, a reindex request had two different size specifications in the body:
* Outer level, determining the maximum documents to process
* Inside the source element, determining the scroll/batch size.
The outer level size has now been renamed to max_docs to
avoid confusion and clarify its semantics, with backwards compatibility and
deprecation warnings for using size.
Similarly, the size parameter has been renamed to max_docs for
update/delete-by-query to keep the 3 interfaces consistent.
Finally, all 3 endpoints now support max_docs in both body and URL.
Relates #24344
A static code analysis revealed that we are not closing
the input stream in the find_file_structure endpoint.
This actually makes no difference in practice, as the
particular InputStream implementation in this case is
org.elasticsearch.common.bytes.BytesReferenceStreamInput
and its close() method is a no-op. However, it is good
practice to close the stream anyway.
This change adds the earliest and latest timestamps into
the field stats for fields of type "date" in the output of
the ML find_file_structure endpoint. This will enable the
cards for date fields in the file data visualizer in the UI
to be made to look more similar to the cards for date
fields in the index data visualizer in the UI.
Dots in the column names cause an error in the ingest
pipeline, as dots are special characters in ingest pipeline.
This PR changes dots into underscores in CSV field names
suggested by the ML find_file_structure endpoint _unless_
the field names are specifically overridden. The reason for
allowing them in overrides is that fields that are not
mentioned in the ingest pipeline can contain dots. But it's
more consistent that the default behaviour is to replace
them all.
Fixeselastic/kibana#26800
When analysing a semi-structured text file the
find_file_structure endpoint merges lines to form
multi-line messages using the assumption that the
first line in each message contains the timestamp.
However, if the timestamp is misdetected then this
can lead to excessive numbers of lines being merged
to form massive messages.
This commit adds a line_merge_size_limit setting
(default 10000 characters) that halts the analysis
if a message bigger than this is created. This
prevents significant CPU time being spent subsequently
trying to determine the internal structure of the
huge bogus messages.
This change helps to prevent the situation where a binary
file uploaded to the find_file_structure endpoint is
detected as being text in the UTF-16 character set, and
then causes a large amount of CPU to be spent analysing
the bogus text structure.
The approach is to check the distribution of zero bytes
between odd and even file positions, on the grounds that
UTF-16BE or UTF16-LE would have a very skewed distribution.
This change fixes a race condition that would result in an
in-memory data structure becoming out-of-sync with persistent
tasks in cluster state.
If repeated often enough this could result in it being
impossible to open any ML jobs on the affected node, as the
master node would think the node had capacity to open another
job but the chosen node would error during the open sequence
due to its in-memory data structure being full.
The race could be triggered by opening a job and then closing
it a tiny fraction of a second later. It is unlikely a user
of the UI could open and close the job that fast, but a script
or program calling the REST API could.
The nasty thing is, from the externally observable states and
stats everything would appear to be fine - the fast open then
close sequence would appear to leave the job in the closed
state. It's only later that the leftovers in the in-memory
data structure might build up and cause a problem.
This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.
Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them. This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.
Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats. There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set. This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon. Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.
The full list of changes/improvements in this refactor is:
- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
determined or overridden Java timestamp formats, not stored
separately
- Functionality for determining the "best" timestamp format in
a set of lines has been moved from TextLogFileStructureFinder
to TimestampFormatFinder, taking advantage of the fact that
TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
patterns when looking for timestamp formats has been changed
from using simple regular expressions to the much faster
approach of using the Shift-And method of sub-string search,
but using an "alphabet" consisting of just 1 (representing any
digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
whose definition is included within the date processor in the
ingest pipeline
- Grok patterns that correspond to multiple Java date/time
patterns are now handled better - the Grok pattern is accepted
as matching broadly, and the required set of Java date/time
patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
when looking for the "best" timestamp in a set of lines
timestamps are considered different if they are preceded by
a different sequence of punctuation characters (to prevent
timestamps far into some lines being considered similar to
timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
%{DATE} and %{DATESTAMP}, which have indeterminate day/month
ordering
- The order of day/month in formats with indeterminate day/month
order is determined by considering all observed samples (plus
the server locale if the observed samples still do not suggest
an ordering)
Relates #38086Closes#35137Closes#35132