This change adds the earliest and latest timestamps into
the field stats for fields of type "date" in the output of
the ML find_file_structure endpoint. This will enable the
cards for date fields in the file data visualizer in the UI
to be made to look more similar to the cards for date
fields in the index data visualizer in the UI.
When analysing a semi-structured text file the
find_file_structure endpoint merges lines to form
multi-line messages using the assumption that the
first line in each message contains the timestamp.
However, if the timestamp is misdetected then this
can lead to excessive numbers of lines being merged
to form massive messages.
This commit adds a line_merge_size_limit setting
(default 10000 characters) that halts the analysis
if a message bigger than this is created. This
prevents significant CPU time being spent subsequently
trying to determine the internal structure of the
huge bogus messages.
This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.
Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them. This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.
Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats. There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set. This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon. Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.
The full list of changes/improvements in this refactor is:
- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
determined or overridden Java timestamp formats, not stored
separately
- Functionality for determining the "best" timestamp format in
a set of lines has been moved from TextLogFileStructureFinder
to TimestampFormatFinder, taking advantage of the fact that
TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
patterns when looking for timestamp formats has been changed
from using simple regular expressions to the much faster
approach of using the Shift-And method of sub-string search,
but using an "alphabet" consisting of just 1 (representing any
digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
whose definition is included within the date processor in the
ingest pipeline
- Grok patterns that correspond to multiple Java date/time
patterns are now handled better - the Grok pattern is accepted
as matching broadly, and the required set of Java date/time
patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
when looking for the "best" timestamp in a set of lines
timestamps are considered different if they are preceded by
a different sequence of punctuation characters (to prevent
timestamps far into some lines being considered similar to
timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
%{DATE} and %{DATESTAMP}, which have indeterminate day/month
ordering
- The order of day/month in formats with indeterminate day/month
order is determined by considering all observed samples (plus
the server locale if the observed samples still do not suggest
an ordering)
Relates #38086Closes#35137Closes#35132
In 7.x Java timestamp formats are the default timestamp format and
there is no need to prefix them with "8". (The "8" prefix was used
in 6.7 to distinguish Java timestamp formats from Joda timestamp
formats.)
This change removes the "8" prefixes from timestamp formats in the
output of the ML file structure finder.
The ML file structure finder has always reported both Joda
and Java time format strings. This change makes the Java time
format strings the ones that are incorporated into mappings
and ingest pipeline definitions.
The BWC syntax of prepending "8" to these formats is used.
This will need to be removed once Java time format strings
become the default in Elasticsearch.
This commit also removes direct imports of Joda classes in the
structure finder unit tests. Instead the core Joda BWC class
is used.
The file structure finder endpoint can find the NDJSON
(newline-delimited JSON) file format, but called it
`json`. This change renames the `format` for this file
structure to `ndjson`, which is more precise and will
hopefully avoid confusion.
The ingest pipeline that is produced is very simple. It
contains a grok processor if the format is semi-structured
text, a date processor if the format contains a timestamp,
and a remove processor if required to remove the interim
timestamp field parsed out of semi-structured text.
Eventually the UI should offer the option to customize the
pipeline with additional processors to perform other data
preparation steps before ingesting data to an index.
This change fixes a potential deadlock problem in the unit
test introduced in #34117.
It also removes a piece of debug code and corrects a docs
formatting problem that were both added in that same PR.
This can be used to restrict the amount of CPU a single
structure finder request can use.
The timeout is not implemented precisely, so requests
may run for slightly longer than the timeout before
aborting.
The default is 25 seconds, which is a little below
Kibana's default timeout of 30 seconds for calls to
Elasticsearch APIs.
Previously the timestamp_formats field in the response
from the find_file_structure endpoint contained Joda
timestamp formats. This change makes that clear by
renaming the field to joda_timestamp_formats, and also
adds a java_timestamp_formats field containing the
equivalent Java time format strings.
Previously numeric values in the field_stats created by the
find_file_structure endpoint were always output with a
decimal point. This looked unfriendly and unnatural for
fields that clearly store integer values. This change
converts integer values to type Integer before output in
the file structure field stats.