Commit Graph

18 Commits

Author SHA1 Message Date
Lisa Cawley 42cb59f7b4 [DOCS] Updates ML APIs to use new API template (#43711) 2019-06-27 15:05:51 -07:00
lcawl d46e2bb26a [DOCS] Adds anchors and attributes to ML APIs 2019-06-27 09:44:56 -07:00
David Roberts b202a59f88 [ML] Add earliest and latest timestamps to field stats (#42890)
This change adds the earliest and latest timestamps into
the field stats for fields of type "date" in the output of
the ML find_file_structure endpoint.  This will enable the
cards for date fields in the file data visualizer in the UI
to be made to look more similar to the cards for date
fields in the index data visualizer in the UI.
2019-06-06 08:58:35 +01:00
David Roberts b61202b0a8 [ML] Add a limit on line merging in find_file_structure (#42501)
When analysing a semi-structured text file the
find_file_structure endpoint merges lines to form
multi-line messages using the assumption that the
first line in each message contains the timestamp.
However, if the timestamp is misdetected then this
can lead to excessive numbers of lines being merged
to form massive messages.

This commit adds a line_merge_size_limit setting
(default 10000 characters) that halts the analysis
if a message bigger than this is created.  This
prevents significant CPU time being spent subsequently
trying to determine the internal structure of the
huge bogus messages.
2019-06-03 13:45:51 +01:00
David Roberts f472186b9f [ML] Improve file structure finder timestamp format determination (#41948)
This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.

Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them.  This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.

Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats.  There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set.  This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon.  Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.

The full list of changes/improvements in this refactor is:

- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
  format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
  determined or overridden Java timestamp formats, not stored
  separately
- Functionality for determining the "best" timestamp format in
  a set of lines has been moved from TextLogFileStructureFinder
  to TimestampFormatFinder, taking advantage of the fact that
  TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
  patterns when looking for timestamp formats has been changed
  from using simple regular expressions to the much faster
  approach of using the Shift-And method of sub-string search,
  but using an "alphabet" consisting of just 1 (representing any
  digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
  Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
  whose definition is included within the date processor in the
  ingest pipeline
- Grok patterns that correspond to multiple Java date/time
  patterns are now handled better - the Grok pattern is accepted
  as matching broadly, and the required set of Java date/time
  patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
  when looking for the "best" timestamp in a set of lines
  timestamps are considered different if they are preceded by
  a different sequence of punctuation characters (to prevent
  timestamps far into some lines being considered similar to
  timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
  %{DATE} and %{DATESTAMP}, which have indeterminate day/month
  ordering
- The order of day/month in formats with indeterminate day/month
  order is determined by considering all observed samples (plus
  the server locale if the observed samples still do not suggest
  an ordering)

Relates #38086
Closes #35137
Closes #35132
2019-05-24 09:10:08 +01:00
Alexander Reelsen 8e5e48319e
Add documentation about breaking java time changes (#38886)
In addition remove joda time mentions across the docs, make 
sure links are updated to java time javadocs.

Forward port of #38720
2019-02-14 10:18:12 +01:00
David Roberts 1fa413a16d
[ML] Remove "8" prefixes from file structure finder timestamp formats (#38016)
In 7.x Java timestamp formats are the default timestamp format and
there is no need to prefix them with "8".  (The "8" prefix was used
in 6.7 to distinguish Java timestamp formats from Joda timestamp
formats.)

This change removes the "8" prefixes from timestamp formats in the
output of the ML file structure finder.
2019-02-01 15:36:04 +00:00
David Roberts f2c0c26d15
[ML] Adjust structure finder for Joda to Java time migration (#37306)
The ML file structure finder has always reported both Joda
and Java time format strings.  This change makes the Java time
format strings the ones that are incorporated into mappings
and ingest pipeline definitions.

The BWC syntax of prepending "8" to these formats is used.
This will need to be removed once Java time format strings
become the default in Elasticsearch.

This commit also removes direct imports of Joda classes in the
structure finder unit tests.  Instead the core Joda BWC class
is used.
2019-01-26 20:19:57 +00:00
lcawl 32bed098bb [DOCS] Synchs titles of X-Pack APIs 2018-12-20 10:27:24 -08:00
David Roberts 9e8cfbb40d
[ML] Deprecate X-Pack centric ML endpoints (#36315)
This commit is part of our plan to deprecate and
ultimately remove the use of _xpack in the REST APIs.

Relates #35958
2018-12-07 20:34:11 +00:00
David Roberts c455be7bc2
[ML] Rename the json file structure to ndjson (#34901)
The file structure finder endpoint can find the NDJSON
(newline-delimited JSON) file format, but called it
`json`.  This change renames the `format` for this file
structure to `ndjson`, which is more precise and will
hopefully avoid confusion.
2018-10-29 10:06:12 +01:00
David Roberts 21c759af0e
[ML] Add an ingest pipeline definition to structure finder (#34350)
The ingest pipeline that is produced is very simple.  It
contains a grok processor if the format is semi-structured
text, a date processor if the format contains a timestamp,
and a remove processor if required to remove the interim
timestamp field parsed out of semi-structured text.

Eventually the UI should offer the option to customize the
pipeline with additional processors to perform other data
preparation steps before ingesting data to an index.
2018-10-12 07:56:35 +01:00
David Roberts a1d2ded98d
[ML] Fix unit test deadlock problem (#34174)
This change fixes a potential deadlock problem in the unit
test introduced in #34117.

It also removes a piece of debug code and corrects a docs
formatting problem that were both added in that same PR.
2018-10-01 15:35:37 +01:00
lcawl 57052f617a [DOCS] Fixes callout in ML API 2018-09-28 10:06:02 -07:00
David Roberts f709c2f694
[ML] Add a timeout option to file structure finder (#34117)
This can be used to restrict the amount of CPU a single
structure finder request can use.

The timeout is not implemented precisely, so requests
may run for slightly longer than the timeout before
aborting.

The default is 25 seconds, which is a little below
Kibana's default timeout of 30 seconds for calls to
Elasticsearch APIs.
2018-09-28 17:32:35 +01:00
David Roberts dfe5af0411
[ML] Return both Joda and Java formats from structure finder (#33900)
Previously the timestamp_formats field in the response
from the find_file_structure endpoint contained Joda
timestamp formats.  This change makes that clear by
renaming the field to joda_timestamp_formats, and also
adds a java_timestamp_formats field containing the
equivalent Java time format strings.
2018-09-25 12:52:51 +01:00
David Roberts b89551c452
[ML] Display integers without .0 in file structure field stats (#33947)
Previously numeric values in the field_stats created by the
find_file_structure endpoint were always output with a
decimal point.  This looked unfriendly and unnatural for
fields that clearly store integer values.  This change
converts integer values to type Integer before output in
the file structure field stats.
2018-09-22 15:48:59 +01:00
David Roberts f5a2ffc3f6
[DOCS][ML] Document the ML find_file_structure endpoint (#33723)
Relates #33471
Relates #33630
2018-09-20 12:17:09 +01:00