Add an explanatory NOTE section to draw attention to the difference
between small and capital letters used for the index date patterns.
e.g.: HH vs hh, MM vs mm.
Closes: #22322
(cherry picked from commit c8125417dc33215651f9bb76c9b1ffaf25f41caf)
Fix a couple of wrong links because of the order of the anchor
and the usage of backquotes.
(cherry picked from commit 4e0c6525153b60a57202937c2ae57968c8e35285)
When analysing a semi-structured text file the
find_file_structure endpoint merges lines to form
multi-line messages using the assumption that the
first line in each message contains the timestamp.
However, if the timestamp is misdetected then this
can lead to excessive numbers of lines being merged
to form massive messages.
This commit adds a line_merge_size_limit setting
(default 10000 characters) that halts the analysis
if a message bigger than this is created. This
prevents significant CPU time being spent subsequently
trying to determine the internal structure of the
huge bogus messages.
Adding an example of how to re-implement the polish stempel analyzer
in case a user want to modify or extend it. In order for the analyzer to be
able to use polish stopwords, also registering a polish_stop filter for the
stempel plugin.
Closes#13150
This commit clones the existing AnalyzeRequest/AnalyzeResponse classes
to the high-level rest client, and adjusts request converters to use these new
classes.
This is a prerequisite to removing the Streamable interface from the internal
server version of these classes.
This PR updates the docs for `docvalue_fields` and `stored_fields` to clarify
that nested fields must be accessed through `inner_hits`. It also tweaks the
nested fields documentation to make this point more visible.
Addresses #23766.
In AsciiDoc, `subs="attributes,callouts,macros"` options were required
to render `include-tagged::` in a code block.
With elastic/docs#827, Elasticsearch Reference documentation migrated
from AsciiDoc to Asciidoctor.
In Asciidoctor, the `subs="attributes,callouts,macros"` options are no
longer needed to render `include-tagged::` in a code block. This commit
removes those unneeded options.
Resolves#41589
Several `ifdef::asciidoctor` conditionals were added so that AsciiDoc
and Asciidoctor doc builds rendered consistently.
With https://github.com/elastic/docs/pull/827, Elasticsearch Reference
documentation migrated completely to Asciidoctor. We no longer need to
support AsciiDoc so we can remove these conditionals.
Resolves#41722
Several `ifdef::asciidoctor` conditionals were added so that AsciiDoc
and Asciidoctor doc builds rendered consistently.
With https://github.com/elastic/docs/pull/827, Elasticsearch Reference
documentation migrated completely to Asciidoctor. We no longer need to
support AsciiDoc so we can remove these conditionals.
Resolves#41722
* Previously, we mentioned multiple times that each nested object was indexed as its own document. This is repetitive, and is also a bit confusing in the context of `index.mapping.nested_fields.limit`, as that applies to the number of distinct `nested` types in the mappings, not the number of nested objects. We now just describe the issue once at the beginning of the section, to illustrate why `nested` types can be expensive.
* Reference the ongoing example to clarify the meaning of the two settings.
Addresses #28363.
Since the max_score optimization landed in Elasticsearch 7,
the CommonTermsQuery is redundant and slower. Moreover the
cutoff_frequency parameter for MatchQuery and MultiMatchQuery
is redundant.
Relates to #27096
(cherry picked from commit 04b74497314eeec076753a33b3b6cc11549646e8)
Both of these classes are basically a bloated wrapper around a simple
construct that can simply be a DirectoryFactory interface. This change
removes both classes and replaces them with a simple stateless interface
that creates a new `Directory` per shard. The concept of `index.store` is preserved
since it makes sense from a configuration perspective.
This change contains a major refactoring of the timestamp
format determination code used by the ML find file structure
endpoint.
Previously timestamp format determination was done separately
for each piece of text supplied to the timestamp format finder.
This had the drawback that it was not possible to distinguish
dd/MM and MM/dd in the case where both numbers were 12 or less.
In order to do this sensibly it is best to look across all the
available timestamps and see if one of the numbers is greater
than 12 in any of them. This necessitates making the timestamp
format finder an instantiable class that can accumulate evidence
over time.
Another problem with the previous approach was that it was only
possible to override the timestamp format to one of a limited
set of timestamp formats. There was no way out if a file to be
analysed had a timestamp that was sane yet not in the supported
set. This is now changed to allow any timestamp format that can
be parsed by a combination of these Java date/time formats:
yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss,
a, XX, XXX, zzz
Additionally S letter groups (fractional seconds) are supported
providing they occur after ss and separated from the ss by a dot,
comma or colon. Spacing and punctuation is also permitted with
the exception of the question mark, newline and carriage return
characters, together with literal text enclosed in single quotes.
The full list of changes/improvements in this refactor is:
- Make TimestampFormatFinder an instantiable class
- Overrides must be specified in Java date/time format - Joda
format is no longer accepted
- Joda timestamp formats in outputs are now derived from the
determined or overridden Java timestamp formats, not stored
separately
- Functionality for determining the "best" timestamp format in
a set of lines has been moved from TextLogFileStructureFinder
to TimestampFormatFinder, taking advantage of the fact that
TimestampFormatFinder is now an instantiable class with state
- The functionality to quickly rule out some possible Grok
patterns when looking for timestamp formats has been changed
from using simple regular expressions to the much faster
approach of using the Shift-And method of sub-string search,
but using an "alphabet" consisting of just 1 (representing any
digit) and 0 (representing non-digits)
- Timestamp format overrides are now much more flexible
- Timestamp format overrides that do not correspond to a built-in
Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern
whose definition is included within the date processor in the
ingest pipeline
- Grok patterns that correspond to multiple Java date/time
patterns are now handled better - the Grok pattern is accepted
as matching broadly, and the required set of Java date/time
patterns is built up considering all observed samples
- As a result of the more flexible acceptance of Grok patterns,
when looking for the "best" timestamp in a set of lines
timestamps are considered different if they are preceded by
a different sequence of punctuation characters (to prevent
timestamps far into some lines being considered similar to
timestamps near the beginning of other lines)
- Out-of-the-box Grok patterns that are considered now include
%{DATE} and %{DATESTAMP}, which have indeterminate day/month
ordering
- The order of day/month in formats with indeterminate day/month
order is determined by considering all observed samples (plus
the server locale if the observed samples still do not suggest
an ordering)
Relates #38086Closes#35137Closes#35132
As a follow-up to #38540 we can use lambda functions and method
references where convenient in the low-level REST client.
Also, we need to update the docs to state that the minimum java version
required is 1.8.
This commit reworks and clarifies the docs for the `discovery-ec2` plugin:
- folds the tiny "Getting started with AWS" into the page on configuration
- spells out the name of each setting in full instead of noting the
`discovery.ec2` prefix at the top of the page.
- replaces each `(Secure)` marker with a sentence describing what that means in
situ
- notes some missing defaults
- clarifies the behaviour of `discovery.ec2.groups` (dependent on `.any_group`)
- clarifies what `discovery.ec2.host_type` is for
- adds `discovery.ec2.tag.TAGNAME` as a (meta-)setting rather than describing
it in a separate section
- notes that the tags mentioned in `discovery.ec2.tag.TAGNAME` cannot contain
colons (see #38406)
- clarifies the EC2-specific interface names and what they're for
- reorders and rewords the recommendations for storage
- expands on why you should not span a cluster across regions
- adds a suggestion on protecting instances against termination during scale-in
- reformat to 80 columns where possible
Fixes#38406