1211 Commits

Author SHA1 Message Date
Costin Leau
bff3c7470e
EQL: Replace SearchHit in response with Event (#61428) (#61522)
The building block of the eql response is currently the SearchHit. This
is a problem since it is tied to an actual search, and thus has scoring,
highlighting, shard information and a lot of other things that are not
relevant for EQL.
This becomes a problem when doing sequence queries since the response is
not generated from one search query and thus there are no SearchHits to
speak of.
Emulating one is not just conceptually incorrect but also problematic
since most of the data is missed or made-up.

As such this PR introduces a simple class, Event, that maps nicely to
the terminology while hiding the ES internals (the use of SearchHit or
GetResult/GetResponse depending on the API used).

Fix #59764
Fix #59779

Co-authored-by: Igor Motov <igor@motovs.org>
(cherry picked from commit 997376fbe6ef2894038968842f5e0635731ede65)
2020-08-25 17:32:42 +03:00
Benjamin Trent
1ae2923632
[7.x] [ML] adding docs + hlrc for data frame analysis feature_processors (#61149) (#61493)
* [ML] adding docs + hlrc for data frame analysis feature_processors (#61149)

Adds HLRC and some docs for the new feature_processors field in Data frame analytics.

Co-authored-by: Przemysław Witek <przemyslaw.witek@elastic.co>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2020-08-24 12:56:21 -04:00
Armin Braun
d05649bfae
Fix PutPolicyRequestTests.testFromXContent (#61485) (#61494)
We only ever support `JSON` for the query source format in practice.
The reason this test worked before is a bug in xcontent parsing that parses
empty maps out of streams of the wrong format.

Closes #61483
2020-08-24 18:52:05 +02:00
Yang Wang
cd52233b94
Include authentication type for the authenticate response (#61247) (#61411)
Add a new "authentication_type" field to the response of "GET _security/_authenticate".
2020-08-21 22:59:43 +10:00
Andrei Stefan
5de0f19cc3
EQL: Return sequence join keys in the original type (#61268) (#61282)
(cherry picked from commit d54957d61faa0d502387656e3cace594017b6ea0)
2020-08-18 19:37:15 +03:00
Mark Tozzi
db1df6cc30
[7.x] Remove a bunch of type boilerplate from Aggs (#60852) (#61031) 2020-08-17 12:13:05 -04:00
Benjamin Trent
038cc26ac5
[ML] adjusts feature importance format for hlrc (#61150) (#61153)
related to PR https://github.com/elastic/elasticsearch/pull/61104
2020-08-14 11:33:41 -04:00
Andrei Dan
186e8b865d
HLRC: UpdateByQuery API with wait_for_completion being false (#58552) (#61081)
(cherry picked from commit 291f5bd1b2e889e9447d660e5407f3120cffb1a5)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

Co-authored-by: Tuan Le <23419763+tumile@users.noreply.github.com>
2020-08-13 11:06:41 +01:00
Armin Braun
32423a486d
Simplify and Speed up some Compression Usage (#60953) (#61008)
Use thread-local buffers and deflater and inflater instances to speed up
compressing and decompressing from in-memory bytes.
Not manually invoking `end()` on these should be safe since their off-heap memory
will eventually be reclaimed by the finalizer thread which should not be an issue for thread-locals
that are not instantiated at a high frequency.
This significantly reduces the amount of byte copying and object creation relative to the previous approach
which had to create a fresh temporary buffer (that was then resized multiple times during operations), copied
bytes out of that buffer to a freshly allocated `byte[]`, used 4k stream buffers needlessly when working with
bytes that are already in arrays (`writeTo` handles efficient writing to the compression logic now) etc.

Relates #57284 which should be helped by this change to some degree.
Also, I expect this change to speed up mapping/template updates a little as those make heavy use of these
code paths.
2020-08-12 11:06:23 +02:00
Henning Andersen
a0b54b53fc Rest high level ReindexIT fix (#60834)
ReindexIT would rethrottle any delete or update by query task, fixed to
more precisely match the task started by the test.

Closes #60811
2020-08-11 10:35:15 +02:00
Martijn van Groningen
9163c9ce36
Adjust hlrc data streams integration test (#60804)
Backport of #60746

to wait for at least a single shard to be allocated for a backing index of a data stream,
so that total store size is larger than zero (which is what the tests expects).

Closes #60461
2020-08-06 11:57:12 +02:00
Hendrik Muhs
2b6891b584
[7.x][Transform] implement test suite to test continuous transforms (#60725)
implements a test suite for testing continuous transform with randomization in terms of mappings,
index settings, transform configuration. Add a test case for terms and date histogram. The test
covers:

 - continuous mode with several checkpoints created
 - correctness of results
 - optimizations (minimal necessary writes)
 - permutations of features (index settings, aggs, data types, index or data stream)
2020-08-05 16:56:01 +02:00
Przemysław Witek
0afa1bd972
Deprecate allow_no_jobs and allow_no_datafeeds in favor of allow_no_match (#60601) (#60727) 2020-08-05 13:39:40 +02:00
Yang Wang
54aaadade7
API key name should always be required for creation (#59836) (#60636)
The name is now required when creating or granting API keys.
2020-08-04 13:28:47 +10:00
Hendrik Muhs
aaed6b59d6
[7.x][Transform] add support for missing bucket (#59591) (#60390)
add support for "missing_bucket" in group_by

fixes #42941
fixes #55102
backport #59591
2020-07-30 08:26:51 +02:00
Dan Hermann
fe12217c7f
[7.x] Move REST specs for data streams (#60111) 2020-07-23 08:10:54 -05:00
James Baiera
b3363cf8f9
[7.x] Remove unneeded rest params from Data Stream Stats (#59575) (#59661)
This PR removes the expand_wildcards and forbid_closed_indices parameters from the Data 
Streams Stats REST endpoint. These options are required for broadcast requests, but are not 
needed for anything in terms of resolving data streams. Instead, we just set a default set of 
IndicesOptions on the transport request.
2020-07-21 15:59:16 -04:00
Przemysław Witek
283a1f605c
Rename binary_soft_classification evaluation to outlier_detection (#59951) (#59970) 2020-07-21 15:15:04 +02:00
Benjamin Trent
a28547c4b4
[7.x] [ML] add new custom field to trained model processors (#59542) (#59700)
* [ML] add new `custom` field to trained model processors (#59542)

This commit adds the new configurable field `custom`.

`custom` indicates if the preprocessor was submitted by a user or automatically created by the analytics job.

Eventually, this field will be used in calculating feature importance. When `custom` is true, the feature importance for
the processed fields is calculated. When `false` the current behavior is the same (we calculate the importance for the originating field/feature).

This also adds new required methods to the preprocessor interface. If users are to supply their own preprocessors
in the analytics job configuration, we need to know the input and output field names.
2020-07-16 10:57:38 -04:00
Martijn van Groningen
2a89e13e43
Move data stream transport and rest action to xpack (#59593)
Backport of #59525 to 7.x branch.

* Actions are moved to xpack core.
* Transport and rest actions are moved the data-streams module.
* Removed data streams methods from Client interface.
* Adjusted tests to use client.execute(...) instead of data stream specific methods.
* only attempt to delete all data streams if xpack is installed in rest tests
* Now that ds apis are in xpack and ESIntegTestCase
no longers deletes all ds, do that in the MlNativeIntegTestCase
class for ml tests.
2020-07-15 16:50:44 +02:00
James Baiera
5f7e7e9410
[7.x] Data Stream Stats API (#58707) (#59566)
This API reports on statistics important for data streams, including the number of data
streams, the number of backing indices for those streams, the disk usage for each data
stream, and the maximum timestamp for each data stream
2020-07-14 16:57:46 -04:00
Dan Hermann
59f639a279
Add auto_configure privilege 2020-07-14 08:23:49 -05:00
Andrei Dan
7dcdaeae49
Default to @timestamp in composable template datastream definition (#59317) (#59516)
This makes the data_stream timestamp field specification optional when
defining a composable template.
When there isn't one specified it will default to `@timestamp`.

(cherry picked from commit 5609353c5d164e15a636c22019c9c17fa98aac30)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-07-14 12:36:54 +01:00
Dimitris Athanasiou
b2243337d8
[7.x][ML] Data frame analytics max_num_threads setting (#59254) (#59308)
This adds a setting to data frame analytics jobs called
`max_number_threads`. The setting expects a positive integer.
When used the user specifies the max number of threads that may
be used by the analysis. Note that the actual number of threads
used is limited by the number of processors on the node where
the job is assigned. Also, the process may use a couple more threads
for operational functionality that is not the analysis itself.

This setting may also be updated for a stopped job.

More threads may reduce the time it takes to complete the job at the cost
of using more CPU.

Backport of #59254 and #57274
2020-07-09 19:15:46 +03:00
David Kyle
c5443f78ce
Add Inference Pipeline aggregation to HLRC (#59086) (#59250)
Adds InferencePipelineAggregationBuilder to the HLRC duplicating 
the server side classes
2020-07-09 13:38:45 +01:00
Andrei Stefan
c0e0bca84c
Remove search_after and implicit_join_key_field (#59232) (#59280)
(cherry picked from commit 6ede6c59eff321b9fedad30e19508b9e4f788b54)
2020-07-09 12:34:01 +03:00
Martijn van Groningen
17bd559253
Fix the timestamp field of a data stream to @timestamp (#59210)
Backport of #59076 to 7.x branch.

The commit makes the following changes:
* The timestamp field of a data stream definition in a composable
  index template can only be set to '@timestamp'.
* Removed custom data stream timestamp field validation and reuse the validation from `TimestampFieldMapper` and
  instead only check that the _timestamp field mapping has been defined on a backing index of a data stream.
* Moved code that injects _timestamp meta field mapping from `MetadataCreateIndexService#applyCreateIndexRequestWithV2Template58956(...)` method
  to `MetadataIndexTemplateService#collectMappings(...)` method.
* Fixed a bug (#58956) that cases timestamp field validation to be performed
  for each template and instead of the final mappings that is created.
* only apply _timestamp meta field if index is created as part of a data stream or data stream rollover,
this fixes a docs test, where a regular index creation matches (logs-*) with a template with a data stream definition.

Relates to #58642
Relates to #53100
Closes #58956
Closes #58583
2020-07-08 17:30:46 +02:00
Costin Leau
3e32d060bf EQL: Fix bug in skipping window (#59196)
Corrected condition that caused a sequence window to be skipped when a query
returns no results by checking not just the current stage but also following
ones as they can match with in-flight sequences.
Improve logging
Fix NPE when emptying a SequenceGroup
Increase randomization in testing
Make maxspan inclusive (up to and equal to value vs just up to)

(cherry picked from commit ad32c488688cb350c2934dfca03af86045e997b0)
2020-07-08 14:36:39 +03:00
Andrei Dan
24c6a30e2b
[7.9] GET data stream API returns additional information (#59128) (#59177)
* GET data stream API returns additional information (#59128)

This adds the data stream's index template, the configured ILM policy
(if any) and the health status of the data stream to the GET _data_stream
response.

Restoring a data stream from a snapshot could install a data stream that
doesn't match any composable templates. This also makes the `template`
field in the `GET _data_stream` response optional.

(cherry picked from commit 0d9c98a82353b088c782b6a04c44844e66137054)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
2020-07-07 20:30:09 +01:00
Costin Leau
f9c15d0fec EQL: Introduce sequencing fetch size (#59063)
The current internal sequence algorithm relies on fetching multiple results and then paginating through the dataset. Depending on the dataset and memory, setting a larger page size can yield better performance at the expense of memory.
This PR makes this behavior explicit by decoupling the fetch size from size, the maximum number of results desired.
As such, use in testing a minimum fetch size which exposed a number of bugs:

Jumping across data across queries causing valid data to be seen as a gap.
Incorrectly resuming searching across pages (again causing data to be discarded).
which have been addressed.

(cherry picked from commit 2f389a7724790d7b0bda67264d6eafcfa8b2116e)
2020-07-06 19:14:26 +03:00
Przemysław Witek
4a791e835b
Simplify parser declarations when specialist types are stored in strings (#58996) (#59056) 2020-07-06 13:05:03 +02:00
Przemysław Witek
f35ad0d4e1
Report peak model memory in ModelSizeStats (#59017) (#59055) 2020-07-06 12:55:12 +02:00
Benjamin Trent
b9d9964d10
[ML] add exponent output aggregator to inference (#58933) (#59016)
* [ML] add exponent output aggregator to inference

* fixing docs

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2020-07-03 14:51:00 -04:00
Przemysław Witek
751e84e4c8
Rename regression evaluation metrics to make the names consistent with loss functions (#58887) (#58927) 2020-07-02 17:35:55 +02:00
Przemysław Witek
8e074c4495
Rename "error" field to "value" for consistency between metrics (#58726) (#58870) 2020-07-02 09:08:56 +02:00
Yang Wang
a5a8b4ae1d
Add cache for application privileges (#55836) (#58798)
Add caching support for application privileges to reduce number of round-trips to security index when building application privilege descriptors.

Privilege retrieving in NativePrivilegeStore is changed to always fetching all privilege documents for a given application. The caching is applied to all places including "get privilege", "has privileges" APIs and CompositeRolesStore (for authentication).
2020-07-02 11:50:03 +10:00
Przemysław Witek
909649dd15
[7.x] Implement pseudo Huber loss (PseudoHuber) evaluation metric for regression analysis (#58734) (#58825) 2020-07-01 14:52:06 +02:00
Przemysław Witek
9ea9b7bd3b
[7.x] Implement MSLE (MeanSquaredLogarithmicError) evaluation metric for regression analysis (#58684) (#58731) 2020-06-30 14:09:11 +02:00
Yannick Welsch
b885cbff1a
Add index block api (#58716)
Adds an API for putting an index block in place, which also ensures for write blocks that, once successfully returning to
the user, all shards of the index are properly accounting for the block, for example that all in-flight writes to an index have
been completed after adding the write block.

This API allows coordinating more complex workflows, where it is crucial that an index is no longer receiving writes after
the API completes, useful for example when marking an index as read-only during an upgrade in order to reindex its
documents.
2020-06-30 14:06:52 +02:00
Andrei Stefan
3cb8f54f28
EQL: case sensitivity aware integration testing (#58624) (#58672)
* EQL: case sensitivity aware integration testing (#58624)

* Add DataLoader
* Rewrite case sensitivity settings:
NULL -> run both case sensitive and insensitive tests
TRUE -> run case sensitive test only
FALSE -> run case insensitive test only
* Rename test_queries_supported
* Add more toml tests from the Python client

Co-authored-by: Ross Wolf <31489089+rw-access@users.noreply.github.com>
(cherry picked from commit 34d383421599f060a5c083b40df35f135de49e39)
2020-06-29 18:40:07 +03:00
Przemysław Witek
3f7c45472e
[7.x] Introduce DataFrameAnalyticsConfig update API (#58302) (#58648) 2020-06-29 10:56:11 +02:00
Dimitris Athanasiou
1817b896c9
[7.x][ML] Add status and increased estimate to memory usage (#58588) (#58606)
Adds parsing of `status` and `memory_reestimate_bytes`
to data frame analytics `memory_usage`. When the training surpasses
the model memory limit, the status will be set to `hard_limit` and
`memory_reestimate_bytes` can be used to update the job's
limit in order to restart the job.

Backport of #58588
2020-06-28 16:27:26 +03:00
Igor Motov
20af856abd
[7.x] EQL: Adds an ability to execute an asynchronous EQL search (#58192)
Adds async support to EQL searches

Closes #49638

Co-authored-by: James Rodewig james.rodewig@elastic.co
2020-06-25 14:11:57 -04:00
Nik Everett
03e6d1b535
Add Variable Width Histogram Aggregation (backport of #42035) (#58440)
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.

This PR addresses #9572.

The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.

At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.

The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.

Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving https://github.com/elastic/elasticsearch/issues/50863
would let us solve this issue.

It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.

Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>
2020-06-25 11:40:47 -04:00
Martijn van Groningen
7dda9934f9
Keep track of timestamp_field mapping as part of a data stream (#58400)
Backporting #58096 to 7.x branch.
Relates to #53100

* use mapping source direcly instead of using mapper service to extract the relevant mapping details
* moved assertion to TimestampField class and added helper method for tests
* Improved logic that inserts timestamp field mapping into an mapping.
If the timestamp field path consisted out of object fields and
if the final mapping did not contain the parent field then an error
occurred, because the prior logic assumed that the object field existed.
2020-06-22 17:46:38 +02:00
Benjamin Trent
bf8641aa15
[7.x] [ML] calculate cache misses for inference and return in stats (#58252) (#58363)
When a local model is constructed, the cache hit miss count is incremented.

When a user calls _stats, we will include the sum cache hit miss count across ALL nodes. This statistic is important to in comparing against the inference_count. If the cache hit miss count is near the inference_count it indicates that the cache is overburdened, or inappropriately configured.
2020-06-19 09:46:51 -04:00
Jason Tedor
be08268562
Allow follower indices to override leader settings (#58103)
Today when creating a follower index via the put follow API, or via an
auto-follow pattern, it is not possible to specify settings overrides
for the follower index. Instead, we copy all of the leader index
settings to the follower. Yet, there are cases where a user would want
some different settings on the follower index such as the number of
replicas, or allocation settings. This commit addresses this by allowing
the user to specify settings overrides when creating follower index via
manual put follower calls, or via auto-follow patterns. Note that not
all settings can be overrode (e.g., index.number_of_shards) so we also
have detection that prevents attempting to override settings that must
be equal between the leader and follow index. Note that we do not even
allow specifying such settings in the overrides, even if they are
specified to be equal between the leader and the follower
index. Instead, the must be implicitly copied from the leader index, not
explicitly set by the user.
2020-06-18 11:56:06 -04:00
Jim Ferenczi
82db0b575c
Allow index filtering in field capabilities API (#57276) (#58299)
This change allows to use an `index_filter` in the
field capabilities API. Indices are filtered from
the response if the provided query rewrites to `match_none`
on every shard:

````
GET metrics-*
{
  "index_filter": {
    "bool": {
      "must": [
        "range": {
          "@timestamp": {
            "gt": "2019"
          }
        }
      }
  }
}
````

The filtering is done on a best-effort basis, it uses the can match phase
to rewrite queries to `match_none` instead of fully executing the request.
The first shard that can match the filter is used to create the field
capabilities response for the entire index.

Closes #56195
2020-06-18 10:23:26 +02:00
Rory Hunter
e065d6cc91 Rename dangling index APIs (#58266)
The dangling_indices.import API name could cause issues in the client
libs because import is a reserved word in many languages. Rename the
API to avoid this, and rename the other APIs for consistency.

Related to #48366.
2020-06-18 08:58:32 +01:00
Przemko Robakowski
3249ee9a86
HLRC support for data streams (#58106) (#58202)
This change adds high level REST client support for data streams

Relates to #53100
2020-06-17 00:21:14 +02:00