Commit Graph

47656 Commits

Author SHA1 Message Date
Lee Hinman 52d7b03b49 Wait for no snapshots in state in testRetentionWhileSnapshotIn… (#46573)
This commit adds a wait/check for all running snapshots to be cleared
before taking another snapshot. The previous snapshot was successful but
had not yet been cleared from the cluster state, so the second snapshot
failed due to a `ConcurrentSnapshotException`.

Resolves #46508
2019-09-11 09:47:01 -06:00
Lisa Cawley c0ec6ade4b
[DOCS] Adds transform content (#46575) (#46578) 2019-09-11 08:44:03 -07:00
David Roberts 461de5b58e [TEST] Remove incorrect data frame analytics state assertion (#46597)
After starting the analytics job and checking its state
the state can be any of "started", "reindexing" or
"analyzing" depending on how quickly the work is done.
2019-09-11 16:33:14 +01:00
Lisa Cawley dea472c6fb [DOCS] Adds machine learning PRs to release notes (#46564) 2019-09-11 08:30:59 -07:00
David Roberts 07a0140260
[ML-DataFrame] Ensure latest index template exists before indexing docs (#46595)
When upgrading data nodes to a newer version before
master nodes there was a risk that a transform running
on an upgraded data node would index a document into
the new transforms internal index before its index
template was created.  This would cause the index to
be created with entirely dynamic mappings.

This change introduces a check before indexing any
internal transforms document to ensure that the required
index template exists and create it if it doesn't.

Backport of #46553
2019-09-11 16:27:26 +01:00
Mark Vieira ccf656a9d0
Repository plugin test cacheability fixes (#46572) 2019-09-11 08:24:55 -07:00
Jim Ferenczi 23bf310c84 Replace the SearchContext with QueryShardContext when building aggregator factories (#46527)
This commit replaces the `SearchContext` with the `QueryShardContext` when building aggregator factories. Aggregator factories are part of the `SearchContext` so they shouldn't require a `SearchContext` to create them.
The main changes here are the signatures of `AggregationBuilder#build` that now takes a `QueryShardContext` and `AggregatorFactory#createInternal` that passes the `SearchContext` to build the `Aggregator`.

Relates #46523
2019-09-11 16:43:30 +02:00
Armin Braun 27c15f137e
Remove Unused Method from BlobStoreRepository (#46204) (#46593)
This method isn't used anymore and I forgot to delete it.
2019-09-11 16:34:24 +02:00
William Brafford 8c9f15db44
Fix Path comparisons for Windows tests (#46503) (#46566)
* Fix Path comparisons for Windows tests

The test NodeEnvironmentTests#testCustonDataPaths worked just fine on
Darwin and Linux, but the comparison was breaking in Windows because one
path had the "C:\" prefix and the other one didn't. The simple fix is to
compare absolute paths rather than potentially relative ones.
2019-09-11 09:33:00 -04:00
Christoph Büscher aa0c586b73 Deprecate `_field_names` disabling (#42854)
Currently we allow `_field_names` fields to be disabled explicitely, but since
the overhead is negligible now we decided to keep it turned on by default and
deprecate the `enable` option on the field type. This change adds a deprecation
warning whenever this setting is used, going forward we want to ignore and finally
remove it.

Closes #27239
2019-09-11 14:58:08 +02:00
Hendrik Muhs efea581dcc
[7.x][Transform]Rename data frame plugin to transform: plugin and package names (#46583)
rename data frame transform plugin to transform:

 - rename plugin data-frame to transform
 - change all package names from o.e.*.dataframe.* to o.e.*.transform.*
 - necessary changes to fix loading/testing
2019-09-11 14:50:08 +02:00
Armin Braun 41633cb9b5
More Efficient Ordering of Shard Upload Execution (#42791) (#46588)
* More Efficient Ordering of Shard Upload Execution (#42791)
* Change the upload order of of snapshots to work file by file in parallel on the snapshot pool instead of merely shard-by-shard
* Inspired by #39657
* Cleanup BlobStoreRepository Abort and Failure Handling (#46208)
2019-09-11 13:59:20 +02:00
Jim Ferenczi 80bb08fbda Replace the SearchContext with QueryShardContext when building collapsing context (#46543)
This commit replaces the `SearchContext` with the `QueryShardContext` when building collapsing conteext
Collapse context is part of the `SearchContext` so it shouldn't require a `SearchContext` to create one.

Relates #46523
2019-09-11 12:25:38 +02:00
Jim Ferenczi 425b1a77e8 Add more context to QueryShardContext (#46584)
This change adds an IndexSearcher and the node's BigArrays in the QueryShardContext.
It's a spin off of #46527 as this change is required to allow aggregation builder to solely use the
query shard context.

Relates #46523
2019-09-11 12:24:51 +02:00
Dimitris Athanasiou 579af626f5
[7.x][ML] No error when datafeed stops during updating to started (#46495) (#46542)
Investigating the test failure reported in #45518 it appears that
the datafeed task was not found during a tast state update. There
are only two places where such an update is performed: when we set
the state to `started` and when we set it to `stopping`. We handle
`ResourceNotFoundException` in the latter but not in the former.

Thus the test reveals a rare race condition where the datafeed gets
requested to stop before we managed to update its state to `started`.
I could not reproduce this scenario but it would be my best guess.

This commit catches `ResourceNotFoundException` while updating the
state to `started` and lets the task terminate smoothly.

Closes #45518

Backport of #46495
2019-09-11 13:18:42 +03:00
Przemysław Witek e38e631dac
[7.x] Implement DataFrameAnalyticsAuditMessage and DataFrameAnalyticsAuditor (#45967) (#46519) 2019-09-11 12:17:26 +02:00
Ioannis Kakavas 35810bd2ae
Enforce realm name uniqueness (#46580)
We depend on file realms being unique in a number of places. Pre
7.0 this was enforced by the fact that the multiple realm types
with different name would mean identical configuration keys and
cause configuration parsing errors. Since we intoduced affix
settings for realms this is not the case any more as the realm type
is part of the configuration key.
This change adds a check when building realms which will explicitly
fail if multiple realms are defined with the same name.

Backport of #46253
2019-09-11 13:13:59 +03:00
Armin Braun f8d5145472
Fix SnapshotStatusApisIT (#46563) (#46582)
Obviously we have to run the status request again to busy wait for
the `STARTED` state, just busy waiting on an existing response
won't do anything.

Closes #45917
2019-09-11 11:58:42 +02:00
Colin Goodheart-Smithe 31697c71bd
Updates 7.4.0 release notes 2019-09-11 10:51:04 +01:00
Colin Goodheart-Smithe ac52588a7a
Updates 7.4.0 release notes 2019-09-11 10:14:31 +01:00
Tim Vernum 80064652f8
Fallback to realm authc if ApiKey fails (#46552)
This changes API-Key authentication to always fallback to the realm
chain if the API key is not valid. The previous behaviour was
inconsistent and would terminate on some failures, but continue to the
realm chain for others.

Backport of: #46538
2019-09-11 14:33:17 +10:00
Gordon Brown 7a2878b29b
Fix class used to initialize logger in Watcher (#46467)
This class has been using a logger configured for a different class for
quite a while. While the circumstance in which it logs is rare, it
should still use the correct logger.
2019-09-10 12:41:36 -06:00
Julie Tibshirani 10da998dfa Expand documentation around global ordinals. (#46517)
This commit updates the eager_global_ordinals documentation to give more
background on what global ordinals are and when they are used. The docs also now
mention that global ordinal loading may be expensive, and describes the cases
where in which loading them can be avoided.
2019-09-10 11:04:07 -07:00
Abhilash Bolla 20e93bca6b Fixed grammar in pattern replace char filter docs. (#46546)
Minor grammar fix in the pattern replace char filter docs.
2019-09-10 11:04:07 -07:00
Julie Tibshirani 1b9bd9a0a8 Use a literal block in the field data docs. (#46469)
Currently we use `quote`, which renders a bit strangely on the website.
2019-09-10 11:04:07 -07:00
Lisa Cawley 70c00621db [DOCS] Add missing xpack role attributes (#46468) 2019-09-10 10:46:14 -07:00
Nhat Nguyen 7f9e2f4d91 Ensure no ongoing peer recovery in translog yaml test (#46476)
We leave replicas unassigned until we reroute after the primary shard
starts. If a cluster health request with wait_for_no_initializing_shards
is executed before the reroute, it will return immediately although
there will be some initializing replicas. Peer recoveries of those
shards can prevent translog on the primary from trimming.

We add wait_for_events to the cluster health request so that it will
execute after the reroute.

Closes #46425
2019-09-10 12:43:31 -04:00
Mark Vieira 972d3569c1
Disable local build cache in CI (#46505)
(cherry picked from commit 4dcc226e142057ec44b313edf883e1c5401ec610)
2019-09-10 08:33:00 -07:00
Mark Vieira e5d315f6e1
Ensure rest api specs are always copied before using test classpath (#46514)
(cherry picked from commit 45202903b4fea3a43f62594fd357ab3c98c3dd15)
2019-09-10 08:31:21 -07:00
Lisa Cawley 7461259ba6 [DOCS] Adds missing icons to ML HLRC APIs (#46515) 2019-09-10 08:28:02 -07:00
Lee Hinman cdc3a260af
Add retention to Snapshot Lifecycle Management (backport of #4… (#46506)
* Add retention to Snapshot Lifecycle Management (#46407)

This commit adds retention to the existing Snapshot Lifecycle Management feature (#38461) as described in #43663. This allows a user to configure SLM to automatically delete older snapshots based on a number of criteria.

An example policy would look like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

SLM Retention is run on a scheduled configurable with the `slm.retention_schedule` setting, which supports cron expressions. Deletions are run for a configurable time bounded by the `slm.retention_duration` setting, which defaults to 1 hour.

Included in this work is a new SLM stats API endpoint available through

``` json
GET /_slm/stats
```

That returns statistics about snapshot taken and deleted, as well as successful retention runs, failures, and the time spent deleting snapshots. #45362 has more information as well as an example of the output. These stats are also included when retrieving SLM policies via the API.

* Add base framework for snapshot retention (#43605)

* Add base framework for snapshot retention

This adds a basic `SnapshotRetentionService` and `SnapshotRetentionTask`
to start as the basis for SLM's retention implementation.

Relates to #38461

* Remove extraneous 'public'

* Use a local var instead of reading class var repeatedly

* Add SnapshotRetentionConfiguration for retention configuration (#43777)

* Add SnapshotRetentionConfiguration for retention configuration

This commit adds the `SnapshotRetentionConfiguration` class and its HLRC
counterpart to encapsulate the configuration for SLM retention.
Currently only a single parameter is supported as an example (we still
need to discuss the different options we want to support and their
names) to keep the size of the PR down. It also does not yet include version serialization checks
since the original SLM branch has not yet been merged.

Relates to #43663

* Fix REST tests

* Fix more documentation

* Use Objects.equals to avoid NPE

* Put `randomSnapshotLifecyclePolicy` in only one place

* Occasionally return retention with no configuration

* Implement SnapshotRetentionTask's snapshot filtering and delet… (#44764)

* Implement SnapshotRetentionTask's snapshot filtering and deletion

This commit implements the snapshot filtering and deletion for
`SnapshotRetentionTask`. Currently only the expire-after age is used for
determining whether a snapshot is eligible for deletion.

Relates to #43663

* Fix deletes running on the wrong thread

* Handle missing or null policy in snap metadata differently

* Convert Tuple<String, List<SnapshotInfo>> to Map<String, List<SnapshotInfo>>

* Use the `OriginSettingClient` to work with security, enhance logging

* Prevent NPE in test by mocking Client

* Allow empty/missing SLM retention configuration (#45018)

Semi-related to #44465, this allows the `"retention"` configuration map
to be missing.

Relates to #43663

* Add min_count and max_count as SLM retention predicates (#44926)

This adds the configuration options for `min_count` and `max_count` as
well as the logic for determining whether a snapshot meets this criteria
to SLM's retention feature.

These options are optional and one, two, or all three can be specified
in an SLM policy.

Relates to #43663

* Time-bound deletion of snapshots in retention delete function (#45065)

* Time-bound deletion of snapshots in retention delete function

With a cluster that has a large number of snapshots, it's possible that
snapshot deletion can take a very long time (especially since deletes
currently have to happen in a serial fashion). To prevent snapshot
deletion from taking forever in a cluster and blocking other operations,
this commit adds a setting to allow configuring a maximum time to spend
deletion snapshots during retention. This dynamic setting defaults to 1
hour and is best-effort, meaning that it doesn't hard stop a deletion
at an hour mark, but ensures that once the time has passed, all
subsequent deletions are deferred until the next retention cycle.

Relates to #43663

* Wow snapshots suuuure can take a long time.

* Use a LongSupplier instead of actually sleeping

* Remove TestLogging annotation

* Remove rate limiting

* Add SLM metrics gathering and endpoint (#45362)

* Add SLM metrics gathering and endpoint

This commit adds the infrastructure to gather metrics about the different SLM actions that a cluster
takes. These actions are stored in `SnapshotLifecycleStats` and perpetuated in cluster state. The
stats stored include the number of snapshots taken, failed, deleted, the number of retention runs,
as well as per-policy counts for snapshots taken, failed, and deleted. It also includes the amount
of time spent deleting snapshots from SLM retention.

This commit also adds an endpoint for retrieving all stats (further commits will expose this in the
SLM get-policy API) that looks like:

```
GET /_slm/stats
{
  "retention_runs" : 13,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "1.4s",
  "retention_deletion_time_millis" : 1404,
  "policy_metrics" : {
    "daily-snapshots2" : {
      "snapshots_taken" : 7,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 6,
      "snapshot_deletion_failures" : 0
    },
    "daily-snapshots" : {
      "snapshots_taken" : 12,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 12,
      "snapshot_deletion_failures" : 6
    }
  },
  "total_snapshots_taken" : 19,
  "total_snapshots_failed" : 0,
  "total_snapshots_deleted" : 18,
  "total_snapshot_deletion_failures" : 6
}
```

This does not yet include HLRC for this, as this commit is quite large on its own. That will be
added in a subsequent commit.

Relates to #43663

* Version qualify serialization

* Initialize counters outside constructor

* Use computeIfAbsent instead of being too verbose

* Move part of XContent generation into subclass

* Fix REST action for master merge

* Unused import

*  Record history of SLM retention actions (#45513)

This commit records the deletion of snapshots by the retention component
of SLM into the SLM history index for the purposes of reviewing operations
taken by SLM and alerting.

* Retry SLM retention after currently running snapshot completes (#45802)

* Retry SLM retention after currently running snapshot completes

This commit adds a ClusterStateObserver to wait until the currently
running snapshot is complete before proceeding with snapshot deletion.
SLM retention waits for the maximum allowed deletion time for the
snapshot to complete, however, the waiting time is not factored into
the limit on actual deletions.

Relates to #43663

* Increase timeout waiting for snapshot completion

* Apply patch

From 2374316f0d.patch

* Rename test variables

* [TEST] Be less strict for stats checking

* Skip SLM retention if ILM is STOPPING or STOPPED (#45869)

This adds a check to ensure we take no action during SLM retention if
ILM is currently stopped or in the process of stopping.

Relates to #43663

* Check all actions preventing snapshot delete during retention (#45992)

* Check all actions preventing snapshot delete during retention run

Previously we only checked to see if a snapshot was currently running,
but it turns out that more things can block snapshot deletion. This
changes the check to be a check for:

- a snapshot currently running
- a deletion already in progress
- a repo cleanup in progress
- a restore currently running

This was found by CI where a third party delete in a test caused SLM
retention deletion to throw an exception.

Relates to #43663

* Add unit test for okayToDeleteSnapshots

* Fix bug where SLM retention task would be scheduled on every node

* Enhance test logging

* Ignore if snapshot is already deleted

* Missing import

* Fix SnapshotRetentionServiceTests

* Expose SLM policy stats in get SLM policy API (#45989)

This also adds support for the SLM stats endpoint to the high level rest client.

Retrieving a policy now looks like:

```json
{
  "daily-snapshots" : {
    "version": 1,
    "modified_date": "2019-04-23T01:30:00.000Z",
    "modified_date_millis": 1556048137314,
    "policy" : {
      "schedule": "0 30 1 * * ?",
      "name": "<daily-snap-{now/d}>",
      "repository": "my_repository",
      "config": {
        "indices": ["data-*", "important"],
        "ignore_unavailable": false,
        "include_global_state": false
      },
      "retention": {}
    },
    "stats": {
      "snapshots_taken": 0,
      "snapshots_failed": 0,
      "snapshots_deleted": 0,
      "snapshot_deletion_failures": 0
    },
    "next_execution": "2019-04-24T01:30:00.000Z",
    "next_execution_millis": 1556048160000
  }
}
```

Relates to #43663

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase (#46356)

* Rewrite SnapshotLifecycleIT as as ESIntegTestCase

This commit splits `SnapshotLifecycleIT` into two different tests.
`SnapshotLifecycleRestIT` which includes the tests that do not require
slow repositories, and `SLMSnapshotBlockingIntegTests` which is now an
integration test using `MockRepository` to simulate a snapshot being in
progress.

Relates to #43663
Resolves #46205

* Add error logging when exceptions are thrown

* Update serialization versions

* Fix type inference

* Use non-Cancellable HLRC return value

* Fix Client mocking in test

* Fix SLMSnapshotBlockingIntegTests for 7.x branch

* Update SnapshotRetentionTask for non-multi-repo snapshot retrieval

* Add serialization guards for SnapshotLifecyclePolicy
2019-09-10 09:08:09 -06:00
Mayya Sharipova 2c5f9b558b Fix highlighting for script_score query (#46507) 2019-09-10 08:26:47 -04:00
Ioannis Kakavas 690164d0be Change EmailSslTest for FIPS 140 JVMs (#46278)
This commit changes the SSLContext for the email server we use in
the tests so that it loads its key material from an in memory
keystore (that is in turn built from a pair of PEM encoded private key
and certificate) instead of a PKCS#12 one. This is done so that when 
we run our tests in FIPS 140-2 JVMs, the keystore is of a type that the
Security Provider actually supports.

This also mutes testCanSendMessageToSmtpServerByDisablingVerification
as we can't run tests with verification set to `none` in FIPS 140
JVMs.
2019-09-10 14:39:40 +03:00
Alpar Torok 0ac52d0e72 Mute test in 7.x
Tracked in #46529
2019-09-10 13:28:28 +03:00
David Turner 6c67b53932
Load metadata at start time not construction time (#46326)
Today we load the metadata from disk while constructing the node. However there
is no real need to do so, and this commit moves that code to run later while
the node is starting instead.
2019-09-10 11:15:10 +01:00
Alpar Torok b40ac6dee7 mute on 7.x fo windows
Tracking #44942
2019-09-10 12:34:16 +03:00
Henning Andersen 7125c101e6 HLRC multisearchTemplate forgot params (#46492)
Since 7.3, the request converter for multiSearchTemplate would silently
not set the two request parameters `typed_keys` and
`max_concurrent_searches`.

Closes #46488
2019-09-10 08:47:08 +02:00
Henning Andersen 9fce5a99d8 Rest Controller wildcard registration (#46487)
Registering two different http methods on the same path using different
wildcard names would result in the last wildcard name being active only.
Now throw an exception instead.

Closes #46482
2019-09-09 21:49:18 +02:00
Przemysław Witek e21deae535
Disallow persisting any documents when datafeed is isolated (#46485) (#46490) 2019-09-09 21:01:27 +02:00
James Rodewig b59ecde041
[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00
Zachary Tong b21e417181 [DOCS] Add 7.3.2 Release Notes (#46454) 2019-09-09 13:36:11 -04:00
James Rodewig e253ee6ba6
[DOCS] Change // CONSOLE comments to [source,console] (#46440) (#46494) 2019-09-09 12:35:50 -04:00
Zachary Tong 8d17527050 [TEST] create larger cuckoo filters for tests (#46457)
The cuckoofilters could be randomly created with too small of
capacity or precision, which means that they can only absorb a few
values before collisions start to make all filters look identical.

This increases the size of filters we generate (capacity >> than
the test cases) and lower fpp rate.
2019-09-09 10:18:51 -04:00
Tanguy Leroux 88bed09119 Mutualize code in cloud-based repository integration tests (#46483)
This commit factors out some common code between the cloud-based
repository integration tests that were recently improved.

Relates #46376
2019-09-09 16:02:14 +02:00
David Turner 8428f8e6e8 Remove trailing comma from nodes lists (#46484)
Today when the membership of the cluster changes we log messages that describe
the change like this:

    added {{node-1}{OPdaTIGmSxaEXXOyg3o96w}{127.0.0.1}{127.0.0.1:9301}{di},}

The trailing comma suggests there is some missing string that might contain
extra information, but in fact it's an artefact of how these messages are
constructed. This commit removes the trailing comma from these lists.
2019-09-09 14:47:32 +01:00
Armin Braun ee3396735c
Execute SnapshotsService Error Callback on Generic Thread (#46277) (#46480)
I couldn't find a test for this, as it seems we only get
into this error handler on a bug. Regardless, we are
executing the snapshot finalization on the master update
thread here which shouldn't happen and will make
debugging a production issue resulting from this
trickier than it has to be (because we probably also
get a cluster state apply is slow warning in addition
to the original bug).
Used the generic pool here instead of the snapshot pool
because we're resolving the user callback here as
well and the generic pool seemed like the safer bet for
that.
2019-09-09 14:38:11 +02:00
Alpar Torok 31bee53fdd Fix the JVM we use for bwc nodes (#46314)
Before this change we would run bwc nodes with their bundled jdk if
these supported it, so the passed in runtime JDK was not honored.
This became obvius when running with FIPS.

Closes #41721
2019-09-09 15:24:58 +03:00
Suhel Khan d5529cb0bb [Docs] Fix typo in minimum-should-match.asciidoc (#46472) 2019-09-09 14:17:19 +02:00
Alexander Reelsen 0915bd7c6a Update mustache dependency to 0.9.6 (#46243) 2019-09-09 13:42:03 +02:00
Tanguy Leroux 023cf44025 Inject random server errors in AzureBlobStoreRepositoryTests (#46371)
This commit modifies the HTTP server used in 
AzureBlobStoreRepositoryTests so that it randomly returns 
server errors for any type of request executed by the Azure client.
2019-09-09 10:00:09 +02:00