OpenSearch

Commit Graph

Author	SHA1	Message	Date
Lee Hinman	fb0461ac76	[7.x] Add Snapshot Lifecycle Management (#44382 ) * Add Snapshot Lifecycle Management (#43934) * Add SnapshotLifecycleService and related CRUD APIs This commit adds `SnapshotLifecycleService` as a new service under the ilm plugin. This service handles snapshot lifecycle policies by scheduling based on the policies defined schedule. This also includes the get, put, and delete APIs for these policies Relates to #38461 * Make scheduledJobIds return an immutable set * Use Object.equals for SnapshotLifecyclePolicy * Remove unneeded TODO * Implement ToXContentFragment on SnapshotLifecyclePolicyItem * Copy contents of the scheduledJobIds * Handle snapshot lifecycle policy updates and deletions (#40062) (Note this is a PR against the `snapshot-lifecycle-management` feature branch) This adds logic to `SnapshotLifecycleService` to handle updates and deletes for snapshot policies. Policies with incremented versions have the old policy cancelled and the new one scheduled. Deleted policies have their schedules cancelled when they are no longer present in the cluster state metadata. Relates to #38461 * Take a snapshot for the policy when the SLM policy is triggered (#40383) (This is a PR for the `snapshot-lifecycle-management` branch) This commit fills in `SnapshotLifecycleTask` to actually perform the snapshotting when the policy is triggered. Currently there is no handling of the results (other than logging) as that will be added in subsequent work. This also adds unit tests and an integration test that schedules a policy and ensures that a snapshot is correctly taken. Relates to #38461 * Record most recent snapshot policy success/failure (#40619) Keeping a record of the results of the successes and failures will aid troubleshooting of policies and make users more confident that their snapshots are being taken as expected. This is the first step toward writing history in a more permanent fashion. * Validate snapshot lifecycle policies (#40654) (This is a PR against the `snapshot-lifecycle-management` branch) With the commit, we now validate the content of snapshot lifecycle policies when the policy is being created or updated. This checks for the validity of the id, name, schedule, and repository. Additionally, cluster state is checked to ensure that the repository exists prior to the lifecycle being added to the cluster state. Part of #38461 * Hook SLM into ILM's start and stop APIs (#40871) (This pull request is for the `snapshot-lifecycle-management` branch) This change allows the existing `/_ilm/stop` and `/_ilm/start` APIs to also manage snapshot lifecycle scheduling. When ILM is stopped all scheduled jobs are cancelled. Relates to #38461 * Add tests for SnapshotLifecyclePolicyItem (#40912) Adds serialization tests for SnapshotLifecyclePolicyItem. * Fix improper import in build.gradle after master merge * Add human readable version of modified date for snapshot lifecycle policy (#41035) * Add human readable version of modified date for snapshot lifecycle policy This small change changes it from: ``` ... "modified_date": 1554843903242, ... ``` To ``` ... "modified_date" : "2019-04-09T21:05:03.242Z", "modified_date_millis" : 1554843903242, ... ``` Including the `"modified_date"` field when the `?human` field is used. Relates to #38461 * Fix test * Add API to execute SLM policy on demand (#41038) This commit adds the ability to perform a snapshot on demand for a policy. This can be useful to take a snapshot immediately prior to performing some sort of maintenance. ```json PUT /_ilm/snapshot/<policy>/_execute ``` And it returns the response with the generated snapshot name: ```json { "snapshot_name" : "production-snap-2019.04.09-rfyv3j9qreixkdbnfuw0ug" } ``` Note that this does not allow waiting for the snapshot, and the snapshot could still fail. It does record this information into the cluster state similar to a regularly trigged SLM job. Relates to #38461 * Add next_execution to SLM policy metadata (#41221) * Add next_execution to SLM policy metadata This adds the next time a snapshot lifecycle policy will be executed when retriving a policy's metadata, for example: ```json GET /_ilm/snapshot?human { "production" : { "version" : 1, "modified_date" : "2019-04-15T21:16:21.865Z", "modified_date_millis" : 1555362981865, "policy" : { "name" : "<production-snap-{now/d}>", "schedule" : "/30 * * * ?", "repository" : "repo", "config" : { "indices" : [ "foo-", "important" ], "ignore_unavailable" : true, "include_global_state" : false } }, "next_execution" : "2019-04-15T21:16:30.000Z", "next_execution_millis" : 1555362990000 }, "other" : { "version" : 1, "modified_date" : "2019-04-15T21:12:19.959Z", "modified_date_millis" : 1555362739959, "policy" : { "name" : "<other-snap-{now/d}>", "schedule" : "0 30 2 * ?", "repository" : "repo", "config" : { "indices" : [ "other" ], "ignore_unavailable" : false, "include_global_state" : true } }, "next_execution" : "2019-04-16T02:30:00.000Z", "next_execution_millis" : 1555381800000 } } ``` Relates to #38461 * Fix and enhance tests * Figured out how to Cron * Change SLM endpoint from /_ilm/* to /_slm/* (#41320) This commit changes the endpoint for snapshot lifecycle management from: ``` GET /_ilm/snapshot/<policy> ``` to: ``` GET /_slm/policy/<policy> ``` It mimics the ILM path only using `slm` instead of `ilm`. Relates to #38461 * Add initial documentation for SLM (#41510) * Add initial documentation for SLM This adds the initial documentation for snapshot lifecycle management. It also includes the REST spec API json files since they're sort of documentation. Relates to #38461 * Add `manage_slm` and `read_slm` roles (#41607) * Add `manage_slm` and `read_slm` roles This adds two more built in roles - `manage_slm` which has permission to perform any of the SLM actions, as well as stopping, starting, and retrieving the operation status of ILM. `read_slm` which has permission to retrieve snapshot lifecycle policies as well as retrieving the operation status of ILM. Relates to #38461 * Add execute to the test * Fix ilm -> slm typo in test * Record SLM history into an index (#41707) It is useful to have a record of the actions that Snapshot Lifecycle Management takes, especially for the purposes of alerting when a snapshot fails or has not been taken successfully for a certain amount of time. This adds the infrastructure to record SLM actions into an index that can be queried at leisure, along with a lifecycle policy so that this history does not grow without bound. Additionally, SLM automatically setting up an index + lifecycle policy leads to `index_lifecycle` custom metadata in the cluster state, which some of the ML tests don't know how to deal with due to setting up custom `NamedXContentRegistry`s. Watcher would cause the same problem, but it is already disabled (for the same reason). * High Level Rest Client support for SLM (#41767) * High Level Rest Client support for SLM This commit add HLRC support for SLM. Relates to #38461 * Fill out documentation tests with tags * Add more callouts and asciidoc for HLRC * Update javadoc links to real locations * Add security test testing SLM cluster privileges (#42678) * Add security test testing SLM cluster privileges This adds a test to `PermissionsIT` that uses the `manage_slm` and `read_slm` cluster privileges. Relates to #38461 * Don't redefine vars * Add Getting Started Guide for SLM (#42878) This commit adds a basic Getting Started Guide for SLM. * Include SLM policy name in Snapshot metadata (#43132) Keep track of which SLM policy in the metadata field of the Snapshots taken by SLM. This allows users to more easily understand where the snapshot came from, and will enable future SLM features such as retention policies. * Fix compilation after master merge * [TEST] Move exception wrapping for devious exception throwing Fixes an issue where an exception was created from one line and thrown in another. * Fix SLM for the change to AcknowledgedResponse * Add Snapshot Lifecycle Management Package Docs (#43535) * Fix compilation for transport actions now that task is required * Add a note mentioning the privileges needed for SLM (#43708) * Add a note mentioning the privileges needed for SLM This adds a note to the top of the "getting started with SLM" documentation mentioning that there are two built-in privileges to assist with creating roles for SLM users and administrators. Relates to #38461 * Mention that you can create snapshots for indices you can't read * Fix REST tests for new number of cluster privileges * Mute testThatNonExistingTemplatesAreAddedImmediately (#43951) * Fix SnapshotHistoryStoreTests after merge * Remove overridden newResponse functions that have been removed * Fix compilation for backport * Fix get snapshot output parsing in test * [DOCS] Add redirects for removed autogen anchors (#44380) * Switch <tt>...</tt> in javadocs for {@code ...}	2019-07-16 07:37:13 -06:00
Armin Braun	099d52f3b0	Prevent Confusing Blocked Thread Warnings in MockNioTransport (#44356 ) (#44376 ) * Prevent Confusing Blocked Thread Warnings in MockNioTransport * We can run into a race where the stacktrace collection and subsequent logging happens after the thread has already unblocked thus logging a confusing stacktrace of wherever the transport thread was after it became unblocked * Fixed this by comparing whether or not the recorded timestamp is still the same before and after the stacktrace was recorded and not logging if it already changed	2019-07-16 04:40:50 +02:00
David Turner	86ee8eab3f	Allow RerouteService to reroute at lower priority (#44338 ) Today the `BatchedRerouteService` submits its delayed reroute task at `HIGH` priority, but in some cases a lower priority would be more appropriate. This commit adds the facility to submit delayed reroute tasks at different priorities, such that each submitted reroute task runs at a priority no lower than the one requested. It does not change the fact that all delayed reroute tasks are submitted at `HIGH` priority, but at least it makes this explicit.	2019-07-15 17:41:39 +01:00
Christoph Büscher	835b7a120d	Fix AnalyzeAction response serialization (#44284 ) Currently we loose information about whether a token list in an AnalyzeAction response is null or an empty list, because we write a 0 value to the stream in both cases and deserialize to a null value on the receiving side. This change fixes this so we write an additional flag indicating whether the value is null or not, followed by the size of the list and its content. Closes #44078	2019-07-14 10:35:11 +02:00
Przemyslaw Gomulka	e23ecc5838	JSON logging refactoring and X-Opaque-ID support backport(#41354 ) (#44178 ) This is a refactor to current JSON logging to make it more open for extensions and support for custom ES log messages used inDeprecationLogger IndexingSlowLog , SearchSLowLog We want to include x-opaque-id in deprecation logs. The easiest way to have this as an additional JSON field instead of part of the message is to create a custom DeprecatedMessage (extends ESLogMEssage) These messages are regular log4j messages with a text, but also carry a map of fields which can then populate the log pattern. The logic for this lives in ESJsonLayout and ESMessageFieldConverter. Similar approach can be used to refactor IndexingSlowLog and SearchSlowLog JSON logs to contain fields previously only present as escaped JSON string in a message field. closes #41350 backport #41354	2019-07-12 16:53:27 +02:00
Yannick Welsch	068286ca4b	Remove RemoteClusterConnection.ConnectedNodes (#44235 ) This instead exposes the set of connected nodes on ConnectionManager.	2019-07-12 14:54:21 +02:00
Armin Braun	6c02cf0241	Fix InternalTestCluster StopRandomNode Assertion (#44258 ) (#44265 ) * The assertion added in #44214 is tripped by tests running dedicated test clusters per test needlessly.This breaks existing tests like the one in #44245. * Closes #44245	2019-07-12 13:18:55 +02:00
David Turner	735c897ec6	Avoid counting votes from master-ineligible nodes (#43688 ) Today if a master-eligible node is converted to a master-ineligible node it may remain in the voting configuration, meaning that the master node may count its publish responses as an indication that it has properly persisted the cluster state. However master-ineligible nodes do not properly persist the cluster state, so it is not safe to count these votes. This change adjusts `CoordinationState` to take account of this from a safety point of view, and also adjusts the `Coordinator` to prevent such nodes from joining the cluster. Instead, it triggers a reconfiguration to remove from the voting configuration a node that now appears to be master-ineligible before processing its join. Backport of #43688, see #44260.	2019-07-12 11:30:52 +01:00
Alpar Torok	8d35583c43	Fix port range allocation with large worker IDs (#44213 ) * Fix port range allocation with large worker IDs Relates to #43983 The IDs gradle uses are incremented for the lifetime of the daemon which can result in port ranges that are outside the valid range. This change implements a modulo based formula to wrap the port ranges when the IDs get too large. Adresses #44134 but #44157 is also required to be able to close it.	2019-07-12 11:04:57 +03:00
Yogesh Gaikwad	91c342a888	fix and enable repository-hdfs secure tests (#44044 ) (#44199 ) Due to recent changes are done for converting `repository-hdfs` to test clusters (#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in #42090 to fix the tests are also done in this commit.	2019-07-12 12:44:01 +10:00
Armin Braun	2768662822	Cleanup Stale Root Level Blobs in Sn. Repository (#43542 ) (#44226 ) * Cleans up all root level temp., snap-%s.dat, meta-%s.dat blobs that aren't referenced by any snapshot to deal with dangling blobs left behind by delete and snapshot finalization failures * The scenario that get's us here is a snapshot failing before it was finalized or a delete failing right after it wrote the updated index-(N+1) that doesn't reference a snapshot anymore but then fails to remove that snapshot * Not deleting other dangling blobs since that don't follow the snap-, meta- or tempfile naming schemes to not accidentally delete blobs not created by the snapshot logic * Follow up to #42189 * Same safety logic, get list of all blobs before writing index-N blobs, delete things after index-N blobs was written	2019-07-11 19:35:15 +02:00
Armin Braun	5f22370b6b	Fix ShrinkIndexIT (#44214 ) (#44223 ) * Fix ShrinkIndexIT * Move this test suit to cluster scope. Currently, `testShrinkThenSplitWithFailedNode` stops a random node which randomly turns out to be the only shared master node so the cluster reset fails on account of the fact that no shared master node survived. * Closes #44164	2019-07-11 17:58:00 +02:00
Nick Knize	374030a53f	Upgrade to lucene-8.2.0-snapshot-860e0be5378 (#44171 ) (#44184 ) Upgrades lucene library to lucene-8.2.0-snapshot-860e0be5378	2019-07-11 09:17:22 -05:00
Alpar Torok	7ba18732f7	Run some REST tests against a cluster running in docker containers (#39515 ) * Run REST tests against a cluster running on docker Closes #38053	2019-07-11 15:28:33 +03:00
Armin Braun	8ce8c627dd	Some Cleanup in o.e.i.shard (#44097 ) (#44208 ) * Some Cleanup in o.e.i.shard * Extract one duplicated method * Cleanup obviously unused code	2019-07-11 13:52:06 +02:00
Yannick Welsch	2ee07f1ff4	Simplify port usage in transport tests (#44157 ) Simplifies AbstractSimpleTransportTestCase to use JVM-local ports and also adds an assertion so that cases like #44134 can be more easily debugged. The likely reason for that one is that a test, which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle daemon) kept increasing Gradle worker IDs, causing an overflow at some point.	2019-07-11 13:35:37 +02:00
Armin Braun	c0ed64bb92	Improve Repository Consistency Check in Tests (#44204 ) * Improve Repository Consistency Check in Tests (#44099) * Check that index metadata as well as snapshot metadata always exists when referenced by other metadata * Fix SnapshotResiliencyTests on ExtraFS (#44113) * As a result of #44099 we're now checking more directories and have to ignore the `extraN` folders for those like we do for indices already * Closes #44112	2019-07-11 11:14:37 +02:00
Armin Braun	8a554f9737	Remove IncompatibleSnapshots Logic from Codebase (#44096 ) (#44183 ) * The incompatible snapshots logic was created to track 1.x snapshots that became incompatible with 2.x * It serves no purpose at this point * It adds an additional GET request to every loading of RepositoryData (from loading the incompatible snapshots blob)	2019-07-11 07:15:51 +02:00
Armin Braun	d6f09fdb97	Add WARN Logging if Mock Network Accepts Huge Number of Connections (#44169 ) (#44182 ) * Add WARN Logging if Mock Network Accepts Huge Number of Connections * As discussed, added warn logging to rule out endless accept loops for #43387 * Had to handle it by the relatively awkward override in the mock nio because we don't have logging in the NIO module where (`ServerChannelContext` lives)	2019-07-10 22:08:36 +02:00
Ryan Ernst	fb77d8f461	Removed writeTo from TransportResponse and ActionResponse (#44092 ) The base classes for transport requests and responses currently implement Streamable and Writeable. The writeTo method on these base classes is implemented with an empty implementation. Not only does this complicate subclasses to think they need to call super.writeTo, but it also can lead to not implementing writeTo when it should have been implemented, or extendiong one of these classes when not necessary, since there is nothing to actually implement. This commit removes the empty writeTo from these base classes, and fixes subclasses to not call super and in some cases implement an empty writeTo themselves. relates #34389	2019-07-10 12:42:04 -07:00
Zachary Tong	92ad588275	Remove generic on AggregatorFactory (#43664 ) (#44079 ) AggregatorFactory was generic over itself, but it doesn't appear we use this functionality anywhere (e.g. to allow the super class to declare arguments/return types generically for subclasses to override). Most places use a wildcard constraint, and even when a concrete type is specified it wasn't used. But since AggFactories are widely used, this led to the generic touching many pieces of code and making type signatures fairly complex	2019-07-10 13:20:28 -04:00
David Turner	aec44fecbc	Decouple DiskThresholdMonitor & ClusterInfoService (#44105 ) Today the `ClusterInfoService` requires the `DiskThresholdMonitor` at construction time so that it can notify it when nodes report changes in their disk usage, but this is awkward to construct: the `DiskThresholdMonitor` requires a `RerouteService` which requires an `AllocationService` which comees from the `ClusterModule` which requires the `ClusterInfoService`. Today we break the cycle with a `LazilyInitializedRerouteService` which is itself a little ugly. This commit replaces this with a more traditional subject/observer relationship between the `ClusterInfoService` and the `DiskThresholdMonitor`.	2019-07-09 18:43:32 +01:00
David Turner	268971db03	Wait for blackholed connection before discovery (#44077 ) Since #42636 we no longer treat connections specially when simulating a blackholed connection. This means that at the end of the safety phase we may have just started a connection attempt which will time out, but the default timeout is 30 seconds, much longer than the 2 seconds we normally allow for post-safety-phase discovery. This commit adds time for such a connection attempt to time out. It also fixes some spurious logging of `this` that now refers to an object with an unhelpful `toString()` implementation introduced in #42636. Fixes #44073	2019-07-09 10:59:53 +01:00
Armin Braun	9eac5ceb1b	Dry up inputstream to bytesreference (#43675 ) (#44094 ) * Dry up Reading InputStream to BytesReference * Dry up spots where we use the same pattern to get from an InputStream to a BytesReferences	2019-07-09 09:18:25 +02:00
David Turner	6dce458ecc	Randomise retention lease expiry time (#44067 ) In today's test suite indices mostly use the default value of `12h` for the `index.soft_deletes.retention_lease.period` setting, which in the context of the test suite essentially means "never expires". In fact, the tests should all behave correctly even if the lease period is much shorter; tests that rely on leases not expiring should configure their indices appropriately. This commit randomises the lease expiry time for those indices created during tests which do not set a specific value for this setting.	2019-07-08 18:29:27 +02:00
Alpar Torok	0c8294e633	Make sure the clean task doesn't break test fixtures (#43641 ) Use a dedicated fixture dir.	2019-07-08 17:58:27 +03:00
Armin Braun	afe81fd625	Some Cleanup in Test Framework (#44039 ) (#44059 ) * Remove some obvious dead code * Move assert methods that were only used in a single test class to the child they belong to * Inline some redundant methods	2019-07-08 14:15:31 +02:00
Armin Braun	af9b98e81c	Recursively Delete Unreferenced Index Directories (#42189 ) (#44051 ) * Use ability to list child "folders" in the blob store to implement recursive delete on all stale index folders when cleaning up instead of using the diff between two `RepositoryData` instances to cover aborted deletes * Runs after ever delete operation * Relates #13159 (fixing most of this issues caused by unreferenced indices, leaving some meta files to be cleaned up only)	2019-07-08 10:55:39 +02:00
Nhat Nguyen	9089820d8f	Enable indexing optimization using sequence numbers on replicas (#43616 ) This PR enables the indexing optimization using sequence numbers on replicas. With this optimization, indexing on replicas should be faster and use less memory as it can forgo the version lookup when possible. This change also deactivates the append-only optimization on replicas. Relates #34099	2019-07-05 22:12:08 -04:00
Yannick Welsch	504a43d43a	Move ConnectionManager to async APIs (#42636 ) This commit converts the ConnectionManager's openConnection and connectToNode methods to async-style. This will allow us to not block threads anymore when opening connections. This PR also adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove some hacks in the test infrastructure that had to account for the previous synchronous nature of the connection APIs.	2019-07-05 20:40:22 +02:00
Yannick Welsch	d090fa514f	Use unique ports per test worker (#43983 ) * Use unique ports per test worker * Add test for system property * check presence of tests.gradle * Revert "check presence of tests.gradle" This reverts commit 2fee7512a28f95c94c5bf7a3312e808f918a9510.	2019-07-05 11:02:28 +02:00
Jim Ferenczi	cdf55cb5c5	Refactor index engines to manage readers instead of searchers (#43860 ) This commit changes the way we manage refreshes in the index engines. Instead of relying on a SearcherManager, this change uses a ReaderManager that creates ElasticsearchDirectoryReader when needed. Searchers are now created on-demand (when acquireSearcher is called) from the current ElasticsearchDirectoryReader. It also slightly changes the Engine.Searcher to extend IndexSearcher in order to simplify the usage in the consumer.	2019-07-04 22:49:43 +02:00
Alan Woodward	4b99255fed	Add name() method to TokenizerFactory (#43909 ) This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory, and removes the need to pass around tokenizer names when building custom analyzers. As this means that TokenizerFactory is no longer a functional interface, the commit also adds a factory method to TokenizerFactory to make construction simpler.	2019-07-04 11:28:55 +01:00
Armin Braun	be20fb80e4	Recursive Delete on BlobContainer (#43281 ) (#43920 ) This is a prerequisite of #42189: * Add directory delete method to blob container specific to each implementation: * Some notes on the implementations: * AWS + GCS: We can simply exploit the fact that both AWS and GCS return blobs lexicographically ordered which allows us to simply delete in the same order that we receive the blobs from the listing request. For AWS this simply required listing without the delimiter setting (so we get a deep listing) and for GCS the same behavior is achieved by not using the directory mode on the listing invocation. The nice thing about this is, that even for very large numbers of blobs the memory requirements are now capped nicely since we go page by page when deleting. * For Azure I extended the parallelization to the listing calls as well and made it work recursively. I verified that this works with thread count `1` since we only block once in the initial thread and then fan out to a "graph" of child listeners that never block. * HDFS and FS are trivial since we have directory delete methods available for them * Enhances third party tests to ensure the new functionality works (I manually ran them for all cloud providers)	2019-07-03 17:14:57 +02:00
Armin Braun	455b12a4fb	Add Ability to List Child Containers to BlobContainer (#42653 ) (#43903 ) * Add Ability to List Child Containers to BlobContainer (#42653) * Add Ability to List Child Containers to BlobContainer * This is a prerequisite of #42189	2019-07-03 11:30:49 +02:00
Armin Braun	826f38cd70	Enable Parallel Deletes in Azure Repository (#42783 ) (#43886 ) * Parallel deletes via private thread pool	2019-07-03 09:28:39 +02:00
Zachary Tong	ea1794832f	Add RareTerms aggregation (#35718 ) This adds a `rare_terms` aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts. This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: LONG_MAX (or worse, ordering a terms agg by count ascending, which has unbounded error). This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a cuckoo filter. If a future term is found in the cuckoo filter we assume it was previously removed from the map and is "common". The map keys are the "rare" terms after collection is done.	2019-07-01 10:30:02 -04:00
Nhat Nguyen	598e00a689	Make peer recovery send file info step async (#43792 ) Relates #36195	2019-07-01 08:40:45 -04:00
Ryan Ernst	3a2c698ce0	Rename Action to ActionType (#43778 ) Action is a class that encapsulates meta information about an action that allows it to be called remotely, specifically the action name and response type. With recent refactoring, the action class can now be constructed as a static constant, instead of needing to create a subclass. This makes the old pattern of creating a singleton INSTANCE both misnamed and lacking a common placement. This commit renames Action to ActionType, thus allowing the old INSTANCE naming pattern to be TYPE on the transport action itself. ActionType also conveys that this class is also not the action itself, although this change does not rename any concrete classes as those will be removed organically as they are converted to TYPE constants. relates #34389	2019-06-30 22:00:17 -07:00
David Turner	fca7a19713	Avoid parallel reroutes in DiskThresholdMonitor (#43381 ) Today the `DiskThresholdMonitor` limits the frequency with which it submits reroute tasks, but it might still submit these tasks faster than the master can process them if, for instance, each reroute takes over 60 seconds. This causes a problem since the reroute task runs with priority `IMMEDIATE` and is always scheduled when there is a node over the high watermark, so this can starve any other pending tasks on the master. This change avoids further updates from the monitor while its last task(s) are still in progress, and it measures the time of each update from the completion time of the reroute task rather than its start time, to allow a larger window for other tasks to run. It also now makes use of the `RoutingService` to submit the reroute task, in order to batch this task with any other pending reroutes. It enhances the `RoutingService` to notify its listeners on completion. Fixes #40174 Relates #42559	2019-06-30 16:54:16 +01:00
Nhat Nguyen	55b3ec8d7b	Make peer recovery clean files step async (#43787 ) Relates #36195	2019-06-29 18:30:51 -04:00
Albert Zaharovits	5e17bc5dcc	Consistent Secure Settings #40416 Introduces a new `ConsistentSecureSettingsValidatorService` service that exposes a single public method, namely `allSecureSettingsConsistent`. The method returns `true` if the local node's secure settings (inside the keystore) are equal to the master's, and `false` otherwise. Technically, the local node has to have exactly the same secure settings - setting names should not be missing or in surplus - for all `SecureSetting` instances that are flagged with the newly introduced `Property.Consistent`. It is worth highlighting that the `allSecureSettingsConsistent` is not a consensus view across the cluster, but rather the local node's perspective in relation to the master.	2019-06-29 23:26:17 +03:00
Jim Ferenczi	7ca69db83f	Refactor IndexSearcherWrapper to disallow the wrapping of IndexSearcher (#43645 ) This change removes the ability to wrap an IndexSearcher in plugins. The IndexSearcherWrapper is replaced by an IndexReaderWrapper and allows to wrap the DirectoryReader only. This simplifies the creation of the context IndexSearcher that is used on a per request basis. This change also moves the optimization that was implemented in the security index searcher wrapper to the ContextIndexSearcher that now checks the live docs to determine how the search should be executed. If the underlying live docs is a sparse bit set the searcher will compute the intersection betweeen the query and the live docs instead of checking the live docs on every document that match the query.	2019-06-28 16:28:02 +02:00
Christoph Büscher	2cc7f5a744	Allow reloading of search time analyzers (#43313 ) Currently changing resources (like dictionaries, synonym files etc...) of search time analyzers is only possible by closing an index, changing the underlying resource (e.g. synonym files) and then re-opening the index for the change to take effect. This PR adds a new API endpoint that allows triggering reloading of certain analysis resources (currently token filters) that will then pick up changes in underlying file resources. To achieve this we introduce a new type of custom analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows swapping out analysis components. Custom analyzers that contain filters that are markes as "updateable" will automatically choose this implementation. This PR also adds this capability to `synonym` token filters for use in search time analyzers. Relates to #29051	2019-06-28 09:55:40 +02:00
Nhat Nguyen	ce8771feb7	Do not use MockInternalEngine in GatewayIndexStateIT (#43716 ) GatewayIndexStateIT#testRecoverBrokenIndexMetadata replies on the flushing on shutdown. This behaviour, however, can be randomly disabled in MockInternalEngine. Closes #43034	2019-06-27 18:28:04 -04:00
Yannick Welsch	6744344ef2	Handle situation where only voting-only nodes are bootstrapped (#43628 ) Adds support for the situation where only voting-only nodes are bootstrapped. In that case, they will still try to become elected and bring full master nodes into the cluster.	2019-06-27 18:10:15 +02:00
David Roberts	c5beb05f77	[ML][DataFrame] Consider data frame templates internal in REST tests (#43692 ) The data frame index template pattern was not in the list considered as internal and therefore not needing cleanup after every test.	2019-06-27 14:40:30 +01:00
Christoph Büscher	36360358b2	Move query builder caching check to dedicated tests (#43238 ) Currently `AbstractQueryTestCase#testToQuery` checks the search context cachable flag. This is a bit fragile due to the high randomization of query builders performed by this general test. Also we might only rarely check the "interesting" cases because they rarely get generated when fully randomizing the query builder. This change moved the general checks out ot #testToQuery and instead adds dedicated cache tests for those query builders that exhibit something other than the default behaviour. Closes #43200	2019-06-27 14:56:29 +02:00
Yannick Welsch	2049f715b3	Add voting-only master node (#43410 ) A voting-only master-eligible node is a node that can participate in master elections but will not act as a master in the cluster. In particular, a voting-only node can help elect another master-eligible node as master, and can serve as a tiebreaker in elections. High availability (HA) clusters require at least three master-eligible nodes, so that if one of the three nodes is down, then the remaining two can still elect a master amongst them-selves. This only requires one of the two remaining nodes to have the capability to act as master, but both need to have voting powers. This means that one of the three master-eligible nodes can be made as voting-only. If this voting-only node is a dedicated master, a less powerful machine or a smaller heap-size can be chosen for this node. Alternatively, a voting-only non-dedicated master node can play the role of the third master-eligible node, which allows running an HA cluster with only two dedicated master nodes. Closes #14340 Co-authored-by: David Turner <david.turner@elastic.co>	2019-06-26 08:07:56 +02:00
Zachary Tong	63fef5a31e	Add scripting support to AggregatorTestCase (#43494 ) This refactors AggregatorTestCase to allow testing mock scripts. The main change is to QueryShardContext. This was previously mocked, but to get the ScriptService you have to invoke a final method which can't be mocked. Instead, we just create a mostly-empty QueryShardContext and populate the fields that are needed for testing. It also introduces a few new helper methods that can be overridden to change the default behavior a bit. Most tests should be able to override getMockScriptService() to supply a ScriptService to the context, which is later used by the aggs. More complicated tests can override queryShardContextMock() as before. Adds a test to MaxAggregatorTests to test out the new functionality.	2019-06-25 11:52:12 -04:00
Yannick Welsch	d45f12799c	Sync global checkpoint on pending in-sync shards (#43526 ) At the end of a peer recovery the primary wants to mark the replica as in-sync. For that the persisted local checkpoint of the replica needs to have caught up with the global checkpoint on the primary. If translog durability is set to ASYNC, this means that information about the persisted local checkpoint can lag on the primary and might need to be explicitly fetched through a global checkpoint sync action. Unfortunately, that action will only be triggered after 30 seconds, and, even worse, will only run based on what the in-sync shard copies say (see IndexShard.maybeSyncGlobalCheckpoint). As the replica has not been marked as in-sync yet, it is not taken into consideration, and the primary might have its global checkpoint equal to the max seq no, so it thinks nothing needs to be done. Closes #43486	2019-06-24 18:35:57 +02:00
Armin Braun	1053a89b79	Log Blocked IO Thread State (#43424 ) (#43447 ) * Let's log the state of the thread to find out if it's dead-locked or just stuck after being suspended * Relates #43392	2019-06-20 22:31:36 +02:00
Yannick Welsch	29d76baf7d	Increase timeout for assertSeqNos Helps with tests that do async translog syncing	2019-06-20 19:06:49 +02:00
Zachary Tong	a8a81200d0	Better support for unmapped fields in AggregatorTestCase (#43405 ) AggregatorTestCase will NPE if only a single, null MappedFieldType is provided (which is required to simulate an unmapped field). While it's possible to test unmapped fields by supplying other, non-related field types... that's clunky and unnecessary. AggregatorTestCase just needs to filter out null field types when setting up.	2019-06-20 11:31:49 -04:00
Armin Braun	7d1983a7e3	Fix Operation Timestamps in Tests (#43155 ) (#43419 ) * For the issue in #43086 we were running into inactive shards because the random timestamps previously used would randomly make `org.elasticsearch.index.shard.IndexShard#checkIdle` see an incorrect+huge inactive time * Also fixed one other spot in tests that passed `ms` instead of `ns` for the same timestamp on an index op to correctly use relative `ns` * Closes #43086	2019-06-20 16:36:17 +02:00
David Turner	c8eb09f158	Fail connection attempts earlier in tests (#43320 ) Today the `DisruptibleMockTransport` always allows a connection to a node to be established, and then fails requests sent to that node such as the subsequent handshake. Since #42342, we log handshake failures on an open connection as a warning, and this makes the test logs rather noisy. This change fails the connection attempt first, avoiding these unrealistic warnings.	2019-06-20 14:45:24 +01:00
Armin Braun	5af9387fad	Fix Stuck IO Thread Logging Time Precision (#42882 ) * The precision of the timestamps we get from the cached time thread is only 200ms by default resulting in a number of needless ~200ms slow network thread execution logs * Fixed by making the warn threshold a function of the precision of the cached time thread found in the settings	2019-06-20 14:26:20 +02:00
Yannick Welsch	7f8e1454ab	Advance checkpoints only after persisting ops (#43205 ) Local and global checkpoints currently do not correctly reflect what's persisted to disk. The issue is that the local checkpoint is adapted as soon as an operation is processed (but not fsynced yet). This leaves room for the history below the global checkpoint to still change in case of a crash. As we rely on global checkpoints for CCR as well as operation-based recoveries, this has the risk of shard copies / follower clusters going out of sync. This commit required changing some core classes in the system: - The LocalCheckpointTracker keeps track now not only of the information whether an operation has been processed, but also whether that operation has been persisted to disk. - TranslogWriter now keeps track of the sequence numbers that have not been fsynced yet. Once they are fsynced, TranslogWriter notifies LocalCheckpointTracker of this. - ReplicationTracker now keeps track of the persisted local and persisted global checkpoints of all shard copies when in primary mode. The computed global checkpoint (which represents the minimum of all persisted local checkpoints of all in-sync shard copies), which was previously stored in the checkpoint entry for the local shard copy, has been moved to an extra field. - The periodic global checkpoint sync now also takes async durability into account, where the local checkpoints on shards only advance when the translog is asynchronously fsynced. This means that the previous condition to detect inactivity (max sequence number is equal to global checkpoint) is not sufficient anymore. - The new index closing API does not work when combined with async durability. The shard verification step is now requires an additional pre-flight step to fsync the translog, so that the main verify shard step has the most up-to-date global checkpoint at disposition.	2019-06-20 11:12:38 +02:00
Mark Vieira	0867ea75b7	Properly format reproduction lines for test methods that contain periods (#43255 )	2019-06-18 09:17:47 -07:00
Nhat Nguyen	0c5086d2f3	Rebuild version map when opening internal engine (#43202 ) With this change, we will rebuild the live version map and local checkpoint using documents (including soft-deleted) from the safe commit when opening an internal engine. This allows us to safely prune away _id of all soft-deleted documents as the version map is always in-sync with the Lucene index. Relates #40741 Supersedes #42979	2019-06-17 18:08:09 -04:00
Martijn Laarman	8b1b9f8ab9	Introduce stability description to the REST API specification (#38413 ) (#43278 ) * introduce state to the REST API specification * change state over to stability * CCR is no GA updated to stable * SQL is now GA so marked as stable * Introduce `internal` as state for API's, marks stable in terms of lifetime but unstable in terms of guarantees on its output format since it exposes internal representations * make setting a wrong stability value, or not setting it at all an error that causes the YAML test suite to fail * update spec files to be explicit about their stability state * Document the fact that stability needs to be defined Otherwise the YAML test runner will fail (with a nice exception message) * address check style violations * update rest spec unit tests to include stability * found one more test spec file not declaring stability, made sure stability appears after documentation everywhere * cluster.state is stable, mark response in some way to denote its a key value format that can be changed during minors * mark data frame API's as beta * remove internal and private as states for an API * removed the wrong enum values in the Stability Enum in the previous commit (cherry picked from commit 61c34bbd92f8f7e5f22fa411c6b682b0ebd8a99d)	2019-06-17 16:57:13 +02:00
Henning Andersen	ba15d08e14	Allow cluster access during node restart (#42946 ) (#43272 ) This commit modifies InternalTestCluster to allow using client() and other operations inside a RestartCallback (onStoppedNode typically). Restarting nodes are now removed from the map and thus all methods now return the state as if the restarting node does not exist. This avoids various exceptions stemming from accessing the stopped node(s).	2019-06-17 15:04:17 +02:00
Christoph Büscher	7af23324e3	SimpleQ.S.B and QueryStringQ.S.B tests should avoid `now` in query (#43199 ) Currently the randomization of the q.b. in these tests can create query strings that can cause caching to be disabled for this query if we query all fields and there is a date field present. This is pretty much an anomaly that we shouldn't generally test for in the "testToQuery" tests where cache policies are checked. This change makes sure we don't create offending query strings so the cache checks never hit these cases and adds a special test method to check this edge case. Closes #43112	2019-06-14 11:21:48 +02:00
Alpar Torok	7cc6dca697	Remove explicily enabled build fixture task	2019-06-14 10:42:08 +03:00
Jason Tedor	5bc3b7f741	Enable node roles to be pluggable (#43175 ) This commit introduces the possibility for a plugin to introduce additional node roles.	2019-06-13 15:15:48 -04:00
Yogesh Gaikwad	4ae1e30a98	Enable krb5kdc-fixture, kerberos tests mount urandom for kdc container (#41710 ) (#43178 ) Infra has fixed #10462 by installing `haveged` on CI workers. This commit enables the disabled fixture and tests, and mounts `/dev/urandom` for the container so there is enough entropy required for kdc. Note: hdfs-repository tests have been disabled, will raise a separate issue for it. Closes #40624 Closes #40678	2019-06-13 13:02:16 +10:00
Alan Woodward	9de1c69c28	IndexAnalyzers doesn't need to extend AbstractIndexComponent (#43149 ) AIC doesn't add anything here, and it removes the need to pass index settings to the constructor.	2019-06-12 17:48:31 +01:00
Henning Andersen	6108899a2d	Fix unresponsive network simulation (#42579 ) Unresponsive network simulation would throw away requests. However, then we no longer have any guarantees that a transport action either succeeds or fails, which could lead to hangs (example: unclosed IndexShard permits). Closes #42244	2019-06-11 17:54:48 +02:00
Nhat Nguyen	f2e66e22eb	Increase waiting time when check retention locks (#42994 ) WriteActionsTests#testBulk and WriteActionsTests#testIndex sometimes fail with a pending retention lock. We might leak retention locks when switching to async recovery. However, it's more likely that ongoing recoveries prevent the retention lock from releasing. This change increases the waiting time when we check for no pending retention lock and also ensures no ongoing recovery in WriteActionsTests. Closes #41054	2019-06-10 17:58:37 -04:00
Jason Tedor	915d2f2daa	Refactor put mapping request validation for reuse (#43005 ) This commit refactors put mapping request validation for reuse. The concrete case that we are after here is the ability to apply effectively the same framework to indices aliases requests. This commit refactors the put mapping request validation framework to allow for that.	2019-06-09 10:19:04 -04:00
henryptung	61b62125b8	Wire query cache into sorting nested-filter computation (#42906 ) Don't use Lucene's default query cache when filtering in sort. Closes #42813	2019-06-06 21:16:58 +02:00
Gordon Brown	6eb4600e93	Add custom metadata to snapshots (#41281 ) Adds a metadata field to snapshots which can be used to store arbitrary key-value information. This may be useful for attaching a description of why a snapshot was taken, tagging snapshots to make categorization easier, or identifying the source of automatically-created snapshots.	2019-06-05 17:30:31 -06:00
Przemyslaw Gomulka	ab5bc83597	Deprecation info for joda-java migration on 7.x (#42659 ) Some clusters might have been already migrated to version 7 without being warned about the joda-java migration changes. Deprecation api on that version will give them guidance on what patterns need to be changed. relates. This change is using the same logic like in 6.8 that is: verifying the pattern is from the incompatible set ('y'-Y', 'C', 'Z' etc), not from predifined set, not prefixed with 8. AND was also created in 6.x. Mappings created in 7.x are considered migrated and should not generate warnings There is no pipeline check (present on 6.8) as it is impossible to verify when the pipeline was created, and therefore to make sure the format is depracated or not #42010	2019-06-05 19:50:04 +02:00
Mark Vieira	e44b8b1e2e	[Backport] Remove dependency substitutions 7.x (#42866 ) * Remove unnecessary usage of Gradle dependency substitution rules (#42773) (cherry picked from commit 12d583dbf6f7d44f00aa365e34fc7e937c3c61f7)	2019-06-04 13:50:23 -07:00
Tim Vernum	8de3a88205	Log the status of security on license change (#42741 ) Whether security is enabled/disabled is dependent on the combination of the node settings and the cluster license. This commit adds a license state listener that logs when the license change causes security to switch state (or to be initialised). This is primarily useful for diagnosing cluster formation issues. Backport of: #42488	2019-06-04 14:25:43 +10:00
David Turner	df0f0b3d40	Rename autoMinMasterNodes to autoManageMasterNodes (#42789 ) Renames the `ClusterScope` attribute `autoMinMasterNodes` to reflect its broader meaning since 7.0. Backport of the relevant part of #42700 to `7.x`.	2019-06-03 12:12:07 +01:00
Jason Tedor	371cb9a8ce	Remove Log4j 1.2 API as a dependency (#42702 ) We had this as a dependency for legacy dependencies that still needed the Log4j 1.2 API. This appears to no longer be necessary, so this commit removes this artifact as a dependency. To remove this dependency, we had to fix a few places where we were accidentally relying on Log4j 1.2 instead of Log4j 2 (easy to do, since both APIs were on the compile-time classpath). Finally, we can remove our custom Netty logger factory. This was needed when we were on Log4j 1.2 and handled logging in our own unique way. When we migrated to Log4j 2 we could have dropped this dependency. However, even then Netty would still pick up Log4j 1.2 since it was on the classpath, thus the advantage to removing this as a dependency now.	2019-05-30 16:08:07 -04:00
Marios Trivyzas	ce30afcd01	Deprecate CommonTermsQuery and cutoff_frequency (#42619 ) (#42691 ) Since the max_score optimization landed in Elasticsearch 7, the CommonTermsQuery is redundant and slower. Moreover the cutoff_frequency parameter for MatchQuery and MultiMatchQuery is redundant. Relates to #27096 (cherry picked from commit 04b74497314eeec076753a33b3b6cc11549646e8)	2019-05-30 18:04:47 +02:00
Armin Braun	0e92ef1843	Fix Incorrect Time Math in MockTransport (#42595 ) (#42617 ) * Fix Incorrect Time Math in MockTransport * The timeunit here must be nanos for the current time (we even convert it accordingly in the logging) * Also, changed the log message when dumping stack traces a little to make it easier to grep for (otherwise it's the same as the message on unregister)	2019-05-28 17:58:23 +02:00
David Turner	746a2f41fd	Remove PRE_60_NODE_CHECKPOINT (#42531 ) This commit removes the obsolete `PRE_60_NODE_CHECKPOINT` constant for dealing with 5.x nodes' lack of sequence number support. Backport of #42527	2019-05-28 12:25:53 +01:00
Armin Braun	44bf784fe1	Add Infrastructure to Run 3rd Party Repository Tests (#42586 ) (#42604 ) * Add Infrastructure to Run 3rd Party Repository Tests * Add infrastructure to run third party repository tests using our standard JUnit infrastructure * This is a prerequisite of #42189	2019-05-28 10:46:22 +02:00
Armin Braun	c4f44024af	Remove Delete Method from BlobStore (#41619 ) (#42574 ) * Remove Delete Method from BlobStore (#41619) * The delete method on the blob store was used almost nowhere and just duplicates the delete method on the blob containers * The fact that it provided for some recursive delete logic (that did not behave the same way on all implementations) was not used and not properly tested either	2019-05-27 12:24:20 +02:00
Armin Braun	b68358945f	Dump Stacktrace on Slow IO-Thread Operations (#42000 ) (#42572 ) * Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745	2019-05-27 11:44:36 +02:00
Armin Braun	489616da62	Fix testTracerLog Network Tests (#42286 ) (#42565 ) * Fix testTracerLog Network Tests * Start appender before using it like we do for e.g. the Netty leak detection appender to avoid interference from actions on the network threads that might still be dangling from previous tests in the same suite * Closes #41890	2019-05-27 11:39:59 +02:00
Nhat Nguyen	02739d038c	Mute accounting circuit breaker check after test (#42448 ) If we close an engine while a refresh is happening, then we might leak refCount of some SegmentReaders. We need to skip the ram accounting circuit breaker check until we have a new Lucene snapshot which includes the fix for LUCENE-8809. This also adds a test to the engine but left it muted so we won't forget to reenable this check. Closes #30290	2019-05-24 15:42:12 -04:00
Simon Willnauer	46ccfba808	Remove IndexStore and DirectoryService (#42446 ) Both of these classes are basically a bloated wrapper around a simple construct that can simply be a DirectoryFactory interface. This change removes both classes and replaces them with a simple stateless interface that creates a new `Directory` per shard. The concept of `index.store` is preserved since it makes sense from a configuration perspective.	2019-05-24 12:14:56 +02:00
David Turner	f864f6a740	Cluster state from API should always have a master (#42454 ) Today the `TransportClusterStateAction` ignores the state passed by the `TransportMasterNodeAction` and obtains its state from the cluster applier. This might be inconsistent, showing a different node as the master or maybe even having no master. This change adjusts the action to use the passed-in state directly, and adds tests showing that the state returned is consistent with our expectations even if there is a concurrent master failover. Fixes #38331 Relates #38432	2019-05-24 08:45:22 +01:00
Simon Willnauer	a79cd77e5c	Remove IndexShard dependency from Repository (#42213 ) * Remove IndexShard dependency from Repository In order to simplify repository testing especially for BlobStoreRepository it's important to remove the dependency on IndexShard and reduce it to Store and MapperService (in the snapshot case). This significantly reduces the dependcy footprint for Repository and allows unittesting without starting nodes or instantiate entire shard instances. This change deprecates the old method signatures and adds a unittest for FileRepository to show the advantage of this change. In addition, the unittesting surfaced a bug where the internal file names that are private to the repository were used in the recovery stats instead of the target file names which makes it impossible to relate to the actual lucene files in the recovery stats. * don't delegate deprecated methods * apply comments * test	2019-05-22 14:27:11 +02:00
Yannick Welsch	770d8e9e39	Remove usage of max_local_storage_nodes in test infrastructure (#41652 ) Moves the test infrastructure away from using node.max_local_storage_nodes, allowing us in a follow-up PR to deprecate this setting in 7.x and to remove it in 8.0. This also changes the behavior of InternalTestCluster so that starting up nodes will not automatically reuse data folders of previously stopped nodes. If this behavior is desired, it needs to be explicitly done by passing the data path from the stopped node to the new node that is started.	2019-05-22 11:04:55 +02:00
Christoph Büscher	3e59c31a12	Change IndexAnalyzers default analyzer access (#42011 ) Currently IndexAnalyzers keeps the three default as separate class members although they should refer to the same analyzers held in the additional analyzers map under the default names. This assumption should be made more explicit by keeping all analyzers in the map. This change adapts the constructor to check all the default entries are there and the getters to reach into the map with the default names when needed.	2019-05-10 18:08:51 +02:00
David Turner	4c909e93bb	Reject port ranges in `discovery.seed_hosts` (#41905 ) Today Elasticsearch accepts, but silently ignores, port ranges in the `discovery.seed_hosts` setting: ``` discovery.seed_hosts: 10.1.2.3:9300-9400 ``` Silently ignoring part of a setting like this is trappy. With this change we reject seed host addresses of this form. Closes #40786 Backport of #41404	2019-05-08 08:34:32 +01:00
Alan Woodward	4cca1e8fff	Correct spelling of MockLogAppender.PatternSeenEventExpectation (#41893 ) The class was called PatternSeenEventExcpectation. This commit is a straight class rename to correct the spelling.	2019-05-07 17:28:51 +01:00
Henning Andersen	f068a22f5f	SeqNo CAS linearizability (#38561 ) Add a test that stresses concurrent writes using ifSeqno/ifPrimaryTerm to do CAS style updates. Use linearizability checker to verify linearizability. Linearizability of successful CAS'es is guaranteed. Changed linearizability checker to allow collecting history concurrently. Changed unresponsive network simulation to wake up immediately when network disruption is cleared to ensure tests proceed in a timely manner (and this also seems more likely to provoke issues).	2019-05-07 14:04:38 +02:00
Jim Ferenczi	70bf432fa8	Fix full text queries test that start with now (#41854 ) Full text queries that start with now are not cacheable if they target a date field. However we assume in the query builder tests that all queries are cacheable and this assumption fails when the random generated query string starts with "now". This fails twice in several years since the probability that a random string starts with "now" is low but this commit ensures that isCacheable is correctly checked for full text queries that fall into this edge case. Closes #41847	2019-05-06 19:08:30 +02:00
Tim Brooks	927013426a	Read multiple TLS packets in one read call (#41820 ) This is related to #27260. Currently we have a single read buffer that is no larger than a single TLS packet. This prevents us from reading multiple TLS packets in a single socket read call. This commit modifies our TLS work to support reading similar to the plaintext case. The data will be copied to a (potentially) recycled TLS packet-sized buffer for interaction with the SSLEngine.	2019-05-06 09:51:32 -06:00
Nhat Nguyen	c7924014fa	Verify consistency of version and source in disruption tests (#41614 ) (#41661 ) With this change, we will verify the consistency of version and source (besides id, seq_no, and term) of live documents between shard copies at the end of disruption tests.	2019-05-03 18:47:14 -04:00
Jay Modi	8421e38887	Do not print null method name in reproduce line (#41691 ) This commit updates the reproduce line that is printed out when a test fails so that it does not output `.null` as the method name when the failure is not a specific method but a class level issue such as threads being leaked from the SUITE. Previously, when this occurred the reproduce line would look like: `./gradlew :server:integTest --tests "org.elasticsearch.indices.memory.breaker.CircuitBreakerServiceIT.null"` and after this change, the line no longer contains the `.null` after the class name.	2019-05-02 12:20:07 -06:00
Sandmannn	728fe2d409	Small correction in comments (#41623 )	2019-05-02 15:30:18 +03:00
Nhat Nguyen	887f3f2c83	Simplify initialization of max_seq_no of updates (#41161 ) Today we choose to initialize max_seq_no_of_updates on primaries only so we can deal with a situation where a primary is on an old node (before 6.5) which does not have MUS while replicas on new nodes (6.5+). However, this strategy is quite complex and can lead to bugs (for example #40249) since we have to assign a correct value (not too low) to MSU in all possible situations (before recovering from translog, restoring history on promotion, and handing off relocation). Fortunately, we don't have to deal with this BWC in 7.0+ since all nodes in the cluster should have MSU. This change simplifies the initialization of MSU by always assigning it a correct value in the constructor of Engine regardless of whether it's a replica or primary. Relates #33842	2019-04-30 15:14:52 -04:00
Tim Brooks	df3ef66294	Remove dedicated SSL network write buffer (#41654 ) This is related to #27260. Currently for the SSLDriver we allocate a dedicated network write buffer and encrypt the data into that buffer one buffer at a time. This requires constantly switching between encrypting and flushing. This commit adds a dedicated outbound buffer for SSL operations that will internally allocate new packet sized buffers as they are need (for writing encrypted data). This allows us to totally encrypt an operation before writing it to the network. Eventually it can be hooked up to buffer recycling. This commit also backports the following commit: Handle WRAP ops during SSL read It is possible that a WRAP operation can occur while decrypting handshake data in TLS 1.3. The SSLDriver does not currently handle this well as it does not have access to the outbound buffer during read call. This commit moves the buffer into the Driver to fix this issue. Data wrapped during a read call will be queued for writing after the read call is complete.	2019-04-29 17:59:13 -06:00
Nhat Nguyen	615a0211f0	Recovery should not indefinitely retry on mapping error (#41099 ) A stuck peer recovery in #40913 reveals that we indefinitely retry on new cluster states if indexing translog operations hits a mapper exception. We should not wait and retry if the mapping on the target is as recent as the mapping that the primary used to index the replaying operations. Relates #40913	2019-04-27 10:55:08 -04:00
Armin Braun	aad33121d8	Async Snapshot Repository Deletes (#40144 ) (#41571 ) Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See https://github.com/elastic/elasticsearch/pull/39656#issuecomment-470492106 * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657	2019-04-26 15:36:09 +02:00
Tim Brooks	1f8ff052a1	Revert "Remove dedicated SSL network write buffer (#41283 )" This reverts commit `f65a86c258`.	2019-04-25 18:39:25 -06:00
Tim Brooks	f65a86c258	Remove dedicated SSL network write buffer (#41283 ) This is related to #27260. Currently for the SSLDriver we allocate a dedicated network write buffer and encrypt the data into that buffer one buffer at a time. This requires constantly switching between encrypting and flushing. This commit adds a dedicated outbound buffer for SSL operations that will internally allocate new packet sized buffers as they are need (for writing encrypted data). This allows us to totally encrypt an operation before writing it to the network. Eventually it can be hooked up to buffer recycling.	2019-04-25 14:30:54 -06:00
Armin Braun	40aef2b8aa	Introduce Delegating ActionListener Wrappers (#40129 ) (#41527 ) * Introduce Delegating ActionListener Wrappers * Dry up use cases of ActionListener that simply pass through the response or exception to another listener	2019-04-25 16:05:04 +02:00
Ryan Ernst	7e3875d781	Upgrade hamcrest to 2.1 (#41464 ) hamcrest has some improvements in newer versions, like FileMatchers that make assertions regarding file exists cleaner. This commit upgrades to the latest version of hamcrest so we can start using new and improved matchers.	2019-04-24 23:40:03 -07:00
Martijn Laarman	85b9dc18a7	fix #35262 define deprecations of API's as a whole and urls (#39063 ) * fix #35262 define deprecations of API's as a whole and urls * document hot threads deprecated paths * deprecate scroll_id as part of the URL, documented only as part of the body which is a safer behaviour as well * use version numbers up to patch version * rest spec parser picks up deprecated paths as paths too (cherry picked from commit 7e06023e7603b7584bfd9ee4e8a1ccd82c208ce7)	2019-04-23 14:28:36 +02:00
Jason Tedor	4a288af85f	Avoid concurrent modification in mock log appender (#41424 ) It can be the case that while we are setting up expectations that also a log message is appended. For example, if we are setting up these expectations after a cluster has formed and messages start being sent around the cluster. In this case, we would hit a concurrent modification exception while we are mutating the expectations, and also while the expectations are being iterated over as a message is appended. This commit avoids this by using a copy-on-write array list which is safe for concurrent modification and iteration. Note that another possible approach here is to use synchronized, but that seems unnecessary since we don't appear to rely on messages that are sent while we are setting up expectations. Rather, we are setting up some expectations and some situation that we think will cause those expectations to be met. Using copy-on-write array list here is nice since we avoid bottlenecking these tests on synchronizing these methods.	2019-04-22 21:47:16 -04:00
Adrien Grand	9fd5237fd4	Clean up Node#close. (#39317 ) (#41301 ) `Node#close` is pretty hard to rely on today: - it might swallow exceptions - it waits for 10 seconds for threads to terminate but doesn't signal anything if threads are still not terminated after 10 seconds This commit makes `IOException`s propagated and splits `Node#close` into `Node#close` and `Node#awaitClose` so that the decision what to do if a node takes too long to close can be done on top of `Node#close`. It also adds synchronization to lifecycle transitions to make them atomic. I don't think it is a source of problems today, but it makes things easier to reason about.	2019-04-17 16:10:53 +02:00
Armin Braun	c4e84e2b34	Add Bulk Delete Api to BlobStore (#40322 ) (#41253 ) * Adds Bulk delete API to blob container * Implement bulk delete API for S3 * Adjust S3Fixture to accept both path styles for bulk deletes since the S3 SDK uses both during our ITs * Closes #40250	2019-04-16 17:19:05 +02:00
Tim Brooks	ad3b7abaa3	Deprecate old transport settings (#41229 ) This is related to #36652. We intend to remove a number of old transport settings in 8.0. This commit deprecates those settings for 7.x.	2019-04-15 21:43:09 -06:00
Nhat Nguyen	e9999dfa1d	Init global checkpoint after copy commit in peer recovery (#40823 ) Today a new replica of a closed index does not have a safe commit invariant when its engine is opened because we won't initialize the global checkpoint on a recovering replica until the finalize step. With this change, we can achieve that property by creating a new translog with the global checkpoint from the primary at the end of phase 1.	2019-04-11 22:18:31 -04:00
David Turner	b522de975d	Move primary term from replicas proxy to repl op (#41119 ) A small refactoring that removes the primaryTerm field from ReplicasProxy and instead passes it directly in to the methods that need it. Relates #40706.	2019-04-11 21:19:27 +01:00
Armin Braun	233df6b73b	Make Transport Shard Bulk Action Async (#39793 ) (#41112 ) This is a dependency of #39504 Motivation: By refactoring `TransportShardBulkAction#shardOperationOnPrimary` to async, we enable using `DeterministicTaskQueue` based tests to run indexing operations. This was previously impossible since we were blocking on the `write` thread until the `update` thread finished the mapping update. With this change, the mapping update will trigger a new task in the `write` queue instead. This change significantly enhances the amount of coverage we get from `SnapshotResiliencyTests` (and other potential future tests) when it comes to tracking down concurrency issues with distributed state machines. The logical change is effectively all in `TransportShardBulkAction`, the rest of the changes is then simply mechanically moving the caller code and tests to being async and passing the `ActionListener` down. Since the move to async would've added more parameters to the `private static` steps in this logic, I decided to inline and dry up (between delete and update) the logic as much as I could instead of passing the listener + wait-consumer down through all of them.	2019-04-11 16:01:52 +02:00
Mark Vieira	1287c7d91f	[Backport] Replace usages RandomizedTestingTask with built-in Gradle Test (#40978 ) (#40993 ) * Replace usages RandomizedTestingTask with built-in Gradle Test (#40978) This commit replaces the existing RandomizedTestingTask and supporting code with Gradle's built-in JUnit support via the Test task type. Additionally, the previous workaround to disable all tasks named "test" and create new unit testing tasks named "unitTest" has been removed such that the "test" task now runs unit tests as per the normal Gradle Java plugin conventions. (cherry picked from commit 323f312bbc829a63056a79ebe45adced5099f6e6) * Fix forking JVM runner * Don't bump shadow plugin version	2019-04-09 11:52:50 -07:00
David Turner	2ff19bc1b7	Use Writeable for TransportReplAction derivatives (#40905 ) Relates #34389, backport of #40894.	2019-04-05 19:10:10 +01:00
Martijn van Groningen	809a5f13a4	Make -try xlint warning disabled by default. (#40833 ) Many gradle projects specifically use the -try exclude flag, because there are many cases where auto-closeable resource ignore is never referenced in body of corresponding try statement. Suppressing this warning specifically in each case that it happens using `@SuppressWarnings("try")` would be very verbose. This change removes `-try` from any gradle project and adds it to the build plugin. Also this change removes exclude flags from gradle projects that is already specified in build plugin (for example -deprecation). Relates to #40366	2019-04-05 08:02:26 +02:00
Jason Tedor	f377155f10	Use default memory lock setting in testing (#40730 ) Today we are running our internal tests with bootstrap.memory_lock enabled. This is not out default setting, and not the recommended value. This commit switches to use the default value, which is to not enable bootstrap.memory_lock.	2019-04-02 17:56:32 -04:00
Alpar Torok	293297ae3d	Fix repository-hdfs when no docker and unnecesary fixture The hdfs-fixture is actually executed in plugin/repository-hdfs as a dependency. The fixture is not needed and actually causes a failure because we have two copies now and both use the same ports.	2019-03-29 16:55:12 +02:00
Alpar Torok	2b91fb1cc0	Avoid building hdfs-fixure use an image that works instead Avoid the additional requirement for the debian package repos to be up, and depend on dockerhub only instead.	2019-03-29 16:55:11 +02:00
Alpar Torok	e8c0b53796	Add ability to mute and mute flaky fixture (#40630 )	2019-03-29 12:10:04 +02:00
Alpar Torok	d791e08932	Test fixtures krb5 (#40297 ) Replaces the vagrant based kerberos fixtures with docker based test fixtures plugin. The configuration is now entirely static on the docker side and no longer driven by Gradle, also two different services are being configured since there are two different consumers of the fixture that can run in parallel and require different configurations.	2019-03-28 17:26:58 +02:00
Yannick Welsch	fea91c6113	Mute testTracerLog Relates to #40586	2019-03-28 14:05:54 +01:00
David Turner	5a2ba34174	Get node ID from nodes info in REST tests (#40052 ) (#40532 ) We discussed recently that the cluster state API should be considered "internal" and therefore our usual cast-iron stability guarantees do not hold for this API. However, there are a good number of REST tests that try to identify the master node. Today they call `GET /_cluster/state` API and extract the master node ID from the response. In fact many of these tests just want an arbitary node ID (or perhaps a data node ID) so an alternative is to call `GET _nodes` or `GET _nodes/data:true` and obtain a node ID from the keys of the `nodes` map in the response. This change adds the ability for YAML-based REST tests to extract an arbitrary key from a map so that they can obtain a node ID from the nodes info API instead of using the master node ID from the cluster state API. Relates #40047.	2019-03-27 23:08:10 +00:00
Tim Brooks	ab44f5fd5d	Add InboundHandler for inbound message handling (#40430 ) This commit adds an InboundHandler to handle inbound message processing. With this commit, this code is moved out of the TcpTransport. Additionally, finer grained unit tests are added to ensure that the inbound processing works as expected	2019-03-27 12:33:26 -06:00
Yannick Welsch	64b31f44af	No mapper service and index caches for replicated closed indices (#40423 ) Replicated closed indices can't be indexed into or searched, and therefore don't need a shard with full indexing and search capabilities allocated. We can save on a lot of heap memory for those indices by not allocating a mapper service and caching infrastructure (which preallocates a constant amount per instance). Before this change, a 1GB ES instance could host 250 replicated closed metricbeat indices (each index with one shard). After this change, the same instance can host 7300 replicated closed metricbeat instances (not that this would be a recommended configuration). Most of the remaining memory is in the cluster state and the IndexSettings object.	2019-03-27 19:04:24 +01:00
Yannick Welsch	8f7c5732f1	Use default discovery implementation for single-node discovery (#40036 ) Switches "discovery.type: single-node" from using a separate implementation for single-node discovery to using the existing standard discovery implementation, with two small adaptions: - auto-bootstrapping, but requiring initial_master_nodes not to be set. - not actively pinging other nodes using the Peerfinder - not allowing other nodes to join its single-node cluster (if they have e.g. been set up using regular discovery and connect to the single-disco node).	2019-03-27 19:04:24 +01:00
Tim Brooks	3860ddd1a4	Move outbound message handling to OutboundHandler (#40336 ) Currently there are some components of message serializer and sending that still occur in TcpTransport. This commit makes it possible to send a message without the TcpTransport by moving all of the remaining application logic to the OutboundHandler. Additionally, it adds unit tests to ensure that this logic works as expected.	2019-03-27 11:47:36 -06:00
Tim Brooks	760cfffe4b	Move TransportMessageListener to TransportService (#40474 ) Currently the TransportMessageListener is applied and used in the Transport class. However, local requests and responses never make it to this class. This PR moves the listener add/remove methods to the TransportService. After this change the Transport can only have one listener set with it. This one listener is the TransportService, which will then propogate the events to the external listeners. Additionally this commit back ports #40237 Remove Tracer from MockTransportService Currently the TransportMessageListener is applied and used in the Transport class. However, local requests and responses never make it to this class. This PR moves the listener add/remove methods to the TransportService. After this change the Transport can only have one listener set with it. This one listener is the TransportService, which will then propogate the events to the external listeners.	2019-03-27 09:24:20 -06:00
Henning Andersen	bf444b9f02	Store Pending Deletions Fix (#40345 ) FilterDirectory.getPendingDeletions does not delegate, fixed temporarily by overriding in StoreDirectory. This in turn caused duplicate file name use after a trimUnsafeCommits had been done, since a new IndexWriter would not consider the pending deletes in IndexFileDeleter. This should only happen on windows (AFAIK). Reenabled doing index updates for all tests using IndexShardTests.indexOnReplicaWithGaps (which could fail due to above when using mocked WindowsFS). Added getPendingDeletions delegation to all elasticsearch FilterDirectory subclasses that were not trivial test-only overrides to minimize the risk of hitting this issue in another case.	2019-03-26 15:30:44 +01:00
Armin Braun	cafb83297c	Use Correct Enum in Wipe Snapshots Test Method (#40422 ) (#40438 ) * Mistake was made in #39662 * The response deserialized here is `org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse` which uses `org.elasticsearch.snapshots.SnapshotInfo` which uses `org.elasticsearch.snapshots.SnapshotState` and not the shard state	2019-03-26 09:04:15 +01:00
Nhat Nguyen	efaf95628b	Use separate translog dir in testDeleteWithFatalError This test currently opens a new engine but shares the same translog directory of the previous opening engine.	2019-03-20 10:22:27 -04:00
Mayya Sharipova	49a7c6e0e8	Expose proximity boosting (#39385 ) (#40251 ) Expose DistanceFeatureQuery for geo, date and date_nanos types Closes #33382	2019-03-20 09:24:41 -04:00
Henning Andersen	4c2a8638ca	Cascading primary failure lead to MSU too low (#40249 ) If a replica were first reset due to one primary failover and then promoted (before resync completes), its MSU would not include changes since global checkpoint, leading to errors during translog replay. Fixed by re-initializing MSU before restoring local history.	2019-03-20 14:00:43 +01:00
Nhat Nguyen	a13b4bc8c5	Always fail engine if delete operation fails (#40117 ) Unlike index operations which can fail at the document level to analyzing errors, delete operations should never fail at the document level whether soft-deletes is enabled or not. With this change, we will always fail the engine if we fail to apply a delete operation to Lucene. Closes #33256	2019-03-19 13:09:23 -04:00
Nhat Nguyen	38e9522218	Remove wait for cluster state step in peer recovery (#40004 ) We introduced WAIT_CLUSTERSTATE action in #19287 (5.0), but then stopped using it since #25692 (6.0). This change removes that action and related code in 7.x and 8.0. Relates #19287 Relates #25692	2019-03-18 15:17:21 -04:00
Nhat Nguyen	9ba0bdf528	Dump cluster state if ensureGreen timed out in QA tests (#40133 ) When the method ensureGreen in QA tests is timed out, it does not provide enough info for us to investigate why the testing index is not green yet. With this change, we will dump the cluster state if ensureGreen timed out. Relates #32027	2019-03-18 15:17:21 -04:00
Nhat Nguyen	d720a64b9e	Ensure sendBatch not called recursively (#39988 ) This PR introduces AsyncRecoveryTarget which executes remote calls of peer recovery asynchronously. In this change, we also add a new assertion to ensure that method sendBatch, which sends a batch of history operations in phase2, is never called recursively on the same thread. This new assertion will also be used in method sendFileChunks.	2019-03-18 15:17:21 -04:00
Andrey Ershov	42602478b8	Unmute, fix, refactor and zen2ify NetworkDisruptionIT (#38351 ) This commit unmutes NetworkDisruptionIT. It makes changes necessary for Zen2 - avoids usage of autoMinMasterNodes and selects cluster size, such that there is no need to call AddVotingExclusion. This test also introduces refactors a single method prepareDistruptedCluster to be used by both test methods. Unfortunately, NetworkDisruption is broken and the testNetworkPartitionRemovalRestoresConnections "is fixed" by introducing assertBusy - #38348. Relates #36205 Relates #38348 (cherry picked from commit 97707c7f892636e5b75c3df546b067414acb27cd)	2019-03-18 16:39:43 +01:00
Jim Ferenczi	eb540125ea	Fix IndexSearcherWrapper visibility (#39071 ) (#40145 ) This change adds a wrapper for IndexSearcher that makes IndexSearcher#search(List, Weight, Collector) visible by sub-classes. The wrapper is used by the ContextIndexSearcher to call this protected method on a searcher created by a plugin. This ensures that an override of the protected method in an IndexSearcherWrapper plugin is called when a search is executed. Closes #30758	2019-03-18 11:33:54 +01:00
Tim Brooks	0b50a670a4	Remove transport name from tcp channel (#40074 ) Currently, we maintain a transport name ("mock-nio", "nio", "netty") that is passed to a `TcpTransportChannel` when a request is received. The value of this name is to associate with the task when we register a task with the task manager. However, it is only possible to run ES with one transport, so having an implementation specific name is unnecessary. This commit removes the name and replaces it with the generic "transport".	2019-03-15 12:04:13 -06:00
David Turner	a323132503	Create retention leases file during recovery (#39359 ) Today we load the shard history retention leases from disk whenever opening the engine, and treat a missing file as an empty set of leases. However in some cases this is inappropriate: we might be restoring from a snapshot (if the target index already exists then there may be leases on disk) or force-allocating a stale primary, and in neither case does it make sense to restore the retention leases from disk. With this change we write an empty retention leases file during recovery, except for the following cases: - During peer recovery the on-disk leases may be accurate and could be needed if the recovery target is made into a primary. - During recovery from an existing store, as long as we are not force-allocating a stale primary. Relates #37165	2019-03-15 07:49:49 +00:00
Jason Tedor	9181668edf	Stop returning cluster state size by default (#40016 ) Computing the compressed size of the cluster state on every invocation of cluster:monitor/state action is expensive, and the value of this field is dubious anyway. Therefore we want to remove computing this field. As a first step, we stop computing and return this field by default. To avoid breaking users, we will give them a system property to use to tide them over until the next major release when we will actually remove this field. This comes with a deprecation warning too, and the backport to the appropriate minor will also include a note in the migration guide. There will be a follow-up to remove this field in the next major version.	2019-03-14 08:57:55 -04:00
David Turner	049970af3e	Only connect to new nodes on new cluster state (#39629 ) Today, when applying new cluster state we attempt to connect to all of its nodes as a blocking part of the application process. This is the right thing to do with new nodes, and is a no-op on any already-connected nodes, but is questionable on known nodes from which we are currently disconnected: there is a risk that we are partitioned from these nodes so that any attempt to connect to them will hang until it times out. This can dramatically slow down the application of new cluster states which hinders the recovery of the cluster during certain kinds of partition. If nodes are disconnected from the master then it is likely that they are to be removed as part of a subsequent cluster state update, so there's no need to try and reconnect to them like this. Moreover there is no need to attempt to reconnect to disconnected nodes as part of the cluster state application process, because we periodically try and reconnect to any disconnected nodes, and handle their disconnectedness reasonably gracefully in the meantime. This commit alters this behaviour to avoid reconnecting to known nodes during cluster state application. Resolves #29025.	2019-03-12 19:26:13 +00:00
Tim Brooks	5612ed97ca	Add log warnings for long running event handling (#39729 ) Recently we have had a number of test issues related to blocking activity occuring on the io thread. This commit adds a log warning for when handling event takes a >150 milliseconds. This is implemented for the MockNioTransport which is the transport used in ESIntegTestCase.	2019-03-08 13:07:24 -07:00
Alpar Torok	0f89427eb6	Back port build changes from same version bwc tests (#39744 ) * Back port build changes from #39102 This back-ports how versions are determined and bwc test are set up from #39102 without enabling the bwc from current version tests so it's easier/possible to backmerge future buld changes. It's expected that the tets are lacking many of the required fixes in this version to enable them.	2019-03-07 17:25:09 +02:00
Alpar Torok	34ea84948c	Fix bwc tests failure to extract (#39619 ) * Set correct packaging for older versions Continue using zip packages for pre 7 Some other bwc fixes that hid behind this one. Closes #39441 #39751	2019-03-07 09:07:11 +02:00
Armin Braun	bb2f8485f1	Wipe Snapshots Before Indices in RestTests (#39662 ) (#39763 ) * Wipe Snapshots Before Indices in RestTests * If we have a snapshot ongoing from the previous test and enter this method, then deleting the indices fails, which in turn fails the whole wipe * Fixed by first deleting/aborting snapshots	2019-03-06 21:48:17 +01:00
Nhat Nguyen	1fe7cb594f	Don’t ack if unable to remove failing replica (#39584 ) Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467	2019-03-06 15:30:55 -05:00
Simon Willnauer	e620fb2e4a	Add option to force load term dict into memory (#39741 ) Lucene added an optimization to leave the term dictionary on disk for non-id like fields. This change happened very late in the release processes such that it's better to have an escape hatch if certain use-cases are hurt by this optimization. This setting might be removed in the future if it turns out to be unnecessary.	2019-03-06 15:29:04 +01:00
David Turner	295e39a8c8	Drop node if asymmetrically partitioned from master (#39598 ) When a node is joining the cluster we ensure that it can send requests to the master _at that time_. If it joins the cluster and _then_ loses the ability to send requests to the master then it should be removed from the cluster. Today this is not the case: the master can still receive responses to its follower checks, and receives acknowledgements to cluster state publications, so has no reason to remove the node. This commit changes the handling of follower checks so that they fail if they come from a master that the other node was following but which it now believes to have failed.	2019-03-06 09:41:57 +00:00
Armin Braun	aaecaf59a4	Optimize Bulk Message Parsing and Message Length Parsing (#39634 ) (#39730 ) * Optimize Bulk Message Parsing and Message Length Parsing * findNextMarker took almost 1ms per invocation during the PMC rally track * Fixed to be about an order of magnitude faster by using Netty's bulk `ByteBuf` search * It is unnecessary to instantiate an object (the input stream wrapper) and throw it away, just to read the `int` length from the message bytes * Fixed by adding bulk `int` read to BytesReference	2019-03-06 08:13:15 +01:00
Yannick Welsch	936dbb00e3	Isolate Zen1 (#39470 ) Cherry-picks a few commits from #39466 to align 7.x with master branch.	2019-03-04 15:51:17 +01:00
Adrien Grand	934946a232	Don't swallow exception in ThreadPool.terminate. (#39038 ) (#39623 ) The use of `closeWhileHandlingException` means that any exception while trying to close the threadpool is going to be swallowed. Relates #39030	2019-03-04 10:58:29 +01:00
Tanguy Leroux	e005eeb0b3	Backport support for replicating closed indices to 7.x (#39506 )(#39499 ) Backport support for replicating closed indices (#39499) Before this change, closed indexes were simply not replicated. It was therefore possible to close an index and then decommission a data node without knowing that this data node contained shards of the closed index, potentially leading to data loss. Shards of closed indices were not completely taken into account when balancing the shards within the cluster, or automatically replicated through shard copies, and they were not easily movable from node A to node B using APIs like Cluster Reroute without being fully reopened and closed again. This commit changes the logic executed when closing an index, so that its shards are not just removed and forgotten but are instead reinitialized and reallocated on data nodes using an engine implementation which does not allow searching or indexing, which has a low memory overhead (compared with searchable/indexable opened shards) and which allows shards to be recovered from peer or promoted as primaries when needed. This new closing logic is built on top of the new Close Index API introduced in 6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before closing them, and closing an index on a 8.0 cluster will reinitialize the index shards and therefore impact the cluster health. Some APIs have been adapted to make them work with closed indices: - Cluster Health API - Cluster Reroute API - Cluster Allocation Explain API - Recovery API - Cat Indices - Cat Shards - Cat Health - Cat Recovery This commit contains all the following changes (most recent first): * c6c42a1 Adapt NoOpEngineTests after #39006 * 3f9993d Wait for shards to be active after closing indices (#38854) * 5e7a428 Adapt the Cluster Health API to closed indices (#39364) * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767) * 71f5c34 Recover closed indices after a full cluster restart (#39249) * 4db7fd9 Adapt the Recovery API for closed indices (#38421) * 4fd1bb2 Adapt more tests suites to closed indices (#39186) * 0519016 Add replica to primary promotion test for closed indices (#39110) * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631) * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955) * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex() * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329) * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327) * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326) * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024) * e53a9be Fix compilation error in IndexShardIT after merge with master * cae4155 Relax NoOpEngine constraints (#37413) * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903) Relates to #33888	2019-03-01 14:48:26 +01:00
Lee Hinman	dae48ba262	Add details about what acquired the shard lock last (#38807 ) This adds a `details` parameter to shard locking in `NodeEnvironment`. This is intended to be used for diagnosing issues such as ``` 1> [2019-02-11T14:34:19,262][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] deleting index 1> [2019-02-11T14:34:19,279][WARN ][o.e.i.IndicesService ] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] failed to delete index 1> org.elasticsearch.env.ShardLockObtainFailedException: [.tasks][0]: obtaining shard lock timed out after 0ms 1> at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:736) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:655) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:601) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:554) ~[main/:?] ``` In the hope that we will be able to determine why the shard is still locked. Relates to #30290 as well as some other CI failures	2019-02-28 10:50:47 -07:00
Tal Levy	f538b30af9	ensure no initializing shards during cluster cleanup (#39283 ) (#39480 ) there are testing situations where newly created indices are being wiped before they are fully initialized. This results in an edge-case in the shard-locking strategy where an index cannot be deleted. This should fix that	2019-02-27 15:56:33 -08:00
Armin Braun	28b771f5db	Remove Dead Code Test Infrastructure (#39192 ) (#39436 ) * Just removing some obviously unused things	2019-02-27 09:38:47 +01:00
Tim Brooks	f24dae302d	Make security tests transport agnostic (#39411 ) Currently there are two security tests that specifically target the netty security transport. This PR moves the client authentication tests into `AbstractSimpleSecurityTransportTestCase` so that the nio transport will also be tested. Additionally the work to build transport configurations is moved out of the netty transport and tested independently.	2019-02-26 18:55:19 -07:00
Jason Tedor	a6c0166d68	Renew retention leases while following (#39335 ) This commit is the final piece of the integration of CCR with retention leases. Namely, we periodically renew retention leases and advance the retaining sequence number while following.	2019-02-25 17:14:19 -05:00
Nhat Nguyen	48219112e3	Do not wait for advancement of checkpoint in recovery (#39006 ) With this change, we won't wait for the local checkpoint to advance to the max_seq_no before starting phase2 of peer-recovery. We also remove the sequence number range check in peer-recovery. We can safely do these thanks to Yannick's finding. The replication group to be used is currently sampled after indexing into the primary (see `ReplicationOperation` class). This means that when initiating tracking of a new replica, we have to consider the following two cases: - There are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable). - There are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to `endingSeqNo` are available in the snapshot (i.e. also see comment in `RecoverySourceHandler` saying `We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.`). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint. Relates #39000 Closes #38949 Co-authored-by: Yannick Welsch <yannick@welsch.lu>	2019-02-25 12:10:14 -05:00
Armin Braun	50d2736746	Fix Deadlock from Thread.suspend in Test (#39261 ) (#39341 ) * The lambda invoked by the `lockedExecutor` eventually gets JITed (which runs a static initializer that we will suspend in with a very tiny chance). * Fixed by creating the `Runnable` in the main test thread and using the same instance in all threads * Closes #35686	2019-02-25 09:15:19 +01:00
Tim Brooks	44df76251f	Rebuild remote connections on profile changes (#39146 ) Currently remote compression and ping schedule settings are dynamic. However, we do not listen for changes. This commit adds listeners for changes to those two settings. Additionally, when those settings change we now close existing connections and open new ones with the settings applied. Fixes #37201.	2019-02-21 14:00:39 -07:00
Armin Braun	1a21cc0357	Simplify and Fix Synchronization in InternalTestCluster (#39168 ) (#39241 ) * Simplify and Fix Synchronization in InternalTestCluster (#39168) * Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes #37965 * Closes #37275 * Closes #37345	2019-02-21 16:27:18 +01:00
Lee Hinman	d9de899316	Wrap accounting breaker check in assertBusy (#39211 ) There may be situations where indices have not yet been closed from a Lucene perspective, causing the breaker to not immediately be at 0 Relates to #30290	2019-02-21 08:00:31 -07:00
Marios Trivyzas	1316825f52	Replace superfluous usage of Counter with Supplier (#39048 ) (#39225 ) `Counter` was used as a means of a functional argument to pass the relative cached time before `Supplier` iface was introduced.	2019-02-21 12:42:54 +02:00
Nhat Nguyen	820ba8169e	Add retention leases replication tests (#38857 ) This commit introduces the retention leases to ESIndexLevelReplicationTestCase, then adds some tests verifying that the retention leases replication works correctly in spite of the presence of the primary failover or out of order delivery of retention leases sync requests. Relates #37165	2019-02-20 19:21:00 -05:00
Tal Levy	cb7e3708bc	Rollup jobs should be cleaned up before indices are deleted (#38930 ) (#39144 ) Rollup jobs should be stopped + deleted before the indices are removed. It's possible for an active rollup job to issue a bulk request, the test ends and the cleanup code deletes all indices. The in-flight bulk request will then stall + error because the index no-longer exists... but this process might take longer than the StopRollup timeout. Which means the test fails, and often fails several other tests since the job is still active (e.g. other tests cannot create the same-named job, or fail to stop the job in their cleanup because it's still stalled). This tends to knock over several tests before the bulk finally times out and the job shuts down. Instead, we need to simply stop jobs first. Inflight bulks will resolve quickly, and we can carry on with deleting indices after the jobs are confirmed inactive. stop-job.asciidoc tended to trigger this issue because it executed an async stop API and then exited, which setup the above situation. In can and did happen with other tests though. As an extra precaution, the doc test was modified to substitute in wait_for_completion to help head off these issues too.	2019-02-20 11:12:01 -08:00
Adrien Grand	d8852b83d0	Don't swallow IOExceptions in InternalTestCluster. (#39068 ) Relates #39030	2019-02-19 15:03:47 +01:00
Ioannis Kakavas	59e9a0f4f4	Disable specific locales for tests in fips mode (#38938 ) * Disable specific locales for tests in fips mode The Bouncy Castle FIPS provider that we use for running our tests in fips mode has an issue with locale sensitive handling of Dates as described in https://github.com/bcgit/bc-java/issues/405 This causes certificate validation to fail if any given test that includes some form of certificate validation happens to run in one of the locales. This manifested earlier in #33081 which was handled insufficiently in #33299 This change ensures that the problematic 3 locales * th-TH * ja-JP-u-ca-japanese-x-lvariant-JP * th-TH-u-nu-thai-x-lvariant-TH will not be used when running our tests in a FIPS 140 JVM. It also reverts #33299	2019-02-19 08:46:08 +02:00
David Roberts	ae9243ad0a	Reduce single node test cleanup logging (#39060 ) As per https://github.com/elastic/elasticsearch/pull/39049#discussion_r257719530	2019-02-18 17:38:49 +00:00
Nhat Nguyen	2947ccf5c3	Add remote recovery to ShardFollowTaskReplicationTests (#39007 ) We simulate remote recovery in ShardFollowTaskReplicationTests by bootstrapping the follower with the safe commit of the leader. Relates #35975	2019-02-18 09:57:56 -05:00
Nhat Nguyen	7e20a92888	Advance max_seq_no before add operation to Lucene (#38879 ) Today when processing an operation on a replica engine (or the following engine), we first add it to Lucene, then add it to translog, then finally marks its seq_no as completed. If a flush occurs after step1, but before step-3, the max_seq_no in the commit's user_data will be smaller than the seq_no of some documents in the Lucene commit.	2019-02-15 21:04:28 -05:00
Christoph Büscher	9f6c77fad4	Fix FullClusterRestartIT#testSnapshotRestore (#38795 ) This test failed on 7.1 when running full cluster restart tests against pre-7.0 clusters (e.g. 6.6 clusters). The fixes the expected type in the templates after the cluster restart.	2019-02-15 20:12:26 +01:00
Lee Hinman	7d449c5f65	Check that delete index request succeeded in test teardown (#38903 ) (#38913 ) Backport of #38903 When tearing down from `ESSingleNodeTestCase` we perform a delete on "*" indices, it some cases, however, those indices are not fully deleted. Rather than have a failure occur later down the change (see: https://github.com/elastic/elasticsearch/issues/30290#issuecomment-463589008 ) the failure should occurr immediately so it can be diagnosed more easily.	2019-02-14 13:46:17 -07:00
Julie Tibshirani	e769cb4efd	Perform precise check for types warnings in cluster restart tests. (#37944 ) Instead of using `WarningsHandler.PERMISSIVE`, we only match warnings that are due to types removal. This PR also renames `allowTypeRemovalWarnings` to `allowTypesRemovalWarnings`. Relates to #37920.	2019-02-13 11:28:58 -08:00
Nhat Nguyen	a3f39741be	Adjust log and unmute testFailOverOnFollower (#38762 ) There were two documents (seq=2 and seq=103) missing on the follower in one of the failures of `testFailOverOnFollower`. I spent several hours on that failure but could not figure out the reason. I adjust log and unmute this test so we can collect more information. Relates #38633	2019-02-12 11:42:25 -05:00
Nhat Nguyen	225ebb6935	Ensure no snapshotted commit when close engine (#38663 ) With this change, we can automatically detect an implementation that acquires an index commit but fails to release.	2019-02-12 11:39:35 -05:00
Tim Brooks	023e3c207a	Concurrent file chunk fetching for CCR restore (#38656 ) Adds the ability to fetch chunks from different files in parallel, configurable using the new `ccr.indices.recovery.max_concurrent_file_chunks` setting, which defaults to 5 in this PR. The implementation uses the parallel file writer functionality that is also used by peer recoveries.	2019-02-09 21:19:57 -07:00
Tim Vernum	84483b26cf	Fix version logic when bumping major version (#38593 ) When we are preparing to release a major version the rules around "unreleased" versions and branches get a bit more complex. This change implements the following rules: - If the tip version on the previous major is a .0 (e.g. 6.7.0) then the tip of the minor before that (e.g. 6.6.1) must be unreleased. (This is because 6.7.0 would be "staged" in preparation for release, but 6.6.1 would be open for bug fixes on the release 6.6.x line) (in VersionCollection & VersionUtils) - The "major.x" branch (if it exists) will always point to the latest minor in that series. Anything that is not the latest minor, must therefore be on a the "major.minor" branch For example, if v7.1.0 exists then the "7.x" branch must be 7.1.0, and 7.0.0 must be on the "7.0" branch (in VersionCollection)	2019-02-08 18:00:03 +11:00
Jason Tedor	fdf6b3f23f	Add 7.1 version constant to 7.x branch (#38513 ) This commit adds the 7.1 version constant to the 7.x branch. Co-authored-by: Andy Bristol <andy.bristol@elastic.co> Co-authored-by: Tim Brooks <tim@uncontended.net> Co-authored-by: Christoph Büscher <cbuescher@posteo.de> Co-authored-by: Luca Cavanna <javanna@users.noreply.github.com> Co-authored-by: markharwood <markharwood@gmail.com> Co-authored-by: Ioannis Kakavas <ioannis@elastic.co> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: David Roberts <dave.roberts@elastic.co> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Alpar Torok <torokalpar@gmail.com> Co-authored-by: David Turner <david.turner@elastic.co> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Tim Vernum <tim@adjective.org> Co-authored-by: Albert Zaharovits <albert.zaharovits@gmail.com>	2019-02-07 16:32:27 -05:00
David Turner	5a3c452480	Align docs etc with new discovery setting names (#38492 ) In #38333 and #38350 we moved away from the `discovery.zen` settings namespace since these settings have an effect even though Zen Discovery itself is being phased out. This change aligns the documentation and the names of related classes and methods with the newly-introduced naming conventions.	2019-02-06 11:34:38 +00:00
Luca Cavanna	a7046e001c	Remove support for maxRetryTimeout from low-level REST client (#38085 ) We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see #31834), would leave the request executing after expiration causing memory leaks (see #33342), and would not take into account the http client internal queuing (see #25951). Given all these issues, it seems that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the `maxRetryTimeout` setting and all of its usages.	2019-02-06 08:43:47 +01:00
Tim Brooks	c2a8fe1f91	Prevent CCR recovery from missing documents (#38237 ) Currently the snapshot/restore process manually sets the global checkpoint to the max sequence number from the restored segements. This does not work for Ccr as this will lead to documents that would be recovered in the normal followering operation from being recovered. This commit fixes this issue by setting the initial global checkpoint to the existing local checkpoint.	2019-02-05 13:32:41 -06:00
David Turner	f2dd5dd6eb	Remove DiscoveryPlugin#getDiscoveryTypes (#38414 ) With this change we no longer support pluggable discovery implementations. No known implementations of `DiscoveryPlugin` actually override this method, so in practice this should have no effect on the wider world. However, we were using this rather extensively in tests to provide the `test-zen` discovery type. We no longer need a separate discovery type for tests as we no longer need to customise its behaviour. Relates #38410	2019-02-05 17:42:24 +00:00
David Turner	b7ab521eb1	Throw AssertionError when no master (#38432 ) Today we throw a fatal `RuntimeException` if an exception occurs in `getMasterName()`, and this includes the case where there is currently no master. However, sometimes we call this method inside an `assertBusy()` in order to allow for a cluster that is in the process of stabilising and electing a master. The trouble is that `assertBusy()` only retries on an `AssertionError` and not on a general `RuntimeException`, so the lack of a master is immediately fatal. This commit fixes the issue by asserting there is a master, triggering a retry if there is not. Fixes #38331	2019-02-05 17:11:20 +00:00
David Turner	3b2a0d7959	Rename no-master-block setting (#38350 ) Replaces `discovery.zen.no_master_block` with `cluster.no_master_block`. Any value set for the old setting is now ignored.	2019-02-05 08:47:56 +00:00
David Turner	2d114a02ff	Rename static Zen1 settings (#38333 ) Renames the following settings to remove the mention of `zen` in their names: - `discovery.zen.hosts_provider` -> `discovery.seed_providers` - `discovery.zen.ping.unicast.concurrent_connects` -> `discovery.seed_resolver.max_concurrent_resolvers` - `discovery.zen.ping.unicast.hosts.resolve_timeout` -> `discovery.seed_resolver.timeout` - `discovery.zen.ping.unicast.hosts` -> `discovery.seed_addresses`	2019-02-05 08:46:52 +00:00
Yogesh Gaikwad	fe36861ada	Add support for API keys to access Elasticsearch (#38291 ) X-Pack security supports built-in authentication service `token-service` that allows access tokens to be used to access Elasticsearch without using Basic authentication. The tokens are generated by `token-service` based on OAuth2 spec. The access token is a short-lived token (defaults to 20m) and refresh token with a lifetime of 24 hours, making them unsuitable for long-lived or recurring tasks where the system might go offline thereby failing refresh of tokens. This commit introduces a built-in authentication service `api-key-service` that adds support for long-lived tokens aka API keys to access Elasticsearch. The `api-key-service` is consulted after `token-service` in the authentication chain. By default, if TLS is enabled then `api-key-service` is also enabled. The service can be disabled using the configuration setting. The API keys:- - by default do not have an expiration but expiration can be configured where the API keys need to be expired after a certain amount of time. - when generated will keep authentication information of the user that generated them. - can be defined with a role describing the privileges for accessing Elasticsearch and will be limited by the role of the user that generated them - can be invalidated via invalidation API - information can be retrieved via a get API - that have been expired or invalidated will be retained for 1 week before being deleted. The expired API keys remover task handles this. Following are the API key management APIs:- 1. Create API Key - `PUT/POST /_security/api_key` 2. Get API key(s) - `GET /_security/api_key` 3. Invalidate API Key(s) `DELETE /_security/api_key` The API keys can be used to access Elasticsearch using `Authorization` header, where the auth scheme is `ApiKey` and the credentials, is the base64 encoding of API key Id and API key separated by a colon. Example:- ``` curl -H "Authorization: ApiKey YXBpLWtleS1pZDphcGkta2V5" http://localhost:9200/_cluster/health ``` Closes #34383	2019-02-05 14:21:57 +11:00
Mayya Sharipova	641704464d	Deprecate types in rollover index API (#38039 ) Relates to #35190	2019-02-04 16:07:45 -05:00
markharwood	578fd14257	Types removal - fix FullClusterRestartIT warning expectations (#38310 ) Relax test warning message checking to pre-empt PR 38022 landing in 6.7 with new warning messages. The relaxed test now just assumes any warning message starting with “[types removal]” is tolerated rather than the precise phrasing used in the 6.7 branch.	2019-02-04 20:09:07 +00:00
Jason Tedor	625d37a26a	Introduce retention lease background sync (#38262 ) This commit introduces a background sync for retention leases. The idea here is that we do a heavyweight sync when adding a new retention lease, and then periodically we want to background sync any retention lease renewals to the replicas. As long as the background sync interval is significantly lower than the extended lifetime of a retention lease, it is okay if from time to time a replica misses a sync (it will still have an older version of the lease that is retaining more data as we assume that renewals do not decrease the retaining sequence number). There are two follow-ups that will come after this commit. The first is to address the fact that we have not adapted the should periodically flush logic to possibly flush the retention leases. We want to do something like flush if we have not flushed in the last five minutes and there are renewed retention leases since the last time that we flushed. An additional follow-up will remove the syncing of retention leases when a retention lease expires. Today this sync could be invoked in the background by a merge operation. Rather, we will move the syncing of retention lease expiration to be done under the background sync. The background sync will use the heavyweight sync (write action) if a lease has expired, and will use the lightweight background sync (replication action) otherwise.	2019-02-04 10:35:29 -05:00
Lee Hinman	f19fdcd491	Re-enable accounting breaker check in InternalTestCluster (#38131 ) Relates to #30290 The intent for this is to see whether this failure still happens, and if so, provide more up-to-date logs for analysis.	2019-02-04 07:40:59 -07:00
David Turner	1d82a6d9f9	Deprecate unused Zen1 settings (#38289 ) Today the following settings in the `discovery.zen` namespace are still used: - `discovery.zen.no_master_block` - `discovery.zen.hosts_provider` - `discovery.zen.ping.unicast.concurrent_connects` - `discovery.zen.ping.unicast.hosts.resolve_timeout` - `discovery.zen.ping.unicast.hosts` This commit deprecates all other settings in this namespace so that they can be removed in the next major version.	2019-02-04 08:52:08 +00:00
David Turner	c311062476	Add CoordinatorTests for empty unicast hosts list (#38209 ) Today we have DiscoveryDisruptionIT tests for checking that discovery can still work once the cluster has formed, even if the cluster is misconfigured and only has a single master-eligible node in its unicast hosts list. In fact with Zen2 we can go one better: we do not need any nodes in the unicast hosts list, because nodes also use the contents of the last-committed cluster state for discovery. Additionally, the DiscoveryDisruptionIT tests were failing due to the overenthusiastic fault-detection timeouts. This commit replaces these tests with deterministic `CoordinatorTests` that verify the same behaviour. It also removes some duplication by extracting a test method called `testFollowerCheckerAfterMasterReelection()` Closes #37687	2019-02-02 07:54:56 +00:00
Jason Tedor	f181e17038	Introduce retention leases versioning (#37951 ) Because concurrent sync requests from a primary to its replicas could be in flight, it can be the case that an older retention leases collection arrives and is processed on the replica after a newer retention leases collection has arrived and been processed. Without a defense, in this case the replica would overwrite the newer retention leases with the older retention leases. This commit addresses this issue by introducing a versioning scheme to retention leases. This versioning scheme is used to resolve out-of-order processing on the replica. We persist this version into Lucene and restore it on recovery. The encoding of retention leases is starting to get a little ugly. We can consider addressing this in a follow-up.	2019-02-01 17:19:19 -05:00
Julie Tibshirani	c2e9d13ebd	Default include_type_name to false in the yml test harness. (#38058 ) This PR removes the temporary change we made to the yml test harness in #37285 to automatically set `include_type_name` to `true` in index creation requests if it's not already specified. This is possible now that the vast majority of index creation requests were updated to be typeless in #37611. A few additional tests also needed updating here. Additionally, this PR updates the test harness to set `include_type_name` to `false` in index creation requests when communicating with 6.x nodes. This mirrors the logic added in #37611 to allow for typeless document write requests in test set-up code. With this update in place, we can remove many references to `include_type_name: false` from the yml tests.	2019-02-01 11:44:13 -08:00
Andrey Ershov	bfd618cf83	Universal cluster bootstrap method for tests with autoMinMasterNodes=false (#38038 ) Currently, there are a few tests that use autoMinMasterNodes=false and hence override addExtraClusterBootstrapSettings, mostly this is 10-30 lines of codes that are copy-pasted from class to class. This PR introduces `InternalTestCluster.setBootstrapMasterNodeIndex` which is suitable for all classes and copy-paste could be removed. Removing code is always a good thing!	2019-02-01 11:34:31 +01:00
Armin Braun	0a604e3b24	Fix Two Races that Lead to Stuck Snapshots (#37686 ) * Fixes two broken spots: 1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in `INIT` when deleting 2. Aborted shards that were never seen in `INIT` state by the `SnapshotsShardService` will not be notified as failed, leading to the snapshot staying in `ABORTED` state and never getting deleted with one or more shards stuck in `ABORTED` state * Tried to make fixes as short as possible so we can backport to `6.x` with the least amount of risk * Significantly extended test infrastructure to reproduce the above two issues * Two new test runs: 1. Reproducing the effects of node disconnects/restarts in isolation 2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes * Relates #32265 * Closes #32348	2019-02-01 05:45:40 +01:00
Yuri Astrakhan	f3cde06a1d	geotile_grid implementation (#37842 ) Implements `geotile_grid` aggregation This patch refactors previous implementation https://github.com/elastic/elasticsearch/pull/30240 This code uses the same base classes as `geohash_grid` agg, but uses a different hashing algorithm to allow zoom consistency. Each grid bucket is aligned to Web Mercator tiles.	2019-01-31 19:11:30 -05:00
Przemyslaw Gomulka	28b5c7ce78	Do not set up NodeAndClusterIdStateListener in test (#38110 ) When extending ESIntegTestCase are run on the same jvm, the static field in NodeAndClusterIdConverter will throw an AlreadySet exceptions. overriding the configuration method from Node.configureNodeAndClusterIdStateListener in the MockNode will prevent the listener registration from happening relates #32850	2019-01-31 18:59:40 +01:00
Henning Andersen	68ed72b923	Handle scheduler exceptions (#38014 ) Scheduler.schedule(...) would previously assume that caller handles exception by calling get() on the returned ScheduledFuture. schedule() now returns a ScheduledCancellable that no longer gives access to the exception. Instead, any exception thrown out of a scheduled Runnable is logged as a warning. This is a continuation of #28667, #36137 and also fixes #37708.	2019-01-31 17:51:45 +01:00
Jason Tedor	a9b12b38f0	Push primary term to replication tracker (#38044 ) This commit pushes the primary term into the replication tracker. This is a precursor to using the primary term to resolving ordering problems for retention leases. Namely, it can be that out-of-order retention lease sync requests arrive on a replica. To resolve this, we need a tuple of (primary term, version). For this to be, the primary term needs to be accessible in the replication tracker. As the primary term is part of the replication group anyway, this change conceptually makes sense.	2019-01-31 09:19:49 -05:00
Luca Cavanna	622fb7883b	Introduce ability to minimize round-trips in CCS (#37828 ) With #37566 we have introduced the ability to merge multiple search responses into one. That makes it possible to expose a new way of executing cross-cluster search requests, that makes CCS much faster whenever there is network latency between the CCS coordinating node and the remote clusters. The coordinating node can now send a single search request to each remote cluster, which gets reduced by each one of them. from + size results are requested to each cluster, and the reduce phase in each cluster is non final (meaning that buckets are not pruned and pipeline aggs are not executed). The CCS coordinating node performs an additional, final reduction, which produces one search response out of the multiple responses received from the different clusters. This new execution path will be activated by default for any CCS request unless a scroll is provided or inner hits are requested as part of field collapsing. The search API accepts now a new parameter called ccs_minimize_roundtrips that allows to opt-out of the default behaviour. Relates to #32125	2019-01-31 15:12:14 +01:00
Tim Vernum	cde126dbff	Enable SSL in reindex with security QA tests (#37600 ) Update the x-pack/qa/reindex-tests-with-security integration tests to run with TLS enabled on the Rest interface. Relates: #37527	2019-01-31 20:59:50 +11:00
Alexander Reelsen	160d1bd4dd	Work around JDK8 timezone bug in tests (#37968 ) The timezone GMT0 cannot be properly parsed on java8. The randomZone() method now excludes GMT0, if java8 is used. Closes #37814	2019-01-31 08:52:35 +01:00
David Turner	81c443c9de	Deprecate minimum_master_nodes (#37868 ) Today we pass `discovery.zen.minimum_master_nodes` to nodes started up in tests, but for 7.x nodes this setting is not required as it has no effect. This commit removes this setting so that nodes are started with more realistic configurations, and deprecates it.	2019-01-30 20:09:15 +00:00
Nik Everett	e97718245d	Test: Enable strict deprecation on all tests (#36558 ) This drops the option for tests to disable strict deprecation mode in the low level rest client in favor of configuring expected warnings on any calls that should expect warnings. This behavior is paranoid-by-default which is generally the right way to handle deprecations and tests in general.	2019-01-30 11:48:34 -05:00
Colin Goodheart-Smithe	21e392e95e	Removes typed calls from YAML REST tests (#37611 ) This PR attempts to remove all typed calls from our YAML REST tests. The PR adds include_type_name: false to create index requests that use a mapping and also to put mapping requests. It also removes _type from index requests where they haven't already been removed. The PR ignores tests named *_with_types.yml since this are specifically testing typed API behaviour. The change also includes changing the test harness to add the type _doc to index, update, get and bulk requests that do not specify the document type when the test is running against a mixed 7.x/6.x cluster.	2019-01-30 16:32:58 +00:00
Tim Brooks	f3f9cabd67	Add timeout for ccr recovery action (#37840 ) This is related to #35975. It adds a action timeout setting that allows timeouts to be applied to the individual transport actions that are used during a ccr recovery.	2019-01-29 12:29:06 -07:00
Armin Braun	7f1784e9f9	Remove Dead MockTransport Code (#34044 ) * All these methods are unused	2019-01-29 15:08:11 +01:00
Luca Cavanna	2325fb9cb3	Remove test only SearchShardTarget constructor (#37912 ) Remove SearchShardTarget test only constructor and replace all the usages with calls to the other constructor that accepts a ShardId.	2019-01-29 14:58:11 +01:00
Przemyslaw Gomulka	891320f5ac	Elasticsearch support to JSON logging (#36833 ) In order to support JSON log format, a custom pattern layout was used and its configuration is enclosed in ESJsonLayout. Users are free to use their own patterns, but if smooth Beats integration is needed, they should use ESJsonLayout. EvilLoggerTests are left intact to make sure user's custom log patterns work fine. To populate additional fields node.id and cluster.uuid which are not available at start time, a cluster state update will have to be received and the values passed to log4j pattern converter. A ClusterStateObserver.Listener is used to receive only one ClusteStateUpdate. Once update is received the nodeId and clusterUUid are set in a static field in a NodeAndClusterIdConverter. Following fields are expected in JSON log lines: type, tiemstamp, level, component, cluster.name, node.name, node.id, cluster.uuid, message, stacktrace see ESJsonLayout.java for more details and field descriptions Docker log4j2 configuration is now almost the same as the one use for ES binary. The only difference is that docker is using console appenders, whereas ES is using file appenders. relates: #32850	2019-01-29 07:20:09 +01:00
Jason Tedor	5fddb631a2	Introduce retention lease syncing (#37398 ) This commit introduces retention lease syncing from the primary to its replicas when a new retention lease is added. A follow-up commit will add a background sync of the retention leases as well so that renewed retention leases are synced to replicas.	2019-01-27 07:49:56 -05:00
Christoph Büscher	b4b4cd6ebd	Clean codebase from empty statements (#37822 ) * Remove empty statements There are a couple of instances of undocumented empty statements all across the code base. While they are mostly harmless, they make the code hard to read and are potentially error-prone. Removing most of these instances and marking blocks that look empty by intention as such. * Change test, slightly more verbose but less confusing	2019-01-25 14:23:02 +01:00
Jim Ferenczi	787acb14b9	Track total hits up to 10,000 by default (#37466 ) This commit changes the default for the `track_total_hits` option of the search request to `10,000`. This means that by default search requests will accurately track the total hit count up to `10,000` documents, requests that match more than this value will set the `"total.relation"` to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response. Scroll queries are not impacted, they will continue to count the total hits accurately. The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request. I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate. Closes #33028	2019-01-25 13:45:39 +01:00
Julie Tibshirani	e1d8df4ffa	Deprecate types in create index requests. (#37134 ) From #29453 and #37285, the include_type_name parameter was already present and defaulted to false. This PR makes the following updates: * Add deprecation warnings to RestCreateIndexAction, plus tests in RestCreateIndexActionTests. * Add a typeless 'create index' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I created new CreateIndexRequest and CreateIndexResponse objects that differ from the existing server ones.	2019-01-24 13:17:47 -08:00
Andrey Ershov	4974684003	Add tool elasticsearch-node unsafe-bootstrap (#37696 ) elasticsearch-node tool helps to restore cluster if half or more of master eligible nodes are lost. Of course, all bets are off, regarding data consistency. There are two parts of the tool: unsafe-bootstrap to be used when there is still at least one master-eligible node alive and detach-cluster, when there are no master-eligible nodes left. This commit implements the first part. Docs for the tool will be added separately as a part of #37812.	2019-01-24 19:25:55 +01:00
Tal Levy	289106a578	Refactor GeoHashGrid to be abstract and re-usable (#37742 ) This change split out all the specific GeoHash classes for the geohash_grid aggregation into abstract GeoGrid classes that can be re-used for specific hashing types, like `geohash`	2019-01-24 10:12:14 -08:00
Alpar Torok	37768b7eac	Testing conventions now checks for tests in main (#37321 ) * Testing conventions now checks for tests in main This is the last outstanding feature of the old NamingConventionsTask, so time to remove it. * PR review	2019-01-24 17:30:50 +02:00
Nhat Nguyen	a6abb28abf	Fix InternalEngineTests#assertOpsOnPrimary (#37746 ) The assertion `assertOpsOnPrimary` does not store seq_no and primary term of successful deletes to the `lastOpSeqNo` and `lastOpTerm`. This leads to failures of the subsequence CAS deletes or indexes with seq_no and term. Moreover, this assertion trips a translog assertion because it bumps the primary term of some operations but not the primary term of the engine. Relates #36467 Closes #37684	2019-01-24 10:02:48 -05:00
Yannick Welsch	feab59df03	Bubble exceptions up in ClusterApplierService (#37729 ) Exceptions thrown by the cluster applier service's settings and cluster appliers are bubbled up, and block the state from being applied instead of silently being ignored. In combination with the cluster state publishing lag detector, this will throw a node out of the cluster that can't properly apply cluster state updates.	2019-01-24 14:09:03 +01:00
Alexander Reelsen	daa2ec8a60	Switch mapping/aggregations over to java time (#36363 ) This commit moves the aggregation and mapping code from joda time to java time. This includes field mappers, root object mappers, aggregations with date histograms, query builders and a lot of changes within tests. The cut-over to java time is a requirement so that we can support nanoseconds properly in a future field mapper. Relates #27330	2019-01-23 10:40:05 +01:00
Boaz Leskes	52ba407931	Expose sequence number and primary terms in search responses (#37639 ) Users may require the sequence number and primary terms to perform optimistic concurrency control operations. Currently, you can get the sequence number via the `docvalues_fields` API but the primary term is not accessible because it is maintained by the `SeqNoFieldMapper` and the infrastructure can't find it. This commit adds a dedicated sub fetch phase to return both numbers that is connected to a new `seq_no_primary_term` parameter.	2019-01-23 09:01:58 +01:00
Henning Andersen	228611843c	Fail start of non-data node if node has data (#37347 ) * Fail start of non-data node if node has data Check that nodes started with node.data=false cannot start if they have shard data to avoid (old) indexes being resurrected into the cluster in red status. Issue #27073	2019-01-22 13:27:12 +01:00
Tim Brooks	21838d73b5	Extract message serialization from `TcpTransport` (#37034 ) This commit introduces a NetworkMessage class. This class has two subclasses - InboundMessage and OutboundMessage. These messages can be serialized and deserialized independent of the transport. This allows more granular testing. Additionally, the serialization mechanism is now a simple Supplier. This builds the framework to eventually move the serialization of transport messages to the network thread. This is the one serialization component that is not currently performed on the network thread (transport deserialization and http serialization and deserialization are all on the network thread).	2019-01-21 14:14:18 -07:00
Tim Brooks	f516d68fb2	Share `NioGroup` between http and transport impls (#37396 ) Currently we create dedicated network threads for both the http and transport implementations. Since these these threads should never perform blocking operations, these threads could be shared. This commit modifies the nio-transport to have 0 http workers be default. If the default configs are used, this will cause the http transport to be run on the transport worker threads. The http worker setting will still exist in case the user would like to configure dedicated workers. Additionally, this commmit deletes dedicated acceptor threads. We have never had these for the netty transport and they can be added back if a need is determined in the future.	2019-01-21 13:50:56 -07:00
Julie Tibshirani	8da7a27f3b	Deprecate types in the put mapping API. (#37280 ) From #29453 and #37285, the `include_type_name` parameter was already present and defaulted to false. This PR makes the following updates: - Add deprecation warnings to `RestPutMappingAction`, plus tests in `RestPutMappingActionTests`. - Add a typeless 'put mappings' method to the Java HLRC, and deprecate the old typed version. To do this cleanly, I opted to create a new `PutMappingRequest` object that differs from the existing server one.	2019-01-18 12:28:31 -08:00
Yannick Welsch	377d96e376	Remove initial_master_nodes on node restart (#37580 ) Some tests (e.g. testRestoreIndexWithShardsMissingInLocalGateway) were split-braining since being switched to Zen2 because the bootstrap setting was left around when nodes got restarted with data folders wiped. The test in question here was starting one node (which autobootstrapped to that single node), then another node. The first node was then shut down (after excluding it from the voting configuration), its data folder wiped, and restarted. After restart, the node had an empty data folder yet initial_master_nodes set to itself (i.e. same name). This made the node sometimes form a cluster of its own, and not rejoin the existing cluster with the other node.	2019-01-18 16:36:42 +01:00
Jason Tedor	687978b7d1	Reject all requests that have an unconsumed body (#37504 ) This commit removes some leniency from REST handling where we move to reject all requests that have a body where the body is not used during the course of handling the request. For example, DELETE /index { "query" : { "term" : { "field" : "value" } } } is now rejected.	2019-01-16 07:29:25 -05:00
Przemyslaw Gomulka	5e94f384c4	Remove the use of AbstracLifecycleComponent constructor #37488 (#37488 ) The AbstracLifecycleComponent used to extend AbstractComponent, so it had to pass settings to the constractor of its supper class. It no longer extends the AbstractComponent so there is no need for this constructor There is also no need for AbstracLifecycleComponent subclasses to have Settings in their constructors if they were only passing it over to super constructor. This is part 1. which will be backported to 6.x with a migration guide/deprecation log. part 2 will have this constructor removed in 7 relates #35560 relates #34488	2019-01-16 09:05:30 +01:00
Tanguy Leroux	23ae9808ba	Fix IndexShardTestCase.recoverReplica(IndexShard, IndexShard, boolean) (#37414 ) This commit fixes the IndexShardTestCase.recoverReplica(IndexShard, IndexShard, boolean) method where the startReplica parameter was not correctly propagated and the value true always used instead.	2019-01-15 12:48:21 +01:00
Marios Trivyzas	d6a104f52b	[TEST] Muted testDifferentRolesMaintainPathOnRestart Relates to #37462	2019-01-15 11:51:45 +02:00
Jason Tedor	e11a32eda8	Reformat some classes in the index universe This commit reformats some classes in the index universe with the purpose of breaking some long method definitions and invocations into a line per parameter. This has the advantage that for an upcoming change to these definitions and invocations, the diff for that change will be a single line per definition or invocation. That makes these sorts of changes easier to read.	2019-01-14 21:45:24 -05:00
Julie Tibshirani	36a3b84fc9	Update the default for include_type_name to false. (#37285 ) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.	2019-01-14 13:08:01 -08:00
Nhat Nguyen	1e3702da0b	Relax assertSameDocIdsOnShards assertion If the checking node no longer holds the shard copy, the assertion assertSameDocIdsOnShards might fail. This is too harsh since the assertion is to ensure the consistency between active copies.	2019-01-14 15:28:48 -05:00
Nhat Nguyen	15aa3764a4	Reduce recovery time with compress or secure transport (#36981 ) Today file-chunks are sent sequentially one by one in peer-recovery. This is a correct choice since the implementation is straightforward and recovery is network bound in most of the time. However, if the connection is encrypted, we might not be able to saturate the network pipe because encrypting/decrypting are cpu bound rather than network-bound. With this commit, a source node can send multiple (default to 2) file-chunks without waiting for the acknowledgments from the target. Below are the benchmark results for PMC and NYC_taxis. - PMC (20.2 GB) \| Transport \| Baseline \| chunks=1 \| chunks=2 \| chunks=3 \| chunks=4 \| \| ----------\| ---------\| -------- \| -------- \| -------- \| -------- \| \| Plain \| 184s \| 137s \| 106s \| 105s \| 106s \| \| TLS \| 346s \| 294s \| 176s \| 153s \| 117s \| \| Compress \| 1556s \| 1407s \| 1193s \| 1183s \| 1211s \| - NYC_Taxis (38.6GB) \| Transport \| Baseline \| chunks=1 \| chunks=2 \| chunks=3 \| chunks=4 \| \| ----------\| ---------\| ---------\| ---------\| ---------\| -------- \| \| Plain \| 321s \| 249s \| 191s \| * \| * \| \| TLS \| 618s \| 539s \| 323s \| 290s \| 213s \| \| Compress \| 2622s \| 2421s \| 2018s \| 2029s \| n/a \| Relates #33844	2019-01-14 15:14:46 -05:00
Tim Brooks	5c68338a1c	Implement ccr file restore (#37130 ) This is related to #35975. It implements a file based restore in the CcrRepository. The restore transfers files from the leader cluster to the follower cluster. It does not implement any advanced resiliency features at the moment. Any request failure will end the restore.	2019-01-14 13:07:55 -07:00
Armin Braun	033e67fa59	Cleanup Deadcode in Rest Tests (#37418 ) * Either dead code outright or redundant overrides removed	2019-01-14 16:22:44 +01:00
Daniel Mitterdorfer	abe35fb99b	Remove unused index store in directory service With this commit we remove the unused field `indexStore` from all implementations of `FsDirectoryService`. Relates #37097	2019-01-14 13:44:32 +01:00
Jason Tedor	03be4dbaca	Introduce retention lease persistence (#37375 ) This commit introduces the persistence of retention leases by persisting them in index commits and recovering them when recovering a shard from store.	2019-01-12 14:43:19 -08:00
Nhat Nguyen	44a1071018	Make recovery source partially non-blocking (#37291 ) Today a peer-recovery may run into a deadlock if the value of node_concurrent_recoveries is too high. This happens because the peer-recovery is executed in a blocking fashion. This commit attempts to make the recovery source partially non-blocking. I will make three follow-ups to make it fully non-blocking: (1) send translog operations, (2) primary relocation, (3) send commit files. Relates #36195	2019-01-12 12:49:48 -05:00
Armin Braun	63fe3c6ed6	Fix PrimaryAllocationIT Race Condition (#37355 ) * Fix PrimaryAllocationIT Race Condition * Forcing a stale primary allocation on a green index was tripping the assertion that was removed * Added a test that this case still errors out correctly * Made the ability to wipe stopped datanode's data public on the internal test cluster and used it to ensure correct behaviour on the fixed test * Previously it simply passed because the test finished before the index went green and would NPE when the index was green at the time of the shard store status request, that would then come up empty * Closes #37345	2019-01-11 23:26:04 +01:00
Yannick Welsch	f4abf9628a	Mock connections more accurately in DisruptableMockTransport (#37296 ) This commit moves DisruptableMockTransport to use a more accurate representation of connection management, which allows to use the full connection manager and does not require mocking out any behavior. With this, we can implement restarting nodes in CoordinatorTests.	2019-01-11 16:06:48 +01:00
Jason Tedor	822626dadf	Make consistent empty retention lease supplier This commit makes the use of empty retention lease suppliers to always be an empty list as opposed to in some cases an empty set. This commit is solely for consistency reasons, there is no functional change here.	2019-01-10 18:34:55 -08:00
Yannick Welsch	d499233068	Zen2: Add join validation (#37203 ) Adds join validation to Zen2, which prevents a node from joining a cluster when the node does not have the right ES version or does not satisfy any other of the join validation constraints.	2019-01-10 12:57:50 +01:00
Alexander Reelsen	b2e8437424	Tests: Add ElasticsearchAssertions.awaitLatch method (#36777 ) * Tests: Add ElasticsearchAssertions.awaitLatch method Some tests are using assertTrue(latch.await(...)) in their code. This leads to an assertion error without any error message. This adds a method which has a nicer error message and can be used in tests. * fix forbidden apis * fix spaces	2019-01-10 09:25:36 +01:00
Armin Braun	eacc63b032	TESTS: Real Coordinator in SnapshotServiceTests (#37162 ) * TESTS: Real Coordinator in SnapshotServiceTests * Introduce real coordinator in SnapshotServiceTests to be able to test network disruptions realistically * Make adjustments to cluster applier service so that we can pass a mocked single threaded executor for tests	2019-01-09 16:53:49 +01:00
Tanguy Leroux	7f6fe14b66	Merge branch 'master' into close-index-api-refactoring	2019-01-09 09:26:05 +01:00
Mayya Sharipova	ec32e66088	Deprecate reference to _type in lookup queries (#37016 ) Relates to #35190	2019-01-08 18:46:41 -08:00
Tanguy Leroux	d70ebfd1d6	Merge branch 'master' into close-index-api-refactoring	2019-01-08 09:17:48 +01:00
Jason Tedor	c8c596cead	Introduce retention lease expiration (#37195 ) This commit implements a straightforward approach to retention lease expiration. Namely, we inspect which leases are expired when obtaining the current leases through the replication tracker. At that moment, we clean the map that persists the retention leases in memory.	2019-01-07 22:03:52 -08:00
Julie Tibshirani	c5aac4705d	Revert "Stop automatically nesting mappings in index creation requests. (#36924 )" This reverts commit `ac1c6940d2`.	2019-01-07 17:56:40 -08:00
Tanguy Leroux	97bf4d7176	Merge branch 'master' into close-index-api-refactoring	2019-01-07 18:38:27 +01:00
David Turner	9d0e0eb0f3	[Zen2] Remove initial master node count setting (#37150 ) The `cluster.unsafe_initial_master_node_count` setting was introduced as a temporary measure while the design of `cluster.initial_master_nodes` was being finalised. This commit removes this temporary setting, replacing it with usages of `cluster.initial_master_nodes` where appropriate.	2019-01-07 16:05:00 +00:00
Tanguy Leroux	e149b0852e	[Close Index API] Add unique UUID to ClusterBlock (#36775 ) This commit adds a unique id to cluster blocks, so that they can be uniquely identified if needed. This is important for the Close Index API where multiple concurrent closing requests can be executed at the same time. By adding a UUID to the cluster block, we can generate unique "closing block" that can later be verified on shards and then checked again from the cluster state before closing the index. When the verification on shard is done, the closing block is replaced by the regular INDEX_CLOSED_BLOCK instance. If something goes wrong, calling the Open Index API will remove the block. Related to #33888	2019-01-07 16:44:59 +01:00
Jason Tedor	c0f8c89172	Introduce shard history retention leases (#37167 ) This commit is the first in a series which will culminate with fully-functional shard history retention leases. Shard history retention leases are aimed at preventing shard history consumers from having to fallback to expensive file copy operations if shard history is not available from a certain point. These consumers include following indices in cross-cluster replication, and local shard recoveries. A future consumer will be the changes API. Further, index lifecycle management requires coordinating with some of these consumers otherwise it could remove the source before all consumers have finished reading all operations. The notion of shard history retention leases that we are introducing here will also be used to address this problem. Shard history retention leases are a property of the replication group managed under the authority of the primary. A shard history retention lease is a combination of an identifier, a retaining sequence number, a timestamp indicating when the lease was acquired or renewed, and a string indicating the source of the lease. Being leases they have a limited lifespan that will expire if not renewed. The idea of these leases is that all operations above the minimum of all retaining sequence numbers will be retained during merges (which would otherwise clear away operations that are soft deleted). These leases will be periodically persisted to Lucene and restored during recovery, and broadcast to replicas under certain circumstances. This commit is merely putting the basics in place. This first commit only introduces the concept and integrates their use with the soft delete retention policy. We add some tests to demonstrate the basic management is correct, and that the soft delete policy is correctly influenced by the existence of any retention leases. We make no effort in this commit to implement any of the following: - timestamps - expiration - persistence to and recovery from Lucene - handoff during primary relocation - sharing retention leases with replicas - exposing leases in shard-level statistics - integration with cross-cluster replication These will occur individually in follow-up commits.	2019-01-07 07:43:57 -08:00
Alpar Torok	a7c3d5842a	Split third party audit exclusions by type (#36763 )	2019-01-07 17:24:19 +02:00
Simon Willnauer	ac2e09b25a	Fix suite scope random initializaation (#37163 ) The initialization of a suite scope cluster had some sideffects on subsequent runs which causes issues when tests must be reproduced. This moves the suite scope initialization to a privte random context. Closes #36202	2019-01-07 14:20:17 +01:00
Tanguy Leroux	f5af79b9cd	Merge branch 'master' into close-index-api-refactoring	2019-01-07 12:43:03 +01:00
Armin Braun	31c33fdb9b	MINOR: Remove some Deadcode in Gradle (#37160 )	2019-01-07 09:21:25 +01:00
Jim Ferenczi	e38cf1d0dc	Add the ability to set the number of hits to track accurately (#36357 ) In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested. It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected. Relates #33028	2019-01-04 20:36:49 +01:00
Julie Tibshirani	ac1c6940d2	Stop automatically nesting mappings in index creation requests. (#36924 ) Now that we unwrap mappings in DocumentMapperParser#extractMappings, it is not necessary for the mapping definition to always be nested under the type. This leniency around the mapping format was added in `2341825358`.	2019-01-03 17:41:28 -08:00
David Findley	d4e7660248	Fix weighted_avg parser not found for RestHighLevelClient (#37027 ) Add integration test for weighted avg sub aggregation Add weighted avg parser to DefaultNamedXContents Fixes #36861	2019-01-02 15:53:21 -06:00
Josh Soref	1df66d21fe	Spelling: replace uknown with unknown (#37056 )	2019-01-02 17:33:02 +01:00
Josh Soref	d3e98278c3	Spelling: replace cachable with cacheable (#37047 )	2019-01-02 14:10:30 +01:00
Armin Braun	85be9d6a89	SNAPSHOT: Deterministic ClusterState Tests (#36644 ) * Use `DeterministicTaskQueue` infrastructure to test `SnapshotsService`	2018-12-31 11:17:21 +01:00
Luca Cavanna	51fe20e0c3	Add support for local cluster alias to SearchRequest (#36997 ) With the upcoming cross-cluster search alternate execution mode, the CCS node will be able to split a CCS request into multiple search requests, one per remote cluster involved. In order to do that, the CCS node has to be able to signal to each remote cluster that such sub-requests are part of a CCS request. Each cluster does not know about the other clusters involved, and does not know either what alias it is given in the CCS node, hence the CCS coordinating node needs to be able to provide the alias as part of the search request so that it is used as index prefix in the returned search hits. The cluster alias is a notion that's already supported in the search shards iterator and search shard target, but it is currently used in CCS as both index prefix and connection lookup key when fanning out to all the shards. With CCS alternate execution mode the provided cluster alias needs to be used only as index prefix, as shards are local to each cluster hence no cluster alias should be used for connection lookups. The local cluster alias can be set to the SearchRequest at the transport layer only, and its constructor/getter methods are package private. Relates to #32125	2018-12-28 12:43:25 +01:00
Andrey Ershov	a02cfdf6e4	Switch InternalTestClusterTests to zen2 (#36977 ) Today InternalTestClusterTests is still using zen1. This commit fixes it. Two types of changes were required: 1. Explicitly pass file discovery host provider setting. It's done in ESIntegTestCase as a part of the Zen2 feature and should be done here as well. 2. For the test, that uses autoManageMinMasterNodes = false perform cluster bootstrap.	2018-12-27 22:21:37 +01:00
Nhat Nguyen	7580d9d925	Make SourceToParse immutable (#36971 ) Today the routing of a SourceToParse is assigned in a separate step after the object is created. We can easily forget to set the routing. With this commit, the routing must be provided in the constructor of SourceToParse. Relates #36921	2018-12-24 14:06:50 -05:00
Tim Brooks	c8a8391dfa	Only compress responses if request was compressed (#36867 ) This is a follow-up to some discussions around #36399. Currently we have relatively confusing compression behavior where compression can be configured for requests based on transport.compress or a specific setting for a remote cluster. However, we can only compress responses based on transport.compress as we do not know where a request is coming from (currently). This commit modifies the behavior to NEVER compress responses based on settings. Instead, a response will only be compressed if the request was compressed. This commit also updates the documentation to more clearly described transport level compression.	2018-12-21 10:14:00 -07:00
Tanguy Leroux	bd2af2c400	Merge branch 'master' into close-index-api-refactoring	2018-12-21 12:22:24 +01:00
Andrey Ershov	ca92d74e7e	[Zen2] Change unsafe bootstrap nodes count to nodes list in tests (#36559 ) This commit modifies ESSingleNodeTestCase and ESIntegTestCase and several concrete test classes to use node names when bootstrapping the cluster. Today ClusterBootstrapService.INITIAL_MASTER_NODE_COUNT_SETTING setting is used to bootstrap clusters in tests. Instead, we want to use ClusterBootrstapService.INITIAL_MASTER_NODES_SETTING and get rid of the former setting eventually. There were two main problems when refactoring InternalTestCluster: 1. Nodes are created one-by-one in buildNode method. And node.name is created in this method as well. It's not suitable for bootstrapping, because we need to have the names of all master eligible nodes in advance, before creating the node with bootstrapping configuration set. We address this issue by separating buildNode into two methods: getNodeSettings and buildNode. We first iterate over all nodes to get nodes settings, then change the setting for the bootstrapping node and then proceed with building the node. 2. If autoManageMinMasterNodes = false, there is no way for the test to set the list of bootstrapping nodes because node names are not known in advance. This problem is solved by adding updateNodesSettings method to NodeConfigurationSource and ESIntegTestCase (which could be overridden by concrete integration test class). Once we have the list of settings for all nodes, the integration test class is allowed to update it. In our case, we update the ClusterBootrstapService.INITIAL_MASTER_NODES_SETTING setting.	2018-12-20 15:20:33 +01:00
Tanguy Leroux	fb24469fe7	Merge branch 'master' into close-index-api-refactoring	2018-12-19 16:17:26 +01:00
Yannick Welsch	487a1c4f71	Fix cluster state persistence for single-node discovery (#36825 ) Single-node discovery is not persisting cluster states, which was caused by a recent 7.0-only refactoring. This commit ensures that the cluster state is properly persisted when using single-node discovery and adds a corresponding test.	2018-12-19 13:26:04 +01:00
Alan Woodward	344917efab	Add script filter to intervals (#36776 ) This commit adds the ability to filter out intervals based on their start and end position, and internal gaps: ``` POST _search { "query": { "intervals" : { "my_text" : { "match" : { "query" : "hot porridge", "filter" : { "script" : { "source" : "interval.start > 10 && interval.end < 20 && interval.gaps == 0" } } } } } } } ```	2018-12-19 11:12:18 +00:00
Tanguy Leroux	c99fd6a53b	Merge branch 'master' into close-index-api-refactoring	2018-12-19 09:34:59 +01:00
Alpar Torok	e9ef5bdce8	Converting randomized testing to create a separate unitTest task instead of replacing the builtin test task (#36311 ) - Create a separate unitTest task instead of Gradle's built in - convert all configuration to use the new task - the built in task is now disabled	2018-12-19 08:25:20 +02:00
Tim Brooks	aaf466ff5e	Revert transport.port change for tests (#36809 ) Commit #36786 updated docs and strings to reference transport.port instead of transport.tcp.port. However, this breaks backwards compatibility tests as the tests rely on string configurations and transport.port does not exist prior to 6.6. This commit reverts the places were we reference transport.tcp.port for tests. This work will need to be reintroduced in a backwards compatible way.	2018-12-18 19:01:13 -07:00
Tim Brooks	47a9a8de49	Update transport docs and settings for changes (#36786 ) This is related to #36652. In 7.0 we plan to deprecate a number of settings that make reference to the concept of a tcp transport. We mostly just have a single transport type now (based on tcp). Settings should only reference tcp if they are referring to socket options. This commit updates the settings in the docs. And removes string usages of the old settings. Additionally it adds a missing remote compress setting to the docs.	2018-12-18 13:09:58 -07:00
Ryan Ernst	8ec8342a52	Internal: Remove originalSettings from Node (#36569 ) This commit removes the originalSettings member from Node. It was only needed to allows test clusters to recreate the node in certain situations. Instead, the test cluster now keeps track of these settings.	2018-12-18 10:05:27 -08:00
Tanguy Leroux	0a0c969517	Merge branch 'master' into close-index-api-refactoring	2018-12-18 09:27:35 +01:00
Luca Cavanna	b57e12aa44	Add raw sort values to SearchSortValues transport serialization (#36617 ) In order for CCS alternate execution mode (see #32125) to be able to do the final reduction step on the CCS coordinating node, we need to serialize additional info in the transport layer as part of each `SearchHit`. Sort values are already present but they are formatted according to the provided `DocValueFormat` provided. The CCS node needs to be able to reconstruct the lucene `FieldDoc` to include in the `TopFieldDocs` and `CollapseTopFieldDocs` which will feed the `mergeTopDocs` method used to reduce multiple search responses (one per cluster) into one. This commit adds such information to the `SearchSortValues` and exposes it through a new getter method added to `SearchHit` for retrieval. This info is only serialized at transport and never printed out at REST.	2018-12-18 09:20:51 +01:00
Nicholas Knize	96d279ed83	Revert "[Geo] Integrate Lucene's LatLonShape (BKD Backed GeoShapes) as default `geo_shape` indexing approach (#35320 )" This reverts commit `5bc7822562`.	2018-12-17 20:09:46 -06:00
Nick Knize	5bc7822562	[Geo] Integrate Lucene's LatLonShape (BKD Backed GeoShapes) as default `geo_shape` indexing approach (#35320 ) This commit exposes lucene's LatLonShape field as the default type in GeoShapeFieldMapper. To use the new indexing approach, simply set "type" : "geo_shape" in the mappings without setting any of the strategy, precision, tree_levels, or distance_error_pct parameters. Note the following when using the new indexing approach: * geo_shape query does not support querying by MULTIPOINT. * LINESTRING and MULTILINESTRING queries do not yet support WITHIN relation. * CONTAINS relation is not yet supported. The tree, precision, tree_levels, distance_error_pct, and points_only parameters are deprecated.	2018-12-17 14:38:14 -06:00
Luca Cavanna	f1e1f93943	[TEST] fix float comparison in RandomObjects#getExpectedParsedValue This commit fixes a test bug introduced with #36597. This caused some test failure as stored field values comparisons would not work when CBOR xcontent type was used. Closes #29080	2018-12-17 21:19:59 +01:00
Tanguy Leroux	79999d37d4	Merge branch 'master' into close-index-api-refactoring	2018-12-17 10:14:38 +01:00
Boaz Leskes	733a6d34c1	Add seq no powered optimistic locking support to the index and delete transport actions (#36619 ) This commit add support for using sequence numbers to power [optimistic concurrency control](http://en.wikipedia.org/wiki/Optimistic_concurrency_control) in the delete and index transport actions and requests. A follow up will come with adding sequence numbers to the update and get results. Relates #36148 Relates #10708	2018-12-15 17:59:57 +01:00
Tim Brooks	3065300434	Unify transport settings naming (#36623 ) This commit updates our transport settings for 7.0. It generally takes a few approaches. First, for normal transport settings, it usestransport. instead of transport.tcp. Second, it uses transport.tcp, http.tcp, or network.tcp for all settings that are proxies for OS level socket settings. Third, it marks the network.tcp.connect_timeout setting for removal. Network service level settings are only settings that apply to both the http and transport modules. There is no connect timeout in http. Fourth, it moves all the transport settings to a single class TransportSettings similar to the HttpTransportSettings class. This commit does not actually remove any settings. It just adds the new renamed settings and adds todos for settings that will be deprecated.	2018-12-14 14:41:04 -07:00
Tim Brooks	fbf88b2ab7	Remove the `MockTcpTransport` (#36628 ) This commit removes all remaining usages of the `MockTcpTransport`. Additionally it removes the `MockTcpTransport` and its test case.	2018-12-14 10:59:07 -07:00
Luca Cavanna	bb3ae18da5	Increase coverage in SearchSortValuesTests (#36597 ) SearchSortValuesTests extends now `AbstractSerializingTestCase` which removes some code duplication and standardizes the way we test `fromXContent`, serialization and equals/hashcode. Also, we were never creating `SearchSortValues` through their public constructor that accept an array of `DocValueFormat` together with the array of raw sort values. That is covered now, which involved some conversion from `BytesRef` to String in the test. Also, the previous test was not using doing any equality check against the original and parsed versions in `testFromXContent` due to values being parsed with different types in some cases, which is now covered by converting those values using a new method added to `RandomObjects`. The code was already there as part of `randomStoredFieldValues`, but it is now exposed to be used in other scenarios.	2018-12-14 18:57:37 +01:00
Luca Cavanna	7dc3d3b78b	Add sort and collapse info to SearchHits transport serialization (#36555 ) In order for CCS alternate execution mode (see #32125) to be able to do the final reduction step on the CCS coordinating node, we need to serialize additional info in the transport layer as part of the `SearchHits`, specifically: - lucene `SortField[]` which contains info about the fields that sorting was performed on and their type, which depends on mappings (that the CCS node does not know about) - collapse field (`String`) that field collapsing was executed on, if requested - collapse values (`Object[]`) that field collapsing was based on, if requested This info is needed to be able to reconstruct the `TopFieldDocs` or `CollapseFieldTopDocs` in the CCS coordinating node to feed the `mergeTopDocs` method and reduce multiple search responses received (one per cluster) into one. This commit adds such information to the `SearchHits` class. It's nullable info that is not serialized through the REST layer. `SearchPhaseController` sets such info at the end of the hits reduction phase.	2018-12-14 12:22:54 +01:00
Armin Braun	c5b3ac5578	SNAPSHOTS: Allow Parallel Restore Operations (#36397 ) * Enable parallel restore operations * Add uuid to restore in progress entries to uniquely identify them * Adjust restore in progress entries to be a map in cluster state * Added tests for: * Parallel restore from two different snapshots * Parallel restore from a single snapshot to different indices to test uuid identifiers are correctly used by `RestoreService` and routing allocator * Parallel restore with waiting for completion to test transport actions correctly use uuid identifiers	2018-12-14 11:39:23 +01:00
Tanguy Leroux	8e5dd20efb	[Close Index API] Refactor MetaDataIndexStateService (#36354 ) The commit changes how indices are closed in the MetaDataIndexStateService. It now uses a 3 steps process where writes are blocked on indices to be closed, then some verifications are done on shards using the TransportVerifyShardBeforeCloseAction added in #36249, and finally indices states are moved to CLOSE and their routing tables removed. The closing process also takes care of using the pre-7.0 way to close indices if the cluster contains mixed version of nodes and a node does not support the TransportVerifyShardBeforeCloseAction. It also closes unassigned indices. Related to #33888	2018-12-13 17:36:23 +01:00
Boaz Leskes	f6b5d7e013	Add sequence numbers based optimistic concurrency control support to Engine (#36467 ) This commit add support to engine operations for resolving and verifying the sequence number and primary term of the last modification to a document before performing an operation. This is infrastructure to move our (optimistic concurrency control)[http://en.wikipedia.org/wiki/Optimistic_concurrency_control] API to use sequence numbers instead of internal versioning. Relates #36148 Relates #10708	2018-12-13 08:08:40 +01:00
Tal Levy	cd1bec3a06	[refactor] add Environment in BootstrapContext (#36573 ) There are certain BootstrapCheck checks that may need access environment-specific values. Watcher's EncryptSensitiveDataBootstrapCheck passes in the node's environment via a constructor to bypass the shortcoming in BootstrapContext. This commit pulls in the node's environment into BootstrapContext. Another case is found in #36519, where it is useful to check the state of the data-path. Since PathUtils.get and Paths.get are forbidden APIs, we rely on the environment to retrieve references to things like node data paths. This means that the BootstrapContext will have the same Settings used in the Environment, which currently differs from the Node's settings.	2018-12-12 21:07:21 -08:00
Julie Tibshirani	71a39d10be	Make sure that BWC tests run successfully, even with types deprecation messages. (#36511 )	2018-12-12 12:57:32 -08:00
Tim Brooks	7f612d5dd8	Always compress based on the settings (#36522 ) Currently TransportRequestOptions allows specific requests to request compression. This commit removes this and always compresses based on the settings. Additionally, it removes TransportResponseOptions as they are unused. This closes #36399.	2018-12-12 09:39:15 -07:00
Simon Willnauer	ff5dd14753	Fix test failures related to file corruption (#36530 ) * Fix CorruptFileIT to also take last DV generation into account We currently only prune old .liv generations. With soft_deletes it's important to also prune DV generations. * Fix CorruptionUtils to skip the footer bytes after the checksum is read. Today we read a broken checksum since we also checksum the 8 footer bytes that include the checksum algorithm and the footer magic. Closes #36526	2018-12-12 16:21:02 +01:00
Tim Brooks	e63d52af63	Move page size constants to PageCacheRecycler (#36524 ) `PageCacheRecycler` is the class that creates and holds pages of arrays for various uses. `BigArrays` is just one user of these pages. This commit moves the constants that define the page sizes for the recycler to be on the recycler class.	2018-12-12 07:00:50 -07:00

... 4 5 6 7 8 ...

2364 Commits