OpenSearch

Commit Graph

Author	SHA1	Message	Date
David Turner	86ee8eab3f	Allow RerouteService to reroute at lower priority (#44338 ) Today the `BatchedRerouteService` submits its delayed reroute task at `HIGH` priority, but in some cases a lower priority would be more appropriate. This commit adds the facility to submit delayed reroute tasks at different priorities, such that each submitted reroute task runs at a priority no lower than the one requested. It does not change the fact that all delayed reroute tasks are submitted at `HIGH` priority, but at least it makes this explicit.	2019-07-15 17:41:39 +01:00
Christoph Büscher	835b7a120d	Fix AnalyzeAction response serialization (#44284 ) Currently we loose information about whether a token list in an AnalyzeAction response is null or an empty list, because we write a 0 value to the stream in both cases and deserialize to a null value on the receiving side. This change fixes this so we write an additional flag indicating whether the value is null or not, followed by the size of the list and its content. Closes #44078	2019-07-14 10:35:11 +02:00
Przemyslaw Gomulka	e23ecc5838	JSON logging refactoring and X-Opaque-ID support backport(#41354 ) (#44178 ) This is a refactor to current JSON logging to make it more open for extensions and support for custom ES log messages used inDeprecationLogger IndexingSlowLog , SearchSLowLog We want to include x-opaque-id in deprecation logs. The easiest way to have this as an additional JSON field instead of part of the message is to create a custom DeprecatedMessage (extends ESLogMEssage) These messages are regular log4j messages with a text, but also carry a map of fields which can then populate the log pattern. The logic for this lives in ESJsonLayout and ESMessageFieldConverter. Similar approach can be used to refactor IndexingSlowLog and SearchSlowLog JSON logs to contain fields previously only present as escaped JSON string in a message field. closes #41350 backport #41354	2019-07-12 16:53:27 +02:00
Yannick Welsch	068286ca4b	Remove RemoteClusterConnection.ConnectedNodes (#44235 ) This instead exposes the set of connected nodes on ConnectionManager.	2019-07-12 14:54:21 +02:00
Armin Braun	6c02cf0241	Fix InternalTestCluster StopRandomNode Assertion (#44258 ) (#44265 ) * The assertion added in #44214 is tripped by tests running dedicated test clusters per test needlessly.This breaks existing tests like the one in #44245. * Closes #44245	2019-07-12 13:18:55 +02:00
David Turner	735c897ec6	Avoid counting votes from master-ineligible nodes (#43688 ) Today if a master-eligible node is converted to a master-ineligible node it may remain in the voting configuration, meaning that the master node may count its publish responses as an indication that it has properly persisted the cluster state. However master-ineligible nodes do not properly persist the cluster state, so it is not safe to count these votes. This change adjusts `CoordinationState` to take account of this from a safety point of view, and also adjusts the `Coordinator` to prevent such nodes from joining the cluster. Instead, it triggers a reconfiguration to remove from the voting configuration a node that now appears to be master-ineligible before processing its join. Backport of #43688, see #44260.	2019-07-12 11:30:52 +01:00
Alpar Torok	8d35583c43	Fix port range allocation with large worker IDs (#44213 ) * Fix port range allocation with large worker IDs Relates to #43983 The IDs gradle uses are incremented for the lifetime of the daemon which can result in port ranges that are outside the valid range. This change implements a modulo based formula to wrap the port ranges when the IDs get too large. Adresses #44134 but #44157 is also required to be able to close it.	2019-07-12 11:04:57 +03:00
Yogesh Gaikwad	91c342a888	fix and enable repository-hdfs secure tests (#44044 ) (#44199 ) Due to recent changes are done for converting `repository-hdfs` to test clusters (#41252), the `integTestSecure*` tasks did not depend on `secureHdfsFixture` which when running would fail as the fixture would not be available. This commit adds the dependency of the fixture to the task. The `secureHdfsFixture` is a `AntFixture` which is spawned a process. Internally it waits for 30 seconds for the resources to be made available. For my local machine, it took almost 45 seconds to be available so I have added the wait time as an input to the `AntFixture` defaults to 30 seconds and set it to 60 seconds in case of secure hdfs fixture. The integ test for secure hdfs was disabled for a long time and so the changes done in #42090 to fix the tests are also done in this commit.	2019-07-12 12:44:01 +10:00
Armin Braun	2768662822	Cleanup Stale Root Level Blobs in Sn. Repository (#43542 ) (#44226 ) * Cleans up all root level temp., snap-%s.dat, meta-%s.dat blobs that aren't referenced by any snapshot to deal with dangling blobs left behind by delete and snapshot finalization failures * The scenario that get's us here is a snapshot failing before it was finalized or a delete failing right after it wrote the updated index-(N+1) that doesn't reference a snapshot anymore but then fails to remove that snapshot * Not deleting other dangling blobs since that don't follow the snap-, meta- or tempfile naming schemes to not accidentally delete blobs not created by the snapshot logic * Follow up to #42189 * Same safety logic, get list of all blobs before writing index-N blobs, delete things after index-N blobs was written	2019-07-11 19:35:15 +02:00
Armin Braun	5f22370b6b	Fix ShrinkIndexIT (#44214 ) (#44223 ) * Fix ShrinkIndexIT * Move this test suit to cluster scope. Currently, `testShrinkThenSplitWithFailedNode` stops a random node which randomly turns out to be the only shared master node so the cluster reset fails on account of the fact that no shared master node survived. * Closes #44164	2019-07-11 17:58:00 +02:00
Nick Knize	374030a53f	Upgrade to lucene-8.2.0-snapshot-860e0be5378 (#44171 ) (#44184 ) Upgrades lucene library to lucene-8.2.0-snapshot-860e0be5378	2019-07-11 09:17:22 -05:00
Alpar Torok	7ba18732f7	Run some REST tests against a cluster running in docker containers (#39515 ) * Run REST tests against a cluster running on docker Closes #38053	2019-07-11 15:28:33 +03:00
Armin Braun	8ce8c627dd	Some Cleanup in o.e.i.shard (#44097 ) (#44208 ) * Some Cleanup in o.e.i.shard * Extract one duplicated method * Cleanup obviously unused code	2019-07-11 13:52:06 +02:00
Yannick Welsch	2ee07f1ff4	Simplify port usage in transport tests (#44157 ) Simplifies AbstractSimpleTransportTestCase to use JVM-local ports and also adds an assertion so that cases like #44134 can be more easily debugged. The likely reason for that one is that a test, which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle daemon) kept increasing Gradle worker IDs, causing an overflow at some point.	2019-07-11 13:35:37 +02:00
Armin Braun	c0ed64bb92	Improve Repository Consistency Check in Tests (#44204 ) * Improve Repository Consistency Check in Tests (#44099) * Check that index metadata as well as snapshot metadata always exists when referenced by other metadata * Fix SnapshotResiliencyTests on ExtraFS (#44113) * As a result of #44099 we're now checking more directories and have to ignore the `extraN` folders for those like we do for indices already * Closes #44112	2019-07-11 11:14:37 +02:00
Armin Braun	8a554f9737	Remove IncompatibleSnapshots Logic from Codebase (#44096 ) (#44183 ) * The incompatible snapshots logic was created to track 1.x snapshots that became incompatible with 2.x * It serves no purpose at this point * It adds an additional GET request to every loading of RepositoryData (from loading the incompatible snapshots blob)	2019-07-11 07:15:51 +02:00
Armin Braun	d6f09fdb97	Add WARN Logging if Mock Network Accepts Huge Number of Connections (#44169 ) (#44182 ) * Add WARN Logging if Mock Network Accepts Huge Number of Connections * As discussed, added warn logging to rule out endless accept loops for #43387 * Had to handle it by the relatively awkward override in the mock nio because we don't have logging in the NIO module where (`ServerChannelContext` lives)	2019-07-10 22:08:36 +02:00
Ryan Ernst	fb77d8f461	Removed writeTo from TransportResponse and ActionResponse (#44092 ) The base classes for transport requests and responses currently implement Streamable and Writeable. The writeTo method on these base classes is implemented with an empty implementation. Not only does this complicate subclasses to think they need to call super.writeTo, but it also can lead to not implementing writeTo when it should have been implemented, or extendiong one of these classes when not necessary, since there is nothing to actually implement. This commit removes the empty writeTo from these base classes, and fixes subclasses to not call super and in some cases implement an empty writeTo themselves. relates #34389	2019-07-10 12:42:04 -07:00
Zachary Tong	92ad588275	Remove generic on AggregatorFactory (#43664 ) (#44079 ) AggregatorFactory was generic over itself, but it doesn't appear we use this functionality anywhere (e.g. to allow the super class to declare arguments/return types generically for subclasses to override). Most places use a wildcard constraint, and even when a concrete type is specified it wasn't used. But since AggFactories are widely used, this led to the generic touching many pieces of code and making type signatures fairly complex	2019-07-10 13:20:28 -04:00
David Turner	aec44fecbc	Decouple DiskThresholdMonitor & ClusterInfoService (#44105 ) Today the `ClusterInfoService` requires the `DiskThresholdMonitor` at construction time so that it can notify it when nodes report changes in their disk usage, but this is awkward to construct: the `DiskThresholdMonitor` requires a `RerouteService` which requires an `AllocationService` which comees from the `ClusterModule` which requires the `ClusterInfoService`. Today we break the cycle with a `LazilyInitializedRerouteService` which is itself a little ugly. This commit replaces this with a more traditional subject/observer relationship between the `ClusterInfoService` and the `DiskThresholdMonitor`.	2019-07-09 18:43:32 +01:00
David Turner	268971db03	Wait for blackholed connection before discovery (#44077 ) Since #42636 we no longer treat connections specially when simulating a blackholed connection. This means that at the end of the safety phase we may have just started a connection attempt which will time out, but the default timeout is 30 seconds, much longer than the 2 seconds we normally allow for post-safety-phase discovery. This commit adds time for such a connection attempt to time out. It also fixes some spurious logging of `this` that now refers to an object with an unhelpful `toString()` implementation introduced in #42636. Fixes #44073	2019-07-09 10:59:53 +01:00
Armin Braun	9eac5ceb1b	Dry up inputstream to bytesreference (#43675 ) (#44094 ) * Dry up Reading InputStream to BytesReference * Dry up spots where we use the same pattern to get from an InputStream to a BytesReferences	2019-07-09 09:18:25 +02:00
David Turner	6dce458ecc	Randomise retention lease expiry time (#44067 ) In today's test suite indices mostly use the default value of `12h` for the `index.soft_deletes.retention_lease.period` setting, which in the context of the test suite essentially means "never expires". In fact, the tests should all behave correctly even if the lease period is much shorter; tests that rely on leases not expiring should configure their indices appropriately. This commit randomises the lease expiry time for those indices created during tests which do not set a specific value for this setting.	2019-07-08 18:29:27 +02:00
Alpar Torok	0c8294e633	Make sure the clean task doesn't break test fixtures (#43641 ) Use a dedicated fixture dir.	2019-07-08 17:58:27 +03:00
Armin Braun	afe81fd625	Some Cleanup in Test Framework (#44039 ) (#44059 ) * Remove some obvious dead code * Move assert methods that were only used in a single test class to the child they belong to * Inline some redundant methods	2019-07-08 14:15:31 +02:00
Armin Braun	af9b98e81c	Recursively Delete Unreferenced Index Directories (#42189 ) (#44051 ) * Use ability to list child "folders" in the blob store to implement recursive delete on all stale index folders when cleaning up instead of using the diff between two `RepositoryData` instances to cover aborted deletes * Runs after ever delete operation * Relates #13159 (fixing most of this issues caused by unreferenced indices, leaving some meta files to be cleaned up only)	2019-07-08 10:55:39 +02:00
Nhat Nguyen	9089820d8f	Enable indexing optimization using sequence numbers on replicas (#43616 ) This PR enables the indexing optimization using sequence numbers on replicas. With this optimization, indexing on replicas should be faster and use less memory as it can forgo the version lookup when possible. This change also deactivates the append-only optimization on replicas. Relates #34099	2019-07-05 22:12:08 -04:00
Yannick Welsch	504a43d43a	Move ConnectionManager to async APIs (#42636 ) This commit converts the ConnectionManager's openConnection and connectToNode methods to async-style. This will allow us to not block threads anymore when opening connections. This PR also adapts the cluster coordination subsystem to make use of the new async APIs, allowing to remove some hacks in the test infrastructure that had to account for the previous synchronous nature of the connection APIs.	2019-07-05 20:40:22 +02:00
Yannick Welsch	d090fa514f	Use unique ports per test worker (#43983 ) * Use unique ports per test worker * Add test for system property * check presence of tests.gradle * Revert "check presence of tests.gradle" This reverts commit 2fee7512a28f95c94c5bf7a3312e808f918a9510.	2019-07-05 11:02:28 +02:00
Jim Ferenczi	cdf55cb5c5	Refactor index engines to manage readers instead of searchers (#43860 ) This commit changes the way we manage refreshes in the index engines. Instead of relying on a SearcherManager, this change uses a ReaderManager that creates ElasticsearchDirectoryReader when needed. Searchers are now created on-demand (when acquireSearcher is called) from the current ElasticsearchDirectoryReader. It also slightly changes the Engine.Searcher to extend IndexSearcher in order to simplify the usage in the consumer.	2019-07-04 22:49:43 +02:00
Alan Woodward	4b99255fed	Add name() method to TokenizerFactory (#43909 ) This brings TokenizerFactory into line with CharFilterFactory and TokenFilterFactory, and removes the need to pass around tokenizer names when building custom analyzers. As this means that TokenizerFactory is no longer a functional interface, the commit also adds a factory method to TokenizerFactory to make construction simpler.	2019-07-04 11:28:55 +01:00
Armin Braun	be20fb80e4	Recursive Delete on BlobContainer (#43281 ) (#43920 ) This is a prerequisite of #42189: * Add directory delete method to blob container specific to each implementation: * Some notes on the implementations: * AWS + GCS: We can simply exploit the fact that both AWS and GCS return blobs lexicographically ordered which allows us to simply delete in the same order that we receive the blobs from the listing request. For AWS this simply required listing without the delimiter setting (so we get a deep listing) and for GCS the same behavior is achieved by not using the directory mode on the listing invocation. The nice thing about this is, that even for very large numbers of blobs the memory requirements are now capped nicely since we go page by page when deleting. * For Azure I extended the parallelization to the listing calls as well and made it work recursively. I verified that this works with thread count `1` since we only block once in the initial thread and then fan out to a "graph" of child listeners that never block. * HDFS and FS are trivial since we have directory delete methods available for them * Enhances third party tests to ensure the new functionality works (I manually ran them for all cloud providers)	2019-07-03 17:14:57 +02:00
Armin Braun	455b12a4fb	Add Ability to List Child Containers to BlobContainer (#42653 ) (#43903 ) * Add Ability to List Child Containers to BlobContainer (#42653) * Add Ability to List Child Containers to BlobContainer * This is a prerequisite of #42189	2019-07-03 11:30:49 +02:00
Armin Braun	826f38cd70	Enable Parallel Deletes in Azure Repository (#42783 ) (#43886 ) * Parallel deletes via private thread pool	2019-07-03 09:28:39 +02:00
Zachary Tong	ea1794832f	Add RareTerms aggregation (#35718 ) This adds a `rare_terms` aggregation. It is an aggregation designed to identify the long-tail of keywords, e.g. terms that are "rare" or have low doc counts. This aggregation is designed to be more memory efficient than the alternative, which is setting a terms aggregation to size: LONG_MAX (or worse, ordering a terms agg by count ascending, which has unbounded error). This aggregation works by maintaining a map of terms that have been seen. A counter associated with each value is incremented when we see the term again. If the counter surpasses a predefined threshold, the term is removed from the map and inserted into a cuckoo filter. If a future term is found in the cuckoo filter we assume it was previously removed from the map and is "common". The map keys are the "rare" terms after collection is done.	2019-07-01 10:30:02 -04:00
Nhat Nguyen	598e00a689	Make peer recovery send file info step async (#43792 ) Relates #36195	2019-07-01 08:40:45 -04:00
Ryan Ernst	3a2c698ce0	Rename Action to ActionType (#43778 ) Action is a class that encapsulates meta information about an action that allows it to be called remotely, specifically the action name and response type. With recent refactoring, the action class can now be constructed as a static constant, instead of needing to create a subclass. This makes the old pattern of creating a singleton INSTANCE both misnamed and lacking a common placement. This commit renames Action to ActionType, thus allowing the old INSTANCE naming pattern to be TYPE on the transport action itself. ActionType also conveys that this class is also not the action itself, although this change does not rename any concrete classes as those will be removed organically as they are converted to TYPE constants. relates #34389	2019-06-30 22:00:17 -07:00
David Turner	fca7a19713	Avoid parallel reroutes in DiskThresholdMonitor (#43381 ) Today the `DiskThresholdMonitor` limits the frequency with which it submits reroute tasks, but it might still submit these tasks faster than the master can process them if, for instance, each reroute takes over 60 seconds. This causes a problem since the reroute task runs with priority `IMMEDIATE` and is always scheduled when there is a node over the high watermark, so this can starve any other pending tasks on the master. This change avoids further updates from the monitor while its last task(s) are still in progress, and it measures the time of each update from the completion time of the reroute task rather than its start time, to allow a larger window for other tasks to run. It also now makes use of the `RoutingService` to submit the reroute task, in order to batch this task with any other pending reroutes. It enhances the `RoutingService` to notify its listeners on completion. Fixes #40174 Relates #42559	2019-06-30 16:54:16 +01:00
Nhat Nguyen	55b3ec8d7b	Make peer recovery clean files step async (#43787 ) Relates #36195	2019-06-29 18:30:51 -04:00
Albert Zaharovits	5e17bc5dcc	Consistent Secure Settings #40416 Introduces a new `ConsistentSecureSettingsValidatorService` service that exposes a single public method, namely `allSecureSettingsConsistent`. The method returns `true` if the local node's secure settings (inside the keystore) are equal to the master's, and `false` otherwise. Technically, the local node has to have exactly the same secure settings - setting names should not be missing or in surplus - for all `SecureSetting` instances that are flagged with the newly introduced `Property.Consistent`. It is worth highlighting that the `allSecureSettingsConsistent` is not a consensus view across the cluster, but rather the local node's perspective in relation to the master.	2019-06-29 23:26:17 +03:00
Jim Ferenczi	7ca69db83f	Refactor IndexSearcherWrapper to disallow the wrapping of IndexSearcher (#43645 ) This change removes the ability to wrap an IndexSearcher in plugins. The IndexSearcherWrapper is replaced by an IndexReaderWrapper and allows to wrap the DirectoryReader only. This simplifies the creation of the context IndexSearcher that is used on a per request basis. This change also moves the optimization that was implemented in the security index searcher wrapper to the ContextIndexSearcher that now checks the live docs to determine how the search should be executed. If the underlying live docs is a sparse bit set the searcher will compute the intersection betweeen the query and the live docs instead of checking the live docs on every document that match the query.	2019-06-28 16:28:02 +02:00
Christoph Büscher	2cc7f5a744	Allow reloading of search time analyzers (#43313 ) Currently changing resources (like dictionaries, synonym files etc...) of search time analyzers is only possible by closing an index, changing the underlying resource (e.g. synonym files) and then re-opening the index for the change to take effect. This PR adds a new API endpoint that allows triggering reloading of certain analysis resources (currently token filters) that will then pick up changes in underlying file resources. To achieve this we introduce a new type of custom analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows swapping out analysis components. Custom analyzers that contain filters that are markes as "updateable" will automatically choose this implementation. This PR also adds this capability to `synonym` token filters for use in search time analyzers. Relates to #29051	2019-06-28 09:55:40 +02:00
Nhat Nguyen	ce8771feb7	Do not use MockInternalEngine in GatewayIndexStateIT (#43716 ) GatewayIndexStateIT#testRecoverBrokenIndexMetadata replies on the flushing on shutdown. This behaviour, however, can be randomly disabled in MockInternalEngine. Closes #43034	2019-06-27 18:28:04 -04:00
Yannick Welsch	6744344ef2	Handle situation where only voting-only nodes are bootstrapped (#43628 ) Adds support for the situation where only voting-only nodes are bootstrapped. In that case, they will still try to become elected and bring full master nodes into the cluster.	2019-06-27 18:10:15 +02:00
David Roberts	c5beb05f77	[ML][DataFrame] Consider data frame templates internal in REST tests (#43692 ) The data frame index template pattern was not in the list considered as internal and therefore not needing cleanup after every test.	2019-06-27 14:40:30 +01:00
Christoph Büscher	36360358b2	Move query builder caching check to dedicated tests (#43238 ) Currently `AbstractQueryTestCase#testToQuery` checks the search context cachable flag. This is a bit fragile due to the high randomization of query builders performed by this general test. Also we might only rarely check the "interesting" cases because they rarely get generated when fully randomizing the query builder. This change moved the general checks out ot #testToQuery and instead adds dedicated cache tests for those query builders that exhibit something other than the default behaviour. Closes #43200	2019-06-27 14:56:29 +02:00
Yannick Welsch	2049f715b3	Add voting-only master node (#43410 ) A voting-only master-eligible node is a node that can participate in master elections but will not act as a master in the cluster. In particular, a voting-only node can help elect another master-eligible node as master, and can serve as a tiebreaker in elections. High availability (HA) clusters require at least three master-eligible nodes, so that if one of the three nodes is down, then the remaining two can still elect a master amongst them-selves. This only requires one of the two remaining nodes to have the capability to act as master, but both need to have voting powers. This means that one of the three master-eligible nodes can be made as voting-only. If this voting-only node is a dedicated master, a less powerful machine or a smaller heap-size can be chosen for this node. Alternatively, a voting-only non-dedicated master node can play the role of the third master-eligible node, which allows running an HA cluster with only two dedicated master nodes. Closes #14340 Co-authored-by: David Turner <david.turner@elastic.co>	2019-06-26 08:07:56 +02:00
Zachary Tong	63fef5a31e	Add scripting support to AggregatorTestCase (#43494 ) This refactors AggregatorTestCase to allow testing mock scripts. The main change is to QueryShardContext. This was previously mocked, but to get the ScriptService you have to invoke a final method which can't be mocked. Instead, we just create a mostly-empty QueryShardContext and populate the fields that are needed for testing. It also introduces a few new helper methods that can be overridden to change the default behavior a bit. Most tests should be able to override getMockScriptService() to supply a ScriptService to the context, which is later used by the aggs. More complicated tests can override queryShardContextMock() as before. Adds a test to MaxAggregatorTests to test out the new functionality.	2019-06-25 11:52:12 -04:00
Yannick Welsch	d45f12799c	Sync global checkpoint on pending in-sync shards (#43526 ) At the end of a peer recovery the primary wants to mark the replica as in-sync. For that the persisted local checkpoint of the replica needs to have caught up with the global checkpoint on the primary. If translog durability is set to ASYNC, this means that information about the persisted local checkpoint can lag on the primary and might need to be explicitly fetched through a global checkpoint sync action. Unfortunately, that action will only be triggered after 30 seconds, and, even worse, will only run based on what the in-sync shard copies say (see IndexShard.maybeSyncGlobalCheckpoint). As the replica has not been marked as in-sync yet, it is not taken into consideration, and the primary might have its global checkpoint equal to the max seq no, so it thinks nothing needs to be done. Closes #43486	2019-06-24 18:35:57 +02:00
Armin Braun	1053a89b79	Log Blocked IO Thread State (#43424 ) (#43447 ) * Let's log the state of the thread to find out if it's dead-locked or just stuck after being suspended * Relates #43392	2019-06-20 22:31:36 +02:00
Yannick Welsch	29d76baf7d	Increase timeout for assertSeqNos Helps with tests that do async translog syncing	2019-06-20 19:06:49 +02:00
Zachary Tong	a8a81200d0	Better support for unmapped fields in AggregatorTestCase (#43405 ) AggregatorTestCase will NPE if only a single, null MappedFieldType is provided (which is required to simulate an unmapped field). While it's possible to test unmapped fields by supplying other, non-related field types... that's clunky and unnecessary. AggregatorTestCase just needs to filter out null field types when setting up.	2019-06-20 11:31:49 -04:00
Armin Braun	7d1983a7e3	Fix Operation Timestamps in Tests (#43155 ) (#43419 ) * For the issue in #43086 we were running into inactive shards because the random timestamps previously used would randomly make `org.elasticsearch.index.shard.IndexShard#checkIdle` see an incorrect+huge inactive time * Also fixed one other spot in tests that passed `ms` instead of `ns` for the same timestamp on an index op to correctly use relative `ns` * Closes #43086	2019-06-20 16:36:17 +02:00
David Turner	c8eb09f158	Fail connection attempts earlier in tests (#43320 ) Today the `DisruptibleMockTransport` always allows a connection to a node to be established, and then fails requests sent to that node such as the subsequent handshake. Since #42342, we log handshake failures on an open connection as a warning, and this makes the test logs rather noisy. This change fails the connection attempt first, avoiding these unrealistic warnings.	2019-06-20 14:45:24 +01:00
Armin Braun	5af9387fad	Fix Stuck IO Thread Logging Time Precision (#42882 ) * The precision of the timestamps we get from the cached time thread is only 200ms by default resulting in a number of needless ~200ms slow network thread execution logs * Fixed by making the warn threshold a function of the precision of the cached time thread found in the settings	2019-06-20 14:26:20 +02:00
Yannick Welsch	7f8e1454ab	Advance checkpoints only after persisting ops (#43205 ) Local and global checkpoints currently do not correctly reflect what's persisted to disk. The issue is that the local checkpoint is adapted as soon as an operation is processed (but not fsynced yet). This leaves room for the history below the global checkpoint to still change in case of a crash. As we rely on global checkpoints for CCR as well as operation-based recoveries, this has the risk of shard copies / follower clusters going out of sync. This commit required changing some core classes in the system: - The LocalCheckpointTracker keeps track now not only of the information whether an operation has been processed, but also whether that operation has been persisted to disk. - TranslogWriter now keeps track of the sequence numbers that have not been fsynced yet. Once they are fsynced, TranslogWriter notifies LocalCheckpointTracker of this. - ReplicationTracker now keeps track of the persisted local and persisted global checkpoints of all shard copies when in primary mode. The computed global checkpoint (which represents the minimum of all persisted local checkpoints of all in-sync shard copies), which was previously stored in the checkpoint entry for the local shard copy, has been moved to an extra field. - The periodic global checkpoint sync now also takes async durability into account, where the local checkpoints on shards only advance when the translog is asynchronously fsynced. This means that the previous condition to detect inactivity (max sequence number is equal to global checkpoint) is not sufficient anymore. - The new index closing API does not work when combined with async durability. The shard verification step is now requires an additional pre-flight step to fsync the translog, so that the main verify shard step has the most up-to-date global checkpoint at disposition.	2019-06-20 11:12:38 +02:00
Mark Vieira	0867ea75b7	Properly format reproduction lines for test methods that contain periods (#43255 )	2019-06-18 09:17:47 -07:00
Nhat Nguyen	0c5086d2f3	Rebuild version map when opening internal engine (#43202 ) With this change, we will rebuild the live version map and local checkpoint using documents (including soft-deleted) from the safe commit when opening an internal engine. This allows us to safely prune away _id of all soft-deleted documents as the version map is always in-sync with the Lucene index. Relates #40741 Supersedes #42979	2019-06-17 18:08:09 -04:00
Martijn Laarman	8b1b9f8ab9	Introduce stability description to the REST API specification (#38413 ) (#43278 ) * introduce state to the REST API specification * change state over to stability * CCR is no GA updated to stable * SQL is now GA so marked as stable * Introduce `internal` as state for API's, marks stable in terms of lifetime but unstable in terms of guarantees on its output format since it exposes internal representations * make setting a wrong stability value, or not setting it at all an error that causes the YAML test suite to fail * update spec files to be explicit about their stability state * Document the fact that stability needs to be defined Otherwise the YAML test runner will fail (with a nice exception message) * address check style violations * update rest spec unit tests to include stability * found one more test spec file not declaring stability, made sure stability appears after documentation everywhere * cluster.state is stable, mark response in some way to denote its a key value format that can be changed during minors * mark data frame API's as beta * remove internal and private as states for an API * removed the wrong enum values in the Stability Enum in the previous commit (cherry picked from commit 61c34bbd92f8f7e5f22fa411c6b682b0ebd8a99d)	2019-06-17 16:57:13 +02:00
Henning Andersen	ba15d08e14	Allow cluster access during node restart (#42946 ) (#43272 ) This commit modifies InternalTestCluster to allow using client() and other operations inside a RestartCallback (onStoppedNode typically). Restarting nodes are now removed from the map and thus all methods now return the state as if the restarting node does not exist. This avoids various exceptions stemming from accessing the stopped node(s).	2019-06-17 15:04:17 +02:00
Christoph Büscher	7af23324e3	SimpleQ.S.B and QueryStringQ.S.B tests should avoid `now` in query (#43199 ) Currently the randomization of the q.b. in these tests can create query strings that can cause caching to be disabled for this query if we query all fields and there is a date field present. This is pretty much an anomaly that we shouldn't generally test for in the "testToQuery" tests where cache policies are checked. This change makes sure we don't create offending query strings so the cache checks never hit these cases and adds a special test method to check this edge case. Closes #43112	2019-06-14 11:21:48 +02:00
Alpar Torok	7cc6dca697	Remove explicily enabled build fixture task	2019-06-14 10:42:08 +03:00
Jason Tedor	5bc3b7f741	Enable node roles to be pluggable (#43175 ) This commit introduces the possibility for a plugin to introduce additional node roles.	2019-06-13 15:15:48 -04:00
Yogesh Gaikwad	4ae1e30a98	Enable krb5kdc-fixture, kerberos tests mount urandom for kdc container (#41710 ) (#43178 ) Infra has fixed #10462 by installing `haveged` on CI workers. This commit enables the disabled fixture and tests, and mounts `/dev/urandom` for the container so there is enough entropy required for kdc. Note: hdfs-repository tests have been disabled, will raise a separate issue for it. Closes #40624 Closes #40678	2019-06-13 13:02:16 +10:00
Alan Woodward	9de1c69c28	IndexAnalyzers doesn't need to extend AbstractIndexComponent (#43149 ) AIC doesn't add anything here, and it removes the need to pass index settings to the constructor.	2019-06-12 17:48:31 +01:00
Henning Andersen	6108899a2d	Fix unresponsive network simulation (#42579 ) Unresponsive network simulation would throw away requests. However, then we no longer have any guarantees that a transport action either succeeds or fails, which could lead to hangs (example: unclosed IndexShard permits). Closes #42244	2019-06-11 17:54:48 +02:00
Nhat Nguyen	f2e66e22eb	Increase waiting time when check retention locks (#42994 ) WriteActionsTests#testBulk and WriteActionsTests#testIndex sometimes fail with a pending retention lock. We might leak retention locks when switching to async recovery. However, it's more likely that ongoing recoveries prevent the retention lock from releasing. This change increases the waiting time when we check for no pending retention lock and also ensures no ongoing recovery in WriteActionsTests. Closes #41054	2019-06-10 17:58:37 -04:00
Jason Tedor	915d2f2daa	Refactor put mapping request validation for reuse (#43005 ) This commit refactors put mapping request validation for reuse. The concrete case that we are after here is the ability to apply effectively the same framework to indices aliases requests. This commit refactors the put mapping request validation framework to allow for that.	2019-06-09 10:19:04 -04:00
henryptung	61b62125b8	Wire query cache into sorting nested-filter computation (#42906 ) Don't use Lucene's default query cache when filtering in sort. Closes #42813	2019-06-06 21:16:58 +02:00
Gordon Brown	6eb4600e93	Add custom metadata to snapshots (#41281 ) Adds a metadata field to snapshots which can be used to store arbitrary key-value information. This may be useful for attaching a description of why a snapshot was taken, tagging snapshots to make categorization easier, or identifying the source of automatically-created snapshots.	2019-06-05 17:30:31 -06:00
Przemyslaw Gomulka	ab5bc83597	Deprecation info for joda-java migration on 7.x (#42659 ) Some clusters might have been already migrated to version 7 without being warned about the joda-java migration changes. Deprecation api on that version will give them guidance on what patterns need to be changed. relates. This change is using the same logic like in 6.8 that is: verifying the pattern is from the incompatible set ('y'-Y', 'C', 'Z' etc), not from predifined set, not prefixed with 8. AND was also created in 6.x. Mappings created in 7.x are considered migrated and should not generate warnings There is no pipeline check (present on 6.8) as it is impossible to verify when the pipeline was created, and therefore to make sure the format is depracated or not #42010	2019-06-05 19:50:04 +02:00
Mark Vieira	e44b8b1e2e	[Backport] Remove dependency substitutions 7.x (#42866 ) * Remove unnecessary usage of Gradle dependency substitution rules (#42773) (cherry picked from commit 12d583dbf6f7d44f00aa365e34fc7e937c3c61f7)	2019-06-04 13:50:23 -07:00
Tim Vernum	8de3a88205	Log the status of security on license change (#42741 ) Whether security is enabled/disabled is dependent on the combination of the node settings and the cluster license. This commit adds a license state listener that logs when the license change causes security to switch state (or to be initialised). This is primarily useful for diagnosing cluster formation issues. Backport of: #42488	2019-06-04 14:25:43 +10:00
David Turner	df0f0b3d40	Rename autoMinMasterNodes to autoManageMasterNodes (#42789 ) Renames the `ClusterScope` attribute `autoMinMasterNodes` to reflect its broader meaning since 7.0. Backport of the relevant part of #42700 to `7.x`.	2019-06-03 12:12:07 +01:00
Jason Tedor	371cb9a8ce	Remove Log4j 1.2 API as a dependency (#42702 ) We had this as a dependency for legacy dependencies that still needed the Log4j 1.2 API. This appears to no longer be necessary, so this commit removes this artifact as a dependency. To remove this dependency, we had to fix a few places where we were accidentally relying on Log4j 1.2 instead of Log4j 2 (easy to do, since both APIs were on the compile-time classpath). Finally, we can remove our custom Netty logger factory. This was needed when we were on Log4j 1.2 and handled logging in our own unique way. When we migrated to Log4j 2 we could have dropped this dependency. However, even then Netty would still pick up Log4j 1.2 since it was on the classpath, thus the advantage to removing this as a dependency now.	2019-05-30 16:08:07 -04:00
Marios Trivyzas	ce30afcd01	Deprecate CommonTermsQuery and cutoff_frequency (#42619 ) (#42691 ) Since the max_score optimization landed in Elasticsearch 7, the CommonTermsQuery is redundant and slower. Moreover the cutoff_frequency parameter for MatchQuery and MultiMatchQuery is redundant. Relates to #27096 (cherry picked from commit 04b74497314eeec076753a33b3b6cc11549646e8)	2019-05-30 18:04:47 +02:00
Armin Braun	0e92ef1843	Fix Incorrect Time Math in MockTransport (#42595 ) (#42617 ) * Fix Incorrect Time Math in MockTransport * The timeunit here must be nanos for the current time (we even convert it accordingly in the logging) * Also, changed the log message when dumping stack traces a little to make it easier to grep for (otherwise it's the same as the message on unregister)	2019-05-28 17:58:23 +02:00
David Turner	746a2f41fd	Remove PRE_60_NODE_CHECKPOINT (#42531 ) This commit removes the obsolete `PRE_60_NODE_CHECKPOINT` constant for dealing with 5.x nodes' lack of sequence number support. Backport of #42527	2019-05-28 12:25:53 +01:00
Armin Braun	44bf784fe1	Add Infrastructure to Run 3rd Party Repository Tests (#42586 ) (#42604 ) * Add Infrastructure to Run 3rd Party Repository Tests * Add infrastructure to run third party repository tests using our standard JUnit infrastructure * This is a prerequisite of #42189	2019-05-28 10:46:22 +02:00
Armin Braun	c4f44024af	Remove Delete Method from BlobStore (#41619 ) (#42574 ) * Remove Delete Method from BlobStore (#41619) * The delete method on the blob store was used almost nowhere and just duplicates the delete method on the blob containers * The fact that it provided for some recursive delete logic (that did not behave the same way on all implementations) was not used and not properly tested either	2019-05-27 12:24:20 +02:00
Armin Braun	b68358945f	Dump Stacktrace on Slow IO-Thread Operations (#42000 ) (#42572 ) * Dump Stacktrace on Slow IO-Thread Operations * Follow up to #39729 extending the functionality to actually dump the stack when the thread is blocked not afterwards * Logging the stacktrace after the thread became unblocked is only of limited use because we don't know what happened in the slow callback from that (only whether we were blocked on a read,write,connect etc.) * Relates #41745	2019-05-27 11:44:36 +02:00
Armin Braun	489616da62	Fix testTracerLog Network Tests (#42286 ) (#42565 ) * Fix testTracerLog Network Tests * Start appender before using it like we do for e.g. the Netty leak detection appender to avoid interference from actions on the network threads that might still be dangling from previous tests in the same suite * Closes #41890	2019-05-27 11:39:59 +02:00
Nhat Nguyen	02739d038c	Mute accounting circuit breaker check after test (#42448 ) If we close an engine while a refresh is happening, then we might leak refCount of some SegmentReaders. We need to skip the ram accounting circuit breaker check until we have a new Lucene snapshot which includes the fix for LUCENE-8809. This also adds a test to the engine but left it muted so we won't forget to reenable this check. Closes #30290	2019-05-24 15:42:12 -04:00
Simon Willnauer	46ccfba808	Remove IndexStore and DirectoryService (#42446 ) Both of these classes are basically a bloated wrapper around a simple construct that can simply be a DirectoryFactory interface. This change removes both classes and replaces them with a simple stateless interface that creates a new `Directory` per shard. The concept of `index.store` is preserved since it makes sense from a configuration perspective.	2019-05-24 12:14:56 +02:00
David Turner	f864f6a740	Cluster state from API should always have a master (#42454 ) Today the `TransportClusterStateAction` ignores the state passed by the `TransportMasterNodeAction` and obtains its state from the cluster applier. This might be inconsistent, showing a different node as the master or maybe even having no master. This change adjusts the action to use the passed-in state directly, and adds tests showing that the state returned is consistent with our expectations even if there is a concurrent master failover. Fixes #38331 Relates #38432	2019-05-24 08:45:22 +01:00
Simon Willnauer	a79cd77e5c	Remove IndexShard dependency from Repository (#42213 ) * Remove IndexShard dependency from Repository In order to simplify repository testing especially for BlobStoreRepository it's important to remove the dependency on IndexShard and reduce it to Store and MapperService (in the snapshot case). This significantly reduces the dependcy footprint for Repository and allows unittesting without starting nodes or instantiate entire shard instances. This change deprecates the old method signatures and adds a unittest for FileRepository to show the advantage of this change. In addition, the unittesting surfaced a bug where the internal file names that are private to the repository were used in the recovery stats instead of the target file names which makes it impossible to relate to the actual lucene files in the recovery stats. * don't delegate deprecated methods * apply comments * test	2019-05-22 14:27:11 +02:00
Yannick Welsch	770d8e9e39	Remove usage of max_local_storage_nodes in test infrastructure (#41652 ) Moves the test infrastructure away from using node.max_local_storage_nodes, allowing us in a follow-up PR to deprecate this setting in 7.x and to remove it in 8.0. This also changes the behavior of InternalTestCluster so that starting up nodes will not automatically reuse data folders of previously stopped nodes. If this behavior is desired, it needs to be explicitly done by passing the data path from the stopped node to the new node that is started.	2019-05-22 11:04:55 +02:00
Christoph Büscher	3e59c31a12	Change IndexAnalyzers default analyzer access (#42011 ) Currently IndexAnalyzers keeps the three default as separate class members although they should refer to the same analyzers held in the additional analyzers map under the default names. This assumption should be made more explicit by keeping all analyzers in the map. This change adapts the constructor to check all the default entries are there and the getters to reach into the map with the default names when needed.	2019-05-10 18:08:51 +02:00
David Turner	4c909e93bb	Reject port ranges in `discovery.seed_hosts` (#41905 ) Today Elasticsearch accepts, but silently ignores, port ranges in the `discovery.seed_hosts` setting: ``` discovery.seed_hosts: 10.1.2.3:9300-9400 ``` Silently ignoring part of a setting like this is trappy. With this change we reject seed host addresses of this form. Closes #40786 Backport of #41404	2019-05-08 08:34:32 +01:00
Alan Woodward	4cca1e8fff	Correct spelling of MockLogAppender.PatternSeenEventExpectation (#41893 ) The class was called PatternSeenEventExcpectation. This commit is a straight class rename to correct the spelling.	2019-05-07 17:28:51 +01:00
Henning Andersen	f068a22f5f	SeqNo CAS linearizability (#38561 ) Add a test that stresses concurrent writes using ifSeqno/ifPrimaryTerm to do CAS style updates. Use linearizability checker to verify linearizability. Linearizability of successful CAS'es is guaranteed. Changed linearizability checker to allow collecting history concurrently. Changed unresponsive network simulation to wake up immediately when network disruption is cleared to ensure tests proceed in a timely manner (and this also seems more likely to provoke issues).	2019-05-07 14:04:38 +02:00
Jim Ferenczi	70bf432fa8	Fix full text queries test that start with now (#41854 ) Full text queries that start with now are not cacheable if they target a date field. However we assume in the query builder tests that all queries are cacheable and this assumption fails when the random generated query string starts with "now". This fails twice in several years since the probability that a random string starts with "now" is low but this commit ensures that isCacheable is correctly checked for full text queries that fall into this edge case. Closes #41847	2019-05-06 19:08:30 +02:00
Tim Brooks	927013426a	Read multiple TLS packets in one read call (#41820 ) This is related to #27260. Currently we have a single read buffer that is no larger than a single TLS packet. This prevents us from reading multiple TLS packets in a single socket read call. This commit modifies our TLS work to support reading similar to the plaintext case. The data will be copied to a (potentially) recycled TLS packet-sized buffer for interaction with the SSLEngine.	2019-05-06 09:51:32 -06:00
Nhat Nguyen	c7924014fa	Verify consistency of version and source in disruption tests (#41614 ) (#41661 ) With this change, we will verify the consistency of version and source (besides id, seq_no, and term) of live documents between shard copies at the end of disruption tests.	2019-05-03 18:47:14 -04:00
Jay Modi	8421e38887	Do not print null method name in reproduce line (#41691 ) This commit updates the reproduce line that is printed out when a test fails so that it does not output `.null` as the method name when the failure is not a specific method but a class level issue such as threads being leaked from the SUITE. Previously, when this occurred the reproduce line would look like: `./gradlew :server:integTest --tests "org.elasticsearch.indices.memory.breaker.CircuitBreakerServiceIT.null"` and after this change, the line no longer contains the `.null` after the class name.	2019-05-02 12:20:07 -06:00
Sandmannn	728fe2d409	Small correction in comments (#41623 )	2019-05-02 15:30:18 +03:00
Nhat Nguyen	887f3f2c83	Simplify initialization of max_seq_no of updates (#41161 ) Today we choose to initialize max_seq_no_of_updates on primaries only so we can deal with a situation where a primary is on an old node (before 6.5) which does not have MUS while replicas on new nodes (6.5+). However, this strategy is quite complex and can lead to bugs (for example #40249) since we have to assign a correct value (not too low) to MSU in all possible situations (before recovering from translog, restoring history on promotion, and handing off relocation). Fortunately, we don't have to deal with this BWC in 7.0+ since all nodes in the cluster should have MSU. This change simplifies the initialization of MSU by always assigning it a correct value in the constructor of Engine regardless of whether it's a replica or primary. Relates #33842	2019-04-30 15:14:52 -04:00
Tim Brooks	df3ef66294	Remove dedicated SSL network write buffer (#41654 ) This is related to #27260. Currently for the SSLDriver we allocate a dedicated network write buffer and encrypt the data into that buffer one buffer at a time. This requires constantly switching between encrypting and flushing. This commit adds a dedicated outbound buffer for SSL operations that will internally allocate new packet sized buffers as they are need (for writing encrypted data). This allows us to totally encrypt an operation before writing it to the network. Eventually it can be hooked up to buffer recycling. This commit also backports the following commit: Handle WRAP ops during SSL read It is possible that a WRAP operation can occur while decrypting handshake data in TLS 1.3. The SSLDriver does not currently handle this well as it does not have access to the outbound buffer during read call. This commit moves the buffer into the Driver to fix this issue. Data wrapped during a read call will be queued for writing after the read call is complete.	2019-04-29 17:59:13 -06:00
Nhat Nguyen	615a0211f0	Recovery should not indefinitely retry on mapping error (#41099 ) A stuck peer recovery in #40913 reveals that we indefinitely retry on new cluster states if indexing translog operations hits a mapper exception. We should not wait and retry if the mapping on the target is as recent as the mapping that the primary used to index the replaying operations. Relates #40913	2019-04-27 10:55:08 -04:00
Armin Braun	aad33121d8	Async Snapshot Repository Deletes (#40144 ) (#41571 ) Motivated by slow snapshot deletes reported in e.g. #39656 and the fact that these likely are a contributing factor to repositories accumulating stale files over time when deletes fail to finish in time and are interrupted before they can complete. * Makes snapshot deletion async and parallelizes some steps of the delete process that can be safely run concurrently via the snapshot thread poll * I did not take the biggest potential speedup step here and parallelize the shard file deletion because that's probably better handled by moving to bulk deletes where possible (and can still be parallelized via the snapshot pool where it isn't). Also, I wanted to keep the size of the PR manageable. * See https://github.com/elastic/elasticsearch/pull/39656#issuecomment-470492106 * Also, as a side effect this gives the `SnapshotResiliencyTests` a little more coverage for master failover scenarios (since parallel access to a blob store repository during deletes is now possible since a delete isn't a single task anymore). * By adding a `ThreadPool` reference to the repository this also lays the groundwork to parallelizing shard snapshot uploads to improve the situation reported in #39657	2019-04-26 15:36:09 +02:00

1 2 3 4 5 ...

2162 Commits