When three threads are trying to write checksums at the same time, it's possible for all three threads to obtain the same checksum file name A. Then the first thread enters the synchronized section, creates the file with name A and exits. The second thread enters the synchronized section, checks that A exists, creates file A+1 and exits the critical section. Then it proceeds to clean up and deletes all checksum files including A. If it happens before the third thread enters the synchronized section, it's possible for the third thread to check for A and since it no longer exists create the checksum file A the second time, which triggers "file _checksums-XXXXXXXXXXXXX was already written to" exception in MockDirectoryWrapper and fails recovery.
They were due to a combination of mappings propagation delays and the behavior
of MapperService.smartName(String) so mappings are now configured up-front.
Makes it possible to delete snapshots that are missing some of the metadata files. This can happen if snapshot creation failed because repository drive ran out of disk space.
Closes#6383
The delayed mapping intro tests exposed a bug where if a new mapping is introduced, yet not updated on the master, and a full restart occurs, reply of the transaction log will not cause the new mapping to be re-introduced.
closes#6659
add comment on the method
Today, when a new mapping is introduced, the mapping is rebuilt (refreshSource) on the thread that performs the indexing request. This can become heavier and heavier if new mappings keeps on being introduced, we can move this process to another thread that will be responsible to refresh the source and then send the update mapping to the master (note, this doesn't change the semantics of new mapping introduction, since they are async anyhow).
When doing so, the thread can also try and batch as much updates as possible, this is handy especially when multiple shards for the same index exists on the same node. An internal setting that can control the time to wait for batches is also added (defaults to 0).
Testing wise, a new support method on ElasticsearchIntegrationTest#waitForConcreteMappingsOnAll to allow to wait for the concrete manifestation of mappings on all relevant nodes is added. Some tests mistakenly rely on the fact that there are no more pending tasks to mean mappings have been updated, so if we see, timing related, failures down later (all tests pass), then those will need to be fixed to wither awaitBusy on the master for the new mapping, or in the rare case, wait for the concrete mapping on all the nodes using the new method.
closes#6648
allow to change the additional time window dynamically
better sorting on mappers when refreshing source
also, no need to call nodes info in test, we already have the node names
clean calls to mapping update to provide doc mapper and UUID always
also use the internal cluster support method to get the list of nodes an index is on
reverse the order to pick the latest change first
remove unused field
and fix constructor param
move to start/stop on mapping update action
randomize INDICES_MAPPING_ADDITIONAL_MAPPING_CHANGE_TIME
Try and push our system to a state where there is only a single worker, trying to expose potential deadlocks when we by mistake execute blocking operations on the worker thread
closes#6635
only change recovery throttling to slow down recoveries. The recovery file chunk size updates are not picked up by ongoing recoveries. That cause the recovery to take too long even after the default settings are restored.
Also - change document creation to reuse field names in order to speed up the test.
We don't rely upon GC to cleanup mappedbytebuffers, we unmap them
explicitly on close in lucene. But the JDK has crazy loops with
explicit GCs in exceptional cases to try to force unmapping.
In general we don't want any of our code or library code calling
this method: so its banned in forbidden-apis as well.
We clone RateLimitedIndexOutput from lucene just to collect pausing
statistics we can do this in a more straight forward way in a delegating
RateLimiter.
Closes#6625
Thread rejection should return too many requests status code, and not 503, which is used to also show that the cluster is not available
relates to #6627, but only for rejections for now
closes#6629
We want to make sure recycling will not fail for any reason while trying to send a response back that is caused by a failure, for example, if we have circuit breaker on it (at one point), sending an error back will not be affected by it.
closes#6631
the test failed but couldn't repro (yet), at the very least, make sure we have the exception message as the reason, can help to track down the failure itself when it happens again
If the match query with cutoff_frequency encounters stacked tokens,
like synonyms in the same position, it returns a boolean query instead
of a common terms query. However, if the original operator was set
to "and", it was ignoring that and resetting the operator to "or".
In fact, if operator is "and" then there is little benefit in using
a common terms query as a must query is already
executed efficiently.
Waiting for ongoing recoveries was not good enough as it can run before the master finishing processing the started events of primary shards, causing the recovery response to be erroneously empty