Bukhtawar Khan f7e2984248
Introduce FS Health HEALTHY threshold to fail stuck node (#1167)
This will cause the leader stuck on IO during publication to step down and eventually trigger a leader election.

Issue Description
---
The publication of cluster state is time bound to 30s by a cluster.publish.timeout settings. If this time is reached before the new cluster state is committed, then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master.

There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock to flush the new state under a mutex to disk. The same lock is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progress.

Signed-off-by: Bukhtawar Khan <bukhtawa@amazon.com>
2021-09-16 17:02:25 -07:00
2021-03-13 10:36:16 -06:00
2021-03-22 09:25:46 -05:00

Welcome!

OpenSearch is a community-driven, open source fork of Elasticsearch and Kibana following the licence change in early 2021. We're looking to sustain (and evolve!) a search and analytics suite for the multitude of businesses who are dependent on the rights granted by the original, Apache v2.0 License.

Project Resources

Code of Conduct

This project has adopted the Amazon Open Source Code of Conduct. For more information see the Code of Conduct FAQ, or contact opensource-codeofconduct@amazon.com with any additional questions or comments.

License

This project is licensed under the Apache v2.0 License.

Copyright OpenSearch Contributors. See NOTICE for details.

Description
🔎 Open source distributed and RESTful search engine.
Readme 546 MiB
Languages
Java 99.5%
Groovy 0.4%