Sometimes running chaos monkey, I've found that we lose accounting of
region servers. I've taken to a manual process of checking the
reported list against a known reference. It occurs to me that
ChaosMonkey has a known reference, and it can do this accounting for
me.
Signed-off-by: Viraj Jasani <vjasani@apache.org>
Adds `protected abstract Logger getLogger()` to `Action` so that
implementation's names are logged when actions are performed.
Signed-off-by: stack <stack@apache.org>
Signed-off-by: Jan Hentschel <jan.hentschel@ultratendency.com>
foo
Running `ServerKillingChaosMonkey` via `RESTApiClusterManager` for any
duration of time slowly leaks region servers. I see failures on the
RESTApi side go unreported on the ChaosMonkey side. It seems like
`RuntimeException`s are being thrown and lost.
`PolicyBasedChaosMonkey` uses a primitive means of thread management
anyway. Update to use a thread pool, thread groups, and an
uncaughtExceptionHandler.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
`RollingBatchRestartRsAction` doesn't handle failure cases when
tracking its list of dead servers. The original author believed that a
failure to restart would result in a retry. However, by removing the
dead server from the failed list, that state is lost, and retry never
occurs. Because this action doesn't ever look back to the current
state of the cluster, relying only on its local state for the current
action invocation, it never realizes the abandoned server is still
dead. Instead, be more careful to only remove the dead server from the
list when the `startRs` invocation claims to have been successful.
Signed-off-by: stack <stack@apache.org>
(cherry picked from commit 0dae377f53)
* Add chaos monkey action for suspend/resume region servers
* Add these to relevant chaos monkeys
branch-1-backport-note: Graceful regionserver restart action wasn't
backported due to a dependency of "RegionMover" script. Can be done
later if needed.
Signed-off-by: Balazs Meszaros <meszibalu@apache.org>
Signed-off-by: Peter Somogyi <psomogyi@apache.org>
Implements `ClusterManager` that relies on the new
`ShellExecEndpointCoprocessor` for remote shell command execution.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Co-authored-by: Nick Dimiduk <ndimiduk@apache.org>
We have this nice description in the java doc on ITBLL but it's
unformatted and thus illegible. Add some formatting so that it can be
read by humans.
Signed-off-by: Jan Hentschel <janh@apache.org>
Signed-off-by: Josh Elser <elserj@apache.org>
unbalance.kill.meta.rs property was added which controls the monkey to
kill that region server which holds hbase:meta.
Change-Id: I2c871789645b6c1986104f5a16cc6b9badfbc172
Signed-off-by: Apekshit Sharma <appy@apache.org>
The framework sets a configuration property to control how long reads
should be executed. When writes take too long, no time remains for reads
and the user sees an error about a property they must set. We should
prevent this case and log an appropriate message.
Also fixes a rogue character in the class-level javadoc.
Signed-off-by: Michael Stack <stack@apache.org>
Replacing this mechanism with junit @ClassRule to timeout the test.
Also adds missing kdc deps in hbase-it/pom.xml
Change-Id: I00930c2f974b4215e3f82a0ec007d9ef3ebd7cdd
Having thread names in logs and thread dumps greatly improve debugability. This patch is simply adding the names to the threads we spawn.
Change-Id: I6ff22cc3804bb81147dde3a8e9ab671633c6f6ce