Sometimes running chaos monkey, I've found that we lose accounting of
region servers. I've taken to a manual process of checking the
reported list against a known reference. It occurs to me that
ChaosMonkey has a known reference, and it can do this accounting for
me.
Signed-off-by: Viraj Jasani <vjasani@apache.org>
Running `ServerKillingChaosMonkey` via `RESTApiClusterManager` for any
duration of time slowly leaks region servers. I see failures on the
RESTApi side go unreported on the ChaosMonkey side. It seems like
`RuntimeException`s are being thrown and lost.
`PolicyBasedChaosMonkey` uses a primitive means of thread management
anyway. Update to use a thread pool, thread groups, and an
uncaughtExceptionHandler.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
* sometimes API calls return with null/empty response bodies. thus,
wrap all API calls in a retry loop.
* calls that submit work in the form of "commands" now retrieve the
commandId from successful command submission, and track completion
of that command before returning control to calling context.
* model CM's process state and use that model to guide state
transitions more intelligently. this guards against, for example,
the start command failing with an error message like "Role must be
stopped".
* improvements to logging levels, avoid spamming logs with the
side-effects of retries at this and higher contexts.
* include references to API documentation, such as it is.
Signed-off-by: stack <stack@apache.org>
`RollingBatchRestartRsAction` doesn't handle failure cases when
tracking its list of dead servers. The original author believed that a
failure to restart would result in a retry. However, by removing the
dead server from the failed list, that state is lost, and retry never
occurs. Because this action doesn't ever look back to the current
state of the cluster, relying only on its local state for the current
action invocation, it never realizes the abandoned server is still
dead. Instead, be more careful to only remove the dead server from the
list when the `startRs` invocation claims to have been successful.
Signed-off-by: stack <stack@apache.org>
Adds `protected abstract Logger getLogger()` to `Action` so that
implementation's names are logged when actions are performed.
Signed-off-by: stack <stack@apache.org>
Signed-off-by: Jan Hentschel <jan.hentschel@ultratendency.com>
Implements `ClusterManager` that relies on the new
`ShellExecEndpointCoprocessor` for remote shell command execution.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Use the correct GSON API for deserializing service responses. Add
simple unit test covering a very limited selection of the overall API
surface area, just enough to ensure deserialization works.
Signed-off-by: stack <stack@apache.org>
- MOB compaction is now handled in-line with per-region compaction on region
servers
- regions with mob data store per-hfile metadata about which mob hfiles are
referenced
- admin requested major compaction will also rewrite MOB files; periodic RS
initiated major compaction will not
- periodically a chore in the master will initiate a major compaction that
will rewrite MOB values to ensure it happens. controlled by
'hbase.mob.compaction.chore.period'. default is weekly
- control how many RS the chore requests major compaction on in parallel
with 'hbase.mob.major.compaction.region.batch.size'. default is as
parallel as possible.
- periodic chore in master will scan backing hfiles from regions to get the
set of referenced mob hfiles and archive those that are no longer
referenced. control period with 'hbase.master.mob.cleaner.period'
- Optionally, RS that are compacting mob files can limit write
amplification by not rewriting values from mob hfiles over a certain size
limit. opt-in by setting 'hbase.mob.compaction.type' to 'optimized'.
control threshold by 'hbase.mob.compactions.max.file.size'.
default is 1GiB
- Should smoothly integrate with existing MOB users via rolling upgrade.
will delay old MOB file cleanup until per-region compaction has managed
to compact each region at least once so that used mob hfile metadata can
be gathered.
We have this nice description in the java doc on ITBLL but it's
unformatted and thus illegible. Add some formatting so that it can be
read by humans.
Signed-off-by: Jan Hentschel <janh@apache.org>
Signed-off-by: Josh Elser <elserj@apache.org>
Instead of using the default properties when checking for monkey
properties, now we use the ones already extended with command line
params.
Change FillDiskCommandAction to try to stop the remote process if the
command failed with an exception.
Signed-off-by: stack <stack@apache.org>
Add monkey actions:
- manipulate network packages with tc (reorder, loose,...)
- add CPU load
- fill the disk
- corrupt or delete regionserver data files
Extend HBaseClusterManager to allow sudo calls.
Signed-off-by: Josh Elser <elserj@apache.org>
Signed-off-by: Balazs Meszaros <meszibalu@apache.org>
* Add chaos monkey action for suspend/resume region servers
* Add chaos monkey action for graceful rolling restart
* Add these to relevant chaos monkeys
Signed-off-by: Balazs Meszaros <meszibalu@apache.org>
Signed-off-by: Peter Somogyi <psomogyi@apache.org>
HBASE-15666 shaded dependencies for hbase-testing-util
Added new artifact hbase-shaded-testing-util. It wraps a whole hbase-server
with its testing dependencies. Users should use only the following dependency
in pom:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-shaded-testing-util</artifactId>
<version>${hbase.version}</version>
<scope>test</scope>
</dependency>
Added hbase-shaded-testing-util-tester maven module which ensures
that hbase-shaded-testing-util works with a shaded client.
Signed-off-by: Josh Elser <elserj@apache.org>