* HBASE-25867 Extra doc around ITBLL
Minor edits to a few log messages.
Explain how the '-c' option works when passed to ChaosMonkeyRunner.
Some added notes on ITBLL.
Fix whacky 'R' and 'Not r' thing in Master (shows when you run ITBLL).
In HRS, report hostname and port when it checks in (was debugging issue
where Master and HRS had different notions of its hostname).
Spare a dirty FNFException on startup if base dir not yet in place.
* Address Review by Sean
Signed-off-by: Sean Busbey <busbey@apache.org>
IntegrationTestImportTsv is generating HFiles under the working directory of the
current hdfs user executing the tool, before bulkloading it into HBase.
Assuming you encrypt the HBase root directory within HDFS (using HDFS
Transparent Encryption), you can bulkload HFiles only if they sit in the same
encryption zone in HDFS as the HBase root directory itself.
When IntegrationTestImportTsv is executed against a real distributed cluster
and the working directory of the current user (e.g. /user/hbase) is not in the
same encryption zone as the HBase root directory (e.g. /hbase/data) then you
will get an exception:
```
ERROR org.apache.hadoop.hbase.regionserver.HRegion: There was a partial failure
due to IO when attempting to load d :
hdfs://mycluster/user/hbase/test-data/22d8460d-04cc-e032-88ca-2cc20a7dd01c/
IntegrationTestImportTsv/hfiles/d/74655e3f8da142cb94bc31b64f0475cc
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
/user/hbase/test-data/22d8460d-04cc-e032-88ca-2cc20a7dd01c/
IntegrationTestImportTsv/hfiles/d/74655e3f8da142cb94bc31b64f0475cc
can't be moved into an encryption zone.
```
In this commit I make it configurable where the IntegrationTestImportTsv
generates the HFiles.
Co-authored-by: Mate Szalay-Beko <symat@apache.com>
Signed-off-by: Peter Somogyi <psomogyi@apache.org>
Sometimes running chaos monkey, I've found that we lose accounting of
region servers. I've taken to a manual process of checking the
reported list against a known reference. It occurs to me that
ChaosMonkey has a known reference, and it can do this accounting for
me.
Signed-off-by: Viraj Jasani <vjasani@apache.org>
Running `ServerKillingChaosMonkey` via `RESTApiClusterManager` for any
duration of time slowly leaks region servers. I see failures on the
RESTApi side go unreported on the ChaosMonkey side. It seems like
`RuntimeException`s are being thrown and lost.
`PolicyBasedChaosMonkey` uses a primitive means of thread management
anyway. Update to use a thread pool, thread groups, and an
uncaughtExceptionHandler.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Signed-off-by: Viraj Jasani <vjasani@apache.org>
* sometimes API calls return with null/empty response bodies. thus,
wrap all API calls in a retry loop.
* calls that submit work in the form of "commands" now retrieve the
commandId from successful command submission, and track completion
of that command before returning control to calling context.
* model CM's process state and use that model to guide state
transitions more intelligently. this guards against, for example,
the start command failing with an error message like "Role must be
stopped".
* improvements to logging levels, avoid spamming logs with the
side-effects of retries at this and higher contexts.
* include references to API documentation, such as it is.
Signed-off-by: stack <stack@apache.org>
`RollingBatchRestartRsAction` doesn't handle failure cases when
tracking its list of dead servers. The original author believed that a
failure to restart would result in a retry. However, by removing the
dead server from the failed list, that state is lost, and retry never
occurs. Because this action doesn't ever look back to the current
state of the cluster, relying only on its local state for the current
action invocation, it never realizes the abandoned server is still
dead. Instead, be more careful to only remove the dead server from the
list when the `startRs` invocation claims to have been successful.
Signed-off-by: stack <stack@apache.org>
Adds `protected abstract Logger getLogger()` to `Action` so that
implementation's names are logged when actions are performed.
Signed-off-by: stack <stack@apache.org>
Signed-off-by: Jan Hentschel <jan.hentschel@ultratendency.com>
Implements `ClusterManager` that relies on the new
`ShellExecEndpointCoprocessor` for remote shell command execution.
Signed-off-by: Bharath Vissapragada <bharathv@apache.org>
Use the correct GSON API for deserializing service responses. Add
simple unit test covering a very limited selection of the overall API
surface area, just enough to ensure deserialization works.
Signed-off-by: stack <stack@apache.org>
We have this nice description in the java doc on ITBLL but it's
unformatted and thus illegible. Add some formatting so that it can be
read by humans.
Signed-off-by: Jan Hentschel <janh@apache.org>
Signed-off-by: Josh Elser <elserj@apache.org>
Instead of using the default properties when checking for monkey
properties, now we use the ones already extended with command line
params.
Change FillDiskCommandAction to try to stop the remote process if the
command failed with an exception.
Signed-off-by: stack <stack@apache.org>
Add monkey actions:
- manipulate network packages with tc (reorder, loose,...)
- add CPU load
- fill the disk
- corrupt or delete regionserver data files
Extend HBaseClusterManager to allow sudo calls.
Signed-off-by: Josh Elser <elserj@apache.org>
Signed-off-by: Balazs Meszaros <meszibalu@apache.org>
* Add chaos monkey action for suspend/resume region servers
* Add chaos monkey action for graceful rolling restart
* Add these to relevant chaos monkeys
Signed-off-by: Balazs Meszaros <meszibalu@apache.org>
Signed-off-by: Peter Somogyi <psomogyi@apache.org>
Added new artifact hbase-shaded-testing-util. It wraps a whole hbase-server
with its testing dependencies. Users should use only the following dependency
in pom:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-shaded-testing-util</artifactId>
<version>${hbase.version}</version>
<scope>test</scope>
</dependency>
Added hbase-shaded-testing-util-tester maven module which ensures
that hbase-shaded-testing-util works with a shaded client.
Signed-off-by: Josh Elser <elserj@apache.org>
This is a reapply of a reverted commit. This commit includes
HBASE-22059 amendment and subsequent ammendments to HBASE-22052.
See HBASE-22052 for full story.
jersey-core is problematic. It was transitively included from hadoop
and polluting our CLASSPATH with an implementation of a 1.x version
of the javax.ws.rs.core.Response Interface from jsr311-api when we
want the javax.ws.rs-api 2.x version.
M hbase-endpoint/pom.xml
M hbase-http/pom.xml
M hbase-mapreduce/pom.xml
M hbase-rest/pom.xml
M hbase-server/pom.xml
M hbase-zookeeper/pom.xml
Remove redundant version specification (and the odd property define
done already up in parent pom).
M hbase-it/pom.xml
M hbase-rest/pom.xml
Exclude jersey-core explicitly.
M hbase-procedure/pom.xml
Remove redundant version and classifier.
M pom.xml
Add jersey-core exclusions to all dependencies that pull it in
except hadoop-minicluster. mr tests fail w/o the jersey-core
so let it in for minicluster and then in modules, exclude it
where it causes damage as in hbase-it.
* modify the jar checking script to take args; make hadoop stuff optional
* separate out checking the artifacts that have hadoop vs those that don't.
* * Unfortunately means we need two modules for checking things
* * put in a safety check that the support script for checking jar contents is maintained in both modules
* * have to carve out an exception for o.a.hadoop.metrics2. :(
* fix duplicated class warning
* clean up dependencies in hbase-server and some modules that depend on it.
* allow Hadoop to have its own htrace where it needs it
* add a precommit check to make sure we're not using old htrace imports
Conflicts:
hbase-backup/pom.xml
hbase-checkstyle/src/main/resources/hbase/checkstyle-suppressions.xml
Signed-off-by: Mike Drob <mdrob@apache.org>
Cannot go to latest (8.9) yet due to
https://github.com/checkstyle/checkstyle/issues/5279
* move hbaseanti import checks to checkstyle
* implment a few missing equals checks, and ignore one
* fix lots of javadoc errors
Signed-off-by: Sean Busbey <busbey@apache.org>
Remove commons-cli and commons-collections4 use. Account
for the newer internal protobuf version of 3.5.1.
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Mike Drob <mdrob@apache.org>
Allow that DisableTableProcedue can grab a region lock before
ServerCrashProcedure can. Cater to this cricumstance where SCP
was not unable to make progress by running the search for RIT
against the crashed server a second time, post creation of all
crashed-server assignemnts. The second run will uncover such as
the above DisableTableProcedure unassign and will interrupt its
suspend allowing both procedures to make progress.
M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
Add new procedure step post-assigns that reruns the RIT finder method.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Make this important log more specific as to what is going on.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
Better explanation as to what is going on.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
Add extra step and run handleRIT a second time after we've queued up
all SCP assigns. Also fix a but. SCP was adding an assign of a RIT
that was actually trying to unassign (made the deadlock more likely).