Commit Graph

219 Commits

Author SHA1 Message Date
Allan Yang e33591515c
HBASE-21083 Introduce a mechanism to bypass the execution of a stuck procedure 2018-08-28 20:18:47 -07:00
Michael Stack d954031d50 HBASE-21078 [amv2] CODE-BUG NPE in RTP doing Unassign 2018-08-24 13:22:16 -07:00
Allan Yang ee3507d456 HBASE-21050 Exclusive lock may be held by a SUCCESS state procedure forever
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: zhangduo <zhangduo@apache.org>
2018-08-15 15:39:15 -07:00
Allan Yang e1188d27f5 HBASE-20978 [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure Signed-off-by: Michael Stack <stack@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org> 2018-08-14 16:29:58 -07:00
Allan Yang 26828b1860 HBASE-20975 Lock may not be taken or released while rolling back procedure 2018-08-13 20:12:57 +08:00
jackbearden 2d12e1ecf0 HBASE-20981. Rollback stateCount accounting thrown-off when exception out of rollbackState
Signed-off-by: Michael Stack <stack@apache.org>
2018-08-11 11:58:12 -07:00
Michael Stack 88f3148810 HBASE-20989 Minor, miscellaneous logging fixes
Signed-off-by: Zach York <zyork@amazon.com>
Signed-off-by: Mingliang Liu <liuml07@apache.org>
2018-08-01 11:20:01 -07:00
Alex Leblang 31cbd7ab8f
HBASE-19369 Switch to Builder Pattern In WAL
This patch switches to the builder pattern by adding a helper method.
It also checks to ensure that the pattern is available (i.e. that
HBase is running on a hadoop version that supports it).

Amending-Author: Mike Drob <mdrob@apache.org>
Signed-off-by: tedyu <yuzhihong@gmail.com>
Signed-off-by: zhangduo <zhangduo@apache.org>
2018-07-27 23:43:08 -05:00
zhangduo 8bfdb19e85 HBASE-20939 There will be race when we call suspendIfNotReady and then throw ProcedureSuspendedException 2018-07-27 21:30:23 +08:00
zhangduo 1777ea3aae HBASE-20938 Set version to 2.1.1-SNAPSHOT for branch-2.1 2018-07-25 21:45:09 +08:00
zhangduo 833657c46d HBASE-20846 Restore procedure locks when master restarts 2018-07-25 14:37:36 +08:00
Michael Stack 46e5baf670 HBASE-20914 Trim Master memory usage
Add (weak reference) interning of ServerNames.

Correct Balancer regions x racks matrix.

Make smaller defaults when creating ArrayDeques.
2018-07-20 10:08:13 -07:00
zhangduo 113652eb88 HBASE-20847 The parent procedure of RegionTransitionProcedure may not have the table lock 2018-07-11 17:37:27 +08:00
zhangduo a2db3d27ff HBASE-20849 Set version as 2.1.0 in branch-2.1 in prep for first RC 2018-07-06 15:32:23 +08:00
Yu Li d61bb64e93 HBASE-20691 Change the default WAL storage policy back to "NONE""
This reverts commit 564c193d61 and added more doc
about why we choose "NONE" as the default.
2018-07-04 13:45:54 +08:00
Michael Stack 9eeb501825 HBASE-20745 Log when master proc wal rolls 2018-06-19 19:53:29 -07:00
zhangduo 3e33aecea2 HBASE-20708 Remove the usage of RecoverMetaProcedure in master startup 2018-06-19 15:09:11 +08:00
Josh Elser 1725094e6b HBASE-19735 Create a client-tarball assembly
Provides an extra client descriptor to build a second
tarball with a reduced set of dependencies. Not of great
impact now, but will build the way for better in the future.

Signed-off-by: Sean Busbey <busbey@apache.org>

 Conflicts:
	hbase-assembly/pom.xml

 Conflicts:
	hbase-spark/pom.xml
2018-06-18 14:03:22 -07:00
zhangduo 6befdc43ba HBASE-20700 Move meta region when server crash can cause the procedure to be stuck 2018-06-11 15:28:21 +08:00
zhangduo d834859404 HBASE-20634 Reopen region while server crash can cause the procedure to be stuck
A reattempt at fixing HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock

The scenario is a SCP after processing WALs, goes to assign regions that
were on the crashed server but a concurrent Procedure gets in there
first and tries to unassign a region that was on the crashed server
(could be part of a move procedure or a disable table, etc.). The
unassign happens to run AFTER SCP has released all RPCs that
were going against the crashed server. The unassign fails because the
server is crashed. The unassign used to suspend itself only it would
never be woken up because the server it was going against had already
been processed. Worse, the SCP could not make progress because the
unassign was suspended with the lock on a region that it wanted to
assign held making it so it could make no progress.

In here, we add to the unassign recognition of the state where it is
running post SCP cleanup of RPCs. If present, unassign moves to finish
instead of suspending itself.

Includes a nice unit test made by Duo Zhang that reproduces nicely the
hung scenario.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/FailedRemoteDispatchException.java
 Moved this class back to hbase-procedure where it belongs.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoNodeDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoServerDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NullTargetServerDispatchException.java
 Specializiations on FRDE so we can be more particular when we say there
 was a problem.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java
 Change addOperationToNode so we throw exceptions that give more detail
 on issue rather than a mysterious true/false

M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
 Undo SERVER_CRASH_HANDLE_RIT2. Bad idea (from HBASE-20173)

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
 Have expireServer return true if it actually queued an expiration. Used
 later in this patch.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
 Hide methods that shouldn't be public. Add a particular check used out
 in unassign procedure failure processing.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java
 Check that server we're to move from is actually online (might
 catch a few silly move requests early).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
 Add doc on ServerState. Wasn't being used really. Now we actually stamp
 a Server OFFLINE after its WAL has been split. Means its safe to assign
 since all WALs have been processed. Add methods to update SPLITTING
 and to set it to OFFLINE after splitting done.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
 Change logging to be new-style and less repetitive of info.
 Cater to new way in which .addOperationToNode returns info (exceptions
 rather than true/false).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
 Add looking for the case where we failed assign AND we should not
 suspend because we will never be woken up because SCP is beyond
 doing this for all stuck RPCs.

 Some cleanup of the failure processing grouping where we can proceed.

 TODOs have been handled in this refactor including the TODO that
 wonders if it possible that there are concurrent fails coming in
 (Yes).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
 Doc and removing the old HBASE-20173 'fix'.
 Also updating ServerStateNode post WAL splitting so it gets marked
 OFFLINE.

A hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestServerCrashProcedureStuck.java
 Nice test by Duo Zhang.

Signed-off-by: Umesh Agashe <uagashe@cloudera.com>
Signed-off-by: Duo Zhang <palomino219@gmail.com>
Signed-off-by: Mike Drob <mdrob@apache.org>
2018-06-04 09:26:36 -07:00
zhangduo b785896cbd HBASE-20659 Implement a reopen table regions procedure 2018-05-30 20:03:35 +08:00
Sean Busbey 61f96b6ffa HBASE-20544 Make HBTU default to random ports.
Signed-off-by: Umesh Agashe <uagashe@cloudera.com>
Signed-off-by: Josh Elser <elserj@apache.org>

 Conflicts:
	hbase-backup/src/test/resources/hbase-site.xml
	hbase-spark-it/src/test/resources/hbase-site.xml
	hbase-spark/src/test/resources/hbase-site.xml
2018-05-09 23:45:39 -07:00
Chia-Ping Tsai 984fb5bd05
HBASE-20169 NPE when calling HBTU.shutdownMiniCluster (TestAssignmentManagerMetrics is flakey); AMENDMENT 2018-05-02 16:14:38 -07:00
Michael Stack da3e06afab HBASE-20492 UnassignProcedure is stuck in retry loop on region stuck in OPENING state
Add backoff when stuck in RegionTransitionProcedure, the subclass of
AssignProcedure and UnassignProcedure. Can happen when we go to
transition but the current Region state is not what we expect.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
 Add doc on being able to suspend and wait on a timeout.

M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
 Add 'attempt' counter so we can do backoff when we get stuck.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
 Add persistence of new 'attempt' counter

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
 Doc data members that are persisted by subclasses given this is 'odd'.
 Add a counter for 'attempts' used when 'stuck' to implement backoff.
 Add suspend with timeout when 'stuck'. Add callback when timeout is
 exhausted which does wakeup of this procedure.

A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestUnexpectedStateException.java
 Test of backoff.
2018-04-30 17:58:27 -07:00
Wei-Chiu Chuang b1901c9a15 HBASE-20338 WALProcedureStore#recoverLease() should have fixed sleeps for retrying rollWriter()
Signed-off-by: Mike Drob <mdrob@apache.org>
Signed-off-by: Umesh Agashe <uagashe@cloudera.com>
Signed-off-by: Chia-Ping Tsai <chia7712@gmail.com>
2018-04-12 16:35:11 -05:00
Umesh Agashe ec6295bed0 HBASE-20330 ProcedureExecutor.start() gets stuck in recover lease on store
rollWriter() fails after creating the file and returns false. In next iteration of while loop in recoverLease() file list is refreshed.

Signed-off-by: Appy <appy@cloudera.com>
2018-04-11 16:07:23 -07:00
Josh Elser c3d82a283d HBASE-20223 Update to hbase-thirdparty 2.1.0
Remove commons-cli and commons-collections4 use. Account
for the newer internal protobuf version of 3.5.1.

Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Mike Drob <mdrob@apache.org>
2018-03-26 16:07:39 -04:00
Umesh Agashe 96d63fee11 HBASE-20224 Web UI is broken in standalone mode
Changes for HBASE-20027 seem to cause UI not showing up on default port in standalone mode. For concurrent
unit test execution, individual tests can set hbase.localcluster.assign.random.ports to true or modify
test/resources/hbase-site.xml.
2018-03-22 20:28:08 -07:00
Michael Stack 79e4c9d925
Revert "HBASE-20224 Web UI is broken in standalone mode"
Broke shell tests.

This reverts commit dd9fe813ec.
2018-03-22 10:47:47 -07:00
Umesh Agashe dd9fe813ec HBASE-20224 Web UI is broken in standalone mode
Changes for HBASE-20027 seem to cause UI not showing up on default port in standalone mode. For concurrent
unit test execution, individual tests can set hbase.localcluster.assign.random.ports to true or modify
test/resources/hbase-site.xml.
2018-03-22 06:52:51 -07:00
Chia-Ping Tsai dd9e46bbf5 HBASE-20212 Make all Public classes have InterfaceAudience category
Signed-off-by: tedyu <yuzhihong@gmail.com>
Signed-off-by: Michael Stack <stack@apache.org>
2018-03-22 18:09:54 +08:00
Michael Stack fabb1d97cc HBASE-20169 NPE when calling HBTU.shutdownMiniCluster
Adds a prepare step to RecoverMetaProcedure in which we test for
cluster up and master being up. If not up, we fail the run.

Modified hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileCleaner.java
Modified hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/ChunkCreator.java
 Minor log cleanup.

Modified hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RecoverMetaProcedure.java
 Add pepare step.

Modified hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerMetrics.java
 Debug for the failing test....

Added hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestRecoverMetaProcedure.java
 Test the prepare step goes down if master or cluster are down.
2018-03-20 13:09:43 -07:00
Michael Stack 3f1c86786c HBASE-20213 [LOGGING] Aligning formatting and logging less (compactions,
in-memory compactions)

Log less. Log using same format as used elsewhere in log.

Align logs in HFileArchiver with how we format elsewhere. Removed
redundant 'region' qualifiers, tried to tighten up the emissions so
easier to read the long lines.

M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/ChunkCreator.java
 Add a label for each of the chunkcreators we make (I was confused by
two chunk creater stats emissions in log file -- didn't know that one
was for data and the other index).

M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplit.java
 Formatting. Log less.

M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreCompactionStrategy.java
 Make the emissions in here trace-level. When more than a few regions,
log is filled with this stuff.
2018-03-16 13:07:34 -07:00
Umesh Agashe 55e3dda25d HBASE-20024 Fixed flakyness of TestMergeTableRegionsProcedure
We assumed that we can run for loop from 0 to lastStep sequentially. MergeTableRegionProcedure skips step 2. So, when i is 0 the procedure is already at step 3.
Added a method StateMachineProcedure#getCurrentStateId that can be used from test code only.
2018-03-09 12:44:49 -08:00
zhangduo 5e410d8140 HBASE-19524 Master side changes for moving peer modification from zk watcher to procedure 2018-03-09 20:55:48 +08:00
zhangduo 95af14fea6 HBASE-19216 Implement a general framework to execute remote procedure on RS 2018-03-09 20:55:48 +08:00
Sean Busbey 71cc7869db HBASE-20155 update branch-2 version to 2.1.0-SNAPSHOT
Signed-off-by: Peter Somogyi <psomogyi@apache.org>
2018-03-08 08:44:30 -08:00
Sean Busbey 9927c2e14a HBASE-20070 refactor website generation
* rely on git plumbing commands when checking if we've built the site for a particular commit already
* switch to forcing '-e' for bash
* add command line switches for: path to hbase, working directory, and publishing
* only export JAVA/MAVEN HOME if they aren't already set.
* add some docs about assumptions
* Update javadoc plugin to consistently be version 3.0.0
* avoid duplicative site invocations on reactor modules
* update use of cp command so it works both on linux and mac
* manually skip enforcer plugin during build
* still doing install of all jars due to MJAVADOC-490, but then skip rebuilding during aggregate reports.
* avoid the pager on git-diff by teeing to a log file, which also helps later reviewing in the case of big changesets.

Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Misty Stanley-Jones <misty@apache.org>

 Conflicts:
	hbase-backup/pom.xml
	hbase-spark-it/pom.xml
2018-03-02 09:51:43 -06:00
Michael Stack a2de29560f HBASE-20113 Move branch-2 version from 2.0.0-beta-2-SNAPSHOT to 2.0.0-beta-2 2018-03-01 15:46:38 -08:00
Michael Stack 44544c7db0 HBASE-20069 fix existing findbugs errors in hbase-server 2018-02-26 10:55:53 -08:00
Michael Stack 8b3ae58e18 HBASE-20043 ITBLL fails against hadoop3
Fix MoveRandomRegionOfTableAction. It depended on old AM behavior.
Make it do explicit move as is required in AMv3; w/o it, it was just
closing region causing test to fail.

Fix pom so hadoop3 profile specifies a different netty3 version.

Bunch of logging format change that came of trying trying to read
the spew from this test.
2018-02-24 17:29:24 -08:00
Michael Stack 9be0360c5d HBASE-20024 TestMergeTableRegionsProcedure is STILL flakey 2018-02-20 11:07:36 -08:00
zhangduo c1fe9f441c HBASE-19978 The keepalive logic is incomplete in ProcedureExecutor 2018-02-19 17:13:16 -08:00
Michael Stack 8f1e01b6e5 HBASE-19951 Cleanup the explicit timeout value for test method 2018-02-07 16:39:54 -08:00
Michael Stack bac4687345 HBASE-19919 Tidying up logging 2018-02-02 22:42:30 -08:00
Michael Stack 90a75fb052 HBASE-19888 Move branch-2 version from 2.0.0-beta-1 to 2.0.0-beta-2-SNAPSHOT 2018-01-29 14:17:54 -08:00
Duo Zhang bbf3bae72a
HBASE-19873 Add a CategoryBasedTimeout ClassRule for all UTs 2018-01-29 12:41:14 -08:00
Thiruvel Thirumoolan c9950b5a79 HBASE-19756 Master NPE during completed failed proc eviction
Signed-off-by: Andrew Purtell <apurtell@apache.org>
2018-01-24 16:43:08 -08:00
Michael Stack 86ecc963e4 HBASE-19828 Flakey TestRegionsOnMasterOptions.testRegionsOnAllServers
Rename the PE Worker threads.

Send an interrupt if worker taking a long time to go down
(it may be RPC'ing out to a dead server, retrying so
interrupt). Also join on the ProcedureExecutor shutting down.
This will make problems shutting down more obvious.

Disable TestRegionsOnMasterOptions. Master carrying Regions is broke.
2018-01-19 21:54:44 -08:00
Michael Stack 7225899e01 HBASE-19527 Make ExecutorService threads daemon=true
Set the ProcedureExcecutor worker threads as daemon.
Ditto for the timeout thread.

Remove hack from TestRegionsOnMasterOptions that was
put in place because the test would not go down.
2018-01-18 11:30:46 -08:00