236 Commits

Author SHA1 Message Date
Duo Zhang
3fe8649b2c HBASE-21377 Add debug log for catching the root cause 2018-10-24 15:43:12 +08:00
Duo Zhang
b2fcf765ae HBASE-21363 Rewrite the buildingHoldCleanupTracker method in WALProcedureStore 2018-10-24 14:14:19 +08:00
Allan Yang
3b68e5393e HBASE-20973 ArrayIndexOutOfBoundsException when rolling back procedure 2018-10-23 16:09:05 +08:00
Duo Zhang
603bf4c551 HBASE-21354 Addendum fix compile error 2018-10-23 14:39:53 +08:00
Allan Yang
86f23128b0 HBASE-21354 Procedure may be deleted improperly during master restarts resulting in 'Corrupt' 2018-10-23 10:55:18 +08:00
zhangduo
3b66b65b9f HBASE-21336 Simplify the implementation of WALProcedureMap 2018-10-22 18:36:11 +08:00
Duo Zhang
7d7293049a Revert "HBASE-21336 Simplify the implementation of WALProcedureMap"
This reverts commit 7adf590106826b9e4432cfeee06acdc0ccff8c6e.
2018-10-22 09:32:55 +08:00
zhangduo
7adf590106 HBASE-21336 Simplify the implementation of WALProcedureMap 2018-10-20 21:59:46 +08:00
jingyuntian
5fbb227deb
HBASE-21269 Forward-port HBASE-21213 [hbck2] bypass leaves behind state in RegionStates when assign/unassign 2018-10-18 06:22:52 -07:00
zhangduo
132bea9a1c HBASE-21323 Should not skip force updating for a sub procedure even if it has been finished 2018-10-18 14:24:34 +08:00
Duo Zhang
5efa5f6de4 HBASE-21330 ReopenTableRegionsProcedure will enter an infinite loop if we schedule a TRSP at the same time 2018-10-18 11:29:12 +08:00
Jingyun Tian
821e4d7de2 HBASE-21291 Add a test for bypassing stuck state-machine procedures
Signed-off-by: Allan Yang <allan163@apache.org>
2018-10-16 22:26:58 +08:00
Duo Zhang
fa652cc610 HBASE-21315 The getActiveMinProcId and getActiveMaxProcId of BitSetNode are incorrect if there are no active procedure 2018-10-16 15:42:01 +08:00
zhangduo
0d9982901a HBASE-21278 Do not rollback successful sub procedures when rolling back a procedure 2018-10-16 15:12:49 +08:00
Duo Zhang
9e9a1e0f0d HBASE-21254 Need to find a way to limit the number of proc wal files 2018-10-12 11:05:13 +08:00
zhangduo
118b074684 HBASE-21250 Refactor WALProcedureStore and add more comments for better understanding the implementation 2018-10-07 17:09:09 +08:00
meiyi
6bc7089f9e HBASE-21249 Add jitter for ProcedureUtil.getBackoffTimeMs
Signed-off-by: zhangduo <zhangduo@apache.org>
2018-09-28 21:27:45 +08:00
zhangduo
22ac655704 HBASE-21233 Allow the procedure implementation to skip persistence of the state after a execution 2018-09-28 10:50:29 +08:00
Umesh Agashe
dc767c06d2
HBASE-21023 Added bypassProcedure() API to HbckService 2018-09-19 15:18:16 -07:00
Michael Stack
76199a0a29 HBASE-21190 Log files and count of entries in each as we load from the MasterProcWAL store 2018-09-12 10:21:26 -07:00
TAK LON WU
7ecb435d9d
HBASE-21181 Use the same filesystem for wal archive directory and wal directory
Signed-off-by: Andrew Purtell <apurtell@apache.org>
2018-09-11 15:32:51 -07:00
Duo Zhang
c59ecfb961 HBASE-21172 Reimplement the retry backoff logic for ReopenTableRegionsProcedure 2018-09-11 15:13:58 +08:00
Michael Stack
b83613fdce HBASE-21171 [amv2] Tool to parse a directory of MasterProcWALs standalone
Signed-off-by: Mike Drob <mdrob@apache.org>
2018-09-09 09:28:59 -07:00
Michael Stack
3ac3249423
HBASE-21155 Save on a few log strings and some churn in wal splitter by skipping out early if no logs in dir 2018-09-06 16:52:47 -07:00
Allan Yang
7c1fad4992
HBASE-21083 Introduce a mechanism to bypass the execution of a stuck procedure 2018-08-30 12:23:24 -07:00
zhangduo
bb3494134e HBASE-20881 Introduce a region transition procedure to handle all the state transition for a region 2018-08-21 06:12:09 +08:00
Allan Yang
159435ee40 HBASE-21050 Exclusive lock may be held by a SUCCESS state procedure forever
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: zhangduo <zhangduo@apache.org>
2018-08-15 15:39:48 -07:00
subrat.mishra
49ae8549cf HBASE-21040 Replace call to printStackTrace() with proper logger call
Signed-off-by: tedyu <yuzhihong@gmail.com>
2018-08-15 08:29:54 -07:00
Allan Yang
1114a1a65e HBASE-20978 [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure Signed-off-by: Michael Stack <stack@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org> 2018-08-14 16:30:20 -07:00
Allan Yang
a07e755625 HBASE-20975 Lock may not be taken or released while rolling back procedure 2018-08-13 20:23:04 +08:00
jackbearden
953e5aa88c HBASE-20981. Rollback stateCount accounting thrown-off when exception out of rollbackState
Signed-off-by: Michael Stack <stack@apache.org>
2018-08-11 11:58:38 -07:00
Michael Stack
c365c4084e HBASE-20989 Minor, miscellaneous logging fixes
Signed-off-by: Zach York <zyork@amazon.com>
Signed-off-by: Mingliang Liu <liuml07@apache.org>
2018-08-01 11:20:59 -07:00
Alex Leblang
e963694259
HBASE-19369 Switch to Builder Pattern In WAL
This patch switches to the builder pattern by adding a helper method.
It also checks to ensure that the pattern is available (i.e. that
HBase is running on a hadoop version that supports it).

Amending-Author: Mike Drob <mdrob@apache.org>
Signed-off-by: tedyu <yuzhihong@gmail.com>
Signed-off-by: zhangduo <zhangduo@apache.org>
2018-07-27 23:42:33 -05:00
zhangduo
7178a98258 HBASE-20939 There will be race when we call suspendIfNotReady and then throw ProcedureSuspendedException 2018-07-27 17:27:12 +08:00
zhangduo
d43e28dc82 Revert "HBASE-20949 Add logs for debugging"
This reverts commit 8b8de1f8a77b5b9f6d4b8cfb7eeb3d545a69d0f2.
2018-07-27 08:40:46 +08:00
zhangduo
8b8de1f8a7 HBASE-20949 Add logs for debugging 2018-07-26 22:43:14 +08:00
zhangduo
f3f17fa111 HBASE-20846 Restore procedure locks when master restarts 2018-07-25 14:37:26 +08:00
Michael Stack
067388bfd9 HBASE-20914 Trim Master memory usage
Add (weak reference) interning of ServerNames.

Correct Balancer regions x racks matrix.

Make smaller defaults when creating ArrayDeques.
2018-07-20 10:08:55 -07:00
zhangduo
a838f7631f HBASE-20847 The parent procedure of RegionTransitionProcedure may not have the table lock 2018-07-11 17:34:35 +08:00
Yu Li
ec8947f226 HBASE-20691 Change the default WAL storage policy back to "NONE""
This reverts commit 564c193d61cd1f92688a08a3af6d55ce4c4636d8 and added more doc
about why we choose "NONE" as the default.
2018-07-04 13:43:48 +08:00
zhangduo
0c97cda2a9 HBASE-19990 Create remote wal directory when transitting to state S 2018-06-28 18:07:44 +08:00
Michael Stack
21684a32fa HBASE-20745 Log when master proc wal rolls 2018-06-19 19:53:51 -07:00
zhangduo
6dbbd78aa0 HBASE-20708 Remove the usage of RecoverMetaProcedure in master startup 2018-06-19 15:02:10 +08:00
Josh Elser
8710825a9a HBASE-19735 Create a client-tarball assembly
Provides an extra client descriptor to build a second
tarball with a reduced set of dependencies. Not of great
impact now, but will build the way for better in the future.

Signed-off-by: Sean Busbey <busbey@apache.org>

 Conflicts:
	hbase-assembly/pom.xml
2018-06-18 11:31:13 -07:00
zhangduo
573b57d437 HBASE-20700 Move meta region when server crash can cause the procedure to be stuck 2018-06-11 14:57:31 +08:00
zhangduo
a472f24d17 HBASE-20634 Reopen region while server crash can cause the procedure to be stuck
A reattempt at fixing HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock

The scenario is a SCP after processing WALs, goes to assign regions that
were on the crashed server but a concurrent Procedure gets in there
first and tries to unassign a region that was on the crashed server
(could be part of a move procedure or a disable table, etc.). The
unassign happens to run AFTER SCP has released all RPCs that
were going against the crashed server. The unassign fails because the
server is crashed. The unassign used to suspend itself only it would
never be woken up because the server it was going against had already
been processed. Worse, the SCP could not make progress because the
unassign was suspended with the lock on a region that it wanted to
assign held making it so it could make no progress.

In here, we add to the unassign recognition of the state where it is
running post SCP cleanup of RPCs. If present, unassign moves to finish
instead of suspending itself.

Includes a nice unit test made by Duo Zhang that reproduces nicely the
hung scenario.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/FailedRemoteDispatchException.java
 Moved this class back to hbase-procedure where it belongs.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoNodeDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoServerDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NullTargetServerDispatchException.java
 Specializiations on FRDE so we can be more particular when we say there
 was a problem.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java
 Change addOperationToNode so we throw exceptions that give more detail
 on issue rather than a mysterious true/false

M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
 Undo SERVER_CRASH_HANDLE_RIT2. Bad idea (from HBASE-20173)

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
 Have expireServer return true if it actually queued an expiration. Used
 later in this patch.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
 Hide methods that shouldn't be public. Add a particular check used out
 in unassign procedure failure processing.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java
 Check that server we're to move from is actually online (might
 catch a few silly move requests early).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
 Add doc on ServerState. Wasn't being used really. Now we actually stamp
 a Server OFFLINE after its WAL has been split. Means its safe to assign
 since all WALs have been processed. Add methods to update SPLITTING
 and to set it to OFFLINE after splitting done.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
 Change logging to be new-style and less repetitive of info.
 Cater to new way in which .addOperationToNode returns info (exceptions
 rather than true/false).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
 Add looking for the case where we failed assign AND we should not
 suspend because we will never be woken up because SCP is beyond
 doing this for all stuck RPCs.

 Some cleanup of the failure processing grouping where we can proceed.

 TODOs have been handled in this refactor including the TODO that
 wonders if it possible that there are concurrent fails coming in
 (Yes).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
 Doc and removing the old HBASE-20173 'fix'.
 Also updating ServerStateNode post WAL splitting so it gets marked
 OFFLINE.

A hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestServerCrashProcedureStuck.java
 Nice test by Duo Zhang.

Signed-off-by: Umesh Agashe <uagashe@cloudera.com>
Signed-off-by: Duo Zhang <palomino219@gmail.com>
Signed-off-by: Mike Drob <mdrob@apache.org>
2018-06-04 09:26:56 -07:00
zhangduo
997747076d HBASE-20659 Implement a reopen table regions procedure 2018-05-30 20:03:25 +08:00
Sean Busbey
8ba2a7eeb9 HBASE-20544 Make HBTU default to random ports.
Signed-off-by: Umesh Agashe <uagashe@cloudera.com>
Signed-off-by: Josh Elser <elserj@apache.org>
2018-05-09 23:35:20 -07:00
Chia-Ping Tsai
4cb444e77b
HBASE-20169 NPE when calling HBTU.shutdownMiniCluster (TestAssignmentManagerMetrics is flakey); AMENDMENT 2018-05-02 16:14:58 -07:00
Michael Stack
5a071dbe2b HBASE-20492 UnassignProcedure is stuck in retry loop on region stuck in OPENING state
Add backoff when stuck in RegionTransitionProcedure, the subclass of
AssignProcedure and UnassignProcedure. Can happen when we go to
transition but the current Region state is not what we expect.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
 Add doc on being able to suspend and wait on a timeout.

M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
 Add 'attempt' counter so we can do backoff when we get stuck.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
 Add persistence of new 'attempt' counter

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
 Doc data members that are persisted by subclasses given this is 'odd'.
 Add a counter for 'attempts' used when 'stuck' to implement backoff.
 Add suspend with timeout when 'stuck'. Add callback when timeout is
 exhausted which does wakeup of this procedure.

A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestUnexpectedStateException.java
 Test of backoff.
2018-04-30 20:40:22 -07:00