hbase

History

zhangduo a472f24d17 HBASE-20634 Reopen region while server crash can cause the procedure to be stuck A reattempt at fixing HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock The scenario is a SCP after processing WALs, goes to assign regions that were on the crashed server but a concurrent Procedure gets in there first and tries to unassign a region that was on the crashed server (could be part of a move procedure or a disable table, etc.). The unassign happens to run AFTER SCP has released all RPCs that were going against the crashed server. The unassign fails because the server is crashed. The unassign used to suspend itself only it would never be woken up because the server it was going against had already been processed. Worse, the SCP could not make progress because the unassign was suspended with the lock on a region that it wanted to assign held making it so it could make no progress. In here, we add to the unassign recognition of the state where it is running post SCP cleanup of RPCs. If present, unassign moves to finish instead of suspending itself. Includes a nice unit test made by Duo Zhang that reproduces nicely the hung scenario. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/FailedRemoteDispatchException.java Moved this class back to hbase-procedure where it belongs. M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoNodeDispatchException.java M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoServerDispatchException.java M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NullTargetServerDispatchException.java Specializiations on FRDE so we can be more particular when we say there was a problem. M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java Change addOperationToNode so we throw exceptions that give more detail on issue rather than a mysterious true/false M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto Undo SERVER_CRASH_HANDLE_RIT2. Bad idea (from HBASE-20173) M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java Have expireServer return true if it actually queued an expiration. Used later in this patch. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java Hide methods that shouldn't be public. Add a particular check used out in unassign procedure failure processing. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java Check that server we're to move from is actually online (might catch a few silly move requests early). M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java Add doc on ServerState. Wasn't being used really. Now we actually stamp a Server OFFLINE after its WAL has been split. Means its safe to assign since all WALs have been processed. Add methods to update SPLITTING and to set it to OFFLINE after splitting done. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java Change logging to be new-style and less repetitive of info. Cater to new way in which .addOperationToNode returns info (exceptions rather than true/false). M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java Add looking for the case where we failed assign AND we should not suspend because we will never be woken up because SCP is beyond doing this for all stuck RPCs. Some cleanup of the failure processing grouping where we can proceed. TODOs have been handled in this refactor including the TODO that wonders if it possible that there are concurrent fails coming in (Yes). M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java Doc and removing the old HBASE-20173 'fix'. Also updating ServerStateNode post WAL splitting so it gets marked OFFLINE. A hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestServerCrashProcedureStuck.java Nice test by Duo Zhang. Signed-off-by: Umesh Agashe <uagashe@cloudera.com> Signed-off-by: Duo Zhang <palomino219@gmail.com> Signed-off-by: Mike Drob <mdrob@apache.org>	2018-06-04 09:26:56 -07:00
..
src	HBASE-20634 Reopen region while server crash can cause the procedure to be stuck	2018-06-04 09:26:56 -07:00
pom.xml	HBASE-20330 ProcedureExecutor.start() gets stuck in recover lease on store	2018-04-11 16:07:42 -07:00

zhangduo a472f24d17 HBASE-20634 Reopen region while server crash can cause the procedure to be stuck

A reattempt at fixing HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock

The scenario is a SCP after processing WALs, goes to assign regions that
were on the crashed server but a concurrent Procedure gets in there
first and tries to unassign a region that was on the crashed server
(could be part of a move procedure or a disable table, etc.). The
unassign happens to run AFTER SCP has released all RPCs that
were going against the crashed server. The unassign fails because the
server is crashed. The unassign used to suspend itself only it would
never be woken up because the server it was going against had already
been processed. Worse, the SCP could not make progress because the
unassign was suspended with the lock on a region that it wanted to
assign held making it so it could make no progress.

In here, we add to the unassign recognition of the state where it is
running post SCP cleanup of RPCs. If present, unassign moves to finish
instead of suspending itself.

Includes a nice unit test made by Duo Zhang that reproduces nicely the
hung scenario.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/FailedRemoteDispatchException.java
Moved this class back to hbase-procedure where it belongs.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoNodeDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoServerDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NullTargetServerDispatchException.java
Specializiations on FRDE so we can be more particular when we say there
was a problem.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java
Change addOperationToNode so we throw exceptions that give more detail
on issue rather than a mysterious true/false

M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
Undo SERVER_CRASH_HANDLE_RIT2. Bad idea (from HBASE-20173)

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Have expireServer return true if it actually queued an expiration. Used
later in this patch.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Hide methods that shouldn't be public. Add a particular check used out
in unassign procedure failure processing.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java
Check that server we're to move from is actually online (might
catch a few silly move requests early).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
Add doc on ServerState. Wasn't being used really. Now we actually stamp
a Server OFFLINE after its WAL has been split. Means its safe to assign
since all WALs have been processed. Add methods to update SPLITTING
and to set it to OFFLINE after splitting done.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Change logging to be new-style and less repetitive of info.
Cater to new way in which .addOperationToNode returns info (exceptions
rather than true/false).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
Add looking for the case where we failed assign AND we should not
suspend because we will never be woken up because SCP is beyond
doing this for all stuck RPCs.

Some cleanup of the failure processing grouping where we can proceed.

TODOs have been handled in this refactor including the TODO that
wonders if it possible that there are concurrent fails coming in
(Yes).

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
Doc and removing the old HBASE-20173 'fix'.
Also updating ServerStateNode post WAL splitting so it gets marked
OFFLINE.

A hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestServerCrashProcedureStuck.java
Nice test by Duo Zhang.

Signed-off-by: Umesh Agashe <uagashe@cloudera.com>
Signed-off-by: Duo Zhang <palomino219@gmail.com>
Signed-off-by: Mike Drob <mdrob@apache.org>

2018-06-04 09:26:56 -07:00

src

HBASE-20634 Reopen region while server crash can cause the procedure to be stuck

2018-06-04 09:26:56 -07:00

pom.xml

HBASE-20330 ProcedureExecutor.start() gets stuck in recover lease on store

2018-04-11 16:07:42 -07:00