From 2e4be3e77db6dc53626b3d99c26699273b91399d Mon Sep 17 00:00:00 2001
From: stack rsn
matches that of this crashed server.
+ */
+ protected boolean isMatchingRegionLocation(RegionStateNode rsn) {
+ return this.serverName.equals(rsn.getRegionLocation());
+ }
+
/**
* Assign the regions on the crashed RS to other Rses.
*
@@ -467,14 +476,17 @@ public class ServerCrashProcedure
regionNode.lock();
try {
// This is possible, as when a server is dead, TRSP will fail to schedule a RemoteProcedure
- // to us and then try to assign the region to a new RS. And before it has updated the region
+ // and then try to assign the region to a new RS. And before it has updated the region
// location to the new RS, we may have already called the am.getRegionsOnServer so we will
- // consider the region is still on us. And then before we arrive here, the TRSP could have
- // updated the region location, or even finished itself, so the region is no longer on us
- // any more, we should not try to assign it again. Please see HBASE-23594 for more details.
- if (!serverName.equals(regionNode.getRegionLocation())) {
- LOG.info("{} found a region {} which is no longer on us {}, give up assigning...", this,
- regionNode, serverName);
+ // consider the region is still on this crashed server. Then before we arrive here, the
+ // TRSP could have updated the region location, or even finished itself, so the region is
+ // no longer on this crashed server any more. We should not try to assign it again. Please
+ // see HBASE-23594 for more details.
+ // UPDATE: HBCKServerCrashProcedure overrides isMatchingRegionLocation; this check can get
+ // in the way of our clearing out 'Unknown Servers'.
+ if (!isMatchingRegionLocation(regionNode)) {
+ LOG.info("{} found {} whose regionLocation no longer matches {}, skipping assign...",
+ this, regionNode, serverName);
continue;
}
if (regionNode.getProcedure() != null) {
diff --git a/hbase-server/src/main/resources/hbase-webapps/master/hbck.jsp b/hbase-server/src/main/resources/hbase-webapps/master/hbck.jsp
index d90827c4018..f0a2ce17704 100644
--- a/hbase-server/src/main/resources/hbase-webapps/master/hbck.jsp
+++ b/hbase-server/src/main/resources/hbase-webapps/master/hbck.jsp
@@ -112,8 +112,7 @@
need to check the server still exists. If not, schedule ServerCrashProcedure for it. If exists,
restart Server2 and Server1):
3. More than one regionserver reports opened this region (Fix: restart the RegionServers).
- Notice: the reported online regionservers may be not right when there are regions in transition.
- Please check them in regionserver's web UI.
+ Note: the reported online regionservers may be not be up-to-date when there are regions in transition.
- The below are Regions we've lost account of. To be safe, run bulk load of any data found in these Region orphan directories back into the HBase cluster.
- First make sure hbase:meta is in a healthy state, that there are no holes, overlaps or inconsistencies (else bulk load may complain);
+ The below are Regions we've lost account of. To be safe, run bulk load of any data found under these Region orphan directories to have the
+ cluster re-adopt data.
+ First make sure hbase:meta is in a healthy state, that there are no holes, overlaps or inconsistencies (else bulk load may fail);
run hbck2 fixMeta. Once this is done, per Region below, run a bulk
load -- $ hbase completebulkload REGION_DIR_PATH TABLE_NAME -- and then delete the desiccated directory content (HFiles are removed upon
successful load; all that is left are empty directories and occasionally a seqid marking file).
@@ -259,6 +259,21 @@
+ The below are servers mentioned in the hbase:meta table that are no longer 'live' or known 'dead'.
+ The server likely belongs to an older cluster epoch since replaced by a new instance because of a restart/crash.
+ To clear 'Unknown Servers', run 'hbck2 scheduleRecoveries UNKNOWN_SERVERNAME'. This will schedule a ServerCrashProcedure.
+ It will clear out 'Unknown Server' references and schedule reassigns of any Regions that were associated with this host.
+ But first!, be sure the referenced Region is not currently stuck looping trying to OPEN. Does it show as a Region-In-Transition on the
+ Master home page? Is it mentioned in the 'Procedures and Locks' Procedures list? If so, perhaps it stuck in a loop
+ trying to OPEN but unable to because of a missing reference or file.
+ Read the Master log looking for the most recent
+ mentions of the associated Region name. Try and address any such complaint first. If successful, a side-effect
+ should be the clean up of the 'Unknown Servers' list. It may take a while. OPENs are retried forever but the interval
+ between retries grows. The 'Unknown Server' may be cleared because it is just the last RegionServer the Region was
+ successfully opened on; on the next open, the 'Unknown Server' will be purged.
+
+ Unknown Servers
+
RegionInfo