HBASE-24368 Let HBCKSCP clear 'Unknown Servers', even if RegionStateNode has RegionLocation == null

hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/GCMultipleMergedRegionsProcedure.java
 Edit a log.

hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/HBCKServerCrashProcedure.java
 Add override of isMatchingRegionLocation. Allow 'null' as a pass in
 HBCKSCP.

hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
 Add a method for HBCKSCP to override and be less strict filtering
 assigns.

hbase-server/src/main/resources/hbase-webapps/master/hbck.jsp
 Some doc on what 'Unknown Servers' are.
This commit is contained in:
stack 2020-05-13 22:19:25 -07:00
parent 2e5a664233
commit 32e2682310
4 changed files with 53 additions and 14 deletions

View File

@ -99,12 +99,11 @@ public class GCMultipleMergedRegionsProcedure extends
case GC_MERGED_REGIONS_PREPARE:
// If GCMultipleMergedRegionsProcedure processing is slower than the CatalogJanitor's scan
// interval, it will end resubmitting GCMultipleMergedRegionsProcedure for the same
// region, we can skip duplicate GCMultipleMergedRegionsProcedure while previous finished
// region. We can skip duplicate GCMultipleMergedRegionsProcedure while previous finished
List<RegionInfo> parents = MetaTableAccessor.getMergeRegions(
env.getMasterServices().getConnection(), mergedChild.getRegionName());
if (parents == null || parents.isEmpty()) {
LOG.info("Region=" + mergedChild.getShortNameToLog()
+ " info:merge qualifier has been deleted");
LOG.info("{} mergeXXX qualifiers have ALL been deleted", mergedChild.getShortNameToLog());
return Flow.NO_MORE_STATE;
}
setNextState(GCMergedRegionsState.GC_MERGED_REGIONS_PURGE);

View File

@ -30,6 +30,7 @@ import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.RegionInfo;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.master.RegionState;
import org.apache.hadoop.hbase.master.assignment.RegionStateNode;
import org.apache.hadoop.hbase.master.assignment.RegionStateStore;
import org.apache.yetus.audience.InterfaceAudience;
import org.slf4j.Logger;
@ -168,4 +169,16 @@ public class HBCKServerCrashProcedure extends ServerCrashProcedure {
return this.reassigns;
}
}
/**
* The RegionStateNode will not have a location if a confirm of an OPEN fails. On fail,
* the RegionStateNode regionLocation is set to null. This is 'looser' than the test done
* in the superclass. The HBCKSCP has been scheduled by an operator via hbck2 probably at the
* behest of a report of an 'Unknown Server' in the 'HBCK Report'. Let the operators operation
* succeed even in case where the region location in the RegionStateNode is null.
*/
@Override
protected boolean isMatchingRegionLocation(RegionStateNode rsn) {
return super.isMatchingRegionLocation(rsn) || rsn.getRegionLocation() == null;
}
}

View File

@ -450,6 +450,15 @@ public class ServerCrashProcedure
return false;
}
/**
* Moved out here so can be overridden by the HBCK fix-up SCP to be less strict about what
* it will tolerate as a 'match'.
* @return True if the region location in <code>rsn</code> matches that of this crashed server.
*/
protected boolean isMatchingRegionLocation(RegionStateNode rsn) {
return this.serverName.equals(rsn.getRegionLocation());
}
/**
* Assign the regions on the crashed RS to other Rses.
* <p/>
@ -467,14 +476,17 @@ public class ServerCrashProcedure
regionNode.lock();
try {
// This is possible, as when a server is dead, TRSP will fail to schedule a RemoteProcedure
// to us and then try to assign the region to a new RS. And before it has updated the region
// and then try to assign the region to a new RS. And before it has updated the region
// location to the new RS, we may have already called the am.getRegionsOnServer so we will
// consider the region is still on us. And then before we arrive here, the TRSP could have
// updated the region location, or even finished itself, so the region is no longer on us
// any more, we should not try to assign it again. Please see HBASE-23594 for more details.
if (!serverName.equals(regionNode.getRegionLocation())) {
LOG.info("{} found a region {} which is no longer on us {}, give up assigning...", this,
regionNode, serverName);
// consider the region is still on this crashed server. Then before we arrive here, the
// TRSP could have updated the region location, or even finished itself, so the region is
// no longer on this crashed server any more. We should not try to assign it again. Please
// see HBASE-23594 for more details.
// UPDATE: HBCKServerCrashProcedure overrides isMatchingRegionLocation; this check can get
// in the way of our clearing out 'Unknown Servers'.
if (!isMatchingRegionLocation(regionNode)) {
LOG.info("{} found {} whose regionLocation no longer matches {}, skipping assign...",
this, regionNode, serverName);
continue;
}
if (regionNode.getProcedure() != null) {

View File

@ -112,8 +112,7 @@
need to check the server still exists. If not, schedule <em>ServerCrashProcedure</em> for it. If exists,
restart Server2 and Server1):
3. More than one regionserver reports opened this region (Fix: restart the RegionServers).
Notice: the reported online regionservers may be not right when there are regions in transition.
Please check them in regionserver's web UI.
Note: the reported online regionservers may be not be up-to-date when there are regions in transition.
</span>
</p>
@ -165,8 +164,9 @@
</div>
<p>
<span>
The below are Regions we've lost account of. To be safe, run bulk load of any data found in these Region orphan directories back into the HBase cluster.
First make sure <em>hbase:meta</em> is in a healthy state, that there are no holes, overlaps or inconsistencies (else bulk load may complain);
The below are Regions we've lost account of. To be safe, run bulk load of any data found under these Region orphan directories to have the
cluster re-adopt data.
First make sure <em>hbase:meta</em> is in a healthy state, that there are no holes, overlaps or inconsistencies (else bulk load may fail);
run <em>hbck2 fixMeta</em>. Once this is done, per Region below, run a bulk
load -- <em>$ hbase completebulkload REGION_DIR_PATH TABLE_NAME</em> -- and then delete the desiccated directory content (HFiles are removed upon
successful load; all that is left are empty directories and occasionally a seqid marking file).
@ -259,6 +259,21 @@
<h2>Unknown Servers</h2>
</div>
</div>
<p>
<span>The below are servers mentioned in the hbase:meta table that are no longer 'live' or known 'dead'.
The server likely belongs to an older cluster epoch since replaced by a new instance because of a restart/crash.
To clear 'Unknown Servers', run 'hbck2 scheduleRecoveries UNKNOWN_SERVERNAME'. This will schedule a ServerCrashProcedure.
It will clear out 'Unknown Server' references and schedule reassigns of any Regions that were associated with this host.
But first!, be sure the referenced Region is not currently stuck looping trying to OPEN. Does it show as a Region-In-Transition on the
Master home page? Is it mentioned in the 'Procedures and Locks' Procedures list? If so, perhaps it stuck in a loop
trying to OPEN but unable to because of a missing reference or file.
Read the Master log looking for the most recent
mentions of the associated Region name. Try and address any such complaint first. If successful, a side-effect
should be the clean up of the 'Unknown Servers' list. It may take a while. OPENs are retried forever but the interval
between retries grows. The 'Unknown Server' may be cleared because it is just the last RegionServer the Region was
successfully opened on; on the next open, the 'Unknown Server' will be purged.
</span>
</p>
<table class="table table-striped">
<tr>
<th>RegionInfo</th>