Remove a bunch of places where we create ServerStateNode. We were
creating a SSN even though the server was long dead and processed.
The revived SSN was messing up the little dance we do unassigning
procedures. In particular, in UnassignProcedure, the check for a
dead server inside in isLogSplittingDone returns true -- we can
proceed because server is dead -- fails if an SSN exists.
We were creating SSN when we didn't need it as well as inadvertently.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Print serverstatenode when reporting expiration. Helps debugging.
Make moveFromOnlineToDeadServers return if server online or not.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Make do w/ serverName in place of serverNode in a few places.
In waitServerReportEvent, create a ServerStateNode if none though we
should not have to at this point; to figure out later: TODO.
addRegionToServer no longer automatically calls create SSN
so do explicit create processing load meta and the region
is OPEN so we can associate OPEN regions with the SSN.
Do not schedule an SCP if server is not online, not in fs, and not in
dead servers. No point (and there may be cases where server is long
gone but hbase:meta still refers to it though it has not carried
regions in a long time; running an assign/unassign against such a
server will fail because it is not there but SCP won't clean up
the outstanding hung RPC because our region is not on the long-gone
server).
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
Just cleanup. Make it so addRegionToServer and remove can deal if no SSN.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterWalManager.java
Add isWALDirectoryNameWithWALS utility.
For RIT Duration, do better than print ms/seconds. Remove redundant UI
column dedicated to duration when we log it in the status field too.
Make bypass log at INFO level.
Make it so on complete of subprocedure, we note count of outstanding
siblings so we have a clue how much further the parent has to go before
it is done (Helpful when hundreds of servers doing SCP).
Have the SCP run the AP preflight check before creating an AP; saves
creation of thousands of APs during fixup.
Don't log tablename three times when reporting remote call failed.
If lock is held already, note who has it. Also log after we get lock
or if we have to wait rather than log on entrance though we may
later have to wait (or we may have just picked up the lock).
Signed-off-by: Mike Drob <mdrob@apache.org>
Adds override to assigns and unassigns. Changes bypass 'force'
to align calling the param 'override' instead.
Adds recursive to 'bypass', a means of calling bypass on
parent and its subprocedures (usually bypass works on
leaf nodes rippling the bypass up to parent -- recursive
has us work in the opposite direction): EXPERIMENTAL.
bypass on an assign/unassign leaves region in RIT and the
RegionStateNode loaded with the bypassed procedure. First
implementation had assign/unassign cleanup leftover state.
Second implementation, on feedback, keeps the state in place
as a fence against other Procedures assuming the region entity,
and instead adds an 'override' function that hbck2 can set on
assigns/unassigns to override the fencing.
Note that the below also converts ProcedureExceptions that
come out of the Pv2 system into DoNotRetryIOEs. It is a
little awkward because DNRIOE is in client-module, not
in procedure module. Previous, we'd just keep retrying
the bypass, etc.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
Have bypass take an environment like all other methods so subclasses.
Fix javadoc issues.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java
Javadoc issues. Pass environment when we invoke bypass.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Rename waitUntilNamespace... etc. to align with how these method types
are named elsehwere .. i.e. waitFor rather than waitUntil..
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Cleanup message we emit when we find an exisitng procedure working
against this entity.
Add support for a force function which allows Assigns/Unassigns force
ownership of the Region entity.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionBypass.java
Test bypass and force.
M hbase-shell/src/main/ruby/shell/commands/list_procedures.rb
Minor cleanup of the json output... do iso8601 timestamps.
This reverts commit b96905d1df.
i.e. a revert of a revert so a reapplication!
Revert so I can add signed-off-by....
Signed-off-by: Allan Yang <allan163@apache.org>
bypass on an assign/unassign leaves region in RIT and the
RegionStateNode loaded with the bypassed procedure. First
implementation had assign/unassign cleanup leftover state.
Second implementation, on feedback, keeps the state in place
as a fence against other Procedures assuming the region entity,
and instead adds an 'override' function that hbck2 can set on
assigns/unassigns to override the fencing.
Note that the below also converts ProcedureExceptions that
come out of the Pv2 system into DoNotRetryIOEs. It is a
little awkward because DNRIOE is in client-module, not
in procedure module. Previous, we'd just keep retrying
the bypass, etc.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
Have bypass take an environment like all other methods so subclasses.
Fix javadoc issues.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java
Javadoc issues. Pass environment when we invoke bypass.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Rename waitUntilNamespace... etc. to align with how these method types
are named elsehwere .. i.e. waitFor rather than waitUntil..
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Cleanup message we emit when we find an exisitng procedure working
against this entity.
Add support for a force function which allows Assigns/Unassigns force
ownership of the Region entity.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionBypass.java
Test bypass and force.
M hbase-shell/src/main/ruby/shell/commands/list_procedures.rb
Minor cleanup of the json output... do iso8601 timestamps.
Remove unused methods from Sleeper (its ok, its @Private).
Remove notion of startTime from Sleeper handling (it is is unused).
Allow passing in how long to sleep so can maintain externally.
In HRS, use a RetryCounter to calculate backoff sleep time for when
reportForDuty is failing against a struggling Master.
Add a check for hbase:meta being online before we go to read it.
If not online, move into a holding-pattern until rectified, probably
by external operator.
Incorporates bulk of patch made by Allan Yang over on HBASE-21035.
M hbase-common/src/main/java/org/apache/hadoop/hbase/util/RetryCounterFactory.java
Add a Constructor for case where retries are for ever.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Move stuff around so that the first hbase:meta read is the AM#loadMeta.
Previously, checking table state and/or favored nodes could end up
trying to read a meta that was not onlined holding up master startup.
Do similar for the namespace table. Adds new methods isMeta and
isNamespace which check that the regions/tables are online.. if not,
we wait logging with a back-off that assigns need to be run.
Signed-off-by: Allan Yang <allan163@apache.org>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
HBASE-21113 Apply the branch-2 version of HBASE-21095, The timeout retry
logic for several procedures are broken after master restarts
I applied the patch HBASE-21095 and then reverted it so could apply the
patch as HBASE-21113 (by reverting the HBASE-21095 revert but pushing
with this message!).
This reverts commit 4978db8102.
Write the hbase-1.x hbck1 lock file to block out hbck1 instances writing
state to an hbase-2.x cluster (could do damage).
Set hbase.write.hbck1.lock.file to false to disable this writing.
Look for the particular case where RS does the close of region w/o
involving Master and log special message in this case. Dodgy. But
until we have Master run shutdown of all regions, better than
the message we currently show.